Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-20 Thread Christoph Lameter
On Tue, 20 Nov 2007, Andrew Morton wrote:

> uhm, maybe.  It's getting toward the time when we should try to get -mm
> vaguely compiling and booting on some machines, which means stopping
> merging new stuff.  I left that too late in the 2.6.23 cycle.

Huh? mm1 works fine here.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-20 Thread Andrew Morton
On Tue, 20 Nov 2007 12:18:10 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> On Tue, 20 Nov 2007, Mel Gorman wrote:
> 
> > Went back and revisited this. Allocating them at boot-time is below but
> > essentially it is a silly and it makes sense to just have two zonelists
> > where one of them is for __GFP_THISNODE. Implementation wise, this involves
> > dropping the last patch in the set and the overall result is still a 
> > reduction
> > in the number of zonelists.
> 
> Allright with me. Andrew could we get this patchset merged?

uhm, maybe.  It's getting toward the time when we should try to get -mm
vaguely compiling and booting on some machines, which means stopping
merging new stuff.  I left that too late in the 2.6.23 cycle.

otoh it'd be nice to get mainline fixed up a bit too :(
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-20 Thread Mel Gorman
On (20/11/07 12:18), Christoph Lameter didst pronounce:
> On Tue, 20 Nov 2007, Mel Gorman wrote:
> 
> > Went back and revisited this. Allocating them at boot-time is below but
> > essentially it is a silly and it makes sense to just have two zonelists
> > where one of them is for __GFP_THISNODE. Implementation wise, this involves
> > dropping the last patch in the set and the overall result is still a 
> > reduction
> > in the number of zonelists.
> 
> Allright with me. Andrew could we get this patchset merged?
> 

They will not merge cleanly with a recent tree. I expect to be ready to
post a new set when regression tests complete later this evening.

-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-20 Thread Christoph Lameter
On Tue, 20 Nov 2007, Mel Gorman wrote:

> Hold off testing for the moment. Getting all the corner cases right for
> __GFP_THISNODE has turned too complicated to be considered as part of a larger
> patchset. I believe it makes sense to drop the final patch and settle with
> having two zonelists. One of these will be for __GFP_THISNODE allocations. We
> can then tackle removing that zonelist at a later date.

Ack.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-20 Thread Christoph Lameter
On Tue, 20 Nov 2007, Mel Gorman wrote:

> Went back and revisited this. Allocating them at boot-time is below but
> essentially it is a silly and it makes sense to just have two zonelists
> where one of them is for __GFP_THISNODE. Implementation wise, this involves
> dropping the last patch in the set and the overall result is still a reduction
> in the number of zonelists.

Allright with me. Andrew could we get this patchset merged?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-20 Thread Mel Gorman
On (20/11/07 10:14), Lee Schermerhorn didst pronounce:
> On Tue, 2007-11-20 at 14:19 +, Mel Gorman wrote:
> > On (09/11/07 07:45), Christoph Lameter didst pronounce:
> > > On Fri, 9 Nov 2007, Mel Gorman wrote:
> > > 
> > > >  struct page * fastcall
> > > >  __alloc_pages(gfp_t gfp_mask, unsigned int order,
> > > > struct zonelist *zonelist)
> > > >  {
> > > > +   /*
> > > > +* Use a temporary nodemask for __GFP_THISNODE allocations. If 
> > > > the
> > > > +* cost of allocating on the stack or the stack usage becomes
> > > > +* noticable, allocate the nodemasks per node at boot or 
> > > > compile time
> > > > +*/
> > > > +   if (unlikely(gfp_mask & __GFP_THISNODE)) {
> > > > +   nodemask_t nodemask;
> > > 
> > > Hmmm.. This places a potentially big structure on the stack. nodemask can 
> > > contain up to 1024 bits which means 128 bytes. Maybe keep an array of 
> > > gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]?
> > 
> > Went back and revisited this. Allocating them at boot-time is below but
> > essentially it is a silly and it makes sense to just have two zonelists
> > where one of them is for __GFP_THISNODE. Implementation wise, this involves
> > dropping the last patch in the set and the overall result is still a 
> > reduction
> > in the number of zonelists.
> 
> Hi, Mel:
> 
> I'll try this out [n 24-rc2-mm1 or later]. 

Hold off testing for the moment. Getting all the corner cases right for
__GFP_THISNODE has turned too complicated to be considered as part of a larger
patchset. I believe it makes sense to drop the final patch and settle with
having two zonelists. One of these will be for __GFP_THISNODE allocations. We
can then tackle removing that zonelist at a later date.

This will still remove the hack, reduce the number of zonelists and
improve how MPOL_BIND policies are applied so it is still an overall
win.

I should have a revised patchset with the final one dropped posted in a
few hours.

> I have a series with yet
> another rework of policy reference counting that will be easier/cleaner
> atop your series w/o the external zonelist hung off 'BIND policies.  So,
> I'm hoping your series goes into -mm "real soon now".
> 

Sounds good.

> Question:  Just wondering why you didn't embed the '_THISNODE nodemask
> in the pgdat_t--initialized at boot/node-hot-add time? 

At the time, to save space in the pg_data_t structure but it is not
worth the complexity.

> Perhaps under
> #ifdef CONFIG_NUMA. The size is known at build time, right?  And
> wouldn't that be way smaller than an additional zonelist?  
> 

Yes it would.

> Lee
> 
> > 
> > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
> > linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h 
> > linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h
> > --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h   
> > 2007-11-19 19:27:15.0 +
> > +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h
> > 2007-11-19 19:28:55.0 +
> > @@ -175,7 +175,6 @@ FASTCALL(__alloc_pages(gfp_t, unsigned i
> >  extern struct page *
> >  FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int,
> > struct zonelist *, nodemask_t *nodemask));
> > -extern nodemask_t *nodemask_thisnode(int nid, nodemask_t *nodemask);
> >  
> >  static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
> > unsigned int order)
> > @@ -187,13 +186,10 @@ static inline struct page *alloc_pages_n
> > if (nid < 0)
> > nid = numa_node_id();
> >  
> > -   /* Use a temporary nodemask for __GFP_THISNODE allocations */
> > if (unlikely(gfp_mask & __GFP_THISNODE)) {
> > -   nodemask_t nodemask;
> > -
> > return __alloc_pages_nodemask(gfp_mask, order,
> > node_zonelist(nid),
> > -   nodemask_thisnode(nid, ));
> > +   NODE_DATA(nid)->nodemask_thisnode);
> > }
> >  
> > return __alloc_pages(gfp_mask, order, node_zonelist(nid));
> > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
> > linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h 
> > linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h
> > --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h
> > 2007-11-19 19:27:15.0 +
> > +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h 
> > 2007-11-19 19:28:55.0 +
> > @@ -519,6 +519,9 @@ typedef struct pglist_data {
> > struct zone node_zones[MAX_NR_ZONES];
> > struct zonelist node_zonelist;
> > int nr_zones;
> > +
> > +   /* nodemask suitable for __GFP_THISNODE */
> > +   nodemask_t *nodemask_thisnode;
> >  #ifdef CONFIG_FLAT_NODE_MEM_MAP
> > struct page *node_mem_map;
> >  #endif
> > diff -rup -X 

Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-20 Thread Lee Schermerhorn
On Tue, 2007-11-20 at 14:19 +, Mel Gorman wrote:
> On (09/11/07 07:45), Christoph Lameter didst pronounce:
> > On Fri, 9 Nov 2007, Mel Gorman wrote:
> > 
> > >  struct page * fastcall
> > >  __alloc_pages(gfp_t gfp_mask, unsigned int order,
> > >   struct zonelist *zonelist)
> > >  {
> > > + /*
> > > +  * Use a temporary nodemask for __GFP_THISNODE allocations. If the
> > > +  * cost of allocating on the stack or the stack usage becomes
> > > +  * noticable, allocate the nodemasks per node at boot or compile time
> > > +  */
> > > + if (unlikely(gfp_mask & __GFP_THISNODE)) {
> > > + nodemask_t nodemask;
> > 
> > Hmmm.. This places a potentially big structure on the stack. nodemask can 
> > contain up to 1024 bits which means 128 bytes. Maybe keep an array of 
> > gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]?
> 
> Went back and revisited this. Allocating them at boot-time is below but
> essentially it is a silly and it makes sense to just have two zonelists
> where one of them is for __GFP_THISNODE. Implementation wise, this involves
> dropping the last patch in the set and the overall result is still a reduction
> in the number of zonelists.

Hi, Mel:

I'll try this out [n 24-rc2-mm1 or later].  I have a series with yet
another rework of policy reference counting that will be easier/cleaner
atop your series w/o the external zonelist hung off 'BIND policies.  So,
I'm hoping your series goes into -mm "real soon now".

Question:  Just wondering why you didn't embed the '_THISNODE nodemask
in the pgdat_t--initialized at boot/node-hot-add time?  Perhaps under
#ifdef CONFIG_NUMA. The size is known at build time, right?  And
wouldn't that be way smaller than an additional zonelist?  

Lee

> 
> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
> linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h 
> linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h
> --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h 
> 2007-11-19 19:27:15.0 +
> +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h  
> 2007-11-19 19:28:55.0 +
> @@ -175,7 +175,6 @@ FASTCALL(__alloc_pages(gfp_t, unsigned i
>  extern struct page *
>  FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int,
>   struct zonelist *, nodemask_t *nodemask));
> -extern nodemask_t *nodemask_thisnode(int nid, nodemask_t *nodemask);
>  
>  static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
>   unsigned int order)
> @@ -187,13 +186,10 @@ static inline struct page *alloc_pages_n
>   if (nid < 0)
>   nid = numa_node_id();
>  
> - /* Use a temporary nodemask for __GFP_THISNODE allocations */
>   if (unlikely(gfp_mask & __GFP_THISNODE)) {
> - nodemask_t nodemask;
> -
>   return __alloc_pages_nodemask(gfp_mask, order,
>   node_zonelist(nid),
> - nodemask_thisnode(nid, ));
> + NODE_DATA(nid)->nodemask_thisnode);
>   }
>  
>   return __alloc_pages(gfp_mask, order, node_zonelist(nid));
> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
> linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h 
> linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h
> --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h  
> 2007-11-19 19:27:15.0 +
> +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h   
> 2007-11-19 19:28:55.0 +
> @@ -519,6 +519,9 @@ typedef struct pglist_data {
>   struct zone node_zones[MAX_NR_ZONES];
>   struct zonelist node_zonelist;
>   int nr_zones;
> +
> + /* nodemask suitable for __GFP_THISNODE */
> + nodemask_t *nodemask_thisnode;
>  #ifdef CONFIG_FLAT_NODE_MEM_MAP
>   struct page *node_mem_map;
>  #endif
> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
> linux-2.6.24-rc2-mm1-040_use_one_zonelist/mm/page_alloc.c 
> linux-2.6.24-rc2-mm1-045_use_static_nodemask/mm/page_alloc.c
> --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/mm/page_alloc.c 2007-11-19 
> 19:27:15.0 +
> +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/mm/page_alloc.c  
> 2007-11-19 19:28:55.0 +
> @@ -1695,28 +1695,36 @@ got_pg:
>  }
>  
>  /* Creates a nodemask suitable for GFP_THISNODE allocations */
> -nodemask_t *nodemask_thisnode(int nid, nodemask_t *nodemask)
> +static inline void alloc_node_nodemask_thisnode(pg_data_t *pgdat)
>  {
> - nodes_clear(*nodemask);
> - node_set(nid, *nodemask);
> + nodemask_t *nodemask_thisnode;
>  
> - return nodemask;
> + /* Only a machine with multiple nodes needs the nodemask */
> + if (!NUMA_BUILD || num_online_nodes() == 1)
> + return;
> + 
> + /* Allocate the nodemask. Serious if it fails, but not world ending */
> + nodemask_thisnode = 

Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-20 Thread Mel Gorman
On (09/11/07 07:45), Christoph Lameter didst pronounce:
> On Fri, 9 Nov 2007, Mel Gorman wrote:
> 
> >  struct page * fastcall
> >  __alloc_pages(gfp_t gfp_mask, unsigned int order,
> > struct zonelist *zonelist)
> >  {
> > +   /*
> > +* Use a temporary nodemask for __GFP_THISNODE allocations. If the
> > +* cost of allocating on the stack or the stack usage becomes
> > +* noticable, allocate the nodemasks per node at boot or compile time
> > +*/
> > +   if (unlikely(gfp_mask & __GFP_THISNODE)) {
> > +   nodemask_t nodemask;
> 
> Hmmm.. This places a potentially big structure on the stack. nodemask can 
> contain up to 1024 bits which means 128 bytes. Maybe keep an array of 
> gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]?

Went back and revisited this. Allocating them at boot-time is below but
essentially it is a silly and it makes sense to just have two zonelists
where one of them is for __GFP_THISNODE. Implementation wise, this involves
dropping the last patch in the set and the overall result is still a reduction
in the number of zonelists.

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h 
linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h
--- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h   
2007-11-19 19:27:15.0 +
+++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h
2007-11-19 19:28:55.0 +
@@ -175,7 +175,6 @@ FASTCALL(__alloc_pages(gfp_t, unsigned i
 extern struct page *
 FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int,
struct zonelist *, nodemask_t *nodemask));
-extern nodemask_t *nodemask_thisnode(int nid, nodemask_t *nodemask);
 
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
@@ -187,13 +186,10 @@ static inline struct page *alloc_pages_n
if (nid < 0)
nid = numa_node_id();
 
-   /* Use a temporary nodemask for __GFP_THISNODE allocations */
if (unlikely(gfp_mask & __GFP_THISNODE)) {
-   nodemask_t nodemask;
-
return __alloc_pages_nodemask(gfp_mask, order,
node_zonelist(nid),
-   nodemask_thisnode(nid, ));
+   NODE_DATA(nid)->nodemask_thisnode);
}
 
return __alloc_pages(gfp_mask, order, node_zonelist(nid));
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h 
linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h
--- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h
2007-11-19 19:27:15.0 +
+++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h 
2007-11-19 19:28:55.0 +
@@ -519,6 +519,9 @@ typedef struct pglist_data {
struct zone node_zones[MAX_NR_ZONES];
struct zonelist node_zonelist;
int nr_zones;
+
+   /* nodemask suitable for __GFP_THISNODE */
+   nodemask_t *nodemask_thisnode;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP
struct page *node_mem_map;
 #endif
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
linux-2.6.24-rc2-mm1-040_use_one_zonelist/mm/page_alloc.c 
linux-2.6.24-rc2-mm1-045_use_static_nodemask/mm/page_alloc.c
--- linux-2.6.24-rc2-mm1-040_use_one_zonelist/mm/page_alloc.c   2007-11-19 
19:27:15.0 +
+++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/mm/page_alloc.c
2007-11-19 19:28:55.0 +
@@ -1695,28 +1695,36 @@ got_pg:
 }
 
 /* Creates a nodemask suitable for GFP_THISNODE allocations */
-nodemask_t *nodemask_thisnode(int nid, nodemask_t *nodemask)
+static inline void alloc_node_nodemask_thisnode(pg_data_t *pgdat)
 {
-   nodes_clear(*nodemask);
-   node_set(nid, *nodemask);
+   nodemask_t *nodemask_thisnode;
 
-   return nodemask;
+   /* Only a machine with multiple nodes needs the nodemask */
+   if (!NUMA_BUILD || num_online_nodes() == 1)
+   return;
+   
+   /* Allocate the nodemask. Serious if it fails, but not world ending */
+   nodemask_thisnode = alloc_bootmem_node(pgdat, sizeof(nodemask_t));
+   if (!nodemask_thisnode) {
+   printk(KERN_WARNING
+   "thisnode nodemask allocation failed."
+   "There may be sub-optimal NUMA placement.\n");
+   return;
+   }
+
+   /* Initialise the nodemask to only cover the current node */
+   nodes_clear(*nodemask_thisnode);
+   node_set(pgdat->node_id, *nodemask_thisnode);
+   pgdat->nodemask_thisnode = nodemask_thisnode;
 }
 
 struct page * fastcall
 __alloc_pages(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist)
 {
-   /*
-* Use a temporary nodemask for __GFP_THISNODE allocations. If the
-

Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-20 Thread Mel Gorman
On (09/11/07 07:45), Christoph Lameter didst pronounce:
 On Fri, 9 Nov 2007, Mel Gorman wrote:
 
   struct page * fastcall
   __alloc_pages(gfp_t gfp_mask, unsigned int order,
  struct zonelist *zonelist)
   {
  +   /*
  +* Use a temporary nodemask for __GFP_THISNODE allocations. If the
  +* cost of allocating on the stack or the stack usage becomes
  +* noticable, allocate the nodemasks per node at boot or compile time
  +*/
  +   if (unlikely(gfp_mask  __GFP_THISNODE)) {
  +   nodemask_t nodemask;
 
 Hmmm.. This places a potentially big structure on the stack. nodemask can 
 contain up to 1024 bits which means 128 bytes. Maybe keep an array of 
 gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]?

Went back and revisited this. Allocating them at boot-time is below but
essentially it is a silly and it makes sense to just have two zonelists
where one of them is for __GFP_THISNODE. Implementation wise, this involves
dropping the last patch in the set and the overall result is still a reduction
in the number of zonelists.

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h 
linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h
--- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h   
2007-11-19 19:27:15.0 +
+++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h
2007-11-19 19:28:55.0 +
@@ -175,7 +175,6 @@ FASTCALL(__alloc_pages(gfp_t, unsigned i
 extern struct page *
 FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int,
struct zonelist *, nodemask_t *nodemask));
-extern nodemask_t *nodemask_thisnode(int nid, nodemask_t *nodemask);
 
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
@@ -187,13 +186,10 @@ static inline struct page *alloc_pages_n
if (nid  0)
nid = numa_node_id();
 
-   /* Use a temporary nodemask for __GFP_THISNODE allocations */
if (unlikely(gfp_mask  __GFP_THISNODE)) {
-   nodemask_t nodemask;
-
return __alloc_pages_nodemask(gfp_mask, order,
node_zonelist(nid),
-   nodemask_thisnode(nid, nodemask));
+   NODE_DATA(nid)-nodemask_thisnode);
}
 
return __alloc_pages(gfp_mask, order, node_zonelist(nid));
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h 
linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h
--- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h
2007-11-19 19:27:15.0 +
+++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h 
2007-11-19 19:28:55.0 +
@@ -519,6 +519,9 @@ typedef struct pglist_data {
struct zone node_zones[MAX_NR_ZONES];
struct zonelist node_zonelist;
int nr_zones;
+
+   /* nodemask suitable for __GFP_THISNODE */
+   nodemask_t *nodemask_thisnode;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP
struct page *node_mem_map;
 #endif
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
linux-2.6.24-rc2-mm1-040_use_one_zonelist/mm/page_alloc.c 
linux-2.6.24-rc2-mm1-045_use_static_nodemask/mm/page_alloc.c
--- linux-2.6.24-rc2-mm1-040_use_one_zonelist/mm/page_alloc.c   2007-11-19 
19:27:15.0 +
+++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/mm/page_alloc.c
2007-11-19 19:28:55.0 +
@@ -1695,28 +1695,36 @@ got_pg:
 }
 
 /* Creates a nodemask suitable for GFP_THISNODE allocations */
-nodemask_t *nodemask_thisnode(int nid, nodemask_t *nodemask)
+static inline void alloc_node_nodemask_thisnode(pg_data_t *pgdat)
 {
-   nodes_clear(*nodemask);
-   node_set(nid, *nodemask);
+   nodemask_t *nodemask_thisnode;
 
-   return nodemask;
+   /* Only a machine with multiple nodes needs the nodemask */
+   if (!NUMA_BUILD || num_online_nodes() == 1)
+   return;
+   
+   /* Allocate the nodemask. Serious if it fails, but not world ending */
+   nodemask_thisnode = alloc_bootmem_node(pgdat, sizeof(nodemask_t));
+   if (!nodemask_thisnode) {
+   printk(KERN_WARNING
+   thisnode nodemask allocation failed.
+   There may be sub-optimal NUMA placement.\n);
+   return;
+   }
+
+   /* Initialise the nodemask to only cover the current node */
+   nodes_clear(*nodemask_thisnode);
+   node_set(pgdat-node_id, *nodemask_thisnode);
+   pgdat-nodemask_thisnode = nodemask_thisnode;
 }
 
 struct page * fastcall
 __alloc_pages(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist)
 {
-   /*
-* Use a temporary nodemask for __GFP_THISNODE allocations. If the
-* cost of allocating on the 

Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-20 Thread Lee Schermerhorn
On Tue, 2007-11-20 at 14:19 +, Mel Gorman wrote:
 On (09/11/07 07:45), Christoph Lameter didst pronounce:
  On Fri, 9 Nov 2007, Mel Gorman wrote:
  
struct page * fastcall
__alloc_pages(gfp_t gfp_mask, unsigned int order,
 struct zonelist *zonelist)
{
   + /*
   +  * Use a temporary nodemask for __GFP_THISNODE allocations. If the
   +  * cost of allocating on the stack or the stack usage becomes
   +  * noticable, allocate the nodemasks per node at boot or compile time
   +  */
   + if (unlikely(gfp_mask  __GFP_THISNODE)) {
   + nodemask_t nodemask;
  
  Hmmm.. This places a potentially big structure on the stack. nodemask can 
  contain up to 1024 bits which means 128 bytes. Maybe keep an array of 
  gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]?
 
 Went back and revisited this. Allocating them at boot-time is below but
 essentially it is a silly and it makes sense to just have two zonelists
 where one of them is for __GFP_THISNODE. Implementation wise, this involves
 dropping the last patch in the set and the overall result is still a reduction
 in the number of zonelists.

Hi, Mel:

I'll try this out [n 24-rc2-mm1 or later].  I have a series with yet
another rework of policy reference counting that will be easier/cleaner
atop your series w/o the external zonelist hung off 'BIND policies.  So,
I'm hoping your series goes into -mm real soon now.

Question:  Just wondering why you didn't embed the '_THISNODE nodemask
in the pgdat_t--initialized at boot/node-hot-add time?  Perhaps under
#ifdef CONFIG_NUMA. The size is known at build time, right?  And
wouldn't that be way smaller than an additional zonelist?  

Lee

 
 diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
 linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h 
 linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h
 --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h 
 2007-11-19 19:27:15.0 +
 +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h  
 2007-11-19 19:28:55.0 +
 @@ -175,7 +175,6 @@ FASTCALL(__alloc_pages(gfp_t, unsigned i
  extern struct page *
  FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int,
   struct zonelist *, nodemask_t *nodemask));
 -extern nodemask_t *nodemask_thisnode(int nid, nodemask_t *nodemask);
  
  static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
   unsigned int order)
 @@ -187,13 +186,10 @@ static inline struct page *alloc_pages_n
   if (nid  0)
   nid = numa_node_id();
  
 - /* Use a temporary nodemask for __GFP_THISNODE allocations */
   if (unlikely(gfp_mask  __GFP_THISNODE)) {
 - nodemask_t nodemask;
 -
   return __alloc_pages_nodemask(gfp_mask, order,
   node_zonelist(nid),
 - nodemask_thisnode(nid, nodemask));
 + NODE_DATA(nid)-nodemask_thisnode);
   }
  
   return __alloc_pages(gfp_mask, order, node_zonelist(nid));
 diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
 linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h 
 linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h
 --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h  
 2007-11-19 19:27:15.0 +
 +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h   
 2007-11-19 19:28:55.0 +
 @@ -519,6 +519,9 @@ typedef struct pglist_data {
   struct zone node_zones[MAX_NR_ZONES];
   struct zonelist node_zonelist;
   int nr_zones;
 +
 + /* nodemask suitable for __GFP_THISNODE */
 + nodemask_t *nodemask_thisnode;
  #ifdef CONFIG_FLAT_NODE_MEM_MAP
   struct page *node_mem_map;
  #endif
 diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
 linux-2.6.24-rc2-mm1-040_use_one_zonelist/mm/page_alloc.c 
 linux-2.6.24-rc2-mm1-045_use_static_nodemask/mm/page_alloc.c
 --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/mm/page_alloc.c 2007-11-19 
 19:27:15.0 +
 +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/mm/page_alloc.c  
 2007-11-19 19:28:55.0 +
 @@ -1695,28 +1695,36 @@ got_pg:
  }
  
  /* Creates a nodemask suitable for GFP_THISNODE allocations */
 -nodemask_t *nodemask_thisnode(int nid, nodemask_t *nodemask)
 +static inline void alloc_node_nodemask_thisnode(pg_data_t *pgdat)
  {
 - nodes_clear(*nodemask);
 - node_set(nid, *nodemask);
 + nodemask_t *nodemask_thisnode;
  
 - return nodemask;
 + /* Only a machine with multiple nodes needs the nodemask */
 + if (!NUMA_BUILD || num_online_nodes() == 1)
 + return;
 + 
 + /* Allocate the nodemask. Serious if it fails, but not world ending */
 + nodemask_thisnode = alloc_bootmem_node(pgdat, sizeof(nodemask_t));
 + if (!nodemask_thisnode) {
 + printk(KERN_WARNING
 +

Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-20 Thread Mel Gorman
On (20/11/07 10:14), Lee Schermerhorn didst pronounce:
 On Tue, 2007-11-20 at 14:19 +, Mel Gorman wrote:
  On (09/11/07 07:45), Christoph Lameter didst pronounce:
   On Fri, 9 Nov 2007, Mel Gorman wrote:
   
 struct page * fastcall
 __alloc_pages(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist)
 {
+   /*
+* Use a temporary nodemask for __GFP_THISNODE allocations. If 
the
+* cost of allocating on the stack or the stack usage becomes
+* noticable, allocate the nodemasks per node at boot or 
compile time
+*/
+   if (unlikely(gfp_mask  __GFP_THISNODE)) {
+   nodemask_t nodemask;
   
   Hmmm.. This places a potentially big structure on the stack. nodemask can 
   contain up to 1024 bits which means 128 bytes. Maybe keep an array of 
   gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]?
  
  Went back and revisited this. Allocating them at boot-time is below but
  essentially it is a silly and it makes sense to just have two zonelists
  where one of them is for __GFP_THISNODE. Implementation wise, this involves
  dropping the last patch in the set and the overall result is still a 
  reduction
  in the number of zonelists.
 
 Hi, Mel:
 
 I'll try this out [n 24-rc2-mm1 or later]. 

Hold off testing for the moment. Getting all the corner cases right for
__GFP_THISNODE has turned too complicated to be considered as part of a larger
patchset. I believe it makes sense to drop the final patch and settle with
having two zonelists. One of these will be for __GFP_THISNODE allocations. We
can then tackle removing that zonelist at a later date.

This will still remove the hack, reduce the number of zonelists and
improve how MPOL_BIND policies are applied so it is still an overall
win.

I should have a revised patchset with the final one dropped posted in a
few hours.

 I have a series with yet
 another rework of policy reference counting that will be easier/cleaner
 atop your series w/o the external zonelist hung off 'BIND policies.  So,
 I'm hoping your series goes into -mm real soon now.
 

Sounds good.

 Question:  Just wondering why you didn't embed the '_THISNODE nodemask
 in the pgdat_t--initialized at boot/node-hot-add time? 

At the time, to save space in the pg_data_t structure but it is not
worth the complexity.

 Perhaps under
 #ifdef CONFIG_NUMA. The size is known at build time, right?  And
 wouldn't that be way smaller than an additional zonelist?  
 

Yes it would.

 Lee
 
  
  diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
  linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h 
  linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h
  --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h   
  2007-11-19 19:27:15.0 +
  +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h
  2007-11-19 19:28:55.0 +
  @@ -175,7 +175,6 @@ FASTCALL(__alloc_pages(gfp_t, unsigned i
   extern struct page *
   FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int,
  struct zonelist *, nodemask_t *nodemask));
  -extern nodemask_t *nodemask_thisnode(int nid, nodemask_t *nodemask);
   
   static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
  unsigned int order)
  @@ -187,13 +186,10 @@ static inline struct page *alloc_pages_n
  if (nid  0)
  nid = numa_node_id();
   
  -   /* Use a temporary nodemask for __GFP_THISNODE allocations */
  if (unlikely(gfp_mask  __GFP_THISNODE)) {
  -   nodemask_t nodemask;
  -
  return __alloc_pages_nodemask(gfp_mask, order,
  node_zonelist(nid),
  -   nodemask_thisnode(nid, nodemask));
  +   NODE_DATA(nid)-nodemask_thisnode);
  }
   
  return __alloc_pages(gfp_mask, order, node_zonelist(nid));
  diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
  linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h 
  linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h
  --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h
  2007-11-19 19:27:15.0 +
  +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h 
  2007-11-19 19:28:55.0 +
  @@ -519,6 +519,9 @@ typedef struct pglist_data {
  struct zone node_zones[MAX_NR_ZONES];
  struct zonelist node_zonelist;
  int nr_zones;
  +
  +   /* nodemask suitable for __GFP_THISNODE */
  +   nodemask_t *nodemask_thisnode;
   #ifdef CONFIG_FLAT_NODE_MEM_MAP
  struct page *node_mem_map;
   #endif
  diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
  linux-2.6.24-rc2-mm1-040_use_one_zonelist/mm/page_alloc.c 
  linux-2.6.24-rc2-mm1-045_use_static_nodemask/mm/page_alloc.c
  --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/mm/page_alloc.c   

Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-20 Thread Christoph Lameter
On Tue, 20 Nov 2007, Mel Gorman wrote:

 Went back and revisited this. Allocating them at boot-time is below but
 essentially it is a silly and it makes sense to just have two zonelists
 where one of them is for __GFP_THISNODE. Implementation wise, this involves
 dropping the last patch in the set and the overall result is still a reduction
 in the number of zonelists.

Allright with me. Andrew could we get this patchset merged?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-20 Thread Christoph Lameter
On Tue, 20 Nov 2007, Mel Gorman wrote:

 Hold off testing for the moment. Getting all the corner cases right for
 __GFP_THISNODE has turned too complicated to be considered as part of a larger
 patchset. I believe it makes sense to drop the final patch and settle with
 having two zonelists. One of these will be for __GFP_THISNODE allocations. We
 can then tackle removing that zonelist at a later date.

Ack.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-20 Thread Mel Gorman
On (20/11/07 12:18), Christoph Lameter didst pronounce:
 On Tue, 20 Nov 2007, Mel Gorman wrote:
 
  Went back and revisited this. Allocating them at boot-time is below but
  essentially it is a silly and it makes sense to just have two zonelists
  where one of them is for __GFP_THISNODE. Implementation wise, this involves
  dropping the last patch in the set and the overall result is still a 
  reduction
  in the number of zonelists.
 
 Allright with me. Andrew could we get this patchset merged?
 

They will not merge cleanly with a recent tree. I expect to be ready to
post a new set when regression tests complete later this evening.

-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-20 Thread Andrew Morton
On Tue, 20 Nov 2007 12:18:10 -0800 (PST)
Christoph Lameter [EMAIL PROTECTED] wrote:

 On Tue, 20 Nov 2007, Mel Gorman wrote:
 
  Went back and revisited this. Allocating them at boot-time is below but
  essentially it is a silly and it makes sense to just have two zonelists
  where one of them is for __GFP_THISNODE. Implementation wise, this involves
  dropping the last patch in the set and the overall result is still a 
  reduction
  in the number of zonelists.
 
 Allright with me. Andrew could we get this patchset merged?

uhm, maybe.  It's getting toward the time when we should try to get -mm
vaguely compiling and booting on some machines, which means stopping
merging new stuff.  I left that too late in the 2.6.23 cycle.

otoh it'd be nice to get mainline fixed up a bit too :(
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-20 Thread Christoph Lameter
On Tue, 20 Nov 2007, Andrew Morton wrote:

 uhm, maybe.  It's getting toward the time when we should try to get -mm
 vaguely compiling and booting on some machines, which means stopping
 merging new stuff.  I left that too late in the 2.6.23 cycle.

Huh? mm1 works fine here.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-12 Thread Christoph Lameter
On Sun, 11 Nov 2007, Mel Gorman wrote:

> If MPOL_BIND is in effect, the allocation will be filtered based on the
> current allowed nodemask. If they specify THISNODE and the specified
> node or current node is not in the mask, I would expect the allocation
> to fail. Is that unexpected to anybody?

Currently GFP_THISNODE with MPOL_BIND results an allocation on the first 
node.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-12 Thread Christoph Lameter
On Sun, 11 Nov 2007, Mel Gorman wrote:

 If MPOL_BIND is in effect, the allocation will be filtered based on the
 current allowed nodemask. If they specify THISNODE and the specified
 node or current node is not in the mask, I would expect the allocation
 to fail. Is that unexpected to anybody?

Currently GFP_THISNODE with MPOL_BIND results an allocation on the first 
node.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-11 Thread Mel Gorman
On (09/11/07 09:26), Christoph Lameter didst pronounce:
> On Fri, 9 Nov 2007, Lee Schermerhorn wrote:
> 
> > > On the other hand, if we call alloc_pages() with GFP_THISNODE set, there
> > > is no nid to base the allocation on, so we "fallback" to numa_node_id()
> > > [ almost like the nid had been specified as -1 ].
> > > 
> > > So I guess this is logical -- but I wonder, do we have any callers of
> > > alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when
> > > alloc_pages_node() exists?
> > 
> > I don't know if we have any current callers that do this, but absent any
> > documentation specifying otherwise, Mel's implementation matches what
> > I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE.
> > However, we could specify that THISNODE is ignored in __alloc_pages()
> > and recommend the use of alloc_pages_node() passing numa_node_id() as
> > the nid parameter to achieve the behavior.  This would eliminate the
> > check for 'THISNODE in __alloc_pages().  Just mask it off before calling
> > down to __alloc_pages_internal().
> > 
> > Does this make sense?
> 
> I like consistency. If someone absolutely wants a local page then 
> specifying GFP_THISNODE to __alloc_pages is okay. Leave as is I guess. 
> 

Agreed.

> What happens though if an MPOL_BIND policy is in effect? The node used 
> must then be the nearest node from the policy mask
> 

If MPOL_BIND is in effect, the allocation will be filtered based on the
current allowed nodemask. If they specify THISNODE and the specified
node or current node is not in the mask, I would expect the allocation
to fail. Is that unexpected to anybody?

-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-11 Thread Mel Gorman
On (09/11/07 09:26), Christoph Lameter didst pronounce:
 On Fri, 9 Nov 2007, Lee Schermerhorn wrote:
 
   On the other hand, if we call alloc_pages() with GFP_THISNODE set, there
   is no nid to base the allocation on, so we fallback to numa_node_id()
   [ almost like the nid had been specified as -1 ].
   
   So I guess this is logical -- but I wonder, do we have any callers of
   alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when
   alloc_pages_node() exists?
  
  I don't know if we have any current callers that do this, but absent any
  documentation specifying otherwise, Mel's implementation matches what
  I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE.
  However, we could specify that THISNODE is ignored in __alloc_pages()
  and recommend the use of alloc_pages_node() passing numa_node_id() as
  the nid parameter to achieve the behavior.  This would eliminate the
  check for 'THISNODE in __alloc_pages().  Just mask it off before calling
  down to __alloc_pages_internal().
  
  Does this make sense?
 
 I like consistency. If someone absolutely wants a local page then 
 specifying GFP_THISNODE to __alloc_pages is okay. Leave as is I guess. 
 

Agreed.

 What happens though if an MPOL_BIND policy is in effect? The node used 
 must then be the nearest node from the policy mask
 

If MPOL_BIND is in effect, the allocation will be filtered based on the
current allowed nodemask. If they specify THISNODE and the specified
node or current node is not in the mask, I would expect the allocation
to fail. Is that unexpected to anybody?

-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-09 Thread Christoph Lameter
On Fri, 9 Nov 2007, Nishanth Aravamudan wrote:

> > Indeed, this probably needs to be validated... Sigh, more interleaving
> > of policies and everything else...
> 
> Hrm, more importantly, isn't this an existing issue? Maybe should be
> resolved separately from the one zonelist patches.

GFP_THISNODE with alloc_pages() currently yields an allocation from the 
first node of the MPOL_BIND zonelist. So its the lowest node of the set.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-09 Thread Nishanth Aravamudan
On 09.11.2007 [10:16:07 -0800], Nishanth Aravamudan wrote:
> On 09.11.2007 [09:26:01 -0800], Christoph Lameter wrote:
> > On Fri, 9 Nov 2007, Lee Schermerhorn wrote:
> > 
> > > > On the other hand, if we call alloc_pages() with GFP_THISNODE set, there
> > > > is no nid to base the allocation on, so we "fallback" to numa_node_id()
> > > > [ almost like the nid had been specified as -1 ].
> > > > 
> > > > So I guess this is logical -- but I wonder, do we have any callers of
> > > > alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when
> > > > alloc_pages_node() exists?
> > > 
> > > I don't know if we have any current callers that do this, but absent any
> > > documentation specifying otherwise, Mel's implementation matches what
> > > I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE.
> > > However, we could specify that THISNODE is ignored in __alloc_pages()
> > > and recommend the use of alloc_pages_node() passing numa_node_id() as
> > > the nid parameter to achieve the behavior.  This would eliminate the
> > > check for 'THISNODE in __alloc_pages().  Just mask it off before calling
> > > down to __alloc_pages_internal().
> > > 
> > > Does this make sense?
> > 
> > I like consistency. If someone absolutely wants a local page then
> > specifying GFP_THISNODE to __alloc_pages is okay. Leave as is I guess. 
> 
> Fair enough.
> 
> > What happens though if an MPOL_BIND policy is in effect? The node used
> > must then be the nearest node from the policy mask
> 
> Indeed, this probably needs to be validated... Sigh, more interleaving
> of policies and everything else...

Hrm, more importantly, isn't this an existing issue? Maybe should be
resolved separately from the one zonelist patches.

-Nish

-- 
Nishanth Aravamudan <[EMAIL PROTECTED]>
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-09 Thread Nishanth Aravamudan
On 09.11.2007 [12:18:52 -0500], Lee Schermerhorn wrote:
> On Fri, 2007-11-09 at 08:45 -0800, Nishanth Aravamudan wrote:
> > On 09.11.2007 [16:14:55 +], Mel Gorman wrote:
> > > On (09/11/07 07:45), Christoph Lameter didst pronounce:
> > > > On Fri, 9 Nov 2007, Mel Gorman wrote:
> > > > 
> > > > >  struct page * fastcall
> > > > >  __alloc_pages(gfp_t gfp_mask, unsigned int order,
> > > > >   struct zonelist *zonelist)
> > > > >  {
> > > > > + /*
> > > > > +  * Use a temporary nodemask for __GFP_THISNODE allocations. If 
> > > > > the
> > > > > +  * cost of allocating on the stack or the stack usage becomes
> > > > > +  * noticable, allocate the nodemasks per node at boot or 
> > > > > compile time
> > > > > +  */
> > > > > + if (unlikely(gfp_mask & __GFP_THISNODE)) {
> > > > > + nodemask_t nodemask;
> > > > 
> > > > Hmmm.. This places a potentially big structure on the stack. nodemask 
> > > > can 
> > > > contain up to 1024 bits which means 128 bytes. Maybe keep an array of 
> > > > gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]?
> > > > 
> > > 
> > > That is what I was hinting at in the comment as a possible solution.
> > > 
> > > > > +
> > > > > + return __alloc_pages_internal(gfp_mask, order,
> > > > > + zonelist, nodemask_thisnode(numa_node_id(), 
> > > > > ));
> > > > 
> > > > Argh GFP_THISNODE must use the nid passed to alloc_pages_node
> > > > and *not* the local numa node id. Only if the node specified to
> > > > alloc_pages nodes is -1 will this work.
> > > > 
> > > 
> > > alloc_pages_node() calls __alloc_pages_nodemask() though where in this
> > > function if I'm reading it right is called without a node id. Given no
> > > other details on the nid, the current one seemed a logical choice.
> > 
> > Yeah, I guess the context here matters (and is a little hard to follow
> > because thare are a few places that change in different ways here):
> > 
> > For allocating pages from a particular node (GFP_THISNODE with nid),
> > the nid clearly must be specified. This only happens with
> > alloc_pages_node(), AFAICT. So, in that interface, the right thing is
> > done and the appropriate nodemask will be built.
> 
> I agree.  In an earlier patch, Mel was ignoring nid and using
> numa_node_id() here.  This was causing your [Nish's] hugetlb pool
> allocation patches to fail.  Mel fixed that ~9oct07.  

Yep, and that's why I'm on the Cc, I think :)

> > On the other hand, if we call alloc_pages() with GFP_THISNODE set, there
> > is no nid to base the allocation on, so we "fallback" to numa_node_id()
> > [ almost like the nid had been specified as -1 ].
> > 
> > So I guess this is logical -- but I wonder, do we have any callers of
> > alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when
> > alloc_pages_node() exists?
> 
> I don't know if we have any current callers that do this, but absent any
> documentation specifying otherwise, Mel's implementation matches what
> I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE.
> However, we could specify that THISNODE is ignored in __alloc_pages()
> and recommend the use of alloc_pages_node() passing numa_node_id() as
> the nid parameter to achieve the behavior.  This would eliminate the
> check for 'THISNODE in __alloc_pages().  Just mask it off before calling
> down to __alloc_pages_internal().
> 
> Does this make sense?

The caller could also just use -1 as the nid, since then
alloc_pages_node() should do the numa_node_id() for the caller... But I
agree, there is no documentation saying GFP_THISNODE is *not* allowed
for alloc_pages(), so we should probably handle it

-Nish

-- 
Nishanth Aravamudan <[EMAIL PROTECTED]>
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-09 Thread Nishanth Aravamudan
On 09.11.2007 [09:26:01 -0800], Christoph Lameter wrote:
> On Fri, 9 Nov 2007, Lee Schermerhorn wrote:
> 
> > > On the other hand, if we call alloc_pages() with GFP_THISNODE set, there
> > > is no nid to base the allocation on, so we "fallback" to numa_node_id()
> > > [ almost like the nid had been specified as -1 ].
> > > 
> > > So I guess this is logical -- but I wonder, do we have any callers of
> > > alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when
> > > alloc_pages_node() exists?
> > 
> > I don't know if we have any current callers that do this, but absent any
> > documentation specifying otherwise, Mel's implementation matches what
> > I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE.
> > However, we could specify that THISNODE is ignored in __alloc_pages()
> > and recommend the use of alloc_pages_node() passing numa_node_id() as
> > the nid parameter to achieve the behavior.  This would eliminate the
> > check for 'THISNODE in __alloc_pages().  Just mask it off before calling
> > down to __alloc_pages_internal().
> > 
> > Does this make sense?
> 
> I like consistency. If someone absolutely wants a local page then
> specifying GFP_THISNODE to __alloc_pages is okay. Leave as is I guess. 

Fair enough.

> What happens though if an MPOL_BIND policy is in effect? The node used
> must then be the nearest node from the policy mask

Indeed, this probably needs to be validated... Sigh, more interleaving
of policies and everything else...

-Nish


-- 
Nishanth Aravamudan <[EMAIL PROTECTED]>
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-09 Thread Christoph Lameter
On Fri, 9 Nov 2007, Lee Schermerhorn wrote:

> > On the other hand, if we call alloc_pages() with GFP_THISNODE set, there
> > is no nid to base the allocation on, so we "fallback" to numa_node_id()
> > [ almost like the nid had been specified as -1 ].
> > 
> > So I guess this is logical -- but I wonder, do we have any callers of
> > alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when
> > alloc_pages_node() exists?
> 
> I don't know if we have any current callers that do this, but absent any
> documentation specifying otherwise, Mel's implementation matches what
> I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE.
> However, we could specify that THISNODE is ignored in __alloc_pages()
> and recommend the use of alloc_pages_node() passing numa_node_id() as
> the nid parameter to achieve the behavior.  This would eliminate the
> check for 'THISNODE in __alloc_pages().  Just mask it off before calling
> down to __alloc_pages_internal().
> 
> Does this make sense?

I like consistency. If someone absolutely wants a local page then 
specifying GFP_THISNODE to __alloc_pages is okay. Leave as is I guess. 

What happens though if an MPOL_BIND policy is in effect? The node used 
must then be the nearest node from the policy mask


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-09 Thread Lee Schermerhorn
On Fri, 2007-11-09 at 08:45 -0800, Nishanth Aravamudan wrote:
> On 09.11.2007 [16:14:55 +], Mel Gorman wrote:
> > On (09/11/07 07:45), Christoph Lameter didst pronounce:
> > > On Fri, 9 Nov 2007, Mel Gorman wrote:
> > > 
> > > >  struct page * fastcall
> > > >  __alloc_pages(gfp_t gfp_mask, unsigned int order,
> > > > struct zonelist *zonelist)
> > > >  {
> > > > +   /*
> > > > +* Use a temporary nodemask for __GFP_THISNODE allocations. If 
> > > > the
> > > > +* cost of allocating on the stack or the stack usage becomes
> > > > +* noticable, allocate the nodemasks per node at boot or 
> > > > compile time
> > > > +*/
> > > > +   if (unlikely(gfp_mask & __GFP_THISNODE)) {
> > > > +   nodemask_t nodemask;
> > > 
> > > Hmmm.. This places a potentially big structure on the stack. nodemask can 
> > > contain up to 1024 bits which means 128 bytes. Maybe keep an array of 
> > > gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]?
> > > 
> > 
> > That is what I was hinting at in the comment as a possible solution.
> > 
> > > > +
> > > > +   return __alloc_pages_internal(gfp_mask, order,
> > > > +   zonelist, nodemask_thisnode(numa_node_id(), 
> > > > ));
> > > 
> > > Argh GFP_THISNODE must use the nid passed to alloc_pages_node
> > > and *not* the local numa node id. Only if the node specified to
> > > alloc_pages nodes is -1 will this work.
> > > 
> > 
> > alloc_pages_node() calls __alloc_pages_nodemask() though where in this
> > function if I'm reading it right is called without a node id. Given no
> > other details on the nid, the current one seemed a logical choice.
> 
> Yeah, I guess the context here matters (and is a little hard to follow
> because thare are a few places that change in different ways here):
> 
> For allocating pages from a particular node (GFP_THISNODE with nid),
> the nid clearly must be specified. This only happens with
> alloc_pages_node(), AFAICT. So, in that interface, the right thing is
> done and the appropriate nodemask will be built.

I agree.  In an earlier patch, Mel was ignoring nid and using
numa_node_id() here.  This was causing your [Nish's] hugetlb pool
allocation patches to fail.  Mel fixed that ~9oct07.  

> 
> On the other hand, if we call alloc_pages() with GFP_THISNODE set, there
> is no nid to base the allocation on, so we "fallback" to numa_node_id()
> [ almost like the nid had been specified as -1 ].
> 
> So I guess this is logical -- but I wonder, do we have any callers of
> alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when
> alloc_pages_node() exists?

I don't know if we have any current callers that do this, but absent any
documentation specifying otherwise, Mel's implementation matches what
I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE.
However, we could specify that THISNODE is ignored in __alloc_pages()
and recommend the use of alloc_pages_node() passing numa_node_id() as
the nid parameter to achieve the behavior.  This would eliminate the
check for 'THISNODE in __alloc_pages().  Just mask it off before calling
down to __alloc_pages_internal().

Does this make sense?

Lee


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-09 Thread Nishanth Aravamudan
On 09.11.2007 [16:14:55 +], Mel Gorman wrote:
> On (09/11/07 07:45), Christoph Lameter didst pronounce:
> > On Fri, 9 Nov 2007, Mel Gorman wrote:
> > 
> > >  struct page * fastcall
> > >  __alloc_pages(gfp_t gfp_mask, unsigned int order,
> > >   struct zonelist *zonelist)
> > >  {
> > > + /*
> > > +  * Use a temporary nodemask for __GFP_THISNODE allocations. If the
> > > +  * cost of allocating on the stack or the stack usage becomes
> > > +  * noticable, allocate the nodemasks per node at boot or compile time
> > > +  */
> > > + if (unlikely(gfp_mask & __GFP_THISNODE)) {
> > > + nodemask_t nodemask;
> > 
> > Hmmm.. This places a potentially big structure on the stack. nodemask can 
> > contain up to 1024 bits which means 128 bytes. Maybe keep an array of 
> > gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]?
> > 
> 
> That is what I was hinting at in the comment as a possible solution.
> 
> > > +
> > > + return __alloc_pages_internal(gfp_mask, order,
> > > + zonelist, nodemask_thisnode(numa_node_id(), ));
> > 
> > Argh GFP_THISNODE must use the nid passed to alloc_pages_node
> > and *not* the local numa node id. Only if the node specified to
> > alloc_pages nodes is -1 will this work.
> > 
> 
> alloc_pages_node() calls __alloc_pages_nodemask() though where in this
> function if I'm reading it right is called without a node id. Given no
> other details on the nid, the current one seemed a logical choice.

Yeah, I guess the context here matters (and is a little hard to follow
because thare are a few places that change in different ways here):

For allocating pages from a particular node (GFP_THISNODE with nid),
the nid clearly must be specified. This only happens with
alloc_pages_node(), AFAICT. So, in that interface, the right thing is
done and the appropriate nodemask will be built.

On the other hand, if we call alloc_pages() with GFP_THISNODE set, there
is no nid to base the allocation on, so we "fallback" to numa_node_id()
[ almost like the nid had been specified as -1 ].

So I guess this is logical -- but I wonder, do we have any callers of
alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when
alloc_pages_node() exists?

Thanks,
Nish

-- 
Nishanth Aravamudan <[EMAIL PROTECTED]>
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-09 Thread Christoph Lameter
On Fri, 9 Nov 2007, Mel Gorman wrote:

> > Argh GFP_THISNODE must use the nid passed to alloc_pages_node and 
> > *not* the local numa node id. Only if the node specified to alloc_pages 
> > nodes is -1 will this work.
> > 
> 
> alloc_pages_node() calls __alloc_pages_nodemask() though where in this
> function if I'm reading it right is called without a node id. Given no
> other details on the nid, the current one seemed a logical choice.

If the function only called when the first zone == numa_node_id then 
the use of numa_node_id here is okay.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-09 Thread Mel Gorman
On (09/11/07 07:45), Christoph Lameter didst pronounce:
> On Fri, 9 Nov 2007, Mel Gorman wrote:
> 
> >  struct page * fastcall
> >  __alloc_pages(gfp_t gfp_mask, unsigned int order,
> > struct zonelist *zonelist)
> >  {
> > +   /*
> > +* Use a temporary nodemask for __GFP_THISNODE allocations. If the
> > +* cost of allocating on the stack or the stack usage becomes
> > +* noticable, allocate the nodemasks per node at boot or compile time
> > +*/
> > +   if (unlikely(gfp_mask & __GFP_THISNODE)) {
> > +   nodemask_t nodemask;
> 
> Hmmm.. This places a potentially big structure on the stack. nodemask can 
> contain up to 1024 bits which means 128 bytes. Maybe keep an array of 
> gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]?
> 

That is what I was hinting at in the comment as a possible solution.

> > +
> > +   return __alloc_pages_internal(gfp_mask, order,
> > +   zonelist, nodemask_thisnode(numa_node_id(), ));
> 
> Argh GFP_THISNODE must use the nid passed to alloc_pages_node and 
> *not* the local numa node id. Only if the node specified to alloc_pages 
> nodes is -1 will this work.
> 

alloc_pages_node() calls __alloc_pages_nodemask() though where in this
function if I'm reading it right is called without a node id. Given no
other details on the nid, the current one seemed a logical choice.

What I did notice when rechecking is I left the warning about THISNODE
in by accident :(

-- 
-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-09 Thread Christoph Lameter
On Fri, 9 Nov 2007, Mel Gorman wrote:

>  struct page * fastcall
>  __alloc_pages(gfp_t gfp_mask, unsigned int order,
>   struct zonelist *zonelist)
>  {
> + /*
> +  * Use a temporary nodemask for __GFP_THISNODE allocations. If the
> +  * cost of allocating on the stack or the stack usage becomes
> +  * noticable, allocate the nodemasks per node at boot or compile time
> +  */
> + if (unlikely(gfp_mask & __GFP_THISNODE)) {
> + nodemask_t nodemask;

Hmmm.. This places a potentially big structure on the stack. nodemask can 
contain up to 1024 bits which means 128 bytes. Maybe keep an array of 
gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]?

> +
> + return __alloc_pages_internal(gfp_mask, order,
> + zonelist, nodemask_thisnode(numa_node_id(), ));

Argh GFP_THISNODE must use the nid passed to alloc_pages_node and 
*not* the local numa node id. Only if the node specified to alloc_pages 
nodes is -1 will this work.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-09 Thread Christoph Lameter
On Fri, 9 Nov 2007, Mel Gorman wrote:

  struct page * fastcall
  __alloc_pages(gfp_t gfp_mask, unsigned int order,
   struct zonelist *zonelist)
  {
 + /*
 +  * Use a temporary nodemask for __GFP_THISNODE allocations. If the
 +  * cost of allocating on the stack or the stack usage becomes
 +  * noticable, allocate the nodemasks per node at boot or compile time
 +  */
 + if (unlikely(gfp_mask  __GFP_THISNODE)) {
 + nodemask_t nodemask;

Hmmm.. This places a potentially big structure on the stack. nodemask can 
contain up to 1024 bits which means 128 bytes. Maybe keep an array of 
gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]?

 +
 + return __alloc_pages_internal(gfp_mask, order,
 + zonelist, nodemask_thisnode(numa_node_id(), nodemask));

Argh GFP_THISNODE must use the nid passed to alloc_pages_node and 
*not* the local numa node id. Only if the node specified to alloc_pages 
nodes is -1 will this work.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-09 Thread Mel Gorman
On (09/11/07 07:45), Christoph Lameter didst pronounce:
 On Fri, 9 Nov 2007, Mel Gorman wrote:
 
   struct page * fastcall
   __alloc_pages(gfp_t gfp_mask, unsigned int order,
  struct zonelist *zonelist)
   {
  +   /*
  +* Use a temporary nodemask for __GFP_THISNODE allocations. If the
  +* cost of allocating on the stack or the stack usage becomes
  +* noticable, allocate the nodemasks per node at boot or compile time
  +*/
  +   if (unlikely(gfp_mask  __GFP_THISNODE)) {
  +   nodemask_t nodemask;
 
 Hmmm.. This places a potentially big structure on the stack. nodemask can 
 contain up to 1024 bits which means 128 bytes. Maybe keep an array of 
 gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]?
 

That is what I was hinting at in the comment as a possible solution.

  +
  +   return __alloc_pages_internal(gfp_mask, order,
  +   zonelist, nodemask_thisnode(numa_node_id(), nodemask));
 
 Argh GFP_THISNODE must use the nid passed to alloc_pages_node and 
 *not* the local numa node id. Only if the node specified to alloc_pages 
 nodes is -1 will this work.
 

alloc_pages_node() calls __alloc_pages_nodemask() though where in this
function if I'm reading it right is called without a node id. Given no
other details on the nid, the current one seemed a logical choice.

What I did notice when rechecking is I left the warning about THISNODE
in by accident :(

-- 
-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-09 Thread Christoph Lameter
On Fri, 9 Nov 2007, Mel Gorman wrote:

  Argh GFP_THISNODE must use the nid passed to alloc_pages_node and 
  *not* the local numa node id. Only if the node specified to alloc_pages 
  nodes is -1 will this work.
  
 
 alloc_pages_node() calls __alloc_pages_nodemask() though where in this
 function if I'm reading it right is called without a node id. Given no
 other details on the nid, the current one seemed a logical choice.

If the function only called when the first zone == numa_node_id then 
the use of numa_node_id here is okay.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-09 Thread Nishanth Aravamudan
On 09.11.2007 [16:14:55 +], Mel Gorman wrote:
 On (09/11/07 07:45), Christoph Lameter didst pronounce:
  On Fri, 9 Nov 2007, Mel Gorman wrote:
  
struct page * fastcall
__alloc_pages(gfp_t gfp_mask, unsigned int order,
 struct zonelist *zonelist)
{
   + /*
   +  * Use a temporary nodemask for __GFP_THISNODE allocations. If the
   +  * cost of allocating on the stack or the stack usage becomes
   +  * noticable, allocate the nodemasks per node at boot or compile time
   +  */
   + if (unlikely(gfp_mask  __GFP_THISNODE)) {
   + nodemask_t nodemask;
  
  Hmmm.. This places a potentially big structure on the stack. nodemask can 
  contain up to 1024 bits which means 128 bytes. Maybe keep an array of 
  gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]?
  
 
 That is what I was hinting at in the comment as a possible solution.
 
   +
   + return __alloc_pages_internal(gfp_mask, order,
   + zonelist, nodemask_thisnode(numa_node_id(), nodemask));
  
  Argh GFP_THISNODE must use the nid passed to alloc_pages_node
  and *not* the local numa node id. Only if the node specified to
  alloc_pages nodes is -1 will this work.
  
 
 alloc_pages_node() calls __alloc_pages_nodemask() though where in this
 function if I'm reading it right is called without a node id. Given no
 other details on the nid, the current one seemed a logical choice.

Yeah, I guess the context here matters (and is a little hard to follow
because thare are a few places that change in different ways here):

For allocating pages from a particular node (GFP_THISNODE with nid),
the nid clearly must be specified. This only happens with
alloc_pages_node(), AFAICT. So, in that interface, the right thing is
done and the appropriate nodemask will be built.

On the other hand, if we call alloc_pages() with GFP_THISNODE set, there
is no nid to base the allocation on, so we fallback to numa_node_id()
[ almost like the nid had been specified as -1 ].

So I guess this is logical -- but I wonder, do we have any callers of
alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when
alloc_pages_node() exists?

Thanks,
Nish

-- 
Nishanth Aravamudan [EMAIL PROTECTED]
IBM Linux Technology Center
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-09 Thread Lee Schermerhorn
On Fri, 2007-11-09 at 08:45 -0800, Nishanth Aravamudan wrote:
 On 09.11.2007 [16:14:55 +], Mel Gorman wrote:
  On (09/11/07 07:45), Christoph Lameter didst pronounce:
   On Fri, 9 Nov 2007, Mel Gorman wrote:
   
 struct page * fastcall
 __alloc_pages(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist)
 {
+   /*
+* Use a temporary nodemask for __GFP_THISNODE allocations. If 
the
+* cost of allocating on the stack or the stack usage becomes
+* noticable, allocate the nodemasks per node at boot or 
compile time
+*/
+   if (unlikely(gfp_mask  __GFP_THISNODE)) {
+   nodemask_t nodemask;
   
   Hmmm.. This places a potentially big structure on the stack. nodemask can 
   contain up to 1024 bits which means 128 bytes. Maybe keep an array of 
   gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]?
   
  
  That is what I was hinting at in the comment as a possible solution.
  
+
+   return __alloc_pages_internal(gfp_mask, order,
+   zonelist, nodemask_thisnode(numa_node_id(), 
nodemask));
   
   Argh GFP_THISNODE must use the nid passed to alloc_pages_node
   and *not* the local numa node id. Only if the node specified to
   alloc_pages nodes is -1 will this work.
   
  
  alloc_pages_node() calls __alloc_pages_nodemask() though where in this
  function if I'm reading it right is called without a node id. Given no
  other details on the nid, the current one seemed a logical choice.
 
 Yeah, I guess the context here matters (and is a little hard to follow
 because thare are a few places that change in different ways here):
 
 For allocating pages from a particular node (GFP_THISNODE with nid),
 the nid clearly must be specified. This only happens with
 alloc_pages_node(), AFAICT. So, in that interface, the right thing is
 done and the appropriate nodemask will be built.

I agree.  In an earlier patch, Mel was ignoring nid and using
numa_node_id() here.  This was causing your [Nish's] hugetlb pool
allocation patches to fail.  Mel fixed that ~9oct07.  

 
 On the other hand, if we call alloc_pages() with GFP_THISNODE set, there
 is no nid to base the allocation on, so we fallback to numa_node_id()
 [ almost like the nid had been specified as -1 ].
 
 So I guess this is logical -- but I wonder, do we have any callers of
 alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when
 alloc_pages_node() exists?

I don't know if we have any current callers that do this, but absent any
documentation specifying otherwise, Mel's implementation matches what
I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE.
However, we could specify that THISNODE is ignored in __alloc_pages()
and recommend the use of alloc_pages_node() passing numa_node_id() as
the nid parameter to achieve the behavior.  This would eliminate the
check for 'THISNODE in __alloc_pages().  Just mask it off before calling
down to __alloc_pages_internal().

Does this make sense?

Lee


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-09 Thread Christoph Lameter
On Fri, 9 Nov 2007, Lee Schermerhorn wrote:

  On the other hand, if we call alloc_pages() with GFP_THISNODE set, there
  is no nid to base the allocation on, so we fallback to numa_node_id()
  [ almost like the nid had been specified as -1 ].
  
  So I guess this is logical -- but I wonder, do we have any callers of
  alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when
  alloc_pages_node() exists?
 
 I don't know if we have any current callers that do this, but absent any
 documentation specifying otherwise, Mel's implementation matches what
 I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE.
 However, we could specify that THISNODE is ignored in __alloc_pages()
 and recommend the use of alloc_pages_node() passing numa_node_id() as
 the nid parameter to achieve the behavior.  This would eliminate the
 check for 'THISNODE in __alloc_pages().  Just mask it off before calling
 down to __alloc_pages_internal().
 
 Does this make sense?

I like consistency. If someone absolutely wants a local page then 
specifying GFP_THISNODE to __alloc_pages is okay. Leave as is I guess. 

What happens though if an MPOL_BIND policy is in effect? The node used 
must then be the nearest node from the policy mask


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-09 Thread Nishanth Aravamudan
On 09.11.2007 [12:18:52 -0500], Lee Schermerhorn wrote:
 On Fri, 2007-11-09 at 08:45 -0800, Nishanth Aravamudan wrote:
  On 09.11.2007 [16:14:55 +], Mel Gorman wrote:
   On (09/11/07 07:45), Christoph Lameter didst pronounce:
On Fri, 9 Nov 2007, Mel Gorman wrote:

  struct page * fastcall
  __alloc_pages(gfp_t gfp_mask, unsigned int order,
   struct zonelist *zonelist)
  {
 + /*
 +  * Use a temporary nodemask for __GFP_THISNODE allocations. If 
 the
 +  * cost of allocating on the stack or the stack usage becomes
 +  * noticable, allocate the nodemasks per node at boot or 
 compile time
 +  */
 + if (unlikely(gfp_mask  __GFP_THISNODE)) {
 + nodemask_t nodemask;

Hmmm.. This places a potentially big structure on the stack. nodemask 
can 
contain up to 1024 bits which means 128 bytes. Maybe keep an array of 
gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]?

   
   That is what I was hinting at in the comment as a possible solution.
   
 +
 + return __alloc_pages_internal(gfp_mask, order,
 + zonelist, nodemask_thisnode(numa_node_id(), 
 nodemask));

Argh GFP_THISNODE must use the nid passed to alloc_pages_node
and *not* the local numa node id. Only if the node specified to
alloc_pages nodes is -1 will this work.

   
   alloc_pages_node() calls __alloc_pages_nodemask() though where in this
   function if I'm reading it right is called without a node id. Given no
   other details on the nid, the current one seemed a logical choice.
  
  Yeah, I guess the context here matters (and is a little hard to follow
  because thare are a few places that change in different ways here):
  
  For allocating pages from a particular node (GFP_THISNODE with nid),
  the nid clearly must be specified. This only happens with
  alloc_pages_node(), AFAICT. So, in that interface, the right thing is
  done and the appropriate nodemask will be built.
 
 I agree.  In an earlier patch, Mel was ignoring nid and using
 numa_node_id() here.  This was causing your [Nish's] hugetlb pool
 allocation patches to fail.  Mel fixed that ~9oct07.  

Yep, and that's why I'm on the Cc, I think :)

  On the other hand, if we call alloc_pages() with GFP_THISNODE set, there
  is no nid to base the allocation on, so we fallback to numa_node_id()
  [ almost like the nid had been specified as -1 ].
  
  So I guess this is logical -- but I wonder, do we have any callers of
  alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when
  alloc_pages_node() exists?
 
 I don't know if we have any current callers that do this, but absent any
 documentation specifying otherwise, Mel's implementation matches what
 I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE.
 However, we could specify that THISNODE is ignored in __alloc_pages()
 and recommend the use of alloc_pages_node() passing numa_node_id() as
 the nid parameter to achieve the behavior.  This would eliminate the
 check for 'THISNODE in __alloc_pages().  Just mask it off before calling
 down to __alloc_pages_internal().
 
 Does this make sense?

The caller could also just use -1 as the nid, since then
alloc_pages_node() should do the numa_node_id() for the caller... But I
agree, there is no documentation saying GFP_THISNODE is *not* allowed
for alloc_pages(), so we should probably handle it

-Nish

-- 
Nishanth Aravamudan [EMAIL PROTECTED]
IBM Linux Technology Center
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-09 Thread Nishanth Aravamudan
On 09.11.2007 [09:26:01 -0800], Christoph Lameter wrote:
 On Fri, 9 Nov 2007, Lee Schermerhorn wrote:
 
   On the other hand, if we call alloc_pages() with GFP_THISNODE set, there
   is no nid to base the allocation on, so we fallback to numa_node_id()
   [ almost like the nid had been specified as -1 ].
   
   So I guess this is logical -- but I wonder, do we have any callers of
   alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when
   alloc_pages_node() exists?
  
  I don't know if we have any current callers that do this, but absent any
  documentation specifying otherwise, Mel's implementation matches what
  I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE.
  However, we could specify that THISNODE is ignored in __alloc_pages()
  and recommend the use of alloc_pages_node() passing numa_node_id() as
  the nid parameter to achieve the behavior.  This would eliminate the
  check for 'THISNODE in __alloc_pages().  Just mask it off before calling
  down to __alloc_pages_internal().
  
  Does this make sense?
 
 I like consistency. If someone absolutely wants a local page then
 specifying GFP_THISNODE to __alloc_pages is okay. Leave as is I guess. 

Fair enough.

 What happens though if an MPOL_BIND policy is in effect? The node used
 must then be the nearest node from the policy mask

Indeed, this probably needs to be validated... Sigh, more interleaving
of policies and everything else...

-Nish


-- 
Nishanth Aravamudan [EMAIL PROTECTED]
IBM Linux Technology Center
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-09 Thread Christoph Lameter
On Fri, 9 Nov 2007, Nishanth Aravamudan wrote:

  Indeed, this probably needs to be validated... Sigh, more interleaving
  of policies and everything else...
 
 Hrm, more importantly, isn't this an existing issue? Maybe should be
 resolved separately from the one zonelist patches.

GFP_THISNODE with alloc_pages() currently yields an allocation from the 
first node of the MPOL_BIND zonelist. So its the lowest node of the set.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-11-09 Thread Nishanth Aravamudan
On 09.11.2007 [10:16:07 -0800], Nishanth Aravamudan wrote:
 On 09.11.2007 [09:26:01 -0800], Christoph Lameter wrote:
  On Fri, 9 Nov 2007, Lee Schermerhorn wrote:
  
On the other hand, if we call alloc_pages() with GFP_THISNODE set, there
is no nid to base the allocation on, so we fallback to numa_node_id()
[ almost like the nid had been specified as -1 ].

So I guess this is logical -- but I wonder, do we have any callers of
alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when
alloc_pages_node() exists?
   
   I don't know if we have any current callers that do this, but absent any
   documentation specifying otherwise, Mel's implementation matches what
   I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE.
   However, we could specify that THISNODE is ignored in __alloc_pages()
   and recommend the use of alloc_pages_node() passing numa_node_id() as
   the nid parameter to achieve the behavior.  This would eliminate the
   check for 'THISNODE in __alloc_pages().  Just mask it off before calling
   down to __alloc_pages_internal().
   
   Does this make sense?
  
  I like consistency. If someone absolutely wants a local page then
  specifying GFP_THISNODE to __alloc_pages is okay. Leave as is I guess. 
 
 Fair enough.
 
  What happens though if an MPOL_BIND policy is in effect? The node used
  must then be the nearest node from the policy mask
 
 Indeed, this probably needs to be validated... Sigh, more interleaving
 of policies and everything else...

Hrm, more importantly, isn't this an existing issue? Maybe should be
resolved separately from the one zonelist patches.

-Nish

-- 
Nishanth Aravamudan [EMAIL PROTECTED]
IBM Linux Technology Center
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-10-10 Thread Mel Gorman
On (10/10/07 11:53), Lee Schermerhorn didst pronounce:
> On Tue, 2007-10-09 at 16:40 +0100, Mel Gorman wrote:
> 
> > 
> > Subject: Use specified node ID with GFP_THISNODE if available
> > 
> > It had been assumed that __GFP_THISNODE meant allocating from the local
> > node and only the local node. However, users of alloc_pages_node() may also
> > specify GFP_THISNODE. In this case, only the specified node should be used.
> > This patch will allocate pages only from the requested node when 
> > GFP_THISNODE
> > is used with alloc_pages_node().
> > 
> > [EMAIL PROTECTED]: Detailed analysis of problem]
> > Found-by: Lee Schermerhorn <[EMAIL PROTECTED]>
> > Signed-off-by: Mel Gorman <[EMAIL PROTECTED]>
> > 
> 
> 
> Mel:  I applied this patch [to your v8 series--the most recent, I
> think?] and it does fix the problem.  However, now I'm tripping over
> this warning in __alloc_pages_nodemask:
> 
>   /* Specifying both __GFP_THISNODE and nodemask is stupid. Warn user */
>   WARN_ON(gfp_mask & __GFP_THISNODE);
> 
> for each huge page allocated.  Rather slow as my console is a virtual
> serial line and the warning includes the stack traceback.
> 
> I think we want to just drop this warning, but maybe you have a tighter
> condition that you want to warn about?
> 

I should drop the warning. The nature of the comment and the WARN_ON was
rooted in my belief that "THISNODE means this node I am running on" and the
warning was defensive programming just in case the assumption was broken. Now
we know the assumption was wrong and the warning is bogus.

Thanks Lee.

-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-10-10 Thread Nishanth Aravamudan
On 10.10.2007 [11:53:40 -0400], Lee Schermerhorn wrote:
> On Tue, 2007-10-09 at 16:40 +0100, Mel Gorman wrote:
> 
> > 
> > Subject: Use specified node ID with GFP_THISNODE if available
> > 
> > It had been assumed that __GFP_THISNODE meant allocating from the local
> > node and only the local node. However, users of alloc_pages_node() may also
> > specify GFP_THISNODE. In this case, only the specified node should be used.
> > This patch will allocate pages only from the requested node when 
> > GFP_THISNODE
> > is used with alloc_pages_node().
> > 
> > [EMAIL PROTECTED]: Detailed analysis of problem]
> > Found-by: Lee Schermerhorn <[EMAIL PROTECTED]>
> > Signed-off-by: Mel Gorman <[EMAIL PROTECTED]>
> > 
> 
> 
> Mel:  I applied this patch [to your v8 series--the most recent, I
> think?] and it does fix the problem.  However, now I'm tripping over
> this warning in __alloc_pages_nodemask:
> 
>   /* Specifying both __GFP_THISNODE and nodemask is stupid. Warn user */
>   WARN_ON(gfp_mask & __GFP_THISNODE);
> 
> for each huge page allocated.  Rather slow as my console is a virtual
> serial line and the warning includes the stack traceback.
> 
> I think we want to just drop this warning, but maybe you have a tighter
> condition that you want to warn about?

Sigh, sorry Mel. I see this too on my box. I purely checked the
functionality and didn't think to check the logs, as the tests worked :/

I think it's quite clear that the WARN_ON() makes no sense now, since
alloc_pages_node() now calls __alloc_pages_nodemask().

-Nish

-- 
Nishanth Aravamudan <[EMAIL PROTECTED]>
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-10-10 Thread Lee Schermerhorn
On Tue, 2007-10-09 at 16:40 +0100, Mel Gorman wrote:

> 
> Subject: Use specified node ID with GFP_THISNODE if available
> 
> It had been assumed that __GFP_THISNODE meant allocating from the local
> node and only the local node. However, users of alloc_pages_node() may also
> specify GFP_THISNODE. In this case, only the specified node should be used.
> This patch will allocate pages only from the requested node when GFP_THISNODE
> is used with alloc_pages_node().
> 
> [EMAIL PROTECTED]: Detailed analysis of problem]
> Found-by: Lee Schermerhorn <[EMAIL PROTECTED]>
> Signed-off-by: Mel Gorman <[EMAIL PROTECTED]>
> 


Mel:  I applied this patch [to your v8 series--the most recent, I
think?] and it does fix the problem.  However, now I'm tripping over
this warning in __alloc_pages_nodemask:

/* Specifying both __GFP_THISNODE and nodemask is stupid. Warn user */
WARN_ON(gfp_mask & __GFP_THISNODE);

for each huge page allocated.  Rather slow as my console is a virtual
serial line and the warning includes the stack traceback.

I think we want to just drop this warning, but maybe you have a tighter
condition that you want to warn about?

Lee

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-10-10 Thread Lee Schermerhorn
On Tue, 2007-10-09 at 16:40 +0100, Mel Gorman wrote:
snip
 
 Subject: Use specified node ID with GFP_THISNODE if available
 
 It had been assumed that __GFP_THISNODE meant allocating from the local
 node and only the local node. However, users of alloc_pages_node() may also
 specify GFP_THISNODE. In this case, only the specified node should be used.
 This patch will allocate pages only from the requested node when GFP_THISNODE
 is used with alloc_pages_node().
 
 [EMAIL PROTECTED]: Detailed analysis of problem]
 Found-by: Lee Schermerhorn [EMAIL PROTECTED]
 Signed-off-by: Mel Gorman [EMAIL PROTECTED]
 
snip

Mel:  I applied this patch [to your v8 series--the most recent, I
think?] and it does fix the problem.  However, now I'm tripping over
this warning in __alloc_pages_nodemask:

/* Specifying both __GFP_THISNODE and nodemask is stupid. Warn user */
WARN_ON(gfp_mask  __GFP_THISNODE);

for each huge page allocated.  Rather slow as my console is a virtual
serial line and the warning includes the stack traceback.

I think we want to just drop this warning, but maybe you have a tighter
condition that you want to warn about?

Lee

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-10-10 Thread Nishanth Aravamudan
On 10.10.2007 [11:53:40 -0400], Lee Schermerhorn wrote:
 On Tue, 2007-10-09 at 16:40 +0100, Mel Gorman wrote:
 snip
  
  Subject: Use specified node ID with GFP_THISNODE if available
  
  It had been assumed that __GFP_THISNODE meant allocating from the local
  node and only the local node. However, users of alloc_pages_node() may also
  specify GFP_THISNODE. In this case, only the specified node should be used.
  This patch will allocate pages only from the requested node when 
  GFP_THISNODE
  is used with alloc_pages_node().
  
  [EMAIL PROTECTED]: Detailed analysis of problem]
  Found-by: Lee Schermerhorn [EMAIL PROTECTED]
  Signed-off-by: Mel Gorman [EMAIL PROTECTED]
  
 snip
 
 Mel:  I applied this patch [to your v8 series--the most recent, I
 think?] and it does fix the problem.  However, now I'm tripping over
 this warning in __alloc_pages_nodemask:
 
   /* Specifying both __GFP_THISNODE and nodemask is stupid. Warn user */
   WARN_ON(gfp_mask  __GFP_THISNODE);
 
 for each huge page allocated.  Rather slow as my console is a virtual
 serial line and the warning includes the stack traceback.
 
 I think we want to just drop this warning, but maybe you have a tighter
 condition that you want to warn about?

Sigh, sorry Mel. I see this too on my box. I purely checked the
functionality and didn't think to check the logs, as the tests worked :/

I think it's quite clear that the WARN_ON() makes no sense now, since
alloc_pages_node() now calls __alloc_pages_nodemask().

-Nish

-- 
Nishanth Aravamudan [EMAIL PROTECTED]
IBM Linux Technology Center
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-10-10 Thread Mel Gorman
On (10/10/07 11:53), Lee Schermerhorn didst pronounce:
 On Tue, 2007-10-09 at 16:40 +0100, Mel Gorman wrote:
 snip
  
  Subject: Use specified node ID with GFP_THISNODE if available
  
  It had been assumed that __GFP_THISNODE meant allocating from the local
  node and only the local node. However, users of alloc_pages_node() may also
  specify GFP_THISNODE. In this case, only the specified node should be used.
  This patch will allocate pages only from the requested node when 
  GFP_THISNODE
  is used with alloc_pages_node().
  
  [EMAIL PROTECTED]: Detailed analysis of problem]
  Found-by: Lee Schermerhorn [EMAIL PROTECTED]
  Signed-off-by: Mel Gorman [EMAIL PROTECTED]
  
 snip
 
 Mel:  I applied this patch [to your v8 series--the most recent, I
 think?] and it does fix the problem.  However, now I'm tripping over
 this warning in __alloc_pages_nodemask:
 
   /* Specifying both __GFP_THISNODE and nodemask is stupid. Warn user */
   WARN_ON(gfp_mask  __GFP_THISNODE);
 
 for each huge page allocated.  Rather slow as my console is a virtual
 serial line and the warning includes the stack traceback.
 
 I think we want to just drop this warning, but maybe you have a tighter
 condition that you want to warn about?
 

I should drop the warning. The nature of the comment and the WARN_ON was
rooted in my belief that THISNODE means this node I am running on and the
warning was defensive programming just in case the assumption was broken. Now
we know the assumption was wrong and the warning is bogus.

Thanks Lee.

-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-10-09 Thread Christoph Lameter
On Tue, 9 Oct 2007, Nishanth Aravamudan wrote:

> > > And nodemask_thisnode() always gives us a nodemask with only the node
> > > the current process is running on set, I think?
> > > 
> > 
> > Yes, I interpreted THISNODE to mean "this node I am running on".
> > Callers seemed to expect this but the memoryless needs it to be "this
> > node I am running on unless I specify a node in which case I mean that
> > node.".
> 
> I think that is only true (THISNODE = local node) if the callpath is not
> via alloc_pages_node(). If the callpath is via alloc_pages_node(), then
> it depends on whether the nid parameter is -1 (in which case it is also
> local node) or anything (in which case it is the nid specified). Ah,
> reading further along, that's exactly what your changelog indicates too
> :)

Right. THISNODE means the node we are on or the node that we indicated we 
want to allocate from. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-10-09 Thread Nishanth Aravamudan
On 09.10.2007 [16:40:53 +0100], Mel Gorman wrote:
> First, sorry for being so slow to respond. I was getting ill towards the end
> of last week and am worse now. Brain is in total mush as a result. Thanks
> Lee for finding this problem and thanks to Nish for investigating it properly.
> 
> Comments and candidate fix to one zonelist are below.
> 
> On (08/10/07 18:11), Nishanth Aravamudan didst pronounce:
> > On 28.09.2007 [15:25:27 +0100], Mel Gorman wrote:
> > > 
> > > Two zonelists exist so that GFP_THISNODE allocations will be guaranteed
> > > to use memory only from a node local to the CPU. As we can now filter the
> > > zonelist based on a nodemask, we filter the standard node zonelist for 
> > > zones
> > > on the local node when GFP_THISNODE is specified.
> > > 
> > > When GFP_THISNODE is used, a temporary nodemask is created with only the
> > > node local to the CPU set. This allows us to eliminate the second 
> > > zonelist.
> > > 
> > > Signed-off-by: Mel Gorman <[EMAIL PROTECTED]>
> > > Acked-by: Christoph Lameter <[EMAIL PROTECTED]>
> > 
> > 
> > 
> > > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
> > > linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h 
> > > linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h
> > > --- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h  
> > > 2007-09-28 15:49:57.0 +0100
> > > +++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h 
> > > 2007-09-28 15:55:03.0 +0100
> > 
> > [Reordering the chunks to make my comments a little more logical]
> > 
> > 
> > 
> > > -static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> > > +static inline struct zonelist *node_zonelist(int nid)
> > >  {
> > > - return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> > > + return _DATA(nid)->node_zonelist;
> > >  }
> > > 
> > >  #ifndef HAVE_ARCH_FREE_PAGE
> > > @@ -198,7 +186,7 @@ static inline struct page *alloc_pages_n
> > >   if (nid < 0)
> > >   nid = numa_node_id();
> > > 
> > > - return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> > > + return __alloc_pages(gfp_mask, order, node_zonelist(nid));
> > >  }
> > 
> > This is alloc_pages_node(), and converting the nid to a zonelist means
> > that lower levels (specifically __alloc_pages() here) are not aware of
> > nids, as far as I can tell.
> 
> Yep, this is correct.
> 
> > This isn't a change, I just want to make
> > sure I understand...
> > 
> > 
> > 
> > >  struct page * fastcall
> > >  __alloc_pages(gfp_t gfp_mask, unsigned int order,
> > >   struct zonelist *zonelist)
> > >  {
> > > + /*
> > > +  * Use a temporary nodemask for __GFP_THISNODE allocations. If the
> > > +  * cost of allocating on the stack or the stack usage becomes
> > > +  * noticable, allocate the nodemasks per node at boot or compile time
> > > +  */
> > > + if (unlikely(gfp_mask & __GFP_THISNODE)) {
> > > + nodemask_t nodemask;
> > > +
> > > + return __alloc_pages_internal(gfp_mask, order,
> > > + zonelist, nodemask_thisnode());
> > > + }
> > > +
> > >   return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
> > >  }
> > 
> > 
> > 
> > So alloc_pages_node() calls here and for THISNODE allocations, we go ask
> > nodemask_thisnode() for a nodemask...
> > 
> 
> Also correct.
> 
> > > +static nodemask_t *nodemask_thisnode(nodemask_t *nodemask)
> > > +{
> > > + /* Build a nodemask for just this node */
> > > + int nid = numa_node_id();
> > > +
> > > + nodes_clear(*nodemask);
> > > + node_set(nid, *nodemask);
> > > +
> > > + return nodemask;
> > > +}
> > 
> > 
> > 
> > And nodemask_thisnode() always gives us a nodemask with only the node
> > the current process is running on set, I think?
> > 
> 
> Yes, I interpreted THISNODE to mean "this node I am running on". Callers
> seemed to expect this but the memoryless needs it to be "this node I am
> running on unless I specify a node in which case I mean that node.".
> 
> > That seems really wrong -- and would explain what Lee was seeing while
> > using my patches for the hugetlb pool allocator to use THISNODE
> > allocations. All the allocations would end up coming from whatever node
> > the process happened to be running on. This obviously messes up hugetlb
> > accounting, as I rely on THISNODE requests returning NULL if they go
> > off-node.
> > 
> > I'm not sure how this would be fixed, as __alloc_pages() no longer has
> > the nid to set in the mask.
> > 
> > Am I wrong in my analysis?
> > 
> 
> No, you seem to be right on the ball. Can you review the following patch
> please and determine if it fixes the problem in a satisfactory manner? I
> think it does and your tests seemed to give proper values with this patch
> applied but brain no worky work and a second opinion is needed.
> 
> 
> Subject: Use specified node ID with GFP_THISNODE if available
> 
> It had been assumed that __GFP_THISNODE meant allocating from the local
> node and only the 

Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-10-09 Thread Nishanth Aravamudan
On 09.10.2007 [16:40:53 +0100], Mel Gorman wrote:
> First, sorry for being so slow to respond. I was getting ill towards the end
> of last week and am worse now. Brain is in total mush as a result. Thanks
> Lee for finding this problem and thanks to Nish for investigating it properly.
> 
> Comments and candidate fix to one zonelist are below.
> 
> On (08/10/07 18:11), Nishanth Aravamudan didst pronounce:
> > On 28.09.2007 [15:25:27 +0100], Mel Gorman wrote:
> > > 
> > > Two zonelists exist so that GFP_THISNODE allocations will be guaranteed
> > > to use memory only from a node local to the CPU. As we can now filter the
> > > zonelist based on a nodemask, we filter the standard node zonelist for 
> > > zones
> > > on the local node when GFP_THISNODE is specified.
> > > 
> > > When GFP_THISNODE is used, a temporary nodemask is created with only the
> > > node local to the CPU set. This allows us to eliminate the second 
> > > zonelist.
> > > 
> > > Signed-off-by: Mel Gorman <[EMAIL PROTECTED]>
> > > Acked-by: Christoph Lameter <[EMAIL PROTECTED]>
> > 
> > 
> > 
> > > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
> > > linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h 
> > > linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h
> > > --- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h  
> > > 2007-09-28 15:49:57.0 +0100
> > > +++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h 
> > > 2007-09-28 15:55:03.0 +0100
> > 
> > [Reordering the chunks to make my comments a little more logical]
> > 
> > 
> > 
> > > -static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> > > +static inline struct zonelist *node_zonelist(int nid)
> > >  {
> > > - return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> > > + return _DATA(nid)->node_zonelist;
> > >  }
> > > 
> > >  #ifndef HAVE_ARCH_FREE_PAGE
> > > @@ -198,7 +186,7 @@ static inline struct page *alloc_pages_n
> > >   if (nid < 0)
> > >   nid = numa_node_id();
> > > 
> > > - return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> > > + return __alloc_pages(gfp_mask, order, node_zonelist(nid));
> > >  }
> > 
> > This is alloc_pages_node(), and converting the nid to a zonelist means
> > that lower levels (specifically __alloc_pages() here) are not aware of
> > nids, as far as I can tell.
> 
> Yep, this is correct.
> 
> > This isn't a change, I just want to make
> > sure I understand...
> > 
> > 
> > 
> > >  struct page * fastcall
> > >  __alloc_pages(gfp_t gfp_mask, unsigned int order,
> > >   struct zonelist *zonelist)
> > >  {
> > > + /*
> > > +  * Use a temporary nodemask for __GFP_THISNODE allocations. If the
> > > +  * cost of allocating on the stack or the stack usage becomes
> > > +  * noticable, allocate the nodemasks per node at boot or compile time
> > > +  */
> > > + if (unlikely(gfp_mask & __GFP_THISNODE)) {
> > > + nodemask_t nodemask;
> > > +
> > > + return __alloc_pages_internal(gfp_mask, order,
> > > + zonelist, nodemask_thisnode());
> > > + }
> > > +
> > >   return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
> > >  }
> > 
> > 
> > 
> > So alloc_pages_node() calls here and for THISNODE allocations, we go ask
> > nodemask_thisnode() for a nodemask...
> > 
> 
> Also correct.
> 
> > > +static nodemask_t *nodemask_thisnode(nodemask_t *nodemask)
> > > +{
> > > + /* Build a nodemask for just this node */
> > > + int nid = numa_node_id();
> > > +
> > > + nodes_clear(*nodemask);
> > > + node_set(nid, *nodemask);
> > > +
> > > + return nodemask;
> > > +}
> > 
> > 
> > 
> > And nodemask_thisnode() always gives us a nodemask with only the node
> > the current process is running on set, I think?
> > 
> 
> Yes, I interpreted THISNODE to mean "this node I am running on".
> Callers seemed to expect this but the memoryless needs it to be "this
> node I am running on unless I specify a node in which case I mean that
> node.".

I think that is only true (THISNODE = local node) if the callpath is not
via alloc_pages_node(). If the callpath is via alloc_pages_node(), then
it depends on whether the nid parameter is -1 (in which case it is also
local node) or anything (in which case it is the nid specified). Ah,
reading further along, that's exactly what your changelog indicates too
:)

> > That seems really wrong -- and would explain what Lee was seeing while
> > using my patches for the hugetlb pool allocator to use THISNODE
> > allocations. All the allocations would end up coming from whatever node
> > the process happened to be running on. This obviously messes up hugetlb
> > accounting, as I rely on THISNODE requests returning NULL if they go
> > off-node.
> > 
> > I'm not sure how this would be fixed, as __alloc_pages() no longer has
> > the nid to set in the mask.
> > 
> > Am I wrong in my analysis?
> > 
> 
> No, you seem to be right on the ball. Can you review the following patch
> please and determine if 

Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-10-09 Thread Mel Gorman
First, sorry for being so slow to respond. I was getting ill towards the end
of last week and am worse now. Brain is in total mush as a result. Thanks
Lee for finding this problem and thanks to Nish for investigating it properly.

Comments and candidate fix to one zonelist are below.

On (08/10/07 18:11), Nishanth Aravamudan didst pronounce:
> On 28.09.2007 [15:25:27 +0100], Mel Gorman wrote:
> > 
> > Two zonelists exist so that GFP_THISNODE allocations will be guaranteed
> > to use memory only from a node local to the CPU. As we can now filter the
> > zonelist based on a nodemask, we filter the standard node zonelist for zones
> > on the local node when GFP_THISNODE is specified.
> > 
> > When GFP_THISNODE is used, a temporary nodemask is created with only the
> > node local to the CPU set. This allows us to eliminate the second zonelist.
> > 
> > Signed-off-by: Mel Gorman <[EMAIL PROTECTED]>
> > Acked-by: Christoph Lameter <[EMAIL PROTECTED]>
> 
> 
> 
> > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
> > linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h 
> > linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h
> > --- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h
> > 2007-09-28 15:49:57.0 +0100
> > +++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h   
> > 2007-09-28 15:55:03.0 +0100
> 
> [Reordering the chunks to make my comments a little more logical]
> 
> 
> 
> > -static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> > +static inline struct zonelist *node_zonelist(int nid)
> >  {
> > -   return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> > +   return _DATA(nid)->node_zonelist;
> >  }
> > 
> >  #ifndef HAVE_ARCH_FREE_PAGE
> > @@ -198,7 +186,7 @@ static inline struct page *alloc_pages_n
> > if (nid < 0)
> > nid = numa_node_id();
> > 
> > -   return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> > +   return __alloc_pages(gfp_mask, order, node_zonelist(nid));
> >  }
> 
> This is alloc_pages_node(), and converting the nid to a zonelist means
> that lower levels (specifically __alloc_pages() here) are not aware of
> nids, as far as I can tell.

Yep, this is correct.

> This isn't a change, I just want to make
> sure I understand...
> 
> 
> 
> >  struct page * fastcall
> >  __alloc_pages(gfp_t gfp_mask, unsigned int order,
> > struct zonelist *zonelist)
> >  {
> > +   /*
> > +* Use a temporary nodemask for __GFP_THISNODE allocations. If the
> > +* cost of allocating on the stack or the stack usage becomes
> > +* noticable, allocate the nodemasks per node at boot or compile time
> > +*/
> > +   if (unlikely(gfp_mask & __GFP_THISNODE)) {
> > +   nodemask_t nodemask;
> > +
> > +   return __alloc_pages_internal(gfp_mask, order,
> > +   zonelist, nodemask_thisnode());
> > +   }
> > +
> > return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
> >  }
> 
> 
> 
> So alloc_pages_node() calls here and for THISNODE allocations, we go ask
> nodemask_thisnode() for a nodemask...
> 

Also correct.

> > +static nodemask_t *nodemask_thisnode(nodemask_t *nodemask)
> > +{
> > +   /* Build a nodemask for just this node */
> > +   int nid = numa_node_id();
> > +
> > +   nodes_clear(*nodemask);
> > +   node_set(nid, *nodemask);
> > +
> > +   return nodemask;
> > +}
> 
> 
> 
> And nodemask_thisnode() always gives us a nodemask with only the node
> the current process is running on set, I think?
> 

Yes, I interpreted THISNODE to mean "this node I am running on". Callers
seemed to expect this but the memoryless needs it to be "this node I am
running on unless I specify a node in which case I mean that node.".

> That seems really wrong -- and would explain what Lee was seeing while
> using my patches for the hugetlb pool allocator to use THISNODE
> allocations. All the allocations would end up coming from whatever node
> the process happened to be running on. This obviously messes up hugetlb
> accounting, as I rely on THISNODE requests returning NULL if they go
> off-node.
> 
> I'm not sure how this would be fixed, as __alloc_pages() no longer has
> the nid to set in the mask.
> 
> Am I wrong in my analysis?
> 

No, you seem to be right on the ball. Can you review the following patch
please and determine if it fixes the problem in a satisfactory manner? I
think it does and your tests seemed to give proper values with this patch
applied but brain no worky work and a second opinion is needed.


Subject: Use specified node ID with GFP_THISNODE if available

It had been assumed that __GFP_THISNODE meant allocating from the local
node and only the local node. However, users of alloc_pages_node() may also
specify GFP_THISNODE. In this case, only the specified node should be used.
This patch will allocate pages only from the requested node when GFP_THISNODE
is used with alloc_pages_node().

[EMAIL PROTECTED]: Detailed 

Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-10-09 Thread Mel Gorman
First, sorry for being so slow to respond. I was getting ill towards the end
of last week and am worse now. Brain is in total mush as a result. Thanks
Lee for finding this problem and thanks to Nish for investigating it properly.

Comments and candidate fix to one zonelist are below.

On (08/10/07 18:11), Nishanth Aravamudan didst pronounce:
 On 28.09.2007 [15:25:27 +0100], Mel Gorman wrote:
  
  Two zonelists exist so that GFP_THISNODE allocations will be guaranteed
  to use memory only from a node local to the CPU. As we can now filter the
  zonelist based on a nodemask, we filter the standard node zonelist for zones
  on the local node when GFP_THISNODE is specified.
  
  When GFP_THISNODE is used, a temporary nodemask is created with only the
  node local to the CPU set. This allows us to eliminate the second zonelist.
  
  Signed-off-by: Mel Gorman [EMAIL PROTECTED]
  Acked-by: Christoph Lameter [EMAIL PROTECTED]
 
 snip
 
  diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
  linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h 
  linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h
  --- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h
  2007-09-28 15:49:57.0 +0100
  +++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h   
  2007-09-28 15:55:03.0 +0100
 
 [Reordering the chunks to make my comments a little more logical]
 
 snip
 
  -static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
  +static inline struct zonelist *node_zonelist(int nid)
   {
  -   return NODE_DATA(nid)-node_zonelists + gfp_zonelist(flags);
  +   return NODE_DATA(nid)-node_zonelist;
   }
  
   #ifndef HAVE_ARCH_FREE_PAGE
  @@ -198,7 +186,7 @@ static inline struct page *alloc_pages_n
  if (nid  0)
  nid = numa_node_id();
  
  -   return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
  +   return __alloc_pages(gfp_mask, order, node_zonelist(nid));
   }
 
 This is alloc_pages_node(), and converting the nid to a zonelist means
 that lower levels (specifically __alloc_pages() here) are not aware of
 nids, as far as I can tell.

Yep, this is correct.

 This isn't a change, I just want to make
 sure I understand...
 
 snip
 
   struct page * fastcall
   __alloc_pages(gfp_t gfp_mask, unsigned int order,
  struct zonelist *zonelist)
   {
  +   /*
  +* Use a temporary nodemask for __GFP_THISNODE allocations. If the
  +* cost of allocating on the stack or the stack usage becomes
  +* noticable, allocate the nodemasks per node at boot or compile time
  +*/
  +   if (unlikely(gfp_mask  __GFP_THISNODE)) {
  +   nodemask_t nodemask;
  +
  +   return __alloc_pages_internal(gfp_mask, order,
  +   zonelist, nodemask_thisnode(nodemask));
  +   }
  +
  return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
   }
 
 snip
 
 So alloc_pages_node() calls here and for THISNODE allocations, we go ask
 nodemask_thisnode() for a nodemask...
 

Also correct.

  +static nodemask_t *nodemask_thisnode(nodemask_t *nodemask)
  +{
  +   /* Build a nodemask for just this node */
  +   int nid = numa_node_id();
  +
  +   nodes_clear(*nodemask);
  +   node_set(nid, *nodemask);
  +
  +   return nodemask;
  +}
 
 snip
 
 And nodemask_thisnode() always gives us a nodemask with only the node
 the current process is running on set, I think?
 

Yes, I interpreted THISNODE to mean this node I am running on. Callers
seemed to expect this but the memoryless needs it to be this node I am
running on unless I specify a node in which case I mean that node..

 That seems really wrong -- and would explain what Lee was seeing while
 using my patches for the hugetlb pool allocator to use THISNODE
 allocations. All the allocations would end up coming from whatever node
 the process happened to be running on. This obviously messes up hugetlb
 accounting, as I rely on THISNODE requests returning NULL if they go
 off-node.
 
 I'm not sure how this would be fixed, as __alloc_pages() no longer has
 the nid to set in the mask.
 
 Am I wrong in my analysis?
 

No, you seem to be right on the ball. Can you review the following patch
please and determine if it fixes the problem in a satisfactory manner? I
think it does and your tests seemed to give proper values with this patch
applied but brain no worky work and a second opinion is needed.


Subject: Use specified node ID with GFP_THISNODE if available

It had been assumed that __GFP_THISNODE meant allocating from the local
node and only the local node. However, users of alloc_pages_node() may also
specify GFP_THISNODE. In this case, only the specified node should be used.
This patch will allocate pages only from the requested node when GFP_THISNODE
is used with alloc_pages_node().

[EMAIL PROTECTED]: Detailed analysis of problem]
Found-by: Lee Schermerhorn [EMAIL PROTECTED]
Signed-off-by: Mel Gorman [EMAIL PROTECTED]

--- 
 include/linux/gfp.h |   10 

Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-10-09 Thread Nishanth Aravamudan
On 09.10.2007 [16:40:53 +0100], Mel Gorman wrote:
 First, sorry for being so slow to respond. I was getting ill towards the end
 of last week and am worse now. Brain is in total mush as a result. Thanks
 Lee for finding this problem and thanks to Nish for investigating it properly.
 
 Comments and candidate fix to one zonelist are below.
 
 On (08/10/07 18:11), Nishanth Aravamudan didst pronounce:
  On 28.09.2007 [15:25:27 +0100], Mel Gorman wrote:
   
   Two zonelists exist so that GFP_THISNODE allocations will be guaranteed
   to use memory only from a node local to the CPU. As we can now filter the
   zonelist based on a nodemask, we filter the standard node zonelist for 
   zones
   on the local node when GFP_THISNODE is specified.
   
   When GFP_THISNODE is used, a temporary nodemask is created with only the
   node local to the CPU set. This allows us to eliminate the second 
   zonelist.
   
   Signed-off-by: Mel Gorman [EMAIL PROTECTED]
   Acked-by: Christoph Lameter [EMAIL PROTECTED]
  
  snip
  
   diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
   linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h 
   linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h
   --- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h  
   2007-09-28 15:49:57.0 +0100
   +++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h 
   2007-09-28 15:55:03.0 +0100
  
  [Reordering the chunks to make my comments a little more logical]
  
  snip
  
   -static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
   +static inline struct zonelist *node_zonelist(int nid)
{
   - return NODE_DATA(nid)-node_zonelists + gfp_zonelist(flags);
   + return NODE_DATA(nid)-node_zonelist;
}
   
#ifndef HAVE_ARCH_FREE_PAGE
   @@ -198,7 +186,7 @@ static inline struct page *alloc_pages_n
 if (nid  0)
 nid = numa_node_id();
   
   - return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
   + return __alloc_pages(gfp_mask, order, node_zonelist(nid));
}
  
  This is alloc_pages_node(), and converting the nid to a zonelist means
  that lower levels (specifically __alloc_pages() here) are not aware of
  nids, as far as I can tell.
 
 Yep, this is correct.
 
  This isn't a change, I just want to make
  sure I understand...
  
  snip
  
struct page * fastcall
__alloc_pages(gfp_t gfp_mask, unsigned int order,
 struct zonelist *zonelist)
{
   + /*
   +  * Use a temporary nodemask for __GFP_THISNODE allocations. If the
   +  * cost of allocating on the stack or the stack usage becomes
   +  * noticable, allocate the nodemasks per node at boot or compile time
   +  */
   + if (unlikely(gfp_mask  __GFP_THISNODE)) {
   + nodemask_t nodemask;
   +
   + return __alloc_pages_internal(gfp_mask, order,
   + zonelist, nodemask_thisnode(nodemask));
   + }
   +
 return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
}
  
  snip
  
  So alloc_pages_node() calls here and for THISNODE allocations, we go ask
  nodemask_thisnode() for a nodemask...
  
 
 Also correct.
 
   +static nodemask_t *nodemask_thisnode(nodemask_t *nodemask)
   +{
   + /* Build a nodemask for just this node */
   + int nid = numa_node_id();
   +
   + nodes_clear(*nodemask);
   + node_set(nid, *nodemask);
   +
   + return nodemask;
   +}
  
  snip
  
  And nodemask_thisnode() always gives us a nodemask with only the node
  the current process is running on set, I think?
  
 
 Yes, I interpreted THISNODE to mean this node I am running on.
 Callers seemed to expect this but the memoryless needs it to be this
 node I am running on unless I specify a node in which case I mean that
 node..

I think that is only true (THISNODE = local node) if the callpath is not
via alloc_pages_node(). If the callpath is via alloc_pages_node(), then
it depends on whether the nid parameter is -1 (in which case it is also
local node) or anything (in which case it is the nid specified). Ah,
reading further along, that's exactly what your changelog indicates too
:)

  That seems really wrong -- and would explain what Lee was seeing while
  using my patches for the hugetlb pool allocator to use THISNODE
  allocations. All the allocations would end up coming from whatever node
  the process happened to be running on. This obviously messes up hugetlb
  accounting, as I rely on THISNODE requests returning NULL if they go
  off-node.
  
  I'm not sure how this would be fixed, as __alloc_pages() no longer has
  the nid to set in the mask.
  
  Am I wrong in my analysis?
  
 
 No, you seem to be right on the ball. Can you review the following patch
 please and determine if it fixes the problem in a satisfactory manner? I
 think it does and your tests seemed to give proper values with this patch
 applied but brain no worky work and a second opinion is needed.
 
 
 Subject: Use specified node ID with GFP_THISNODE if available
 
 It had been 

Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-10-09 Thread Nishanth Aravamudan
On 09.10.2007 [16:40:53 +0100], Mel Gorman wrote:
 First, sorry for being so slow to respond. I was getting ill towards the end
 of last week and am worse now. Brain is in total mush as a result. Thanks
 Lee for finding this problem and thanks to Nish for investigating it properly.
 
 Comments and candidate fix to one zonelist are below.
 
 On (08/10/07 18:11), Nishanth Aravamudan didst pronounce:
  On 28.09.2007 [15:25:27 +0100], Mel Gorman wrote:
   
   Two zonelists exist so that GFP_THISNODE allocations will be guaranteed
   to use memory only from a node local to the CPU. As we can now filter the
   zonelist based on a nodemask, we filter the standard node zonelist for 
   zones
   on the local node when GFP_THISNODE is specified.
   
   When GFP_THISNODE is used, a temporary nodemask is created with only the
   node local to the CPU set. This allows us to eliminate the second 
   zonelist.
   
   Signed-off-by: Mel Gorman [EMAIL PROTECTED]
   Acked-by: Christoph Lameter [EMAIL PROTECTED]
  
  snip
  
   diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
   linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h 
   linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h
   --- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h  
   2007-09-28 15:49:57.0 +0100
   +++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h 
   2007-09-28 15:55:03.0 +0100
  
  [Reordering the chunks to make my comments a little more logical]
  
  snip
  
   -static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
   +static inline struct zonelist *node_zonelist(int nid)
{
   - return NODE_DATA(nid)-node_zonelists + gfp_zonelist(flags);
   + return NODE_DATA(nid)-node_zonelist;
}
   
#ifndef HAVE_ARCH_FREE_PAGE
   @@ -198,7 +186,7 @@ static inline struct page *alloc_pages_n
 if (nid  0)
 nid = numa_node_id();
   
   - return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
   + return __alloc_pages(gfp_mask, order, node_zonelist(nid));
}
  
  This is alloc_pages_node(), and converting the nid to a zonelist means
  that lower levels (specifically __alloc_pages() here) are not aware of
  nids, as far as I can tell.
 
 Yep, this is correct.
 
  This isn't a change, I just want to make
  sure I understand...
  
  snip
  
struct page * fastcall
__alloc_pages(gfp_t gfp_mask, unsigned int order,
 struct zonelist *zonelist)
{
   + /*
   +  * Use a temporary nodemask for __GFP_THISNODE allocations. If the
   +  * cost of allocating on the stack or the stack usage becomes
   +  * noticable, allocate the nodemasks per node at boot or compile time
   +  */
   + if (unlikely(gfp_mask  __GFP_THISNODE)) {
   + nodemask_t nodemask;
   +
   + return __alloc_pages_internal(gfp_mask, order,
   + zonelist, nodemask_thisnode(nodemask));
   + }
   +
 return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
}
  
  snip
  
  So alloc_pages_node() calls here and for THISNODE allocations, we go ask
  nodemask_thisnode() for a nodemask...
  
 
 Also correct.
 
   +static nodemask_t *nodemask_thisnode(nodemask_t *nodemask)
   +{
   + /* Build a nodemask for just this node */
   + int nid = numa_node_id();
   +
   + nodes_clear(*nodemask);
   + node_set(nid, *nodemask);
   +
   + return nodemask;
   +}
  
  snip
  
  And nodemask_thisnode() always gives us a nodemask with only the node
  the current process is running on set, I think?
  
 
 Yes, I interpreted THISNODE to mean this node I am running on. Callers
 seemed to expect this but the memoryless needs it to be this node I am
 running on unless I specify a node in which case I mean that node..
 
  That seems really wrong -- and would explain what Lee was seeing while
  using my patches for the hugetlb pool allocator to use THISNODE
  allocations. All the allocations would end up coming from whatever node
  the process happened to be running on. This obviously messes up hugetlb
  accounting, as I rely on THISNODE requests returning NULL if they go
  off-node.
  
  I'm not sure how this would be fixed, as __alloc_pages() no longer has
  the nid to set in the mask.
  
  Am I wrong in my analysis?
  
 
 No, you seem to be right on the ball. Can you review the following patch
 please and determine if it fixes the problem in a satisfactory manner? I
 think it does and your tests seemed to give proper values with this patch
 applied but brain no worky work and a second opinion is needed.
 
 
 Subject: Use specified node ID with GFP_THISNODE if available
 
 It had been assumed that __GFP_THISNODE meant allocating from the local
 node and only the local node. However, users of alloc_pages_node() may also
 specify GFP_THISNODE. In this case, only the specified node should be used.
 This patch will allocate pages only from the requested node when GFP_THISNODE
 is used with alloc_pages_node().
 
 [EMAIL PROTECTED]: Detailed 

Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-10-09 Thread Christoph Lameter
On Tue, 9 Oct 2007, Nishanth Aravamudan wrote:

   And nodemask_thisnode() always gives us a nodemask with only the node
   the current process is running on set, I think?
   
  
  Yes, I interpreted THISNODE to mean this node I am running on.
  Callers seemed to expect this but the memoryless needs it to be this
  node I am running on unless I specify a node in which case I mean that
  node..
 
 I think that is only true (THISNODE = local node) if the callpath is not
 via alloc_pages_node(). If the callpath is via alloc_pages_node(), then
 it depends on whether the nid parameter is -1 (in which case it is also
 local node) or anything (in which case it is the nid specified). Ah,
 reading further along, that's exactly what your changelog indicates too
 :)

Right. THISNODE means the node we are on or the node that we indicated we 
want to allocate from. 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-10-08 Thread Nishanth Aravamudan
On 08.10.2007 [18:56:05 -0700], Christoph Lameter wrote:
> On Mon, 8 Oct 2007, Nishanth Aravamudan wrote:
> 
> > >  struct page * fastcall
> > >  __alloc_pages(gfp_t gfp_mask, unsigned int order,
> > >   struct zonelist *zonelist)
> > >  {
> > > + /*
> > > +  * Use a temporary nodemask for __GFP_THISNODE allocations. If the
> > > +  * cost of allocating on the stack or the stack usage becomes
> > > +  * noticable, allocate the nodemasks per node at boot or compile time
> > > +  */
> > > + if (unlikely(gfp_mask & __GFP_THISNODE)) {
> > > + nodemask_t nodemask;
> > > +
> > > + return __alloc_pages_internal(gfp_mask, order,
> > > + zonelist, nodemask_thisnode());
> > > + }
> > > +
> > >   return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
> > >  }
> > 
> > 
> > 
> > So alloc_pages_node() calls here and for THISNODE allocations, we go ask
> > nodemask_thisnode() for a nodemask...
> 
> H... nodemask_thisnode needs to be passed the zonelist.
> 
> > And nodemask_thisnode() always gives us a nodemask with only the node
> > the current process is running on set, I think?
> 
> Right.
> 
> 
> > That seems really wrong -- and would explain what Lee was seeing while
> > using my patches for the hugetlb pool allocator to use THISNODE
> > allocations. All the allocations would end up coming from whatever node
> > the process happened to be running on. This obviously messes up hugetlb
> > accounting, as I rely on THISNODE requests returning NULL if they go
> > off-node.
> > 
> > I'm not sure how this would be fixed, as __alloc_pages() no longer has
> > the nid to set in the mask.
> > 
> > Am I wrong in my analysis?
> 
> No you are right on target. The thisnode function must determine the
> node from the first zone of the zonelist.

It seems like I would zonelist_node_idx() for this, along the lines of:

static nodemask_t *nodemask_thisnode(nodemask_t *nodemask,
struct zonelist *zonelist)
{
int nid = zonelist_node_idx(zonelist);
/* Build a nodemask for just this node */
nodes_clear(*nodemask);
node_set(nid, *nodemask);

return nodemask;
}

But I think I need to check that zonelist->_zonerefs->zone is !NULL, given this
definition of zonelist_node_idx()

static inline int zonelist_node_idx(struct zoneref *zoneref)
{
#ifdef CONFIG_NUMA
/* zone_to_nid not available in this context */
return zoneref->zone->node;
#else
return 0;
#endif /* CONFIG_NUMA */
}

and this comment in __alloc_pages_internal():


z = zonelist->_zonerefs;  /* the list of zones suitable for gfp_mask */

if (unlikely(!z->zone)) {
/*
 * Happens if we have an empty zonelist as a result of
 * GFP_THISNODE being used on a memoryless node
 */
return NULL;
}
...

It seems like zoneref->zone may be NULL in zonelist_node_idx()? Maybe
someone else should look into resolving this :)

Thanks,
Nish

-- 
Nishanth Aravamudan <[EMAIL PROTECTED]>
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-10-08 Thread Christoph Lameter
On Mon, 8 Oct 2007, Nishanth Aravamudan wrote:

> >  struct page * fastcall
> >  __alloc_pages(gfp_t gfp_mask, unsigned int order,
> > struct zonelist *zonelist)
> >  {
> > +   /*
> > +* Use a temporary nodemask for __GFP_THISNODE allocations. If the
> > +* cost of allocating on the stack or the stack usage becomes
> > +* noticable, allocate the nodemasks per node at boot or compile time
> > +*/
> > +   if (unlikely(gfp_mask & __GFP_THISNODE)) {
> > +   nodemask_t nodemask;
> > +
> > +   return __alloc_pages_internal(gfp_mask, order,
> > +   zonelist, nodemask_thisnode());
> > +   }
> > +
> > return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
> >  }
> 
> 
> 
> So alloc_pages_node() calls here and for THISNODE allocations, we go ask
> nodemask_thisnode() for a nodemask...

H... nodemask_thisnode needs to be passed the zonelist.

> And nodemask_thisnode() always gives us a nodemask with only the node
> the current process is running on set, I think?

Right.

 
> That seems really wrong -- and would explain what Lee was seeing while
> using my patches for the hugetlb pool allocator to use THISNODE
> allocations. All the allocations would end up coming from whatever node
> the process happened to be running on. This obviously messes up hugetlb
> accounting, as I rely on THISNODE requests returning NULL if they go
> off-node.
> 
> I'm not sure how this would be fixed, as __alloc_pages() no longer has
> the nid to set in the mask.
> 
> Am I wrong in my analysis?

No you are right on target. The thisnode function must determine the node 
from the first zone of the zonelist.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-10-08 Thread Nishanth Aravamudan
On 28.09.2007 [15:25:27 +0100], Mel Gorman wrote:
> 
> Two zonelists exist so that GFP_THISNODE allocations will be guaranteed
> to use memory only from a node local to the CPU. As we can now filter the
> zonelist based on a nodemask, we filter the standard node zonelist for zones
> on the local node when GFP_THISNODE is specified.
> 
> When GFP_THISNODE is used, a temporary nodemask is created with only the
> node local to the CPU set. This allows us to eliminate the second zonelist.
> 
> Signed-off-by: Mel Gorman <[EMAIL PROTECTED]>
> Acked-by: Christoph Lameter <[EMAIL PROTECTED]>



> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
> linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h 
> linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h
> --- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h  
> 2007-09-28 15:49:57.0 +0100
> +++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h 
> 2007-09-28 15:55:03.0 +0100

[Reordering the chunks to make my comments a little more logical]



> -static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> +static inline struct zonelist *node_zonelist(int nid)
>  {
> - return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> + return _DATA(nid)->node_zonelist;
>  }
> 
>  #ifndef HAVE_ARCH_FREE_PAGE
> @@ -198,7 +186,7 @@ static inline struct page *alloc_pages_n
>   if (nid < 0)
>   nid = numa_node_id();
> 
> - return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> + return __alloc_pages(gfp_mask, order, node_zonelist(nid));
>  }

This is alloc_pages_node(), and converting the nid to a zonelist means
that lower levels (specifically __alloc_pages() here) are not aware of
nids, as far as I can tell. This isn't a change, I just want to make
sure I understand...



>  struct page * fastcall
>  __alloc_pages(gfp_t gfp_mask, unsigned int order,
>   struct zonelist *zonelist)
>  {
> + /*
> +  * Use a temporary nodemask for __GFP_THISNODE allocations. If the
> +  * cost of allocating on the stack or the stack usage becomes
> +  * noticable, allocate the nodemasks per node at boot or compile time
> +  */
> + if (unlikely(gfp_mask & __GFP_THISNODE)) {
> + nodemask_t nodemask;
> +
> + return __alloc_pages_internal(gfp_mask, order,
> + zonelist, nodemask_thisnode());
> + }
> +
>   return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
>  }



So alloc_pages_node() calls here and for THISNODE allocations, we go ask
nodemask_thisnode() for a nodemask...

> +static nodemask_t *nodemask_thisnode(nodemask_t *nodemask)
> +{
> + /* Build a nodemask for just this node */
> + int nid = numa_node_id();
> +
> + nodes_clear(*nodemask);
> + node_set(nid, *nodemask);
> +
> + return nodemask;
> +}



And nodemask_thisnode() always gives us a nodemask with only the node
the current process is running on set, I think?

That seems really wrong -- and would explain what Lee was seeing while
using my patches for the hugetlb pool allocator to use THISNODE
allocations. All the allocations would end up coming from whatever node
the process happened to be running on. This obviously messes up hugetlb
accounting, as I rely on THISNODE requests returning NULL if they go
off-node.

I'm not sure how this would be fixed, as __alloc_pages() no longer has
the nid to set in the mask.

Am I wrong in my analysis?

Thanks,
Nish

-- 
Nishanth Aravamudan <[EMAIL PROTECTED]>
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-10-08 Thread Nishanth Aravamudan
On 28.09.2007 [15:25:27 +0100], Mel Gorman wrote:
 
 Two zonelists exist so that GFP_THISNODE allocations will be guaranteed
 to use memory only from a node local to the CPU. As we can now filter the
 zonelist based on a nodemask, we filter the standard node zonelist for zones
 on the local node when GFP_THISNODE is specified.
 
 When GFP_THISNODE is used, a temporary nodemask is created with only the
 node local to the CPU set. This allows us to eliminate the second zonelist.
 
 Signed-off-by: Mel Gorman [EMAIL PROTECTED]
 Acked-by: Christoph Lameter [EMAIL PROTECTED]

snip

 diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
 linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h 
 linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h
 --- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h  
 2007-09-28 15:49:57.0 +0100
 +++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h 
 2007-09-28 15:55:03.0 +0100

[Reordering the chunks to make my comments a little more logical]

snip

 -static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
 +static inline struct zonelist *node_zonelist(int nid)
  {
 - return NODE_DATA(nid)-node_zonelists + gfp_zonelist(flags);
 + return NODE_DATA(nid)-node_zonelist;
  }
 
  #ifndef HAVE_ARCH_FREE_PAGE
 @@ -198,7 +186,7 @@ static inline struct page *alloc_pages_n
   if (nid  0)
   nid = numa_node_id();
 
 - return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 + return __alloc_pages(gfp_mask, order, node_zonelist(nid));
  }

This is alloc_pages_node(), and converting the nid to a zonelist means
that lower levels (specifically __alloc_pages() here) are not aware of
nids, as far as I can tell. This isn't a change, I just want to make
sure I understand...

snip

  struct page * fastcall
  __alloc_pages(gfp_t gfp_mask, unsigned int order,
   struct zonelist *zonelist)
  {
 + /*
 +  * Use a temporary nodemask for __GFP_THISNODE allocations. If the
 +  * cost of allocating on the stack or the stack usage becomes
 +  * noticable, allocate the nodemasks per node at boot or compile time
 +  */
 + if (unlikely(gfp_mask  __GFP_THISNODE)) {
 + nodemask_t nodemask;
 +
 + return __alloc_pages_internal(gfp_mask, order,
 + zonelist, nodemask_thisnode(nodemask));
 + }
 +
   return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
  }

snip

So alloc_pages_node() calls here and for THISNODE allocations, we go ask
nodemask_thisnode() for a nodemask...

 +static nodemask_t *nodemask_thisnode(nodemask_t *nodemask)
 +{
 + /* Build a nodemask for just this node */
 + int nid = numa_node_id();
 +
 + nodes_clear(*nodemask);
 + node_set(nid, *nodemask);
 +
 + return nodemask;
 +}

snip

And nodemask_thisnode() always gives us a nodemask with only the node
the current process is running on set, I think?

That seems really wrong -- and would explain what Lee was seeing while
using my patches for the hugetlb pool allocator to use THISNODE
allocations. All the allocations would end up coming from whatever node
the process happened to be running on. This obviously messes up hugetlb
accounting, as I rely on THISNODE requests returning NULL if they go
off-node.

I'm not sure how this would be fixed, as __alloc_pages() no longer has
the nid to set in the mask.

Am I wrong in my analysis?

Thanks,
Nish

-- 
Nishanth Aravamudan [EMAIL PROTECTED]
IBM Linux Technology Center
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-10-08 Thread Christoph Lameter
On Mon, 8 Oct 2007, Nishanth Aravamudan wrote:

   struct page * fastcall
   __alloc_pages(gfp_t gfp_mask, unsigned int order,
  struct zonelist *zonelist)
   {
  +   /*
  +* Use a temporary nodemask for __GFP_THISNODE allocations. If the
  +* cost of allocating on the stack or the stack usage becomes
  +* noticable, allocate the nodemasks per node at boot or compile time
  +*/
  +   if (unlikely(gfp_mask  __GFP_THISNODE)) {
  +   nodemask_t nodemask;
  +
  +   return __alloc_pages_internal(gfp_mask, order,
  +   zonelist, nodemask_thisnode(nodemask));
  +   }
  +
  return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
   }
 
 snip
 
 So alloc_pages_node() calls here and for THISNODE allocations, we go ask
 nodemask_thisnode() for a nodemask...

H... nodemask_thisnode needs to be passed the zonelist.

 And nodemask_thisnode() always gives us a nodemask with only the node
 the current process is running on set, I think?

Right.

 
 That seems really wrong -- and would explain what Lee was seeing while
 using my patches for the hugetlb pool allocator to use THISNODE
 allocations. All the allocations would end up coming from whatever node
 the process happened to be running on. This obviously messes up hugetlb
 accounting, as I rely on THISNODE requests returning NULL if they go
 off-node.
 
 I'm not sure how this would be fixed, as __alloc_pages() no longer has
 the nid to set in the mask.
 
 Am I wrong in my analysis?

No you are right on target. The thisnode function must determine the node 
from the first zone of the zonelist.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask

2007-10-08 Thread Nishanth Aravamudan
On 08.10.2007 [18:56:05 -0700], Christoph Lameter wrote:
 On Mon, 8 Oct 2007, Nishanth Aravamudan wrote:
 
struct page * fastcall
__alloc_pages(gfp_t gfp_mask, unsigned int order,
 struct zonelist *zonelist)
{
   + /*
   +  * Use a temporary nodemask for __GFP_THISNODE allocations. If the
   +  * cost of allocating on the stack or the stack usage becomes
   +  * noticable, allocate the nodemasks per node at boot or compile time
   +  */
   + if (unlikely(gfp_mask  __GFP_THISNODE)) {
   + nodemask_t nodemask;
   +
   + return __alloc_pages_internal(gfp_mask, order,
   + zonelist, nodemask_thisnode(nodemask));
   + }
   +
 return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
}
  
  snip
  
  So alloc_pages_node() calls here and for THISNODE allocations, we go ask
  nodemask_thisnode() for a nodemask...
 
 H... nodemask_thisnode needs to be passed the zonelist.
 
  And nodemask_thisnode() always gives us a nodemask with only the node
  the current process is running on set, I think?
 
 Right.
 
 
  That seems really wrong -- and would explain what Lee was seeing while
  using my patches for the hugetlb pool allocator to use THISNODE
  allocations. All the allocations would end up coming from whatever node
  the process happened to be running on. This obviously messes up hugetlb
  accounting, as I rely on THISNODE requests returning NULL if they go
  off-node.
  
  I'm not sure how this would be fixed, as __alloc_pages() no longer has
  the nid to set in the mask.
  
  Am I wrong in my analysis?
 
 No you are right on target. The thisnode function must determine the
 node from the first zone of the zonelist.

It seems like I would zonelist_node_idx() for this, along the lines of:

static nodemask_t *nodemask_thisnode(nodemask_t *nodemask,
struct zonelist *zonelist)
{
int nid = zonelist_node_idx(zonelist);
/* Build a nodemask for just this node */
nodes_clear(*nodemask);
node_set(nid, *nodemask);

return nodemask;
}

But I think I need to check that zonelist-_zonerefs-zone is !NULL, given this
definition of zonelist_node_idx()

static inline int zonelist_node_idx(struct zoneref *zoneref)
{
#ifdef CONFIG_NUMA
/* zone_to_nid not available in this context */
return zoneref-zone-node;
#else
return 0;
#endif /* CONFIG_NUMA */
}

and this comment in __alloc_pages_internal():


z = zonelist-_zonerefs;  /* the list of zones suitable for gfp_mask */

if (unlikely(!z-zone)) {
/*
 * Happens if we have an empty zonelist as a result of
 * GFP_THISNODE being used on a memoryless node
 */
return NULL;
}
...

It seems like zoneref-zone may be NULL in zonelist_node_idx()? Maybe
someone else should look into resolving this :)

Thanks,
Nish

-- 
Nishanth Aravamudan [EMAIL PROTECTED]
IBM Linux Technology Center
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/