Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-15 Thread Paul Jackson
> Do you know when the patch that adds 
> bitmap_onto(), which is a name I think I noticed you liking, will be 
> available?

Thanks in good part to your various questions regarding it, I realized
a significant error in the central piece of code in that patch, the
routine lib/bitmap.c:bitmap_onto().  I should be done in a few hours
reworking it, and will post it then.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-15 Thread David Rientjes
On Fri, 15 Feb 2008, Paul Jackson wrote:

> --- 2.6.24-mm1.orig/include/linux/mempolicy.h 2008-02-15 00:11:10.0 
> -0800
> +++ 2.6.24-mm1/include/linux/mempolicy.h  2008-02-15 15:16:16.031031424 
> -0800
> @@ -8,6 +8,14 @@
>   * Copyright 2003,2004 Andi Kleen SuSE Labs
>   */
>  
> +/*
> + * The 'policy' field of 'struct mempolicy' has both a mode and
> + * some flags packed into it.  The flags (MPOL_F_* below) occupy
> + * the high bit positions (MPOL_MODE_FLAGS), and the mempolicy
> + * modes (the "Policies" below) are encoded in the remaining low
> + * bit positions.
> + */
> +
>  /* Policies */
>  enum {
>   MPOL_DEFAULT,
> @@ -18,16 +26,12 @@ enum {
>  };
>  
>  /*
> - * The lower MPOL_FLAG_SHIFT bits of the policy mode represent the MPOL_*
> - * constants defined in the above enum.  The upper bits represent optional
> - * set_mempolicy() or mbind() MPOL_F_* mode flags.
> + * Optional flags that modify nodemask numbering.
>   */
> -#define MPOL_FLAG_SHIFT  (8)
> -#define MPOL_MODE_MASK   ((1 << MPOL_FLAG_SHIFT) - 1)
> -
> -/* Flags for set_mempolicy */
> -#define MPOL_F_STATIC_NODES  (1 << MPOL_FLAG_SHIFT)
> -#define MPOL_MODE_FLAGS  (MPOL_F_STATIC_NODES)   /* legal 
> set_mempolicy() MPOL_F_* flags */
> +#define MPOL_F_RELATIVE_NODES (1<<14)/* remapped relative to 
> cpuset */
> +#define MPOL_F_STATIC_NODES (1<<15)  /* unremapped physical masks */
> +#define MPOL_MODE_FLAGS (MPOL_F_RELATIVE_NODES|MPOL_F_STATIC_NODES)
> + /* combined MPOL_F_* mask flags 
> */
>  
>  /* Flags for get_mempolicy */
>  #define MPOL_F_NODE  (1<<0)  /* return next IL mode instead of node mask */
> @@ -128,14 +132,14 @@ static inline int mpol_equal(struct memp
>  
>  #define mpol_set_vma_default(vma) ((vma)->vm_policy = NULL)
>  
> -static inline unsigned char mpol_mode(unsigned short mode)
> +static inline unsigned short mpol_mode(unsigned short mode)
>  {
> - return mode & MPOL_MODE_MASK;
> + return mode & ~MPOL_MODE_FLAGS;
>  }
>  
>  static inline unsigned short mpol_flags(unsigned short mode)
>  {
> - return mode & ~MPOL_MODE_MASK;
> + return mode & MPOL_MODE_FLAGS;
>  }
>  
>  /*
> @@ -201,7 +205,7 @@ static inline int mpol_equal(struct memp
>  
>  #define mpol_set_vma_default(vma) do {} while(0)
>  
> -static inline unsigned char mpol_mode(unsigned short mode)
> +static inline unsigned short mpol_mode(unsigned short mode)
>  {
>   return 0;
>  }
> --- 2.6.24-mm1.orig/mm/mempolicy.c2008-02-15 00:18:35.0 -0800
> +++ 2.6.24-mm1/mm/mempolicy.c 2008-02-15 08:16:52.034431591 -0800
> @@ -884,8 +884,6 @@ asmlinkage long sys_mbind(unsigned long 
>  
>   if (mpol_mode(mode) >= MPOL_MAX)
>   return -EINVAL;
> - if (mpol_flags(mode) & ~MPOL_MODE_FLAGS)
> - return -EINVAL;
>   err = get_nodes(, nmask, maxnode);
>   if (err)
>   return err;
> @@ -898,13 +896,9 @@ asmlinkage long sys_set_mempolicy(int mo
>  {
>   int err;
>   nodemask_t nodes;
> - unsigned short flags;
>  
>   if (mpol_mode(mode) >= MPOL_MAX)
>   return -EINVAL;
> - if (mpol_flags(mode) & ~MPOL_MODE_FLAGS)
> - return -EINVAL;
> - flags = mpol_flags(mode) & MPOL_MODE_FLAGS;
>   err = get_nodes(, nmask, maxnode);
>   if (err)
>   return err;
> 

There's been significant changes in this area since my last posting, but I 
agree that doing a slight variation of this is better.

On that topic, I am ready to post the updated patchset but I'd like to do 
it with your bitmap_onto() patch so that I can fully implement and test 
MPOL_F_RELATIVE_NODES.  Do you know when the patch that adds 
bitmap_onto(), which is a name I think I noticed you liking, will be 
available?

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-15 Thread Paul Jackson
David responding to pj:
> > I don't think so.  It's not possible to detemine if exactly the low
> > eight bits of the 'policy' short are a valid mode, -however- that
> > "eight" is a spurious detail.  Remove it.
> 
> Sure it is, you can determine if the low MPOL_FLAG_SHIFT bits are a valid 
> mode by masking off the upper bits and testing if the result is >= 
> MPOL_MAX.

Sorry ... my response was ambiguous, so not surprisingly you read it
the other way than I intended it and missed my point.

Let me try this again.

Yes, if we drop that MPOL_FLAG_SHIFT, we can't tell if exactly the low
eight bits are valid, -however- we can still tell if the low bits not
used by our MPOL_F_* flag bits are valid, and that's sufficient.

I included a PATCH in my last reply, in order to demonstrate this.


> Well, it's still an implementation detail that needs to be explicitly 
> defined, otherwise we have no way to separate mode from flags when the 
> user passes in their int formal as part of either syscall.

No.

That is not an implementation detail that must be explicitly defined
in order to separate the mode from the flags.

One can separate out the high order flag bits by simply masking them off.


> We should make sure to return -EINVAL to a user specifying an invalid flag 
> if, for example, they are using an older kernel that doesn't support what 
> they're asking for.

I agree that we must return -EINVAL if the user specifies an invalid
flag.  I have always agreed with this.

==

> It would be possible to do all of this in both sys_set_mempolicy() and 
> sys_mbind() by masking off the set of possible modes and checking the 
> result for being >= MPOL_MAX:

Bingo!!

That's what I've been saying all along.

I'm not quite sure how we got here; this seems to me such a sharp turn
from what went before that I fear I've misread you, but I suppose I
should just be glad we're in agreement, despite the strange journey.

You go on to suggest a couple of variations of this check, by first
masking off the high order flag bits:

if ((mode & MPOL_MODE_FLAGS) >= MPOL_MAX)
return -EINVAL;

and then, in a follow-on message, refining that line of thought with:

unsigned short flags;

flags = mode & MPOL_MODE_FLAGS;
mode &= ~MPOL_MODE_FLAGS;
if (mode < 0 || mode >= MPOL_MAX)
return -EINVAL;

Actually, I don't think you need to add either variation, as I think you
-already- have that check, in both sys_mbind() and sys_set_mempolicy(),
as the check:

if (mpol_mode(mode) >= MPOL_MAX)
return -EINVAL;

With one minor tweak (changing the return type of mpol_mode() from
uchar to ushort), here again is the PATCH of my last reply, with this
check present (actually, it's a check of yours, unchanged from what has
been in your patch for days now).

Does this PATCH do what you have in mind?

---
 include/linux/mempolicy.h |   30 +-
 mm/mempolicy.c|6 --
 2 files changed, 17 insertions(+), 19 deletions(-)

--- 2.6.24-mm1.orig/include/linux/mempolicy.h   2008-02-15 00:11:10.0 
-0800
+++ 2.6.24-mm1/include/linux/mempolicy.h2008-02-15 15:16:16.031031424 
-0800
@@ -8,6 +8,14 @@
  * Copyright 2003,2004 Andi Kleen SuSE Labs
  */
 
+/*
+ * The 'policy' field of 'struct mempolicy' has both a mode and
+ * some flags packed into it.  The flags (MPOL_F_* below) occupy
+ * the high bit positions (MPOL_MODE_FLAGS), and the mempolicy
+ * modes (the "Policies" below) are encoded in the remaining low
+ * bit positions.
+ */
+
 /* Policies */
 enum {
MPOL_DEFAULT,
@@ -18,16 +26,12 @@ enum {
 };
 
 /*
- * The lower MPOL_FLAG_SHIFT bits of the policy mode represent the MPOL_*
- * constants defined in the above enum.  The upper bits represent optional
- * set_mempolicy() or mbind() MPOL_F_* mode flags.
+ * Optional flags that modify nodemask numbering.
  */
-#define MPOL_FLAG_SHIFT(8)
-#define MPOL_MODE_MASK ((1 << MPOL_FLAG_SHIFT) - 1)
-
-/* Flags for set_mempolicy */
-#define MPOL_F_STATIC_NODES(1 << MPOL_FLAG_SHIFT)
-#define MPOL_MODE_FLAGS(MPOL_F_STATIC_NODES)   /* legal 
set_mempolicy() MPOL_F_* flags */
+#define MPOL_F_RELATIVE_NODES (1<<14)  /* remapped relative to cpuset 
*/
+#define MPOL_F_STATIC_NODES (1<<15)/* unremapped physical masks */
+#define MPOL_MODE_FLAGS (MPOL_F_RELATIVE_NODES|MPOL_F_STATIC_NODES)
+   /* combined MPOL_F_* mask flags 
*/
 
 /* Flags for get_mempolicy */
 #define MPOL_F_NODE(1<<0)  /* return next IL mode instead of node mask */
@@ -128,14 +132,14 @@ static inline int mpol_equal(struct memp
 
 #define mpol_set_vma_default(vma) ((vma)->vm_policy = NULL)
 
-static inline unsigned char mpol_mode(unsigned short mode)
+static inline unsigned short mpol_mode(unsigned short mode)
 {
-   return mode & MPOL_MODE_MASK;
+   return mode & ~MPOL_MODE_FLAGS;

Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-15 Thread David Rientjes
On Fri, 15 Feb 2008, David Rientjes wrote:

> It would be possible to do all of this in both sys_set_mempolicy() and 
> sys_mbind() by masking off the set of possible modes and checking the 
> result for being >= MPOL_MAX:
> 
>   if ((mode & MPOL_MODE_FLAGS) >= MPOL_MAX)
>   return -EINVAL;

Actually, in sys_setmempolicy() or sys_mbind() where mode is an int:

unsigned short flags;

flags = mode & MPOL_MODE_FLAGS;
mode &= ~MPOL_MODE_FLAGS;
if (mode < 0 || mode >= MPOL_MAX)
return -EINVAL;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-15 Thread David Rientjes
On Fri, 15 Feb 2008, Paul Jackson wrote:

> I don't think so.  It's not possible to detemine if exactly the low
> eight bits of the 'policy' short are a valid mode, -however- that
> "eight" is a spurious detail.  Remove it.
> 

Sure it is, you can determine if the low MPOL_FLAG_SHIFT bits are a valid 
mode by masking off the upper bits and testing if the result is >= 
MPOL_MAX.

> Instead of specifying that the 'policy' short consists of two bytes,
> one for mode and one for flags, rather specify that the policy fields
> consists of some high bits for flags, and the remaining low bits
> (however many remain) for the mode.
> 

Well, it's still an implementation detail that needs to be explicitly 
defined, otherwise we have no way to separate mode from flags when the 
user passes in their int formal as part of either syscall.

And given the recent requirements for a perfectly formed set_mempolicy() 
or mbind() syscall (as a result of the discussion about allowing non-empty 
nodemasks with MPOL_DEFAULT), we need to ensure that invalid flags are not 
being passed.

We should make sure to return -EINVAL to a user specifying an invalid flag 
if, for example, they are using an older kernel that doesn't support what 
they're asking for.

It would be possible to do all of this in both sys_set_mempolicy() and 
sys_mbind() by masking off the set of possible modes and checking the 
result for being >= MPOL_MAX:

if ((mode & MPOL_MODE_FLAGS) >= MPOL_MAX)
return -EINVAL;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-15 Thread David Rientjes
On Fri, 15 Feb 2008, Paul Jackson wrote:

> So that last line should be:
> 
> > 1,3,5   4-105,7,9
> 

What about this case where one of the relative nodes wraps around to 
represent an already set node in the result?

relativemems_allowedresult
1,3,6   4-8 5,7 or 5-7 ?

Neither result is immediately obvious to me logically: either your result 
has less weight than your relative nodemask (seems like a bad thing) or 
your relative nodemask really isn't all that relative to begin with (it's 
the same result as 1-3, 6-8, 11-13, etc).

Or is this just a less-than-desired side-effect of relative nodemasks that 
we're willing to live with given its other advantages?

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-15 Thread Paul Jackson
> So basically the "relative" nodemask that is passed with 
> MPOL_F_RELATIVE_NODES is wrapped around the allowed nodes?
> 
>   relative nodemask   mems_allowedresult
>   1,3,5   4   4
>   1,3,5   4-6 4-6
>   1,3,5   4-8 4-5,7
>   1,3,5   4-104,6,8
> 
> Is that correct?

By my calculation, all but the last line is correct.

We use zero-based numbering, so relative node '1' is the
'second' node, and the 'second' node in allowed nodes 4-10
is node 5, not 4.  Similarly for relative nodes '3' and '5'.

So that last line should be:

>   1,3,5   4-105,7,9


-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-15 Thread Paul Jackson
> like MPOL_F_RELATIVE_NODES or whatever Paul decides to call it 

MPOL_F_RELATIVE_NODES it is.

I see no compelling need for this name to track whatever
we name the relative bitmap operator.  They are related,
but not the same thing.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-15 Thread Paul Jackson
David wrote:
> But without MPOL_FLAG_SHIFT it becomes 
> impossible to determine whether a user passed an invalid flag.

I don't think so.  It's not possible to detemine if exactly the low
eight bits of the 'policy' short are a valid mode, -however- that
"eight" is a spurious detail.  Remove it.

Instead of specifying that the 'policy' short consists of two bytes,
one for mode and one for flags, rather specify that the policy fields
consists of some high bits for flags, and the remaining low bits
(however many remain) for the mode.

That in turn exposes the opportunity to remove a couple of checks
for "(mpol_flags(mode) & ~MPOL_MODE_FLAGS)" being zero, and that
in turn makes it obvious that the 'flags' variable in
sys_set_mempolicy() is unused.

Consider the following patch, on top of the latest you sent me a day
ago.

---
 include/linux/mempolicy.h |   26 +++---
 mm/mempolicy.c|6 --
 2 files changed, 15 insertions(+), 17 deletions(-)

--- 2.6.24-mm1.orig/include/linux/mempolicy.h   2008-02-15 00:11:10.0 
-0800
+++ 2.6.24-mm1/include/linux/mempolicy.h2008-02-15 01:13:51.597894429 
-0800
@@ -8,6 +8,14 @@
  * Copyright 2003,2004 Andi Kleen SuSE Labs
  */
 
+/*
+ * The 'policy' field of 'struct mempolicy' has both a mode and
+ * some flags packed into it.  The flags (MPOL_F_* below) occupy
+ * the high bit positions (MPOL_MODE_FLAGS), and the mempolicy
+ * modes (the "Policies" below) are encoded in the remaining low
+ * bit positions.
+ */
+
 /* Policies */
 enum {
MPOL_DEFAULT,
@@ -18,16 +26,12 @@ enum {
 };
 
 /*
- * The lower MPOL_FLAG_SHIFT bits of the policy mode represent the MPOL_*
- * constants defined in the above enum.  The upper bits represent optional
- * set_mempolicy() or mbind() MPOL_F_* mode flags.
+ * Optional flags that modify nodemask numbering.
  */
-#define MPOL_FLAG_SHIFT(8)
-#define MPOL_MODE_MASK ((1 << MPOL_FLAG_SHIFT) - 1)
-
-/* Flags for set_mempolicy */
-#define MPOL_F_STATIC_NODES(1 << MPOL_FLAG_SHIFT)
-#define MPOL_MODE_FLAGS(MPOL_F_STATIC_NODES)   /* legal 
set_mempolicy() MPOL_F_* flags */
+#define MPOL_F_RELATIVE_NODES (1<<14)  /* remapped relative to cpuset 
*/
+#define MPOL_F_STATIC_NODES (1<<15)/* unremapped physical masks */
+#define MPOL_MODE_FLAGS (MPOL_F_RELATIVE_NODES|MPOL_F_STATIC_NODES)
+   /* combined MPOL_F_* mask flags 
*/
 
 /* Flags for get_mempolicy */
 #define MPOL_F_NODE(1<<0)  /* return next IL mode instead of node mask */
@@ -130,12 +134,12 @@ static inline int mpol_equal(struct memp
 
 static inline unsigned char mpol_mode(unsigned short mode)
 {
-   return mode & MPOL_MODE_MASK;
+   return mode & ~MPOL_MODE_FLAGS;
 }
 
 static inline unsigned short mpol_flags(unsigned short mode)
 {
-   return mode & ~MPOL_MODE_MASK;
+   return mode & MPOL_MODE_FLAGS;
 }
 
 /*
--- 2.6.24-mm1.orig/mm/mempolicy.c  2008-02-15 00:18:35.0 -0800
+++ 2.6.24-mm1/mm/mempolicy.c   2008-02-15 01:23:53.775748002 -0800
@@ -884,8 +884,6 @@ asmlinkage long sys_mbind(unsigned long 
 
if (mpol_mode(mode) >= MPOL_MAX)
return -EINVAL;
-   if (mpol_flags(mode) & ~MPOL_MODE_FLAGS)
-   return -EINVAL;
err = get_nodes(, nmask, maxnode);
if (err)
return err;
@@ -898,13 +896,9 @@ asmlinkage long sys_set_mempolicy(int mo
 {
int err;
nodemask_t nodes;
-   unsigned short flags;
 
if (mpol_mode(mode) >= MPOL_MAX)
return -EINVAL;
-   if (mpol_flags(mode) & ~MPOL_MODE_FLAGS)
-   return -EINVAL;
-   flags = mpol_flags(mode) & MPOL_MODE_FLAGS;
err = get_nodes(, nmask, maxnode);
if (err)
return err;


-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-15 Thread Paul Jackson
David wrote:
 But without MPOL_FLAG_SHIFT it becomes 
 impossible to determine whether a user passed an invalid flag.

I don't think so.  It's not possible to detemine if exactly the low
eight bits of the 'policy' short are a valid mode, -however- that
eight is a spurious detail.  Remove it.

Instead of specifying that the 'policy' short consists of two bytes,
one for mode and one for flags, rather specify that the policy fields
consists of some high bits for flags, and the remaining low bits
(however many remain) for the mode.

That in turn exposes the opportunity to remove a couple of checks
for (mpol_flags(mode)  ~MPOL_MODE_FLAGS) being zero, and that
in turn makes it obvious that the 'flags' variable in
sys_set_mempolicy() is unused.

Consider the following patch, on top of the latest you sent me a day
ago.

---
 include/linux/mempolicy.h |   26 +++---
 mm/mempolicy.c|6 --
 2 files changed, 15 insertions(+), 17 deletions(-)

--- 2.6.24-mm1.orig/include/linux/mempolicy.h   2008-02-15 00:11:10.0 
-0800
+++ 2.6.24-mm1/include/linux/mempolicy.h2008-02-15 01:13:51.597894429 
-0800
@@ -8,6 +8,14 @@
  * Copyright 2003,2004 Andi Kleen SuSE Labs
  */
 
+/*
+ * The 'policy' field of 'struct mempolicy' has both a mode and
+ * some flags packed into it.  The flags (MPOL_F_* below) occupy
+ * the high bit positions (MPOL_MODE_FLAGS), and the mempolicy
+ * modes (the Policies below) are encoded in the remaining low
+ * bit positions.
+ */
+
 /* Policies */
 enum {
MPOL_DEFAULT,
@@ -18,16 +26,12 @@ enum {
 };
 
 /*
- * The lower MPOL_FLAG_SHIFT bits of the policy mode represent the MPOL_*
- * constants defined in the above enum.  The upper bits represent optional
- * set_mempolicy() or mbind() MPOL_F_* mode flags.
+ * Optional flags that modify nodemask numbering.
  */
-#define MPOL_FLAG_SHIFT(8)
-#define MPOL_MODE_MASK ((1  MPOL_FLAG_SHIFT) - 1)
-
-/* Flags for set_mempolicy */
-#define MPOL_F_STATIC_NODES(1  MPOL_FLAG_SHIFT)
-#define MPOL_MODE_FLAGS(MPOL_F_STATIC_NODES)   /* legal 
set_mempolicy() MPOL_F_* flags */
+#define MPOL_F_RELATIVE_NODES (114)  /* remapped relative to cpuset 
*/
+#define MPOL_F_STATIC_NODES (115)/* unremapped physical masks */
+#define MPOL_MODE_FLAGS (MPOL_F_RELATIVE_NODES|MPOL_F_STATIC_NODES)
+   /* combined MPOL_F_* mask flags 
*/
 
 /* Flags for get_mempolicy */
 #define MPOL_F_NODE(10)  /* return next IL mode instead of node mask */
@@ -130,12 +134,12 @@ static inline int mpol_equal(struct memp
 
 static inline unsigned char mpol_mode(unsigned short mode)
 {
-   return mode  MPOL_MODE_MASK;
+   return mode  ~MPOL_MODE_FLAGS;
 }
 
 static inline unsigned short mpol_flags(unsigned short mode)
 {
-   return mode  ~MPOL_MODE_MASK;
+   return mode  MPOL_MODE_FLAGS;
 }
 
 /*
--- 2.6.24-mm1.orig/mm/mempolicy.c  2008-02-15 00:18:35.0 -0800
+++ 2.6.24-mm1/mm/mempolicy.c   2008-02-15 01:23:53.775748002 -0800
@@ -884,8 +884,6 @@ asmlinkage long sys_mbind(unsigned long 
 
if (mpol_mode(mode) = MPOL_MAX)
return -EINVAL;
-   if (mpol_flags(mode)  ~MPOL_MODE_FLAGS)
-   return -EINVAL;
err = get_nodes(nodes, nmask, maxnode);
if (err)
return err;
@@ -898,13 +896,9 @@ asmlinkage long sys_set_mempolicy(int mo
 {
int err;
nodemask_t nodes;
-   unsigned short flags;
 
if (mpol_mode(mode) = MPOL_MAX)
return -EINVAL;
-   if (mpol_flags(mode)  ~MPOL_MODE_FLAGS)
-   return -EINVAL;
-   flags = mpol_flags(mode)  MPOL_MODE_FLAGS;
err = get_nodes(nodes, nmask, maxnode);
if (err)
return err;


-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.940.382.4214
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-15 Thread Paul Jackson
 like MPOL_F_RELATIVE_NODES or whatever Paul decides to call it 

MPOL_F_RELATIVE_NODES it is.

I see no compelling need for this name to track whatever
we name the relative bitmap operator.  They are related,
but not the same thing.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.940.382.4214
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-15 Thread Paul Jackson
 So basically the relative nodemask that is passed with 
 MPOL_F_RELATIVE_NODES is wrapped around the allowed nodes?
 
   relative nodemask   mems_allowedresult
   1,3,5   4   4
   1,3,5   4-6 4-6
   1,3,5   4-8 4-5,7
   1,3,5   4-104,6,8
 
 Is that correct?

By my calculation, all but the last line is correct.

We use zero-based numbering, so relative node '1' is the
'second' node, and the 'second' node in allowed nodes 4-10
is node 5, not 4.  Similarly for relative nodes '3' and '5'.

So that last line should be:

   1,3,5   4-105,7,9


-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.940.382.4214
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-15 Thread David Rientjes
On Fri, 15 Feb 2008, David Rientjes wrote:

 It would be possible to do all of this in both sys_set_mempolicy() and 
 sys_mbind() by masking off the set of possible modes and checking the 
 result for being = MPOL_MAX:
 
   if ((mode  MPOL_MODE_FLAGS) = MPOL_MAX)
   return -EINVAL;

Actually, in sys_setmempolicy() or sys_mbind() where mode is an int:

unsigned short flags;

flags = mode  MPOL_MODE_FLAGS;
mode = ~MPOL_MODE_FLAGS;
if (mode  0 || mode = MPOL_MAX)
return -EINVAL;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-15 Thread David Rientjes
On Fri, 15 Feb 2008, Paul Jackson wrote:

 So that last line should be:
 
  1,3,5   4-105,7,9
 

What about this case where one of the relative nodes wraps around to 
represent an already set node in the result?

relativemems_allowedresult
1,3,6   4-8 5,7 or 5-7 ?

Neither result is immediately obvious to me logically: either your result 
has less weight than your relative nodemask (seems like a bad thing) or 
your relative nodemask really isn't all that relative to begin with (it's 
the same result as 1-3, 6-8, 11-13, etc).

Or is this just a less-than-desired side-effect of relative nodemasks that 
we're willing to live with given its other advantages?

David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-15 Thread David Rientjes
On Fri, 15 Feb 2008, Paul Jackson wrote:

 I don't think so.  It's not possible to detemine if exactly the low
 eight bits of the 'policy' short are a valid mode, -however- that
 eight is a spurious detail.  Remove it.
 

Sure it is, you can determine if the low MPOL_FLAG_SHIFT bits are a valid 
mode by masking off the upper bits and testing if the result is = 
MPOL_MAX.

 Instead of specifying that the 'policy' short consists of two bytes,
 one for mode and one for flags, rather specify that the policy fields
 consists of some high bits for flags, and the remaining low bits
 (however many remain) for the mode.
 

Well, it's still an implementation detail that needs to be explicitly 
defined, otherwise we have no way to separate mode from flags when the 
user passes in their int formal as part of either syscall.

And given the recent requirements for a perfectly formed set_mempolicy() 
or mbind() syscall (as a result of the discussion about allowing non-empty 
nodemasks with MPOL_DEFAULT), we need to ensure that invalid flags are not 
being passed.

We should make sure to return -EINVAL to a user specifying an invalid flag 
if, for example, they are using an older kernel that doesn't support what 
they're asking for.

It would be possible to do all of this in both sys_set_mempolicy() and 
sys_mbind() by masking off the set of possible modes and checking the 
result for being = MPOL_MAX:

if ((mode  MPOL_MODE_FLAGS) = MPOL_MAX)
return -EINVAL;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-15 Thread Paul Jackson
David responding to pj:
  I don't think so.  It's not possible to detemine if exactly the low
  eight bits of the 'policy' short are a valid mode, -however- that
  eight is a spurious detail.  Remove it.
 
 Sure it is, you can determine if the low MPOL_FLAG_SHIFT bits are a valid 
 mode by masking off the upper bits and testing if the result is = 
 MPOL_MAX.

Sorry ... my response was ambiguous, so not surprisingly you read it
the other way than I intended it and missed my point.

Let me try this again.

Yes, if we drop that MPOL_FLAG_SHIFT, we can't tell if exactly the low
eight bits are valid, -however- we can still tell if the low bits not
used by our MPOL_F_* flag bits are valid, and that's sufficient.

I included a PATCH in my last reply, in order to demonstrate this.


 Well, it's still an implementation detail that needs to be explicitly 
 defined, otherwise we have no way to separate mode from flags when the 
 user passes in their int formal as part of either syscall.

No.

That is not an implementation detail that must be explicitly defined
in order to separate the mode from the flags.

One can separate out the high order flag bits by simply masking them off.


 We should make sure to return -EINVAL to a user specifying an invalid flag 
 if, for example, they are using an older kernel that doesn't support what 
 they're asking for.

I agree that we must return -EINVAL if the user specifies an invalid
flag.  I have always agreed with this.

==

 It would be possible to do all of this in both sys_set_mempolicy() and 
 sys_mbind() by masking off the set of possible modes and checking the 
 result for being = MPOL_MAX:

Bingo!!

That's what I've been saying all along.

I'm not quite sure how we got here; this seems to me such a sharp turn
from what went before that I fear I've misread you, but I suppose I
should just be glad we're in agreement, despite the strange journey.

You go on to suggest a couple of variations of this check, by first
masking off the high order flag bits:

if ((mode  MPOL_MODE_FLAGS) = MPOL_MAX)
return -EINVAL;

and then, in a follow-on message, refining that line of thought with:

unsigned short flags;

flags = mode  MPOL_MODE_FLAGS;
mode = ~MPOL_MODE_FLAGS;
if (mode  0 || mode = MPOL_MAX)
return -EINVAL;

Actually, I don't think you need to add either variation, as I think you
-already- have that check, in both sys_mbind() and sys_set_mempolicy(),
as the check:

if (mpol_mode(mode) = MPOL_MAX)
return -EINVAL;

With one minor tweak (changing the return type of mpol_mode() from
uchar to ushort), here again is the PATCH of my last reply, with this
check present (actually, it's a check of yours, unchanged from what has
been in your patch for days now).

Does this PATCH do what you have in mind?

---
 include/linux/mempolicy.h |   30 +-
 mm/mempolicy.c|6 --
 2 files changed, 17 insertions(+), 19 deletions(-)

--- 2.6.24-mm1.orig/include/linux/mempolicy.h   2008-02-15 00:11:10.0 
-0800
+++ 2.6.24-mm1/include/linux/mempolicy.h2008-02-15 15:16:16.031031424 
-0800
@@ -8,6 +8,14 @@
  * Copyright 2003,2004 Andi Kleen SuSE Labs
  */
 
+/*
+ * The 'policy' field of 'struct mempolicy' has both a mode and
+ * some flags packed into it.  The flags (MPOL_F_* below) occupy
+ * the high bit positions (MPOL_MODE_FLAGS), and the mempolicy
+ * modes (the Policies below) are encoded in the remaining low
+ * bit positions.
+ */
+
 /* Policies */
 enum {
MPOL_DEFAULT,
@@ -18,16 +26,12 @@ enum {
 };
 
 /*
- * The lower MPOL_FLAG_SHIFT bits of the policy mode represent the MPOL_*
- * constants defined in the above enum.  The upper bits represent optional
- * set_mempolicy() or mbind() MPOL_F_* mode flags.
+ * Optional flags that modify nodemask numbering.
  */
-#define MPOL_FLAG_SHIFT(8)
-#define MPOL_MODE_MASK ((1  MPOL_FLAG_SHIFT) - 1)
-
-/* Flags for set_mempolicy */
-#define MPOL_F_STATIC_NODES(1  MPOL_FLAG_SHIFT)
-#define MPOL_MODE_FLAGS(MPOL_F_STATIC_NODES)   /* legal 
set_mempolicy() MPOL_F_* flags */
+#define MPOL_F_RELATIVE_NODES (114)  /* remapped relative to cpuset 
*/
+#define MPOL_F_STATIC_NODES (115)/* unremapped physical masks */
+#define MPOL_MODE_FLAGS (MPOL_F_RELATIVE_NODES|MPOL_F_STATIC_NODES)
+   /* combined MPOL_F_* mask flags 
*/
 
 /* Flags for get_mempolicy */
 #define MPOL_F_NODE(10)  /* return next IL mode instead of node mask */
@@ -128,14 +132,14 @@ static inline int mpol_equal(struct memp
 
 #define mpol_set_vma_default(vma) ((vma)-vm_policy = NULL)
 
-static inline unsigned char mpol_mode(unsigned short mode)
+static inline unsigned short mpol_mode(unsigned short mode)
 {
-   return mode  MPOL_MODE_MASK;
+   return mode  ~MPOL_MODE_FLAGS;
 }
 
 static inline unsigned short 

Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-15 Thread David Rientjes
On Fri, 15 Feb 2008, Paul Jackson wrote:

 --- 2.6.24-mm1.orig/include/linux/mempolicy.h 2008-02-15 00:11:10.0 
 -0800
 +++ 2.6.24-mm1/include/linux/mempolicy.h  2008-02-15 15:16:16.031031424 
 -0800
 @@ -8,6 +8,14 @@
   * Copyright 2003,2004 Andi Kleen SuSE Labs
   */
  
 +/*
 + * The 'policy' field of 'struct mempolicy' has both a mode and
 + * some flags packed into it.  The flags (MPOL_F_* below) occupy
 + * the high bit positions (MPOL_MODE_FLAGS), and the mempolicy
 + * modes (the Policies below) are encoded in the remaining low
 + * bit positions.
 + */
 +
  /* Policies */
  enum {
   MPOL_DEFAULT,
 @@ -18,16 +26,12 @@ enum {
  };
  
  /*
 - * The lower MPOL_FLAG_SHIFT bits of the policy mode represent the MPOL_*
 - * constants defined in the above enum.  The upper bits represent optional
 - * set_mempolicy() or mbind() MPOL_F_* mode flags.
 + * Optional flags that modify nodemask numbering.
   */
 -#define MPOL_FLAG_SHIFT  (8)
 -#define MPOL_MODE_MASK   ((1  MPOL_FLAG_SHIFT) - 1)
 -
 -/* Flags for set_mempolicy */
 -#define MPOL_F_STATIC_NODES  (1  MPOL_FLAG_SHIFT)
 -#define MPOL_MODE_FLAGS  (MPOL_F_STATIC_NODES)   /* legal 
 set_mempolicy() MPOL_F_* flags */
 +#define MPOL_F_RELATIVE_NODES (114)/* remapped relative to 
 cpuset */
 +#define MPOL_F_STATIC_NODES (115)  /* unremapped physical masks */
 +#define MPOL_MODE_FLAGS (MPOL_F_RELATIVE_NODES|MPOL_F_STATIC_NODES)
 + /* combined MPOL_F_* mask flags 
 */
  
  /* Flags for get_mempolicy */
  #define MPOL_F_NODE  (10)  /* return next IL mode instead of node mask */
 @@ -128,14 +132,14 @@ static inline int mpol_equal(struct memp
  
  #define mpol_set_vma_default(vma) ((vma)-vm_policy = NULL)
  
 -static inline unsigned char mpol_mode(unsigned short mode)
 +static inline unsigned short mpol_mode(unsigned short mode)
  {
 - return mode  MPOL_MODE_MASK;
 + return mode  ~MPOL_MODE_FLAGS;
  }
  
  static inline unsigned short mpol_flags(unsigned short mode)
  {
 - return mode  ~MPOL_MODE_MASK;
 + return mode  MPOL_MODE_FLAGS;
  }
  
  /*
 @@ -201,7 +205,7 @@ static inline int mpol_equal(struct memp
  
  #define mpol_set_vma_default(vma) do {} while(0)
  
 -static inline unsigned char mpol_mode(unsigned short mode)
 +static inline unsigned short mpol_mode(unsigned short mode)
  {
   return 0;
  }
 --- 2.6.24-mm1.orig/mm/mempolicy.c2008-02-15 00:18:35.0 -0800
 +++ 2.6.24-mm1/mm/mempolicy.c 2008-02-15 08:16:52.034431591 -0800
 @@ -884,8 +884,6 @@ asmlinkage long sys_mbind(unsigned long 
  
   if (mpol_mode(mode) = MPOL_MAX)
   return -EINVAL;
 - if (mpol_flags(mode)  ~MPOL_MODE_FLAGS)
 - return -EINVAL;
   err = get_nodes(nodes, nmask, maxnode);
   if (err)
   return err;
 @@ -898,13 +896,9 @@ asmlinkage long sys_set_mempolicy(int mo
  {
   int err;
   nodemask_t nodes;
 - unsigned short flags;
  
   if (mpol_mode(mode) = MPOL_MAX)
   return -EINVAL;
 - if (mpol_flags(mode)  ~MPOL_MODE_FLAGS)
 - return -EINVAL;
 - flags = mpol_flags(mode)  MPOL_MODE_FLAGS;
   err = get_nodes(nodes, nmask, maxnode);
   if (err)
   return err;
 

There's been significant changes in this area since my last posting, but I 
agree that doing a slight variation of this is better.

On that topic, I am ready to post the updated patchset but I'd like to do 
it with your bitmap_onto() patch so that I can fully implement and test 
MPOL_F_RELATIVE_NODES.  Do you know when the patch that adds 
bitmap_onto(), which is a name I think I noticed you liking, will be 
available?

David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-15 Thread Paul Jackson
 Do you know when the patch that adds 
 bitmap_onto(), which is a name I think I noticed you liking, will be 
 available?

Thanks in good part to your various questions regarding it, I realized
a significant error in the central piece of code in that patch, the
routine lib/bitmap.c:bitmap_onto().  I should be done in a few hours
reworking it, and will post it then.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.940.382.4214
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-14 Thread David Rientjes
On Thu, 14 Feb 2008, David Rientjes wrote:

> We'll need to decide whether mpol_equal() is determining the equality of 
> the currently effected mempolicy (whereas policy->user_nodemask is 
> irrelevant) or the whole intended mempolicy overall.
> 

I've done the latter in the latest patchset.

Since mpol_equal() is only used for determining whether nearby vmas can be 
merged, it is only logical to merge them if they share the mempolicy 
intent of the user if MPOL_F_STATIC_NODES or any flag that makes sense of 
policy->user_nodemask is used.

> > Could we have mpol_to_str() mark policies which are
> > MPOL_F_RELATIVE_NODES or MPOL_F_STATIC_NODES?  Perhaps
> > by adding a suffix of "|relative" or "|static" or some
> > such.
> > 
> 
> I'd like to keep it in the same format as the tmpfs mount option which is 
> '=relative' and '=static'.
> 

In preparation for mode flags in addition to MPOL_F_STATIC_NODES to be 
added (like MPOL_F_RELATIVE_NODES or whatever Paul decides to call it 
based on the latest feedback, I've prepared mpol_to_str() to have this 
format

interleave=static|relative=1-3

as viewable from /proc/pid/numa_maps.  This is also how tmpfs mount 
options will be specified.

 [ Of course the above example can't happen since MPOL_F_STATIC_NODES and
   MPOL_F_RELATIVE_NODES are disjoint, but it's sufficient for the
   example. ]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-14 Thread David Rientjes
On Thu, 14 Feb 2008, Paul Jackson wrote:

> In mempolicy.h, the lines:
> 
> /*
>  * The lower MPOL_FLAG_SHIFT bits of the policy mode represent the MPOL_*
>  * constants defined in enum mempolicy_mode.  The upper bits represent
>  * optional set_mempolicy() MPOL_F_* mode flags.
>  */
> #define MPOL_FLAG_SHIFT (8)
> #define MPOL_MODE_MASK  ((1 << MPOL_FLAG_SHIFT) - 1)
> 
> /* Flags for set_mempolicy */
> #define MPOL_F_STATIC_NODES (1 << MPOL_FLAG_SHIFT)
> #define MPOL_MODE_FLAGS (MPOL_F_STATIC_NODES)   /* legal set_mempolicy() 
> MPOL_* mode flags */
> 
> could be simplified, to:
> 
> /*
>  * Optional flags that modify nodemask numbering.
>  */
> #define MPOL_F_RELATIVE_NODES (1<<14) /* remapped relative to cpuset 
> */
> #define MPOL_F_STATIC_NODES (1<<15)   /* unremapped physical 
> masks */
> #define MPOL_MODE_FLAGS (MPOL_F_RELATIVE_NODES|MPOL_F_STATIC_NODES)
>   /* combined MPOL_F_* mask flags 
> */
> 
> (really, that MPOL_FLAG_SHIFT is just unnecessary distracting detail.)
> 

It would be easy to define mpol_mode() and mpol_flags() in terms of 
MPOL_MODE_FLAGS as well, yes.  But without MPOL_FLAG_SHIFT it becomes 
impossible to determine whether a user passed an invalid flag.

I think we're all in agreement that passing an invalid flag bit should be 
rejected with -EINVAL.  So to do that we need MPOL_MODE_MASK to expicitly 
define the parts of the int *policy passed from set_mempolicy() that 
represent the mode.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-14 Thread David Rientjes
On Thu, 14 Feb 2008, Paul Jackson wrote:

> No and yes.  The manner in which too many nodes (as requested in a
> RELATIVE mask) are folded into too small a cpuset is not actually
> that critical, so long as it doesn't come up empty.  However, what
> I'll be recommending, in a follow-up patch, will be folding the
> larger set into the smaller one modulo the size of the smaller one.
> 

So basically the "relative" nodemask that is passed with 
MPOL_F_RELATIVE_NODES is wrapped around the allowed nodes?

relative nodemask   mems_allowedresult
1,3,5   4   4
1,3,5   4-6 4-6
1,3,5   4-8 4-5,7
1,3,5   4-104,6,8

Is that correct?

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-14 Thread David Rientjes
On Thu, 14 Feb 2008, Paul Jackson wrote:

> If we ever call mpol_rebind_policy() with an 
> MPOL_PREFERRED|MPOL_F_STATIC_NODES
> policy when the preferred_node doesn't happen to be in the current cpuset,
> then would the following lines loose the preferred node setting, such that
> it didn't get applied again correctly if that node came back into our allowed
> cpuset ?
> 
> case MPOL_PREFERRED:
> if (!remap && !node_isset(pol->v.preferred_node, *newmask))
> pol->v.preferred_node = -1;
> 

That's already been corrected as a result of a discussion between Lee and 
myself (please see the incremental patch that I sent you privately when I 
sent the patchset along).

> Should the mpol_equal() algorithm change, in the case of either
> MPOL_F_RELATIVE_NODES or MPOL_F_STATIC_NODES, to checking
> user_nodemask equality, -instead- of the "switch(mode)"
> mode specific tests?
> 

That's a good question.

We'll need to decide whether mpol_equal() is determining the equality of 
the currently effected mempolicy (whereas policy->user_nodemask is 
irrelevant) or the whole intended mempolicy overall.

I didn't originally modify mpol_equal() because I preferred the former.  
Is there a compelling case for the latter where mpol_equal() is used in 
the tree that would require this change?

> Could we have mpol_to_str() mark policies which are
> MPOL_F_RELATIVE_NODES or MPOL_F_STATIC_NODES?  Perhaps
> by adding a suffix of "|relative" or "|static" or some
> such.
> 

I'd like to keep it in the same format as the tmpfs mount option which is 
'=relative' and '=static'.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-14 Thread Paul Jackson
I had taken a vow of silence on this one, but couldn't resist one more
round.

David wrote:
> My preference (both mode and flags stored in the same member of struct 
> mempolicy):
> 
>Advantages:
> 
>   - completely consistent with the userspace API of passing modes
> and flags together in a pointer to an int, and

True -- the necessary overloaded ugliness of the kernel-user system
call API is propogated throughout mempolicy.c.  However, I wouldn't
necessarily call that an advantage.

>   - does not require additional formals to be added to several
> functions, including functions outside mm/mempolicy.c.

... but does require the use of the new mpol_flags() and mpol_mode()
macros by code in mmshmem.c, outside of mm/mempolicy.c

>Disadvantage:
> 
>   - use of mpol_mode() throughout mm/mempolicy.c code to mask
> off optional mode flags for conditionals or switch statements.

This will cause a bug in the future, that escapes into the wild, when
someone forgets one of these.  I'll bet on that.  It's fragile, because
(1) such errors are easy to make, and (2) hard to catch.

> Your preference (separate mode and flags members in struct mempolicy):
> 
>Advantages:
> 
>   - clearer implementation when dealing with modes: all existing
> statements involving pol->policy can remain unchanged.

Clear, robust code - that's the biggie.

>Disadvantages:
> 
>   - requires additional formals to be added to several functions,
> including functions outside mm/mempolicy.c, and

True -- though by this argument, we'd routinely aggregate multiple flags
and small words into single integer parameters, just to minimize the
parameter count.  Putting two flags in one parameter is a false
simplification, unless required by circumstance, such as communicating
with deep space probes or across the system call boundary with existing
API's.

Across the system call boundary, we have little choice, for compatibility
reasons.  But kernel internal interfaces are not so constrained, and the
ugliness at the system call boundary can be quarantined from most of the
mempolicy.c code.

>   - takes additional space in struct mempolicy (two bytes) which
> could eventually be used for something else.

To be clear to others, as you know, we're not talking here about
growing the sizeof(struct mempolicy) at present, but rather about using
some currently unused bytes in the struct.

More often than not, when someone adds complexity that is not needed at
present, because it might be needed in the future, they are making it
harder to maintain the code, not easier.

The single most important thing we can do to improve future
maintainability of code is to make it more readable.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-14 Thread Paul Jackson
David wrote:
> So let's say, like my first example from the previous email, that you have 
> MPOL_INTERLEAVE | MPOL_F_RELATIVE_NODES over nodes 3-4 and your cpuset's 
> mems is only nodes 5-7.  This would interleave over no nodes.  Correct?

Given what I said yesterday, that would be a correct conclusion.

However, as I just posted, what I said yesterday was wrong.

Using the same table format as I used in what I just posted, we see that
the two nodes 5 and 6 would be included in the interleave.  Each requested
node gets folded modulo the size (weight) of the current mems_allowed,
into some actual node to be included in the interleave.

  Given
N: how many nodes are allowed by the tasks cpuset (3, here [5,6,7])

   nm := (n % N)r := m-th set node in allowed
requested mod 3 result (in set 5,6,7 allowed)
   30   5
   41   6

> it just cares about the order.

MPOL_F_RELATIVE_NODES is for use by jobs consisting of multiple compute
threads which care about how many CPUs they run on, how many memory nodes
they have, and which nodes are closest to which CPUs.  In the typical case
all CPUs are equivalent, and all memory nodes are equivalent, except for
the basic topology issues of which nodes are near which CPUs, and in the
bigger NUMA systems (the less uniform NUMA systems, I should say), whether
all these CPUs and nodes are "close to" each other in the larger system.

> I don't understand the use case for this

The use case is the original motivating use case for cpusets, and for
my customer audience still one of the largest use cases of interest.

They are running N-way tightly coupled parallel threads, with sometimes
extreme sensitivity to data placement.  The job may run for hours, even
days, with exactly N threads, on exactly N CPUs, carefully placing each
data page on the memory node nearest the CPU that will access it most
frequently.  They may be using OpenMP, synchronizing many times per
second across many hundreds of threads.  If -any- of the threads takes
a few per-cent longer due to poor page placement, the entire job slows
as much.  If various threads, at uncorrelated times, each slow a few
per-cent, then the entire job can take twice as long.  When you're
computing tomorrows weather forecast, getting the result the day after
tomorrow is rather useless.

Now, if the job is critical enough, the customer simply won't interfere
with it or move it to other CPUs or any of that nonsense.  But when
running a mix of such jobs, under a batch scheduler such as PBS or LSF,
where the jobs have varying priority and varying resource needs, then
the batch scheduler will, intentionally, move less critical jobs about,
sometimes suspending them entirely, in order to provide the more
critical jobs with exactly what they need, while keeping the
utilization of the rest of the system as high as possible.

> It comes 
> with the premise that the task doesn't already know it's cpuset mems 
> (otherwise, the current implementation without MPOL_F_STATIC_NODES would 
> work fine for this)

No.

The current "classical" implementation doesn't work adequately.

I think I've already explained this before, so will just copy my
previous explanation here.  Perhaps you missed it the first time,
or perhaps something in it isn't clear enough.

=
This second, cpuset relative, mode is required:

1) to provide a race-free means for an application to place its memory
   when the application considers all physical nodes essentially
   equivalent, and just wants to lay things out relative to whatever
   cpuset it happens to be running in, and

2) to provide a practical means, without the need for constantly
   reprobing ones current cpuset placement, for an application to
   specify cpuset-relative placement to be applied whenever the
   application is placed in a larger cpuset, even if the application
   is presently in a smaller cpuset.

Without it, the application has to first query its current physical
cpuset placement, then calculate its desired relative placement into
terms of the current physical nodes it finds itself on, and then issue
various mbind or set_mempolicy calls using those current physical node
numbers, praying that it wasn't migrated to other physical nodes at the
same time, which would have invalidated its current placement
assumptions in its relative to physical node number calculations.

And without it, if an application is currently placed in a smaller cpuset
than it would prefer, it cannot specify how it would want its mempolicies
should it ever subsequently be moved to a larger cpuset.  This leaves such
an application with little option other than to constantly reprobe its
cpuset placement, in order to know if and when it needs to reissue its
mbind and set_mempolicy calls because it 

Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-14 Thread Paul Jackson
Paul, responding (incorrectly) to David:
> > So, for example, if the task is bound by mems 1-3, and it asks for 
> > MPOL_INTERLEAVE over 2-4, then initially the mempolicy is only effected 
> > over node 3 and if it's later expanded to mems 1-8, then the mempolicy is 
> > effected over nodes 3-5, right?
> > 
> > And if the mems change to 3-8, the mempolicy is remapped to 5-7 even 
> > though 3-5 (which it already was interleaving over) is still accessible?
> 
> Yes and yes, for this cpuset relative numbering mode.

Paul, correcting himself ...

No and yes.  The manner in which too many nodes (as requested in a
RELATIVE mask) are folded into too small a cpuset is not actually
that critical, so long as it doesn't come up empty.  However, what
I'll be recommending, in a follow-up patch, will be folding the
larger set into the smaller one modulo the size of the smaller one.

In your example above, the requested policy asked for nodes 2-4,
so that means it is trying to lay out a memory policy with at least
five (0-4) nodes in consideration.  But the cpuset is only allowing
three (1-3) nodes at the moment.  So for each requested node, we take
its number modulo three to see which of the available nodes to include
in the interleave.

  Given
N: how many nodes are allowed by the tasks cpuset (3, here)

   nm := (n % N)r := m-th set node in allowed
requested mod 3 result (in set 1, 2, 3 allowed)
   22   3
   30   1
   41   2

Hence your first example would result in an interleave over physical
nodes 1, 2, and 3 (the last column above.) ... not just over 3.

I intend to post patches you can use to lib/bitmap.c and the related
cpumask and nodemask apparatus that compute the above for you.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-14 Thread Paul Jackson
A few more review comments on details of this patch set ...

=

In mempolicy.h, the lines:

/*
 * The lower MPOL_FLAG_SHIFT bits of the policy mode represent the MPOL_*
 * constants defined in enum mempolicy_mode.  The upper bits represent
 * optional set_mempolicy() MPOL_F_* mode flags.
 */
#define MPOL_FLAG_SHIFT (8)
#define MPOL_MODE_MASK  ((1 << MPOL_FLAG_SHIFT) - 1)

/* Flags for set_mempolicy */
#define MPOL_F_STATIC_NODES (1 << MPOL_FLAG_SHIFT)
#define MPOL_MODE_FLAGS (MPOL_F_STATIC_NODES)   /* legal set_mempolicy() 
MPOL_* mode flags */

could be simplified, to:

/*
 * Optional flags that modify nodemask numbering.
 */
#define MPOL_F_RELATIVE_NODES (1<<14)   /* remapped relative to cpuset 
*/
#define MPOL_F_STATIC_NODES (1<<15) /* unremapped physical masks */
#define MPOL_MODE_FLAGS (MPOL_F_RELATIVE_NODES|MPOL_F_STATIC_NODES)
/* combined MPOL_F_* mask flags 
*/

(really, that MPOL_FLAG_SHIFT is just unnecessary distracting detail.)

=

If we ever call mpol_rebind_policy() with an MPOL_PREFERRED|MPOL_F_STATIC_NODES
policy when the preferred_node doesn't happen to be in the current cpuset,
then would the following lines loose the preferred node setting, such that
it didn't get applied again correctly if that node came back into our allowed
cpuset ?

case MPOL_PREFERRED:
if (!remap && !node_isset(pol->v.preferred_node, *newmask))
pol->v.preferred_node = -1;

=

Instead of the double negative and the use of a new
word 'remap' meaning the opposite of STATIC, as in"

int remap;
...
remap = !(mpol_flags(pol->policy) & MPOL_F_STATIC_NODES);
...
if (!remap)

How about something more direct, as in:

int static_nodes;
...
static_nodes = mpol_flags(pol->policy) & MPOL_F_STATIC_NODES;
...
if (static_nodes)

Each additional logic inversion and additional name that isn't
trivially based on an existing name just makes this code more
difficult to read.

Of course, if we used bit fields for these additional flags,
instead of #defines and masks, the above would get even simpler,
eliminating the extra define's, and replacing the above three lines
with just the one line:

if (pol->f_static_nodes)

But I already failed with that suggestion ... grumble, grumble ...

=

Should the mpol_equal() algorithm change, in the case of either
MPOL_F_RELATIVE_NODES or MPOL_F_STATIC_NODES, to checking
user_nodemask equality, -instead- of the "switch(mode)"
mode specific tests?

=

Could we have mpol_to_str() mark policies which are
MPOL_F_RELATIVE_NODES or MPOL_F_STATIC_NODES?  Perhaps
by adding a suffix of "|relative" or "|static" or some
such.


-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-14 Thread Paul Jackson
A few more review comments on details of this patch set ...

=

In mempolicy.h, the lines:

/*
 * The lower MPOL_FLAG_SHIFT bits of the policy mode represent the MPOL_*
 * constants defined in enum mempolicy_mode.  The upper bits represent
 * optional set_mempolicy() MPOL_F_* mode flags.
 */
#define MPOL_FLAG_SHIFT (8)
#define MPOL_MODE_MASK  ((1  MPOL_FLAG_SHIFT) - 1)

/* Flags for set_mempolicy */
#define MPOL_F_STATIC_NODES (1  MPOL_FLAG_SHIFT)
#define MPOL_MODE_FLAGS (MPOL_F_STATIC_NODES)   /* legal set_mempolicy() 
MPOL_* mode flags */

could be simplified, to:

/*
 * Optional flags that modify nodemask numbering.
 */
#define MPOL_F_RELATIVE_NODES (114)   /* remapped relative to cpuset 
*/
#define MPOL_F_STATIC_NODES (115) /* unremapped physical masks */
#define MPOL_MODE_FLAGS (MPOL_F_RELATIVE_NODES|MPOL_F_STATIC_NODES)
/* combined MPOL_F_* mask flags 
*/

(really, that MPOL_FLAG_SHIFT is just unnecessary distracting detail.)

=

If we ever call mpol_rebind_policy() with an MPOL_PREFERRED|MPOL_F_STATIC_NODES
policy when the preferred_node doesn't happen to be in the current cpuset,
then would the following lines loose the preferred node setting, such that
it didn't get applied again correctly if that node came back into our allowed
cpuset ?

case MPOL_PREFERRED:
if (!remap  !node_isset(pol-v.preferred_node, *newmask))
pol-v.preferred_node = -1;

=

Instead of the double negative and the use of a new
word 'remap' meaning the opposite of STATIC, as in

int remap;
...
remap = !(mpol_flags(pol-policy)  MPOL_F_STATIC_NODES);
...
if (!remap)

How about something more direct, as in:

int static_nodes;
...
static_nodes = mpol_flags(pol-policy)  MPOL_F_STATIC_NODES;
...
if (static_nodes)

Each additional logic inversion and additional name that isn't
trivially based on an existing name just makes this code more
difficult to read.

Of course, if we used bit fields for these additional flags,
instead of #defines and masks, the above would get even simpler,
eliminating the extra define's, and replacing the above three lines
with just the one line:

if (pol-f_static_nodes)

But I already failed with that suggestion ... grumble, grumble ...

=

Should the mpol_equal() algorithm change, in the case of either
MPOL_F_RELATIVE_NODES or MPOL_F_STATIC_NODES, to checking
user_nodemask equality, -instead- of the switch(mode)
mode specific tests?

=

Could we have mpol_to_str() mark policies which are
MPOL_F_RELATIVE_NODES or MPOL_F_STATIC_NODES?  Perhaps
by adding a suffix of |relative or |static or some
such.


-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.940.382.4214
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-14 Thread Paul Jackson
Paul, responding (incorrectly) to David:
  So, for example, if the task is bound by mems 1-3, and it asks for 
  MPOL_INTERLEAVE over 2-4, then initially the mempolicy is only effected 
  over node 3 and if it's later expanded to mems 1-8, then the mempolicy is 
  effected over nodes 3-5, right?
  
  And if the mems change to 3-8, the mempolicy is remapped to 5-7 even 
  though 3-5 (which it already was interleaving over) is still accessible?
 
 Yes and yes, for this cpuset relative numbering mode.

Paul, correcting himself ...

No and yes.  The manner in which too many nodes (as requested in a
RELATIVE mask) are folded into too small a cpuset is not actually
that critical, so long as it doesn't come up empty.  However, what
I'll be recommending, in a follow-up patch, will be folding the
larger set into the smaller one modulo the size of the smaller one.

In your example above, the requested policy asked for nodes 2-4,
so that means it is trying to lay out a memory policy with at least
five (0-4) nodes in consideration.  But the cpuset is only allowing
three (1-3) nodes at the moment.  So for each requested node, we take
its number modulo three to see which of the available nodes to include
in the interleave.

  Given
N: how many nodes are allowed by the tasks cpuset (3, here)

   nm := (n % N)r := m-th set node in allowed
requested mod 3 result (in set 1, 2, 3 allowed)
   22   3
   30   1
   41   2

Hence your first example would result in an interleave over physical
nodes 1, 2, and 3 (the last column above.) ... not just over 3.

I intend to post patches you can use to lib/bitmap.c and the related
cpumask and nodemask apparatus that compute the above for you.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.940.382.4214
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-14 Thread Paul Jackson
David wrote:
 So let's say, like my first example from the previous email, that you have 
 MPOL_INTERLEAVE | MPOL_F_RELATIVE_NODES over nodes 3-4 and your cpuset's 
 mems is only nodes 5-7.  This would interleave over no nodes.  Correct?

Given what I said yesterday, that would be a correct conclusion.

However, as I just posted, what I said yesterday was wrong.

Using the same table format as I used in what I just posted, we see that
the two nodes 5 and 6 would be included in the interleave.  Each requested
node gets folded modulo the size (weight) of the current mems_allowed,
into some actual node to be included in the interleave.

  Given
N: how many nodes are allowed by the tasks cpuset (3, here [5,6,7])

   nm := (n % N)r := m-th set node in allowed
requested mod 3 result (in set 5,6,7 allowed)
   30   5
   41   6

 it just cares about the order.

MPOL_F_RELATIVE_NODES is for use by jobs consisting of multiple compute
threads which care about how many CPUs they run on, how many memory nodes
they have, and which nodes are closest to which CPUs.  In the typical case
all CPUs are equivalent, and all memory nodes are equivalent, except for
the basic topology issues of which nodes are near which CPUs, and in the
bigger NUMA systems (the less uniform NUMA systems, I should say), whether
all these CPUs and nodes are close to each other in the larger system.

 I don't understand the use case for this

The use case is the original motivating use case for cpusets, and for
my customer audience still one of the largest use cases of interest.

They are running N-way tightly coupled parallel threads, with sometimes
extreme sensitivity to data placement.  The job may run for hours, even
days, with exactly N threads, on exactly N CPUs, carefully placing each
data page on the memory node nearest the CPU that will access it most
frequently.  They may be using OpenMP, synchronizing many times per
second across many hundreds of threads.  If -any- of the threads takes
a few per-cent longer due to poor page placement, the entire job slows
as much.  If various threads, at uncorrelated times, each slow a few
per-cent, then the entire job can take twice as long.  When you're
computing tomorrows weather forecast, getting the result the day after
tomorrow is rather useless.

Now, if the job is critical enough, the customer simply won't interfere
with it or move it to other CPUs or any of that nonsense.  But when
running a mix of such jobs, under a batch scheduler such as PBS or LSF,
where the jobs have varying priority and varying resource needs, then
the batch scheduler will, intentionally, move less critical jobs about,
sometimes suspending them entirely, in order to provide the more
critical jobs with exactly what they need, while keeping the
utilization of the rest of the system as high as possible.

 It comes 
 with the premise that the task doesn't already know it's cpuset mems 
 (otherwise, the current implementation without MPOL_F_STATIC_NODES would 
 work fine for this)

No.

The current classical implementation doesn't work adequately.

I think I've already explained this before, so will just copy my
previous explanation here.  Perhaps you missed it the first time,
or perhaps something in it isn't clear enough.

=
This second, cpuset relative, mode is required:

1) to provide a race-free means for an application to place its memory
   when the application considers all physical nodes essentially
   equivalent, and just wants to lay things out relative to whatever
   cpuset it happens to be running in, and

2) to provide a practical means, without the need for constantly
   reprobing ones current cpuset placement, for an application to
   specify cpuset-relative placement to be applied whenever the
   application is placed in a larger cpuset, even if the application
   is presently in a smaller cpuset.

Without it, the application has to first query its current physical
cpuset placement, then calculate its desired relative placement into
terms of the current physical nodes it finds itself on, and then issue
various mbind or set_mempolicy calls using those current physical node
numbers, praying that it wasn't migrated to other physical nodes at the
same time, which would have invalidated its current placement
assumptions in its relative to physical node number calculations.

And without it, if an application is currently placed in a smaller cpuset
than it would prefer, it cannot specify how it would want its mempolicies
should it ever subsequently be moved to a larger cpuset.  This leaves such
an application with little option other than to constantly reprobe its
cpuset placement, in order to know if and when it needs to reissue its
mbind and set_mempolicy calls because it gained access 

Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-14 Thread Paul Jackson
I had taken a vow of silence on this one, but couldn't resist one more
round.

David wrote:
 My preference (both mode and flags stored in the same member of struct 
 mempolicy):
 
Advantages:
 
   - completely consistent with the userspace API of passing modes
 and flags together in a pointer to an int, and

True -- the necessary overloaded ugliness of the kernel-user system
call API is propogated throughout mempolicy.c.  However, I wouldn't
necessarily call that an advantage.

   - does not require additional formals to be added to several
 functions, including functions outside mm/mempolicy.c.

... but does require the use of the new mpol_flags() and mpol_mode()
macros by code in mmshmem.c, outside of mm/mempolicy.c

Disadvantage:
 
   - use of mpol_mode() throughout mm/mempolicy.c code to mask
 off optional mode flags for conditionals or switch statements.

This will cause a bug in the future, that escapes into the wild, when
someone forgets one of these.  I'll bet on that.  It's fragile, because
(1) such errors are easy to make, and (2) hard to catch.

 Your preference (separate mode and flags members in struct mempolicy):
 
Advantages:
 
   - clearer implementation when dealing with modes: all existing
 statements involving pol-policy can remain unchanged.

Clear, robust code - that's the biggie.

Disadvantages:
 
   - requires additional formals to be added to several functions,
 including functions outside mm/mempolicy.c, and

True -- though by this argument, we'd routinely aggregate multiple flags
and small words into single integer parameters, just to minimize the
parameter count.  Putting two flags in one parameter is a false
simplification, unless required by circumstance, such as communicating
with deep space probes or across the system call boundary with existing
API's.

Across the system call boundary, we have little choice, for compatibility
reasons.  But kernel internal interfaces are not so constrained, and the
ugliness at the system call boundary can be quarantined from most of the
mempolicy.c code.

   - takes additional space in struct mempolicy (two bytes) which
 could eventually be used for something else.

To be clear to others, as you know, we're not talking here about
growing the sizeof(struct mempolicy) at present, but rather about using
some currently unused bytes in the struct.

More often than not, when someone adds complexity that is not needed at
present, because it might be needed in the future, they are making it
harder to maintain the code, not easier.

The single most important thing we can do to improve future
maintainability of code is to make it more readable.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.940.382.4214
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-14 Thread David Rientjes
On Thu, 14 Feb 2008, Paul Jackson wrote:

 If we ever call mpol_rebind_policy() with an 
 MPOL_PREFERRED|MPOL_F_STATIC_NODES
 policy when the preferred_node doesn't happen to be in the current cpuset,
 then would the following lines loose the preferred node setting, such that
 it didn't get applied again correctly if that node came back into our allowed
 cpuset ?
 
 case MPOL_PREFERRED:
 if (!remap  !node_isset(pol-v.preferred_node, *newmask))
 pol-v.preferred_node = -1;
 

That's already been corrected as a result of a discussion between Lee and 
myself (please see the incremental patch that I sent you privately when I 
sent the patchset along).

 Should the mpol_equal() algorithm change, in the case of either
 MPOL_F_RELATIVE_NODES or MPOL_F_STATIC_NODES, to checking
 user_nodemask equality, -instead- of the switch(mode)
 mode specific tests?
 

That's a good question.

We'll need to decide whether mpol_equal() is determining the equality of 
the currently effected mempolicy (whereas policy-user_nodemask is 
irrelevant) or the whole intended mempolicy overall.

I didn't originally modify mpol_equal() because I preferred the former.  
Is there a compelling case for the latter where mpol_equal() is used in 
the tree that would require this change?

 Could we have mpol_to_str() mark policies which are
 MPOL_F_RELATIVE_NODES or MPOL_F_STATIC_NODES?  Perhaps
 by adding a suffix of |relative or |static or some
 such.
 

I'd like to keep it in the same format as the tmpfs mount option which is 
'=relative' and '=static'.

David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-14 Thread David Rientjes
On Thu, 14 Feb 2008, Paul Jackson wrote:

 In mempolicy.h, the lines:
 
 /*
  * The lower MPOL_FLAG_SHIFT bits of the policy mode represent the MPOL_*
  * constants defined in enum mempolicy_mode.  The upper bits represent
  * optional set_mempolicy() MPOL_F_* mode flags.
  */
 #define MPOL_FLAG_SHIFT (8)
 #define MPOL_MODE_MASK  ((1  MPOL_FLAG_SHIFT) - 1)
 
 /* Flags for set_mempolicy */
 #define MPOL_F_STATIC_NODES (1  MPOL_FLAG_SHIFT)
 #define MPOL_MODE_FLAGS (MPOL_F_STATIC_NODES)   /* legal set_mempolicy() 
 MPOL_* mode flags */
 
 could be simplified, to:
 
 /*
  * Optional flags that modify nodemask numbering.
  */
 #define MPOL_F_RELATIVE_NODES (114) /* remapped relative to cpuset 
 */
 #define MPOL_F_STATIC_NODES (115)   /* unremapped physical 
 masks */
 #define MPOL_MODE_FLAGS (MPOL_F_RELATIVE_NODES|MPOL_F_STATIC_NODES)
   /* combined MPOL_F_* mask flags 
 */
 
 (really, that MPOL_FLAG_SHIFT is just unnecessary distracting detail.)
 

It would be easy to define mpol_mode() and mpol_flags() in terms of 
MPOL_MODE_FLAGS as well, yes.  But without MPOL_FLAG_SHIFT it becomes 
impossible to determine whether a user passed an invalid flag.

I think we're all in agreement that passing an invalid flag bit should be 
rejected with -EINVAL.  So to do that we need MPOL_MODE_MASK to expicitly 
define the parts of the int *policy passed from set_mempolicy() that 
represent the mode.

David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-14 Thread David Rientjes
On Thu, 14 Feb 2008, David Rientjes wrote:

 We'll need to decide whether mpol_equal() is determining the equality of 
 the currently effected mempolicy (whereas policy-user_nodemask is 
 irrelevant) or the whole intended mempolicy overall.
 

I've done the latter in the latest patchset.

Since mpol_equal() is only used for determining whether nearby vmas can be 
merged, it is only logical to merge them if they share the mempolicy 
intent of the user if MPOL_F_STATIC_NODES or any flag that makes sense of 
policy-user_nodemask is used.

  Could we have mpol_to_str() mark policies which are
  MPOL_F_RELATIVE_NODES or MPOL_F_STATIC_NODES?  Perhaps
  by adding a suffix of |relative or |static or some
  such.
  
 
 I'd like to keep it in the same format as the tmpfs mount option which is 
 '=relative' and '=static'.
 

In preparation for mode flags in addition to MPOL_F_STATIC_NODES to be 
added (like MPOL_F_RELATIVE_NODES or whatever Paul decides to call it 
based on the latest feedback, I've prepared mpol_to_str() to have this 
format

interleave=static|relative=1-3

as viewable from /proc/pid/numa_maps.  This is also how tmpfs mount 
options will be specified.

 [ Of course the above example can't happen since MPOL_F_STATIC_NODES and
   MPOL_F_RELATIVE_NODES are disjoint, but it's sufficient for the
   example. ]
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread David Rientjes
On Wed, 13 Feb 2008, Paul Jackson wrote:

> Yes, if an application considers nodes to be interchangeable, I'm
> trying to avoid having that application -have- to know its current
> cpuset placement, for two reasons:
> 
> For one thing, it's racey.  It's cpuset placement could change,
> unbeknownst to it, between the time it queried it, and the time
> that it issued the mbind or set_mempolicy call.
> 
> For the other thing, it's not always possible.  If the application
> is currently in a cpuset that is smaller than it's preferred
> configuration, it would not be possible to express its preferred
> memory policies using just the smaller number of memory nodes
> allowed by its current cpuset placement.  How do you say "put
> this on my third node" if you don't have a third node and you
> can only speak of the nodes you currently have?
> 

So let's say, like my first example from the previous email, that you have 
MPOL_INTERLEAVE | MPOL_F_RELATIVE_NODES over nodes 3-4 and your cpuset's 
mems is only nodes 5-7.  This would interleave over no nodes.  Correct?

It seems like MPOL_F_RELATIVE_NODES is primarily designed to maintain a 
certain order among the nodes it effects the mempolicy over.  It comes 
with the premise that the task doesn't already know it's cpuset mems 
(otherwise, the current implementation without MPOL_F_STATIC_NODES would 
work fine for this) so it doesn't really care what nodes it allocates 
pages on, it just cares about the order.

This works for MPOL_PREFERRED and MPOL_BIND as well, right?

I don't understand the use case for this (at all), but if you have 
workloads that require this type of setting then I can implement this as 
part of my series.  I just want to confirm that there are real world cases 
backing this so that we don't have flags with highly highly specialized 
cornercases.

 [ If a user _does_ specify MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES
   as part of their syscall, then we'll simply return -EINVAL. ]

> > Well, I didn't cave on anything 
> 
> ;)   Your simple "ok" was ambiguous enough that we were able to
>  read into it whatever we wanted to.
> 
> But I've made my case on that issue (involving the separate or
> packed policy flag field).  So I probably won't say more, and
> I expect to live with whatever you choose, after any further
> input from Lee or others.
> 

Well, there's advantages and disadvantages to either approach.

My preference (both mode and flags stored in the same member of struct 
mempolicy):

   Advantages:

- completely consistent with the userspace API of passing modes
  and flags together in a pointer to an int, and

- does not require additional formals to be added to several
  functions, including functions outside mm/mempolicy.c.

   Disadvantage:

- use of mpol_mode() throughout mm/mempolicy.c code to mask
  off optional mode flags for conditionals or switch statements.

Your preference (separate mode and flags members in struct mempolicy):

   Advantages:

- clearer implementation when dealing with modes: all existing
  statements involving pol->policy can remain unchanged.

   Disadvantages:

- requires additional formals to be added to several functions,
  including functions outside mm/mempolicy.c, and

- takes additional space in struct mempolicy (two bytes) which
  could eventually be used for something else.

In both cases the testing of mode flags is the same as before:

if (pol->policy & MPOL_F_STATIC_NODES) {
...
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread Paul Jackson
David wrote:
> You're specifically trying to avoid having the application know about its 
> cpuset placement with regard to mems at the time it sets up its mempolicy, 
> right?  Otherwise it could already setup this relative nodemask by 
> selecting node 2, from your example above, in its mems_allowed and it 
> would always be remapped appropriately.

Yes, if an application considers nodes to be interchangeable, I'm
trying to avoid having that application -have- to know its current
cpuset placement, for two reasons:

For one thing, it's racey.  It's cpuset placement could change,
unbeknownst to it, between the time it queried it, and the time
that it issued the mbind or set_mempolicy call.

For the other thing, it's not always possible.  If the application
is currently in a cpuset that is smaller than it's preferred
configuration, it would not be possible to express its preferred
memory policies using just the smaller number of memory nodes
allowed by its current cpuset placement.  How do you say "put
this on my third node" if you don't have a third node and you
can only speak of the nodes you currently have?


> So, for example, if the task is bound by mems 1-3, and it asks for 
> MPOL_INTERLEAVE over 2-4, then initially the mempolicy is only effected 
> over node 3 and if it's later expanded to mems 1-8, then the mempolicy is 
> effected over nodes 3-5, right?
> 
> And if the mems change to 3-8, the mempolicy is remapped to 5-7 even 
> though 3-5 (which it already was interleaving over) is still accessible?

Yes and yes, for this cpuset relative numbering mode.

> Does MPOL_INTERLEAVE | MPOL_F_STATIC_NODES | MPOL_F_PAULS_NEW_FLAG make 
> any logical sense?  If it does, I think we're going to be writing some 
> very complex remap code in our future.


No -- MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES (which is what I'll
likely call my new flag) are mutually exclusive.

The allowed modes and flags would be:
  Choose exactly one of:
MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND, MPOL_INTERLEAVE
  Plus zero or one of:
MPOL_F_STATIC_NODES, MPOL_F_RELATIVE_NODES
  where, if you choose neither of tne MPOL_F_*_NODES flags,
  then you get the classic, compatible nodemask handling.

> Well, I didn't cave on anything 

;)   Your simple "ok" was ambiguous enough that we were able to
 read into it whatever we wanted to.

But I've made my case on that issue (involving the separate or
packed policy flag field).  So I probably won't say more, and
I expect to live with whatever you choose, after any further
input from Lee or others.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread David Rientjes
On Wed, 13 Feb 2008, Lee Schermerhorn wrote:

> > Yeah, it gets tricky.  I'm not too sure about only pulling the mode and 
> > flags apart at mpol_new() time because perhaps, in the future, there will 
> > be flag and mode combinations that are only applicable for set_mempolicy() 
> > and not for mbind(), for example.  I can imagine that someday we may add a 
> > flag that applies only to a task mempolicy and not to a VMA mempolicy.
> 
> True.  Altho' at such a time, I'd probably argue for testing for and
> rejecting invalid mode/flag combinations in the respective
> functions :-).
> 

Yeah, and that would require the modes and flags to be split apart in 
sys_set_mempolicy() and sys_mbind() or at least in do_set_mempolicy() or 
do_mbind().  So if we see the possibility that we want to test for mode 
and flag combinations that perhaps differ between the set_mempolicy() and 
mbind() case, we have to do it in either of those two places.  I think we 
should do it there, as early as possible, like I did in my first patchset.

> OK.  I'm "caving"... :-)  Different views of consistency.  But,
> eventually, I hope we can replace the separate mode[, flags] and
> nodemask in the 'sb_info with a mempolicy and keep the details of modes,
> flags, etc. internal to mempolicy.c.
> 

I agree, I think keeping all of these implementation details inside 
mm/mempolicy.c is the best practice.  We'll still need to expose some of 
the details, such as the parsing of '=static' in the tmpfs mount option to 
add the MPOL_F_STATIC_NODES flag to the policy, but situations like that 
should be rare.

For extendability, I agree that the struct shmem_sb_info member should be 
a pointer to a mempolicy and not an int.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread David Rientjes
On Wed, 13 Feb 2008, Lee Schermerhorn wrote:

> > > If we just want to preserve existing behavior, we can define an
> > > additional mode flag that we set in the sbinfo policy in
> > > shmem_parse_mpol() and test in mpol_new().  If we want to be able to
> > > specify existing or new behavior, we can use the same flag, but set it
> > > or not based on an additional qualifier specified via the mount option.
> > > 
> > 
> > Shmem areas between cpusets with disjoint mems_allowed seems like an error 
> > in userspace to me and I would prefer that mpol_new() reject it outright.
> 
> I don't think so, and I'd like to explain why.  Perhaps this is too big
> a topic to cover in the context of this patch response.  I think I'll
> start another thread.  Suffice it to say here that cpusets and
> mempolicies don't constrain, in anyway, the ability of tasks in a cpuset
> to attach to shmem segments created by tasks in other, perhaps disjoint,
> cpusets.  The cpusets and mempolicies DO constrain page allocations,
> however.  This can result in OOM faults for tasks in cpusets whose
> mems_allowed contains no overlap with a shared policy installed by the
> creating task.  This can be avoided if the task that creates the shmem
> segment faults in all of the pages and SHM_LOCKs them down.  Whether
> this is sufficient for all applications is subject of the other
> discussion.  In any case, I think we need to point this out in mempolicy
> and cpuset man pages.
> 

I don't think we're likely to see examples of actual code written to do 
this being posted to refute my comment.

I think what you said above is right and that it should be allowed.  
You're advocating for allowing task A to attach to shmem segments created 
by task B while task A has its own mempolicy that restricts its own 
allocations but still being able to access the shmem segments of task B, 
providing that task A will not fault them back in itself.

You're right, that's not a scenario that I was hoping to address to 
MPOL_F_STATIC_NODES.

> > Since we're allowed to remap the node to a different node than the user 
> > specified with either syscall, the current behavior is that "one node is 
> > as good as another."  In other words, we're trying to accomodate the mode 
> > first by setting pol->v.preferred_node to some accessible node and only 
> > setting that to the user-supplied node if it is available.
> > 
> > If the node isn't available and the user specifically asked that it is not 
> > remapped (with MPOL_F_STATIC_NODES), then I felt local allocation was best 
> > compared to remapping to a node that may be unrelated to the VMA or task.  
> > This preserves a sense of the current behavior that "one node is as good 
> > as another," but the user specifically asked for no remap.
> 
> Yeah.  I went back and read the update to the mempolicy doc where you
> make it clear that this is the semantic of 'STATIC_NODES.  You also
> mentioned it in the patch description, but I didn't "get it".  Sorry.  
> 
> Still a comment in the code would, I think, help future spelunkers.
> 

Ok, I'll clarify the intention in this code that we agreed was better:

case MPOL_PREFERRED:
if (!remap) {
int nid = first_node(pol->user_nodemask);

if (node_isset(nid, *newmask))
pol->v.preferred_node = nid;
else
pol->v.preferred_node = -1;
} else
pol->v.preferred_node = 
node_remap(pol->v.preferred_node,
*mpolmask, *newmask);
break;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread Lee Schermerhorn
On Wed, 2008-02-13 at 10:48 -0800, David Rientjes wrote:
> On Wed, 13 Feb 2008, Lee Schermerhorn wrote:
> 
> > > >  2) Those 'mpol_mode()' wrappers on all those mempolicy->policy
> > > > evaluations look dangerous to me.  It looks like a code bug
> > > > waiting to happen.  Unless I miss my guess, if you forget one of
> > > > those wrappers (or someone making a future change misses one), no
> > > > one will notice until sometime at runtime, someone makes use of the
> > > > new fangled MPOL_F_STATIC_NODES mode in some new situation, and we
> > > > end up executing code that we didn't expect to execute just then.
> > > > 
> > > > I urge you to reconsider, and keep it so that the 'policy' field of 
> > > > struct
> > > > mempolicy naturally evaluates to just the policy.  There should be 
> > > > just one
> > > > pair of places, on entry to, and exit from, the kernel, where the 
> > > > code is
> > > > aware of the need to pack the mode and the policy into a single 
> > > > word.
> > > > 
> > > 
> > > Ok.
> > 
> > I was hoping David wouldn't cave on this point. ;-)  I do agree with
> > David that if we ever end up with nr_modes + nr_mode_flags > 16, the API
> > will become too complex for myself, at least, to comprehend.  I don't
> > feel too strongly either way, and I'll explain more below, but first:
> > 
> 
> Hmm, I don't remember "caving" on this point; Paul asked me to reconsider 
> and I said that I would.  I think it's also helpful to have more 
> discussion on the matter by including other people's opinions.

> 
> > I do agree with Paul that there is a danger of missing one of these in
> > converting existing code.  In fact, I missed one in my ref counting
> > rework patch that I will ultimately rebase atop David's final version
> > [I'm already dependent on Mel Gorman's patch series].  To catch this, I
> > decided to rename the "policy" member to "mode".  This also aligns the
> > struct better with the numa_memory_policy doc [I think of the "policy"
> > as the tuple:  mode + optional nodemask].  For future code, I think we
> > could add comments to the struct definition warning that one should use
> > the wrappers to access the mode or mode flags.
> > 
> 
> I considered this when I implemented it the way I did, and that was part 
> of the reason I was in favor of enum mempolicy_mode so much: functions 
> that had a formal of type 'enum mempolicy_mode' were only the mode and 
> formals of type 'int' or 'unsigned short' were the combination.  I even 
> stated that difference in Documentation/vm/numa_memory_policy.txt in the 
> first patchset.
> 
> > However, let's step back and look at the usage of the mempolicy struct.
> > In earlier mail, David mentioned that it is passed around outside of
> > mempolicy.c.  While this is true, the only place that I see where code
> > outside of mempolicy.c deals with the policy/mode and nodemask
> > separately--as opposed to using an opaque mempolicy struct--is in the
> > shmem mount option parsing.  Everywhere else, as far as I can see, just
> > uses the struct mempolicy.  Even slab/slub call into slab_node() in
> > mempolicy.c to extract the target node for the policy.
> > 
> 
> Yes, struct shmem_sb_info stores the 'policy' member as well so it would 
> need to include a new 'flags' member as well.  And the interface between 
> the two, mpol_shared_policy_init() would need another formal for the 
> flags.
> 
> > So, if David does accept Paul's plea to separate the mode and the mode
> > flags in the structure, I would suggest that we do this in mpol_new().
> > That is, in {sys|do}_{set_mempolicy|mbind}(), maintain the API format
> > with the mode flags ORed into the mode.  mpol_new() can pull them apart
> > into different struct mempolicy members.  The rest of mempolicy.c can
> > use them directly from the struct without the helpers.  get_mempolicy()
> > will need to pack the mode and flags back together--perhaps with an
> > inverse helper function.
> > 
> 
> Yeah, it gets tricky.  I'm not too sure about only pulling the mode and 
> flags apart at mpol_new() time because perhaps, in the future, there will 
> be flag and mode combinations that are only applicable for set_mempolicy() 
> and not for mbind(), for example.  I can imagine that someday we may add a 
> flag that applies only to a task mempolicy and not to a VMA mempolicy.

True.  Altho' at such a time, I'd probably argue for testing for and
rejecting invalid mode/flag combinations in the respective
functions :-).


> 
> > As for the shmem mount option parsing:  For now, I'd suggest that we
> > keep the mode flags OR'd into the "policy" member of the shmem_sb_info
> > struct -- i.e., the "API format"--to preserve the mpol_new() interface.
> 
> I don't like this because its not consistent.  It increases the confusion 
> of what contains a mode and what contains the combination.  I think the 
> safest thing to do, if we separate the mode and flags out in struct 
> 

Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread David Rientjes
On Wed, 13 Feb 2008, Paul Jackson wrote:

> No -- MPOL_F_STATIC_NODES does not handle my second case.  Notice the
> phrase 'cpuset relative.'
> 
> In my second case, nodes are numbered relative to the cpuset.  If you
> say "node 2" then you mean whatever is the third (since numbering is
> zero based) node in your current cpuset.
> 

Ok, so this truly is a new feature that isn't addressed either by the 
current implementation or my patchset.  Fair enough.

You're specifically trying to avoid having the application know about its 
cpuset placement with regard to mems at the time it sets up its mempolicy, 
right?  Otherwise it could already setup this relative nodemask by 
selecting node 2, from your example above, in its mems_allowed and it 
would always be remapped appropriately.

> In this mode, "node 2" doesn't mean what the system calls "node 2"; it
> means the third node in whatever is ones current cpuset placement (if
> your cpuset even has that many nodes), and mempolicies using this mode
> are automatically remapped by the kernel, anytime the cpuset placement
> changes.
> 
> This second, cpuset relative, mode is required:
> 
> 1) to provide a race-free means for an application to place its memory
>when the application considers all physical nodes essentially
>equivalent, and just wants to lay things out relative to whatever
>cpuset it happens to be running in, and
> 
> 2) to provide a practical means, without the need for constantly
>reprobing ones current cpuset placement, for an application to
>specify cpuset-relative placement to be applied whenever the
>application is placed in a larger cpuset, even if the application
>is presently in a smaller cpuset.
> 

So, for example, if the task is bound by mems 1-3, and it asks for 
MPOL_INTERLEAVE over 2-4, then initially the mempolicy is only effected 
over node 3 and if it's later expanded to mems 1-8, then the mempolicy is 
effected over nodes 3-5, right?

And if the mems change to 3-8, the mempolicy is remapped to 5-7 even 
though 3-5 (which it already was interleaving over) is still accessible?

> I agree, David, that this present MPOL_F_STATIC_NODES patch handles the
> case of a growing cpuset (or hotplug added nodes) for the static mapped
> case (node "2" means physical system node "2", always.)  But this
> present patch, by design, does not address the case of a growing cpuset
> for the case where an application actually wants its mempolicies remapped.
> 

Does MPOL_INTERLEAVE | MPOL_F_STATIC_NODES | MPOL_F_PAULS_NEW_FLAG make 
any logical sense?  If it does, I think we're going to be writing some 
very complex remap code in our future.

> > Ahh, since policy->cpuset_mems_allowed is only meaningful in the 
> > non-MPOL_F_STATIC_NODES case, that probably will work.  For the purposes 
> > of this patchset, we can certainly do that.  I'm wondering whether future 
> > expansions will require them to be separated again, however.
> 
> I'd suggest we let future expansions deal with their own needs.  We
> don't usually pad internal (not externally visible) data structures
> in anticipation that someone else might need the space in the future.
> 
> At least earlier, Andi Kleen, when he was the primary author and sole
> active maintainer of this mempolicy code, was always keen to avoid
> expanding the size of 'struct mempolicy' by another nodemask.
> 
> I have not done the calculations myself to know how vital it is to
> keep the size of struct mempolicy to a minimum.  It certainly seems
> worth a bit of effort, however, if adding this union of these two
> nodemasks doesn't complicate the code too horribly much.
> 

I think it will work very nicely and the benefit is immediately obvious 
for systems that have large nodemasks.

> > > I urge you to reconsider, and keep it so that the 'policy' field of 
> > > struct
> > > mempolicy naturally evaluates to just the policy.  There should be 
> > > just one
> > > pair of places, on entry to, and exit from, the kernel, where the 
> > > code is
> > > aware of the need to pack the mode and the policy into a single word.
> > > 
> > 
> > Ok.
> 
> Cool.  Thanks.  (I'm glad you caved ... ;).  Looking forward in my inbox, I 
> see
> that Lee has some suggestions on where to handle the conversion between the
> packed mode and the separate fields.  I'm too lazy to think about that more,
> and will likely acquiesce to whatever you and Lee agree to.
> 

Well, I didn't cave on anything, I said that we can reconsider it in the 
hopes that other people would add their feedback.  I think continuing to 
discuss this matter with yourself and Lee (and whomever else is 
interested) will lead us to the correct solution.  Since this is an 
internal implementation detail, I think it's important to hear other 
people's opinions since we're the ones who will be hacking the code in the 
future so it's really our opinions that matter.

> > The user's nodemask is always stored unaltered in 

Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread Paul Jackson
David wrote:
> Yeah, it gets tricky.

If you can stand to read my code, look at that big patch I sent you
and Lee, dated "Sun, 23 Dec 2007 22:24:02 -0600", under the Subject
of "Re: [RFC] cpuset relative memory policies - second choice",
where I did this, keeping the 'policy' field unchanged in struct
mempolicy and adding additional fields for the (three, in my case)
new modes.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread David Rientjes
On Wed, 13 Feb 2008, Lee Schermerhorn wrote:

> > >  2) Those 'mpol_mode()' wrappers on all those mempolicy->policy
> > > evaluations look dangerous to me.  It looks like a code bug
> > > waiting to happen.  Unless I miss my guess, if you forget one of
> > > those wrappers (or someone making a future change misses one), no
> > > one will notice until sometime at runtime, someone makes use of the
> > > new fangled MPOL_F_STATIC_NODES mode in some new situation, and we
> > > end up executing code that we didn't expect to execute just then.
> > > 
> > > I urge you to reconsider, and keep it so that the 'policy' field of 
> > > struct
> > > mempolicy naturally evaluates to just the policy.  There should be 
> > > just one
> > > pair of places, on entry to, and exit from, the kernel, where the 
> > > code is
> > > aware of the need to pack the mode and the policy into a single word.
> > > 
> > 
> > Ok.
> 
> I was hoping David wouldn't cave on this point. ;-)  I do agree with
> David that if we ever end up with nr_modes + nr_mode_flags > 16, the API
> will become too complex for myself, at least, to comprehend.  I don't
> feel too strongly either way, and I'll explain more below, but first:
> 

Hmm, I don't remember "caving" on this point; Paul asked me to reconsider 
and I said that I would.  I think it's also helpful to have more 
discussion on the matter by including other people's opinions.

> I do agree with Paul that there is a danger of missing one of these in
> converting existing code.  In fact, I missed one in my ref counting
> rework patch that I will ultimately rebase atop David's final version
> [I'm already dependent on Mel Gorman's patch series].  To catch this, I
> decided to rename the "policy" member to "mode".  This also aligns the
> struct better with the numa_memory_policy doc [I think of the "policy"
> as the tuple:  mode + optional nodemask].  For future code, I think we
> could add comments to the struct definition warning that one should use
> the wrappers to access the mode or mode flags.
> 

I considered this when I implemented it the way I did, and that was part 
of the reason I was in favor of enum mempolicy_mode so much: functions 
that had a formal of type 'enum mempolicy_mode' were only the mode and 
formals of type 'int' or 'unsigned short' were the combination.  I even 
stated that difference in Documentation/vm/numa_memory_policy.txt in the 
first patchset.

> However, let's step back and look at the usage of the mempolicy struct.
> In earlier mail, David mentioned that it is passed around outside of
> mempolicy.c.  While this is true, the only place that I see where code
> outside of mempolicy.c deals with the policy/mode and nodemask
> separately--as opposed to using an opaque mempolicy struct--is in the
> shmem mount option parsing.  Everywhere else, as far as I can see, just
> uses the struct mempolicy.  Even slab/slub call into slab_node() in
> mempolicy.c to extract the target node for the policy.
> 

Yes, struct shmem_sb_info stores the 'policy' member as well so it would 
need to include a new 'flags' member as well.  And the interface between 
the two, mpol_shared_policy_init() would need another formal for the 
flags.

> So, if David does accept Paul's plea to separate the mode and the mode
> flags in the structure, I would suggest that we do this in mpol_new().
> That is, in {sys|do}_{set_mempolicy|mbind}(), maintain the API format
> with the mode flags ORed into the mode.  mpol_new() can pull them apart
> into different struct mempolicy members.  The rest of mempolicy.c can
> use them directly from the struct without the helpers.  get_mempolicy()
> will need to pack the mode and flags back together--perhaps with an
> inverse helper function.
> 

Yeah, it gets tricky.  I'm not too sure about only pulling the mode and 
flags apart at mpol_new() time because perhaps, in the future, there will 
be flag and mode combinations that are only applicable for set_mempolicy() 
and not for mbind(), for example.  I can imagine that someday we may add a 
flag that applies only to a task mempolicy and not to a VMA mempolicy.

> As for the shmem mount option parsing:  For now, I'd suggest that we
> keep the mode flags OR'd into the "policy" member of the shmem_sb_info
> struct -- i.e., the "API format"--to preserve the mpol_new() interface.

I don't like this because its not consistent.  It increases the confusion 
of what contains a mode and what contains the combination.  I think the 
safest thing to do, if we separate the mode and flags out in struct 
mempolicy is to do it everywhere.  All helper functions will need 
additional formals similar to what I did in the first patchset (that was 
actually unpopular) and struct shmem_sb_info need to be modified to 
include the additional member.

In other words, I'm all in favor of storing the mode and flags however we 
want, but I think it should be done consistenty throughout the tree.  
While 

Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread Paul Jackson
David, responding to pj, write:
> >  2) a cpuset relative mode that correctly handled the case of cpusets
> > growing larger (increased numbers of nodes.)
> > 
> > I'd like to persuade you to add case (2) as well.  But I suspect,
> > given that case (2) was never as compelling to you as it was to me
> > in our December discussions, that I'll have little luck doing that.
> > 
> 
> MPOL_F_STATIC_NODES already handles the second case because you can 
> specify nodes that aren't currently accessible because of your cpuset in 
> the hopes that eventually they will become accessible.

No -- MPOL_F_STATIC_NODES does not handle my second case.  Notice the
phrase 'cpuset relative.'

In my second case, nodes are numbered relative to the cpuset.  If you
say "node 2" then you mean whatever is the third (since numbering is
zero based) node in your current cpuset.

This cpuset relative mode that I'm proposing is an entirely different
mode of numbering nodes than anything we've seen so far (except for our
side discussions with just yourself, Lee and Christoph, last December.)

In this mode, "node 2" doesn't mean what the system calls "node 2"; it
means the third node in whatever is ones current cpuset placement (if
your cpuset even has that many nodes), and mempolicies using this mode
are automatically remapped by the kernel, anytime the cpuset placement
changes.

This second, cpuset relative, mode is required:

1) to provide a race-free means for an application to place its memory
   when the application considers all physical nodes essentially
   equivalent, and just wants to lay things out relative to whatever
   cpuset it happens to be running in, and

2) to provide a practical means, without the need for constantly
   reprobing ones current cpuset placement, for an application to
   specify cpuset-relative placement to be applied whenever the
   application is placed in a larger cpuset, even if the application
   is presently in a smaller cpuset.

Without it, the application has to first query its current physical
cpuset placement, then calculate its desired relative placement into
terms of the current physical nodes it finds itself on, and then issue
various mbind or set_mempolicy calls using those current physical node
numbers, praying that it wasn't migrated to other physical nodes at the
same time, which would have invalidated its current placement
assumptions in its relative to physical node number calculations.

And without it, if an application is currently placed in a smaller cpuset
than it would prefer, it cannot specify how it would want its mempolicies
should it ever subsequently be moved to a larger cpuset.  This leaves such
an application with little option other than to constantly reprobe its
cpuset placement, in order to know if and when it needs to reissue its
mbind and set_mempolicy calls because it gained access to a larger cpuset.

I agree, David, that this present MPOL_F_STATIC_NODES patch handles the
case of a growing cpuset (or hotplug added nodes) for the static mapped
case (node "2" means physical system node "2", always.)  But this
present patch, by design, does not address the case of a growing cpuset
for the case where an application actually wants its mempolicies remapped.

My original code remapping mempolicies when cpuset placement changes,
that has been in the kernel for a couple of years now, was -supposed-
to handle this relative case.  But it is flawed, as it fails to meet
the requirements I list above.  We can't just change the way that mode
works now, for compatibility reasons.  So we need to add a new cpuset
relative mode, just as you're adding a STATIC mode, that addresses the
above requirements for a cpuset relative, remapped, numbering of
mempolicy nodes.


> Other than that, perhaps if you can elaborate more on what you imagined 
> supporting as far as cpusets growing larger (or supporting node hotplug, 
> which is the same type of problem), we can discuss that further before I 
> resend my patches.

Ok - I tried to elaborate, above.  You (David), Lee and Christoph will
perhaps recognize this elaboration, as it essentially repeats things
I said in our earlier discussion in December and early January.

Yes - hotplug presents the same problems as cpusets growing larger.

If this makes sense to you, David, and you'd like to include this
second, cpuset relative mode, in your patchset, that would be excellent.

Given that I have not been very good at explaining this second mode
you might choose not to do that; in that case I'll have to follow up,
after your second patch shapes up, with a patch of my own, adding this
cpuset relative mode.


> > I believe you can overlay these two nodemasks using a union, and
> > avoid making struct mempolicy bigger.
> 
> Ahh, since policy->cpuset_mems_allowed is only meaningful in the 
> non-MPOL_F_STATIC_NODES case, that probably will work.  For the purposes 
> of this patchset, we can certainly do that.  I'm wondering whether future 

Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread Lee Schermerhorn
On Tue, 2008-02-12 at 20:18 -0800, David Rientjes wrote:
> On Tue, 12 Feb 2008, Lee Schermerhorn wrote:
> 
> > > Adds another member to struct mempolicy,
> > > 
> > >   nodemask_t user_nodemask
> > > 
> > > that stores the the nodemask that the user passed when he or she created
> > > the mempolicy via set_mempolicy() or mbind().  When using
> > > MPOL_F_STATIC_NODES, which is passed with any mempolicy mode, the user's
> > > passed nodemask is always used when determining the preferred node,
> > > building the MPOL_BIND zonelist, or creating the interleave nodemask.
> > > This happens whenever the policy is rebound, including when a task's
> > > cpuset assignment changes or the cpuset's mems are changed.
> > 
> > When you say that the user's nodemask is "always used" you mean "after
> > cpuset contextualization", right?  I.e., after masking with mems_allowed
> > [which also restricts to nodes with memory].  That is what the code
> > still does.  
> > 
> 
> Yeah, I'm not introducing a loophole so that tasks can access memory to 
> which they don't have access.  I've clarified that for the next version.

I though so.  The description just seemed ambiguous to me.  Thanks for
the clarification.
> 
> > > It is also possible to mount tmpfs with the static nodemask behavior when
> > > specifying a node or nodemask.  To do this, simply add "=static"
> > > immediately following the mempolicy mode at mount time:
> > > 
> > >   mount -o remount mpol=interleave=static:1-3
> > > 
> > > Also removes mpol_check_policy() and folds some of its logic into
> > > mpol_new() since it is now mostly obsoleted.
> > 
> > Couple of comments here:
> > 
> > 1) we've discussed the issue of returning EINVAL for non-empty nodemasks
> > with MPOL_DEFAULT.  By removing this restriction, we run the risk of
> > breaking applications if we should ever want to define a semantic to
> > non-empty node mask for MPOL_DEFAULT.   I think this is probably
> > unlikely, but consider the new mode flags part of the mode/policy
> > parameter:  by not rejecting undefined flags, we may break applications
> > by later defining additional flags.  I'd argue for fairly strict
> > argument checking.
> > 
> 
> As I've mentioned in a parallel thread, I've folded the entire logic of 
> mpol_check_policy() as it stands this minute in Linus' tree into 
> mpol_new().

OK.  
> 
> > 2) Before this patch, policy_nodemask from the shmem_sb_info was NOT
> > masked with allowed nodes because we passed this mask directly to
> > mpol_new() without "contextualization".  We DO guarantee that this
> > policy nodemask contains only nodes with memory [see
> > shmem_parse_mpol()].  Now that you've moved the contextualization to
> > mpol_new(), we are masking this policy with the cpuset mems allowed.
> > This is a change in behavior.  Do we want this?  I.e., are we preserving
> > the "intent" of the system administrator in setting the tmpfs policy?
> > Maybe they wanted to shared interleaved shmem areas between cpusets with
> > disjoint mems allowed.  Nothing prevents this...
> > 
> 
> We're still saving the intent in the new policy->user_nodemask field so 
> any future rebinds will still allow unaccessible nodes be effected by the 
> mempolicy if the permissions are changed later.
> 
> > If we just want to preserve existing behavior, we can define an
> > additional mode flag that we set in the sbinfo policy in
> > shmem_parse_mpol() and test in mpol_new().  If we want to be able to
> > specify existing or new behavior, we can use the same flag, but set it
> > or not based on an additional qualifier specified via the mount option.
> > 
> 
> Shmem areas between cpusets with disjoint mems_allowed seems like an error 
> in userspace to me and I would prefer that mpol_new() reject it outright.

I don't think so, and I'd like to explain why.  Perhaps this is too big
a topic to cover in the context of this patch response.  I think I'll
start another thread.  Suffice it to say here that cpusets and
mempolicies don't constrain, in anyway, the ability of tasks in a cpuset
to attach to shmem segments created by tasks in other, perhaps disjoint,
cpusets.  The cpusets and mempolicies DO constrain page allocations,
however.  This can result in OOM faults for tasks in cpusets whose
mems_allowed contains no overlap with a shared policy installed by the
creating task.  This can be avoided if the task that creates the shmem
segment faults in all of the pages and SHM_LOCKs them down.  Whether
this is sufficient for all applications is subject of the other
discussion.  In any case, I think we need to point this out in mempolicy
and cpuset man pages.

> 
> > > @@ -218,21 +167,27 @@ static struct mempolicy *mpol_new(enum 
> > > mempolicy_mode mode,
> > >   return ERR_PTR(-ENOMEM);
> > >   flags &= MPOL_MODE_FLAGS;
> > >   atomic_set(>refcnt, 1);
> > > + cpuset_update_task_memory_state();
> > > + nodes_and(cpuset_context_nmask, *nodes, cpuset_current_mems_allowed);
> > >   switch (mode) {
> > 

Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread Lee Schermerhorn
On Wed, 2008-02-13 at 01:36 -0800, David Rientjes wrote:
> On Wed, 13 Feb 2008, Paul Jackson wrote:
> 
> > The infamous unpublished (except to a few) patch I drafted on Christmas
> > (Dec 25, 2007) basically added two new modes for how mempolicy
> > nodemasks were to be resolved:
> >  1) a static, no remap, mode, such as in this present patchset, and
> >  2) a cpuset relative mode that correctly handled the case of cpusets
> > growing larger (increased numbers of nodes.)
> > 
> > I'd like to persuade you to add case (2) as well.  But I suspect,
> > given that case (2) was never as compelling to you as it was to me
> > in our December discussions, that I'll have little luck doing that.
> > 
> 
> MPOL_F_STATIC_NODES already handles the second case because you can 
> specify nodes that aren't currently accessible because of your cpuset in 
> the hopes that eventually they will become accessible.  It's possible to 
> pass a nodemask with all bits set so that when the cpuset's set of nodes 
> expand, the mempolicy is effected over the intersection of the two on 
> rebind.
> 
> Other than that, perhaps if you can elaborate more on what you imagined 
> supporting as far as cpusets growing larger (or supporting node hotplug, 
> which is the same type of problem), we can discuss that further before I 
> resend my patches.
> 
> > Notes and questions on your current patchset:
> > 
> >  1) It looks like you -add- a second nodemask to struct mempolicy:
> > nodemask_t cpuset_mems_allowed; /* mempolicy relative to these 
> > nodes */
> > nodemask_t user_nodemask;   /* nodemask passed by user */
> > 
> > I believe you can overlay these two nodemasks using a union, and avoid 
> > making
> > struct mempolicy bigger.
> > 
> 
> Ahh, since policy->cpuset_mems_allowed is only meaningful in the 
> non-MPOL_F_STATIC_NODES case, that probably will work.  For the purposes 
> of this patchset, we can certainly do that.  I'm wondering whether future 
> expansions will require them to be separated again, however.
> 
> >  2) Those 'mpol_mode()' wrappers on all those mempolicy->policy
> > evaluations look dangerous to me.  It looks like a code bug
> > waiting to happen.  Unless I miss my guess, if you forget one of
> > those wrappers (or someone making a future change misses one), no
> > one will notice until sometime at runtime, someone makes use of the
> > new fangled MPOL_F_STATIC_NODES mode in some new situation, and we
> > end up executing code that we didn't expect to execute just then.
> > 
> > I urge you to reconsider, and keep it so that the 'policy' field of 
> > struct
> > mempolicy naturally evaluates to just the policy.  There should be just 
> > one
> > pair of places, on entry to, and exit from, the kernel, where the code 
> > is
> > aware of the need to pack the mode and the policy into a single word.
> > 
> 
> Ok.

I was hoping David wouldn't cave on this point. ;-)  I do agree with
David that if we ever end up with nr_modes + nr_mode_flags > 16, the API
will become too complex for myself, at least, to comprehend.  I don't
feel too strongly either way, and I'll explain more below, but first:

I do agree with Paul that there is a danger of missing one of these in
converting existing code.  In fact, I missed one in my ref counting
rework patch that I will ultimately rebase atop David's final version
[I'm already dependent on Mel Gorman's patch series].  To catch this, I
decided to rename the "policy" member to "mode".  This also aligns the
struct better with the numa_memory_policy doc [I think of the "policy"
as the tuple:  mode + optional nodemask].  For future code, I think we
could add comments to the struct definition warning that one should use
the wrappers to access the mode or mode flags.

However, let's step back and look at the usage of the mempolicy struct.
In earlier mail, David mentioned that it is passed around outside of
mempolicy.c.  While this is true, the only place that I see where code
outside of mempolicy.c deals with the policy/mode and nodemask
separately--as opposed to using an opaque mempolicy struct--is in the
shmem mount option parsing.  Everywhere else, as far as I can see, just
uses the struct mempolicy.  Even slab/slub call into slab_node() in
mempolicy.c to extract the target node for the policy.

So, if David does accept Paul's plea to separate the mode and the mode
flags in the structure, I would suggest that we do this in mpol_new().
That is, in {sys|do}_{set_mempolicy|mbind}(), maintain the API format
with the mode flags ORed into the mode.  mpol_new() can pull them apart
into different struct mempolicy members.  The rest of mempolicy.c can
use them directly from the struct without the helpers.  get_mempolicy()
will need to pack the mode and flags back together--perhaps with an
inverse helper function.

As for the shmem mount option parsing:  For now, I'd suggest that we
keep the mode flags OR'd into the "policy" 

Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread Lee Schermerhorn
On Tue, 2008-02-12 at 21:06 -0800, David Rientjes wrote:
> On Tue, 12 Feb 2008, David Rientjes wrote:
> 
> > Since we're allowed to remap the node to a different node than the user 
> > specified with either syscall, the current behavior is that "one node is 
> > as good as another."  In other words, we're trying to accomodate the mode 
> > first by setting pol->v.preferred_node to some accessible node and only 
> > setting that to the user-supplied node if it is available.
> > 
> > If the node isn't available and the user specifically asked that it is not 
> > remapped (with MPOL_F_STATIC_NODES), then I felt local allocation was best 
> > compared to remapping to a node that may be unrelated to the VMA or task.  
> > This preserves a sense of the current behavior that "one node is as good 
> > as another," but the user specifically asked for no remap.
> > 
> 
> Upon further inspection, perhaps it's better to fallback to local 
> allocation when the preferred node is not available on rebind and then, if 
> it becomes available later, prefer it again.
> 
> In mpol_rebind_policy():
> 
>   case MPOL_PREFERRED:
>   if (!remap) {
>   int nid = first_node(pol->user_nodemask);
> 
>   if (node_isset(nid, *newmask))
>   pol->v.preferred_node = nid;
>   else {
>   /*
>* Fallback to local allocation since that
>* is the behavior in mpol_new().  The
>* node may eventually become available.
>*/
>   pol->v.preferred_node = -1;
>   }
>   } else
>   pol->v.preferred_node = 
> node_remap(pol->v.preferred_node,
>   *mpolmask, *newmask);
>   break;
> 
> What do you think, Lee?

I agree.  I was still puzzled by your patch in this area as it appeared
to me that you weren't remapping back to your original preferred node,
which seemed to be the whole point of "static".  I was going to ask
about this today, but you caught it first!

Lee

> 
>   David

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread David Rientjes
On Wed, 13 Feb 2008, Paul Jackson wrote:

> The infamous unpublished (except to a few) patch I drafted on Christmas
> (Dec 25, 2007) basically added two new modes for how mempolicy
> nodemasks were to be resolved:
>  1) a static, no remap, mode, such as in this present patchset, and
>  2) a cpuset relative mode that correctly handled the case of cpusets
> growing larger (increased numbers of nodes.)
> 
> I'd like to persuade you to add case (2) as well.  But I suspect,
> given that case (2) was never as compelling to you as it was to me
> in our December discussions, that I'll have little luck doing that.
> 

MPOL_F_STATIC_NODES already handles the second case because you can 
specify nodes that aren't currently accessible because of your cpuset in 
the hopes that eventually they will become accessible.  It's possible to 
pass a nodemask with all bits set so that when the cpuset's set of nodes 
expand, the mempolicy is effected over the intersection of the two on 
rebind.

Other than that, perhaps if you can elaborate more on what you imagined 
supporting as far as cpusets growing larger (or supporting node hotplug, 
which is the same type of problem), we can discuss that further before I 
resend my patches.

> Notes and questions on your current patchset:
> 
>  1) It looks like you -add- a second nodemask to struct mempolicy:
>   nodemask_t cpuset_mems_allowed; /* mempolicy relative to these 
> nodes */
>   nodemask_t user_nodemask;   /* nodemask passed by user */
> 
> I believe you can overlay these two nodemasks using a union, and avoid 
> making
> struct mempolicy bigger.
> 

Ahh, since policy->cpuset_mems_allowed is only meaningful in the 
non-MPOL_F_STATIC_NODES case, that probably will work.  For the purposes 
of this patchset, we can certainly do that.  I'm wondering whether future 
expansions will require them to be separated again, however.

>  2) Those 'mpol_mode()' wrappers on all those mempolicy->policy
> evaluations look dangerous to me.  It looks like a code bug
> waiting to happen.  Unless I miss my guess, if you forget one of
> those wrappers (or someone making a future change misses one), no
> one will notice until sometime at runtime, someone makes use of the
> new fangled MPOL_F_STATIC_NODES mode in some new situation, and we
> end up executing code that we didn't expect to execute just then.
> 
> I urge you to reconsider, and keep it so that the 'policy' field of struct
> mempolicy naturally evaluates to just the policy.  There should be just 
> one
> pair of places, on entry to, and exit from, the kernel, where the code is
> aware of the need to pack the mode and the policy into a single word.
> 

Ok.

>  3) Does your patchset allow the user to specify a nodemask that includes 
> nodes
> outside of the current cpuset with a MPOL_F_STATIC_NODES mode?  At first
> glance, I didn't see that it did, but I should have looked closer.  This
> strikes me as an essential aspect of this mode.
> 

It does, and that's why I was a little curious about your second case at 
the beginning of this email.

The user's nodemask is always stored unaltered in policy->user_nodemask.  
In mpol_new(), we intersect the current accessible nodes with that 
nodemask to determine if there's even a mempolicy to effect.  If not, 
mpol_new() returns ERR_PTR(-EINVAL) and we fall back to the existing 
policy (if any) for the VMA or task.  Otherwise, we use the intersection 
of those two nodemasks.

In mpol_rebind_policy() with MPOL_F_STATIC_NODES, we always intersect 
policy->user_nodemask with the set of accessible nodes (nodemask_t 
*newmask).  So if we can now access nodes that we previously couldn't, 
they are now part of the interleave nodemask.  A similiar situation occurs 
when building the zonelist for the MPOL_BIND case and you can see the 
change I made to MPOL_PREFERRED in the incremental patch I sent you.

The only way that newly-accessible nodes will not become a part of the 
mempolicy nodemask is when the user's nodemask and the set of accessible 
nodes is disjoint when the policy was created.

It is arguable whether we want to support that behavior as well (and we 
definitely could do it, it's not out of the scope of mempolicies).  Lee 
had specific requirements of rejecting nodemasks that had no nodes in the 
intersection based on the current implementation, but we could continue 
discussing the possibility of setting up mempolicies that are uneffected 
when they are created and only become policy when they are rebound later.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread Paul Jackson
Ok ... I read this patchset a little closer, and see a couple
of items worth noting.

The infamous unpublished (except to a few) patch I drafted on Christmas
(Dec 25, 2007) basically added two new modes for how mempolicy
nodemasks were to be resolved:
 1) a static, no remap, mode, such as in this present patchset, and
 2) a cpuset relative mode that correctly handled the case of cpusets
growing larger (increased numbers of nodes.)

I'd like to persuade you to add case (2) as well.  But I suspect,
given that case (2) was never as compelling to you as it was to me
in our December discussions, that I'll have little luck doing that.

If you choose not to add case (2) to your patchset, then I'll wait
until the dust settles on your patchset, and follow up with a second
patch, adding case (2) and presenting the justification for it then.

Notes and questions on your current patchset:

 1) It looks like you -add- a second nodemask to struct mempolicy:
nodemask_t cpuset_mems_allowed; /* mempolicy relative to these 
nodes */
nodemask_t user_nodemask;   /* nodemask passed by user */

I believe you can overlay these two nodemasks using a union, and avoid 
making
struct mempolicy bigger.

 2) Those 'mpol_mode()' wrappers on all those mempolicy->policy
evaluations look dangerous to me.  It looks like a code bug
waiting to happen.  Unless I miss my guess, if you forget one of
those wrappers (or someone making a future change misses one), no
one will notice until sometime at runtime, someone makes use of the
new fangled MPOL_F_STATIC_NODES mode in some new situation, and we
end up executing code that we didn't expect to execute just then.

I urge you to reconsider, and keep it so that the 'policy' field of struct
mempolicy naturally evaluates to just the policy.  There should be just one
pair of places, on entry to, and exit from, the kernel, where the code is
aware of the need to pack the mode and the policy into a single word.

 3) Does your patchset allow the user to specify a nodemask that includes nodes
outside of the current cpuset with a MPOL_F_STATIC_NODES mode?  At first
glance, I didn't see that it did, but I should have looked closer.  This
strikes me as an essential aspect of this mode.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread Paul Jackson
Ok ... I read this patchset a little closer, and see a couple
of items worth noting.

The infamous unpublished (except to a few) patch I drafted on Christmas
(Dec 25, 2007) basically added two new modes for how mempolicy
nodemasks were to be resolved:
 1) a static, no remap, mode, such as in this present patchset, and
 2) a cpuset relative mode that correctly handled the case of cpusets
growing larger (increased numbers of nodes.)

I'd like to persuade you to add case (2) as well.  But I suspect,
given that case (2) was never as compelling to you as it was to me
in our December discussions, that I'll have little luck doing that.

If you choose not to add case (2) to your patchset, then I'll wait
until the dust settles on your patchset, and follow up with a second
patch, adding case (2) and presenting the justification for it then.

Notes and questions on your current patchset:

 1) It looks like you -add- a second nodemask to struct mempolicy:
nodemask_t cpuset_mems_allowed; /* mempolicy relative to these 
nodes */
nodemask_t user_nodemask;   /* nodemask passed by user */

I believe you can overlay these two nodemasks using a union, and avoid 
making
struct mempolicy bigger.

 2) Those 'mpol_mode()' wrappers on all those mempolicy-policy
evaluations look dangerous to me.  It looks like a code bug
waiting to happen.  Unless I miss my guess, if you forget one of
those wrappers (or someone making a future change misses one), no
one will notice until sometime at runtime, someone makes use of the
new fangled MPOL_F_STATIC_NODES mode in some new situation, and we
end up executing code that we didn't expect to execute just then.

I urge you to reconsider, and keep it so that the 'policy' field of struct
mempolicy naturally evaluates to just the policy.  There should be just one
pair of places, on entry to, and exit from, the kernel, where the code is
aware of the need to pack the mode and the policy into a single word.

 3) Does your patchset allow the user to specify a nodemask that includes nodes
outside of the current cpuset with a MPOL_F_STATIC_NODES mode?  At first
glance, I didn't see that it did, but I should have looked closer.  This
strikes me as an essential aspect of this mode.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.940.382.4214
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread David Rientjes
On Wed, 13 Feb 2008, Paul Jackson wrote:

 The infamous unpublished (except to a few) patch I drafted on Christmas
 (Dec 25, 2007) basically added two new modes for how mempolicy
 nodemasks were to be resolved:
  1) a static, no remap, mode, such as in this present patchset, and
  2) a cpuset relative mode that correctly handled the case of cpusets
 growing larger (increased numbers of nodes.)
 
 I'd like to persuade you to add case (2) as well.  But I suspect,
 given that case (2) was never as compelling to you as it was to me
 in our December discussions, that I'll have little luck doing that.
 

MPOL_F_STATIC_NODES already handles the second case because you can 
specify nodes that aren't currently accessible because of your cpuset in 
the hopes that eventually they will become accessible.  It's possible to 
pass a nodemask with all bits set so that when the cpuset's set of nodes 
expand, the mempolicy is effected over the intersection of the two on 
rebind.

Other than that, perhaps if you can elaborate more on what you imagined 
supporting as far as cpusets growing larger (or supporting node hotplug, 
which is the same type of problem), we can discuss that further before I 
resend my patches.

 Notes and questions on your current patchset:
 
  1) It looks like you -add- a second nodemask to struct mempolicy:
   nodemask_t cpuset_mems_allowed; /* mempolicy relative to these 
 nodes */
   nodemask_t user_nodemask;   /* nodemask passed by user */
 
 I believe you can overlay these two nodemasks using a union, and avoid 
 making
 struct mempolicy bigger.
 

Ahh, since policy-cpuset_mems_allowed is only meaningful in the 
non-MPOL_F_STATIC_NODES case, that probably will work.  For the purposes 
of this patchset, we can certainly do that.  I'm wondering whether future 
expansions will require them to be separated again, however.

  2) Those 'mpol_mode()' wrappers on all those mempolicy-policy
 evaluations look dangerous to me.  It looks like a code bug
 waiting to happen.  Unless I miss my guess, if you forget one of
 those wrappers (or someone making a future change misses one), no
 one will notice until sometime at runtime, someone makes use of the
 new fangled MPOL_F_STATIC_NODES mode in some new situation, and we
 end up executing code that we didn't expect to execute just then.
 
 I urge you to reconsider, and keep it so that the 'policy' field of struct
 mempolicy naturally evaluates to just the policy.  There should be just 
 one
 pair of places, on entry to, and exit from, the kernel, where the code is
 aware of the need to pack the mode and the policy into a single word.
 

Ok.

  3) Does your patchset allow the user to specify a nodemask that includes 
 nodes
 outside of the current cpuset with a MPOL_F_STATIC_NODES mode?  At first
 glance, I didn't see that it did, but I should have looked closer.  This
 strikes me as an essential aspect of this mode.
 

It does, and that's why I was a little curious about your second case at 
the beginning of this email.

The user's nodemask is always stored unaltered in policy-user_nodemask.  
In mpol_new(), we intersect the current accessible nodes with that 
nodemask to determine if there's even a mempolicy to effect.  If not, 
mpol_new() returns ERR_PTR(-EINVAL) and we fall back to the existing 
policy (if any) for the VMA or task.  Otherwise, we use the intersection 
of those two nodemasks.

In mpol_rebind_policy() with MPOL_F_STATIC_NODES, we always intersect 
policy-user_nodemask with the set of accessible nodes (nodemask_t 
*newmask).  So if we can now access nodes that we previously couldn't, 
they are now part of the interleave nodemask.  A similiar situation occurs 
when building the zonelist for the MPOL_BIND case and you can see the 
change I made to MPOL_PREFERRED in the incremental patch I sent you.

The only way that newly-accessible nodes will not become a part of the 
mempolicy nodemask is when the user's nodemask and the set of accessible 
nodes is disjoint when the policy was created.

It is arguable whether we want to support that behavior as well (and we 
definitely could do it, it's not out of the scope of mempolicies).  Lee 
had specific requirements of rejecting nodemasks that had no nodes in the 
intersection based on the current implementation, but we could continue 
discussing the possibility of setting up mempolicies that are uneffected 
when they are created and only become policy when they are rebound later.

David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread Lee Schermerhorn
On Tue, 2008-02-12 at 20:18 -0800, David Rientjes wrote:
 On Tue, 12 Feb 2008, Lee Schermerhorn wrote:
 
   Adds another member to struct mempolicy,
   
 nodemask_t user_nodemask
   
   that stores the the nodemask that the user passed when he or she created
   the mempolicy via set_mempolicy() or mbind().  When using
   MPOL_F_STATIC_NODES, which is passed with any mempolicy mode, the user's
   passed nodemask is always used when determining the preferred node,
   building the MPOL_BIND zonelist, or creating the interleave nodemask.
   This happens whenever the policy is rebound, including when a task's
   cpuset assignment changes or the cpuset's mems are changed.
  
  When you say that the user's nodemask is always used you mean after
  cpuset contextualization, right?  I.e., after masking with mems_allowed
  [which also restricts to nodes with memory].  That is what the code
  still does.  
  
 
 Yeah, I'm not introducing a loophole so that tasks can access memory to 
 which they don't have access.  I've clarified that for the next version.

I though so.  The description just seemed ambiguous to me.  Thanks for
the clarification.
 
   It is also possible to mount tmpfs with the static nodemask behavior when
   specifying a node or nodemask.  To do this, simply add =static
   immediately following the mempolicy mode at mount time:
   
 mount -o remount mpol=interleave=static:1-3
   
   Also removes mpol_check_policy() and folds some of its logic into
   mpol_new() since it is now mostly obsoleted.
  
  Couple of comments here:
  
  1) we've discussed the issue of returning EINVAL for non-empty nodemasks
  with MPOL_DEFAULT.  By removing this restriction, we run the risk of
  breaking applications if we should ever want to define a semantic to
  non-empty node mask for MPOL_DEFAULT.   I think this is probably
  unlikely, but consider the new mode flags part of the mode/policy
  parameter:  by not rejecting undefined flags, we may break applications
  by later defining additional flags.  I'd argue for fairly strict
  argument checking.
  
 
 As I've mentioned in a parallel thread, I've folded the entire logic of 
 mpol_check_policy() as it stands this minute in Linus' tree into 
 mpol_new().

OK.  
 
  2) Before this patch, policy_nodemask from the shmem_sb_info was NOT
  masked with allowed nodes because we passed this mask directly to
  mpol_new() without contextualization.  We DO guarantee that this
  policy nodemask contains only nodes with memory [see
  shmem_parse_mpol()].  Now that you've moved the contextualization to
  mpol_new(), we are masking this policy with the cpuset mems allowed.
  This is a change in behavior.  Do we want this?  I.e., are we preserving
  the intent of the system administrator in setting the tmpfs policy?
  Maybe they wanted to shared interleaved shmem areas between cpusets with
  disjoint mems allowed.  Nothing prevents this...
  
 
 We're still saving the intent in the new policy-user_nodemask field so 
 any future rebinds will still allow unaccessible nodes be effected by the 
 mempolicy if the permissions are changed later.
 
  If we just want to preserve existing behavior, we can define an
  additional mode flag that we set in the sbinfo policy in
  shmem_parse_mpol() and test in mpol_new().  If we want to be able to
  specify existing or new behavior, we can use the same flag, but set it
  or not based on an additional qualifier specified via the mount option.
  
 
 Shmem areas between cpusets with disjoint mems_allowed seems like an error 
 in userspace to me and I would prefer that mpol_new() reject it outright.

I don't think so, and I'd like to explain why.  Perhaps this is too big
a topic to cover in the context of this patch response.  I think I'll
start another thread.  Suffice it to say here that cpusets and
mempolicies don't constrain, in anyway, the ability of tasks in a cpuset
to attach to shmem segments created by tasks in other, perhaps disjoint,
cpusets.  The cpusets and mempolicies DO constrain page allocations,
however.  This can result in OOM faults for tasks in cpusets whose
mems_allowed contains no overlap with a shared policy installed by the
creating task.  This can be avoided if the task that creates the shmem
segment faults in all of the pages and SHM_LOCKs them down.  Whether
this is sufficient for all applications is subject of the other
discussion.  In any case, I think we need to point this out in mempolicy
and cpuset man pages.

 
   @@ -218,21 +167,27 @@ static struct mempolicy *mpol_new(enum 
   mempolicy_mode mode,
 return ERR_PTR(-ENOMEM);
 flags = MPOL_MODE_FLAGS;
 atomic_set(policy-refcnt, 1);
   + cpuset_update_task_memory_state();
   + nodes_and(cpuset_context_nmask, *nodes, cpuset_current_mems_allowed);
 switch (mode) {
 case MPOL_INTERLEAVE:
   - policy-v.nodes = *nodes;
   + if (nodes_empty(*nodes))
   + return ERR_PTR(-EINVAL);
   + policy-v.nodes = 

Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread Lee Schermerhorn
On Wed, 2008-02-13 at 01:36 -0800, David Rientjes wrote:
 On Wed, 13 Feb 2008, Paul Jackson wrote:
 
  The infamous unpublished (except to a few) patch I drafted on Christmas
  (Dec 25, 2007) basically added two new modes for how mempolicy
  nodemasks were to be resolved:
   1) a static, no remap, mode, such as in this present patchset, and
   2) a cpuset relative mode that correctly handled the case of cpusets
  growing larger (increased numbers of nodes.)
  
  I'd like to persuade you to add case (2) as well.  But I suspect,
  given that case (2) was never as compelling to you as it was to me
  in our December discussions, that I'll have little luck doing that.
  
 
 MPOL_F_STATIC_NODES already handles the second case because you can 
 specify nodes that aren't currently accessible because of your cpuset in 
 the hopes that eventually they will become accessible.  It's possible to 
 pass a nodemask with all bits set so that when the cpuset's set of nodes 
 expand, the mempolicy is effected over the intersection of the two on 
 rebind.
 
 Other than that, perhaps if you can elaborate more on what you imagined 
 supporting as far as cpusets growing larger (or supporting node hotplug, 
 which is the same type of problem), we can discuss that further before I 
 resend my patches.
 
  Notes and questions on your current patchset:
  
   1) It looks like you -add- a second nodemask to struct mempolicy:
  nodemask_t cpuset_mems_allowed; /* mempolicy relative to these 
  nodes */
  nodemask_t user_nodemask;   /* nodemask passed by user */
  
  I believe you can overlay these two nodemasks using a union, and avoid 
  making
  struct mempolicy bigger.
  
 
 Ahh, since policy-cpuset_mems_allowed is only meaningful in the 
 non-MPOL_F_STATIC_NODES case, that probably will work.  For the purposes 
 of this patchset, we can certainly do that.  I'm wondering whether future 
 expansions will require them to be separated again, however.
 
   2) Those 'mpol_mode()' wrappers on all those mempolicy-policy
  evaluations look dangerous to me.  It looks like a code bug
  waiting to happen.  Unless I miss my guess, if you forget one of
  those wrappers (or someone making a future change misses one), no
  one will notice until sometime at runtime, someone makes use of the
  new fangled MPOL_F_STATIC_NODES mode in some new situation, and we
  end up executing code that we didn't expect to execute just then.
  
  I urge you to reconsider, and keep it so that the 'policy' field of 
  struct
  mempolicy naturally evaluates to just the policy.  There should be just 
  one
  pair of places, on entry to, and exit from, the kernel, where the code 
  is
  aware of the need to pack the mode and the policy into a single word.
  
 
 Ok.

I was hoping David wouldn't cave on this point. ;-)  I do agree with
David that if we ever end up with nr_modes + nr_mode_flags  16, the API
will become too complex for myself, at least, to comprehend.  I don't
feel too strongly either way, and I'll explain more below, but first:

I do agree with Paul that there is a danger of missing one of these in
converting existing code.  In fact, I missed one in my ref counting
rework patch that I will ultimately rebase atop David's final version
[I'm already dependent on Mel Gorman's patch series].  To catch this, I
decided to rename the policy member to mode.  This also aligns the
struct better with the numa_memory_policy doc [I think of the policy
as the tuple:  mode + optional nodemask].  For future code, I think we
could add comments to the struct definition warning that one should use
the wrappers to access the mode or mode flags.

However, let's step back and look at the usage of the mempolicy struct.
In earlier mail, David mentioned that it is passed around outside of
mempolicy.c.  While this is true, the only place that I see where code
outside of mempolicy.c deals with the policy/mode and nodemask
separately--as opposed to using an opaque mempolicy struct--is in the
shmem mount option parsing.  Everywhere else, as far as I can see, just
uses the struct mempolicy.  Even slab/slub call into slab_node() in
mempolicy.c to extract the target node for the policy.

So, if David does accept Paul's plea to separate the mode and the mode
flags in the structure, I would suggest that we do this in mpol_new().
That is, in {sys|do}_{set_mempolicy|mbind}(), maintain the API format
with the mode flags ORed into the mode.  mpol_new() can pull them apart
into different struct mempolicy members.  The rest of mempolicy.c can
use them directly from the struct without the helpers.  get_mempolicy()
will need to pack the mode and flags back together--perhaps with an
inverse helper function.

As for the shmem mount option parsing:  For now, I'd suggest that we
keep the mode flags OR'd into the policy member of the shmem_sb_info
struct -- i.e., the API format--to preserve the mpol_new() interface.
At some point 

Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread Paul Jackson
David, responding to pj, write:
   2) a cpuset relative mode that correctly handled the case of cpusets
  growing larger (increased numbers of nodes.)
  
  I'd like to persuade you to add case (2) as well.  But I suspect,
  given that case (2) was never as compelling to you as it was to me
  in our December discussions, that I'll have little luck doing that.
  
 
 MPOL_F_STATIC_NODES already handles the second case because you can 
 specify nodes that aren't currently accessible because of your cpuset in 
 the hopes that eventually they will become accessible.

No -- MPOL_F_STATIC_NODES does not handle my second case.  Notice the
phrase 'cpuset relative.'

In my second case, nodes are numbered relative to the cpuset.  If you
say node 2 then you mean whatever is the third (since numbering is
zero based) node in your current cpuset.

This cpuset relative mode that I'm proposing is an entirely different
mode of numbering nodes than anything we've seen so far (except for our
side discussions with just yourself, Lee and Christoph, last December.)

In this mode, node 2 doesn't mean what the system calls node 2; it
means the third node in whatever is ones current cpuset placement (if
your cpuset even has that many nodes), and mempolicies using this mode
are automatically remapped by the kernel, anytime the cpuset placement
changes.

This second, cpuset relative, mode is required:

1) to provide a race-free means for an application to place its memory
   when the application considers all physical nodes essentially
   equivalent, and just wants to lay things out relative to whatever
   cpuset it happens to be running in, and

2) to provide a practical means, without the need for constantly
   reprobing ones current cpuset placement, for an application to
   specify cpuset-relative placement to be applied whenever the
   application is placed in a larger cpuset, even if the application
   is presently in a smaller cpuset.

Without it, the application has to first query its current physical
cpuset placement, then calculate its desired relative placement into
terms of the current physical nodes it finds itself on, and then issue
various mbind or set_mempolicy calls using those current physical node
numbers, praying that it wasn't migrated to other physical nodes at the
same time, which would have invalidated its current placement
assumptions in its relative to physical node number calculations.

And without it, if an application is currently placed in a smaller cpuset
than it would prefer, it cannot specify how it would want its mempolicies
should it ever subsequently be moved to a larger cpuset.  This leaves such
an application with little option other than to constantly reprobe its
cpuset placement, in order to know if and when it needs to reissue its
mbind and set_mempolicy calls because it gained access to a larger cpuset.

I agree, David, that this present MPOL_F_STATIC_NODES patch handles the
case of a growing cpuset (or hotplug added nodes) for the static mapped
case (node 2 means physical system node 2, always.)  But this
present patch, by design, does not address the case of a growing cpuset
for the case where an application actually wants its mempolicies remapped.

My original code remapping mempolicies when cpuset placement changes,
that has been in the kernel for a couple of years now, was -supposed-
to handle this relative case.  But it is flawed, as it fails to meet
the requirements I list above.  We can't just change the way that mode
works now, for compatibility reasons.  So we need to add a new cpuset
relative mode, just as you're adding a STATIC mode, that addresses the
above requirements for a cpuset relative, remapped, numbering of
mempolicy nodes.


 Other than that, perhaps if you can elaborate more on what you imagined 
 supporting as far as cpusets growing larger (or supporting node hotplug, 
 which is the same type of problem), we can discuss that further before I 
 resend my patches.

Ok - I tried to elaborate, above.  You (David), Lee and Christoph will
perhaps recognize this elaboration, as it essentially repeats things
I said in our earlier discussion in December and early January.

Yes - hotplug presents the same problems as cpusets growing larger.

If this makes sense to you, David, and you'd like to include this
second, cpuset relative mode, in your patchset, that would be excellent.

Given that I have not been very good at explaining this second mode
you might choose not to do that; in that case I'll have to follow up,
after your second patch shapes up, with a patch of my own, adding this
cpuset relative mode.


  I believe you can overlay these two nodemasks using a union, and
  avoid making struct mempolicy bigger.
 
 Ahh, since policy-cpuset_mems_allowed is only meaningful in the 
 non-MPOL_F_STATIC_NODES case, that probably will work.  For the purposes 
 of this patchset, we can certainly do that.  I'm wondering whether future 
 expansions will require them to be 

Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread Paul Jackson
David wrote:
 Yeah, it gets tricky.

If you can stand to read my code, look at that big patch I sent you
and Lee, dated Sun, 23 Dec 2007 22:24:02 -0600, under the Subject
of Re: [RFC] cpuset relative memory policies - second choice,
where I did this, keeping the 'policy' field unchanged in struct
mempolicy and adding additional fields for the (three, in my case)
new modes.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.940.382.4214
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread David Rientjes
On Wed, 13 Feb 2008, Lee Schermerhorn wrote:

2) Those 'mpol_mode()' wrappers on all those mempolicy-policy
   evaluations look dangerous to me.  It looks like a code bug
   waiting to happen.  Unless I miss my guess, if you forget one of
   those wrappers (or someone making a future change misses one), no
   one will notice until sometime at runtime, someone makes use of the
   new fangled MPOL_F_STATIC_NODES mode in some new situation, and we
   end up executing code that we didn't expect to execute just then.
   
   I urge you to reconsider, and keep it so that the 'policy' field of 
   struct
   mempolicy naturally evaluates to just the policy.  There should be 
   just one
   pair of places, on entry to, and exit from, the kernel, where the 
   code is
   aware of the need to pack the mode and the policy into a single word.
   
  
  Ok.
 
 I was hoping David wouldn't cave on this point. ;-)  I do agree with
 David that if we ever end up with nr_modes + nr_mode_flags  16, the API
 will become too complex for myself, at least, to comprehend.  I don't
 feel too strongly either way, and I'll explain more below, but first:
 

Hmm, I don't remember caving on this point; Paul asked me to reconsider 
and I said that I would.  I think it's also helpful to have more 
discussion on the matter by including other people's opinions.

 I do agree with Paul that there is a danger of missing one of these in
 converting existing code.  In fact, I missed one in my ref counting
 rework patch that I will ultimately rebase atop David's final version
 [I'm already dependent on Mel Gorman's patch series].  To catch this, I
 decided to rename the policy member to mode.  This also aligns the
 struct better with the numa_memory_policy doc [I think of the policy
 as the tuple:  mode + optional nodemask].  For future code, I think we
 could add comments to the struct definition warning that one should use
 the wrappers to access the mode or mode flags.
 

I considered this when I implemented it the way I did, and that was part 
of the reason I was in favor of enum mempolicy_mode so much: functions 
that had a formal of type 'enum mempolicy_mode' were only the mode and 
formals of type 'int' or 'unsigned short' were the combination.  I even 
stated that difference in Documentation/vm/numa_memory_policy.txt in the 
first patchset.

 However, let's step back and look at the usage of the mempolicy struct.
 In earlier mail, David mentioned that it is passed around outside of
 mempolicy.c.  While this is true, the only place that I see where code
 outside of mempolicy.c deals with the policy/mode and nodemask
 separately--as opposed to using an opaque mempolicy struct--is in the
 shmem mount option parsing.  Everywhere else, as far as I can see, just
 uses the struct mempolicy.  Even slab/slub call into slab_node() in
 mempolicy.c to extract the target node for the policy.
 

Yes, struct shmem_sb_info stores the 'policy' member as well so it would 
need to include a new 'flags' member as well.  And the interface between 
the two, mpol_shared_policy_init() would need another formal for the 
flags.

 So, if David does accept Paul's plea to separate the mode and the mode
 flags in the structure, I would suggest that we do this in mpol_new().
 That is, in {sys|do}_{set_mempolicy|mbind}(), maintain the API format
 with the mode flags ORed into the mode.  mpol_new() can pull them apart
 into different struct mempolicy members.  The rest of mempolicy.c can
 use them directly from the struct without the helpers.  get_mempolicy()
 will need to pack the mode and flags back together--perhaps with an
 inverse helper function.
 

Yeah, it gets tricky.  I'm not too sure about only pulling the mode and 
flags apart at mpol_new() time because perhaps, in the future, there will 
be flag and mode combinations that are only applicable for set_mempolicy() 
and not for mbind(), for example.  I can imagine that someday we may add a 
flag that applies only to a task mempolicy and not to a VMA mempolicy.

 As for the shmem mount option parsing:  For now, I'd suggest that we
 keep the mode flags OR'd into the policy member of the shmem_sb_info
 struct -- i.e., the API format--to preserve the mpol_new() interface.

I don't like this because its not consistent.  It increases the confusion 
of what contains a mode and what contains the combination.  I think the 
safest thing to do, if we separate the mode and flags out in struct 
mempolicy is to do it everywhere.  All helper functions will need 
additional formals similar to what I did in the first patchset (that was 
actually unpopular) and struct shmem_sb_info need to be modified to 
include the additional member.

In other words, I'm all in favor of storing the mode and flags however we 
want, but I think it should be done consistenty throughout the tree.  
While mpol_mode() wrappers may not be favorable all over the place (and I 
didn't miss any), it may be easier 

Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread David Rientjes
On Wed, 13 Feb 2008, Lee Schermerhorn wrote:

   If we just want to preserve existing behavior, we can define an
   additional mode flag that we set in the sbinfo policy in
   shmem_parse_mpol() and test in mpol_new().  If we want to be able to
   specify existing or new behavior, we can use the same flag, but set it
   or not based on an additional qualifier specified via the mount option.
   
  
  Shmem areas between cpusets with disjoint mems_allowed seems like an error 
  in userspace to me and I would prefer that mpol_new() reject it outright.
 
 I don't think so, and I'd like to explain why.  Perhaps this is too big
 a topic to cover in the context of this patch response.  I think I'll
 start another thread.  Suffice it to say here that cpusets and
 mempolicies don't constrain, in anyway, the ability of tasks in a cpuset
 to attach to shmem segments created by tasks in other, perhaps disjoint,
 cpusets.  The cpusets and mempolicies DO constrain page allocations,
 however.  This can result in OOM faults for tasks in cpusets whose
 mems_allowed contains no overlap with a shared policy installed by the
 creating task.  This can be avoided if the task that creates the shmem
 segment faults in all of the pages and SHM_LOCKs them down.  Whether
 this is sufficient for all applications is subject of the other
 discussion.  In any case, I think we need to point this out in mempolicy
 and cpuset man pages.
 

I don't think we're likely to see examples of actual code written to do 
this being posted to refute my comment.

I think what you said above is right and that it should be allowed.  
You're advocating for allowing task A to attach to shmem segments created 
by task B while task A has its own mempolicy that restricts its own 
allocations but still being able to access the shmem segments of task B, 
providing that task A will not fault them back in itself.

You're right, that's not a scenario that I was hoping to address to 
MPOL_F_STATIC_NODES.

  Since we're allowed to remap the node to a different node than the user 
  specified with either syscall, the current behavior is that one node is 
  as good as another.  In other words, we're trying to accomodate the mode 
  first by setting pol-v.preferred_node to some accessible node and only 
  setting that to the user-supplied node if it is available.
  
  If the node isn't available and the user specifically asked that it is not 
  remapped (with MPOL_F_STATIC_NODES), then I felt local allocation was best 
  compared to remapping to a node that may be unrelated to the VMA or task.  
  This preserves a sense of the current behavior that one node is as good 
  as another, but the user specifically asked for no remap.
 
 Yeah.  I went back and read the update to the mempolicy doc where you
 make it clear that this is the semantic of 'STATIC_NODES.  You also
 mentioned it in the patch description, but I didn't get it.  Sorry.  
 
 Still a comment in the code would, I think, help future spelunkers.
 

Ok, I'll clarify the intention in this code that we agreed was better:

case MPOL_PREFERRED:
if (!remap) {
int nid = first_node(pol-user_nodemask);

if (node_isset(nid, *newmask))
pol-v.preferred_node = nid;
else
pol-v.preferred_node = -1;
} else
pol-v.preferred_node = 
node_remap(pol-v.preferred_node,
*mpolmask, *newmask);
break;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread Lee Schermerhorn
On Wed, 2008-02-13 at 10:48 -0800, David Rientjes wrote:
 On Wed, 13 Feb 2008, Lee Schermerhorn wrote:
 
 2) Those 'mpol_mode()' wrappers on all those mempolicy-policy
evaluations look dangerous to me.  It looks like a code bug
waiting to happen.  Unless I miss my guess, if you forget one of
those wrappers (or someone making a future change misses one), no
one will notice until sometime at runtime, someone makes use of the
new fangled MPOL_F_STATIC_NODES mode in some new situation, and we
end up executing code that we didn't expect to execute just then.

I urge you to reconsider, and keep it so that the 'policy' field of 
struct
mempolicy naturally evaluates to just the policy.  There should be 
just one
pair of places, on entry to, and exit from, the kernel, where the 
code is
aware of the need to pack the mode and the policy into a single 
word.

   
   Ok.
  
  I was hoping David wouldn't cave on this point. ;-)  I do agree with
  David that if we ever end up with nr_modes + nr_mode_flags  16, the API
  will become too complex for myself, at least, to comprehend.  I don't
  feel too strongly either way, and I'll explain more below, but first:
  
 
 Hmm, I don't remember caving on this point; Paul asked me to reconsider 
 and I said that I would.  I think it's also helpful to have more 
 discussion on the matter by including other people's opinions.

 
  I do agree with Paul that there is a danger of missing one of these in
  converting existing code.  In fact, I missed one in my ref counting
  rework patch that I will ultimately rebase atop David's final version
  [I'm already dependent on Mel Gorman's patch series].  To catch this, I
  decided to rename the policy member to mode.  This also aligns the
  struct better with the numa_memory_policy doc [I think of the policy
  as the tuple:  mode + optional nodemask].  For future code, I think we
  could add comments to the struct definition warning that one should use
  the wrappers to access the mode or mode flags.
  
 
 I considered this when I implemented it the way I did, and that was part 
 of the reason I was in favor of enum mempolicy_mode so much: functions 
 that had a formal of type 'enum mempolicy_mode' were only the mode and 
 formals of type 'int' or 'unsigned short' were the combination.  I even 
 stated that difference in Documentation/vm/numa_memory_policy.txt in the 
 first patchset.
 
  However, let's step back and look at the usage of the mempolicy struct.
  In earlier mail, David mentioned that it is passed around outside of
  mempolicy.c.  While this is true, the only place that I see where code
  outside of mempolicy.c deals with the policy/mode and nodemask
  separately--as opposed to using an opaque mempolicy struct--is in the
  shmem mount option parsing.  Everywhere else, as far as I can see, just
  uses the struct mempolicy.  Even slab/slub call into slab_node() in
  mempolicy.c to extract the target node for the policy.
  
 
 Yes, struct shmem_sb_info stores the 'policy' member as well so it would 
 need to include a new 'flags' member as well.  And the interface between 
 the two, mpol_shared_policy_init() would need another formal for the 
 flags.
 
  So, if David does accept Paul's plea to separate the mode and the mode
  flags in the structure, I would suggest that we do this in mpol_new().
  That is, in {sys|do}_{set_mempolicy|mbind}(), maintain the API format
  with the mode flags ORed into the mode.  mpol_new() can pull them apart
  into different struct mempolicy members.  The rest of mempolicy.c can
  use them directly from the struct without the helpers.  get_mempolicy()
  will need to pack the mode and flags back together--perhaps with an
  inverse helper function.
  
 
 Yeah, it gets tricky.  I'm not too sure about only pulling the mode and 
 flags apart at mpol_new() time because perhaps, in the future, there will 
 be flag and mode combinations that are only applicable for set_mempolicy() 
 and not for mbind(), for example.  I can imagine that someday we may add a 
 flag that applies only to a task mempolicy and not to a VMA mempolicy.

True.  Altho' at such a time, I'd probably argue for testing for and
rejecting invalid mode/flag combinations in the respective
functions :-).


 
  As for the shmem mount option parsing:  For now, I'd suggest that we
  keep the mode flags OR'd into the policy member of the shmem_sb_info
  struct -- i.e., the API format--to preserve the mpol_new() interface.
 
 I don't like this because its not consistent.  It increases the confusion 
 of what contains a mode and what contains the combination.  I think the 
 safest thing to do, if we separate the mode and flags out in struct 
 mempolicy is to do it everywhere.  All helper functions will need 
 additional formals similar to what I did in the first patchset (that was 
 actually unpopular) and struct shmem_sb_info need to be 

Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread David Rientjes
On Wed, 13 Feb 2008, Paul Jackson wrote:

 No -- MPOL_F_STATIC_NODES does not handle my second case.  Notice the
 phrase 'cpuset relative.'
 
 In my second case, nodes are numbered relative to the cpuset.  If you
 say node 2 then you mean whatever is the third (since numbering is
 zero based) node in your current cpuset.
 

Ok, so this truly is a new feature that isn't addressed either by the 
current implementation or my patchset.  Fair enough.

You're specifically trying to avoid having the application know about its 
cpuset placement with regard to mems at the time it sets up its mempolicy, 
right?  Otherwise it could already setup this relative nodemask by 
selecting node 2, from your example above, in its mems_allowed and it 
would always be remapped appropriately.

 In this mode, node 2 doesn't mean what the system calls node 2; it
 means the third node in whatever is ones current cpuset placement (if
 your cpuset even has that many nodes), and mempolicies using this mode
 are automatically remapped by the kernel, anytime the cpuset placement
 changes.
 
 This second, cpuset relative, mode is required:
 
 1) to provide a race-free means for an application to place its memory
when the application considers all physical nodes essentially
equivalent, and just wants to lay things out relative to whatever
cpuset it happens to be running in, and
 
 2) to provide a practical means, without the need for constantly
reprobing ones current cpuset placement, for an application to
specify cpuset-relative placement to be applied whenever the
application is placed in a larger cpuset, even if the application
is presently in a smaller cpuset.
 

So, for example, if the task is bound by mems 1-3, and it asks for 
MPOL_INTERLEAVE over 2-4, then initially the mempolicy is only effected 
over node 3 and if it's later expanded to mems 1-8, then the mempolicy is 
effected over nodes 3-5, right?

And if the mems change to 3-8, the mempolicy is remapped to 5-7 even 
though 3-5 (which it already was interleaving over) is still accessible?

 I agree, David, that this present MPOL_F_STATIC_NODES patch handles the
 case of a growing cpuset (or hotplug added nodes) for the static mapped
 case (node 2 means physical system node 2, always.)  But this
 present patch, by design, does not address the case of a growing cpuset
 for the case where an application actually wants its mempolicies remapped.
 

Does MPOL_INTERLEAVE | MPOL_F_STATIC_NODES | MPOL_F_PAULS_NEW_FLAG make 
any logical sense?  If it does, I think we're going to be writing some 
very complex remap code in our future.

  Ahh, since policy-cpuset_mems_allowed is only meaningful in the 
  non-MPOL_F_STATIC_NODES case, that probably will work.  For the purposes 
  of this patchset, we can certainly do that.  I'm wondering whether future 
  expansions will require them to be separated again, however.
 
 I'd suggest we let future expansions deal with their own needs.  We
 don't usually pad internal (not externally visible) data structures
 in anticipation that someone else might need the space in the future.
 
 At least earlier, Andi Kleen, when he was the primary author and sole
 active maintainer of this mempolicy code, was always keen to avoid
 expanding the size of 'struct mempolicy' by another nodemask.
 
 I have not done the calculations myself to know how vital it is to
 keep the size of struct mempolicy to a minimum.  It certainly seems
 worth a bit of effort, however, if adding this union of these two
 nodemasks doesn't complicate the code too horribly much.
 

I think it will work very nicely and the benefit is immediately obvious 
for systems that have large nodemasks.

   I urge you to reconsider, and keep it so that the 'policy' field of 
   struct
   mempolicy naturally evaluates to just the policy.  There should be 
   just one
   pair of places, on entry to, and exit from, the kernel, where the 
   code is
   aware of the need to pack the mode and the policy into a single word.
   
  
  Ok.
 
 Cool.  Thanks.  (I'm glad you caved ... ;).  Looking forward in my inbox, I 
 see
 that Lee has some suggestions on where to handle the conversion between the
 packed mode and the separate fields.  I'm too lazy to think about that more,
 and will likely acquiesce to whatever you and Lee agree to.
 

Well, I didn't cave on anything, I said that we can reconsider it in the 
hopes that other people would add their feedback.  I think continuing to 
discuss this matter with yourself and Lee (and whomever else is 
interested) will lead us to the correct solution.  Since this is an 
internal implementation detail, I think it's important to hear other 
people's opinions since we're the ones who will be hacking the code in the 
future so it's really our opinions that matter.

  The user's nodemask is always stored unaltered in policy-user_nodemask.  
 
 Ah - good.  I missed that.  Just to be sure I'm reading the code right,
 I take it 

Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread David Rientjes
On Wed, 13 Feb 2008, Lee Schermerhorn wrote:

  Yeah, it gets tricky.  I'm not too sure about only pulling the mode and 
  flags apart at mpol_new() time because perhaps, in the future, there will 
  be flag and mode combinations that are only applicable for set_mempolicy() 
  and not for mbind(), for example.  I can imagine that someday we may add a 
  flag that applies only to a task mempolicy and not to a VMA mempolicy.
 
 True.  Altho' at such a time, I'd probably argue for testing for and
 rejecting invalid mode/flag combinations in the respective
 functions :-).
 

Yeah, and that would require the modes and flags to be split apart in 
sys_set_mempolicy() and sys_mbind() or at least in do_set_mempolicy() or 
do_mbind().  So if we see the possibility that we want to test for mode 
and flag combinations that perhaps differ between the set_mempolicy() and 
mbind() case, we have to do it in either of those two places.  I think we 
should do it there, as early as possible, like I did in my first patchset.

 OK.  I'm caving... :-)  Different views of consistency.  But,
 eventually, I hope we can replace the separate mode[, flags] and
 nodemask in the 'sb_info with a mempolicy and keep the details of modes,
 flags, etc. internal to mempolicy.c.
 

I agree, I think keeping all of these implementation details inside 
mm/mempolicy.c is the best practice.  We'll still need to expose some of 
the details, such as the parsing of '=static' in the tmpfs mount option to 
add the MPOL_F_STATIC_NODES flag to the policy, but situations like that 
should be rare.

For extendability, I agree that the struct shmem_sb_info member should be 
a pointer to a mempolicy and not an int.

David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread Paul Jackson
David wrote:
 You're specifically trying to avoid having the application know about its 
 cpuset placement with regard to mems at the time it sets up its mempolicy, 
 right?  Otherwise it could already setup this relative nodemask by 
 selecting node 2, from your example above, in its mems_allowed and it 
 would always be remapped appropriately.

Yes, if an application considers nodes to be interchangeable, I'm
trying to avoid having that application -have- to know its current
cpuset placement, for two reasons:

For one thing, it's racey.  It's cpuset placement could change,
unbeknownst to it, between the time it queried it, and the time
that it issued the mbind or set_mempolicy call.

For the other thing, it's not always possible.  If the application
is currently in a cpuset that is smaller than it's preferred
configuration, it would not be possible to express its preferred
memory policies using just the smaller number of memory nodes
allowed by its current cpuset placement.  How do you say put
this on my third node if you don't have a third node and you
can only speak of the nodes you currently have?


 So, for example, if the task is bound by mems 1-3, and it asks for 
 MPOL_INTERLEAVE over 2-4, then initially the mempolicy is only effected 
 over node 3 and if it's later expanded to mems 1-8, then the mempolicy is 
 effected over nodes 3-5, right?
 
 And if the mems change to 3-8, the mempolicy is remapped to 5-7 even 
 though 3-5 (which it already was interleaving over) is still accessible?

Yes and yes, for this cpuset relative numbering mode.

 Does MPOL_INTERLEAVE | MPOL_F_STATIC_NODES | MPOL_F_PAULS_NEW_FLAG make 
 any logical sense?  If it does, I think we're going to be writing some 
 very complex remap code in our future.


No -- MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES (which is what I'll
likely call my new flag) are mutually exclusive.

The allowed modes and flags would be:
  Choose exactly one of:
MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND, MPOL_INTERLEAVE
  Plus zero or one of:
MPOL_F_STATIC_NODES, MPOL_F_RELATIVE_NODES
  where, if you choose neither of tne MPOL_F_*_NODES flags,
  then you get the classic, compatible nodemask handling.

 Well, I didn't cave on anything 

;)   Your simple ok was ambiguous enough that we were able to
 read into it whatever we wanted to.

But I've made my case on that issue (involving the separate or
packed policy flag field).  So I probably won't say more, and
I expect to live with whatever you choose, after any further
input from Lee or others.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.940.382.4214
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-13 Thread David Rientjes
On Wed, 13 Feb 2008, Paul Jackson wrote:

 Yes, if an application considers nodes to be interchangeable, I'm
 trying to avoid having that application -have- to know its current
 cpuset placement, for two reasons:
 
 For one thing, it's racey.  It's cpuset placement could change,
 unbeknownst to it, between the time it queried it, and the time
 that it issued the mbind or set_mempolicy call.
 
 For the other thing, it's not always possible.  If the application
 is currently in a cpuset that is smaller than it's preferred
 configuration, it would not be possible to express its preferred
 memory policies using just the smaller number of memory nodes
 allowed by its current cpuset placement.  How do you say put
 this on my third node if you don't have a third node and you
 can only speak of the nodes you currently have?
 

So let's say, like my first example from the previous email, that you have 
MPOL_INTERLEAVE | MPOL_F_RELATIVE_NODES over nodes 3-4 and your cpuset's 
mems is only nodes 5-7.  This would interleave over no nodes.  Correct?

It seems like MPOL_F_RELATIVE_NODES is primarily designed to maintain a 
certain order among the nodes it effects the mempolicy over.  It comes 
with the premise that the task doesn't already know it's cpuset mems 
(otherwise, the current implementation without MPOL_F_STATIC_NODES would 
work fine for this) so it doesn't really care what nodes it allocates 
pages on, it just cares about the order.

This works for MPOL_PREFERRED and MPOL_BIND as well, right?

I don't understand the use case for this (at all), but if you have 
workloads that require this type of setting then I can implement this as 
part of my series.  I just want to confirm that there are real world cases 
backing this so that we don't have flags with highly highly specialized 
cornercases.

 [ If a user _does_ specify MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES
   as part of their syscall, then we'll simply return -EINVAL. ]

  Well, I didn't cave on anything 
 
 ;)   Your simple ok was ambiguous enough that we were able to
  read into it whatever we wanted to.
 
 But I've made my case on that issue (involving the separate or
 packed policy flag field).  So I probably won't say more, and
 I expect to live with whatever you choose, after any further
 input from Lee or others.
 

Well, there's advantages and disadvantages to either approach.

My preference (both mode and flags stored in the same member of struct 
mempolicy):

   Advantages:

- completely consistent with the userspace API of passing modes
  and flags together in a pointer to an int, and

- does not require additional formals to be added to several
  functions, including functions outside mm/mempolicy.c.

   Disadvantage:

- use of mpol_mode() throughout mm/mempolicy.c code to mask
  off optional mode flags for conditionals or switch statements.

Your preference (separate mode and flags members in struct mempolicy):

   Advantages:

- clearer implementation when dealing with modes: all existing
  statements involving pol-policy can remain unchanged.

   Disadvantages:

- requires additional formals to be added to several functions,
  including functions outside mm/mempolicy.c, and

- takes additional space in struct mempolicy (two bytes) which
  could eventually be used for something else.

In both cases the testing of mode flags is the same as before:

if (pol-policy  MPOL_F_STATIC_NODES) {
...
}
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-12 Thread David Rientjes
On Tue, 12 Feb 2008, David Rientjes wrote:

> Since we're allowed to remap the node to a different node than the user 
> specified with either syscall, the current behavior is that "one node is 
> as good as another."  In other words, we're trying to accomodate the mode 
> first by setting pol->v.preferred_node to some accessible node and only 
> setting that to the user-supplied node if it is available.
> 
> If the node isn't available and the user specifically asked that it is not 
> remapped (with MPOL_F_STATIC_NODES), then I felt local allocation was best 
> compared to remapping to a node that may be unrelated to the VMA or task.  
> This preserves a sense of the current behavior that "one node is as good 
> as another," but the user specifically asked for no remap.
> 

Upon further inspection, perhaps it's better to fallback to local 
allocation when the preferred node is not available on rebind and then, if 
it becomes available later, prefer it again.

In mpol_rebind_policy():

case MPOL_PREFERRED:
if (!remap) {
int nid = first_node(pol->user_nodemask);

if (node_isset(nid, *newmask))
pol->v.preferred_node = nid;
else {
/*
 * Fallback to local allocation since that
 * is the behavior in mpol_new().  The
 * node may eventually become available.
 */
pol->v.preferred_node = -1;
}
} else
pol->v.preferred_node = 
node_remap(pol->v.preferred_node,
*mpolmask, *newmask);
break;

What do you think, Lee?

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-12 Thread David Rientjes
On Tue, 12 Feb 2008, Paul Jackson wrote:

> > I've redone my patchset based on the feedback that I've received
> 
> Will you be sending that along soon?  I was just getting into my
> review of this patchset, and I suppose it would be better to
> review the latest and greatest.
> 

Lee had some questions and comments that I've recently addressed so I 
wanted to give him a chance to respond before sending it out again in case 
more changes need to be made.

I'll email the current set to you now.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-12 Thread David Rientjes
On Tue, 12 Feb 2008, Lee Schermerhorn wrote:

> > Adds another member to struct mempolicy,
> > 
> > nodemask_t user_nodemask
> > 
> > that stores the the nodemask that the user passed when he or she created
> > the mempolicy via set_mempolicy() or mbind().  When using
> > MPOL_F_STATIC_NODES, which is passed with any mempolicy mode, the user's
> > passed nodemask is always used when determining the preferred node,
> > building the MPOL_BIND zonelist, or creating the interleave nodemask.
> > This happens whenever the policy is rebound, including when a task's
> > cpuset assignment changes or the cpuset's mems are changed.
> 
> When you say that the user's nodemask is "always used" you mean "after
> cpuset contextualization", right?  I.e., after masking with mems_allowed
> [which also restricts to nodes with memory].  That is what the code
> still does.  
> 

Yeah, I'm not introducing a loophole so that tasks can access memory to 
which they don't have access.  I've clarified that for the next version.

> > It is also possible to mount tmpfs with the static nodemask behavior when
> > specifying a node or nodemask.  To do this, simply add "=static"
> > immediately following the mempolicy mode at mount time:
> > 
> > mount -o remount mpol=interleave=static:1-3
> > 
> > Also removes mpol_check_policy() and folds some of its logic into
> > mpol_new() since it is now mostly obsoleted.
> 
> Couple of comments here:
> 
> 1) we've discussed the issue of returning EINVAL for non-empty nodemasks
> with MPOL_DEFAULT.  By removing this restriction, we run the risk of
> breaking applications if we should ever want to define a semantic to
> non-empty node mask for MPOL_DEFAULT.   I think this is probably
> unlikely, but consider the new mode flags part of the mode/policy
> parameter:  by not rejecting undefined flags, we may break applications
> by later defining additional flags.  I'd argue for fairly strict
> argument checking.
> 

As I've mentioned in a parallel thread, I've folded the entire logic of 
mpol_check_policy() as it stands this minute in Linus' tree into 
mpol_new().

> 2) Before this patch, policy_nodemask from the shmem_sb_info was NOT
> masked with allowed nodes because we passed this mask directly to
> mpol_new() without "contextualization".  We DO guarantee that this
> policy nodemask contains only nodes with memory [see
> shmem_parse_mpol()].  Now that you've moved the contextualization to
> mpol_new(), we are masking this policy with the cpuset mems allowed.
> This is a change in behavior.  Do we want this?  I.e., are we preserving
> the "intent" of the system administrator in setting the tmpfs policy?
> Maybe they wanted to shared interleaved shmem areas between cpusets with
> disjoint mems allowed.  Nothing prevents this...
> 

We're still saving the intent in the new policy->user_nodemask field so 
any future rebinds will still allow unaccessible nodes be effected by the 
mempolicy if the permissions are changed later.

> If we just want to preserve existing behavior, we can define an
> additional mode flag that we set in the sbinfo policy in
> shmem_parse_mpol() and test in mpol_new().  If we want to be able to
> specify existing or new behavior, we can use the same flag, but set it
> or not based on an additional qualifier specified via the mount option.
> 

Shmem areas between cpusets with disjoint mems_allowed seems like an error 
in userspace to me and I would prefer that mpol_new() reject it outright.

> > @@ -218,21 +167,27 @@ static struct mempolicy *mpol_new(enum mempolicy_mode 
> > mode,
> > return ERR_PTR(-ENOMEM);
> > flags &= MPOL_MODE_FLAGS;
> > atomic_set(>refcnt, 1);
> > +   cpuset_update_task_memory_state();
> > +   nodes_and(cpuset_context_nmask, *nodes, cpuset_current_mems_allowed);
> > switch (mode) {
> > case MPOL_INTERLEAVE:
> > -   policy->v.nodes = *nodes;
> > +   if (nodes_empty(*nodes))
> > +   return ERR_PTR(-EINVAL);
> > +   policy->v.nodes = cpuset_context_nmask;
> > if (nodes_weight(policy->v.nodes) == 0) {
> > kmem_cache_free(policy_cache, policy);
> > return ERR_PTR(-EINVAL);
> > }
> > break;
> > case MPOL_PREFERRED:
> > -   policy->v.preferred_node = first_node(*nodes);
> > +   policy->v.preferred_node = first_node(cpuset_context_nmask);
> > if (policy->v.preferred_node >= MAX_NUMNODES)
> > policy->v.preferred_node = -1;
> > break;
> > case MPOL_BIND:
> > -   policy->v.zonelist = bind_zonelist(nodes);
> > +   if (nodes_empty(*nodes))
> 
> Kosaki-san already pointed out the need to free the policy struct
> here.  
> 

I folded your fix to have only one

kmem_cache_free(policy_cache, policy); 
return ERR_PTR(error_code);

point in mpol_new().

> > @@ -1780,51 +1737,67 @@ static void mpol_rebind_policy(struct mempolicy 

Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-12 Thread Paul Jackson
David wrote:
> I've redone my patchset based on the feedback that I've received

Will you be sending that along soon?  I was just getting into my
review of this patchset, and I suppose it would be better to
review the latest and greatest.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-12 Thread David Rientjes
On Tue, 12 Feb 2008, Paul Jackson wrote:

> > 1) we've discussed the issue of returning EINVAL for non-empty nodemasks
> > with MPOL_DEFAULT.  By removing this restriction, we run the risk of
> > breaking applications if we should ever want to define a semantic to
> > non-empty node mask for MPOL_DEFAULT. 
> 
> The bigger risk, in my view, is breaking some piece of existing user code.
> Properly written user code wouldn't break, but that doesn't mean much.
> Changes, even minor corner case changes, often break something, so should
> not be done with out cause.  Whether or not code cleanup in mempolicy.c is
> sufficient cause here is not clear to me.
> 
> Future room for growth doesn't mean so much for me here; if we close one
> future alternative, we always have others, such as more mode flag bits.
> 

I've redone my patchset based on the feedback that I've received, and in 
my latest revisions I folded the entire equivalent of mpol_check_policy() 
into mpol_new().

Lee, you feel strongly that non-empty nodemasks passed with MPOL_DEFAULT 
should be considered invalid and rejected by the kernel, as the current 
implementation does.  I've brought up a counter-argument based on the 
set_mempolicy() man page and the Linux documentation that don't specify 
anything about what the nodemask shall contain if it's not a NULL pointer.

My position was to give the user the benefit of the doubt.  Because Linux 
has been vague in specifying what the nodemask shall contain, that (to me) 
means that it can contain anything.  It's undefined, in a standards sense.  
The only thing that we should look for is whether the user passed 
MPOL_DEFAULT as the first parameter and then we should effect that policy 
because it's clearly the intention.

I don't think there's a super strong case for either behavior, and that's 
why I just folded the mpol_check_policy() logic straight into mpol_new().  

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-12 Thread Paul Jackson
Lee wrote:
> 1) we've discussed the issue of returning EINVAL for non-empty nodemasks
> with MPOL_DEFAULT.  By removing this restriction, we run the risk of
> breaking applications if we should ever want to define a semantic to
> non-empty node mask for MPOL_DEFAULT. 

The bigger risk, in my view, is breaking some piece of existing user code.
Properly written user code wouldn't break, but that doesn't mean much.
Changes, even minor corner case changes, often break something, so should
not be done with out cause.  Whether or not code cleanup in mempolicy.c is
sufficient cause here is not clear to me.

Future room for growth doesn't mean so much for me here; if we close one
future alternative, we always have others, such as more mode flag bits.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-12 Thread David Rientjes
On Tue, 12 Feb 2008, Lee Schermerhorn wrote:

> PATCH mempolicy - consolidate mpol_new() error paths
> 
> Use common error path in mpol_new() for errors that we discover
> after allocation the new mempolicy structure.  Free the mempolicy
> in the common error path.
> 
> Signed-off-by: Lee Schermerhorn <[EMAIL PROTECTED]>
> 
>  mm/mempolicy.c |   16 +---
>  1 file changed, 9 insertions(+), 7 deletions(-)
> 
> Index: Linux/mm/mempolicy.c
> ===
> --- Linux.orig/mm/mempolicy.c 2008-02-12 15:18:12.0 -0700
> +++ Linux/mm/mempolicy.c  2008-02-12 15:22:07.0 -0700
> @@ -156,6 +156,7 @@ static struct mempolicy *mpol_new(enum m
>  {
>   struct mempolicy *policy;
>   nodemask_t cpuset_context_nmask;
> + void *error_code = ERR_PTR(-EINVAL);
>  
>   pr_debug("setting mode %d flags %d nodes[0] %lx\n",
>mode, flags, nodes ? nodes_addr(*nodes)[0] : -1);
> @@ -172,8 +173,7 @@ static struct mempolicy *mpol_new(enum m
>   switch (mode) {
>   case MPOL_INTERLEAVE:
>   if (nodes_empty(*nodes) || nodes_empty(cpuset_context_nmask)) {
> - kmem_cache_free(policy_cache, policy);
> - return ERR_PTR(-EINVAL);
> + goto free_mpol;
>   }
>   policy->v.nodes = cpuset_context_nmask;
>   break;

You can also get rid of the parentheses to make checkpatch.pl happy, too.

> @@ -184,14 +184,12 @@ static struct mempolicy *mpol_new(enum m
>   break;
>   case MPOL_BIND:
>   if (nodes_empty(*nodes)) {
> - kmem_cache_free(policy_cache, policy);
> - return ERR_PTR(-EINVAL);
> + goto free_mpol;
>   }
>   policy->v.zonelist = bind_zonelist(_context_nmask);
>   if (IS_ERR(policy->v.zonelist)) {
> - void *error_code = policy->v.zonelist;
> - kmem_cache_free(policy_cache, policy);
> - return error_code;
> + error_code = policy->v.zonelist;
> + goto free_mpol;
>   }
>   break;
>   default:
> @@ -201,6 +199,10 @@ static struct mempolicy *mpol_new(enum m
>   policy->cpuset_mems_allowed = cpuset_mems_allowed(current);
>   policy->user_nodemask = *nodes;
>   return policy;
> +
> +free_mpol:
> + kmem_cache_free(policy_cache, policy);
> + return error_code;
>  }
>  
>  static void gather_stats(struct page *, void *, int pte_dirty);
> 

I'll fold this into my patchset when I repost it, thanks.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-12 Thread Lee Schermerhorn
On Mon, 2008-02-11 at 11:56 -0800, David Rientjes wrote: 
> On Tue, 12 Feb 2008, KOSAKI Motohiro wrote:
> 
> > > @@ -218,21 +167,27 @@ static struct mempolicy *mpol_new(enum 
> > > mempolicy_mode mode,
> > > return ERR_PTR(-ENOMEM);
> > > flags &= MPOL_MODE_FLAGS;
> > > atomic_set(>refcnt, 1);
> > > +   cpuset_update_task_memory_state();
> > > +   nodes_and(cpuset_context_nmask, *nodes, 
> > > cpuset_current_mems_allowed);
> > > switch (mode) {
> > > case MPOL_INTERLEAVE:
> > > -   policy->v.nodes = *nodes;
> > > +   if (nodes_empty(*nodes))
> > > +   return ERR_PTR(-EINVAL);
> > 
> > need kmem_cache_free(policy_cache, policy) before return?
> > 
> 
> Very good catch!
> 
> 
> 
> mempolicy: fix policy memory leak in mpol_new()
> 
> If mpol_new() cannot setup a new mempolicy because of an invalid argument
> provided by the user, avoid leaking the mempolicy that has been dynamically
> allocated.
> 
> Reported by KOSAKI Motohiro.
> 
> Cc: Paul Jackson <[EMAIL PROTECTED]>
> Cc: Christoph Lameter <[EMAIL PROTECTED]>
> Cc: Lee Schermerhorn <[EMAIL PROTECTED]>
> Cc: Andi Kleen <[EMAIL PROTECTED]>
> Cc: KOSAKI Motohiro <[EMAIL PROTECTED]>
> Signed-off-by: David Rientjes <[EMAIL PROTECTED]>
> ---
>  mm/mempolicy.c |   10 +-
>  1 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -171,13 +171,11 @@ static struct mempolicy *mpol_new(enum mempolicy_mode 
> mode,
>   nodes_and(cpuset_context_nmask, *nodes, cpuset_current_mems_allowed);
>   switch (mode) {
>   case MPOL_INTERLEAVE:
> - if (nodes_empty(*nodes))
> - return ERR_PTR(-EINVAL);
> - policy->v.nodes = cpuset_context_nmask;
> - if (nodes_weight(policy->v.nodes) == 0) {
> + if (nodes_empty(*nodes) || nodes_empty(cpuset_context_nmask)) {
>   kmem_cache_free(policy_cache, policy);
>   return ERR_PTR(-EINVAL);
>   }
> + policy->v.nodes = cpuset_context_nmask;
>   break;
>   case MPOL_PREFERRED:
>   policy->v.preferred_node = first_node(cpuset_context_nmask);
> @@ -185,8 +183,10 @@ static struct mempolicy *mpol_new(enum mempolicy_mode 
> mode,
>   policy->v.preferred_node = -1;
>   break;
>   case MPOL_BIND:
> - if (nodes_empty(*nodes))
> + if (nodes_empty(*nodes)) {
> + kmem_cache_free(policy_cache, policy);
>   return ERR_PTR(-EINVAL);
> + }
>   policy->v.zonelist = bind_zonelist(_context_nmask);
>   if (IS_ERR(policy->v.zonelist)) {
>   void *error_code = policy->v.zonelist;

With this patch, we now have 3 error paths from mpol_new that need to
free the mempolicy struct.  How about consolidating them with something
like this [uncompiled/untested]:

PATCH mempolicy - consolidate mpol_new() error paths

Use common error path in mpol_new() for errors that we discover
after allocation the new mempolicy structure.  Free the mempolicy
in the common error path.

Signed-off-by: Lee Schermerhorn <[EMAIL PROTECTED]>

 mm/mempolicy.c |   16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

Index: Linux/mm/mempolicy.c
===
--- Linux.orig/mm/mempolicy.c   2008-02-12 15:18:12.0 -0700
+++ Linux/mm/mempolicy.c2008-02-12 15:22:07.0 -0700
@@ -156,6 +156,7 @@ static struct mempolicy *mpol_new(enum m
 {
struct mempolicy *policy;
nodemask_t cpuset_context_nmask;
+   void *error_code = ERR_PTR(-EINVAL);
 
pr_debug("setting mode %d flags %d nodes[0] %lx\n",
 mode, flags, nodes ? nodes_addr(*nodes)[0] : -1);
@@ -172,8 +173,7 @@ static struct mempolicy *mpol_new(enum m
switch (mode) {
case MPOL_INTERLEAVE:
if (nodes_empty(*nodes) || nodes_empty(cpuset_context_nmask)) {
-   kmem_cache_free(policy_cache, policy);
-   return ERR_PTR(-EINVAL);
+   goto free_mpol;
}
policy->v.nodes = cpuset_context_nmask;
break;
@@ -184,14 +184,12 @@ static struct mempolicy *mpol_new(enum m
break;
case MPOL_BIND:
if (nodes_empty(*nodes)) {
-   kmem_cache_free(policy_cache, policy);
-   return ERR_PTR(-EINVAL);
+   goto free_mpol;
}
policy->v.zonelist = bind_zonelist(_context_nmask);
if (IS_ERR(policy->v.zonelist)) {
-   void *error_code = policy->v.zonelist;
-   kmem_cache_free(policy_cache, policy);
-

Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-12 Thread Lee Schermerhorn
On Mon, 2008-02-11 at 07:30 -0800, David Rientjes wrote: 
> Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses
> the node remap when the policy is rebound.
> 
> Adds another member to struct mempolicy,
> 
>   nodemask_t user_nodemask
> 
> that stores the the nodemask that the user passed when he or she created
> the mempolicy via set_mempolicy() or mbind().  When using
> MPOL_F_STATIC_NODES, which is passed with any mempolicy mode, the user's
> passed nodemask is always used when determining the preferred node,
> building the MPOL_BIND zonelist, or creating the interleave nodemask.
> This happens whenever the policy is rebound, including when a task's
> cpuset assignment changes or the cpuset's mems are changed.

When you say that the user's nodemask is "always used" you mean "after
cpuset contextualization", right?  I.e., after masking with mems_allowed
[which also restricts to nodes with memory].  That is what the code
still does.  

> 
> This creates an interesting side-effect in that it allows the mempolicy
> "intent" to lie dormant and uneffected until it has access to the node(s)
> that it desires.  For example, if you currently ask for an interleaved
> policy over a set of nodes that you do not have access to, the mempolicy
> is not created and the task continues to use the equivalent of
> MPOL_DEFAULT.  

You get an error [EINVAL], right?  The target policy [task or vma]
remains unchanged.  That may or may not be MPOL_DEFAULT, depending on
how the task was started [via numactl()] or the success of prior
set_mempolicy()/mbind() calls.

> With this change, however, it is possible to create the
> same mempolicy; it is effected only when access to the nodes is acquired.
> 
> It is also possible to mount tmpfs with the static nodemask behavior when
> specifying a node or nodemask.  To do this, simply add "=static"
> immediately following the mempolicy mode at mount time:
> 
>   mount -o remount mpol=interleave=static:1-3
> 
> Also removes mpol_check_policy() and folds some of its logic into
> mpol_new() since it is now mostly obsoleted.

Couple of comments here:

1) we've discussed the issue of returning EINVAL for non-empty nodemasks
with MPOL_DEFAULT.  By removing this restriction, we run the risk of
breaking applications if we should ever want to define a semantic to
non-empty node mask for MPOL_DEFAULT.   I think this is probably
unlikely, but consider the new mode flags part of the mode/policy
parameter:  by not rejecting undefined flags, we may break applications
by later defining additional flags.  I'd argue for fairly strict
argument checking.

2) Before this patch, policy_nodemask from the shmem_sb_info was NOT
masked with allowed nodes because we passed this mask directly to
mpol_new() without "contextualization".  We DO guarantee that this
policy nodemask contains only nodes with memory [see
shmem_parse_mpol()].  Now that you've moved the contextualization to
mpol_new(), we are masking this policy with the cpuset mems allowed.
This is a change in behavior.  Do we want this?  I.e., are we preserving
the "intent" of the system administrator in setting the tmpfs policy?
Maybe they wanted to shared interleaved shmem areas between cpusets with
disjoint mems allowed.  Nothing prevents this...

If we just want to preserve existing behavior, we can define an
additional mode flag that we set in the sbinfo policy in
shmem_parse_mpol() and test in mpol_new().  If we want to be able to
specify existing or new behavior, we can use the same flag, but set it
or not based on an additional qualifier specified via the mount option.

[more below]

> 
> Cc: Paul Jackson <[EMAIL PROTECTED]>
> Cc: Christoph Lameter <[EMAIL PROTECTED]>
> Cc: Lee Schermerhorn <[EMAIL PROTECTED]>
> Cc: Andi Kleen <[EMAIL PROTECTED]>
> Signed-off-by: David Rientjes <[EMAIL PROTECTED]>
> ---
>  include/linux/mempolicy.h |4 +-
>  mm/mempolicy.c|  137 
> ++---
>  mm/shmem.c|2 +
>  3 files changed, 60 insertions(+), 83 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -26,7 +26,8 @@ enum mempolicy_mode {
>  #define MPOL_MODE_MASK   ((1 << MPOL_FLAG_SHIFT) - 1)
>  
>  /* Flags for set_mempolicy */
> -#define MPOL_MODE_FLAGS  (0) /* legal set_mempolicy() MPOL_* mode 
> flags */
> +#define MPOL_F_STATIC_NODES  (1 << MPOL_FLAG_SHIFT)
> +#define MPOL_MODE_FLAGS  (MPOL_F_STATIC_NODES)   /* legal 
> set_mempolicy() MPOL_* mode flags */
>  
>  /* Flags for get_mempolicy */
>  #define MPOL_F_NODE  (1<<0)  /* return next IL mode instead of node mask */
> @@ -82,6 +83,7 @@ struct mempolicy {
>   /* undefined for default */
>   } v;
>   nodemask_t cpuset_mems_allowed; /* mempolicy relative to these nodes */
> + nodemask_t user_nodemask;   /* nodemask passed by user */
>  };
>  
>  /*
> diff 

Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-12 Thread Lee Schermerhorn
On Mon, 2008-02-11 at 11:56 -0800, David Rientjes wrote: 
 On Tue, 12 Feb 2008, KOSAKI Motohiro wrote:
 
   @@ -218,21 +167,27 @@ static struct mempolicy *mpol_new(enum 
   mempolicy_mode mode,
   return ERR_PTR(-ENOMEM);
   flags = MPOL_MODE_FLAGS;
   atomic_set(policy-refcnt, 1);
   +   cpuset_update_task_memory_state();
   +   nodes_and(cpuset_context_nmask, *nodes, 
   cpuset_current_mems_allowed);
   switch (mode) {
   case MPOL_INTERLEAVE:
   -   policy-v.nodes = *nodes;
   +   if (nodes_empty(*nodes))
   +   return ERR_PTR(-EINVAL);
  
  need kmem_cache_free(policy_cache, policy) before return?
  
 
 Very good catch!
 
 
 
 mempolicy: fix policy memory leak in mpol_new()
 
 If mpol_new() cannot setup a new mempolicy because of an invalid argument
 provided by the user, avoid leaking the mempolicy that has been dynamically
 allocated.
 
 Reported by KOSAKI Motohiro.
 
 Cc: Paul Jackson [EMAIL PROTECTED]
 Cc: Christoph Lameter [EMAIL PROTECTED]
 Cc: Lee Schermerhorn [EMAIL PROTECTED]
 Cc: Andi Kleen [EMAIL PROTECTED]
 Cc: KOSAKI Motohiro [EMAIL PROTECTED]
 Signed-off-by: David Rientjes [EMAIL PROTECTED]
 ---
  mm/mempolicy.c |   10 +-
  1 files changed, 5 insertions(+), 5 deletions(-)
 
 diff --git a/mm/mempolicy.c b/mm/mempolicy.c
 --- a/mm/mempolicy.c
 +++ b/mm/mempolicy.c
 @@ -171,13 +171,11 @@ static struct mempolicy *mpol_new(enum mempolicy_mode 
 mode,
   nodes_and(cpuset_context_nmask, *nodes, cpuset_current_mems_allowed);
   switch (mode) {
   case MPOL_INTERLEAVE:
 - if (nodes_empty(*nodes))
 - return ERR_PTR(-EINVAL);
 - policy-v.nodes = cpuset_context_nmask;
 - if (nodes_weight(policy-v.nodes) == 0) {
 + if (nodes_empty(*nodes) || nodes_empty(cpuset_context_nmask)) {
   kmem_cache_free(policy_cache, policy);
   return ERR_PTR(-EINVAL);
   }
 + policy-v.nodes = cpuset_context_nmask;
   break;
   case MPOL_PREFERRED:
   policy-v.preferred_node = first_node(cpuset_context_nmask);
 @@ -185,8 +183,10 @@ static struct mempolicy *mpol_new(enum mempolicy_mode 
 mode,
   policy-v.preferred_node = -1;
   break;
   case MPOL_BIND:
 - if (nodes_empty(*nodes))
 + if (nodes_empty(*nodes)) {
 + kmem_cache_free(policy_cache, policy);
   return ERR_PTR(-EINVAL);
 + }
   policy-v.zonelist = bind_zonelist(cpuset_context_nmask);
   if (IS_ERR(policy-v.zonelist)) {
   void *error_code = policy-v.zonelist;

With this patch, we now have 3 error paths from mpol_new that need to
free the mempolicy struct.  How about consolidating them with something
like this [uncompiled/untested]:

PATCH mempolicy - consolidate mpol_new() error paths

Use common error path in mpol_new() for errors that we discover
after allocation the new mempolicy structure.  Free the mempolicy
in the common error path.

Signed-off-by: Lee Schermerhorn [EMAIL PROTECTED]

 mm/mempolicy.c |   16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

Index: Linux/mm/mempolicy.c
===
--- Linux.orig/mm/mempolicy.c   2008-02-12 15:18:12.0 -0700
+++ Linux/mm/mempolicy.c2008-02-12 15:22:07.0 -0700
@@ -156,6 +156,7 @@ static struct mempolicy *mpol_new(enum m
 {
struct mempolicy *policy;
nodemask_t cpuset_context_nmask;
+   void *error_code = ERR_PTR(-EINVAL);
 
pr_debug(setting mode %d flags %d nodes[0] %lx\n,
 mode, flags, nodes ? nodes_addr(*nodes)[0] : -1);
@@ -172,8 +173,7 @@ static struct mempolicy *mpol_new(enum m
switch (mode) {
case MPOL_INTERLEAVE:
if (nodes_empty(*nodes) || nodes_empty(cpuset_context_nmask)) {
-   kmem_cache_free(policy_cache, policy);
-   return ERR_PTR(-EINVAL);
+   goto free_mpol;
}
policy-v.nodes = cpuset_context_nmask;
break;
@@ -184,14 +184,12 @@ static struct mempolicy *mpol_new(enum m
break;
case MPOL_BIND:
if (nodes_empty(*nodes)) {
-   kmem_cache_free(policy_cache, policy);
-   return ERR_PTR(-EINVAL);
+   goto free_mpol;
}
policy-v.zonelist = bind_zonelist(cpuset_context_nmask);
if (IS_ERR(policy-v.zonelist)) {
-   void *error_code = policy-v.zonelist;
-   kmem_cache_free(policy_cache, policy);
-   return error_code;
+   error_code = policy-v.zonelist;
+   goto 

Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-12 Thread Lee Schermerhorn
On Mon, 2008-02-11 at 07:30 -0800, David Rientjes wrote: 
 Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses
 the node remap when the policy is rebound.
 
 Adds another member to struct mempolicy,
 
   nodemask_t user_nodemask
 
 that stores the the nodemask that the user passed when he or she created
 the mempolicy via set_mempolicy() or mbind().  When using
 MPOL_F_STATIC_NODES, which is passed with any mempolicy mode, the user's
 passed nodemask is always used when determining the preferred node,
 building the MPOL_BIND zonelist, or creating the interleave nodemask.
 This happens whenever the policy is rebound, including when a task's
 cpuset assignment changes or the cpuset's mems are changed.

When you say that the user's nodemask is always used you mean after
cpuset contextualization, right?  I.e., after masking with mems_allowed
[which also restricts to nodes with memory].  That is what the code
still does.  

 
 This creates an interesting side-effect in that it allows the mempolicy
 intent to lie dormant and uneffected until it has access to the node(s)
 that it desires.  For example, if you currently ask for an interleaved
 policy over a set of nodes that you do not have access to, the mempolicy
 is not created and the task continues to use the equivalent of
 MPOL_DEFAULT.  

You get an error [EINVAL], right?  The target policy [task or vma]
remains unchanged.  That may or may not be MPOL_DEFAULT, depending on
how the task was started [via numactl()] or the success of prior
set_mempolicy()/mbind() calls.

 With this change, however, it is possible to create the
 same mempolicy; it is effected only when access to the nodes is acquired.
 
 It is also possible to mount tmpfs with the static nodemask behavior when
 specifying a node or nodemask.  To do this, simply add =static
 immediately following the mempolicy mode at mount time:
 
   mount -o remount mpol=interleave=static:1-3
 
 Also removes mpol_check_policy() and folds some of its logic into
 mpol_new() since it is now mostly obsoleted.

Couple of comments here:

1) we've discussed the issue of returning EINVAL for non-empty nodemasks
with MPOL_DEFAULT.  By removing this restriction, we run the risk of
breaking applications if we should ever want to define a semantic to
non-empty node mask for MPOL_DEFAULT.   I think this is probably
unlikely, but consider the new mode flags part of the mode/policy
parameter:  by not rejecting undefined flags, we may break applications
by later defining additional flags.  I'd argue for fairly strict
argument checking.

2) Before this patch, policy_nodemask from the shmem_sb_info was NOT
masked with allowed nodes because we passed this mask directly to
mpol_new() without contextualization.  We DO guarantee that this
policy nodemask contains only nodes with memory [see
shmem_parse_mpol()].  Now that you've moved the contextualization to
mpol_new(), we are masking this policy with the cpuset mems allowed.
This is a change in behavior.  Do we want this?  I.e., are we preserving
the intent of the system administrator in setting the tmpfs policy?
Maybe they wanted to shared interleaved shmem areas between cpusets with
disjoint mems allowed.  Nothing prevents this...

If we just want to preserve existing behavior, we can define an
additional mode flag that we set in the sbinfo policy in
shmem_parse_mpol() and test in mpol_new().  If we want to be able to
specify existing or new behavior, we can use the same flag, but set it
or not based on an additional qualifier specified via the mount option.

[more below]

 
 Cc: Paul Jackson [EMAIL PROTECTED]
 Cc: Christoph Lameter [EMAIL PROTECTED]
 Cc: Lee Schermerhorn [EMAIL PROTECTED]
 Cc: Andi Kleen [EMAIL PROTECTED]
 Signed-off-by: David Rientjes [EMAIL PROTECTED]
 ---
  include/linux/mempolicy.h |4 +-
  mm/mempolicy.c|  137 
 ++---
  mm/shmem.c|2 +
  3 files changed, 60 insertions(+), 83 deletions(-)
 
 diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
 --- a/include/linux/mempolicy.h
 +++ b/include/linux/mempolicy.h
 @@ -26,7 +26,8 @@ enum mempolicy_mode {
  #define MPOL_MODE_MASK   ((1  MPOL_FLAG_SHIFT) - 1)
  
  /* Flags for set_mempolicy */
 -#define MPOL_MODE_FLAGS  (0) /* legal set_mempolicy() MPOL_* mode 
 flags */
 +#define MPOL_F_STATIC_NODES  (1  MPOL_FLAG_SHIFT)
 +#define MPOL_MODE_FLAGS  (MPOL_F_STATIC_NODES)   /* legal 
 set_mempolicy() MPOL_* mode flags */
  
  /* Flags for get_mempolicy */
  #define MPOL_F_NODE  (10)  /* return next IL mode instead of node mask */
 @@ -82,6 +83,7 @@ struct mempolicy {
   /* undefined for default */
   } v;
   nodemask_t cpuset_mems_allowed; /* mempolicy relative to these nodes */
 + nodemask_t user_nodemask;   /* nodemask passed by user */
  };
  
  /*
 diff --git a/mm/mempolicy.c b/mm/mempolicy.c
 --- a/mm/mempolicy.c
 +++ b/mm/mempolicy.c
 @@ -113,58 

Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-12 Thread David Rientjes
On Tue, 12 Feb 2008, Paul Jackson wrote:

  I've redone my patchset based on the feedback that I've received
 
 Will you be sending that along soon?  I was just getting into my
 review of this patchset, and I suppose it would be better to
 review the latest and greatest.
 

Lee had some questions and comments that I've recently addressed so I 
wanted to give him a chance to respond before sending it out again in case 
more changes need to be made.

I'll email the current set to you now.

David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-12 Thread David Rientjes
On Tue, 12 Feb 2008, Lee Schermerhorn wrote:

  Adds another member to struct mempolicy,
  
  nodemask_t user_nodemask
  
  that stores the the nodemask that the user passed when he or she created
  the mempolicy via set_mempolicy() or mbind().  When using
  MPOL_F_STATIC_NODES, which is passed with any mempolicy mode, the user's
  passed nodemask is always used when determining the preferred node,
  building the MPOL_BIND zonelist, or creating the interleave nodemask.
  This happens whenever the policy is rebound, including when a task's
  cpuset assignment changes or the cpuset's mems are changed.
 
 When you say that the user's nodemask is always used you mean after
 cpuset contextualization, right?  I.e., after masking with mems_allowed
 [which also restricts to nodes with memory].  That is what the code
 still does.  
 

Yeah, I'm not introducing a loophole so that tasks can access memory to 
which they don't have access.  I've clarified that for the next version.

  It is also possible to mount tmpfs with the static nodemask behavior when
  specifying a node or nodemask.  To do this, simply add =static
  immediately following the mempolicy mode at mount time:
  
  mount -o remount mpol=interleave=static:1-3
  
  Also removes mpol_check_policy() and folds some of its logic into
  mpol_new() since it is now mostly obsoleted.
 
 Couple of comments here:
 
 1) we've discussed the issue of returning EINVAL for non-empty nodemasks
 with MPOL_DEFAULT.  By removing this restriction, we run the risk of
 breaking applications if we should ever want to define a semantic to
 non-empty node mask for MPOL_DEFAULT.   I think this is probably
 unlikely, but consider the new mode flags part of the mode/policy
 parameter:  by not rejecting undefined flags, we may break applications
 by later defining additional flags.  I'd argue for fairly strict
 argument checking.
 

As I've mentioned in a parallel thread, I've folded the entire logic of 
mpol_check_policy() as it stands this minute in Linus' tree into 
mpol_new().

 2) Before this patch, policy_nodemask from the shmem_sb_info was NOT
 masked with allowed nodes because we passed this mask directly to
 mpol_new() without contextualization.  We DO guarantee that this
 policy nodemask contains only nodes with memory [see
 shmem_parse_mpol()].  Now that you've moved the contextualization to
 mpol_new(), we are masking this policy with the cpuset mems allowed.
 This is a change in behavior.  Do we want this?  I.e., are we preserving
 the intent of the system administrator in setting the tmpfs policy?
 Maybe they wanted to shared interleaved shmem areas between cpusets with
 disjoint mems allowed.  Nothing prevents this...
 

We're still saving the intent in the new policy-user_nodemask field so 
any future rebinds will still allow unaccessible nodes be effected by the 
mempolicy if the permissions are changed later.

 If we just want to preserve existing behavior, we can define an
 additional mode flag that we set in the sbinfo policy in
 shmem_parse_mpol() and test in mpol_new().  If we want to be able to
 specify existing or new behavior, we can use the same flag, but set it
 or not based on an additional qualifier specified via the mount option.
 

Shmem areas between cpusets with disjoint mems_allowed seems like an error 
in userspace to me and I would prefer that mpol_new() reject it outright.

  @@ -218,21 +167,27 @@ static struct mempolicy *mpol_new(enum mempolicy_mode 
  mode,
  return ERR_PTR(-ENOMEM);
  flags = MPOL_MODE_FLAGS;
  atomic_set(policy-refcnt, 1);
  +   cpuset_update_task_memory_state();
  +   nodes_and(cpuset_context_nmask, *nodes, cpuset_current_mems_allowed);
  switch (mode) {
  case MPOL_INTERLEAVE:
  -   policy-v.nodes = *nodes;
  +   if (nodes_empty(*nodes))
  +   return ERR_PTR(-EINVAL);
  +   policy-v.nodes = cpuset_context_nmask;
  if (nodes_weight(policy-v.nodes) == 0) {
  kmem_cache_free(policy_cache, policy);
  return ERR_PTR(-EINVAL);
  }
  break;
  case MPOL_PREFERRED:
  -   policy-v.preferred_node = first_node(*nodes);
  +   policy-v.preferred_node = first_node(cpuset_context_nmask);
  if (policy-v.preferred_node = MAX_NUMNODES)
  policy-v.preferred_node = -1;
  break;
  case MPOL_BIND:
  -   policy-v.zonelist = bind_zonelist(nodes);
  +   if (nodes_empty(*nodes))
 
 Kosaki-san already pointed out the need to free the policy struct
 here.  
 

I folded your fix to have only one

kmem_cache_free(policy_cache, policy); 
return ERR_PTR(error_code);

point in mpol_new().

  @@ -1780,51 +1737,67 @@ static void mpol_rebind_policy(struct mempolicy 
  *pol,
  if (nodes_equal(*mpolmask, *newmask))
  return;
   
  +   remap = !(mpol_flags(pol-policy)  MPOL_F_STATIC_NODES);
  

Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-12 Thread David Rientjes
On Tue, 12 Feb 2008, David Rientjes wrote:

 Since we're allowed to remap the node to a different node than the user 
 specified with either syscall, the current behavior is that one node is 
 as good as another.  In other words, we're trying to accomodate the mode 
 first by setting pol-v.preferred_node to some accessible node and only 
 setting that to the user-supplied node if it is available.
 
 If the node isn't available and the user specifically asked that it is not 
 remapped (with MPOL_F_STATIC_NODES), then I felt local allocation was best 
 compared to remapping to a node that may be unrelated to the VMA or task.  
 This preserves a sense of the current behavior that one node is as good 
 as another, but the user specifically asked for no remap.
 

Upon further inspection, perhaps it's better to fallback to local 
allocation when the preferred node is not available on rebind and then, if 
it becomes available later, prefer it again.

In mpol_rebind_policy():

case MPOL_PREFERRED:
if (!remap) {
int nid = first_node(pol-user_nodemask);

if (node_isset(nid, *newmask))
pol-v.preferred_node = nid;
else {
/*
 * Fallback to local allocation since that
 * is the behavior in mpol_new().  The
 * node may eventually become available.
 */
pol-v.preferred_node = -1;
}
} else
pol-v.preferred_node = 
node_remap(pol-v.preferred_node,
*mpolmask, *newmask);
break;

What do you think, Lee?

David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-12 Thread Paul Jackson
David wrote:
 I've redone my patchset based on the feedback that I've received

Will you be sending that along soon?  I was just getting into my
review of this patchset, and I suppose it would be better to
review the latest and greatest.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.940.382.4214
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-12 Thread David Rientjes
On Tue, 12 Feb 2008, Paul Jackson wrote:

  1) we've discussed the issue of returning EINVAL for non-empty nodemasks
  with MPOL_DEFAULT.  By removing this restriction, we run the risk of
  breaking applications if we should ever want to define a semantic to
  non-empty node mask for MPOL_DEFAULT. 
 
 The bigger risk, in my view, is breaking some piece of existing user code.
 Properly written user code wouldn't break, but that doesn't mean much.
 Changes, even minor corner case changes, often break something, so should
 not be done with out cause.  Whether or not code cleanup in mempolicy.c is
 sufficient cause here is not clear to me.
 
 Future room for growth doesn't mean so much for me here; if we close one
 future alternative, we always have others, such as more mode flag bits.
 

I've redone my patchset based on the feedback that I've received, and in 
my latest revisions I folded the entire equivalent of mpol_check_policy() 
into mpol_new().

Lee, you feel strongly that non-empty nodemasks passed with MPOL_DEFAULT 
should be considered invalid and rejected by the kernel, as the current 
implementation does.  I've brought up a counter-argument based on the 
set_mempolicy() man page and the Linux documentation that don't specify 
anything about what the nodemask shall contain if it's not a NULL pointer.

My position was to give the user the benefit of the doubt.  Because Linux 
has been vague in specifying what the nodemask shall contain, that (to me) 
means that it can contain anything.  It's undefined, in a standards sense.  
The only thing that we should look for is whether the user passed 
MPOL_DEFAULT as the first parameter and then we should effect that policy 
because it's clearly the intention.

I don't think there's a super strong case for either behavior, and that's 
why I just folded the mpol_check_policy() logic straight into mpol_new().  

David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-12 Thread David Rientjes
On Tue, 12 Feb 2008, Lee Schermerhorn wrote:

 PATCH mempolicy - consolidate mpol_new() error paths
 
 Use common error path in mpol_new() for errors that we discover
 after allocation the new mempolicy structure.  Free the mempolicy
 in the common error path.
 
 Signed-off-by: Lee Schermerhorn [EMAIL PROTECTED]
 
  mm/mempolicy.c |   16 +---
  1 file changed, 9 insertions(+), 7 deletions(-)
 
 Index: Linux/mm/mempolicy.c
 ===
 --- Linux.orig/mm/mempolicy.c 2008-02-12 15:18:12.0 -0700
 +++ Linux/mm/mempolicy.c  2008-02-12 15:22:07.0 -0700
 @@ -156,6 +156,7 @@ static struct mempolicy *mpol_new(enum m
  {
   struct mempolicy *policy;
   nodemask_t cpuset_context_nmask;
 + void *error_code = ERR_PTR(-EINVAL);
  
   pr_debug(setting mode %d flags %d nodes[0] %lx\n,
mode, flags, nodes ? nodes_addr(*nodes)[0] : -1);
 @@ -172,8 +173,7 @@ static struct mempolicy *mpol_new(enum m
   switch (mode) {
   case MPOL_INTERLEAVE:
   if (nodes_empty(*nodes) || nodes_empty(cpuset_context_nmask)) {
 - kmem_cache_free(policy_cache, policy);
 - return ERR_PTR(-EINVAL);
 + goto free_mpol;
   }
   policy-v.nodes = cpuset_context_nmask;
   break;

You can also get rid of the parentheses to make checkpatch.pl happy, too.

 @@ -184,14 +184,12 @@ static struct mempolicy *mpol_new(enum m
   break;
   case MPOL_BIND:
   if (nodes_empty(*nodes)) {
 - kmem_cache_free(policy_cache, policy);
 - return ERR_PTR(-EINVAL);
 + goto free_mpol;
   }
   policy-v.zonelist = bind_zonelist(cpuset_context_nmask);
   if (IS_ERR(policy-v.zonelist)) {
 - void *error_code = policy-v.zonelist;
 - kmem_cache_free(policy_cache, policy);
 - return error_code;
 + error_code = policy-v.zonelist;
 + goto free_mpol;
   }
   break;
   default:
 @@ -201,6 +199,10 @@ static struct mempolicy *mpol_new(enum m
   policy-cpuset_mems_allowed = cpuset_mems_allowed(current);
   policy-user_nodemask = *nodes;
   return policy;
 +
 +free_mpol:
 + kmem_cache_free(policy_cache, policy);
 + return error_code;
  }
  
  static void gather_stats(struct page *, void *, int pte_dirty);
 

I'll fold this into my patchset when I repost it, thanks.

David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-12 Thread Paul Jackson
Lee wrote:
 1) we've discussed the issue of returning EINVAL for non-empty nodemasks
 with MPOL_DEFAULT.  By removing this restriction, we run the risk of
 breaking applications if we should ever want to define a semantic to
 non-empty node mask for MPOL_DEFAULT. 

The bigger risk, in my view, is breaking some piece of existing user code.
Properly written user code wouldn't break, but that doesn't mean much.
Changes, even minor corner case changes, often break something, so should
not be done with out cause.  Whether or not code cleanup in mempolicy.c is
sufficient cause here is not clear to me.

Future room for growth doesn't mean so much for me here; if we close one
future alternative, we always have others, such as more mode flag bits.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.940.382.4214
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-11 Thread David Rientjes
On Tue, 12 Feb 2008, KOSAKI Motohiro wrote:

> > @@ -218,21 +167,27 @@ static struct mempolicy *mpol_new(enum mempolicy_mode 
> > mode,
> > return ERR_PTR(-ENOMEM);
> > flags &= MPOL_MODE_FLAGS;
> > atomic_set(>refcnt, 1);
> > +   cpuset_update_task_memory_state();
> > +   nodes_and(cpuset_context_nmask, *nodes, 
> > cpuset_current_mems_allowed);
> > switch (mode) {
> > case MPOL_INTERLEAVE:
> > -   policy->v.nodes = *nodes;
> > +   if (nodes_empty(*nodes))
> > +   return ERR_PTR(-EINVAL);
> 
> need kmem_cache_free(policy_cache, policy) before return?
> 

Very good catch!



mempolicy: fix policy memory leak in mpol_new()

If mpol_new() cannot setup a new mempolicy because of an invalid argument
provided by the user, avoid leaking the mempolicy that has been dynamically
allocated.

Reported by KOSAKI Motohiro.

Cc: Paul Jackson <[EMAIL PROTECTED]>
Cc: Christoph Lameter <[EMAIL PROTECTED]>
Cc: Lee Schermerhorn <[EMAIL PROTECTED]>
Cc: Andi Kleen <[EMAIL PROTECTED]>
Cc: KOSAKI Motohiro <[EMAIL PROTECTED]>
Signed-off-by: David Rientjes <[EMAIL PROTECTED]>
---
 mm/mempolicy.c |   10 +-
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -171,13 +171,11 @@ static struct mempolicy *mpol_new(enum mempolicy_mode 
mode,
nodes_and(cpuset_context_nmask, *nodes, cpuset_current_mems_allowed);
switch (mode) {
case MPOL_INTERLEAVE:
-   if (nodes_empty(*nodes))
-   return ERR_PTR(-EINVAL);
-   policy->v.nodes = cpuset_context_nmask;
-   if (nodes_weight(policy->v.nodes) == 0) {
+   if (nodes_empty(*nodes) || nodes_empty(cpuset_context_nmask)) {
kmem_cache_free(policy_cache, policy);
return ERR_PTR(-EINVAL);
}
+   policy->v.nodes = cpuset_context_nmask;
break;
case MPOL_PREFERRED:
policy->v.preferred_node = first_node(cpuset_context_nmask);
@@ -185,8 +183,10 @@ static struct mempolicy *mpol_new(enum mempolicy_mode mode,
policy->v.preferred_node = -1;
break;
case MPOL_BIND:
-   if (nodes_empty(*nodes))
+   if (nodes_empty(*nodes)) {
+   kmem_cache_free(policy_cache, policy);
return ERR_PTR(-EINVAL);
+   }
policy->v.zonelist = bind_zonelist(_context_nmask);
if (IS_ERR(policy->v.zonelist)) {
void *error_code = policy->v.zonelist;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-11 Thread Christoph Lameter
Looks conceptually good (I have not looked at the details).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-11 Thread KOSAKI Motohiro
Hi David,

In general, I like this feature.
and I found no bug in patch [1/4] and [2/4].

> @@ -218,21 +167,27 @@ static struct mempolicy *mpol_new(enum mempolicy_mode 
> mode,
> return ERR_PTR(-ENOMEM);
> flags &= MPOL_MODE_FLAGS;
> atomic_set(>refcnt, 1);
> +   cpuset_update_task_memory_state();
> +   nodes_and(cpuset_context_nmask, *nodes, cpuset_current_mems_allowed);
> switch (mode) {
> case MPOL_INTERLEAVE:
> -   policy->v.nodes = *nodes;
> +   if (nodes_empty(*nodes))
> +   return ERR_PTR(-EINVAL);

need kmem_cache_free(policy_cache, policy) before return?

> +   policy->v.nodes = cpuset_context_nmask;
> if (nodes_weight(policy->v.nodes) == 0) {
> kmem_cache_free(policy_cache, policy);
> return ERR_PTR(-EINVAL);
> }
> break;
> case MPOL_PREFERRED:
> -   policy->v.preferred_node = first_node(*nodes);
> +   policy->v.preferred_node = first_node(cpuset_context_nmask);
> if (policy->v.preferred_node >= MAX_NUMNODES)
> policy->v.preferred_node = -1;
> break;
> case MPOL_BIND:
> -   policy->v.zonelist = bind_zonelist(nodes);
> +   if (nodes_empty(*nodes))
> +   return ERR_PTR(-EINVAL);

ditto

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-11 Thread David Rientjes
On Tue, 12 Feb 2008, KOSAKI Motohiro wrote:

  @@ -218,21 +167,27 @@ static struct mempolicy *mpol_new(enum mempolicy_mode 
  mode,
  return ERR_PTR(-ENOMEM);
  flags = MPOL_MODE_FLAGS;
  atomic_set(policy-refcnt, 1);
  +   cpuset_update_task_memory_state();
  +   nodes_and(cpuset_context_nmask, *nodes, 
  cpuset_current_mems_allowed);
  switch (mode) {
  case MPOL_INTERLEAVE:
  -   policy-v.nodes = *nodes;
  +   if (nodes_empty(*nodes))
  +   return ERR_PTR(-EINVAL);
 
 need kmem_cache_free(policy_cache, policy) before return?
 

Very good catch!



mempolicy: fix policy memory leak in mpol_new()

If mpol_new() cannot setup a new mempolicy because of an invalid argument
provided by the user, avoid leaking the mempolicy that has been dynamically
allocated.

Reported by KOSAKI Motohiro.

Cc: Paul Jackson [EMAIL PROTECTED]
Cc: Christoph Lameter [EMAIL PROTECTED]
Cc: Lee Schermerhorn [EMAIL PROTECTED]
Cc: Andi Kleen [EMAIL PROTECTED]
Cc: KOSAKI Motohiro [EMAIL PROTECTED]
Signed-off-by: David Rientjes [EMAIL PROTECTED]
---
 mm/mempolicy.c |   10 +-
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -171,13 +171,11 @@ static struct mempolicy *mpol_new(enum mempolicy_mode 
mode,
nodes_and(cpuset_context_nmask, *nodes, cpuset_current_mems_allowed);
switch (mode) {
case MPOL_INTERLEAVE:
-   if (nodes_empty(*nodes))
-   return ERR_PTR(-EINVAL);
-   policy-v.nodes = cpuset_context_nmask;
-   if (nodes_weight(policy-v.nodes) == 0) {
+   if (nodes_empty(*nodes) || nodes_empty(cpuset_context_nmask)) {
kmem_cache_free(policy_cache, policy);
return ERR_PTR(-EINVAL);
}
+   policy-v.nodes = cpuset_context_nmask;
break;
case MPOL_PREFERRED:
policy-v.preferred_node = first_node(cpuset_context_nmask);
@@ -185,8 +183,10 @@ static struct mempolicy *mpol_new(enum mempolicy_mode mode,
policy-v.preferred_node = -1;
break;
case MPOL_BIND:
-   if (nodes_empty(*nodes))
+   if (nodes_empty(*nodes)) {
+   kmem_cache_free(policy_cache, policy);
return ERR_PTR(-EINVAL);
+   }
policy-v.zonelist = bind_zonelist(cpuset_context_nmask);
if (IS_ERR(policy-v.zonelist)) {
void *error_code = policy-v.zonelist;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-11 Thread Christoph Lameter
Looks conceptually good (I have not looked at the details).

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-11 Thread KOSAKI Motohiro
Hi David,

In general, I like this feature.
and I found no bug in patch [1/4] and [2/4].

 @@ -218,21 +167,27 @@ static struct mempolicy *mpol_new(enum mempolicy_mode 
 mode,
 return ERR_PTR(-ENOMEM);
 flags = MPOL_MODE_FLAGS;
 atomic_set(policy-refcnt, 1);
 +   cpuset_update_task_memory_state();
 +   nodes_and(cpuset_context_nmask, *nodes, cpuset_current_mems_allowed);
 switch (mode) {
 case MPOL_INTERLEAVE:
 -   policy-v.nodes = *nodes;
 +   if (nodes_empty(*nodes))
 +   return ERR_PTR(-EINVAL);

need kmem_cache_free(policy_cache, policy) before return?

 +   policy-v.nodes = cpuset_context_nmask;
 if (nodes_weight(policy-v.nodes) == 0) {
 kmem_cache_free(policy_cache, policy);
 return ERR_PTR(-EINVAL);
 }
 break;
 case MPOL_PREFERRED:
 -   policy-v.preferred_node = first_node(*nodes);
 +   policy-v.preferred_node = first_node(cpuset_context_nmask);
 if (policy-v.preferred_node = MAX_NUMNODES)
 policy-v.preferred_node = -1;
 break;
 case MPOL_BIND:
 -   policy-v.zonelist = bind_zonelist(nodes);
 +   if (nodes_empty(*nodes))
 +   return ERR_PTR(-EINVAL);

ditto

Thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/