Re: [PATCH 2/2 v5][resend] tmpfs: interleave the starting node of /dev/shmem

2012-07-25 Thread Nathan Zimmer
On Tue, Jul 24, 2012 at 09:38:21PM -0700, Hugh Dickins wrote:
> 
> I'm glad Andrew took out the stable Cc: 
Actually I did that.  I have a habit of thinking about performance issues as
bugs and that is not always the case.

> Please, what's wrong with the patch below, to replace the current
> two or three?  I don't have real NUMA myself: does it work?
Yes it works and spreads quite nicely. 

> Nathan, I've presumptuously put in your signoff, because
> you generally seemed happy to incorporate suggestions made.
I am always grateful for suggestions, advise, and help.

Nate

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2 v5][resend] tmpfs: interleave the starting node of /dev/shmem

2012-07-25 Thread KOSAKI Motohiro
> Please, what's wrong with the patch below, to replace the current
> two or three?  I don't have real NUMA myself: does it work?
> If it doesn't work, can you see why not?

It works. It doesn't match my preference. but I don't want block your way.
this area is maintained you. please go ahead.

at least, inode bias is better than random.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2 v5][resend] tmpfs: interleave the starting node of /dev/shmem

2012-07-25 Thread Nathan Zimmer
On Tue, Jul 24, 2012 at 09:38:21PM -0700, Hugh Dickins wrote:
 
 I'm glad Andrew took out the stable Cc: 
Actually I did that.  I have a habit of thinking about performance issues as
bugs and that is not always the case.

 Please, what's wrong with the patch below, to replace the current
 two or three?  I don't have real NUMA myself: does it work?
Yes it works and spreads quite nicely. 

 Nathan, I've presumptuously put in your signoff, because
 you generally seemed happy to incorporate suggestions made.
I am always grateful for suggestions, advise, and help.

Nate

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2 v5][resend] tmpfs: interleave the starting node of /dev/shmem

2012-07-25 Thread KOSAKI Motohiro
 Please, what's wrong with the patch below, to replace the current
 two or three?  I don't have real NUMA myself: does it work?
 If it doesn't work, can you see why not?

It works. It doesn't match my preference. but I don't want block your way.
this area is maintained you. please go ahead.

at least, inode bias is better than random.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2 v5][resend] tmpfs: interleave the starting node of /dev/shmem

2012-07-24 Thread Hugh Dickins
Nathan, Kosaki-san,

I have, at long last, reached the point of looking at this patchset.
And I'm puzzled as to why it has grown more complicated than what you
first sent out.

I've read through the various threads, and some of the changes I like.

I'm glad Andrew took out the stable Cc: obviously the interleave policy
was never intended for a filesystem of many small files, and it could
be that some usages with larger files have actually optimized to the
current node layout, and will regress with this change.  Let's keep it
simple and assume not; but if there are complaints, then we shall have
to make the new behaviour dependent on a mount option.

And I'm glad you switched from random number to rotor: I'm probably
missing the mark by orders of magnitude, but I always think of random
numbers as a precious resource, and was unsure if this deserved them.

But other changes just seem unnecessary to me.  And I don't see how
we can accuse you of being hackish, so long as we have that horrid
business of pseudo-vma on the shmem stack.  I believe the mempolicy
work was designed around vmas, then at the last moment had shmem
grafted on, and the quick way to shoehorn it in was the pseudo-vma.
It's just a way of massaging the info into a format that mempolicy.c
expects, and the arguments about addresses and offsets mystified me.

I did set out to replace the pseudo-vma by adding an alloc_page_mpol()
three years ago; but, no surprise, I got stuck when it came to
understanding the mpol reference counting, and had to move away.
Maybe we can revisit that once Kosaki-san has the refcounting fixed.

Please, what's wrong with the patch below, to replace the current
two or three?  I don't have real NUMA myself: does it work?
If it doesn't work, can you see why not?

Nathan, I've presumptuously put in your signoff, because
you generally seemed happy to incorporate suggestions made.
Kosaki-san, I'm sorry if this version annoys you, but I've not
seen an actual explanation as to why anything more is needed.

Hugh

From: Nathan Zimmer 
Subject: tmpfs: distribute interleave better across nodes

When tmpfs has the interleave memory policy, it always starts allocating
for each file from node 0 at offset 0.  When there are many small files,
the lower nodes fill up disproportionately.

This patch spreads out node usage by starting files at nodes other than
0, by using the inode number to bias the starting node for interleave.

Signed-off-by: Nathan Zimmer 
Signed-off-by: Hugh Dickins 
Cc: Christoph Lameter 
Cc: Nick Piggin 
Cc: Lee Schermerhorn 
Cc: KOSAKI Motohiro 
Cc: Rik van Riel 
Cc: Andi Kleen 
Cc: Andrew Morton 
---

 mm/shmem.c |6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

--- v3.5/mm/shmem.c 2012-07-21 13:58:29.0 -0700
+++ linux/mm/shmem.c2012-07-24 20:13:58.468797969 -0700
@@ -929,7 +929,8 @@ static struct page *shmem_swapin(swp_ent
 
/* Create a pseudo vma that just contains the policy */
pvma.vm_start = 0;
-   pvma.vm_pgoff = index;
+   /* Bias interleave by inode number to distribute better across nodes */
+   pvma.vm_pgoff = index + info->vfs_inode.i_ino;
pvma.vm_ops = NULL;
pvma.vm_policy = spol;
return swapin_readahead(swap, gfp, , 0);
@@ -942,7 +943,8 @@ static struct page *shmem_alloc_page(gfp
 
/* Create a pseudo vma that just contains the policy */
pvma.vm_start = 0;
-   pvma.vm_pgoff = index;
+   /* Bias interleave by inode number to distribute better across nodes */
+   pvma.vm_pgoff = index + info->vfs_inode.i_ino;
pvma.vm_ops = NULL;
pvma.vm_policy = mpol_shared_policy_lookup(>policy, index);
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2 v5][resend] tmpfs: interleave the starting node of /dev/shmem

2012-07-24 Thread Hugh Dickins
Nathan, Kosaki-san,

I have, at long last, reached the point of looking at this patchset.
And I'm puzzled as to why it has grown more complicated than what you
first sent out.

I've read through the various threads, and some of the changes I like.

I'm glad Andrew took out the stable Cc: obviously the interleave policy
was never intended for a filesystem of many small files, and it could
be that some usages with larger files have actually optimized to the
current node layout, and will regress with this change.  Let's keep it
simple and assume not; but if there are complaints, then we shall have
to make the new behaviour dependent on a mount option.

And I'm glad you switched from random number to rotor: I'm probably
missing the mark by orders of magnitude, but I always think of random
numbers as a precious resource, and was unsure if this deserved them.

But other changes just seem unnecessary to me.  And I don't see how
we can accuse you of being hackish, so long as we have that horrid
business of pseudo-vma on the shmem stack.  I believe the mempolicy
work was designed around vmas, then at the last moment had shmem
grafted on, and the quick way to shoehorn it in was the pseudo-vma.
It's just a way of massaging the info into a format that mempolicy.c
expects, and the arguments about addresses and offsets mystified me.

I did set out to replace the pseudo-vma by adding an alloc_page_mpol()
three years ago; but, no surprise, I got stuck when it came to
understanding the mpol reference counting, and had to move away.
Maybe we can revisit that once Kosaki-san has the refcounting fixed.

Please, what's wrong with the patch below, to replace the current
two or three?  I don't have real NUMA myself: does it work?
If it doesn't work, can you see why not?

Nathan, I've presumptuously put in your signoff, because
you generally seemed happy to incorporate suggestions made.
Kosaki-san, I'm sorry if this version annoys you, but I've not
seen an actual explanation as to why anything more is needed.

Hugh

From: Nathan Zimmer nzim...@sgi.com
Subject: tmpfs: distribute interleave better across nodes

When tmpfs has the interleave memory policy, it always starts allocating
for each file from node 0 at offset 0.  When there are many small files,
the lower nodes fill up disproportionately.

This patch spreads out node usage by starting files at nodes other than
0, by using the inode number to bias the starting node for interleave.

Signed-off-by: Nathan Zimmer nzim...@sgi.com
Signed-off-by: Hugh Dickins hu...@google.com
Cc: Christoph Lameter c...@linux.com
Cc: Nick Piggin npig...@gmail.com
Cc: Lee Schermerhorn lee.schermerh...@hp.com
Cc: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com
Cc: Rik van Riel r...@redhat.com
Cc: Andi Kleen a...@firstfloor.org
Cc: Andrew Morton a...@linux-foundation.org
---

 mm/shmem.c |6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

--- v3.5/mm/shmem.c 2012-07-21 13:58:29.0 -0700
+++ linux/mm/shmem.c2012-07-24 20:13:58.468797969 -0700
@@ -929,7 +929,8 @@ static struct page *shmem_swapin(swp_ent
 
/* Create a pseudo vma that just contains the policy */
pvma.vm_start = 0;
-   pvma.vm_pgoff = index;
+   /* Bias interleave by inode number to distribute better across nodes */
+   pvma.vm_pgoff = index + info-vfs_inode.i_ino;
pvma.vm_ops = NULL;
pvma.vm_policy = spol;
return swapin_readahead(swap, gfp, pvma, 0);
@@ -942,7 +943,8 @@ static struct page *shmem_alloc_page(gfp
 
/* Create a pseudo vma that just contains the policy */
pvma.vm_start = 0;
-   pvma.vm_pgoff = index;
+   /* Bias interleave by inode number to distribute better across nodes */
+   pvma.vm_pgoff = index + info-vfs_inode.i_ino;
pvma.vm_ops = NULL;
pvma.vm_policy = mpol_shared_policy_lookup(info-policy, index);
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2 v5][resend] tmpfs: interleave the starting node of /dev/shmem

2012-07-23 Thread Nathan Zimmer

Yes I had failed to notice that.
I'll send a fix shortly.


On 07/23/2012 05:58 AM, Dan Carpenter wrote:

On Mon, Jul 09, 2012 at 09:46:39AM -0500, Nathan Zimmer wrote:

+static unsigned long shmem_interleave(struct vm_area_struct *vma,
+   unsigned long addr)
+{
+   unsigned long offset;
+
+   /* Use the vm_files prefered node as the initial offset. */
+   offset = (unsigned long *) vma->vm_private_data;

Should this be?:
offset = (unsigned long)vma->vm_private_data;

offset is an unsigned long, not a pointer.  ->vm_private_data is a
void pointer.

It causes a GCC warning:
mm/shmem.c: In function ‘shmem_interleave’:
mm/shmem.c:1341:9: warning: assignment makes integer from pointer without a 
cast [enabled by default]


+
+   offset += ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+
+   return offset;
+}
  #endif

regards,
dan carpenter

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2 v5][resend] tmpfs: interleave the starting node of /dev/shmem

2012-07-23 Thread Dan Carpenter
On Mon, Jul 09, 2012 at 09:46:39AM -0500, Nathan Zimmer wrote:
> +static unsigned long shmem_interleave(struct vm_area_struct *vma,
> + unsigned long addr)
> +{
> + unsigned long offset;
> +
> + /* Use the vm_files prefered node as the initial offset. */
> + offset = (unsigned long *) vma->vm_private_data;

Should this be?:
offset = (unsigned long)vma->vm_private_data;

offset is an unsigned long, not a pointer.  ->vm_private_data is a
void pointer.

It causes a GCC warning:
mm/shmem.c: In function ‘shmem_interleave’:
mm/shmem.c:1341:9: warning: assignment makes integer from pointer without a 
cast [enabled by default]

> +
> + offset += ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> +
> + return offset;
> +}
>  #endif

regards,
dan carpenter
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2 v5][resend] tmpfs: interleave the starting node of /dev/shmem

2012-07-23 Thread Dan Carpenter
On Mon, Jul 09, 2012 at 09:46:39AM -0500, Nathan Zimmer wrote:
 +static unsigned long shmem_interleave(struct vm_area_struct *vma,
 + unsigned long addr)
 +{
 + unsigned long offset;
 +
 + /* Use the vm_files prefered node as the initial offset. */
 + offset = (unsigned long *) vma-vm_private_data;

Should this be?:
offset = (unsigned long)vma-vm_private_data;

offset is an unsigned long, not a pointer.  -vm_private_data is a
void pointer.

It causes a GCC warning:
mm/shmem.c: In function ‘shmem_interleave’:
mm/shmem.c:1341:9: warning: assignment makes integer from pointer without a 
cast [enabled by default]

 +
 + offset += ((addr - vma-vm_start)  PAGE_SHIFT) + vma-vm_pgoff;
 +
 + return offset;
 +}
  #endif

regards,
dan carpenter
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2 v5][resend] tmpfs: interleave the starting node of /dev/shmem

2012-07-23 Thread Nathan Zimmer

Yes I had failed to notice that.
I'll send a fix shortly.


On 07/23/2012 05:58 AM, Dan Carpenter wrote:

On Mon, Jul 09, 2012 at 09:46:39AM -0500, Nathan Zimmer wrote:

+static unsigned long shmem_interleave(struct vm_area_struct *vma,
+   unsigned long addr)
+{
+   unsigned long offset;
+
+   /* Use the vm_files prefered node as the initial offset. */
+   offset = (unsigned long *) vma-vm_private_data;

Should this be?:
offset = (unsigned long)vma-vm_private_data;

offset is an unsigned long, not a pointer.  -vm_private_data is a
void pointer.

It causes a GCC warning:
mm/shmem.c: In function ‘shmem_interleave’:
mm/shmem.c:1341:9: warning: assignment makes integer from pointer without a 
cast [enabled by default]


+
+   offset += ((addr - vma-vm_start)  PAGE_SHIFT) + vma-vm_pgoff;
+
+   return offset;
+}
  #endif

regards,
dan carpenter

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2 v5][resend] tmpfs: interleave the starting node of /dev/shmem

2012-07-09 Thread Nathan Zimmer
The tmpfs superblock grants an offset for each inode as they are created. Each
inode then uses that offset to provide a preferred first node for its interleave
in the newly provided shmem_interleave.

Cc: Christoph Lameter 
Cc: Nick Piggin 
Cc: Hugh Dickins 
Cc: Lee Schermerhorn 
Cc: KOSAKI Motohiro 
Cc: Rik van Riel 
Signed-off-by: Nathan Zimmer 
---
 include/linux/mm.h   |7 +++
 include/linux/shmem_fs.h |3 +++
 mm/mempolicy.c   |4 
 mm/shmem.c   |   17 +
 4 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b36d08c..651109e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -238,6 +238,13 @@ struct vm_operations_struct {
 */
struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
unsigned long addr);
+
+   /*
+* If the policy is interleave allow the vma to suggest a node.
+*/
+   unsigned long (*interleave)(struct vm_area_struct *vma,
+   unsigned long addr);
+
int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
const nodemask_t *to, unsigned long flags);
 #endif
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index bef2cf0..6995556 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -17,6 +17,7 @@ struct shmem_inode_info {
char*symlink;   /* unswappable short symlink */
};
struct shared_policypolicy; /* NUMA memory alloc policy */
+   unsigned long   node_offset;/* bias for interleaved nodes */
struct list_headswaplist;   /* chain of maybes on swap */
struct list_headxattr_list; /* list of shmem_xattr */
struct inodevfs_inode;
@@ -32,6 +33,8 @@ struct shmem_sb_info {
kgid_t gid; /* Mount gid for root directory */
umode_t mode;   /* Mount mode for root directory */
struct mempolicy *mpol; /* default memory policy for mappings */
+   unsigned long next_pref_node;
+/* next interleave bias to suggest for inodes */
 };
 
 static inline struct shmem_inode_info *SHMEM_I(struct inode *inode)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1d771e4..e2cbe9e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1663,6 +1663,10 @@ static inline unsigned interleave_nid(struct mempolicy 
*pol,
 {
if (vma) {
unsigned long off;
+   if (vma->vm_ops && vma->vm_ops->interleave) {
+   off = vma->vm_ops->interleave(vma, addr);
+   return offset_il_node(pol, vma, off);
+   }
 
/*
 * for small pages, there is no difference between
diff --git a/mm/shmem.c b/mm/shmem.c
index d073252..e569338 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -922,6 +922,7 @@ static struct page *shmem_swapin(swp_entry_t swap, gfp_t 
gfp,
pvma.vm_start = 0;
pvma.vm_pgoff = index;
pvma.vm_policy = spol;
+   pvma.vm_private_data = (void *) info->node_offset;
if (pvma.vm_policy)
pvma.vm_ops = _vm_ops;
else
@@ -938,6 +939,7 @@ static struct page *shmem_alloc_page(gfp_t gfp,
pvma.vm_start = 0;
pvma.vm_pgoff = index;
pvma.vm_policy = mpol_shared_policy_lookup(>policy, index);
+   pvma.vm_private_data = (void *) info->node_offset;
if (pvma.vm_policy)
pvma.vm_ops = _vm_ops;
else
@@ -1314,6 +1316,19 @@ static struct mempolicy *shmem_get_policy(struct 
vm_area_struct *vma,
index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
return mpol_shared_policy_lookup(_I(inode)->policy, index);
 }
+
+static unsigned long shmem_interleave(struct vm_area_struct *vma,
+   unsigned long addr)
+{
+   unsigned long offset;
+
+   /* Use the vm_files prefered node as the initial offset. */
+   offset = (unsigned long *) vma->vm_private_data;
+
+   offset += ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+
+   return offset;
+}
 #endif
 
 int shmem_lock(struct file *file, int lock, struct user_struct *user)
@@ -1386,6 +1401,7 @@ static struct inode *shmem_get_inode(struct super_block 
*sb, const struct inode
inode->i_fop = _file_operations;
mpol_shared_policy_init(>policy,
 shmem_get_sbmpol(sbinfo));
+   info->node_offset = ++(sbinfo->next_pref_node);
break;
case S_IFDIR:
inc_nlink(inode);
@@ -2871,6 +2887,7 @@ static const struct super_operations shmem_ops = {
 static const struct vm_operations_struct shmem_vm_ops = {
.fault  = 

[PATCH 2/2 v5][resend] tmpfs: interleave the starting node of /dev/shmem

2012-07-09 Thread Nathan Zimmer
The tmpfs superblock grants an offset for each inode as they are created. Each
inode then uses that offset to provide a preferred first node for its interleave
in the newly provided shmem_interleave.

Cc: Christoph Lameter c...@linux.com
Cc: Nick Piggin npig...@gmail.com
Cc: Hugh Dickins hu...@google.com
Cc: Lee Schermerhorn lee.schermerh...@hp.com
Cc: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com
Cc: Rik van Riel r...@redhat.com
Signed-off-by: Nathan Zimmer nzim...@sgi.com
---
 include/linux/mm.h   |7 +++
 include/linux/shmem_fs.h |3 +++
 mm/mempolicy.c   |4 
 mm/shmem.c   |   17 +
 4 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b36d08c..651109e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -238,6 +238,13 @@ struct vm_operations_struct {
 */
struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
unsigned long addr);
+
+   /*
+* If the policy is interleave allow the vma to suggest a node.
+*/
+   unsigned long (*interleave)(struct vm_area_struct *vma,
+   unsigned long addr);
+
int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
const nodemask_t *to, unsigned long flags);
 #endif
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index bef2cf0..6995556 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -17,6 +17,7 @@ struct shmem_inode_info {
char*symlink;   /* unswappable short symlink */
};
struct shared_policypolicy; /* NUMA memory alloc policy */
+   unsigned long   node_offset;/* bias for interleaved nodes */
struct list_headswaplist;   /* chain of maybes on swap */
struct list_headxattr_list; /* list of shmem_xattr */
struct inodevfs_inode;
@@ -32,6 +33,8 @@ struct shmem_sb_info {
kgid_t gid; /* Mount gid for root directory */
umode_t mode;   /* Mount mode for root directory */
struct mempolicy *mpol; /* default memory policy for mappings */
+   unsigned long next_pref_node;
+/* next interleave bias to suggest for inodes */
 };
 
 static inline struct shmem_inode_info *SHMEM_I(struct inode *inode)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1d771e4..e2cbe9e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1663,6 +1663,10 @@ static inline unsigned interleave_nid(struct mempolicy 
*pol,
 {
if (vma) {
unsigned long off;
+   if (vma-vm_ops  vma-vm_ops-interleave) {
+   off = vma-vm_ops-interleave(vma, addr);
+   return offset_il_node(pol, vma, off);
+   }
 
/*
 * for small pages, there is no difference between
diff --git a/mm/shmem.c b/mm/shmem.c
index d073252..e569338 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -922,6 +922,7 @@ static struct page *shmem_swapin(swp_entry_t swap, gfp_t 
gfp,
pvma.vm_start = 0;
pvma.vm_pgoff = index;
pvma.vm_policy = spol;
+   pvma.vm_private_data = (void *) info-node_offset;
if (pvma.vm_policy)
pvma.vm_ops = shmem_vm_ops;
else
@@ -938,6 +939,7 @@ static struct page *shmem_alloc_page(gfp_t gfp,
pvma.vm_start = 0;
pvma.vm_pgoff = index;
pvma.vm_policy = mpol_shared_policy_lookup(info-policy, index);
+   pvma.vm_private_data = (void *) info-node_offset;
if (pvma.vm_policy)
pvma.vm_ops = shmem_vm_ops;
else
@@ -1314,6 +1316,19 @@ static struct mempolicy *shmem_get_policy(struct 
vm_area_struct *vma,
index = ((addr - vma-vm_start)  PAGE_SHIFT) + vma-vm_pgoff;
return mpol_shared_policy_lookup(SHMEM_I(inode)-policy, index);
 }
+
+static unsigned long shmem_interleave(struct vm_area_struct *vma,
+   unsigned long addr)
+{
+   unsigned long offset;
+
+   /* Use the vm_files prefered node as the initial offset. */
+   offset = (unsigned long *) vma-vm_private_data;
+
+   offset += ((addr - vma-vm_start)  PAGE_SHIFT) + vma-vm_pgoff;
+
+   return offset;
+}
 #endif
 
 int shmem_lock(struct file *file, int lock, struct user_struct *user)
@@ -1386,6 +1401,7 @@ static struct inode *shmem_get_inode(struct super_block 
*sb, const struct inode
inode-i_fop = shmem_file_operations;
mpol_shared_policy_init(info-policy,
 shmem_get_sbmpol(sbinfo));
+   info-node_offset = ++(sbinfo-next_pref_node);
break;
case S_IFDIR:
inc_nlink(inode);
@@ -2871,6 +2887,7 @@