Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v3

2012-10-11 Thread Andi Kleen
> Alas, include/asm-generic/mman.h doesn't exist now.

git resolved it automagically

> 
> Does this change touch all the hugetlb-capable architectures?

I took a look at this again. So not every hugetlb capable architecture
needs it, only architectures with multiple hugetlb page sizes.

This is only x86, tile, powerpc

I looked at tile and powerpc and they both have configurable
hugetlb page sizes. So it's somewhat awkward to add defines
for them.

One disadvantage of this is also the user programs would need
to know the page sizes that are configured. That is definitely
awkward, but I don't know of any way around that.

Luckily there's a way in /sys to query this.

-Andi

> 
> z:/usr/src/linux-3.6> grep -rl MAP_HUGETLB arch
> arch/alpha/include/asm/mman.h
> arch/xtensa/include/asm/mman.h
> arch/parisc/include/asm/mman.h
> arch/tile/include/asm/mman.h
> arch/sparc/include/asm/mman.h
> arch/powerpc/include/asm/mman.h
> arch/mips/include/asm/mman.h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v3

2012-10-11 Thread Andi Kleen
 Alas, include/asm-generic/mman.h doesn't exist now.

git resolved it automagically

 
 Does this change touch all the hugetlb-capable architectures?

I took a look at this again. So not every hugetlb capable architecture
needs it, only architectures with multiple hugetlb page sizes.

This is only x86, tile, powerpc

I looked at tile and powerpc and they both have configurable
hugetlb page sizes. So it's somewhat awkward to add defines
for them.

One disadvantage of this is also the user programs would need
to know the page sizes that are configured. That is definitely
awkward, but I don't know of any way around that.

Luckily there's a way in /sys to query this.

-Andi

 
 z:/usr/src/linux-3.6 grep -rl MAP_HUGETLB arch
 arch/alpha/include/asm/mman.h
 arch/xtensa/include/asm/mman.h
 arch/parisc/include/asm/mman.h
 arch/tile/include/asm/mman.h
 arch/sparc/include/asm/mman.h
 arch/powerpc/include/asm/mman.h
 arch/mips/include/asm/mman.h
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v3

2012-10-09 Thread Andi Kleen

Thanks for the review.

> > I also exported the new flags to the user headers
> > (they were previously under __KERNEL__). Right now only symbols
> > for x86 and some other architecture for 1GB and 2MB are defined.
> > The interface should already work for all other architectures
> > though.
> 
> So some manpages need updating.  I'm not sure which - mmap(2) surely,
> but which for the IPC change?

mmap and shmget. Was already planned.

> 
> > v2: Port to new tree. Fix unmount.
> > v3: Ported to latest tree.
> > Acked-by: Rik van Riel 
> > Acked-by: KAMEZAWA Hiroyuki 
> > Signed-off-by: Andi Kleen 
> > ---
> >  arch/x86/include/asm/mman.h |3 ++
> >  fs/hugetlbfs/inode.c|   63 
> > ++-
> >  include/asm-generic/mman.h  |   13 +
> >  include/linux/hugetlb.h |   12 +++-
> >  include/linux/shm.h |   19 +
> >  ipc/shm.c   |3 +-
> >  mm/mmap.c   |5 ++-
> 
> Alas, include/asm-generic/mman.h doesn't exist now.
> 
> Does this change touch all the hugetlb-capable architectures?

Right now only symbols
for x86 and some other architecture for 1GB and 2MB are defined.
The interface should already work for all other architectures
though.

So they can add new symbols for their page sizes at their leisure.

> > return capable(CAP_IPC_LOCK) || in_group_p(shm_group);
> >  }
> >  
> > +static int get_hstate_idx(int page_size_log)
> 
> nitlet: "page_size_order" would be more kernely.  Or just "page_order".

It's not really an order, just the index.  I think I would prefer the current 
name,
order would be misleading.

For x86 it's only 0 and 1

> > +   if (IS_ERR(hugetlbfs_vfsmount[i])) {
> > +   pr_err(
> > +   "hugetlb: Cannot mount internal hugetlbfs for page size 
> > %uK",
> > +  ps_kb);
> > +   error = PTR_ERR(hugetlbfs_vfsmount[i]);
> > +   }
> > +   i++;
> > +   }
> > +   /* Non default hstates are optional */
> > +   if (hugetlbfs_vfsmount[default_hstate_idx])
> > +   return 0;
> 
> hm, so if I'm understanding this, the patch mounts hugetlbfs N times,
> once for each page size.  And presumably the shm code somehow selects
> one of these mounts, based on incoming flags.  And presumably if those
> flags are all-zero, the behaviour is unaltered.

Yes.

> 
> Please update the changelog to describe all this - the overview of how
> the patch actually operates.

Ok.

> 
> Also, all this affects the /proc/mounts contents, yes?  Let's changelog
> that very-slightly-non-back-compatible user-visible change as well.

AFAIK not. The internal mounts are not visible. At least my laptop
doesn't show them.

> There's some overhead to doing all those additional mounts.  Can we
> quantify it?

On x86 it's one more mount (1GB). AFAIK it's just the sb structure, there's
nothing else preallocated. Maybe a couple hundred bytes per page size.

The number of huge page sizes is normally small, I don't think any architecture
has a large number.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v3

2012-10-09 Thread Andrew Morton
On Wed,  3 Oct 2012 15:24:23 -0700
Andi Kleen  wrote:

> From: Andi Kleen 
> 
> There was some desire in large applications using MAP_HUGETLB/SHM_HUGETLB
> to use 1GB huge pages on some mappings, and stay with 2MB on others. This
> is useful together with NUMA policy: use 2MB interleaving on some mappings,
> but 1GB on local mappings.
> 
> This patch extends the IPC/SHM syscall interfaces slightly to allow specifying
> the page size.
> 
> It borrows some upper bits in the existing flag arguments and allows encoding
> the log of the desired page size in addition to the *_HUGETLB flag.
> When 0 is specified the default size is used, this makes the change fully
> compatible.
> 
> Extending the internal hugetlb code to handle this is straight forward. 
> Instead
> of a single mount it just keeps an array of them and selects the right
> mount based on the specified page size.
> 
> I also exported the new flags to the user headers
> (they were previously under __KERNEL__). Right now only symbols
> for x86 and some other architecture for 1GB and 2MB are defined.
> The interface should already work for all other architectures
> though.

So some manpages need updating.  I'm not sure which - mmap(2) surely,
but which for the IPC change?

> v2: Port to new tree. Fix unmount.
> v3: Ported to latest tree.
> Acked-by: Rik van Riel 
> Acked-by: KAMEZAWA Hiroyuki 
> Signed-off-by: Andi Kleen 
> ---
>  arch/x86/include/asm/mman.h |3 ++
>  fs/hugetlbfs/inode.c|   63 
> ++-
>  include/asm-generic/mman.h  |   13 +
>  include/linux/hugetlb.h |   12 +++-
>  include/linux/shm.h |   19 +
>  ipc/shm.c   |3 +-
>  mm/mmap.c   |5 ++-

Alas, include/asm-generic/mman.h doesn't exist now.

Does this change touch all the hugetlb-capable architectures?

z:/usr/src/linux-3.6> grep -rl MAP_HUGETLB arch
arch/alpha/include/asm/mman.h
arch/xtensa/include/asm/mman.h
arch/parisc/include/asm/mman.h
arch/tile/include/asm/mman.h
arch/sparc/include/asm/mman.h
arch/powerpc/include/asm/mman.h
arch/mips/include/asm/mman.h

>
> ...
>
> @@ -933,9 +933,22 @@ static int can_do_hugetlb_shm(void)
>   return capable(CAP_IPC_LOCK) || in_group_p(shm_group);
>  }
>  
> +static int get_hstate_idx(int page_size_log)

nitlet: "page_size_order" would be more kernely.  Or just "page_order".

> +{
> + struct hstate *h;
> +
> + if (!page_size_log)
> + return default_hstate_idx;
> + h = size_to_hstate(1 << page_size_log);
> + if (!h)
> + return -1;
> + return h - hstates;
> +}
>
> ...
>
>  static int __init init_hugetlbfs_fs(void)
>  {
> + struct hstate *h;
>   int error;
> - struct vfsmount *vfsmount;
> + int i;
>  
>   error = bdi_init(_backing_dev_info);
>   if (error)
> @@ -1030,14 +1049,26 @@ static int __init init_hugetlbfs_fs(void)
>   if (error)
>   goto out;
>  
> - vfsmount = kern_mount(_fs_type);
> + i = 0;
> + for_each_hstate (h) {
> + char buf[50];
> + unsigned ps_kb = 1U << (h->order + PAGE_SHIFT - 10);
>  
> - if (!IS_ERR(vfsmount)) {
> - hugetlbfs_vfsmount = vfsmount;
> - return 0;
> - }
> + snprintf(buf, sizeof buf, "pagesize=%uK", ps_kb);
> + hugetlbfs_vfsmount[i] = kern_mount_data(_fs_type,
> + buf);
>  
> - error = PTR_ERR(vfsmount);
> + if (IS_ERR(hugetlbfs_vfsmount[i])) {
> + pr_err(
> + "hugetlb: Cannot mount internal hugetlbfs for page size 
> %uK",
> +ps_kb);
> + error = PTR_ERR(hugetlbfs_vfsmount[i]);
> + }
> + i++;
> + }
> + /* Non default hstates are optional */
> + if (hugetlbfs_vfsmount[default_hstate_idx])
> + return 0;

hm, so if I'm understanding this, the patch mounts hugetlbfs N times,
once for each page size.  And presumably the shm code somehow selects
one of these mounts, based on incoming flags.  And presumably if those
flags are all-zero, the behaviour is unaltered.

Please update the changelog to describe all this - the overview of how
the patch actually operates.

Also, all this affects the /proc/mounts contents, yes?  Let's changelog
that very-slightly-non-back-compatible user-visible change as well.

There's some overhead to doing all those additional mounts.  Can we
quantify it?

>
> ...
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v3

2012-10-09 Thread Andrew Morton
On Wed,  3 Oct 2012 15:24:23 -0700
Andi Kleen a...@firstfloor.org wrote:

 From: Andi Kleen a...@linux.intel.com
 
 There was some desire in large applications using MAP_HUGETLB/SHM_HUGETLB
 to use 1GB huge pages on some mappings, and stay with 2MB on others. This
 is useful together with NUMA policy: use 2MB interleaving on some mappings,
 but 1GB on local mappings.
 
 This patch extends the IPC/SHM syscall interfaces slightly to allow specifying
 the page size.
 
 It borrows some upper bits in the existing flag arguments and allows encoding
 the log of the desired page size in addition to the *_HUGETLB flag.
 When 0 is specified the default size is used, this makes the change fully
 compatible.
 
 Extending the internal hugetlb code to handle this is straight forward. 
 Instead
 of a single mount it just keeps an array of them and selects the right
 mount based on the specified page size.
 
 I also exported the new flags to the user headers
 (they were previously under __KERNEL__). Right now only symbols
 for x86 and some other architecture for 1GB and 2MB are defined.
 The interface should already work for all other architectures
 though.

So some manpages need updating.  I'm not sure which - mmap(2) surely,
but which for the IPC change?

 v2: Port to new tree. Fix unmount.
 v3: Ported to latest tree.
 Acked-by: Rik van Riel r...@redhat.com
 Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
 Signed-off-by: Andi Kleen a...@linux.intel.com
 ---
  arch/x86/include/asm/mman.h |3 ++
  fs/hugetlbfs/inode.c|   63 
 ++-
  include/asm-generic/mman.h  |   13 +
  include/linux/hugetlb.h |   12 +++-
  include/linux/shm.h |   19 +
  ipc/shm.c   |3 +-
  mm/mmap.c   |5 ++-

Alas, include/asm-generic/mman.h doesn't exist now.

Does this change touch all the hugetlb-capable architectures?

z:/usr/src/linux-3.6 grep -rl MAP_HUGETLB arch
arch/alpha/include/asm/mman.h
arch/xtensa/include/asm/mman.h
arch/parisc/include/asm/mman.h
arch/tile/include/asm/mman.h
arch/sparc/include/asm/mman.h
arch/powerpc/include/asm/mman.h
arch/mips/include/asm/mman.h


 ...

 @@ -933,9 +933,22 @@ static int can_do_hugetlb_shm(void)
   return capable(CAP_IPC_LOCK) || in_group_p(shm_group);
  }
  
 +static int get_hstate_idx(int page_size_log)

nitlet: page_size_order would be more kernely.  Or just page_order.

 +{
 + struct hstate *h;
 +
 + if (!page_size_log)
 + return default_hstate_idx;
 + h = size_to_hstate(1  page_size_log);
 + if (!h)
 + return -1;
 + return h - hstates;
 +}

 ...

  static int __init init_hugetlbfs_fs(void)
  {
 + struct hstate *h;
   int error;
 - struct vfsmount *vfsmount;
 + int i;
  
   error = bdi_init(hugetlbfs_backing_dev_info);
   if (error)
 @@ -1030,14 +1049,26 @@ static int __init init_hugetlbfs_fs(void)
   if (error)
   goto out;
  
 - vfsmount = kern_mount(hugetlbfs_fs_type);
 + i = 0;
 + for_each_hstate (h) {
 + char buf[50];
 + unsigned ps_kb = 1U  (h-order + PAGE_SHIFT - 10);
  
 - if (!IS_ERR(vfsmount)) {
 - hugetlbfs_vfsmount = vfsmount;
 - return 0;
 - }
 + snprintf(buf, sizeof buf, pagesize=%uK, ps_kb);
 + hugetlbfs_vfsmount[i] = kern_mount_data(hugetlbfs_fs_type,
 + buf);
  
 - error = PTR_ERR(vfsmount);
 + if (IS_ERR(hugetlbfs_vfsmount[i])) {
 + pr_err(
 + hugetlb: Cannot mount internal hugetlbfs for page size 
 %uK,
 +ps_kb);
 + error = PTR_ERR(hugetlbfs_vfsmount[i]);
 + }
 + i++;
 + }
 + /* Non default hstates are optional */
 + if (hugetlbfs_vfsmount[default_hstate_idx])
 + return 0;

hm, so if I'm understanding this, the patch mounts hugetlbfs N times,
once for each page size.  And presumably the shm code somehow selects
one of these mounts, based on incoming flags.  And presumably if those
flags are all-zero, the behaviour is unaltered.

Please update the changelog to describe all this - the overview of how
the patch actually operates.

Also, all this affects the /proc/mounts contents, yes?  Let's changelog
that very-slightly-non-back-compatible user-visible change as well.

There's some overhead to doing all those additional mounts.  Can we
quantify it?


 ...


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v3

2012-10-09 Thread Andi Kleen

Thanks for the review.

  I also exported the new flags to the user headers
  (they were previously under __KERNEL__). Right now only symbols
  for x86 and some other architecture for 1GB and 2MB are defined.
  The interface should already work for all other architectures
  though.
 
 So some manpages need updating.  I'm not sure which - mmap(2) surely,
 but which for the IPC change?

mmap and shmget. Was already planned.

 
  v2: Port to new tree. Fix unmount.
  v3: Ported to latest tree.
  Acked-by: Rik van Riel r...@redhat.com
  Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
  Signed-off-by: Andi Kleen a...@linux.intel.com
  ---
   arch/x86/include/asm/mman.h |3 ++
   fs/hugetlbfs/inode.c|   63 
  ++-
   include/asm-generic/mman.h  |   13 +
   include/linux/hugetlb.h |   12 +++-
   include/linux/shm.h |   19 +
   ipc/shm.c   |3 +-
   mm/mmap.c   |5 ++-
 
 Alas, include/asm-generic/mman.h doesn't exist now.
 
 Does this change touch all the hugetlb-capable architectures?

Right now only symbols
for x86 and some other architecture for 1GB and 2MB are defined.
The interface should already work for all other architectures
though.

So they can add new symbols for their page sizes at their leisure.

  return capable(CAP_IPC_LOCK) || in_group_p(shm_group);
   }
   
  +static int get_hstate_idx(int page_size_log)
 
 nitlet: page_size_order would be more kernely.  Or just page_order.

It's not really an order, just the index.  I think I would prefer the current 
name,
order would be misleading.

For x86 it's only 0 and 1

  +   if (IS_ERR(hugetlbfs_vfsmount[i])) {
  +   pr_err(
  +   hugetlb: Cannot mount internal hugetlbfs for page size 
  %uK,
  +  ps_kb);
  +   error = PTR_ERR(hugetlbfs_vfsmount[i]);
  +   }
  +   i++;
  +   }
  +   /* Non default hstates are optional */
  +   if (hugetlbfs_vfsmount[default_hstate_idx])
  +   return 0;
 
 hm, so if I'm understanding this, the patch mounts hugetlbfs N times,
 once for each page size.  And presumably the shm code somehow selects
 one of these mounts, based on incoming flags.  And presumably if those
 flags are all-zero, the behaviour is unaltered.

Yes.

 
 Please update the changelog to describe all this - the overview of how
 the patch actually operates.

Ok.

 
 Also, all this affects the /proc/mounts contents, yes?  Let's changelog
 that very-slightly-non-back-compatible user-visible change as well.

AFAIK not. The internal mounts are not visible. At least my laptop
doesn't show them.

 There's some overhead to doing all those additional mounts.  Can we
 quantify it?

On x86 it's one more mount (1GB). AFAIK it's just the sb structure, there's
nothing else preallocated. Maybe a couple hundred bytes per page size.

The number of huge page sizes is normally small, I don't think any architecture
has a large number.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v3

2012-10-03 Thread Andi Kleen
From: Andi Kleen 

There was some desire in large applications using MAP_HUGETLB/SHM_HUGETLB
to use 1GB huge pages on some mappings, and stay with 2MB on others. This
is useful together with NUMA policy: use 2MB interleaving on some mappings,
but 1GB on local mappings.

This patch extends the IPC/SHM syscall interfaces slightly to allow specifying
the page size.

It borrows some upper bits in the existing flag arguments and allows encoding
the log of the desired page size in addition to the *_HUGETLB flag.
When 0 is specified the default size is used, this makes the change fully
compatible.

Extending the internal hugetlb code to handle this is straight forward. Instead
of a single mount it just keeps an array of them and selects the right
mount based on the specified page size.

I also exported the new flags to the user headers
(they were previously under __KERNEL__). Right now only symbols
for x86 and some other architecture for 1GB and 2MB are defined.
The interface should already work for all other architectures
though.

v2: Port to new tree. Fix unmount.
v3: Ported to latest tree.
Acked-by: Rik van Riel 
Acked-by: KAMEZAWA Hiroyuki 
Signed-off-by: Andi Kleen 
---
 arch/x86/include/asm/mman.h |3 ++
 fs/hugetlbfs/inode.c|   63 ++-
 include/asm-generic/mman.h  |   13 +
 include/linux/hugetlb.h |   12 +++-
 include/linux/shm.h |   19 +
 ipc/shm.c   |3 +-
 mm/mmap.c   |5 ++-
 7 files changed, 100 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/mman.h b/arch/x86/include/asm/mman.h
index 593e51d..513b05f 100644
--- a/arch/x86/include/asm/mman.h
+++ b/arch/x86/include/asm/mman.h
@@ -3,6 +3,9 @@
 
 #define MAP_32BIT  0x40/* only give out 32bit addresses */
 
+#define MAP_HUGE_2MB(21 << MAP_HUGE_SHIFT)
+#define MAP_HUGE_1GB(30 << MAP_HUGE_SHIFT)
+
 #include 
 
 #endif /* _ASM_X86_MMAN_H */
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 9460120..f6fb699 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -924,7 +924,7 @@ static struct file_system_type hugetlbfs_fs_type = {
.kill_sb= kill_litter_super,
 };
 
-static struct vfsmount *hugetlbfs_vfsmount;
+static struct vfsmount *hugetlbfs_vfsmount[HUGE_MAX_HSTATE];
 
 static int can_do_hugetlb_shm(void)
 {
@@ -933,9 +933,22 @@ static int can_do_hugetlb_shm(void)
return capable(CAP_IPC_LOCK) || in_group_p(shm_group);
 }
 
+static int get_hstate_idx(int page_size_log)
+{
+   struct hstate *h;
+
+   if (!page_size_log)
+   return default_hstate_idx;
+   h = size_to_hstate(1 << page_size_log);
+   if (!h)
+   return -1;
+   return h - hstates;
+}
+
 struct file *hugetlb_file_setup(const char *name, unsigned long addr,
size_t size, vm_flags_t acctflag,
-   struct user_struct **user, int creat_flags)
+   struct user_struct **user,
+   int creat_flags, int page_size_log)
 {
int error = -ENOMEM;
struct file *file;
@@ -945,9 +958,14 @@ struct file *hugetlb_file_setup(const char *name, unsigned 
long addr,
struct qstr quick_string;
struct hstate *hstate;
unsigned long num_pages;
+   int hstate_idx;
+
+   hstate_idx = get_hstate_idx(page_size_log);
+   if (hstate_idx < 0)
+   return ERR_PTR(-ENODEV);
 
*user = NULL;
-   if (!hugetlbfs_vfsmount)
+   if (!hugetlbfs_vfsmount[hstate_idx])
return ERR_PTR(-ENOENT);
 
if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) {
@@ -964,7 +982,7 @@ struct file *hugetlb_file_setup(const char *name, unsigned 
long addr,
}
}
 
-   root = hugetlbfs_vfsmount->mnt_root;
+   root = hugetlbfs_vfsmount[hstate_idx]->mnt_root;
quick_string.name = name;
quick_string.len = strlen(quick_string.name);
quick_string.hash = 0;
@@ -972,7 +990,7 @@ struct file *hugetlb_file_setup(const char *name, unsigned 
long addr,
if (!path.dentry)
goto out_shm_unlock;
 
-   path.mnt = mntget(hugetlbfs_vfsmount);
+   path.mnt = mntget(hugetlbfs_vfsmount[hstate_idx]);
error = -ENOSPC;
inode = hugetlbfs_get_inode(root->d_sb, NULL, S_IFREG | S_IRWXUGO, 0);
if (!inode)
@@ -1012,8 +1030,9 @@ out_shm_unlock:
 
 static int __init init_hugetlbfs_fs(void)
 {
+   struct hstate *h;
int error;
-   struct vfsmount *vfsmount;
+   int i;
 
error = bdi_init(_backing_dev_info);
if (error)
@@ -1030,14 +1049,26 @@ static int __init init_hugetlbfs_fs(void)
if (error)
goto out;
 
-   vfsmount = kern_mount(_fs_type);
+   i = 0;
+   for_each_hstate (h) {
+   char buf[50];
+   unsigned ps_kb = 1U << 

[PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v3

2012-10-03 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

There was some desire in large applications using MAP_HUGETLB/SHM_HUGETLB
to use 1GB huge pages on some mappings, and stay with 2MB on others. This
is useful together with NUMA policy: use 2MB interleaving on some mappings,
but 1GB on local mappings.

This patch extends the IPC/SHM syscall interfaces slightly to allow specifying
the page size.

It borrows some upper bits in the existing flag arguments and allows encoding
the log of the desired page size in addition to the *_HUGETLB flag.
When 0 is specified the default size is used, this makes the change fully
compatible.

Extending the internal hugetlb code to handle this is straight forward. Instead
of a single mount it just keeps an array of them and selects the right
mount based on the specified page size.

I also exported the new flags to the user headers
(they were previously under __KERNEL__). Right now only symbols
for x86 and some other architecture for 1GB and 2MB are defined.
The interface should already work for all other architectures
though.

v2: Port to new tree. Fix unmount.
v3: Ported to latest tree.
Acked-by: Rik van Riel r...@redhat.com
Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
Signed-off-by: Andi Kleen a...@linux.intel.com
---
 arch/x86/include/asm/mman.h |3 ++
 fs/hugetlbfs/inode.c|   63 ++-
 include/asm-generic/mman.h  |   13 +
 include/linux/hugetlb.h |   12 +++-
 include/linux/shm.h |   19 +
 ipc/shm.c   |3 +-
 mm/mmap.c   |5 ++-
 7 files changed, 100 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/mman.h b/arch/x86/include/asm/mman.h
index 593e51d..513b05f 100644
--- a/arch/x86/include/asm/mman.h
+++ b/arch/x86/include/asm/mman.h
@@ -3,6 +3,9 @@
 
 #define MAP_32BIT  0x40/* only give out 32bit addresses */
 
+#define MAP_HUGE_2MB(21  MAP_HUGE_SHIFT)
+#define MAP_HUGE_1GB(30  MAP_HUGE_SHIFT)
+
 #include asm-generic/mman.h
 
 #endif /* _ASM_X86_MMAN_H */
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 9460120..f6fb699 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -924,7 +924,7 @@ static struct file_system_type hugetlbfs_fs_type = {
.kill_sb= kill_litter_super,
 };
 
-static struct vfsmount *hugetlbfs_vfsmount;
+static struct vfsmount *hugetlbfs_vfsmount[HUGE_MAX_HSTATE];
 
 static int can_do_hugetlb_shm(void)
 {
@@ -933,9 +933,22 @@ static int can_do_hugetlb_shm(void)
return capable(CAP_IPC_LOCK) || in_group_p(shm_group);
 }
 
+static int get_hstate_idx(int page_size_log)
+{
+   struct hstate *h;
+
+   if (!page_size_log)
+   return default_hstate_idx;
+   h = size_to_hstate(1  page_size_log);
+   if (!h)
+   return -1;
+   return h - hstates;
+}
+
 struct file *hugetlb_file_setup(const char *name, unsigned long addr,
size_t size, vm_flags_t acctflag,
-   struct user_struct **user, int creat_flags)
+   struct user_struct **user,
+   int creat_flags, int page_size_log)
 {
int error = -ENOMEM;
struct file *file;
@@ -945,9 +958,14 @@ struct file *hugetlb_file_setup(const char *name, unsigned 
long addr,
struct qstr quick_string;
struct hstate *hstate;
unsigned long num_pages;
+   int hstate_idx;
+
+   hstate_idx = get_hstate_idx(page_size_log);
+   if (hstate_idx  0)
+   return ERR_PTR(-ENODEV);
 
*user = NULL;
-   if (!hugetlbfs_vfsmount)
+   if (!hugetlbfs_vfsmount[hstate_idx])
return ERR_PTR(-ENOENT);
 
if (creat_flags == HUGETLB_SHMFS_INODE  !can_do_hugetlb_shm()) {
@@ -964,7 +982,7 @@ struct file *hugetlb_file_setup(const char *name, unsigned 
long addr,
}
}
 
-   root = hugetlbfs_vfsmount-mnt_root;
+   root = hugetlbfs_vfsmount[hstate_idx]-mnt_root;
quick_string.name = name;
quick_string.len = strlen(quick_string.name);
quick_string.hash = 0;
@@ -972,7 +990,7 @@ struct file *hugetlb_file_setup(const char *name, unsigned 
long addr,
if (!path.dentry)
goto out_shm_unlock;
 
-   path.mnt = mntget(hugetlbfs_vfsmount);
+   path.mnt = mntget(hugetlbfs_vfsmount[hstate_idx]);
error = -ENOSPC;
inode = hugetlbfs_get_inode(root-d_sb, NULL, S_IFREG | S_IRWXUGO, 0);
if (!inode)
@@ -1012,8 +1030,9 @@ out_shm_unlock:
 
 static int __init init_hugetlbfs_fs(void)
 {
+   struct hstate *h;
int error;
-   struct vfsmount *vfsmount;
+   int i;
 
error = bdi_init(hugetlbfs_backing_dev_info);
if (error)
@@ -1030,14 +1049,26 @@ static int __init init_hugetlbfs_fs(void)
if (error)
goto out;
 
-   vfsmount = kern_mount(hugetlbfs_fs_type);
+