Re: [PATCH] hugetlb: ensure we do not reference a surplus page after handing it to buddy

2008-02-20 Thread Adam Litke

On Tue, 2008-02-19 at 15:30 -0800, Andrew Morton wrote:
> On Tue, 19 Feb 2008 12:41:51 -0600 Adam Litke <[EMAIL PROTECTED]> wrote:
> 
> > Indeed.  I'll take credit for this thinko...
> > 
> > On Tue, 2008-02-19 at 18:28 +, Andy Whitcroft wrote:
> > > When we free a page via free_huge_page and we detect that we are in
> > > surplus the page will be returned to the buddy.  After this we no longer
> > > own the page.  However at the end free_huge_page we clear out our mapping
> > > pointer from page private.  Even where the page is not a surplus we
> > > free the page to the hugepage pool, drop the pool locks and then clear
> > > page private.  In either case the page may have been reallocated.  BAD.
> > > 
> > > Make sure we clear out page private before we free the page.
> > > 
> > > Signed-off-by: Andy Whitcroft <[EMAIL PROTECTED]>
> > 
> > Acked-by: Adam Litke <[EMAIL PROTECTED]>
> 
> Was I right to assume that this is also needed in 2.6.24.x?

Yep.  Thanks Andrew.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hugetlb: ensure we do not reference a surplus page after handing it to buddy

2008-02-20 Thread Adam Litke

On Tue, 2008-02-19 at 15:30 -0800, Andrew Morton wrote:
 On Tue, 19 Feb 2008 12:41:51 -0600 Adam Litke [EMAIL PROTECTED] wrote:
 
  Indeed.  I'll take credit for this thinko...
  
  On Tue, 2008-02-19 at 18:28 +, Andy Whitcroft wrote:
   When we free a page via free_huge_page and we detect that we are in
   surplus the page will be returned to the buddy.  After this we no longer
   own the page.  However at the end free_huge_page we clear out our mapping
   pointer from page private.  Even where the page is not a surplus we
   free the page to the hugepage pool, drop the pool locks and then clear
   page private.  In either case the page may have been reallocated.  BAD.
   
   Make sure we clear out page private before we free the page.
   
   Signed-off-by: Andy Whitcroft [EMAIL PROTECTED]
  
  Acked-by: Adam Litke [EMAIL PROTECTED]
 
 Was I right to assume that this is also needed in 2.6.24.x?

Yep.  Thanks Andrew.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hugetlb: ensure we do not reference a surplus page after handing it to buddy

2008-02-19 Thread Adam Litke
Indeed.  I'll take credit for this thinko...

On Tue, 2008-02-19 at 18:28 +, Andy Whitcroft wrote:
> When we free a page via free_huge_page and we detect that we are in
> surplus the page will be returned to the buddy.  After this we no longer
> own the page.  However at the end free_huge_page we clear out our mapping
> pointer from page private.  Even where the page is not a surplus we
> free the page to the hugepage pool, drop the pool locks and then clear
> page private.  In either case the page may have been reallocated.  BAD.
> 
> Make sure we clear out page private before we free the page.
> 
> Signed-off-by: Andy Whitcroft <[EMAIL PROTECTED]>

Acked-by: Adam Litke <[EMAIL PROTECTED]>

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hugetlb: ensure we do not reference a surplus page after handing it to buddy

2008-02-19 Thread Adam Litke
Indeed.  I'll take credit for this thinko...

On Tue, 2008-02-19 at 18:28 +, Andy Whitcroft wrote:
 When we free a page via free_huge_page and we detect that we are in
 surplus the page will be returned to the buddy.  After this we no longer
 own the page.  However at the end free_huge_page we clear out our mapping
 pointer from page private.  Even where the page is not a surplus we
 free the page to the hugepage pool, drop the pool locks and then clear
 page private.  In either case the page may have been reallocated.  BAD.
 
 Make sure we clear out page private before we free the page.
 
 Signed-off-by: Andy Whitcroft [EMAIL PROTECTED]

Acked-by: Adam Litke [EMAIL PROTECTED]

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] hugetlb: handle write-protection faults in follow_hugetlb_page

2007-12-06 Thread Adam Litke

The follow_hugetlb_page() fix I posted (merged as git commit
5b23dbe8173c212d6a326e35347b038705603d39) missed one case.  If the pte is
present, but not writable and write access is requested by the caller to
get_user_pages(), the code will do the wrong thing.  Rather than calling
hugetlb_fault to make the pte writable, it notes the presence of the pte
and continues.

This simple one-liner makes sure we also fault on the pte for this case.
Please apply.

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
Acked-by: Dave Kleikamp <[EMAIL PROTECTED]>
---

 mm/hugetlb.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6121b57..6f97821 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -907,7 +907,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
 */
pte = huge_pte_offset(mm, vaddr & HPAGE_MASK);
 
-   if (!pte || pte_none(*pte)) {
+   if (!pte || pte_none(*pte) || (write && !pte_write(*pte))) {
int ret;
 
spin_unlock(>page_table_lock);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] hugetlb: handle write-protection faults in follow_hugetlb_page

2007-12-06 Thread Adam Litke

The follow_hugetlb_page() fix I posted (merged as git commit
5b23dbe8173c212d6a326e35347b038705603d39) missed one case.  If the pte is
present, but not writable and write access is requested by the caller to
get_user_pages(), the code will do the wrong thing.  Rather than calling
hugetlb_fault to make the pte writable, it notes the presence of the pte
and continues.

This simple one-liner makes sure we also fault on the pte for this case.
Please apply.

Signed-off-by: Adam Litke [EMAIL PROTECTED]
Acked-by: Dave Kleikamp [EMAIL PROTECTED]
---

 mm/hugetlb.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6121b57..6f97821 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -907,7 +907,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
 */
pte = huge_pte_offset(mm, vaddr  HPAGE_MASK);
 
-   if (!pte || pte_none(*pte)) {
+   if (!pte || pte_none(*pte) || (write  !pte_write(*pte))) {
int ret;
 
spin_unlock(mm-page_table_lock);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] hugetlb: follow_hugetlb_page for write access

2007-11-07 Thread Adam Litke

When calling get_user_pages(), a write flag is passed in by the caller to
indicate if write access is required on the faulted-in pages.  Currently,
follow_hugetlb_page() ignores this flag and always faults pages for
read-only access.  This can cause data corruption because a device driver
that calls get_user_pages() with write set will not expect COW faults to
occur on the returned pages.

This patch passes the write flag down to follow_hugetlb_page() and makes
sure hugetlb_fault() is called with the right write_access parameter.

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
---

 include/linux/hugetlb.h |2 +-
 mm/hugetlb.c|5 +++--
 mm/memory.c |2 +-
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 3a19b03..31fa0a0 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -19,7 +19,7 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct 
*vma)
 int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user 
*, size_t *, loff_t *);
 int hugetlb_treat_movable_handler(struct ctl_table *, int, struct file *, void 
__user *, size_t *, loff_t *);
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct 
vm_area_struct *);
-int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct 
page **, struct vm_area_struct **, unsigned long *, int *, int);
+int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct 
page **, struct vm_area_struct **, unsigned long *, int *, int, int);
 void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned 
long);
 void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned 
long);
 int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index eab8c42..b645985 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -621,7 +621,8 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
 
 int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page **pages, struct vm_area_struct **vmas,
-   unsigned long *position, int *length, int i)
+   unsigned long *position, int *length, int i,
+   int write)
 {
unsigned long pfn_offset;
unsigned long vaddr = *position;
@@ -643,7 +644,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
int ret;
 
spin_unlock(>page_table_lock);
-   ret = hugetlb_fault(mm, vma, vaddr, 0);
+   ret = hugetlb_fault(mm, vma, vaddr, write);
spin_lock(>page_table_lock);
if (!(ret & VM_FAULT_ERROR))
continue;
diff --git a/mm/memory.c b/mm/memory.c
index f82b359..1bcd444 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1039,7 +1039,7 @@ int get_user_pages(struct task_struct *tsk, struct 
mm_struct *mm,
 
if (is_vm_hugetlb_page(vma)) {
i = follow_hugetlb_page(mm, vma, pages, vmas,
-   , , i);
+   , , i, write);
continue;
}
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] hugetlb: follow_hugetlb_page for write access

2007-11-07 Thread Adam Litke

When calling get_user_pages(), a write flag is passed in by the caller to
indicate if write access is required on the faulted-in pages.  Currently,
follow_hugetlb_page() ignores this flag and always faults pages for
read-only access.  This can cause data corruption because a device driver
that calls get_user_pages() with write set will not expect COW faults to
occur on the returned pages.

This patch passes the write flag down to follow_hugetlb_page() and makes
sure hugetlb_fault() is called with the right write_access parameter.

Signed-off-by: Adam Litke [EMAIL PROTECTED]
---

 include/linux/hugetlb.h |2 +-
 mm/hugetlb.c|5 +++--
 mm/memory.c |2 +-
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 3a19b03..31fa0a0 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -19,7 +19,7 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct 
*vma)
 int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user 
*, size_t *, loff_t *);
 int hugetlb_treat_movable_handler(struct ctl_table *, int, struct file *, void 
__user *, size_t *, loff_t *);
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct 
vm_area_struct *);
-int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct 
page **, struct vm_area_struct **, unsigned long *, int *, int);
+int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct 
page **, struct vm_area_struct **, unsigned long *, int *, int, int);
 void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned 
long);
 void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned 
long);
 int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index eab8c42..b645985 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -621,7 +621,8 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
 
 int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page **pages, struct vm_area_struct **vmas,
-   unsigned long *position, int *length, int i)
+   unsigned long *position, int *length, int i,
+   int write)
 {
unsigned long pfn_offset;
unsigned long vaddr = *position;
@@ -643,7 +644,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
int ret;
 
spin_unlock(mm-page_table_lock);
-   ret = hugetlb_fault(mm, vma, vaddr, 0);
+   ret = hugetlb_fault(mm, vma, vaddr, write);
spin_lock(mm-page_table_lock);
if (!(ret  VM_FAULT_ERROR))
continue;
diff --git a/mm/memory.c b/mm/memory.c
index f82b359..1bcd444 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1039,7 +1039,7 @@ int get_user_pages(struct task_struct *tsk, struct 
mm_struct *mm,
 
if (is_vm_hugetlb_page(vma)) {
i = follow_hugetlb_page(mm, vma, pages, vmas,
-   start, len, i);
+   start, len, i, write);
continue;
}
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2.6 patch] mm/hugetlb.c: make a function static

2007-10-24 Thread Adam Litke
On Wed, 2007-10-24 at 18:23 +0200, Adrian Bunk wrote:
> return_unused_surplus_pages() can become static.
> 
> Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>
Acked-by: Adam Litke <[EMAIL PROTECTED]>

> 
> ---
> 8932fe99341629d50863643229d25666e9f44e03 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8b809ec..b65da0d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -323,7 +323,7 @@ free:
>   * allocated to satisfy the reservation must be explicitly freed if they were
>   * never used.
>   */
> -void return_unused_surplus_pages(unsigned long unused_resv_pages)
> +static void return_unused_surplus_pages(unsigned long unused_resv_pages)
>  {
>       static int nid = -1;
>   struct page *page;
> 
> 
-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2.6 patch] mm/hugetlb.c: make a function static

2007-10-24 Thread Adam Litke
On Wed, 2007-10-24 at 18:23 +0200, Adrian Bunk wrote:
 return_unused_surplus_pages() can become static.
 
 Signed-off-by: Adrian Bunk [EMAIL PROTECTED]
Acked-by: Adam Litke [EMAIL PROTECTED]

 
 ---
 8932fe99341629d50863643229d25666e9f44e03 
 diff --git a/mm/hugetlb.c b/mm/hugetlb.c
 index 8b809ec..b65da0d 100644
 --- a/mm/hugetlb.c
 +++ b/mm/hugetlb.c
 @@ -323,7 +323,7 @@ free:
   * allocated to satisfy the reservation must be explicitly freed if they were
   * never used.
   */
 -void return_unused_surplus_pages(unsigned long unused_resv_pages)
 +static void return_unused_surplus_pages(unsigned long unused_resv_pages)
  {
   static int nid = -1;
   struct page *page;
 
 
-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] hugetlb: Fix pool resizing corner case V2

2007-10-03 Thread Adam Litke

Changes in V2:
 - Removed now unnecessary check as suggested by Ken Chen

When shrinking the size of the hugetlb pool via the nr_hugepages sysctl, we
are careful to keep enough pages around to satisfy reservations.  But the
calculation is flawed for the following scenario:

Action  Pool Counters (Total, Free, Resv)
==  =
Set pool to 1 page  1 1 0
Map 1 page MAP_PRIVATE  1 1 0
Touch the page to fault it in   1 0 0
Set pool to 3 pages 3 2 0
Map 2 pages MAP_SHARED  3 2 2
Set pool to 2 pages 2 1 2 <-- Mistake, should be 3 2 2
Touch the 2 shared pages2 0 1 <-- Program crashes here

The last touch above will terminate the process due to lack of huge pages.

This patch corrects the calculation so that it factors in pages being used
for private mappings.  Andrew, this is a standalone fix suitable for
mainline.  It is also now corrected in my latest dynamic pool resizing
patchset which I will send out soon.

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
Acked-by: Ken Chen <[EMAIL PROTECTED]>
---

 mm/hugetlb.c |8 +++-
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 84c795e..b6b3b64 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -224,14 +224,14 @@ static void try_to_free_low(unsigned long count)
for (i = 0; i < MAX_NUMNODES; ++i) {
struct page *page, *next;
list_for_each_entry_safe(page, next, _freelists[i], 
lru) {
+   if (count >= nr_huge_pages)
+   return;
if (PageHighMem(page))
continue;
list_del(>lru);
update_and_free_page(page);
free_huge_pages--;
free_huge_pages_node[page_to_nid(page)]--;
-   if (count >= nr_huge_pages)
-   return;
}
}
 }
@@ -247,11 +247,9 @@ static unsigned long set_max_huge_pages(unsigned long 
count)
if (!alloc_fresh_huge_page())
return nr_huge_pages;
}
-   if (count >= nr_huge_pages)
-   return nr_huge_pages;
 
spin_lock(_lock);
-   count = max(count, resv_huge_pages);
+   count = max(count, resv_huge_pages + nr_huge_pages - free_huge_pages);
try_to_free_low(count);
while (count < nr_huge_pages) {
struct page *page = dequeue_huge_page(NULL, 0);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hugetlb: Fix pool resizing corner case

2007-10-03 Thread Adam Litke
On Wed, 2007-10-03 at 10:40 -0700, Dave Hansen wrote:
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 84c795e..7af3908 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -224,14 +224,14 @@ static void try_to_free_low(unsigned long count)
> > for (i = 0; i < MAX_NUMNODES; ++i) {
> > struct page *page, *next;
> > list_for_each_entry_safe(page, next, _freelists[i], 
> > lru) {
> > +   if (count >= nr_huge_pages)
> > +   return;
> > if (PageHighMem(page))
> > continue;
> > list_del(>lru);
> > update_and_free_page(page);
> > free_huge_pages--;
> > free_huge_pages_node[page_to_nid(page)]--;
> > -   if (count >= nr_huge_pages)
> > -   return;
> > }
> > }
> >  }
> 
> That's an excellent problem description.  I'm just a bit hazy on how the
> patch fixes it. :)
> 
> What is the actual error in this loop?  The fact that we can go trying
> to free pages when the count is actually OK?

The above hunk serves only to change the behavior of try_to_free_low()
so that rather than always freeing _at_least_ one huge page, it can
return without having freed any pages. 

> BTW, try_to_free_low(count) kinda sucks for a function name.  Is that
> count the number of pages we're trying to end up with, or the total
> number of low pages that we're trying to free?

I agree the name sucks, but this is a bugfix patch.

> Also, as I look at try_to_free_low(), why do we need to #ifdef it out in
> the case of !HIGHMEM?  If we have CONFIG_HIGHMEM=yes, we still might not
> have any _actual_ high memory.  So, they loop obviously doesn't *hurt*
> when there is no high memory.

Maybe, but not really in-scope of what this patch is trying to
accomplish.

> > @@ -251,7 +251,7 @@ static unsigned long set_max_huge_pages(unsigned long 
> > count)
> > return nr_huge_pages;
> > 
> > spin_lock(_lock);
> > -   count = max(count, resv_huge_pages);
> > +   count = max(count, resv_huge_pages + nr_huge_pages - free_huge_pages);
> > try_to_free_low(count);
> > while (count < nr_huge_pages) {
> > struct page *page = dequeue_huge_page(NULL, 0);
> 
> The real problem with this line is that "count" is too ambiguous. :)

I agree, count is almost always a bad variable name :)

> We could rewrite the original max() line this way:
> 
>   if (resv_huge_pages > nr_of_pages_to_end_up_with)
>   nr_of_pages_to_end_up_with = resv_huge_pages;
>   try_to_make_the_total_nr_of_hpages(nr_of_pages_to_end_up_with);
> 
> Which makes it more clear that you're setting the number of total pages
> to the number of reserved pages, which is obviously screwy.
> 
> OK, so this is actually saying: "count can never go below
> resv_huge_pages+nr_huge_pages"?

Not quite.  Count can never go below the number of reserved pages plus
pages allocated to MAP_PRIVATE mappings.  That number is computed by:
(resv + (total - free)).

> Could we change try_to_free_low() to free a distinct number of pages?
> 
>   if (count > free_huge_pages)
>   count = free_huge_pages;
>   try_to_free_nr_huge_pages(count);
> 
> I feel a bit sketchy about the "resv_huge_pages + nr_huge_pages -
> free_huge_pages" logic.  Could you elaborate a bit there on what the
> rules are?

The key is that we don't want to shrink the pool below the number of
pages we are committed to keeping around.  Before this patch, we only
accounted for the pages we plan to hand out (reserved huge pages) but
not the ones we've already handed out (total - free).  Does that make
sense?

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] hugetlb: Fix pool resizing corner case

2007-10-03 Thread Adam Litke

When shrinking the size of the hugetlb pool via the nr_hugepages sysctl, we
are careful to keep enough pages around to satisfy reservations.  But the
calculation is flawed for the following scenario:

Action  Pool Counters (Total, Free, Resv)
==  =
Set pool to 1 page  1 1 0
Map 1 page MAP_PRIVATE  1 1 0
Touch the page to fault it in   1 0 0
Set pool to 3 pages 3 2 0
Map 2 pages MAP_SHARED  3 2 2
Set pool to 2 pages 2 1 2 <-- Mistake, should be 3 2 2
Touch the 2 shared pages2 0 1 <-- Program crashes here

The last touch above will terminate the process due to lack of huge pages.

This patch corrects the calculation so that it factors in pages being used
for private mappings.  Andrew, this is a standalone fix suitable for
mainline.  It is also now corrected in my latest dynamic pool resizing
patchset which I will send out soon.

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
---

 mm/hugetlb.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 84c795e..7af3908 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -224,14 +224,14 @@ static void try_to_free_low(unsigned long count)
for (i = 0; i < MAX_NUMNODES; ++i) {
struct page *page, *next;
list_for_each_entry_safe(page, next, _freelists[i], 
lru) {
+   if (count >= nr_huge_pages)
+   return;
if (PageHighMem(page))
continue;
list_del(>lru);
update_and_free_page(page);
free_huge_pages--;
free_huge_pages_node[page_to_nid(page)]--;
-   if (count >= nr_huge_pages)
-   return;
}
}
 }
@@ -251,7 +251,7 @@ static unsigned long set_max_huge_pages(unsigned long count)
return nr_huge_pages;
 
spin_lock(_lock);
-   count = max(count, resv_huge_pages);
+   count = max(count, resv_huge_pages + nr_huge_pages - free_huge_pages);
try_to_free_low(count);
while (count < nr_huge_pages) {
struct page *page = dequeue_huge_page(NULL, 0);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] hugetlb: Fix pool resizing corner case

2007-10-03 Thread Adam Litke

When shrinking the size of the hugetlb pool via the nr_hugepages sysctl, we
are careful to keep enough pages around to satisfy reservations.  But the
calculation is flawed for the following scenario:

Action  Pool Counters (Total, Free, Resv)
==  =
Set pool to 1 page  1 1 0
Map 1 page MAP_PRIVATE  1 1 0
Touch the page to fault it in   1 0 0
Set pool to 3 pages 3 2 0
Map 2 pages MAP_SHARED  3 2 2
Set pool to 2 pages 2 1 2 -- Mistake, should be 3 2 2
Touch the 2 shared pages2 0 1 -- Program crashes here

The last touch above will terminate the process due to lack of huge pages.

This patch corrects the calculation so that it factors in pages being used
for private mappings.  Andrew, this is a standalone fix suitable for
mainline.  It is also now corrected in my latest dynamic pool resizing
patchset which I will send out soon.

Signed-off-by: Adam Litke [EMAIL PROTECTED]
---

 mm/hugetlb.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 84c795e..7af3908 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -224,14 +224,14 @@ static void try_to_free_low(unsigned long count)
for (i = 0; i  MAX_NUMNODES; ++i) {
struct page *page, *next;
list_for_each_entry_safe(page, next, hugepage_freelists[i], 
lru) {
+   if (count = nr_huge_pages)
+   return;
if (PageHighMem(page))
continue;
list_del(page-lru);
update_and_free_page(page);
free_huge_pages--;
free_huge_pages_node[page_to_nid(page)]--;
-   if (count = nr_huge_pages)
-   return;
}
}
 }
@@ -251,7 +251,7 @@ static unsigned long set_max_huge_pages(unsigned long count)
return nr_huge_pages;
 
spin_lock(hugetlb_lock);
-   count = max(count, resv_huge_pages);
+   count = max(count, resv_huge_pages + nr_huge_pages - free_huge_pages);
try_to_free_low(count);
while (count  nr_huge_pages) {
struct page *page = dequeue_huge_page(NULL, 0);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hugetlb: Fix pool resizing corner case

2007-10-03 Thread Adam Litke
On Wed, 2007-10-03 at 10:40 -0700, Dave Hansen wrote:
  diff --git a/mm/hugetlb.c b/mm/hugetlb.c
  index 84c795e..7af3908 100644
  --- a/mm/hugetlb.c
  +++ b/mm/hugetlb.c
  @@ -224,14 +224,14 @@ static void try_to_free_low(unsigned long count)
  for (i = 0; i  MAX_NUMNODES; ++i) {
  struct page *page, *next;
  list_for_each_entry_safe(page, next, hugepage_freelists[i], 
  lru) {
  +   if (count = nr_huge_pages)
  +   return;
  if (PageHighMem(page))
  continue;
  list_del(page-lru);
  update_and_free_page(page);
  free_huge_pages--;
  free_huge_pages_node[page_to_nid(page)]--;
  -   if (count = nr_huge_pages)
  -   return;
  }
  }
   }
 
 That's an excellent problem description.  I'm just a bit hazy on how the
 patch fixes it. :)
 
 What is the actual error in this loop?  The fact that we can go trying
 to free pages when the count is actually OK?

The above hunk serves only to change the behavior of try_to_free_low()
so that rather than always freeing _at_least_ one huge page, it can
return without having freed any pages. 

 BTW, try_to_free_low(count) kinda sucks for a function name.  Is that
 count the number of pages we're trying to end up with, or the total
 number of low pages that we're trying to free?

I agree the name sucks, but this is a bugfix patch.

 Also, as I look at try_to_free_low(), why do we need to #ifdef it out in
 the case of !HIGHMEM?  If we have CONFIG_HIGHMEM=yes, we still might not
 have any _actual_ high memory.  So, they loop obviously doesn't *hurt*
 when there is no high memory.

Maybe, but not really in-scope of what this patch is trying to
accomplish.

  @@ -251,7 +251,7 @@ static unsigned long set_max_huge_pages(unsigned long 
  count)
  return nr_huge_pages;
  
  spin_lock(hugetlb_lock);
  -   count = max(count, resv_huge_pages);
  +   count = max(count, resv_huge_pages + nr_huge_pages - free_huge_pages);
  try_to_free_low(count);
  while (count  nr_huge_pages) {
  struct page *page = dequeue_huge_page(NULL, 0);
 
 The real problem with this line is that count is too ambiguous. :)

I agree, count is almost always a bad variable name :)

 We could rewrite the original max() line this way:
 
   if (resv_huge_pages  nr_of_pages_to_end_up_with)
   nr_of_pages_to_end_up_with = resv_huge_pages;
   try_to_make_the_total_nr_of_hpages(nr_of_pages_to_end_up_with);
 
 Which makes it more clear that you're setting the number of total pages
 to the number of reserved pages, which is obviously screwy.
 
 OK, so this is actually saying: count can never go below
 resv_huge_pages+nr_huge_pages?

Not quite.  Count can never go below the number of reserved pages plus
pages allocated to MAP_PRIVATE mappings.  That number is computed by:
(resv + (total - free)).

 Could we change try_to_free_low() to free a distinct number of pages?
 
   if (count  free_huge_pages)
   count = free_huge_pages;
   try_to_free_nr_huge_pages(count);
 
 I feel a bit sketchy about the resv_huge_pages + nr_huge_pages -
 free_huge_pages logic.  Could you elaborate a bit there on what the
 rules are?

The key is that we don't want to shrink the pool below the number of
pages we are committed to keeping around.  Before this patch, we only
accounted for the pages we plan to hand out (reserved huge pages) but
not the ones we've already handed out (total - free).  Does that make
sense?

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] hugetlb: Fix pool resizing corner case V2

2007-10-03 Thread Adam Litke

Changes in V2:
 - Removed now unnecessary check as suggested by Ken Chen

When shrinking the size of the hugetlb pool via the nr_hugepages sysctl, we
are careful to keep enough pages around to satisfy reservations.  But the
calculation is flawed for the following scenario:

Action  Pool Counters (Total, Free, Resv)
==  =
Set pool to 1 page  1 1 0
Map 1 page MAP_PRIVATE  1 1 0
Touch the page to fault it in   1 0 0
Set pool to 3 pages 3 2 0
Map 2 pages MAP_SHARED  3 2 2
Set pool to 2 pages 2 1 2 -- Mistake, should be 3 2 2
Touch the 2 shared pages2 0 1 -- Program crashes here

The last touch above will terminate the process due to lack of huge pages.

This patch corrects the calculation so that it factors in pages being used
for private mappings.  Andrew, this is a standalone fix suitable for
mainline.  It is also now corrected in my latest dynamic pool resizing
patchset which I will send out soon.

Signed-off-by: Adam Litke [EMAIL PROTECTED]
Acked-by: Ken Chen [EMAIL PROTECTED]
---

 mm/hugetlb.c |8 +++-
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 84c795e..b6b3b64 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -224,14 +224,14 @@ static void try_to_free_low(unsigned long count)
for (i = 0; i  MAX_NUMNODES; ++i) {
struct page *page, *next;
list_for_each_entry_safe(page, next, hugepage_freelists[i], 
lru) {
+   if (count = nr_huge_pages)
+   return;
if (PageHighMem(page))
continue;
list_del(page-lru);
update_and_free_page(page);
free_huge_pages--;
free_huge_pages_node[page_to_nid(page)]--;
-   if (count = nr_huge_pages)
-   return;
}
}
 }
@@ -247,11 +247,9 @@ static unsigned long set_max_huge_pages(unsigned long 
count)
if (!alloc_fresh_huge_page())
return nr_huge_pages;
}
-   if (count = nr_huge_pages)
-   return nr_huge_pages;
 
spin_lock(hugetlb_lock);
-   count = max(count, resv_huge_pages);
+   count = max(count, resv_huge_pages + nr_huge_pages - free_huge_pages);
try_to_free_low(count);
while (count  nr_huge_pages) {
struct page *page = dequeue_huge_page(NULL, 0);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix find_next_best_node (Re: [BUG] 2.6.23-rc3-mm1 Kernel panic - not syncing: DMA: Memory would be corrupted)

2007-08-28 Thread Adam Litke
On Fri, 2007-08-24 at 15:53 +0900, Yasunori Goto wrote:
> I found find_next_best_node() was wrong.
> I confirmed boot up by the following patch.
> Mel-san, Kamalesh-san, could you try this?

FYI: This patch also allows the alloc-instantiate-race testcase in
libhugetlbfs to pass again :)

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix find_next_best_node (Re: [BUG] 2.6.23-rc3-mm1 Kernel panic - not syncing: DMA: Memory would be corrupted)

2007-08-28 Thread Adam Litke
On Fri, 2007-08-24 at 15:53 +0900, Yasunori Goto wrote:
 I found find_next_best_node() was wrong.
 I confirmed boot up by the following patch.
 Mel-san, Kamalesh-san, could you try this?

FYI: This patch also allows the alloc-instantiate-race testcase in
libhugetlbfs to pass again :)

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Fix VM_FAULT flags conversion for hugetlb

2007-08-14 Thread Adam Litke
It seems a simple mistake was made when converting follow_hugetlb_page()
over to the VM_FAULT flags bitmask stuff:
(commit  83c54070ee1a2d05c89793884bea1a03f2851ed4).

By using the wrong bitmask, hugetlb_fault() failures are not being
recognized.  This results in an infinite loop whenever
follow_hugetlb_page is involved in a failed fault.

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d7ca59d..de4cf45 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -643,7 +643,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
spin_unlock(>page_table_lock);
ret = hugetlb_fault(mm, vma, vaddr, 0);
spin_lock(>page_table_lock);
-   if (!(ret & VM_FAULT_MAJOR))
+   if (!(ret & VM_FAULT_ERROR))
continue;
 
remainder = 0;

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Fix VM_FAULT flags conversion for hugetlb

2007-08-14 Thread Adam Litke
It seems a simple mistake was made when converting follow_hugetlb_page()
over to the VM_FAULT flags bitmask stuff:
(commit  83c54070ee1a2d05c89793884bea1a03f2851ed4).

By using the wrong bitmask, hugetlb_fault() failures are not being
recognized.  This results in an infinite loop whenever
follow_hugetlb_page is involved in a failed fault.

Signed-off-by: Adam Litke [EMAIL PROTECTED]

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d7ca59d..de4cf45 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -643,7 +643,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
spin_unlock(mm-page_table_lock);
ret = hugetlb_fault(mm, vma, vaddr, 0);
spin_lock(mm-page_table_lock);
-   if (!(ret  VM_FAULT_MAJOR))
+   if (!(ret  VM_FAULT_ERROR))
continue;
 
remainder = 0;

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Documentation] Page Table Layout diagrams

2007-08-08 Thread Adam Litke
Hello all.  In an effort to understand how the page tables are laid out
across various architectures I put together some diagrams.  I have
posted them on the linux-mm wiki: http://linux-mm.org/PageTableStructure
and I hope they will be useful to others.  

Just to make sure I am not spreading misinformation, could a few of you
experts take a quick look at the three diagrams I've got finished so far
and point out any errors I have made?  Thanks.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Documentation] Page Table Layout diagrams

2007-08-08 Thread Adam Litke
Hello all.  In an effort to understand how the page tables are laid out
across various architectures I put together some diagrams.  I have
posted them on the linux-mm wiki: http://linux-mm.org/PageTableStructure
and I hope they will be useful to others.  

Just to make sure I am not spreading misinformation, could a few of you
experts take a quick look at the three diagrams I've got finished so far
and point out any errors I have made?  Thanks.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] remove hugetlb_instantiation_mutex

2007-08-03 Thread Adam Litke
On Mon, 2007-07-30 at 15:15 +0800, Zhang, Yanmin wrote:
> On Fri, 2007-07-27 at 11:37 -0500, Adam Litke wrote:
> > Hey... I am amazed at how quickly you came back with a patch for this :)
> > Thanks for looking at it.  Unfortunately there is one show-stopper and I
> > have some reservations (pun definitely intended) with your approach:
> Thanks for your great comments.

Sorry for such a long delay in responding.  I have been pretty busy
lately.

> > First, your patch does not pass the libhugetlbfs test
> > 'alloc-instantiate-race' which was written to tickle the the race which
> > the mutex was introduced to solve.  Your patch works for shared
> > mappings, but not for the private case.
> My testing about private might not be thorough. Function hugetlb_cow has a 
> race
> for multi-thread to fault on the same private page index. But after I fixed 
> it,
> alloc-instantiate-race still failed.
> 
> I tried to google the source code tarball of libhugetlbfs test suite, but 
> couldn't
> find it. Would you like to send me a copy of the test source codes?

http://libhugetlbfs.ozlabs.org/releases/libhugetlbfs-1.2-pre1.tar.gz

The tarball will contain a test called alloc-instantiate-race.  Make
sure to run it in private and shared mode.  Let me know what you find
out.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] remove hugetlb_instantiation_mutex

2007-08-03 Thread Adam Litke
On Mon, 2007-07-30 at 15:15 +0800, Zhang, Yanmin wrote:
 On Fri, 2007-07-27 at 11:37 -0500, Adam Litke wrote:
  Hey... I am amazed at how quickly you came back with a patch for this :)
  Thanks for looking at it.  Unfortunately there is one show-stopper and I
  have some reservations (pun definitely intended) with your approach:
 Thanks for your great comments.

Sorry for such a long delay in responding.  I have been pretty busy
lately.

  First, your patch does not pass the libhugetlbfs test
  'alloc-instantiate-race' which was written to tickle the the race which
  the mutex was introduced to solve.  Your patch works for shared
  mappings, but not for the private case.
 My testing about private might not be thorough. Function hugetlb_cow has a 
 race
 for multi-thread to fault on the same private page index. But after I fixed 
 it,
 alloc-instantiate-race still failed.
 
 I tried to google the source code tarball of libhugetlbfs test suite, but 
 couldn't
 find it. Would you like to send me a copy of the test source codes?

http://libhugetlbfs.ozlabs.org/releases/libhugetlbfs-1.2-pre1.tar.gz

The tarball will contain a test called alloc-instantiate-race.  Make
sure to run it in private and shared mode.  Let me know what you find
out.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] remove hugetlb_instantiation_mutex

2007-07-27 Thread Adam Litke
Hey... I am amazed at how quickly you came back with a patch for this :)
Thanks for looking at it.  Unfortunately there is one show-stopper and I
have some reservations (pun definitely intended) with your approach:

First, your patch does not pass the libhugetlbfs test
'alloc-instantiate-race' which was written to tickle the the race which
the mutex was introduced to solve.  Your patch works for shared
mappings, but not for the private case.

Second, the introduction of another pair of global counters triggers my
internal warning system... These global counters are known to cause
problems with NUMA and cpusets.  Have you considered these interactions?
Additionally, the commit/rollback logic you are using closely parallels
what we already do with the huge page reservation mechanism.  Is there
any way you could integrate your stuff into the reservation system to
avoid all the duplicated logic?

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] remove hugetlb_instantiation_mutex

2007-07-27 Thread Adam Litke
Hey... I am amazed at how quickly you came back with a patch for this :)
Thanks for looking at it.  Unfortunately there is one show-stopper and I
have some reservations (pun definitely intended) with your approach:

First, your patch does not pass the libhugetlbfs test
'alloc-instantiate-race' which was written to tickle the the race which
the mutex was introduced to solve.  Your patch works for shared
mappings, but not for the private case.

Second, the introduction of another pair of global counters triggers my
internal warning system... These global counters are known to cause
problems with NUMA and cpusets.  Have you considered these interactions?
Additionally, the commit/rollback logic you are using closely parallels
what we already do with the huge page reservation mechanism.  Is there
any way you could integrate your stuff into the reservation system to
avoid all the duplicated logic?

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] spinlock in function hugetlb_fault could be deleted

2007-07-23 Thread Adam Litke

Hello.  hugetlb_instantiation_mutex is an extremely heavy-weight lock
whose days are numbered (hopefully).  It exists primarily to arbitrate
a race condition where n (n > 1) threads of execution race to satisfy
the same page fault for a process.  Even though only one hugetlb page
is needed, if (n) are not available, the application can receive a
bogus VM_FAULT_OOM.

Anyway, the hugetlb_instantiation_mutex approach has few friends
around here, so rather than making the code rely more heavily upon it,
perhaps you could focus you efforts on helping us remove it.

On 7/23/07, Zhang, Yanmin <[EMAIL PROTECTED]> wrote:

Function hugetlb_fault needn't hold spinlock mm->page_table_lock,
because when hugetlb_fault is called:
1) mm->mmap_sem is held already;
2) hugetlb_instantiation_mutex is held by hugetlb_fault, which prevents
other threads/processes from entering this critical area. It's impossible
for other threads/processes to change the page table now.


--
Adam Litke ( agl at us.ibm.com )
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] spinlock in function hugetlb_fault could be deleted

2007-07-23 Thread Adam Litke

Hello.  hugetlb_instantiation_mutex is an extremely heavy-weight lock
whose days are numbered (hopefully).  It exists primarily to arbitrate
a race condition where n (n  1) threads of execution race to satisfy
the same page fault for a process.  Even though only one hugetlb page
is needed, if (n) are not available, the application can receive a
bogus VM_FAULT_OOM.

Anyway, the hugetlb_instantiation_mutex approach has few friends
around here, so rather than making the code rely more heavily upon it,
perhaps you could focus you efforts on helping us remove it.

On 7/23/07, Zhang, Yanmin [EMAIL PROTECTED] wrote:

Function hugetlb_fault needn't hold spinlock mm-page_table_lock,
because when hugetlb_fault is called:
1) mm-mmap_sem is held already;
2) hugetlb_instantiation_mutex is held by hugetlb_fault, which prevents
other threads/processes from entering this critical area. It's impossible
for other threads/processes to change the page table now.


--
Adam Litke ( agl at us.ibm.com )
IBM Linux Technology Center
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hugetlb: use set_compound_page_dtor

2007-07-16 Thread Adam Litke

On 7/13/07, Akinobu Mita <[EMAIL PROTECTED]> wrote:

Use appropriate accessor function to set compound page destructor
function.

Cc:  William Irwin <[EMAIL PROTECTED]>
Signed-off-by: Akinobu Mita <[EMAIL PROTECTED]>


Acked-by: Adam Litke <[EMAIL PROTECTED]>

--
Adam Litke ( agl at us.ibm.com )
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hugetlb: use set_compound_page_dtor

2007-07-16 Thread Adam Litke

On 7/13/07, Akinobu Mita [EMAIL PROTECTED] wrote:

Use appropriate accessor function to set compound page destructor
function.

Cc:  William Irwin [EMAIL PROTECTED]
Signed-off-by: Akinobu Mita [EMAIL PROTECTED]


Acked-by: Adam Litke [EMAIL PROTECTED]

--
Adam Litke ( agl at us.ibm.com )
IBM Linux Technology Center
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [shm][hugetlb] Fix get_policy for stacked shared memory files

2007-06-12 Thread Adam Litke

On 6/12/07, Eric W. Biederman <[EMAIL PROTECTED]> wrote:

Adam Litke <[EMAIL PROTECTED]> writes:

> Here's another breakage as a result of shared memory stacked files :(
>
> The NUMA policy for a VMA is determined by checking the following (in the 
order
> given):
>
> 1) vma->vm_ops->get_policy() (if defined)
> 2) vma->vm_policy (if defined)
> 3) task->mempolicy (if defined)
> 4) Fall back to default_policy
>
> By switching to stacked files for shared memory, get_policy() is now always 
set
> to shm_get_policy which is a wrapper function.  This causes us to stop at step
> 1, which yields NULL for hugetlb instead of task->mempolicy which was the
> previous (and correct) result.
>
> This patch modifies the shm_get_policy() wrapper to maintain steps 1-3 for the
> wrapped vm_ops.  Andi and Christoph, does this look right to you?

I'm confused.

I agree that the behavior you describe is correct.
However I only see two code paths were get_policy is called and
both of them take a NULL result and change it to task->mempolicy:


The coffee hasn't quite absorbed yet, but don't those two code paths
take a NULL result from get_policy() and turn it into default_policy,
not task->mempolicy?


From mm/mempolicy.c

> long do_get_mempolicy(int *policy, nodemask_t *nmask,
>   unsigned long addr, unsigned long flags)
> {
>   int err;
>   struct mm_struct *mm = current->mm;
>   struct vm_area_struct *vma = NULL;
>   struct mempolicy *pol = current->mempolicy;
>
>   cpuset_update_task_memory_state();
>   if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
>   return -EINVAL;
>   if (flags & MPOL_F_ADDR) {
>   down_read(>mmap_sem);
>   vma = find_vma_intersection(mm, addr, addr+1);
>   if (!vma) {
>   up_read(>mmap_sem);
>   return -EFAULT;
>   }
>   if (vma->vm_ops && vma->vm_ops->get_policy)
>   pol = vma->vm_ops->get_policy(vma, addr);
>   else
>   pol = vma->vm_policy;
>   } else if (addr)
>   return -EINVAL;
>
>   if (!pol)
>   pol = _policy;



> /* Return effective policy for a VMA */
> static struct mempolicy * get_vma_policy(struct task_struct *task,
>   struct vm_area_struct *vma, unsigned long addr)
> {
>   struct mempolicy *pol = task->mempolicy;
>
>   if (vma) {
>   if (vma->vm_ops && vma->vm_ops->get_policy)
>   pol = vma->vm_ops->get_policy(vma, addr);
>   else if (vma->vm_policy &&
>   vma->vm_policy->policy != MPOL_DEFAULT)
>   pol = vma->vm_policy;
>   }
>   if (!pol)
>   pol = _policy;
>   return pol;
> }


Does this perhaps need to be:
> Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
>
> diff --git a/ipc/shm.c b/ipc/shm.c
> index 4fefbad..8d2672d 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -254,8 +254,10 @@ struct mempolicy *shm_get_policy(struct vm_area_struct
> *vma, unsigned long addr)

+   pol = NULL;
>
>   if (sfd->vm_ops->get_policy)
>   pol = sfd->vm_ops->get_policy(vma, addr);
> - else
> + else if (vma->vm_policy && vma->vm_policy->policy != MPOL_DEFAULT)
>   pol = vma->vm_policy;
>   return pol;
>  }
>  #endif


afaict this would provide no way for pol to be set to task->mempolicy
for hugetlb per my comment above.

--
Adam Litke ( agl at us.ibm.com )
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [shm][hugetlb] Fix get_policy for stacked shared memory files

2007-06-12 Thread Adam Litke

On 6/12/07, Eric W. Biederman [EMAIL PROTECTED] wrote:

Adam Litke [EMAIL PROTECTED] writes:

 Here's another breakage as a result of shared memory stacked files :(

 The NUMA policy for a VMA is determined by checking the following (in the 
order
 given):

 1) vma-vm_ops-get_policy() (if defined)
 2) vma-vm_policy (if defined)
 3) task-mempolicy (if defined)
 4) Fall back to default_policy

 By switching to stacked files for shared memory, get_policy() is now always 
set
 to shm_get_policy which is a wrapper function.  This causes us to stop at step
 1, which yields NULL for hugetlb instead of task-mempolicy which was the
 previous (and correct) result.

 This patch modifies the shm_get_policy() wrapper to maintain steps 1-3 for the
 wrapped vm_ops.  Andi and Christoph, does this look right to you?

I'm confused.

I agree that the behavior you describe is correct.
However I only see two code paths were get_policy is called and
both of them take a NULL result and change it to task-mempolicy:


The coffee hasn't quite absorbed yet, but don't those two code paths
take a NULL result from get_policy() and turn it into default_policy,
not task-mempolicy?


From mm/mempolicy.c

 long do_get_mempolicy(int *policy, nodemask_t *nmask,
   unsigned long addr, unsigned long flags)
 {
   int err;
   struct mm_struct *mm = current-mm;
   struct vm_area_struct *vma = NULL;
   struct mempolicy *pol = current-mempolicy;

   cpuset_update_task_memory_state();
   if (flags  ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
   return -EINVAL;
   if (flags  MPOL_F_ADDR) {
   down_read(mm-mmap_sem);
   vma = find_vma_intersection(mm, addr, addr+1);
   if (!vma) {
   up_read(mm-mmap_sem);
   return -EFAULT;
   }
   if (vma-vm_ops  vma-vm_ops-get_policy)
   pol = vma-vm_ops-get_policy(vma, addr);
   else
   pol = vma-vm_policy;
   } else if (addr)
   return -EINVAL;

   if (!pol)
   pol = default_policy;



 /* Return effective policy for a VMA */
 static struct mempolicy * get_vma_policy(struct task_struct *task,
   struct vm_area_struct *vma, unsigned long addr)
 {
   struct mempolicy *pol = task-mempolicy;

   if (vma) {
   if (vma-vm_ops  vma-vm_ops-get_policy)
   pol = vma-vm_ops-get_policy(vma, addr);
   else if (vma-vm_policy 
   vma-vm_policy-policy != MPOL_DEFAULT)
   pol = vma-vm_policy;
   }
   if (!pol)
   pol = default_policy;
   return pol;
 }


Does this perhaps need to be:
 Signed-off-by: Adam Litke [EMAIL PROTECTED]

 diff --git a/ipc/shm.c b/ipc/shm.c
 index 4fefbad..8d2672d 100644
 --- a/ipc/shm.c
 +++ b/ipc/shm.c
 @@ -254,8 +254,10 @@ struct mempolicy *shm_get_policy(struct vm_area_struct
 *vma, unsigned long addr)

+   pol = NULL;

   if (sfd-vm_ops-get_policy)
   pol = sfd-vm_ops-get_policy(vma, addr);
 - else
 + else if (vma-vm_policy  vma-vm_policy-policy != MPOL_DEFAULT)
   pol = vma-vm_policy;
   return pol;
  }
  #endif


afaict this would provide no way for pol to be set to task-mempolicy
for hugetlb per my comment above.

--
Adam Litke ( agl at us.ibm.com )
IBM Linux Technology Center
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[shm][hugetlb] Fix get_policy for stacked shared memory files

2007-06-11 Thread Adam Litke
Here's another breakage as a result of shared memory stacked files :(

The NUMA policy for a VMA is determined by checking the following (in the order
given):

1) vma->vm_ops->get_policy() (if defined)
2) vma->vm_policy (if defined)
3) task->mempolicy (if defined)
4) Fall back to default_policy

By switching to stacked files for shared memory, get_policy() is now always set
to shm_get_policy which is a wrapper function.  This causes us to stop at step
1, which yields NULL for hugetlb instead of task->mempolicy which was the
previous (and correct) result.

This patch modifies the shm_get_policy() wrapper to maintain steps 1-3 for the
wrapped vm_ops.  Andi and Christoph, does this look right to you?

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>

diff --git a/ipc/shm.c b/ipc/shm.c
index 4fefbad..8d2672d 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -254,8 +254,10 @@ struct mempolicy *shm_get_policy(struct vm_area_struct 
*vma, unsigned long addr)
 
if (sfd->vm_ops->get_policy)
pol = sfd->vm_ops->get_policy(vma, addr);
-   else
+   else if (vma->vm_policy)
pol = vma->vm_policy;
+   else
+   pol = current->mempolicy;
return pol;
 }
 #endif

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] shm: Fix the filename of hugetlb sysv shared memory

2007-06-11 Thread Adam Litke

On 6/8/07, Eric W. Biederman <[EMAIL PROTECTED]> wrote:

-struct file *hugetlb_zero_setup(size_t size)
+struct file *hugetlb_file_setup(const char *name, size_t size)


The bulk of this patch seems to handle renaming this function.  Is
that really necessary?

--
Adam Litke ( agl at us.ibm.com )
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21 numa policy and huge pages not working

2007-06-11 Thread Adam Litke
On Sat, 2007-06-09 at 21:10 -0700, dean gaudet wrote:
> On Tue, 15 May 2007, William Lee Irwin III wrote:
> 
> > On Tue, May 15, 2007 at 10:41:06PM -0700, dean gaudet wrote:
> > > prior to 2.6.21 i could "numactl --interleave=all" and use SHM_HUGETLB 
> > > and 
> > > the interleave policy would be respected.  as of 2.6.21 it doesn't seem 
> > > to 
> > > respect the policy on SHM_HUGETLB request.
> > > see test program below.
> > > output from pre-2.6.21:
> > > 2ab19620 interleave=0-3 file=/2\040(deleted) huge dirty=32 N0=8 N1=8 
> > > N2=8 N3=8
> > > 2ab19a20 default file=/SYSV\040(deleted) dirty=16384 active=0 
> > > N0=4096 N1=4096 N2=4096 N3=4096
> > > output from 2.6.21:
> > > 2b49b1c0 default file=/10\040(deleted) huge dirty=32 N3=32
> > > 2b49b5c0 default file=/SYSV\040(deleted) dirty=16384 active=0 
> > > N0=4096 N1=4096 N2=4096 N3=4096
> > > was this an intentional behaviour change?  it seems to be only affecting 
> > > SHM_HUGETLB allocations.  (i haven't tested hugetlbfs yet.)
> > > run with "numactl --interleave=all ./shmtest"
> > 
> > This was not intentional. I'll search for where it broke.
> 
> ok i've narrowed it some... maybe.

Thanks a lot for the detailed information.  I am on it.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21 numa policy and huge pages not working

2007-06-11 Thread Adam Litke
On Sat, 2007-06-09 at 21:10 -0700, dean gaudet wrote:
 On Tue, 15 May 2007, William Lee Irwin III wrote:
 
  On Tue, May 15, 2007 at 10:41:06PM -0700, dean gaudet wrote:
   prior to 2.6.21 i could numactl --interleave=all and use SHM_HUGETLB 
   and 
   the interleave policy would be respected.  as of 2.6.21 it doesn't seem 
   to 
   respect the policy on SHM_HUGETLB request.
   see test program below.
   output from pre-2.6.21:
   2ab19620 interleave=0-3 file=/2\040(deleted) huge dirty=32 N0=8 N1=8 
   N2=8 N3=8
   2ab19a20 default file=/SYSV\040(deleted) dirty=16384 active=0 
   N0=4096 N1=4096 N2=4096 N3=4096
   output from 2.6.21:
   2b49b1c0 default file=/10\040(deleted) huge dirty=32 N3=32
   2b49b5c0 default file=/SYSV\040(deleted) dirty=16384 active=0 
   N0=4096 N1=4096 N2=4096 N3=4096
   was this an intentional behaviour change?  it seems to be only affecting 
   SHM_HUGETLB allocations.  (i haven't tested hugetlbfs yet.)
   run with numactl --interleave=all ./shmtest
  
  This was not intentional. I'll search for where it broke.
 
 ok i've narrowed it some... maybe.

Thanks a lot for the detailed information.  I am on it.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] shm: Fix the filename of hugetlb sysv shared memory

2007-06-11 Thread Adam Litke

On 6/8/07, Eric W. Biederman [EMAIL PROTECTED] wrote:

-struct file *hugetlb_zero_setup(size_t size)
+struct file *hugetlb_file_setup(const char *name, size_t size)


The bulk of this patch seems to handle renaming this function.  Is
that really necessary?

--
Adam Litke ( agl at us.ibm.com )
IBM Linux Technology Center
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[shm][hugetlb] Fix get_policy for stacked shared memory files

2007-06-11 Thread Adam Litke
Here's another breakage as a result of shared memory stacked files :(

The NUMA policy for a VMA is determined by checking the following (in the order
given):

1) vma-vm_ops-get_policy() (if defined)
2) vma-vm_policy (if defined)
3) task-mempolicy (if defined)
4) Fall back to default_policy

By switching to stacked files for shared memory, get_policy() is now always set
to shm_get_policy which is a wrapper function.  This causes us to stop at step
1, which yields NULL for hugetlb instead of task-mempolicy which was the
previous (and correct) result.

This patch modifies the shm_get_policy() wrapper to maintain steps 1-3 for the
wrapped vm_ops.  Andi and Christoph, does this look right to you?

Signed-off-by: Adam Litke [EMAIL PROTECTED]

diff --git a/ipc/shm.c b/ipc/shm.c
index 4fefbad..8d2672d 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -254,8 +254,10 @@ struct mempolicy *shm_get_policy(struct vm_area_struct 
*vma, unsigned long addr)
 
if (sfd-vm_ops-get_policy)
pol = sfd-vm_ops-get_policy(vma, addr);
-   else
+   else if (vma-vm_policy)
pol = vma-vm_policy;
+   else
+   pol = current-mempolicy;
return pol;
 }
 #endif

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 4/8] Enhance fallback functions in libs to support higher order pages

2007-04-19 Thread Adam Litke

On 4/19/07, Christoph Lameter <[EMAIL PROTECTED]> wrote:

@@ -331,11 +331,15 @@ int simple_prepare_write(struct file *fi
unsigned from, unsigned to)
 {
if (!PageUptodate(page)) {
-   if (to - from != PAGE_CACHE_SIZE) {
+   if (to - from != page_cache_size(file->f_mapping)) {


Where do you introduce page_cache_size()?  Is this added by a
different set of patches I should have applied first?

--
Adam Litke ( agl at us.ibm.com )
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 4/8] Enhance fallback functions in libs to support higher order pages

2007-04-19 Thread Adam Litke

On 4/19/07, Christoph Lameter [EMAIL PROTECTED] wrote:

@@ -331,11 +331,15 @@ int simple_prepare_write(struct file *fi
unsigned from, unsigned to)
 {
if (!PageUptodate(page)) {
-   if (to - from != PAGE_CACHE_SIZE) {
+   if (to - from != page_cache_size(file-f_mapping)) {


Where do you introduce page_cache_size()?  Is this added by a
different set of patches I should have applied first?

--
Adam Litke ( agl at us.ibm.com )
IBM Linux Technology Center
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/2] hugetlb: add resv argument to hugetlb_file_setup

2007-03-26 Thread Adam Litke
On Fri, 2007-03-23 at 15:42 -0700, Ken Chen wrote:
> rename hugetlb_zero_setup() to hugetlb_file_setup() to better match
> function name convention like shmem implementation.  Also add an
> argument to the function to indicate whether file setup should reserve
> hugetlb page upfront or not.
> 
> Signed-off-by: Ken Chen <[EMAIL PROTECTED]>

This patch doesn't really look bad at all, but...

I am worried that what might seem nice and clean right now will slowly
get worse.  This implements an interface on top of another interface
(char device on top of a filesystem).  What is the next hugetlbfs
function that will need a boolean switch to handle a character device
special case?

Am I just worrying too much here?  Although my pagetable_operations
patches aren't the most popular right now, they do have at least one
advantage IMO: they enable side-by-side implementation of the interfaces
as opposed to stacking them.  Keeping them separate removes the need for
if ((vm_flags & VM_HUGETLB) && (is_hugetlbfs_chardev())) checking. 

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/2] hugetlb: add resv argument to hugetlb_file_setup

2007-03-26 Thread Adam Litke
On Fri, 2007-03-23 at 15:42 -0700, Ken Chen wrote:
 rename hugetlb_zero_setup() to hugetlb_file_setup() to better match
 function name convention like shmem implementation.  Also add an
 argument to the function to indicate whether file setup should reserve
 hugetlb page upfront or not.
 
 Signed-off-by: Ken Chen [EMAIL PROTECTED]

This patch doesn't really look bad at all, but...

I am worried that what might seem nice and clean right now will slowly
get worse.  This implements an interface on top of another interface
(char device on top of a filesystem).  What is the next hugetlbfs
function that will need a boolean switch to handle a character device
special case?

Am I just worrying too much here?  Although my pagetable_operations
patches aren't the most popular right now, they do have at least one
advantage IMO: they enable side-by-side implementation of the interfaces
as opposed to stacking them.  Keeping them separate removes the need for
if ((vm_flags  VM_HUGETLB)  (is_hugetlbfs_chardev())) checking. 

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pagetable_ops: Hugetlb character device example

2007-03-21 Thread Adam Litke
On Wed, 2007-03-21 at 15:51 -0400, [EMAIL PROTECTED] wrote:
> On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said:
> > The main reason I am advocating a set of pagetable_operations is to
> > enable the development of a new hugetlb interface.
> 
> Do you have an exit strategy for the *old* interface?

Not really.  Hugetlbfs needs to be kept around for a number of reasons.
It was designed to support MAP_SHARED mappings and IPC shm segments.  It
is probably still the best interface for those jobs.  Of course
hugetlbfs has lots of users so we must preserve the interface for them.

But... once hugetlbfs is abstracted behind pagetable_operations, you
would have the option of configuring it out of the kernel without losing
access to huge pages by other means (such as the character device).

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


pagetable_ops: Hugetlb character device example

2007-03-21 Thread Adam Litke
The main reason I am advocating a set of pagetable_operations is to
enable the development of a new hugetlb interface.  During the hugetlb
BOFS at OLS last year, we talked about a character device that would
behave like /dev/zero.  Many of the people were talking about how they
just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss
about the hugetlbfs filesystem.  /dev/zero is a familiar interface for
getting anonymous memory so bringing that model to huge pages would make
programming for anonymous huge pages easier.

The pagetable_operations API opens up possibilities to do some
additional (and completely sane) things.  For example, I have a patch
that alters the character device code below to make use of a hugetlb
ZERO_PAGE.  This eliminates almost all the up-front fault time, allowing
pages to be COW'ed only when first written to.  We cannot do things like
this with hugetlbfs anymore because we have a set of complex semantics
to preserve.

The following patch is an example of what a simple pagetable_operations
consumer could look like.  It does depend on some other cleanups I am
working on (removal of is_file_hugepages(), ...hugetlbfs/inode.c vs.
mm/hugetlb.c separation, etc).  So it is unlikely to apply to any trees
you may have.  I do think it makes a useful illustration of what
legitimate things can be done with a pagetable_operations interface.

commit be72df1c616fb662693a8d4410ce3058f20c71f3
Author: Adam Litke <[EMAIL PROTECTED]>
Date:   Tue Feb 13 14:18:21 2007 -0800

diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index fc11063..c5e755b 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -100,6 +100,7 @@ obj-$(CONFIG_IPMI_HANDLER)  += ipmi/
 
 obj-$(CONFIG_HANGCHECK_TIMER)  += hangcheck-timer.o
 obj-$(CONFIG_TCG_TPM)  += tpm/
+obj-$(CONFIG_HUGETLB_PAGE) += page.o
 
 # Files generated that shall be removed upon make clean
 clean-files := consolemap_deftbl.c defkeymap.c
diff --git a/drivers/char/page.c b/drivers/char/page.c
new file mode 100644
index 000..e903028
--- /dev/null
+++ b/drivers/char/page.c
@@ -0,0 +1,133 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static const struct {
+   unsigned intminor;
+   char*name;
+   umode_t mode;
+} devlist[] = {
+   {1, "page-huge", S_IRUGO | S_IWUGO},
+};
+
+static struct page *page_nopage(struct vm_area_struct *vma,
+   unsigned long address, int *unused)
+{
+   BUG();
+   return NULL;
+}
+
+static struct vm_operations_struct page_vm_ops = {
+   .nopage = page_nopage,
+};
+
+static int page_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+   unsigned long address, int write_access)
+{
+   pte_t *ptep;
+   pte_t entry, new_entry;
+   int ret;
+   static DEFINE_MUTEX(hugetlb_instantiation_mutex);
+
+   ptep = huge_pte_alloc(mm, address);
+   if (!ptep)
+   return VM_FAULT_OOM;
+
+   mutex_lock(_instantiation_mutex);
+   entry = *ptep;
+   if (pte_none(entry)) {
+   struct page *page;
+
+   page = alloc_huge_page(vma, address);
+   if (!page)
+   return VM_FAULT_OOM;
+   clear_huge_page(page, address);
+
+   ret = VM_FAULT_MINOR;
+   spin_lock(>page_table_lock);
+   if (!pte_none(*ptep))
+   goto out;
+   add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE);
+   new_entry = make_huge_pte(vma, page, 0);
+   set_huge_pte_at(mm, address, ptep, new_entry);
+   goto out;
+   }
+
+   spin_lock(>page_table_lock);
+   /* Check for a racing update before calling hugetlb_cow */
+   if (likely(pte_same(entry, *ptep)))
+   if (write_access && !pte_write(entry))
+   ret = hugetlb_cow(mm, vma, address, ptep, entry);
+
+out:
+   spin_unlock(>page_table_lock);
+   mutex_unlock(_instantiation_mutex);
+   return ret;
+}
+
+
+static struct pagetable_operations_struct page_pagetable_ops = {
+   .copy_vma   = copy_hugetlb_page_range,
+   .pin_pages  = follow_hugetlb_page,
+   .unmap_page_range   = unmap_hugepage_range,
+   .change_protection  = hugetlb_change_protection,
+   .free_pgtable_range = hugetlb_free_pgd_range,
+   .fault  = page_fault,
+};
+
+static int page_mmap(struct file * file, struct vm_area_struct *vma)
+{
+   if (vma->vm_flags & VM_SHARED)
+   return -EINVAL;
+
+   if (vma->vm_pgoff)
+   return -EINVAL;
+
+   if (vma->vm_start & ~HPAGE_MASK)
+   return -EINVAL;
+
+   if (vma->vm_end & ~HPAGE_MASK)
+   return -EINVAL;
+
+   if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
+   re

Re: [PATCH] NOMMU: Supply get_unmapped_area() to fix NOMMU SYSV SHM

2007-03-21 Thread Adam Litke
On Wed, 2007-03-21 at 17:44 +, David Howells wrote:
> From: David Howells <[EMAIL PROTECTED]>
> 
> Supply a get_unmapped_area() to fix NOMMU SYSV SHM support.
> 
> Signed-Off-By: David Howells <[EMAIL PROTECTED]>

Hmm, yes... So it would seem my fix for the shm stacked files stuff was
not quite complete.

Acked-by: Adam Litke <[EMAIL PROTECTED]>
-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.

2007-03-21 Thread Adam Litke
On Wed, 2007-03-21 at 15:18 +1100, Nick Piggin wrote:
> Adam Litke wrote:
> > Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
> > ---
> > 
> >  include/linux/mm.h |   25 +
> >  1 files changed, 25 insertions(+), 0 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 60e0e4a..7089323 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -98,6 +98,7 @@ struct vm_area_struct {
> >  
> > /* Function pointers to deal with this struct. */
> > struct vm_operations_struct * vm_ops;
> > +   const struct pagetable_operations_struct * pagetable_ops;
> >  
> > /* Information about our backing store: */
> > unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
> 
> Can you remind me why this isn't in vm_ops?

We didn't want to bloat the size of the vm_ops struct for all of its
users.

> Also, it is going to be hugepage-only, isn't it? So should the naming be
> changed to reflect that? And #ifdef it...

They are doing some interesting things on Cell that could take advantage
of this.

> > @@ -218,6 +219,30 @@ struct vm_operations_struct {
> >  };
> >  
> >  struct mmu_gather;
> > +
> > +struct pagetable_operations_struct {
> > +   int (*fault)(struct mm_struct *mm,
> > +   struct vm_area_struct *vma,
> > +   unsigned long address, int write_access);
> 
> I got dibs on fault ;)
> 
> My callback is a sanitised one that basically abstracts the details of the
> virtual memory mapping away, so it is usable by drivers and filesystems.
> 
> You actually want to bypass the normal fault handling because it doesn't
> know how to deal with your virtual memory mapping. Hmm, the best suggestion
> I can come up with is handle_mm_fault... unless you can think of a better
> name for me to use.

How about I use handle_pte_fault?

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.

2007-03-21 Thread Adam Litke
On Tue, 2007-03-20 at 16:24 -0700, Dave Hansen wrote:
> On Mon, 2007-03-19 at 13:05 -0700, Adam Litke wrote:
> > 
> > +#define has_pt_op(vma, op) \
> > +   ((vma)->pagetable_ops && (vma)->pagetable_ops->op)
> > +#define pt_op(vma, call) \
> > +   ((vma)->pagetable_ops->call) 
> 
> Can you get rid of these macros?  I think they make it a wee bit harder
> to read.  My brain doesn't properly parse the foo(arg)(bar) syntax.  
> 
> +   if (has_pt_op(vma, copy_vma))
> +   return pt_op(vma, copy_vma)(dst_mm, src_mm, vma);
> 
> +   if (vma->pagetable_ops && vma->pagetable_ops->copy_vma)
> +   return vma->pagetable_ops->copy_vma(dst_mm, src_mm, vma);
> 
> I guess it does lead to some longish lines.  Does it start looking
> really nasty?

Yeah, it starts to look pretty bad.  Some of these calls are in code
that is already indented several times.

> If you're going to have them, it might just be best to put a single
> unlikely() around the macro definitions themselves to keep anybody from
> having to open-code it for any of the users.  

It should be pretty easy to wrap has_pt_op() with an unlikely().  Good
suggestion.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.

2007-03-21 Thread Adam Litke
On Tue, 2007-03-20 at 16:24 -0700, Dave Hansen wrote:
 On Mon, 2007-03-19 at 13:05 -0700, Adam Litke wrote:
  
  +#define has_pt_op(vma, op) \
  +   ((vma)-pagetable_ops  (vma)-pagetable_ops-op)
  +#define pt_op(vma, call) \
  +   ((vma)-pagetable_ops-call) 
 
 Can you get rid of these macros?  I think they make it a wee bit harder
 to read.  My brain doesn't properly parse the foo(arg)(bar) syntax.  
 
 +   if (has_pt_op(vma, copy_vma))
 +   return pt_op(vma, copy_vma)(dst_mm, src_mm, vma);
 
 +   if (vma-pagetable_ops  vma-pagetable_ops-copy_vma)
 +   return vma-pagetable_ops-copy_vma(dst_mm, src_mm, vma);
 
 I guess it does lead to some longish lines.  Does it start looking
 really nasty?

Yeah, it starts to look pretty bad.  Some of these calls are in code
that is already indented several times.

 If you're going to have them, it might just be best to put a single
 unlikely() around the macro definitions themselves to keep anybody from
 having to open-code it for any of the users.  

It should be pretty easy to wrap has_pt_op() with an unlikely().  Good
suggestion.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.

2007-03-21 Thread Adam Litke
On Wed, 2007-03-21 at 15:18 +1100, Nick Piggin wrote:
 Adam Litke wrote:
  Signed-off-by: Adam Litke [EMAIL PROTECTED]
  ---
  
   include/linux/mm.h |   25 +
   1 files changed, 25 insertions(+), 0 deletions(-)
  
  diff --git a/include/linux/mm.h b/include/linux/mm.h
  index 60e0e4a..7089323 100644
  --- a/include/linux/mm.h
  +++ b/include/linux/mm.h
  @@ -98,6 +98,7 @@ struct vm_area_struct {
   
  /* Function pointers to deal with this struct. */
  struct vm_operations_struct * vm_ops;
  +   const struct pagetable_operations_struct * pagetable_ops;
   
  /* Information about our backing store: */
  unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
 
 Can you remind me why this isn't in vm_ops?

We didn't want to bloat the size of the vm_ops struct for all of its
users.

 Also, it is going to be hugepage-only, isn't it? So should the naming be
 changed to reflect that? And #ifdef it...

They are doing some interesting things on Cell that could take advantage
of this.

  @@ -218,6 +219,30 @@ struct vm_operations_struct {
   };
   
   struct mmu_gather;
  +
  +struct pagetable_operations_struct {
  +   int (*fault)(struct mm_struct *mm,
  +   struct vm_area_struct *vma,
  +   unsigned long address, int write_access);
 
 I got dibs on fault ;)
 
 My callback is a sanitised one that basically abstracts the details of the
 virtual memory mapping away, so it is usable by drivers and filesystems.
 
 You actually want to bypass the normal fault handling because it doesn't
 know how to deal with your virtual memory mapping. Hmm, the best suggestion
 I can come up with is handle_mm_fault... unless you can think of a better
 name for me to use.

How about I use handle_pte_fault?

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] NOMMU: Supply get_unmapped_area() to fix NOMMU SYSV SHM

2007-03-21 Thread Adam Litke
On Wed, 2007-03-21 at 17:44 +, David Howells wrote:
 From: David Howells [EMAIL PROTECTED]
 
 Supply a get_unmapped_area() to fix NOMMU SYSV SHM support.
 
 Signed-Off-By: David Howells [EMAIL PROTECTED]

Hmm, yes... So it would seem my fix for the shm stacked files stuff was
not quite complete.

Acked-by: Adam Litke [EMAIL PROTECTED]
-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


pagetable_ops: Hugetlb character device example

2007-03-21 Thread Adam Litke
The main reason I am advocating a set of pagetable_operations is to
enable the development of a new hugetlb interface.  During the hugetlb
BOFS at OLS last year, we talked about a character device that would
behave like /dev/zero.  Many of the people were talking about how they
just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss
about the hugetlbfs filesystem.  /dev/zero is a familiar interface for
getting anonymous memory so bringing that model to huge pages would make
programming for anonymous huge pages easier.

The pagetable_operations API opens up possibilities to do some
additional (and completely sane) things.  For example, I have a patch
that alters the character device code below to make use of a hugetlb
ZERO_PAGE.  This eliminates almost all the up-front fault time, allowing
pages to be COW'ed only when first written to.  We cannot do things like
this with hugetlbfs anymore because we have a set of complex semantics
to preserve.

The following patch is an example of what a simple pagetable_operations
consumer could look like.  It does depend on some other cleanups I am
working on (removal of is_file_hugepages(), ...hugetlbfs/inode.c vs.
mm/hugetlb.c separation, etc).  So it is unlikely to apply to any trees
you may have.  I do think it makes a useful illustration of what
legitimate things can be done with a pagetable_operations interface.

commit be72df1c616fb662693a8d4410ce3058f20c71f3
Author: Adam Litke [EMAIL PROTECTED]
Date:   Tue Feb 13 14:18:21 2007 -0800

diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index fc11063..c5e755b 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -100,6 +100,7 @@ obj-$(CONFIG_IPMI_HANDLER)  += ipmi/
 
 obj-$(CONFIG_HANGCHECK_TIMER)  += hangcheck-timer.o
 obj-$(CONFIG_TCG_TPM)  += tpm/
+obj-$(CONFIG_HUGETLB_PAGE) += page.o
 
 # Files generated that shall be removed upon make clean
 clean-files := consolemap_deftbl.c defkeymap.c
diff --git a/drivers/char/page.c b/drivers/char/page.c
new file mode 100644
index 000..e903028
--- /dev/null
+++ b/drivers/char/page.c
@@ -0,0 +1,133 @@
+#include linux/mm.h
+#include linux/mman.h
+#include linux/init.h
+#include linux/device.h
+#include linux/fs.h
+#include linux/pagemap.h
+#include linux/hugetlb.h
+
+static const struct {
+   unsigned intminor;
+   char*name;
+   umode_t mode;
+} devlist[] = {
+   {1, page-huge, S_IRUGO | S_IWUGO},
+};
+
+static struct page *page_nopage(struct vm_area_struct *vma,
+   unsigned long address, int *unused)
+{
+   BUG();
+   return NULL;
+}
+
+static struct vm_operations_struct page_vm_ops = {
+   .nopage = page_nopage,
+};
+
+static int page_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+   unsigned long address, int write_access)
+{
+   pte_t *ptep;
+   pte_t entry, new_entry;
+   int ret;
+   static DEFINE_MUTEX(hugetlb_instantiation_mutex);
+
+   ptep = huge_pte_alloc(mm, address);
+   if (!ptep)
+   return VM_FAULT_OOM;
+
+   mutex_lock(hugetlb_instantiation_mutex);
+   entry = *ptep;
+   if (pte_none(entry)) {
+   struct page *page;
+
+   page = alloc_huge_page(vma, address);
+   if (!page)
+   return VM_FAULT_OOM;
+   clear_huge_page(page, address);
+
+   ret = VM_FAULT_MINOR;
+   spin_lock(mm-page_table_lock);
+   if (!pte_none(*ptep))
+   goto out;
+   add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE);
+   new_entry = make_huge_pte(vma, page, 0);
+   set_huge_pte_at(mm, address, ptep, new_entry);
+   goto out;
+   }
+
+   spin_lock(mm-page_table_lock);
+   /* Check for a racing update before calling hugetlb_cow */
+   if (likely(pte_same(entry, *ptep)))
+   if (write_access  !pte_write(entry))
+   ret = hugetlb_cow(mm, vma, address, ptep, entry);
+
+out:
+   spin_unlock(mm-page_table_lock);
+   mutex_unlock(hugetlb_instantiation_mutex);
+   return ret;
+}
+
+
+static struct pagetable_operations_struct page_pagetable_ops = {
+   .copy_vma   = copy_hugetlb_page_range,
+   .pin_pages  = follow_hugetlb_page,
+   .unmap_page_range   = unmap_hugepage_range,
+   .change_protection  = hugetlb_change_protection,
+   .free_pgtable_range = hugetlb_free_pgd_range,
+   .fault  = page_fault,
+};
+
+static int page_mmap(struct file * file, struct vm_area_struct *vma)
+{
+   if (vma-vm_flags  VM_SHARED)
+   return -EINVAL;
+
+   if (vma-vm_pgoff)
+   return -EINVAL;
+
+   if (vma-vm_start  ~HPAGE_MASK)
+   return -EINVAL;
+
+   if (vma-vm_end  ~HPAGE_MASK)
+   return -EINVAL;
+
+   if (vma-vm_end - vma-vm_start

Re: pagetable_ops: Hugetlb character device example

2007-03-21 Thread Adam Litke
On Wed, 2007-03-21 at 15:51 -0400, [EMAIL PROTECTED] wrote:
 On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said:
  The main reason I am advocating a set of pagetable_operations is to
  enable the development of a new hugetlb interface.
 
 Do you have an exit strategy for the *old* interface?

Not really.  Hugetlbfs needs to be kept around for a number of reasons.
It was designed to support MAP_SHARED mappings and IPC shm segments.  It
is probably still the best interface for those jobs.  Of course
hugetlbfs has lots of users so we must preserve the interface for them.

But... once hugetlbfs is abstracted behind pagetable_operations, you
would have the option of configuring it out of the kernel without losing
access to huge pages by other means (such as the character device).

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/7] unmap_page_range for hugetlb

2007-03-19 Thread Adam Litke

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
---

 fs/hugetlbfs/inode.c|3 ++-
 include/linux/hugetlb.h |4 ++--
 mm/hugetlb.c|   12 
 mm/memory.c |   10 --
 4 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index d0b4b46..198efa7 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -289,7 +289,7 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, 
pgoff_t pgoff)
v_offset = 0;
 
__unmap_hugepage_range(vma,
-   vma->vm_start + v_offset, vma->vm_end);
+   vma->vm_start + v_offset, vma->vm_end, 0);
}
 }
 
@@ -568,6 +568,7 @@ const struct file_operations hugetlbfs_file_operations = {
 static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
.copy_vma   = copy_hugetlb_page_range,
.pin_pages  = follow_hugetlb_page,
+   .unmap_page_range   = unmap_hugepage_range,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 3f3e7a6..502c2f8 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -17,8 +17,8 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct 
*vma)
 int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user 
*, size_t *, loff_t *);
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct 
vm_area_struct *);
 int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct 
page **, struct vm_area_struct **, unsigned long *, int *, int);
-void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned 
long);
-void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned 
long);
+unsigned long unmap_hugepage_range(struct vm_area_struct *, unsigned long, 
unsigned long, long *);
+void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned 
long, long *);
 int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
 int hugetlb_report_meminfo(char *);
 int hugetlb_report_node_meminfo(int, char *);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 36db012..d902fb9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -356,7 +356,7 @@ nomem:
 }
 
 void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
-   unsigned long end)
+   unsigned long end, long *zap_work)
 {
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
@@ -399,10 +399,13 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, 
unsigned long start,
list_del(>lru);
put_page(page);
}
+
+   if (zap_work)
+   *zap_work -= (end - start) / (HPAGE_SIZE / PAGE_SIZE);
 }
 
-void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
- unsigned long end)
+unsigned long unmap_hugepage_range(struct vm_area_struct *vma,
+   unsigned long start, unsigned long end, long *zap_work)
 {
/*
 * It is undesirable to test vma->vm_file as it should be non-null
@@ -414,9 +417,10 @@ void unmap_hugepage_range(struct vm_area_struct *vma, 
unsigned long start,
 */
if (vma->vm_file) {
spin_lock(>vm_file->f_mapping->i_mmap_lock);
-   __unmap_hugepage_range(vma, start, end);
+   __unmap_hugepage_range(vma, start, end, zap_work);
spin_unlock(>vm_file->f_mapping->i_mmap_lock);
}
+   return end;
 }
 
 static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index 01256cf..a3bcaf3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -839,12 +839,10 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,
tlb_start_valid = 1;
}
 
-   if (unlikely(is_vm_hugetlb_page(vma))) {
-   unmap_hugepage_range(vma, start, end);
-   zap_work -= (end - start) /
-   (HPAGE_SIZE / PAGE_SIZE);
-   start = end;
-   } else
+   if (unlikely(has_pt_op(vma, unmap_page_range)))
+   start = pt_op(vma, unmap_page_range)
+   (vma, start, end, _work);
+   else
start = unmap_page_range(*tlbp, vma,
start, end, _work, details);
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.k

[PATCH 6/7] free_pgtable_range for hugetlb

2007-03-19 Thread Adam Litke

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
---

 fs/hugetlbfs/inode.c |1 +
 mm/memory.c  |6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 3de5d93..823a9e3 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -570,6 +570,7 @@ static const struct pagetable_operations_struct 
hugetlbfs_pagetable_ops = {
.pin_pages  = follow_hugetlb_page,
.unmap_page_range   = unmap_hugepage_range,
.change_protection  = hugetlb_change_protection,
+   .free_pgtable_range = hugetlb_free_pgd_range,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/memory.c b/mm/memory.c
index a3bcaf3..d2f28e7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -279,15 +279,15 @@ void free_pgtables(struct mmu_gather **tlb, struct 
vm_area_struct *vma,
anon_vma_unlink(vma);
unlink_file_vma(vma);
 
-   if (is_vm_hugetlb_page(vma)) {
-   hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
+   if (has_pt_op(vma, free_pgtable_range)) {
+   pt_op(vma, free_pgtable_range)(tlb, addr, vma->vm_end,
floor, next? next->vm_start: ceiling);
} else {
/*
 * Optimization: gather nearby vmas into one call down
 */
while (next && next->vm_start <= vma->vm_end + PMD_SIZE
-  && !is_vm_hugetlb_page(next)) {
+  && !has_pt_op(next, free_pgtable_range)) {
vma = next;
next = vma->vm_next;
anon_vma_unlink(vma);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 7/7] hugetlbfs fault handler

2007-03-19 Thread Adam Litke

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
---

 fs/hugetlbfs/inode.c |1 +
 mm/hugetlb.c |4 +++-
 mm/memory.c  |4 ++--
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 823a9e3..29e65c2 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -571,6 +571,7 @@ static const struct pagetable_operations_struct 
hugetlbfs_pagetable_ops = {
.unmap_page_range   = unmap_hugepage_range,
.change_protection  = hugetlb_change_protection,
.free_pgtable_range = hugetlb_free_pgd_range,
+   .fault  = hugetlb_fault,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d902fb9..e0f0607 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -590,6 +590,8 @@ int follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
unsigned long vaddr = *position;
int remainder = *length;
 
+   BUG_ON(!has_pt_op(vma, fault));
+
spin_lock(>page_table_lock);
while (vaddr < vma->vm_end && remainder) {
pte_t *pte;
@@ -606,7 +608,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
int ret;
 
spin_unlock(>page_table_lock);
-   ret = hugetlb_fault(mm, vma, vaddr, 0);
+   ret = pt_op(vma, fault)(mm, vma, vaddr, 0);
spin_lock(>page_table_lock);
if (ret == VM_FAULT_MINOR)
continue;
diff --git a/mm/memory.c b/mm/memory.c
index d2f28e7..bd7c243 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2499,8 +2499,8 @@ int __handle_mm_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
 
count_vm_event(PGFAULT);
 
-   if (unlikely(is_vm_hugetlb_page(vma)))
-   return hugetlb_fault(mm, vma, address, write_access);
+   if (unlikely(has_pt_op(vma, fault)))
+   return pt_op(vma, fault)(mm, vma, address, write_access);
 
pgd = pgd_offset(mm, address);
pud = pud_alloc(mm, pgd, address);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/7] pin_pages for hugetlb

2007-03-19 Thread Adam Litke

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
---

 fs/hugetlbfs/inode.c |1 +
 mm/memory.c  |6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 2452dde..d0b4b46 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -567,6 +567,7 @@ const struct file_operations hugetlbfs_file_operations = {
 
 static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
.copy_vma   = copy_hugetlb_page_range,
+   .pin_pages  = follow_hugetlb_page,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/memory.c b/mm/memory.c
index 69bb0b3..01256cf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1039,9 +1039,9 @@ int get_user_pages(struct task_struct *tsk, struct 
mm_struct *mm,
|| !(vm_flags & vma->vm_flags))
return i ? : -EFAULT;
 
-   if (is_vm_hugetlb_page(vma)) {
-   i = follow_hugetlb_page(mm, vma, pages, vmas,
-   , , i);
+   if (has_pt_op(vma, pin_pages)) {
+   i = pt_op(vma, pin_pages)(mm, vma, pages,
+   vmas, , , i);
continue;
}
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/7] change_protection for hugetlb

2007-03-19 Thread Adam Litke

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
---

 fs/hugetlbfs/inode.c |1 +
 mm/mprotect.c|5 +++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 198efa7..3de5d93 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -569,6 +569,7 @@ static const struct pagetable_operations_struct 
hugetlbfs_pagetable_ops = {
.copy_vma   = copy_hugetlb_page_range,
.pin_pages  = follow_hugetlb_page,
.unmap_page_range   = unmap_hugepage_range,
+   .change_protection  = hugetlb_change_protection,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 3b8f3c0..172e204 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -201,8 +201,9 @@ success:
dirty_accountable = 1;
}
 
-   if (is_vm_hugetlb_page(vma))
-   hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
+   if (has_pt_op(vma, change_protection))
+   pt_op(vma, change_protection)(vma, start, end,
+   vma->vm_page_prot);
else
change_protection(vma, start, end, vma->vm_page_prot, 
dirty_accountable);
vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/7] [RFC] hugetlb: pagetable_operations API (V2)

2007-03-19 Thread Adam Litke

Andrew, given the favorable review of these patches the last time around, would
you consider them for the -mm tree?  Does anyone else have any objections?

The page tables for hugetlb mappings are handled differently than page tables
for normal pages.  Rather than integrating multiple page size support into the
core VM (which would tremendously complicate the code) some hooks were created.
This allows hugetlb special cases to be handled "out of line" by a separate
interface.

Hugetlbfs was the huge page interface chosen.  At the time, large database
users were the only big users of huge pages and the hugetlbfs design meets
their needs pretty well.  Over time, hugetlbfs has been expanded to enable new
uses of huge page memory with varied results.  As features are added, the
semantics become a permanent part of the Linux API.  This makes maintenance of
hugetlbfs an increasingly difficult task and inhibits the addition of features
and functionality in support of ever-changing hardware.

To remedy the situation, I propose an API (currently called
pagetable_operations).  All of the current hugetlbfs-specific hooks are moved
into an operations struct that is attached to VMAs.  The end result is a more
explicit and IMO a cleaner interface between hugetlbfs and the core VM.  We are
then free to add other hugetlb interfaces (such as a /dev/zero-styled character
device) that can operate either in concert with or independent of hugetlbfs.

There should be no measurable performance impact for normal page users (we're
checking if pagetable_ops != NULL instead of checking for vm_flags &
VM_HUGETLB).  Of course we do increase the VMA size by one pointer.  For huge
pages, there is an added indirection for pt_op() calls.  This patch series does
not change the logic of the the hugetlbfs operations, just moves them into the
pagetable_operations struct.

I did some pretty basic benchmarking of these patches on ppc64, x86, and x86_64
to get a feel for the fast-path performance impact.  The following tables show
kernbench performance comparisons between a clean 2.6.20 kernel and one with my
patches applied.  These numbers seem well within statistical noise to me.

Changes since V1:
- Made hugetlbfs_pagetable_ops const (Thanks Arjan)

--

KernBench Comparison (ppc64)

   2.6.20-clean  2.6.20-pgtable_opspct. diff
User   CPU time  708.82 708.59  0.03
System CPU time   62.50  62.58 -0.13
Total  CPU time  771.32 771.17  0.02
Elapsedtime  115.40 115.35  0.04

KernBench Comparison (x86)
--
   2.6.20-clean  2.6.20-pgtable_opspct. diff
User   CPU time 1382.621381.88  0.05
System CPU time  146.06 146.86 -0.55
Total  CPU time 1528.681528.74 -0.00
Elapsedtime  394.92 396.70 -0.45

KernBench Comparison (x86_64)
-
   2.6.20-clean  2.6.20-pgtable_opspct. diff
User   CPU time  559.39 557.97  0.25
System CPU time   65.10  66.17 -1.64
Total  CPU time  624.49 624.14  0.06
Elapsedtime  158.54 158.59 -0.03

The lack of a performance impact makes sense to me.  The following is a
simplified instruction comparison for each case:

2.6.20-clean   2.6.20-pgtable_ops
---
/* Load vm_flags *//* Load pagetable_ops pointer */
mov 0x18(ecx),eax  mov  0x48(ecx),eax
/* Test for VM_HUGETLB */  /* Test if it's NULL */
test$0x40,eax  test   eax,eax
/* If set, jump to call stub *//* If so, jump away to main code */
jne c0148f04   je   c0148ba1
.../* Lookup the operation's function 
pointer */
/* copy_hugetlb_page_range call */ mov  0x4(eax),ebx
c0148f04:  /* Test if it's NULL */
mov 0xff98(ebp),ecxtest   ebx,ebx
mov 0xff9c(ebp),edx/* If so, jump away to main code */
mov 0xffa0(ebp),eaxje   c0148ba1
callc01536e0   /* pagetable operation call */
   mov  0xff9c(ebp),edx
   mov  0xffa0(ebp),eax
   call *ebx

For the common case (vma->pagetable_ops == NULL), we do almost the same thing 
as the current code: load and test.  The third instruction is different in that 
we jump for the common case instead of jumping in the hugetlb case.  I 

[PATCH 1/7] Introduce the pagetable_operations and associated helper macros.

2007-03-19 Thread Adam Litke

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
---

 include/linux/mm.h |   25 +
 1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 60e0e4a..7089323 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -98,6 +98,7 @@ struct vm_area_struct {
 
/* Function pointers to deal with this struct. */
struct vm_operations_struct * vm_ops;
+   const struct pagetable_operations_struct * pagetable_ops;
 
/* Information about our backing store: */
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
@@ -218,6 +219,30 @@ struct vm_operations_struct {
 };
 
 struct mmu_gather;
+
+struct pagetable_operations_struct {
+   int (*fault)(struct mm_struct *mm,
+   struct vm_area_struct *vma,
+   unsigned long address, int write_access);
+   int (*copy_vma)(struct mm_struct *dst, struct mm_struct *src,
+   struct vm_area_struct *vma);
+   int (*pin_pages)(struct mm_struct *mm, struct vm_area_struct *vma,
+   struct page **pages, struct vm_area_struct **vmas,
+   unsigned long *position, int *length, int i);
+   void (*change_protection)(struct vm_area_struct *vma,
+   unsigned long address, unsigned long end, pgprot_t newprot);
+   unsigned long (*unmap_page_range)(struct vm_area_struct *vma,
+   unsigned long address, unsigned long end, long *zap_work);
+   void (*free_pgtable_range)(struct mmu_gather **tlb,
+   unsigned long addr, unsigned long end,
+   unsigned long floor, unsigned long ceiling);
+};
+
+#define has_pt_op(vma, op) \
+   ((vma)->pagetable_ops && (vma)->pagetable_ops->op)
+#define pt_op(vma, call) \
+   ((vma)->pagetable_ops->call)
+
 struct inode;
 
 #define page_private(page) ((page)->private)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/7] copy_vma for hugetlbfs

2007-03-19 Thread Adam Litke

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
---

 fs/hugetlbfs/inode.c |6 ++
 mm/memory.c  |4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 8c718a3..2452dde 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -36,6 +36,7 @@
 static const struct super_operations hugetlbfs_ops;
 static const struct address_space_operations hugetlbfs_aops;
 const struct file_operations hugetlbfs_file_operations;
+static const struct pagetable_operations_struct hugetlbfs_pagetable_ops;
 static const struct inode_operations hugetlbfs_dir_inode_operations;
 static const struct inode_operations hugetlbfs_inode_operations;
 
@@ -70,6 +71,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct 
vm_area_struct *vma)
 */
vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
vma->vm_ops = _vm_ops;
+   vma->pagetable_ops = _pagetable_ops;
 
vma_len = (loff_t)(vma->vm_end - vma->vm_start);
 
@@ -563,6 +565,10 @@ const struct file_operations hugetlbfs_file_operations = {
.get_unmapped_area  = hugetlb_get_unmapped_area,
 };
 
+static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
+   .copy_vma   = copy_hugetlb_page_range,
+};
+
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
.create = hugetlbfs_create,
.lookup = simple_lookup,
diff --git a/mm/memory.c b/mm/memory.c
index e7066e7..69bb0b3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -602,8 +602,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
return 0;
}
 
-   if (is_vm_hugetlb_page(vma))
-   return copy_hugetlb_page_range(dst_mm, src_mm, vma);
+   if (has_pt_op(vma, copy_vma))
+   return pt_op(vma, copy_vma)(dst_mm, src_mm, vma);
 
dst_pgd = pgd_offset(dst_mm, addr);
src_pgd = pgd_offset(src_mm, addr);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/7] [RFC] hugetlb: pagetable_operations API (V2)

2007-03-19 Thread Adam Litke

Andrew, given the favorable review of these patches the last time around, would
you consider them for the -mm tree?  Does anyone else have any objections?

The page tables for hugetlb mappings are handled differently than page tables
for normal pages.  Rather than integrating multiple page size support into the
core VM (which would tremendously complicate the code) some hooks were created.
This allows hugetlb special cases to be handled out of line by a separate
interface.

Hugetlbfs was the huge page interface chosen.  At the time, large database
users were the only big users of huge pages and the hugetlbfs design meets
their needs pretty well.  Over time, hugetlbfs has been expanded to enable new
uses of huge page memory with varied results.  As features are added, the
semantics become a permanent part of the Linux API.  This makes maintenance of
hugetlbfs an increasingly difficult task and inhibits the addition of features
and functionality in support of ever-changing hardware.

To remedy the situation, I propose an API (currently called
pagetable_operations).  All of the current hugetlbfs-specific hooks are moved
into an operations struct that is attached to VMAs.  The end result is a more
explicit and IMO a cleaner interface between hugetlbfs and the core VM.  We are
then free to add other hugetlb interfaces (such as a /dev/zero-styled character
device) that can operate either in concert with or independent of hugetlbfs.

There should be no measurable performance impact for normal page users (we're
checking if pagetable_ops != NULL instead of checking for vm_flags 
VM_HUGETLB).  Of course we do increase the VMA size by one pointer.  For huge
pages, there is an added indirection for pt_op() calls.  This patch series does
not change the logic of the the hugetlbfs operations, just moves them into the
pagetable_operations struct.

I did some pretty basic benchmarking of these patches on ppc64, x86, and x86_64
to get a feel for the fast-path performance impact.  The following tables show
kernbench performance comparisons between a clean 2.6.20 kernel and one with my
patches applied.  These numbers seem well within statistical noise to me.

Changes since V1:
- Made hugetlbfs_pagetable_ops const (Thanks Arjan)

--

KernBench Comparison (ppc64)

   2.6.20-clean  2.6.20-pgtable_opspct. diff
User   CPU time  708.82 708.59  0.03
System CPU time   62.50  62.58 -0.13
Total  CPU time  771.32 771.17  0.02
Elapsedtime  115.40 115.35  0.04

KernBench Comparison (x86)
--
   2.6.20-clean  2.6.20-pgtable_opspct. diff
User   CPU time 1382.621381.88  0.05
System CPU time  146.06 146.86 -0.55
Total  CPU time 1528.681528.74 -0.00
Elapsedtime  394.92 396.70 -0.45

KernBench Comparison (x86_64)
-
   2.6.20-clean  2.6.20-pgtable_opspct. diff
User   CPU time  559.39 557.97  0.25
System CPU time   65.10  66.17 -1.64
Total  CPU time  624.49 624.14  0.06
Elapsedtime  158.54 158.59 -0.03

The lack of a performance impact makes sense to me.  The following is a
simplified instruction comparison for each case:

2.6.20-clean   2.6.20-pgtable_ops
---
/* Load vm_flags *//* Load pagetable_ops pointer */
mov 0x18(ecx),eax  mov  0x48(ecx),eax
/* Test for VM_HUGETLB */  /* Test if it's NULL */
test$0x40,eax  test   eax,eax
/* If set, jump to call stub *//* If so, jump away to main code */
jne c0148f04   je   c0148ba1
.../* Lookup the operation's function 
pointer */
/* copy_hugetlb_page_range call */ mov  0x4(eax),ebx
c0148f04:  /* Test if it's NULL */
mov 0xff98(ebp),ecxtest   ebx,ebx
mov 0xff9c(ebp),edx/* If so, jump away to main code */
mov 0xffa0(ebp),eaxje   c0148ba1
callc01536e0   /* pagetable operation call */
   mov  0xff9c(ebp),edx
   mov  0xffa0(ebp),eax
   call *ebx

For the common case (vma-pagetable_ops == NULL), we do almost the same thing 
as the current code: load and test.  The third instruction is different in that 
we jump for the common case instead of jumping in the hugetlb case.  I don't 

[PATCH 1/7] Introduce the pagetable_operations and associated helper macros.

2007-03-19 Thread Adam Litke

Signed-off-by: Adam Litke [EMAIL PROTECTED]
---

 include/linux/mm.h |   25 +
 1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 60e0e4a..7089323 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -98,6 +98,7 @@ struct vm_area_struct {
 
/* Function pointers to deal with this struct. */
struct vm_operations_struct * vm_ops;
+   const struct pagetable_operations_struct * pagetable_ops;
 
/* Information about our backing store: */
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
@@ -218,6 +219,30 @@ struct vm_operations_struct {
 };
 
 struct mmu_gather;
+
+struct pagetable_operations_struct {
+   int (*fault)(struct mm_struct *mm,
+   struct vm_area_struct *vma,
+   unsigned long address, int write_access);
+   int (*copy_vma)(struct mm_struct *dst, struct mm_struct *src,
+   struct vm_area_struct *vma);
+   int (*pin_pages)(struct mm_struct *mm, struct vm_area_struct *vma,
+   struct page **pages, struct vm_area_struct **vmas,
+   unsigned long *position, int *length, int i);
+   void (*change_protection)(struct vm_area_struct *vma,
+   unsigned long address, unsigned long end, pgprot_t newprot);
+   unsigned long (*unmap_page_range)(struct vm_area_struct *vma,
+   unsigned long address, unsigned long end, long *zap_work);
+   void (*free_pgtable_range)(struct mmu_gather **tlb,
+   unsigned long addr, unsigned long end,
+   unsigned long floor, unsigned long ceiling);
+};
+
+#define has_pt_op(vma, op) \
+   ((vma)-pagetable_ops  (vma)-pagetable_ops-op)
+#define pt_op(vma, call) \
+   ((vma)-pagetable_ops-call)
+
 struct inode;
 
 #define page_private(page) ((page)-private)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/7] copy_vma for hugetlbfs

2007-03-19 Thread Adam Litke

Signed-off-by: Adam Litke [EMAIL PROTECTED]
---

 fs/hugetlbfs/inode.c |6 ++
 mm/memory.c  |4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 8c718a3..2452dde 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -36,6 +36,7 @@
 static const struct super_operations hugetlbfs_ops;
 static const struct address_space_operations hugetlbfs_aops;
 const struct file_operations hugetlbfs_file_operations;
+static const struct pagetable_operations_struct hugetlbfs_pagetable_ops;
 static const struct inode_operations hugetlbfs_dir_inode_operations;
 static const struct inode_operations hugetlbfs_inode_operations;
 
@@ -70,6 +71,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct 
vm_area_struct *vma)
 */
vma-vm_flags |= VM_HUGETLB | VM_RESERVED;
vma-vm_ops = hugetlb_vm_ops;
+   vma-pagetable_ops = hugetlbfs_pagetable_ops;
 
vma_len = (loff_t)(vma-vm_end - vma-vm_start);
 
@@ -563,6 +565,10 @@ const struct file_operations hugetlbfs_file_operations = {
.get_unmapped_area  = hugetlb_get_unmapped_area,
 };
 
+static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
+   .copy_vma   = copy_hugetlb_page_range,
+};
+
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
.create = hugetlbfs_create,
.lookup = simple_lookup,
diff --git a/mm/memory.c b/mm/memory.c
index e7066e7..69bb0b3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -602,8 +602,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
return 0;
}
 
-   if (is_vm_hugetlb_page(vma))
-   return copy_hugetlb_page_range(dst_mm, src_mm, vma);
+   if (has_pt_op(vma, copy_vma))
+   return pt_op(vma, copy_vma)(dst_mm, src_mm, vma);
 
dst_pgd = pgd_offset(dst_mm, addr);
src_pgd = pgd_offset(src_mm, addr);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/7] pin_pages for hugetlb

2007-03-19 Thread Adam Litke

Signed-off-by: Adam Litke [EMAIL PROTECTED]
---

 fs/hugetlbfs/inode.c |1 +
 mm/memory.c  |6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 2452dde..d0b4b46 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -567,6 +567,7 @@ const struct file_operations hugetlbfs_file_operations = {
 
 static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
.copy_vma   = copy_hugetlb_page_range,
+   .pin_pages  = follow_hugetlb_page,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/memory.c b/mm/memory.c
index 69bb0b3..01256cf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1039,9 +1039,9 @@ int get_user_pages(struct task_struct *tsk, struct 
mm_struct *mm,
|| !(vm_flags  vma-vm_flags))
return i ? : -EFAULT;
 
-   if (is_vm_hugetlb_page(vma)) {
-   i = follow_hugetlb_page(mm, vma, pages, vmas,
-   start, len, i);
+   if (has_pt_op(vma, pin_pages)) {
+   i = pt_op(vma, pin_pages)(mm, vma, pages,
+   vmas, start, len, i);
continue;
}
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/7] change_protection for hugetlb

2007-03-19 Thread Adam Litke

Signed-off-by: Adam Litke [EMAIL PROTECTED]
---

 fs/hugetlbfs/inode.c |1 +
 mm/mprotect.c|5 +++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 198efa7..3de5d93 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -569,6 +569,7 @@ static const struct pagetable_operations_struct 
hugetlbfs_pagetable_ops = {
.copy_vma   = copy_hugetlb_page_range,
.pin_pages  = follow_hugetlb_page,
.unmap_page_range   = unmap_hugepage_range,
+   .change_protection  = hugetlb_change_protection,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 3b8f3c0..172e204 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -201,8 +201,9 @@ success:
dirty_accountable = 1;
}
 
-   if (is_vm_hugetlb_page(vma))
-   hugetlb_change_protection(vma, start, end, vma-vm_page_prot);
+   if (has_pt_op(vma, change_protection))
+   pt_op(vma, change_protection)(vma, start, end,
+   vma-vm_page_prot);
else
change_protection(vma, start, end, vma-vm_page_prot, 
dirty_accountable);
vm_stat_account(mm, oldflags, vma-vm_file, -nrpages);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/7] unmap_page_range for hugetlb

2007-03-19 Thread Adam Litke

Signed-off-by: Adam Litke [EMAIL PROTECTED]
---

 fs/hugetlbfs/inode.c|3 ++-
 include/linux/hugetlb.h |4 ++--
 mm/hugetlb.c|   12 
 mm/memory.c |   10 --
 4 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index d0b4b46..198efa7 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -289,7 +289,7 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, 
pgoff_t pgoff)
v_offset = 0;
 
__unmap_hugepage_range(vma,
-   vma-vm_start + v_offset, vma-vm_end);
+   vma-vm_start + v_offset, vma-vm_end, 0);
}
 }
 
@@ -568,6 +568,7 @@ const struct file_operations hugetlbfs_file_operations = {
 static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
.copy_vma   = copy_hugetlb_page_range,
.pin_pages  = follow_hugetlb_page,
+   .unmap_page_range   = unmap_hugepage_range,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 3f3e7a6..502c2f8 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -17,8 +17,8 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct 
*vma)
 int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user 
*, size_t *, loff_t *);
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct 
vm_area_struct *);
 int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct 
page **, struct vm_area_struct **, unsigned long *, int *, int);
-void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned 
long);
-void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned 
long);
+unsigned long unmap_hugepage_range(struct vm_area_struct *, unsigned long, 
unsigned long, long *);
+void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned 
long, long *);
 int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
 int hugetlb_report_meminfo(char *);
 int hugetlb_report_node_meminfo(int, char *);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 36db012..d902fb9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -356,7 +356,7 @@ nomem:
 }
 
 void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
-   unsigned long end)
+   unsigned long end, long *zap_work)
 {
struct mm_struct *mm = vma-vm_mm;
unsigned long address;
@@ -399,10 +399,13 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, 
unsigned long start,
list_del(page-lru);
put_page(page);
}
+
+   if (zap_work)
+   *zap_work -= (end - start) / (HPAGE_SIZE / PAGE_SIZE);
 }
 
-void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
- unsigned long end)
+unsigned long unmap_hugepage_range(struct vm_area_struct *vma,
+   unsigned long start, unsigned long end, long *zap_work)
 {
/*
 * It is undesirable to test vma-vm_file as it should be non-null
@@ -414,9 +417,10 @@ void unmap_hugepage_range(struct vm_area_struct *vma, 
unsigned long start,
 */
if (vma-vm_file) {
spin_lock(vma-vm_file-f_mapping-i_mmap_lock);
-   __unmap_hugepage_range(vma, start, end);
+   __unmap_hugepage_range(vma, start, end, zap_work);
spin_unlock(vma-vm_file-f_mapping-i_mmap_lock);
}
+   return end;
 }
 
 static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index 01256cf..a3bcaf3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -839,12 +839,10 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,
tlb_start_valid = 1;
}
 
-   if (unlikely(is_vm_hugetlb_page(vma))) {
-   unmap_hugepage_range(vma, start, end);
-   zap_work -= (end - start) /
-   (HPAGE_SIZE / PAGE_SIZE);
-   start = end;
-   } else
+   if (unlikely(has_pt_op(vma, unmap_page_range)))
+   start = pt_op(vma, unmap_page_range)
+   (vma, start, end, zap_work);
+   else
start = unmap_page_range(*tlbp, vma,
start, end, zap_work, details);
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http

[PATCH 6/7] free_pgtable_range for hugetlb

2007-03-19 Thread Adam Litke

Signed-off-by: Adam Litke [EMAIL PROTECTED]
---

 fs/hugetlbfs/inode.c |1 +
 mm/memory.c  |6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 3de5d93..823a9e3 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -570,6 +570,7 @@ static const struct pagetable_operations_struct 
hugetlbfs_pagetable_ops = {
.pin_pages  = follow_hugetlb_page,
.unmap_page_range   = unmap_hugepage_range,
.change_protection  = hugetlb_change_protection,
+   .free_pgtable_range = hugetlb_free_pgd_range,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/memory.c b/mm/memory.c
index a3bcaf3..d2f28e7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -279,15 +279,15 @@ void free_pgtables(struct mmu_gather **tlb, struct 
vm_area_struct *vma,
anon_vma_unlink(vma);
unlink_file_vma(vma);
 
-   if (is_vm_hugetlb_page(vma)) {
-   hugetlb_free_pgd_range(tlb, addr, vma-vm_end,
+   if (has_pt_op(vma, free_pgtable_range)) {
+   pt_op(vma, free_pgtable_range)(tlb, addr, vma-vm_end,
floor, next? next-vm_start: ceiling);
} else {
/*
 * Optimization: gather nearby vmas into one call down
 */
while (next  next-vm_start = vma-vm_end + PMD_SIZE
-   !is_vm_hugetlb_page(next)) {
+   !has_pt_op(next, free_pgtable_range)) {
vma = next;
next = vma-vm_next;
anon_vma_unlink(vma);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 7/7] hugetlbfs fault handler

2007-03-19 Thread Adam Litke

Signed-off-by: Adam Litke [EMAIL PROTECTED]
---

 fs/hugetlbfs/inode.c |1 +
 mm/hugetlb.c |4 +++-
 mm/memory.c  |4 ++--
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 823a9e3..29e65c2 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -571,6 +571,7 @@ static const struct pagetable_operations_struct 
hugetlbfs_pagetable_ops = {
.unmap_page_range   = unmap_hugepage_range,
.change_protection  = hugetlb_change_protection,
.free_pgtable_range = hugetlb_free_pgd_range,
+   .fault  = hugetlb_fault,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d902fb9..e0f0607 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -590,6 +590,8 @@ int follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
unsigned long vaddr = *position;
int remainder = *length;
 
+   BUG_ON(!has_pt_op(vma, fault));
+
spin_lock(mm-page_table_lock);
while (vaddr  vma-vm_end  remainder) {
pte_t *pte;
@@ -606,7 +608,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
int ret;
 
spin_unlock(mm-page_table_lock);
-   ret = hugetlb_fault(mm, vma, vaddr, 0);
+   ret = pt_op(vma, fault)(mm, vma, vaddr, 0);
spin_lock(mm-page_table_lock);
if (ret == VM_FAULT_MINOR)
continue;
diff --git a/mm/memory.c b/mm/memory.c
index d2f28e7..bd7c243 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2499,8 +2499,8 @@ int __handle_mm_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
 
count_vm_event(PGFAULT);
 
-   if (unlikely(is_vm_hugetlb_page(vma)))
-   return hugetlb_fault(mm, vma, address, write_access);
+   if (unlikely(has_pt_op(vma, fault)))
+   return pt_op(vma, fault)(mm, vma, address, write_access);
 
pgd = pgd_offset(mm, address);
pud = pud_alloc(mm, pgd, address);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix get_unmapped_area and fsync for hugetlb shm segments

2007-03-07 Thread Adam Litke
On Wed, 2007-03-07 at 16:03 -0700, Eric W. Biederman wrote:
> Bill Irwin <[EMAIL PROTECTED]> writes:
> 
> > On Thu, Mar 01, 2007 at 03:46:08PM -0800, Adam Litke wrote:
> >>  static inline int is_file_hugepages(struct file *file)
> >>  {
> >> -  return file->f_op == _file_operations;
> >> +  if (file->f_op == _file_operations)
> >> +  return 1;
> >> +  if (is_file_shm_hugepages(file))
> >> +  return 1;
> >> +
> >> +  return 0;
> >>  }
> > ...
> >> +int is_file_shm_hugepages(struct file *file)
> >> +{
> >> +  int ret = 0;
> >> +
> >> +  if (file->f_op == _file_operations) {
> >> +  struct shm_file_data *sfd;
> >> +  sfd = shm_file_data(file);
> >> +  ret = is_file_hugepages(sfd->file);
> >> +  }
> >> +  return ret;
> >
> > A comment to prepare others for the impending doubletake might be nice.
> > Or maybe just open-coding the equality check for _file_operations
> > in is_file_shm_hugepages() if others find it as jarring as I. Please
> > extend my ack to any follow-up fiddling with that.
> 
> You did notice we are testing a different struct file?
> 
> > The patch addresses relatively straightforward issues and naturally at
> > that.
> 
> The whole concept is recursive so I'm not certain being a recursive check
> is that bad but I understand the point.
> 
> I think the right answer is most likely to add an extra file method or
> two so we can remove the need for is_file_hugepages.
> 
> There are still 4 calls to is_file_hugepages in ipc/shm.c and
> 2 calls in mm/mmap.c not counting the one in is_file_shm_hugepages.
> 
> The special cases make it difficult to properly wrap hugetlbfs files
> with another file, which is why we have the weird special case above.

:) Enter my remove-is_file_hugepages() patches (which I posted a few
weeks ago).  I'll rework them and repost soon.  That should help to make
all of this cleaner.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix get_unmapped_area and fsync for hugetlb shm segments

2007-03-07 Thread Adam Litke
On Wed, 2007-03-07 at 16:03 -0700, Eric W. Biederman wrote:
 Bill Irwin [EMAIL PROTECTED] writes:
 
  On Thu, Mar 01, 2007 at 03:46:08PM -0800, Adam Litke wrote:
   static inline int is_file_hugepages(struct file *file)
   {
  -  return file-f_op == hugetlbfs_file_operations;
  +  if (file-f_op == hugetlbfs_file_operations)
  +  return 1;
  +  if (is_file_shm_hugepages(file))
  +  return 1;
  +
  +  return 0;
   }
  ...
  +int is_file_shm_hugepages(struct file *file)
  +{
  +  int ret = 0;
  +
  +  if (file-f_op == shm_file_operations) {
  +  struct shm_file_data *sfd;
  +  sfd = shm_file_data(file);
  +  ret = is_file_hugepages(sfd-file);
  +  }
  +  return ret;
 
  A comment to prepare others for the impending doubletake might be nice.
  Or maybe just open-coding the equality check for huetlbfs_file_operations
  in is_file_shm_hugepages() if others find it as jarring as I. Please
  extend my ack to any follow-up fiddling with that.
 
 You did notice we are testing a different struct file?
 
  The patch addresses relatively straightforward issues and naturally at
  that.
 
 The whole concept is recursive so I'm not certain being a recursive check
 is that bad but I understand the point.
 
 I think the right answer is most likely to add an extra file method or
 two so we can remove the need for is_file_hugepages.
 
 There are still 4 calls to is_file_hugepages in ipc/shm.c and
 2 calls in mm/mmap.c not counting the one in is_file_shm_hugepages.
 
 The special cases make it difficult to properly wrap hugetlbfs files
 with another file, which is why we have the weird special case above.

:) Enter my remove-is_file_hugepages() patches (which I posted a few
weeks ago).  I'll rework them and repost soon.  That should help to make
all of this cleaner.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel Oops with shm namespace cleanups

2007-03-02 Thread Adam Litke
On Thu, 2007-03-01 at 16:08 -0800, Bill Irwin wrote:
> On Wed, Feb 28, 2007 at 02:13:29PM -0600, Adam Litke wrote:
> > Hey.  While testing 2.6.21-rc2 with libhugetlbfs, the shm-fork test case
> > causes the kernel to oops.  To reproduce:  Execute 'make check' in the
> > latest libhugetlbfs source on a 2.6.21-rc2 kernel with 100 huge pages
> > allocated.  Using fewer huge pages will likely also trigger the oops.
> > Libhugetlbfs can be downloaded from:
> > http://libhugetlbfs.ozlabs.org/snapshots/libhugetlbfs-dev-20070228.tar.gz
> 
> Looks like I should grab these testcases for the sake of due diligence
> (not to say I intend to alter maintenance style from primarily review,
> approval, and bugfixing, not that I've been doing as much of any of those
> as I should). To which architectures and/or distributions have the
> userspace bits been ported, or otherwise run/tested on? A quick sniff
> test on an Altix suggests SLES and/or ia64 may trip up the scripts:

Right now we support x86, powerpc, and x86_64.  Segment remapping and
hugetlb malloc won't work on ia64 until long format vhpt is supported (I
suspect).  But the test framework should be adaptable to other
architectures.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel Oops with shm namespace cleanups

2007-03-02 Thread Adam Litke
On Thu, 2007-03-01 at 16:08 -0800, Bill Irwin wrote:
 On Wed, Feb 28, 2007 at 02:13:29PM -0600, Adam Litke wrote:
  Hey.  While testing 2.6.21-rc2 with libhugetlbfs, the shm-fork test case
  causes the kernel to oops.  To reproduce:  Execute 'make check' in the
  latest libhugetlbfs source on a 2.6.21-rc2 kernel with 100 huge pages
  allocated.  Using fewer huge pages will likely also trigger the oops.
  Libhugetlbfs can be downloaded from:
  http://libhugetlbfs.ozlabs.org/snapshots/libhugetlbfs-dev-20070228.tar.gz
 
 Looks like I should grab these testcases for the sake of due diligence
 (not to say I intend to alter maintenance style from primarily review,
 approval, and bugfixing, not that I've been doing as much of any of those
 as I should). To which architectures and/or distributions have the
 userspace bits been ported, or otherwise run/tested on? A quick sniff
 test on an Altix suggests SLES and/or ia64 may trip up the scripts:

Right now we support x86, powerpc, and x86_64.  Segment remapping and
hugetlb malloc won't work on ia64 until long format vhpt is supported (I
suspect).  But the test framework should be adaptable to other
architectures.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Fix get_unmapped_area and fsync for hugetlb shm segments

2007-03-01 Thread Adam Litke

This patch provides the following hugetlb-related fixes to the recent stacked
shm files changes:
 - Update is_file_hugepages() so it will reconize hugetlb shm segments.
 - get_unmapped_area must be called with the nested file struct to handle
   the sfd->file->f_ops->get_unmapped_area == NULL case.
 - The fsync f_op must be wrapped since it is specified in the hugetlbfs
   f_ops.

This is based on proposed fixes from Eric Biederman that were debugged and
tested by me.  Without it, attempting to use hugetlb shared memory segments
on powerpc (and likely ia64) will kill your box.

Please Apply.

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
---

 include/linux/hugetlb.h |8 +++-
 include/linux/shm.h |5 +
 ipc/shm.c   |   32 ++--
 3 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a60995a..3f3e7a6 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -4,6 +4,7 @@
 #ifdef CONFIG_HUGETLB_PAGE
 
 #include 
+#include 
 #include 
 
 struct ctl_table;
@@ -168,7 +169,12 @@ void hugetlb_put_quota(struct address_space *mapping);
 
 static inline int is_file_hugepages(struct file *file)
 {
-   return file->f_op == _file_operations;
+   if (file->f_op == _file_operations)
+   return 1;
+   if (is_file_shm_hugepages(file))
+   return 1;
+
+   return 0;
 }
 
 static inline void set_file_hugepages(struct file *file)
diff --git a/include/linux/shm.h b/include/linux/shm.h
index a2c896a..ad2e3af 100644
--- a/include/linux/shm.h
+++ b/include/linux/shm.h
@@ -96,12 +96,17 @@ struct shmid_kernel /* private to the kernel */
 
 #ifdef CONFIG_SYSVIPC
 long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long 
*addr);
+extern int is_file_shm_hugepages(struct file *file);
 #else
 static inline long do_shmat(int shmid, char __user *shmaddr,
int shmflg, unsigned long *addr)
 {
return -ENOSYS;
 }
+static inline int is_file_shm_hugepages(struct file *file)
+{
+   return 0;
+}
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/ipc/shm.c b/ipc/shm.c
index eb57e22..abf864d 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -285,21 +285,41 @@ static int shm_release(struct inode *ino, struct file 
*file)
return 0;
 }
 
-#ifndef CONFIG_MMU
+static int shm_fsync(struct file *file, struct dentry *dentry, int datasync)
+{
+   int (*fsync) (struct file *, struct dentry *, int datasync);
+   struct shm_file_data *sfd = shm_file_data(file);
+   int ret = -EINVAL;
+
+   fsync = sfd->file->f_op->fsync;
+   if (fsync)
+   ret = fsync(sfd->file, sfd->file->f_path.dentry, datasync);
+   return ret;
+}
+
 static unsigned long shm_get_unmapped_area(struct file *file,
unsigned long addr, unsigned long len, unsigned long pgoff,
unsigned long flags)
 {
struct shm_file_data *sfd = shm_file_data(file);
-   return sfd->file->f_op->get_unmapped_area(sfd->file, addr, len, pgoff,
-   flags);
+   return get_unmapped_area(sfd->file, addr, len, pgoff, flags);
+}
+
+int is_file_shm_hugepages(struct file *file)
+{
+   int ret = 0;
+
+   if (file->f_op == _file_operations) {
+   struct shm_file_data *sfd;
+   sfd = shm_file_data(file);
+   ret = is_file_hugepages(sfd->file);
+   }
+   return ret;
 }
-#else
-#define shm_get_unmapped_area NULL
-#endif
 
 static const struct file_operations shm_file_operations = {
.mmap   = shm_mmap,
+   .fsync  = shm_fsync,
.release= shm_release,
.get_unmapped_area  = shm_get_unmapped_area,
 };
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Fix get_unmapped_area and fsync for hugetlb shm segments

2007-03-01 Thread Adam Litke

This patch provides the following hugetlb-related fixes to the recent stacked
shm files changes:
 - Update is_file_hugepages() so it will reconize hugetlb shm segments.
 - get_unmapped_area must be called with the nested file struct to handle
   the sfd-file-f_ops-get_unmapped_area == NULL case.
 - The fsync f_op must be wrapped since it is specified in the hugetlbfs
   f_ops.

This is based on proposed fixes from Eric Biederman that were debugged and
tested by me.  Without it, attempting to use hugetlb shared memory segments
on powerpc (and likely ia64) will kill your box.

Please Apply.

Signed-off-by: Adam Litke [EMAIL PROTECTED]
---

 include/linux/hugetlb.h |8 +++-
 include/linux/shm.h |5 +
 ipc/shm.c   |   32 ++--
 3 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a60995a..3f3e7a6 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -4,6 +4,7 @@
 #ifdef CONFIG_HUGETLB_PAGE
 
 #include linux/mempolicy.h
+#include linux/shm.h
 #include asm/tlbflush.h
 
 struct ctl_table;
@@ -168,7 +169,12 @@ void hugetlb_put_quota(struct address_space *mapping);
 
 static inline int is_file_hugepages(struct file *file)
 {
-   return file-f_op == hugetlbfs_file_operations;
+   if (file-f_op == hugetlbfs_file_operations)
+   return 1;
+   if (is_file_shm_hugepages(file))
+   return 1;
+
+   return 0;
 }
 
 static inline void set_file_hugepages(struct file *file)
diff --git a/include/linux/shm.h b/include/linux/shm.h
index a2c896a..ad2e3af 100644
--- a/include/linux/shm.h
+++ b/include/linux/shm.h
@@ -96,12 +96,17 @@ struct shmid_kernel /* private to the kernel */
 
 #ifdef CONFIG_SYSVIPC
 long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long 
*addr);
+extern int is_file_shm_hugepages(struct file *file);
 #else
 static inline long do_shmat(int shmid, char __user *shmaddr,
int shmflg, unsigned long *addr)
 {
return -ENOSYS;
 }
+static inline int is_file_shm_hugepages(struct file *file)
+{
+   return 0;
+}
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/ipc/shm.c b/ipc/shm.c
index eb57e22..abf864d 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -285,21 +285,41 @@ static int shm_release(struct inode *ino, struct file 
*file)
return 0;
 }
 
-#ifndef CONFIG_MMU
+static int shm_fsync(struct file *file, struct dentry *dentry, int datasync)
+{
+   int (*fsync) (struct file *, struct dentry *, int datasync);
+   struct shm_file_data *sfd = shm_file_data(file);
+   int ret = -EINVAL;
+
+   fsync = sfd-file-f_op-fsync;
+   if (fsync)
+   ret = fsync(sfd-file, sfd-file-f_path.dentry, datasync);
+   return ret;
+}
+
 static unsigned long shm_get_unmapped_area(struct file *file,
unsigned long addr, unsigned long len, unsigned long pgoff,
unsigned long flags)
 {
struct shm_file_data *sfd = shm_file_data(file);
-   return sfd-file-f_op-get_unmapped_area(sfd-file, addr, len, pgoff,
-   flags);
+   return get_unmapped_area(sfd-file, addr, len, pgoff, flags);
+}
+
+int is_file_shm_hugepages(struct file *file)
+{
+   int ret = 0;
+
+   if (file-f_op == shm_file_operations) {
+   struct shm_file_data *sfd;
+   sfd = shm_file_data(file);
+   ret = is_file_hugepages(sfd-file);
+   }
+   return ret;
 }
-#else
-#define shm_get_unmapped_area NULL
-#endif
 
 static const struct file_operations shm_file_operations = {
.mmap   = shm_mmap,
+   .fsync  = shm_fsync,
.release= shm_release,
.get_unmapped_area  = shm_get_unmapped_area,
 };
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Kernel Oops with shm namespace cleanups

2007-02-28 Thread Adam Litke
[C00779AD7630] [C005C68C] .run_local_timers+0x1c/0x30
[C00779AD76B0] [C0021358] .timer_interrupt+0x80/0x498
[C00779AD7790] [C0003580] decrementer_common+0x100/0x180
--- Exception: 901 at ._spin_lock+0x30/0x44
LR = .vma_link+0x6c/0x1d4
[C00779AD7A80] [C05C8C08] 0xc05c8c08 (unreliable)
[C00779AD7B30] [C0098D18] .do_mmap_pgoff+0x650/0x818
[C00779AD7C40] [C028F204] .do_shmat+0x304/0x454
[C00779AD7D30] [C0289660] .compat_sys_shmat+0x34/0x94
[C00779AD7DC0] [C0014A20] .compat_sys_ipc+0x18c/0x1e8
[C00779AD7E30] [C000872C] syscall_exit+0x0/0x40
BUG: soft lockup detected on CPU#2!
Call Trace:
[C00779B934C0] [C000F588] .show_stack+0x68/0x1b4 (unreliable)
[C00779B93570] [C007C5E0] .softlockup_tick+0xec/0x140
[C00779B93630] [C005C68C] .run_local_timers+0x1c/0x30
[C00779B936B0] [C0021358] .timer_interrupt+0x80/0x498
[C00779B93790] [C0003580] decrementer_common+0x100/0x180
--- Exception: 901 at ._spin_lock+0x2c/0x44
LR = .vma_link+0x6c/0x1d4
[C00779B93A80] [C05C8C08] 0xc05c8c08 (unreliable)
[C00779B93B30] [C0098D18] .do_mmap_pgoff+0x650/0x818
[C00779B93C40] [C028F204] .do_shmat+0x304/0x454
[C00779B93D30] [C0289660] .compat_sys_shmat+0x34/0x94
[C00779B93DC0] [C0014A20] .compat_sys_ipc+0x18c/0x1e8
[C00779B93E30] [C000872C] syscall_exit+0x0/0x40
BUG: soft lockup detected on CPU#3!
Call Trace:
[C00779AD34C0] [C000F588] .show_stack+0x68/0x1b4 (unreliable)
[C00779AD3570] [C007C5E0] .softlockup_tick+0xec/0x140
[C00779AD3630] [C005C68C] .run_local_timers+0x1c/0x30
[C00779AD36B0] [C0021358] .timer_interrupt+0x80/0x498
[C00779AD3790] [C0003580] decrementer_common+0x100/0x180
--- Exception: 901 at ._spin_lock+0x2c/0x44
LR = .vma_link+0x6c/0x1d4
[C00779AD3A80] [C05C8C08] 0xc05c8c08 (unreliable)
[C00779AD3B30] [C0098D18] .do_mmap_pgoff+0x650/0x818
[C00779AD3C40] [C028F204] .do_shmat+0x304/0x454
[C00779AD3D30] [C0289660] .compat_sys_shmat+0x34/0x94
[C00779AD3DC0] [C0014A20] .compat_sys_ipc+0x18c/0x1e8
[C00779AD3E30] [C000872C] syscall_exit+0x0/0x40
BUG: soft lockup detected on CPU#1!
Call Trace:
[C00779BB34C0] [C000F588] .show_stack+0x68/0x1b4 (unreliable)
[C00779BB3570] [C007C5E0] .softlockup_tick+0xec/0x140
[C00779BB3630] [C005C68C] .run_local_timers+0x1c/0x30
[C00779BB36B0] [C0021358] .timer_interrupt+0x80/0x498
[C00779BB3790] [C0003580] decrementer_common+0x100/0x180
--- Exception: 901 at ._spin_lock+0x2c/0x44
LR = .vma_link+0x6c/0x1d4
[C00779BB3A80] [C05C8C08] 0xc05c8c08 (unreliable)
[C00779BB3B30] [C0098D18] .do_mmap_pgoff+0x650/0x818
[C00779BB3C40] [C028F204] .do_shmat+0x304/0x454
[C00779BB3D30] [C0289660] .compat_sys_shmat+0x34/0x94
[C00779BB3DC0] [C0014A20] .compat_sys_ipc+0x18c/0x1e8
[C00779BB3E30] [C000872C] syscall_exit+0x0/0x40


-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Kernel Oops with shm namespace cleanups

2007-02-28 Thread Adam Litke
[C00779AD7630] [C005C68C] .run_local_timers+0x1c/0x30
[C00779AD76B0] [C0021358] .timer_interrupt+0x80/0x498
[C00779AD7790] [C0003580] decrementer_common+0x100/0x180
--- Exception: 901 at ._spin_lock+0x30/0x44
LR = .vma_link+0x6c/0x1d4
[C00779AD7A80] [C05C8C08] 0xc05c8c08 (unreliable)
[C00779AD7B30] [C0098D18] .do_mmap_pgoff+0x650/0x818
[C00779AD7C40] [C028F204] .do_shmat+0x304/0x454
[C00779AD7D30] [C0289660] .compat_sys_shmat+0x34/0x94
[C00779AD7DC0] [C0014A20] .compat_sys_ipc+0x18c/0x1e8
[C00779AD7E30] [C000872C] syscall_exit+0x0/0x40
BUG: soft lockup detected on CPU#2!
Call Trace:
[C00779B934C0] [C000F588] .show_stack+0x68/0x1b4 (unreliable)
[C00779B93570] [C007C5E0] .softlockup_tick+0xec/0x140
[C00779B93630] [C005C68C] .run_local_timers+0x1c/0x30
[C00779B936B0] [C0021358] .timer_interrupt+0x80/0x498
[C00779B93790] [C0003580] decrementer_common+0x100/0x180
--- Exception: 901 at ._spin_lock+0x2c/0x44
LR = .vma_link+0x6c/0x1d4
[C00779B93A80] [C05C8C08] 0xc05c8c08 (unreliable)
[C00779B93B30] [C0098D18] .do_mmap_pgoff+0x650/0x818
[C00779B93C40] [C028F204] .do_shmat+0x304/0x454
[C00779B93D30] [C0289660] .compat_sys_shmat+0x34/0x94
[C00779B93DC0] [C0014A20] .compat_sys_ipc+0x18c/0x1e8
[C00779B93E30] [C000872C] syscall_exit+0x0/0x40
BUG: soft lockup detected on CPU#3!
Call Trace:
[C00779AD34C0] [C000F588] .show_stack+0x68/0x1b4 (unreliable)
[C00779AD3570] [C007C5E0] .softlockup_tick+0xec/0x140
[C00779AD3630] [C005C68C] .run_local_timers+0x1c/0x30
[C00779AD36B0] [C0021358] .timer_interrupt+0x80/0x498
[C00779AD3790] [C0003580] decrementer_common+0x100/0x180
--- Exception: 901 at ._spin_lock+0x2c/0x44
LR = .vma_link+0x6c/0x1d4
[C00779AD3A80] [C05C8C08] 0xc05c8c08 (unreliable)
[C00779AD3B30] [C0098D18] .do_mmap_pgoff+0x650/0x818
[C00779AD3C40] [C028F204] .do_shmat+0x304/0x454
[C00779AD3D30] [C0289660] .compat_sys_shmat+0x34/0x94
[C00779AD3DC0] [C0014A20] .compat_sys_ipc+0x18c/0x1e8
[C00779AD3E30] [C000872C] syscall_exit+0x0/0x40
BUG: soft lockup detected on CPU#1!
Call Trace:
[C00779BB34C0] [C000F588] .show_stack+0x68/0x1b4 (unreliable)
[C00779BB3570] [C007C5E0] .softlockup_tick+0xec/0x140
[C00779BB3630] [C005C68C] .run_local_timers+0x1c/0x30
[C00779BB36B0] [C0021358] .timer_interrupt+0x80/0x498
[C00779BB3790] [C0003580] decrementer_common+0x100/0x180
--- Exception: 901 at ._spin_lock+0x2c/0x44
LR = .vma_link+0x6c/0x1d4
[C00779BB3A80] [C05C8C08] 0xc05c8c08 (unreliable)
[C00779BB3B30] [C0098D18] .do_mmap_pgoff+0x650/0x818
[C00779BB3C40] [C028F204] .do_shmat+0x304/0x454
[C00779BB3D30] [C0289660] .compat_sys_shmat+0x34/0x94
[C00779BB3DC0] [C0014A20] .compat_sys_ipc+0x18c/0x1e8
[C00779BB3E30] [C000872C] syscall_exit+0x0/0x40


-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/7] [RFC] hugetlb: pagetable_operations API

2007-02-19 Thread Adam Litke
On Mon, 2007-02-19 at 19:43 +0100, Arjan van de Ven wrote:
> On Mon, 2007-02-19 at 10:31 -0800, Adam Litke wrote:
> > The page tables for hugetlb mappings are handled differently than page 
> > tables
> > for normal pages.  Rather than integrating multiple page size support into 
> > the
> > main VM (which would tremendously complicate the code) some hooks were 
> > created.
> > This allows hugetlb special cases to be handled "out of line" by a separate
> > interface.
> 
> ok it makes sense to clean this up.. what I don't like is that there
> STILL are all the double cases... for this to work and be worth it both
> the common case and the hugetlb case should be using the ops structure
> always! Anything else and you're just replacing bad code with bad
> code ;(

Hmm.  Do you think everyone would support an extra pointer indirection
for every handle_pte_fault() call?  If not, then I definitely wouldn't
mind creating a default_pagetable_ops and calling into that.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.

2007-02-19 Thread Adam Litke
On Mon, 2007-02-19 at 19:41 +0100, Arjan van de Ven wrote:
> On Mon, 2007-02-19 at 10:31 -0800, Adam Litke wrote:
> > Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
> > ---
> > 
> >  include/linux/mm.h |   25 +
> >  1 files changed, 25 insertions(+), 0 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 2d2c08d..a2fa66d 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -98,6 +98,7 @@ struct vm_area_struct {
> >  
> > /* Function pointers to deal with this struct. */
> > struct vm_operations_struct * vm_ops;
> > +   struct pagetable_operations_struct * pagetable_ops;
> >  
> 
> please make it at least const, those things have no business ever being
> written to right? And by making them const the compiler helps catch
> that, and as bonus the data gets moved to rodata so that it won't share
> cachelines with anything that gets dirty

Yep I agree.  Changed.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 6/7] free_pgtable_range for hugetlb

2007-02-19 Thread Adam Litke

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
---

 fs/hugetlbfs/inode.c |1 +
 mm/memory.c  |6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 1016694..3461f9b 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -567,6 +567,7 @@ static struct pagetable_operations_struct 
hugetlbfs_pagetable_ops = {
.pin_pages  = follow_hugetlb_page,
.unmap_page_range   = unmap_hugepage_range,
.change_protection  = hugetlb_change_protection,
+   .free_pgtable_range = hugetlb_free_pgd_range,
 };
 
 static struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/memory.c b/mm/memory.c
index aa6b06e..442e6b2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -279,15 +279,15 @@ void free_pgtables(struct mmu_gather **tlb, struct 
vm_area_struct *vma,
anon_vma_unlink(vma);
unlink_file_vma(vma);
 
-   if (is_vm_hugetlb_page(vma)) {
-   hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
+   if (has_pt_op(vma, free_pgtable_range)) {
+   pt_op(vma, free_pgtable_range)(tlb, addr, vma->vm_end,
floor, next? next->vm_start: ceiling);
} else {
/*
 * Optimization: gather nearby vmas into one call down
 */
while (next && next->vm_start <= vma->vm_end + PMD_SIZE
-  && !is_vm_hugetlb_page(next)) {
+  && !has_pt_op(next, free_pgtable_range)) {
vma = next;
next = vma->vm_next;
anon_vma_unlink(vma);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 7/7] hugetlbfs fault handler

2007-02-19 Thread Adam Litke

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
---

 fs/hugetlbfs/inode.c |1 +
 mm/hugetlb.c |4 +++-
 mm/memory.c  |4 ++--
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 3461f9b..1de73c1 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -568,6 +568,7 @@ static struct pagetable_operations_struct 
hugetlbfs_pagetable_ops = {
.unmap_page_range   = unmap_hugepage_range,
.change_protection  = hugetlb_change_protection,
.free_pgtable_range = hugetlb_free_pgd_range,
+   .fault  = hugetlb_fault,
 };
 
 static struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3bcc0db..63de369 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -588,6 +588,8 @@ int follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
unsigned long vaddr = *position;
int remainder = *length;
 
+   BUG_ON(!has_pt_op(vma, fault));
+
spin_lock(>page_table_lock);
while (vaddr < vma->vm_end && remainder) {
pte_t *pte;
@@ -604,7 +606,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
int ret;
 
spin_unlock(>page_table_lock);
-   ret = hugetlb_fault(mm, vma, vaddr, 0);
+   ret = pt_op(vma, fault)(mm, vma, vaddr, 0);
spin_lock(>page_table_lock);
if (ret == VM_FAULT_MINOR)
continue;
diff --git a/mm/memory.c b/mm/memory.c
index 442e6b2..c8e17b4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2455,8 +2455,8 @@ int __handle_mm_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
 
count_vm_event(PGFAULT);
 
-   if (unlikely(is_vm_hugetlb_page(vma)))
-   return hugetlb_fault(mm, vma, address, write_access);
+   if (unlikely(has_pt_op(vma, fault)))
+   return pt_op(vma, fault)(mm, vma, address, write_access);
 
pgd = pgd_offset(mm, address);
pud = pud_alloc(mm, pgd, address);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/7] change_protection for hugetlb

2007-02-19 Thread Adam Litke

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
---

 fs/hugetlbfs/inode.c |1 +
 mm/mprotect.c|5 +++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 146a4b7..1016694 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -566,6 +566,7 @@ static struct pagetable_operations_struct 
hugetlbfs_pagetable_ops = {
.copy_vma   = copy_hugetlb_page_range,
.pin_pages  = follow_hugetlb_page,
.unmap_page_range   = unmap_hugepage_range,
+   .change_protection  = hugetlb_change_protection,
 };
 
 static struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 3b8f3c0..172e204 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -201,8 +201,9 @@ success:
dirty_accountable = 1;
}
 
-   if (is_vm_hugetlb_page(vma))
-   hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
+   if (has_pt_op(vma, change_protection))
+   pt_op(vma, change_protection)(vma, start, end,
+   vma->vm_page_prot);
else
change_protection(vma, start, end, vma->vm_page_prot, 
dirty_accountable);
vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/7] unmap_page_range for hugetlb

2007-02-19 Thread Adam Litke

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
---

 fs/hugetlbfs/inode.c|3 ++-
 include/linux/hugetlb.h |4 ++--
 mm/hugetlb.c|   12 
 mm/memory.c |   10 --
 4 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 2d1dd84..146a4b7 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -289,7 +289,7 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, 
pgoff_t pgoff)
v_offset = 0;
 
__unmap_hugepage_range(vma,
-   vma->vm_start + v_offset, vma->vm_end);
+   vma->vm_start + v_offset, vma->vm_end, 0);
}
 }
 
@@ -565,6 +565,7 @@ const struct file_operations hugetlbfs_file_operations = {
 static struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
.copy_vma   = copy_hugetlb_page_range,
.pin_pages  = follow_hugetlb_page,
+   .unmap_page_range   = unmap_hugepage_range,
 };
 
 static struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a60995a..add92b3 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -16,8 +16,8 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct 
*vma)
 int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user 
*, size_t *, loff_t *);
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct 
vm_area_struct *);
 int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct 
page **, struct vm_area_struct **, unsigned long *, int *, int);
-void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned 
long);
-void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned 
long);
+unsigned long unmap_hugepage_range(struct vm_area_struct *, unsigned long, 
unsigned long, long *);
+void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned 
long, long *);
 int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
 int hugetlb_report_meminfo(char *);
 int hugetlb_report_node_meminfo(int, char *);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index cb362f7..3bcc0db 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -356,7 +356,7 @@ nomem:
 }
 
 void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
-   unsigned long end)
+   unsigned long end, long *zap_work)
 {
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
@@ -397,10 +397,13 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, 
unsigned long start,
list_del(>lru);
put_page(page);
}
+
+   if (zap_work)
+   *zap_work -= (end - start) / (HPAGE_SIZE / PAGE_SIZE);
 }
 
-void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
- unsigned long end)
+unsigned long unmap_hugepage_range(struct vm_area_struct *vma,
+   unsigned long start, unsigned long end, long *zap_work)
 {
/*
 * It is undesirable to test vma->vm_file as it should be non-null
@@ -412,9 +415,10 @@ void unmap_hugepage_range(struct vm_area_struct *vma, 
unsigned long start,
 */
if (vma->vm_file) {
spin_lock(>vm_file->f_mapping->i_mmap_lock);
-   __unmap_hugepage_range(vma, start, end);
+   __unmap_hugepage_range(vma, start, end, zap_work);
spin_unlock(>vm_file->f_mapping->i_mmap_lock);
}
+   return end;
 }
 
 static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index 9467c65..aa6b06e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -839,12 +839,10 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,
tlb_start_valid = 1;
}
 
-   if (unlikely(is_vm_hugetlb_page(vma))) {
-   unmap_hugepage_range(vma, start, end);
-   zap_work -= (end - start) /
-   (HPAGE_SIZE / PAGE_SIZE);
-   start = end;
-   } else
+   if (unlikely(has_pt_op(vma, unmap_page_range)))
+   start = pt_op(vma, unmap_page_range)
+   (vma, start, end, _work);
+   else
start = unmap_page_range(*tlbp, vma,
start, end, _work, details);
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/7] pin_pages for hugetlb

2007-02-19 Thread Adam Litke

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
---

 fs/hugetlbfs/inode.c |1 +
 mm/memory.c  |6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index c0a7984..2d1dd84 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -564,6 +564,7 @@ const struct file_operations hugetlbfs_file_operations = {
 
 static struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
.copy_vma   = copy_hugetlb_page_range,
+   .pin_pages  = follow_hugetlb_page,
 };
 
 static struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/memory.c b/mm/memory.c
index 80eafd5..9467c65 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1039,9 +1039,9 @@ int get_user_pages(struct task_struct *tsk, struct 
mm_struct *mm,
|| !(vm_flags & vma->vm_flags))
return i ? : -EFAULT;
 
-   if (is_vm_hugetlb_page(vma)) {
-   i = follow_hugetlb_page(mm, vma, pages, vmas,
-   , , i);
+   if (has_pt_op(vma, pin_pages)) {
+   i = pt_op(vma, pin_pages)(mm, vma, pages,
+   vmas, , , i);
continue;
}
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/7] copy_vma for hugetlbfs

2007-02-19 Thread Adam Litke

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
---

 fs/hugetlbfs/inode.c |6 ++
 mm/memory.c  |4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 4f4cd13..c0a7984 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -36,6 +36,7 @@
 static struct super_operations hugetlbfs_ops;
 static const struct address_space_operations hugetlbfs_aops;
 const struct file_operations hugetlbfs_file_operations;
+static struct pagetable_operations_struct hugetlbfs_pagetable_ops;
 static struct inode_operations hugetlbfs_dir_inode_operations;
 static struct inode_operations hugetlbfs_inode_operations;
 
@@ -70,6 +71,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct 
vm_area_struct *vma)
 */
vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
vma->vm_ops = _vm_ops;
+   vma->pagetable_ops = _pagetable_ops;
 
vma_len = (loff_t)(vma->vm_end - vma->vm_start);
 
@@ -560,6 +562,10 @@ const struct file_operations hugetlbfs_file_operations = {
.get_unmapped_area  = hugetlb_get_unmapped_area,
 };
 
+static struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
+   .copy_vma   = copy_hugetlb_page_range,
+};
+
 static struct inode_operations hugetlbfs_dir_inode_operations = {
.create = hugetlbfs_create,
.lookup = simple_lookup,
diff --git a/mm/memory.c b/mm/memory.c
index ef09f0a..80eafd5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -602,8 +602,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
return 0;
}
 
-   if (is_vm_hugetlb_page(vma))
-   return copy_hugetlb_page_range(dst_mm, src_mm, vma);
+   if (has_pt_op(vma, copy_vma))
+   return pt_op(vma, copy_vma)(dst_mm, src_mm, vma);
 
dst_pgd = pgd_offset(dst_mm, addr);
src_pgd = pgd_offset(src_mm, addr);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/7] [RFC] hugetlb: pagetable_operations API

2007-02-19 Thread Adam Litke

The page tables for hugetlb mappings are handled differently than page tables
for normal pages.  Rather than integrating multiple page size support into the
main VM (which would tremendously complicate the code) some hooks were created.
This allows hugetlb special cases to be handled "out of line" by a separate
interface.

Hugetlbfs was the huge page interface chosen.  At the time, large database
users were the only big users of huge pages and the hugetlbfs design meets
their needs pretty well.  Over time, hugetlbfs has been expanded to enable new
uses of huge page memory with varied results.  As features are added, the
semantics become a permanent part of the Linux API.  This makes maintenance of
hugetlbfs an increasingly difficult task and inhibits the addition of features
and functionality in support of ever-changing hardware.

To remedy the situation, I propose an API (currently called
pagetable_operations).  All of the current hugetlbfs-specific hooks are moved
into an operations struct that is attached to VMAs.  The end result is a more
explicit and IMO a cleaner interface between hugetlbfs and the core VM.  We are
then free to add other hugetlb interfaces (such as a /dev/zero-styled character
device) that can operate either in concert with or independent of hugetlbfs.

There should be no measurable performance impact for normal page users (we're
checking if pagetable_ops != NULL instead of checking for vm_flags &
VM_HUGETLB).  Of course we do increase the VMA size by one pointer.  For huge
pages, there is an added indirection for pt_op() calls.  This patch series does
not change the logic of the the hugetlbfs operations, just moves them into the
pagetable_operations struct.

Comments?  Do you think it's as good of an idea as I do?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/7] Introduce the pagetable_operations and associated helper macros.

2007-02-19 Thread Adam Litke

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
---

 include/linux/mm.h |   25 +
 1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2d2c08d..a2fa66d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -98,6 +98,7 @@ struct vm_area_struct {
 
/* Function pointers to deal with this struct. */
struct vm_operations_struct * vm_ops;
+   struct pagetable_operations_struct * pagetable_ops;
 
/* Information about our backing store: */
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
@@ -218,6 +219,30 @@ struct vm_operations_struct {
 };
 
 struct mmu_gather;
+
+struct pagetable_operations_struct {
+   int (*fault)(struct mm_struct *mm,
+   struct vm_area_struct *vma,
+   unsigned long address, int write_access);
+   int (*copy_vma)(struct mm_struct *dst, struct mm_struct *src,
+   struct vm_area_struct *vma);
+   int (*pin_pages)(struct mm_struct *mm, struct vm_area_struct *vma,
+   struct page **pages, struct vm_area_struct **vmas,
+   unsigned long *position, int *length, int i);
+   void (*change_protection)(struct vm_area_struct *vma,
+   unsigned long address, unsigned long end, pgprot_t newprot);
+   unsigned long (*unmap_page_range)(struct vm_area_struct *vma,
+   unsigned long address, unsigned long end, long *zap_work);
+   void (*free_pgtable_range)(struct mmu_gather **tlb,
+   unsigned long addr, unsigned long end,
+   unsigned long floor, unsigned long ceiling);
+};
+
+#define has_pt_op(vma, op) \
+   ((vma)->pagetable_ops && (vma)->pagetable_ops->op)
+#define pt_op(vma, call) \
+   ((vma)->pagetable_ops->call)
+
 struct inode;
 
 #define page_private(page) ((page)->private)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/7] [RFC] hugetlb: pagetable_operations API

2007-02-19 Thread Adam Litke

The page tables for hugetlb mappings are handled differently than page tables
for normal pages.  Rather than integrating multiple page size support into the
main VM (which would tremendously complicate the code) some hooks were created.
This allows hugetlb special cases to be handled out of line by a separate
interface.

Hugetlbfs was the huge page interface chosen.  At the time, large database
users were the only big users of huge pages and the hugetlbfs design meets
their needs pretty well.  Over time, hugetlbfs has been expanded to enable new
uses of huge page memory with varied results.  As features are added, the
semantics become a permanent part of the Linux API.  This makes maintenance of
hugetlbfs an increasingly difficult task and inhibits the addition of features
and functionality in support of ever-changing hardware.

To remedy the situation, I propose an API (currently called
pagetable_operations).  All of the current hugetlbfs-specific hooks are moved
into an operations struct that is attached to VMAs.  The end result is a more
explicit and IMO a cleaner interface between hugetlbfs and the core VM.  We are
then free to add other hugetlb interfaces (such as a /dev/zero-styled character
device) that can operate either in concert with or independent of hugetlbfs.

There should be no measurable performance impact for normal page users (we're
checking if pagetable_ops != NULL instead of checking for vm_flags 
VM_HUGETLB).  Of course we do increase the VMA size by one pointer.  For huge
pages, there is an added indirection for pt_op() calls.  This patch series does
not change the logic of the the hugetlbfs operations, just moves them into the
pagetable_operations struct.

Comments?  Do you think it's as good of an idea as I do?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/7] Introduce the pagetable_operations and associated helper macros.

2007-02-19 Thread Adam Litke

Signed-off-by: Adam Litke [EMAIL PROTECTED]
---

 include/linux/mm.h |   25 +
 1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2d2c08d..a2fa66d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -98,6 +98,7 @@ struct vm_area_struct {
 
/* Function pointers to deal with this struct. */
struct vm_operations_struct * vm_ops;
+   struct pagetable_operations_struct * pagetable_ops;
 
/* Information about our backing store: */
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
@@ -218,6 +219,30 @@ struct vm_operations_struct {
 };
 
 struct mmu_gather;
+
+struct pagetable_operations_struct {
+   int (*fault)(struct mm_struct *mm,
+   struct vm_area_struct *vma,
+   unsigned long address, int write_access);
+   int (*copy_vma)(struct mm_struct *dst, struct mm_struct *src,
+   struct vm_area_struct *vma);
+   int (*pin_pages)(struct mm_struct *mm, struct vm_area_struct *vma,
+   struct page **pages, struct vm_area_struct **vmas,
+   unsigned long *position, int *length, int i);
+   void (*change_protection)(struct vm_area_struct *vma,
+   unsigned long address, unsigned long end, pgprot_t newprot);
+   unsigned long (*unmap_page_range)(struct vm_area_struct *vma,
+   unsigned long address, unsigned long end, long *zap_work);
+   void (*free_pgtable_range)(struct mmu_gather **tlb,
+   unsigned long addr, unsigned long end,
+   unsigned long floor, unsigned long ceiling);
+};
+
+#define has_pt_op(vma, op) \
+   ((vma)-pagetable_ops  (vma)-pagetable_ops-op)
+#define pt_op(vma, call) \
+   ((vma)-pagetable_ops-call)
+
 struct inode;
 
 #define page_private(page) ((page)-private)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/7] copy_vma for hugetlbfs

2007-02-19 Thread Adam Litke

Signed-off-by: Adam Litke [EMAIL PROTECTED]
---

 fs/hugetlbfs/inode.c |6 ++
 mm/memory.c  |4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 4f4cd13..c0a7984 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -36,6 +36,7 @@
 static struct super_operations hugetlbfs_ops;
 static const struct address_space_operations hugetlbfs_aops;
 const struct file_operations hugetlbfs_file_operations;
+static struct pagetable_operations_struct hugetlbfs_pagetable_ops;
 static struct inode_operations hugetlbfs_dir_inode_operations;
 static struct inode_operations hugetlbfs_inode_operations;
 
@@ -70,6 +71,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct 
vm_area_struct *vma)
 */
vma-vm_flags |= VM_HUGETLB | VM_RESERVED;
vma-vm_ops = hugetlb_vm_ops;
+   vma-pagetable_ops = hugetlbfs_pagetable_ops;
 
vma_len = (loff_t)(vma-vm_end - vma-vm_start);
 
@@ -560,6 +562,10 @@ const struct file_operations hugetlbfs_file_operations = {
.get_unmapped_area  = hugetlb_get_unmapped_area,
 };
 
+static struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
+   .copy_vma   = copy_hugetlb_page_range,
+};
+
 static struct inode_operations hugetlbfs_dir_inode_operations = {
.create = hugetlbfs_create,
.lookup = simple_lookup,
diff --git a/mm/memory.c b/mm/memory.c
index ef09f0a..80eafd5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -602,8 +602,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
return 0;
}
 
-   if (is_vm_hugetlb_page(vma))
-   return copy_hugetlb_page_range(dst_mm, src_mm, vma);
+   if (has_pt_op(vma, copy_vma))
+   return pt_op(vma, copy_vma)(dst_mm, src_mm, vma);
 
dst_pgd = pgd_offset(dst_mm, addr);
src_pgd = pgd_offset(src_mm, addr);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/7] pin_pages for hugetlb

2007-02-19 Thread Adam Litke

Signed-off-by: Adam Litke [EMAIL PROTECTED]
---

 fs/hugetlbfs/inode.c |1 +
 mm/memory.c  |6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index c0a7984..2d1dd84 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -564,6 +564,7 @@ const struct file_operations hugetlbfs_file_operations = {
 
 static struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
.copy_vma   = copy_hugetlb_page_range,
+   .pin_pages  = follow_hugetlb_page,
 };
 
 static struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/memory.c b/mm/memory.c
index 80eafd5..9467c65 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1039,9 +1039,9 @@ int get_user_pages(struct task_struct *tsk, struct 
mm_struct *mm,
|| !(vm_flags  vma-vm_flags))
return i ? : -EFAULT;
 
-   if (is_vm_hugetlb_page(vma)) {
-   i = follow_hugetlb_page(mm, vma, pages, vmas,
-   start, len, i);
+   if (has_pt_op(vma, pin_pages)) {
+   i = pt_op(vma, pin_pages)(mm, vma, pages,
+   vmas, start, len, i);
continue;
}
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/7] unmap_page_range for hugetlb

2007-02-19 Thread Adam Litke

Signed-off-by: Adam Litke [EMAIL PROTECTED]
---

 fs/hugetlbfs/inode.c|3 ++-
 include/linux/hugetlb.h |4 ++--
 mm/hugetlb.c|   12 
 mm/memory.c |   10 --
 4 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 2d1dd84..146a4b7 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -289,7 +289,7 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, 
pgoff_t pgoff)
v_offset = 0;
 
__unmap_hugepage_range(vma,
-   vma-vm_start + v_offset, vma-vm_end);
+   vma-vm_start + v_offset, vma-vm_end, 0);
}
 }
 
@@ -565,6 +565,7 @@ const struct file_operations hugetlbfs_file_operations = {
 static struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
.copy_vma   = copy_hugetlb_page_range,
.pin_pages  = follow_hugetlb_page,
+   .unmap_page_range   = unmap_hugepage_range,
 };
 
 static struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a60995a..add92b3 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -16,8 +16,8 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct 
*vma)
 int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user 
*, size_t *, loff_t *);
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct 
vm_area_struct *);
 int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct 
page **, struct vm_area_struct **, unsigned long *, int *, int);
-void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned 
long);
-void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned 
long);
+unsigned long unmap_hugepage_range(struct vm_area_struct *, unsigned long, 
unsigned long, long *);
+void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned 
long, long *);
 int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
 int hugetlb_report_meminfo(char *);
 int hugetlb_report_node_meminfo(int, char *);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index cb362f7..3bcc0db 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -356,7 +356,7 @@ nomem:
 }
 
 void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
-   unsigned long end)
+   unsigned long end, long *zap_work)
 {
struct mm_struct *mm = vma-vm_mm;
unsigned long address;
@@ -397,10 +397,13 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, 
unsigned long start,
list_del(page-lru);
put_page(page);
}
+
+   if (zap_work)
+   *zap_work -= (end - start) / (HPAGE_SIZE / PAGE_SIZE);
 }
 
-void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
- unsigned long end)
+unsigned long unmap_hugepage_range(struct vm_area_struct *vma,
+   unsigned long start, unsigned long end, long *zap_work)
 {
/*
 * It is undesirable to test vma-vm_file as it should be non-null
@@ -412,9 +415,10 @@ void unmap_hugepage_range(struct vm_area_struct *vma, 
unsigned long start,
 */
if (vma-vm_file) {
spin_lock(vma-vm_file-f_mapping-i_mmap_lock);
-   __unmap_hugepage_range(vma, start, end);
+   __unmap_hugepage_range(vma, start, end, zap_work);
spin_unlock(vma-vm_file-f_mapping-i_mmap_lock);
}
+   return end;
 }
 
 static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index 9467c65..aa6b06e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -839,12 +839,10 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,
tlb_start_valid = 1;
}
 
-   if (unlikely(is_vm_hugetlb_page(vma))) {
-   unmap_hugepage_range(vma, start, end);
-   zap_work -= (end - start) /
-   (HPAGE_SIZE / PAGE_SIZE);
-   start = end;
-   } else
+   if (unlikely(has_pt_op(vma, unmap_page_range)))
+   start = pt_op(vma, unmap_page_range)
+   (vma, start, end, zap_work);
+   else
start = unmap_page_range(*tlbp, vma,
start, end, zap_work, details);
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 7/7] hugetlbfs fault handler

2007-02-19 Thread Adam Litke

Signed-off-by: Adam Litke [EMAIL PROTECTED]
---

 fs/hugetlbfs/inode.c |1 +
 mm/hugetlb.c |4 +++-
 mm/memory.c  |4 ++--
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 3461f9b..1de73c1 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -568,6 +568,7 @@ static struct pagetable_operations_struct 
hugetlbfs_pagetable_ops = {
.unmap_page_range   = unmap_hugepage_range,
.change_protection  = hugetlb_change_protection,
.free_pgtable_range = hugetlb_free_pgd_range,
+   .fault  = hugetlb_fault,
 };
 
 static struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3bcc0db..63de369 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -588,6 +588,8 @@ int follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
unsigned long vaddr = *position;
int remainder = *length;
 
+   BUG_ON(!has_pt_op(vma, fault));
+
spin_lock(mm-page_table_lock);
while (vaddr  vma-vm_end  remainder) {
pte_t *pte;
@@ -604,7 +606,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
int ret;
 
spin_unlock(mm-page_table_lock);
-   ret = hugetlb_fault(mm, vma, vaddr, 0);
+   ret = pt_op(vma, fault)(mm, vma, vaddr, 0);
spin_lock(mm-page_table_lock);
if (ret == VM_FAULT_MINOR)
continue;
diff --git a/mm/memory.c b/mm/memory.c
index 442e6b2..c8e17b4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2455,8 +2455,8 @@ int __handle_mm_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
 
count_vm_event(PGFAULT);
 
-   if (unlikely(is_vm_hugetlb_page(vma)))
-   return hugetlb_fault(mm, vma, address, write_access);
+   if (unlikely(has_pt_op(vma, fault)))
+   return pt_op(vma, fault)(mm, vma, address, write_access);
 
pgd = pgd_offset(mm, address);
pud = pud_alloc(mm, pgd, address);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/7] change_protection for hugetlb

2007-02-19 Thread Adam Litke

Signed-off-by: Adam Litke [EMAIL PROTECTED]
---

 fs/hugetlbfs/inode.c |1 +
 mm/mprotect.c|5 +++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 146a4b7..1016694 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -566,6 +566,7 @@ static struct pagetable_operations_struct 
hugetlbfs_pagetable_ops = {
.copy_vma   = copy_hugetlb_page_range,
.pin_pages  = follow_hugetlb_page,
.unmap_page_range   = unmap_hugepage_range,
+   .change_protection  = hugetlb_change_protection,
 };
 
 static struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 3b8f3c0..172e204 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -201,8 +201,9 @@ success:
dirty_accountable = 1;
}
 
-   if (is_vm_hugetlb_page(vma))
-   hugetlb_change_protection(vma, start, end, vma-vm_page_prot);
+   if (has_pt_op(vma, change_protection))
+   pt_op(vma, change_protection)(vma, start, end,
+   vma-vm_page_prot);
else
change_protection(vma, start, end, vma-vm_page_prot, 
dirty_accountable);
vm_stat_account(mm, oldflags, vma-vm_file, -nrpages);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 6/7] free_pgtable_range for hugetlb

2007-02-19 Thread Adam Litke

Signed-off-by: Adam Litke [EMAIL PROTECTED]
---

 fs/hugetlbfs/inode.c |1 +
 mm/memory.c  |6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 1016694..3461f9b 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -567,6 +567,7 @@ static struct pagetable_operations_struct 
hugetlbfs_pagetable_ops = {
.pin_pages  = follow_hugetlb_page,
.unmap_page_range   = unmap_hugepage_range,
.change_protection  = hugetlb_change_protection,
+   .free_pgtable_range = hugetlb_free_pgd_range,
 };
 
 static struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/memory.c b/mm/memory.c
index aa6b06e..442e6b2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -279,15 +279,15 @@ void free_pgtables(struct mmu_gather **tlb, struct 
vm_area_struct *vma,
anon_vma_unlink(vma);
unlink_file_vma(vma);
 
-   if (is_vm_hugetlb_page(vma)) {
-   hugetlb_free_pgd_range(tlb, addr, vma-vm_end,
+   if (has_pt_op(vma, free_pgtable_range)) {
+   pt_op(vma, free_pgtable_range)(tlb, addr, vma-vm_end,
floor, next? next-vm_start: ceiling);
} else {
/*
 * Optimization: gather nearby vmas into one call down
 */
while (next  next-vm_start = vma-vm_end + PMD_SIZE
-   !is_vm_hugetlb_page(next)) {
+   !has_pt_op(next, free_pgtable_range)) {
vma = next;
next = vma-vm_next;
anon_vma_unlink(vma);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.

2007-02-19 Thread Adam Litke
On Mon, 2007-02-19 at 19:41 +0100, Arjan van de Ven wrote:
 On Mon, 2007-02-19 at 10:31 -0800, Adam Litke wrote:
  Signed-off-by: Adam Litke [EMAIL PROTECTED]
  ---
  
   include/linux/mm.h |   25 +
   1 files changed, 25 insertions(+), 0 deletions(-)
  
  diff --git a/include/linux/mm.h b/include/linux/mm.h
  index 2d2c08d..a2fa66d 100644
  --- a/include/linux/mm.h
  +++ b/include/linux/mm.h
  @@ -98,6 +98,7 @@ struct vm_area_struct {
   
  /* Function pointers to deal with this struct. */
  struct vm_operations_struct * vm_ops;
  +   struct pagetable_operations_struct * pagetable_ops;
   
 
 please make it at least const, those things have no business ever being
 written to right? And by making them const the compiler helps catch
 that, and as bonus the data gets moved to rodata so that it won't share
 cachelines with anything that gets dirty

Yep I agree.  Changed.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/7] [RFC] hugetlb: pagetable_operations API

2007-02-19 Thread Adam Litke
On Mon, 2007-02-19 at 19:43 +0100, Arjan van de Ven wrote:
 On Mon, 2007-02-19 at 10:31 -0800, Adam Litke wrote:
  The page tables for hugetlb mappings are handled differently than page 
  tables
  for normal pages.  Rather than integrating multiple page size support into 
  the
  main VM (which would tremendously complicate the code) some hooks were 
  created.
  This allows hugetlb special cases to be handled out of line by a separate
  interface.
 
 ok it makes sense to clean this up.. what I don't like is that there
 STILL are all the double cases... for this to work and be worth it both
 the common case and the hugetlb case should be using the ops structure
 always! Anything else and you're just replacing bad code with bad
 code ;(

Hmm.  Do you think everyone would support an extra pointer indirection
for every handle_pte_fault() call?  If not, then I definitely wouldn't
mind creating a default_pagetable_ops and calling into that.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Define the shmem_inode_info flags directly

2007-02-07 Thread Adam Litke

Andrew: This is a pretty basic and obvious cleanup IMO.  How about a ride in
-mm?

Defining flags in terms of other flags is always confusing.  Give them literal
values instead of defining them in terms of VM_flags.  While we're at it, move
them to a header file in preparation for the introduction of a
SHMEM_HUGETLB flag.

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
---

 include/linux/shmem_fs.h |4 
 mm/shmem.c   |4 
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index f3c5189..3ea0b6e 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -8,6 +8,10 @@
 
 #define SHMEM_NR_DIRECT 16
 
+/* These info->flags are used to handle pagein/truncate races efficiently */
+#define SHMEM_PAGEIN   0x0001
+#define SHMEM_TRUNCATE 0x0002
+
 struct shmem_inode_info {
spinlock_t  lock;
unsigned long   flags;
diff --git a/mm/shmem.c b/mm/shmem.c
index 70da7a0..a9bdb0d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -66,10 +66,6 @@
 
 #define VM_ACCT(size)(PAGE_CACHE_ALIGN(size) >> PAGE_SHIFT)
 
-/* info->flags needs VM_flags to handle pagein/truncate races efficiently */
-#define SHMEM_PAGEINVM_READ
-#define SHMEM_TRUNCATE  VM_WRITE
-
 /* Definition to limit shmem_truncate's steps between cond_rescheds */
 #define LATENCY_LIMIT   64
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Define the shmem_inode_info flags directly

2007-02-07 Thread Adam Litke

Andrew: This is a pretty basic and obvious cleanup IMO.  How about a ride in
-mm?

Defining flags in terms of other flags is always confusing.  Give them literal
values instead of defining them in terms of VM_flags.  While we're at it, move
them to a header file in preparation for the introduction of a
SHMEM_HUGETLB flag.

Signed-off-by: Adam Litke [EMAIL PROTECTED]
---

 include/linux/shmem_fs.h |4 
 mm/shmem.c   |4 
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index f3c5189..3ea0b6e 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -8,6 +8,10 @@
 
 #define SHMEM_NR_DIRECT 16
 
+/* These info-flags are used to handle pagein/truncate races efficiently */
+#define SHMEM_PAGEIN   0x0001
+#define SHMEM_TRUNCATE 0x0002
+
 struct shmem_inode_info {
spinlock_t  lock;
unsigned long   flags;
diff --git a/mm/shmem.c b/mm/shmem.c
index 70da7a0..a9bdb0d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -66,10 +66,6 @@
 
 #define VM_ACCT(size)(PAGE_CACHE_ALIGN(size)  PAGE_SHIFT)
 
-/* info-flags needs VM_flags to handle pagein/truncate races efficiently */
-#define SHMEM_PAGEINVM_READ
-#define SHMEM_TRUNCATE  VM_WRITE
-
 /* Definition to limit shmem_truncate's steps between cond_rescheds */
 #define LATENCY_LIMIT   64
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Don't allow the stack to grow into hugetlb reserved regions

2007-01-29 Thread Adam Litke

When expanding the stack, we don't currently check if the VMA will cross into
an area of the address space that is reserved for hugetlb pages.  Subsequent
faults on the expanded portion of such a VMA will confuse the low-level MMU
code, resulting in an OOPS.  Check for this.

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
---

 mm/mmap.c |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 9717337..2c6b163 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1477,6 +1477,7 @@ static int acct_stack_growth(struct vm_area_struct * vma, 
unsigned long size, un
 {
struct mm_struct *mm = vma->vm_mm;
struct rlimit *rlim = current->signal->rlim;
+   unsigned long new_start;
 
/* address space limit tests */
if (!may_expand_vm(mm, grow))
@@ -1496,6 +1497,12 @@ static int acct_stack_growth(struct vm_area_struct * 
vma, unsigned long size, un
return -ENOMEM;
}
 
+   /* Check to make the stack will not grow into a hugetlb-only region. */
+   new_start = (vma->vm_flags & VM_GROWSUP) ? vma->vm_start :
+   vma->vm_end - size;
+   if (is_hugepage_only_range(vma->vm_mm, new_start, size))
+   return -EFAULT;
+
/*
 * Overcommit..  This must be the final test, as it will
 * update security statistics.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Don't allow the stack to grow into hugetlb reserved regions

2007-01-29 Thread Adam Litke

When expanding the stack, we don't currently check if the VMA will cross into
an area of the address space that is reserved for hugetlb pages.  Subsequent
faults on the expanded portion of such a VMA will confuse the low-level MMU
code, resulting in an OOPS.  Check for this.

Signed-off-by: Adam Litke [EMAIL PROTECTED]
---

 mm/mmap.c |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 9717337..2c6b163 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1477,6 +1477,7 @@ static int acct_stack_growth(struct vm_area_struct * vma, 
unsigned long size, un
 {
struct mm_struct *mm = vma-vm_mm;
struct rlimit *rlim = current-signal-rlim;
+   unsigned long new_start;
 
/* address space limit tests */
if (!may_expand_vm(mm, grow))
@@ -1496,6 +1497,12 @@ static int acct_stack_growth(struct vm_area_struct * 
vma, unsigned long size, un
return -ENOMEM;
}
 
+   /* Check to make the stack will not grow into a hugetlb-only region. */
+   new_start = (vma-vm_flags  VM_GROWSUP) ? vma-vm_start :
+   vma-vm_end - size;
+   if (is_hugepage_only_range(vma-vm_mm, new_start, size))
+   return -EFAULT;
+
/*
 * Overcommit..  This must be the final test, as it will
 * update security statistics.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >