Re: [PATCH] hugetlb: ensure we do not reference a surplus page after handing it to buddy
On Tue, 2008-02-19 at 15:30 -0800, Andrew Morton wrote: > On Tue, 19 Feb 2008 12:41:51 -0600 Adam Litke <[EMAIL PROTECTED]> wrote: > > > Indeed. I'll take credit for this thinko... > > > > On Tue, 2008-02-19 at 18:28 +, Andy Whitcroft wrote: > > > When we free a page via free_huge_page and we detect that we are in > > > surplus the page will be returned to the buddy. After this we no longer > > > own the page. However at the end free_huge_page we clear out our mapping > > > pointer from page private. Even where the page is not a surplus we > > > free the page to the hugepage pool, drop the pool locks and then clear > > > page private. In either case the page may have been reallocated. BAD. > > > > > > Make sure we clear out page private before we free the page. > > > > > > Signed-off-by: Andy Whitcroft <[EMAIL PROTECTED]> > > > > Acked-by: Adam Litke <[EMAIL PROTECTED]> > > Was I right to assume that this is also needed in 2.6.24.x? Yep. Thanks Andrew. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] hugetlb: ensure we do not reference a surplus page after handing it to buddy
On Tue, 2008-02-19 at 15:30 -0800, Andrew Morton wrote: On Tue, 19 Feb 2008 12:41:51 -0600 Adam Litke [EMAIL PROTECTED] wrote: Indeed. I'll take credit for this thinko... On Tue, 2008-02-19 at 18:28 +, Andy Whitcroft wrote: When we free a page via free_huge_page and we detect that we are in surplus the page will be returned to the buddy. After this we no longer own the page. However at the end free_huge_page we clear out our mapping pointer from page private. Even where the page is not a surplus we free the page to the hugepage pool, drop the pool locks and then clear page private. In either case the page may have been reallocated. BAD. Make sure we clear out page private before we free the page. Signed-off-by: Andy Whitcroft [EMAIL PROTECTED] Acked-by: Adam Litke [EMAIL PROTECTED] Was I right to assume that this is also needed in 2.6.24.x? Yep. Thanks Andrew. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] hugetlb: ensure we do not reference a surplus page after handing it to buddy
Indeed. I'll take credit for this thinko... On Tue, 2008-02-19 at 18:28 +, Andy Whitcroft wrote: > When we free a page via free_huge_page and we detect that we are in > surplus the page will be returned to the buddy. After this we no longer > own the page. However at the end free_huge_page we clear out our mapping > pointer from page private. Even where the page is not a surplus we > free the page to the hugepage pool, drop the pool locks and then clear > page private. In either case the page may have been reallocated. BAD. > > Make sure we clear out page private before we free the page. > > Signed-off-by: Andy Whitcroft <[EMAIL PROTECTED]> Acked-by: Adam Litke <[EMAIL PROTECTED]> -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] hugetlb: ensure we do not reference a surplus page after handing it to buddy
Indeed. I'll take credit for this thinko... On Tue, 2008-02-19 at 18:28 +, Andy Whitcroft wrote: When we free a page via free_huge_page and we detect that we are in surplus the page will be returned to the buddy. After this we no longer own the page. However at the end free_huge_page we clear out our mapping pointer from page private. Even where the page is not a surplus we free the page to the hugepage pool, drop the pool locks and then clear page private. In either case the page may have been reallocated. BAD. Make sure we clear out page private before we free the page. Signed-off-by: Andy Whitcroft [EMAIL PROTECTED] Acked-by: Adam Litke [EMAIL PROTECTED] -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] hugetlb: handle write-protection faults in follow_hugetlb_page
The follow_hugetlb_page() fix I posted (merged as git commit 5b23dbe8173c212d6a326e35347b038705603d39) missed one case. If the pte is present, but not writable and write access is requested by the caller to get_user_pages(), the code will do the wrong thing. Rather than calling hugetlb_fault to make the pte writable, it notes the presence of the pte and continues. This simple one-liner makes sure we also fault on the pte for this case. Please apply. Signed-off-by: Adam Litke <[EMAIL PROTECTED]> Acked-by: Dave Kleikamp <[EMAIL PROTECTED]> --- mm/hugetlb.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6121b57..6f97821 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -907,7 +907,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, */ pte = huge_pte_offset(mm, vaddr & HPAGE_MASK); - if (!pte || pte_none(*pte)) { + if (!pte || pte_none(*pte) || (write && !pte_write(*pte))) { int ret; spin_unlock(>page_table_lock); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] hugetlb: handle write-protection faults in follow_hugetlb_page
The follow_hugetlb_page() fix I posted (merged as git commit 5b23dbe8173c212d6a326e35347b038705603d39) missed one case. If the pte is present, but not writable and write access is requested by the caller to get_user_pages(), the code will do the wrong thing. Rather than calling hugetlb_fault to make the pte writable, it notes the presence of the pte and continues. This simple one-liner makes sure we also fault on the pte for this case. Please apply. Signed-off-by: Adam Litke [EMAIL PROTECTED] Acked-by: Dave Kleikamp [EMAIL PROTECTED] --- mm/hugetlb.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6121b57..6f97821 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -907,7 +907,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, */ pte = huge_pte_offset(mm, vaddr HPAGE_MASK); - if (!pte || pte_none(*pte)) { + if (!pte || pte_none(*pte) || (write !pte_write(*pte))) { int ret; spin_unlock(mm-page_table_lock); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] hugetlb: follow_hugetlb_page for write access
When calling get_user_pages(), a write flag is passed in by the caller to indicate if write access is required on the faulted-in pages. Currently, follow_hugetlb_page() ignores this flag and always faults pages for read-only access. This can cause data corruption because a device driver that calls get_user_pages() with write set will not expect COW faults to occur on the returned pages. This patch passes the write flag down to follow_hugetlb_page() and makes sure hugetlb_fault() is called with the right write_access parameter. Signed-off-by: Adam Litke <[EMAIL PROTECTED]> --- include/linux/hugetlb.h |2 +- mm/hugetlb.c|5 +++-- mm/memory.c |2 +- 3 files changed, 5 insertions(+), 4 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 3a19b03..31fa0a0 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -19,7 +19,7 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma) int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); int hugetlb_treat_movable_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *); -int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int); +int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int, int); void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long); void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long); int hugetlb_prefault(struct address_space *, struct vm_area_struct *); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index eab8c42..b645985 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -621,7 +621,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page **pages, struct vm_area_struct **vmas, - unsigned long *position, int *length, int i) + unsigned long *position, int *length, int i, + int write) { unsigned long pfn_offset; unsigned long vaddr = *position; @@ -643,7 +644,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, int ret; spin_unlock(>page_table_lock); - ret = hugetlb_fault(mm, vma, vaddr, 0); + ret = hugetlb_fault(mm, vma, vaddr, write); spin_lock(>page_table_lock); if (!(ret & VM_FAULT_ERROR)) continue; diff --git a/mm/memory.c b/mm/memory.c index f82b359..1bcd444 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1039,7 +1039,7 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, if (is_vm_hugetlb_page(vma)) { i = follow_hugetlb_page(mm, vma, pages, vmas, - , , i); + , , i, write); continue; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] hugetlb: follow_hugetlb_page for write access
When calling get_user_pages(), a write flag is passed in by the caller to indicate if write access is required on the faulted-in pages. Currently, follow_hugetlb_page() ignores this flag and always faults pages for read-only access. This can cause data corruption because a device driver that calls get_user_pages() with write set will not expect COW faults to occur on the returned pages. This patch passes the write flag down to follow_hugetlb_page() and makes sure hugetlb_fault() is called with the right write_access parameter. Signed-off-by: Adam Litke [EMAIL PROTECTED] --- include/linux/hugetlb.h |2 +- mm/hugetlb.c|5 +++-- mm/memory.c |2 +- 3 files changed, 5 insertions(+), 4 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 3a19b03..31fa0a0 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -19,7 +19,7 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma) int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); int hugetlb_treat_movable_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *); -int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int); +int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int, int); void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long); void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long); int hugetlb_prefault(struct address_space *, struct vm_area_struct *); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index eab8c42..b645985 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -621,7 +621,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page **pages, struct vm_area_struct **vmas, - unsigned long *position, int *length, int i) + unsigned long *position, int *length, int i, + int write) { unsigned long pfn_offset; unsigned long vaddr = *position; @@ -643,7 +644,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, int ret; spin_unlock(mm-page_table_lock); - ret = hugetlb_fault(mm, vma, vaddr, 0); + ret = hugetlb_fault(mm, vma, vaddr, write); spin_lock(mm-page_table_lock); if (!(ret VM_FAULT_ERROR)) continue; diff --git a/mm/memory.c b/mm/memory.c index f82b359..1bcd444 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1039,7 +1039,7 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, if (is_vm_hugetlb_page(vma)) { i = follow_hugetlb_page(mm, vma, pages, vmas, - start, len, i); + start, len, i, write); continue; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6 patch] mm/hugetlb.c: make a function static
On Wed, 2007-10-24 at 18:23 +0200, Adrian Bunk wrote: > return_unused_surplus_pages() can become static. > > Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]> Acked-by: Adam Litke <[EMAIL PROTECTED]> > > --- > 8932fe99341629d50863643229d25666e9f44e03 > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 8b809ec..b65da0d 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -323,7 +323,7 @@ free: > * allocated to satisfy the reservation must be explicitly freed if they were > * never used. > */ > -void return_unused_surplus_pages(unsigned long unused_resv_pages) > +static void return_unused_surplus_pages(unsigned long unused_resv_pages) > { > static int nid = -1; > struct page *page; > > -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6 patch] mm/hugetlb.c: make a function static
On Wed, 2007-10-24 at 18:23 +0200, Adrian Bunk wrote: return_unused_surplus_pages() can become static. Signed-off-by: Adrian Bunk [EMAIL PROTECTED] Acked-by: Adam Litke [EMAIL PROTECTED] --- 8932fe99341629d50863643229d25666e9f44e03 diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 8b809ec..b65da0d 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -323,7 +323,7 @@ free: * allocated to satisfy the reservation must be explicitly freed if they were * never used. */ -void return_unused_surplus_pages(unsigned long unused_resv_pages) +static void return_unused_surplus_pages(unsigned long unused_resv_pages) { static int nid = -1; struct page *page; -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] hugetlb: Fix pool resizing corner case V2
Changes in V2: - Removed now unnecessary check as suggested by Ken Chen When shrinking the size of the hugetlb pool via the nr_hugepages sysctl, we are careful to keep enough pages around to satisfy reservations. But the calculation is flawed for the following scenario: Action Pool Counters (Total, Free, Resv) == = Set pool to 1 page 1 1 0 Map 1 page MAP_PRIVATE 1 1 0 Touch the page to fault it in 1 0 0 Set pool to 3 pages 3 2 0 Map 2 pages MAP_SHARED 3 2 2 Set pool to 2 pages 2 1 2 <-- Mistake, should be 3 2 2 Touch the 2 shared pages2 0 1 <-- Program crashes here The last touch above will terminate the process due to lack of huge pages. This patch corrects the calculation so that it factors in pages being used for private mappings. Andrew, this is a standalone fix suitable for mainline. It is also now corrected in my latest dynamic pool resizing patchset which I will send out soon. Signed-off-by: Adam Litke <[EMAIL PROTECTED]> Acked-by: Ken Chen <[EMAIL PROTECTED]> --- mm/hugetlb.c |8 +++- 1 files changed, 3 insertions(+), 5 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 84c795e..b6b3b64 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -224,14 +224,14 @@ static void try_to_free_low(unsigned long count) for (i = 0; i < MAX_NUMNODES; ++i) { struct page *page, *next; list_for_each_entry_safe(page, next, _freelists[i], lru) { + if (count >= nr_huge_pages) + return; if (PageHighMem(page)) continue; list_del(>lru); update_and_free_page(page); free_huge_pages--; free_huge_pages_node[page_to_nid(page)]--; - if (count >= nr_huge_pages) - return; } } } @@ -247,11 +247,9 @@ static unsigned long set_max_huge_pages(unsigned long count) if (!alloc_fresh_huge_page()) return nr_huge_pages; } - if (count >= nr_huge_pages) - return nr_huge_pages; spin_lock(_lock); - count = max(count, resv_huge_pages); + count = max(count, resv_huge_pages + nr_huge_pages - free_huge_pages); try_to_free_low(count); while (count < nr_huge_pages) { struct page *page = dequeue_huge_page(NULL, 0); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] hugetlb: Fix pool resizing corner case
On Wed, 2007-10-03 at 10:40 -0700, Dave Hansen wrote: > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 84c795e..7af3908 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -224,14 +224,14 @@ static void try_to_free_low(unsigned long count) > > for (i = 0; i < MAX_NUMNODES; ++i) { > > struct page *page, *next; > > list_for_each_entry_safe(page, next, _freelists[i], > > lru) { > > + if (count >= nr_huge_pages) > > + return; > > if (PageHighMem(page)) > > continue; > > list_del(>lru); > > update_and_free_page(page); > > free_huge_pages--; > > free_huge_pages_node[page_to_nid(page)]--; > > - if (count >= nr_huge_pages) > > - return; > > } > > } > > } > > That's an excellent problem description. I'm just a bit hazy on how the > patch fixes it. :) > > What is the actual error in this loop? The fact that we can go trying > to free pages when the count is actually OK? The above hunk serves only to change the behavior of try_to_free_low() so that rather than always freeing _at_least_ one huge page, it can return without having freed any pages. > BTW, try_to_free_low(count) kinda sucks for a function name. Is that > count the number of pages we're trying to end up with, or the total > number of low pages that we're trying to free? I agree the name sucks, but this is a bugfix patch. > Also, as I look at try_to_free_low(), why do we need to #ifdef it out in > the case of !HIGHMEM? If we have CONFIG_HIGHMEM=yes, we still might not > have any _actual_ high memory. So, they loop obviously doesn't *hurt* > when there is no high memory. Maybe, but not really in-scope of what this patch is trying to accomplish. > > @@ -251,7 +251,7 @@ static unsigned long set_max_huge_pages(unsigned long > > count) > > return nr_huge_pages; > > > > spin_lock(_lock); > > - count = max(count, resv_huge_pages); > > + count = max(count, resv_huge_pages + nr_huge_pages - free_huge_pages); > > try_to_free_low(count); > > while (count < nr_huge_pages) { > > struct page *page = dequeue_huge_page(NULL, 0); > > The real problem with this line is that "count" is too ambiguous. :) I agree, count is almost always a bad variable name :) > We could rewrite the original max() line this way: > > if (resv_huge_pages > nr_of_pages_to_end_up_with) > nr_of_pages_to_end_up_with = resv_huge_pages; > try_to_make_the_total_nr_of_hpages(nr_of_pages_to_end_up_with); > > Which makes it more clear that you're setting the number of total pages > to the number of reserved pages, which is obviously screwy. > > OK, so this is actually saying: "count can never go below > resv_huge_pages+nr_huge_pages"? Not quite. Count can never go below the number of reserved pages plus pages allocated to MAP_PRIVATE mappings. That number is computed by: (resv + (total - free)). > Could we change try_to_free_low() to free a distinct number of pages? > > if (count > free_huge_pages) > count = free_huge_pages; > try_to_free_nr_huge_pages(count); > > I feel a bit sketchy about the "resv_huge_pages + nr_huge_pages - > free_huge_pages" logic. Could you elaborate a bit there on what the > rules are? The key is that we don't want to shrink the pool below the number of pages we are committed to keeping around. Before this patch, we only accounted for the pages we plan to hand out (reserved huge pages) but not the ones we've already handed out (total - free). Does that make sense? -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] hugetlb: Fix pool resizing corner case
When shrinking the size of the hugetlb pool via the nr_hugepages sysctl, we are careful to keep enough pages around to satisfy reservations. But the calculation is flawed for the following scenario: Action Pool Counters (Total, Free, Resv) == = Set pool to 1 page 1 1 0 Map 1 page MAP_PRIVATE 1 1 0 Touch the page to fault it in 1 0 0 Set pool to 3 pages 3 2 0 Map 2 pages MAP_SHARED 3 2 2 Set pool to 2 pages 2 1 2 <-- Mistake, should be 3 2 2 Touch the 2 shared pages2 0 1 <-- Program crashes here The last touch above will terminate the process due to lack of huge pages. This patch corrects the calculation so that it factors in pages being used for private mappings. Andrew, this is a standalone fix suitable for mainline. It is also now corrected in my latest dynamic pool resizing patchset which I will send out soon. Signed-off-by: Adam Litke <[EMAIL PROTECTED]> --- mm/hugetlb.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 84c795e..7af3908 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -224,14 +224,14 @@ static void try_to_free_low(unsigned long count) for (i = 0; i < MAX_NUMNODES; ++i) { struct page *page, *next; list_for_each_entry_safe(page, next, _freelists[i], lru) { + if (count >= nr_huge_pages) + return; if (PageHighMem(page)) continue; list_del(>lru); update_and_free_page(page); free_huge_pages--; free_huge_pages_node[page_to_nid(page)]--; - if (count >= nr_huge_pages) - return; } } } @@ -251,7 +251,7 @@ static unsigned long set_max_huge_pages(unsigned long count) return nr_huge_pages; spin_lock(_lock); - count = max(count, resv_huge_pages); + count = max(count, resv_huge_pages + nr_huge_pages - free_huge_pages); try_to_free_low(count); while (count < nr_huge_pages) { struct page *page = dequeue_huge_page(NULL, 0); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] hugetlb: Fix pool resizing corner case
When shrinking the size of the hugetlb pool via the nr_hugepages sysctl, we are careful to keep enough pages around to satisfy reservations. But the calculation is flawed for the following scenario: Action Pool Counters (Total, Free, Resv) == = Set pool to 1 page 1 1 0 Map 1 page MAP_PRIVATE 1 1 0 Touch the page to fault it in 1 0 0 Set pool to 3 pages 3 2 0 Map 2 pages MAP_SHARED 3 2 2 Set pool to 2 pages 2 1 2 -- Mistake, should be 3 2 2 Touch the 2 shared pages2 0 1 -- Program crashes here The last touch above will terminate the process due to lack of huge pages. This patch corrects the calculation so that it factors in pages being used for private mappings. Andrew, this is a standalone fix suitable for mainline. It is also now corrected in my latest dynamic pool resizing patchset which I will send out soon. Signed-off-by: Adam Litke [EMAIL PROTECTED] --- mm/hugetlb.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 84c795e..7af3908 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -224,14 +224,14 @@ static void try_to_free_low(unsigned long count) for (i = 0; i MAX_NUMNODES; ++i) { struct page *page, *next; list_for_each_entry_safe(page, next, hugepage_freelists[i], lru) { + if (count = nr_huge_pages) + return; if (PageHighMem(page)) continue; list_del(page-lru); update_and_free_page(page); free_huge_pages--; free_huge_pages_node[page_to_nid(page)]--; - if (count = nr_huge_pages) - return; } } } @@ -251,7 +251,7 @@ static unsigned long set_max_huge_pages(unsigned long count) return nr_huge_pages; spin_lock(hugetlb_lock); - count = max(count, resv_huge_pages); + count = max(count, resv_huge_pages + nr_huge_pages - free_huge_pages); try_to_free_low(count); while (count nr_huge_pages) { struct page *page = dequeue_huge_page(NULL, 0); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] hugetlb: Fix pool resizing corner case
On Wed, 2007-10-03 at 10:40 -0700, Dave Hansen wrote: diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 84c795e..7af3908 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -224,14 +224,14 @@ static void try_to_free_low(unsigned long count) for (i = 0; i MAX_NUMNODES; ++i) { struct page *page, *next; list_for_each_entry_safe(page, next, hugepage_freelists[i], lru) { + if (count = nr_huge_pages) + return; if (PageHighMem(page)) continue; list_del(page-lru); update_and_free_page(page); free_huge_pages--; free_huge_pages_node[page_to_nid(page)]--; - if (count = nr_huge_pages) - return; } } } That's an excellent problem description. I'm just a bit hazy on how the patch fixes it. :) What is the actual error in this loop? The fact that we can go trying to free pages when the count is actually OK? The above hunk serves only to change the behavior of try_to_free_low() so that rather than always freeing _at_least_ one huge page, it can return without having freed any pages. BTW, try_to_free_low(count) kinda sucks for a function name. Is that count the number of pages we're trying to end up with, or the total number of low pages that we're trying to free? I agree the name sucks, but this is a bugfix patch. Also, as I look at try_to_free_low(), why do we need to #ifdef it out in the case of !HIGHMEM? If we have CONFIG_HIGHMEM=yes, we still might not have any _actual_ high memory. So, they loop obviously doesn't *hurt* when there is no high memory. Maybe, but not really in-scope of what this patch is trying to accomplish. @@ -251,7 +251,7 @@ static unsigned long set_max_huge_pages(unsigned long count) return nr_huge_pages; spin_lock(hugetlb_lock); - count = max(count, resv_huge_pages); + count = max(count, resv_huge_pages + nr_huge_pages - free_huge_pages); try_to_free_low(count); while (count nr_huge_pages) { struct page *page = dequeue_huge_page(NULL, 0); The real problem with this line is that count is too ambiguous. :) I agree, count is almost always a bad variable name :) We could rewrite the original max() line this way: if (resv_huge_pages nr_of_pages_to_end_up_with) nr_of_pages_to_end_up_with = resv_huge_pages; try_to_make_the_total_nr_of_hpages(nr_of_pages_to_end_up_with); Which makes it more clear that you're setting the number of total pages to the number of reserved pages, which is obviously screwy. OK, so this is actually saying: count can never go below resv_huge_pages+nr_huge_pages? Not quite. Count can never go below the number of reserved pages plus pages allocated to MAP_PRIVATE mappings. That number is computed by: (resv + (total - free)). Could we change try_to_free_low() to free a distinct number of pages? if (count free_huge_pages) count = free_huge_pages; try_to_free_nr_huge_pages(count); I feel a bit sketchy about the resv_huge_pages + nr_huge_pages - free_huge_pages logic. Could you elaborate a bit there on what the rules are? The key is that we don't want to shrink the pool below the number of pages we are committed to keeping around. Before this patch, we only accounted for the pages we plan to hand out (reserved huge pages) but not the ones we've already handed out (total - free). Does that make sense? -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] hugetlb: Fix pool resizing corner case V2
Changes in V2: - Removed now unnecessary check as suggested by Ken Chen When shrinking the size of the hugetlb pool via the nr_hugepages sysctl, we are careful to keep enough pages around to satisfy reservations. But the calculation is flawed for the following scenario: Action Pool Counters (Total, Free, Resv) == = Set pool to 1 page 1 1 0 Map 1 page MAP_PRIVATE 1 1 0 Touch the page to fault it in 1 0 0 Set pool to 3 pages 3 2 0 Map 2 pages MAP_SHARED 3 2 2 Set pool to 2 pages 2 1 2 -- Mistake, should be 3 2 2 Touch the 2 shared pages2 0 1 -- Program crashes here The last touch above will terminate the process due to lack of huge pages. This patch corrects the calculation so that it factors in pages being used for private mappings. Andrew, this is a standalone fix suitable for mainline. It is also now corrected in my latest dynamic pool resizing patchset which I will send out soon. Signed-off-by: Adam Litke [EMAIL PROTECTED] Acked-by: Ken Chen [EMAIL PROTECTED] --- mm/hugetlb.c |8 +++- 1 files changed, 3 insertions(+), 5 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 84c795e..b6b3b64 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -224,14 +224,14 @@ static void try_to_free_low(unsigned long count) for (i = 0; i MAX_NUMNODES; ++i) { struct page *page, *next; list_for_each_entry_safe(page, next, hugepage_freelists[i], lru) { + if (count = nr_huge_pages) + return; if (PageHighMem(page)) continue; list_del(page-lru); update_and_free_page(page); free_huge_pages--; free_huge_pages_node[page_to_nid(page)]--; - if (count = nr_huge_pages) - return; } } } @@ -247,11 +247,9 @@ static unsigned long set_max_huge_pages(unsigned long count) if (!alloc_fresh_huge_page()) return nr_huge_pages; } - if (count = nr_huge_pages) - return nr_huge_pages; spin_lock(hugetlb_lock); - count = max(count, resv_huge_pages); + count = max(count, resv_huge_pages + nr_huge_pages - free_huge_pages); try_to_free_low(count); while (count nr_huge_pages) { struct page *page = dequeue_huge_page(NULL, 0); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix find_next_best_node (Re: [BUG] 2.6.23-rc3-mm1 Kernel panic - not syncing: DMA: Memory would be corrupted)
On Fri, 2007-08-24 at 15:53 +0900, Yasunori Goto wrote: > I found find_next_best_node() was wrong. > I confirmed boot up by the following patch. > Mel-san, Kamalesh-san, could you try this? FYI: This patch also allows the alloc-instantiate-race testcase in libhugetlbfs to pass again :) -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix find_next_best_node (Re: [BUG] 2.6.23-rc3-mm1 Kernel panic - not syncing: DMA: Memory would be corrupted)
On Fri, 2007-08-24 at 15:53 +0900, Yasunori Goto wrote: I found find_next_best_node() was wrong. I confirmed boot up by the following patch. Mel-san, Kamalesh-san, could you try this? FYI: This patch also allows the alloc-instantiate-race testcase in libhugetlbfs to pass again :) -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Fix VM_FAULT flags conversion for hugetlb
It seems a simple mistake was made when converting follow_hugetlb_page() over to the VM_FAULT flags bitmask stuff: (commit 83c54070ee1a2d05c89793884bea1a03f2851ed4). By using the wrong bitmask, hugetlb_fault() failures are not being recognized. This results in an infinite loop whenever follow_hugetlb_page is involved in a failed fault. Signed-off-by: Adam Litke <[EMAIL PROTECTED]> diff --git a/mm/hugetlb.c b/mm/hugetlb.c index d7ca59d..de4cf45 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -643,7 +643,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, spin_unlock(>page_table_lock); ret = hugetlb_fault(mm, vma, vaddr, 0); spin_lock(>page_table_lock); - if (!(ret & VM_FAULT_MAJOR)) + if (!(ret & VM_FAULT_ERROR)) continue; remainder = 0; -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Fix VM_FAULT flags conversion for hugetlb
It seems a simple mistake was made when converting follow_hugetlb_page() over to the VM_FAULT flags bitmask stuff: (commit 83c54070ee1a2d05c89793884bea1a03f2851ed4). By using the wrong bitmask, hugetlb_fault() failures are not being recognized. This results in an infinite loop whenever follow_hugetlb_page is involved in a failed fault. Signed-off-by: Adam Litke [EMAIL PROTECTED] diff --git a/mm/hugetlb.c b/mm/hugetlb.c index d7ca59d..de4cf45 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -643,7 +643,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, spin_unlock(mm-page_table_lock); ret = hugetlb_fault(mm, vma, vaddr, 0); spin_lock(mm-page_table_lock); - if (!(ret VM_FAULT_MAJOR)) + if (!(ret VM_FAULT_ERROR)) continue; remainder = 0; -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Documentation] Page Table Layout diagrams
Hello all. In an effort to understand how the page tables are laid out across various architectures I put together some diagrams. I have posted them on the linux-mm wiki: http://linux-mm.org/PageTableStructure and I hope they will be useful to others. Just to make sure I am not spreading misinformation, could a few of you experts take a quick look at the three diagrams I've got finished so far and point out any errors I have made? Thanks. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Documentation] Page Table Layout diagrams
Hello all. In an effort to understand how the page tables are laid out across various architectures I put together some diagrams. I have posted them on the linux-mm wiki: http://linux-mm.org/PageTableStructure and I hope they will be useful to others. Just to make sure I am not spreading misinformation, could a few of you experts take a quick look at the three diagrams I've got finished so far and point out any errors I have made? Thanks. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove hugetlb_instantiation_mutex
On Mon, 2007-07-30 at 15:15 +0800, Zhang, Yanmin wrote: > On Fri, 2007-07-27 at 11:37 -0500, Adam Litke wrote: > > Hey... I am amazed at how quickly you came back with a patch for this :) > > Thanks for looking at it. Unfortunately there is one show-stopper and I > > have some reservations (pun definitely intended) with your approach: > Thanks for your great comments. Sorry for such a long delay in responding. I have been pretty busy lately. > > First, your patch does not pass the libhugetlbfs test > > 'alloc-instantiate-race' which was written to tickle the the race which > > the mutex was introduced to solve. Your patch works for shared > > mappings, but not for the private case. > My testing about private might not be thorough. Function hugetlb_cow has a > race > for multi-thread to fault on the same private page index. But after I fixed > it, > alloc-instantiate-race still failed. > > I tried to google the source code tarball of libhugetlbfs test suite, but > couldn't > find it. Would you like to send me a copy of the test source codes? http://libhugetlbfs.ozlabs.org/releases/libhugetlbfs-1.2-pre1.tar.gz The tarball will contain a test called alloc-instantiate-race. Make sure to run it in private and shared mode. Let me know what you find out. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove hugetlb_instantiation_mutex
On Mon, 2007-07-30 at 15:15 +0800, Zhang, Yanmin wrote: On Fri, 2007-07-27 at 11:37 -0500, Adam Litke wrote: Hey... I am amazed at how quickly you came back with a patch for this :) Thanks for looking at it. Unfortunately there is one show-stopper and I have some reservations (pun definitely intended) with your approach: Thanks for your great comments. Sorry for such a long delay in responding. I have been pretty busy lately. First, your patch does not pass the libhugetlbfs test 'alloc-instantiate-race' which was written to tickle the the race which the mutex was introduced to solve. Your patch works for shared mappings, but not for the private case. My testing about private might not be thorough. Function hugetlb_cow has a race for multi-thread to fault on the same private page index. But after I fixed it, alloc-instantiate-race still failed. I tried to google the source code tarball of libhugetlbfs test suite, but couldn't find it. Would you like to send me a copy of the test source codes? http://libhugetlbfs.ozlabs.org/releases/libhugetlbfs-1.2-pre1.tar.gz The tarball will contain a test called alloc-instantiate-race. Make sure to run it in private and shared mode. Let me know what you find out. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove hugetlb_instantiation_mutex
Hey... I am amazed at how quickly you came back with a patch for this :) Thanks for looking at it. Unfortunately there is one show-stopper and I have some reservations (pun definitely intended) with your approach: First, your patch does not pass the libhugetlbfs test 'alloc-instantiate-race' which was written to tickle the the race which the mutex was introduced to solve. Your patch works for shared mappings, but not for the private case. Second, the introduction of another pair of global counters triggers my internal warning system... These global counters are known to cause problems with NUMA and cpusets. Have you considered these interactions? Additionally, the commit/rollback logic you are using closely parallels what we already do with the huge page reservation mechanism. Is there any way you could integrate your stuff into the reservation system to avoid all the duplicated logic? -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove hugetlb_instantiation_mutex
Hey... I am amazed at how quickly you came back with a patch for this :) Thanks for looking at it. Unfortunately there is one show-stopper and I have some reservations (pun definitely intended) with your approach: First, your patch does not pass the libhugetlbfs test 'alloc-instantiate-race' which was written to tickle the the race which the mutex was introduced to solve. Your patch works for shared mappings, but not for the private case. Second, the introduction of another pair of global counters triggers my internal warning system... These global counters are known to cause problems with NUMA and cpusets. Have you considered these interactions? Additionally, the commit/rollback logic you are using closely parallels what we already do with the huge page reservation mechanism. Is there any way you could integrate your stuff into the reservation system to avoid all the duplicated logic? -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] spinlock in function hugetlb_fault could be deleted
Hello. hugetlb_instantiation_mutex is an extremely heavy-weight lock whose days are numbered (hopefully). It exists primarily to arbitrate a race condition where n (n > 1) threads of execution race to satisfy the same page fault for a process. Even though only one hugetlb page is needed, if (n) are not available, the application can receive a bogus VM_FAULT_OOM. Anyway, the hugetlb_instantiation_mutex approach has few friends around here, so rather than making the code rely more heavily upon it, perhaps you could focus you efforts on helping us remove it. On 7/23/07, Zhang, Yanmin <[EMAIL PROTECTED]> wrote: Function hugetlb_fault needn't hold spinlock mm->page_table_lock, because when hugetlb_fault is called: 1) mm->mmap_sem is held already; 2) hugetlb_instantiation_mutex is held by hugetlb_fault, which prevents other threads/processes from entering this critical area. It's impossible for other threads/processes to change the page table now. -- Adam Litke ( agl at us.ibm.com ) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] spinlock in function hugetlb_fault could be deleted
Hello. hugetlb_instantiation_mutex is an extremely heavy-weight lock whose days are numbered (hopefully). It exists primarily to arbitrate a race condition where n (n 1) threads of execution race to satisfy the same page fault for a process. Even though only one hugetlb page is needed, if (n) are not available, the application can receive a bogus VM_FAULT_OOM. Anyway, the hugetlb_instantiation_mutex approach has few friends around here, so rather than making the code rely more heavily upon it, perhaps you could focus you efforts on helping us remove it. On 7/23/07, Zhang, Yanmin [EMAIL PROTECTED] wrote: Function hugetlb_fault needn't hold spinlock mm-page_table_lock, because when hugetlb_fault is called: 1) mm-mmap_sem is held already; 2) hugetlb_instantiation_mutex is held by hugetlb_fault, which prevents other threads/processes from entering this critical area. It's impossible for other threads/processes to change the page table now. -- Adam Litke ( agl at us.ibm.com ) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] hugetlb: use set_compound_page_dtor
On 7/13/07, Akinobu Mita <[EMAIL PROTECTED]> wrote: Use appropriate accessor function to set compound page destructor function. Cc: William Irwin <[EMAIL PROTECTED]> Signed-off-by: Akinobu Mita <[EMAIL PROTECTED]> Acked-by: Adam Litke <[EMAIL PROTECTED]> -- Adam Litke ( agl at us.ibm.com ) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] hugetlb: use set_compound_page_dtor
On 7/13/07, Akinobu Mita [EMAIL PROTECTED] wrote: Use appropriate accessor function to set compound page destructor function. Cc: William Irwin [EMAIL PROTECTED] Signed-off-by: Akinobu Mita [EMAIL PROTECTED] Acked-by: Adam Litke [EMAIL PROTECTED] -- Adam Litke ( agl at us.ibm.com ) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [shm][hugetlb] Fix get_policy for stacked shared memory files
On 6/12/07, Eric W. Biederman <[EMAIL PROTECTED]> wrote: Adam Litke <[EMAIL PROTECTED]> writes: > Here's another breakage as a result of shared memory stacked files :( > > The NUMA policy for a VMA is determined by checking the following (in the order > given): > > 1) vma->vm_ops->get_policy() (if defined) > 2) vma->vm_policy (if defined) > 3) task->mempolicy (if defined) > 4) Fall back to default_policy > > By switching to stacked files for shared memory, get_policy() is now always set > to shm_get_policy which is a wrapper function. This causes us to stop at step > 1, which yields NULL for hugetlb instead of task->mempolicy which was the > previous (and correct) result. > > This patch modifies the shm_get_policy() wrapper to maintain steps 1-3 for the > wrapped vm_ops. Andi and Christoph, does this look right to you? I'm confused. I agree that the behavior you describe is correct. However I only see two code paths were get_policy is called and both of them take a NULL result and change it to task->mempolicy: The coffee hasn't quite absorbed yet, but don't those two code paths take a NULL result from get_policy() and turn it into default_policy, not task->mempolicy? From mm/mempolicy.c > long do_get_mempolicy(int *policy, nodemask_t *nmask, > unsigned long addr, unsigned long flags) > { > int err; > struct mm_struct *mm = current->mm; > struct vm_area_struct *vma = NULL; > struct mempolicy *pol = current->mempolicy; > > cpuset_update_task_memory_state(); > if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR)) > return -EINVAL; > if (flags & MPOL_F_ADDR) { > down_read(>mmap_sem); > vma = find_vma_intersection(mm, addr, addr+1); > if (!vma) { > up_read(>mmap_sem); > return -EFAULT; > } > if (vma->vm_ops && vma->vm_ops->get_policy) > pol = vma->vm_ops->get_policy(vma, addr); > else > pol = vma->vm_policy; > } else if (addr) > return -EINVAL; > > if (!pol) > pol = _policy; > /* Return effective policy for a VMA */ > static struct mempolicy * get_vma_policy(struct task_struct *task, > struct vm_area_struct *vma, unsigned long addr) > { > struct mempolicy *pol = task->mempolicy; > > if (vma) { > if (vma->vm_ops && vma->vm_ops->get_policy) > pol = vma->vm_ops->get_policy(vma, addr); > else if (vma->vm_policy && > vma->vm_policy->policy != MPOL_DEFAULT) > pol = vma->vm_policy; > } > if (!pol) > pol = _policy; > return pol; > } Does this perhaps need to be: > Signed-off-by: Adam Litke <[EMAIL PROTECTED]> > > diff --git a/ipc/shm.c b/ipc/shm.c > index 4fefbad..8d2672d 100644 > --- a/ipc/shm.c > +++ b/ipc/shm.c > @@ -254,8 +254,10 @@ struct mempolicy *shm_get_policy(struct vm_area_struct > *vma, unsigned long addr) + pol = NULL; > > if (sfd->vm_ops->get_policy) > pol = sfd->vm_ops->get_policy(vma, addr); > - else > + else if (vma->vm_policy && vma->vm_policy->policy != MPOL_DEFAULT) > pol = vma->vm_policy; > return pol; > } > #endif afaict this would provide no way for pol to be set to task->mempolicy for hugetlb per my comment above. -- Adam Litke ( agl at us.ibm.com ) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [shm][hugetlb] Fix get_policy for stacked shared memory files
On 6/12/07, Eric W. Biederman [EMAIL PROTECTED] wrote: Adam Litke [EMAIL PROTECTED] writes: Here's another breakage as a result of shared memory stacked files :( The NUMA policy for a VMA is determined by checking the following (in the order given): 1) vma-vm_ops-get_policy() (if defined) 2) vma-vm_policy (if defined) 3) task-mempolicy (if defined) 4) Fall back to default_policy By switching to stacked files for shared memory, get_policy() is now always set to shm_get_policy which is a wrapper function. This causes us to stop at step 1, which yields NULL for hugetlb instead of task-mempolicy which was the previous (and correct) result. This patch modifies the shm_get_policy() wrapper to maintain steps 1-3 for the wrapped vm_ops. Andi and Christoph, does this look right to you? I'm confused. I agree that the behavior you describe is correct. However I only see two code paths were get_policy is called and both of them take a NULL result and change it to task-mempolicy: The coffee hasn't quite absorbed yet, but don't those two code paths take a NULL result from get_policy() and turn it into default_policy, not task-mempolicy? From mm/mempolicy.c long do_get_mempolicy(int *policy, nodemask_t *nmask, unsigned long addr, unsigned long flags) { int err; struct mm_struct *mm = current-mm; struct vm_area_struct *vma = NULL; struct mempolicy *pol = current-mempolicy; cpuset_update_task_memory_state(); if (flags ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR)) return -EINVAL; if (flags MPOL_F_ADDR) { down_read(mm-mmap_sem); vma = find_vma_intersection(mm, addr, addr+1); if (!vma) { up_read(mm-mmap_sem); return -EFAULT; } if (vma-vm_ops vma-vm_ops-get_policy) pol = vma-vm_ops-get_policy(vma, addr); else pol = vma-vm_policy; } else if (addr) return -EINVAL; if (!pol) pol = default_policy; /* Return effective policy for a VMA */ static struct mempolicy * get_vma_policy(struct task_struct *task, struct vm_area_struct *vma, unsigned long addr) { struct mempolicy *pol = task-mempolicy; if (vma) { if (vma-vm_ops vma-vm_ops-get_policy) pol = vma-vm_ops-get_policy(vma, addr); else if (vma-vm_policy vma-vm_policy-policy != MPOL_DEFAULT) pol = vma-vm_policy; } if (!pol) pol = default_policy; return pol; } Does this perhaps need to be: Signed-off-by: Adam Litke [EMAIL PROTECTED] diff --git a/ipc/shm.c b/ipc/shm.c index 4fefbad..8d2672d 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -254,8 +254,10 @@ struct mempolicy *shm_get_policy(struct vm_area_struct *vma, unsigned long addr) + pol = NULL; if (sfd-vm_ops-get_policy) pol = sfd-vm_ops-get_policy(vma, addr); - else + else if (vma-vm_policy vma-vm_policy-policy != MPOL_DEFAULT) pol = vma-vm_policy; return pol; } #endif afaict this would provide no way for pol to be set to task-mempolicy for hugetlb per my comment above. -- Adam Litke ( agl at us.ibm.com ) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[shm][hugetlb] Fix get_policy for stacked shared memory files
Here's another breakage as a result of shared memory stacked files :( The NUMA policy for a VMA is determined by checking the following (in the order given): 1) vma->vm_ops->get_policy() (if defined) 2) vma->vm_policy (if defined) 3) task->mempolicy (if defined) 4) Fall back to default_policy By switching to stacked files for shared memory, get_policy() is now always set to shm_get_policy which is a wrapper function. This causes us to stop at step 1, which yields NULL for hugetlb instead of task->mempolicy which was the previous (and correct) result. This patch modifies the shm_get_policy() wrapper to maintain steps 1-3 for the wrapped vm_ops. Andi and Christoph, does this look right to you? Signed-off-by: Adam Litke <[EMAIL PROTECTED]> diff --git a/ipc/shm.c b/ipc/shm.c index 4fefbad..8d2672d 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -254,8 +254,10 @@ struct mempolicy *shm_get_policy(struct vm_area_struct *vma, unsigned long addr) if (sfd->vm_ops->get_policy) pol = sfd->vm_ops->get_policy(vma, addr); - else + else if (vma->vm_policy) pol = vma->vm_policy; + else + pol = current->mempolicy; return pol; } #endif -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] shm: Fix the filename of hugetlb sysv shared memory
On 6/8/07, Eric W. Biederman <[EMAIL PROTECTED]> wrote: -struct file *hugetlb_zero_setup(size_t size) +struct file *hugetlb_file_setup(const char *name, size_t size) The bulk of this patch seems to handle renaming this function. Is that really necessary? -- Adam Litke ( agl at us.ibm.com ) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21 numa policy and huge pages not working
On Sat, 2007-06-09 at 21:10 -0700, dean gaudet wrote: > On Tue, 15 May 2007, William Lee Irwin III wrote: > > > On Tue, May 15, 2007 at 10:41:06PM -0700, dean gaudet wrote: > > > prior to 2.6.21 i could "numactl --interleave=all" and use SHM_HUGETLB > > > and > > > the interleave policy would be respected. as of 2.6.21 it doesn't seem > > > to > > > respect the policy on SHM_HUGETLB request. > > > see test program below. > > > output from pre-2.6.21: > > > 2ab19620 interleave=0-3 file=/2\040(deleted) huge dirty=32 N0=8 N1=8 > > > N2=8 N3=8 > > > 2ab19a20 default file=/SYSV\040(deleted) dirty=16384 active=0 > > > N0=4096 N1=4096 N2=4096 N3=4096 > > > output from 2.6.21: > > > 2b49b1c0 default file=/10\040(deleted) huge dirty=32 N3=32 > > > 2b49b5c0 default file=/SYSV\040(deleted) dirty=16384 active=0 > > > N0=4096 N1=4096 N2=4096 N3=4096 > > > was this an intentional behaviour change? it seems to be only affecting > > > SHM_HUGETLB allocations. (i haven't tested hugetlbfs yet.) > > > run with "numactl --interleave=all ./shmtest" > > > > This was not intentional. I'll search for where it broke. > > ok i've narrowed it some... maybe. Thanks a lot for the detailed information. I am on it. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21 numa policy and huge pages not working
On Sat, 2007-06-09 at 21:10 -0700, dean gaudet wrote: On Tue, 15 May 2007, William Lee Irwin III wrote: On Tue, May 15, 2007 at 10:41:06PM -0700, dean gaudet wrote: prior to 2.6.21 i could numactl --interleave=all and use SHM_HUGETLB and the interleave policy would be respected. as of 2.6.21 it doesn't seem to respect the policy on SHM_HUGETLB request. see test program below. output from pre-2.6.21: 2ab19620 interleave=0-3 file=/2\040(deleted) huge dirty=32 N0=8 N1=8 N2=8 N3=8 2ab19a20 default file=/SYSV\040(deleted) dirty=16384 active=0 N0=4096 N1=4096 N2=4096 N3=4096 output from 2.6.21: 2b49b1c0 default file=/10\040(deleted) huge dirty=32 N3=32 2b49b5c0 default file=/SYSV\040(deleted) dirty=16384 active=0 N0=4096 N1=4096 N2=4096 N3=4096 was this an intentional behaviour change? it seems to be only affecting SHM_HUGETLB allocations. (i haven't tested hugetlbfs yet.) run with numactl --interleave=all ./shmtest This was not intentional. I'll search for where it broke. ok i've narrowed it some... maybe. Thanks a lot for the detailed information. I am on it. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] shm: Fix the filename of hugetlb sysv shared memory
On 6/8/07, Eric W. Biederman [EMAIL PROTECTED] wrote: -struct file *hugetlb_zero_setup(size_t size) +struct file *hugetlb_file_setup(const char *name, size_t size) The bulk of this patch seems to handle renaming this function. Is that really necessary? -- Adam Litke ( agl at us.ibm.com ) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[shm][hugetlb] Fix get_policy for stacked shared memory files
Here's another breakage as a result of shared memory stacked files :( The NUMA policy for a VMA is determined by checking the following (in the order given): 1) vma-vm_ops-get_policy() (if defined) 2) vma-vm_policy (if defined) 3) task-mempolicy (if defined) 4) Fall back to default_policy By switching to stacked files for shared memory, get_policy() is now always set to shm_get_policy which is a wrapper function. This causes us to stop at step 1, which yields NULL for hugetlb instead of task-mempolicy which was the previous (and correct) result. This patch modifies the shm_get_policy() wrapper to maintain steps 1-3 for the wrapped vm_ops. Andi and Christoph, does this look right to you? Signed-off-by: Adam Litke [EMAIL PROTECTED] diff --git a/ipc/shm.c b/ipc/shm.c index 4fefbad..8d2672d 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -254,8 +254,10 @@ struct mempolicy *shm_get_policy(struct vm_area_struct *vma, unsigned long addr) if (sfd-vm_ops-get_policy) pol = sfd-vm_ops-get_policy(vma, addr); - else + else if (vma-vm_policy) pol = vma-vm_policy; + else + pol = current-mempolicy; return pol; } #endif -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 4/8] Enhance fallback functions in libs to support higher order pages
On 4/19/07, Christoph Lameter <[EMAIL PROTECTED]> wrote: @@ -331,11 +331,15 @@ int simple_prepare_write(struct file *fi unsigned from, unsigned to) { if (!PageUptodate(page)) { - if (to - from != PAGE_CACHE_SIZE) { + if (to - from != page_cache_size(file->f_mapping)) { Where do you introduce page_cache_size()? Is this added by a different set of patches I should have applied first? -- Adam Litke ( agl at us.ibm.com ) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 4/8] Enhance fallback functions in libs to support higher order pages
On 4/19/07, Christoph Lameter [EMAIL PROTECTED] wrote: @@ -331,11 +331,15 @@ int simple_prepare_write(struct file *fi unsigned from, unsigned to) { if (!PageUptodate(page)) { - if (to - from != PAGE_CACHE_SIZE) { + if (to - from != page_cache_size(file-f_mapping)) { Where do you introduce page_cache_size()? Is this added by a different set of patches I should have applied first? -- Adam Litke ( agl at us.ibm.com ) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/2] hugetlb: add resv argument to hugetlb_file_setup
On Fri, 2007-03-23 at 15:42 -0700, Ken Chen wrote: > rename hugetlb_zero_setup() to hugetlb_file_setup() to better match > function name convention like shmem implementation. Also add an > argument to the function to indicate whether file setup should reserve > hugetlb page upfront or not. > > Signed-off-by: Ken Chen <[EMAIL PROTECTED]> This patch doesn't really look bad at all, but... I am worried that what might seem nice and clean right now will slowly get worse. This implements an interface on top of another interface (char device on top of a filesystem). What is the next hugetlbfs function that will need a boolean switch to handle a character device special case? Am I just worrying too much here? Although my pagetable_operations patches aren't the most popular right now, they do have at least one advantage IMO: they enable side-by-side implementation of the interfaces as opposed to stacking them. Keeping them separate removes the need for if ((vm_flags & VM_HUGETLB) && (is_hugetlbfs_chardev())) checking. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/2] hugetlb: add resv argument to hugetlb_file_setup
On Fri, 2007-03-23 at 15:42 -0700, Ken Chen wrote: rename hugetlb_zero_setup() to hugetlb_file_setup() to better match function name convention like shmem implementation. Also add an argument to the function to indicate whether file setup should reserve hugetlb page upfront or not. Signed-off-by: Ken Chen [EMAIL PROTECTED] This patch doesn't really look bad at all, but... I am worried that what might seem nice and clean right now will slowly get worse. This implements an interface on top of another interface (char device on top of a filesystem). What is the next hugetlbfs function that will need a boolean switch to handle a character device special case? Am I just worrying too much here? Although my pagetable_operations patches aren't the most popular right now, they do have at least one advantage IMO: they enable side-by-side implementation of the interfaces as opposed to stacking them. Keeping them separate removes the need for if ((vm_flags VM_HUGETLB) (is_hugetlbfs_chardev())) checking. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: pagetable_ops: Hugetlb character device example
On Wed, 2007-03-21 at 15:51 -0400, [EMAIL PROTECTED] wrote: > On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said: > > The main reason I am advocating a set of pagetable_operations is to > > enable the development of a new hugetlb interface. > > Do you have an exit strategy for the *old* interface? Not really. Hugetlbfs needs to be kept around for a number of reasons. It was designed to support MAP_SHARED mappings and IPC shm segments. It is probably still the best interface for those jobs. Of course hugetlbfs has lots of users so we must preserve the interface for them. But... once hugetlbfs is abstracted behind pagetable_operations, you would have the option of configuring it out of the kernel without losing access to huge pages by other means (such as the character device). -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
pagetable_ops: Hugetlb character device example
The main reason I am advocating a set of pagetable_operations is to enable the development of a new hugetlb interface. During the hugetlb BOFS at OLS last year, we talked about a character device that would behave like /dev/zero. Many of the people were talking about how they just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss about the hugetlbfs filesystem. /dev/zero is a familiar interface for getting anonymous memory so bringing that model to huge pages would make programming for anonymous huge pages easier. The pagetable_operations API opens up possibilities to do some additional (and completely sane) things. For example, I have a patch that alters the character device code below to make use of a hugetlb ZERO_PAGE. This eliminates almost all the up-front fault time, allowing pages to be COW'ed only when first written to. We cannot do things like this with hugetlbfs anymore because we have a set of complex semantics to preserve. The following patch is an example of what a simple pagetable_operations consumer could look like. It does depend on some other cleanups I am working on (removal of is_file_hugepages(), ...hugetlbfs/inode.c vs. mm/hugetlb.c separation, etc). So it is unlikely to apply to any trees you may have. I do think it makes a useful illustration of what legitimate things can be done with a pagetable_operations interface. commit be72df1c616fb662693a8d4410ce3058f20c71f3 Author: Adam Litke <[EMAIL PROTECTED]> Date: Tue Feb 13 14:18:21 2007 -0800 diff --git a/drivers/char/Makefile b/drivers/char/Makefile index fc11063..c5e755b 100644 --- a/drivers/char/Makefile +++ b/drivers/char/Makefile @@ -100,6 +100,7 @@ obj-$(CONFIG_IPMI_HANDLER) += ipmi/ obj-$(CONFIG_HANGCHECK_TIMER) += hangcheck-timer.o obj-$(CONFIG_TCG_TPM) += tpm/ +obj-$(CONFIG_HUGETLB_PAGE) += page.o # Files generated that shall be removed upon make clean clean-files := consolemap_deftbl.c defkeymap.c diff --git a/drivers/char/page.c b/drivers/char/page.c new file mode 100644 index 000..e903028 --- /dev/null +++ b/drivers/char/page.c @@ -0,0 +1,133 @@ +#include +#include +#include +#include +#include +#include +#include + +static const struct { + unsigned intminor; + char*name; + umode_t mode; +} devlist[] = { + {1, "page-huge", S_IRUGO | S_IWUGO}, +}; + +static struct page *page_nopage(struct vm_area_struct *vma, + unsigned long address, int *unused) +{ + BUG(); + return NULL; +} + +static struct vm_operations_struct page_vm_ops = { + .nopage = page_nopage, +}; + +static int page_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, int write_access) +{ + pte_t *ptep; + pte_t entry, new_entry; + int ret; + static DEFINE_MUTEX(hugetlb_instantiation_mutex); + + ptep = huge_pte_alloc(mm, address); + if (!ptep) + return VM_FAULT_OOM; + + mutex_lock(_instantiation_mutex); + entry = *ptep; + if (pte_none(entry)) { + struct page *page; + + page = alloc_huge_page(vma, address); + if (!page) + return VM_FAULT_OOM; + clear_huge_page(page, address); + + ret = VM_FAULT_MINOR; + spin_lock(>page_table_lock); + if (!pte_none(*ptep)) + goto out; + add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE); + new_entry = make_huge_pte(vma, page, 0); + set_huge_pte_at(mm, address, ptep, new_entry); + goto out; + } + + spin_lock(>page_table_lock); + /* Check for a racing update before calling hugetlb_cow */ + if (likely(pte_same(entry, *ptep))) + if (write_access && !pte_write(entry)) + ret = hugetlb_cow(mm, vma, address, ptep, entry); + +out: + spin_unlock(>page_table_lock); + mutex_unlock(_instantiation_mutex); + return ret; +} + + +static struct pagetable_operations_struct page_pagetable_ops = { + .copy_vma = copy_hugetlb_page_range, + .pin_pages = follow_hugetlb_page, + .unmap_page_range = unmap_hugepage_range, + .change_protection = hugetlb_change_protection, + .free_pgtable_range = hugetlb_free_pgd_range, + .fault = page_fault, +}; + +static int page_mmap(struct file * file, struct vm_area_struct *vma) +{ + if (vma->vm_flags & VM_SHARED) + return -EINVAL; + + if (vma->vm_pgoff) + return -EINVAL; + + if (vma->vm_start & ~HPAGE_MASK) + return -EINVAL; + + if (vma->vm_end & ~HPAGE_MASK) + return -EINVAL; + + if (vma->vm_end - vma->vm_start < HPAGE_SIZE) + re
Re: [PATCH] NOMMU: Supply get_unmapped_area() to fix NOMMU SYSV SHM
On Wed, 2007-03-21 at 17:44 +, David Howells wrote: > From: David Howells <[EMAIL PROTECTED]> > > Supply a get_unmapped_area() to fix NOMMU SYSV SHM support. > > Signed-Off-By: David Howells <[EMAIL PROTECTED]> Hmm, yes... So it would seem my fix for the shm stacked files stuff was not quite complete. Acked-by: Adam Litke <[EMAIL PROTECTED]> -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
On Wed, 2007-03-21 at 15:18 +1100, Nick Piggin wrote: > Adam Litke wrote: > > Signed-off-by: Adam Litke <[EMAIL PROTECTED]> > > --- > > > > include/linux/mm.h | 25 + > > 1 files changed, 25 insertions(+), 0 deletions(-) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 60e0e4a..7089323 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -98,6 +98,7 @@ struct vm_area_struct { > > > > /* Function pointers to deal with this struct. */ > > struct vm_operations_struct * vm_ops; > > + const struct pagetable_operations_struct * pagetable_ops; > > > > /* Information about our backing store: */ > > unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE > > Can you remind me why this isn't in vm_ops? We didn't want to bloat the size of the vm_ops struct for all of its users. > Also, it is going to be hugepage-only, isn't it? So should the naming be > changed to reflect that? And #ifdef it... They are doing some interesting things on Cell that could take advantage of this. > > @@ -218,6 +219,30 @@ struct vm_operations_struct { > > }; > > > > struct mmu_gather; > > + > > +struct pagetable_operations_struct { > > + int (*fault)(struct mm_struct *mm, > > + struct vm_area_struct *vma, > > + unsigned long address, int write_access); > > I got dibs on fault ;) > > My callback is a sanitised one that basically abstracts the details of the > virtual memory mapping away, so it is usable by drivers and filesystems. > > You actually want to bypass the normal fault handling because it doesn't > know how to deal with your virtual memory mapping. Hmm, the best suggestion > I can come up with is handle_mm_fault... unless you can think of a better > name for me to use. How about I use handle_pte_fault? -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
On Tue, 2007-03-20 at 16:24 -0700, Dave Hansen wrote: > On Mon, 2007-03-19 at 13:05 -0700, Adam Litke wrote: > > > > +#define has_pt_op(vma, op) \ > > + ((vma)->pagetable_ops && (vma)->pagetable_ops->op) > > +#define pt_op(vma, call) \ > > + ((vma)->pagetable_ops->call) > > Can you get rid of these macros? I think they make it a wee bit harder > to read. My brain doesn't properly parse the foo(arg)(bar) syntax. > > + if (has_pt_op(vma, copy_vma)) > + return pt_op(vma, copy_vma)(dst_mm, src_mm, vma); > > + if (vma->pagetable_ops && vma->pagetable_ops->copy_vma) > + return vma->pagetable_ops->copy_vma(dst_mm, src_mm, vma); > > I guess it does lead to some longish lines. Does it start looking > really nasty? Yeah, it starts to look pretty bad. Some of these calls are in code that is already indented several times. > If you're going to have them, it might just be best to put a single > unlikely() around the macro definitions themselves to keep anybody from > having to open-code it for any of the users. It should be pretty easy to wrap has_pt_op() with an unlikely(). Good suggestion. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
On Tue, 2007-03-20 at 16:24 -0700, Dave Hansen wrote: On Mon, 2007-03-19 at 13:05 -0700, Adam Litke wrote: +#define has_pt_op(vma, op) \ + ((vma)-pagetable_ops (vma)-pagetable_ops-op) +#define pt_op(vma, call) \ + ((vma)-pagetable_ops-call) Can you get rid of these macros? I think they make it a wee bit harder to read. My brain doesn't properly parse the foo(arg)(bar) syntax. + if (has_pt_op(vma, copy_vma)) + return pt_op(vma, copy_vma)(dst_mm, src_mm, vma); + if (vma-pagetable_ops vma-pagetable_ops-copy_vma) + return vma-pagetable_ops-copy_vma(dst_mm, src_mm, vma); I guess it does lead to some longish lines. Does it start looking really nasty? Yeah, it starts to look pretty bad. Some of these calls are in code that is already indented several times. If you're going to have them, it might just be best to put a single unlikely() around the macro definitions themselves to keep anybody from having to open-code it for any of the users. It should be pretty easy to wrap has_pt_op() with an unlikely(). Good suggestion. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
On Wed, 2007-03-21 at 15:18 +1100, Nick Piggin wrote: Adam Litke wrote: Signed-off-by: Adam Litke [EMAIL PROTECTED] --- include/linux/mm.h | 25 + 1 files changed, 25 insertions(+), 0 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 60e0e4a..7089323 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -98,6 +98,7 @@ struct vm_area_struct { /* Function pointers to deal with this struct. */ struct vm_operations_struct * vm_ops; + const struct pagetable_operations_struct * pagetable_ops; /* Information about our backing store: */ unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE Can you remind me why this isn't in vm_ops? We didn't want to bloat the size of the vm_ops struct for all of its users. Also, it is going to be hugepage-only, isn't it? So should the naming be changed to reflect that? And #ifdef it... They are doing some interesting things on Cell that could take advantage of this. @@ -218,6 +219,30 @@ struct vm_operations_struct { }; struct mmu_gather; + +struct pagetable_operations_struct { + int (*fault)(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, int write_access); I got dibs on fault ;) My callback is a sanitised one that basically abstracts the details of the virtual memory mapping away, so it is usable by drivers and filesystems. You actually want to bypass the normal fault handling because it doesn't know how to deal with your virtual memory mapping. Hmm, the best suggestion I can come up with is handle_mm_fault... unless you can think of a better name for me to use. How about I use handle_pte_fault? -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] NOMMU: Supply get_unmapped_area() to fix NOMMU SYSV SHM
On Wed, 2007-03-21 at 17:44 +, David Howells wrote: From: David Howells [EMAIL PROTECTED] Supply a get_unmapped_area() to fix NOMMU SYSV SHM support. Signed-Off-By: David Howells [EMAIL PROTECTED] Hmm, yes... So it would seem my fix for the shm stacked files stuff was not quite complete. Acked-by: Adam Litke [EMAIL PROTECTED] -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
pagetable_ops: Hugetlb character device example
The main reason I am advocating a set of pagetable_operations is to enable the development of a new hugetlb interface. During the hugetlb BOFS at OLS last year, we talked about a character device that would behave like /dev/zero. Many of the people were talking about how they just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss about the hugetlbfs filesystem. /dev/zero is a familiar interface for getting anonymous memory so bringing that model to huge pages would make programming for anonymous huge pages easier. The pagetable_operations API opens up possibilities to do some additional (and completely sane) things. For example, I have a patch that alters the character device code below to make use of a hugetlb ZERO_PAGE. This eliminates almost all the up-front fault time, allowing pages to be COW'ed only when first written to. We cannot do things like this with hugetlbfs anymore because we have a set of complex semantics to preserve. The following patch is an example of what a simple pagetable_operations consumer could look like. It does depend on some other cleanups I am working on (removal of is_file_hugepages(), ...hugetlbfs/inode.c vs. mm/hugetlb.c separation, etc). So it is unlikely to apply to any trees you may have. I do think it makes a useful illustration of what legitimate things can be done with a pagetable_operations interface. commit be72df1c616fb662693a8d4410ce3058f20c71f3 Author: Adam Litke [EMAIL PROTECTED] Date: Tue Feb 13 14:18:21 2007 -0800 diff --git a/drivers/char/Makefile b/drivers/char/Makefile index fc11063..c5e755b 100644 --- a/drivers/char/Makefile +++ b/drivers/char/Makefile @@ -100,6 +100,7 @@ obj-$(CONFIG_IPMI_HANDLER) += ipmi/ obj-$(CONFIG_HANGCHECK_TIMER) += hangcheck-timer.o obj-$(CONFIG_TCG_TPM) += tpm/ +obj-$(CONFIG_HUGETLB_PAGE) += page.o # Files generated that shall be removed upon make clean clean-files := consolemap_deftbl.c defkeymap.c diff --git a/drivers/char/page.c b/drivers/char/page.c new file mode 100644 index 000..e903028 --- /dev/null +++ b/drivers/char/page.c @@ -0,0 +1,133 @@ +#include linux/mm.h +#include linux/mman.h +#include linux/init.h +#include linux/device.h +#include linux/fs.h +#include linux/pagemap.h +#include linux/hugetlb.h + +static const struct { + unsigned intminor; + char*name; + umode_t mode; +} devlist[] = { + {1, page-huge, S_IRUGO | S_IWUGO}, +}; + +static struct page *page_nopage(struct vm_area_struct *vma, + unsigned long address, int *unused) +{ + BUG(); + return NULL; +} + +static struct vm_operations_struct page_vm_ops = { + .nopage = page_nopage, +}; + +static int page_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, int write_access) +{ + pte_t *ptep; + pte_t entry, new_entry; + int ret; + static DEFINE_MUTEX(hugetlb_instantiation_mutex); + + ptep = huge_pte_alloc(mm, address); + if (!ptep) + return VM_FAULT_OOM; + + mutex_lock(hugetlb_instantiation_mutex); + entry = *ptep; + if (pte_none(entry)) { + struct page *page; + + page = alloc_huge_page(vma, address); + if (!page) + return VM_FAULT_OOM; + clear_huge_page(page, address); + + ret = VM_FAULT_MINOR; + spin_lock(mm-page_table_lock); + if (!pte_none(*ptep)) + goto out; + add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE); + new_entry = make_huge_pte(vma, page, 0); + set_huge_pte_at(mm, address, ptep, new_entry); + goto out; + } + + spin_lock(mm-page_table_lock); + /* Check for a racing update before calling hugetlb_cow */ + if (likely(pte_same(entry, *ptep))) + if (write_access !pte_write(entry)) + ret = hugetlb_cow(mm, vma, address, ptep, entry); + +out: + spin_unlock(mm-page_table_lock); + mutex_unlock(hugetlb_instantiation_mutex); + return ret; +} + + +static struct pagetable_operations_struct page_pagetable_ops = { + .copy_vma = copy_hugetlb_page_range, + .pin_pages = follow_hugetlb_page, + .unmap_page_range = unmap_hugepage_range, + .change_protection = hugetlb_change_protection, + .free_pgtable_range = hugetlb_free_pgd_range, + .fault = page_fault, +}; + +static int page_mmap(struct file * file, struct vm_area_struct *vma) +{ + if (vma-vm_flags VM_SHARED) + return -EINVAL; + + if (vma-vm_pgoff) + return -EINVAL; + + if (vma-vm_start ~HPAGE_MASK) + return -EINVAL; + + if (vma-vm_end ~HPAGE_MASK) + return -EINVAL; + + if (vma-vm_end - vma-vm_start
Re: pagetable_ops: Hugetlb character device example
On Wed, 2007-03-21 at 15:51 -0400, [EMAIL PROTECTED] wrote: On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said: The main reason I am advocating a set of pagetable_operations is to enable the development of a new hugetlb interface. Do you have an exit strategy for the *old* interface? Not really. Hugetlbfs needs to be kept around for a number of reasons. It was designed to support MAP_SHARED mappings and IPC shm segments. It is probably still the best interface for those jobs. Of course hugetlbfs has lots of users so we must preserve the interface for them. But... once hugetlbfs is abstracted behind pagetable_operations, you would have the option of configuring it out of the kernel without losing access to huge pages by other means (such as the character device). -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/7] unmap_page_range for hugetlb
Signed-off-by: Adam Litke <[EMAIL PROTECTED]> --- fs/hugetlbfs/inode.c|3 ++- include/linux/hugetlb.h |4 ++-- mm/hugetlb.c| 12 mm/memory.c | 10 -- 4 files changed, 16 insertions(+), 13 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index d0b4b46..198efa7 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -289,7 +289,7 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff) v_offset = 0; __unmap_hugepage_range(vma, - vma->vm_start + v_offset, vma->vm_end); + vma->vm_start + v_offset, vma->vm_end, 0); } } @@ -568,6 +568,7 @@ const struct file_operations hugetlbfs_file_operations = { static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = { .copy_vma = copy_hugetlb_page_range, .pin_pages = follow_hugetlb_page, + .unmap_page_range = unmap_hugepage_range, }; static const struct inode_operations hugetlbfs_dir_inode_operations = { diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 3f3e7a6..502c2f8 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -17,8 +17,8 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma) int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *); int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int); -void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long); -void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long); +unsigned long unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, long *); +void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, long *); int hugetlb_prefault(struct address_space *, struct vm_area_struct *); int hugetlb_report_meminfo(char *); int hugetlb_report_node_meminfo(int, char *); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 36db012..d902fb9 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -356,7 +356,7 @@ nomem: } void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, - unsigned long end) + unsigned long end, long *zap_work) { struct mm_struct *mm = vma->vm_mm; unsigned long address; @@ -399,10 +399,13 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, list_del(>lru); put_page(page); } + + if (zap_work) + *zap_work -= (end - start) / (HPAGE_SIZE / PAGE_SIZE); } -void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, - unsigned long end) +unsigned long unmap_hugepage_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end, long *zap_work) { /* * It is undesirable to test vma->vm_file as it should be non-null @@ -414,9 +417,10 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, */ if (vma->vm_file) { spin_lock(>vm_file->f_mapping->i_mmap_lock); - __unmap_hugepage_range(vma, start, end); + __unmap_hugepage_range(vma, start, end, zap_work); spin_unlock(>vm_file->f_mapping->i_mmap_lock); } + return end; } static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, diff --git a/mm/memory.c b/mm/memory.c index 01256cf..a3bcaf3 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -839,12 +839,10 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp, tlb_start_valid = 1; } - if (unlikely(is_vm_hugetlb_page(vma))) { - unmap_hugepage_range(vma, start, end); - zap_work -= (end - start) / - (HPAGE_SIZE / PAGE_SIZE); - start = end; - } else + if (unlikely(has_pt_op(vma, unmap_page_range))) + start = pt_op(vma, unmap_page_range) + (vma, start, end, _work); + else start = unmap_page_range(*tlbp, vma, start, end, _work, details); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.k
[PATCH 6/7] free_pgtable_range for hugetlb
Signed-off-by: Adam Litke <[EMAIL PROTECTED]> --- fs/hugetlbfs/inode.c |1 + mm/memory.c |6 +++--- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 3de5d93..823a9e3 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -570,6 +570,7 @@ static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = { .pin_pages = follow_hugetlb_page, .unmap_page_range = unmap_hugepage_range, .change_protection = hugetlb_change_protection, + .free_pgtable_range = hugetlb_free_pgd_range, }; static const struct inode_operations hugetlbfs_dir_inode_operations = { diff --git a/mm/memory.c b/mm/memory.c index a3bcaf3..d2f28e7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -279,15 +279,15 @@ void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma, anon_vma_unlink(vma); unlink_file_vma(vma); - if (is_vm_hugetlb_page(vma)) { - hugetlb_free_pgd_range(tlb, addr, vma->vm_end, + if (has_pt_op(vma, free_pgtable_range)) { + pt_op(vma, free_pgtable_range)(tlb, addr, vma->vm_end, floor, next? next->vm_start: ceiling); } else { /* * Optimization: gather nearby vmas into one call down */ while (next && next->vm_start <= vma->vm_end + PMD_SIZE - && !is_vm_hugetlb_page(next)) { + && !has_pt_op(next, free_pgtable_range)) { vma = next; next = vma->vm_next; anon_vma_unlink(vma); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 7/7] hugetlbfs fault handler
Signed-off-by: Adam Litke <[EMAIL PROTECTED]> --- fs/hugetlbfs/inode.c |1 + mm/hugetlb.c |4 +++- mm/memory.c |4 ++-- 3 files changed, 6 insertions(+), 3 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 823a9e3..29e65c2 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -571,6 +571,7 @@ static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = { .unmap_page_range = unmap_hugepage_range, .change_protection = hugetlb_change_protection, .free_pgtable_range = hugetlb_free_pgd_range, + .fault = hugetlb_fault, }; static const struct inode_operations hugetlbfs_dir_inode_operations = { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index d902fb9..e0f0607 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -590,6 +590,8 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long vaddr = *position; int remainder = *length; + BUG_ON(!has_pt_op(vma, fault)); + spin_lock(>page_table_lock); while (vaddr < vma->vm_end && remainder) { pte_t *pte; @@ -606,7 +608,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, int ret; spin_unlock(>page_table_lock); - ret = hugetlb_fault(mm, vma, vaddr, 0); + ret = pt_op(vma, fault)(mm, vma, vaddr, 0); spin_lock(>page_table_lock); if (ret == VM_FAULT_MINOR) continue; diff --git a/mm/memory.c b/mm/memory.c index d2f28e7..bd7c243 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2499,8 +2499,8 @@ int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, count_vm_event(PGFAULT); - if (unlikely(is_vm_hugetlb_page(vma))) - return hugetlb_fault(mm, vma, address, write_access); + if (unlikely(has_pt_op(vma, fault))) + return pt_op(vma, fault)(mm, vma, address, write_access); pgd = pgd_offset(mm, address); pud = pud_alloc(mm, pgd, address); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/7] pin_pages for hugetlb
Signed-off-by: Adam Litke <[EMAIL PROTECTED]> --- fs/hugetlbfs/inode.c |1 + mm/memory.c |6 +++--- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 2452dde..d0b4b46 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -567,6 +567,7 @@ const struct file_operations hugetlbfs_file_operations = { static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = { .copy_vma = copy_hugetlb_page_range, + .pin_pages = follow_hugetlb_page, }; static const struct inode_operations hugetlbfs_dir_inode_operations = { diff --git a/mm/memory.c b/mm/memory.c index 69bb0b3..01256cf 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1039,9 +1039,9 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, || !(vm_flags & vma->vm_flags)) return i ? : -EFAULT; - if (is_vm_hugetlb_page(vma)) { - i = follow_hugetlb_page(mm, vma, pages, vmas, - , , i); + if (has_pt_op(vma, pin_pages)) { + i = pt_op(vma, pin_pages)(mm, vma, pages, + vmas, , , i); continue; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/7] change_protection for hugetlb
Signed-off-by: Adam Litke <[EMAIL PROTECTED]> --- fs/hugetlbfs/inode.c |1 + mm/mprotect.c|5 +++-- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 198efa7..3de5d93 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -569,6 +569,7 @@ static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = { .copy_vma = copy_hugetlb_page_range, .pin_pages = follow_hugetlb_page, .unmap_page_range = unmap_hugepage_range, + .change_protection = hugetlb_change_protection, }; static const struct inode_operations hugetlbfs_dir_inode_operations = { diff --git a/mm/mprotect.c b/mm/mprotect.c index 3b8f3c0..172e204 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -201,8 +201,9 @@ success: dirty_accountable = 1; } - if (is_vm_hugetlb_page(vma)) - hugetlb_change_protection(vma, start, end, vma->vm_page_prot); + if (has_pt_op(vma, change_protection)) + pt_op(vma, change_protection)(vma, start, end, + vma->vm_page_prot); else change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable); vm_stat_account(mm, oldflags, vma->vm_file, -nrpages); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/7] [RFC] hugetlb: pagetable_operations API (V2)
Andrew, given the favorable review of these patches the last time around, would you consider them for the -mm tree? Does anyone else have any objections? The page tables for hugetlb mappings are handled differently than page tables for normal pages. Rather than integrating multiple page size support into the core VM (which would tremendously complicate the code) some hooks were created. This allows hugetlb special cases to be handled "out of line" by a separate interface. Hugetlbfs was the huge page interface chosen. At the time, large database users were the only big users of huge pages and the hugetlbfs design meets their needs pretty well. Over time, hugetlbfs has been expanded to enable new uses of huge page memory with varied results. As features are added, the semantics become a permanent part of the Linux API. This makes maintenance of hugetlbfs an increasingly difficult task and inhibits the addition of features and functionality in support of ever-changing hardware. To remedy the situation, I propose an API (currently called pagetable_operations). All of the current hugetlbfs-specific hooks are moved into an operations struct that is attached to VMAs. The end result is a more explicit and IMO a cleaner interface between hugetlbfs and the core VM. We are then free to add other hugetlb interfaces (such as a /dev/zero-styled character device) that can operate either in concert with or independent of hugetlbfs. There should be no measurable performance impact for normal page users (we're checking if pagetable_ops != NULL instead of checking for vm_flags & VM_HUGETLB). Of course we do increase the VMA size by one pointer. For huge pages, there is an added indirection for pt_op() calls. This patch series does not change the logic of the the hugetlbfs operations, just moves them into the pagetable_operations struct. I did some pretty basic benchmarking of these patches on ppc64, x86, and x86_64 to get a feel for the fast-path performance impact. The following tables show kernbench performance comparisons between a clean 2.6.20 kernel and one with my patches applied. These numbers seem well within statistical noise to me. Changes since V1: - Made hugetlbfs_pagetable_ops const (Thanks Arjan) -- KernBench Comparison (ppc64) 2.6.20-clean 2.6.20-pgtable_opspct. diff User CPU time 708.82 708.59 0.03 System CPU time 62.50 62.58 -0.13 Total CPU time 771.32 771.17 0.02 Elapsedtime 115.40 115.35 0.04 KernBench Comparison (x86) -- 2.6.20-clean 2.6.20-pgtable_opspct. diff User CPU time 1382.621381.88 0.05 System CPU time 146.06 146.86 -0.55 Total CPU time 1528.681528.74 -0.00 Elapsedtime 394.92 396.70 -0.45 KernBench Comparison (x86_64) - 2.6.20-clean 2.6.20-pgtable_opspct. diff User CPU time 559.39 557.97 0.25 System CPU time 65.10 66.17 -1.64 Total CPU time 624.49 624.14 0.06 Elapsedtime 158.54 158.59 -0.03 The lack of a performance impact makes sense to me. The following is a simplified instruction comparison for each case: 2.6.20-clean 2.6.20-pgtable_ops --- /* Load vm_flags *//* Load pagetable_ops pointer */ mov 0x18(ecx),eax mov 0x48(ecx),eax /* Test for VM_HUGETLB */ /* Test if it's NULL */ test$0x40,eax test eax,eax /* If set, jump to call stub *//* If so, jump away to main code */ jne c0148f04 je c0148ba1 .../* Lookup the operation's function pointer */ /* copy_hugetlb_page_range call */ mov 0x4(eax),ebx c0148f04: /* Test if it's NULL */ mov 0xff98(ebp),ecxtest ebx,ebx mov 0xff9c(ebp),edx/* If so, jump away to main code */ mov 0xffa0(ebp),eaxje c0148ba1 callc01536e0 /* pagetable operation call */ mov 0xff9c(ebp),edx mov 0xffa0(ebp),eax call *ebx For the common case (vma->pagetable_ops == NULL), we do almost the same thing as the current code: load and test. The third instruction is different in that we jump for the common case instead of jumping in the hugetlb case. I
[PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
Signed-off-by: Adam Litke <[EMAIL PROTECTED]> --- include/linux/mm.h | 25 + 1 files changed, 25 insertions(+), 0 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 60e0e4a..7089323 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -98,6 +98,7 @@ struct vm_area_struct { /* Function pointers to deal with this struct. */ struct vm_operations_struct * vm_ops; + const struct pagetable_operations_struct * pagetable_ops; /* Information about our backing store: */ unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE @@ -218,6 +219,30 @@ struct vm_operations_struct { }; struct mmu_gather; + +struct pagetable_operations_struct { + int (*fault)(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, int write_access); + int (*copy_vma)(struct mm_struct *dst, struct mm_struct *src, + struct vm_area_struct *vma); + int (*pin_pages)(struct mm_struct *mm, struct vm_area_struct *vma, + struct page **pages, struct vm_area_struct **vmas, + unsigned long *position, int *length, int i); + void (*change_protection)(struct vm_area_struct *vma, + unsigned long address, unsigned long end, pgprot_t newprot); + unsigned long (*unmap_page_range)(struct vm_area_struct *vma, + unsigned long address, unsigned long end, long *zap_work); + void (*free_pgtable_range)(struct mmu_gather **tlb, + unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling); +}; + +#define has_pt_op(vma, op) \ + ((vma)->pagetable_ops && (vma)->pagetable_ops->op) +#define pt_op(vma, call) \ + ((vma)->pagetable_ops->call) + struct inode; #define page_private(page) ((page)->private) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/7] copy_vma for hugetlbfs
Signed-off-by: Adam Litke <[EMAIL PROTECTED]> --- fs/hugetlbfs/inode.c |6 ++ mm/memory.c |4 ++-- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 8c718a3..2452dde 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -36,6 +36,7 @@ static const struct super_operations hugetlbfs_ops; static const struct address_space_operations hugetlbfs_aops; const struct file_operations hugetlbfs_file_operations; +static const struct pagetable_operations_struct hugetlbfs_pagetable_ops; static const struct inode_operations hugetlbfs_dir_inode_operations; static const struct inode_operations hugetlbfs_inode_operations; @@ -70,6 +71,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) */ vma->vm_flags |= VM_HUGETLB | VM_RESERVED; vma->vm_ops = _vm_ops; + vma->pagetable_ops = _pagetable_ops; vma_len = (loff_t)(vma->vm_end - vma->vm_start); @@ -563,6 +565,10 @@ const struct file_operations hugetlbfs_file_operations = { .get_unmapped_area = hugetlb_get_unmapped_area, }; +static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = { + .copy_vma = copy_hugetlb_page_range, +}; + static const struct inode_operations hugetlbfs_dir_inode_operations = { .create = hugetlbfs_create, .lookup = simple_lookup, diff --git a/mm/memory.c b/mm/memory.c index e7066e7..69bb0b3 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -602,8 +602,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, return 0; } - if (is_vm_hugetlb_page(vma)) - return copy_hugetlb_page_range(dst_mm, src_mm, vma); + if (has_pt_op(vma, copy_vma)) + return pt_op(vma, copy_vma)(dst_mm, src_mm, vma); dst_pgd = pgd_offset(dst_mm, addr); src_pgd = pgd_offset(src_mm, addr); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/7] [RFC] hugetlb: pagetable_operations API (V2)
Andrew, given the favorable review of these patches the last time around, would you consider them for the -mm tree? Does anyone else have any objections? The page tables for hugetlb mappings are handled differently than page tables for normal pages. Rather than integrating multiple page size support into the core VM (which would tremendously complicate the code) some hooks were created. This allows hugetlb special cases to be handled out of line by a separate interface. Hugetlbfs was the huge page interface chosen. At the time, large database users were the only big users of huge pages and the hugetlbfs design meets their needs pretty well. Over time, hugetlbfs has been expanded to enable new uses of huge page memory with varied results. As features are added, the semantics become a permanent part of the Linux API. This makes maintenance of hugetlbfs an increasingly difficult task and inhibits the addition of features and functionality in support of ever-changing hardware. To remedy the situation, I propose an API (currently called pagetable_operations). All of the current hugetlbfs-specific hooks are moved into an operations struct that is attached to VMAs. The end result is a more explicit and IMO a cleaner interface between hugetlbfs and the core VM. We are then free to add other hugetlb interfaces (such as a /dev/zero-styled character device) that can operate either in concert with or independent of hugetlbfs. There should be no measurable performance impact for normal page users (we're checking if pagetable_ops != NULL instead of checking for vm_flags VM_HUGETLB). Of course we do increase the VMA size by one pointer. For huge pages, there is an added indirection for pt_op() calls. This patch series does not change the logic of the the hugetlbfs operations, just moves them into the pagetable_operations struct. I did some pretty basic benchmarking of these patches on ppc64, x86, and x86_64 to get a feel for the fast-path performance impact. The following tables show kernbench performance comparisons between a clean 2.6.20 kernel and one with my patches applied. These numbers seem well within statistical noise to me. Changes since V1: - Made hugetlbfs_pagetable_ops const (Thanks Arjan) -- KernBench Comparison (ppc64) 2.6.20-clean 2.6.20-pgtable_opspct. diff User CPU time 708.82 708.59 0.03 System CPU time 62.50 62.58 -0.13 Total CPU time 771.32 771.17 0.02 Elapsedtime 115.40 115.35 0.04 KernBench Comparison (x86) -- 2.6.20-clean 2.6.20-pgtable_opspct. diff User CPU time 1382.621381.88 0.05 System CPU time 146.06 146.86 -0.55 Total CPU time 1528.681528.74 -0.00 Elapsedtime 394.92 396.70 -0.45 KernBench Comparison (x86_64) - 2.6.20-clean 2.6.20-pgtable_opspct. diff User CPU time 559.39 557.97 0.25 System CPU time 65.10 66.17 -1.64 Total CPU time 624.49 624.14 0.06 Elapsedtime 158.54 158.59 -0.03 The lack of a performance impact makes sense to me. The following is a simplified instruction comparison for each case: 2.6.20-clean 2.6.20-pgtable_ops --- /* Load vm_flags *//* Load pagetable_ops pointer */ mov 0x18(ecx),eax mov 0x48(ecx),eax /* Test for VM_HUGETLB */ /* Test if it's NULL */ test$0x40,eax test eax,eax /* If set, jump to call stub *//* If so, jump away to main code */ jne c0148f04 je c0148ba1 .../* Lookup the operation's function pointer */ /* copy_hugetlb_page_range call */ mov 0x4(eax),ebx c0148f04: /* Test if it's NULL */ mov 0xff98(ebp),ecxtest ebx,ebx mov 0xff9c(ebp),edx/* If so, jump away to main code */ mov 0xffa0(ebp),eaxje c0148ba1 callc01536e0 /* pagetable operation call */ mov 0xff9c(ebp),edx mov 0xffa0(ebp),eax call *ebx For the common case (vma-pagetable_ops == NULL), we do almost the same thing as the current code: load and test. The third instruction is different in that we jump for the common case instead of jumping in the hugetlb case. I don't
[PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
Signed-off-by: Adam Litke [EMAIL PROTECTED] --- include/linux/mm.h | 25 + 1 files changed, 25 insertions(+), 0 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 60e0e4a..7089323 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -98,6 +98,7 @@ struct vm_area_struct { /* Function pointers to deal with this struct. */ struct vm_operations_struct * vm_ops; + const struct pagetable_operations_struct * pagetable_ops; /* Information about our backing store: */ unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE @@ -218,6 +219,30 @@ struct vm_operations_struct { }; struct mmu_gather; + +struct pagetable_operations_struct { + int (*fault)(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, int write_access); + int (*copy_vma)(struct mm_struct *dst, struct mm_struct *src, + struct vm_area_struct *vma); + int (*pin_pages)(struct mm_struct *mm, struct vm_area_struct *vma, + struct page **pages, struct vm_area_struct **vmas, + unsigned long *position, int *length, int i); + void (*change_protection)(struct vm_area_struct *vma, + unsigned long address, unsigned long end, pgprot_t newprot); + unsigned long (*unmap_page_range)(struct vm_area_struct *vma, + unsigned long address, unsigned long end, long *zap_work); + void (*free_pgtable_range)(struct mmu_gather **tlb, + unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling); +}; + +#define has_pt_op(vma, op) \ + ((vma)-pagetable_ops (vma)-pagetable_ops-op) +#define pt_op(vma, call) \ + ((vma)-pagetable_ops-call) + struct inode; #define page_private(page) ((page)-private) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/7] copy_vma for hugetlbfs
Signed-off-by: Adam Litke [EMAIL PROTECTED] --- fs/hugetlbfs/inode.c |6 ++ mm/memory.c |4 ++-- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 8c718a3..2452dde 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -36,6 +36,7 @@ static const struct super_operations hugetlbfs_ops; static const struct address_space_operations hugetlbfs_aops; const struct file_operations hugetlbfs_file_operations; +static const struct pagetable_operations_struct hugetlbfs_pagetable_ops; static const struct inode_operations hugetlbfs_dir_inode_operations; static const struct inode_operations hugetlbfs_inode_operations; @@ -70,6 +71,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) */ vma-vm_flags |= VM_HUGETLB | VM_RESERVED; vma-vm_ops = hugetlb_vm_ops; + vma-pagetable_ops = hugetlbfs_pagetable_ops; vma_len = (loff_t)(vma-vm_end - vma-vm_start); @@ -563,6 +565,10 @@ const struct file_operations hugetlbfs_file_operations = { .get_unmapped_area = hugetlb_get_unmapped_area, }; +static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = { + .copy_vma = copy_hugetlb_page_range, +}; + static const struct inode_operations hugetlbfs_dir_inode_operations = { .create = hugetlbfs_create, .lookup = simple_lookup, diff --git a/mm/memory.c b/mm/memory.c index e7066e7..69bb0b3 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -602,8 +602,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, return 0; } - if (is_vm_hugetlb_page(vma)) - return copy_hugetlb_page_range(dst_mm, src_mm, vma); + if (has_pt_op(vma, copy_vma)) + return pt_op(vma, copy_vma)(dst_mm, src_mm, vma); dst_pgd = pgd_offset(dst_mm, addr); src_pgd = pgd_offset(src_mm, addr); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/7] pin_pages for hugetlb
Signed-off-by: Adam Litke [EMAIL PROTECTED] --- fs/hugetlbfs/inode.c |1 + mm/memory.c |6 +++--- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 2452dde..d0b4b46 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -567,6 +567,7 @@ const struct file_operations hugetlbfs_file_operations = { static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = { .copy_vma = copy_hugetlb_page_range, + .pin_pages = follow_hugetlb_page, }; static const struct inode_operations hugetlbfs_dir_inode_operations = { diff --git a/mm/memory.c b/mm/memory.c index 69bb0b3..01256cf 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1039,9 +1039,9 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, || !(vm_flags vma-vm_flags)) return i ? : -EFAULT; - if (is_vm_hugetlb_page(vma)) { - i = follow_hugetlb_page(mm, vma, pages, vmas, - start, len, i); + if (has_pt_op(vma, pin_pages)) { + i = pt_op(vma, pin_pages)(mm, vma, pages, + vmas, start, len, i); continue; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/7] change_protection for hugetlb
Signed-off-by: Adam Litke [EMAIL PROTECTED] --- fs/hugetlbfs/inode.c |1 + mm/mprotect.c|5 +++-- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 198efa7..3de5d93 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -569,6 +569,7 @@ static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = { .copy_vma = copy_hugetlb_page_range, .pin_pages = follow_hugetlb_page, .unmap_page_range = unmap_hugepage_range, + .change_protection = hugetlb_change_protection, }; static const struct inode_operations hugetlbfs_dir_inode_operations = { diff --git a/mm/mprotect.c b/mm/mprotect.c index 3b8f3c0..172e204 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -201,8 +201,9 @@ success: dirty_accountable = 1; } - if (is_vm_hugetlb_page(vma)) - hugetlb_change_protection(vma, start, end, vma-vm_page_prot); + if (has_pt_op(vma, change_protection)) + pt_op(vma, change_protection)(vma, start, end, + vma-vm_page_prot); else change_protection(vma, start, end, vma-vm_page_prot, dirty_accountable); vm_stat_account(mm, oldflags, vma-vm_file, -nrpages); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/7] unmap_page_range for hugetlb
Signed-off-by: Adam Litke [EMAIL PROTECTED] --- fs/hugetlbfs/inode.c|3 ++- include/linux/hugetlb.h |4 ++-- mm/hugetlb.c| 12 mm/memory.c | 10 -- 4 files changed, 16 insertions(+), 13 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index d0b4b46..198efa7 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -289,7 +289,7 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff) v_offset = 0; __unmap_hugepage_range(vma, - vma-vm_start + v_offset, vma-vm_end); + vma-vm_start + v_offset, vma-vm_end, 0); } } @@ -568,6 +568,7 @@ const struct file_operations hugetlbfs_file_operations = { static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = { .copy_vma = copy_hugetlb_page_range, .pin_pages = follow_hugetlb_page, + .unmap_page_range = unmap_hugepage_range, }; static const struct inode_operations hugetlbfs_dir_inode_operations = { diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 3f3e7a6..502c2f8 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -17,8 +17,8 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma) int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *); int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int); -void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long); -void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long); +unsigned long unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, long *); +void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, long *); int hugetlb_prefault(struct address_space *, struct vm_area_struct *); int hugetlb_report_meminfo(char *); int hugetlb_report_node_meminfo(int, char *); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 36db012..d902fb9 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -356,7 +356,7 @@ nomem: } void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, - unsigned long end) + unsigned long end, long *zap_work) { struct mm_struct *mm = vma-vm_mm; unsigned long address; @@ -399,10 +399,13 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, list_del(page-lru); put_page(page); } + + if (zap_work) + *zap_work -= (end - start) / (HPAGE_SIZE / PAGE_SIZE); } -void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, - unsigned long end) +unsigned long unmap_hugepage_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end, long *zap_work) { /* * It is undesirable to test vma-vm_file as it should be non-null @@ -414,9 +417,10 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, */ if (vma-vm_file) { spin_lock(vma-vm_file-f_mapping-i_mmap_lock); - __unmap_hugepage_range(vma, start, end); + __unmap_hugepage_range(vma, start, end, zap_work); spin_unlock(vma-vm_file-f_mapping-i_mmap_lock); } + return end; } static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, diff --git a/mm/memory.c b/mm/memory.c index 01256cf..a3bcaf3 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -839,12 +839,10 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp, tlb_start_valid = 1; } - if (unlikely(is_vm_hugetlb_page(vma))) { - unmap_hugepage_range(vma, start, end); - zap_work -= (end - start) / - (HPAGE_SIZE / PAGE_SIZE); - start = end; - } else + if (unlikely(has_pt_op(vma, unmap_page_range))) + start = pt_op(vma, unmap_page_range) + (vma, start, end, zap_work); + else start = unmap_page_range(*tlbp, vma, start, end, zap_work, details); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http
[PATCH 6/7] free_pgtable_range for hugetlb
Signed-off-by: Adam Litke [EMAIL PROTECTED] --- fs/hugetlbfs/inode.c |1 + mm/memory.c |6 +++--- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 3de5d93..823a9e3 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -570,6 +570,7 @@ static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = { .pin_pages = follow_hugetlb_page, .unmap_page_range = unmap_hugepage_range, .change_protection = hugetlb_change_protection, + .free_pgtable_range = hugetlb_free_pgd_range, }; static const struct inode_operations hugetlbfs_dir_inode_operations = { diff --git a/mm/memory.c b/mm/memory.c index a3bcaf3..d2f28e7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -279,15 +279,15 @@ void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma, anon_vma_unlink(vma); unlink_file_vma(vma); - if (is_vm_hugetlb_page(vma)) { - hugetlb_free_pgd_range(tlb, addr, vma-vm_end, + if (has_pt_op(vma, free_pgtable_range)) { + pt_op(vma, free_pgtable_range)(tlb, addr, vma-vm_end, floor, next? next-vm_start: ceiling); } else { /* * Optimization: gather nearby vmas into one call down */ while (next next-vm_start = vma-vm_end + PMD_SIZE - !is_vm_hugetlb_page(next)) { + !has_pt_op(next, free_pgtable_range)) { vma = next; next = vma-vm_next; anon_vma_unlink(vma); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 7/7] hugetlbfs fault handler
Signed-off-by: Adam Litke [EMAIL PROTECTED] --- fs/hugetlbfs/inode.c |1 + mm/hugetlb.c |4 +++- mm/memory.c |4 ++-- 3 files changed, 6 insertions(+), 3 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 823a9e3..29e65c2 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -571,6 +571,7 @@ static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = { .unmap_page_range = unmap_hugepage_range, .change_protection = hugetlb_change_protection, .free_pgtable_range = hugetlb_free_pgd_range, + .fault = hugetlb_fault, }; static const struct inode_operations hugetlbfs_dir_inode_operations = { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index d902fb9..e0f0607 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -590,6 +590,8 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long vaddr = *position; int remainder = *length; + BUG_ON(!has_pt_op(vma, fault)); + spin_lock(mm-page_table_lock); while (vaddr vma-vm_end remainder) { pte_t *pte; @@ -606,7 +608,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, int ret; spin_unlock(mm-page_table_lock); - ret = hugetlb_fault(mm, vma, vaddr, 0); + ret = pt_op(vma, fault)(mm, vma, vaddr, 0); spin_lock(mm-page_table_lock); if (ret == VM_FAULT_MINOR) continue; diff --git a/mm/memory.c b/mm/memory.c index d2f28e7..bd7c243 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2499,8 +2499,8 @@ int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, count_vm_event(PGFAULT); - if (unlikely(is_vm_hugetlb_page(vma))) - return hugetlb_fault(mm, vma, address, write_access); + if (unlikely(has_pt_op(vma, fault))) + return pt_op(vma, fault)(mm, vma, address, write_access); pgd = pgd_offset(mm, address); pud = pud_alloc(mm, pgd, address); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix get_unmapped_area and fsync for hugetlb shm segments
On Wed, 2007-03-07 at 16:03 -0700, Eric W. Biederman wrote: > Bill Irwin <[EMAIL PROTECTED]> writes: > > > On Thu, Mar 01, 2007 at 03:46:08PM -0800, Adam Litke wrote: > >> static inline int is_file_hugepages(struct file *file) > >> { > >> - return file->f_op == _file_operations; > >> + if (file->f_op == _file_operations) > >> + return 1; > >> + if (is_file_shm_hugepages(file)) > >> + return 1; > >> + > >> + return 0; > >> } > > ... > >> +int is_file_shm_hugepages(struct file *file) > >> +{ > >> + int ret = 0; > >> + > >> + if (file->f_op == _file_operations) { > >> + struct shm_file_data *sfd; > >> + sfd = shm_file_data(file); > >> + ret = is_file_hugepages(sfd->file); > >> + } > >> + return ret; > > > > A comment to prepare others for the impending doubletake might be nice. > > Or maybe just open-coding the equality check for _file_operations > > in is_file_shm_hugepages() if others find it as jarring as I. Please > > extend my ack to any follow-up fiddling with that. > > You did notice we are testing a different struct file? > > > The patch addresses relatively straightforward issues and naturally at > > that. > > The whole concept is recursive so I'm not certain being a recursive check > is that bad but I understand the point. > > I think the right answer is most likely to add an extra file method or > two so we can remove the need for is_file_hugepages. > > There are still 4 calls to is_file_hugepages in ipc/shm.c and > 2 calls in mm/mmap.c not counting the one in is_file_shm_hugepages. > > The special cases make it difficult to properly wrap hugetlbfs files > with another file, which is why we have the weird special case above. :) Enter my remove-is_file_hugepages() patches (which I posted a few weeks ago). I'll rework them and repost soon. That should help to make all of this cleaner. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix get_unmapped_area and fsync for hugetlb shm segments
On Wed, 2007-03-07 at 16:03 -0700, Eric W. Biederman wrote: Bill Irwin [EMAIL PROTECTED] writes: On Thu, Mar 01, 2007 at 03:46:08PM -0800, Adam Litke wrote: static inline int is_file_hugepages(struct file *file) { - return file-f_op == hugetlbfs_file_operations; + if (file-f_op == hugetlbfs_file_operations) + return 1; + if (is_file_shm_hugepages(file)) + return 1; + + return 0; } ... +int is_file_shm_hugepages(struct file *file) +{ + int ret = 0; + + if (file-f_op == shm_file_operations) { + struct shm_file_data *sfd; + sfd = shm_file_data(file); + ret = is_file_hugepages(sfd-file); + } + return ret; A comment to prepare others for the impending doubletake might be nice. Or maybe just open-coding the equality check for huetlbfs_file_operations in is_file_shm_hugepages() if others find it as jarring as I. Please extend my ack to any follow-up fiddling with that. You did notice we are testing a different struct file? The patch addresses relatively straightforward issues and naturally at that. The whole concept is recursive so I'm not certain being a recursive check is that bad but I understand the point. I think the right answer is most likely to add an extra file method or two so we can remove the need for is_file_hugepages. There are still 4 calls to is_file_hugepages in ipc/shm.c and 2 calls in mm/mmap.c not counting the one in is_file_shm_hugepages. The special cases make it difficult to properly wrap hugetlbfs files with another file, which is why we have the weird special case above. :) Enter my remove-is_file_hugepages() patches (which I posted a few weeks ago). I'll rework them and repost soon. That should help to make all of this cleaner. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel Oops with shm namespace cleanups
On Thu, 2007-03-01 at 16:08 -0800, Bill Irwin wrote: > On Wed, Feb 28, 2007 at 02:13:29PM -0600, Adam Litke wrote: > > Hey. While testing 2.6.21-rc2 with libhugetlbfs, the shm-fork test case > > causes the kernel to oops. To reproduce: Execute 'make check' in the > > latest libhugetlbfs source on a 2.6.21-rc2 kernel with 100 huge pages > > allocated. Using fewer huge pages will likely also trigger the oops. > > Libhugetlbfs can be downloaded from: > > http://libhugetlbfs.ozlabs.org/snapshots/libhugetlbfs-dev-20070228.tar.gz > > Looks like I should grab these testcases for the sake of due diligence > (not to say I intend to alter maintenance style from primarily review, > approval, and bugfixing, not that I've been doing as much of any of those > as I should). To which architectures and/or distributions have the > userspace bits been ported, or otherwise run/tested on? A quick sniff > test on an Altix suggests SLES and/or ia64 may trip up the scripts: Right now we support x86, powerpc, and x86_64. Segment remapping and hugetlb malloc won't work on ia64 until long format vhpt is supported (I suspect). But the test framework should be adaptable to other architectures. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel Oops with shm namespace cleanups
On Thu, 2007-03-01 at 16:08 -0800, Bill Irwin wrote: On Wed, Feb 28, 2007 at 02:13:29PM -0600, Adam Litke wrote: Hey. While testing 2.6.21-rc2 with libhugetlbfs, the shm-fork test case causes the kernel to oops. To reproduce: Execute 'make check' in the latest libhugetlbfs source on a 2.6.21-rc2 kernel with 100 huge pages allocated. Using fewer huge pages will likely also trigger the oops. Libhugetlbfs can be downloaded from: http://libhugetlbfs.ozlabs.org/snapshots/libhugetlbfs-dev-20070228.tar.gz Looks like I should grab these testcases for the sake of due diligence (not to say I intend to alter maintenance style from primarily review, approval, and bugfixing, not that I've been doing as much of any of those as I should). To which architectures and/or distributions have the userspace bits been ported, or otherwise run/tested on? A quick sniff test on an Altix suggests SLES and/or ia64 may trip up the scripts: Right now we support x86, powerpc, and x86_64. Segment remapping and hugetlb malloc won't work on ia64 until long format vhpt is supported (I suspect). But the test framework should be adaptable to other architectures. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Fix get_unmapped_area and fsync for hugetlb shm segments
This patch provides the following hugetlb-related fixes to the recent stacked shm files changes: - Update is_file_hugepages() so it will reconize hugetlb shm segments. - get_unmapped_area must be called with the nested file struct to handle the sfd->file->f_ops->get_unmapped_area == NULL case. - The fsync f_op must be wrapped since it is specified in the hugetlbfs f_ops. This is based on proposed fixes from Eric Biederman that were debugged and tested by me. Without it, attempting to use hugetlb shared memory segments on powerpc (and likely ia64) will kill your box. Please Apply. Signed-off-by: Adam Litke <[EMAIL PROTECTED]> --- include/linux/hugetlb.h |8 +++- include/linux/shm.h |5 + ipc/shm.c | 32 ++-- 3 files changed, 38 insertions(+), 7 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index a60995a..3f3e7a6 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -4,6 +4,7 @@ #ifdef CONFIG_HUGETLB_PAGE #include +#include #include struct ctl_table; @@ -168,7 +169,12 @@ void hugetlb_put_quota(struct address_space *mapping); static inline int is_file_hugepages(struct file *file) { - return file->f_op == _file_operations; + if (file->f_op == _file_operations) + return 1; + if (is_file_shm_hugepages(file)) + return 1; + + return 0; } static inline void set_file_hugepages(struct file *file) diff --git a/include/linux/shm.h b/include/linux/shm.h index a2c896a..ad2e3af 100644 --- a/include/linux/shm.h +++ b/include/linux/shm.h @@ -96,12 +96,17 @@ struct shmid_kernel /* private to the kernel */ #ifdef CONFIG_SYSVIPC long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long *addr); +extern int is_file_shm_hugepages(struct file *file); #else static inline long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long *addr) { return -ENOSYS; } +static inline int is_file_shm_hugepages(struct file *file) +{ + return 0; +} #endif #endif /* __KERNEL__ */ diff --git a/ipc/shm.c b/ipc/shm.c index eb57e22..abf864d 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -285,21 +285,41 @@ static int shm_release(struct inode *ino, struct file *file) return 0; } -#ifndef CONFIG_MMU +static int shm_fsync(struct file *file, struct dentry *dentry, int datasync) +{ + int (*fsync) (struct file *, struct dentry *, int datasync); + struct shm_file_data *sfd = shm_file_data(file); + int ret = -EINVAL; + + fsync = sfd->file->f_op->fsync; + if (fsync) + ret = fsync(sfd->file, sfd->file->f_path.dentry, datasync); + return ret; +} + static unsigned long shm_get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { struct shm_file_data *sfd = shm_file_data(file); - return sfd->file->f_op->get_unmapped_area(sfd->file, addr, len, pgoff, - flags); + return get_unmapped_area(sfd->file, addr, len, pgoff, flags); +} + +int is_file_shm_hugepages(struct file *file) +{ + int ret = 0; + + if (file->f_op == _file_operations) { + struct shm_file_data *sfd; + sfd = shm_file_data(file); + ret = is_file_hugepages(sfd->file); + } + return ret; } -#else -#define shm_get_unmapped_area NULL -#endif static const struct file_operations shm_file_operations = { .mmap = shm_mmap, + .fsync = shm_fsync, .release= shm_release, .get_unmapped_area = shm_get_unmapped_area, }; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Fix get_unmapped_area and fsync for hugetlb shm segments
This patch provides the following hugetlb-related fixes to the recent stacked shm files changes: - Update is_file_hugepages() so it will reconize hugetlb shm segments. - get_unmapped_area must be called with the nested file struct to handle the sfd-file-f_ops-get_unmapped_area == NULL case. - The fsync f_op must be wrapped since it is specified in the hugetlbfs f_ops. This is based on proposed fixes from Eric Biederman that were debugged and tested by me. Without it, attempting to use hugetlb shared memory segments on powerpc (and likely ia64) will kill your box. Please Apply. Signed-off-by: Adam Litke [EMAIL PROTECTED] --- include/linux/hugetlb.h |8 +++- include/linux/shm.h |5 + ipc/shm.c | 32 ++-- 3 files changed, 38 insertions(+), 7 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index a60995a..3f3e7a6 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -4,6 +4,7 @@ #ifdef CONFIG_HUGETLB_PAGE #include linux/mempolicy.h +#include linux/shm.h #include asm/tlbflush.h struct ctl_table; @@ -168,7 +169,12 @@ void hugetlb_put_quota(struct address_space *mapping); static inline int is_file_hugepages(struct file *file) { - return file-f_op == hugetlbfs_file_operations; + if (file-f_op == hugetlbfs_file_operations) + return 1; + if (is_file_shm_hugepages(file)) + return 1; + + return 0; } static inline void set_file_hugepages(struct file *file) diff --git a/include/linux/shm.h b/include/linux/shm.h index a2c896a..ad2e3af 100644 --- a/include/linux/shm.h +++ b/include/linux/shm.h @@ -96,12 +96,17 @@ struct shmid_kernel /* private to the kernel */ #ifdef CONFIG_SYSVIPC long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long *addr); +extern int is_file_shm_hugepages(struct file *file); #else static inline long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long *addr) { return -ENOSYS; } +static inline int is_file_shm_hugepages(struct file *file) +{ + return 0; +} #endif #endif /* __KERNEL__ */ diff --git a/ipc/shm.c b/ipc/shm.c index eb57e22..abf864d 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -285,21 +285,41 @@ static int shm_release(struct inode *ino, struct file *file) return 0; } -#ifndef CONFIG_MMU +static int shm_fsync(struct file *file, struct dentry *dentry, int datasync) +{ + int (*fsync) (struct file *, struct dentry *, int datasync); + struct shm_file_data *sfd = shm_file_data(file); + int ret = -EINVAL; + + fsync = sfd-file-f_op-fsync; + if (fsync) + ret = fsync(sfd-file, sfd-file-f_path.dentry, datasync); + return ret; +} + static unsigned long shm_get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { struct shm_file_data *sfd = shm_file_data(file); - return sfd-file-f_op-get_unmapped_area(sfd-file, addr, len, pgoff, - flags); + return get_unmapped_area(sfd-file, addr, len, pgoff, flags); +} + +int is_file_shm_hugepages(struct file *file) +{ + int ret = 0; + + if (file-f_op == shm_file_operations) { + struct shm_file_data *sfd; + sfd = shm_file_data(file); + ret = is_file_hugepages(sfd-file); + } + return ret; } -#else -#define shm_get_unmapped_area NULL -#endif static const struct file_operations shm_file_operations = { .mmap = shm_mmap, + .fsync = shm_fsync, .release= shm_release, .get_unmapped_area = shm_get_unmapped_area, }; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Kernel Oops with shm namespace cleanups
[C00779AD7630] [C005C68C] .run_local_timers+0x1c/0x30 [C00779AD76B0] [C0021358] .timer_interrupt+0x80/0x498 [C00779AD7790] [C0003580] decrementer_common+0x100/0x180 --- Exception: 901 at ._spin_lock+0x30/0x44 LR = .vma_link+0x6c/0x1d4 [C00779AD7A80] [C05C8C08] 0xc05c8c08 (unreliable) [C00779AD7B30] [C0098D18] .do_mmap_pgoff+0x650/0x818 [C00779AD7C40] [C028F204] .do_shmat+0x304/0x454 [C00779AD7D30] [C0289660] .compat_sys_shmat+0x34/0x94 [C00779AD7DC0] [C0014A20] .compat_sys_ipc+0x18c/0x1e8 [C00779AD7E30] [C000872C] syscall_exit+0x0/0x40 BUG: soft lockup detected on CPU#2! Call Trace: [C00779B934C0] [C000F588] .show_stack+0x68/0x1b4 (unreliable) [C00779B93570] [C007C5E0] .softlockup_tick+0xec/0x140 [C00779B93630] [C005C68C] .run_local_timers+0x1c/0x30 [C00779B936B0] [C0021358] .timer_interrupt+0x80/0x498 [C00779B93790] [C0003580] decrementer_common+0x100/0x180 --- Exception: 901 at ._spin_lock+0x2c/0x44 LR = .vma_link+0x6c/0x1d4 [C00779B93A80] [C05C8C08] 0xc05c8c08 (unreliable) [C00779B93B30] [C0098D18] .do_mmap_pgoff+0x650/0x818 [C00779B93C40] [C028F204] .do_shmat+0x304/0x454 [C00779B93D30] [C0289660] .compat_sys_shmat+0x34/0x94 [C00779B93DC0] [C0014A20] .compat_sys_ipc+0x18c/0x1e8 [C00779B93E30] [C000872C] syscall_exit+0x0/0x40 BUG: soft lockup detected on CPU#3! Call Trace: [C00779AD34C0] [C000F588] .show_stack+0x68/0x1b4 (unreliable) [C00779AD3570] [C007C5E0] .softlockup_tick+0xec/0x140 [C00779AD3630] [C005C68C] .run_local_timers+0x1c/0x30 [C00779AD36B0] [C0021358] .timer_interrupt+0x80/0x498 [C00779AD3790] [C0003580] decrementer_common+0x100/0x180 --- Exception: 901 at ._spin_lock+0x2c/0x44 LR = .vma_link+0x6c/0x1d4 [C00779AD3A80] [C05C8C08] 0xc05c8c08 (unreliable) [C00779AD3B30] [C0098D18] .do_mmap_pgoff+0x650/0x818 [C00779AD3C40] [C028F204] .do_shmat+0x304/0x454 [C00779AD3D30] [C0289660] .compat_sys_shmat+0x34/0x94 [C00779AD3DC0] [C0014A20] .compat_sys_ipc+0x18c/0x1e8 [C00779AD3E30] [C000872C] syscall_exit+0x0/0x40 BUG: soft lockup detected on CPU#1! Call Trace: [C00779BB34C0] [C000F588] .show_stack+0x68/0x1b4 (unreliable) [C00779BB3570] [C007C5E0] .softlockup_tick+0xec/0x140 [C00779BB3630] [C005C68C] .run_local_timers+0x1c/0x30 [C00779BB36B0] [C0021358] .timer_interrupt+0x80/0x498 [C00779BB3790] [C0003580] decrementer_common+0x100/0x180 --- Exception: 901 at ._spin_lock+0x2c/0x44 LR = .vma_link+0x6c/0x1d4 [C00779BB3A80] [C05C8C08] 0xc05c8c08 (unreliable) [C00779BB3B30] [C0098D18] .do_mmap_pgoff+0x650/0x818 [C00779BB3C40] [C028F204] .do_shmat+0x304/0x454 [C00779BB3D30] [C0289660] .compat_sys_shmat+0x34/0x94 [C00779BB3DC0] [C0014A20] .compat_sys_ipc+0x18c/0x1e8 [C00779BB3E30] [C000872C] syscall_exit+0x0/0x40 -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Kernel Oops with shm namespace cleanups
[C00779AD7630] [C005C68C] .run_local_timers+0x1c/0x30 [C00779AD76B0] [C0021358] .timer_interrupt+0x80/0x498 [C00779AD7790] [C0003580] decrementer_common+0x100/0x180 --- Exception: 901 at ._spin_lock+0x30/0x44 LR = .vma_link+0x6c/0x1d4 [C00779AD7A80] [C05C8C08] 0xc05c8c08 (unreliable) [C00779AD7B30] [C0098D18] .do_mmap_pgoff+0x650/0x818 [C00779AD7C40] [C028F204] .do_shmat+0x304/0x454 [C00779AD7D30] [C0289660] .compat_sys_shmat+0x34/0x94 [C00779AD7DC0] [C0014A20] .compat_sys_ipc+0x18c/0x1e8 [C00779AD7E30] [C000872C] syscall_exit+0x0/0x40 BUG: soft lockup detected on CPU#2! Call Trace: [C00779B934C0] [C000F588] .show_stack+0x68/0x1b4 (unreliable) [C00779B93570] [C007C5E0] .softlockup_tick+0xec/0x140 [C00779B93630] [C005C68C] .run_local_timers+0x1c/0x30 [C00779B936B0] [C0021358] .timer_interrupt+0x80/0x498 [C00779B93790] [C0003580] decrementer_common+0x100/0x180 --- Exception: 901 at ._spin_lock+0x2c/0x44 LR = .vma_link+0x6c/0x1d4 [C00779B93A80] [C05C8C08] 0xc05c8c08 (unreliable) [C00779B93B30] [C0098D18] .do_mmap_pgoff+0x650/0x818 [C00779B93C40] [C028F204] .do_shmat+0x304/0x454 [C00779B93D30] [C0289660] .compat_sys_shmat+0x34/0x94 [C00779B93DC0] [C0014A20] .compat_sys_ipc+0x18c/0x1e8 [C00779B93E30] [C000872C] syscall_exit+0x0/0x40 BUG: soft lockup detected on CPU#3! Call Trace: [C00779AD34C0] [C000F588] .show_stack+0x68/0x1b4 (unreliable) [C00779AD3570] [C007C5E0] .softlockup_tick+0xec/0x140 [C00779AD3630] [C005C68C] .run_local_timers+0x1c/0x30 [C00779AD36B0] [C0021358] .timer_interrupt+0x80/0x498 [C00779AD3790] [C0003580] decrementer_common+0x100/0x180 --- Exception: 901 at ._spin_lock+0x2c/0x44 LR = .vma_link+0x6c/0x1d4 [C00779AD3A80] [C05C8C08] 0xc05c8c08 (unreliable) [C00779AD3B30] [C0098D18] .do_mmap_pgoff+0x650/0x818 [C00779AD3C40] [C028F204] .do_shmat+0x304/0x454 [C00779AD3D30] [C0289660] .compat_sys_shmat+0x34/0x94 [C00779AD3DC0] [C0014A20] .compat_sys_ipc+0x18c/0x1e8 [C00779AD3E30] [C000872C] syscall_exit+0x0/0x40 BUG: soft lockup detected on CPU#1! Call Trace: [C00779BB34C0] [C000F588] .show_stack+0x68/0x1b4 (unreliable) [C00779BB3570] [C007C5E0] .softlockup_tick+0xec/0x140 [C00779BB3630] [C005C68C] .run_local_timers+0x1c/0x30 [C00779BB36B0] [C0021358] .timer_interrupt+0x80/0x498 [C00779BB3790] [C0003580] decrementer_common+0x100/0x180 --- Exception: 901 at ._spin_lock+0x2c/0x44 LR = .vma_link+0x6c/0x1d4 [C00779BB3A80] [C05C8C08] 0xc05c8c08 (unreliable) [C00779BB3B30] [C0098D18] .do_mmap_pgoff+0x650/0x818 [C00779BB3C40] [C028F204] .do_shmat+0x304/0x454 [C00779BB3D30] [C0289660] .compat_sys_shmat+0x34/0x94 [C00779BB3DC0] [C0014A20] .compat_sys_ipc+0x18c/0x1e8 [C00779BB3E30] [C000872C] syscall_exit+0x0/0x40 -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/7] [RFC] hugetlb: pagetable_operations API
On Mon, 2007-02-19 at 19:43 +0100, Arjan van de Ven wrote: > On Mon, 2007-02-19 at 10:31 -0800, Adam Litke wrote: > > The page tables for hugetlb mappings are handled differently than page > > tables > > for normal pages. Rather than integrating multiple page size support into > > the > > main VM (which would tremendously complicate the code) some hooks were > > created. > > This allows hugetlb special cases to be handled "out of line" by a separate > > interface. > > ok it makes sense to clean this up.. what I don't like is that there > STILL are all the double cases... for this to work and be worth it both > the common case and the hugetlb case should be using the ops structure > always! Anything else and you're just replacing bad code with bad > code ;( Hmm. Do you think everyone would support an extra pointer indirection for every handle_pte_fault() call? If not, then I definitely wouldn't mind creating a default_pagetable_ops and calling into that. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
On Mon, 2007-02-19 at 19:41 +0100, Arjan van de Ven wrote: > On Mon, 2007-02-19 at 10:31 -0800, Adam Litke wrote: > > Signed-off-by: Adam Litke <[EMAIL PROTECTED]> > > --- > > > > include/linux/mm.h | 25 + > > 1 files changed, 25 insertions(+), 0 deletions(-) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 2d2c08d..a2fa66d 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -98,6 +98,7 @@ struct vm_area_struct { > > > > /* Function pointers to deal with this struct. */ > > struct vm_operations_struct * vm_ops; > > + struct pagetable_operations_struct * pagetable_ops; > > > > please make it at least const, those things have no business ever being > written to right? And by making them const the compiler helps catch > that, and as bonus the data gets moved to rodata so that it won't share > cachelines with anything that gets dirty Yep I agree. Changed. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 6/7] free_pgtable_range for hugetlb
Signed-off-by: Adam Litke <[EMAIL PROTECTED]> --- fs/hugetlbfs/inode.c |1 + mm/memory.c |6 +++--- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 1016694..3461f9b 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -567,6 +567,7 @@ static struct pagetable_operations_struct hugetlbfs_pagetable_ops = { .pin_pages = follow_hugetlb_page, .unmap_page_range = unmap_hugepage_range, .change_protection = hugetlb_change_protection, + .free_pgtable_range = hugetlb_free_pgd_range, }; static struct inode_operations hugetlbfs_dir_inode_operations = { diff --git a/mm/memory.c b/mm/memory.c index aa6b06e..442e6b2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -279,15 +279,15 @@ void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma, anon_vma_unlink(vma); unlink_file_vma(vma); - if (is_vm_hugetlb_page(vma)) { - hugetlb_free_pgd_range(tlb, addr, vma->vm_end, + if (has_pt_op(vma, free_pgtable_range)) { + pt_op(vma, free_pgtable_range)(tlb, addr, vma->vm_end, floor, next? next->vm_start: ceiling); } else { /* * Optimization: gather nearby vmas into one call down */ while (next && next->vm_start <= vma->vm_end + PMD_SIZE - && !is_vm_hugetlb_page(next)) { + && !has_pt_op(next, free_pgtable_range)) { vma = next; next = vma->vm_next; anon_vma_unlink(vma); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 7/7] hugetlbfs fault handler
Signed-off-by: Adam Litke <[EMAIL PROTECTED]> --- fs/hugetlbfs/inode.c |1 + mm/hugetlb.c |4 +++- mm/memory.c |4 ++-- 3 files changed, 6 insertions(+), 3 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 3461f9b..1de73c1 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -568,6 +568,7 @@ static struct pagetable_operations_struct hugetlbfs_pagetable_ops = { .unmap_page_range = unmap_hugepage_range, .change_protection = hugetlb_change_protection, .free_pgtable_range = hugetlb_free_pgd_range, + .fault = hugetlb_fault, }; static struct inode_operations hugetlbfs_dir_inode_operations = { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3bcc0db..63de369 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -588,6 +588,8 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long vaddr = *position; int remainder = *length; + BUG_ON(!has_pt_op(vma, fault)); + spin_lock(>page_table_lock); while (vaddr < vma->vm_end && remainder) { pte_t *pte; @@ -604,7 +606,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, int ret; spin_unlock(>page_table_lock); - ret = hugetlb_fault(mm, vma, vaddr, 0); + ret = pt_op(vma, fault)(mm, vma, vaddr, 0); spin_lock(>page_table_lock); if (ret == VM_FAULT_MINOR) continue; diff --git a/mm/memory.c b/mm/memory.c index 442e6b2..c8e17b4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2455,8 +2455,8 @@ int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, count_vm_event(PGFAULT); - if (unlikely(is_vm_hugetlb_page(vma))) - return hugetlb_fault(mm, vma, address, write_access); + if (unlikely(has_pt_op(vma, fault))) + return pt_op(vma, fault)(mm, vma, address, write_access); pgd = pgd_offset(mm, address); pud = pud_alloc(mm, pgd, address); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/7] change_protection for hugetlb
Signed-off-by: Adam Litke <[EMAIL PROTECTED]> --- fs/hugetlbfs/inode.c |1 + mm/mprotect.c|5 +++-- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 146a4b7..1016694 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -566,6 +566,7 @@ static struct pagetable_operations_struct hugetlbfs_pagetable_ops = { .copy_vma = copy_hugetlb_page_range, .pin_pages = follow_hugetlb_page, .unmap_page_range = unmap_hugepage_range, + .change_protection = hugetlb_change_protection, }; static struct inode_operations hugetlbfs_dir_inode_operations = { diff --git a/mm/mprotect.c b/mm/mprotect.c index 3b8f3c0..172e204 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -201,8 +201,9 @@ success: dirty_accountable = 1; } - if (is_vm_hugetlb_page(vma)) - hugetlb_change_protection(vma, start, end, vma->vm_page_prot); + if (has_pt_op(vma, change_protection)) + pt_op(vma, change_protection)(vma, start, end, + vma->vm_page_prot); else change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable); vm_stat_account(mm, oldflags, vma->vm_file, -nrpages); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/7] unmap_page_range for hugetlb
Signed-off-by: Adam Litke <[EMAIL PROTECTED]> --- fs/hugetlbfs/inode.c|3 ++- include/linux/hugetlb.h |4 ++-- mm/hugetlb.c| 12 mm/memory.c | 10 -- 4 files changed, 16 insertions(+), 13 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 2d1dd84..146a4b7 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -289,7 +289,7 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff) v_offset = 0; __unmap_hugepage_range(vma, - vma->vm_start + v_offset, vma->vm_end); + vma->vm_start + v_offset, vma->vm_end, 0); } } @@ -565,6 +565,7 @@ const struct file_operations hugetlbfs_file_operations = { static struct pagetable_operations_struct hugetlbfs_pagetable_ops = { .copy_vma = copy_hugetlb_page_range, .pin_pages = follow_hugetlb_page, + .unmap_page_range = unmap_hugepage_range, }; static struct inode_operations hugetlbfs_dir_inode_operations = { diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index a60995a..add92b3 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -16,8 +16,8 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma) int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *); int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int); -void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long); -void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long); +unsigned long unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, long *); +void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, long *); int hugetlb_prefault(struct address_space *, struct vm_area_struct *); int hugetlb_report_meminfo(char *); int hugetlb_report_node_meminfo(int, char *); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index cb362f7..3bcc0db 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -356,7 +356,7 @@ nomem: } void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, - unsigned long end) + unsigned long end, long *zap_work) { struct mm_struct *mm = vma->vm_mm; unsigned long address; @@ -397,10 +397,13 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, list_del(>lru); put_page(page); } + + if (zap_work) + *zap_work -= (end - start) / (HPAGE_SIZE / PAGE_SIZE); } -void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, - unsigned long end) +unsigned long unmap_hugepage_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end, long *zap_work) { /* * It is undesirable to test vma->vm_file as it should be non-null @@ -412,9 +415,10 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, */ if (vma->vm_file) { spin_lock(>vm_file->f_mapping->i_mmap_lock); - __unmap_hugepage_range(vma, start, end); + __unmap_hugepage_range(vma, start, end, zap_work); spin_unlock(>vm_file->f_mapping->i_mmap_lock); } + return end; } static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, diff --git a/mm/memory.c b/mm/memory.c index 9467c65..aa6b06e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -839,12 +839,10 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp, tlb_start_valid = 1; } - if (unlikely(is_vm_hugetlb_page(vma))) { - unmap_hugepage_range(vma, start, end); - zap_work -= (end - start) / - (HPAGE_SIZE / PAGE_SIZE); - start = end; - } else + if (unlikely(has_pt_op(vma, unmap_page_range))) + start = pt_op(vma, unmap_page_range) + (vma, start, end, _work); + else start = unmap_page_range(*tlbp, vma, start, end, _work, details); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/7] pin_pages for hugetlb
Signed-off-by: Adam Litke <[EMAIL PROTECTED]> --- fs/hugetlbfs/inode.c |1 + mm/memory.c |6 +++--- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index c0a7984..2d1dd84 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -564,6 +564,7 @@ const struct file_operations hugetlbfs_file_operations = { static struct pagetable_operations_struct hugetlbfs_pagetable_ops = { .copy_vma = copy_hugetlb_page_range, + .pin_pages = follow_hugetlb_page, }; static struct inode_operations hugetlbfs_dir_inode_operations = { diff --git a/mm/memory.c b/mm/memory.c index 80eafd5..9467c65 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1039,9 +1039,9 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, || !(vm_flags & vma->vm_flags)) return i ? : -EFAULT; - if (is_vm_hugetlb_page(vma)) { - i = follow_hugetlb_page(mm, vma, pages, vmas, - , , i); + if (has_pt_op(vma, pin_pages)) { + i = pt_op(vma, pin_pages)(mm, vma, pages, + vmas, , , i); continue; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/7] copy_vma for hugetlbfs
Signed-off-by: Adam Litke <[EMAIL PROTECTED]> --- fs/hugetlbfs/inode.c |6 ++ mm/memory.c |4 ++-- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 4f4cd13..c0a7984 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -36,6 +36,7 @@ static struct super_operations hugetlbfs_ops; static const struct address_space_operations hugetlbfs_aops; const struct file_operations hugetlbfs_file_operations; +static struct pagetable_operations_struct hugetlbfs_pagetable_ops; static struct inode_operations hugetlbfs_dir_inode_operations; static struct inode_operations hugetlbfs_inode_operations; @@ -70,6 +71,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) */ vma->vm_flags |= VM_HUGETLB | VM_RESERVED; vma->vm_ops = _vm_ops; + vma->pagetable_ops = _pagetable_ops; vma_len = (loff_t)(vma->vm_end - vma->vm_start); @@ -560,6 +562,10 @@ const struct file_operations hugetlbfs_file_operations = { .get_unmapped_area = hugetlb_get_unmapped_area, }; +static struct pagetable_operations_struct hugetlbfs_pagetable_ops = { + .copy_vma = copy_hugetlb_page_range, +}; + static struct inode_operations hugetlbfs_dir_inode_operations = { .create = hugetlbfs_create, .lookup = simple_lookup, diff --git a/mm/memory.c b/mm/memory.c index ef09f0a..80eafd5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -602,8 +602,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, return 0; } - if (is_vm_hugetlb_page(vma)) - return copy_hugetlb_page_range(dst_mm, src_mm, vma); + if (has_pt_op(vma, copy_vma)) + return pt_op(vma, copy_vma)(dst_mm, src_mm, vma); dst_pgd = pgd_offset(dst_mm, addr); src_pgd = pgd_offset(src_mm, addr); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/7] [RFC] hugetlb: pagetable_operations API
The page tables for hugetlb mappings are handled differently than page tables for normal pages. Rather than integrating multiple page size support into the main VM (which would tremendously complicate the code) some hooks were created. This allows hugetlb special cases to be handled "out of line" by a separate interface. Hugetlbfs was the huge page interface chosen. At the time, large database users were the only big users of huge pages and the hugetlbfs design meets their needs pretty well. Over time, hugetlbfs has been expanded to enable new uses of huge page memory with varied results. As features are added, the semantics become a permanent part of the Linux API. This makes maintenance of hugetlbfs an increasingly difficult task and inhibits the addition of features and functionality in support of ever-changing hardware. To remedy the situation, I propose an API (currently called pagetable_operations). All of the current hugetlbfs-specific hooks are moved into an operations struct that is attached to VMAs. The end result is a more explicit and IMO a cleaner interface between hugetlbfs and the core VM. We are then free to add other hugetlb interfaces (such as a /dev/zero-styled character device) that can operate either in concert with or independent of hugetlbfs. There should be no measurable performance impact for normal page users (we're checking if pagetable_ops != NULL instead of checking for vm_flags & VM_HUGETLB). Of course we do increase the VMA size by one pointer. For huge pages, there is an added indirection for pt_op() calls. This patch series does not change the logic of the the hugetlbfs operations, just moves them into the pagetable_operations struct. Comments? Do you think it's as good of an idea as I do? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
Signed-off-by: Adam Litke <[EMAIL PROTECTED]> --- include/linux/mm.h | 25 + 1 files changed, 25 insertions(+), 0 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 2d2c08d..a2fa66d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -98,6 +98,7 @@ struct vm_area_struct { /* Function pointers to deal with this struct. */ struct vm_operations_struct * vm_ops; + struct pagetable_operations_struct * pagetable_ops; /* Information about our backing store: */ unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE @@ -218,6 +219,30 @@ struct vm_operations_struct { }; struct mmu_gather; + +struct pagetable_operations_struct { + int (*fault)(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, int write_access); + int (*copy_vma)(struct mm_struct *dst, struct mm_struct *src, + struct vm_area_struct *vma); + int (*pin_pages)(struct mm_struct *mm, struct vm_area_struct *vma, + struct page **pages, struct vm_area_struct **vmas, + unsigned long *position, int *length, int i); + void (*change_protection)(struct vm_area_struct *vma, + unsigned long address, unsigned long end, pgprot_t newprot); + unsigned long (*unmap_page_range)(struct vm_area_struct *vma, + unsigned long address, unsigned long end, long *zap_work); + void (*free_pgtable_range)(struct mmu_gather **tlb, + unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling); +}; + +#define has_pt_op(vma, op) \ + ((vma)->pagetable_ops && (vma)->pagetable_ops->op) +#define pt_op(vma, call) \ + ((vma)->pagetable_ops->call) + struct inode; #define page_private(page) ((page)->private) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/7] [RFC] hugetlb: pagetable_operations API
The page tables for hugetlb mappings are handled differently than page tables for normal pages. Rather than integrating multiple page size support into the main VM (which would tremendously complicate the code) some hooks were created. This allows hugetlb special cases to be handled out of line by a separate interface. Hugetlbfs was the huge page interface chosen. At the time, large database users were the only big users of huge pages and the hugetlbfs design meets their needs pretty well. Over time, hugetlbfs has been expanded to enable new uses of huge page memory with varied results. As features are added, the semantics become a permanent part of the Linux API. This makes maintenance of hugetlbfs an increasingly difficult task and inhibits the addition of features and functionality in support of ever-changing hardware. To remedy the situation, I propose an API (currently called pagetable_operations). All of the current hugetlbfs-specific hooks are moved into an operations struct that is attached to VMAs. The end result is a more explicit and IMO a cleaner interface between hugetlbfs and the core VM. We are then free to add other hugetlb interfaces (such as a /dev/zero-styled character device) that can operate either in concert with or independent of hugetlbfs. There should be no measurable performance impact for normal page users (we're checking if pagetable_ops != NULL instead of checking for vm_flags VM_HUGETLB). Of course we do increase the VMA size by one pointer. For huge pages, there is an added indirection for pt_op() calls. This patch series does not change the logic of the the hugetlbfs operations, just moves them into the pagetable_operations struct. Comments? Do you think it's as good of an idea as I do? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
Signed-off-by: Adam Litke [EMAIL PROTECTED] --- include/linux/mm.h | 25 + 1 files changed, 25 insertions(+), 0 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 2d2c08d..a2fa66d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -98,6 +98,7 @@ struct vm_area_struct { /* Function pointers to deal with this struct. */ struct vm_operations_struct * vm_ops; + struct pagetable_operations_struct * pagetable_ops; /* Information about our backing store: */ unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE @@ -218,6 +219,30 @@ struct vm_operations_struct { }; struct mmu_gather; + +struct pagetable_operations_struct { + int (*fault)(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, int write_access); + int (*copy_vma)(struct mm_struct *dst, struct mm_struct *src, + struct vm_area_struct *vma); + int (*pin_pages)(struct mm_struct *mm, struct vm_area_struct *vma, + struct page **pages, struct vm_area_struct **vmas, + unsigned long *position, int *length, int i); + void (*change_protection)(struct vm_area_struct *vma, + unsigned long address, unsigned long end, pgprot_t newprot); + unsigned long (*unmap_page_range)(struct vm_area_struct *vma, + unsigned long address, unsigned long end, long *zap_work); + void (*free_pgtable_range)(struct mmu_gather **tlb, + unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling); +}; + +#define has_pt_op(vma, op) \ + ((vma)-pagetable_ops (vma)-pagetable_ops-op) +#define pt_op(vma, call) \ + ((vma)-pagetable_ops-call) + struct inode; #define page_private(page) ((page)-private) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/7] copy_vma for hugetlbfs
Signed-off-by: Adam Litke [EMAIL PROTECTED] --- fs/hugetlbfs/inode.c |6 ++ mm/memory.c |4 ++-- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 4f4cd13..c0a7984 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -36,6 +36,7 @@ static struct super_operations hugetlbfs_ops; static const struct address_space_operations hugetlbfs_aops; const struct file_operations hugetlbfs_file_operations; +static struct pagetable_operations_struct hugetlbfs_pagetable_ops; static struct inode_operations hugetlbfs_dir_inode_operations; static struct inode_operations hugetlbfs_inode_operations; @@ -70,6 +71,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) */ vma-vm_flags |= VM_HUGETLB | VM_RESERVED; vma-vm_ops = hugetlb_vm_ops; + vma-pagetable_ops = hugetlbfs_pagetable_ops; vma_len = (loff_t)(vma-vm_end - vma-vm_start); @@ -560,6 +562,10 @@ const struct file_operations hugetlbfs_file_operations = { .get_unmapped_area = hugetlb_get_unmapped_area, }; +static struct pagetable_operations_struct hugetlbfs_pagetable_ops = { + .copy_vma = copy_hugetlb_page_range, +}; + static struct inode_operations hugetlbfs_dir_inode_operations = { .create = hugetlbfs_create, .lookup = simple_lookup, diff --git a/mm/memory.c b/mm/memory.c index ef09f0a..80eafd5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -602,8 +602,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, return 0; } - if (is_vm_hugetlb_page(vma)) - return copy_hugetlb_page_range(dst_mm, src_mm, vma); + if (has_pt_op(vma, copy_vma)) + return pt_op(vma, copy_vma)(dst_mm, src_mm, vma); dst_pgd = pgd_offset(dst_mm, addr); src_pgd = pgd_offset(src_mm, addr); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/7] pin_pages for hugetlb
Signed-off-by: Adam Litke [EMAIL PROTECTED] --- fs/hugetlbfs/inode.c |1 + mm/memory.c |6 +++--- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index c0a7984..2d1dd84 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -564,6 +564,7 @@ const struct file_operations hugetlbfs_file_operations = { static struct pagetable_operations_struct hugetlbfs_pagetable_ops = { .copy_vma = copy_hugetlb_page_range, + .pin_pages = follow_hugetlb_page, }; static struct inode_operations hugetlbfs_dir_inode_operations = { diff --git a/mm/memory.c b/mm/memory.c index 80eafd5..9467c65 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1039,9 +1039,9 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, || !(vm_flags vma-vm_flags)) return i ? : -EFAULT; - if (is_vm_hugetlb_page(vma)) { - i = follow_hugetlb_page(mm, vma, pages, vmas, - start, len, i); + if (has_pt_op(vma, pin_pages)) { + i = pt_op(vma, pin_pages)(mm, vma, pages, + vmas, start, len, i); continue; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/7] unmap_page_range for hugetlb
Signed-off-by: Adam Litke [EMAIL PROTECTED] --- fs/hugetlbfs/inode.c|3 ++- include/linux/hugetlb.h |4 ++-- mm/hugetlb.c| 12 mm/memory.c | 10 -- 4 files changed, 16 insertions(+), 13 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 2d1dd84..146a4b7 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -289,7 +289,7 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff) v_offset = 0; __unmap_hugepage_range(vma, - vma-vm_start + v_offset, vma-vm_end); + vma-vm_start + v_offset, vma-vm_end, 0); } } @@ -565,6 +565,7 @@ const struct file_operations hugetlbfs_file_operations = { static struct pagetable_operations_struct hugetlbfs_pagetable_ops = { .copy_vma = copy_hugetlb_page_range, .pin_pages = follow_hugetlb_page, + .unmap_page_range = unmap_hugepage_range, }; static struct inode_operations hugetlbfs_dir_inode_operations = { diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index a60995a..add92b3 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -16,8 +16,8 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma) int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *); int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int); -void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long); -void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long); +unsigned long unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, long *); +void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, long *); int hugetlb_prefault(struct address_space *, struct vm_area_struct *); int hugetlb_report_meminfo(char *); int hugetlb_report_node_meminfo(int, char *); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index cb362f7..3bcc0db 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -356,7 +356,7 @@ nomem: } void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, - unsigned long end) + unsigned long end, long *zap_work) { struct mm_struct *mm = vma-vm_mm; unsigned long address; @@ -397,10 +397,13 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, list_del(page-lru); put_page(page); } + + if (zap_work) + *zap_work -= (end - start) / (HPAGE_SIZE / PAGE_SIZE); } -void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, - unsigned long end) +unsigned long unmap_hugepage_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end, long *zap_work) { /* * It is undesirable to test vma-vm_file as it should be non-null @@ -412,9 +415,10 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, */ if (vma-vm_file) { spin_lock(vma-vm_file-f_mapping-i_mmap_lock); - __unmap_hugepage_range(vma, start, end); + __unmap_hugepage_range(vma, start, end, zap_work); spin_unlock(vma-vm_file-f_mapping-i_mmap_lock); } + return end; } static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, diff --git a/mm/memory.c b/mm/memory.c index 9467c65..aa6b06e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -839,12 +839,10 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp, tlb_start_valid = 1; } - if (unlikely(is_vm_hugetlb_page(vma))) { - unmap_hugepage_range(vma, start, end); - zap_work -= (end - start) / - (HPAGE_SIZE / PAGE_SIZE); - start = end; - } else + if (unlikely(has_pt_op(vma, unmap_page_range))) + start = pt_op(vma, unmap_page_range) + (vma, start, end, zap_work); + else start = unmap_page_range(*tlbp, vma, start, end, zap_work, details); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 7/7] hugetlbfs fault handler
Signed-off-by: Adam Litke [EMAIL PROTECTED] --- fs/hugetlbfs/inode.c |1 + mm/hugetlb.c |4 +++- mm/memory.c |4 ++-- 3 files changed, 6 insertions(+), 3 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 3461f9b..1de73c1 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -568,6 +568,7 @@ static struct pagetable_operations_struct hugetlbfs_pagetable_ops = { .unmap_page_range = unmap_hugepage_range, .change_protection = hugetlb_change_protection, .free_pgtable_range = hugetlb_free_pgd_range, + .fault = hugetlb_fault, }; static struct inode_operations hugetlbfs_dir_inode_operations = { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3bcc0db..63de369 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -588,6 +588,8 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long vaddr = *position; int remainder = *length; + BUG_ON(!has_pt_op(vma, fault)); + spin_lock(mm-page_table_lock); while (vaddr vma-vm_end remainder) { pte_t *pte; @@ -604,7 +606,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, int ret; spin_unlock(mm-page_table_lock); - ret = hugetlb_fault(mm, vma, vaddr, 0); + ret = pt_op(vma, fault)(mm, vma, vaddr, 0); spin_lock(mm-page_table_lock); if (ret == VM_FAULT_MINOR) continue; diff --git a/mm/memory.c b/mm/memory.c index 442e6b2..c8e17b4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2455,8 +2455,8 @@ int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, count_vm_event(PGFAULT); - if (unlikely(is_vm_hugetlb_page(vma))) - return hugetlb_fault(mm, vma, address, write_access); + if (unlikely(has_pt_op(vma, fault))) + return pt_op(vma, fault)(mm, vma, address, write_access); pgd = pgd_offset(mm, address); pud = pud_alloc(mm, pgd, address); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/7] change_protection for hugetlb
Signed-off-by: Adam Litke [EMAIL PROTECTED] --- fs/hugetlbfs/inode.c |1 + mm/mprotect.c|5 +++-- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 146a4b7..1016694 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -566,6 +566,7 @@ static struct pagetable_operations_struct hugetlbfs_pagetable_ops = { .copy_vma = copy_hugetlb_page_range, .pin_pages = follow_hugetlb_page, .unmap_page_range = unmap_hugepage_range, + .change_protection = hugetlb_change_protection, }; static struct inode_operations hugetlbfs_dir_inode_operations = { diff --git a/mm/mprotect.c b/mm/mprotect.c index 3b8f3c0..172e204 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -201,8 +201,9 @@ success: dirty_accountable = 1; } - if (is_vm_hugetlb_page(vma)) - hugetlb_change_protection(vma, start, end, vma-vm_page_prot); + if (has_pt_op(vma, change_protection)) + pt_op(vma, change_protection)(vma, start, end, + vma-vm_page_prot); else change_protection(vma, start, end, vma-vm_page_prot, dirty_accountable); vm_stat_account(mm, oldflags, vma-vm_file, -nrpages); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 6/7] free_pgtable_range for hugetlb
Signed-off-by: Adam Litke [EMAIL PROTECTED] --- fs/hugetlbfs/inode.c |1 + mm/memory.c |6 +++--- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 1016694..3461f9b 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -567,6 +567,7 @@ static struct pagetable_operations_struct hugetlbfs_pagetable_ops = { .pin_pages = follow_hugetlb_page, .unmap_page_range = unmap_hugepage_range, .change_protection = hugetlb_change_protection, + .free_pgtable_range = hugetlb_free_pgd_range, }; static struct inode_operations hugetlbfs_dir_inode_operations = { diff --git a/mm/memory.c b/mm/memory.c index aa6b06e..442e6b2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -279,15 +279,15 @@ void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma, anon_vma_unlink(vma); unlink_file_vma(vma); - if (is_vm_hugetlb_page(vma)) { - hugetlb_free_pgd_range(tlb, addr, vma-vm_end, + if (has_pt_op(vma, free_pgtable_range)) { + pt_op(vma, free_pgtable_range)(tlb, addr, vma-vm_end, floor, next? next-vm_start: ceiling); } else { /* * Optimization: gather nearby vmas into one call down */ while (next next-vm_start = vma-vm_end + PMD_SIZE - !is_vm_hugetlb_page(next)) { + !has_pt_op(next, free_pgtable_range)) { vma = next; next = vma-vm_next; anon_vma_unlink(vma); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
On Mon, 2007-02-19 at 19:41 +0100, Arjan van de Ven wrote: On Mon, 2007-02-19 at 10:31 -0800, Adam Litke wrote: Signed-off-by: Adam Litke [EMAIL PROTECTED] --- include/linux/mm.h | 25 + 1 files changed, 25 insertions(+), 0 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 2d2c08d..a2fa66d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -98,6 +98,7 @@ struct vm_area_struct { /* Function pointers to deal with this struct. */ struct vm_operations_struct * vm_ops; + struct pagetable_operations_struct * pagetable_ops; please make it at least const, those things have no business ever being written to right? And by making them const the compiler helps catch that, and as bonus the data gets moved to rodata so that it won't share cachelines with anything that gets dirty Yep I agree. Changed. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/7] [RFC] hugetlb: pagetable_operations API
On Mon, 2007-02-19 at 19:43 +0100, Arjan van de Ven wrote: On Mon, 2007-02-19 at 10:31 -0800, Adam Litke wrote: The page tables for hugetlb mappings are handled differently than page tables for normal pages. Rather than integrating multiple page size support into the main VM (which would tremendously complicate the code) some hooks were created. This allows hugetlb special cases to be handled out of line by a separate interface. ok it makes sense to clean this up.. what I don't like is that there STILL are all the double cases... for this to work and be worth it both the common case and the hugetlb case should be using the ops structure always! Anything else and you're just replacing bad code with bad code ;( Hmm. Do you think everyone would support an extra pointer indirection for every handle_pte_fault() call? If not, then I definitely wouldn't mind creating a default_pagetable_ops and calling into that. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Define the shmem_inode_info flags directly
Andrew: This is a pretty basic and obvious cleanup IMO. How about a ride in -mm? Defining flags in terms of other flags is always confusing. Give them literal values instead of defining them in terms of VM_flags. While we're at it, move them to a header file in preparation for the introduction of a SHMEM_HUGETLB flag. Signed-off-by: Adam Litke <[EMAIL PROTECTED]> --- include/linux/shmem_fs.h |4 mm/shmem.c |4 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index f3c5189..3ea0b6e 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -8,6 +8,10 @@ #define SHMEM_NR_DIRECT 16 +/* These info->flags are used to handle pagein/truncate races efficiently */ +#define SHMEM_PAGEIN 0x0001 +#define SHMEM_TRUNCATE 0x0002 + struct shmem_inode_info { spinlock_t lock; unsigned long flags; diff --git a/mm/shmem.c b/mm/shmem.c index 70da7a0..a9bdb0d 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -66,10 +66,6 @@ #define VM_ACCT(size)(PAGE_CACHE_ALIGN(size) >> PAGE_SHIFT) -/* info->flags needs VM_flags to handle pagein/truncate races efficiently */ -#define SHMEM_PAGEINVM_READ -#define SHMEM_TRUNCATE VM_WRITE - /* Definition to limit shmem_truncate's steps between cond_rescheds */ #define LATENCY_LIMIT 64 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Define the shmem_inode_info flags directly
Andrew: This is a pretty basic and obvious cleanup IMO. How about a ride in -mm? Defining flags in terms of other flags is always confusing. Give them literal values instead of defining them in terms of VM_flags. While we're at it, move them to a header file in preparation for the introduction of a SHMEM_HUGETLB flag. Signed-off-by: Adam Litke [EMAIL PROTECTED] --- include/linux/shmem_fs.h |4 mm/shmem.c |4 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index f3c5189..3ea0b6e 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -8,6 +8,10 @@ #define SHMEM_NR_DIRECT 16 +/* These info-flags are used to handle pagein/truncate races efficiently */ +#define SHMEM_PAGEIN 0x0001 +#define SHMEM_TRUNCATE 0x0002 + struct shmem_inode_info { spinlock_t lock; unsigned long flags; diff --git a/mm/shmem.c b/mm/shmem.c index 70da7a0..a9bdb0d 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -66,10 +66,6 @@ #define VM_ACCT(size)(PAGE_CACHE_ALIGN(size) PAGE_SHIFT) -/* info-flags needs VM_flags to handle pagein/truncate races efficiently */ -#define SHMEM_PAGEINVM_READ -#define SHMEM_TRUNCATE VM_WRITE - /* Definition to limit shmem_truncate's steps between cond_rescheds */ #define LATENCY_LIMIT 64 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Don't allow the stack to grow into hugetlb reserved regions
When expanding the stack, we don't currently check if the VMA will cross into an area of the address space that is reserved for hugetlb pages. Subsequent faults on the expanded portion of such a VMA will confuse the low-level MMU code, resulting in an OOPS. Check for this. Signed-off-by: Adam Litke <[EMAIL PROTECTED]> --- mm/mmap.c |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 9717337..2c6b163 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1477,6 +1477,7 @@ static int acct_stack_growth(struct vm_area_struct * vma, unsigned long size, un { struct mm_struct *mm = vma->vm_mm; struct rlimit *rlim = current->signal->rlim; + unsigned long new_start; /* address space limit tests */ if (!may_expand_vm(mm, grow)) @@ -1496,6 +1497,12 @@ static int acct_stack_growth(struct vm_area_struct * vma, unsigned long size, un return -ENOMEM; } + /* Check to make the stack will not grow into a hugetlb-only region. */ + new_start = (vma->vm_flags & VM_GROWSUP) ? vma->vm_start : + vma->vm_end - size; + if (is_hugepage_only_range(vma->vm_mm, new_start, size)) + return -EFAULT; + /* * Overcommit.. This must be the final test, as it will * update security statistics. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Don't allow the stack to grow into hugetlb reserved regions
When expanding the stack, we don't currently check if the VMA will cross into an area of the address space that is reserved for hugetlb pages. Subsequent faults on the expanded portion of such a VMA will confuse the low-level MMU code, resulting in an OOPS. Check for this. Signed-off-by: Adam Litke [EMAIL PROTECTED] --- mm/mmap.c |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 9717337..2c6b163 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1477,6 +1477,7 @@ static int acct_stack_growth(struct vm_area_struct * vma, unsigned long size, un { struct mm_struct *mm = vma-vm_mm; struct rlimit *rlim = current-signal-rlim; + unsigned long new_start; /* address space limit tests */ if (!may_expand_vm(mm, grow)) @@ -1496,6 +1497,12 @@ static int acct_stack_growth(struct vm_area_struct * vma, unsigned long size, un return -ENOMEM; } + /* Check to make the stack will not grow into a hugetlb-only region. */ + new_start = (vma-vm_flags VM_GROWSUP) ? vma-vm_start : + vma-vm_end - size; + if (is_hugepage_only_range(vma-vm_mm, new_start, size)) + return -EFAULT; + /* * Overcommit.. This must be the final test, as it will * update security statistics. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/