hugetlb: fix hugepage allocation with memoryless nodes

Linux Kernel Mailing List Tue, 16 Oct 2007 11:05:42 -0700

Gitweb:     
http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=63b4613c3f0d4b724ba259dc6c201bb68b884e1a
Commit:     63b4613c3f0d4b724ba259dc6c201bb68b884e1a
Parent:     6b0c880dfefecedb9ad353014ed41505c32aca82
Author:     Nishanth Aravamudan <[EMAIL PROTECTED]>
AuthorDate: Tue Oct 16 01:26:24 2007 -0700
Committer:  Linus Torvalds <[EMAIL PROTECTED]>
CommitDate: Tue Oct 16 09:43:03 2007 -0700


    hugetlb: fix hugepage allocation with memoryless nodes
    
    Anton found a problem with the hugetlb pool allocation when some nodes have
    no memory (http://marc.info/?l=linux-mm&m=118133042025995&w=2).  Lee worked
    on versions that tried to fix it, but none were accepted.  Christoph has
    created a set of patches which allow for GFP_THISNODE allocations to fail
    if the node has no memory.
    
    Currently, alloc_fresh_huge_page() returns NULL when it is not able to
    allocate a huge page on the current node, as specified by its custom
    interleave variable.  The callers of this function, though, assume that a
    failure in alloc_fresh_huge_page() indicates no hugepages can be allocated
    on the system period.  This might not be the case, for instance, if we have
    an uneven NUMA system, and we happen to try to allocate a hugepage on a
    node with less memory and fail, while there is still plenty of free memory
    on the other nodes.
    
    To correct this, make alloc_fresh_huge_page() search through all online
    nodes before deciding no hugepages can be allocated.  Add a helper function
    for actually allocating the hugepage.  Use a new global nid iterator to
    control which nid to allocate on.
    
    Note: we expect particular semantics for __GFP_THISNODE, which are now
    enforced even for memoryless nodes.  That is, there is should be no
    fallback to other nodes.  Therefore, we rely on the nid passed into
    alloc_pages_node() to be the nid the page comes from.  If this is
    incorrect, accounting will break.
    
    Tested on x86 !NUMA, x86 NUMA, x86_64 NUMA and ppc64 NUMA (with 2
    memoryless nodes).
    
    Before on the ppc64 box:
    Trying to clear the hugetlb pool
    Done.       0 free
    Trying to resize the pool to 100
    Node 0 HugePages_Free:     25
    Node 1 HugePages_Free:     75
    Node 2 HugePages_Free:      0
    Node 3 HugePages_Free:      0
    Done. Initially     100 free
    Trying to resize the pool to 200
    Node 0 HugePages_Free:     50
    Node 1 HugePages_Free:    150
    Node 2 HugePages_Free:      0
    Node 3 HugePages_Free:      0
    Done.     200 free
    
    After:
    Trying to clear the hugetlb pool
    Done.       0 free
    Trying to resize the pool to 100
    Node 0 HugePages_Free:     50
    Node 1 HugePages_Free:     50
    Node 2 HugePages_Free:      0
    Node 3 HugePages_Free:      0
    Done. Initially     100 free
    Trying to resize the pool to 200
    Node 0 HugePages_Free:    100
    Node 1 HugePages_Free:    100
    Node 2 HugePages_Free:      0
    Node 3 HugePages_Free:      0
    Done.     200 free
    
    Signed-off-by: Nishanth Aravamudan <[EMAIL PROTECTED]>
    Acked-by: Christoph Lameter <[EMAIL PROTECTED]>
    Cc: Adam Litke <[EMAIL PROTECTED]>
    Cc: David Gibson <[EMAIL PROTECTED]>
    Cc: Badari Pulavarty <[EMAIL PROTECTED]>
    Cc: Ken Chen <[EMAIL PROTECTED]>
    Cc: William Lee Irwin III <[EMAIL PROTECTED]>
    Cc: Lee Schermerhorn <[EMAIL PROTECTED]>
    Cc: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]>
    Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
    Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]>
---
 mm/hugetlb.c |   63 +++++++++++++++++++++++++++++++++++++++------------------
 1 files changed, 43 insertions(+), 20 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8fb86ba..82efecb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -32,6 +32,7 @@ static unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
 int hugetlb_dynamic_pool;
+static int hugetlb_next_nid;
 
 /*
  * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
@@ -165,36 +166,56 @@ static int adjust_pool_surplus(int delta)
        return ret;
 }
 
-static int alloc_fresh_huge_page(void)
+static struct page *alloc_fresh_huge_page_node(int nid)
 {
-       static int prev_nid;
        struct page *page;
-       int nid;
-
-       /*
-        * Copy static prev_nid to local nid, work on that, then copy it
-        * back to prev_nid afterwards: otherwise there's a window in which
-        * a racer might pass invalid nid MAX_NUMNODES to alloc_pages_node.
-        * But we don't need to use a spin_lock here: it really doesn't
-        * matter if occasionally a racer chooses the same nid as we do.
-        */
-       nid = next_node(prev_nid, node_online_map);
-       if (nid == MAX_NUMNODES)
-               nid = first_node(node_online_map);
-       prev_nid = nid;
 
-       page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
-                                       HUGETLB_PAGE_ORDER);
+       page = alloc_pages_node(nid,
+               htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
+               HUGETLB_PAGE_ORDER);
        if (page) {
                set_compound_page_dtor(page, free_huge_page);
                spin_lock(&hugetlb_lock);
                nr_huge_pages++;
-               nr_huge_pages_node[page_to_nid(page)]++;
+               nr_huge_pages_node[nid]++;
                spin_unlock(&hugetlb_lock);
                put_page(page); /* free it into the hugepage allocator */
-               return 1;
        }
-       return 0;
+
+       return page;
+}
+
+static int alloc_fresh_huge_page(void)
+{
+       struct page *page;
+       int start_nid;
+       int next_nid;
+       int ret = 0;
+
+       start_nid = hugetlb_next_nid;
+
+       do {
+               page = alloc_fresh_huge_page_node(hugetlb_next_nid);
+               if (page)
+                       ret = 1;
+               /*
+                * Use a helper variable to find the next node and then
+                * copy it back to hugetlb_next_nid afterwards:
+                * otherwise there's a window in which a racer might
+                * pass invalid nid MAX_NUMNODES to alloc_pages_node.
+                * But we don't need to use a spin_lock here: it really
+                * doesn't matter if occasionally a racer chooses the
+                * same nid as we do.  Move nid forward in the mask even
+                * if we just successfully allocated a hugepage so that
+                * the next caller gets hugepages on the next node.
+                */
+               next_nid = next_node(hugetlb_next_nid, node_online_map);
+               if (next_nid == MAX_NUMNODES)
+                       next_nid = first_node(node_online_map);
+               hugetlb_next_nid = next_nid;
+       } while (!page && hugetlb_next_nid != start_nid);
+
+       return ret;
 }
 
 static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
@@ -365,6 +386,8 @@ static int __init hugetlb_init(void)
        for (i = 0; i < MAX_NUMNODES; ++i)
                INIT_LIST_HEAD(&hugepage_freelists[i]);
 
+       hugetlb_next_nid = first_node(node_online_map);
+
        for (i = 0; i < max_huge_pages; ++i) {
                if (!alloc_fresh_huge_page())
                        break;
-
To unsubscribe from this list: send the line "unsubscribe git-commits-head" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

hugetlb: fix hugepage allocation with memoryless nodes

Reply via email to