Re: [RFC PATCH 0/5] mm, hugetlb: allocation API and migration improvements

2017-12-22 Thread Naoya Horiguchi
On 12/20/2017 06:53 PM, Michal Hocko wrote:
> On Wed 20-12-17 05:33:36, Naoya Horiguchi wrote:
>>
>> On 12/15/2017 06:33 PM, Michal Hocko wrote:
>>> Naoya,
>>> this has passed Mike's review (thanks for that!), you have mentioned
>>> that you can pass this through your testing machinery earlier. While
>>> I've done some testing already I would really appreciate if you could
>>> do that as well. Review would be highly appreciated as well.
>>
>> Sorry for my slow response. I reviewed/tested this patchset and looks
>> good to me overall.
> 
> No need to feel sorry. This doesn't have an urgent priority. Thanks for
> the review and testing. Can I assume your {Reviewed,Acked}-by or
> Tested-by?
> 

Yes, I tested again with additional changes below, and hugetlb migration
works fine from mbind(2). Thank you very much for your work.

Reviewed-by: Naoya Horiguchi 

for the series.

Thanks,
Naoya Horiguchi

>> I have one comment on the code path from mbind(2).
>> The callback passed to migrate_pages() in do_mbind() (i.e. new_page())
>> calls alloc_huge_page_noerr() which currently doesn't call 
>> SetPageHugeTemporary(),
>> so hugetlb migration fails when h->surplus_huge_page >= 
>> h->nr_overcommit_huge_pages.
> 
> Yes, I am aware of that. I should have been more explicit in the
> changelog. Sorry about that and thanks for pointing it out explicitly.
> To be honest I wasn't really sure what to do about this. The code path
> is really complex and it made my head spin. I fail to see why we have to
> call alloc_huge_page and mess with reservations at all.
> 
>> I don't think this is a bug, but it would be better if mbind(2) works
>> more similarly with other migration callers like 
>> move_pages(2)/migrate_pages(2).
> 
> If the fix is as easy as the following I will add it to the pile.
> Otherwise I would prefer to do this separately after I find some more
> time to understand the callpath.
> ---
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index e035002d3fb6..08a4af411e25 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -345,10 +345,9 @@ struct huge_bootmem_page {
>  struct page *alloc_huge_page(struct vm_area_struct *vma,
>   unsigned long addr, int avoid_reserve);
>  struct page *alloc_huge_page_node(struct hstate *h, int nid);
> -struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
> - unsigned long addr, int avoid_reserve);
>  struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
>   nodemask_t *nmask);
> +struct page *alloc_huge_page_vma(struct vm_area_struct *vma, unsigned long 
> address);
>  int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
>   pgoff_t idx);
>  
> @@ -526,7 +525,7 @@ struct hstate {};
>  #define alloc_huge_page(v, a, r) NULL
>  #define alloc_huge_page_node(h, nid) NULL
>  #define alloc_huge_page_nodemask(h, preferred_nid, nmask) NULL
> -#define alloc_huge_page_noerr(v, a, r) NULL
> +#define alloc_huge_page_vma(vma, address) NULL
>  #define alloc_bootmem_huge_page(h) NULL
>  #define hstate_file(f) NULL
>  #define hstate_sizelog(s) NULL
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 4426c5b23a20..e00deabe6d17 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1672,6 +1672,25 @@ struct page *alloc_huge_page_nodemask(struct hstate 
> *h, int preferred_nid,
>   return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
>  }
>  
> +/* mempolicy aware migration callback */
> +struct page *alloc_huge_page_vma(struct vm_area_struct *vma, unsigned long 
> address)
> +{
> + struct mempolicy *mpol;
> + nodemask_t *nodemask;
> + struct page *page;
> + struct hstate *h;
> + gfp_t gfp_mask;
> + int node;
> +
> + h = hstate_vma(vma);
> + gfp_mask = htlb_alloc_mask(h);
> + node = huge_node(vma, address, gfp_mask, , );
> + page = alloc_huge_page_nodemask(h, node, nodemask);
> + mpol_cond_put(mpol);
> +
> + return page;
> +}
> +
>  /*
>   * Increase the hugetlb pool such that it can accommodate a reservation
>   * of size 'delta'.
> @@ -2077,20 +2096,6 @@ struct page *alloc_huge_page(struct vm_area_struct 
> *vma,
>   return ERR_PTR(-ENOSPC);
>  }
>  
> -/*
> - * alloc_huge_page()'s wrapper which simply returns the page if allocation
> - * succeeds, otherwise NULL. This function is called from new_vma_page(),
> - * where no ERR_VALUE is expected to be returned.
> - */
> -struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
> - unsigned long addr, int avoid_reserve)
> -{
> - struct page *page = alloc_huge_page(vma, addr, avoid_reserve);
> - if (IS_ERR(page))
> - page = NULL;
> - return page;
> -}
> -
>  int alloc_bootmem_huge_page(struct hstate *h)
>   __attribute__ ((weak, alias("__alloc_bootmem_huge_page")));
>  int 

Re: [RFC PATCH 0/5] mm, hugetlb: allocation API and migration improvements

2017-12-22 Thread Naoya Horiguchi
On 12/20/2017 06:53 PM, Michal Hocko wrote:
> On Wed 20-12-17 05:33:36, Naoya Horiguchi wrote:
>>
>> On 12/15/2017 06:33 PM, Michal Hocko wrote:
>>> Naoya,
>>> this has passed Mike's review (thanks for that!), you have mentioned
>>> that you can pass this through your testing machinery earlier. While
>>> I've done some testing already I would really appreciate if you could
>>> do that as well. Review would be highly appreciated as well.
>>
>> Sorry for my slow response. I reviewed/tested this patchset and looks
>> good to me overall.
> 
> No need to feel sorry. This doesn't have an urgent priority. Thanks for
> the review and testing. Can I assume your {Reviewed,Acked}-by or
> Tested-by?
> 

Yes, I tested again with additional changes below, and hugetlb migration
works fine from mbind(2). Thank you very much for your work.

Reviewed-by: Naoya Horiguchi 

for the series.

Thanks,
Naoya Horiguchi

>> I have one comment on the code path from mbind(2).
>> The callback passed to migrate_pages() in do_mbind() (i.e. new_page())
>> calls alloc_huge_page_noerr() which currently doesn't call 
>> SetPageHugeTemporary(),
>> so hugetlb migration fails when h->surplus_huge_page >= 
>> h->nr_overcommit_huge_pages.
> 
> Yes, I am aware of that. I should have been more explicit in the
> changelog. Sorry about that and thanks for pointing it out explicitly.
> To be honest I wasn't really sure what to do about this. The code path
> is really complex and it made my head spin. I fail to see why we have to
> call alloc_huge_page and mess with reservations at all.
> 
>> I don't think this is a bug, but it would be better if mbind(2) works
>> more similarly with other migration callers like 
>> move_pages(2)/migrate_pages(2).
> 
> If the fix is as easy as the following I will add it to the pile.
> Otherwise I would prefer to do this separately after I find some more
> time to understand the callpath.
> ---
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index e035002d3fb6..08a4af411e25 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -345,10 +345,9 @@ struct huge_bootmem_page {
>  struct page *alloc_huge_page(struct vm_area_struct *vma,
>   unsigned long addr, int avoid_reserve);
>  struct page *alloc_huge_page_node(struct hstate *h, int nid);
> -struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
> - unsigned long addr, int avoid_reserve);
>  struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
>   nodemask_t *nmask);
> +struct page *alloc_huge_page_vma(struct vm_area_struct *vma, unsigned long 
> address);
>  int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
>   pgoff_t idx);
>  
> @@ -526,7 +525,7 @@ struct hstate {};
>  #define alloc_huge_page(v, a, r) NULL
>  #define alloc_huge_page_node(h, nid) NULL
>  #define alloc_huge_page_nodemask(h, preferred_nid, nmask) NULL
> -#define alloc_huge_page_noerr(v, a, r) NULL
> +#define alloc_huge_page_vma(vma, address) NULL
>  #define alloc_bootmem_huge_page(h) NULL
>  #define hstate_file(f) NULL
>  #define hstate_sizelog(s) NULL
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 4426c5b23a20..e00deabe6d17 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1672,6 +1672,25 @@ struct page *alloc_huge_page_nodemask(struct hstate 
> *h, int preferred_nid,
>   return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
>  }
>  
> +/* mempolicy aware migration callback */
> +struct page *alloc_huge_page_vma(struct vm_area_struct *vma, unsigned long 
> address)
> +{
> + struct mempolicy *mpol;
> + nodemask_t *nodemask;
> + struct page *page;
> + struct hstate *h;
> + gfp_t gfp_mask;
> + int node;
> +
> + h = hstate_vma(vma);
> + gfp_mask = htlb_alloc_mask(h);
> + node = huge_node(vma, address, gfp_mask, , );
> + page = alloc_huge_page_nodemask(h, node, nodemask);
> + mpol_cond_put(mpol);
> +
> + return page;
> +}
> +
>  /*
>   * Increase the hugetlb pool such that it can accommodate a reservation
>   * of size 'delta'.
> @@ -2077,20 +2096,6 @@ struct page *alloc_huge_page(struct vm_area_struct 
> *vma,
>   return ERR_PTR(-ENOSPC);
>  }
>  
> -/*
> - * alloc_huge_page()'s wrapper which simply returns the page if allocation
> - * succeeds, otherwise NULL. This function is called from new_vma_page(),
> - * where no ERR_VALUE is expected to be returned.
> - */
> -struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
> - unsigned long addr, int avoid_reserve)
> -{
> - struct page *page = alloc_huge_page(vma, addr, avoid_reserve);
> - if (IS_ERR(page))
> - page = NULL;
> - return page;
> -}
> -
>  int alloc_bootmem_huge_page(struct hstate *h)
>   __attribute__ ((weak, alias("__alloc_bootmem_huge_page")));
>  int __alloc_bootmem_huge_page(struct hstate *h)
> diff 

Re: [RFC PATCH 0/5] mm, hugetlb: allocation API and migration improvements

2017-12-22 Thread Michal Hocko
On Thu 21-12-17 15:35:28, Mike Kravetz wrote:
[...]
> You can add,
> 
> Reviewed-by: Mike Kravetz 
> 
> I had some concerns about transferring huge page state during migration
> not specific to this patch, so I did a bunch of testing.

On Fri 22-12-17 08:58:48, Naoya Horiguchi wrote:
[...]
> Yes, I tested again with additional changes below, and hugetlb migration
> works fine from mbind(2). Thank you very much for your work.
> 
> Reviewed-by: Naoya Horiguchi 
> 
> for the series.

Thanks a lot to both of you! I have added the changelog to the last
patch. I am currently busy as hell so I will unlikely send the whole
thing before new year but please double check the changelog if you find
some more time.
---
>From 65e7dda5dfe90fa656285729a270855e93637caa Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Fri, 22 Dec 2017 10:31:18 +0100
Subject: [PATCH] hugetlb, mempolicy: fix the mbind hugetlb migration

do_mbind migration code relies on alloc_huge_page_noerr for hugetlb
pages. alloc_huge_page_noerr uses alloc_huge_page which is a highlevel
allocation function which has to take care of reserves, overcommit or
hugetlb cgroup accounting.  None of that is really required for the page
migration because the new page is only temporal and either will replace
the original page or it will be dropped. This is essentially as for
other migration call paths and there shouldn't be any reason to handle
mbind in a special way.

The current implementation is even suboptimal because the migration
might fail just because the hugetlb cgroup limit is reached, or the
overcommit is saturated.

Fix this by making mbind like other hugetlb migration paths. Add
a new migration helper alloc_huge_page_vma as a wrapper around
alloc_huge_page_nodemask with additional mempolicy handling.

alloc_huge_page_noerr has no more users and it can go.

Reviewed-by: Mike Kravetz 
Reviewed-by: Naoya Horiguchi 
Signed-off-by: Michal Hocko 
---
 include/linux/hugetlb.h |  5 ++---
 mm/hugetlb.c| 33 +++--
 mm/mempolicy.c  |  3 +--
 3 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e035002d3fb6..08a4af411e25 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -345,10 +345,9 @@ struct huge_bootmem_page {
 struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
-struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
-   unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
nodemask_t *nmask);
+struct page *alloc_huge_page_vma(struct vm_area_struct *vma, unsigned long 
address);
 int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
pgoff_t idx);
 
@@ -526,7 +525,7 @@ struct hstate {};
 #define alloc_huge_page(v, a, r) NULL
 #define alloc_huge_page_node(h, nid) NULL
 #define alloc_huge_page_nodemask(h, preferred_nid, nmask) NULL
-#define alloc_huge_page_noerr(v, a, r) NULL
+#define alloc_huge_page_vma(vma, address) NULL
 #define alloc_bootmem_huge_page(h) NULL
 #define hstate_file(f) NULL
 #define hstate_sizelog(s) NULL
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4426c5b23a20..e00deabe6d17 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1672,6 +1672,25 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, 
int preferred_nid,
return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
 }
 
+/* mempolicy aware migration callback */
+struct page *alloc_huge_page_vma(struct vm_area_struct *vma, unsigned long 
address)
+{
+   struct mempolicy *mpol;
+   nodemask_t *nodemask;
+   struct page *page;
+   struct hstate *h;
+   gfp_t gfp_mask;
+   int node;
+
+   h = hstate_vma(vma);
+   gfp_mask = htlb_alloc_mask(h);
+   node = huge_node(vma, address, gfp_mask, , );
+   page = alloc_huge_page_nodemask(h, node, nodemask);
+   mpol_cond_put(mpol);
+
+   return page;
+}
+
 /*
  * Increase the hugetlb pool such that it can accommodate a reservation
  * of size 'delta'.
@@ -2077,20 +2096,6 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
return ERR_PTR(-ENOSPC);
 }
 
-/*
- * alloc_huge_page()'s wrapper which simply returns the page if allocation
- * succeeds, otherwise NULL. This function is called from new_vma_page(),
- * where no ERR_VALUE is expected to be returned.
- */
-struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
-   unsigned long addr, int avoid_reserve)
-{
-   struct page *page = alloc_huge_page(vma, addr, avoid_reserve);
-   if (IS_ERR(page))
- 

Re: [RFC PATCH 0/5] mm, hugetlb: allocation API and migration improvements

2017-12-22 Thread Michal Hocko
On Thu 21-12-17 15:35:28, Mike Kravetz wrote:
[...]
> You can add,
> 
> Reviewed-by: Mike Kravetz 
> 
> I had some concerns about transferring huge page state during migration
> not specific to this patch, so I did a bunch of testing.

On Fri 22-12-17 08:58:48, Naoya Horiguchi wrote:
[...]
> Yes, I tested again with additional changes below, and hugetlb migration
> works fine from mbind(2). Thank you very much for your work.
> 
> Reviewed-by: Naoya Horiguchi 
> 
> for the series.

Thanks a lot to both of you! I have added the changelog to the last
patch. I am currently busy as hell so I will unlikely send the whole
thing before new year but please double check the changelog if you find
some more time.
---
>From 65e7dda5dfe90fa656285729a270855e93637caa Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Fri, 22 Dec 2017 10:31:18 +0100
Subject: [PATCH] hugetlb, mempolicy: fix the mbind hugetlb migration

do_mbind migration code relies on alloc_huge_page_noerr for hugetlb
pages. alloc_huge_page_noerr uses alloc_huge_page which is a highlevel
allocation function which has to take care of reserves, overcommit or
hugetlb cgroup accounting.  None of that is really required for the page
migration because the new page is only temporal and either will replace
the original page or it will be dropped. This is essentially as for
other migration call paths and there shouldn't be any reason to handle
mbind in a special way.

The current implementation is even suboptimal because the migration
might fail just because the hugetlb cgroup limit is reached, or the
overcommit is saturated.

Fix this by making mbind like other hugetlb migration paths. Add
a new migration helper alloc_huge_page_vma as a wrapper around
alloc_huge_page_nodemask with additional mempolicy handling.

alloc_huge_page_noerr has no more users and it can go.

Reviewed-by: Mike Kravetz 
Reviewed-by: Naoya Horiguchi 
Signed-off-by: Michal Hocko 
---
 include/linux/hugetlb.h |  5 ++---
 mm/hugetlb.c| 33 +++--
 mm/mempolicy.c  |  3 +--
 3 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e035002d3fb6..08a4af411e25 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -345,10 +345,9 @@ struct huge_bootmem_page {
 struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
-struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
-   unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
nodemask_t *nmask);
+struct page *alloc_huge_page_vma(struct vm_area_struct *vma, unsigned long 
address);
 int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
pgoff_t idx);
 
@@ -526,7 +525,7 @@ struct hstate {};
 #define alloc_huge_page(v, a, r) NULL
 #define alloc_huge_page_node(h, nid) NULL
 #define alloc_huge_page_nodemask(h, preferred_nid, nmask) NULL
-#define alloc_huge_page_noerr(v, a, r) NULL
+#define alloc_huge_page_vma(vma, address) NULL
 #define alloc_bootmem_huge_page(h) NULL
 #define hstate_file(f) NULL
 #define hstate_sizelog(s) NULL
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4426c5b23a20..e00deabe6d17 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1672,6 +1672,25 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, 
int preferred_nid,
return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
 }
 
+/* mempolicy aware migration callback */
+struct page *alloc_huge_page_vma(struct vm_area_struct *vma, unsigned long 
address)
+{
+   struct mempolicy *mpol;
+   nodemask_t *nodemask;
+   struct page *page;
+   struct hstate *h;
+   gfp_t gfp_mask;
+   int node;
+
+   h = hstate_vma(vma);
+   gfp_mask = htlb_alloc_mask(h);
+   node = huge_node(vma, address, gfp_mask, , );
+   page = alloc_huge_page_nodemask(h, node, nodemask);
+   mpol_cond_put(mpol);
+
+   return page;
+}
+
 /*
  * Increase the hugetlb pool such that it can accommodate a reservation
  * of size 'delta'.
@@ -2077,20 +2096,6 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
return ERR_PTR(-ENOSPC);
 }
 
-/*
- * alloc_huge_page()'s wrapper which simply returns the page if allocation
- * succeeds, otherwise NULL. This function is called from new_vma_page(),
- * where no ERR_VALUE is expected to be returned.
- */
-struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
-   unsigned long addr, int avoid_reserve)
-{
-   struct page *page = alloc_huge_page(vma, addr, avoid_reserve);
-   if (IS_ERR(page))
-   page = NULL;
-   return page;
-}
-
 int alloc_bootmem_huge_page(struct hstate *h)
__attribute__ ((weak, 

Re: [RFC PATCH 0/5] mm, hugetlb: allocation API and migration improvements

2017-12-21 Thread Mike Kravetz
On 12/20/2017 11:28 PM, Michal Hocko wrote:
> On Wed 20-12-17 14:43:03, Mike Kravetz wrote:
>> On 12/20/2017 01:53 AM, Michal Hocko wrote:
>>> On Wed 20-12-17 05:33:36, Naoya Horiguchi wrote:
 I have one comment on the code path from mbind(2).
 The callback passed to migrate_pages() in do_mbind() (i.e. new_page())
 calls alloc_huge_page_noerr() which currently doesn't call 
 SetPageHugeTemporary(),
 so hugetlb migration fails when h->surplus_huge_page >= 
 h->nr_overcommit_huge_pages.
>>>
>>> Yes, I am aware of that. I should have been more explicit in the
>>> changelog. Sorry about that and thanks for pointing it out explicitly.
>>> To be honest I wasn't really sure what to do about this. The code path
>>> is really complex and it made my head spin. I fail to see why we have to
>>> call alloc_huge_page and mess with reservations at all.
>>
>> Oops!  I missed that in my review.
>>
>> Since alloc_huge_page was called with avoid_reserve == 1, it should not
>> do anything with reserve counts.  One potential issue with the existing
>> code is cgroup accounting done by alloc_huge_page.  When the new target
>> page is allocated, it is charged against the cgroup even though the original
>> page is still accounted for.  If we are 'at the cgroup limit', the migration
>> may fail because of this.
> 
> Yeah, the existing code seems just broken. I strongly suspect that the
> allocation API for hugetlb was so complicated that this was just a
> natural result of a confusion with some follow up changes on top.
> 
>> I like your new code below as it explicitly takes reserve and cgroup
>> accounting out of the picture for migration.  Let me think about it
>> for another day before providing a Reviewed-by.
> 
> Thanks a lot!

You can add,

Reviewed-by: Mike Kravetz 

I had some concerns about transferring huge page state during migration
not specific to this patch, so I did a bunch of testing.

-- 
Mike Kravetz


Re: [RFC PATCH 0/5] mm, hugetlb: allocation API and migration improvements

2017-12-21 Thread Mike Kravetz
On 12/20/2017 11:28 PM, Michal Hocko wrote:
> On Wed 20-12-17 14:43:03, Mike Kravetz wrote:
>> On 12/20/2017 01:53 AM, Michal Hocko wrote:
>>> On Wed 20-12-17 05:33:36, Naoya Horiguchi wrote:
 I have one comment on the code path from mbind(2).
 The callback passed to migrate_pages() in do_mbind() (i.e. new_page())
 calls alloc_huge_page_noerr() which currently doesn't call 
 SetPageHugeTemporary(),
 so hugetlb migration fails when h->surplus_huge_page >= 
 h->nr_overcommit_huge_pages.
>>>
>>> Yes, I am aware of that. I should have been more explicit in the
>>> changelog. Sorry about that and thanks for pointing it out explicitly.
>>> To be honest I wasn't really sure what to do about this. The code path
>>> is really complex and it made my head spin. I fail to see why we have to
>>> call alloc_huge_page and mess with reservations at all.
>>
>> Oops!  I missed that in my review.
>>
>> Since alloc_huge_page was called with avoid_reserve == 1, it should not
>> do anything with reserve counts.  One potential issue with the existing
>> code is cgroup accounting done by alloc_huge_page.  When the new target
>> page is allocated, it is charged against the cgroup even though the original
>> page is still accounted for.  If we are 'at the cgroup limit', the migration
>> may fail because of this.
> 
> Yeah, the existing code seems just broken. I strongly suspect that the
> allocation API for hugetlb was so complicated that this was just a
> natural result of a confusion with some follow up changes on top.
> 
>> I like your new code below as it explicitly takes reserve and cgroup
>> accounting out of the picture for migration.  Let me think about it
>> for another day before providing a Reviewed-by.
> 
> Thanks a lot!

You can add,

Reviewed-by: Mike Kravetz 

I had some concerns about transferring huge page state during migration
not specific to this patch, so I did a bunch of testing.

-- 
Mike Kravetz


Re: [RFC PATCH 0/5] mm, hugetlb: allocation API and migration improvements

2017-12-20 Thread Michal Hocko
On Wed 20-12-17 14:43:03, Mike Kravetz wrote:
> On 12/20/2017 01:53 AM, Michal Hocko wrote:
> > On Wed 20-12-17 05:33:36, Naoya Horiguchi wrote:
> >> I have one comment on the code path from mbind(2).
> >> The callback passed to migrate_pages() in do_mbind() (i.e. new_page())
> >> calls alloc_huge_page_noerr() which currently doesn't call 
> >> SetPageHugeTemporary(),
> >> so hugetlb migration fails when h->surplus_huge_page >= 
> >> h->nr_overcommit_huge_pages.
> > 
> > Yes, I am aware of that. I should have been more explicit in the
> > changelog. Sorry about that and thanks for pointing it out explicitly.
> > To be honest I wasn't really sure what to do about this. The code path
> > is really complex and it made my head spin. I fail to see why we have to
> > call alloc_huge_page and mess with reservations at all.
> 
> Oops!  I missed that in my review.
> 
> Since alloc_huge_page was called with avoid_reserve == 1, it should not
> do anything with reserve counts.  One potential issue with the existing
> code is cgroup accounting done by alloc_huge_page.  When the new target
> page is allocated, it is charged against the cgroup even though the original
> page is still accounted for.  If we are 'at the cgroup limit', the migration
> may fail because of this.

Yeah, the existing code seems just broken. I strongly suspect that the
allocation API for hugetlb was so complicated that this was just a
natural result of a confusion with some follow up changes on top.

> I like your new code below as it explicitly takes reserve and cgroup
> accounting out of the picture for migration.  Let me think about it
> for another day before providing a Reviewed-by.

Thanks a lot!
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 0/5] mm, hugetlb: allocation API and migration improvements

2017-12-20 Thread Michal Hocko
On Wed 20-12-17 14:43:03, Mike Kravetz wrote:
> On 12/20/2017 01:53 AM, Michal Hocko wrote:
> > On Wed 20-12-17 05:33:36, Naoya Horiguchi wrote:
> >> I have one comment on the code path from mbind(2).
> >> The callback passed to migrate_pages() in do_mbind() (i.e. new_page())
> >> calls alloc_huge_page_noerr() which currently doesn't call 
> >> SetPageHugeTemporary(),
> >> so hugetlb migration fails when h->surplus_huge_page >= 
> >> h->nr_overcommit_huge_pages.
> > 
> > Yes, I am aware of that. I should have been more explicit in the
> > changelog. Sorry about that and thanks for pointing it out explicitly.
> > To be honest I wasn't really sure what to do about this. The code path
> > is really complex and it made my head spin. I fail to see why we have to
> > call alloc_huge_page and mess with reservations at all.
> 
> Oops!  I missed that in my review.
> 
> Since alloc_huge_page was called with avoid_reserve == 1, it should not
> do anything with reserve counts.  One potential issue with the existing
> code is cgroup accounting done by alloc_huge_page.  When the new target
> page is allocated, it is charged against the cgroup even though the original
> page is still accounted for.  If we are 'at the cgroup limit', the migration
> may fail because of this.

Yeah, the existing code seems just broken. I strongly suspect that the
allocation API for hugetlb was so complicated that this was just a
natural result of a confusion with some follow up changes on top.

> I like your new code below as it explicitly takes reserve and cgroup
> accounting out of the picture for migration.  Let me think about it
> for another day before providing a Reviewed-by.

Thanks a lot!
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 0/5] mm, hugetlb: allocation API and migration improvements

2017-12-20 Thread Mike Kravetz
On 12/20/2017 01:53 AM, Michal Hocko wrote:
> On Wed 20-12-17 05:33:36, Naoya Horiguchi wrote:
>> I have one comment on the code path from mbind(2).
>> The callback passed to migrate_pages() in do_mbind() (i.e. new_page())
>> calls alloc_huge_page_noerr() which currently doesn't call 
>> SetPageHugeTemporary(),
>> so hugetlb migration fails when h->surplus_huge_page >= 
>> h->nr_overcommit_huge_pages.
> 
> Yes, I am aware of that. I should have been more explicit in the
> changelog. Sorry about that and thanks for pointing it out explicitly.
> To be honest I wasn't really sure what to do about this. The code path
> is really complex and it made my head spin. I fail to see why we have to
> call alloc_huge_page and mess with reservations at all.

Oops!  I missed that in my review.

Since alloc_huge_page was called with avoid_reserve == 1, it should not
do anything with reserve counts.  One potential issue with the existing
code is cgroup accounting done by alloc_huge_page.  When the new target
page is allocated, it is charged against the cgroup even though the original
page is still accounted for.  If we are 'at the cgroup limit', the migration
may fail because of this.

I like your new code below as it explicitly takes reserve and cgroup
accounting out of the picture for migration.  Let me think about it
for another day before providing a Reviewed-by.

-- 
Mike Kravetz

>> I don't think this is a bug, but it would be better if mbind(2) works
>> more similarly with other migration callers like 
>> move_pages(2)/migrate_pages(2).
> 
> If the fix is as easy as the following I will add it to the pile.
> Otherwise I would prefer to do this separately after I find some more
> time to understand the callpath.
> ---
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index e035002d3fb6..08a4af411e25 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -345,10 +345,9 @@ struct huge_bootmem_page {
>  struct page *alloc_huge_page(struct vm_area_struct *vma,
>   unsigned long addr, int avoid_reserve);
>  struct page *alloc_huge_page_node(struct hstate *h, int nid);
> -struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
> - unsigned long addr, int avoid_reserve);
>  struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
>   nodemask_t *nmask);
> +struct page *alloc_huge_page_vma(struct vm_area_struct *vma, unsigned long 
> address);
>  int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
>   pgoff_t idx);
>  
> @@ -526,7 +525,7 @@ struct hstate {};
>  #define alloc_huge_page(v, a, r) NULL
>  #define alloc_huge_page_node(h, nid) NULL
>  #define alloc_huge_page_nodemask(h, preferred_nid, nmask) NULL
> -#define alloc_huge_page_noerr(v, a, r) NULL
> +#define alloc_huge_page_vma(vma, address) NULL
>  #define alloc_bootmem_huge_page(h) NULL
>  #define hstate_file(f) NULL
>  #define hstate_sizelog(s) NULL
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 4426c5b23a20..e00deabe6d17 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1672,6 +1672,25 @@ struct page *alloc_huge_page_nodemask(struct hstate 
> *h, int preferred_nid,
>   return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
>  }
>  
> +/* mempolicy aware migration callback */
> +struct page *alloc_huge_page_vma(struct vm_area_struct *vma, unsigned long 
> address)
> +{
> + struct mempolicy *mpol;
> + nodemask_t *nodemask;
> + struct page *page;
> + struct hstate *h;
> + gfp_t gfp_mask;
> + int node;
> +
> + h = hstate_vma(vma);
> + gfp_mask = htlb_alloc_mask(h);
> + node = huge_node(vma, address, gfp_mask, , );
> + page = alloc_huge_page_nodemask(h, node, nodemask);
> + mpol_cond_put(mpol);
> +
> + return page;
> +}
> +
>  /*
>   * Increase the hugetlb pool such that it can accommodate a reservation
>   * of size 'delta'.
> @@ -2077,20 +2096,6 @@ struct page *alloc_huge_page(struct vm_area_struct 
> *vma,
>   return ERR_PTR(-ENOSPC);
>  }
>  
> -/*
> - * alloc_huge_page()'s wrapper which simply returns the page if allocation
> - * succeeds, otherwise NULL. This function is called from new_vma_page(),
> - * where no ERR_VALUE is expected to be returned.
> - */
> -struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
> - unsigned long addr, int avoid_reserve)
> -{
> - struct page *page = alloc_huge_page(vma, addr, avoid_reserve);
> - if (IS_ERR(page))
> - page = NULL;
> - return page;
> -}
> -
>  int alloc_bootmem_huge_page(struct hstate *h)
>   __attribute__ ((weak, alias("__alloc_bootmem_huge_page")));
>  int __alloc_bootmem_huge_page(struct hstate *h)
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index f604b22ebb65..96823fa07f38 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1121,8 +1121,7 @@ static struct page 

Re: [RFC PATCH 0/5] mm, hugetlb: allocation API and migration improvements

2017-12-20 Thread Mike Kravetz
On 12/20/2017 01:53 AM, Michal Hocko wrote:
> On Wed 20-12-17 05:33:36, Naoya Horiguchi wrote:
>> I have one comment on the code path from mbind(2).
>> The callback passed to migrate_pages() in do_mbind() (i.e. new_page())
>> calls alloc_huge_page_noerr() which currently doesn't call 
>> SetPageHugeTemporary(),
>> so hugetlb migration fails when h->surplus_huge_page >= 
>> h->nr_overcommit_huge_pages.
> 
> Yes, I am aware of that. I should have been more explicit in the
> changelog. Sorry about that and thanks for pointing it out explicitly.
> To be honest I wasn't really sure what to do about this. The code path
> is really complex and it made my head spin. I fail to see why we have to
> call alloc_huge_page and mess with reservations at all.

Oops!  I missed that in my review.

Since alloc_huge_page was called with avoid_reserve == 1, it should not
do anything with reserve counts.  One potential issue with the existing
code is cgroup accounting done by alloc_huge_page.  When the new target
page is allocated, it is charged against the cgroup even though the original
page is still accounted for.  If we are 'at the cgroup limit', the migration
may fail because of this.

I like your new code below as it explicitly takes reserve and cgroup
accounting out of the picture for migration.  Let me think about it
for another day before providing a Reviewed-by.

-- 
Mike Kravetz

>> I don't think this is a bug, but it would be better if mbind(2) works
>> more similarly with other migration callers like 
>> move_pages(2)/migrate_pages(2).
> 
> If the fix is as easy as the following I will add it to the pile.
> Otherwise I would prefer to do this separately after I find some more
> time to understand the callpath.
> ---
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index e035002d3fb6..08a4af411e25 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -345,10 +345,9 @@ struct huge_bootmem_page {
>  struct page *alloc_huge_page(struct vm_area_struct *vma,
>   unsigned long addr, int avoid_reserve);
>  struct page *alloc_huge_page_node(struct hstate *h, int nid);
> -struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
> - unsigned long addr, int avoid_reserve);
>  struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
>   nodemask_t *nmask);
> +struct page *alloc_huge_page_vma(struct vm_area_struct *vma, unsigned long 
> address);
>  int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
>   pgoff_t idx);
>  
> @@ -526,7 +525,7 @@ struct hstate {};
>  #define alloc_huge_page(v, a, r) NULL
>  #define alloc_huge_page_node(h, nid) NULL
>  #define alloc_huge_page_nodemask(h, preferred_nid, nmask) NULL
> -#define alloc_huge_page_noerr(v, a, r) NULL
> +#define alloc_huge_page_vma(vma, address) NULL
>  #define alloc_bootmem_huge_page(h) NULL
>  #define hstate_file(f) NULL
>  #define hstate_sizelog(s) NULL
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 4426c5b23a20..e00deabe6d17 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1672,6 +1672,25 @@ struct page *alloc_huge_page_nodemask(struct hstate 
> *h, int preferred_nid,
>   return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
>  }
>  
> +/* mempolicy aware migration callback */
> +struct page *alloc_huge_page_vma(struct vm_area_struct *vma, unsigned long 
> address)
> +{
> + struct mempolicy *mpol;
> + nodemask_t *nodemask;
> + struct page *page;
> + struct hstate *h;
> + gfp_t gfp_mask;
> + int node;
> +
> + h = hstate_vma(vma);
> + gfp_mask = htlb_alloc_mask(h);
> + node = huge_node(vma, address, gfp_mask, , );
> + page = alloc_huge_page_nodemask(h, node, nodemask);
> + mpol_cond_put(mpol);
> +
> + return page;
> +}
> +
>  /*
>   * Increase the hugetlb pool such that it can accommodate a reservation
>   * of size 'delta'.
> @@ -2077,20 +2096,6 @@ struct page *alloc_huge_page(struct vm_area_struct 
> *vma,
>   return ERR_PTR(-ENOSPC);
>  }
>  
> -/*
> - * alloc_huge_page()'s wrapper which simply returns the page if allocation
> - * succeeds, otherwise NULL. This function is called from new_vma_page(),
> - * where no ERR_VALUE is expected to be returned.
> - */
> -struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
> - unsigned long addr, int avoid_reserve)
> -{
> - struct page *page = alloc_huge_page(vma, addr, avoid_reserve);
> - if (IS_ERR(page))
> - page = NULL;
> - return page;
> -}
> -
>  int alloc_bootmem_huge_page(struct hstate *h)
>   __attribute__ ((weak, alias("__alloc_bootmem_huge_page")));
>  int __alloc_bootmem_huge_page(struct hstate *h)
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index f604b22ebb65..96823fa07f38 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1121,8 +1121,7 @@ static struct page 

Re: [RFC PATCH 0/5] mm, hugetlb: allocation API and migration improvements

2017-12-20 Thread Michal Hocko
On Wed 20-12-17 05:33:36, Naoya Horiguchi wrote:
> 
> On 12/15/2017 06:33 PM, Michal Hocko wrote:
> > Naoya,
> > this has passed Mike's review (thanks for that!), you have mentioned
> > that you can pass this through your testing machinery earlier. While
> > I've done some testing already I would really appreciate if you could
> > do that as well. Review would be highly appreciated as well.
> 
> Sorry for my slow response. I reviewed/tested this patchset and looks
> good to me overall.

No need to feel sorry. This doesn't have an urgent priority. Thanks for
the review and testing. Can I assume your {Reviewed,Acked}-by or
Tested-by?

> I have one comment on the code path from mbind(2).
> The callback passed to migrate_pages() in do_mbind() (i.e. new_page())
> calls alloc_huge_page_noerr() which currently doesn't call 
> SetPageHugeTemporary(),
> so hugetlb migration fails when h->surplus_huge_page >= 
> h->nr_overcommit_huge_pages.

Yes, I am aware of that. I should have been more explicit in the
changelog. Sorry about that and thanks for pointing it out explicitly.
To be honest I wasn't really sure what to do about this. The code path
is really complex and it made my head spin. I fail to see why we have to
call alloc_huge_page and mess with reservations at all.

> I don't think this is a bug, but it would be better if mbind(2) works
> more similarly with other migration callers like 
> move_pages(2)/migrate_pages(2).

If the fix is as easy as the following I will add it to the pile.
Otherwise I would prefer to do this separately after I find some more
time to understand the callpath.
---
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e035002d3fb6..08a4af411e25 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -345,10 +345,9 @@ struct huge_bootmem_page {
 struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
-struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
-   unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
nodemask_t *nmask);
+struct page *alloc_huge_page_vma(struct vm_area_struct *vma, unsigned long 
address);
 int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
pgoff_t idx);
 
@@ -526,7 +525,7 @@ struct hstate {};
 #define alloc_huge_page(v, a, r) NULL
 #define alloc_huge_page_node(h, nid) NULL
 #define alloc_huge_page_nodemask(h, preferred_nid, nmask) NULL
-#define alloc_huge_page_noerr(v, a, r) NULL
+#define alloc_huge_page_vma(vma, address) NULL
 #define alloc_bootmem_huge_page(h) NULL
 #define hstate_file(f) NULL
 #define hstate_sizelog(s) NULL
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4426c5b23a20..e00deabe6d17 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1672,6 +1672,25 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, 
int preferred_nid,
return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
 }
 
+/* mempolicy aware migration callback */
+struct page *alloc_huge_page_vma(struct vm_area_struct *vma, unsigned long 
address)
+{
+   struct mempolicy *mpol;
+   nodemask_t *nodemask;
+   struct page *page;
+   struct hstate *h;
+   gfp_t gfp_mask;
+   int node;
+
+   h = hstate_vma(vma);
+   gfp_mask = htlb_alloc_mask(h);
+   node = huge_node(vma, address, gfp_mask, , );
+   page = alloc_huge_page_nodemask(h, node, nodemask);
+   mpol_cond_put(mpol);
+
+   return page;
+}
+
 /*
  * Increase the hugetlb pool such that it can accommodate a reservation
  * of size 'delta'.
@@ -2077,20 +2096,6 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
return ERR_PTR(-ENOSPC);
 }
 
-/*
- * alloc_huge_page()'s wrapper which simply returns the page if allocation
- * succeeds, otherwise NULL. This function is called from new_vma_page(),
- * where no ERR_VALUE is expected to be returned.
- */
-struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
-   unsigned long addr, int avoid_reserve)
-{
-   struct page *page = alloc_huge_page(vma, addr, avoid_reserve);
-   if (IS_ERR(page))
-   page = NULL;
-   return page;
-}
-
 int alloc_bootmem_huge_page(struct hstate *h)
__attribute__ ((weak, alias("__alloc_bootmem_huge_page")));
 int __alloc_bootmem_huge_page(struct hstate *h)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f604b22ebb65..96823fa07f38 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1121,8 +1121,7 @@ static struct page *new_page(struct page *page, unsigned 
long start, int **x)
}
 
if (PageHuge(page)) {
-   BUG_ON(!vma);
-   return alloc_huge_page_noerr(vma, address, 1);
+   return alloc_huge_page_vma(vma, address);
   

Re: [RFC PATCH 0/5] mm, hugetlb: allocation API and migration improvements

2017-12-20 Thread Michal Hocko
On Wed 20-12-17 05:33:36, Naoya Horiguchi wrote:
> 
> On 12/15/2017 06:33 PM, Michal Hocko wrote:
> > Naoya,
> > this has passed Mike's review (thanks for that!), you have mentioned
> > that you can pass this through your testing machinery earlier. While
> > I've done some testing already I would really appreciate if you could
> > do that as well. Review would be highly appreciated as well.
> 
> Sorry for my slow response. I reviewed/tested this patchset and looks
> good to me overall.

No need to feel sorry. This doesn't have an urgent priority. Thanks for
the review and testing. Can I assume your {Reviewed,Acked}-by or
Tested-by?

> I have one comment on the code path from mbind(2).
> The callback passed to migrate_pages() in do_mbind() (i.e. new_page())
> calls alloc_huge_page_noerr() which currently doesn't call 
> SetPageHugeTemporary(),
> so hugetlb migration fails when h->surplus_huge_page >= 
> h->nr_overcommit_huge_pages.

Yes, I am aware of that. I should have been more explicit in the
changelog. Sorry about that and thanks for pointing it out explicitly.
To be honest I wasn't really sure what to do about this. The code path
is really complex and it made my head spin. I fail to see why we have to
call alloc_huge_page and mess with reservations at all.

> I don't think this is a bug, but it would be better if mbind(2) works
> more similarly with other migration callers like 
> move_pages(2)/migrate_pages(2).

If the fix is as easy as the following I will add it to the pile.
Otherwise I would prefer to do this separately after I find some more
time to understand the callpath.
---
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e035002d3fb6..08a4af411e25 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -345,10 +345,9 @@ struct huge_bootmem_page {
 struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
-struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
-   unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
nodemask_t *nmask);
+struct page *alloc_huge_page_vma(struct vm_area_struct *vma, unsigned long 
address);
 int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
pgoff_t idx);
 
@@ -526,7 +525,7 @@ struct hstate {};
 #define alloc_huge_page(v, a, r) NULL
 #define alloc_huge_page_node(h, nid) NULL
 #define alloc_huge_page_nodemask(h, preferred_nid, nmask) NULL
-#define alloc_huge_page_noerr(v, a, r) NULL
+#define alloc_huge_page_vma(vma, address) NULL
 #define alloc_bootmem_huge_page(h) NULL
 #define hstate_file(f) NULL
 #define hstate_sizelog(s) NULL
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4426c5b23a20..e00deabe6d17 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1672,6 +1672,25 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, 
int preferred_nid,
return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
 }
 
+/* mempolicy aware migration callback */
+struct page *alloc_huge_page_vma(struct vm_area_struct *vma, unsigned long 
address)
+{
+   struct mempolicy *mpol;
+   nodemask_t *nodemask;
+   struct page *page;
+   struct hstate *h;
+   gfp_t gfp_mask;
+   int node;
+
+   h = hstate_vma(vma);
+   gfp_mask = htlb_alloc_mask(h);
+   node = huge_node(vma, address, gfp_mask, , );
+   page = alloc_huge_page_nodemask(h, node, nodemask);
+   mpol_cond_put(mpol);
+
+   return page;
+}
+
 /*
  * Increase the hugetlb pool such that it can accommodate a reservation
  * of size 'delta'.
@@ -2077,20 +2096,6 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
return ERR_PTR(-ENOSPC);
 }
 
-/*
- * alloc_huge_page()'s wrapper which simply returns the page if allocation
- * succeeds, otherwise NULL. This function is called from new_vma_page(),
- * where no ERR_VALUE is expected to be returned.
- */
-struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
-   unsigned long addr, int avoid_reserve)
-{
-   struct page *page = alloc_huge_page(vma, addr, avoid_reserve);
-   if (IS_ERR(page))
-   page = NULL;
-   return page;
-}
-
 int alloc_bootmem_huge_page(struct hstate *h)
__attribute__ ((weak, alias("__alloc_bootmem_huge_page")));
 int __alloc_bootmem_huge_page(struct hstate *h)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f604b22ebb65..96823fa07f38 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1121,8 +1121,7 @@ static struct page *new_page(struct page *page, unsigned 
long start, int **x)
}
 
if (PageHuge(page)) {
-   BUG_ON(!vma);
-   return alloc_huge_page_noerr(vma, address, 1);
+   return alloc_huge_page_vma(vma, address);
   

Re: [RFC PATCH 0/5] mm, hugetlb: allocation API and migration improvements

2017-12-19 Thread Naoya Horiguchi

On 12/15/2017 06:33 PM, Michal Hocko wrote:
> Naoya,
> this has passed Mike's review (thanks for that!), you have mentioned
> that you can pass this through your testing machinery earlier. While
> I've done some testing already I would really appreciate if you could
> do that as well. Review would be highly appreciated as well.

Sorry for my slow response. I reviewed/tested this patchset and looks
good to me overall.

I have one comment on the code path from mbind(2).
The callback passed to migrate_pages() in do_mbind() (i.e. new_page())
calls alloc_huge_page_noerr() which currently doesn't call 
SetPageHugeTemporary(),
so hugetlb migration fails when h->surplus_huge_page >= 
h->nr_overcommit_huge_pages.

I don't think this is a bug, but it would be better if mbind(2) works
more similarly with other migration callers like move_pages(2)/migrate_pages(2).

Thanks,
Naoya Horiguchi


> 
> Thanks!
> 
> On Mon 04-12-17 15:01:12, Michal Hocko wrote:
>> Hi,
>> this is a follow up for [1] for the allocation API and [2] for the
>> hugetlb migration. It wasn't really easy to split those into two
>> separate patch series as they share some code.
>>
>> My primary motivation to touch this code is to make the gigantic pages
>> migration working. The giga pages allocation code is just too fragile
>> and hacked into the hugetlb code now. This series tries to move giga
>> pages closer to the first class citizen. We are not there yet but having
>> 5 patches is quite a lot already and it will already make the code much
>> easier to follow. I will come with other changes on top after this sees
>> some review.
>>
>> The first two patches should be trivial to review. The third patch
>> changes the way how we migrate huge pages. Newly allocated pages are a
>> subject of the overcommit check and they participate surplus accounting
>> which is quite unfortunate as the changelog explains. This patch doesn't
>> change anything wrt. giga pages.
>> Patch #4 removes the surplus accounting hack from
>> __alloc_surplus_huge_page.  I hope I didn't miss anything there and a
>> deeper review is really due there.
>> Patch #5 finally unifies allocation paths and giga pages shouldn't be
>> any special anymore. There is also some renaming going on as well.
>>
>> Shortlog
>> Michal Hocko (5):
>>   mm, hugetlb: unify core page allocation accounting and initialization
>>   mm, hugetlb: integrate giga hugetlb more naturally to the allocation 
>> path
>>   mm, hugetlb: do not rely on overcommit limit during migration
>>   mm, hugetlb: get rid of surplus page accounting tricks
>>   mm, hugetlb: further simplify hugetlb allocation API
>>
>> Diffstat:
>>  include/linux/hugetlb.h |   3 +
>>  mm/hugetlb.c| 305 
>> +++-
>>  mm/migrate.c|   3 +-
>>  3 files changed, 175 insertions(+), 136 deletions(-)
>>
>>
>> [1] http://lkml.kernel.org/r/20170622193034.28972-1-mho...@kernel.org
>> [2] http://lkml.kernel.org/r/20171122152832.iayefrlxbugph...@dhcp22.suse.cz
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majord...@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> 


Re: [RFC PATCH 0/5] mm, hugetlb: allocation API and migration improvements

2017-12-19 Thread Naoya Horiguchi

On 12/15/2017 06:33 PM, Michal Hocko wrote:
> Naoya,
> this has passed Mike's review (thanks for that!), you have mentioned
> that you can pass this through your testing machinery earlier. While
> I've done some testing already I would really appreciate if you could
> do that as well. Review would be highly appreciated as well.

Sorry for my slow response. I reviewed/tested this patchset and looks
good to me overall.

I have one comment on the code path from mbind(2).
The callback passed to migrate_pages() in do_mbind() (i.e. new_page())
calls alloc_huge_page_noerr() which currently doesn't call 
SetPageHugeTemporary(),
so hugetlb migration fails when h->surplus_huge_page >= 
h->nr_overcommit_huge_pages.

I don't think this is a bug, but it would be better if mbind(2) works
more similarly with other migration callers like move_pages(2)/migrate_pages(2).

Thanks,
Naoya Horiguchi


> 
> Thanks!
> 
> On Mon 04-12-17 15:01:12, Michal Hocko wrote:
>> Hi,
>> this is a follow up for [1] for the allocation API and [2] for the
>> hugetlb migration. It wasn't really easy to split those into two
>> separate patch series as they share some code.
>>
>> My primary motivation to touch this code is to make the gigantic pages
>> migration working. The giga pages allocation code is just too fragile
>> and hacked into the hugetlb code now. This series tries to move giga
>> pages closer to the first class citizen. We are not there yet but having
>> 5 patches is quite a lot already and it will already make the code much
>> easier to follow. I will come with other changes on top after this sees
>> some review.
>>
>> The first two patches should be trivial to review. The third patch
>> changes the way how we migrate huge pages. Newly allocated pages are a
>> subject of the overcommit check and they participate surplus accounting
>> which is quite unfortunate as the changelog explains. This patch doesn't
>> change anything wrt. giga pages.
>> Patch #4 removes the surplus accounting hack from
>> __alloc_surplus_huge_page.  I hope I didn't miss anything there and a
>> deeper review is really due there.
>> Patch #5 finally unifies allocation paths and giga pages shouldn't be
>> any special anymore. There is also some renaming going on as well.
>>
>> Shortlog
>> Michal Hocko (5):
>>   mm, hugetlb: unify core page allocation accounting and initialization
>>   mm, hugetlb: integrate giga hugetlb more naturally to the allocation 
>> path
>>   mm, hugetlb: do not rely on overcommit limit during migration
>>   mm, hugetlb: get rid of surplus page accounting tricks
>>   mm, hugetlb: further simplify hugetlb allocation API
>>
>> Diffstat:
>>  include/linux/hugetlb.h |   3 +
>>  mm/hugetlb.c| 305 
>> +++-
>>  mm/migrate.c|   3 +-
>>  3 files changed, 175 insertions(+), 136 deletions(-)
>>
>>
>> [1] http://lkml.kernel.org/r/20170622193034.28972-1-mho...@kernel.org
>> [2] http://lkml.kernel.org/r/20171122152832.iayefrlxbugph...@dhcp22.suse.cz
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majord...@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> 


Re: [RFC PATCH 0/5] mm, hugetlb: allocation API and migration improvements

2017-12-15 Thread Michal Hocko
Naoya,
this has passed Mike's review (thanks for that!), you have mentioned
that you can pass this through your testing machinery earlier. While
I've done some testing already I would really appreciate if you could
do that as well. Review would be highly appreciated as well.

Thanks!

On Mon 04-12-17 15:01:12, Michal Hocko wrote:
> Hi,
> this is a follow up for [1] for the allocation API and [2] for the
> hugetlb migration. It wasn't really easy to split those into two
> separate patch series as they share some code.
> 
> My primary motivation to touch this code is to make the gigantic pages
> migration working. The giga pages allocation code is just too fragile
> and hacked into the hugetlb code now. This series tries to move giga
> pages closer to the first class citizen. We are not there yet but having
> 5 patches is quite a lot already and it will already make the code much
> easier to follow. I will come with other changes on top after this sees
> some review.
> 
> The first two patches should be trivial to review. The third patch
> changes the way how we migrate huge pages. Newly allocated pages are a
> subject of the overcommit check and they participate surplus accounting
> which is quite unfortunate as the changelog explains. This patch doesn't
> change anything wrt. giga pages.
> Patch #4 removes the surplus accounting hack from
> __alloc_surplus_huge_page.  I hope I didn't miss anything there and a
> deeper review is really due there.
> Patch #5 finally unifies allocation paths and giga pages shouldn't be
> any special anymore. There is also some renaming going on as well.
> 
> Shortlog
> Michal Hocko (5):
>   mm, hugetlb: unify core page allocation accounting and initialization
>   mm, hugetlb: integrate giga hugetlb more naturally to the allocation 
> path
>   mm, hugetlb: do not rely on overcommit limit during migration
>   mm, hugetlb: get rid of surplus page accounting tricks
>   mm, hugetlb: further simplify hugetlb allocation API
> 
> Diffstat:
>  include/linux/hugetlb.h |   3 +
>  mm/hugetlb.c| 305 
> +++-
>  mm/migrate.c|   3 +-
>  3 files changed, 175 insertions(+), 136 deletions(-)
> 
> 
> [1] http://lkml.kernel.org/r/20170622193034.28972-1-mho...@kernel.org
> [2] http://lkml.kernel.org/r/20171122152832.iayefrlxbugph...@dhcp22.suse.cz
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 0/5] mm, hugetlb: allocation API and migration improvements

2017-12-15 Thread Michal Hocko
Naoya,
this has passed Mike's review (thanks for that!), you have mentioned
that you can pass this through your testing machinery earlier. While
I've done some testing already I would really appreciate if you could
do that as well. Review would be highly appreciated as well.

Thanks!

On Mon 04-12-17 15:01:12, Michal Hocko wrote:
> Hi,
> this is a follow up for [1] for the allocation API and [2] for the
> hugetlb migration. It wasn't really easy to split those into two
> separate patch series as they share some code.
> 
> My primary motivation to touch this code is to make the gigantic pages
> migration working. The giga pages allocation code is just too fragile
> and hacked into the hugetlb code now. This series tries to move giga
> pages closer to the first class citizen. We are not there yet but having
> 5 patches is quite a lot already and it will already make the code much
> easier to follow. I will come with other changes on top after this sees
> some review.
> 
> The first two patches should be trivial to review. The third patch
> changes the way how we migrate huge pages. Newly allocated pages are a
> subject of the overcommit check and they participate surplus accounting
> which is quite unfortunate as the changelog explains. This patch doesn't
> change anything wrt. giga pages.
> Patch #4 removes the surplus accounting hack from
> __alloc_surplus_huge_page.  I hope I didn't miss anything there and a
> deeper review is really due there.
> Patch #5 finally unifies allocation paths and giga pages shouldn't be
> any special anymore. There is also some renaming going on as well.
> 
> Shortlog
> Michal Hocko (5):
>   mm, hugetlb: unify core page allocation accounting and initialization
>   mm, hugetlb: integrate giga hugetlb more naturally to the allocation 
> path
>   mm, hugetlb: do not rely on overcommit limit during migration
>   mm, hugetlb: get rid of surplus page accounting tricks
>   mm, hugetlb: further simplify hugetlb allocation API
> 
> Diffstat:
>  include/linux/hugetlb.h |   3 +
>  mm/hugetlb.c| 305 
> +++-
>  mm/migrate.c|   3 +-
>  3 files changed, 175 insertions(+), 136 deletions(-)
> 
> 
> [1] http://lkml.kernel.org/r/20170622193034.28972-1-mho...@kernel.org
> [2] http://lkml.kernel.org/r/20171122152832.iayefrlxbugph...@dhcp22.suse.cz
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
Michal Hocko
SUSE Labs


[RFC PATCH 0/5] mm, hugetlb: allocation API and migration improvements

2017-12-04 Thread Michal Hocko
Hi,
this is a follow up for [1] for the allocation API and [2] for the
hugetlb migration. It wasn't really easy to split those into two
separate patch series as they share some code.

My primary motivation to touch this code is to make the gigantic pages
migration working. The giga pages allocation code is just too fragile
and hacked into the hugetlb code now. This series tries to move giga
pages closer to the first class citizen. We are not there yet but having
5 patches is quite a lot already and it will already make the code much
easier to follow. I will come with other changes on top after this sees
some review.

The first two patches should be trivial to review. The third patch
changes the way how we migrate huge pages. Newly allocated pages are a
subject of the overcommit check and they participate surplus accounting
which is quite unfortunate as the changelog explains. This patch doesn't
change anything wrt. giga pages.
Patch #4 removes the surplus accounting hack from
__alloc_surplus_huge_page.  I hope I didn't miss anything there and a
deeper review is really due there.
Patch #5 finally unifies allocation paths and giga pages shouldn't be
any special anymore. There is also some renaming going on as well.

Shortlog
Michal Hocko (5):
  mm, hugetlb: unify core page allocation accounting and initialization
  mm, hugetlb: integrate giga hugetlb more naturally to the allocation path
  mm, hugetlb: do not rely on overcommit limit during migration
  mm, hugetlb: get rid of surplus page accounting tricks
  mm, hugetlb: further simplify hugetlb allocation API

Diffstat:
 include/linux/hugetlb.h |   3 +
 mm/hugetlb.c| 305 +++-
 mm/migrate.c|   3 +-
 3 files changed, 175 insertions(+), 136 deletions(-)


[1] http://lkml.kernel.org/r/20170622193034.28972-1-mho...@kernel.org
[2] http://lkml.kernel.org/r/20171122152832.iayefrlxbugph...@dhcp22.suse.cz



[RFC PATCH 0/5] mm, hugetlb: allocation API and migration improvements

2017-12-04 Thread Michal Hocko
Hi,
this is a follow up for [1] for the allocation API and [2] for the
hugetlb migration. It wasn't really easy to split those into two
separate patch series as they share some code.

My primary motivation to touch this code is to make the gigantic pages
migration working. The giga pages allocation code is just too fragile
and hacked into the hugetlb code now. This series tries to move giga
pages closer to the first class citizen. We are not there yet but having
5 patches is quite a lot already and it will already make the code much
easier to follow. I will come with other changes on top after this sees
some review.

The first two patches should be trivial to review. The third patch
changes the way how we migrate huge pages. Newly allocated pages are a
subject of the overcommit check and they participate surplus accounting
which is quite unfortunate as the changelog explains. This patch doesn't
change anything wrt. giga pages.
Patch #4 removes the surplus accounting hack from
__alloc_surplus_huge_page.  I hope I didn't miss anything there and a
deeper review is really due there.
Patch #5 finally unifies allocation paths and giga pages shouldn't be
any special anymore. There is also some renaming going on as well.

Shortlog
Michal Hocko (5):
  mm, hugetlb: unify core page allocation accounting and initialization
  mm, hugetlb: integrate giga hugetlb more naturally to the allocation path
  mm, hugetlb: do not rely on overcommit limit during migration
  mm, hugetlb: get rid of surplus page accounting tricks
  mm, hugetlb: further simplify hugetlb allocation API

Diffstat:
 include/linux/hugetlb.h |   3 +
 mm/hugetlb.c| 305 +++-
 mm/migrate.c|   3 +-
 3 files changed, 175 insertions(+), 136 deletions(-)


[1] http://lkml.kernel.org/r/20170622193034.28972-1-mho...@kernel.org
[2] http://lkml.kernel.org/r/20171122152832.iayefrlxbugph...@dhcp22.suse.cz