Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-30 Thread Michal Hocko
On Thu 30-11-17 20:57:43, Michal Hocko wrote:
> On Thu 30-11-17 11:35:11, Mike Kravetz wrote:
> > On 11/29/2017 11:57 PM, Michal Hocko wrote:
> > > On Wed 29-11-17 11:52:53, Mike Kravetz wrote:
> > >> On 11/29/2017 01:22 AM, Michal Hocko wrote:
> > >>> What about this on top. I haven't tested this yet though.
> > >>
> > >> Yes, this would work.
> > >>
> > >> However, I think a simple modification to your previous free_huge_page
> > >> changes would make this unnecessary.  I was confused in your previous
> > >> patch because you decremented the per-node surplus page count, but not
> > >> the global count.  I think it would have been correct (and made this
> > >> patch unnecessary) if you decremented the global counter there as well.
> > > 
> > > We cannot really increment the global counter because the over number of
> > > surplus pages during migration doesn't increase.
> > 
> > I was not suggesting we increment the global surplus count.  Rather,
> > your previous patch should have decremented the global surplus count in
> > free_huge_page.  Something like:
> 
> sorry I meant to say decrement. The point is that overal suprlus count
> doesn't change after the migration. The only thing that _might_ change
> is the per node distribution of surplus pages. That is why I think we
> should handle that during the migration.

Let me clarify. The migration context is the only place where we have
both the old and new page so this sounds like the only place to know
that we need to transfer the per-node surplus state.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-30 Thread Michal Hocko
On Thu 30-11-17 20:57:43, Michal Hocko wrote:
> On Thu 30-11-17 11:35:11, Mike Kravetz wrote:
> > On 11/29/2017 11:57 PM, Michal Hocko wrote:
> > > On Wed 29-11-17 11:52:53, Mike Kravetz wrote:
> > >> On 11/29/2017 01:22 AM, Michal Hocko wrote:
> > >>> What about this on top. I haven't tested this yet though.
> > >>
> > >> Yes, this would work.
> > >>
> > >> However, I think a simple modification to your previous free_huge_page
> > >> changes would make this unnecessary.  I was confused in your previous
> > >> patch because you decremented the per-node surplus page count, but not
> > >> the global count.  I think it would have been correct (and made this
> > >> patch unnecessary) if you decremented the global counter there as well.
> > > 
> > > We cannot really increment the global counter because the over number of
> > > surplus pages during migration doesn't increase.
> > 
> > I was not suggesting we increment the global surplus count.  Rather,
> > your previous patch should have decremented the global surplus count in
> > free_huge_page.  Something like:
> 
> sorry I meant to say decrement. The point is that overal suprlus count
> doesn't change after the migration. The only thing that _might_ change
> is the per node distribution of surplus pages. That is why I think we
> should handle that during the migration.

Let me clarify. The migration context is the only place where we have
both the old and new page so this sounds like the only place to know
that we need to transfer the per-node surplus state.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-30 Thread Michal Hocko
On Thu 30-11-17 11:35:11, Mike Kravetz wrote:
> On 11/29/2017 11:57 PM, Michal Hocko wrote:
> > On Wed 29-11-17 11:52:53, Mike Kravetz wrote:
> >> On 11/29/2017 01:22 AM, Michal Hocko wrote:
> >>> What about this on top. I haven't tested this yet though.
> >>
> >> Yes, this would work.
> >>
> >> However, I think a simple modification to your previous free_huge_page
> >> changes would make this unnecessary.  I was confused in your previous
> >> patch because you decremented the per-node surplus page count, but not
> >> the global count.  I think it would have been correct (and made this
> >> patch unnecessary) if you decremented the global counter there as well.
> > 
> > We cannot really increment the global counter because the over number of
> > surplus pages during migration doesn't increase.
> 
> I was not suggesting we increment the global surplus count.  Rather,
> your previous patch should have decremented the global surplus count in
> free_huge_page.  Something like:

sorry I meant to say decrement. The point is that overal suprlus count
doesn't change after the migration. The only thing that _might_ change
is the per node distribution of surplus pages. That is why I think we
should handle that during the migration.

> @@ -1283,7 +1283,13 @@ void free_huge_page(struct page *page)
>   if (restore_reserve)
>   h->resv_huge_pages++;
>  
> - if (h->surplus_huge_pages_node[nid]) {
> + if (PageHugeTemporary(page)) {
> + list_del(>lru);
> + ClearPageHugeTemporary(page);
> + update_and_free_page(h, page);
> + if (h->surplus_huge_pages_node[nid])
> + h->surplus_huge_pages--;
> + h->surplus_huge_pages_node[nid]--;
> + }
> + } else if (h->surplus_huge_pages_node[nid]) {
>   /* remove the page from active list */
>   list_del(>lru);
>   update_and_free_page(h, page);
> 
> When we allocate one of these 'PageHugeTemporary' pages, we only increment
> the global and node specific nr_huge_pages counters.  To me, this makes all
> the huge page counters be the same as if there were simply one additional
> pre-allocated huge page.  This 'extra' (PageHugeTemporary) page will go
> away when free_huge_page is called.  So, my thought is that it is not
> necessary to transfer per-node counts from the original to target node.
> Of course, I may be missing something.

The thing is that we do not know whether the original page is surplus
until the deallocation time.

> When thinking about transfering per-node counts as is done in your latest
> patch, I took another look at all the per-node counts.  This may show my
> ignorance of huge page migration, but do we need to handle the case where
> the page being migrated is 'free'?  Is that possible?  If so, there will
> be a count for free_huge_pages_node and the page will be on the per node
> hugepage_freelists that must be handled

I do not understand. What do you mean by free? Sitting on the pool? I do
not think we ever try to migrate those. They simply do not have any
state to migrate. We could very well just allocate fresh pages on the
remote node and dissolve free ones. I am not sure we do that during the
memory hotplug to preserve the pool size and I am too tired to check
that now. This would be a different topic I guess.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-30 Thread Michal Hocko
On Thu 30-11-17 11:35:11, Mike Kravetz wrote:
> On 11/29/2017 11:57 PM, Michal Hocko wrote:
> > On Wed 29-11-17 11:52:53, Mike Kravetz wrote:
> >> On 11/29/2017 01:22 AM, Michal Hocko wrote:
> >>> What about this on top. I haven't tested this yet though.
> >>
> >> Yes, this would work.
> >>
> >> However, I think a simple modification to your previous free_huge_page
> >> changes would make this unnecessary.  I was confused in your previous
> >> patch because you decremented the per-node surplus page count, but not
> >> the global count.  I think it would have been correct (and made this
> >> patch unnecessary) if you decremented the global counter there as well.
> > 
> > We cannot really increment the global counter because the over number of
> > surplus pages during migration doesn't increase.
> 
> I was not suggesting we increment the global surplus count.  Rather,
> your previous patch should have decremented the global surplus count in
> free_huge_page.  Something like:

sorry I meant to say decrement. The point is that overal suprlus count
doesn't change after the migration. The only thing that _might_ change
is the per node distribution of surplus pages. That is why I think we
should handle that during the migration.

> @@ -1283,7 +1283,13 @@ void free_huge_page(struct page *page)
>   if (restore_reserve)
>   h->resv_huge_pages++;
>  
> - if (h->surplus_huge_pages_node[nid]) {
> + if (PageHugeTemporary(page)) {
> + list_del(>lru);
> + ClearPageHugeTemporary(page);
> + update_and_free_page(h, page);
> + if (h->surplus_huge_pages_node[nid])
> + h->surplus_huge_pages--;
> + h->surplus_huge_pages_node[nid]--;
> + }
> + } else if (h->surplus_huge_pages_node[nid]) {
>   /* remove the page from active list */
>   list_del(>lru);
>   update_and_free_page(h, page);
> 
> When we allocate one of these 'PageHugeTemporary' pages, we only increment
> the global and node specific nr_huge_pages counters.  To me, this makes all
> the huge page counters be the same as if there were simply one additional
> pre-allocated huge page.  This 'extra' (PageHugeTemporary) page will go
> away when free_huge_page is called.  So, my thought is that it is not
> necessary to transfer per-node counts from the original to target node.
> Of course, I may be missing something.

The thing is that we do not know whether the original page is surplus
until the deallocation time.

> When thinking about transfering per-node counts as is done in your latest
> patch, I took another look at all the per-node counts.  This may show my
> ignorance of huge page migration, but do we need to handle the case where
> the page being migrated is 'free'?  Is that possible?  If so, there will
> be a count for free_huge_pages_node and the page will be on the per node
> hugepage_freelists that must be handled

I do not understand. What do you mean by free? Sitting on the pool? I do
not think we ever try to migrate those. They simply do not have any
state to migrate. We could very well just allocate fresh pages on the
remote node and dissolve free ones. I am not sure we do that during the
memory hotplug to preserve the pool size and I am too tired to check
that now. This would be a different topic I guess.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-30 Thread Mike Kravetz
On 11/29/2017 11:57 PM, Michal Hocko wrote:
> On Wed 29-11-17 11:52:53, Mike Kravetz wrote:
>> On 11/29/2017 01:22 AM, Michal Hocko wrote:
>>> What about this on top. I haven't tested this yet though.
>>
>> Yes, this would work.
>>
>> However, I think a simple modification to your previous free_huge_page
>> changes would make this unnecessary.  I was confused in your previous
>> patch because you decremented the per-node surplus page count, but not
>> the global count.  I think it would have been correct (and made this
>> patch unnecessary) if you decremented the global counter there as well.
> 
> We cannot really increment the global counter because the over number of
> surplus pages during migration doesn't increase.

I was not suggesting we increment the global surplus count.  Rather,
your previous patch should have decremented the global surplus count in
free_huge_page.  Something like:

@@ -1283,7 +1283,13 @@ void free_huge_page(struct page *page)
if (restore_reserve)
h->resv_huge_pages++;
 
-   if (h->surplus_huge_pages_node[nid]) {
+   if (PageHugeTemporary(page)) {
+   list_del(>lru);
+   ClearPageHugeTemporary(page);
+   update_and_free_page(h, page);
+   if (h->surplus_huge_pages_node[nid])
+   h->surplus_huge_pages--;
+   h->surplus_huge_pages_node[nid]--;
+   }
+   } else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
list_del(>lru);
update_and_free_page(h, page);

When we allocate one of these 'PageHugeTemporary' pages, we only increment
the global and node specific nr_huge_pages counters.  To me, this makes all
the huge page counters be the same as if there were simply one additional
pre-allocated huge page.  This 'extra' (PageHugeTemporary) page will go
away when free_huge_page is called.  So, my thought is that it is not
necessary to transfer per-node counts from the original to target node.
Of course, I may be missing something.

When thinking about transfering per-node counts as is done in your latest
patch, I took another look at all the per-node counts.  This may show my
ignorance of huge page migration, but do we need to handle the case where
the page being migrated is 'free'?  Is that possible?  If so, there will
be a count for free_huge_pages_node and the page will be on the per node
hugepage_freelists that must be handled

-- 
Mike Kravetz


Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-30 Thread Mike Kravetz
On 11/29/2017 11:57 PM, Michal Hocko wrote:
> On Wed 29-11-17 11:52:53, Mike Kravetz wrote:
>> On 11/29/2017 01:22 AM, Michal Hocko wrote:
>>> What about this on top. I haven't tested this yet though.
>>
>> Yes, this would work.
>>
>> However, I think a simple modification to your previous free_huge_page
>> changes would make this unnecessary.  I was confused in your previous
>> patch because you decremented the per-node surplus page count, but not
>> the global count.  I think it would have been correct (and made this
>> patch unnecessary) if you decremented the global counter there as well.
> 
> We cannot really increment the global counter because the over number of
> surplus pages during migration doesn't increase.

I was not suggesting we increment the global surplus count.  Rather,
your previous patch should have decremented the global surplus count in
free_huge_page.  Something like:

@@ -1283,7 +1283,13 @@ void free_huge_page(struct page *page)
if (restore_reserve)
h->resv_huge_pages++;
 
-   if (h->surplus_huge_pages_node[nid]) {
+   if (PageHugeTemporary(page)) {
+   list_del(>lru);
+   ClearPageHugeTemporary(page);
+   update_and_free_page(h, page);
+   if (h->surplus_huge_pages_node[nid])
+   h->surplus_huge_pages--;
+   h->surplus_huge_pages_node[nid]--;
+   }
+   } else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
list_del(>lru);
update_and_free_page(h, page);

When we allocate one of these 'PageHugeTemporary' pages, we only increment
the global and node specific nr_huge_pages counters.  To me, this makes all
the huge page counters be the same as if there were simply one additional
pre-allocated huge page.  This 'extra' (PageHugeTemporary) page will go
away when free_huge_page is called.  So, my thought is that it is not
necessary to transfer per-node counts from the original to target node.
Of course, I may be missing something.

When thinking about transfering per-node counts as is done in your latest
patch, I took another look at all the per-node counts.  This may show my
ignorance of huge page migration, but do we need to handle the case where
the page being migrated is 'free'?  Is that possible?  If so, there will
be a count for free_huge_pages_node and the page will be on the per node
hugepage_freelists that must be handled

-- 
Mike Kravetz


Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-29 Thread Michal Hocko
On Wed 29-11-17 11:52:53, Mike Kravetz wrote:
> On 11/29/2017 01:22 AM, Michal Hocko wrote:
> > What about this on top. I haven't tested this yet though.
> 
> Yes, this would work.
> 
> However, I think a simple modification to your previous free_huge_page
> changes would make this unnecessary.  I was confused in your previous
> patch because you decremented the per-node surplus page count, but not
> the global count.  I think it would have been correct (and made this
> patch unnecessary) if you decremented the global counter there as well.

We cannot really increment the global counter because the over number of
surplus pages during migration doesn't increase.

> Of course, this patch makes the surplus accounting more explicit.
> 
> If we move forward with this patch, one issue below.
> 
> > ---
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 1b6d7783c717..f5fcd4e355dc 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -119,6 +119,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long 
> > start, long end,
> > long freed);
> >  bool isolate_huge_page(struct page *page, struct list_head *list);
> >  void putback_active_hugepage(struct page *page);
> > +void move_hugetlb_state(struct page *oldpage, struct page *newpage, int 
> > reason);
> >  void free_huge_page(struct page *page);
> >  void hugetlb_fix_reserve_counts(struct inode *inode);
> >  extern struct mutex *hugetlb_fault_mutex_table;
> > @@ -232,6 +233,7 @@ static inline bool isolate_huge_page(struct page *page, 
> > struct list_head *list)
> > return false;
> >  }
> >  #define putback_active_hugepage(p) do {} while (0)
> > +#define move_hugetlb_state(old, new, reason)   do {} while (0)
> >  
> >  static inline unsigned long hugetlb_change_protection(struct 
> > vm_area_struct *vma,
> > unsigned long address, unsigned long end, pgprot_t newprot)
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 037bf0f89463..30601c1c62f3 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -34,6 +34,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include "internal.h"
> >  
> >  int hugetlb_max_hstate __read_mostly;
> > @@ -4830,3 +4831,34 @@ void putback_active_hugepage(struct page *page)
> > spin_unlock(_lock);
> > put_page(page);
> >  }
> > +
> > +void move_hugetlb_state(struct page *oldpage, struct page *newpage, int 
> > reason)
> > +{
> > +   struct hstate *h = page_hstate(oldpage);
> > +
> > +   hugetlb_cgroup_migrate(oldpage, newpage);
> > +   set_page_owner_migrate_reason(newpage, reason);
> > +
> > +   /*
> > +* transfer temporary state of the new huge page. This is
> > +* reverse to other transitions because the newpage is going to
> > +* be final while the old one will be freed so it takes over
> > +* the temporary status.
> > +*
> > +* Also note that we have to transfer the per-node surplus state
> > +* here as well otherwise the global surplus count will not match
> > +* the per-node's.
> > +*/
> > +   if (PageHugeTemporary(newpage)) {
> > +   int old_nid = page_to_nid(oldpage);
> > +   int new_nid = page_to_nid(newpage);
> > +
> > +   SetPageHugeTemporary(oldpage);
> > +   ClearPageHugeTemporary(newpage);
> > +
> > +   if (h->surplus_huge_pages_node[old_nid]) {
> > +   h->surplus_huge_pages_node[old_nid]--;
> > +   h->surplus_huge_pages_node[new_nid]++;
> > +   }
> 
> You need to take hugetlb_lock before adjusting the surplus counts.

You are right. Actually moving the code to hugetlb.c was exactly because
I didn't want to take the lock outside of the hugetlb proper. I just
forgot to add it here. Thanks for spotting.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-29 Thread Michal Hocko
On Wed 29-11-17 11:52:53, Mike Kravetz wrote:
> On 11/29/2017 01:22 AM, Michal Hocko wrote:
> > What about this on top. I haven't tested this yet though.
> 
> Yes, this would work.
> 
> However, I think a simple modification to your previous free_huge_page
> changes would make this unnecessary.  I was confused in your previous
> patch because you decremented the per-node surplus page count, but not
> the global count.  I think it would have been correct (and made this
> patch unnecessary) if you decremented the global counter there as well.

We cannot really increment the global counter because the over number of
surplus pages during migration doesn't increase.

> Of course, this patch makes the surplus accounting more explicit.
> 
> If we move forward with this patch, one issue below.
> 
> > ---
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 1b6d7783c717..f5fcd4e355dc 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -119,6 +119,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long 
> > start, long end,
> > long freed);
> >  bool isolate_huge_page(struct page *page, struct list_head *list);
> >  void putback_active_hugepage(struct page *page);
> > +void move_hugetlb_state(struct page *oldpage, struct page *newpage, int 
> > reason);
> >  void free_huge_page(struct page *page);
> >  void hugetlb_fix_reserve_counts(struct inode *inode);
> >  extern struct mutex *hugetlb_fault_mutex_table;
> > @@ -232,6 +233,7 @@ static inline bool isolate_huge_page(struct page *page, 
> > struct list_head *list)
> > return false;
> >  }
> >  #define putback_active_hugepage(p) do {} while (0)
> > +#define move_hugetlb_state(old, new, reason)   do {} while (0)
> >  
> >  static inline unsigned long hugetlb_change_protection(struct 
> > vm_area_struct *vma,
> > unsigned long address, unsigned long end, pgprot_t newprot)
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 037bf0f89463..30601c1c62f3 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -34,6 +34,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include "internal.h"
> >  
> >  int hugetlb_max_hstate __read_mostly;
> > @@ -4830,3 +4831,34 @@ void putback_active_hugepage(struct page *page)
> > spin_unlock(_lock);
> > put_page(page);
> >  }
> > +
> > +void move_hugetlb_state(struct page *oldpage, struct page *newpage, int 
> > reason)
> > +{
> > +   struct hstate *h = page_hstate(oldpage);
> > +
> > +   hugetlb_cgroup_migrate(oldpage, newpage);
> > +   set_page_owner_migrate_reason(newpage, reason);
> > +
> > +   /*
> > +* transfer temporary state of the new huge page. This is
> > +* reverse to other transitions because the newpage is going to
> > +* be final while the old one will be freed so it takes over
> > +* the temporary status.
> > +*
> > +* Also note that we have to transfer the per-node surplus state
> > +* here as well otherwise the global surplus count will not match
> > +* the per-node's.
> > +*/
> > +   if (PageHugeTemporary(newpage)) {
> > +   int old_nid = page_to_nid(oldpage);
> > +   int new_nid = page_to_nid(newpage);
> > +
> > +   SetPageHugeTemporary(oldpage);
> > +   ClearPageHugeTemporary(newpage);
> > +
> > +   if (h->surplus_huge_pages_node[old_nid]) {
> > +   h->surplus_huge_pages_node[old_nid]--;
> > +   h->surplus_huge_pages_node[new_nid]++;
> > +   }
> 
> You need to take hugetlb_lock before adjusting the surplus counts.

You are right. Actually moving the code to hugetlb.c was exactly because
I didn't want to take the lock outside of the hugetlb proper. I just
forgot to add it here. Thanks for spotting.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-29 Thread Mike Kravetz
On 11/29/2017 01:22 AM, Michal Hocko wrote:
> What about this on top. I haven't tested this yet though.

Yes, this would work.

However, I think a simple modification to your previous free_huge_page
changes would make this unnecessary.  I was confused in your previous
patch because you decremented the per-node surplus page count, but not
the global count.  I think it would have been correct (and made this
patch unnecessary) if you decremented the global counter there as well.

Of course, this patch makes the surplus accounting more explicit.

If we move forward with this patch, one issue below.

> ---
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1b6d7783c717..f5fcd4e355dc 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -119,6 +119,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long 
> start, long end,
>   long freed);
>  bool isolate_huge_page(struct page *page, struct list_head *list);
>  void putback_active_hugepage(struct page *page);
> +void move_hugetlb_state(struct page *oldpage, struct page *newpage, int 
> reason);
>  void free_huge_page(struct page *page);
>  void hugetlb_fix_reserve_counts(struct inode *inode);
>  extern struct mutex *hugetlb_fault_mutex_table;
> @@ -232,6 +233,7 @@ static inline bool isolate_huge_page(struct page *page, 
> struct list_head *list)
>   return false;
>  }
>  #define putback_active_hugepage(p)   do {} while (0)
> +#define move_hugetlb_state(old, new, reason) do {} while (0)
>  
>  static inline unsigned long hugetlb_change_protection(struct vm_area_struct 
> *vma,
>   unsigned long address, unsigned long end, pgprot_t newprot)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 037bf0f89463..30601c1c62f3 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -34,6 +34,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "internal.h"
>  
>  int hugetlb_max_hstate __read_mostly;
> @@ -4830,3 +4831,34 @@ void putback_active_hugepage(struct page *page)
>   spin_unlock(_lock);
>   put_page(page);
>  }
> +
> +void move_hugetlb_state(struct page *oldpage, struct page *newpage, int 
> reason)
> +{
> + struct hstate *h = page_hstate(oldpage);
> +
> + hugetlb_cgroup_migrate(oldpage, newpage);
> + set_page_owner_migrate_reason(newpage, reason);
> +
> + /*
> +  * transfer temporary state of the new huge page. This is
> +  * reverse to other transitions because the newpage is going to
> +  * be final while the old one will be freed so it takes over
> +  * the temporary status.
> +  *
> +  * Also note that we have to transfer the per-node surplus state
> +  * here as well otherwise the global surplus count will not match
> +  * the per-node's.
> +  */
> + if (PageHugeTemporary(newpage)) {
> + int old_nid = page_to_nid(oldpage);
> + int new_nid = page_to_nid(newpage);
> +
> + SetPageHugeTemporary(oldpage);
> + ClearPageHugeTemporary(newpage);
> +
> + if (h->surplus_huge_pages_node[old_nid]) {
> + h->surplus_huge_pages_node[old_nid]--;
> + h->surplus_huge_pages_node[new_nid]++;
> + }

You need to take hugetlb_lock before adjusting the surplus counts.

-- 
Mike Kravetz


Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-29 Thread Mike Kravetz
On 11/29/2017 01:22 AM, Michal Hocko wrote:
> What about this on top. I haven't tested this yet though.

Yes, this would work.

However, I think a simple modification to your previous free_huge_page
changes would make this unnecessary.  I was confused in your previous
patch because you decremented the per-node surplus page count, but not
the global count.  I think it would have been correct (and made this
patch unnecessary) if you decremented the global counter there as well.

Of course, this patch makes the surplus accounting more explicit.

If we move forward with this patch, one issue below.

> ---
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1b6d7783c717..f5fcd4e355dc 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -119,6 +119,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long 
> start, long end,
>   long freed);
>  bool isolate_huge_page(struct page *page, struct list_head *list);
>  void putback_active_hugepage(struct page *page);
> +void move_hugetlb_state(struct page *oldpage, struct page *newpage, int 
> reason);
>  void free_huge_page(struct page *page);
>  void hugetlb_fix_reserve_counts(struct inode *inode);
>  extern struct mutex *hugetlb_fault_mutex_table;
> @@ -232,6 +233,7 @@ static inline bool isolate_huge_page(struct page *page, 
> struct list_head *list)
>   return false;
>  }
>  #define putback_active_hugepage(p)   do {} while (0)
> +#define move_hugetlb_state(old, new, reason) do {} while (0)
>  
>  static inline unsigned long hugetlb_change_protection(struct vm_area_struct 
> *vma,
>   unsigned long address, unsigned long end, pgprot_t newprot)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 037bf0f89463..30601c1c62f3 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -34,6 +34,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "internal.h"
>  
>  int hugetlb_max_hstate __read_mostly;
> @@ -4830,3 +4831,34 @@ void putback_active_hugepage(struct page *page)
>   spin_unlock(_lock);
>   put_page(page);
>  }
> +
> +void move_hugetlb_state(struct page *oldpage, struct page *newpage, int 
> reason)
> +{
> + struct hstate *h = page_hstate(oldpage);
> +
> + hugetlb_cgroup_migrate(oldpage, newpage);
> + set_page_owner_migrate_reason(newpage, reason);
> +
> + /*
> +  * transfer temporary state of the new huge page. This is
> +  * reverse to other transitions because the newpage is going to
> +  * be final while the old one will be freed so it takes over
> +  * the temporary status.
> +  *
> +  * Also note that we have to transfer the per-node surplus state
> +  * here as well otherwise the global surplus count will not match
> +  * the per-node's.
> +  */
> + if (PageHugeTemporary(newpage)) {
> + int old_nid = page_to_nid(oldpage);
> + int new_nid = page_to_nid(newpage);
> +
> + SetPageHugeTemporary(oldpage);
> + ClearPageHugeTemporary(newpage);
> +
> + if (h->surplus_huge_pages_node[old_nid]) {
> + h->surplus_huge_pages_node[old_nid]--;
> + h->surplus_huge_pages_node[new_nid]++;
> + }

You need to take hugetlb_lock before adjusting the surplus counts.

-- 
Mike Kravetz


Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-29 Thread Michal Hocko
On Wed 29-11-17 10:22:34, Michal Hocko wrote:
> What about this on top. I haven't tested this yet though.

OK, it seem to work:
root@test1:~# echo 1 > /proc/sys/vm/nr_hugepages
root@test1:~# echo 1 > 
/sys/kernel/mm/hugepages/hugepages-2048kB/nr_overcommit_hugepages

root@test1:~# grep . /sys/devices/system/node/node*/hugepages/hugepages-2048kB/*
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:1
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0

# mmap 2 huge pages

root@test1:~# grep . /sys/devices/system/node/node*/hugepages/hugepages-2048kB/*
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:2
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:1

root@test1:~# migratepages $(pidof map_hugetlb) 1 0

/sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:2
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:1
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0

and exit the mmap

root@test1:~# grep . /sys/devices/system/node/node*/hugepages/hugepages-2048kB/*
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:1
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-29 Thread Michal Hocko
On Wed 29-11-17 10:22:34, Michal Hocko wrote:
> What about this on top. I haven't tested this yet though.

OK, it seem to work:
root@test1:~# echo 1 > /proc/sys/vm/nr_hugepages
root@test1:~# echo 1 > 
/sys/kernel/mm/hugepages/hugepages-2048kB/nr_overcommit_hugepages

root@test1:~# grep . /sys/devices/system/node/node*/hugepages/hugepages-2048kB/*
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:1
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0

# mmap 2 huge pages

root@test1:~# grep . /sys/devices/system/node/node*/hugepages/hugepages-2048kB/*
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:2
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:1

root@test1:~# migratepages $(pidof map_hugetlb) 1 0

/sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:2
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:1
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0

and exit the mmap

root@test1:~# grep . /sys/devices/system/node/node*/hugepages/hugepages-2048kB/*
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:1
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-29 Thread Michal Hocko
On Tue 28-11-17 15:12:11, Michal Hocko wrote:
[...]
> +/*
> + * Internal hugetlb specific page flag. Do not use outside of the hugetlb
> + * code
> + */
> +static inline bool PageHugeTemporary(struct page *page)
> +{
> + if (!PageHuge(page))
> + return false;
> +
> + return page[2].flags == -1U;
> +}
> +
> +static inline void SetPageHugeTemporary(struct page *page)
> +{
> + page[2].flags = -1U;
> +}
> +
> +static inline void ClearPageHugeTemporary(struct page *page)
> +{
> + page[2].flags = 0;
> +}

Ups, this is obviously not OK. I was just lucky to not hit BUG_ONs
because I am clearly overwriting node/zone data in flags. I will have to
find something else to abuse. I will go with my favorite mapping pointer
which is not used at all.
---
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1be43563e226..db7544a0b7b6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1259,17 +1259,17 @@ static inline bool PageHugeTemporary(struct page *page)
if (!PageHuge(page))
return false;
 
-   return page[2].flags == -1U;
+   return page[2].mapping == -1U;
 }
 
 static inline void SetPageHugeTemporary(struct page *page)
 {
-   page[2].flags = -1U;
+   page[2].mapping = -1U;
 }
 
 static inline void ClearPageHugeTemporary(struct page *page)
 {
-   page[2].flags = 0;
+   page[2].mapping = NULL;
 }
 
 void free_huge_page(struct page *page)
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-29 Thread Michal Hocko
On Tue 28-11-17 15:12:11, Michal Hocko wrote:
[...]
> +/*
> + * Internal hugetlb specific page flag. Do not use outside of the hugetlb
> + * code
> + */
> +static inline bool PageHugeTemporary(struct page *page)
> +{
> + if (!PageHuge(page))
> + return false;
> +
> + return page[2].flags == -1U;
> +}
> +
> +static inline void SetPageHugeTemporary(struct page *page)
> +{
> + page[2].flags = -1U;
> +}
> +
> +static inline void ClearPageHugeTemporary(struct page *page)
> +{
> + page[2].flags = 0;
> +}

Ups, this is obviously not OK. I was just lucky to not hit BUG_ONs
because I am clearly overwriting node/zone data in flags. I will have to
find something else to abuse. I will go with my favorite mapping pointer
which is not used at all.
---
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1be43563e226..db7544a0b7b6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1259,17 +1259,17 @@ static inline bool PageHugeTemporary(struct page *page)
if (!PageHuge(page))
return false;
 
-   return page[2].flags == -1U;
+   return page[2].mapping == -1U;
 }
 
 static inline void SetPageHugeTemporary(struct page *page)
 {
-   page[2].flags = -1U;
+   page[2].mapping = -1U;
 }
 
 static inline void ClearPageHugeTemporary(struct page *page)
 {
-   page[2].flags = 0;
+   page[2].mapping = NULL;
 }
 
 void free_huge_page(struct page *page)
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-29 Thread Michal Hocko
On Wed 29-11-17 10:22:34, Michal Hocko wrote:
> What about this on top. I haven't tested this yet though.
> ---

We will need to drop surplus_huge_pages_node handling from the free path
obviously as well

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1be43563e226..756833f9ef8b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1312,8 +1312,6 @@ void free_huge_page(struct page *page)
list_del(>lru);
ClearPageHugeTemporary(page);
update_and_free_page(h, page);
-   if (h->surplus_huge_pages_node[nid])
-   h->surplus_huge_pages_node[nid]--;
} else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
list_del(>lru);
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-29 Thread Michal Hocko
On Wed 29-11-17 10:22:34, Michal Hocko wrote:
> What about this on top. I haven't tested this yet though.
> ---

We will need to drop surplus_huge_pages_node handling from the free path
obviously as well

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1be43563e226..756833f9ef8b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1312,8 +1312,6 @@ void free_huge_page(struct page *page)
list_del(>lru);
ClearPageHugeTemporary(page);
update_and_free_page(h, page);
-   if (h->surplus_huge_pages_node[nid])
-   h->surplus_huge_pages_node[nid]--;
} else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
list_del(>lru);
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-29 Thread Michal Hocko
What about this on top. I haven't tested this yet though.
---
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 1b6d7783c717..f5fcd4e355dc 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -119,6 +119,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long 
start, long end,
long freed);
 bool isolate_huge_page(struct page *page, struct list_head *list);
 void putback_active_hugepage(struct page *page);
+void move_hugetlb_state(struct page *oldpage, struct page *newpage, int 
reason);
 void free_huge_page(struct page *page);
 void hugetlb_fix_reserve_counts(struct inode *inode);
 extern struct mutex *hugetlb_fault_mutex_table;
@@ -232,6 +233,7 @@ static inline bool isolate_huge_page(struct page *page, 
struct list_head *list)
return false;
 }
 #define putback_active_hugepage(p) do {} while (0)
+#define move_hugetlb_state(old, new, reason)   do {} while (0)
 
 static inline unsigned long hugetlb_change_protection(struct vm_area_struct 
*vma,
unsigned long address, unsigned long end, pgprot_t newprot)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 037bf0f89463..30601c1c62f3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 int hugetlb_max_hstate __read_mostly;
@@ -4830,3 +4831,34 @@ void putback_active_hugepage(struct page *page)
spin_unlock(_lock);
put_page(page);
 }
+
+void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
+{
+   struct hstate *h = page_hstate(oldpage);
+
+   hugetlb_cgroup_migrate(oldpage, newpage);
+   set_page_owner_migrate_reason(newpage, reason);
+
+   /*
+* transfer temporary state of the new huge page. This is
+* reverse to other transitions because the newpage is going to
+* be final while the old one will be freed so it takes over
+* the temporary status.
+*
+* Also note that we have to transfer the per-node surplus state
+* here as well otherwise the global surplus count will not match
+* the per-node's.
+*/
+   if (PageHugeTemporary(newpage)) {
+   int old_nid = page_to_nid(oldpage);
+   int new_nid = page_to_nid(newpage);
+
+   SetPageHugeTemporary(oldpage);
+   ClearPageHugeTemporary(newpage);
+
+   if (h->surplus_huge_pages_node[old_nid]) {
+   h->surplus_huge_pages_node[old_nid]--;
+   h->surplus_huge_pages_node[new_nid]++;
+   }
+   }
+}
diff --git a/mm/migrate.c b/mm/migrate.c
index b3345f8174a9..1e5525a25691 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1323,22 +1323,8 @@ static int unmap_and_move_huge_page(new_page_t 
get_new_page,
put_anon_vma(anon_vma);
 
if (rc == MIGRATEPAGE_SUCCESS) {
-   hugetlb_cgroup_migrate(hpage, new_hpage);
+   move_hugetlb_state(hpage, new_hpage, reason);
put_new_page = NULL;
-   set_page_owner_migrate_reason(new_hpage, reason);
-
-   /*
-* transfer temporary state of the new huge page. This is
-* reverse to other transitions because the newpage is going to
-* be final while the old one will be freed so it takes over
-* the temporary status.
-* No need for any locking here because destructor cannot race
-* with us.
-*/
-   if (PageHugeTemporary(new_hpage)) {
-   SetPageHugeTemporary(hpage);
-   ClearPageHugeTemporary(new_hpage);
-   }
}
 
unlock_page(hpage);
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-29 Thread Michal Hocko
What about this on top. I haven't tested this yet though.
---
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 1b6d7783c717..f5fcd4e355dc 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -119,6 +119,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long 
start, long end,
long freed);
 bool isolate_huge_page(struct page *page, struct list_head *list);
 void putback_active_hugepage(struct page *page);
+void move_hugetlb_state(struct page *oldpage, struct page *newpage, int 
reason);
 void free_huge_page(struct page *page);
 void hugetlb_fix_reserve_counts(struct inode *inode);
 extern struct mutex *hugetlb_fault_mutex_table;
@@ -232,6 +233,7 @@ static inline bool isolate_huge_page(struct page *page, 
struct list_head *list)
return false;
 }
 #define putback_active_hugepage(p) do {} while (0)
+#define move_hugetlb_state(old, new, reason)   do {} while (0)
 
 static inline unsigned long hugetlb_change_protection(struct vm_area_struct 
*vma,
unsigned long address, unsigned long end, pgprot_t newprot)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 037bf0f89463..30601c1c62f3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 int hugetlb_max_hstate __read_mostly;
@@ -4830,3 +4831,34 @@ void putback_active_hugepage(struct page *page)
spin_unlock(_lock);
put_page(page);
 }
+
+void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
+{
+   struct hstate *h = page_hstate(oldpage);
+
+   hugetlb_cgroup_migrate(oldpage, newpage);
+   set_page_owner_migrate_reason(newpage, reason);
+
+   /*
+* transfer temporary state of the new huge page. This is
+* reverse to other transitions because the newpage is going to
+* be final while the old one will be freed so it takes over
+* the temporary status.
+*
+* Also note that we have to transfer the per-node surplus state
+* here as well otherwise the global surplus count will not match
+* the per-node's.
+*/
+   if (PageHugeTemporary(newpage)) {
+   int old_nid = page_to_nid(oldpage);
+   int new_nid = page_to_nid(newpage);
+
+   SetPageHugeTemporary(oldpage);
+   ClearPageHugeTemporary(newpage);
+
+   if (h->surplus_huge_pages_node[old_nid]) {
+   h->surplus_huge_pages_node[old_nid]--;
+   h->surplus_huge_pages_node[new_nid]++;
+   }
+   }
+}
diff --git a/mm/migrate.c b/mm/migrate.c
index b3345f8174a9..1e5525a25691 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1323,22 +1323,8 @@ static int unmap_and_move_huge_page(new_page_t 
get_new_page,
put_anon_vma(anon_vma);
 
if (rc == MIGRATEPAGE_SUCCESS) {
-   hugetlb_cgroup_migrate(hpage, new_hpage);
+   move_hugetlb_state(hpage, new_hpage, reason);
put_new_page = NULL;
-   set_page_owner_migrate_reason(new_hpage, reason);
-
-   /*
-* transfer temporary state of the new huge page. This is
-* reverse to other transitions because the newpage is going to
-* be final while the old one will be freed so it takes over
-* the temporary status.
-* No need for any locking here because destructor cannot race
-* with us.
-*/
-   if (PageHugeTemporary(new_hpage)) {
-   SetPageHugeTemporary(hpage);
-   ClearPageHugeTemporary(new_hpage);
-   }
}
 
unlock_page(hpage);
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-28 Thread Michal Hocko
On Tue 28-11-17 17:39:50, Mike Kravetz wrote:
> On 11/28/2017 06:12 AM, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > hugepage migration relies on __alloc_buddy_huge_page to get a new page.
> > This has 2 main disadvantages.
> > 1) it doesn't allow to migrate any huge page if the pool is used
> > completely which is not an exceptional case as the pool is static and
> > unused memory is just wasted.
> > 2) it leads to a weird semantic when migration between two numa nodes
> > might increase the pool size of the destination NUMA node while the page
> > is in use. The issue is caused by per NUMA node surplus pages tracking
> > (see free_huge_page).
> > 
> > Address both issues by changing the way how we allocate and account
> > pages allocated for migration. Those should temporal by definition.
> > So we mark them that way (we will abuse page flags in the 3rd page)
> > and update free_huge_page to free such pages to the page allocator.
> > Page migration path then just transfers the temporal status from the
> > new page to the old one which will be freed on the last reference.
> 
> In general, I think this will work.  Some questions below.
> 
> > The global surplus count will never change during this path but we still
> > have to be careful when freeing a page from a node with surplus pages
> > on the node.
> 
> Not sure about the "freeing page from a node with surplus pages" comment.
> If allocating PageHugeTemporary pages does not adjust surplus counts, then
> there should be no concern at the time of freeing.
> 
> Could this comment be a hold over from a previous implementation attempt?
> 

Not really. You have to realize that the original page could be surplus
on its node. More on that below.

[...]
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 8189c92fac82..037bf0f89463 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1283,7 +1283,13 @@ void free_huge_page(struct page *page)
> > if (restore_reserve)
> > h->resv_huge_pages++;
> >  
> > -   if (h->surplus_huge_pages_node[nid]) {
> > +   if (PageHugeTemporary(page)) {
> > +   list_del(>lru);
> > +   ClearPageHugeTemporary(page);
> > +   update_and_free_page(h, page);
> > +   if (h->surplus_huge_pages_node[nid])
> > +   h->surplus_huge_pages_node[nid]--;
> 
> I think this is not correct.  Should the lines dealing with per-node
> surplus counts even be here?  If the lines above are correct, then it
> implies that the sum of per node surplus counts could exceed (or get out
> of sync with) the global surplus count.

You are right, I guess. This per-node accounting makes the whole thing
real pain. I am worried that we will free next page from the same node
and reduce the overal pool size. I will think about it some more.

> > +   } else if (h->surplus_huge_pages_node[nid]) {
> > /* remove the page from active list */
> > list_del(>lru);
> > update_and_free_page(h, page);
> > @@ -1531,7 +1537,11 @@ int dissolve_free_huge_pages(unsigned long 
> > start_pfn, unsigned long end_pfn)
> > return rc;
> >  }
> >  
> > -static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t 
> > gfp_mask,
> > +/*
> > + * Allocates a fresh surplus page from the page allocator. Temporary
> > + * requests (e.g. page migration) can pass enforce_overcommit == false
> 
> 'enforce_overcommit == false' perhaps part of an earlier implementation
> attempt?

yeah.

[...]

> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 4d0be47a322a..b3345f8174a9 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1326,6 +1326,19 @@ static int unmap_and_move_huge_page(new_page_t 
> > get_new_page,
> > hugetlb_cgroup_migrate(hpage, new_hpage);
> > put_new_page = NULL;
> > set_page_owner_migrate_reason(new_hpage, reason);
> > +
> > +   /*
> > +* transfer temporary state of the new huge page. This is
> > +* reverse to other transitions because the newpage is going to
> > +* be final while the old one will be freed so it takes over
> > +* the temporary status.
> > +* No need for any locking here because destructor cannot race
> > +* with us.
> > +*/
> > +   if (PageHugeTemporary(new_hpage)) {
> > +   SetPageHugeTemporary(hpage);
> > +   ClearPageHugeTemporary(new_hpage);
> > +   }
> > }
> >  
> > unlock_page(hpage);
> > 
> 
> I'm still trying to wrap my head around all the different scenarios.
> In general, this new code only 'kicks in' if the there is not a free
> pre-allocated huge page for migration.  Right?

yes

> So, if there are free huge pages they are 'consumed' during migration
> and the number of available pre-allocated huge pages is reduced?  Or,
> is that not exactly how it works?  Or does it depend in the purpose
> of the migration?

Well, if we have 

Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-28 Thread Michal Hocko
On Tue 28-11-17 17:39:50, Mike Kravetz wrote:
> On 11/28/2017 06:12 AM, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > hugepage migration relies on __alloc_buddy_huge_page to get a new page.
> > This has 2 main disadvantages.
> > 1) it doesn't allow to migrate any huge page if the pool is used
> > completely which is not an exceptional case as the pool is static and
> > unused memory is just wasted.
> > 2) it leads to a weird semantic when migration between two numa nodes
> > might increase the pool size of the destination NUMA node while the page
> > is in use. The issue is caused by per NUMA node surplus pages tracking
> > (see free_huge_page).
> > 
> > Address both issues by changing the way how we allocate and account
> > pages allocated for migration. Those should temporal by definition.
> > So we mark them that way (we will abuse page flags in the 3rd page)
> > and update free_huge_page to free such pages to the page allocator.
> > Page migration path then just transfers the temporal status from the
> > new page to the old one which will be freed on the last reference.
> 
> In general, I think this will work.  Some questions below.
> 
> > The global surplus count will never change during this path but we still
> > have to be careful when freeing a page from a node with surplus pages
> > on the node.
> 
> Not sure about the "freeing page from a node with surplus pages" comment.
> If allocating PageHugeTemporary pages does not adjust surplus counts, then
> there should be no concern at the time of freeing.
> 
> Could this comment be a hold over from a previous implementation attempt?
> 

Not really. You have to realize that the original page could be surplus
on its node. More on that below.

[...]
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 8189c92fac82..037bf0f89463 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1283,7 +1283,13 @@ void free_huge_page(struct page *page)
> > if (restore_reserve)
> > h->resv_huge_pages++;
> >  
> > -   if (h->surplus_huge_pages_node[nid]) {
> > +   if (PageHugeTemporary(page)) {
> > +   list_del(>lru);
> > +   ClearPageHugeTemporary(page);
> > +   update_and_free_page(h, page);
> > +   if (h->surplus_huge_pages_node[nid])
> > +   h->surplus_huge_pages_node[nid]--;
> 
> I think this is not correct.  Should the lines dealing with per-node
> surplus counts even be here?  If the lines above are correct, then it
> implies that the sum of per node surplus counts could exceed (or get out
> of sync with) the global surplus count.

You are right, I guess. This per-node accounting makes the whole thing
real pain. I am worried that we will free next page from the same node
and reduce the overal pool size. I will think about it some more.

> > +   } else if (h->surplus_huge_pages_node[nid]) {
> > /* remove the page from active list */
> > list_del(>lru);
> > update_and_free_page(h, page);
> > @@ -1531,7 +1537,11 @@ int dissolve_free_huge_pages(unsigned long 
> > start_pfn, unsigned long end_pfn)
> > return rc;
> >  }
> >  
> > -static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t 
> > gfp_mask,
> > +/*
> > + * Allocates a fresh surplus page from the page allocator. Temporary
> > + * requests (e.g. page migration) can pass enforce_overcommit == false
> 
> 'enforce_overcommit == false' perhaps part of an earlier implementation
> attempt?

yeah.

[...]

> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 4d0be47a322a..b3345f8174a9 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1326,6 +1326,19 @@ static int unmap_and_move_huge_page(new_page_t 
> > get_new_page,
> > hugetlb_cgroup_migrate(hpage, new_hpage);
> > put_new_page = NULL;
> > set_page_owner_migrate_reason(new_hpage, reason);
> > +
> > +   /*
> > +* transfer temporary state of the new huge page. This is
> > +* reverse to other transitions because the newpage is going to
> > +* be final while the old one will be freed so it takes over
> > +* the temporary status.
> > +* No need for any locking here because destructor cannot race
> > +* with us.
> > +*/
> > +   if (PageHugeTemporary(new_hpage)) {
> > +   SetPageHugeTemporary(hpage);
> > +   ClearPageHugeTemporary(new_hpage);
> > +   }
> > }
> >  
> > unlock_page(hpage);
> > 
> 
> I'm still trying to wrap my head around all the different scenarios.
> In general, this new code only 'kicks in' if the there is not a free
> pre-allocated huge page for migration.  Right?

yes

> So, if there are free huge pages they are 'consumed' during migration
> and the number of available pre-allocated huge pages is reduced?  Or,
> is that not exactly how it works?  Or does it depend in the purpose
> of the migration?

Well, if we have pre-allocated 

Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-28 Thread Mike Kravetz
On 11/28/2017 06:12 AM, Michal Hocko wrote:
> From: Michal Hocko 
> 
> hugepage migration relies on __alloc_buddy_huge_page to get a new page.
> This has 2 main disadvantages.
> 1) it doesn't allow to migrate any huge page if the pool is used
> completely which is not an exceptional case as the pool is static and
> unused memory is just wasted.
> 2) it leads to a weird semantic when migration between two numa nodes
> might increase the pool size of the destination NUMA node while the page
> is in use. The issue is caused by per NUMA node surplus pages tracking
> (see free_huge_page).
> 
> Address both issues by changing the way how we allocate and account
> pages allocated for migration. Those should temporal by definition.
> So we mark them that way (we will abuse page flags in the 3rd page)
> and update free_huge_page to free such pages to the page allocator.
> Page migration path then just transfers the temporal status from the
> new page to the old one which will be freed on the last reference.

In general, I think this will work.  Some questions below.

> The global surplus count will never change during this path but we still
> have to be careful when freeing a page from a node with surplus pages
> on the node.

Not sure about the "freeing page from a node with surplus pages" comment.
If allocating PageHugeTemporary pages does not adjust surplus counts, then
there should be no concern at the time of freeing.

Could this comment be a hold over from a previous implementation attempt?

> 
> Rename __alloc_buddy_huge_page to __alloc_surplus_huge_page to better
> reflect its purpose. The new allocation routine for the migration path
> is __alloc_migrate_huge_page.
> 
> The user visible effect of this patch is that migrated pages are really
> temporal and they travel between NUMA nodes as per the migration
> request:
> Before migration
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0
> 
> After
> 
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0
> 
> with the previous implementation, both nodes would have nr_hugepages:1
> until the page is freed.
> 
> Signed-off-by: Michal Hocko 
> ---
>  include/linux/hugetlb.h | 35 +
>  mm/hugetlb.c| 58 
> +++--
>  mm/migrate.c| 13 +++
>  3 files changed, 90 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 6e3696c7b35a..1b6d7783c717 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -157,8 +157,43 @@ unsigned long hugetlb_change_protection(struct 
> vm_area_struct *vma,
>   unsigned long address, unsigned long end, pgprot_t newprot);
>  
>  bool is_hugetlb_entry_migration(pte_t pte);
> +
> +/*
> + * Internal hugetlb specific page flag. Do not use outside of the hugetlb
> + * code
> + */
> +static inline bool PageHugeTemporary(struct page *page)
> +{
> + if (!PageHuge(page))
> + return false;
> +
> + return page[2].flags == -1U;
> +}
> +
> +static inline void SetPageHugeTemporary(struct page *page)
> +{
> + page[2].flags = -1U;
> +}
> +
> +static inline void ClearPageHugeTemporary(struct page *page)
> +{
> + page[2].flags = 0;
> +}
>  #else /* !CONFIG_HUGETLB_PAGE */
>  
> +static inline bool PageHugeTemporary(struct page *page)
> +{
> + return false;
> +}
> +
> +static inline void SetPageHugeTemporary(struct page *page)
> +{
> +}
> +
> +static inline void ClearPageHugeTemporary(struct page *page)
> +{
> +}
> +
>  static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
>  {
>  }
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8189c92fac82..037bf0f89463 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1283,7 +1283,13 @@ void free_huge_page(struct page *page)
>   if (restore_reserve)
>   h->resv_huge_pages++;
>  
> - if (h->surplus_huge_pages_node[nid]) {
> + if (PageHugeTemporary(page)) {
> + list_del(>lru);
> + ClearPageHugeTemporary(page);
> + update_and_free_page(h, page);
> + if 

Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-28 Thread Mike Kravetz
On 11/28/2017 06:12 AM, Michal Hocko wrote:
> From: Michal Hocko 
> 
> hugepage migration relies on __alloc_buddy_huge_page to get a new page.
> This has 2 main disadvantages.
> 1) it doesn't allow to migrate any huge page if the pool is used
> completely which is not an exceptional case as the pool is static and
> unused memory is just wasted.
> 2) it leads to a weird semantic when migration between two numa nodes
> might increase the pool size of the destination NUMA node while the page
> is in use. The issue is caused by per NUMA node surplus pages tracking
> (see free_huge_page).
> 
> Address both issues by changing the way how we allocate and account
> pages allocated for migration. Those should temporal by definition.
> So we mark them that way (we will abuse page flags in the 3rd page)
> and update free_huge_page to free such pages to the page allocator.
> Page migration path then just transfers the temporal status from the
> new page to the old one which will be freed on the last reference.

In general, I think this will work.  Some questions below.

> The global surplus count will never change during this path but we still
> have to be careful when freeing a page from a node with surplus pages
> on the node.

Not sure about the "freeing page from a node with surplus pages" comment.
If allocating PageHugeTemporary pages does not adjust surplus counts, then
there should be no concern at the time of freeing.

Could this comment be a hold over from a previous implementation attempt?

> 
> Rename __alloc_buddy_huge_page to __alloc_surplus_huge_page to better
> reflect its purpose. The new allocation routine for the migration path
> is __alloc_migrate_huge_page.
> 
> The user visible effect of this patch is that migrated pages are really
> temporal and they travel between NUMA nodes as per the migration
> request:
> Before migration
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0
> 
> After
> 
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0
> 
> with the previous implementation, both nodes would have nr_hugepages:1
> until the page is freed.
> 
> Signed-off-by: Michal Hocko 
> ---
>  include/linux/hugetlb.h | 35 +
>  mm/hugetlb.c| 58 
> +++--
>  mm/migrate.c| 13 +++
>  3 files changed, 90 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 6e3696c7b35a..1b6d7783c717 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -157,8 +157,43 @@ unsigned long hugetlb_change_protection(struct 
> vm_area_struct *vma,
>   unsigned long address, unsigned long end, pgprot_t newprot);
>  
>  bool is_hugetlb_entry_migration(pte_t pte);
> +
> +/*
> + * Internal hugetlb specific page flag. Do not use outside of the hugetlb
> + * code
> + */
> +static inline bool PageHugeTemporary(struct page *page)
> +{
> + if (!PageHuge(page))
> + return false;
> +
> + return page[2].flags == -1U;
> +}
> +
> +static inline void SetPageHugeTemporary(struct page *page)
> +{
> + page[2].flags = -1U;
> +}
> +
> +static inline void ClearPageHugeTemporary(struct page *page)
> +{
> + page[2].flags = 0;
> +}
>  #else /* !CONFIG_HUGETLB_PAGE */
>  
> +static inline bool PageHugeTemporary(struct page *page)
> +{
> + return false;
> +}
> +
> +static inline void SetPageHugeTemporary(struct page *page)
> +{
> +}
> +
> +static inline void ClearPageHugeTemporary(struct page *page)
> +{
> +}
> +
>  static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
>  {
>  }
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8189c92fac82..037bf0f89463 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1283,7 +1283,13 @@ void free_huge_page(struct page *page)
>   if (restore_reserve)
>   h->resv_huge_pages++;
>  
> - if (h->surplus_huge_pages_node[nid]) {
> + if (PageHugeTemporary(page)) {
> + list_del(>lru);
> + ClearPageHugeTemporary(page);
> + update_and_free_page(h, page);
> + if (h->surplus_huge_pages_node[nid])
> + 

[PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-28 Thread Michal Hocko
From: Michal Hocko 

hugepage migration relies on __alloc_buddy_huge_page to get a new page.
This has 2 main disadvantages.
1) it doesn't allow to migrate any huge page if the pool is used
completely which is not an exceptional case as the pool is static and
unused memory is just wasted.
2) it leads to a weird semantic when migration between two numa nodes
might increase the pool size of the destination NUMA node while the page
is in use. The issue is caused by per NUMA node surplus pages tracking
(see free_huge_page).

Address both issues by changing the way how we allocate and account
pages allocated for migration. Those should temporal by definition.
So we mark them that way (we will abuse page flags in the 3rd page)
and update free_huge_page to free such pages to the page allocator.
Page migration path then just transfers the temporal status from the
new page to the old one which will be freed on the last reference.
The global surplus count will never change during this path but we still
have to be careful when freeing a page from a node with surplus pages
on the node.

Rename __alloc_buddy_huge_page to __alloc_surplus_huge_page to better
reflect its purpose. The new allocation routine for the migration path
is __alloc_migrate_huge_page.

The user visible effect of this patch is that migrated pages are really
temporal and they travel between NUMA nodes as per the migration
request:
Before migration
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0

After

/sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0

with the previous implementation, both nodes would have nr_hugepages:1
until the page is freed.

Signed-off-by: Michal Hocko 
---
 include/linux/hugetlb.h | 35 +
 mm/hugetlb.c| 58 +++--
 mm/migrate.c| 13 +++
 3 files changed, 90 insertions(+), 16 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 6e3696c7b35a..1b6d7783c717 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -157,8 +157,43 @@ unsigned long hugetlb_change_protection(struct 
vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot);
 
 bool is_hugetlb_entry_migration(pte_t pte);
+
+/*
+ * Internal hugetlb specific page flag. Do not use outside of the hugetlb
+ * code
+ */
+static inline bool PageHugeTemporary(struct page *page)
+{
+   if (!PageHuge(page))
+   return false;
+
+   return page[2].flags == -1U;
+}
+
+static inline void SetPageHugeTemporary(struct page *page)
+{
+   page[2].flags = -1U;
+}
+
+static inline void ClearPageHugeTemporary(struct page *page)
+{
+   page[2].flags = 0;
+}
 #else /* !CONFIG_HUGETLB_PAGE */
 
+static inline bool PageHugeTemporary(struct page *page)
+{
+   return false;
+}
+
+static inline void SetPageHugeTemporary(struct page *page)
+{
+}
+
+static inline void ClearPageHugeTemporary(struct page *page)
+{
+}
+
 static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
 {
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8189c92fac82..037bf0f89463 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1283,7 +1283,13 @@ void free_huge_page(struct page *page)
if (restore_reserve)
h->resv_huge_pages++;
 
-   if (h->surplus_huge_pages_node[nid]) {
+   if (PageHugeTemporary(page)) {
+   list_del(>lru);
+   ClearPageHugeTemporary(page);
+   update_and_free_page(h, page);
+   if (h->surplus_huge_pages_node[nid])
+   h->surplus_huge_pages_node[nid]--;
+   } else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
list_del(>lru);
update_and_free_page(h, page);
@@ -1531,7 +1537,11 @@ int dissolve_free_huge_pages(unsigned long start_pfn, 
unsigned long end_pfn)
return rc;
 }
 
-static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t gfp_mask,
+/*
+ * Allocates a fresh surplus page from the page allocator. Temporary
+ * requests (e.g. page migration) can pass 

[PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-28 Thread Michal Hocko
From: Michal Hocko 

hugepage migration relies on __alloc_buddy_huge_page to get a new page.
This has 2 main disadvantages.
1) it doesn't allow to migrate any huge page if the pool is used
completely which is not an exceptional case as the pool is static and
unused memory is just wasted.
2) it leads to a weird semantic when migration between two numa nodes
might increase the pool size of the destination NUMA node while the page
is in use. The issue is caused by per NUMA node surplus pages tracking
(see free_huge_page).

Address both issues by changing the way how we allocate and account
pages allocated for migration. Those should temporal by definition.
So we mark them that way (we will abuse page flags in the 3rd page)
and update free_huge_page to free such pages to the page allocator.
Page migration path then just transfers the temporal status from the
new page to the old one which will be freed on the last reference.
The global surplus count will never change during this path but we still
have to be careful when freeing a page from a node with surplus pages
on the node.

Rename __alloc_buddy_huge_page to __alloc_surplus_huge_page to better
reflect its purpose. The new allocation routine for the migration path
is __alloc_migrate_huge_page.

The user visible effect of this patch is that migrated pages are really
temporal and they travel between NUMA nodes as per the migration
request:
Before migration
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0

After

/sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0

with the previous implementation, both nodes would have nr_hugepages:1
until the page is freed.

Signed-off-by: Michal Hocko 
---
 include/linux/hugetlb.h | 35 +
 mm/hugetlb.c| 58 +++--
 mm/migrate.c| 13 +++
 3 files changed, 90 insertions(+), 16 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 6e3696c7b35a..1b6d7783c717 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -157,8 +157,43 @@ unsigned long hugetlb_change_protection(struct 
vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot);
 
 bool is_hugetlb_entry_migration(pte_t pte);
+
+/*
+ * Internal hugetlb specific page flag. Do not use outside of the hugetlb
+ * code
+ */
+static inline bool PageHugeTemporary(struct page *page)
+{
+   if (!PageHuge(page))
+   return false;
+
+   return page[2].flags == -1U;
+}
+
+static inline void SetPageHugeTemporary(struct page *page)
+{
+   page[2].flags = -1U;
+}
+
+static inline void ClearPageHugeTemporary(struct page *page)
+{
+   page[2].flags = 0;
+}
 #else /* !CONFIG_HUGETLB_PAGE */
 
+static inline bool PageHugeTemporary(struct page *page)
+{
+   return false;
+}
+
+static inline void SetPageHugeTemporary(struct page *page)
+{
+}
+
+static inline void ClearPageHugeTemporary(struct page *page)
+{
+}
+
 static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
 {
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8189c92fac82..037bf0f89463 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1283,7 +1283,13 @@ void free_huge_page(struct page *page)
if (restore_reserve)
h->resv_huge_pages++;
 
-   if (h->surplus_huge_pages_node[nid]) {
+   if (PageHugeTemporary(page)) {
+   list_del(>lru);
+   ClearPageHugeTemporary(page);
+   update_and_free_page(h, page);
+   if (h->surplus_huge_pages_node[nid])
+   h->surplus_huge_pages_node[nid]--;
+   } else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
list_del(>lru);
update_and_free_page(h, page);
@@ -1531,7 +1537,11 @@ int dissolve_free_huge_pages(unsigned long start_pfn, 
unsigned long end_pfn)
return rc;
 }
 
-static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t gfp_mask,
+/*
+ * Allocates a fresh surplus page from the page allocator. Temporary
+ * requests (e.g. page migration) can pass enforce_overcommit == false
+ */
+static