Re: [PATCH 4/6] hugetlb: avoid allocation failed when page reporting is on going

2021-01-10 Thread Liang Li
> > > Please don't use this email address for me anymore. Either use
> > > alexander.du...@gmail.com or alexanderdu...@fb.com. I am getting
> > > bounces when I reply to this thread because of the old address.
> >
> > No problem.
> >
> > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > index eb533995cb49..0fccd5f96954 100644
> > > > --- a/mm/hugetlb.c
> > > > +++ b/mm/hugetlb.c
> > > > @@ -2320,6 +2320,12 @@ struct page *alloc_huge_page(struct 
> > > > vm_area_struct *vma,
> > > > goto out_uncharge_cgroup_reservation;
> > > >
> > > > spin_lock(_lock);
> > > > +   while (h->free_huge_pages <= 1 && h->isolated_huge_pages) {
> > > > +   spin_unlock(_lock);
> > > > +   mutex_lock(>mtx_prezero);
> > > > +   mutex_unlock(>mtx_prezero);
> > > > +   spin_lock(_lock);
> > > > +   }
> > >
> > > This seems like a bad idea. It kind of defeats the whole point of
> > > doing the page zeroing outside of the hugetlb_lock. Also it is
> > > operating on the assumption that the only way you might get a page is
> > > from the page zeroing logic.
> > >
> > > With the page reporting code we wouldn't drop the count to zero. We
> > > had checks that were going through and monitoring the watermarks and
> > > if we started to hit the low watermark we would stop page reporting
> > > and just assume there aren't enough pages to report. You might need to
> > > look at doing something similar here so that you can avoid colliding
> > > with the allocator.
> >
> > For hugetlb, things are a little different, Just like Mike points out:
> >  "On some systems, hugetlb pages are a precious resource and
> >   the sysadmin carefully configures the number needed by
> >   applications.  Removing a hugetlb page (even for a very short
> >   period of time) could cause serious application failure."
> >
> > Just keeping some pages in the freelist is not enough to prevent that from
> > happening, because these pages may be allocated while zero out is on
> > going, and application may still run into a situation for not available free
> > pages.
>
> I get what you are saying. However I don't know if it is acceptable
> for the allocating thread to be put to sleep in this situation. There
> are two scenarios where I can see this being problematic.
>
> One is a setup where you put the page allocator to sleep and while it
> is sleeping another thread is then freeing a page and your thread
> cannot respond to that newly freed page and is stuck waiting on the
> zeroed page.
>
> The second issue is that users may want a different option of just
> breaking up the request into smaller pages rather than waiting on the
> page zeroing, or to do something else while waiting on the page. So
> instead of sitting on the request and waiting it might make more sense
> to return an error pointer like EAGAIN or EBUSY to indicate that there
> is a page there, but it is momentarily tied up.

It seems returning EAGAIN or EBUSY will still change the application's
behavior,  I am not sure if it's acceptable.

Thanks
Liang


Re: [PATCH 4/6] hugetlb: avoid allocation failed when page reporting is on going

2021-01-07 Thread Alexander Duyck
On Wed, Jan 6, 2021 at 7:57 PM Liang Li  wrote:
>
> > > Page reporting isolates free pages temporarily when reporting
> > > free pages information. It will reduce the actual free pages
> > > and may cause application failed for no enough available memory.
> > > This patch try to solve this issue, when there is no free page
> > > and page repoting is on going, wait until it is done.
> > >
> > > Cc: Alexander Duyck 
> >
> > Please don't use this email address for me anymore. Either use
> > alexander.du...@gmail.com or alexanderdu...@fb.com. I am getting
> > bounces when I reply to this thread because of the old address.
>
> No problem.
>
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index eb533995cb49..0fccd5f96954 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -2320,6 +2320,12 @@ struct page *alloc_huge_page(struct vm_area_struct 
> > > *vma,
> > > goto out_uncharge_cgroup_reservation;
> > >
> > > spin_lock(_lock);
> > > +   while (h->free_huge_pages <= 1 && h->isolated_huge_pages) {
> > > +   spin_unlock(_lock);
> > > +   mutex_lock(>mtx_prezero);
> > > +   mutex_unlock(>mtx_prezero);
> > > +   spin_lock(_lock);
> > > +   }
> >
> > This seems like a bad idea. It kind of defeats the whole point of
> > doing the page zeroing outside of the hugetlb_lock. Also it is
> > operating on the assumption that the only way you might get a page is
> > from the page zeroing logic.
> >
> > With the page reporting code we wouldn't drop the count to zero. We
> > had checks that were going through and monitoring the watermarks and
> > if we started to hit the low watermark we would stop page reporting
> > and just assume there aren't enough pages to report. You might need to
> > look at doing something similar here so that you can avoid colliding
> > with the allocator.
>
> For hugetlb, things are a little different, Just like Mike points out:
>  "On some systems, hugetlb pages are a precious resource and
>   the sysadmin carefully configures the number needed by
>   applications.  Removing a hugetlb page (even for a very short
>   period of time) could cause serious application failure."
>
> Just keeping some pages in the freelist is not enough to prevent that from
> happening, because these pages may be allocated while zero out is on
> going, and application may still run into a situation for not available free
> pages.

I get what you are saying. However I don't know if it is acceptable
for the allocating thread to be put to sleep in this situation. There
are two scenarios where I can see this being problematic.

One is a setup where you put the page allocator to sleep and while it
is sleeping another thread is then freeing a page and your thread
cannot respond to that newly freed page and is stuck waiting on the
zeroed page.

The second issue is that users may want a different option of just
breaking up the request into smaller pages rather than waiting on the
page zeroing, or to do something else while waiting on the page. So
instead of sitting on the request and waiting it might make more sense
to return an error pointer like EAGAIN or EBUSY to indicate that there
is a page there, but it is momentarily tied up.


Re: [PATCH 4/6] hugetlb: avoid allocation failed when page reporting is on going

2021-01-06 Thread Liang Li
> > Page reporting isolates free pages temporarily when reporting
> > free pages information. It will reduce the actual free pages
> > and may cause application failed for no enough available memory.
> > This patch try to solve this issue, when there is no free page
> > and page repoting is on going, wait until it is done.
> >
> > Cc: Alexander Duyck 
>
> Please don't use this email address for me anymore. Either use
> alexander.du...@gmail.com or alexanderdu...@fb.com. I am getting
> bounces when I reply to this thread because of the old address.

No problem.

> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index eb533995cb49..0fccd5f96954 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2320,6 +2320,12 @@ struct page *alloc_huge_page(struct vm_area_struct 
> > *vma,
> > goto out_uncharge_cgroup_reservation;
> >
> > spin_lock(_lock);
> > +   while (h->free_huge_pages <= 1 && h->isolated_huge_pages) {
> > +   spin_unlock(_lock);
> > +   mutex_lock(>mtx_prezero);
> > +   mutex_unlock(>mtx_prezero);
> > +   spin_lock(_lock);
> > +   }
>
> This seems like a bad idea. It kind of defeats the whole point of
> doing the page zeroing outside of the hugetlb_lock. Also it is
> operating on the assumption that the only way you might get a page is
> from the page zeroing logic.
>
> With the page reporting code we wouldn't drop the count to zero. We
> had checks that were going through and monitoring the watermarks and
> if we started to hit the low watermark we would stop page reporting
> and just assume there aren't enough pages to report. You might need to
> look at doing something similar here so that you can avoid colliding
> with the allocator.

For hugetlb, things are a little different, Just like Mike points out:
 "On some systems, hugetlb pages are a precious resource and
  the sysadmin carefully configures the number needed by
  applications.  Removing a hugetlb page (even for a very short
  period of time) could cause serious application failure."

Just keeping some pages in the freelist is not enough to prevent that from
happening, because these pages may be allocated while zero out is on
going, and application may still run into a situation for not available free
pages.

Thanks
Liang


Re: [PATCH 4/6] hugetlb: avoid allocation failed when page reporting is on going

2021-01-06 Thread Alexander Duyck
On Tue, Jan 5, 2021 at 7:50 PM Liang Li  wrote:
>
> Page reporting isolates free pages temporarily when reporting
> free pages information. It will reduce the actual free pages
> and may cause application failed for no enough available memory.
> This patch try to solve this issue, when there is no free page
> and page repoting is on going, wait until it is done.
>
> Cc: Alexander Duyck 

Please don't use this email address for me anymore. Either use
alexander.du...@gmail.com or alexanderdu...@fb.com. I am getting
bounces when I reply to this thread because of the old address.

> Cc: Mel Gorman 
> Cc: Andrea Arcangeli 
> Cc: Dan Williams 
> Cc: Dave Hansen 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Alex Williamson 
> Cc: Michael S. Tsirkin 
> Cc: Liang Li 
> Signed-off-by: Liang Li 
> ---
>  include/linux/hugetlb.h | 2 ++
>  mm/hugetlb.c| 9 +
>  mm/page_reporting.c | 6 +-
>  3 files changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index d55e6a00b3dc..73b2934ba91c 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -490,6 +490,7 @@ struct hstate {
> unsigned long resv_huge_pages;
> unsigned long surplus_huge_pages;
> unsigned long nr_overcommit_huge_pages;
> +   unsigned long isolated_huge_pages;
> struct list_head hugepage_activelist;
> struct list_head hugepage_freelists[MAX_NUMNODES];
> unsigned int nr_huge_pages_node[MAX_NUMNODES];
> @@ -500,6 +501,7 @@ struct hstate {
> struct cftype cgroup_files_dfl[7];
> struct cftype cgroup_files_legacy[9];
>  #endif
> +   struct mutex mtx_prezero;
> char name[HSTATE_NAME_LEN];
>  };
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index eb533995cb49..0fccd5f96954 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2320,6 +2320,12 @@ struct page *alloc_huge_page(struct vm_area_struct 
> *vma,
> goto out_uncharge_cgroup_reservation;
>
> spin_lock(_lock);
> +   while (h->free_huge_pages <= 1 && h->isolated_huge_pages) {
> +   spin_unlock(_lock);
> +   mutex_lock(>mtx_prezero);
> +   mutex_unlock(>mtx_prezero);
> +   spin_lock(_lock);
> +   }

This seems like a bad idea. It kind of defeats the whole point of
doing the page zeroing outside of the hugetlb_lock. Also it is
operating on the assumption that the only way you might get a page is
from the page zeroing logic.

With the page reporting code we wouldn't drop the count to zero. We
had checks that were going through and monitoring the watermarks and
if we started to hit the low watermark we would stop page reporting
and just assume there aren't enough pages to report. You might need to
look at doing something similar here so that you can avoid colliding
with the allocator.


> /*
>  * glb_chg is passed to indicate whether or not a page must be taken
>  * from the global free pool (global change).  gbl_chg == 0 indicates
> @@ -3208,6 +3214,7 @@ void __init hugetlb_add_hstate(unsigned int order)
> INIT_LIST_HEAD(>hugepage_activelist);
> h->next_nid_to_alloc = first_memory_node;
> h->next_nid_to_free = first_memory_node;
> +   mutex_init(>mtx_prezero);
> snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
> huge_page_size(h)/1024);
>
> @@ -5541,6 +5548,7 @@ void isolate_free_huge_page(struct page *page, struct 
> hstate *h, int nid)
>
> list_move(>lru, >hugepage_activelist);
> set_page_refcounted(page);
> +   h->isolated_huge_pages++;
>  }
>
>  void putback_isolate_huge_page(struct hstate *h, struct page *page)
> @@ -5548,6 +5556,7 @@ void putback_isolate_huge_page(struct hstate *h, struct 
> page *page)
> int nid = page_to_nid(page);
>
> list_move(>lru, >hugepage_freelists[nid]);
> +   h->isolated_huge_pages--;
>  }
>
>  bool isolate_huge_page(struct page *page, struct list_head *list)
> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
> index cc31696225bb..99e1e688d7c1 100644
> --- a/mm/page_reporting.c
> +++ b/mm/page_reporting.c
> @@ -272,12 +272,15 @@ hugepage_reporting_process_hstate(struct 
> page_reporting_dev_info *prdev,
> int ret = 0, nid;
>
> offset = max_items;
> +   mutex_lock(>mtx_prezero);
> for (nid = 0; nid < MAX_NUMNODES; nid++) {
> ret = hugepage_reporting_cycle(prdev, h, nid, sgl, ,
>max_items);
>
> -   if (ret < 0)
> +   if (ret < 0) {
> +   mutex_unlock(>mtx_prezero);
> return ret;
> +   }
> }
>
> /* report the leftover pages before going idle */
> @@ -291,6 +294,7 @@ hugepage_reporting_process_hstate(struct 
> page_reporting_dev_info *prdev,
> 

[PATCH 4/6] hugetlb: avoid allocation failed when page reporting is on going

2021-01-05 Thread Liang Li
Page reporting isolates free pages temporarily when reporting
free pages information. It will reduce the actual free pages
and may cause application failed for no enough available memory.
This patch try to solve this issue, when there is no free page
and page repoting is on going, wait until it is done.

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Cc: Liang Li 
Signed-off-by: Liang Li 
---
 include/linux/hugetlb.h | 2 ++
 mm/hugetlb.c| 9 +
 mm/page_reporting.c | 6 +-
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d55e6a00b3dc..73b2934ba91c 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -490,6 +490,7 @@ struct hstate {
unsigned long resv_huge_pages;
unsigned long surplus_huge_pages;
unsigned long nr_overcommit_huge_pages;
+   unsigned long isolated_huge_pages;
struct list_head hugepage_activelist;
struct list_head hugepage_freelists[MAX_NUMNODES];
unsigned int nr_huge_pages_node[MAX_NUMNODES];
@@ -500,6 +501,7 @@ struct hstate {
struct cftype cgroup_files_dfl[7];
struct cftype cgroup_files_legacy[9];
 #endif
+   struct mutex mtx_prezero;
char name[HSTATE_NAME_LEN];
 };
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index eb533995cb49..0fccd5f96954 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2320,6 +2320,12 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
goto out_uncharge_cgroup_reservation;
 
spin_lock(_lock);
+   while (h->free_huge_pages <= 1 && h->isolated_huge_pages) {
+   spin_unlock(_lock);
+   mutex_lock(>mtx_prezero);
+   mutex_unlock(>mtx_prezero);
+   spin_lock(_lock);
+   }
/*
 * glb_chg is passed to indicate whether or not a page must be taken
 * from the global free pool (global change).  gbl_chg == 0 indicates
@@ -3208,6 +3214,7 @@ void __init hugetlb_add_hstate(unsigned int order)
INIT_LIST_HEAD(>hugepage_activelist);
h->next_nid_to_alloc = first_memory_node;
h->next_nid_to_free = first_memory_node;
+   mutex_init(>mtx_prezero);
snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
huge_page_size(h)/1024);
 
@@ -5541,6 +5548,7 @@ void isolate_free_huge_page(struct page *page, struct 
hstate *h, int nid)
 
list_move(>lru, >hugepage_activelist);
set_page_refcounted(page);
+   h->isolated_huge_pages++;
 }
 
 void putback_isolate_huge_page(struct hstate *h, struct page *page)
@@ -5548,6 +5556,7 @@ void putback_isolate_huge_page(struct hstate *h, struct 
page *page)
int nid = page_to_nid(page);
 
list_move(>lru, >hugepage_freelists[nid]);
+   h->isolated_huge_pages--;
 }
 
 bool isolate_huge_page(struct page *page, struct list_head *list)
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index cc31696225bb..99e1e688d7c1 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -272,12 +272,15 @@ hugepage_reporting_process_hstate(struct 
page_reporting_dev_info *prdev,
int ret = 0, nid;
 
offset = max_items;
+   mutex_lock(>mtx_prezero);
for (nid = 0; nid < MAX_NUMNODES; nid++) {
ret = hugepage_reporting_cycle(prdev, h, nid, sgl, ,
   max_items);
 
-   if (ret < 0)
+   if (ret < 0) {
+   mutex_unlock(>mtx_prezero);
return ret;
+   }
}
 
/* report the leftover pages before going idle */
@@ -291,6 +294,7 @@ hugepage_reporting_process_hstate(struct 
page_reporting_dev_info *prdev,
hugepage_reporting_drain(prdev, h, sgl, leftover, !ret);
spin_unlock(_lock);
}
+   mutex_unlock(>mtx_prezero);
 
return ret;
 }
-- 
2.18.2