Re: [PATCH 0/1] memory offline issues with hugepage size > memory block size

2016-09-21 Thread Michal Hocko
On Wed 21-09-16 11:27:48, Dave Hansen wrote:
> On 09/21/2016 11:20 AM, Michal Hocko wrote:
> > I would even question the per page block offlining itself. Why would
> > anybody want to offline few blocks rather than the whole node? What is
> > the usecase here?
> 
> The original reason was so that you could remove a DIMM or a riser card
> full of DIMMs, which are certainly a subset of a node.

OK, I see, thanks for the clarification! I was always thinking more in
node rather than physical memory range hot-remove. I do agree that it
makes sense to free the whole gigantic huge page if we encounter a tail
page for the above use case because losing the gigantic page is
justified when the whole dim goes away.

Thanks!

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 0/1] memory offline issues with hugepage size > memory block size

2016-09-21 Thread Dave Hansen
On 09/21/2016 11:20 AM, Michal Hocko wrote:
> I would even question the per page block offlining itself. Why would
> anybody want to offline few blocks rather than the whole node? What is
> the usecase here?

The original reason was so that you could remove a DIMM or a riser card
full of DIMMs, which are certainly a subset of a node.

With virtual machines, perhaps you only want to make a small adjustment
to the memory that a VM has.  Or, perhaps the VM only _has_ one node.
Granted, ballooning takes care of a lot of these cases, but memmap[]
starts to get annoying at some point if you balloon too much memory away.


Re: [PATCH 0/1] memory offline issues with hugepage size > memory block size

2016-09-21 Thread Michal Hocko
On Tue 20-09-16 10:37:04, Mike Kravetz wrote:
> On 09/20/2016 08:53 AM, Gerald Schaefer wrote:
> > dissolve_free_huge_pages() will either run into the VM_BUG_ON() or a
> > list corruption and addressing exception when trying to set a memory
> > block offline that is part (but not the first part) of a gigantic
> > hugetlb page with a size > memory block size.
> > 
> > When no other smaller hugepage sizes are present, the VM_BUG_ON() will
> > trigger directly. In the other case we will run into an addressing
> > exception later, because dissolve_free_huge_page() will not use the head
> > page of the compound hugetlb page which will result in a NULL hstate
> > from page_hstate(). list_del() would also not work well on a tail page.
> > 
> > To fix this, first remove the VM_BUG_ON() because it is wrong, and then
> > use the compound head page in dissolve_free_huge_page().
> > 
> > However, this all assumes that it is the desired behaviour to remove
> > a (gigantic) unused hugetlb page from the pool, just because a small
> > (in relation to the  hugepage size) memory block is going offline. Not
> > sure if this is the right thing, and it doesn't look very consistent
> > given that in this scenario it is _not_ possible to migrate
> > such a (gigantic) hugepage if it is in use. OTOH, has_unmovable_pages()
> > will return false in both cases, i.e. the memory block will be reported
> > as removable, no matter if the hugepage that it is part of is unused or
> > in use.
> > 
> > This patch is assuming that it would be OK to remove the hugepage,
> > i.e. memory offline beats pre-allocated unused (gigantic) hugepages.
> > 
> > Any thoughts?
> 
> Cc'ed Rui Teng and Dave Hansen as they were discussing the issue in
> this thread:
> https://lkml.org/lkml/2016/9/13/146
> 
> Their approach (I believe) would be to fail the offline operation in
> this case.  However, I could argue that failing the operation, or
> dissolving the unused huge page containing the area to be offlined is
> the right thing to do.

I am sorry I have noticed this thread only now. I was arguing about this
in the original thread. I would be rather reluctant to free gigantic
page just because somebody wants to offline a small part of it because
setup is really expensive and a lost page would be really hard to get
back.

I would even question the per page block offlining itself. Why would
anybody want to offline few blocks rather than the whole node? What is
the usecase here?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 0/1] memory offline issues with hugepage size > memory block size

2016-09-21 Thread Gerald Schaefer
On Tue, 20 Sep 2016 10:45:23 -0700
Dave Hansen  wrote:

> On 09/20/2016 10:37 AM, Mike Kravetz wrote:
> > 
> > Their approach (I believe) would be to fail the offline operation in
> > this case.  However, I could argue that failing the operation, or
> > dissolving the unused huge page containing the area to be offlined is
> > the right thing to do.
> 
> I think the right thing to do is dissolve the whole huge page if even a
> part of it is offlined.  The only question is what to do with the
> gigantic remnants.
> 

Hmm, not sure if I got this right, but I thought that by calling
update_and_free_page() on the head page (even if it is not part of the
memory block to be removed) all parts of the gigantic hugepage should be
properly freed and there should not be any remnants left.



Re: [PATCH 0/1] memory offline issues with hugepage size > memory block size

2016-09-21 Thread Gerald Schaefer
On Tue, 20 Sep 2016 10:37:04 -0700
Mike Kravetz  wrote:

> 
> Cc'ed Rui Teng and Dave Hansen as they were discussing the issue in
> this thread:
> https://lkml.org/lkml/2016/9/13/146

Ah, thanks, I didn't see that.

> 
> Their approach (I believe) would be to fail the offline operation in
> this case.  However, I could argue that failing the operation, or
> dissolving the unused huge page containing the area to be offlined is
> the right thing to do.
> 
> I never thought too much about the VM_BUG_ON(), but you are correct in
> that it should be removed in either case.
> 
> The other thing that needs to be changed is the locking in
> dissolve_free_huge_page().  I believe the lock only needs to be held if
> we are removing the huge page from the pool.  It is not a correctness
> but performance issue.
> 

Yes, that looks odd, that's why in my patch I moved the PageHuge() check
out from dissolve_free_huge_page(), up into the loop in
dissolve_free_huge_pages(). This way dissolve_free_huge_page() with its
locking should only be called once per memory block, in the case where
this memory block is part of a gigantic hugepage.



Re: [PATCH 0/1] memory offline issues with hugepage size > memory block size

2016-09-21 Thread Vlastimil Babka

On 09/20/2016 07:45 PM, Dave Hansen wrote:

On 09/20/2016 10:37 AM, Mike Kravetz wrote:


Their approach (I believe) would be to fail the offline operation in
this case.  However, I could argue that failing the operation, or
dissolving the unused huge page containing the area to be offlined is
the right thing to do.


I think the right thing to do is dissolve the whole huge page if even a
part of it is offlined.  The only question is what to do with the
gigantic remnants.


Just free them into the buddy system? Or what are the alternatives? 
Creating smaller huge pages (if supported)? That doesn't make much 
sense. Offline it completely? That's probably not what the user requested.


Vlastimil



Re: [PATCH 0/1] memory offline issues with hugepage size > memory block size

2016-09-20 Thread Dave Hansen
On 09/20/2016 10:37 AM, Mike Kravetz wrote:
> 
> Their approach (I believe) would be to fail the offline operation in
> this case.  However, I could argue that failing the operation, or
> dissolving the unused huge page containing the area to be offlined is
> the right thing to do.

I think the right thing to do is dissolve the whole huge page if even a
part of it is offlined.  The only question is what to do with the
gigantic remnants.


Re: [PATCH 0/1] memory offline issues with hugepage size > memory block size

2016-09-20 Thread Mike Kravetz
On 09/20/2016 08:53 AM, Gerald Schaefer wrote:
> dissolve_free_huge_pages() will either run into the VM_BUG_ON() or a
> list corruption and addressing exception when trying to set a memory
> block offline that is part (but not the first part) of a gigantic
> hugetlb page with a size > memory block size.
> 
> When no other smaller hugepage sizes are present, the VM_BUG_ON() will
> trigger directly. In the other case we will run into an addressing
> exception later, because dissolve_free_huge_page() will not use the head
> page of the compound hugetlb page which will result in a NULL hstate
> from page_hstate(). list_del() would also not work well on a tail page.
> 
> To fix this, first remove the VM_BUG_ON() because it is wrong, and then
> use the compound head page in dissolve_free_huge_page().
> 
> However, this all assumes that it is the desired behaviour to remove
> a (gigantic) unused hugetlb page from the pool, just because a small
> (in relation to the  hugepage size) memory block is going offline. Not
> sure if this is the right thing, and it doesn't look very consistent
> given that in this scenario it is _not_ possible to migrate
> such a (gigantic) hugepage if it is in use. OTOH, has_unmovable_pages()
> will return false in both cases, i.e. the memory block will be reported
> as removable, no matter if the hugepage that it is part of is unused or
> in use.
> 
> This patch is assuming that it would be OK to remove the hugepage,
> i.e. memory offline beats pre-allocated unused (gigantic) hugepages.
> 
> Any thoughts?

Cc'ed Rui Teng and Dave Hansen as they were discussing the issue in
this thread:
https://lkml.org/lkml/2016/9/13/146

Their approach (I believe) would be to fail the offline operation in
this case.  However, I could argue that failing the operation, or
dissolving the unused huge page containing the area to be offlined is
the right thing to do.

I never thought too much about the VM_BUG_ON(), but you are correct in
that it should be removed in either case.

The other thing that needs to be changed is the locking in
dissolve_free_huge_page().  I believe the lock only needs to be held if
we are removing the huge page from the pool.  It is not a correctness
but performance issue.

-- 
Mike Kravetz

> 
> 
> Gerald Schaefer (1):
>   mm/hugetlb: fix memory offline with hugepage size > memory block size
> 
>  mm/hugetlb.c | 16 +---
>  1 file changed, 9 insertions(+), 7 deletions(-)
>