Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()

2016-09-21 Thread Michal Hocko
On Wed 21-09-16 09:32:10, Dave Hansen wrote:
> On 09/21/2016 09:27 AM, Michal Hocko wrote:
> > That was not my point. I wasn't very clear probably. Offlining can fail
> > which shouldn't be really surprising. There might be a kernel allocation
> > in the particular block which cannot be migrated so failures are to be
> > expected. I just do not see how offlining in the middle of a gigantic
> > page is any different from having any other unmovable allocation in a
> > block. That being said, why don't we simply refuse to offline a block
> > which is in the middle of a gigantic page.
> 
> Don't we want to minimize the things that can cause an offline to fail?
> The code to fix it here doesn't seem too bad.

I am not really sure. So say somebody wants to offline few blocks (does
offlining anything but whole nodes make any sense btw.?) and that
happens to be in the middle of a gigantic huge page which is not really
that easy to allocate, do we want to free it in order to do the offline?
To me it sounds like keeping the gigantic page should be preffered but
I have to admit I do not really understand the per-block offlining
though.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()

2016-09-21 Thread Dave Hansen
On 09/21/2016 09:27 AM, Michal Hocko wrote:
> That was not my point. I wasn't very clear probably. Offlining can fail
> which shouldn't be really surprising. There might be a kernel allocation
> in the particular block which cannot be migrated so failures are to be
> expected. I just do not see how offlining in the middle of a gigantic
> page is any different from having any other unmovable allocation in a
> block. That being said, why don't we simply refuse to offline a block
> which is in the middle of a gigantic page.

Don't we want to minimize the things that can cause an offline to fail?
 The code to fix it here doesn't seem too bad.



Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()

2016-09-21 Thread Michal Hocko
On Wed 21-09-16 09:04:31, Dave Hansen wrote:
> On 09/21/2016 05:05 AM, Michal Hocko wrote:
> > On Tue 20-09-16 10:43:13, Dave Hansen wrote:
> >> On 09/20/2016 08:52 AM, Rui Teng wrote:
> >>> On 9/20/16 10:53 PM, Dave Hansen wrote:
> >> ...
>  That's good, but aren't we still left with a situation where we've
>  offlined and dissolved the _middle_ of a gigantic huge page while the
>  head page is still in place and online?
> 
>  That seems bad.
> 
> >>> What about refusing to change the status for such memory block, if it
> >>> contains a huge page which larger than itself? (function
> >>> memory_block_action())
> >>
> >> How will this be visible to users, though?  That sounds like you simply
> >> won't be able to offline memory with gigantic huge pages.
> > 
> > I might be missing something but Is this any different from a regular
> > failure when the memory cannot be freed? I mean
> > /sys/devices/system/memory/memory API doesn't give you any hint whether
> > the memory in the particular block is used and
> > unmigrateable.
> 
> It's OK to have free hugetlbfs pages in an area that's being offline'd.
>  If we did that, it would not be OK to have a free gigantic hugetlbfs
> page that's larger than the area being offlined.

That was not my point. I wasn't very clear probably. Offlining can fail
which shouldn't be really surprising. There might be a kernel allocation
in the particular block which cannot be migrated so failures are to be
expected. I just do not see how offlining in the middle of a gigantic
page is any different from having any other unmovable allocation in a
block. That being said, why don't we simply refuse to offline a block
which is in the middle of a gigantic page.
 
> It would be a wee bit goofy to have the requirement that userspace go
> find all the gigantic pages and make them non-gigantic before trying to
> offline something.

yes

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()

2016-09-21 Thread Dave Hansen
On 09/21/2016 05:05 AM, Michal Hocko wrote:
> On Tue 20-09-16 10:43:13, Dave Hansen wrote:
>> On 09/20/2016 08:52 AM, Rui Teng wrote:
>>> On 9/20/16 10:53 PM, Dave Hansen wrote:
>> ...
 That's good, but aren't we still left with a situation where we've
 offlined and dissolved the _middle_ of a gigantic huge page while the
 head page is still in place and online?

 That seems bad.

>>> What about refusing to change the status for such memory block, if it
>>> contains a huge page which larger than itself? (function
>>> memory_block_action())
>>
>> How will this be visible to users, though?  That sounds like you simply
>> won't be able to offline memory with gigantic huge pages.
> 
> I might be missing something but Is this any different from a regular
> failure when the memory cannot be freed? I mean
> /sys/devices/system/memory/memory API doesn't give you any hint whether
> the memory in the particular block is used and
> unmigrateable.

It's OK to have free hugetlbfs pages in an area that's being offline'd.
 If we did that, it would not be OK to have a free gigantic hugetlbfs
page that's larger than the area being offlined.

It would be a wee bit goofy to have the requirement that userspace go
find all the gigantic pages and make them non-gigantic before trying to
offline something.



Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()

2016-09-21 Thread Michal Hocko
On Tue 20-09-16 10:43:13, Dave Hansen wrote:
> On 09/20/2016 08:52 AM, Rui Teng wrote:
> > On 9/20/16 10:53 PM, Dave Hansen wrote:
> ...
> >> That's good, but aren't we still left with a situation where we've
> >> offlined and dissolved the _middle_ of a gigantic huge page while the
> >> head page is still in place and online?
> >>
> >> That seems bad.
> >>
> > What about refusing to change the status for such memory block, if it
> > contains a huge page which larger than itself? (function
> > memory_block_action())
> 
> How will this be visible to users, though?  That sounds like you simply
> won't be able to offline memory with gigantic huge pages.

I might be missing something but Is this any different from a regular
failure when the memory cannot be freed? I mean
/sys/devices/system/memory/memory API doesn't give you any hint whether
the memory in the particular block is used and
unmigrateable.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()

2016-09-20 Thread Dave Hansen
On 09/20/2016 08:52 AM, Rui Teng wrote:
> On 9/20/16 10:53 PM, Dave Hansen wrote:
...
>> That's good, but aren't we still left with a situation where we've
>> offlined and dissolved the _middle_ of a gigantic huge page while the
>> head page is still in place and online?
>>
>> That seems bad.
>>
> What about refusing to change the status for such memory block, if it
> contains a huge page which larger than itself? (function
> memory_block_action())

How will this be visible to users, though?  That sounds like you simply
won't be able to offline memory with gigantic huge pages.

> I think it will not affect the hot-plug function too much. We can
> change the nr_hugepages to zero first, if we really want to hot-plug a
> memory.

Is that really feasible?  Suggest that folks stop using hugetlbfs before
offlining any memory?  Isn't the entire point of hotplug to keep the
system running while you change the memory present?  Doing this would
require that you stop your applications that are using huge pages.

With gigantic pages, you may also never get them back if you do this.

> And I also found that the __test_page_isolated_in_pageblock() function
> can not handle a gigantic page well. It will cause a device busy error
> later. I am still investigating on that.
> 
> Any suggestion?

It sounds like the _first_ offline operation needs to dissolve an
_entire_ page if that page has any portion in the section being
offlined.  I'm not quite sure where the page should live after that, but
I'm not sure of any other way to do this sanely.



Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()

2016-09-20 Thread Rui Teng

On 9/20/16 10:53 PM, Dave Hansen wrote:

On 09/20/2016 07:45 AM, Rui Teng wrote:

On 9/17/16 12:25 AM, Dave Hansen wrote:


That's an interesting data point, but it still doesn't quite explain
what is going on.

It seems like there might be parts of gigantic pages that have
PageHuge() set on tail pages, while other parts don't.  If that's true,
we have another bug and your patch just papers over the issue.

I think you really need to find the root cause before we apply this
patch.


The root cause is the test scripts(tools/testing/selftests/memory-
hotplug/mem-on-off-test.sh) changes online/offline status on memory
blocks other than page header. It will *randomly* select 10% memory
blocks from /sys/devices/system/memory/memory*, and change their
online/offline status.


Ahh, that does explain it!  Thanks for digging into that!


That's why we need a PageHead() check now, and why this problem does
not happened on systems with smaller huge page such as 16M.

As far as the PageHuge() set, I think PageHuge() will return true for
all tail pages. Because it will get the compound_head for tail page,
and then get its huge page flag.
page = compound_head(page);

And as far as the failure message, if one memory block is in use, it
will return failure when offline it.


That's good, but aren't we still left with a situation where we've
offlined and dissolved the _middle_ of a gigantic huge page while the
head page is still in place and online?

That seems bad.


What about refusing to change the status for such memory block, if it
contains a huge page which larger than itself? (function
memory_block_action())

I think it will not affect the hot-plug function too much. We can
change the nr_hugepages to zero first, if we really want to hot-plug a
memory.

And I also found that the __test_page_isolated_in_pageblock() function
can not handle a gigantic page well. It will cause a device busy error
later. I am still investigating on that.

Any suggestion?

Thanks!



Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()

2016-09-20 Thread Dave Hansen
On 09/20/2016 07:45 AM, Rui Teng wrote:
> On 9/17/16 12:25 AM, Dave Hansen wrote:
>>
>> That's an interesting data point, but it still doesn't quite explain
>> what is going on.
>>
>> It seems like there might be parts of gigantic pages that have
>> PageHuge() set on tail pages, while other parts don't.  If that's true,
>> we have another bug and your patch just papers over the issue.
>>
>> I think you really need to find the root cause before we apply this
>> patch.
>>
> The root cause is the test scripts(tools/testing/selftests/memory-
> hotplug/mem-on-off-test.sh) changes online/offline status on memory
> blocks other than page header. It will *randomly* select 10% memory
> blocks from /sys/devices/system/memory/memory*, and change their
> online/offline status.

Ahh, that does explain it!  Thanks for digging into that!

> That's why we need a PageHead() check now, and why this problem does
> not happened on systems with smaller huge page such as 16M.
> 
> As far as the PageHuge() set, I think PageHuge() will return true for
> all tail pages. Because it will get the compound_head for tail page,
> and then get its huge page flag.
> page = compound_head(page);
> 
> And as far as the failure message, if one memory block is in use, it
> will return failure when offline it.

That's good, but aren't we still left with a situation where we've
offlined and dissolved the _middle_ of a gigantic huge page while the
head page is still in place and online?

That seems bad.


Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()

2016-09-20 Thread Rui Teng

On 9/17/16 12:25 AM, Dave Hansen wrote:


That's an interesting data point, but it still doesn't quite explain
what is going on.

It seems like there might be parts of gigantic pages that have
PageHuge() set on tail pages, while other parts don't.  If that's true,
we have another bug and your patch just papers over the issue.

I think you really need to find the root cause before we apply this patch.


The root cause is the test scripts(tools/testing/selftests/memory-
hotplug/mem-on-off-test.sh) changes online/offline status on memory
blocks other than page header. It will *randomly* select 10% memory
blocks from /sys/devices/system/memory/memory*, and change their
online/offline status.

On my system, the memory block size is 0x1000:
[root@elvis-n01-kvm memory]# cat block_size_bytes
1000

But the huge page size(16G) is more than this memory block size. So one
huge page is composed by several memory blocks. For example, memory704,
memory705, memory706 and so on. Then memory704 will contain a head
page, but memory705 will *only* contain tail pages. So the problem will
happened on it, if we call:
#echo offline > memory705/state

That's why we need a PageHead() check now, and why this problem does
not happened on systems with smaller huge page such as 16M.

As far as the PageHuge() set, I think PageHuge() will return true for
all tail pages. Because it will get the compound_head for tail page,
and then get its huge page flag.
page = compound_head(page);

And as far as the failure message, if one memory block is in use, it
will return failure when offline it.



Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()

2016-09-16 Thread Dave Hansen
On 09/16/2016 06:58 AM, Rui Teng wrote:
> On 9/15/16 12:37 AM, Dave Hansen wrote:
>> On 09/14/2016 09:33 AM, Rui Teng wrote:
>> But, as far as describing the initial problem, can you explain how the
>> tail pages still ended up being PageHuge()?  Seems like dissolving the
>> huge page should have cleared that.
>>
> I use the scripts of tools/testing/selftests/memory-hotplug/mem-on-
> off-test.sh to test and reproduce this bug. And I printed the pfn range
> on dissolve_free_huge_pages(). The sizes of the pfn range are always
> 4096, and the ranges are separated.
> [   72.362427] start_pfn: 204800, end_pfn: 208896
> [   72.371677] start_pfn: 2162688, end_pfn: 2166784
> [   72.373945] start_pfn: 217088, end_pfn: 221184
> [   72.383218] start_pfn: 2170880, end_pfn: 2174976
> [   72.385918] start_pfn: 2306048, end_pfn: 2310144
> [   72.388254] start_pfn: 2326528, end_pfn: 2330624
> 
> Sometimes, it will report a failure:
> [   72.371690] memory offlining [mem 0x21-0x210fff] failed
> 
> And sometimes, it will report following:
> [   72.373956] Offlined Pages 4096
> 
> Whether the start_pfn and end_pfn of dissolve_free_huge_pages could be
> *random*? If so, the range may not include any page head and start from
> tail page, right?

That's an interesting data point, but it still doesn't quite explain
what is going on.

It seems like there might be parts of gigantic pages that have
PageHuge() set on tail pages, while other parts don't.  If that's true,
we have another bug and your patch just papers over the issue.

I think you really need to find the root cause before we apply this patch.


Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()

2016-09-16 Thread Rui Teng

On 9/15/16 12:37 AM, Dave Hansen wrote:

On 09/14/2016 09:33 AM, Rui Teng wrote:


How about return the size of page freed from dissolve_free_huge_page(),
and jump such step on pfn?


That would be a nice improvement.

But, as far as describing the initial problem, can you explain how the
tail pages still ended up being PageHuge()?  Seems like dissolving the
huge page should have cleared that.


I use the scripts of tools/testing/selftests/memory-hotplug/mem-on-
off-test.sh to test and reproduce this bug. And I printed the pfn range
on dissolve_free_huge_pages(). The sizes of the pfn range are always
4096, and the ranges are separated.
[   72.362427] start_pfn: 204800, end_pfn: 208896
[   72.371677] start_pfn: 2162688, end_pfn: 2166784
[   72.373945] start_pfn: 217088, end_pfn: 221184
[   72.383218] start_pfn: 2170880, end_pfn: 2174976
[   72.385918] start_pfn: 2306048, end_pfn: 2310144
[   72.388254] start_pfn: 2326528, end_pfn: 2330624

Sometimes, it will report a failure:
[   72.371690] memory offlining [mem 0x21-0x210fff] failed

And sometimes, it will report following:
[   72.373956] Offlined Pages 4096

Whether the start_pfn and end_pfn of dissolve_free_huge_pages could be
*random*? If so, the range may not include any page head and start from
tail page, right?



Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()

2016-09-14 Thread Dave Hansen
On 09/14/2016 09:33 AM, Rui Teng wrote:
> 
> How about return the size of page freed from dissolve_free_huge_page(),
> and jump such step on pfn?

That would be a nice improvement.

But, as far as describing the initial problem, can you explain how the
tail pages still ended up being PageHuge()?  Seems like dissolving the
huge page should have cleared that.


Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()

2016-09-14 Thread Rui Teng

On 9/14/16 1:32 AM, Dave Hansen wrote:

On 09/13/2016 01:39 AM, Rui Teng wrote:

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 87e11d8..64b5f81 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1442,7 +1442,7 @@ static int free_pool_huge_page(struct hstate *h, 
nodemask_t *nodes_allowed,
 static void dissolve_free_huge_page(struct page *page)
 {
spin_lock(&hugetlb_lock);
-   if (PageHuge(page) && !page_count(page)) {
+   if (PageHuge(page) && !page_count(page) && PageHead(page)) {
struct hstate *h = page_hstate(page);
int nid = page_to_nid(page);
list_del(&page->lru);


This is goofy.  What is calling dissolve_free_huge_page() on a tail page?

Hmm:


for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << minimum_order)
dissolve_free_huge_page(pfn_to_page(pfn));


So, skip through the area being offlined at the smallest huge page size,
and try to dissolve a huge page in each place one might appear.  But,
after we dissolve a 16GB huge page, we continue looking through the
remaining 15.98GB tail area for huge pages in the area we just
dissolved.  The tail pages are still PageHuge() (how??), and we call
page_hstate() on the tail page whose head was just dissolved.

Note, even with the fix, this taking a (global) spinlock 1023 more times
that it doesn't have to.

This seems inefficient, and fails to fully explain what is going on, and
how tail pages still _look_ like PageHuge(), which seems pretty wrong.

I guess the patch _works_.  But, sheesh, it leaves a lot of room for
improvement.


Thanks for your suggestion!
How about return the size of page freed from dissolve_free_huge_page(), 
and jump such step on pfn?




Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()

2016-09-13 Thread Dave Hansen
On 09/13/2016 01:39 AM, Rui Teng wrote:
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 87e11d8..64b5f81 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1442,7 +1442,7 @@ static int free_pool_huge_page(struct hstate *h, 
> nodemask_t *nodes_allowed,
>  static void dissolve_free_huge_page(struct page *page)
>  {
>   spin_lock(&hugetlb_lock);
> - if (PageHuge(page) && !page_count(page)) {
> + if (PageHuge(page) && !page_count(page) && PageHead(page)) {
>   struct hstate *h = page_hstate(page);
>   int nid = page_to_nid(page);
>   list_del(&page->lru);

This is goofy.  What is calling dissolve_free_huge_page() on a tail page?

Hmm:

> for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << minimum_order)
> dissolve_free_huge_page(pfn_to_page(pfn));

So, skip through the area being offlined at the smallest huge page size,
and try to dissolve a huge page in each place one might appear.  But,
after we dissolve a 16GB huge page, we continue looking through the
remaining 15.98GB tail area for huge pages in the area we just
dissolved.  The tail pages are still PageHuge() (how??), and we call
page_hstate() on the tail page whose head was just dissolved.

Note, even with the fix, this taking a (global) spinlock 1023 more times
that it doesn't have to.

This seems inefficient, and fails to fully explain what is going on, and
how tail pages still _look_ like PageHuge(), which seems pretty wrong.

I guess the patch _works_.  But, sheesh, it leaves a lot of room for
improvement.