Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()
On Wed 21-09-16 09:32:10, Dave Hansen wrote: > On 09/21/2016 09:27 AM, Michal Hocko wrote: > > That was not my point. I wasn't very clear probably. Offlining can fail > > which shouldn't be really surprising. There might be a kernel allocation > > in the particular block which cannot be migrated so failures are to be > > expected. I just do not see how offlining in the middle of a gigantic > > page is any different from having any other unmovable allocation in a > > block. That being said, why don't we simply refuse to offline a block > > which is in the middle of a gigantic page. > > Don't we want to minimize the things that can cause an offline to fail? > The code to fix it here doesn't seem too bad. I am not really sure. So say somebody wants to offline few blocks (does offlining anything but whole nodes make any sense btw.?) and that happens to be in the middle of a gigantic huge page which is not really that easy to allocate, do we want to free it in order to do the offline? To me it sounds like keeping the gigantic page should be preffered but I have to admit I do not really understand the per-block offlining though. -- Michal Hocko SUSE Labs
Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()
On 09/21/2016 09:27 AM, Michal Hocko wrote: > That was not my point. I wasn't very clear probably. Offlining can fail > which shouldn't be really surprising. There might be a kernel allocation > in the particular block which cannot be migrated so failures are to be > expected. I just do not see how offlining in the middle of a gigantic > page is any different from having any other unmovable allocation in a > block. That being said, why don't we simply refuse to offline a block > which is in the middle of a gigantic page. Don't we want to minimize the things that can cause an offline to fail? The code to fix it here doesn't seem too bad.
Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()
On Wed 21-09-16 09:04:31, Dave Hansen wrote: > On 09/21/2016 05:05 AM, Michal Hocko wrote: > > On Tue 20-09-16 10:43:13, Dave Hansen wrote: > >> On 09/20/2016 08:52 AM, Rui Teng wrote: > >>> On 9/20/16 10:53 PM, Dave Hansen wrote: > >> ... > That's good, but aren't we still left with a situation where we've > offlined and dissolved the _middle_ of a gigantic huge page while the > head page is still in place and online? > > That seems bad. > > >>> What about refusing to change the status for such memory block, if it > >>> contains a huge page which larger than itself? (function > >>> memory_block_action()) > >> > >> How will this be visible to users, though? That sounds like you simply > >> won't be able to offline memory with gigantic huge pages. > > > > I might be missing something but Is this any different from a regular > > failure when the memory cannot be freed? I mean > > /sys/devices/system/memory/memory API doesn't give you any hint whether > > the memory in the particular block is used and > > unmigrateable. > > It's OK to have free hugetlbfs pages in an area that's being offline'd. > If we did that, it would not be OK to have a free gigantic hugetlbfs > page that's larger than the area being offlined. That was not my point. I wasn't very clear probably. Offlining can fail which shouldn't be really surprising. There might be a kernel allocation in the particular block which cannot be migrated so failures are to be expected. I just do not see how offlining in the middle of a gigantic page is any different from having any other unmovable allocation in a block. That being said, why don't we simply refuse to offline a block which is in the middle of a gigantic page. > It would be a wee bit goofy to have the requirement that userspace go > find all the gigantic pages and make them non-gigantic before trying to > offline something. yes -- Michal Hocko SUSE Labs
Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()
On 09/21/2016 05:05 AM, Michal Hocko wrote: > On Tue 20-09-16 10:43:13, Dave Hansen wrote: >> On 09/20/2016 08:52 AM, Rui Teng wrote: >>> On 9/20/16 10:53 PM, Dave Hansen wrote: >> ... That's good, but aren't we still left with a situation where we've offlined and dissolved the _middle_ of a gigantic huge page while the head page is still in place and online? That seems bad. >>> What about refusing to change the status for such memory block, if it >>> contains a huge page which larger than itself? (function >>> memory_block_action()) >> >> How will this be visible to users, though? That sounds like you simply >> won't be able to offline memory with gigantic huge pages. > > I might be missing something but Is this any different from a regular > failure when the memory cannot be freed? I mean > /sys/devices/system/memory/memory API doesn't give you any hint whether > the memory in the particular block is used and > unmigrateable. It's OK to have free hugetlbfs pages in an area that's being offline'd. If we did that, it would not be OK to have a free gigantic hugetlbfs page that's larger than the area being offlined. It would be a wee bit goofy to have the requirement that userspace go find all the gigantic pages and make them non-gigantic before trying to offline something.
Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()
On Tue 20-09-16 10:43:13, Dave Hansen wrote: > On 09/20/2016 08:52 AM, Rui Teng wrote: > > On 9/20/16 10:53 PM, Dave Hansen wrote: > ... > >> That's good, but aren't we still left with a situation where we've > >> offlined and dissolved the _middle_ of a gigantic huge page while the > >> head page is still in place and online? > >> > >> That seems bad. > >> > > What about refusing to change the status for such memory block, if it > > contains a huge page which larger than itself? (function > > memory_block_action()) > > How will this be visible to users, though? That sounds like you simply > won't be able to offline memory with gigantic huge pages. I might be missing something but Is this any different from a regular failure when the memory cannot be freed? I mean /sys/devices/system/memory/memory API doesn't give you any hint whether the memory in the particular block is used and unmigrateable. -- Michal Hocko SUSE Labs
Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()
On 09/20/2016 08:52 AM, Rui Teng wrote: > On 9/20/16 10:53 PM, Dave Hansen wrote: ... >> That's good, but aren't we still left with a situation where we've >> offlined and dissolved the _middle_ of a gigantic huge page while the >> head page is still in place and online? >> >> That seems bad. >> > What about refusing to change the status for such memory block, if it > contains a huge page which larger than itself? (function > memory_block_action()) How will this be visible to users, though? That sounds like you simply won't be able to offline memory with gigantic huge pages. > I think it will not affect the hot-plug function too much. We can > change the nr_hugepages to zero first, if we really want to hot-plug a > memory. Is that really feasible? Suggest that folks stop using hugetlbfs before offlining any memory? Isn't the entire point of hotplug to keep the system running while you change the memory present? Doing this would require that you stop your applications that are using huge pages. With gigantic pages, you may also never get them back if you do this. > And I also found that the __test_page_isolated_in_pageblock() function > can not handle a gigantic page well. It will cause a device busy error > later. I am still investigating on that. > > Any suggestion? It sounds like the _first_ offline operation needs to dissolve an _entire_ page if that page has any portion in the section being offlined. I'm not quite sure where the page should live after that, but I'm not sure of any other way to do this sanely.
Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()
On 9/20/16 10:53 PM, Dave Hansen wrote: On 09/20/2016 07:45 AM, Rui Teng wrote: On 9/17/16 12:25 AM, Dave Hansen wrote: That's an interesting data point, but it still doesn't quite explain what is going on. It seems like there might be parts of gigantic pages that have PageHuge() set on tail pages, while other parts don't. If that's true, we have another bug and your patch just papers over the issue. I think you really need to find the root cause before we apply this patch. The root cause is the test scripts(tools/testing/selftests/memory- hotplug/mem-on-off-test.sh) changes online/offline status on memory blocks other than page header. It will *randomly* select 10% memory blocks from /sys/devices/system/memory/memory*, and change their online/offline status. Ahh, that does explain it! Thanks for digging into that! That's why we need a PageHead() check now, and why this problem does not happened on systems with smaller huge page such as 16M. As far as the PageHuge() set, I think PageHuge() will return true for all tail pages. Because it will get the compound_head for tail page, and then get its huge page flag. page = compound_head(page); And as far as the failure message, if one memory block is in use, it will return failure when offline it. That's good, but aren't we still left with a situation where we've offlined and dissolved the _middle_ of a gigantic huge page while the head page is still in place and online? That seems bad. What about refusing to change the status for such memory block, if it contains a huge page which larger than itself? (function memory_block_action()) I think it will not affect the hot-plug function too much. We can change the nr_hugepages to zero first, if we really want to hot-plug a memory. And I also found that the __test_page_isolated_in_pageblock() function can not handle a gigantic page well. It will cause a device busy error later. I am still investigating on that. Any suggestion? Thanks!
Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()
On 09/20/2016 07:45 AM, Rui Teng wrote: > On 9/17/16 12:25 AM, Dave Hansen wrote: >> >> That's an interesting data point, but it still doesn't quite explain >> what is going on. >> >> It seems like there might be parts of gigantic pages that have >> PageHuge() set on tail pages, while other parts don't. If that's true, >> we have another bug and your patch just papers over the issue. >> >> I think you really need to find the root cause before we apply this >> patch. >> > The root cause is the test scripts(tools/testing/selftests/memory- > hotplug/mem-on-off-test.sh) changes online/offline status on memory > blocks other than page header. It will *randomly* select 10% memory > blocks from /sys/devices/system/memory/memory*, and change their > online/offline status. Ahh, that does explain it! Thanks for digging into that! > That's why we need a PageHead() check now, and why this problem does > not happened on systems with smaller huge page such as 16M. > > As far as the PageHuge() set, I think PageHuge() will return true for > all tail pages. Because it will get the compound_head for tail page, > and then get its huge page flag. > page = compound_head(page); > > And as far as the failure message, if one memory block is in use, it > will return failure when offline it. That's good, but aren't we still left with a situation where we've offlined and dissolved the _middle_ of a gigantic huge page while the head page is still in place and online? That seems bad.
Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()
On 9/17/16 12:25 AM, Dave Hansen wrote: That's an interesting data point, but it still doesn't quite explain what is going on. It seems like there might be parts of gigantic pages that have PageHuge() set on tail pages, while other parts don't. If that's true, we have another bug and your patch just papers over the issue. I think you really need to find the root cause before we apply this patch. The root cause is the test scripts(tools/testing/selftests/memory- hotplug/mem-on-off-test.sh) changes online/offline status on memory blocks other than page header. It will *randomly* select 10% memory blocks from /sys/devices/system/memory/memory*, and change their online/offline status. On my system, the memory block size is 0x1000: [root@elvis-n01-kvm memory]# cat block_size_bytes 1000 But the huge page size(16G) is more than this memory block size. So one huge page is composed by several memory blocks. For example, memory704, memory705, memory706 and so on. Then memory704 will contain a head page, but memory705 will *only* contain tail pages. So the problem will happened on it, if we call: #echo offline > memory705/state That's why we need a PageHead() check now, and why this problem does not happened on systems with smaller huge page such as 16M. As far as the PageHuge() set, I think PageHuge() will return true for all tail pages. Because it will get the compound_head for tail page, and then get its huge page flag. page = compound_head(page); And as far as the failure message, if one memory block is in use, it will return failure when offline it.
Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()
On 09/16/2016 06:58 AM, Rui Teng wrote: > On 9/15/16 12:37 AM, Dave Hansen wrote: >> On 09/14/2016 09:33 AM, Rui Teng wrote: >> But, as far as describing the initial problem, can you explain how the >> tail pages still ended up being PageHuge()? Seems like dissolving the >> huge page should have cleared that. >> > I use the scripts of tools/testing/selftests/memory-hotplug/mem-on- > off-test.sh to test and reproduce this bug. And I printed the pfn range > on dissolve_free_huge_pages(). The sizes of the pfn range are always > 4096, and the ranges are separated. > [ 72.362427] start_pfn: 204800, end_pfn: 208896 > [ 72.371677] start_pfn: 2162688, end_pfn: 2166784 > [ 72.373945] start_pfn: 217088, end_pfn: 221184 > [ 72.383218] start_pfn: 2170880, end_pfn: 2174976 > [ 72.385918] start_pfn: 2306048, end_pfn: 2310144 > [ 72.388254] start_pfn: 2326528, end_pfn: 2330624 > > Sometimes, it will report a failure: > [ 72.371690] memory offlining [mem 0x21-0x210fff] failed > > And sometimes, it will report following: > [ 72.373956] Offlined Pages 4096 > > Whether the start_pfn and end_pfn of dissolve_free_huge_pages could be > *random*? If so, the range may not include any page head and start from > tail page, right? That's an interesting data point, but it still doesn't quite explain what is going on. It seems like there might be parts of gigantic pages that have PageHuge() set on tail pages, while other parts don't. If that's true, we have another bug and your patch just papers over the issue. I think you really need to find the root cause before we apply this patch.
Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()
On 9/15/16 12:37 AM, Dave Hansen wrote: On 09/14/2016 09:33 AM, Rui Teng wrote: How about return the size of page freed from dissolve_free_huge_page(), and jump such step on pfn? That would be a nice improvement. But, as far as describing the initial problem, can you explain how the tail pages still ended up being PageHuge()? Seems like dissolving the huge page should have cleared that. I use the scripts of tools/testing/selftests/memory-hotplug/mem-on- off-test.sh to test and reproduce this bug. And I printed the pfn range on dissolve_free_huge_pages(). The sizes of the pfn range are always 4096, and the ranges are separated. [ 72.362427] start_pfn: 204800, end_pfn: 208896 [ 72.371677] start_pfn: 2162688, end_pfn: 2166784 [ 72.373945] start_pfn: 217088, end_pfn: 221184 [ 72.383218] start_pfn: 2170880, end_pfn: 2174976 [ 72.385918] start_pfn: 2306048, end_pfn: 2310144 [ 72.388254] start_pfn: 2326528, end_pfn: 2330624 Sometimes, it will report a failure: [ 72.371690] memory offlining [mem 0x21-0x210fff] failed And sometimes, it will report following: [ 72.373956] Offlined Pages 4096 Whether the start_pfn and end_pfn of dissolve_free_huge_pages could be *random*? If so, the range may not include any page head and start from tail page, right?
Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()
On 09/14/2016 09:33 AM, Rui Teng wrote: > > How about return the size of page freed from dissolve_free_huge_page(), > and jump such step on pfn? That would be a nice improvement. But, as far as describing the initial problem, can you explain how the tail pages still ended up being PageHuge()? Seems like dissolving the huge page should have cleared that.
Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()
On 9/14/16 1:32 AM, Dave Hansen wrote: On 09/13/2016 01:39 AM, Rui Teng wrote: diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 87e11d8..64b5f81 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1442,7 +1442,7 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed, static void dissolve_free_huge_page(struct page *page) { spin_lock(&hugetlb_lock); - if (PageHuge(page) && !page_count(page)) { + if (PageHuge(page) && !page_count(page) && PageHead(page)) { struct hstate *h = page_hstate(page); int nid = page_to_nid(page); list_del(&page->lru); This is goofy. What is calling dissolve_free_huge_page() on a tail page? Hmm: for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << minimum_order) dissolve_free_huge_page(pfn_to_page(pfn)); So, skip through the area being offlined at the smallest huge page size, and try to dissolve a huge page in each place one might appear. But, after we dissolve a 16GB huge page, we continue looking through the remaining 15.98GB tail area for huge pages in the area we just dissolved. The tail pages are still PageHuge() (how??), and we call page_hstate() on the tail page whose head was just dissolved. Note, even with the fix, this taking a (global) spinlock 1023 more times that it doesn't have to. This seems inefficient, and fails to fully explain what is going on, and how tail pages still _look_ like PageHuge(), which seems pretty wrong. I guess the patch _works_. But, sheesh, it leaves a lot of room for improvement. Thanks for your suggestion! How about return the size of page freed from dissolve_free_huge_page(), and jump such step on pfn?
Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()
On 09/13/2016 01:39 AM, Rui Teng wrote: > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 87e11d8..64b5f81 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1442,7 +1442,7 @@ static int free_pool_huge_page(struct hstate *h, > nodemask_t *nodes_allowed, > static void dissolve_free_huge_page(struct page *page) > { > spin_lock(&hugetlb_lock); > - if (PageHuge(page) && !page_count(page)) { > + if (PageHuge(page) && !page_count(page) && PageHead(page)) { > struct hstate *h = page_hstate(page); > int nid = page_to_nid(page); > list_del(&page->lru); This is goofy. What is calling dissolve_free_huge_page() on a tail page? Hmm: > for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << minimum_order) > dissolve_free_huge_page(pfn_to_page(pfn)); So, skip through the area being offlined at the smallest huge page size, and try to dissolve a huge page in each place one might appear. But, after we dissolve a 16GB huge page, we continue looking through the remaining 15.98GB tail area for huge pages in the area we just dissolved. The tail pages are still PageHuge() (how??), and we call page_hstate() on the tail page whose head was just dissolved. Note, even with the fix, this taking a (global) spinlock 1023 more times that it doesn't have to. This seems inefficient, and fails to fully explain what is going on, and how tail pages still _look_ like PageHuge(), which seems pretty wrong. I guess the patch _works_. But, sheesh, it leaves a lot of room for improvement.