Re: [PATCH v2 1/1] mm: buddy page accessed before initialized

2017-11-03 Thread Michal Hocko
On Fri 03-11-17 09:47:30, Pavel Tatashin wrote:
> Hi Michal,
> 
> There is a small regression, on the largest x86 machine I have access to:
> Before:
> node 1 initialised, 32471632 pages in 901ms
> After:
> node 1 initialised, 32471632 pages in 1128ms
> 
> One node contains 128G of memory (overal 1T in 8 nodes). This
> regression is going to be solved by this work:
> https://patchwork.kernel.org/patch/9920953/, other than that I do not
> know a better solution. The overall performance is still much better
> compared to before this project.

OK, I think that is completely acceptable for now. We can always
optimize for a better result later.

> Also, thinking about this problem some more, it is safer to split the
> initialization, and freeing parts into two functions:
> 
> In deferred_init_memmap()
> 1574 for_each_free_mem_range(i, nid, MEMBLOCK_NONE, , , NULL) 
> {
> 1575 spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> 1576 epfn = min_t(unsigned long, zone_end_pfn(zone),
> PFN_DOWN(epa));
> 1577 nr_pages += deferred_init_range(nid, zid, spfn, epfn);
> 1578 }
> 
> Replace with two loops:
> First loop, calls a function that initializes the given range, the 2nd
> loop calls a function that frees it. This way we won't get a potential
> problem where buddy page is computed from the next range that has not
> yet been initialized. And it is also going to be easier to multithread
> later: multi-thread the first loop, wait for it to finish,
> multi-thread the 2nd loop wait for it to finish.

OK, but let's do that as a separate patch. What you have here is good
for now IMHO. My ack applies. Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 1/1] mm: buddy page accessed before initialized

2017-11-03 Thread Michal Hocko
On Fri 03-11-17 09:47:30, Pavel Tatashin wrote:
> Hi Michal,
> 
> There is a small regression, on the largest x86 machine I have access to:
> Before:
> node 1 initialised, 32471632 pages in 901ms
> After:
> node 1 initialised, 32471632 pages in 1128ms
> 
> One node contains 128G of memory (overal 1T in 8 nodes). This
> regression is going to be solved by this work:
> https://patchwork.kernel.org/patch/9920953/, other than that I do not
> know a better solution. The overall performance is still much better
> compared to before this project.

OK, I think that is completely acceptable for now. We can always
optimize for a better result later.

> Also, thinking about this problem some more, it is safer to split the
> initialization, and freeing parts into two functions:
> 
> In deferred_init_memmap()
> 1574 for_each_free_mem_range(i, nid, MEMBLOCK_NONE, , , NULL) 
> {
> 1575 spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> 1576 epfn = min_t(unsigned long, zone_end_pfn(zone),
> PFN_DOWN(epa));
> 1577 nr_pages += deferred_init_range(nid, zid, spfn, epfn);
> 1578 }
> 
> Replace with two loops:
> First loop, calls a function that initializes the given range, the 2nd
> loop calls a function that frees it. This way we won't get a potential
> problem where buddy page is computed from the next range that has not
> yet been initialized. And it is also going to be easier to multithread
> later: multi-thread the first loop, wait for it to finish,
> multi-thread the 2nd loop wait for it to finish.

OK, but let's do that as a separate patch. What you have here is good
for now IMHO. My ack applies. Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 1/1] mm: buddy page accessed before initialized

2017-11-03 Thread Pavel Tatashin
Hi Michal,

There is a small regression, on the largest x86 machine I have access to:
Before:
node 1 initialised, 32471632 pages in 901ms
After:
node 1 initialised, 32471632 pages in 1128ms

One node contains 128G of memory (overal 1T in 8 nodes). This
regression is going to be solved by this work:
https://patchwork.kernel.org/patch/9920953/, other than that I do not
know a better solution. The overall performance is still much better
compared to before this project.

Also, thinking about this problem some more, it is safer to split the
initialization, and freeing parts into two functions:

In deferred_init_memmap()
1574 for_each_free_mem_range(i, nid, MEMBLOCK_NONE, , , NULL) {
1575 spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
1576 epfn = min_t(unsigned long, zone_end_pfn(zone),
PFN_DOWN(epa));
1577 nr_pages += deferred_init_range(nid, zid, spfn, epfn);
1578 }

Replace with two loops:
First loop, calls a function that initializes the given range, the 2nd
loop calls a function that frees it. This way we won't get a potential
problem where buddy page is computed from the next range that has not
yet been initialized. And it is also going to be easier to multithread
later: multi-thread the first loop, wait for it to finish,
multi-thread the 2nd loop wait for it to finish.

Pasha


On Fri, Nov 3, 2017 at 5:27 AM, Michal Hocko  wrote:
> On Thu 02-11-17 13:02:21, Pavel Tatashin wrote:
>> This problem is seen when machine is rebooted after kexec:
>> A message like this is printed:
>> ==
>> WARNING: CPU: 21 PID: 249 at linux/lib/list_debug.c:53__listd+0x83/0xa0
>> Modules linked in:
>> CPU: 21 PID: 249 Comm: pgdatinit0 Not tainted 4.14.0-rc6_pt_deferred #90
>> Hardware name: Oracle Corporation ORACLE SERVER X6-2/ASM,MOTHERBOARD,1U,
>> BIOS 3016
>> node 1 initialised, 32444607 pages in 1679ms
>> task: 880180e75a00 task.stack: c9000cdb
>> RIP: 0010:__list_del_entry_valid+0x83/0xa0
>> RSP: :c9000cdb3d18 EFLAGS: 00010046
>> RAX: 0054 RBX: 0009 RCX: 81c5f3e8
>> RDX:  RSI: 0086 RDI: 0046
>> RBP: c9000cdb3d18 R08: fffe R09: 0154
>> R10: 0005 R11: 0153 R12: 01fcdc00
>> R13: 01fcde00 R14: 88207ffded00 R15: ea007f37
>> FS:  () GS:881fffac() knlGS:0
>> CS:  0010 DS:  ES:  CR0: 80050033
>> CR2:  CR3: 00407ec09001 CR4: 003606e0
>> DR0:  DR1:  DR2: 
>> DR3:  DR6: fffe0ff0 DR7: 0400
>> Call Trace:
>>  free_one_page+0x103/0x390
>>  __free_pages_ok+0x1cf/0x2d0
>>  __free_pages+0x19/0x30
>>  __free_pages_boot_core+0xae/0xba
>>  deferred_free_range+0x60/0x94
>>  deferred_init_memmap+0x324/0x372
>>  kthread+0x109/0x140
>>  ? __free_pages_bootmem+0x2e/0x2e
>>  ? kthread_park+0x60/0x60
>>  ret_from_fork+0x25/0x30
>>
>> list_del corruption. next->prev should be ea007f428020, but was
>> ea007f1d8020
>> ==
>>
>> The problem happens in this path:
>>
>> page_alloc_init_late
>>   deferred_init_memmap
>> deferred_init_range
>>   __def_free
>> deferred_free_range
>>   __free_pages_boot_core(page, order)
>> __free_pages()
>>   __free_pages_ok()
>> free_one_page()
>>   __free_one_page(page, pfn, zone, order, migratetype);
>>
>> deferred_init_range() initializes one page at a time by calling
>> __init_single_page(), once it initializes pageblock_nr_pages pages, it
>> calls deferred_free_range() to free the initialized pages to the buddy
>> allocator. Eventually, we reach __free_one_page(), where we compute buddy
>> page:
>>   buddy_pfn = __find_buddy_pfn(pfn, order);
>>   buddy = page + (buddy_pfn - pfn);
>>
>> buddy_pfn is computed as pfn ^ (1 << order), or pfn + pageblock_nr_pages.
>> Thefore, buddy page becomes a page one after the range that currently was
>> initialized, and we access this page in this function. Also, later when we
>> return back to deferred_init_range(), the buddy page is initialized again.
>>
>> So, in order to avoid this issue, we must initialize the buddy page prior
>> to calling deferred_free_range().
>
> Have you measured any negative performance impact with this change?
>
>> Signed-off-by: Pavel Tatashin 
>
> The patch looks good to me otherwise. So if this doesn't introduce a
> noticeable overhead, which I whope it doesn't then feel free to add
> Acked-by: Michal Hocko 
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more 

Re: [PATCH v2 1/1] mm: buddy page accessed before initialized

2017-11-03 Thread Pavel Tatashin
Hi Michal,

There is a small regression, on the largest x86 machine I have access to:
Before:
node 1 initialised, 32471632 pages in 901ms
After:
node 1 initialised, 32471632 pages in 1128ms

One node contains 128G of memory (overal 1T in 8 nodes). This
regression is going to be solved by this work:
https://patchwork.kernel.org/patch/9920953/, other than that I do not
know a better solution. The overall performance is still much better
compared to before this project.

Also, thinking about this problem some more, it is safer to split the
initialization, and freeing parts into two functions:

In deferred_init_memmap()
1574 for_each_free_mem_range(i, nid, MEMBLOCK_NONE, , , NULL) {
1575 spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
1576 epfn = min_t(unsigned long, zone_end_pfn(zone),
PFN_DOWN(epa));
1577 nr_pages += deferred_init_range(nid, zid, spfn, epfn);
1578 }

Replace with two loops:
First loop, calls a function that initializes the given range, the 2nd
loop calls a function that frees it. This way we won't get a potential
problem where buddy page is computed from the next range that has not
yet been initialized. And it is also going to be easier to multithread
later: multi-thread the first loop, wait for it to finish,
multi-thread the 2nd loop wait for it to finish.

Pasha


On Fri, Nov 3, 2017 at 5:27 AM, Michal Hocko  wrote:
> On Thu 02-11-17 13:02:21, Pavel Tatashin wrote:
>> This problem is seen when machine is rebooted after kexec:
>> A message like this is printed:
>> ==
>> WARNING: CPU: 21 PID: 249 at linux/lib/list_debug.c:53__listd+0x83/0xa0
>> Modules linked in:
>> CPU: 21 PID: 249 Comm: pgdatinit0 Not tainted 4.14.0-rc6_pt_deferred #90
>> Hardware name: Oracle Corporation ORACLE SERVER X6-2/ASM,MOTHERBOARD,1U,
>> BIOS 3016
>> node 1 initialised, 32444607 pages in 1679ms
>> task: 880180e75a00 task.stack: c9000cdb
>> RIP: 0010:__list_del_entry_valid+0x83/0xa0
>> RSP: :c9000cdb3d18 EFLAGS: 00010046
>> RAX: 0054 RBX: 0009 RCX: 81c5f3e8
>> RDX:  RSI: 0086 RDI: 0046
>> RBP: c9000cdb3d18 R08: fffe R09: 0154
>> R10: 0005 R11: 0153 R12: 01fcdc00
>> R13: 01fcde00 R14: 88207ffded00 R15: ea007f37
>> FS:  () GS:881fffac() knlGS:0
>> CS:  0010 DS:  ES:  CR0: 80050033
>> CR2:  CR3: 00407ec09001 CR4: 003606e0
>> DR0:  DR1:  DR2: 
>> DR3:  DR6: fffe0ff0 DR7: 0400
>> Call Trace:
>>  free_one_page+0x103/0x390
>>  __free_pages_ok+0x1cf/0x2d0
>>  __free_pages+0x19/0x30
>>  __free_pages_boot_core+0xae/0xba
>>  deferred_free_range+0x60/0x94
>>  deferred_init_memmap+0x324/0x372
>>  kthread+0x109/0x140
>>  ? __free_pages_bootmem+0x2e/0x2e
>>  ? kthread_park+0x60/0x60
>>  ret_from_fork+0x25/0x30
>>
>> list_del corruption. next->prev should be ea007f428020, but was
>> ea007f1d8020
>> ==
>>
>> The problem happens in this path:
>>
>> page_alloc_init_late
>>   deferred_init_memmap
>> deferred_init_range
>>   __def_free
>> deferred_free_range
>>   __free_pages_boot_core(page, order)
>> __free_pages()
>>   __free_pages_ok()
>> free_one_page()
>>   __free_one_page(page, pfn, zone, order, migratetype);
>>
>> deferred_init_range() initializes one page at a time by calling
>> __init_single_page(), once it initializes pageblock_nr_pages pages, it
>> calls deferred_free_range() to free the initialized pages to the buddy
>> allocator. Eventually, we reach __free_one_page(), where we compute buddy
>> page:
>>   buddy_pfn = __find_buddy_pfn(pfn, order);
>>   buddy = page + (buddy_pfn - pfn);
>>
>> buddy_pfn is computed as pfn ^ (1 << order), or pfn + pageblock_nr_pages.
>> Thefore, buddy page becomes a page one after the range that currently was
>> initialized, and we access this page in this function. Also, later when we
>> return back to deferred_init_range(), the buddy page is initialized again.
>>
>> So, in order to avoid this issue, we must initialize the buddy page prior
>> to calling deferred_free_range().
>
> Have you measured any negative performance impact with this change?
>
>> Signed-off-by: Pavel Tatashin 
>
> The patch looks good to me otherwise. So if this doesn't introduce a
> noticeable overhead, which I whope it doesn't then feel free to add
> Acked-by: Michal Hocko 
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: 

Re: [PATCH v2 1/1] mm: buddy page accessed before initialized

2017-11-03 Thread Michal Hocko
On Thu 02-11-17 13:02:21, Pavel Tatashin wrote:
> This problem is seen when machine is rebooted after kexec:
> A message like this is printed:
> ==
> WARNING: CPU: 21 PID: 249 at linux/lib/list_debug.c:53__listd+0x83/0xa0
> Modules linked in:
> CPU: 21 PID: 249 Comm: pgdatinit0 Not tainted 4.14.0-rc6_pt_deferred #90
> Hardware name: Oracle Corporation ORACLE SERVER X6-2/ASM,MOTHERBOARD,1U,
> BIOS 3016
> node 1 initialised, 32444607 pages in 1679ms
> task: 880180e75a00 task.stack: c9000cdb
> RIP: 0010:__list_del_entry_valid+0x83/0xa0
> RSP: :c9000cdb3d18 EFLAGS: 00010046
> RAX: 0054 RBX: 0009 RCX: 81c5f3e8
> RDX:  RSI: 0086 RDI: 0046
> RBP: c9000cdb3d18 R08: fffe R09: 0154
> R10: 0005 R11: 0153 R12: 01fcdc00
> R13: 01fcde00 R14: 88207ffded00 R15: ea007f37
> FS:  () GS:881fffac() knlGS:0
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2:  CR3: 00407ec09001 CR4: 003606e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  free_one_page+0x103/0x390
>  __free_pages_ok+0x1cf/0x2d0
>  __free_pages+0x19/0x30
>  __free_pages_boot_core+0xae/0xba
>  deferred_free_range+0x60/0x94
>  deferred_init_memmap+0x324/0x372
>  kthread+0x109/0x140
>  ? __free_pages_bootmem+0x2e/0x2e
>  ? kthread_park+0x60/0x60
>  ret_from_fork+0x25/0x30
> 
> list_del corruption. next->prev should be ea007f428020, but was
> ea007f1d8020
> ==
> 
> The problem happens in this path:
> 
> page_alloc_init_late
>   deferred_init_memmap
> deferred_init_range
>   __def_free
> deferred_free_range
>   __free_pages_boot_core(page, order)
> __free_pages()
>   __free_pages_ok()
> free_one_page()
>   __free_one_page(page, pfn, zone, order, migratetype);
> 
> deferred_init_range() initializes one page at a time by calling
> __init_single_page(), once it initializes pageblock_nr_pages pages, it
> calls deferred_free_range() to free the initialized pages to the buddy
> allocator. Eventually, we reach __free_one_page(), where we compute buddy
> page:
>   buddy_pfn = __find_buddy_pfn(pfn, order);
>   buddy = page + (buddy_pfn - pfn);
> 
> buddy_pfn is computed as pfn ^ (1 << order), or pfn + pageblock_nr_pages.
> Thefore, buddy page becomes a page one after the range that currently was
> initialized, and we access this page in this function. Also, later when we
> return back to deferred_init_range(), the buddy page is initialized again.
> 
> So, in order to avoid this issue, we must initialize the buddy page prior
> to calling deferred_free_range().

Have you measured any negative performance impact with this change?

> Signed-off-by: Pavel Tatashin 

The patch looks good to me otherwise. So if this doesn't introduce a
noticeable overhead, which I whope it doesn't then feel free to add
Acked-by: Michal Hocko 
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 1/1] mm: buddy page accessed before initialized

2017-11-03 Thread Michal Hocko
On Thu 02-11-17 13:02:21, Pavel Tatashin wrote:
> This problem is seen when machine is rebooted after kexec:
> A message like this is printed:
> ==
> WARNING: CPU: 21 PID: 249 at linux/lib/list_debug.c:53__listd+0x83/0xa0
> Modules linked in:
> CPU: 21 PID: 249 Comm: pgdatinit0 Not tainted 4.14.0-rc6_pt_deferred #90
> Hardware name: Oracle Corporation ORACLE SERVER X6-2/ASM,MOTHERBOARD,1U,
> BIOS 3016
> node 1 initialised, 32444607 pages in 1679ms
> task: 880180e75a00 task.stack: c9000cdb
> RIP: 0010:__list_del_entry_valid+0x83/0xa0
> RSP: :c9000cdb3d18 EFLAGS: 00010046
> RAX: 0054 RBX: 0009 RCX: 81c5f3e8
> RDX:  RSI: 0086 RDI: 0046
> RBP: c9000cdb3d18 R08: fffe R09: 0154
> R10: 0005 R11: 0153 R12: 01fcdc00
> R13: 01fcde00 R14: 88207ffded00 R15: ea007f37
> FS:  () GS:881fffac() knlGS:0
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2:  CR3: 00407ec09001 CR4: 003606e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  free_one_page+0x103/0x390
>  __free_pages_ok+0x1cf/0x2d0
>  __free_pages+0x19/0x30
>  __free_pages_boot_core+0xae/0xba
>  deferred_free_range+0x60/0x94
>  deferred_init_memmap+0x324/0x372
>  kthread+0x109/0x140
>  ? __free_pages_bootmem+0x2e/0x2e
>  ? kthread_park+0x60/0x60
>  ret_from_fork+0x25/0x30
> 
> list_del corruption. next->prev should be ea007f428020, but was
> ea007f1d8020
> ==
> 
> The problem happens in this path:
> 
> page_alloc_init_late
>   deferred_init_memmap
> deferred_init_range
>   __def_free
> deferred_free_range
>   __free_pages_boot_core(page, order)
> __free_pages()
>   __free_pages_ok()
> free_one_page()
>   __free_one_page(page, pfn, zone, order, migratetype);
> 
> deferred_init_range() initializes one page at a time by calling
> __init_single_page(), once it initializes pageblock_nr_pages pages, it
> calls deferred_free_range() to free the initialized pages to the buddy
> allocator. Eventually, we reach __free_one_page(), where we compute buddy
> page:
>   buddy_pfn = __find_buddy_pfn(pfn, order);
>   buddy = page + (buddy_pfn - pfn);
> 
> buddy_pfn is computed as pfn ^ (1 << order), or pfn + pageblock_nr_pages.
> Thefore, buddy page becomes a page one after the range that currently was
> initialized, and we access this page in this function. Also, later when we
> return back to deferred_init_range(), the buddy page is initialized again.
> 
> So, in order to avoid this issue, we must initialize the buddy page prior
> to calling deferred_free_range().

Have you measured any negative performance impact with this change?

> Signed-off-by: Pavel Tatashin 

The patch looks good to me otherwise. So if this doesn't introduce a
noticeable overhead, which I whope it doesn't then feel free to add
Acked-by: Michal Hocko 
-- 
Michal Hocko
SUSE Labs