Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
On Fri 14-07-17 11:34:31, Vlastimil Babka wrote: > On 07/14/2017 11:13 AM, Michal Hocko wrote: > > On Fri 07-07-17 14:00:03, Vlastimil Babka wrote: > >> On 07/04/2017 07:17 AM, Joonsoo Kim wrote: > > Still, backporting b8f1a75d61d8 fixes this: > > [1.538379] allocated 738197504 bytes of page_ext > [1.539340] Node 0, zone DMA: page owner found early allocated 0 > pages > [1.540179] Node 0, zoneDMA32: page owner found early allocated > 33 pages > [1.611173] Node 0, zone Normal: page owner found early allocated > 96755 pages > [1.683167] Node 1, zone Normal: page owner found early allocated > 96575 pages > > No panic, notice how it allocated more for page_ext, and found smaller > number of > early allocated pages. > > Now backporting fe53ca54270a on top: > > [0.00] allocated 738197504 bytes of page_ext > [0.00] Node 0, zone DMA: page owner found early allocated 0 > pages > [0.00] Node 0, zoneDMA32: page owner found early allocated > 33 pages > [0.00] Node 0, zone Normal: page owner found early allocated > 2842622 pages > [0.00] Node 1, zone Normal: page owner found early allocated > 3694362 pages > > Again no panic, and same amount of page_ext usage. But the "early > allocated" numbers > seem bogus to me. I think it's because init_pages_in_zone() is running > and inspecting > struct pages that have not been yet initialized. It doesn't end up > crashing, but > still doesn't seem correct? > >>> > >>> Numbers looks sane to me. fe53ca54270a makes init_pages_in_zone() > >>> called before page_alloc_init_late(). So, there would be many > >>> uninitialized pages with PageReserved(). Page owner regarded these > >>> PageReserved() page as allocated page. > >> > >> That seems incorrect for two reasons: > >> - init_pages_in_zone() actually skips PageReserved() pages > >> - the pages don't have PageReserved() flag, until the deferred struct page > >> init > >> thread processes them via deferred_init_memmap() -> __init_single_page() > >> AFAICS > >> > >> Now I've found out why upstream reports much less early allocated pages > >> than our > >> kernel. We're missing 9d43f5aec950 ("mm/page_owner: add zone range > >> overlapping > >> check") which adds a "page_zone(page) != zone" check. I think this only > >> works > >> because the pages are not initialized and thus have no nid/zone links. > >> Probably > >> page_zone() only doesn't break because it's all zeroed. I don't think it's > >> safe > >> to rely on this? > > > > Yes, if anything PageReserved should be checked before the zone check. > > That wouldn't change anything, because we skip PageReserved and it's not > set. I thought they were still marked reserved from the bootmem allocator I would have to go through the initialization code again to be sure. > Perhaps we could skip pages that have the raw page flags value > zero, but then a) we should make sure that the allocation of the struct > page array zeroes the range, and b) the first modification of struct > page in the initialization is setting the PageReserved flag. I would rather not depend on the page state. There are plans to not initialize the struct page (even to 0 during memmap init) until __init_single_page. Either the page is fully initialized or we are touching invalid pfn range. end_pfn = pfn + zone->spanned_pages but I guess we should in fact consider first_deferred_pfn as well (calculate_node_totalpages is not deffered initialization aware). -- Michal Hocko SUSE Labs
Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
On 07/14/2017 11:13 AM, Michal Hocko wrote: > On Fri 07-07-17 14:00:03, Vlastimil Babka wrote: >> On 07/04/2017 07:17 AM, Joonsoo Kim wrote: Still, backporting b8f1a75d61d8 fixes this: [1.538379] allocated 738197504 bytes of page_ext [1.539340] Node 0, zone DMA: page owner found early allocated 0 pages [1.540179] Node 0, zoneDMA32: page owner found early allocated 33 pages [1.611173] Node 0, zone Normal: page owner found early allocated 96755 pages [1.683167] Node 1, zone Normal: page owner found early allocated 96575 pages No panic, notice how it allocated more for page_ext, and found smaller number of early allocated pages. Now backporting fe53ca54270a on top: [0.00] allocated 738197504 bytes of page_ext [0.00] Node 0, zone DMA: page owner found early allocated 0 pages [0.00] Node 0, zoneDMA32: page owner found early allocated 33 pages [0.00] Node 0, zone Normal: page owner found early allocated 2842622 pages [0.00] Node 1, zone Normal: page owner found early allocated 3694362 pages Again no panic, and same amount of page_ext usage. But the "early allocated" numbers seem bogus to me. I think it's because init_pages_in_zone() is running and inspecting struct pages that have not been yet initialized. It doesn't end up crashing, but still doesn't seem correct? >>> >>> Numbers looks sane to me. fe53ca54270a makes init_pages_in_zone() >>> called before page_alloc_init_late(). So, there would be many >>> uninitialized pages with PageReserved(). Page owner regarded these >>> PageReserved() page as allocated page. >> >> That seems incorrect for two reasons: >> - init_pages_in_zone() actually skips PageReserved() pages >> - the pages don't have PageReserved() flag, until the deferred struct page >> init >> thread processes them via deferred_init_memmap() -> __init_single_page() >> AFAICS >> >> Now I've found out why upstream reports much less early allocated pages than >> our >> kernel. We're missing 9d43f5aec950 ("mm/page_owner: add zone range >> overlapping >> check") which adds a "page_zone(page) != zone" check. I think this only works >> because the pages are not initialized and thus have no nid/zone links. >> Probably >> page_zone() only doesn't break because it's all zeroed. I don't think it's >> safe >> to rely on this? > > Yes, if anything PageReserved should be checked before the zone check. That wouldn't change anything, because we skip PageReserved and it's not set. Perhaps we could skip pages that have the raw page flags value zero, but then a) we should make sure that the allocation of the struct page array zeroes the range, and b) the first modification of struct page in the initialization is setting the PageReserved flag.
Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
On Fri 07-07-17 14:00:03, Vlastimil Babka wrote: > On 07/04/2017 07:17 AM, Joonsoo Kim wrote: > >> > >> Still, backporting b8f1a75d61d8 fixes this: > >> > >> [1.538379] allocated 738197504 bytes of page_ext > >> [1.539340] Node 0, zone DMA: page owner found early allocated 0 > >> pages > >> [1.540179] Node 0, zoneDMA32: page owner found early allocated 33 > >> pages > >> [1.611173] Node 0, zone Normal: page owner found early allocated > >> 96755 pages > >> [1.683167] Node 1, zone Normal: page owner found early allocated > >> 96575 pages > >> > >> No panic, notice how it allocated more for page_ext, and found smaller > >> number of > >> early allocated pages. > >> > >> Now backporting fe53ca54270a on top: > >> > >> [0.00] allocated 738197504 bytes of page_ext > >> [0.00] Node 0, zone DMA: page owner found early allocated 0 > >> pages > >> [0.00] Node 0, zoneDMA32: page owner found early allocated 33 > >> pages > >> [0.00] Node 0, zone Normal: page owner found early allocated > >> 2842622 pages > >> [0.00] Node 1, zone Normal: page owner found early allocated > >> 3694362 pages > >> > >> Again no panic, and same amount of page_ext usage. But the "early > >> allocated" numbers > >> seem bogus to me. I think it's because init_pages_in_zone() is running and > >> inspecting > >> struct pages that have not been yet initialized. It doesn't end up > >> crashing, but > >> still doesn't seem correct? > > > > Numbers looks sane to me. fe53ca54270a makes init_pages_in_zone() > > called before page_alloc_init_late(). So, there would be many > > uninitialized pages with PageReserved(). Page owner regarded these > > PageReserved() page as allocated page. > > That seems incorrect for two reasons: > - init_pages_in_zone() actually skips PageReserved() pages > - the pages don't have PageReserved() flag, until the deferred struct page > init > thread processes them via deferred_init_memmap() -> __init_single_page() > AFAICS > > Now I've found out why upstream reports much less early allocated pages than > our > kernel. We're missing 9d43f5aec950 ("mm/page_owner: add zone range overlapping > check") which adds a "page_zone(page) != zone" check. I think this only works > because the pages are not initialized and thus have no nid/zone links. > Probably > page_zone() only doesn't break because it's all zeroed. I don't think it's > safe > to rely on this? Yes, if anything PageReserved should be checked before the zone check. -- Michal Hocko SUSE Labs
Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
On 07/04/2017 07:17 AM, Joonsoo Kim wrote: >> >> Still, backporting b8f1a75d61d8 fixes this: >> >> [1.538379] allocated 738197504 bytes of page_ext >> [1.539340] Node 0, zone DMA: page owner found early allocated 0 >> pages >> [1.540179] Node 0, zoneDMA32: page owner found early allocated 33 >> pages >> [1.611173] Node 0, zone Normal: page owner found early allocated 96755 >> pages >> [1.683167] Node 1, zone Normal: page owner found early allocated 96575 >> pages >> >> No panic, notice how it allocated more for page_ext, and found smaller >> number of >> early allocated pages. >> >> Now backporting fe53ca54270a on top: >> >> [0.00] allocated 738197504 bytes of page_ext >> [0.00] Node 0, zone DMA: page owner found early allocated 0 >> pages >> [0.00] Node 0, zoneDMA32: page owner found early allocated 33 >> pages >> [0.00] Node 0, zone Normal: page owner found early allocated >> 2842622 pages >> [0.00] Node 1, zone Normal: page owner found early allocated >> 3694362 pages >> >> Again no panic, and same amount of page_ext usage. But the "early allocated" >> numbers >> seem bogus to me. I think it's because init_pages_in_zone() is running and >> inspecting >> struct pages that have not been yet initialized. It doesn't end up crashing, >> but >> still doesn't seem correct? > > Numbers looks sane to me. fe53ca54270a makes init_pages_in_zone() > called before page_alloc_init_late(). So, there would be many > uninitialized pages with PageReserved(). Page owner regarded these > PageReserved() page as allocated page. That seems incorrect for two reasons: - init_pages_in_zone() actually skips PageReserved() pages - the pages don't have PageReserved() flag, until the deferred struct page init thread processes them via deferred_init_memmap() -> __init_single_page() AFAICS Now I've found out why upstream reports much less early allocated pages than our kernel. We're missing 9d43f5aec950 ("mm/page_owner: add zone range overlapping check") which adds a "page_zone(page) != zone" check. I think this only works because the pages are not initialized and thus have no nid/zone links. Probably page_zone() only doesn't break because it's all zeroed. I don't think it's safe to rely on this? > We can change the message to "page owner found early reserved N pages" > > Thanks. >
Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
On Tue 04-07-17 11:39:57, Vlastimil Babka wrote: > On 07/04/2017 07:23 AM, Joonsoo Kim wrote: > > On Mon, Jul 03, 2017 at 04:18:01PM +0200, Vlastimil Babka wrote: > >> allocated" looks much more sane there. But there's a warning nevertheless. > > > > Warning would comes from the fact that drain_all_pages() is called > > before mm_percpu_wq is initialised. We could remove WARN_ON_ONCE() and add > > drain_local_page(zone) to fix the problem. > > Wouldn't that still leave some period during boot where kernel already > runs on multiple CPU's, but mm_percpu_wq is not yet initialized and > somebody tries to use it? We want to catch such cases, right? I haven't checked the boot sequence but if we know that we need mm_percpu_wq initialized earlier than moving it should be not a big deal I guess. -- Michal Hocko SUSE Labs
Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
On 07/04/2017 07:23 AM, Joonsoo Kim wrote: > On Mon, Jul 03, 2017 at 04:18:01PM +0200, Vlastimil Babka wrote: >> allocated" looks much more sane there. But there's a warning nevertheless. > > Warning would comes from the fact that drain_all_pages() is called > before mm_percpu_wq is initialised. We could remove WARN_ON_ONCE() and add > drain_local_page(zone) to fix the problem. Wouldn't that still leave some period during boot where kernel already runs on multiple CPU's, but mm_percpu_wq is not yet initialized and somebody tries to use it? We want to catch such cases, right? > Thanks. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org";> em...@kvack.org >
Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
On Tue 04-07-17 14:11:41, Joonsoo Kim wrote: > On Fri, Jun 30, 2017 at 05:44:16PM +0200, Michal Hocko wrote: > > On Fri 30-06-17 17:42:24, Michal Hocko wrote: > > [...] > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > > index 16532fa0bb64..894697c1e6f5 100644 > > > --- a/include/linux/mmzone.h > > > +++ b/include/linux/mmzone.h > > > @@ -1055,6 +1055,7 @@ static inline struct zoneref > > > *first_zones_zonelist(struct zonelist *zonelist, > > > !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) > > > static inline unsigned long early_pfn_to_nid(unsigned long pfn) > > > { > > > + BUILD_BUG_ON(!IS_ENABLED(CONFIG_NUMA)); > > > > Err, this should read BUILD_BUG_ON(IS_ENABLED(CONFIG_NUMA)) of course > > Agreed. > > However, AFAIK, ARM can set CONFIG_NUMA but it doesn't have > CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID and CONFIG_HAVE_MEMBLOCK_NODE_MAP. $ git grep "config NUMA\|select NUMA" arch/arm $ Did you mean arch64? If yes this one looks ok $ git grep "HAVE_MEMBLOCK_NODE_MAP\|HAVE_ARCH_EARLY_PFN_TO_NID" arch/arm64/ arch/arm64/Kconfig: select HAVE_MEMBLOCK_NODE_MAP if NUMA > If page_ext uses early_pfn_to_nid(), it will cause build error in ARM. Which would be intentional if it doesn't provide a proper implementation of the function. > Therefore, I suggest following change. > CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on proper early_pfn_to_nid(). > So, following code will always work as long as > CONFIG_DEFERRED_STRUCT_PAGE_INIT works. I haven't checked all other callers of early_pfn_to_nid yet but I have run my original patch (with !IS_ENABLED...) just to see whether anybody actually uses this function from an innvalid context and it hasn't blown up. So I suspect that all current users simply use the function from the proper context. So if nobody objects I would just post the patch for inclusion. If the compilation breaks we can think of a proper implementation. > > Thanks. > > --->8--- > diff --git a/mm/page_ext.c b/mm/page_ext.c > index 88ccc044..e3db259 100644 > --- a/mm/page_ext.c > +++ b/mm/page_ext.c > @@ -384,6 +384,7 @@ void __init page_ext_init(void) > > for_each_node_state(nid, N_MEMORY) { > unsigned long start_pfn, end_pfn; > + int page_nid; > > start_pfn = node_start_pfn(nid); > end_pfn = node_end_pfn(nid); > @@ -405,8 +406,15 @@ void __init page_ext_init(void) > * > * Take into account DEFERRED_STRUCT_PAGE_INIT. > */ > - if (early_pfn_to_nid(pfn) != nid) > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT > + page_nid = early_pfn_to_nid(pfn); > +#else > + page_nid = pfn_to_nid(pfn); > +#endif I cannot say I would be happy about this ifdefery. Especially when there is no existing user which would need it. > + > + if (page_nid != nid) > continue; > + > if (init_section_page_ext(pfn, nid)) > goto oom; > } -- Michal Hocko SUSE Labs
Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
On Mon, Jul 03, 2017 at 04:18:01PM +0200, Vlastimil Babka wrote: > On 07/03/2017 01:48 PM, Vlastimil Babka wrote: > > On 06/30/2017 04:18 PM, Michal Hocko wrote: > >> fe53ca54270a ("mm: use early_pfn_to_nid in page_ext_init") seem > >> to silently depend on CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID resp. > >> CONFIG_HAVE_MEMBLOCK_NODE_MAP. early_pfn_to_nid is returning zero with > >> !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && > >> !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) > >> I am not sure how widely is this used but such a code is tricky. I see > >> how catching early allocations during defered initialization might be > >> useful but a subtly broken code sounds like a problem to me. So is > >> fe53ca54270a worth this or we should revert it? > > > > There might be more issues with fe53ca54270a, I think. This I've > > observed on our 4.4-based kernel, which has deferred page struct init, > > but doesn't have b8f1a75d61d8 ("mm: call page_ext_init() after all > > struct pages are initialized") nor aforementioned fe53ca54270a: > > > > [0.00] allocated 421003264 bytes of page_ext > > [0.00] Node 0, zone DMA: page owner found early allocated 0 > > pages > > [0.00] Node 0, zoneDMA32: page owner found early allocated 33 > > pages > > [0.00] Node 0, zone Normal: page owner found early allocated > > 2842622 pages > > [0.00] BUG: unable to handle kernel NULL pointer dereference at > > (null) > > [0.00] IP: [] init_page_owner+0x12a/0x240 > > [0.00] PGD 0 > > [0.00] Oops: [#1] SMP > > [0.00] Modules linked in: > > [0.00] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.74+ #7 > > [0.00] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > > 1.0.0-prebuilt.qemu-project.org 04/01/2014 > > [0.00] task: 81e104c0 ti: 81e0 task.ti: > > 81e0 > > [0.00] RIP: 0010:[] [] > > init_page_owner+0x12a/0x240 > > [0.00] RSP: :81e03ed0 EFLAGS: 00010046 > > [0.00] RAX: RBX: 88083ffe0210 RCX: > > ea001300 > > [0.00] RDX: 0300 RSI: 81f57437 RDI: > > 004c > > [0.00] RBP: 81e03f20 R08: 81e03e90 R09: > > > > [0.00] R10: 004c0200 R11: R12: > > ea00 > > [0.00] R13: 004c0200 R14: 004c R15: > > 0084 > > [0.00] FS: () GS:88042fc0() > > knlGS: > > [0.00] CS: 0010 DS: ES: CR0: 80050033 > > [0.00] CR2: CR3: 01e0b000 CR4: > > 000406b0 > > [0.00] Stack: > > [0.00] 0206 88083ffe0f90 88083ffdf000 > > 3181 > > [0.00] ea001300 0040 ea00 > > 0084 > > [0.00] 0084 8e10 81e03f50 > > 81f84145 > > [0.00] Call Trace: > > [0.00] [] page_ext_init+0x15e/0x167 > > [0.00] [] start_kernel+0x351/0x418 > > [0.00] [] ? early_idt_handler_array+0x120/0x120 > > [0.00] [] x86_64_start_reservations+0x2a/0x2c > > [0.00] [] x86_64_start_kernel+0x12c/0x13b > > [0.00] Code: 81 e2 00 fe ff ff 4d 39 fa 4d 0f 47 d7 4d 39 f2 4d 89 > > d5 77 34 eb 5e 48 8b 01 f6 c4 04 75 21 48 89 cf 48 89 4d d0 e8 b6 35 00 00 > > <48> 8b 00 a8 04 75 0e 48 8b 4d d0 e9 c2 00 00 00 48 83 45 c8 01 > > [0.00] RIP [] init_page_owner+0x12a/0x240 > > [0.00] RSP > > [0.00] CR2: > > [0.00] ---[ end trace 19e05592f03a690f ]--- > > > > Note that this is different backtrace than in b8f1a75d61d8 log. > > > > Still, backporting b8f1a75d61d8 fixes this: > > > > [1.538379] allocated 738197504 bytes of page_ext > > [1.539340] Node 0, zone DMA: page owner found early allocated 0 > > pages > > [1.540179] Node 0, zoneDMA32: page owner found early allocated 33 > > pages > > [1.611173] Node 0, zone Normal: page owner found early allocated > > 96755 pages > > [1.683167] Node 1, zone Normal: page owner found early allocated > > 96575 pages > > > > No panic, notice how it allocated more for page_ext, and found smaller > > number of > > early allocated pages. > > > > Now backporting fe53ca54270a on top: > > > > [0.00] allocated 738197504 bytes of page_ext > > [0.00] Node 0, zone DMA: page owner found early allocated 0 > > pages > > [0.00] Node 0, zoneDMA32: page owner found early allocated 33 > > pages > > [0.00] Node 0, zone Normal: page owner found early allocated > > 2842622 pages > > [0.00] Node 1, zone Normal: page owner found early allocated > > 3694362 pages > > > > Again no panic, and same amount of page_ext usage. But the "early >
Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
On Mon, Jul 03, 2017 at 01:48:05PM +0200, Vlastimil Babka wrote: > On 06/30/2017 04:18 PM, Michal Hocko wrote: > > fe53ca54270a ("mm: use early_pfn_to_nid in page_ext_init") seem > > to silently depend on CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID resp. > > CONFIG_HAVE_MEMBLOCK_NODE_MAP. early_pfn_to_nid is returning zero with > > !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && > > !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) > > I am not sure how widely is this used but such a code is tricky. I see > > how catching early allocations during defered initialization might be > > useful but a subtly broken code sounds like a problem to me. So is > > fe53ca54270a worth this or we should revert it? > > There might be more issues with fe53ca54270a, I think. This I've > observed on our 4.4-based kernel, which has deferred page struct init, > but doesn't have b8f1a75d61d8 ("mm: call page_ext_init() after all > struct pages are initialized") nor aforementioned fe53ca54270a: > > [0.00] allocated 421003264 bytes of page_ext > [0.00] Node 0, zone DMA: page owner found early allocated 0 pages > [0.00] Node 0, zoneDMA32: page owner found early allocated 33 > pages > [0.00] Node 0, zone Normal: page owner found early allocated > 2842622 pages > [0.00] BUG: unable to handle kernel NULL pointer dereference at > (null) > [0.00] IP: [] init_page_owner+0x12a/0x240 > [0.00] PGD 0 > [0.00] Oops: [#1] SMP > [0.00] Modules linked in: > [0.00] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.74+ #7 > [0.00] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > 1.0.0-prebuilt.qemu-project.org 04/01/2014 > [0.00] task: 81e104c0 ti: 81e0 task.ti: > 81e0 > [0.00] RIP: 0010:[] [] > init_page_owner+0x12a/0x240 > [0.00] RSP: :81e03ed0 EFLAGS: 00010046 > [0.00] RAX: RBX: 88083ffe0210 RCX: > ea001300 > [0.00] RDX: 0300 RSI: 81f57437 RDI: > 004c > [0.00] RBP: 81e03f20 R08: 81e03e90 R09: > > [0.00] R10: 004c0200 R11: R12: > ea00 > [0.00] R13: 004c0200 R14: 004c R15: > 0084 > [0.00] FS: () GS:88042fc0() > knlGS: > [0.00] CS: 0010 DS: ES: CR0: 80050033 > [0.00] CR2: CR3: 01e0b000 CR4: > 000406b0 > [0.00] Stack: > [0.00] 0206 88083ffe0f90 88083ffdf000 > 3181 > [0.00] ea001300 0040 ea00 > 0084 > [0.00] 0084 8e10 81e03f50 > 81f84145 > [0.00] Call Trace: > [0.00] [] page_ext_init+0x15e/0x167 > [0.00] [] start_kernel+0x351/0x418 > [0.00] [] ? early_idt_handler_array+0x120/0x120 > [0.00] [] x86_64_start_reservations+0x2a/0x2c > [0.00] [] x86_64_start_kernel+0x12c/0x13b > [0.00] Code: 81 e2 00 fe ff ff 4d 39 fa 4d 0f 47 d7 4d 39 f2 4d 89 d5 > 77 34 eb 5e 48 8b 01 f6 c4 04 75 21 48 89 cf 48 89 4d d0 e8 b6 35 00 00 <48> > 8b 00 a8 04 75 0e 48 8b 4d d0 e9 c2 00 00 00 48 83 45 c8 01 > [0.00] RIP [] init_page_owner+0x12a/0x240 > [0.00] RSP > [0.00] CR2: > [0.00] ---[ end trace 19e05592f03a690f ]--- > > Note that this is different backtrace than in b8f1a75d61d8 log. > > Still, backporting b8f1a75d61d8 fixes this: > > [1.538379] allocated 738197504 bytes of page_ext > [1.539340] Node 0, zone DMA: page owner found early allocated 0 pages > [1.540179] Node 0, zoneDMA32: page owner found early allocated 33 > pages > [1.611173] Node 0, zone Normal: page owner found early allocated 96755 > pages > [1.683167] Node 1, zone Normal: page owner found early allocated 96575 > pages > > No panic, notice how it allocated more for page_ext, and found smaller number > of > early allocated pages. > > Now backporting fe53ca54270a on top: > > [0.00] allocated 738197504 bytes of page_ext > [0.00] Node 0, zone DMA: page owner found early allocated 0 pages > [0.00] Node 0, zoneDMA32: page owner found early allocated 33 > pages > [0.00] Node 0, zone Normal: page owner found early allocated > 2842622 pages > [0.00] Node 1, zone Normal: page owner found early allocated > 3694362 pages > > Again no panic, and same amount of page_ext usage. But the "early allocated" > numbers > seem bogus to me. I think it's because init_pages_in_zone() is running and > inspecting > struct pages that have not been yet initialized. It doesn't end up crashing, > but > still doesn't seem correct? Numbers looks sane to me. fe53
Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
On Fri, Jun 30, 2017 at 05:44:16PM +0200, Michal Hocko wrote: > On Fri 30-06-17 17:42:24, Michal Hocko wrote: > [...] > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index 16532fa0bb64..894697c1e6f5 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -1055,6 +1055,7 @@ static inline struct zoneref > > *first_zones_zonelist(struct zonelist *zonelist, > > !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) > > static inline unsigned long early_pfn_to_nid(unsigned long pfn) > > { > > + BUILD_BUG_ON(!IS_ENABLED(CONFIG_NUMA)); > > Err, this should read BUILD_BUG_ON(IS_ENABLED(CONFIG_NUMA)) of course Agreed. However, AFAIK, ARM can set CONFIG_NUMA but it doesn't have CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID and CONFIG_HAVE_MEMBLOCK_NODE_MAP. If page_ext uses early_pfn_to_nid(), it will cause build error in ARM. Therefore, I suggest following change. CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on proper early_pfn_to_nid(). So, following code will always work as long as CONFIG_DEFERRED_STRUCT_PAGE_INIT works. Thanks. --->8--- diff --git a/mm/page_ext.c b/mm/page_ext.c index 88ccc044..e3db259 100644 --- a/mm/page_ext.c +++ b/mm/page_ext.c @@ -384,6 +384,7 @@ void __init page_ext_init(void) for_each_node_state(nid, N_MEMORY) { unsigned long start_pfn, end_pfn; + int page_nid; start_pfn = node_start_pfn(nid); end_pfn = node_end_pfn(nid); @@ -405,8 +406,15 @@ void __init page_ext_init(void) * * Take into account DEFERRED_STRUCT_PAGE_INIT. */ - if (early_pfn_to_nid(pfn) != nid) +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT + page_nid = early_pfn_to_nid(pfn); +#else + page_nid = pfn_to_nid(pfn); +#endif + + if (page_nid != nid) continue; + if (init_section_page_ext(pfn, nid)) goto oom; }
Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
On 07/03/2017 01:48 PM, Vlastimil Babka wrote: > On 06/30/2017 04:18 PM, Michal Hocko wrote: >> fe53ca54270a ("mm: use early_pfn_to_nid in page_ext_init") seem >> to silently depend on CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID resp. >> CONFIG_HAVE_MEMBLOCK_NODE_MAP. early_pfn_to_nid is returning zero with >> !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && >> !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) >> I am not sure how widely is this used but such a code is tricky. I see >> how catching early allocations during defered initialization might be >> useful but a subtly broken code sounds like a problem to me. So is >> fe53ca54270a worth this or we should revert it? > > There might be more issues with fe53ca54270a, I think. This I've > observed on our 4.4-based kernel, which has deferred page struct init, > but doesn't have b8f1a75d61d8 ("mm: call page_ext_init() after all > struct pages are initialized") nor aforementioned fe53ca54270a: > > [0.00] allocated 421003264 bytes of page_ext > [0.00] Node 0, zone DMA: page owner found early allocated 0 pages > [0.00] Node 0, zoneDMA32: page owner found early allocated 33 > pages > [0.00] Node 0, zone Normal: page owner found early allocated > 2842622 pages > [0.00] BUG: unable to handle kernel NULL pointer dereference at > (null) > [0.00] IP: [] init_page_owner+0x12a/0x240 > [0.00] PGD 0 > [0.00] Oops: [#1] SMP > [0.00] Modules linked in: > [0.00] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.74+ #7 > [0.00] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > 1.0.0-prebuilt.qemu-project.org 04/01/2014 > [0.00] task: 81e104c0 ti: 81e0 task.ti: > 81e0 > [0.00] RIP: 0010:[] [] > init_page_owner+0x12a/0x240 > [0.00] RSP: :81e03ed0 EFLAGS: 00010046 > [0.00] RAX: RBX: 88083ffe0210 RCX: > ea001300 > [0.00] RDX: 0300 RSI: 81f57437 RDI: > 004c > [0.00] RBP: 81e03f20 R08: 81e03e90 R09: > > [0.00] R10: 004c0200 R11: R12: > ea00 > [0.00] R13: 004c0200 R14: 004c R15: > 0084 > [0.00] FS: () GS:88042fc0() > knlGS: > [0.00] CS: 0010 DS: ES: CR0: 80050033 > [0.00] CR2: CR3: 01e0b000 CR4: > 000406b0 > [0.00] Stack: > [0.00] 0206 88083ffe0f90 88083ffdf000 > 3181 > [0.00] ea001300 0040 ea00 > 0084 > [0.00] 0084 8e10 81e03f50 > 81f84145 > [0.00] Call Trace: > [0.00] [] page_ext_init+0x15e/0x167 > [0.00] [] start_kernel+0x351/0x418 > [0.00] [] ? early_idt_handler_array+0x120/0x120 > [0.00] [] x86_64_start_reservations+0x2a/0x2c > [0.00] [] x86_64_start_kernel+0x12c/0x13b > [0.00] Code: 81 e2 00 fe ff ff 4d 39 fa 4d 0f 47 d7 4d 39 f2 4d 89 d5 > 77 34 eb 5e 48 8b 01 f6 c4 04 75 21 48 89 cf 48 89 4d d0 e8 b6 35 00 00 <48> > 8b 00 a8 04 75 0e 48 8b 4d d0 e9 c2 00 00 00 48 83 45 c8 01 > [0.00] RIP [] init_page_owner+0x12a/0x240 > [0.00] RSP > [0.00] CR2: > [0.00] ---[ end trace 19e05592f03a690f ]--- > > Note that this is different backtrace than in b8f1a75d61d8 log. > > Still, backporting b8f1a75d61d8 fixes this: > > [1.538379] allocated 738197504 bytes of page_ext > [1.539340] Node 0, zone DMA: page owner found early allocated 0 pages > [1.540179] Node 0, zoneDMA32: page owner found early allocated 33 > pages > [1.611173] Node 0, zone Normal: page owner found early allocated 96755 > pages > [1.683167] Node 1, zone Normal: page owner found early allocated 96575 > pages > > No panic, notice how it allocated more for page_ext, and found smaller number > of > early allocated pages. > > Now backporting fe53ca54270a on top: > > [0.00] allocated 738197504 bytes of page_ext > [0.00] Node 0, zone DMA: page owner found early allocated 0 pages > [0.00] Node 0, zoneDMA32: page owner found early allocated 33 > pages > [0.00] Node 0, zone Normal: page owner found early allocated > 2842622 pages > [0.00] Node 1, zone Normal: page owner found early allocated > 3694362 pages > > Again no panic, and same amount of page_ext usage. But the "early allocated" > numbers > seem bogus to me. I think it's because init_pages_in_zone() is running and > inspecting > struct pages that have not been yet initialized. It doesn't end up crashing, > but > still doesn't seem correct? Hmm, and for the record, this is recent mainline. Wonder
Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
On 06/30/2017 04:18 PM, Michal Hocko wrote: > fe53ca54270a ("mm: use early_pfn_to_nid in page_ext_init") seem > to silently depend on CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID resp. > CONFIG_HAVE_MEMBLOCK_NODE_MAP. early_pfn_to_nid is returning zero with > !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && > !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) > I am not sure how widely is this used but such a code is tricky. I see > how catching early allocations during defered initialization might be > useful but a subtly broken code sounds like a problem to me. So is > fe53ca54270a worth this or we should revert it? There might be more issues with fe53ca54270a, I think. This I've observed on our 4.4-based kernel, which has deferred page struct init, but doesn't have b8f1a75d61d8 ("mm: call page_ext_init() after all struct pages are initialized") nor aforementioned fe53ca54270a: [0.00] allocated 421003264 bytes of page_ext [0.00] Node 0, zone DMA: page owner found early allocated 0 pages [0.00] Node 0, zoneDMA32: page owner found early allocated 33 pages [0.00] Node 0, zone Normal: page owner found early allocated 2842622 pages [0.00] BUG: unable to handle kernel NULL pointer dereference at (null) [0.00] IP: [] init_page_owner+0x12a/0x240 [0.00] PGD 0 [0.00] Oops: [#1] SMP [0.00] Modules linked in: [0.00] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.74+ #7 [0.00] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 [0.00] task: 81e104c0 ti: 81e0 task.ti: 81e0 [0.00] RIP: 0010:[] [] init_page_owner+0x12a/0x240 [0.00] RSP: :81e03ed0 EFLAGS: 00010046 [0.00] RAX: RBX: 88083ffe0210 RCX: ea001300 [0.00] RDX: 0300 RSI: 81f57437 RDI: 004c [0.00] RBP: 81e03f20 R08: 81e03e90 R09: [0.00] R10: 004c0200 R11: R12: ea00 [0.00] R13: 004c0200 R14: 004c R15: 0084 [0.00] FS: () GS:88042fc0() knlGS: [0.00] CS: 0010 DS: ES: CR0: 80050033 [0.00] CR2: CR3: 01e0b000 CR4: 000406b0 [0.00] Stack: [0.00] 0206 88083ffe0f90 88083ffdf000 3181 [0.00] ea001300 0040 ea00 0084 [0.00] 0084 8e10 81e03f50 81f84145 [0.00] Call Trace: [0.00] [] page_ext_init+0x15e/0x167 [0.00] [] start_kernel+0x351/0x418 [0.00] [] ? early_idt_handler_array+0x120/0x120 [0.00] [] x86_64_start_reservations+0x2a/0x2c [0.00] [] x86_64_start_kernel+0x12c/0x13b [0.00] Code: 81 e2 00 fe ff ff 4d 39 fa 4d 0f 47 d7 4d 39 f2 4d 89 d5 77 34 eb 5e 48 8b 01 f6 c4 04 75 21 48 89 cf 48 89 4d d0 e8 b6 35 00 00 <48> 8b 00 a8 04 75 0e 48 8b 4d d0 e9 c2 00 00 00 48 83 45 c8 01 [0.00] RIP [] init_page_owner+0x12a/0x240 [0.00] RSP [0.00] CR2: [0.00] ---[ end trace 19e05592f03a690f ]--- Note that this is different backtrace than in b8f1a75d61d8 log. Still, backporting b8f1a75d61d8 fixes this: [1.538379] allocated 738197504 bytes of page_ext [1.539340] Node 0, zone DMA: page owner found early allocated 0 pages [1.540179] Node 0, zoneDMA32: page owner found early allocated 33 pages [1.611173] Node 0, zone Normal: page owner found early allocated 96755 pages [1.683167] Node 1, zone Normal: page owner found early allocated 96575 pages No panic, notice how it allocated more for page_ext, and found smaller number of early allocated pages. Now backporting fe53ca54270a on top: [0.00] allocated 738197504 bytes of page_ext [0.00] Node 0, zone DMA: page owner found early allocated 0 pages [0.00] Node 0, zoneDMA32: page owner found early allocated 33 pages [0.00] Node 0, zone Normal: page owner found early allocated 2842622 pages [0.00] Node 1, zone Normal: page owner found early allocated 3694362 pages Again no panic, and same amount of page_ext usage. But the "early allocated" numbers seem bogus to me. I think it's because init_pages_in_zone() is running and inspecting struct pages that have not been yet initialized. It doesn't end up crashing, but still doesn't seem correct?
Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
On Fri 30-06-17 17:42:24, Michal Hocko wrote: [...] > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 16532fa0bb64..894697c1e6f5 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -1055,6 +1055,7 @@ static inline struct zoneref > *first_zones_zonelist(struct zonelist *zonelist, > !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) > static inline unsigned long early_pfn_to_nid(unsigned long pfn) > { > + BUILD_BUG_ON(!IS_ENABLED(CONFIG_NUMA)); Err, this should read BUILD_BUG_ON(IS_ENABLED(CONFIG_NUMA)) of course > return 0; > } > #endif -- Michal Hocko SUSE Labs
Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
On Fri 30-06-17 16:18:47, Michal Hocko wrote: > fe53ca54270a ("mm: use early_pfn_to_nid in page_ext_init") seem > to silently depend on CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID resp. > CONFIG_HAVE_MEMBLOCK_NODE_MAP. early_pfn_to_nid is returning zero with > !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && > !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) > I am not sure how widely is this used but such a code is tricky. I see > how catching early allocations during defered initialization might be > useful but a subtly broken code sounds like a problem to me. So is > fe53ca54270a worth this or we should revert it? I've dug little bit further. It seems that only s390 and ia64 select HAVE_ARCH_EARLY_PFN_TO_NID. Much more architectures enabled HAVE_MEMBLOCK_NODE_MAP though but still alpha, arc, arm, avr32, blackfin, c6x, cris, frv, h8300, hexagon, Kconfig, m32r, m68k, mn10300, nios2, openrisc, parisc, tile, um, unicore32, xtensa do not. I can only see alpha having NUMA and even that is marked BROKEN. So it seems that this is not a real problem after all. Still subtle, so I guess we want to have the following. What do you think? diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 16532fa0bb64..894697c1e6f5 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1055,6 +1055,7 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist, !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) static inline unsigned long early_pfn_to_nid(unsigned long pfn) { + BUILD_BUG_ON(!IS_ENABLED(CONFIG_NUMA)); return 0; } #endif -- Michal Hocko SUSE Labs