Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-18 Thread David Hildenbrand
On 09.09.19 09:46, David Hildenbrand wrote:
> On 09.09.19 07:48, Toshiki Fukasawa wrote:
>> On 2019/09/06 19:35, David Hildenbrand wrote:
>>> On 06.09.19 12:02, Toshiki Fukasawa wrote:
 Thank you for your feedback.

 On 2019/09/06 17:45, David Hildenbrand wrote:
> On 06.09.19 10:09, Toshiki Fukasawa wrote:
>> A kernel panic is observed during reading
>> /proc/kpage{cgroup,count,flags} for first few pfns allocated by
>> pmem namespace:
>>
>> BUG: unable to handle page fault for address: fffe
>> [  114.495280] #PF: supervisor read access in kernel mode
>> [  114.495738] #PF: error_code(0x) - not-present page
>> [  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
>> [  114.496713] Oops:  [#1] SMP PTI
>> [  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1
>> [  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
>> BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
>> [  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
>> [  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 
>> 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 
>> c7 <48> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
>> [  114.500788] RSP: 0018:a5e601a0fe60 EFLAGS: 00010202
>> [  114.501373] RAX: fffe RBX:  RCX: 
>> 
>> [  114.502009] RDX: 0001 RSI: 7ffca13a7310 RDI: 
>> d0748900
>> [  114.502637] RBP: d0748900 R08: 0001 R09: 
>> 
>> [  114.503270] R10:  R11:  R12: 
>> 0024
>> [  114.503896] R13: 0008 R14: 7ffca13a7310 R15: 
>> a5e601a0ff08
>> [  114.504530] FS:  7f0266c7f540() GS:962dbbac() 
>> knlGS:
>> [  114.505245] CS:  0010 DS:  ES:  CR0: 80050033
>> [  114.505754] CR2: fffe CR3: 00023a204000 CR4: 
>> 06e0
>> [  114.506401] Call Trace:
>> [  114.506660]  kpageflags_read+0xb1/0x130
>> [  114.507051]  proc_reg_read+0x39/0x60
>> [  114.507387]  vfs_read+0x8a/0x140
>> [  114.507686]  ksys_pread64+0x61/0xa0
>> [  114.508021]  do_syscall_64+0x5f/0x1a0
>> [  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [  114.508844] RIP: 0033:0x7f0266ba426b
>>
>> The first few pages of ZONE_DEVICE expressed as the range
>> (altmap->base_pfn) to (altmap->base_pfn + altmap->reserve) are
>> skipped by struct page initialization. Some pfn walkers like
>> /proc/kpage{cgroup, count, flags} can't handle these uninitialized
>> struct pages, which causes the error.
>>
>> In previous discussion, Dan seemed to have concern that the struct
>> page area of some pages indicated by vmem_altmap->reserve may not
>> be allocated. (See 
>> https://lore.kernel.org/lkml/capcyv4i5fjtonpbxnctzvt+e6rqyow0jrqwsfuxaa62lsuv...@mail.gmail.com/)
>> However, arch_add_memory() called by devm_memremap_pages() allocates
>> struct page area for pages containing addresses in the range
>> (res.start) to (res.start + resource_size(res)), which include the
>> pages indicated by vmem_altmap->reserve. If I read correctly, it is
>> allocated as requested at least on x86_64. Also, memmap_init_zone()
>> initializes struct pages in the same range.
>> So I think the struct pages should be initialized.>
>
> For !ZONE_DEVICE memory, the memmap is valid with SECTION_IS_ONLINE -
> for the whole section. For ZONE_DEVICE memory we have no such
> indication. In any section that is !SECTION_IS_ONLINE and
> SECTION_MARKED_PRESENT, we could have any subsections initialized. >
> The only indication I am aware of is pfn_zone_device_reserved() - which
> seems to check exactly what you are trying to skip here.
>
> Can't you somehow use pfn_zone_device_reserved() ? Or if you considered
> that already, why did you decide against it?

 No, in current approach this function is no longer needed.
 The reason why we change the approach is that all pfn walkers
 have to be aware of the uninitialized struct pages.
>>>
>>> We should use the same strategy for all pfn walkers then (effectively
>>> get rid of pfn_zone_device_reserved() if that's what we want).
>>
>> True, but this patch replaces "/proc/kpageflags: do not use uninitialized
>> struct pages". If we initialize the uninitialized struct pages, no pfn walker
>> will need to be aware of them.
> 
> So the function should go.
> 
>>
>>>

 As for SECTION_IS_ONLINE, I'm not sure now.
 I will look into it next week.
>>>
>>> SECTION_IS_ONLINE does currently not apply to ZONE_DEVICE and due to
>>> sub-section support for ZONE_DEVICE, it cannot easily be reused.
>>>
>> It 

Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-17 Thread Toshiki Fukasawa
On 2019/09/18 2:04, Waiman Long wrote:
> On 9/17/19 12:21 PM, Qian Cai wrote:
>> On Tue, 2019-09-17 at 11:49 -0400, Waiman Long wrote:
>>> On 9/17/19 3:13 AM, David Hildenbrand wrote:
 On 17.09.19 04:34, Toshiki Fukasawa wrote:
> On 2019/09/09 16:46, David Hildenbrand wrote:
>> Let's take a step back here to understand the issues I am aware of. I
>> think we should solve this for good now:
>>
>> A PFN walker takes a look at a random PFN at a random point in time. It
>> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
>> options are:
>>
>> 1. It is buddy memory (add_memory()) that has not been online yet. The
>> memmap contains garbage. Don't access.
>>
>> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
>>
>> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
>> is only partially present: E.g., device starts at offset 64MB within a
>> section or the device ends at offset 64MB within a section. Don't access 
>> it.
> I don't agree with case #3. In the case, struct page area is not 
> allocated on
> ZONE_DEVICE, but is allocated on system memory. So I think we can access 
> the
> struct pages. What do you mean "invalid memmap"?
 No, that's not the case. There is no memory, especially not system
 memory. We only allow partially present sections (sub-section memory
 hotplug) for ZONE_DEVICE.

 invalid memmap == memmap was not initialized == struct pages contains
 garbage. There is a memmap, but accessing it (e.g., pfn_to_nid()) will
 trigger a BUG.

>>> As long as the page structures exist, they should be initialized to some
>>> known state. We could set PagePoison for those invalid memmap. It is the
>> Sounds like you want to run page_init_poison() by default.
> 
> Yes for those pages that are not initialized otherwise. I don't want to
> run page_init_poison() for the whole ZONE_DEVICE memory range as it can
> take a while if we are talking about TBs of persistent memory. Also most
> of the pages will be reinitialized anyway in the init process. So it is
> mostly a wasted effort. However, for those reserved pages that are not
> being exported to the memory management layer, having them initialized
> to a known state will cause less problem down the road.
> 
The same can be said about altmap->reserve. I think it should be initialized
since the struct page exists.

Thanks,
Toshiki Fukasawa


Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-17 Thread Toshiki Fukasawa
On 2019/09/17 19:20, David Hildenbrand wrote:
> On 17.09.19 11:32, Toshiki Fukasawa wrote:
>> On 2019/09/17 16:13, David Hildenbrand wrote:
>>> On 17.09.19 04:34, Toshiki Fukasawa wrote:
 On 2019/09/09 16:46, David Hildenbrand wrote:
> Let's take a step back here to understand the issues I am aware of. I
> think we should solve this for good now:
>
> A PFN walker takes a look at a random PFN at a random point in time. It
> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
> options are:
>
> 1. It is buddy memory (add_memory()) that has not been online yet. The
> memmap contains garbage. Don't access.
>
> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
>
> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
> is only partially present: E.g., device starts at offset 64MB within a
> section or the device ends at offset 64MB within a section. Don't access 
> it.

 I don't agree with case #3. In the case, struct page area is not allocated 
 on
 ZONE_DEVICE, but is allocated on system memory. So I think we can access 
 the
 struct pages. What do you mean "invalid memmap"?
>>> No, that's not the case. There is no memory, especially not system
>>> memory. We only allow partially present sections (sub-section memory
>>> hotplug) for ZONE_DEVICE.
>>
>> Let me clear my thoughts. If I read correctly, the struct pages for sections
>> (including partially present sections) on ZONE_DEVICE are allocated by
>> vmemmap_populate(). And all the struct pages except (altmap->base_pfn) to
>> (altmap->base_pfn + altmap->reserve) are initialized by memmap_init_zone()
>> and memmap_init_zone_device().
>>
>> Do struct pages for partially present sections go through a different 
>> process?
> 
> No. However, the memmap is initialized via move_pfn_range_to_zone(). So
> partially present sections will have partially uninitialized memmaps.
> 
Thank you for explanation.
To my understanding, depending on architecture, the situation is possible
that the struct pages for entire section is allocated, but only pages
in the zone are initialized.

Thanks,
Toshiki Fukasawa


Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-17 Thread David Hildenbrand
On 17.09.19 19:04, Waiman Long wrote:
> On 9/17/19 12:21 PM, Qian Cai wrote:
>> On Tue, 2019-09-17 at 11:49 -0400, Waiman Long wrote:
>>> On 9/17/19 3:13 AM, David Hildenbrand wrote:
 On 17.09.19 04:34, Toshiki Fukasawa wrote:
> On 2019/09/09 16:46, David Hildenbrand wrote:
>> Let's take a step back here to understand the issues I am aware of. I
>> think we should solve this for good now:
>>
>> A PFN walker takes a look at a random PFN at a random point in time. It
>> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
>> options are:
>>
>> 1. It is buddy memory (add_memory()) that has not been online yet. The
>> memmap contains garbage. Don't access.
>>
>> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
>>
>> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
>> is only partially present: E.g., device starts at offset 64MB within a
>> section or the device ends at offset 64MB within a section. Don't access 
>> it.
> I don't agree with case #3. In the case, struct page area is not 
> allocated on
> ZONE_DEVICE, but is allocated on system memory. So I think we can access 
> the
> struct pages. What do you mean "invalid memmap"?
 No, that's not the case. There is no memory, especially not system
 memory. We only allow partially present sections (sub-section memory
 hotplug) for ZONE_DEVICE.

 invalid memmap == memmap was not initialized == struct pages contains
 garbage. There is a memmap, but accessing it (e.g., pfn_to_nid()) will
 trigger a BUG.

>>> As long as the page structures exist, they should be initialized to some
>>> known state. We could set PagePoison for those invalid memmap. It is the
>> Sounds like you want to run page_init_poison() by default.
> 
> Yes for those pages that are not initialized otherwise. I don't want to
> run page_init_poison() for the whole ZONE_DEVICE memory range as it can
> take a while if we are talking about TBs of persistent memory. Also most
> of the pages will be reinitialized anyway in the init process. So it is
> mostly a wasted effort. However, for those reserved pages that are not
> being exported to the memory management layer, having them initialized
> to a known state will cause less problem down the road.
> 

No hacks please. There has to be a proper way to identify if a memmap
was initialized or not. Fake-initializing a memmap is *not* the answer.

-- 

Thanks,

David / dhildenb


Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-17 Thread Waiman Long
On 9/17/19 12:21 PM, Qian Cai wrote:
> On Tue, 2019-09-17 at 11:49 -0400, Waiman Long wrote:
>> On 9/17/19 3:13 AM, David Hildenbrand wrote:
>>> On 17.09.19 04:34, Toshiki Fukasawa wrote:
 On 2019/09/09 16:46, David Hildenbrand wrote:
> Let's take a step back here to understand the issues I am aware of. I
> think we should solve this for good now:
>
> A PFN walker takes a look at a random PFN at a random point in time. It
> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
> options are:
>
> 1. It is buddy memory (add_memory()) that has not been online yet. The
> memmap contains garbage. Don't access.
>
> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
>
> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
> is only partially present: E.g., device starts at offset 64MB within a
> section or the device ends at offset 64MB within a section. Don't access 
> it.
 I don't agree with case #3. In the case, struct page area is not allocated 
 on
 ZONE_DEVICE, but is allocated on system memory. So I think we can access 
 the
 struct pages. What do you mean "invalid memmap"?
>>> No, that's not the case. There is no memory, especially not system
>>> memory. We only allow partially present sections (sub-section memory
>>> hotplug) for ZONE_DEVICE.
>>>
>>> invalid memmap == memmap was not initialized == struct pages contains
>>> garbage. There is a memmap, but accessing it (e.g., pfn_to_nid()) will
>>> trigger a BUG.
>>>
>> As long as the page structures exist, they should be initialized to some
>> known state. We could set PagePoison for those invalid memmap. It is the
> Sounds like you want to run page_init_poison() by default.

Yes for those pages that are not initialized otherwise. I don't want to
run page_init_poison() for the whole ZONE_DEVICE memory range as it can
take a while if we are talking about TBs of persistent memory. Also most
of the pages will be reinitialized anyway in the init process. So it is
mostly a wasted effort. However, for those reserved pages that are not
being exported to the memory management layer, having them initialized
to a known state will cause less problem down the road.

Cheers,
Longman



Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-17 Thread Qian Cai
On Tue, 2019-09-17 at 11:49 -0400, Waiman Long wrote:
> On 9/17/19 3:13 AM, David Hildenbrand wrote:
> > On 17.09.19 04:34, Toshiki Fukasawa wrote:
> > > On 2019/09/09 16:46, David Hildenbrand wrote:
> > > > Let's take a step back here to understand the issues I am aware of. I
> > > > think we should solve this for good now:
> > > > 
> > > > A PFN walker takes a look at a random PFN at a random point in time. It
> > > > finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
> > > > options are:
> > > > 
> > > > 1. It is buddy memory (add_memory()) that has not been online yet. The
> > > > memmap contains garbage. Don't access.
> > > > 
> > > > 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
> > > > 
> > > > 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
> > > > is only partially present: E.g., device starts at offset 64MB within a
> > > > section or the device ends at offset 64MB within a section. Don't 
> > > > access it.
> > > 
> > > I don't agree with case #3. In the case, struct page area is not 
> > > allocated on
> > > ZONE_DEVICE, but is allocated on system memory. So I think we can access 
> > > the
> > > struct pages. What do you mean "invalid memmap"?
> > 
> > No, that's not the case. There is no memory, especially not system
> > memory. We only allow partially present sections (sub-section memory
> > hotplug) for ZONE_DEVICE.
> > 
> > invalid memmap == memmap was not initialized == struct pages contains
> > garbage. There is a memmap, but accessing it (e.g., pfn_to_nid()) will
> > trigger a BUG.
> > 
> 
> As long as the page structures exist, they should be initialized to some
> known state. We could set PagePoison for those invalid memmap. It is the

Sounds like you want to run page_init_poison() by default.


> garbage that are in those page structures that can cause problem if a
> struct page walker scan those pages and try to make sense of it.



Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-17 Thread Waiman Long
On 9/17/19 3:13 AM, David Hildenbrand wrote:
> On 17.09.19 04:34, Toshiki Fukasawa wrote:
>> On 2019/09/09 16:46, David Hildenbrand wrote:
>>> Let's take a step back here to understand the issues I am aware of. I
>>> think we should solve this for good now:
>>>
>>> A PFN walker takes a look at a random PFN at a random point in time. It
>>> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
>>> options are:
>>>
>>> 1. It is buddy memory (add_memory()) that has not been online yet. The
>>> memmap contains garbage. Don't access.
>>>
>>> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
>>>
>>> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
>>> is only partially present: E.g., device starts at offset 64MB within a
>>> section or the device ends at offset 64MB within a section. Don't access it.
>> I don't agree with case #3. In the case, struct page area is not allocated on
>> ZONE_DEVICE, but is allocated on system memory. So I think we can access the
>> struct pages. What do you mean "invalid memmap"?
> No, that's not the case. There is no memory, especially not system
> memory. We only allow partially present sections (sub-section memory
> hotplug) for ZONE_DEVICE.
>
> invalid memmap == memmap was not initialized == struct pages contains
> garbage. There is a memmap, but accessing it (e.g., pfn_to_nid()) will
> trigger a BUG.
>
As long as the page structures exist, they should be initialized to some
known state. We could set PagePoison for those invalid memmap. It is the
garbage that are in those page structures that can cause problem if a
struct page walker scan those pages and try to make sense of it.

Cheers,
Longman



Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-17 Thread David Hildenbrand
On 17.09.19 11:32, Toshiki Fukasawa wrote:
> On 2019/09/17 16:13, David Hildenbrand wrote:
>> On 17.09.19 04:34, Toshiki Fukasawa wrote:
>>> On 2019/09/09 16:46, David Hildenbrand wrote:
 Let's take a step back here to understand the issues I am aware of. I
 think we should solve this for good now:

 A PFN walker takes a look at a random PFN at a random point in time. It
 finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
 options are:

 1. It is buddy memory (add_memory()) that has not been online yet. The
 memmap contains garbage. Don't access.

 2. It is ZONE_DEVICE memory with a valid memmap. Access it.

 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
 is only partially present: E.g., device starts at offset 64MB within a
 section or the device ends at offset 64MB within a section. Don't access 
 it.
>>>
>>> I don't agree with case #3. In the case, struct page area is not allocated 
>>> on
>>> ZONE_DEVICE, but is allocated on system memory. So I think we can access the
>>> struct pages. What do you mean "invalid memmap"?
>> No, that's not the case. There is no memory, especially not system
>> memory. We only allow partially present sections (sub-section memory
>> hotplug) for ZONE_DEVICE.
> 
> Let me clear my thoughts. If I read correctly, the struct pages for sections
> (including partially present sections) on ZONE_DEVICE are allocated by
> vmemmap_populate(). And all the struct pages except (altmap->base_pfn) to
> (altmap->base_pfn + altmap->reserve) are initialized by memmap_init_zone()
> and memmap_init_zone_device().
> 
> Do struct pages for partially present sections go through a different process?

No. However, the memmap is initialized via move_pfn_range_to_zone(). So
partially present sections will have partially uninitialized memmaps.

But I get your point. I just saw that pfn_valid() does take care of the
subsection map via pfn_section_valid(). - pfn_present() does not, which
is weird, but ok.

So I agree, in case #3 might have a partially uninitialized memmap, but
we can test via pfn_valid() if the memory is at least valid. So there is
some way to test. Then we're back to the race between adding the memory
and initializing the memmap.

Thanks!

> 
> Thanks,
> Toshiki Fukasawa
>>


-- 

Thanks,

David / dhildenb


Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-17 Thread Toshiki Fukasawa
On 2019/09/17 16:13, David Hildenbrand wrote:
> On 17.09.19 04:34, Toshiki Fukasawa wrote:
>> On 2019/09/09 16:46, David Hildenbrand wrote:
>>> Let's take a step back here to understand the issues I am aware of. I
>>> think we should solve this for good now:
>>>
>>> A PFN walker takes a look at a random PFN at a random point in time. It
>>> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
>>> options are:
>>>
>>> 1. It is buddy memory (add_memory()) that has not been online yet. The
>>> memmap contains garbage. Don't access.
>>>
>>> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
>>>
>>> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
>>> is only partially present: E.g., device starts at offset 64MB within a
>>> section or the device ends at offset 64MB within a section. Don't access it.
>>
>> I don't agree with case #3. In the case, struct page area is not allocated on
>> ZONE_DEVICE, but is allocated on system memory. So I think we can access the
>> struct pages. What do you mean "invalid memmap"?
> No, that's not the case. There is no memory, especially not system
> memory. We only allow partially present sections (sub-section memory
> hotplug) for ZONE_DEVICE.

Let me clear my thoughts. If I read correctly, the struct pages for sections
(including partially present sections) on ZONE_DEVICE are allocated by
vmemmap_populate(). And all the struct pages except (altmap->base_pfn) to
(altmap->base_pfn + altmap->reserve) are initialized by memmap_init_zone()
and memmap_init_zone_device().

Do struct pages for partially present sections go through a different process?

Thanks,
Toshiki Fukasawa
> 
> invalid memmap == memmap was not initialized == struct pages contains
> garbage. There is a memmap, but accessing it (e.g., pfn_to_nid()) will
> trigger a BUG.
> 


Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-17 Thread David Hildenbrand
On 17.09.19 04:34, Toshiki Fukasawa wrote:
> On 2019/09/09 16:46, David Hildenbrand wrote:
>> Let's take a step back here to understand the issues I am aware of. I
>> think we should solve this for good now:
>>
>> A PFN walker takes a look at a random PFN at a random point in time. It
>> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
>> options are:
>>
>> 1. It is buddy memory (add_memory()) that has not been online yet. The
>> memmap contains garbage. Don't access.
>>
>> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
>>
>> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
>> is only partially present: E.g., device starts at offset 64MB within a
>> section or the device ends at offset 64MB within a section. Don't access it.
> 
> I don't agree with case #3. In the case, struct page area is not allocated on
> ZONE_DEVICE, but is allocated on system memory. So I think we can access the
> struct pages. What do you mean "invalid memmap"?
No, that's not the case. There is no memory, especially not system
memory. We only allow partially present sections (sub-section memory
hotplug) for ZONE_DEVICE.

invalid memmap == memmap was not initialized == struct pages contains
garbage. There is a memmap, but accessing it (e.g., pfn_to_nid()) will
trigger a BUG.

-- 

Thanks,

David / dhildenb


Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-16 Thread Toshiki Fukasawa
On 2019/09/09 16:46, David Hildenbrand wrote:
> Let's take a step back here to understand the issues I am aware of. I
> think we should solve this for good now:
> 
> A PFN walker takes a look at a random PFN at a random point in time. It
> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
> options are:
> 
> 1. It is buddy memory (add_memory()) that has not been online yet. The
> memmap contains garbage. Don't access.
> 
> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
> 
> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
> is only partially present: E.g., device starts at offset 64MB within a
> section or the device ends at offset 64MB within a section. Don't access it.

I don't agree with case #3. In the case, struct page area is not allocated on
ZONE_DEVICE, but is allocated on system memory. So I think we can access the
struct pages. What do you mean "invalid memmap"?

> 
> 4. It is ZONE_DEVICE memory with an invalid memmap, because the memmap
> was not initialized yet. memmap_init_zone_device() did not yet succeed
> after dropping the mem_hotplug lock in mm/memremap.c. Don't access it.
> 
> 5. It is reserved ZONE_DEVICE memory ("pages mapped, but reserved for
> driver") with an invalid memmap. Don't access it.
> 
> I can see that your patch tries to make #5 vanish by initializing the
> memmap, fair enough. #3 and #4 can't be detected. The PFN walker could
> still stumble over uninitialized memmaps.
> 


Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-10 Thread Michal Hocko
On Tue 10-09-19 07:53:17, Dan Williams wrote:
> On Tue, Sep 10, 2019 at 7:01 AM Michal Hocko  wrote:
> >
> > On Fri 06-09-19 08:09:52, Toshiki Fukasawa wrote:
> > [...]
> > > @@ -5856,8 +5855,6 @@ void __meminit memmap_init_zone(unsigned long size, 
> > > int nid, unsigned long zone,
> > >   if (!altmap)
> > >   return;
> > >
> > > - if (start_pfn == altmap->base_pfn)
> > > - start_pfn += altmap->reserve;
> > >   end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
> >
> > Who is actually setting reserve? This is is something really impossible
> > to grep for in the kernle and git grep on altmap->reserve doesn't show
> > anything AFAICS.
> 
> Yes, it's difficult to grep, here is the use in the nvdimm case:
> 
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvdimm/pfn_devs.c#n600

Thanks! I am still not sure what the proper fix is but this is a useful
pointer.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-10 Thread Dan Williams
On Tue, Sep 10, 2019 at 7:01 AM Michal Hocko  wrote:
>
> On Fri 06-09-19 08:09:52, Toshiki Fukasawa wrote:
> [...]
> > @@ -5856,8 +5855,6 @@ void __meminit memmap_init_zone(unsigned long size, 
> > int nid, unsigned long zone,
> >   if (!altmap)
> >   return;
> >
> > - if (start_pfn == altmap->base_pfn)
> > - start_pfn += altmap->reserve;
> >   end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
>
> Who is actually setting reserve? This is is something really impossible
> to grep for in the kernle and git grep on altmap->reserve doesn't show
> anything AFAICS.

Yes, it's difficult to grep, here is the use in the nvdimm case:


https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvdimm/pfn_devs.c#n600

>
> Btw. irrespective to this issue all three callers should be using
> pfn_to_online_page rather than pfn_to_page AFAICS. It doesn't really
> make sense to collect data for offline pfn ranges. They might be
> uninitialized even without zone device.

Agree.


Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-10 Thread Michal Hocko
On Fri 06-09-19 08:09:52, Toshiki Fukasawa wrote:
[...]
> @@ -5856,8 +5855,6 @@ void __meminit memmap_init_zone(unsigned long size, int 
> nid, unsigned long zone,
>   if (!altmap)
>   return;
>  
> - if (start_pfn == altmap->base_pfn)
> - start_pfn += altmap->reserve;
>   end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);

Who is actually setting reserve? This is is something really impossible
to grep for in the kernle and git grep on altmap->reserve doesn't show
anything AFAICS.

Btw. irrespective to this issue all three callers should be using
pfn_to_online_page rather than pfn_to_page AFAICS. It doesn't really
make sense to collect data for offline pfn ranges. They might be
uninitialized even without zone device.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-10 Thread David Hildenbrand
On 10.09.19 11:21, Dan Williams wrote:
> On Mon, Sep 9, 2019 at 5:06 AM David Hildenbrand  wrote:
>>
>> On 09.09.19 13:53, Dan Williams wrote:
>>> On Mon, Sep 9, 2019 at 1:11 AM David Hildenbrand  wrote:
>>> [..]
>> It seems that SECTION_IS_ONLINE and SECTION_MARKED_PRESENT can be used to
>> distinguish uninitialized struct pages if we can apply them to 
>> ZONE_DEVICE,
>> but that is no longer necessary with this approach.
>
> Let's take a step back here to understand the issues I am aware of. I
> think we should solve this for good now:
>
> A PFN walker takes a look at a random PFN at a random point in time. It
> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
> options are:
>
> 1. It is buddy memory (add_memory()) that has not been online yet. The
> memmap contains garbage. Don't access.
>
> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
>
> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
> is only partially present: E.g., device starts at offset 64MB within a
> section or the device ends at offset 64MB within a section. Don't access 
> it.
>
> 4. It is ZONE_DEVICE memory with an invalid memmap, because the memmap
> was not initialized yet. memmap_init_zone_device() did not yet succeed
> after dropping the mem_hotplug lock in mm/memremap.c. Don't access it.
>
> 5. It is reserved ZONE_DEVICE memory ("pages mapped, but reserved for
> driver") with an invalid memmap. Don't access it.
>
> I can see that your patch tries to make #5 vanish by initializing the
> memmap, fair enough. #3 and #4 can't be detected. The PFN walker could
> still stumble over uninitialized memmaps.
>

 FWIW, I thinkg having something like pfn_zone_device(), similarly
 implemented like pfn_zone_device_reserved() could be one solution to
 most issues.
>>>
>>> I've been thinking of a replacement for PTE_DEVMAP with section-level,
>>> or sub-section level flags. The section-level flag would still require
>>> a call to get_dev_pagemap() to validate that the pfn is not section in
>>> the subsection case which seems to be entirely too much overhead. If
>>> ZONE_DEVICE is to be a first class citizen in pfn walkers I think it
>>> would be worth the cost to double the size of subsection_map and to
>>> identify whether a sub-section is ZONE_DEVICE, or not.
>>>
>>> Thoughts?
>>>
>>
>> I thought about this last week and came up with something like
>>
>> 1. Convert SECTION_IS_ONLINE to SECTION IS_ACTIVE
>>
>> 2. Make pfn_to_online_page() also check that it's not ZONE_DEVICE.
>> Online pfns are limited to !ZONE_DEVICE.
>>
>> 3. Extend subsection_map to an additional active_map
>>
>> 4. Set SECTION IS_ACTIVE *iff* the whole active_map is set. This keeps
>> most accesses of pfn_to_online_page() fast. If !SECTION IS_ACTIVE, check
>> the active_map.
>>
>> 5. Set sub-sections active/unactive in
>> move_pfn_range_to_zone()/remove_pfn_range_from_zone() - see "[PATCH v4
>> 0/8] mm/memory_hotplug: Shrink zones before removing memory" for the
>> latter.
>>
>> 6. Set boot memory properly active (this is a tricky bit :/ ).
>>
>> However, it turned out too complex for my taste (and limited time to
>> spend on this), so I abandoned that idea for now. If somebody wants to
>> pick that up, fine.
>>
> 
> That seems to solve the pfn walk case but it would not address the
> need for PTE_DEVMAP or speed up the other places that want an
> efficient way to determine if it's worthwhile to call
> get_dev_pagemap().
> 

Then I probably didn't get how your suggestion would look like :)

Would you suggest - instead of reusing SECTION_IS_ONLINE - something
like SECTION_IS_DEVMAP / SECTION_IS_DEVICE, to decide whether to call
get_dev_pagemap()?

How to deal with subsections? I would like to avoid storing subsection
information only useful for ZONE_DEVICE for any kind of sections (or
could we then simply not store the information per subsection but
instead check the device as initially suggested by me?)

-- 

Thanks,

David / dhildenb


Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-10 Thread Dan Williams
On Mon, Sep 9, 2019 at 5:06 AM David Hildenbrand  wrote:
>
> On 09.09.19 13:53, Dan Williams wrote:
> > On Mon, Sep 9, 2019 at 1:11 AM David Hildenbrand  wrote:
> > [..]
>  It seems that SECTION_IS_ONLINE and SECTION_MARKED_PRESENT can be used to
>  distinguish uninitialized struct pages if we can apply them to 
>  ZONE_DEVICE,
>  but that is no longer necessary with this approach.
> >>>
> >>> Let's take a step back here to understand the issues I am aware of. I
> >>> think we should solve this for good now:
> >>>
> >>> A PFN walker takes a look at a random PFN at a random point in time. It
> >>> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
> >>> options are:
> >>>
> >>> 1. It is buddy memory (add_memory()) that has not been online yet. The
> >>> memmap contains garbage. Don't access.
> >>>
> >>> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
> >>>
> >>> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
> >>> is only partially present: E.g., device starts at offset 64MB within a
> >>> section or the device ends at offset 64MB within a section. Don't access 
> >>> it.
> >>>
> >>> 4. It is ZONE_DEVICE memory with an invalid memmap, because the memmap
> >>> was not initialized yet. memmap_init_zone_device() did not yet succeed
> >>> after dropping the mem_hotplug lock in mm/memremap.c. Don't access it.
> >>>
> >>> 5. It is reserved ZONE_DEVICE memory ("pages mapped, but reserved for
> >>> driver") with an invalid memmap. Don't access it.
> >>>
> >>> I can see that your patch tries to make #5 vanish by initializing the
> >>> memmap, fair enough. #3 and #4 can't be detected. The PFN walker could
> >>> still stumble over uninitialized memmaps.
> >>>
> >>
> >> FWIW, I thinkg having something like pfn_zone_device(), similarly
> >> implemented like pfn_zone_device_reserved() could be one solution to
> >> most issues.
> >
> > I've been thinking of a replacement for PTE_DEVMAP with section-level,
> > or sub-section level flags. The section-level flag would still require
> > a call to get_dev_pagemap() to validate that the pfn is not section in
> > the subsection case which seems to be entirely too much overhead. If
> > ZONE_DEVICE is to be a first class citizen in pfn walkers I think it
> > would be worth the cost to double the size of subsection_map and to
> > identify whether a sub-section is ZONE_DEVICE, or not.
> >
> > Thoughts?
> >
>
> I thought about this last week and came up with something like
>
> 1. Convert SECTION_IS_ONLINE to SECTION IS_ACTIVE
>
> 2. Make pfn_to_online_page() also check that it's not ZONE_DEVICE.
> Online pfns are limited to !ZONE_DEVICE.
>
> 3. Extend subsection_map to an additional active_map
>
> 4. Set SECTION IS_ACTIVE *iff* the whole active_map is set. This keeps
> most accesses of pfn_to_online_page() fast. If !SECTION IS_ACTIVE, check
> the active_map.
>
> 5. Set sub-sections active/unactive in
> move_pfn_range_to_zone()/remove_pfn_range_from_zone() - see "[PATCH v4
> 0/8] mm/memory_hotplug: Shrink zones before removing memory" for the
> latter.
>
> 6. Set boot memory properly active (this is a tricky bit :/ ).
>
> However, it turned out too complex for my taste (and limited time to
> spend on this), so I abandoned that idea for now. If somebody wants to
> pick that up, fine.
>

That seems to solve the pfn walk case but it would not address the
need for PTE_DEVMAP or speed up the other places that want an
efficient way to determine if it's worthwhile to call
get_dev_pagemap().


Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-09 Thread David Hildenbrand
On 09.09.19 13:53, Dan Williams wrote:
> On Mon, Sep 9, 2019 at 1:11 AM David Hildenbrand  wrote:
> [..]
 It seems that SECTION_IS_ONLINE and SECTION_MARKED_PRESENT can be used to
 distinguish uninitialized struct pages if we can apply them to ZONE_DEVICE,
 but that is no longer necessary with this approach.
>>>
>>> Let's take a step back here to understand the issues I am aware of. I
>>> think we should solve this for good now:
>>>
>>> A PFN walker takes a look at a random PFN at a random point in time. It
>>> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
>>> options are:
>>>
>>> 1. It is buddy memory (add_memory()) that has not been online yet. The
>>> memmap contains garbage. Don't access.
>>>
>>> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
>>>
>>> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
>>> is only partially present: E.g., device starts at offset 64MB within a
>>> section or the device ends at offset 64MB within a section. Don't access it.
>>>
>>> 4. It is ZONE_DEVICE memory with an invalid memmap, because the memmap
>>> was not initialized yet. memmap_init_zone_device() did not yet succeed
>>> after dropping the mem_hotplug lock in mm/memremap.c. Don't access it.
>>>
>>> 5. It is reserved ZONE_DEVICE memory ("pages mapped, but reserved for
>>> driver") with an invalid memmap. Don't access it.
>>>
>>> I can see that your patch tries to make #5 vanish by initializing the
>>> memmap, fair enough. #3 and #4 can't be detected. The PFN walker could
>>> still stumble over uninitialized memmaps.
>>>
>>
>> FWIW, I thinkg having something like pfn_zone_device(), similarly
>> implemented like pfn_zone_device_reserved() could be one solution to
>> most issues.
> 
> I've been thinking of a replacement for PTE_DEVMAP with section-level,
> or sub-section level flags. The section-level flag would still require
> a call to get_dev_pagemap() to validate that the pfn is not section in
> the subsection case which seems to be entirely too much overhead. If
> ZONE_DEVICE is to be a first class citizen in pfn walkers I think it
> would be worth the cost to double the size of subsection_map and to
> identify whether a sub-section is ZONE_DEVICE, or not.
> 
> Thoughts?
> 

I thought about this last week and came up with something like

1. Convert SECTION_IS_ONLINE to SECTION IS_ACTIVE

2. Make pfn_to_online_page() also check that it's not ZONE_DEVICE.
Online pfns are limited to !ZONE_DEVICE.

3. Extend subsection_map to an additional active_map

4. Set SECTION IS_ACTIVE *iff* the whole active_map is set. This keeps
most accesses of pfn_to_online_page() fast. If !SECTION IS_ACTIVE, check
the active_map.

5. Set sub-sections active/unactive in
move_pfn_range_to_zone()/remove_pfn_range_from_zone() - see "[PATCH v4
0/8] mm/memory_hotplug: Shrink zones before removing memory​" for the
latter.

6. Set boot memory properly active (this is a tricky bit :/ ).

However, it turned out too complex for my taste (and limited time to
spend on this), so I abandoned that idea for now. If somebody wants to
pick that up, fine.

-- 

Thanks,

David / dhildenb


Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-09 Thread Dan Williams
On Mon, Sep 9, 2019 at 1:11 AM David Hildenbrand  wrote:
[..]
> >> It seems that SECTION_IS_ONLINE and SECTION_MARKED_PRESENT can be used to
> >> distinguish uninitialized struct pages if we can apply them to ZONE_DEVICE,
> >> but that is no longer necessary with this approach.
> >
> > Let's take a step back here to understand the issues I am aware of. I
> > think we should solve this for good now:
> >
> > A PFN walker takes a look at a random PFN at a random point in time. It
> > finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
> > options are:
> >
> > 1. It is buddy memory (add_memory()) that has not been online yet. The
> > memmap contains garbage. Don't access.
> >
> > 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
> >
> > 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
> > is only partially present: E.g., device starts at offset 64MB within a
> > section or the device ends at offset 64MB within a section. Don't access it.
> >
> > 4. It is ZONE_DEVICE memory with an invalid memmap, because the memmap
> > was not initialized yet. memmap_init_zone_device() did not yet succeed
> > after dropping the mem_hotplug lock in mm/memremap.c. Don't access it.
> >
> > 5. It is reserved ZONE_DEVICE memory ("pages mapped, but reserved for
> > driver") with an invalid memmap. Don't access it.
> >
> > I can see that your patch tries to make #5 vanish by initializing the
> > memmap, fair enough. #3 and #4 can't be detected. The PFN walker could
> > still stumble over uninitialized memmaps.
> >
>
> FWIW, I thinkg having something like pfn_zone_device(), similarly
> implemented like pfn_zone_device_reserved() could be one solution to
> most issues.

I've been thinking of a replacement for PTE_DEVMAP with section-level,
or sub-section level flags. The section-level flag would still require
a call to get_dev_pagemap() to validate that the pfn is not section in
the subsection case which seems to be entirely too much overhead. If
ZONE_DEVICE is to be a first class citizen in pfn walkers I think it
would be worth the cost to double the size of subsection_map and to
identify whether a sub-section is ZONE_DEVICE, or not.

Thoughts?


Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-09 Thread David Hildenbrand
On 09.09.19 09:46, David Hildenbrand wrote:
> On 09.09.19 07:48, Toshiki Fukasawa wrote:
>> On 2019/09/06 19:35, David Hildenbrand wrote:
>>> On 06.09.19 12:02, Toshiki Fukasawa wrote:
 Thank you for your feedback.

 On 2019/09/06 17:45, David Hildenbrand wrote:
> On 06.09.19 10:09, Toshiki Fukasawa wrote:
>> A kernel panic is observed during reading
>> /proc/kpage{cgroup,count,flags} for first few pfns allocated by
>> pmem namespace:
>>
>> BUG: unable to handle page fault for address: fffe
>> [  114.495280] #PF: supervisor read access in kernel mode
>> [  114.495738] #PF: error_code(0x) - not-present page
>> [  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
>> [  114.496713] Oops:  [#1] SMP PTI
>> [  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1
>> [  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
>> BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
>> [  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
>> [  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 
>> 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 
>> c7 <48> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
>> [  114.500788] RSP: 0018:a5e601a0fe60 EFLAGS: 00010202
>> [  114.501373] RAX: fffe RBX:  RCX: 
>> 
>> [  114.502009] RDX: 0001 RSI: 7ffca13a7310 RDI: 
>> d0748900
>> [  114.502637] RBP: d0748900 R08: 0001 R09: 
>> 
>> [  114.503270] R10:  R11:  R12: 
>> 0024
>> [  114.503896] R13: 0008 R14: 7ffca13a7310 R15: 
>> a5e601a0ff08
>> [  114.504530] FS:  7f0266c7f540() GS:962dbbac() 
>> knlGS:
>> [  114.505245] CS:  0010 DS:  ES:  CR0: 80050033
>> [  114.505754] CR2: fffe CR3: 00023a204000 CR4: 
>> 06e0
>> [  114.506401] Call Trace:
>> [  114.506660]  kpageflags_read+0xb1/0x130
>> [  114.507051]  proc_reg_read+0x39/0x60
>> [  114.507387]  vfs_read+0x8a/0x140
>> [  114.507686]  ksys_pread64+0x61/0xa0
>> [  114.508021]  do_syscall_64+0x5f/0x1a0
>> [  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [  114.508844] RIP: 0033:0x7f0266ba426b
>>
>> The first few pages of ZONE_DEVICE expressed as the range
>> (altmap->base_pfn) to (altmap->base_pfn + altmap->reserve) are
>> skipped by struct page initialization. Some pfn walkers like
>> /proc/kpage{cgroup, count, flags} can't handle these uninitialized
>> struct pages, which causes the error.
>>
>> In previous discussion, Dan seemed to have concern that the struct
>> page area of some pages indicated by vmem_altmap->reserve may not
>> be allocated. (See 
>> https://lore.kernel.org/lkml/capcyv4i5fjtonpbxnctzvt+e6rqyow0jrqwsfuxaa62lsuv...@mail.gmail.com/)
>> However, arch_add_memory() called by devm_memremap_pages() allocates
>> struct page area for pages containing addresses in the range
>> (res.start) to (res.start + resource_size(res)), which include the
>> pages indicated by vmem_altmap->reserve. If I read correctly, it is
>> allocated as requested at least on x86_64. Also, memmap_init_zone()
>> initializes struct pages in the same range.
>> So I think the struct pages should be initialized.>
>
> For !ZONE_DEVICE memory, the memmap is valid with SECTION_IS_ONLINE -
> for the whole section. For ZONE_DEVICE memory we have no such
> indication. In any section that is !SECTION_IS_ONLINE and
> SECTION_MARKED_PRESENT, we could have any subsections initialized. >
> The only indication I am aware of is pfn_zone_device_reserved() - which
> seems to check exactly what you are trying to skip here.
>
> Can't you somehow use pfn_zone_device_reserved() ? Or if you considered
> that already, why did you decide against it?

 No, in current approach this function is no longer needed.
 The reason why we change the approach is that all pfn walkers
 have to be aware of the uninitialized struct pages.
>>>
>>> We should use the same strategy for all pfn walkers then (effectively
>>> get rid of pfn_zone_device_reserved() if that's what we want).
>>
>> True, but this patch replaces "/proc/kpageflags: do not use uninitialized
>> struct pages". If we initialize the uninitialized struct pages, no pfn walker
>> will need to be aware of them.
> 
> So the function should go.
> 
>>
>>>

 As for SECTION_IS_ONLINE, I'm not sure now.
 I will look into it next week.
>>>
>>> SECTION_IS_ONLINE does currently not apply to ZONE_DEVICE and due to
>>> sub-section support for ZONE_DEVICE, it cannot easily be reused.
>>>
>> It 

Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-09 Thread David Hildenbrand
On 09.09.19 07:48, Toshiki Fukasawa wrote:
> On 2019/09/06 19:35, David Hildenbrand wrote:
>> On 06.09.19 12:02, Toshiki Fukasawa wrote:
>>> Thank you for your feedback.
>>>
>>> On 2019/09/06 17:45, David Hildenbrand wrote:
 On 06.09.19 10:09, Toshiki Fukasawa wrote:
> A kernel panic is observed during reading
> /proc/kpage{cgroup,count,flags} for first few pfns allocated by
> pmem namespace:
>
> BUG: unable to handle page fault for address: fffe
> [  114.495280] #PF: supervisor read access in kernel mode
> [  114.495738] #PF: error_code(0x) - not-present page
> [  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
> [  114.496713] Oops:  [#1] SMP PTI
> [  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1
> [  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
> [  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
> [  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 
> 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 
> c7 <48> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
> [  114.500788] RSP: 0018:a5e601a0fe60 EFLAGS: 00010202
> [  114.501373] RAX: fffe RBX:  RCX: 
> 
> [  114.502009] RDX: 0001 RSI: 7ffca13a7310 RDI: 
> d0748900
> [  114.502637] RBP: d0748900 R08: 0001 R09: 
> 
> [  114.503270] R10:  R11:  R12: 
> 0024
> [  114.503896] R13: 0008 R14: 7ffca13a7310 R15: 
> a5e601a0ff08
> [  114.504530] FS:  7f0266c7f540() GS:962dbbac() 
> knlGS:
> [  114.505245] CS:  0010 DS:  ES:  CR0: 80050033
> [  114.505754] CR2: fffe CR3: 00023a204000 CR4: 
> 06e0
> [  114.506401] Call Trace:
> [  114.506660]  kpageflags_read+0xb1/0x130
> [  114.507051]  proc_reg_read+0x39/0x60
> [  114.507387]  vfs_read+0x8a/0x140
> [  114.507686]  ksys_pread64+0x61/0xa0
> [  114.508021]  do_syscall_64+0x5f/0x1a0
> [  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  114.508844] RIP: 0033:0x7f0266ba426b
>
> The first few pages of ZONE_DEVICE expressed as the range
> (altmap->base_pfn) to (altmap->base_pfn + altmap->reserve) are
> skipped by struct page initialization. Some pfn walkers like
> /proc/kpage{cgroup, count, flags} can't handle these uninitialized
> struct pages, which causes the error.
>
> In previous discussion, Dan seemed to have concern that the struct
> page area of some pages indicated by vmem_altmap->reserve may not
> be allocated. (See 
> https://lore.kernel.org/lkml/capcyv4i5fjtonpbxnctzvt+e6rqyow0jrqwsfuxaa62lsuv...@mail.gmail.com/)
> However, arch_add_memory() called by devm_memremap_pages() allocates
> struct page area for pages containing addresses in the range
> (res.start) to (res.start + resource_size(res)), which include the
> pages indicated by vmem_altmap->reserve. If I read correctly, it is
> allocated as requested at least on x86_64. Also, memmap_init_zone()
> initializes struct pages in the same range.
> So I think the struct pages should be initialized.>

 For !ZONE_DEVICE memory, the memmap is valid with SECTION_IS_ONLINE -
 for the whole section. For ZONE_DEVICE memory we have no such
 indication. In any section that is !SECTION_IS_ONLINE and
 SECTION_MARKED_PRESENT, we could have any subsections initialized. >
 The only indication I am aware of is pfn_zone_device_reserved() - which
 seems to check exactly what you are trying to skip here.

 Can't you somehow use pfn_zone_device_reserved() ? Or if you considered
 that already, why did you decide against it?
>>>
>>> No, in current approach this function is no longer needed.
>>> The reason why we change the approach is that all pfn walkers
>>> have to be aware of the uninitialized struct pages.
>>
>> We should use the same strategy for all pfn walkers then (effectively
>> get rid of pfn_zone_device_reserved() if that's what we want).
> 
> True, but this patch replaces "/proc/kpageflags: do not use uninitialized
> struct pages". If we initialize the uninitialized struct pages, no pfn walker
> will need to be aware of them.

So the function should go.

> 
>>
>>>
>>> As for SECTION_IS_ONLINE, I'm not sure now.
>>> I will look into it next week.
>>
>> SECTION_IS_ONLINE does currently not apply to ZONE_DEVICE and due to
>> sub-section support for ZONE_DEVICE, it cannot easily be reused.
>>
> It seems that SECTION_IS_ONLINE and SECTION_MARKED_PRESENT can be used to
> distinguish uninitialized struct pages if we can apply them to 

Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-09 Thread Toshiki Fukasawa
On 2019/09/06 19:35, David Hildenbrand wrote:
> On 06.09.19 12:02, Toshiki Fukasawa wrote:
>> Thank you for your feedback.
>>
>> On 2019/09/06 17:45, David Hildenbrand wrote:
>>> On 06.09.19 10:09, Toshiki Fukasawa wrote:
 A kernel panic is observed during reading
 /proc/kpage{cgroup,count,flags} for first few pfns allocated by
 pmem namespace:

 BUG: unable to handle page fault for address: fffe
 [  114.495280] #PF: supervisor read access in kernel mode
 [  114.495738] #PF: error_code(0x) - not-present page
 [  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
 [  114.496713] Oops:  [#1] SMP PTI
 [  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1
 [  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
 rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
 [  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
 [  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 
 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 
 <48> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
 [  114.500788] RSP: 0018:a5e601a0fe60 EFLAGS: 00010202
 [  114.501373] RAX: fffe RBX:  RCX: 
 
 [  114.502009] RDX: 0001 RSI: 7ffca13a7310 RDI: 
 d0748900
 [  114.502637] RBP: d0748900 R08: 0001 R09: 
 
 [  114.503270] R10:  R11:  R12: 
 0024
 [  114.503896] R13: 0008 R14: 7ffca13a7310 R15: 
 a5e601a0ff08
 [  114.504530] FS:  7f0266c7f540() GS:962dbbac() 
 knlGS:
 [  114.505245] CS:  0010 DS:  ES:  CR0: 80050033
 [  114.505754] CR2: fffe CR3: 00023a204000 CR4: 
 06e0
 [  114.506401] Call Trace:
 [  114.506660]  kpageflags_read+0xb1/0x130
 [  114.507051]  proc_reg_read+0x39/0x60
 [  114.507387]  vfs_read+0x8a/0x140
 [  114.507686]  ksys_pread64+0x61/0xa0
 [  114.508021]  do_syscall_64+0x5f/0x1a0
 [  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 [  114.508844] RIP: 0033:0x7f0266ba426b

 The first few pages of ZONE_DEVICE expressed as the range
 (altmap->base_pfn) to (altmap->base_pfn + altmap->reserve) are
 skipped by struct page initialization. Some pfn walkers like
 /proc/kpage{cgroup, count, flags} can't handle these uninitialized
 struct pages, which causes the error.

 In previous discussion, Dan seemed to have concern that the struct
 page area of some pages indicated by vmem_altmap->reserve may not
 be allocated. (See 
 https://lore.kernel.org/lkml/capcyv4i5fjtonpbxnctzvt+e6rqyow0jrqwsfuxaa62lsuv...@mail.gmail.com/)
 However, arch_add_memory() called by devm_memremap_pages() allocates
 struct page area for pages containing addresses in the range
 (res.start) to (res.start + resource_size(res)), which include the
 pages indicated by vmem_altmap->reserve. If I read correctly, it is
 allocated as requested at least on x86_64. Also, memmap_init_zone()
 initializes struct pages in the same range.
 So I think the struct pages should be initialized.>
>>>
>>> For !ZONE_DEVICE memory, the memmap is valid with SECTION_IS_ONLINE -
>>> for the whole section. For ZONE_DEVICE memory we have no such
>>> indication. In any section that is !SECTION_IS_ONLINE and
>>> SECTION_MARKED_PRESENT, we could have any subsections initialized. >
>>> The only indication I am aware of is pfn_zone_device_reserved() - which
>>> seems to check exactly what you are trying to skip here.
>>>
>>> Can't you somehow use pfn_zone_device_reserved() ? Or if you considered
>>> that already, why did you decide against it?
>>
>> No, in current approach this function is no longer needed.
>> The reason why we change the approach is that all pfn walkers
>> have to be aware of the uninitialized struct pages.
> 
> We should use the same strategy for all pfn walkers then (effectively
> get rid of pfn_zone_device_reserved() if that's what we want).

True, but this patch replaces "/proc/kpageflags: do not use uninitialized
struct pages". If we initialize the uninitialized struct pages, no pfn walker
will need to be aware of them.

> 
>>
>> As for SECTION_IS_ONLINE, I'm not sure now.
>> I will look into it next week.
> 
> SECTION_IS_ONLINE does currently not apply to ZONE_DEVICE and due to
> sub-section support for ZONE_DEVICE, it cannot easily be reused.
> 
It seems that SECTION_IS_ONLINE and SECTION_MARKED_PRESENT can be used to
distinguish uninitialized struct pages if we can apply them to ZONE_DEVICE,
but that is no longer necessary with this approach.

Thanks,
Toshiki Fukasawa


Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-06 Thread David Hildenbrand
On 06.09.19 12:02, Toshiki Fukasawa wrote:
> Thank you for your feedback.
> 
> On 2019/09/06 17:45, David Hildenbrand wrote:
>> On 06.09.19 10:09, Toshiki Fukasawa wrote:
>>> A kernel panic is observed during reading
>>> /proc/kpage{cgroup,count,flags} for first few pfns allocated by
>>> pmem namespace:
>>>
>>> BUG: unable to handle page fault for address: fffe
>>> [  114.495280] #PF: supervisor read access in kernel mode
>>> [  114.495738] #PF: error_code(0x) - not-present page
>>> [  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
>>> [  114.496713] Oops:  [#1] SMP PTI
>>> [  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1
>>> [  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>> rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
>>> [  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
>>> [  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 
>>> 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 
>>> <48> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
>>> [  114.500788] RSP: 0018:a5e601a0fe60 EFLAGS: 00010202
>>> [  114.501373] RAX: fffe RBX:  RCX: 
>>> 
>>> [  114.502009] RDX: 0001 RSI: 7ffca13a7310 RDI: 
>>> d0748900
>>> [  114.502637] RBP: d0748900 R08: 0001 R09: 
>>> 
>>> [  114.503270] R10:  R11:  R12: 
>>> 0024
>>> [  114.503896] R13: 0008 R14: 7ffca13a7310 R15: 
>>> a5e601a0ff08
>>> [  114.504530] FS:  7f0266c7f540() GS:962dbbac() 
>>> knlGS:
>>> [  114.505245] CS:  0010 DS:  ES:  CR0: 80050033
>>> [  114.505754] CR2: fffe CR3: 00023a204000 CR4: 
>>> 06e0
>>> [  114.506401] Call Trace:
>>> [  114.506660]  kpageflags_read+0xb1/0x130
>>> [  114.507051]  proc_reg_read+0x39/0x60
>>> [  114.507387]  vfs_read+0x8a/0x140
>>> [  114.507686]  ksys_pread64+0x61/0xa0
>>> [  114.508021]  do_syscall_64+0x5f/0x1a0
>>> [  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [  114.508844] RIP: 0033:0x7f0266ba426b
>>>
>>> The first few pages of ZONE_DEVICE expressed as the range
>>> (altmap->base_pfn) to (altmap->base_pfn + altmap->reserve) are
>>> skipped by struct page initialization. Some pfn walkers like
>>> /proc/kpage{cgroup, count, flags} can't handle these uninitialized
>>> struct pages, which causes the error.
>>>
>>> In previous discussion, Dan seemed to have concern that the struct
>>> page area of some pages indicated by vmem_altmap->reserve may not
>>> be allocated. (See 
>>> https://lore.kernel.org/lkml/capcyv4i5fjtonpbxnctzvt+e6rqyow0jrqwsfuxaa62lsuv...@mail.gmail.com/)
>>> However, arch_add_memory() called by devm_memremap_pages() allocates
>>> struct page area for pages containing addresses in the range
>>> (res.start) to (res.start + resource_size(res)), which include the
>>> pages indicated by vmem_altmap->reserve. If I read correctly, it is
>>> allocated as requested at least on x86_64. Also, memmap_init_zone()
>>> initializes struct pages in the same range.
>>> So I think the struct pages should be initialized.>
>>
>> For !ZONE_DEVICE memory, the memmap is valid with SECTION_IS_ONLINE -
>> for the whole section. For ZONE_DEVICE memory we have no such
>> indication. In any section that is !SECTION_IS_ONLINE and
>> SECTION_MARKED_PRESENT, we could have any subsections initialized. >
>> The only indication I am aware of is pfn_zone_device_reserved() - which
>> seems to check exactly what you are trying to skip here.
>>
>> Can't you somehow use pfn_zone_device_reserved() ? Or if you considered
>> that already, why did you decide against it?
> 
> No, in current approach this function is no longer needed.
> The reason why we change the approach is that all pfn walkers
> have to be aware of the uninitialized struct pages.

We should use the same strategy for all pfn walkers then (effectively
get rid of pfn_zone_device_reserved() if that's what we want).

> 
> As for SECTION_IS_ONLINE, I'm not sure now.
> I will look into it next week.

SECTION_IS_ONLINE does currently not apply to ZONE_DEVICE and due to
sub-section support for ZONE_DEVICE, it cannot easily be reused.

> 
> Thanks,
> Toshiki Fukasawa
> 
>>
>>> Signed-off-by: Toshiki Fukasawa 
>>> Cc: sta...@vger.kernel.org
>>> ---
>>> Changes since rev 1:
>>>   Instead of avoiding uninitialized pages on the pfn walker side,
>>>   we initialize struct pages.
>>>
>>> mm/page_alloc.c | 5 +
>>>   1 file changed, 1 insertion(+), 4 deletions(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 9c91949..6d180ae 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -5846,8 +5846,7 @@ void __meminit memmap_init_zone(unsigned long size, 
>>> int nid, unsigned long zone,
>>>   
>>>   #ifdef CONFIG_ZONE_DEVICE
>>> /*
>>> -* 

Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-06 Thread Toshiki Fukasawa
Thank you for your feedback.

On 2019/09/06 17:45, David Hildenbrand wrote:
> On 06.09.19 10:09, Toshiki Fukasawa wrote:
>> A kernel panic is observed during reading
>> /proc/kpage{cgroup,count,flags} for first few pfns allocated by
>> pmem namespace:
>>
>> BUG: unable to handle page fault for address: fffe
>> [  114.495280] #PF: supervisor read access in kernel mode
>> [  114.495738] #PF: error_code(0x) - not-present page
>> [  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
>> [  114.496713] Oops:  [#1] SMP PTI
>> [  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1
>> [  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
>> [  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
>> [  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 
>> 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 
>> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
>> [  114.500788] RSP: 0018:a5e601a0fe60 EFLAGS: 00010202
>> [  114.501373] RAX: fffe RBX:  RCX: 
>> 
>> [  114.502009] RDX: 0001 RSI: 7ffca13a7310 RDI: 
>> d0748900
>> [  114.502637] RBP: d0748900 R08: 0001 R09: 
>> 
>> [  114.503270] R10:  R11:  R12: 
>> 0024
>> [  114.503896] R13: 0008 R14: 7ffca13a7310 R15: 
>> a5e601a0ff08
>> [  114.504530] FS:  7f0266c7f540() GS:962dbbac() 
>> knlGS:
>> [  114.505245] CS:  0010 DS:  ES:  CR0: 80050033
>> [  114.505754] CR2: fffe CR3: 00023a204000 CR4: 
>> 06e0
>> [  114.506401] Call Trace:
>> [  114.506660]  kpageflags_read+0xb1/0x130
>> [  114.507051]  proc_reg_read+0x39/0x60
>> [  114.507387]  vfs_read+0x8a/0x140
>> [  114.507686]  ksys_pread64+0x61/0xa0
>> [  114.508021]  do_syscall_64+0x5f/0x1a0
>> [  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [  114.508844] RIP: 0033:0x7f0266ba426b
>>
>> The first few pages of ZONE_DEVICE expressed as the range
>> (altmap->base_pfn) to (altmap->base_pfn + altmap->reserve) are
>> skipped by struct page initialization. Some pfn walkers like
>> /proc/kpage{cgroup, count, flags} can't handle these uninitialized
>> struct pages, which causes the error.
>>
>> In previous discussion, Dan seemed to have concern that the struct
>> page area of some pages indicated by vmem_altmap->reserve may not
>> be allocated. (See 
>> https://lore.kernel.org/lkml/capcyv4i5fjtonpbxnctzvt+e6rqyow0jrqwsfuxaa62lsuv...@mail.gmail.com/)
>> However, arch_add_memory() called by devm_memremap_pages() allocates
>> struct page area for pages containing addresses in the range
>> (res.start) to (res.start + resource_size(res)), which include the
>> pages indicated by vmem_altmap->reserve. If I read correctly, it is
>> allocated as requested at least on x86_64. Also, memmap_init_zone()
>> initializes struct pages in the same range.
>> So I think the struct pages should be initialized.>
> 
> For !ZONE_DEVICE memory, the memmap is valid with SECTION_IS_ONLINE -
> for the whole section. For ZONE_DEVICE memory we have no such
> indication. In any section that is !SECTION_IS_ONLINE and
> SECTION_MARKED_PRESENT, we could have any subsections initialized. >
> The only indication I am aware of is pfn_zone_device_reserved() - which
> seems to check exactly what you are trying to skip here.
> 
> Can't you somehow use pfn_zone_device_reserved() ? Or if you considered
> that already, why did you decide against it?

No, in current approach this function is no longer needed.
The reason why we change the approach is that all pfn walkers
have to be aware of the uninitialized struct pages.

As for SECTION_IS_ONLINE, I'm not sure now.
I will look into it next week.

Thanks,
Toshiki Fukasawa

> 
>> Signed-off-by: Toshiki Fukasawa 
>> Cc: sta...@vger.kernel.org
>> ---
>> Changes since rev 1:
>>   Instead of avoiding uninitialized pages on the pfn walker side,
>>   we initialize struct pages.
>>
>> mm/page_alloc.c | 5 +
>>   1 file changed, 1 insertion(+), 4 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 9c91949..6d180ae 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5846,8 +5846,7 @@ void __meminit memmap_init_zone(unsigned long size, 
>> int nid, unsigned long zone,
>>   
>>   #ifdef CONFIG_ZONE_DEVICE
>>  /*
>> - * Honor reservation requested by the driver for this ZONE_DEVICE
>> - * memory. We limit the total number of pages to initialize to just
>> + * We limit the total number of pages to initialize to just
>>   * those that might contain the memory mapping. We will defer the
>>   * ZONE_DEVICE page initialization until after we have released
>>   * the hotplug lock.
>> @@ -5856,8 +5855,6 @@ void __meminit 

Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-06 Thread David Hildenbrand
On 06.09.19 10:09, Toshiki Fukasawa wrote:
> A kernel panic is observed during reading
> /proc/kpage{cgroup,count,flags} for first few pfns allocated by
> pmem namespace:
> 
> BUG: unable to handle page fault for address: fffe
> [  114.495280] #PF: supervisor read access in kernel mode
> [  114.495738] #PF: error_code(0x) - not-present page
> [  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
> [  114.496713] Oops:  [#1] SMP PTI
> [  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1
> [  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
> [  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
> [  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 54 
> 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 
> 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
> [  114.500788] RSP: 0018:a5e601a0fe60 EFLAGS: 00010202
> [  114.501373] RAX: fffe RBX:  RCX: 
> 
> [  114.502009] RDX: 0001 RSI: 7ffca13a7310 RDI: 
> d0748900
> [  114.502637] RBP: d0748900 R08: 0001 R09: 
> 
> [  114.503270] R10:  R11:  R12: 
> 0024
> [  114.503896] R13: 0008 R14: 7ffca13a7310 R15: 
> a5e601a0ff08
> [  114.504530] FS:  7f0266c7f540() GS:962dbbac() 
> knlGS:
> [  114.505245] CS:  0010 DS:  ES:  CR0: 80050033
> [  114.505754] CR2: fffe CR3: 00023a204000 CR4: 
> 06e0
> [  114.506401] Call Trace:
> [  114.506660]  kpageflags_read+0xb1/0x130
> [  114.507051]  proc_reg_read+0x39/0x60
> [  114.507387]  vfs_read+0x8a/0x140
> [  114.507686]  ksys_pread64+0x61/0xa0
> [  114.508021]  do_syscall_64+0x5f/0x1a0
> [  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  114.508844] RIP: 0033:0x7f0266ba426b
> 
> The first few pages of ZONE_DEVICE expressed as the range
> (altmap->base_pfn) to (altmap->base_pfn + altmap->reserve) are
> skipped by struct page initialization. Some pfn walkers like
> /proc/kpage{cgroup, count, flags} can't handle these uninitialized
> struct pages, which causes the error.
> 
> In previous discussion, Dan seemed to have concern that the struct
> page area of some pages indicated by vmem_altmap->reserve may not
> be allocated. (See 
> https://lore.kernel.org/lkml/capcyv4i5fjtonpbxnctzvt+e6rqyow0jrqwsfuxaa62lsuv...@mail.gmail.com/)
> However, arch_add_memory() called by devm_memremap_pages() allocates
> struct page area for pages containing addresses in the range
> (res.start) to (res.start + resource_size(res)), which include the
> pages indicated by vmem_altmap->reserve. If I read correctly, it is
> allocated as requested at least on x86_64. Also, memmap_init_zone()
> initializes struct pages in the same range.
> So I think the struct pages should be initialized.>

For !ZONE_DEVICE memory, the memmap is valid with SECTION_IS_ONLINE -
for the whole section. For ZONE_DEVICE memory we have no such
indication. In any section that is !SECTION_IS_ONLINE and
SECTION_MARKED_PRESENT, we could have any subsections initialized.

The only indication I am aware of is pfn_zone_device_reserved() - which
seems to check exactly what you are trying to skip here.

Can't you somehow use pfn_zone_device_reserved() ? Or if you considered
that already, why did you decide against it?

> Signed-off-by: Toshiki Fukasawa 
> Cc: sta...@vger.kernel.org
> ---
> Changes since rev 1:
>  Instead of avoiding uninitialized pages on the pfn walker side,
>  we initialize struct pages.
> 
> mm/page_alloc.c | 5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9c91949..6d180ae 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5846,8 +5846,7 @@ void __meminit memmap_init_zone(unsigned long size, int 
> nid, unsigned long zone,
>  
>  #ifdef CONFIG_ZONE_DEVICE
>   /*
> -  * Honor reservation requested by the driver for this ZONE_DEVICE
> -  * memory. We limit the total number of pages to initialize to just
> +  * We limit the total number of pages to initialize to just
>* those that might contain the memory mapping. We will defer the
>* ZONE_DEVICE page initialization until after we have released
>* the hotplug lock.
> @@ -5856,8 +5855,6 @@ void __meminit memmap_init_zone(unsigned long size, int 
> nid, unsigned long zone,
>   if (!altmap)
>   return;
>  
> - if (start_pfn == altmap->base_pfn)
> - start_pfn += altmap->reserve;
>   end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
>   }
>  #endif
> 


-- 

Thanks,

David / dhildenb


[RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-06 Thread Toshiki Fukasawa
A kernel panic is observed during reading
/proc/kpage{cgroup,count,flags} for first few pfns allocated by
pmem namespace:

BUG: unable to handle page fault for address: fffe
[  114.495280] #PF: supervisor read access in kernel mode
[  114.495738] #PF: error_code(0x) - not-present page
[  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
[  114.496713] Oops:  [#1] SMP PTI
[  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1
[  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
[  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
[  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 54 
55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 
f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
[  114.500788] RSP: 0018:a5e601a0fe60 EFLAGS: 00010202
[  114.501373] RAX: fffe RBX:  RCX: 
[  114.502009] RDX: 0001 RSI: 7ffca13a7310 RDI: d0748900
[  114.502637] RBP: d0748900 R08: 0001 R09: 
[  114.503270] R10:  R11:  R12: 0024
[  114.503896] R13: 0008 R14: 7ffca13a7310 R15: a5e601a0ff08
[  114.504530] FS:  7f0266c7f540() GS:962dbbac() 
knlGS:
[  114.505245] CS:  0010 DS:  ES:  CR0: 80050033
[  114.505754] CR2: fffe CR3: 00023a204000 CR4: 06e0
[  114.506401] Call Trace:
[  114.506660]  kpageflags_read+0xb1/0x130
[  114.507051]  proc_reg_read+0x39/0x60
[  114.507387]  vfs_read+0x8a/0x140
[  114.507686]  ksys_pread64+0x61/0xa0
[  114.508021]  do_syscall_64+0x5f/0x1a0
[  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  114.508844] RIP: 0033:0x7f0266ba426b

The first few pages of ZONE_DEVICE expressed as the range
(altmap->base_pfn) to (altmap->base_pfn + altmap->reserve) are
skipped by struct page initialization. Some pfn walkers like
/proc/kpage{cgroup, count, flags} can't handle these uninitialized
struct pages, which causes the error.

In previous discussion, Dan seemed to have concern that the struct
page area of some pages indicated by vmem_altmap->reserve may not
be allocated. (See 
https://lore.kernel.org/lkml/capcyv4i5fjtonpbxnctzvt+e6rqyow0jrqwsfuxaa62lsuv...@mail.gmail.com/)
However, arch_add_memory() called by devm_memremap_pages() allocates
struct page area for pages containing addresses in the range
(res.start) to (res.start + resource_size(res)), which include the
pages indicated by vmem_altmap->reserve. If I read correctly, it is
allocated as requested at least on x86_64. Also, memmap_init_zone()
initializes struct pages in the same range.
So I think the struct pages should be initialized.

Signed-off-by: Toshiki Fukasawa 
Cc: sta...@vger.kernel.org
---
Changes since rev 1:
 Instead of avoiding uninitialized pages on the pfn walker side,
 we initialize struct pages.

mm/page_alloc.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9c91949..6d180ae 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5846,8 +5846,7 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
 
 #ifdef CONFIG_ZONE_DEVICE
/*
-* Honor reservation requested by the driver for this ZONE_DEVICE
-* memory. We limit the total number of pages to initialize to just
+* We limit the total number of pages to initialize to just
 * those that might contain the memory mapping. We will defer the
 * ZONE_DEVICE page initialization until after we have released
 * the hotplug lock.
@@ -5856,8 +5855,6 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
if (!altmap)
return;
 
-   if (start_pfn == altmap->base_pfn)
-   start_pfn += altmap->reserve;
end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
}
 #endif
-- 
1.8.3.1