Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-17 Thread Toshiki Fukasawa
On 2019/09/18 2:04, Waiman Long wrote:
> On 9/17/19 12:21 PM, Qian Cai wrote:
>> On Tue, 2019-09-17 at 11:49 -0400, Waiman Long wrote:
>>> On 9/17/19 3:13 AM, David Hildenbrand wrote:
>>>> On 17.09.19 04:34, Toshiki Fukasawa wrote:
>>>>> On 2019/09/09 16:46, David Hildenbrand wrote:
>>>>>> Let's take a step back here to understand the issues I am aware of. I
>>>>>> think we should solve this for good now:
>>>>>>
>>>>>> A PFN walker takes a look at a random PFN at a random point in time. It
>>>>>> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
>>>>>> options are:
>>>>>>
>>>>>> 1. It is buddy memory (add_memory()) that has not been online yet. The
>>>>>> memmap contains garbage. Don't access.
>>>>>>
>>>>>> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
>>>>>>
>>>>>> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
>>>>>> is only partially present: E.g., device starts at offset 64MB within a
>>>>>> section or the device ends at offset 64MB within a section. Don't access 
>>>>>> it.
>>>>> I don't agree with case #3. In the case, struct page area is not 
>>>>> allocated on
>>>>> ZONE_DEVICE, but is allocated on system memory. So I think we can access 
>>>>> the
>>>>> struct pages. What do you mean "invalid memmap"?
>>>> No, that's not the case. There is no memory, especially not system
>>>> memory. We only allow partially present sections (sub-section memory
>>>> hotplug) for ZONE_DEVICE.
>>>>
>>>> invalid memmap == memmap was not initialized == struct pages contains
>>>> garbage. There is a memmap, but accessing it (e.g., pfn_to_nid()) will
>>>> trigger a BUG.
>>>>
>>> As long as the page structures exist, they should be initialized to some
>>> known state. We could set PagePoison for those invalid memmap. It is the
>> Sounds like you want to run page_init_poison() by default.
> 
> Yes for those pages that are not initialized otherwise. I don't want to
> run page_init_poison() for the whole ZONE_DEVICE memory range as it can
> take a while if we are talking about TBs of persistent memory. Also most
> of the pages will be reinitialized anyway in the init process. So it is
> mostly a wasted effort. However, for those reserved pages that are not
> being exported to the memory management layer, having them initialized
> to a known state will cause less problem down the road.
> 
The same can be said about altmap->reserve. I think it should be initialized
since the struct page exists.

Thanks,
Toshiki Fukasawa


Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-17 Thread Toshiki Fukasawa
On 2019/09/17 19:20, David Hildenbrand wrote:
> On 17.09.19 11:32, Toshiki Fukasawa wrote:
>> On 2019/09/17 16:13, David Hildenbrand wrote:
>>> On 17.09.19 04:34, Toshiki Fukasawa wrote:
>>>> On 2019/09/09 16:46, David Hildenbrand wrote:
>>>>> Let's take a step back here to understand the issues I am aware of. I
>>>>> think we should solve this for good now:
>>>>>
>>>>> A PFN walker takes a look at a random PFN at a random point in time. It
>>>>> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
>>>>> options are:
>>>>>
>>>>> 1. It is buddy memory (add_memory()) that has not been online yet. The
>>>>> memmap contains garbage. Don't access.
>>>>>
>>>>> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
>>>>>
>>>>> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
>>>>> is only partially present: E.g., device starts at offset 64MB within a
>>>>> section or the device ends at offset 64MB within a section. Don't access 
>>>>> it.
>>>>
>>>> I don't agree with case #3. In the case, struct page area is not allocated 
>>>> on
>>>> ZONE_DEVICE, but is allocated on system memory. So I think we can access 
>>>> the
>>>> struct pages. What do you mean "invalid memmap"?
>>> No, that's not the case. There is no memory, especially not system
>>> memory. We only allow partially present sections (sub-section memory
>>> hotplug) for ZONE_DEVICE.
>>
>> Let me clear my thoughts. If I read correctly, the struct pages for sections
>> (including partially present sections) on ZONE_DEVICE are allocated by
>> vmemmap_populate(). And all the struct pages except (altmap->base_pfn) to
>> (altmap->base_pfn + altmap->reserve) are initialized by memmap_init_zone()
>> and memmap_init_zone_device().
>>
>> Do struct pages for partially present sections go through a different 
>> process?
> 
> No. However, the memmap is initialized via move_pfn_range_to_zone(). So
> partially present sections will have partially uninitialized memmaps.
> 
Thank you for explanation.
To my understanding, depending on architecture, the situation is possible
that the struct pages for entire section is allocated, but only pages
in the zone are initialized.

Thanks,
Toshiki Fukasawa


Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-17 Thread Toshiki Fukasawa
On 2019/09/17 16:13, David Hildenbrand wrote:
> On 17.09.19 04:34, Toshiki Fukasawa wrote:
>> On 2019/09/09 16:46, David Hildenbrand wrote:
>>> Let's take a step back here to understand the issues I am aware of. I
>>> think we should solve this for good now:
>>>
>>> A PFN walker takes a look at a random PFN at a random point in time. It
>>> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
>>> options are:
>>>
>>> 1. It is buddy memory (add_memory()) that has not been online yet. The
>>> memmap contains garbage. Don't access.
>>>
>>> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
>>>
>>> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
>>> is only partially present: E.g., device starts at offset 64MB within a
>>> section or the device ends at offset 64MB within a section. Don't access it.
>>
>> I don't agree with case #3. In the case, struct page area is not allocated on
>> ZONE_DEVICE, but is allocated on system memory. So I think we can access the
>> struct pages. What do you mean "invalid memmap"?
> No, that's not the case. There is no memory, especially not system
> memory. We only allow partially present sections (sub-section memory
> hotplug) for ZONE_DEVICE.

Let me clear my thoughts. If I read correctly, the struct pages for sections
(including partially present sections) on ZONE_DEVICE are allocated by
vmemmap_populate(). And all the struct pages except (altmap->base_pfn) to
(altmap->base_pfn + altmap->reserve) are initialized by memmap_init_zone()
and memmap_init_zone_device().

Do struct pages for partially present sections go through a different process?

Thanks,
Toshiki Fukasawa
> 
> invalid memmap == memmap was not initialized == struct pages contains
> garbage. There is a memmap, but accessing it (e.g., pfn_to_nid()) will
> trigger a BUG.
> 


Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-16 Thread Toshiki Fukasawa
On 2019/09/09 16:46, David Hildenbrand wrote:
> Let's take a step back here to understand the issues I am aware of. I
> think we should solve this for good now:
> 
> A PFN walker takes a look at a random PFN at a random point in time. It
> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
> options are:
> 
> 1. It is buddy memory (add_memory()) that has not been online yet. The
> memmap contains garbage. Don't access.
> 
> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
> 
> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
> is only partially present: E.g., device starts at offset 64MB within a
> section or the device ends at offset 64MB within a section. Don't access it.

I don't agree with case #3. In the case, struct page area is not allocated on
ZONE_DEVICE, but is allocated on system memory. So I think we can access the
struct pages. What do you mean "invalid memmap"?

> 
> 4. It is ZONE_DEVICE memory with an invalid memmap, because the memmap
> was not initialized yet. memmap_init_zone_device() did not yet succeed
> after dropping the mem_hotplug lock in mm/memremap.c. Don't access it.
> 
> 5. It is reserved ZONE_DEVICE memory ("pages mapped, but reserved for
> driver") with an invalid memmap. Don't access it.
> 
> I can see that your patch tries to make #5 vanish by initializing the
> memmap, fair enough. #3 and #4 can't be detected. The PFN walker could
> still stumble over uninitialized memmaps.
> 


Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-09 Thread Toshiki Fukasawa
On 2019/09/06 19:35, David Hildenbrand wrote:
> On 06.09.19 12:02, Toshiki Fukasawa wrote:
>> Thank you for your feedback.
>>
>> On 2019/09/06 17:45, David Hildenbrand wrote:
>>> On 06.09.19 10:09, Toshiki Fukasawa wrote:
>>>> A kernel panic is observed during reading
>>>> /proc/kpage{cgroup,count,flags} for first few pfns allocated by
>>>> pmem namespace:
>>>>
>>>> BUG: unable to handle page fault for address: fffe
>>>> [  114.495280] #PF: supervisor read access in kernel mode
>>>> [  114.495738] #PF: error_code(0x) - not-present page
>>>> [  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
>>>> [  114.496713] Oops:  [#1] SMP PTI
>>>> [  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1
>>>> [  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>>> rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
>>>> [  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
>>>> [  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 
>>>> 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 
>>>> <48> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
>>>> [  114.500788] RSP: 0018:a5e601a0fe60 EFLAGS: 00010202
>>>> [  114.501373] RAX: fffe RBX:  RCX: 
>>>> 
>>>> [  114.502009] RDX: 0001 RSI: 7ffca13a7310 RDI: 
>>>> d0748900
>>>> [  114.502637] RBP: d0748900 R08: 0001 R09: 
>>>> 
>>>> [  114.503270] R10:  R11:  R12: 
>>>> 0024
>>>> [  114.503896] R13: 0008 R14: 7ffca13a7310 R15: 
>>>> a5e601a0ff08
>>>> [  114.504530] FS:  7f0266c7f540() GS:962dbbac() 
>>>> knlGS:
>>>> [  114.505245] CS:  0010 DS:  ES:  CR0: 80050033
>>>> [  114.505754] CR2: fffe CR3: 00023a204000 CR4: 
>>>> 06e0
>>>> [  114.506401] Call Trace:
>>>> [  114.506660]  kpageflags_read+0xb1/0x130
>>>> [  114.507051]  proc_reg_read+0x39/0x60
>>>> [  114.507387]  vfs_read+0x8a/0x140
>>>> [  114.507686]  ksys_pread64+0x61/0xa0
>>>> [  114.508021]  do_syscall_64+0x5f/0x1a0
>>>> [  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> [  114.508844] RIP: 0033:0x7f0266ba426b
>>>>
>>>> The first few pages of ZONE_DEVICE expressed as the range
>>>> (altmap->base_pfn) to (altmap->base_pfn + altmap->reserve) are
>>>> skipped by struct page initialization. Some pfn walkers like
>>>> /proc/kpage{cgroup, count, flags} can't handle these uninitialized
>>>> struct pages, which causes the error.
>>>>
>>>> In previous discussion, Dan seemed to have concern that the struct
>>>> page area of some pages indicated by vmem_altmap->reserve may not
>>>> be allocated. (See 
>>>> https://lore.kernel.org/lkml/capcyv4i5fjtonpbxnctzvt+e6rqyow0jrqwsfuxaa62lsuv...@mail.gmail.com/)
>>>> However, arch_add_memory() called by devm_memremap_pages() allocates
>>>> struct page area for pages containing addresses in the range
>>>> (res.start) to (res.start + resource_size(res)), which include the
>>>> pages indicated by vmem_altmap->reserve. If I read correctly, it is
>>>> allocated as requested at least on x86_64. Also, memmap_init_zone()
>>>> initializes struct pages in the same range.
>>>> So I think the struct pages should be initialized.>
>>>
>>> For !ZONE_DEVICE memory, the memmap is valid with SECTION_IS_ONLINE -
>>> for the whole section. For ZONE_DEVICE memory we have no such
>>> indication. In any section that is !SECTION_IS_ONLINE and
>>> SECTION_MARKED_PRESENT, we could have any subsections initialized. >
>>> The only indication I am aware of is pfn_zone_device_reserved() - which
>>> seems to check exactly what you are trying to skip here.
>>>
>>> Can't you somehow use pfn_zone_device_reserved() ? Or if you considered
>>> that already, why did you decide against it?
>>
>> No, in current approach this function is no longer needed.
>> The reason why we change the approach is that all pfn walkers
>> have to be aware of the uninitialized struct pages.
> 
> We should use the same strategy for all pfn walkers then (effectively
> get rid of pfn_zone_device_reserved() if that's what we want).

True, but this patch replaces "/proc/kpageflags: do not use uninitialized
struct pages". If we initialize the uninitialized struct pages, no pfn walker
will need to be aware of them.

> 
>>
>> As for SECTION_IS_ONLINE, I'm not sure now.
>> I will look into it next week.
> 
> SECTION_IS_ONLINE does currently not apply to ZONE_DEVICE and due to
> sub-section support for ZONE_DEVICE, it cannot easily be reused.
> 
It seems that SECTION_IS_ONLINE and SECTION_MARKED_PRESENT can be used to
distinguish uninitialized struct pages if we can apply them to ZONE_DEVICE,
but that is no longer necessary with this approach.

Thanks,
Toshiki Fukasawa


Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-06 Thread Toshiki Fukasawa
Thank you for your feedback.

On 2019/09/06 17:45, David Hildenbrand wrote:
> On 06.09.19 10:09, Toshiki Fukasawa wrote:
>> A kernel panic is observed during reading
>> /proc/kpage{cgroup,count,flags} for first few pfns allocated by
>> pmem namespace:
>>
>> BUG: unable to handle page fault for address: fffe
>> [  114.495280] #PF: supervisor read access in kernel mode
>> [  114.495738] #PF: error_code(0x) - not-present page
>> [  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
>> [  114.496713] Oops:  [#1] SMP PTI
>> [  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1
>> [  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
>> [  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
>> [  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 
>> 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 
>> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
>> [  114.500788] RSP: 0018:a5e601a0fe60 EFLAGS: 00010202
>> [  114.501373] RAX: fffe RBX:  RCX: 
>> 
>> [  114.502009] RDX: 0001 RSI: 7ffca13a7310 RDI: 
>> d0748900
>> [  114.502637] RBP: d0748900 R08: 0001 R09: 
>> 
>> [  114.503270] R10:  R11:  R12: 
>> 0024
>> [  114.503896] R13: 0008 R14: 7ffca13a7310 R15: 
>> a5e601a0ff08
>> [  114.504530] FS:  7f0266c7f540() GS:962dbbac() 
>> knlGS:
>> [  114.505245] CS:  0010 DS:  ES:  CR0: 80050033
>> [  114.505754] CR2: fffe CR3: 00023a204000 CR4: 
>> 06e0
>> [  114.506401] Call Trace:
>> [  114.506660]  kpageflags_read+0xb1/0x130
>> [  114.507051]  proc_reg_read+0x39/0x60
>> [  114.507387]  vfs_read+0x8a/0x140
>> [  114.507686]  ksys_pread64+0x61/0xa0
>> [  114.508021]  do_syscall_64+0x5f/0x1a0
>> [  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [  114.508844] RIP: 0033:0x7f0266ba426b
>>
>> The first few pages of ZONE_DEVICE expressed as the range
>> (altmap->base_pfn) to (altmap->base_pfn + altmap->reserve) are
>> skipped by struct page initialization. Some pfn walkers like
>> /proc/kpage{cgroup, count, flags} can't handle these uninitialized
>> struct pages, which causes the error.
>>
>> In previous discussion, Dan seemed to have concern that the struct
>> page area of some pages indicated by vmem_altmap->reserve may not
>> be allocated. (See 
>> https://lore.kernel.org/lkml/capcyv4i5fjtonpbxnctzvt+e6rqyow0jrqwsfuxaa62lsuv...@mail.gmail.com/)
>> However, arch_add_memory() called by devm_memremap_pages() allocates
>> struct page area for pages containing addresses in the range
>> (res.start) to (res.start + resource_size(res)), which include the
>> pages indicated by vmem_altmap->reserve. If I read correctly, it is
>> allocated as requested at least on x86_64. Also, memmap_init_zone()
>> initializes struct pages in the same range.
>> So I think the struct pages should be initialized.>
> 
> For !ZONE_DEVICE memory, the memmap is valid with SECTION_IS_ONLINE -
> for the whole section. For ZONE_DEVICE memory we have no such
> indication. In any section that is !SECTION_IS_ONLINE and
> SECTION_MARKED_PRESENT, we could have any subsections initialized. >
> The only indication I am aware of is pfn_zone_device_reserved() - which
> seems to check exactly what you are trying to skip here.
> 
> Can't you somehow use pfn_zone_device_reserved() ? Or if you considered
> that already, why did you decide against it?

No, in current approach this function is no longer needed.
The reason why we change the approach is that all pfn walkers
have to be aware of the uninitialized struct pages.

As for SECTION_IS_ONLINE, I'm not sure now.
I will look into it next week.

Thanks,
Toshiki Fukasawa

> 
>> Signed-off-by: Toshiki Fukasawa 
>> Cc: sta...@vger.kernel.org
>> ---
>> Changes since rev 1:
>>   Instead of avoiding uninitialized pages on the pfn walker side,
>>   we initialize struct pages.
>>
>> mm/page_alloc.c | 5 +
>>   1 file changed, 1 insertion(+), 4 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 9c91949..6d180ae 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5846,8 +5846,7 @@ void __meminit memmap_init_zone(unsigned long size, 
>> in

[RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

2019-09-06 Thread Toshiki Fukasawa
A kernel panic is observed during reading
/proc/kpage{cgroup,count,flags} for first few pfns allocated by
pmem namespace:

BUG: unable to handle page fault for address: fffe
[  114.495280] #PF: supervisor read access in kernel mode
[  114.495738] #PF: error_code(0x) - not-present page
[  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
[  114.496713] Oops:  [#1] SMP PTI
[  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1
[  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
[  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
[  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 54 
55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 
f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
[  114.500788] RSP: 0018:a5e601a0fe60 EFLAGS: 00010202
[  114.501373] RAX: fffe RBX:  RCX: 
[  114.502009] RDX: 0001 RSI: 7ffca13a7310 RDI: d0748900
[  114.502637] RBP: d0748900 R08: 0001 R09: 
[  114.503270] R10:  R11:  R12: 0024
[  114.503896] R13: 0008 R14: 7ffca13a7310 R15: a5e601a0ff08
[  114.504530] FS:  7f0266c7f540() GS:962dbbac() 
knlGS:
[  114.505245] CS:  0010 DS:  ES:  CR0: 80050033
[  114.505754] CR2: fffe CR3: 00023a204000 CR4: 06e0
[  114.506401] Call Trace:
[  114.506660]  kpageflags_read+0xb1/0x130
[  114.507051]  proc_reg_read+0x39/0x60
[  114.507387]  vfs_read+0x8a/0x140
[  114.507686]  ksys_pread64+0x61/0xa0
[  114.508021]  do_syscall_64+0x5f/0x1a0
[  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  114.508844] RIP: 0033:0x7f0266ba426b

The first few pages of ZONE_DEVICE expressed as the range
(altmap->base_pfn) to (altmap->base_pfn + altmap->reserve) are
skipped by struct page initialization. Some pfn walkers like
/proc/kpage{cgroup, count, flags} can't handle these uninitialized
struct pages, which causes the error.

In previous discussion, Dan seemed to have concern that the struct
page area of some pages indicated by vmem_altmap->reserve may not
be allocated. (See 
https://lore.kernel.org/lkml/capcyv4i5fjtonpbxnctzvt+e6rqyow0jrqwsfuxaa62lsuv...@mail.gmail.com/)
However, arch_add_memory() called by devm_memremap_pages() allocates
struct page area for pages containing addresses in the range
(res.start) to (res.start + resource_size(res)), which include the
pages indicated by vmem_altmap->reserve. If I read correctly, it is
allocated as requested at least on x86_64. Also, memmap_init_zone()
initializes struct pages in the same range.
So I think the struct pages should be initialized.

Signed-off-by: Toshiki Fukasawa 
Cc: sta...@vger.kernel.org
---
Changes since rev 1:
 Instead of avoiding uninitialized pages on the pfn walker side,
 we initialize struct pages.

mm/page_alloc.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9c91949..6d180ae 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5846,8 +5846,7 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
 
 #ifdef CONFIG_ZONE_DEVICE
/*
-* Honor reservation requested by the driver for this ZONE_DEVICE
-* memory. We limit the total number of pages to initialize to just
+* We limit the total number of pages to initialize to just
 * those that might contain the memory mapping. We will defer the
 * ZONE_DEVICE page initialization until after we have released
 * the hotplug lock.
@@ -5856,8 +5855,6 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
if (!altmap)
return;
 
-   if (start_pfn == altmap->base_pfn)
-   start_pfn += altmap->reserve;
end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
}
 #endif
-- 
1.8.3.1



Re: [PATCH v2] fs/proc/page: Skip uninitialized page when iterating page structures

2019-08-28 Thread Toshiki Fukasawa
On 2019/08/28 23:18, Waiman Long wrote:
> On 8/28/19 10:09 AM, Michal Hocko wrote:
>> On Wed 28-08-19 09:46:21, Waiman Long wrote:
>>> On 8/28/19 4:00 AM, Michal Hocko wrote:
>>>> On Tue 27-08-19 16:22:38, Michal Hocko wrote:
>>>>> Dan, isn't this something we have discussed recently?
>>>> This was 
>>>> http://lkml.kernel.org/r/20190725023100.31141-3-t-fukas...@vx.jp.nec.com
>>>> and talked about /proc/kpageflags but this is essentially the same thing
>>>> AFAIU. I hope we get a consistent solution for both issues.
>>>>
>>> Yes, it is the same problem. The uninitialized page structure problem
>>> affects all the 3 /proc/kpage{cgroup,count,flags) files.
>>>
>>> Toshiki's patch seems to fix it just for /proc/kpageflags, though.
>> Yup. I was arguing that whacking a mole kinda fix is far from good. Dan
>> had some arguments on why initializing those struct pages is a problem.
>> The discussion had a half open end though. I hoped that Dan would try
>> out the initialization side but I migh have misunderstood.
> 
> If the page structures of the reserved PFNs are always initialized, that
> will fix the problem too. I am not familiar with the zone device code.
> So I didn't attempt to do that.
> 
> Cheers,
> Longman
> 
> 
I also think that the struct pages should be initialized.
I'm preparing to post the patch.

Toshiki Fukasawa


Re: [PATCH 2/2] /proc/kpageflags: do not use uninitialized struct pages

2019-08-04 Thread Toshiki Fukasawa
On 2019/07/26 16:06, Michal Hocko wrote:
> On Fri 26-07-19 06:25:49, Toshiki Fukasawa wrote:
>>
>>
>> On 2019/07/25 18:03, Michal Hocko wrote:
>>> On Thu 25-07-19 02:31:18, Toshiki Fukasawa wrote:
>>>> A kernel panic was observed during reading /proc/kpageflags for
>>>> first few pfns allocated by pmem namespace:
>>>>
>>>> BUG: unable to handle page fault for address: fffe
>>>> [  114.495280] #PF: supervisor read access in kernel mode
>>>> [  114.495738] #PF: error_code(0x) - not-present page
>>>> [  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
>>>> [  114.496713] Oops:  [#1] SMP PTI
>>>> [  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1 #1
>>>> [  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>>> rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
>>>> [  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
>>>> [  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 
>>>> 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 
>>>> <48> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
>>>> [  114.500788] RSP: 0018:a5e601a0fe60 EFLAGS: 00010202
>>>> [  114.501373] RAX: fffe RBX:  RCX: 
>>>> 
>>>> [  114.502009] RDX: 0001 RSI: 7ffca13a7310 RDI: 
>>>> d0748900
>>>> [  114.502637] RBP: d0748900 R08: 0001 R09: 
>>>> 
>>>> [  114.503270] R10:  R11:  R12: 
>>>> 0024
>>>> [  114.503896] R13: 0008 R14: 7ffca13a7310 R15: 
>>>> a5e601a0ff08
>>>> [  114.504530] FS:  7f0266c7f540() GS:962dbbac() 
>>>> knlGS:
>>>> [  114.505245] CS:  0010 DS:  ES:  CR0: 80050033
>>>> [  114.505754] CR2: fffe CR3: 00023a204000 CR4: 
>>>> 06e0
>>>> [  114.506401] Call Trace:
>>>> [  114.506660]  kpageflags_read+0xb1/0x130
>>>> [  114.507051]  proc_reg_read+0x39/0x60
>>>> [  114.507387]  vfs_read+0x8a/0x140
>>>> [  114.507686]  ksys_pread64+0x61/0xa0
>>>> [  114.508021]  do_syscall_64+0x5f/0x1a0
>>>> [  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> [  114.508844] RIP: 0033:0x7f0266ba426b
>>>>
>>>> The reason for the panic is that stable_page_flags() which parses
>>>> the page flags uses uninitialized struct pages reserved by the
>>>> ZONE_DEVICE driver.
>>>
>>> Why pmem hasn't initialized struct pages?
>>
>> We proposed to initialize in previous approach but that wasn't merged.
>> (See https://marc.info/?l=linux-mm=152964792500739=2)
>>
>>> Isn't that a bug that should be addressed rather than paper over it like 
>>> this?
>>
>> I'm not sure. What do you think, Dan?
> 
> Yeah, I am really curious about details. Why do we keep uninitialized
> struct pages at all? What is a random pfn walker supposed to do? What
> kind of metadata would be clobbered? In other words much more details
> please.
> 
I also want to know. I do not think that initializing struct pages will
clobber any metadata.

Best regards,
Toshiki Fukasawa


Re: [PATCH 2/2] /proc/kpageflags: do not use uninitialized struct pages

2019-07-26 Thread Toshiki Fukasawa



On 2019/07/25 18:03, Michal Hocko wrote:
> On Thu 25-07-19 02:31:18, Toshiki Fukasawa wrote:
>> A kernel panic was observed during reading /proc/kpageflags for
>> first few pfns allocated by pmem namespace:
>>
>> BUG: unable to handle page fault for address: fffe
>> [  114.495280] #PF: supervisor read access in kernel mode
>> [  114.495738] #PF: error_code(0x) - not-present page
>> [  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
>> [  114.496713] Oops:  [#1] SMP PTI
>> [  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1 #1
>> [  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
>> [  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
>> [  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 
>> 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 
>> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
>> [  114.500788] RSP: 0018:a5e601a0fe60 EFLAGS: 00010202
>> [  114.501373] RAX: fffe RBX:  RCX: 
>> 
>> [  114.502009] RDX: 0001 RSI: 7ffca13a7310 RDI: 
>> d0748900
>> [  114.502637] RBP: d0748900 R08: 0001 R09: 
>> 
>> [  114.503270] R10:  R11:  R12: 
>> 0024
>> [  114.503896] R13: 0008 R14: 7ffca13a7310 R15: 
>> a5e601a0ff08
>> [  114.504530] FS:  7f0266c7f540() GS:962dbbac() 
>> knlGS:
>> [  114.505245] CS:  0010 DS:  ES:  CR0: 80050033
>> [  114.505754] CR2: fffe CR3: 00023a204000 CR4: 
>> 06e0
>> [  114.506401] Call Trace:
>> [  114.506660]  kpageflags_read+0xb1/0x130
>> [  114.507051]  proc_reg_read+0x39/0x60
>> [  114.507387]  vfs_read+0x8a/0x140
>> [  114.507686]  ksys_pread64+0x61/0xa0
>> [  114.508021]  do_syscall_64+0x5f/0x1a0
>> [  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [  114.508844] RIP: 0033:0x7f0266ba426b
>>
>> The reason for the panic is that stable_page_flags() which parses
>> the page flags uses uninitialized struct pages reserved by the
>> ZONE_DEVICE driver.
> 
> Why pmem hasn't initialized struct pages? 

We proposed to initialize in previous approach but that wasn't merged.
(See https://marc.info/?l=linux-mm=152964792500739=2)

> Isn't that a bug that should be addressed rather than paper over it like this?

I'm not sure. What do you think, Dan?

Best regards,
Toshiki Fukasawa


[PATCH 2/2] /proc/kpageflags: do not use uninitialized struct pages

2019-07-24 Thread Toshiki Fukasawa
A kernel panic was observed during reading /proc/kpageflags for
first few pfns allocated by pmem namespace:

BUG: unable to handle page fault for address: fffe
[  114.495280] #PF: supervisor read access in kernel mode
[  114.495738] #PF: error_code(0x) - not-present page
[  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
[  114.496713] Oops:  [#1] SMP PTI
[  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1 #1
[  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
[  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
[  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 54 
55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 
f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
[  114.500788] RSP: 0018:a5e601a0fe60 EFLAGS: 00010202
[  114.501373] RAX: fffe RBX:  RCX: 
[  114.502009] RDX: 0001 RSI: 7ffca13a7310 RDI: d0748900
[  114.502637] RBP: d0748900 R08: 0001 R09: 
[  114.503270] R10:  R11:  R12: 0024
[  114.503896] R13: 0008 R14: 7ffca13a7310 R15: a5e601a0ff08
[  114.504530] FS:  7f0266c7f540() GS:962dbbac() 
knlGS:
[  114.505245] CS:  0010 DS:  ES:  CR0: 80050033
[  114.505754] CR2: fffe CR3: 00023a204000 CR4: 06e0
[  114.506401] Call Trace:
[  114.506660]  kpageflags_read+0xb1/0x130
[  114.507051]  proc_reg_read+0x39/0x60
[  114.507387]  vfs_read+0x8a/0x140
[  114.507686]  ksys_pread64+0x61/0xa0
[  114.508021]  do_syscall_64+0x5f/0x1a0
[  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  114.508844] RIP: 0033:0x7f0266ba426b

The reason for the panic is that stable_page_flags() which parses
the page flags uses uninitialized struct pages reserved by the
ZONE_DEVICE driver.

Earlier approach to fix this was discussed here:
https://marc.info/?l=linux-mm=15296477672=2

This is another approach. To avoid using the uninitialized struct page,
immediately return with KPF_RESERVED at the beginning of
stable_page_flags() if the page is reserved by ZONE_DEVICE driver.

Cc: sta...@vger.kernel.org
Signed-off-by: Toshiki Fukasawa 
---
 fs/proc/page.c   |  3 +++
 include/linux/memremap.h |  6 ++
 kernel/memremap.c| 20 
 3 files changed, 29 insertions(+)

diff --git a/fs/proc/page.c b/fs/proc/page.c
index 69064ad..decd3fe 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -97,6 +97,9 @@ u64 stable_page_flags(struct page *page)
if (!page)
return BIT_ULL(KPF_NOPAGE);
 
+   if (pfn_zone_device_reserved(page_to_pfn(page)))
+   return BIT_ULL(KPF_RESERVED);
+
k = page->flags;
u = 0;
 
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index f8a5b2a..2cfc3c2 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -124,6 +124,7 @@ static inline struct vmem_altmap *pgmap_altmap(struct 
dev_pagemap *pgmap)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
+bool pfn_zone_device_reserved(unsigned long pfn);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
 void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap);
 struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
@@ -132,6 +133,11 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
 void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
 #else
+static inline bool pfn_zone_device_reserved(unsigned long pfn)
+{
+   return false;
+}
+
 static inline void *devm_memremap_pages(struct device *dev,
struct dev_pagemap *pgmap)
 {
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 6ee03a8..bc3471c 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -72,6 +72,26 @@ static unsigned long pfn_next(unsigned long pfn)
return pfn + 1;
 }
 
+/*
+ * This returns true if the page is reserved by ZONE_DEVICE driver.
+ */
+bool pfn_zone_device_reserved(unsigned long pfn)
+{
+   struct dev_pagemap *pgmap;
+   struct vmem_altmap *altmap;
+   bool ret = false;
+
+   pgmap = get_dev_pagemap(pfn, NULL);
+   if (!pgmap)
+   return ret;
+   altmap = pgmap_altmap(pgmap);
+   if (altmap && pfn < (altmap->base_pfn + altmap->reserve))
+   ret = true;
+   put_dev_pagemap(pgmap);
+
+   return ret;
+}
+
 #define for_each_device_pfn(pfn, map) \
for (pfn = pfn_first(map); pfn < pfn_end(map); pfn = pfn_next(pfn))
 
-- 
1.8.3.1



[PATCH 1/2] /proc/kpageflags: prevent an integer overflow in stable_page_flags()

2019-07-24 Thread Toshiki Fukasawa
stable_page_flags() returns kpageflags info in u64, but it uses
"1 << KPF_*" internally which is considered as int. This type mismatch
causes no visible problem now, but it will if you set bit 32 or more as
done in a subsequent patch. So use BIT_ULL in order to avoid future
overflow issues.

Signed-off-by: Toshiki Fukasawa 
---
 fs/proc/page.c | 37 ++---
 1 file changed, 18 insertions(+), 19 deletions(-)

diff --git a/fs/proc/page.c b/fs/proc/page.c
index 544d1ee..69064ad 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -95,7 +95,7 @@ u64 stable_page_flags(struct page *page)
 * it differentiates a memory hole from a page with no flags
 */
if (!page)
-   return 1 << KPF_NOPAGE;
+   return BIT_ULL(KPF_NOPAGE);
 
k = page->flags;
u = 0;
@@ -107,22 +107,22 @@ u64 stable_page_flags(struct page *page)
 * simple test in page_mapped() is not enough.
 */
if (!PageSlab(page) && page_mapped(page))
-   u |= 1 << KPF_MMAP;
+   u |= BIT_ULL(KPF_MMAP);
if (PageAnon(page))
-   u |= 1 << KPF_ANON;
+   u |= BIT_ULL(KPF_ANON);
if (PageKsm(page))
-   u |= 1 << KPF_KSM;
+   u |= BIT_ULL(KPF_KSM);
 
/*
 * compound pages: export both head/tail info
 * they together define a compound page's start/end pos and order
 */
if (PageHead(page))
-   u |= 1 << KPF_COMPOUND_HEAD;
+   u |= BIT_ULL(KPF_COMPOUND_HEAD);
if (PageTail(page))
-   u |= 1 << KPF_COMPOUND_TAIL;
+   u |= BIT_ULL(KPF_COMPOUND_TAIL);
if (PageHuge(page))
-   u |= 1 << KPF_HUGE;
+   u |= BIT_ULL(KPF_HUGE);
/*
 * PageTransCompound can be true for non-huge compound pages (slab
 * pages or pages allocated by drivers with __GFP_COMP) because it
@@ -133,14 +133,13 @@ u64 stable_page_flags(struct page *page)
struct page *head = compound_head(page);
 
if (PageLRU(head) || PageAnon(head))
-   u |= 1 << KPF_THP;
+   u |= BIT_ULL(KPF_THP);
else if (is_huge_zero_page(head)) {
-   u |= 1 << KPF_ZERO_PAGE;
-   u |= 1 << KPF_THP;
+   u |= BIT_ULL(KPF_ZERO_PAGE);
+   u |= BIT_ULL(KPF_THP);
}
} else if (is_zero_pfn(page_to_pfn(page)))
-   u |= 1 << KPF_ZERO_PAGE;
-
+   u |= BIT_ULL(KPF_ZERO_PAGE);
 
/*
 * Caveats on high order pages: page->_refcount will only be set
@@ -148,23 +147,23 @@ u64 stable_page_flags(struct page *page)
 * SLOB won't set PG_slab at all on compound pages.
 */
if (PageBuddy(page))
-   u |= 1 << KPF_BUDDY;
+   u |= BIT_ULL(KPF_BUDDY);
else if (page_count(page) == 0 && is_free_buddy_page(page))
-   u |= 1 << KPF_BUDDY;
+   u |= BIT_ULL(KPF_BUDDY);
 
if (PageOffline(page))
-   u |= 1 << KPF_OFFLINE;
+   u |= BIT_ULL(KPF_OFFLINE);
if (PageTable(page))
-   u |= 1 << KPF_PGTABLE;
+   u |= BIT_ULL(KPF_PGTABLE);
 
if (page_is_idle(page))
-   u |= 1 << KPF_IDLE;
+   u |= BIT_ULL(KPF_IDLE);
 
u |= kpf_copy_bit(k, KPF_LOCKED,PG_locked);
 
u |= kpf_copy_bit(k, KPF_SLAB,  PG_slab);
if (PageTail(page) && PageSlab(compound_head(page)))
-   u |= 1 << KPF_SLAB;
+   u |= BIT_ULL(KPF_SLAB);
 
u |= kpf_copy_bit(k, KPF_ERROR, PG_error);
u |= kpf_copy_bit(k, KPF_DIRTY, PG_dirty);
@@ -177,7 +176,7 @@ u64 stable_page_flags(struct page *page)
u |= kpf_copy_bit(k, KPF_RECLAIM,   PG_reclaim);
 
if (PageSwapCache(page))
-   u |= 1 << KPF_SWAPCACHE;
+   u |= BIT_ULL(KPF_SWAPCACHE);
u |= kpf_copy_bit(k, KPF_SWAPBACKED,PG_swapbacked);
 
u |= kpf_copy_bit(k, KPF_UNEVICTABLE,   PG_unevictable);
-- 
1.8.3.1



[PATCH 0/2] fix kernel panic due to use uninitialized struct pages

2019-07-24 Thread Toshiki Fukasawa
A kernel panic was observed during reading /proc/kpageflags for
first few pfns allocated by pmem namespace:

BUG: unable to handle page fault for address: fffe
[  114.495280] #PF: supervisor read access in kernel mode
[  114.495738] #PF: error_code(0x) - not-present page
[  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
[  114.496713] Oops:  [#1] SMP PTI
[  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1 #1
[  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
[  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
[  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 54 
55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 
f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
[  114.500788] RSP: 0018:a5e601a0fe60 EFLAGS: 00010202
[  114.501373] RAX: fffe RBX:  RCX: 
[  114.502009] RDX: 0001 RSI: 7ffca13a7310 RDI: d0748900
[  114.502637] RBP: d0748900 R08: 0001 R09: 
[  114.503270] R10:  R11:  R12: 0024
[  114.503896] R13: 0008 R14: 7ffca13a7310 R15: a5e601a0ff08
[  114.504530] FS:  7f0266c7f540() GS:962dbbac() 
knlGS:
[  114.505245] CS:  0010 DS:  ES:  CR0: 80050033
[  114.505754] CR2: fffe CR3: 00023a204000 CR4: 06e0
[  114.506401] Call Trace:
[  114.506660]  kpageflags_read+0xb1/0x130
[  114.507051]  proc_reg_read+0x39/0x60
[  114.507387]  vfs_read+0x8a/0x140
[  114.507686]  ksys_pread64+0x61/0xa0
[  114.508021]  do_syscall_64+0x5f/0x1a0
[  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  114.508844] RIP: 0033:0x7f0266ba426b

Earlier approach to fix this was discussed here:
https://marc.info/?l=linux-mm=15296477672=2

This patchset is another approach to fix it and also provide
a fix for potential future bugs discovered in the process.

Toshiki Fukasawa (2):
  /proc/kpageflags: prevent an integer overflow in stable_page_flags()
  /proc/kpageflags: do not use uninitialized struct pages

 fs/proc/page.c   | 40 +---
 include/linux/memremap.h |  6 ++
 kernel/memremap.c| 20 
 3 files changed, 47 insertions(+), 19 deletions(-)

-- 
1.8.3.1