Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up

2013-10-09 Thread Zhang Yanfei
Hello guys,

On 10/10/2013 07:26 AM, Zhang Yanfei wrote:
> Hello Peter,
> 
> On 10/10/2013 07:10 AM, H. Peter Anvin wrote:
>> On 10/09/2013 02:45 PM, Zhang Yanfei wrote:

 I would also argue that in the VM scenario -- and arguable even in the
 hardware scenario -- the right thing is to not expose the flexible
 memory in the e820/EFI tables, and instead have it hotadded (possibly
 *immediately* so) on boot.  This avoids both the boot time funnies as
 well as the scaling issues with metadata.

>>>
>>> So in this kind of scenario, hotpluggable memory will not be detected
>>> at boot time, and admin should not use this movable_node boot option
>>> and the kernel will act as before, using top-down allocation always.
>>>
>>
>> Yes.  The idea is that the kernel will boot up without the hotplug
>> memory, but if desired, will immediately see a hotplug-add event for the
>> movable memory.
> 
> Yeah, this is good.
> 
> But in the scenario that boot with hotplug memory, we need the movable_node
> option. So as tejun has explained a lot about this patchset, do you still
> have objection to it or could I ask andrew to merge it into -mm tree for
> more tests?
> 

Since tejun has explained a lot about this approach, could we come to
an agreement on this one?

Peter? If you have no objection, I'll post a new v7 version which will fix
the __pa_symbol problem pointed by you.

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up

2013-10-09 Thread Zhang Yanfei
Hello guys,

On 10/10/2013 07:26 AM, Zhang Yanfei wrote:
 Hello Peter,
 
 On 10/10/2013 07:10 AM, H. Peter Anvin wrote:
 On 10/09/2013 02:45 PM, Zhang Yanfei wrote:

 I would also argue that in the VM scenario -- and arguable even in the
 hardware scenario -- the right thing is to not expose the flexible
 memory in the e820/EFI tables, and instead have it hotadded (possibly
 *immediately* so) on boot.  This avoids both the boot time funnies as
 well as the scaling issues with metadata.


 So in this kind of scenario, hotpluggable memory will not be detected
 at boot time, and admin should not use this movable_node boot option
 and the kernel will act as before, using top-down allocation always.


 Yes.  The idea is that the kernel will boot up without the hotplug
 memory, but if desired, will immediately see a hotplug-add event for the
 movable memory.
 
 Yeah, this is good.
 
 But in the scenario that boot with hotplug memory, we need the movable_node
 option. So as tejun has explained a lot about this patchset, do you still
 have objection to it or could I ask andrew to merge it into -mm tree for
 more tests?
 

Since tejun has explained a lot about this approach, could we come to
an agreement on this one?

Peter? If you have no objection, I'll post a new v7 version which will fix
the __pa_symbol problem pointed by you.

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up

2013-10-08 Thread Zhang Yanfei
Hello tejun
CC: Peter

On 10/07/2013 08:00 AM, H. Peter Anvin wrote:
> On 10/03/2013 07:00 PM, Zhang Yanfei wrote:
>> From: Tang Chen 
>>
>> The Linux kernel cannot migrate pages used by the kernel. As a
>> result, kernel pages cannot be hot-removed. So we cannot allocate
>> hotpluggable memory for the kernel.
>>
>> In a memory hotplug system, any numa node the kernel resides in
>> should be unhotpluggable. And for a modern server, each node could
>> have at least 16GB memory. So memory around the kernel image is
>> highly likely unhotpluggable.
>>
>> ACPI SRAT (System Resource Affinity Table) contains the memory
>> hotplug info. But before SRAT is parsed, memblock has already
>> started to allocate memory for the kernel. So we need to prevent
>> memblock from doing this.
>>
>> So direct memory mapping page tables setup is the case. init_mem_mapping()
>> is called before SRAT is parsed. To prevent page tables being allocated
>> within hotpluggable memory, we will use bottom-up direction to allocate
>> page tables from the end of kernel image to the higher memory.
>>
>> Acked-by: Tejun Heo 
>> Signed-off-by: Tang Chen 
>> Signed-off-by: Zhang Yanfei 
> 
> I'm still seriously concerned about this.  This unconditionally
> introduces new behavior which may very well break some classes of
> systems -- the whole point of creating the page tables top down is
> because the kernel tends to be allocated in lower memory, which is also
> the memory that some devices need for DMA.
> 

After thinking for a while, this issue pointed by Peter seems to be really
existing. And looking back to what you suggested the allocation close to the
kernel, 

> so if we allocate memory close to the kernel image,
>   it's likely that we don't contaminate hotpluggable node.  We're
>   talking about few megs at most right after the kernel image.  I
>   can't see how that would make any noticeable difference.

You meant that the memory size is about few megs. But here, page tables
seems to be large enough in big memory machines, so that page tables will
consume the precious lower memory. So I think we may really reorder
the page table setup after we get the hotplug info in some way. Just like
we have done in patch 5, we reorder reserve_crashkernel() to be called
after initmem_init().

So do you still have any objection to the pagetable setup reorder?

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up

2013-10-08 Thread Zhang Yanfei
Hello tejun
CC: Peter

On 10/07/2013 08:00 AM, H. Peter Anvin wrote:
 On 10/03/2013 07:00 PM, Zhang Yanfei wrote:
 From: Tang Chen tangc...@cn.fujitsu.com

 The Linux kernel cannot migrate pages used by the kernel. As a
 result, kernel pages cannot be hot-removed. So we cannot allocate
 hotpluggable memory for the kernel.

 In a memory hotplug system, any numa node the kernel resides in
 should be unhotpluggable. And for a modern server, each node could
 have at least 16GB memory. So memory around the kernel image is
 highly likely unhotpluggable.

 ACPI SRAT (System Resource Affinity Table) contains the memory
 hotplug info. But before SRAT is parsed, memblock has already
 started to allocate memory for the kernel. So we need to prevent
 memblock from doing this.

 So direct memory mapping page tables setup is the case. init_mem_mapping()
 is called before SRAT is parsed. To prevent page tables being allocated
 within hotpluggable memory, we will use bottom-up direction to allocate
 page tables from the end of kernel image to the higher memory.

 Acked-by: Tejun Heo t...@kernel.org
 Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
 Signed-off-by: Zhang Yanfei zhangyan...@cn.fujitsu.com
 
 I'm still seriously concerned about this.  This unconditionally
 introduces new behavior which may very well break some classes of
 systems -- the whole point of creating the page tables top down is
 because the kernel tends to be allocated in lower memory, which is also
 the memory that some devices need for DMA.
 

After thinking for a while, this issue pointed by Peter seems to be really
existing. And looking back to what you suggested the allocation close to the
kernel, 

 so if we allocate memory close to the kernel image,
   it's likely that we don't contaminate hotpluggable node.  We're
   talking about few megs at most right after the kernel image.  I
   can't see how that would make any noticeable difference.

You meant that the memory size is about few megs. But here, page tables
seems to be large enough in big memory machines, so that page tables will
consume the precious lower memory. So I think we may really reorder
the page table setup after we get the hotplug info in some way. Just like
we have done in patch 5, we reorder reserve_crashkernel() to be called
after initmem_init().

So do you still have any objection to the pagetable setup reorder?

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up

2013-10-03 Thread Zhang Yanfei
From: Tang Chen 

The Linux kernel cannot migrate pages used by the kernel. As a
result, kernel pages cannot be hot-removed. So we cannot allocate
hotpluggable memory for the kernel.

In a memory hotplug system, any numa node the kernel resides in
should be unhotpluggable. And for a modern server, each node could
have at least 16GB memory. So memory around the kernel image is
highly likely unhotpluggable.

ACPI SRAT (System Resource Affinity Table) contains the memory
hotplug info. But before SRAT is parsed, memblock has already
started to allocate memory for the kernel. So we need to prevent
memblock from doing this.

So direct memory mapping page tables setup is the case. init_mem_mapping()
is called before SRAT is parsed. To prevent page tables being allocated
within hotpluggable memory, we will use bottom-up direction to allocate
page tables from the end of kernel image to the higher memory.

Acked-by: Tejun Heo 
Signed-off-by: Tang Chen 
Signed-off-by: Zhang Yanfei 
---
 arch/x86/mm/init.c |   71 ++-
 1 files changed, 69 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index ea2be79..5cea9ed 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -458,6 +458,51 @@ static void __init memory_map_top_down(unsigned long 
map_start,
init_range_memory_mapping(real_end, map_end);
 }
 
+/**
+ * memory_map_bottom_up - Map [map_start, map_end) bottom up
+ * @map_start: start address of the target memory range
+ * @map_end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range
+ * [map_start, map_end) in bottom-up. Since we have limited the
+ * bottom-up allocation above the kernel, the page tables will
+ * be allocated just above the kernel and we map the memory
+ * in [map_start, map_end) in bottom-up.
+ */
+static void __init memory_map_bottom_up(unsigned long map_start,
+   unsigned long map_end)
+{
+   unsigned long next, new_mapped_ram_size, start;
+   unsigned long mapped_ram_size = 0;
+   /* step_size need to be small so pgt_buf from BRK could cover it */
+   unsigned long step_size = PMD_SIZE;
+
+   start = map_start;
+   min_pfn_mapped = start >> PAGE_SHIFT;
+
+   /*
+* We start from the bottom (@map_start) and go to the top (@map_end).
+* The memblock_find_in_range() gets us a block of RAM from the
+* end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
+* for page table.
+*/
+   while (start < map_end) {
+   if (map_end - start > step_size) {
+   next = round_up(start + 1, step_size);
+   if (next > map_end)
+   next = map_end;
+   } else
+   next = map_end;
+
+   new_mapped_ram_size = init_range_memory_mapping(start, next);
+   start = next;
+
+   if (new_mapped_ram_size > mapped_ram_size)
+   step_size <<= STEP_SIZE_SHIFT;
+   mapped_ram_size += new_mapped_ram_size;
+   }
+}
+
 void __init init_mem_mapping(void)
 {
unsigned long end;
@@ -473,8 +518,30 @@ void __init init_mem_mapping(void)
/* the ISA range is always mapped regardless of memory holes */
init_memory_mapping(0, ISA_END_ADDRESS);
 
-   /* setup direct mapping for range [ISA_END_ADDRESS, end) in top-down*/
-   memory_map_top_down(ISA_END_ADDRESS, end);
+   /*
+* If the allocation is in bottom-up direction, we setup direct mapping
+* in bottom-up, otherwise we setup direct mapping in top-down.
+*/
+   if (memblock_bottom_up()) {
+   unsigned long kernel_end;
+
+#ifdef CONFIG_X86
+   kernel_end = __pa_symbol(_end);
+#else
+   kernel_end = __pa(RELOC_HIDE((unsigned long)(_end), 0));
+#endif
+   /*
+* we need two separate calls here. This is because we want to
+* allocate page tables above the kernel. So we first map
+* [kernel_end, end) to make memory above the kernel be mapped
+* as soon as possible. And then use page tables allocated above
+* the kernel to map [ISA_END_ADDRESS, kernel_end).
+*/
+   memory_map_bottom_up(kernel_end, end);
+   memory_map_bottom_up(ISA_END_ADDRESS, kernel_end);
+   } else {
+   memory_map_top_down(ISA_END_ADDRESS, end);
+   }
 
 #ifdef CONFIG_X86_64
if (max_pfn > max_low_pfn) {
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up

2013-10-03 Thread Zhang Yanfei
From: Tang Chen tangc...@cn.fujitsu.com

The Linux kernel cannot migrate pages used by the kernel. As a
result, kernel pages cannot be hot-removed. So we cannot allocate
hotpluggable memory for the kernel.

In a memory hotplug system, any numa node the kernel resides in
should be unhotpluggable. And for a modern server, each node could
have at least 16GB memory. So memory around the kernel image is
highly likely unhotpluggable.

ACPI SRAT (System Resource Affinity Table) contains the memory
hotplug info. But before SRAT is parsed, memblock has already
started to allocate memory for the kernel. So we need to prevent
memblock from doing this.

So direct memory mapping page tables setup is the case. init_mem_mapping()
is called before SRAT is parsed. To prevent page tables being allocated
within hotpluggable memory, we will use bottom-up direction to allocate
page tables from the end of kernel image to the higher memory.

Acked-by: Tejun Heo t...@kernel.org
Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
Signed-off-by: Zhang Yanfei zhangyan...@cn.fujitsu.com
---
 arch/x86/mm/init.c |   71 ++-
 1 files changed, 69 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index ea2be79..5cea9ed 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -458,6 +458,51 @@ static void __init memory_map_top_down(unsigned long 
map_start,
init_range_memory_mapping(real_end, map_end);
 }
 
+/**
+ * memory_map_bottom_up - Map [map_start, map_end) bottom up
+ * @map_start: start address of the target memory range
+ * @map_end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range
+ * [map_start, map_end) in bottom-up. Since we have limited the
+ * bottom-up allocation above the kernel, the page tables will
+ * be allocated just above the kernel and we map the memory
+ * in [map_start, map_end) in bottom-up.
+ */
+static void __init memory_map_bottom_up(unsigned long map_start,
+   unsigned long map_end)
+{
+   unsigned long next, new_mapped_ram_size, start;
+   unsigned long mapped_ram_size = 0;
+   /* step_size need to be small so pgt_buf from BRK could cover it */
+   unsigned long step_size = PMD_SIZE;
+
+   start = map_start;
+   min_pfn_mapped = start  PAGE_SHIFT;
+
+   /*
+* We start from the bottom (@map_start) and go to the top (@map_end).
+* The memblock_find_in_range() gets us a block of RAM from the
+* end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
+* for page table.
+*/
+   while (start  map_end) {
+   if (map_end - start  step_size) {
+   next = round_up(start + 1, step_size);
+   if (next  map_end)
+   next = map_end;
+   } else
+   next = map_end;
+
+   new_mapped_ram_size = init_range_memory_mapping(start, next);
+   start = next;
+
+   if (new_mapped_ram_size  mapped_ram_size)
+   step_size = STEP_SIZE_SHIFT;
+   mapped_ram_size += new_mapped_ram_size;
+   }
+}
+
 void __init init_mem_mapping(void)
 {
unsigned long end;
@@ -473,8 +518,30 @@ void __init init_mem_mapping(void)
/* the ISA range is always mapped regardless of memory holes */
init_memory_mapping(0, ISA_END_ADDRESS);
 
-   /* setup direct mapping for range [ISA_END_ADDRESS, end) in top-down*/
-   memory_map_top_down(ISA_END_ADDRESS, end);
+   /*
+* If the allocation is in bottom-up direction, we setup direct mapping
+* in bottom-up, otherwise we setup direct mapping in top-down.
+*/
+   if (memblock_bottom_up()) {
+   unsigned long kernel_end;
+
+#ifdef CONFIG_X86
+   kernel_end = __pa_symbol(_end);
+#else
+   kernel_end = __pa(RELOC_HIDE((unsigned long)(_end), 0));
+#endif
+   /*
+* we need two separate calls here. This is because we want to
+* allocate page tables above the kernel. So we first map
+* [kernel_end, end) to make memory above the kernel be mapped
+* as soon as possible. And then use page tables allocated above
+* the kernel to map [ISA_END_ADDRESS, kernel_end).
+*/
+   memory_map_bottom_up(kernel_end, end);
+   memory_map_bottom_up(ISA_END_ADDRESS, kernel_end);
+   } else {
+   memory_map_top_down(ISA_END_ADDRESS, end);
+   }
 
 #ifdef CONFIG_X86_64
if (max_pfn  max_low_pfn) {
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/