Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-31 Thread Simon Jeons
Hi Tang,
On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote:
 On 01/31/2013 02:19 PM, Simon Jeons wrote:
  Hi Tang,
  On Thu, 2013-01-31 at 11:31 +0800, Tang Chen wrote:
  Hi Simon,
 
  Please see below. :)
 
  On 01/31/2013 09:22 AM, Simon Jeons wrote:
 
  Sorry, I still confuse. :(
  update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or
  node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE?
 
  node_states is what? node_states[N_NORMAL_MEMOR] or
  node_states[N_MEMORY]?
 
  Are you asking what node_states[] is ?
 
  node_states[] is an array of nodemask,
 
extern nodemask_t node_states[NR_NODE_STATES];
 
  For example, node_states[N_NORMAL_MEMOR] represents which nodes have
  normal memory.
  If N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, node_states[N_MEMORY] is
  node_states[N_NORMAL_MEMOR]. So it represents which nodes have 0 ...
  ZONE_MOVABLE.
 
 
  Sorry, how can nodes_state[N_NORMAL_MEMORY] represents a node have 0 ...
  *ZONE_MOVABLE*, the comment of enum nodes_states said that
  N_NORMAL_MEMORY just means the node has regular memory.
 
 
 Hi Simon,
 
 Let's say it in this way.
 
 If we don't have CONFIG_HIGHMEM, N_HIGH_MEMORY == N_NORMAL_MEMORY. We 
 don't have a separate
 macro to represent highmem because we don't have highmem.
 This is easy to understand, right ?
 
 Now, think it just like above:
 If we don't have CONFIG_MOVABLE_NODE, N_MEMORY == N_HIGH_MEMORY == 
 N_NORMAL_MEMORY.
 This means we don't allow a node to have only movable memory, not we 
 don't have movable memory.
 A node could have normal memory and movable memory. So 
 nodes_state[N_NORMAL_MEMORY] represents
 a node have 0 ... *ZONE_MOVABLE*.
 
 I think the point is: CONFIG_MOVABLE_NODE means we allow a node to have 
 only movable memory.
 So without CONFIG_MOVABLE_NODE, it doesn't mean a node cannot have 
 movable memory. It means
 the node cannot have only movable memory. It can have normal memory and 
 movable memory.
 
 1) With CONFIG_MOVABLE_NODE:
 N_NORMAL_MEMORY: nodes who have normal memory.
  normal memory only
  normal and highmem
  normal and highmem and movablemem
  normal and movablemem
 N_MEMORY: nodes who has memory (any memory)
  normal memory only
  normal and highmem
  normal and highmem and movablemem
  normal and movablemem  We can have 
 movablemem.
  highmem only -
  highmem and movablemem ---
  movablemem only -- We can have 
 movablemem only.***
 
 2) With out CONFIG_MOVABLE_NODE:
 N_MEMORY == N_NORMAL_MEMORY: (Here, I omit N_HIGH_MEMORY)
  normal memory only
  normal and highmem
  normal and highmem and movablemem
  normal and movablemem  We can have 
 movablemem.
  No movablemem only --- We cannot 
 have movablemem only. ***
 
 The semantics is not that clear here. So we can only try to understand 
 it from the code where
 we use N_MEMORY. :)
 
 That is my understanding of this.

Thanks for your clarify, very clear now. :)

 
 Thanks. :)
 
 
 
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-31 Thread Simon Jeons
Hi Tang,
On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote:

1. IIUC, there is a button on machine which supports hot-remove memory,
then what's the difference between press button and echo to /sys?
2. Since kernel memory is linear mapping(I mean direct mapping part),
why can't put kernel direct mapping memory into one memory device, and
other memory into the other devices? As you know x86_64 don't need
highmem, IIUC, all kernel memory will linear mapping in this case. Is my
idea available? If is correct, x86_32 can't implement in the same way
since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's
hard to focus kernel memory on single memory device.
3. In current implementation, if memory hotplug just need memory
subsystem and ACPI codes support? Or also needs firmware take part in?
Hope you can explain in details, thanks in advance. :)
4. What's the status of memory hotplug? Apart from can't remove kernel
memory, other things are fully implementation?  


 On 01/31/2013 02:19 PM, Simon Jeons wrote:
  Hi Tang,
  On Thu, 2013-01-31 at 11:31 +0800, Tang Chen wrote:
  Hi Simon,
 
  Please see below. :)
 
  On 01/31/2013 09:22 AM, Simon Jeons wrote:
 
  Sorry, I still confuse. :(
  update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or
  node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE?
 
  node_states is what? node_states[N_NORMAL_MEMOR] or
  node_states[N_MEMORY]?
 
  Are you asking what node_states[] is ?
 
  node_states[] is an array of nodemask,
 
extern nodemask_t node_states[NR_NODE_STATES];
 
  For example, node_states[N_NORMAL_MEMOR] represents which nodes have
  normal memory.
  If N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, node_states[N_MEMORY] is
  node_states[N_NORMAL_MEMOR]. So it represents which nodes have 0 ...
  ZONE_MOVABLE.
 
 
  Sorry, how can nodes_state[N_NORMAL_MEMORY] represents a node have 0 ...
  *ZONE_MOVABLE*, the comment of enum nodes_states said that
  N_NORMAL_MEMORY just means the node has regular memory.
 
 
 Hi Simon,
 
 Let's say it in this way.
 
 If we don't have CONFIG_HIGHMEM, N_HIGH_MEMORY == N_NORMAL_MEMORY. We 
 don't have a separate
 macro to represent highmem because we don't have highmem.
 This is easy to understand, right ?
 
 Now, think it just like above:
 If we don't have CONFIG_MOVABLE_NODE, N_MEMORY == N_HIGH_MEMORY == 
 N_NORMAL_MEMORY.
 This means we don't allow a node to have only movable memory, not we 
 don't have movable memory.
 A node could have normal memory and movable memory. So 
 nodes_state[N_NORMAL_MEMORY] represents
 a node have 0 ... *ZONE_MOVABLE*.
 
 I think the point is: CONFIG_MOVABLE_NODE means we allow a node to have 
 only movable memory.
 So without CONFIG_MOVABLE_NODE, it doesn't mean a node cannot have 
 movable memory. It means
 the node cannot have only movable memory. It can have normal memory and 
 movable memory.
 
 1) With CONFIG_MOVABLE_NODE:
 N_NORMAL_MEMORY: nodes who have normal memory.
  normal memory only
  normal and highmem
  normal and highmem and movablemem
  normal and movablemem
 N_MEMORY: nodes who has memory (any memory)
  normal memory only
  normal and highmem
  normal and highmem and movablemem
  normal and movablemem  We can have 
 movablemem.
  highmem only -
  highmem and movablemem ---
  movablemem only -- We can have 
 movablemem only.***
 
 2) With out CONFIG_MOVABLE_NODE:
 N_MEMORY == N_NORMAL_MEMORY: (Here, I omit N_HIGH_MEMORY)
  normal memory only
  normal and highmem
  normal and highmem and movablemem
  normal and movablemem  We can have 
 movablemem.
  No movablemem only --- We cannot 
 have movablemem only. ***
 
 The semantics is not that clear here. So we can only try to understand 
 it from the code where
 we use N_MEMORY. :)
 
 That is my understanding of this.
 
 Thanks. :)
 
 
 
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-31 Thread Tang Chen

Hi Simon,

On 01/31/2013 04:48 PM, Simon Jeons wrote:

Hi Tang,
On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote:

1. IIUC, there is a button on machine which supports hot-remove memory,
then what's the difference between press button and echo to /sys?


No important difference, I think. Since I don't have the machine you are
saying, I cannot surely answer you. :)
AFAIK, pressing the button means trigger the hotplug from hardware, sysfs
is just another entrance. At last, they will run into the same code.


2. Since kernel memory is linear mapping(I mean direct mapping part),
why can't put kernel direct mapping memory into one memory device, and
other memory into the other devices?


We cannot do that because in that way, we will lose NUMA performance.

If you know NUMA, you will understand the following example:

node0:node1:
   cpu0~cpu15cpu16~cpu31
   memory0~memory511 memory512~memory1023

cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511.
If we set direct mapping area in node0, and movable area in node1, then
the kernel code running on cpu16~cpu31 will have to access 
memory0~memory511.

This is a terrible performance down.


As you know x86_64 don't need
highmem, IIUC, all kernel memory will linear mapping in this case. Is my
idea available? If is correct, x86_32 can't implement in the same way
since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's
hard to focus kernel memory on single memory device.


Sorry, I'm not quite familiar with x86_32 box.


3. In current implementation, if memory hotplug just need memory
subsystem and ACPI codes support? Or also needs firmware take part in?
Hope you can explain in details, thanks in advance. :)


We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware
based memory migration mentioned by Liu Jiang.

So far, I only know this. :)


4. What's the status of memory hotplug? Apart from can't remove kernel
memory, other things are fully implementation?


I think the main job is done for now. And there are still bugs to fix.
And this functionality is not stable.

Thanks. :)
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-31 Thread Simon Jeons
Hi Tang,
On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote:
 Hi Simon,
 
 On 01/31/2013 04:48 PM, Simon Jeons wrote:
  Hi Tang,
  On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote:
 
  1. IIUC, there is a button on machine which supports hot-remove memory,
  then what's the difference between press button and echo to /sys?
 
 No important difference, I think. Since I don't have the machine you are
 saying, I cannot surely answer you. :)
 AFAIK, pressing the button means trigger the hotplug from hardware, sysfs
 is just another entrance. At last, they will run into the same code.
 
  2. Since kernel memory is linear mapping(I mean direct mapping part),
  why can't put kernel direct mapping memory into one memory device, and
  other memory into the other devices?
 
 We cannot do that because in that way, we will lose NUMA performance.
 
 If you know NUMA, you will understand the following example:
 
 node0:node1:
 cpu0~cpu15cpu16~cpu31
 memory0~memory511 memory512~memory1023
 
 cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511.
 If we set direct mapping area in node0, and movable area in node1, then
 the kernel code running on cpu16~cpu31 will have to access 
 memory0~memory511.
 This is a terrible performance down.

So if config NUMA, kernel memory will not be linear mapping anymore? For
example, 

Node 0  Node 1 

0 ~ 10G 11G~14G

kernel memory only at Node 0? Can part of kernel memory also at Node 1?

How big is kernel direct mapping memory in x86_64? Is there max limit?
It seems that only around 896MB on x86_32. 

 
 As you know x86_64 don't need
  highmem, IIUC, all kernel memory will linear mapping in this case. Is my
  idea available? If is correct, x86_32 can't implement in the same way
  since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's
  hard to focus kernel memory on single memory device.
 
 Sorry, I'm not quite familiar with x86_32 box.
 
  3. In current implementation, if memory hotplug just need memory
  subsystem and ACPI codes support? Or also needs firmware take part in?
  Hope you can explain in details, thanks in advance. :)
 
 We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware
 based memory migration mentioned by Liu Jiang.

Is there any material about firmware based memory migration?

 
 So far, I only know this. :)
 
  4. What's the status of memory hotplug? Apart from can't remove kernel
  memory, other things are fully implementation?
 
 I think the main job is done for now. And there are still bugs to fix.
 And this functionality is not stable.
 
 Thanks. :)


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-31 Thread Jianguo Wu
On 2013/1/31 18:38, Simon Jeons wrote:

 Hi Tang,
 On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote:
 Hi Simon,

 On 01/31/2013 04:48 PM, Simon Jeons wrote:
 Hi Tang,
 On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote:

 1. IIUC, there is a button on machine which supports hot-remove memory,
 then what's the difference between press button and echo to /sys?

 No important difference, I think. Since I don't have the machine you are
 saying, I cannot surely answer you. :)
 AFAIK, pressing the button means trigger the hotplug from hardware, sysfs
 is just another entrance. At last, they will run into the same code.

 2. Since kernel memory is linear mapping(I mean direct mapping part),
 why can't put kernel direct mapping memory into one memory device, and
 other memory into the other devices?

 We cannot do that because in that way, we will lose NUMA performance.

 If you know NUMA, you will understand the following example:

 node0:node1:
 cpu0~cpu15cpu16~cpu31
 memory0~memory511 memory512~memory1023

 cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511.
 If we set direct mapping area in node0, and movable area in node1, then
 the kernel code running on cpu16~cpu31 will have to access 
 memory0~memory511.
 This is a terrible performance down.
 
 So if config NUMA, kernel memory will not be linear mapping anymore? For
 example, 
 
 Node 0  Node 1 
 
 0 ~ 10G 11G~14G
 
 kernel memory only at Node 0? Can part of kernel memory also at Node 1?
 
 How big is kernel direct mapping memory in x86_64? Is there max limit?


Max kernel direct mapping memory in x86_64 is 64TB.

 It seems that only around 896MB on x86_32. 
 

 As you know x86_64 don't need
 highmem, IIUC, all kernel memory will linear mapping in this case. Is my
 idea available? If is correct, x86_32 can't implement in the same way
 since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's
 hard to focus kernel memory on single memory device.

 Sorry, I'm not quite familiar with x86_32 box.

 3. In current implementation, if memory hotplug just need memory
 subsystem and ACPI codes support? Or also needs firmware take part in?
 Hope you can explain in details, thanks in advance. :)

 We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware
 based memory migration mentioned by Liu Jiang.
 
 Is there any material about firmware based memory migration?
 

 So far, I only know this. :)

 4. What's the status of memory hotplug? Apart from can't remove kernel
 memory, other things are fully implementation?

 I think the main job is done for now. And there are still bugs to fix.
 And this functionality is not stable.

 Thanks. :)
 
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
 
 .
 



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-31 Thread Tang Chen

On 02/01/2013 09:36 AM, Simon Jeons wrote:

On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote:


So if config NUMA, kernel memory will not be linear mapping anymore? For
example,

Node 0  Node 1

0 ~ 10G 11G~14G


It has nothing to do with linear mapping, I think.



kernel memory only at Node 0? Can part of kernel memory also at Node 1?


Please refer to find_zone_movable_pfns_for_nodes().
The kernel is not only on node0. It uses all the online nodes evenly. :)



How big is kernel direct mapping memory in x86_64? Is there max limit?



Max kernel direct mapping memory in x86_64 is 64TB.


For example, I have 8G memory, all of them will be direct mapping for
kernel? then userspace memory allocated from where?


I think you misunderstood what Wu tried to say. :)

The kernel mapped that large space, it doesn't mean it is using that 
large space.
The mapping is to make kernel be able to access all the memory, not for 
the kernel
to use only. User space can also use the memory, but each process has 
its own mapping.


For example:

   64TB, what ever 
   xxxTB, what ever

logic address space: |_kernel___|_user_|
   \  \  /  /
\  /\  /
physical address space:  |___\/__\/_|  4GB or 
8GB, what ever

  *

The * part physical is mapped to user space in the process' own 
pagetable.
It is also direct mapped in kernel's pagetable. So the kernel can also 
access it. :)







It seems that only around 896MB on x86_32.



We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware
based memory migration mentioned by Liu Jiang.


Is there any material about firmware based memory migration?


No, I don't have any because this is a functionality of machine from HUAWEI.
I think you can ask Liu Jiang or Wu Jianguo to share some with you. :)

Thanks. :)
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-31 Thread Jianguo Wu
On 2013/2/1 9:36, Simon Jeons wrote:

 On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote:
 On 2013/1/31 18:38, Simon Jeons wrote:

 Hi Tang,
 On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote:
 Hi Simon,

 On 01/31/2013 04:48 PM, Simon Jeons wrote:
 Hi Tang,
 On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote:

 1. IIUC, there is a button on machine which supports hot-remove memory,
 then what's the difference between press button and echo to /sys?

 No important difference, I think. Since I don't have the machine you are
 saying, I cannot surely answer you. :)
 AFAIK, pressing the button means trigger the hotplug from hardware, sysfs
 is just another entrance. At last, they will run into the same code.

 2. Since kernel memory is linear mapping(I mean direct mapping part),
 why can't put kernel direct mapping memory into one memory device, and
 other memory into the other devices?

 We cannot do that because in that way, we will lose NUMA performance.

 If you know NUMA, you will understand the following example:

 node0:node1:
 cpu0~cpu15cpu16~cpu31
 memory0~memory511 memory512~memory1023

 cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511.
 If we set direct mapping area in node0, and movable area in node1, then
 the kernel code running on cpu16~cpu31 will have to access 
 memory0~memory511.
 This is a terrible performance down.

 So if config NUMA, kernel memory will not be linear mapping anymore? For
 example, 

 Node 0  Node 1 

 0 ~ 10G 11G~14G

 kernel memory only at Node 0? Can part of kernel memory also at Node 1?

 How big is kernel direct mapping memory in x86_64? Is there max limit?


 Max kernel direct mapping memory in x86_64 is 64TB.
 
 For example, I have 8G memory, all of them will be direct mapping for
 kernel? then userspace memory allocated from where?

Direct mapping memory means you can use __va() and pa(), but not means that them
can be only used by kernel, them can be used by user-space too, as long as them 
are free.

 

 It seems that only around 896MB on x86_32. 


 As you know x86_64 don't need
 highmem, IIUC, all kernel memory will linear mapping in this case. Is my
 idea available? If is correct, x86_32 can't implement in the same way
 since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's
 hard to focus kernel memory on single memory device.

 Sorry, I'm not quite familiar with x86_32 box.

 3. In current implementation, if memory hotplug just need memory
 subsystem and ACPI codes support? Or also needs firmware take part in?
 Hope you can explain in details, thanks in advance. :)

 We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware
 based memory migration mentioned by Liu Jiang.

 Is there any material about firmware based memory migration?


 So far, I only know this. :)

 4. What's the status of memory hotplug? Apart from can't remove kernel
 memory, other things are fully implementation?

 I think the main job is done for now. And there are still bugs to fix.
 And this functionality is not stable.

 Thanks. :)


 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

 .




 
 
 
 .
 



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-31 Thread Simon Jeons
Hi Jianguo,
On Fri, 2013-02-01 at 09:57 +0800, Jianguo Wu wrote:
 On 2013/2/1 9:36, Simon Jeons wrote:
 
  On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote:
  On 2013/1/31 18:38, Simon Jeons wrote:
 
  Hi Tang,
  On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote:
  Hi Simon,
 
  On 01/31/2013 04:48 PM, Simon Jeons wrote:
  Hi Tang,
  On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote:
 
  1. IIUC, there is a button on machine which supports hot-remove memory,
  then what's the difference between press button and echo to /sys?
 
  No important difference, I think. Since I don't have the machine you are
  saying, I cannot surely answer you. :)
  AFAIK, pressing the button means trigger the hotplug from hardware, sysfs
  is just another entrance. At last, they will run into the same code.
 
  2. Since kernel memory is linear mapping(I mean direct mapping part),
  why can't put kernel direct mapping memory into one memory device, and
  other memory into the other devices?
 
  We cannot do that because in that way, we will lose NUMA performance.
 
  If you know NUMA, you will understand the following example:
 
  node0:node1:
  cpu0~cpu15cpu16~cpu31
  memory0~memory511 memory512~memory1023
 
  cpu16~cpu31 access memory16~memory1023 much faster than 
  memory0~memory511.
  If we set direct mapping area in node0, and movable area in node1, then
  the kernel code running on cpu16~cpu31 will have to access 
  memory0~memory511.
  This is a terrible performance down.
 
  So if config NUMA, kernel memory will not be linear mapping anymore? For
  example, 
 
  Node 0  Node 1 
 
  0 ~ 10G 11G~14G
 
  kernel memory only at Node 0? Can part of kernel memory also at Node 1?
 
  How big is kernel direct mapping memory in x86_64? Is there max limit?
 
 
  Max kernel direct mapping memory in x86_64 is 64TB.
  
  For example, I have 8G memory, all of them will be direct mapping for
  kernel? then userspace memory allocated from where?
 
 Direct mapping memory means you can use __va() and pa(), but not means that 
 them
 can be only used by kernel, them can be used by user-space too, as long as 
 them are free.

IIUC, the benefit of va() and pa() is just for quick get
virtual/physical address, it takes advantage of linear mapping. But mmu
still need to go through pgd/pud/pmd/pte, correct?

 
  
 
  It seems that only around 896MB on x86_32. 
 
 
  As you know x86_64 don't need
  highmem, IIUC, all kernel memory will linear mapping in this case. Is my
  idea available? If is correct, x86_32 can't implement in the same way
  since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's
  hard to focus kernel memory on single memory device.
 
  Sorry, I'm not quite familiar with x86_32 box.
 
  3. In current implementation, if memory hotplug just need memory
  subsystem and ACPI codes support? Or also needs firmware take part in?
  Hope you can explain in details, thanks in advance. :)
 
  We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware
  based memory migration mentioned by Liu Jiang.
 
  Is there any material about firmware based memory migration?
 
 
  So far, I only know this. :)
 
  4. What's the status of memory hotplug? Apart from can't remove kernel
  memory, other things are fully implementation?
 
  I think the main job is done for now. And there are still bugs to fix.
  And this functionality is not stable.
 
  Thanks. :)
 
 
  --
  To unsubscribe, send a message with 'unsubscribe linux-mm' in
  the body to majord...@kvack.org.  For more info on Linux MM,
  see: http://www.linux-mm.org/ .
  Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
 
  .
 
 
 
 
  
  
  
  .
  
 
 
 


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-31 Thread Jianguo Wu
On 2013/2/1 10:06, Simon Jeons wrote:

 Hi Jianguo,
 On Fri, 2013-02-01 at 09:57 +0800, Jianguo Wu wrote:
 On 2013/2/1 9:36, Simon Jeons wrote:

 On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote:
 On 2013/1/31 18:38, Simon Jeons wrote:

 Hi Tang,
 On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote:
 Hi Simon,

 On 01/31/2013 04:48 PM, Simon Jeons wrote:
 Hi Tang,
 On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote:

 1. IIUC, there is a button on machine which supports hot-remove memory,
 then what's the difference between press button and echo to /sys?

 No important difference, I think. Since I don't have the machine you are
 saying, I cannot surely answer you. :)
 AFAIK, pressing the button means trigger the hotplug from hardware, sysfs
 is just another entrance. At last, they will run into the same code.

 2. Since kernel memory is linear mapping(I mean direct mapping part),
 why can't put kernel direct mapping memory into one memory device, and
 other memory into the other devices?

 We cannot do that because in that way, we will lose NUMA performance.

 If you know NUMA, you will understand the following example:

 node0:node1:
 cpu0~cpu15cpu16~cpu31
 memory0~memory511 memory512~memory1023

 cpu16~cpu31 access memory16~memory1023 much faster than 
 memory0~memory511.
 If we set direct mapping area in node0, and movable area in node1, then
 the kernel code running on cpu16~cpu31 will have to access 
 memory0~memory511.
 This is a terrible performance down.

 So if config NUMA, kernel memory will not be linear mapping anymore? For
 example, 

 Node 0  Node 1 

 0 ~ 10G 11G~14G

 kernel memory only at Node 0? Can part of kernel memory also at Node 1?

 How big is kernel direct mapping memory in x86_64? Is there max limit?


 Max kernel direct mapping memory in x86_64 is 64TB.

 For example, I have 8G memory, all of them will be direct mapping for
 kernel? then userspace memory allocated from where?

 Direct mapping memory means you can use __va() and pa(), but not means that 
 them
 can be only used by kernel, them can be used by user-space too, as long as 
 them are free.
 
 IIUC, the benefit of va() and pa() is just for quick get
 virtual/physical address, it takes advantage of linear mapping. But mmu
 still need to go through pgd/pud/pmd/pte, correct?

Yes.

 




 It seems that only around 896MB on x86_32. 


 As you know x86_64 don't need
 highmem, IIUC, all kernel memory will linear mapping in this case. Is my
 idea available? If is correct, x86_32 can't implement in the same way
 since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's
 hard to focus kernel memory on single memory device.

 Sorry, I'm not quite familiar with x86_32 box.

 3. In current implementation, if memory hotplug just need memory
 subsystem and ACPI codes support? Or also needs firmware take part in?
 Hope you can explain in details, thanks in advance. :)

 We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware
 based memory migration mentioned by Liu Jiang.

 Is there any material about firmware based memory migration?


 So far, I only know this. :)

 4. What's the status of memory hotplug? Apart from can't remove kernel
 memory, other things are fully implementation?

 I think the main job is done for now. And there are still bugs to fix.
 And this functionality is not stable.

 Thanks. :)


 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

 .







 .




 
 
 
 .
 



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-31 Thread Tang Chen

Hi Simon,

On 02/01/2013 10:17 AM, Simon Jeons wrote:

For example:

 64TB, what ever
 xxxTB, what ever
logic address space: |_kernel___|_user_|
 \  \  /  /
  \  /\  /
physical address space:  |___\/__\/_|  4GB or
8GB, what ever
*


How much address space user process can have on x86_64? Also 8GB?


Usually, we don't say that.

8GB is your physical memory, right ?
But kernel space and user space is the logic conception in OS. They are 
in logic

address space.

So both the kernel space and the user space can use all the physical memory.
But if the page is already in use by either of them, the other one 
cannot use it.
For example, some pages are direct mapped to kernel, and is in use by 
kernel, the

user space cannot map it.





The * part physical is mapped to user space in the process' own
pagetable.
It is also direct mapped in kernel's pagetable. So the kernel can also
access it. :)


But how to protect user process not modify kernel memory?


This is the job of CPU. On intel cpus, user space code is running in 
level 3, and
kernel space code is running in level 0. So the code in level 3 cannot 
access the data

segment in level 0.

Thanks. :)
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-31 Thread Simon Jeons
Hi Tang,
On Fri, 2013-02-01 at 09:57 +0800, Tang Chen wrote:
 On 02/01/2013 09:36 AM, Simon Jeons wrote:
  On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote:
 
  So if config NUMA, kernel memory will not be linear mapping anymore? For
  example,
 
  Node 0  Node 1
 
  0 ~ 10G 11G~14G
 
 It has nothing to do with linear mapping, I think.
 
 
  kernel memory only at Node 0? Can part of kernel memory also at Node 1?
 
 Please refer to find_zone_movable_pfns_for_nodes().

I see, thanks. :)

 The kernel is not only on node0. It uses all the online nodes evenly. :)
 
 
  How big is kernel direct mapping memory in x86_64? Is there max limit?
 
 
  Max kernel direct mapping memory in x86_64 is 64TB.
 
  For example, I have 8G memory, all of them will be direct mapping for
  kernel? then userspace memory allocated from where?
 
 I think you misunderstood what Wu tried to say. :)
 
 The kernel mapped that large space, it doesn't mean it is using that 
 large space.
 The mapping is to make kernel be able to access all the memory, not for 
 the kernel
 to use only. User space can also use the memory, but each process has 
 its own mapping.
 
 For example:
 
 64TB, what ever 
 xxxTB, what ever
 logic address space: |_kernel___|_user_|
 \  \  /  /
  \  /\  /
 physical address space:  |___\/__\/_|  4GB or 
 8GB, what ever
*

How much address space user process can have on x86_64? Also 8GB?

 
 The * part physical is mapped to user space in the process' own 
 pagetable.
 It is also direct mapped in kernel's pagetable. So the kernel can also 
 access it. :)

But how to protect user process not modify kernel memory?

 
 
 
  It seems that only around 896MB on x86_32.
 
 
  We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware
  based memory migration mentioned by Liu Jiang.
 
  Is there any material about firmware based memory migration?
 
 No, I don't have any because this is a functionality of machine from HUAWEI.
 I think you can ask Liu Jiang or Wu Jianguo to share some with you. :)
 
 Thanks. :)


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-31 Thread Simon Jeons
Hi Tang,
On Fri, 2013-02-01 at 10:42 +0800, Tang Chen wrote:

I confuse!

 Hi Simon,
 
 On 02/01/2013 10:17 AM, Simon Jeons wrote:
  For example:
 
   64TB, what ever
   xxxTB, what ever
  logic address space: 
  |_kernel___|_user_|
   \  \  /  /
\  /\  /
  physical address space:  |___\/__\/_|  4GB or
  8GB, what ever
  *
 
  How much address space user process can have on x86_64? Also 8GB?
 
 Usually, we don't say that.
 
 8GB is your physical memory, right ?
 But kernel space and user space is the logic conception in OS. They are 
 in logic
 address space.
 
 So both the kernel space and the user space can use all the physical memory.
 But if the page is already in use by either of them, the other one 
 cannot use it.
 For example, some pages are direct mapped to kernel, and is in use by 
 kernel, the
 user space cannot map it.

How can distinguish map and use? I mean how can confirm memory is used
by kernel instead of map? 

 
 
 
  The * part physical is mapped to user space in the process' own
  pagetable.
  It is also direct mapped in kernel's pagetable. So the kernel can also
  access it. :)
 
  But how to protect user process not modify kernel memory?
 
 This is the job of CPU. On intel cpus, user space code is running in 
 level 3, and
 kernel space code is running in level 0. So the code in level 3 cannot 
 access the data
 segment in level 0.

1) If user process and kenel map to same physical memory, user process
will get SIGSEGV during #PF if access to this memory, but If user proces
s will map to the same memory which kernel map? Why? It can't access it.
2) If two user processes map to same physical memory, what will happen
if one process access the memory?

 
 Thanks. :)


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-31 Thread Simon Jeons
On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote:
 On 2013/1/31 18:38, Simon Jeons wrote:
 
  Hi Tang,
  On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote:
  Hi Simon,
 
  On 01/31/2013 04:48 PM, Simon Jeons wrote:
  Hi Tang,
  On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote:
 
  1. IIUC, there is a button on machine which supports hot-remove memory,
  then what's the difference between press button and echo to /sys?
 
  No important difference, I think. Since I don't have the machine you are
  saying, I cannot surely answer you. :)
  AFAIK, pressing the button means trigger the hotplug from hardware, sysfs
  is just another entrance. At last, they will run into the same code.
 
  2. Since kernel memory is linear mapping(I mean direct mapping part),
  why can't put kernel direct mapping memory into one memory device, and
  other memory into the other devices?
 
  We cannot do that because in that way, we will lose NUMA performance.
 
  If you know NUMA, you will understand the following example:
 
  node0:node1:
  cpu0~cpu15cpu16~cpu31
  memory0~memory511 memory512~memory1023
 
  cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511.
  If we set direct mapping area in node0, and movable area in node1, then
  the kernel code running on cpu16~cpu31 will have to access 
  memory0~memory511.
  This is a terrible performance down.
  
  So if config NUMA, kernel memory will not be linear mapping anymore? For
  example, 
  
  Node 0  Node 1 
  
  0 ~ 10G 11G~14G
  
  kernel memory only at Node 0? Can part of kernel memory also at Node 1?
  
  How big is kernel direct mapping memory in x86_64? Is there max limit?
 
 
 Max kernel direct mapping memory in x86_64 is 64TB.

For example, I have 8G memory, all of them will be direct mapping for
kernel? then userspace memory allocated from where?

 
  It seems that only around 896MB on x86_32. 
  
 
  As you know x86_64 don't need
  highmem, IIUC, all kernel memory will linear mapping in this case. Is my
  idea available? If is correct, x86_32 can't implement in the same way
  since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's
  hard to focus kernel memory on single memory device.
 
  Sorry, I'm not quite familiar with x86_32 box.
 
  3. In current implementation, if memory hotplug just need memory
  subsystem and ACPI codes support? Or also needs firmware take part in?
  Hope you can explain in details, thanks in advance. :)
 
  We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware
  based memory migration mentioned by Liu Jiang.
  
  Is there any material about firmware based memory migration?
  
 
  So far, I only know this. :)
 
  4. What's the status of memory hotplug? Apart from can't remove kernel
  memory, other things are fully implementation?
 
  I think the main job is done for now. And there are still bugs to fix.
  And this functionality is not stable.
 
  Thanks. :)
  
  
  --
  To unsubscribe, send a message with 'unsubscribe linux-mm' in
  the body to majord...@kvack.org.  For more info on Linux MM,
  see: http://www.linux-mm.org/ .
  Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
  
  .
  
 
 
 


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-31 Thread Tang Chen

Hi Simon,

On 02/01/2013 11:06 AM, Simon Jeons wrote:


How can distinguish map and use? I mean how can confirm memory is used
by kernel instead of map?


If the page is free, for example, it is in the buddy system, it is not 
in use.
Even if it is direct mapped by kernel, the kernel logic should not to 
access it
because you didn't allocate it. This is the kernel's logic. Of course 
the hardware

and the user will not know this.

You want to access some memory, you should first have a logic address, 
right?

So how can you get a logic address ?  You call alloc api.

For example, when you are coding, of course you write:

p = alloc_xxx();  allocate memory, now, it is in use, alloc_xxx() 
makes kernel know it.

*p = ..   use the memory

You won't write:
p = 0x8745;   if so, kernel doesn't know it is in use
*p = ..   wrong...

right ?

The kernel mapped a page, it doesn't mean it is using the page. You 
should allocate it.

That is just the kernel's allocating logic.

Well, I think I can only give you this answer now. If you want something 
deeper, I think

you need to read how the kernel manage the physical pages. :)



1) If user process and kenel map to same physical memory, user process
will get SIGSEGV during #PF if access to this memory, but If user proces
s will map to the same memory which kernel map? Why? It can't access it.


When you call malloc() to allocate memory in user space, the OS logic will
assure that you won't map a page that has already been used by kernel.

A page is mapped by kernel, but not used by kernel (not allocated, like 
above),

malloc() could allocate it, and map it to user space. This is the situation
you are talking about, right ?

Now it is mapped by kernel and user, but it is only allocated by user. 
So the kernel
will not use it. When the kernel wants some memory, it will allocate 
some other memory.
This is just the kernel logic. This is what memory management subsystem 
does.


I think I cannot answer more because I'm also a student in memory 
management.

This is just my understanding. And I hope it is helpful. :)


2) If two user processes map to same physical memory, what will happen
if one process access the memory?


Obviously you don't need to worry about this situation. We can swap the page
used by process 1 out, and process 2 can use the same page. When process 
1 wants
to access it again, we swap it in. This only happens when the physical 
memory

is not enough to use. :)

And also, if you are using shared memory in user space, like

shmget(), shmat()..

it is the shared memory, both processes can use it at the same time.

Thanks. :)
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-30 Thread Tang Chen

Hi Simon,

Please see below. :)

On 01/29/2013 08:52 PM, Simon Jeons wrote:

Hi Tang,

On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:

Here is the physical memory hot-remove patch-set based on 3.8rc-2.


Some questions ask you, not has relationship with this patchset, but is
memory hotplug stuff.

1. In function node_states_check_changes_online:

comments:
* If we don't have HIGHMEM nor movable node,
* node_states[N_NORMAL_MEMORY] contains nodes which have zones of
* 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.

How to understand it? Why we don't have HIGHMEM nor movable node and
node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC,
N_NORMAL_MEMORY only means the node has regular memory.



First of all, I think we need to understand why we need N_MEMORY.

In order to support movable node, which has only ZONE_MOVABLE (the last 
zone),
we introduce N_MEMORY to represent the node has normal, highmem and 
movable memory.


Here, we have movable node means you configured CONFIG_MOVABLE_NODE.
This config option doesn't mean we don't have movable pages, (NO)
it means we don't have a node which has only movable pages (only have 
ZONE_MOVABLE). (YES)


Here, if we don't have CONFIG_MOVABLE_NODE (we don't have movable node), 
we don't need a
separate node_states[] element to represent a particular node because we 
won't have a node

which has only ZONE_MOVABLE.

So,
1) if we don't have highmem nor movable node, N_MEMORY == N_HIGH_MEMORY 
== N_NORMAL_MEMORY,
   which means N_NORMAL_MEMORY effects as N_MEMORY. If we online pages 
as movable, we need

   to update node_states[N_NORMAL_MEMORY].

Please refer to the definition of enum zone_type, if we don't have 
CONFIG_HIGHMEM, we won't
have ZONE_HIGHMEM, but ZONE_NORMAL and ZONE_MOVABLE will always there. 
So we can have movable

pages, and the zone_last should be ZONE_MOVABLE.

Again, because we won't have a node only having ZONE_MOVABLE, so we just 
need to update

node_states[N_NORMAL_MEMORY].


* If we don't have movable node, node_states[N_NORMAL_MEMORY]
* contains nodes which have zones of 0...ZONE_MOVABLE,
* set zone_last to ZONE_MOVABLE.

How to understand?


2) this code is in #ifdef CONFIG_HIGHMEM, which means we have highmem, 
so if we don't have
   movable node, N_MEMORY == N_HIGH_MEMORY, and N_HIGH_MEMORY effects 
as N_MEMORY. If we

   online pages as movable, we need to update node_states[N_NORMAL_MEMORY].



2. In function move_pfn_range_left, why end= z2-zone_start_pfn is not
correct? The comments said that must include/overlap, why?



This one is easy, if I understand you correctly.
move_pfn_range_left() is used to move the left most part [start_pfn, 
end_pfn) of z2 to z1.
So if end_pfn= z2-zone_start_pfn, it means [start_pfn, end_pfn) is not 
part of z2.

Then it fails.


3. In function online_pages, the normal case(w/o online_kenrel,
online_movable), why not check if the new zone is overlap with adjacent
zones?



Can a zone overlap with the others ? I don't think so.

One pfn could only be in one zone,
   zone = page_zone(pfn_to_page(pfn));

it could overlap with others, I think. :)

But maybe I misunderstand you. :)


4. Could you summarize the difference implementation between hot-add and
logic-add, hot-remove and logic-remove?


Sorry, I don't quite understand what do you mean by logic-add/remove.
Would you please explain more ?

If you meant the sys fs interfaces, I think they are just another set of 
entrances

of memory hotplug.

Thanks.  :)






This patch-set aims to implement physical memory hot-removing.

The patches can free/remove the following things:

   - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
   - memmap of sparse-vmemmap  : [PATCH 6,7,8,10/15]
   - page table of removed memory  : [RFC PATCH 7,8,10/15]
   - node and related sysfs files  : [RFC PATCH 13-15/15]


Existing problem:
If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
when we online pages.

For example: there is a memory device on node 1. The address range
is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
and memory11 under the directory /sys/devices/system/memory/.

If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
cgroup is not provided by this memory device. But when we online memory9, the
memory stored page cgroup may be provided by memory8. So we can't offline
memory8 now. We should offline the memory in the reversed order.

When the memory device is hotremoved, we will auto offline memory provided
by this memory device. But we don't know which memory is onlined first, so
offlining memory may fail.

In patch1, we provide a solution which is not good enough:
Iterate twice to offline the memory.
1st iterate: offline every non primary memory block.
2nd iterate: offline primary (i.e. first added) memory block.

And a new idea from Wen Congyangwe...@cn.fujitsu.com  is:
allocate the memory from the memory block they are 

Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-30 Thread Tang Chen

On 01/30/2013 06:15 PM, Tang Chen wrote:

Hi Simon,

Please see below. :)

On 01/29/2013 08:52 PM, Simon Jeons wrote:

Hi Tang,

On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:

Here is the physical memory hot-remove patch-set based on 3.8rc-2.


Some questions ask you, not has relationship with this patchset, but is
memory hotplug stuff.

1. In function node_states_check_changes_online:

comments:
* If we don't have HIGHMEM nor movable node,
* node_states[N_NORMAL_MEMORY] contains nodes which have zones of
* 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.

How to understand it? Why we don't have HIGHMEM nor movable node and
node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC,
N_NORMAL_MEMORY only means the node has regular memory.



First of all, I think we need to understand why we need N_MEMORY.

In order to support movable node, which has only ZONE_MOVABLE (the last
zone),
we introduce N_MEMORY to represent the node has normal, highmem and
movable memory.

Here, we have movable node means you configured CONFIG_MOVABLE_NODE.


Sorry, should be we don't have movable node means you didn't 
configured CONFIG_MOVABLE_NODE.



This config option doesn't mean we don't have movable pages, (NO)
it means we don't have a node which has only movable pages (only have
ZONE_MOVABLE). (YES)

Here, if we don't have CONFIG_MOVABLE_NODE (we don't have movable node),
we don't need a
separate node_states[] element to represent a particular node because we
won't have a node
which has only ZONE_MOVABLE.

So,
1) if we don't have highmem nor movable node, N_MEMORY == N_HIGH_MEMORY
== N_NORMAL_MEMORY,
which means N_NORMAL_MEMORY effects as N_MEMORY. If we online pages as
movable, we need
to update node_states[N_NORMAL_MEMORY].

Please refer to the definition of enum zone_type, if we don't have
CONFIG_HIGHMEM, we won't
have ZONE_HIGHMEM, but ZONE_NORMAL and ZONE_MOVABLE will always there.
So we can have movable
pages, and the zone_last should be ZONE_MOVABLE.

Again, because we won't have a node only having ZONE_MOVABLE, so we just
need to update
node_states[N_NORMAL_MEMORY].


* If we don't have movable node, node_states[N_NORMAL_MEMORY]
* contains nodes which have zones of 0...ZONE_MOVABLE,
* set zone_last to ZONE_MOVABLE.

How to understand?


2) this code is in #ifdef CONFIG_HIGHMEM, which means we have highmem,
so if we don't have
movable node, N_MEMORY == N_HIGH_MEMORY, and N_HIGH_MEMORY effects as
N_MEMORY. If we
online pages as movable, we need to update node_states[N_NORMAL_MEMORY].



2. In function move_pfn_range_left, why end= z2-zone_start_pfn is not
correct? The comments said that must include/overlap, why?



This one is easy, if I understand you correctly.
move_pfn_range_left() is used to move the left most part [start_pfn,
end_pfn) of z2 to z1.
So if end_pfn= z2-zone_start_pfn, it means [start_pfn, end_pfn) is not
part of z2.
Then it fails.


3. In function online_pages, the normal case(w/o online_kenrel,
online_movable), why not check if the new zone is overlap with adjacent
zones?



Can a zone overlap with the others ? I don't think so.

One pfn could only be in one zone,
zone = page_zone(pfn_to_page(pfn));

it could overlap with others, I think. :)

But maybe I misunderstand you. :)


4. Could you summarize the difference implementation between hot-add and
logic-add, hot-remove and logic-remove?


Sorry, I don't quite understand what do you mean by logic-add/remove.
Would you please explain more ?

If you meant the sys fs interfaces, I think they are just another set of
entrances
of memory hotplug.

Thanks. :)






This patch-set aims to implement physical memory hot-removing.

The patches can free/remove the following things:

- /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
- memmap of sparse-vmemmap : [PATCH 6,7,8,10/15]
- page table of removed memory : [RFC PATCH 7,8,10/15]
- node and related sysfs files : [RFC PATCH 13-15/15]


Existing problem:
If CONFIG_MEMCG is selected, we will allocate memory to store page
cgroup
when we online pages.

For example: there is a memory device on node 1. The address range
is [1G, 1.5G). You will find 4 new directories memory8, memory9,
memory10,
and memory11 under the directory /sys/devices/system/memory/.

If CONFIG_MEMCG is selected, when we online memory8, the memory
stored page
cgroup is not provided by this memory device. But when we online
memory9, the
memory stored page cgroup may be provided by memory8. So we can't
offline
memory8 now. We should offline the memory in the reversed order.

When the memory device is hotremoved, we will auto offline memory
provided
by this memory device. But we don't know which memory is onlined
first, so
offlining memory may fail.

In patch1, we provide a solution which is not good enough:
Iterate twice to offline the memory.
1st iterate: offline every non primary memory block.
2nd iterate: offline primary (i.e. first added) memory block.

And a new idea from Wen Congyangwe...@cn.fujitsu.com is:
allocate 

Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-30 Thread Simon Jeons
Hi Tang,
On Wed, 2013-01-30 at 18:15 +0800, Tang Chen wrote:
 Hi Simon,
 
 Please see below. :)
 
 On 01/29/2013 08:52 PM, Simon Jeons wrote:
  Hi Tang,
 
  On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
  Here is the physical memory hot-remove patch-set based on 3.8rc-2.
 
  Some questions ask you, not has relationship with this patchset, but is
  memory hotplug stuff.
 
  1. In function node_states_check_changes_online:
 
  comments:
  * If we don't have HIGHMEM nor movable node,
  * node_states[N_NORMAL_MEMORY] contains nodes which have zones of
  * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
 
  How to understand it? Why we don't have HIGHMEM nor movable node and
  node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC,
  N_NORMAL_MEMORY only means the node has regular memory.
 
 
 First of all, I think we need to understand why we need N_MEMORY.
 
 In order to support movable node, which has only ZONE_MOVABLE (the last 
 zone),
 we introduce N_MEMORY to represent the node has normal, highmem and 
 movable memory.
 
 Here, we have movable node means you configured CONFIG_MOVABLE_NODE.
 This config option doesn't mean we don't have movable pages, (NO)
 it means we don't have a node which has only movable pages (only have 
 ZONE_MOVABLE). (YES)
 
 Here, if we don't have CONFIG_MOVABLE_NODE (we don't have movable node), 
 we don't need a
 separate node_states[] element to represent a particular node because we 
 won't have a node
 which has only ZONE_MOVABLE.
 
 So,
 1) if we don't have highmem nor movable node, N_MEMORY == N_HIGH_MEMORY 
 == N_NORMAL_MEMORY,
 which means N_NORMAL_MEMORY effects as N_MEMORY. If we online pages 
 as movable, we need
 to update node_states[N_NORMAL_MEMORY].

Sorry, I still confuse. :( 
update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or
node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE?

 
 Please refer to the definition of enum zone_type, if we don't have 
 CONFIG_HIGHMEM, we won't
 have ZONE_HIGHMEM, but ZONE_NORMAL and ZONE_MOVABLE will always there. 
 So we can have movable
 pages, and the zone_last should be ZONE_MOVABLE.

node_states is what? node_states[N_NORMAL_MEMOR] or
node_states[N_MEMORY]?

 
 Again, because we won't have a node only having ZONE_MOVABLE, so we just 
 need to update
 node_states[N_NORMAL_MEMORY].
 
  * If we don't have movable node, node_states[N_NORMAL_MEMORY]
  * contains nodes which have zones of 0...ZONE_MOVABLE,
  * set zone_last to ZONE_MOVABLE.
 
  How to understand?
 
 2) this code is in #ifdef CONFIG_HIGHMEM, which means we have highmem, 
 so if we don't have
 movable node, N_MEMORY == N_HIGH_MEMORY, and N_HIGH_MEMORY effects 
 as N_MEMORY. If we
 online pages as movable, we need to update node_states[N_NORMAL_MEMORY].
 
 
  2. In function move_pfn_range_left, why end= z2-zone_start_pfn is not
  correct? The comments said that must include/overlap, why?
 
 
 This one is easy, if I understand you correctly.
 move_pfn_range_left() is used to move the left most part [start_pfn, 
 end_pfn) of z2 to z1.
 So if end_pfn= z2-zone_start_pfn, it means [start_pfn, end_pfn) is not 
 part of z2.
 Then it fails.

Yup, very clear now. :)
Why check !z1-wait_table in function move_pfn_range_left and function
__add_zone? I think zone-wait_table is initialized in
free_area_init_core, which will be called during system initialization
and hotadd_new_pgdat path.

 
  3. In function online_pages, the normal case(w/o online_kenrel,
  online_movable), why not check if the new zone is overlap with adjacent
  zones?
 
 
 Can a zone overlap with the others ? I don't think so.
 
 One pfn could only be in one zone,
 zone = page_zone(pfn_to_page(pfn));

thanks. :)

There is a zone populated check in function online_pages. But zone is
populated in free_area_init_core which will be called during system
initialization and hotadd_new_pgdat path. Why still need this check?

 
 it could overlap with others, I think. :)
 
 But maybe I misunderstand you. :)
 
  4. Could you summarize the difference implementation between hot-add and
  logic-add, hot-remove and logic-remove?
 
 Sorry, I don't quite understand what do you mean by logic-add/remove.
 Would you please explain more ?
 
 If you meant the sys fs interfaces, I think they are just another set of 
 entrances
 of memory hotplug.

Please ingore this silly question. :(

 
 Thanks.  :)
 
 
 
 
  This patch-set aims to implement physical memory hot-removing.
 
  The patches can free/remove the following things:
 
 - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
 - memmap of sparse-vmemmap  : [PATCH 6,7,8,10/15]
 - page table of removed memory  : [RFC PATCH 7,8,10/15]
 - node and related sysfs files  : [RFC PATCH 13-15/15]
 
 
  Existing problem:
  If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
  when we online pages.
 
  For example: there is a memory device on node 1. The address 

Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-30 Thread Tang Chen

Hi Simon,

Please see below. :)

On 01/31/2013 09:22 AM, Simon Jeons wrote:


Sorry, I still confuse. :(
update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or
node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE?

node_states is what? node_states[N_NORMAL_MEMOR] or
node_states[N_MEMORY]?


Are you asking what node_states[] is ?

node_states[] is an array of nodemask,

extern nodemask_t node_states[NR_NODE_STATES];

For example, node_states[N_NORMAL_MEMOR] represents which nodes have 
normal memory.

If N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, node_states[N_MEMORY] is
node_states[N_NORMAL_MEMOR]. So it represents which nodes have 0 ... 
ZONE_MOVABLE.




Why check !z1-wait_table in function move_pfn_range_left and function
__add_zone? I think zone-wait_table is initialized in
free_area_init_core, which will be called during system initialization
and hotadd_new_pgdat path.


I think,

free_area_init_core(), in the for loop,
 |-- size = zone_spanned_pages_in_node();
 |-- if (!size)
  continue;    If zone is empty, we jump 
out the for loop.

 |-- init_currently_empty_zone()

So, if the zone is empty, wait_table is not initialized.

In move_pfn_range_left(z1, z2), we move pages from z2 to z1. But z1 
could be empty.
So we need to check it and initialize z1-wait_table because we are 
moving pages into it.




There is a zone populated check in function online_pages. But zone is
populated in free_area_init_core which will be called during system
initialization and hotadd_new_pgdat path. Why still need this check?



Because we could also rebuild zone list when we offline pages.

__offline_pages()
 |-- zone-present_pages -= offlined_pages;
 |-- if (!populated_zone(zone)) {
  build_all_zonelists(NULL, NULL);
  }

If the zone is empty, and other zones on the same node is not empty, the 
node
won't be offlined, and next time we online pages of this zone, the pgdat 
won't
be initialized again, and we need to check populated_zone(zone) when 
onlining

pages.

Thanks. :)

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-30 Thread Simon Jeons
Hi Tang,
On Thu, 2013-01-31 at 11:31 +0800, Tang Chen wrote:
 Hi Simon,
 
 Please see below. :)
 
 On 01/31/2013 09:22 AM, Simon Jeons wrote:
 
  Sorry, I still confuse. :(
  update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or
  node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE?
 
  node_states is what? node_states[N_NORMAL_MEMOR] or
  node_states[N_MEMORY]?
 
 Are you asking what node_states[] is ?
 
 node_states[] is an array of nodemask,
 
  extern nodemask_t node_states[NR_NODE_STATES];
 
 For example, node_states[N_NORMAL_MEMOR] represents which nodes have 
 normal memory.
 If N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, node_states[N_MEMORY] is
 node_states[N_NORMAL_MEMOR]. So it represents which nodes have 0 ... 
 ZONE_MOVABLE.
 

Sorry, how can nodes_state[N_NORMAL_MEMORY] represents a node have 0 ...
*ZONE_MOVABLE*, the comment of enum nodes_states said that
N_NORMAL_MEMORY just means the node has regular memory.  

 
  Why check !z1-wait_table in function move_pfn_range_left and function
  __add_zone? I think zone-wait_table is initialized in
  free_area_init_core, which will be called during system initialization
  and hotadd_new_pgdat path.
 
 I think,
 
 free_area_init_core(), in the for loop,
   |-- size = zone_spanned_pages_in_node();
   |-- if (!size)
continue;    If zone is empty, we jump 
 out the for loop.
   |-- init_currently_empty_zone()
 
 So, if the zone is empty, wait_table is not initialized.
 
 In move_pfn_range_left(z1, z2), we move pages from z2 to z1. But z1 
 could be empty.
 So we need to check it and initialize z1-wait_table because we are 
 moving pages into it.

thanks.

 
 
  There is a zone populated check in function online_pages. But zone is
  populated in free_area_init_core which will be called during system
  initialization and hotadd_new_pgdat path. Why still need this check?
 
 
 Because we could also rebuild zone list when we offline pages.
 
 __offline_pages()
   |-- zone-present_pages -= offlined_pages;
   |-- if (!populated_zone(zone)) {
build_all_zonelists(NULL, NULL);
}
 
 If the zone is empty, and other zones on the same node is not empty, the 
 node
 won't be offlined, and next time we online pages of this zone, the pgdat 
 won't
 be initialized again, and we need to check populated_zone(zone) when 
 onlining
 pages.

thanks.

 
 Thanks. :)
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-30 Thread Tang Chen

On 01/31/2013 02:19 PM, Simon Jeons wrote:

Hi Tang,
On Thu, 2013-01-31 at 11:31 +0800, Tang Chen wrote:

Hi Simon,

Please see below. :)

On 01/31/2013 09:22 AM, Simon Jeons wrote:


Sorry, I still confuse. :(
update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or
node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE?

node_states is what? node_states[N_NORMAL_MEMOR] or
node_states[N_MEMORY]?


Are you asking what node_states[] is ?

node_states[] is an array of nodemask,

  extern nodemask_t node_states[NR_NODE_STATES];

For example, node_states[N_NORMAL_MEMOR] represents which nodes have
normal memory.
If N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, node_states[N_MEMORY] is
node_states[N_NORMAL_MEMOR]. So it represents which nodes have 0 ...
ZONE_MOVABLE.



Sorry, how can nodes_state[N_NORMAL_MEMORY] represents a node have 0 ...
*ZONE_MOVABLE*, the comment of enum nodes_states said that
N_NORMAL_MEMORY just means the node has regular memory.



Hi Simon,

Let's say it in this way.

If we don't have CONFIG_HIGHMEM, N_HIGH_MEMORY == N_NORMAL_MEMORY. We 
don't have a separate

macro to represent highmem because we don't have highmem.
This is easy to understand, right ?

Now, think it just like above:
If we don't have CONFIG_MOVABLE_NODE, N_MEMORY == N_HIGH_MEMORY == 
N_NORMAL_MEMORY.
This means we don't allow a node to have only movable memory, not we 
don't have movable memory.
A node could have normal memory and movable memory. So 
nodes_state[N_NORMAL_MEMORY] represents

a node have 0 ... *ZONE_MOVABLE*.

I think the point is: CONFIG_MOVABLE_NODE means we allow a node to have 
only movable memory.
So without CONFIG_MOVABLE_NODE, it doesn't mean a node cannot have 
movable memory. It means
the node cannot have only movable memory. It can have normal memory and 
movable memory.


1) With CONFIG_MOVABLE_NODE:
   N_NORMAL_MEMORY: nodes who have normal memory.
normal memory only
normal and highmem
normal and highmem and movablemem
normal and movablemem
   N_MEMORY: nodes who has memory (any memory)
normal memory only
normal and highmem
normal and highmem and movablemem
normal and movablemem  We can have 
movablemem.

highmem only -
highmem and movablemem ---
movablemem only -- We can have 
movablemem only.***


2) With out CONFIG_MOVABLE_NODE:
   N_MEMORY == N_NORMAL_MEMORY: (Here, I omit N_HIGH_MEMORY)
normal memory only
normal and highmem
normal and highmem and movablemem
normal and movablemem  We can have 
movablemem.
No movablemem only --- We cannot 
have movablemem only. ***


The semantics is not that clear here. So we can only try to understand 
it from the code where

we use N_MEMORY. :)

That is my understanding of this.

Thanks. :)




___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-29 Thread Simon Jeons
Hi Tang,

On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
 Here is the physical memory hot-remove patch-set based on 3.8rc-2.

Some questions ask you, not has relationship with this patchset, but is
memory hotplug stuff.

1. In function node_states_check_changes_online:

comments:
* If we don't have HIGHMEM nor movable node,
* node_states[N_NORMAL_MEMORY] contains nodes which have zones of
* 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.

How to understand it? Why we don't have HIGHMEM nor movable node and
node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC,
N_NORMAL_MEMORY only means the node has regular memory.

* If we don't have movable node, node_states[N_NORMAL_MEMORY]
* contains nodes which have zones of 0...ZONE_MOVABLE,
* set zone_last to ZONE_MOVABLE.

How to understand?

2. In function move_pfn_range_left, why end = z2-zone_start_pfn is not
correct? The comments said that must include/overlap, why?

3. In function online_pages, the normal case(w/o online_kenrel,
online_movable), why not check if the new zone is overlap with adjacent
zones?

4. Could you summarize the difference implementation between hot-add and
logic-add, hot-remove and logic-remove?   


 
 This patch-set aims to implement physical memory hot-removing.
 
 The patches can free/remove the following things:
 
   - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
   - memmap of sparse-vmemmap  : [PATCH 6,7,8,10/15]
   - page table of removed memory  : [RFC PATCH 7,8,10/15]
   - node and related sysfs files  : [RFC PATCH 13-15/15]
 
 
 Existing problem:
 If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
 when we online pages.
 
 For example: there is a memory device on node 1. The address range
 is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
 and memory11 under the directory /sys/devices/system/memory/.
 
 If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
 cgroup is not provided by this memory device. But when we online memory9, the
 memory stored page cgroup may be provided by memory8. So we can't offline
 memory8 now. We should offline the memory in the reversed order.
 
 When the memory device is hotremoved, we will auto offline memory provided
 by this memory device. But we don't know which memory is onlined first, so
 offlining memory may fail.
 
 In patch1, we provide a solution which is not good enough:
 Iterate twice to offline the memory.
 1st iterate: offline every non primary memory block.
 2nd iterate: offline primary (i.e. first added) memory block.
 
 And a new idea from Wen Congyang we...@cn.fujitsu.com is:
 allocate the memory from the memory block they are describing.
 
 But we are not sure if it is OK to do so because there is not existing API
 to do so, and we need to move page_cgroup memory allocation from 
 MEM_GOING_ONLINE
 to MEM_ONLINE. And also, it may interfere the hugepage.
 
 
 
 How to test this patchset?
 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
ACPI_HOTPLUG_MEMORY must be selected.
 2. load the module acpi_memhotplug
 3. hotplug the memory device(it depends on your hardware)
You will see the memory device under the directory /sys/bus/acpi/devices/.
Its name is PNP0C80:XX.
 4. online/offline pages provided by this memory device
You can write online/offline to /sys/devices/system/memory/memoryX/state to
online/offline pages provided by this memory device
 5. hotremove the memory device
You can hotremove the memory device by the hardware, or writing 1 to
/sys/bus/acpi/devices/PNP0C80:XX/eject.

Is there a similar knode to hot-add the memory device?

 
 
 Note: if the memory provided by the memory device is used by the kernel, it
 can't be offlined. It is not a bug.
 
 
 Changelogs from v5 to v6:
  Patch3: Add some more comments to explain memory hot-remove.
  Patch4: Remove bootmem member in struct firmware_map_entry.
  Patch6: Repeatedly register bootmem pages when using hugepage.
  Patch8: Repeatedly free bootmem pages when using hugepage.
  Patch14: Don't free pgdat when offlining a node, just reset it to 0.
  Patch15: New patch, pgdat is not freed in patch14, so don't allocate a new
   one when online a node.
 
 Changelogs from v4 to v5:
  Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to
  avoid disabling irq because we need flush tlb when free pagetables.
  Patch8: new patch, pick up some common APIs that are used to free direct 
 mapping
  and vmemmap pagetables.
  Patch9: free direct mapping pagetables on x86_64 arch.
  Patch10: free vmemmap pagetables.
  Patch11: since freeing memmap with vmemmap has been implemented, the config
   macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is
   no longer needed.
  Patch13: no need to modify acpi_memory_disable_device() since it was removed,
   and add nid 

Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-29 Thread Tang Chen

On 01/29/2013 08:52 PM, Simon Jeons wrote:

Hi Tang,

On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:

Here is the physical memory hot-remove patch-set based on 3.8rc-2.


Hi Simon,

I'll summarize all the info and answer you later. :)

Thanks for asking. :)



Some questions ask you, not has relationship with this patchset, but is
memory hotplug stuff.

1. In function node_states_check_changes_online:

comments:
* If we don't have HIGHMEM nor movable node,
* node_states[N_NORMAL_MEMORY] contains nodes which have zones of
* 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.

How to understand it? Why we don't have HIGHMEM nor movable node and
node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC,
N_NORMAL_MEMORY only means the node has regular memory.

* If we don't have movable node, node_states[N_NORMAL_MEMORY]
* contains nodes which have zones of 0...ZONE_MOVABLE,
* set zone_last to ZONE_MOVABLE.

How to understand?

2. In function move_pfn_range_left, why end= z2-zone_start_pfn is not
correct? The comments said that must include/overlap, why?

3. In function online_pages, the normal case(w/o online_kenrel,
online_movable), why not check if the new zone is overlap with adjacent
zones?

4. Could you summarize the difference implementation between hot-add and
logic-add, hot-remove and logic-remove?




This patch-set aims to implement physical memory hot-removing.

The patches can free/remove the following things:

   - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
   - memmap of sparse-vmemmap  : [PATCH 6,7,8,10/15]
   - page table of removed memory  : [RFC PATCH 7,8,10/15]
   - node and related sysfs files  : [RFC PATCH 13-15/15]


Existing problem:
If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
when we online pages.

For example: there is a memory device on node 1. The address range
is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
and memory11 under the directory /sys/devices/system/memory/.

If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
cgroup is not provided by this memory device. But when we online memory9, the
memory stored page cgroup may be provided by memory8. So we can't offline
memory8 now. We should offline the memory in the reversed order.

When the memory device is hotremoved, we will auto offline memory provided
by this memory device. But we don't know which memory is onlined first, so
offlining memory may fail.

In patch1, we provide a solution which is not good enough:
Iterate twice to offline the memory.
1st iterate: offline every non primary memory block.
2nd iterate: offline primary (i.e. first added) memory block.

And a new idea from Wen Congyangwe...@cn.fujitsu.com  is:
allocate the memory from the memory block they are describing.

But we are not sure if it is OK to do so because there is not existing API
to do so, and we need to move page_cgroup memory allocation from 
MEM_GOING_ONLINE
to MEM_ONLINE. And also, it may interfere the hugepage.



How to test this patchset?
1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
ACPI_HOTPLUG_MEMORY must be selected.
2. load the module acpi_memhotplug
3. hotplug the memory device(it depends on your hardware)
You will see the memory device under the directory /sys/bus/acpi/devices/.
Its name is PNP0C80:XX.
4. online/offline pages provided by this memory device
You can write online/offline to /sys/devices/system/memory/memoryX/state to
online/offline pages provided by this memory device
5. hotremove the memory device
You can hotremove the memory device by the hardware, or writing 1 to
/sys/bus/acpi/devices/PNP0C80:XX/eject.


Is there a similar knode to hot-add the memory device?




Note: if the memory provided by the memory device is used by the kernel, it
can't be offlined. It is not a bug.


Changelogs from v5 to v6:
  Patch3: Add some more comments to explain memory hot-remove.
  Patch4: Remove bootmem member in struct firmware_map_entry.
  Patch6: Repeatedly register bootmem pages when using hugepage.
  Patch8: Repeatedly free bootmem pages when using hugepage.
  Patch14: Don't free pgdat when offlining a node, just reset it to 0.
  Patch15: New patch, pgdat is not freed in patch14, so don't allocate a new
   one when online a node.

Changelogs from v4 to v5:
  Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to
  avoid disabling irq because we need flush tlb when free pagetables.
  Patch8: new patch, pick up some common APIs that are used to free direct 
mapping
  and vmemmap pagetables.
  Patch9: free direct mapping pagetables on x86_64 arch.
  Patch10: free vmemmap pagetables.
  Patch11: since freeing memmap with vmemmap has been implemented, the config
   macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is
   no longer needed.
  Patch13: no need to modify 

Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-29 Thread Simon Jeons
On Wed, 2013-01-30 at 10:32 +0800, Tang Chen wrote:
 On 01/29/2013 08:52 PM, Simon Jeons wrote:
  Hi Tang,
 
  On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
  Here is the physical memory hot-remove patch-set based on 3.8rc-2.
 
 Hi Simon,
 
 I'll summarize all the info and answer you later. :)
 
 Thanks for asking. :)

Thanks Tang, IIRC, there's qemu feature support memory hot-add/remove
emulation if we don't have machine which supports memory hot-add/remove
to test. Is that qemu feature merged? Otherwise where can I get that
patchset?

 
 
  Some questions ask you, not has relationship with this patchset, but is
  memory hotplug stuff.
 
  1. In function node_states_check_changes_online:
 
  comments:
  * If we don't have HIGHMEM nor movable node,
  * node_states[N_NORMAL_MEMORY] contains nodes which have zones of
  * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
 
  How to understand it? Why we don't have HIGHMEM nor movable node and
  node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC,
  N_NORMAL_MEMORY only means the node has regular memory.
 
  * If we don't have movable node, node_states[N_NORMAL_MEMORY]
  * contains nodes which have zones of 0...ZONE_MOVABLE,
  * set zone_last to ZONE_MOVABLE.
 
  How to understand?
 
  2. In function move_pfn_range_left, why end= z2-zone_start_pfn is not
  correct? The comments said that must include/overlap, why?
 
  3. In function online_pages, the normal case(w/o online_kenrel,
  online_movable), why not check if the new zone is overlap with adjacent
  zones?
 
  4. Could you summarize the difference implementation between hot-add and
  logic-add, hot-remove and logic-remove?
 
 
 
  This patch-set aims to implement physical memory hot-removing.
 
  The patches can free/remove the following things:
 
 - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
 - memmap of sparse-vmemmap  : [PATCH 6,7,8,10/15]
 - page table of removed memory  : [RFC PATCH 7,8,10/15]
 - node and related sysfs files  : [RFC PATCH 13-15/15]
 
 
  Existing problem:
  If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
  when we online pages.
 
  For example: there is a memory device on node 1. The address range
  is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
  and memory11 under the directory /sys/devices/system/memory/.
 
  If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
  cgroup is not provided by this memory device. But when we online memory9, 
  the
  memory stored page cgroup may be provided by memory8. So we can't offline
  memory8 now. We should offline the memory in the reversed order.
 
  When the memory device is hotremoved, we will auto offline memory provided
  by this memory device. But we don't know which memory is onlined first, so
  offlining memory may fail.
 
  In patch1, we provide a solution which is not good enough:
  Iterate twice to offline the memory.
  1st iterate: offline every non primary memory block.
  2nd iterate: offline primary (i.e. first added) memory block.
 
  And a new idea from Wen Congyangwe...@cn.fujitsu.com  is:
  allocate the memory from the memory block they are describing.
 
  But we are not sure if it is OK to do so because there is not existing API
  to do so, and we need to move page_cgroup memory allocation from 
  MEM_GOING_ONLINE
  to MEM_ONLINE. And also, it may interfere the hugepage.
 
 
 
  How to test this patchset?
  1. apply this patchset and build the kernel. MEMORY_HOTPLUG, 
  MEMORY_HOTREMOVE,
  ACPI_HOTPLUG_MEMORY must be selected.
  2. load the module acpi_memhotplug
  3. hotplug the memory device(it depends on your hardware)
  You will see the memory device under the directory 
  /sys/bus/acpi/devices/.
  Its name is PNP0C80:XX.
  4. online/offline pages provided by this memory device
  You can write online/offline to 
  /sys/devices/system/memory/memoryX/state to
  online/offline pages provided by this memory device
  5. hotremove the memory device
  You can hotremove the memory device by the hardware, or writing 1 to
  /sys/bus/acpi/devices/PNP0C80:XX/eject.
 
  Is there a similar knode to hot-add the memory device?
 
 
 
  Note: if the memory provided by the memory device is used by the kernel, it
  can't be offlined. It is not a bug.
 
 
  Changelogs from v5 to v6:
Patch3: Add some more comments to explain memory hot-remove.
Patch4: Remove bootmem member in struct firmware_map_entry.
Patch6: Repeatedly register bootmem pages when using hugepage.
Patch8: Repeatedly free bootmem pages when using hugepage.
Patch14: Don't free pgdat when offlining a node, just reset it to 0.
Patch15: New patch, pgdat is not freed in patch14, so don't allocate a 
  new
 one when online a node.
 
  Changelogs from v4 to v5:
Patch7: new patch, move pgdat_resize_lock into 
  sparse_remove_one_section() to
avoid 

Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-29 Thread Tang Chen

On 01/30/2013 10:48 AM, Simon Jeons wrote:

On Wed, 2013-01-30 at 10:32 +0800, Tang Chen wrote:

On 01/29/2013 08:52 PM, Simon Jeons wrote:

Hi Tang,

On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:

Here is the physical memory hot-remove patch-set based on 3.8rc-2.


Hi Simon,

I'll summarize all the info and answer you later. :)

Thanks for asking. :)


Thanks Tang, IIRC, there's qemu feature support memory hot-add/remove
emulation if we don't have machine which supports memory hot-add/remove
to test. Is that qemu feature merged? Otherwise where can I get that
patchset?


Hi Simon,

There are patches to support hot-add/remove in qemu, but they are not 
merged yet.

You can get the latest patches here:
http://lists.nongnu.org/archive/html/qemu-devel/2012-12/msg02693.html

BTY, it is unstable and full of problems, and you need to compile your 
own seabios too.


Thanks. :)







Some questions ask you, not has relationship with this patchset, but is
memory hotplug stuff.

1. In function node_states_check_changes_online:

comments:
* If we don't have HIGHMEM nor movable node,
* node_states[N_NORMAL_MEMORY] contains nodes which have zones of
* 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.

How to understand it? Why we don't have HIGHMEM nor movable node and
node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC,
N_NORMAL_MEMORY only means the node has regular memory.

* If we don't have movable node, node_states[N_NORMAL_MEMORY]
* contains nodes which have zones of 0...ZONE_MOVABLE,
* set zone_last to ZONE_MOVABLE.

How to understand?

2. In function move_pfn_range_left, why end= z2-zone_start_pfn is not
correct? The comments said that must include/overlap, why?

3. In function online_pages, the normal case(w/o online_kenrel,
online_movable), why not check if the new zone is overlap with adjacent
zones?

4. Could you summarize the difference implementation between hot-add and
logic-add, hot-remove and logic-remove?




This patch-set aims to implement physical memory hot-removing.

The patches can free/remove the following things:

- /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
- memmap of sparse-vmemmap  : [PATCH 6,7,8,10/15]
- page table of removed memory  : [RFC PATCH 7,8,10/15]
- node and related sysfs files  : [RFC PATCH 13-15/15]


Existing problem:
If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
when we online pages.

For example: there is a memory device on node 1. The address range
is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
and memory11 under the directory /sys/devices/system/memory/.

If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
cgroup is not provided by this memory device. But when we online memory9, the
memory stored page cgroup may be provided by memory8. So we can't offline
memory8 now. We should offline the memory in the reversed order.

When the memory device is hotremoved, we will auto offline memory provided
by this memory device. But we don't know which memory is onlined first, so
offlining memory may fail.

In patch1, we provide a solution which is not good enough:
Iterate twice to offline the memory.
1st iterate: offline every non primary memory block.
2nd iterate: offline primary (i.e. first added) memory block.

And a new idea from Wen Congyangwe...@cn.fujitsu.com   is:
allocate the memory from the memory block they are describing.

But we are not sure if it is OK to do so because there is not existing API
to do so, and we need to move page_cgroup memory allocation from 
MEM_GOING_ONLINE
to MEM_ONLINE. And also, it may interfere the hugepage.



How to test this patchset?
1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
 ACPI_HOTPLUG_MEMORY must be selected.
2. load the module acpi_memhotplug
3. hotplug the memory device(it depends on your hardware)
 You will see the memory device under the directory /sys/bus/acpi/devices/.
 Its name is PNP0C80:XX.
4. online/offline pages provided by this memory device
 You can write online/offline to /sys/devices/system/memory/memoryX/state to
 online/offline pages provided by this memory device
5. hotremove the memory device
 You can hotremove the memory device by the hardware, or writing 1 to
 /sys/bus/acpi/devices/PNP0C80:XX/eject.


Is there a similar knode to hot-add the memory device?




Note: if the memory provided by the memory device is used by the kernel, it
can't be offlined. It is not a bug.


Changelogs from v5 to v6:
   Patch3: Add some more comments to explain memory hot-remove.
   Patch4: Remove bootmem member in struct firmware_map_entry.
   Patch6: Repeatedly register bootmem pages when using hugepage.
   Patch8: Repeatedly free bootmem pages when using hugepage.
   Patch14: Don't free pgdat when offlining a node, just reset it to 0.
   Patch15: New patch, pgdat is not freed in patch14, so don't allocate a 

Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-10 Thread Kamezawa Hiroyuki

(2013/01/10 16:55), Glauber Costa wrote:

On 01/10/2013 11:31 AM, Kamezawa Hiroyuki wrote:

(2013/01/10 16:14), Glauber Costa wrote:

On 01/10/2013 06:17 AM, Tang Chen wrote:

Note: if the memory provided by the memory device is used by the
kernel, it
can't be offlined. It is not a bug.


Right.  But how often does this happen in testing?  In other words,
please provide an overall description of how well memory hot-remove is
presently operating.  Is it reliable?  What is the success rate in
real-world situations?


We test the hot-remove functionality mostly with movable_online used.
And the memory used by kernel is not allowed to be removed.


Can you try doing this using cpusets configured to hardwall ?
It is my understanding that the object allocators will try hard not to
allocate anything outside the walls defined by cpuset. Which means that
if you have one process per node, and they are hardwalled, your kernel
memory will be spread evenly among the machine. With a big enough load,
they should eventually be present in all blocks.



I'm sorry I couldn't catch your point.
Do you want to confirm whether cpuset can work enough instead of
ZONE_MOVABLE ?
Or Do you want to confirm whether ZONE_MOVABLE will not work if it's
used with cpuset ?



No, I am not proposing to use cpuset do tackle the problem. I am just
wondering if you would still have high success rates with cpusets in use
with hardwalls. This is just one example of a workload that would spread
kernel memory around quite heavily.

So this is just me trying to understand the limitations of the mechanism.



Hm, okay. In my undestanding, if the whole memory of a node is configured as
MOVABLE, no kernel memory will not be allocated in the node because zonelist
will not match. So, if cpuset is used with hardwalls, user will see -ENOMEM or 
OOM,
I guess. even fork() will fail if fallback-to-other-node is not allowed.

If it's configure as ZONE_NORMAL, you need to pray for offlining memory.

AFAIK, IBM's ppc? has 16MB section size. So, some of sections can be offlined
even if they are configured as ZONE_NORMAL. For them, placement of offlined
memory is not important because it's virtualized by LPAR, they don't try
to remove DIMM, they just want to increase/decrease amount of memory.
It's an another approach.

But here, we(fujitsu) tries to remove a system board/DIMM.
So, configuring the whole memory of a node as ZONE_MOVABLE and tries to 
guarantee
DIMM as removable.


IMHO, I don't think shrink_slab() can kill all objects in a node even
if they are some caches. We need more study for doing that.



Indeed, shrink_slab can only kill cached objects. They, however, are
usually a very big part of kernel memory. I wonder though if in case of
failure, it is worth it to try at least one shrink pass before you give up.



Yeah, now, his (our) approach is never allowing kernel memory on a node to be
hot-removed by ZONE_MOVABLE. So, shrink_slab()'s effect will not be seen.

If other brave guys tries to use ZONE_NORMAL for hot-pluggable DIMM, I see,
it's worth triying.

How about checking the target memsection is in NORMAL or in MOVABLE at
hot-removing ? If NORMAL, shrink_slab() will be worth to be called.

BTW, shrink_slab() is now node/zone aware ? If not, fixing that first will
be better direction I guess.

Thanks,
-Kame


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-10 Thread Glauber Costa

 If it's configure as ZONE_NORMAL, you need to pray for offlining memory.
 
 AFAIK, IBM's ppc? has 16MB section size. So, some of sections can be
 offlined
 even if they are configured as ZONE_NORMAL. For them, placement of offlined
 memory is not important because it's virtualized by LPAR, they don't try
 to remove DIMM, they just want to increase/decrease amount of memory.
 It's an another approach.
 
 But here, we(fujitsu) tries to remove a system board/DIMM.
 So, configuring the whole memory of a node as ZONE_MOVABLE and tries to
 guarantee
 DIMM as removable.
 
 IMHO, I don't think shrink_slab() can kill all objects in a node even
 if they are some caches. We need more study for doing that.


 Indeed, shrink_slab can only kill cached objects. They, however, are
 usually a very big part of kernel memory. I wonder though if in case of
 failure, it is worth it to try at least one shrink pass before you
 give up.

 
 Yeah, now, his (our) approach is never allowing kernel memory on a node
 to be
 hot-removed by ZONE_MOVABLE. So, shrink_slab()'s effect will not be seen.

Ok, that clarifies it to me.
 
 If other brave guys tries to use ZONE_NORMAL for hot-pluggable DIMM, I see,
 it's worth triying.
 
I was under the impression that this was being done in here.

 How about checking the target memsection is in NORMAL or in MOVABLE at
 hot-removing ? If NORMAL, shrink_slab() will be worth to be called.
 
Yes, this is what I meant. I think there is value investigating this,
since for a lot of workloads, a lot of the kernel memory will consist of
shrinkable cached memory. It would provide you with the same level of
guarantees (zero), but can improve the success  rate (this is, of
course, a guess)


 BTW, shrink_slab() is now node/zone aware ? If not, fixing that first will
 be better direction I guess.
 
It is not upstream, but there are patches for this that I am already
using in my private tree.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-10 Thread Kamezawa Hiroyuki

(2013/01/10 17:36), Glauber Costa wrote:
 

BTW, shrink_slab() is now node/zone aware ? If not, fixing that first will
be better direction I guess.


It is not upstream, but there are patches for this that I am already
using in my private tree.



Oh, I see. If it's merged, it's worth add shrink_slab() if ZONE_NORMAL
code.

Thanks,
-Kame

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-09 Thread Tang Chen
Here is the physical memory hot-remove patch-set based on 3.8rc-2.

This patch-set aims to implement physical memory hot-removing.

The patches can free/remove the following things:

  - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
  - memmap of sparse-vmemmap  : [PATCH 6,7,8,10/15]
  - page table of removed memory  : [RFC PATCH 7,8,10/15]
  - node and related sysfs files  : [RFC PATCH 13-15/15]


Existing problem:
If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
when we online pages.

For example: there is a memory device on node 1. The address range
is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
and memory11 under the directory /sys/devices/system/memory/.

If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
cgroup is not provided by this memory device. But when we online memory9, the
memory stored page cgroup may be provided by memory8. So we can't offline
memory8 now. We should offline the memory in the reversed order.

When the memory device is hotremoved, we will auto offline memory provided
by this memory device. But we don't know which memory is onlined first, so
offlining memory may fail.

In patch1, we provide a solution which is not good enough:
Iterate twice to offline the memory.
1st iterate: offline every non primary memory block.
2nd iterate: offline primary (i.e. first added) memory block.

And a new idea from Wen Congyang we...@cn.fujitsu.com is:
allocate the memory from the memory block they are describing.

But we are not sure if it is OK to do so because there is not existing API
to do so, and we need to move page_cgroup memory allocation from 
MEM_GOING_ONLINE
to MEM_ONLINE. And also, it may interfere the hugepage.



How to test this patchset?
1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
   ACPI_HOTPLUG_MEMORY must be selected.
2. load the module acpi_memhotplug
3. hotplug the memory device(it depends on your hardware)
   You will see the memory device under the directory /sys/bus/acpi/devices/.
   Its name is PNP0C80:XX.
4. online/offline pages provided by this memory device
   You can write online/offline to /sys/devices/system/memory/memoryX/state to
   online/offline pages provided by this memory device
5. hotremove the memory device
   You can hotremove the memory device by the hardware, or writing 1 to
   /sys/bus/acpi/devices/PNP0C80:XX/eject.


Note: if the memory provided by the memory device is used by the kernel, it
can't be offlined. It is not a bug.


Changelogs from v5 to v6:
 Patch3: Add some more comments to explain memory hot-remove.
 Patch4: Remove bootmem member in struct firmware_map_entry.
 Patch6: Repeatedly register bootmem pages when using hugepage.
 Patch8: Repeatedly free bootmem pages when using hugepage.
 Patch14: Don't free pgdat when offlining a node, just reset it to 0.
 Patch15: New patch, pgdat is not freed in patch14, so don't allocate a new
  one when online a node.

Changelogs from v4 to v5:
 Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to
 avoid disabling irq because we need flush tlb when free pagetables.
 Patch8: new patch, pick up some common APIs that are used to free direct 
mapping
 and vmemmap pagetables.
 Patch9: free direct mapping pagetables on x86_64 arch.
 Patch10: free vmemmap pagetables.
 Patch11: since freeing memmap with vmemmap has been implemented, the config
  macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is
  no longer needed.
 Patch13: no need to modify acpi_memory_disable_device() since it was removed,
  and add nid parameter when calling remove_memory().

Changelogs from v3 to v4:
 Patch7: remove unused codes.
 Patch8: fix nr_pages that is passed to free_map_bootmem()

Changelogs from v2 to v3:
 Patch9: call sync_global_pgds() if pgd is changed
 Patch10: fix a problem int the patch

Changelogs from v1 to v2:
 Patch1: new patch, offline memory twice. 1st iterate: offline every non primary
 memory block. 2nd iterate: offline primary (i.e. first added) memory
 block.

 Patch3: new patch, no logical change, just remove reduntant codes.

 Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu
 after the pagetable is changed.

 Patch12: new patch, free node_data when a node is offlined.


Tang Chen (6):
  memory-hotplug: move pgdat_resize_lock into
sparse_remove_one_section()
  memory-hotplug: remove page table of x86_64 architecture
  memory-hotplug: remove memmap of sparse-vmemmap
  memory-hotplug: Integrated __remove_section() of
CONFIG_SPARSEMEM_VMEMMAP.
  memory-hotplug: remove sysfs file of node
  memory-hotplug: Do not allocate pdgat if it was not freed when
offline.

Wen Congyang (5):
  memory-hotplug: try to offline the memory twice to avoid dependence
  memory-hotplug: remove redundant codes
  memory-hotplug: 

Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-09 Thread Andrew Morton
On Wed, 9 Jan 2013 17:32:24 +0800
Tang Chen tangc...@cn.fujitsu.com wrote:

 Here is the physical memory hot-remove patch-set based on 3.8rc-2.
 
 This patch-set aims to implement physical memory hot-removing.
 
 The patches can free/remove the following things:
 
   - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
   - memmap of sparse-vmemmap  : [PATCH 6,7,8,10/15]
   - page table of removed memory  : [RFC PATCH 7,8,10/15]
   - node and related sysfs files  : [RFC PATCH 13-15/15]
 
 
 Existing problem:
 If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
 when we online pages.
 
 For example: there is a memory device on node 1. The address range
 is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
 and memory11 under the directory /sys/devices/system/memory/.
 
 If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
 cgroup is not provided by this memory device. But when we online memory9, the
 memory stored page cgroup may be provided by memory8. So we can't offline
 memory8 now. We should offline the memory in the reversed order.
 
 When the memory device is hotremoved, we will auto offline memory provided
 by this memory device. But we don't know which memory is onlined first, so
 offlining memory may fail.

This does sound like a significant problem.  We should assume that
mmecg is available and in use.

 In patch1, we provide a solution which is not good enough:
 Iterate twice to offline the memory.
 1st iterate: offline every non primary memory block.
 2nd iterate: offline primary (i.e. first added) memory block.

Let's flesh this out a bit.

If we online memory8, memory9, memory10 and memory11 then I'd have
thought that they would need to offlined in reverse order, which will
require four iterations, not two.  Is this wrong and if so, why?

Also, what happens if we wish to offline only memory9?  Do we offline
memory11 then memory10 then memory9 and then re-online memory10 and
memory11?

 And a new idea from Wen Congyang we...@cn.fujitsu.com is:
 allocate the memory from the memory block they are describing.

Yes.

 But we are not sure if it is OK to do so because there is not existing API
 to do so, and we need to move page_cgroup memory allocation from 
 MEM_GOING_ONLINE
 to MEM_ONLINE.

This all sounds solvable - can we proceed in this fashion?

 And also, it may interfere the hugepage.

Please provide full details on this problem.

 Note: if the memory provided by the memory device is used by the kernel, it
 can't be offlined. It is not a bug.

Right.  But how often does this happen in testing?  In other words,
please provide an overall description of how well memory hot-remove is
presently operating.  Is it reliable?  What is the success rate in
real-world situations?  Are there precautions which the administrator
can take to improve the success rate?  What are the remaining problems
and are there plans to address them?


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-09 Thread Andrew Morton
On Wed, 9 Jan 2013 17:32:24 +0800
Tang Chen tangc...@cn.fujitsu.com wrote:

 This patch-set aims to implement physical memory hot-removing.

As you were on th patch delivery path, all of these patches should have
your Signed-off-by:.  But some were missing it.  I fixed this in my
copy of the patches.


I suspect this patchset adds a significant amount of code which will
not be used if CONFIG_MEMORY_HOTPLUG=n.  [PATCH v6 06/15]
memory-hotplug: implement register_page_bootmem_info_section of
sparse-vmemmap, for example.  This is not a good thing, so please go
through the patchset (in fact, go through all the memhotplug code) and
let's see if we can reduce the bloat for CONFIG_MEMORY_HOTPLUG=n
kernels.

This needn't be done immediately - it would be OK by me if you were to
defer this exercise until all the new memhotplug code is largely in
place.  But please, let's do it.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-09 Thread Tang Chen

Hi Andrew,

Thank you very much for your pushing. :)

On 01/10/2013 06:23 AM, Andrew Morton wrote:


This does sound like a significant problem.  We should assume that
mmecg is available and in use.


In patch1, we provide a solution which is not good enough:
Iterate twice to offline the memory.
1st iterate: offline every non primary memory block.
2nd iterate: offline primary (i.e. first added) memory block.


Let's flesh this out a bit.

If we online memory8, memory9, memory10 and memory11 then I'd have
thought that they would need to offlined in reverse order, which will
require four iterations, not two.  Is this wrong and if so, why?


Well, we may need more than two iterations if all memory8, memory9,
memory10 are in use by kernel, and 10 depends on 9, 9 depends on 8.

So, as you see here, the iteration method is not good enough.

But this only happens when the memory is used by kernel, which will not
be able to be migrated. So if we can use a boot option, such as
movablecore_map, or movable_online functionality to limit the memory as 
movable, the kernel will not use this memory. So it is safe when we are

doing node hot-remove.



Also, what happens if we wish to offline only memory9?  Do we offline
memory11 then memory10 then memory9 and then re-online memory10 and
memory11?


In this case, offlining memory9 could fail if user do this by himself,
for example using sysfs.

In this path, it is in memory hot-remove path. So when we remove a
memory device, it will automatically offline all pages, and it is in
reverse order by itself.

And again, this is not good enough. We will figure out a reasonable way
to solve it soon.




And a new idea from Wen Congyangwe...@cn.fujitsu.com  is:
allocate the memory from the memory block they are describing.


Yes.


But we are not sure if it is OK to do so because there is not existing API
to do so, and we need to move page_cgroup memory allocation from 
MEM_GOING_ONLINE
to MEM_ONLINE.


This all sounds solvable - can we proceed in this fashion?


Yes, we are in progress now.




And also, it may interfere the hugepage.


Please provide full details on this problem.


It is not very clear now, and if I find something, I'll share it out.




Note: if the memory provided by the memory device is used by the kernel, it
can't be offlined. It is not a bug.


Right.  But how often does this happen in testing?  In other words,
please provide an overall description of how well memory hot-remove is
presently operating.  Is it reliable?  What is the success rate in
real-world situations?


We test the hot-remove functionality mostly with movable_online used.
And the memory used by kernel is not allowed to be removed.

We will do some tests in the kernel memory offline cases, and tell you
the test results soon.

And since we are trying out some other ways, I think the problem will
be solved soon.


Are there precautions which the administrator
can take to improve the success rate?


Administrator could use movablecore_map boot option or movable_online
functionality (which is now in kernel) to limit memory as movable to
avoid this problem.


What are the remaining problems
and are there plans to address them?


For now, we will try to allocate page_group on the memory block which
itself is describing. And all the other parts seems work well now.

And we are still testing. If we have any problem, we will share.

Thanks. :)




--
To unsubscribe from this list: send the line unsubscribe linux-acpi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-09 Thread Tang Chen

Hi Andrew,

On 01/10/2013 07:33 AM, Andrew Morton wrote:

On Wed, 9 Jan 2013 17:32:24 +0800
Tang Chentangc...@cn.fujitsu.com  wrote:


This patch-set aims to implement physical memory hot-removing.


As you were on th patch delivery path, all of these patches should have
your Signed-off-by:.  But some were missing it.  I fixed this in my
copy of the patches.


Thank you very much for the help. Next time I'll add it myself.




I suspect this patchset adds a significant amount of code which will
not be used if CONFIG_MEMORY_HOTPLUG=n.  [PATCH v6 06/15]
memory-hotplug: implement register_page_bootmem_info_section of
sparse-vmemmap, for example.  This is not a good thing, so please go
through the patchset (in fact, go through all the memhotplug code) and
let's see if we can reduce the bloat for CONFIG_MEMORY_HOTPLUG=n
kernels.

This needn't be done immediately - it would be OK by me if you were to
defer this exercise until all the new memhotplug code is largely in
place.  But please, let's do it.


OK, I'll do have a check on it when the page_cgroup problem is solved.

Thanks. :)







___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-09 Thread Glauber Costa
On 01/10/2013 06:17 AM, Tang Chen wrote:
 Note: if the memory provided by the memory device is used by the
 kernel, it
 can't be offlined. It is not a bug.

 Right.  But how often does this happen in testing?  In other words,
 please provide an overall description of how well memory hot-remove is
 presently operating.  Is it reliable?  What is the success rate in
 real-world situations?
 
 We test the hot-remove functionality mostly with movable_online used.
 And the memory used by kernel is not allowed to be removed.

Can you try doing this using cpusets configured to hardwall ?
It is my understanding that the object allocators will try hard not to
allocate anything outside the walls defined by cpuset. Which means that
if you have one process per node, and they are hardwalled, your kernel
memory will be spread evenly among the machine. With a big enough load,
they should eventually be present in all blocks.

Another question I have for you: Have you considering calling
shrink_slab to try to deplete the caches and therefore free at least
slab memory in the nodes that can't be offlined? Is it relevant?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-09 Thread Kamezawa Hiroyuki

(2013/01/10 16:14), Glauber Costa wrote:

On 01/10/2013 06:17 AM, Tang Chen wrote:

Note: if the memory provided by the memory device is used by the
kernel, it
can't be offlined. It is not a bug.


Right.  But how often does this happen in testing?  In other words,
please provide an overall description of how well memory hot-remove is
presently operating.  Is it reliable?  What is the success rate in
real-world situations?


We test the hot-remove functionality mostly with movable_online used.
And the memory used by kernel is not allowed to be removed.


Can you try doing this using cpusets configured to hardwall ?
It is my understanding that the object allocators will try hard not to
allocate anything outside the walls defined by cpuset. Which means that
if you have one process per node, and they are hardwalled, your kernel
memory will be spread evenly among the machine. With a big enough load,
they should eventually be present in all blocks.



I'm sorry I couldn't catch your point.
Do you want to confirm whether cpuset can work enough instead of ZONE_MOVABLE ?
Or Do you want to confirm whether ZONE_MOVABLE will not work if it's used with 
cpuset ?



Another question I have for you: Have you considering calling
shrink_slab to try to deplete the caches and therefore free at least
slab memory in the nodes that can't be offlined? Is it relevant?



At this stage, we don't consider to call shrink_slab(). We require
nearly 100% success at offlining memory for removing DIMM.
It's my understanding.

IMHO, I don't think shrink_slab() can kill all objects in a node even
if they are some caches. We need more study for doing that.

Thanks,
-Kame


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-09 Thread Glauber Costa
On 01/10/2013 11:31 AM, Kamezawa Hiroyuki wrote:
 (2013/01/10 16:14), Glauber Costa wrote:
 On 01/10/2013 06:17 AM, Tang Chen wrote:
 Note: if the memory provided by the memory device is used by the
 kernel, it
 can't be offlined. It is not a bug.

 Right.  But how often does this happen in testing?  In other words,
 please provide an overall description of how well memory hot-remove is
 presently operating.  Is it reliable?  What is the success rate in
 real-world situations?

 We test the hot-remove functionality mostly with movable_online used.
 And the memory used by kernel is not allowed to be removed.

 Can you try doing this using cpusets configured to hardwall ?
 It is my understanding that the object allocators will try hard not to
 allocate anything outside the walls defined by cpuset. Which means that
 if you have one process per node, and they are hardwalled, your kernel
 memory will be spread evenly among the machine. With a big enough load,
 they should eventually be present in all blocks.

 
 I'm sorry I couldn't catch your point.
 Do you want to confirm whether cpuset can work enough instead of
 ZONE_MOVABLE ?
 Or Do you want to confirm whether ZONE_MOVABLE will not work if it's
 used with cpuset ?
 
 
No, I am not proposing to use cpuset do tackle the problem. I am just
wondering if you would still have high success rates with cpusets in use
with hardwalls. This is just one example of a workload that would spread
kernel memory around quite heavily.

So this is just me trying to understand the limitations of the mechanism.

 Another question I have for you: Have you considering calling
 shrink_slab to try to deplete the caches and therefore free at least
 slab memory in the nodes that can't be offlined? Is it relevant?

 
 At this stage, we don't consider to call shrink_slab(). We require
 nearly 100% success at offlining memory for removing DIMM.
 It's my understanding.
 
Of course, this is indisputable.

 IMHO, I don't think shrink_slab() can kill all objects in a node even
 if they are some caches. We need more study for doing that.
 

Indeed, shrink_slab can only kill cached objects. They, however, are
usually a very big part of kernel memory. I wonder though if in case of
failure, it is worth it to try at least one shrink pass before you give up.

It is not very different from what is in memory-failure.c, except that
we could do better and do a more targetted shrinking (support for that
is being worked on)


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev