Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Tang, On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote: On 01/31/2013 02:19 PM, Simon Jeons wrote: Hi Tang, On Thu, 2013-01-31 at 11:31 +0800, Tang Chen wrote: Hi Simon, Please see below. :) On 01/31/2013 09:22 AM, Simon Jeons wrote: Sorry, I still confuse. :( update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE? node_states is what? node_states[N_NORMAL_MEMOR] or node_states[N_MEMORY]? Are you asking what node_states[] is ? node_states[] is an array of nodemask, extern nodemask_t node_states[NR_NODE_STATES]; For example, node_states[N_NORMAL_MEMOR] represents which nodes have normal memory. If N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, node_states[N_MEMORY] is node_states[N_NORMAL_MEMOR]. So it represents which nodes have 0 ... ZONE_MOVABLE. Sorry, how can nodes_state[N_NORMAL_MEMORY] represents a node have 0 ... *ZONE_MOVABLE*, the comment of enum nodes_states said that N_NORMAL_MEMORY just means the node has regular memory. Hi Simon, Let's say it in this way. If we don't have CONFIG_HIGHMEM, N_HIGH_MEMORY == N_NORMAL_MEMORY. We don't have a separate macro to represent highmem because we don't have highmem. This is easy to understand, right ? Now, think it just like above: If we don't have CONFIG_MOVABLE_NODE, N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY. This means we don't allow a node to have only movable memory, not we don't have movable memory. A node could have normal memory and movable memory. So nodes_state[N_NORMAL_MEMORY] represents a node have 0 ... *ZONE_MOVABLE*. I think the point is: CONFIG_MOVABLE_NODE means we allow a node to have only movable memory. So without CONFIG_MOVABLE_NODE, it doesn't mean a node cannot have movable memory. It means the node cannot have only movable memory. It can have normal memory and movable memory. 1) With CONFIG_MOVABLE_NODE: N_NORMAL_MEMORY: nodes who have normal memory. normal memory only normal and highmem normal and highmem and movablemem normal and movablemem N_MEMORY: nodes who has memory (any memory) normal memory only normal and highmem normal and highmem and movablemem normal and movablemem We can have movablemem. highmem only - highmem and movablemem --- movablemem only -- We can have movablemem only.*** 2) With out CONFIG_MOVABLE_NODE: N_MEMORY == N_NORMAL_MEMORY: (Here, I omit N_HIGH_MEMORY) normal memory only normal and highmem normal and highmem and movablemem normal and movablemem We can have movablemem. No movablemem only --- We cannot have movablemem only. *** The semantics is not that clear here. So we can only try to understand it from the code where we use N_MEMORY. :) That is my understanding of this. Thanks for your clarify, very clear now. :) Thanks. :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Tang, On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote: 1. IIUC, there is a button on machine which supports hot-remove memory, then what's the difference between press button and echo to /sys? 2. Since kernel memory is linear mapping(I mean direct mapping part), why can't put kernel direct mapping memory into one memory device, and other memory into the other devices? As you know x86_64 don't need highmem, IIUC, all kernel memory will linear mapping in this case. Is my idea available? If is correct, x86_32 can't implement in the same way since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's hard to focus kernel memory on single memory device. 3. In current implementation, if memory hotplug just need memory subsystem and ACPI codes support? Or also needs firmware take part in? Hope you can explain in details, thanks in advance. :) 4. What's the status of memory hotplug? Apart from can't remove kernel memory, other things are fully implementation? On 01/31/2013 02:19 PM, Simon Jeons wrote: Hi Tang, On Thu, 2013-01-31 at 11:31 +0800, Tang Chen wrote: Hi Simon, Please see below. :) On 01/31/2013 09:22 AM, Simon Jeons wrote: Sorry, I still confuse. :( update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE? node_states is what? node_states[N_NORMAL_MEMOR] or node_states[N_MEMORY]? Are you asking what node_states[] is ? node_states[] is an array of nodemask, extern nodemask_t node_states[NR_NODE_STATES]; For example, node_states[N_NORMAL_MEMOR] represents which nodes have normal memory. If N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, node_states[N_MEMORY] is node_states[N_NORMAL_MEMOR]. So it represents which nodes have 0 ... ZONE_MOVABLE. Sorry, how can nodes_state[N_NORMAL_MEMORY] represents a node have 0 ... *ZONE_MOVABLE*, the comment of enum nodes_states said that N_NORMAL_MEMORY just means the node has regular memory. Hi Simon, Let's say it in this way. If we don't have CONFIG_HIGHMEM, N_HIGH_MEMORY == N_NORMAL_MEMORY. We don't have a separate macro to represent highmem because we don't have highmem. This is easy to understand, right ? Now, think it just like above: If we don't have CONFIG_MOVABLE_NODE, N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY. This means we don't allow a node to have only movable memory, not we don't have movable memory. A node could have normal memory and movable memory. So nodes_state[N_NORMAL_MEMORY] represents a node have 0 ... *ZONE_MOVABLE*. I think the point is: CONFIG_MOVABLE_NODE means we allow a node to have only movable memory. So without CONFIG_MOVABLE_NODE, it doesn't mean a node cannot have movable memory. It means the node cannot have only movable memory. It can have normal memory and movable memory. 1) With CONFIG_MOVABLE_NODE: N_NORMAL_MEMORY: nodes who have normal memory. normal memory only normal and highmem normal and highmem and movablemem normal and movablemem N_MEMORY: nodes who has memory (any memory) normal memory only normal and highmem normal and highmem and movablemem normal and movablemem We can have movablemem. highmem only - highmem and movablemem --- movablemem only -- We can have movablemem only.*** 2) With out CONFIG_MOVABLE_NODE: N_MEMORY == N_NORMAL_MEMORY: (Here, I omit N_HIGH_MEMORY) normal memory only normal and highmem normal and highmem and movablemem normal and movablemem We can have movablemem. No movablemem only --- We cannot have movablemem only. *** The semantics is not that clear here. So we can only try to understand it from the code where we use N_MEMORY. :) That is my understanding of this. Thanks. :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Simon, On 01/31/2013 04:48 PM, Simon Jeons wrote: Hi Tang, On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote: 1. IIUC, there is a button on machine which supports hot-remove memory, then what's the difference between press button and echo to /sys? No important difference, I think. Since I don't have the machine you are saying, I cannot surely answer you. :) AFAIK, pressing the button means trigger the hotplug from hardware, sysfs is just another entrance. At last, they will run into the same code. 2. Since kernel memory is linear mapping(I mean direct mapping part), why can't put kernel direct mapping memory into one memory device, and other memory into the other devices? We cannot do that because in that way, we will lose NUMA performance. If you know NUMA, you will understand the following example: node0:node1: cpu0~cpu15cpu16~cpu31 memory0~memory511 memory512~memory1023 cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511. If we set direct mapping area in node0, and movable area in node1, then the kernel code running on cpu16~cpu31 will have to access memory0~memory511. This is a terrible performance down. As you know x86_64 don't need highmem, IIUC, all kernel memory will linear mapping in this case. Is my idea available? If is correct, x86_32 can't implement in the same way since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's hard to focus kernel memory on single memory device. Sorry, I'm not quite familiar with x86_32 box. 3. In current implementation, if memory hotplug just need memory subsystem and ACPI codes support? Or also needs firmware take part in? Hope you can explain in details, thanks in advance. :) We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware based memory migration mentioned by Liu Jiang. So far, I only know this. :) 4. What's the status of memory hotplug? Apart from can't remove kernel memory, other things are fully implementation? I think the main job is done for now. And there are still bugs to fix. And this functionality is not stable. Thanks. :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Tang, On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote: Hi Simon, On 01/31/2013 04:48 PM, Simon Jeons wrote: Hi Tang, On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote: 1. IIUC, there is a button on machine which supports hot-remove memory, then what's the difference between press button and echo to /sys? No important difference, I think. Since I don't have the machine you are saying, I cannot surely answer you. :) AFAIK, pressing the button means trigger the hotplug from hardware, sysfs is just another entrance. At last, they will run into the same code. 2. Since kernel memory is linear mapping(I mean direct mapping part), why can't put kernel direct mapping memory into one memory device, and other memory into the other devices? We cannot do that because in that way, we will lose NUMA performance. If you know NUMA, you will understand the following example: node0:node1: cpu0~cpu15cpu16~cpu31 memory0~memory511 memory512~memory1023 cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511. If we set direct mapping area in node0, and movable area in node1, then the kernel code running on cpu16~cpu31 will have to access memory0~memory511. This is a terrible performance down. So if config NUMA, kernel memory will not be linear mapping anymore? For example, Node 0 Node 1 0 ~ 10G 11G~14G kernel memory only at Node 0? Can part of kernel memory also at Node 1? How big is kernel direct mapping memory in x86_64? Is there max limit? It seems that only around 896MB on x86_32. As you know x86_64 don't need highmem, IIUC, all kernel memory will linear mapping in this case. Is my idea available? If is correct, x86_32 can't implement in the same way since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's hard to focus kernel memory on single memory device. Sorry, I'm not quite familiar with x86_32 box. 3. In current implementation, if memory hotplug just need memory subsystem and ACPI codes support? Or also needs firmware take part in? Hope you can explain in details, thanks in advance. :) We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware based memory migration mentioned by Liu Jiang. Is there any material about firmware based memory migration? So far, I only know this. :) 4. What's the status of memory hotplug? Apart from can't remove kernel memory, other things are fully implementation? I think the main job is done for now. And there are still bugs to fix. And this functionality is not stable. Thanks. :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On 2013/1/31 18:38, Simon Jeons wrote: Hi Tang, On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote: Hi Simon, On 01/31/2013 04:48 PM, Simon Jeons wrote: Hi Tang, On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote: 1. IIUC, there is a button on machine which supports hot-remove memory, then what's the difference between press button and echo to /sys? No important difference, I think. Since I don't have the machine you are saying, I cannot surely answer you. :) AFAIK, pressing the button means trigger the hotplug from hardware, sysfs is just another entrance. At last, they will run into the same code. 2. Since kernel memory is linear mapping(I mean direct mapping part), why can't put kernel direct mapping memory into one memory device, and other memory into the other devices? We cannot do that because in that way, we will lose NUMA performance. If you know NUMA, you will understand the following example: node0:node1: cpu0~cpu15cpu16~cpu31 memory0~memory511 memory512~memory1023 cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511. If we set direct mapping area in node0, and movable area in node1, then the kernel code running on cpu16~cpu31 will have to access memory0~memory511. This is a terrible performance down. So if config NUMA, kernel memory will not be linear mapping anymore? For example, Node 0 Node 1 0 ~ 10G 11G~14G kernel memory only at Node 0? Can part of kernel memory also at Node 1? How big is kernel direct mapping memory in x86_64? Is there max limit? Max kernel direct mapping memory in x86_64 is 64TB. It seems that only around 896MB on x86_32. As you know x86_64 don't need highmem, IIUC, all kernel memory will linear mapping in this case. Is my idea available? If is correct, x86_32 can't implement in the same way since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's hard to focus kernel memory on single memory device. Sorry, I'm not quite familiar with x86_32 box. 3. In current implementation, if memory hotplug just need memory subsystem and ACPI codes support? Or also needs firmware take part in? Hope you can explain in details, thanks in advance. :) We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware based memory migration mentioned by Liu Jiang. Is there any material about firmware based memory migration? So far, I only know this. :) 4. What's the status of memory hotplug? Apart from can't remove kernel memory, other things are fully implementation? I think the main job is done for now. And there are still bugs to fix. And this functionality is not stable. Thanks. :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a . ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On 02/01/2013 09:36 AM, Simon Jeons wrote: On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote: So if config NUMA, kernel memory will not be linear mapping anymore? For example, Node 0 Node 1 0 ~ 10G 11G~14G It has nothing to do with linear mapping, I think. kernel memory only at Node 0? Can part of kernel memory also at Node 1? Please refer to find_zone_movable_pfns_for_nodes(). The kernel is not only on node0. It uses all the online nodes evenly. :) How big is kernel direct mapping memory in x86_64? Is there max limit? Max kernel direct mapping memory in x86_64 is 64TB. For example, I have 8G memory, all of them will be direct mapping for kernel? then userspace memory allocated from where? I think you misunderstood what Wu tried to say. :) The kernel mapped that large space, it doesn't mean it is using that large space. The mapping is to make kernel be able to access all the memory, not for the kernel to use only. User space can also use the memory, but each process has its own mapping. For example: 64TB, what ever xxxTB, what ever logic address space: |_kernel___|_user_| \ \ / / \ /\ / physical address space: |___\/__\/_| 4GB or 8GB, what ever * The * part physical is mapped to user space in the process' own pagetable. It is also direct mapped in kernel's pagetable. So the kernel can also access it. :) It seems that only around 896MB on x86_32. We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware based memory migration mentioned by Liu Jiang. Is there any material about firmware based memory migration? No, I don't have any because this is a functionality of machine from HUAWEI. I think you can ask Liu Jiang or Wu Jianguo to share some with you. :) Thanks. :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On 2013/2/1 9:36, Simon Jeons wrote: On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote: On 2013/1/31 18:38, Simon Jeons wrote: Hi Tang, On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote: Hi Simon, On 01/31/2013 04:48 PM, Simon Jeons wrote: Hi Tang, On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote: 1. IIUC, there is a button on machine which supports hot-remove memory, then what's the difference between press button and echo to /sys? No important difference, I think. Since I don't have the machine you are saying, I cannot surely answer you. :) AFAIK, pressing the button means trigger the hotplug from hardware, sysfs is just another entrance. At last, they will run into the same code. 2. Since kernel memory is linear mapping(I mean direct mapping part), why can't put kernel direct mapping memory into one memory device, and other memory into the other devices? We cannot do that because in that way, we will lose NUMA performance. If you know NUMA, you will understand the following example: node0:node1: cpu0~cpu15cpu16~cpu31 memory0~memory511 memory512~memory1023 cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511. If we set direct mapping area in node0, and movable area in node1, then the kernel code running on cpu16~cpu31 will have to access memory0~memory511. This is a terrible performance down. So if config NUMA, kernel memory will not be linear mapping anymore? For example, Node 0 Node 1 0 ~ 10G 11G~14G kernel memory only at Node 0? Can part of kernel memory also at Node 1? How big is kernel direct mapping memory in x86_64? Is there max limit? Max kernel direct mapping memory in x86_64 is 64TB. For example, I have 8G memory, all of them will be direct mapping for kernel? then userspace memory allocated from where? Direct mapping memory means you can use __va() and pa(), but not means that them can be only used by kernel, them can be used by user-space too, as long as them are free. It seems that only around 896MB on x86_32. As you know x86_64 don't need highmem, IIUC, all kernel memory will linear mapping in this case. Is my idea available? If is correct, x86_32 can't implement in the same way since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's hard to focus kernel memory on single memory device. Sorry, I'm not quite familiar with x86_32 box. 3. In current implementation, if memory hotplug just need memory subsystem and ACPI codes support? Or also needs firmware take part in? Hope you can explain in details, thanks in advance. :) We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware based memory migration mentioned by Liu Jiang. Is there any material about firmware based memory migration? So far, I only know this. :) 4. What's the status of memory hotplug? Apart from can't remove kernel memory, other things are fully implementation? I think the main job is done for now. And there are still bugs to fix. And this functionality is not stable. Thanks. :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a . . ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Jianguo, On Fri, 2013-02-01 at 09:57 +0800, Jianguo Wu wrote: On 2013/2/1 9:36, Simon Jeons wrote: On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote: On 2013/1/31 18:38, Simon Jeons wrote: Hi Tang, On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote: Hi Simon, On 01/31/2013 04:48 PM, Simon Jeons wrote: Hi Tang, On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote: 1. IIUC, there is a button on machine which supports hot-remove memory, then what's the difference between press button and echo to /sys? No important difference, I think. Since I don't have the machine you are saying, I cannot surely answer you. :) AFAIK, pressing the button means trigger the hotplug from hardware, sysfs is just another entrance. At last, they will run into the same code. 2. Since kernel memory is linear mapping(I mean direct mapping part), why can't put kernel direct mapping memory into one memory device, and other memory into the other devices? We cannot do that because in that way, we will lose NUMA performance. If you know NUMA, you will understand the following example: node0:node1: cpu0~cpu15cpu16~cpu31 memory0~memory511 memory512~memory1023 cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511. If we set direct mapping area in node0, and movable area in node1, then the kernel code running on cpu16~cpu31 will have to access memory0~memory511. This is a terrible performance down. So if config NUMA, kernel memory will not be linear mapping anymore? For example, Node 0 Node 1 0 ~ 10G 11G~14G kernel memory only at Node 0? Can part of kernel memory also at Node 1? How big is kernel direct mapping memory in x86_64? Is there max limit? Max kernel direct mapping memory in x86_64 is 64TB. For example, I have 8G memory, all of them will be direct mapping for kernel? then userspace memory allocated from where? Direct mapping memory means you can use __va() and pa(), but not means that them can be only used by kernel, them can be used by user-space too, as long as them are free. IIUC, the benefit of va() and pa() is just for quick get virtual/physical address, it takes advantage of linear mapping. But mmu still need to go through pgd/pud/pmd/pte, correct? It seems that only around 896MB on x86_32. As you know x86_64 don't need highmem, IIUC, all kernel memory will linear mapping in this case. Is my idea available? If is correct, x86_32 can't implement in the same way since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's hard to focus kernel memory on single memory device. Sorry, I'm not quite familiar with x86_32 box. 3. In current implementation, if memory hotplug just need memory subsystem and ACPI codes support? Or also needs firmware take part in? Hope you can explain in details, thanks in advance. :) We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware based memory migration mentioned by Liu Jiang. Is there any material about firmware based memory migration? So far, I only know this. :) 4. What's the status of memory hotplug? Apart from can't remove kernel memory, other things are fully implementation? I think the main job is done for now. And there are still bugs to fix. And this functionality is not stable. Thanks. :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a . . ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On 2013/2/1 10:06, Simon Jeons wrote: Hi Jianguo, On Fri, 2013-02-01 at 09:57 +0800, Jianguo Wu wrote: On 2013/2/1 9:36, Simon Jeons wrote: On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote: On 2013/1/31 18:38, Simon Jeons wrote: Hi Tang, On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote: Hi Simon, On 01/31/2013 04:48 PM, Simon Jeons wrote: Hi Tang, On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote: 1. IIUC, there is a button on machine which supports hot-remove memory, then what's the difference between press button and echo to /sys? No important difference, I think. Since I don't have the machine you are saying, I cannot surely answer you. :) AFAIK, pressing the button means trigger the hotplug from hardware, sysfs is just another entrance. At last, they will run into the same code. 2. Since kernel memory is linear mapping(I mean direct mapping part), why can't put kernel direct mapping memory into one memory device, and other memory into the other devices? We cannot do that because in that way, we will lose NUMA performance. If you know NUMA, you will understand the following example: node0:node1: cpu0~cpu15cpu16~cpu31 memory0~memory511 memory512~memory1023 cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511. If we set direct mapping area in node0, and movable area in node1, then the kernel code running on cpu16~cpu31 will have to access memory0~memory511. This is a terrible performance down. So if config NUMA, kernel memory will not be linear mapping anymore? For example, Node 0 Node 1 0 ~ 10G 11G~14G kernel memory only at Node 0? Can part of kernel memory also at Node 1? How big is kernel direct mapping memory in x86_64? Is there max limit? Max kernel direct mapping memory in x86_64 is 64TB. For example, I have 8G memory, all of them will be direct mapping for kernel? then userspace memory allocated from where? Direct mapping memory means you can use __va() and pa(), but not means that them can be only used by kernel, them can be used by user-space too, as long as them are free. IIUC, the benefit of va() and pa() is just for quick get virtual/physical address, it takes advantage of linear mapping. But mmu still need to go through pgd/pud/pmd/pte, correct? Yes. It seems that only around 896MB on x86_32. As you know x86_64 don't need highmem, IIUC, all kernel memory will linear mapping in this case. Is my idea available? If is correct, x86_32 can't implement in the same way since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's hard to focus kernel memory on single memory device. Sorry, I'm not quite familiar with x86_32 box. 3. In current implementation, if memory hotplug just need memory subsystem and ACPI codes support? Or also needs firmware take part in? Hope you can explain in details, thanks in advance. :) We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware based memory migration mentioned by Liu Jiang. Is there any material about firmware based memory migration? So far, I only know this. :) 4. What's the status of memory hotplug? Apart from can't remove kernel memory, other things are fully implementation? I think the main job is done for now. And there are still bugs to fix. And this functionality is not stable. Thanks. :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a . . . ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Simon, On 02/01/2013 10:17 AM, Simon Jeons wrote: For example: 64TB, what ever xxxTB, what ever logic address space: |_kernel___|_user_| \ \ / / \ /\ / physical address space: |___\/__\/_| 4GB or 8GB, what ever * How much address space user process can have on x86_64? Also 8GB? Usually, we don't say that. 8GB is your physical memory, right ? But kernel space and user space is the logic conception in OS. They are in logic address space. So both the kernel space and the user space can use all the physical memory. But if the page is already in use by either of them, the other one cannot use it. For example, some pages are direct mapped to kernel, and is in use by kernel, the user space cannot map it. The * part physical is mapped to user space in the process' own pagetable. It is also direct mapped in kernel's pagetable. So the kernel can also access it. :) But how to protect user process not modify kernel memory? This is the job of CPU. On intel cpus, user space code is running in level 3, and kernel space code is running in level 0. So the code in level 3 cannot access the data segment in level 0. Thanks. :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Tang, On Fri, 2013-02-01 at 09:57 +0800, Tang Chen wrote: On 02/01/2013 09:36 AM, Simon Jeons wrote: On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote: So if config NUMA, kernel memory will not be linear mapping anymore? For example, Node 0 Node 1 0 ~ 10G 11G~14G It has nothing to do with linear mapping, I think. kernel memory only at Node 0? Can part of kernel memory also at Node 1? Please refer to find_zone_movable_pfns_for_nodes(). I see, thanks. :) The kernel is not only on node0. It uses all the online nodes evenly. :) How big is kernel direct mapping memory in x86_64? Is there max limit? Max kernel direct mapping memory in x86_64 is 64TB. For example, I have 8G memory, all of them will be direct mapping for kernel? then userspace memory allocated from where? I think you misunderstood what Wu tried to say. :) The kernel mapped that large space, it doesn't mean it is using that large space. The mapping is to make kernel be able to access all the memory, not for the kernel to use only. User space can also use the memory, but each process has its own mapping. For example: 64TB, what ever xxxTB, what ever logic address space: |_kernel___|_user_| \ \ / / \ /\ / physical address space: |___\/__\/_| 4GB or 8GB, what ever * How much address space user process can have on x86_64? Also 8GB? The * part physical is mapped to user space in the process' own pagetable. It is also direct mapped in kernel's pagetable. So the kernel can also access it. :) But how to protect user process not modify kernel memory? It seems that only around 896MB on x86_32. We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware based memory migration mentioned by Liu Jiang. Is there any material about firmware based memory migration? No, I don't have any because this is a functionality of machine from HUAWEI. I think you can ask Liu Jiang or Wu Jianguo to share some with you. :) Thanks. :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Tang, On Fri, 2013-02-01 at 10:42 +0800, Tang Chen wrote: I confuse! Hi Simon, On 02/01/2013 10:17 AM, Simon Jeons wrote: For example: 64TB, what ever xxxTB, what ever logic address space: |_kernel___|_user_| \ \ / / \ /\ / physical address space: |___\/__\/_| 4GB or 8GB, what ever * How much address space user process can have on x86_64? Also 8GB? Usually, we don't say that. 8GB is your physical memory, right ? But kernel space and user space is the logic conception in OS. They are in logic address space. So both the kernel space and the user space can use all the physical memory. But if the page is already in use by either of them, the other one cannot use it. For example, some pages are direct mapped to kernel, and is in use by kernel, the user space cannot map it. How can distinguish map and use? I mean how can confirm memory is used by kernel instead of map? The * part physical is mapped to user space in the process' own pagetable. It is also direct mapped in kernel's pagetable. So the kernel can also access it. :) But how to protect user process not modify kernel memory? This is the job of CPU. On intel cpus, user space code is running in level 3, and kernel space code is running in level 0. So the code in level 3 cannot access the data segment in level 0. 1) If user process and kenel map to same physical memory, user process will get SIGSEGV during #PF if access to this memory, but If user proces s will map to the same memory which kernel map? Why? It can't access it. 2) If two user processes map to same physical memory, what will happen if one process access the memory? Thanks. :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote: On 2013/1/31 18:38, Simon Jeons wrote: Hi Tang, On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote: Hi Simon, On 01/31/2013 04:48 PM, Simon Jeons wrote: Hi Tang, On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote: 1. IIUC, there is a button on machine which supports hot-remove memory, then what's the difference between press button and echo to /sys? No important difference, I think. Since I don't have the machine you are saying, I cannot surely answer you. :) AFAIK, pressing the button means trigger the hotplug from hardware, sysfs is just another entrance. At last, they will run into the same code. 2. Since kernel memory is linear mapping(I mean direct mapping part), why can't put kernel direct mapping memory into one memory device, and other memory into the other devices? We cannot do that because in that way, we will lose NUMA performance. If you know NUMA, you will understand the following example: node0:node1: cpu0~cpu15cpu16~cpu31 memory0~memory511 memory512~memory1023 cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511. If we set direct mapping area in node0, and movable area in node1, then the kernel code running on cpu16~cpu31 will have to access memory0~memory511. This is a terrible performance down. So if config NUMA, kernel memory will not be linear mapping anymore? For example, Node 0 Node 1 0 ~ 10G 11G~14G kernel memory only at Node 0? Can part of kernel memory also at Node 1? How big is kernel direct mapping memory in x86_64? Is there max limit? Max kernel direct mapping memory in x86_64 is 64TB. For example, I have 8G memory, all of them will be direct mapping for kernel? then userspace memory allocated from where? It seems that only around 896MB on x86_32. As you know x86_64 don't need highmem, IIUC, all kernel memory will linear mapping in this case. Is my idea available? If is correct, x86_32 can't implement in the same way since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's hard to focus kernel memory on single memory device. Sorry, I'm not quite familiar with x86_32 box. 3. In current implementation, if memory hotplug just need memory subsystem and ACPI codes support? Or also needs firmware take part in? Hope you can explain in details, thanks in advance. :) We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware based memory migration mentioned by Liu Jiang. Is there any material about firmware based memory migration? So far, I only know this. :) 4. What's the status of memory hotplug? Apart from can't remove kernel memory, other things are fully implementation? I think the main job is done for now. And there are still bugs to fix. And this functionality is not stable. Thanks. :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a . ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Simon, On 02/01/2013 11:06 AM, Simon Jeons wrote: How can distinguish map and use? I mean how can confirm memory is used by kernel instead of map? If the page is free, for example, it is in the buddy system, it is not in use. Even if it is direct mapped by kernel, the kernel logic should not to access it because you didn't allocate it. This is the kernel's logic. Of course the hardware and the user will not know this. You want to access some memory, you should first have a logic address, right? So how can you get a logic address ? You call alloc api. For example, when you are coding, of course you write: p = alloc_xxx(); allocate memory, now, it is in use, alloc_xxx() makes kernel know it. *p = .. use the memory You won't write: p = 0x8745; if so, kernel doesn't know it is in use *p = .. wrong... right ? The kernel mapped a page, it doesn't mean it is using the page. You should allocate it. That is just the kernel's allocating logic. Well, I think I can only give you this answer now. If you want something deeper, I think you need to read how the kernel manage the physical pages. :) 1) If user process and kenel map to same physical memory, user process will get SIGSEGV during #PF if access to this memory, but If user proces s will map to the same memory which kernel map? Why? It can't access it. When you call malloc() to allocate memory in user space, the OS logic will assure that you won't map a page that has already been used by kernel. A page is mapped by kernel, but not used by kernel (not allocated, like above), malloc() could allocate it, and map it to user space. This is the situation you are talking about, right ? Now it is mapped by kernel and user, but it is only allocated by user. So the kernel will not use it. When the kernel wants some memory, it will allocate some other memory. This is just the kernel logic. This is what memory management subsystem does. I think I cannot answer more because I'm also a student in memory management. This is just my understanding. And I hope it is helpful. :) 2) If two user processes map to same physical memory, what will happen if one process access the memory? Obviously you don't need to worry about this situation. We can swap the page used by process 1 out, and process 2 can use the same page. When process 1 wants to access it again, we swap it in. This only happens when the physical memory is not enough to use. :) And also, if you are using shared memory in user space, like shmget(), shmat().. it is the shared memory, both processes can use it at the same time. Thanks. :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Simon, Please see below. :) On 01/29/2013 08:52 PM, Simon Jeons wrote: Hi Tang, On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote: Here is the physical memory hot-remove patch-set based on 3.8rc-2. Some questions ask you, not has relationship with this patchset, but is memory hotplug stuff. 1. In function node_states_check_changes_online: comments: * If we don't have HIGHMEM nor movable node, * node_states[N_NORMAL_MEMORY] contains nodes which have zones of * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE. How to understand it? Why we don't have HIGHMEM nor movable node and node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC, N_NORMAL_MEMORY only means the node has regular memory. First of all, I think we need to understand why we need N_MEMORY. In order to support movable node, which has only ZONE_MOVABLE (the last zone), we introduce N_MEMORY to represent the node has normal, highmem and movable memory. Here, we have movable node means you configured CONFIG_MOVABLE_NODE. This config option doesn't mean we don't have movable pages, (NO) it means we don't have a node which has only movable pages (only have ZONE_MOVABLE). (YES) Here, if we don't have CONFIG_MOVABLE_NODE (we don't have movable node), we don't need a separate node_states[] element to represent a particular node because we won't have a node which has only ZONE_MOVABLE. So, 1) if we don't have highmem nor movable node, N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, which means N_NORMAL_MEMORY effects as N_MEMORY. If we online pages as movable, we need to update node_states[N_NORMAL_MEMORY]. Please refer to the definition of enum zone_type, if we don't have CONFIG_HIGHMEM, we won't have ZONE_HIGHMEM, but ZONE_NORMAL and ZONE_MOVABLE will always there. So we can have movable pages, and the zone_last should be ZONE_MOVABLE. Again, because we won't have a node only having ZONE_MOVABLE, so we just need to update node_states[N_NORMAL_MEMORY]. * If we don't have movable node, node_states[N_NORMAL_MEMORY] * contains nodes which have zones of 0...ZONE_MOVABLE, * set zone_last to ZONE_MOVABLE. How to understand? 2) this code is in #ifdef CONFIG_HIGHMEM, which means we have highmem, so if we don't have movable node, N_MEMORY == N_HIGH_MEMORY, and N_HIGH_MEMORY effects as N_MEMORY. If we online pages as movable, we need to update node_states[N_NORMAL_MEMORY]. 2. In function move_pfn_range_left, why end= z2-zone_start_pfn is not correct? The comments said that must include/overlap, why? This one is easy, if I understand you correctly. move_pfn_range_left() is used to move the left most part [start_pfn, end_pfn) of z2 to z1. So if end_pfn= z2-zone_start_pfn, it means [start_pfn, end_pfn) is not part of z2. Then it fails. 3. In function online_pages, the normal case(w/o online_kenrel, online_movable), why not check if the new zone is overlap with adjacent zones? Can a zone overlap with the others ? I don't think so. One pfn could only be in one zone, zone = page_zone(pfn_to_page(pfn)); it could overlap with others, I think. :) But maybe I misunderstand you. :) 4. Could you summarize the difference implementation between hot-add and logic-add, hot-remove and logic-remove? Sorry, I don't quite understand what do you mean by logic-add/remove. Would you please explain more ? If you meant the sys fs interfaces, I think they are just another set of entrances of memory hotplug. Thanks. :) This patch-set aims to implement physical memory hot-removing. The patches can free/remove the following things: - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15] - memmap of sparse-vmemmap : [PATCH 6,7,8,10/15] - page table of removed memory : [RFC PATCH 7,8,10/15] - node and related sysfs files : [RFC PATCH 13-15/15] Existing problem: If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup when we online pages. For example: there is a memory device on node 1. The address range is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10, and memory11 under the directory /sys/devices/system/memory/. If CONFIG_MEMCG is selected, when we online memory8, the memory stored page cgroup is not provided by this memory device. But when we online memory9, the memory stored page cgroup may be provided by memory8. So we can't offline memory8 now. We should offline the memory in the reversed order. When the memory device is hotremoved, we will auto offline memory provided by this memory device. But we don't know which memory is onlined first, so offlining memory may fail. In patch1, we provide a solution which is not good enough: Iterate twice to offline the memory. 1st iterate: offline every non primary memory block. 2nd iterate: offline primary (i.e. first added) memory block. And a new idea from Wen Congyangwe...@cn.fujitsu.com is: allocate the memory from the memory block they are
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On 01/30/2013 06:15 PM, Tang Chen wrote: Hi Simon, Please see below. :) On 01/29/2013 08:52 PM, Simon Jeons wrote: Hi Tang, On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote: Here is the physical memory hot-remove patch-set based on 3.8rc-2. Some questions ask you, not has relationship with this patchset, but is memory hotplug stuff. 1. In function node_states_check_changes_online: comments: * If we don't have HIGHMEM nor movable node, * node_states[N_NORMAL_MEMORY] contains nodes which have zones of * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE. How to understand it? Why we don't have HIGHMEM nor movable node and node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC, N_NORMAL_MEMORY only means the node has regular memory. First of all, I think we need to understand why we need N_MEMORY. In order to support movable node, which has only ZONE_MOVABLE (the last zone), we introduce N_MEMORY to represent the node has normal, highmem and movable memory. Here, we have movable node means you configured CONFIG_MOVABLE_NODE. Sorry, should be we don't have movable node means you didn't configured CONFIG_MOVABLE_NODE. This config option doesn't mean we don't have movable pages, (NO) it means we don't have a node which has only movable pages (only have ZONE_MOVABLE). (YES) Here, if we don't have CONFIG_MOVABLE_NODE (we don't have movable node), we don't need a separate node_states[] element to represent a particular node because we won't have a node which has only ZONE_MOVABLE. So, 1) if we don't have highmem nor movable node, N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, which means N_NORMAL_MEMORY effects as N_MEMORY. If we online pages as movable, we need to update node_states[N_NORMAL_MEMORY]. Please refer to the definition of enum zone_type, if we don't have CONFIG_HIGHMEM, we won't have ZONE_HIGHMEM, but ZONE_NORMAL and ZONE_MOVABLE will always there. So we can have movable pages, and the zone_last should be ZONE_MOVABLE. Again, because we won't have a node only having ZONE_MOVABLE, so we just need to update node_states[N_NORMAL_MEMORY]. * If we don't have movable node, node_states[N_NORMAL_MEMORY] * contains nodes which have zones of 0...ZONE_MOVABLE, * set zone_last to ZONE_MOVABLE. How to understand? 2) this code is in #ifdef CONFIG_HIGHMEM, which means we have highmem, so if we don't have movable node, N_MEMORY == N_HIGH_MEMORY, and N_HIGH_MEMORY effects as N_MEMORY. If we online pages as movable, we need to update node_states[N_NORMAL_MEMORY]. 2. In function move_pfn_range_left, why end= z2-zone_start_pfn is not correct? The comments said that must include/overlap, why? This one is easy, if I understand you correctly. move_pfn_range_left() is used to move the left most part [start_pfn, end_pfn) of z2 to z1. So if end_pfn= z2-zone_start_pfn, it means [start_pfn, end_pfn) is not part of z2. Then it fails. 3. In function online_pages, the normal case(w/o online_kenrel, online_movable), why not check if the new zone is overlap with adjacent zones? Can a zone overlap with the others ? I don't think so. One pfn could only be in one zone, zone = page_zone(pfn_to_page(pfn)); it could overlap with others, I think. :) But maybe I misunderstand you. :) 4. Could you summarize the difference implementation between hot-add and logic-add, hot-remove and logic-remove? Sorry, I don't quite understand what do you mean by logic-add/remove. Would you please explain more ? If you meant the sys fs interfaces, I think they are just another set of entrances of memory hotplug. Thanks. :) This patch-set aims to implement physical memory hot-removing. The patches can free/remove the following things: - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15] - memmap of sparse-vmemmap : [PATCH 6,7,8,10/15] - page table of removed memory : [RFC PATCH 7,8,10/15] - node and related sysfs files : [RFC PATCH 13-15/15] Existing problem: If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup when we online pages. For example: there is a memory device on node 1. The address range is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10, and memory11 under the directory /sys/devices/system/memory/. If CONFIG_MEMCG is selected, when we online memory8, the memory stored page cgroup is not provided by this memory device. But when we online memory9, the memory stored page cgroup may be provided by memory8. So we can't offline memory8 now. We should offline the memory in the reversed order. When the memory device is hotremoved, we will auto offline memory provided by this memory device. But we don't know which memory is onlined first, so offlining memory may fail. In patch1, we provide a solution which is not good enough: Iterate twice to offline the memory. 1st iterate: offline every non primary memory block. 2nd iterate: offline primary (i.e. first added) memory block. And a new idea from Wen Congyangwe...@cn.fujitsu.com is: allocate
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Tang, On Wed, 2013-01-30 at 18:15 +0800, Tang Chen wrote: Hi Simon, Please see below. :) On 01/29/2013 08:52 PM, Simon Jeons wrote: Hi Tang, On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote: Here is the physical memory hot-remove patch-set based on 3.8rc-2. Some questions ask you, not has relationship with this patchset, but is memory hotplug stuff. 1. In function node_states_check_changes_online: comments: * If we don't have HIGHMEM nor movable node, * node_states[N_NORMAL_MEMORY] contains nodes which have zones of * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE. How to understand it? Why we don't have HIGHMEM nor movable node and node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC, N_NORMAL_MEMORY only means the node has regular memory. First of all, I think we need to understand why we need N_MEMORY. In order to support movable node, which has only ZONE_MOVABLE (the last zone), we introduce N_MEMORY to represent the node has normal, highmem and movable memory. Here, we have movable node means you configured CONFIG_MOVABLE_NODE. This config option doesn't mean we don't have movable pages, (NO) it means we don't have a node which has only movable pages (only have ZONE_MOVABLE). (YES) Here, if we don't have CONFIG_MOVABLE_NODE (we don't have movable node), we don't need a separate node_states[] element to represent a particular node because we won't have a node which has only ZONE_MOVABLE. So, 1) if we don't have highmem nor movable node, N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, which means N_NORMAL_MEMORY effects as N_MEMORY. If we online pages as movable, we need to update node_states[N_NORMAL_MEMORY]. Sorry, I still confuse. :( update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE? Please refer to the definition of enum zone_type, if we don't have CONFIG_HIGHMEM, we won't have ZONE_HIGHMEM, but ZONE_NORMAL and ZONE_MOVABLE will always there. So we can have movable pages, and the zone_last should be ZONE_MOVABLE. node_states is what? node_states[N_NORMAL_MEMOR] or node_states[N_MEMORY]? Again, because we won't have a node only having ZONE_MOVABLE, so we just need to update node_states[N_NORMAL_MEMORY]. * If we don't have movable node, node_states[N_NORMAL_MEMORY] * contains nodes which have zones of 0...ZONE_MOVABLE, * set zone_last to ZONE_MOVABLE. How to understand? 2) this code is in #ifdef CONFIG_HIGHMEM, which means we have highmem, so if we don't have movable node, N_MEMORY == N_HIGH_MEMORY, and N_HIGH_MEMORY effects as N_MEMORY. If we online pages as movable, we need to update node_states[N_NORMAL_MEMORY]. 2. In function move_pfn_range_left, why end= z2-zone_start_pfn is not correct? The comments said that must include/overlap, why? This one is easy, if I understand you correctly. move_pfn_range_left() is used to move the left most part [start_pfn, end_pfn) of z2 to z1. So if end_pfn= z2-zone_start_pfn, it means [start_pfn, end_pfn) is not part of z2. Then it fails. Yup, very clear now. :) Why check !z1-wait_table in function move_pfn_range_left and function __add_zone? I think zone-wait_table is initialized in free_area_init_core, which will be called during system initialization and hotadd_new_pgdat path. 3. In function online_pages, the normal case(w/o online_kenrel, online_movable), why not check if the new zone is overlap with adjacent zones? Can a zone overlap with the others ? I don't think so. One pfn could only be in one zone, zone = page_zone(pfn_to_page(pfn)); thanks. :) There is a zone populated check in function online_pages. But zone is populated in free_area_init_core which will be called during system initialization and hotadd_new_pgdat path. Why still need this check? it could overlap with others, I think. :) But maybe I misunderstand you. :) 4. Could you summarize the difference implementation between hot-add and logic-add, hot-remove and logic-remove? Sorry, I don't quite understand what do you mean by logic-add/remove. Would you please explain more ? If you meant the sys fs interfaces, I think they are just another set of entrances of memory hotplug. Please ingore this silly question. :( Thanks. :) This patch-set aims to implement physical memory hot-removing. The patches can free/remove the following things: - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15] - memmap of sparse-vmemmap : [PATCH 6,7,8,10/15] - page table of removed memory : [RFC PATCH 7,8,10/15] - node and related sysfs files : [RFC PATCH 13-15/15] Existing problem: If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup when we online pages. For example: there is a memory device on node 1. The address
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Simon, Please see below. :) On 01/31/2013 09:22 AM, Simon Jeons wrote: Sorry, I still confuse. :( update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE? node_states is what? node_states[N_NORMAL_MEMOR] or node_states[N_MEMORY]? Are you asking what node_states[] is ? node_states[] is an array of nodemask, extern nodemask_t node_states[NR_NODE_STATES]; For example, node_states[N_NORMAL_MEMOR] represents which nodes have normal memory. If N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, node_states[N_MEMORY] is node_states[N_NORMAL_MEMOR]. So it represents which nodes have 0 ... ZONE_MOVABLE. Why check !z1-wait_table in function move_pfn_range_left and function __add_zone? I think zone-wait_table is initialized in free_area_init_core, which will be called during system initialization and hotadd_new_pgdat path. I think, free_area_init_core(), in the for loop, |-- size = zone_spanned_pages_in_node(); |-- if (!size) continue; If zone is empty, we jump out the for loop. |-- init_currently_empty_zone() So, if the zone is empty, wait_table is not initialized. In move_pfn_range_left(z1, z2), we move pages from z2 to z1. But z1 could be empty. So we need to check it and initialize z1-wait_table because we are moving pages into it. There is a zone populated check in function online_pages. But zone is populated in free_area_init_core which will be called during system initialization and hotadd_new_pgdat path. Why still need this check? Because we could also rebuild zone list when we offline pages. __offline_pages() |-- zone-present_pages -= offlined_pages; |-- if (!populated_zone(zone)) { build_all_zonelists(NULL, NULL); } If the zone is empty, and other zones on the same node is not empty, the node won't be offlined, and next time we online pages of this zone, the pgdat won't be initialized again, and we need to check populated_zone(zone) when onlining pages. Thanks. :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Tang, On Thu, 2013-01-31 at 11:31 +0800, Tang Chen wrote: Hi Simon, Please see below. :) On 01/31/2013 09:22 AM, Simon Jeons wrote: Sorry, I still confuse. :( update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE? node_states is what? node_states[N_NORMAL_MEMOR] or node_states[N_MEMORY]? Are you asking what node_states[] is ? node_states[] is an array of nodemask, extern nodemask_t node_states[NR_NODE_STATES]; For example, node_states[N_NORMAL_MEMOR] represents which nodes have normal memory. If N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, node_states[N_MEMORY] is node_states[N_NORMAL_MEMOR]. So it represents which nodes have 0 ... ZONE_MOVABLE. Sorry, how can nodes_state[N_NORMAL_MEMORY] represents a node have 0 ... *ZONE_MOVABLE*, the comment of enum nodes_states said that N_NORMAL_MEMORY just means the node has regular memory. Why check !z1-wait_table in function move_pfn_range_left and function __add_zone? I think zone-wait_table is initialized in free_area_init_core, which will be called during system initialization and hotadd_new_pgdat path. I think, free_area_init_core(), in the for loop, |-- size = zone_spanned_pages_in_node(); |-- if (!size) continue; If zone is empty, we jump out the for loop. |-- init_currently_empty_zone() So, if the zone is empty, wait_table is not initialized. In move_pfn_range_left(z1, z2), we move pages from z2 to z1. But z1 could be empty. So we need to check it and initialize z1-wait_table because we are moving pages into it. thanks. There is a zone populated check in function online_pages. But zone is populated in free_area_init_core which will be called during system initialization and hotadd_new_pgdat path. Why still need this check? Because we could also rebuild zone list when we offline pages. __offline_pages() |-- zone-present_pages -= offlined_pages; |-- if (!populated_zone(zone)) { build_all_zonelists(NULL, NULL); } If the zone is empty, and other zones on the same node is not empty, the node won't be offlined, and next time we online pages of this zone, the pgdat won't be initialized again, and we need to check populated_zone(zone) when onlining pages. thanks. Thanks. :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On 01/31/2013 02:19 PM, Simon Jeons wrote: Hi Tang, On Thu, 2013-01-31 at 11:31 +0800, Tang Chen wrote: Hi Simon, Please see below. :) On 01/31/2013 09:22 AM, Simon Jeons wrote: Sorry, I still confuse. :( update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE? node_states is what? node_states[N_NORMAL_MEMOR] or node_states[N_MEMORY]? Are you asking what node_states[] is ? node_states[] is an array of nodemask, extern nodemask_t node_states[NR_NODE_STATES]; For example, node_states[N_NORMAL_MEMOR] represents which nodes have normal memory. If N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, node_states[N_MEMORY] is node_states[N_NORMAL_MEMOR]. So it represents which nodes have 0 ... ZONE_MOVABLE. Sorry, how can nodes_state[N_NORMAL_MEMORY] represents a node have 0 ... *ZONE_MOVABLE*, the comment of enum nodes_states said that N_NORMAL_MEMORY just means the node has regular memory. Hi Simon, Let's say it in this way. If we don't have CONFIG_HIGHMEM, N_HIGH_MEMORY == N_NORMAL_MEMORY. We don't have a separate macro to represent highmem because we don't have highmem. This is easy to understand, right ? Now, think it just like above: If we don't have CONFIG_MOVABLE_NODE, N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY. This means we don't allow a node to have only movable memory, not we don't have movable memory. A node could have normal memory and movable memory. So nodes_state[N_NORMAL_MEMORY] represents a node have 0 ... *ZONE_MOVABLE*. I think the point is: CONFIG_MOVABLE_NODE means we allow a node to have only movable memory. So without CONFIG_MOVABLE_NODE, it doesn't mean a node cannot have movable memory. It means the node cannot have only movable memory. It can have normal memory and movable memory. 1) With CONFIG_MOVABLE_NODE: N_NORMAL_MEMORY: nodes who have normal memory. normal memory only normal and highmem normal and highmem and movablemem normal and movablemem N_MEMORY: nodes who has memory (any memory) normal memory only normal and highmem normal and highmem and movablemem normal and movablemem We can have movablemem. highmem only - highmem and movablemem --- movablemem only -- We can have movablemem only.*** 2) With out CONFIG_MOVABLE_NODE: N_MEMORY == N_NORMAL_MEMORY: (Here, I omit N_HIGH_MEMORY) normal memory only normal and highmem normal and highmem and movablemem normal and movablemem We can have movablemem. No movablemem only --- We cannot have movablemem only. *** The semantics is not that clear here. So we can only try to understand it from the code where we use N_MEMORY. :) That is my understanding of this. Thanks. :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Tang, On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote: Here is the physical memory hot-remove patch-set based on 3.8rc-2. Some questions ask you, not has relationship with this patchset, but is memory hotplug stuff. 1. In function node_states_check_changes_online: comments: * If we don't have HIGHMEM nor movable node, * node_states[N_NORMAL_MEMORY] contains nodes which have zones of * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE. How to understand it? Why we don't have HIGHMEM nor movable node and node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC, N_NORMAL_MEMORY only means the node has regular memory. * If we don't have movable node, node_states[N_NORMAL_MEMORY] * contains nodes which have zones of 0...ZONE_MOVABLE, * set zone_last to ZONE_MOVABLE. How to understand? 2. In function move_pfn_range_left, why end = z2-zone_start_pfn is not correct? The comments said that must include/overlap, why? 3. In function online_pages, the normal case(w/o online_kenrel, online_movable), why not check if the new zone is overlap with adjacent zones? 4. Could you summarize the difference implementation between hot-add and logic-add, hot-remove and logic-remove? This patch-set aims to implement physical memory hot-removing. The patches can free/remove the following things: - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15] - memmap of sparse-vmemmap : [PATCH 6,7,8,10/15] - page table of removed memory : [RFC PATCH 7,8,10/15] - node and related sysfs files : [RFC PATCH 13-15/15] Existing problem: If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup when we online pages. For example: there is a memory device on node 1. The address range is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10, and memory11 under the directory /sys/devices/system/memory/. If CONFIG_MEMCG is selected, when we online memory8, the memory stored page cgroup is not provided by this memory device. But when we online memory9, the memory stored page cgroup may be provided by memory8. So we can't offline memory8 now. We should offline the memory in the reversed order. When the memory device is hotremoved, we will auto offline memory provided by this memory device. But we don't know which memory is onlined first, so offlining memory may fail. In patch1, we provide a solution which is not good enough: Iterate twice to offline the memory. 1st iterate: offline every non primary memory block. 2nd iterate: offline primary (i.e. first added) memory block. And a new idea from Wen Congyang we...@cn.fujitsu.com is: allocate the memory from the memory block they are describing. But we are not sure if it is OK to do so because there is not existing API to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE to MEM_ONLINE. And also, it may interfere the hugepage. How to test this patchset? 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE, ACPI_HOTPLUG_MEMORY must be selected. 2. load the module acpi_memhotplug 3. hotplug the memory device(it depends on your hardware) You will see the memory device under the directory /sys/bus/acpi/devices/. Its name is PNP0C80:XX. 4. online/offline pages provided by this memory device You can write online/offline to /sys/devices/system/memory/memoryX/state to online/offline pages provided by this memory device 5. hotremove the memory device You can hotremove the memory device by the hardware, or writing 1 to /sys/bus/acpi/devices/PNP0C80:XX/eject. Is there a similar knode to hot-add the memory device? Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Changelogs from v5 to v6: Patch3: Add some more comments to explain memory hot-remove. Patch4: Remove bootmem member in struct firmware_map_entry. Patch6: Repeatedly register bootmem pages when using hugepage. Patch8: Repeatedly free bootmem pages when using hugepage. Patch14: Don't free pgdat when offlining a node, just reset it to 0. Patch15: New patch, pgdat is not freed in patch14, so don't allocate a new one when online a node. Changelogs from v4 to v5: Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to avoid disabling irq because we need flush tlb when free pagetables. Patch8: new patch, pick up some common APIs that are used to free direct mapping and vmemmap pagetables. Patch9: free direct mapping pagetables on x86_64 arch. Patch10: free vmemmap pagetables. Patch11: since freeing memmap with vmemmap has been implemented, the config macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is no longer needed. Patch13: no need to modify acpi_memory_disable_device() since it was removed, and add nid
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On 01/29/2013 08:52 PM, Simon Jeons wrote: Hi Tang, On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote: Here is the physical memory hot-remove patch-set based on 3.8rc-2. Hi Simon, I'll summarize all the info and answer you later. :) Thanks for asking. :) Some questions ask you, not has relationship with this patchset, but is memory hotplug stuff. 1. In function node_states_check_changes_online: comments: * If we don't have HIGHMEM nor movable node, * node_states[N_NORMAL_MEMORY] contains nodes which have zones of * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE. How to understand it? Why we don't have HIGHMEM nor movable node and node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC, N_NORMAL_MEMORY only means the node has regular memory. * If we don't have movable node, node_states[N_NORMAL_MEMORY] * contains nodes which have zones of 0...ZONE_MOVABLE, * set zone_last to ZONE_MOVABLE. How to understand? 2. In function move_pfn_range_left, why end= z2-zone_start_pfn is not correct? The comments said that must include/overlap, why? 3. In function online_pages, the normal case(w/o online_kenrel, online_movable), why not check if the new zone is overlap with adjacent zones? 4. Could you summarize the difference implementation between hot-add and logic-add, hot-remove and logic-remove? This patch-set aims to implement physical memory hot-removing. The patches can free/remove the following things: - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15] - memmap of sparse-vmemmap : [PATCH 6,7,8,10/15] - page table of removed memory : [RFC PATCH 7,8,10/15] - node and related sysfs files : [RFC PATCH 13-15/15] Existing problem: If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup when we online pages. For example: there is a memory device on node 1. The address range is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10, and memory11 under the directory /sys/devices/system/memory/. If CONFIG_MEMCG is selected, when we online memory8, the memory stored page cgroup is not provided by this memory device. But when we online memory9, the memory stored page cgroup may be provided by memory8. So we can't offline memory8 now. We should offline the memory in the reversed order. When the memory device is hotremoved, we will auto offline memory provided by this memory device. But we don't know which memory is onlined first, so offlining memory may fail. In patch1, we provide a solution which is not good enough: Iterate twice to offline the memory. 1st iterate: offline every non primary memory block. 2nd iterate: offline primary (i.e. first added) memory block. And a new idea from Wen Congyangwe...@cn.fujitsu.com is: allocate the memory from the memory block they are describing. But we are not sure if it is OK to do so because there is not existing API to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE to MEM_ONLINE. And also, it may interfere the hugepage. How to test this patchset? 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE, ACPI_HOTPLUG_MEMORY must be selected. 2. load the module acpi_memhotplug 3. hotplug the memory device(it depends on your hardware) You will see the memory device under the directory /sys/bus/acpi/devices/. Its name is PNP0C80:XX. 4. online/offline pages provided by this memory device You can write online/offline to /sys/devices/system/memory/memoryX/state to online/offline pages provided by this memory device 5. hotremove the memory device You can hotremove the memory device by the hardware, or writing 1 to /sys/bus/acpi/devices/PNP0C80:XX/eject. Is there a similar knode to hot-add the memory device? Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Changelogs from v5 to v6: Patch3: Add some more comments to explain memory hot-remove. Patch4: Remove bootmem member in struct firmware_map_entry. Patch6: Repeatedly register bootmem pages when using hugepage. Patch8: Repeatedly free bootmem pages when using hugepage. Patch14: Don't free pgdat when offlining a node, just reset it to 0. Patch15: New patch, pgdat is not freed in patch14, so don't allocate a new one when online a node. Changelogs from v4 to v5: Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to avoid disabling irq because we need flush tlb when free pagetables. Patch8: new patch, pick up some common APIs that are used to free direct mapping and vmemmap pagetables. Patch9: free direct mapping pagetables on x86_64 arch. Patch10: free vmemmap pagetables. Patch11: since freeing memmap with vmemmap has been implemented, the config macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is no longer needed. Patch13: no need to modify
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On Wed, 2013-01-30 at 10:32 +0800, Tang Chen wrote: On 01/29/2013 08:52 PM, Simon Jeons wrote: Hi Tang, On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote: Here is the physical memory hot-remove patch-set based on 3.8rc-2. Hi Simon, I'll summarize all the info and answer you later. :) Thanks for asking. :) Thanks Tang, IIRC, there's qemu feature support memory hot-add/remove emulation if we don't have machine which supports memory hot-add/remove to test. Is that qemu feature merged? Otherwise where can I get that patchset? Some questions ask you, not has relationship with this patchset, but is memory hotplug stuff. 1. In function node_states_check_changes_online: comments: * If we don't have HIGHMEM nor movable node, * node_states[N_NORMAL_MEMORY] contains nodes which have zones of * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE. How to understand it? Why we don't have HIGHMEM nor movable node and node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC, N_NORMAL_MEMORY only means the node has regular memory. * If we don't have movable node, node_states[N_NORMAL_MEMORY] * contains nodes which have zones of 0...ZONE_MOVABLE, * set zone_last to ZONE_MOVABLE. How to understand? 2. In function move_pfn_range_left, why end= z2-zone_start_pfn is not correct? The comments said that must include/overlap, why? 3. In function online_pages, the normal case(w/o online_kenrel, online_movable), why not check if the new zone is overlap with adjacent zones? 4. Could you summarize the difference implementation between hot-add and logic-add, hot-remove and logic-remove? This patch-set aims to implement physical memory hot-removing. The patches can free/remove the following things: - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15] - memmap of sparse-vmemmap : [PATCH 6,7,8,10/15] - page table of removed memory : [RFC PATCH 7,8,10/15] - node and related sysfs files : [RFC PATCH 13-15/15] Existing problem: If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup when we online pages. For example: there is a memory device on node 1. The address range is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10, and memory11 under the directory /sys/devices/system/memory/. If CONFIG_MEMCG is selected, when we online memory8, the memory stored page cgroup is not provided by this memory device. But when we online memory9, the memory stored page cgroup may be provided by memory8. So we can't offline memory8 now. We should offline the memory in the reversed order. When the memory device is hotremoved, we will auto offline memory provided by this memory device. But we don't know which memory is onlined first, so offlining memory may fail. In patch1, we provide a solution which is not good enough: Iterate twice to offline the memory. 1st iterate: offline every non primary memory block. 2nd iterate: offline primary (i.e. first added) memory block. And a new idea from Wen Congyangwe...@cn.fujitsu.com is: allocate the memory from the memory block they are describing. But we are not sure if it is OK to do so because there is not existing API to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE to MEM_ONLINE. And also, it may interfere the hugepage. How to test this patchset? 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE, ACPI_HOTPLUG_MEMORY must be selected. 2. load the module acpi_memhotplug 3. hotplug the memory device(it depends on your hardware) You will see the memory device under the directory /sys/bus/acpi/devices/. Its name is PNP0C80:XX. 4. online/offline pages provided by this memory device You can write online/offline to /sys/devices/system/memory/memoryX/state to online/offline pages provided by this memory device 5. hotremove the memory device You can hotremove the memory device by the hardware, or writing 1 to /sys/bus/acpi/devices/PNP0C80:XX/eject. Is there a similar knode to hot-add the memory device? Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Changelogs from v5 to v6: Patch3: Add some more comments to explain memory hot-remove. Patch4: Remove bootmem member in struct firmware_map_entry. Patch6: Repeatedly register bootmem pages when using hugepage. Patch8: Repeatedly free bootmem pages when using hugepage. Patch14: Don't free pgdat when offlining a node, just reset it to 0. Patch15: New patch, pgdat is not freed in patch14, so don't allocate a new one when online a node. Changelogs from v4 to v5: Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to avoid
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On 01/30/2013 10:48 AM, Simon Jeons wrote: On Wed, 2013-01-30 at 10:32 +0800, Tang Chen wrote: On 01/29/2013 08:52 PM, Simon Jeons wrote: Hi Tang, On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote: Here is the physical memory hot-remove patch-set based on 3.8rc-2. Hi Simon, I'll summarize all the info and answer you later. :) Thanks for asking. :) Thanks Tang, IIRC, there's qemu feature support memory hot-add/remove emulation if we don't have machine which supports memory hot-add/remove to test. Is that qemu feature merged? Otherwise where can I get that patchset? Hi Simon, There are patches to support hot-add/remove in qemu, but they are not merged yet. You can get the latest patches here: http://lists.nongnu.org/archive/html/qemu-devel/2012-12/msg02693.html BTY, it is unstable and full of problems, and you need to compile your own seabios too. Thanks. :) Some questions ask you, not has relationship with this patchset, but is memory hotplug stuff. 1. In function node_states_check_changes_online: comments: * If we don't have HIGHMEM nor movable node, * node_states[N_NORMAL_MEMORY] contains nodes which have zones of * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE. How to understand it? Why we don't have HIGHMEM nor movable node and node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC, N_NORMAL_MEMORY only means the node has regular memory. * If we don't have movable node, node_states[N_NORMAL_MEMORY] * contains nodes which have zones of 0...ZONE_MOVABLE, * set zone_last to ZONE_MOVABLE. How to understand? 2. In function move_pfn_range_left, why end= z2-zone_start_pfn is not correct? The comments said that must include/overlap, why? 3. In function online_pages, the normal case(w/o online_kenrel, online_movable), why not check if the new zone is overlap with adjacent zones? 4. Could you summarize the difference implementation between hot-add and logic-add, hot-remove and logic-remove? This patch-set aims to implement physical memory hot-removing. The patches can free/remove the following things: - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15] - memmap of sparse-vmemmap : [PATCH 6,7,8,10/15] - page table of removed memory : [RFC PATCH 7,8,10/15] - node and related sysfs files : [RFC PATCH 13-15/15] Existing problem: If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup when we online pages. For example: there is a memory device on node 1. The address range is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10, and memory11 under the directory /sys/devices/system/memory/. If CONFIG_MEMCG is selected, when we online memory8, the memory stored page cgroup is not provided by this memory device. But when we online memory9, the memory stored page cgroup may be provided by memory8. So we can't offline memory8 now. We should offline the memory in the reversed order. When the memory device is hotremoved, we will auto offline memory provided by this memory device. But we don't know which memory is onlined first, so offlining memory may fail. In patch1, we provide a solution which is not good enough: Iterate twice to offline the memory. 1st iterate: offline every non primary memory block. 2nd iterate: offline primary (i.e. first added) memory block. And a new idea from Wen Congyangwe...@cn.fujitsu.com is: allocate the memory from the memory block they are describing. But we are not sure if it is OK to do so because there is not existing API to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE to MEM_ONLINE. And also, it may interfere the hugepage. How to test this patchset? 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE, ACPI_HOTPLUG_MEMORY must be selected. 2. load the module acpi_memhotplug 3. hotplug the memory device(it depends on your hardware) You will see the memory device under the directory /sys/bus/acpi/devices/. Its name is PNP0C80:XX. 4. online/offline pages provided by this memory device You can write online/offline to /sys/devices/system/memory/memoryX/state to online/offline pages provided by this memory device 5. hotremove the memory device You can hotremove the memory device by the hardware, or writing 1 to /sys/bus/acpi/devices/PNP0C80:XX/eject. Is there a similar knode to hot-add the memory device? Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Changelogs from v5 to v6: Patch3: Add some more comments to explain memory hot-remove. Patch4: Remove bootmem member in struct firmware_map_entry. Patch6: Repeatedly register bootmem pages when using hugepage. Patch8: Repeatedly free bootmem pages when using hugepage. Patch14: Don't free pgdat when offlining a node, just reset it to 0. Patch15: New patch, pgdat is not freed in patch14, so don't allocate a
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
(2013/01/10 16:55), Glauber Costa wrote: On 01/10/2013 11:31 AM, Kamezawa Hiroyuki wrote: (2013/01/10 16:14), Glauber Costa wrote: On 01/10/2013 06:17 AM, Tang Chen wrote: Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Right. But how often does this happen in testing? In other words, please provide an overall description of how well memory hot-remove is presently operating. Is it reliable? What is the success rate in real-world situations? We test the hot-remove functionality mostly with movable_online used. And the memory used by kernel is not allowed to be removed. Can you try doing this using cpusets configured to hardwall ? It is my understanding that the object allocators will try hard not to allocate anything outside the walls defined by cpuset. Which means that if you have one process per node, and they are hardwalled, your kernel memory will be spread evenly among the machine. With a big enough load, they should eventually be present in all blocks. I'm sorry I couldn't catch your point. Do you want to confirm whether cpuset can work enough instead of ZONE_MOVABLE ? Or Do you want to confirm whether ZONE_MOVABLE will not work if it's used with cpuset ? No, I am not proposing to use cpuset do tackle the problem. I am just wondering if you would still have high success rates with cpusets in use with hardwalls. This is just one example of a workload that would spread kernel memory around quite heavily. So this is just me trying to understand the limitations of the mechanism. Hm, okay. In my undestanding, if the whole memory of a node is configured as MOVABLE, no kernel memory will not be allocated in the node because zonelist will not match. So, if cpuset is used with hardwalls, user will see -ENOMEM or OOM, I guess. even fork() will fail if fallback-to-other-node is not allowed. If it's configure as ZONE_NORMAL, you need to pray for offlining memory. AFAIK, IBM's ppc? has 16MB section size. So, some of sections can be offlined even if they are configured as ZONE_NORMAL. For them, placement of offlined memory is not important because it's virtualized by LPAR, they don't try to remove DIMM, they just want to increase/decrease amount of memory. It's an another approach. But here, we(fujitsu) tries to remove a system board/DIMM. So, configuring the whole memory of a node as ZONE_MOVABLE and tries to guarantee DIMM as removable. IMHO, I don't think shrink_slab() can kill all objects in a node even if they are some caches. We need more study for doing that. Indeed, shrink_slab can only kill cached objects. They, however, are usually a very big part of kernel memory. I wonder though if in case of failure, it is worth it to try at least one shrink pass before you give up. Yeah, now, his (our) approach is never allowing kernel memory on a node to be hot-removed by ZONE_MOVABLE. So, shrink_slab()'s effect will not be seen. If other brave guys tries to use ZONE_NORMAL for hot-pluggable DIMM, I see, it's worth triying. How about checking the target memsection is in NORMAL or in MOVABLE at hot-removing ? If NORMAL, shrink_slab() will be worth to be called. BTW, shrink_slab() is now node/zone aware ? If not, fixing that first will be better direction I guess. Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
If it's configure as ZONE_NORMAL, you need to pray for offlining memory. AFAIK, IBM's ppc? has 16MB section size. So, some of sections can be offlined even if they are configured as ZONE_NORMAL. For them, placement of offlined memory is not important because it's virtualized by LPAR, they don't try to remove DIMM, they just want to increase/decrease amount of memory. It's an another approach. But here, we(fujitsu) tries to remove a system board/DIMM. So, configuring the whole memory of a node as ZONE_MOVABLE and tries to guarantee DIMM as removable. IMHO, I don't think shrink_slab() can kill all objects in a node even if they are some caches. We need more study for doing that. Indeed, shrink_slab can only kill cached objects. They, however, are usually a very big part of kernel memory. I wonder though if in case of failure, it is worth it to try at least one shrink pass before you give up. Yeah, now, his (our) approach is never allowing kernel memory on a node to be hot-removed by ZONE_MOVABLE. So, shrink_slab()'s effect will not be seen. Ok, that clarifies it to me. If other brave guys tries to use ZONE_NORMAL for hot-pluggable DIMM, I see, it's worth triying. I was under the impression that this was being done in here. How about checking the target memsection is in NORMAL or in MOVABLE at hot-removing ? If NORMAL, shrink_slab() will be worth to be called. Yes, this is what I meant. I think there is value investigating this, since for a lot of workloads, a lot of the kernel memory will consist of shrinkable cached memory. It would provide you with the same level of guarantees (zero), but can improve the success rate (this is, of course, a guess) BTW, shrink_slab() is now node/zone aware ? If not, fixing that first will be better direction I guess. It is not upstream, but there are patches for this that I am already using in my private tree. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
(2013/01/10 17:36), Glauber Costa wrote: BTW, shrink_slab() is now node/zone aware ? If not, fixing that first will be better direction I guess. It is not upstream, but there are patches for this that I am already using in my private tree. Oh, I see. If it's merged, it's worth add shrink_slab() if ZONE_NORMAL code. Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Here is the physical memory hot-remove patch-set based on 3.8rc-2. This patch-set aims to implement physical memory hot-removing. The patches can free/remove the following things: - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15] - memmap of sparse-vmemmap : [PATCH 6,7,8,10/15] - page table of removed memory : [RFC PATCH 7,8,10/15] - node and related sysfs files : [RFC PATCH 13-15/15] Existing problem: If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup when we online pages. For example: there is a memory device on node 1. The address range is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10, and memory11 under the directory /sys/devices/system/memory/. If CONFIG_MEMCG is selected, when we online memory8, the memory stored page cgroup is not provided by this memory device. But when we online memory9, the memory stored page cgroup may be provided by memory8. So we can't offline memory8 now. We should offline the memory in the reversed order. When the memory device is hotremoved, we will auto offline memory provided by this memory device. But we don't know which memory is onlined first, so offlining memory may fail. In patch1, we provide a solution which is not good enough: Iterate twice to offline the memory. 1st iterate: offline every non primary memory block. 2nd iterate: offline primary (i.e. first added) memory block. And a new idea from Wen Congyang we...@cn.fujitsu.com is: allocate the memory from the memory block they are describing. But we are not sure if it is OK to do so because there is not existing API to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE to MEM_ONLINE. And also, it may interfere the hugepage. How to test this patchset? 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE, ACPI_HOTPLUG_MEMORY must be selected. 2. load the module acpi_memhotplug 3. hotplug the memory device(it depends on your hardware) You will see the memory device under the directory /sys/bus/acpi/devices/. Its name is PNP0C80:XX. 4. online/offline pages provided by this memory device You can write online/offline to /sys/devices/system/memory/memoryX/state to online/offline pages provided by this memory device 5. hotremove the memory device You can hotremove the memory device by the hardware, or writing 1 to /sys/bus/acpi/devices/PNP0C80:XX/eject. Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Changelogs from v5 to v6: Patch3: Add some more comments to explain memory hot-remove. Patch4: Remove bootmem member in struct firmware_map_entry. Patch6: Repeatedly register bootmem pages when using hugepage. Patch8: Repeatedly free bootmem pages when using hugepage. Patch14: Don't free pgdat when offlining a node, just reset it to 0. Patch15: New patch, pgdat is not freed in patch14, so don't allocate a new one when online a node. Changelogs from v4 to v5: Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to avoid disabling irq because we need flush tlb when free pagetables. Patch8: new patch, pick up some common APIs that are used to free direct mapping and vmemmap pagetables. Patch9: free direct mapping pagetables on x86_64 arch. Patch10: free vmemmap pagetables. Patch11: since freeing memmap with vmemmap has been implemented, the config macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is no longer needed. Patch13: no need to modify acpi_memory_disable_device() since it was removed, and add nid parameter when calling remove_memory(). Changelogs from v3 to v4: Patch7: remove unused codes. Patch8: fix nr_pages that is passed to free_map_bootmem() Changelogs from v2 to v3: Patch9: call sync_global_pgds() if pgd is changed Patch10: fix a problem int the patch Changelogs from v1 to v2: Patch1: new patch, offline memory twice. 1st iterate: offline every non primary memory block. 2nd iterate: offline primary (i.e. first added) memory block. Patch3: new patch, no logical change, just remove reduntant codes. Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu after the pagetable is changed. Patch12: new patch, free node_data when a node is offlined. Tang Chen (6): memory-hotplug: move pgdat_resize_lock into sparse_remove_one_section() memory-hotplug: remove page table of x86_64 architecture memory-hotplug: remove memmap of sparse-vmemmap memory-hotplug: Integrated __remove_section() of CONFIG_SPARSEMEM_VMEMMAP. memory-hotplug: remove sysfs file of node memory-hotplug: Do not allocate pdgat if it was not freed when offline. Wen Congyang (5): memory-hotplug: try to offline the memory twice to avoid dependence memory-hotplug: remove redundant codes memory-hotplug:
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On Wed, 9 Jan 2013 17:32:24 +0800 Tang Chen tangc...@cn.fujitsu.com wrote: Here is the physical memory hot-remove patch-set based on 3.8rc-2. This patch-set aims to implement physical memory hot-removing. The patches can free/remove the following things: - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15] - memmap of sparse-vmemmap : [PATCH 6,7,8,10/15] - page table of removed memory : [RFC PATCH 7,8,10/15] - node and related sysfs files : [RFC PATCH 13-15/15] Existing problem: If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup when we online pages. For example: there is a memory device on node 1. The address range is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10, and memory11 under the directory /sys/devices/system/memory/. If CONFIG_MEMCG is selected, when we online memory8, the memory stored page cgroup is not provided by this memory device. But when we online memory9, the memory stored page cgroup may be provided by memory8. So we can't offline memory8 now. We should offline the memory in the reversed order. When the memory device is hotremoved, we will auto offline memory provided by this memory device. But we don't know which memory is onlined first, so offlining memory may fail. This does sound like a significant problem. We should assume that mmecg is available and in use. In patch1, we provide a solution which is not good enough: Iterate twice to offline the memory. 1st iterate: offline every non primary memory block. 2nd iterate: offline primary (i.e. first added) memory block. Let's flesh this out a bit. If we online memory8, memory9, memory10 and memory11 then I'd have thought that they would need to offlined in reverse order, which will require four iterations, not two. Is this wrong and if so, why? Also, what happens if we wish to offline only memory9? Do we offline memory11 then memory10 then memory9 and then re-online memory10 and memory11? And a new idea from Wen Congyang we...@cn.fujitsu.com is: allocate the memory from the memory block they are describing. Yes. But we are not sure if it is OK to do so because there is not existing API to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE to MEM_ONLINE. This all sounds solvable - can we proceed in this fashion? And also, it may interfere the hugepage. Please provide full details on this problem. Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Right. But how often does this happen in testing? In other words, please provide an overall description of how well memory hot-remove is presently operating. Is it reliable? What is the success rate in real-world situations? Are there precautions which the administrator can take to improve the success rate? What are the remaining problems and are there plans to address them? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On Wed, 9 Jan 2013 17:32:24 +0800 Tang Chen tangc...@cn.fujitsu.com wrote: This patch-set aims to implement physical memory hot-removing. As you were on th patch delivery path, all of these patches should have your Signed-off-by:. But some were missing it. I fixed this in my copy of the patches. I suspect this patchset adds a significant amount of code which will not be used if CONFIG_MEMORY_HOTPLUG=n. [PATCH v6 06/15] memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap, for example. This is not a good thing, so please go through the patchset (in fact, go through all the memhotplug code) and let's see if we can reduce the bloat for CONFIG_MEMORY_HOTPLUG=n kernels. This needn't be done immediately - it would be OK by me if you were to defer this exercise until all the new memhotplug code is largely in place. But please, let's do it. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Andrew, Thank you very much for your pushing. :) On 01/10/2013 06:23 AM, Andrew Morton wrote: This does sound like a significant problem. We should assume that mmecg is available and in use. In patch1, we provide a solution which is not good enough: Iterate twice to offline the memory. 1st iterate: offline every non primary memory block. 2nd iterate: offline primary (i.e. first added) memory block. Let's flesh this out a bit. If we online memory8, memory9, memory10 and memory11 then I'd have thought that they would need to offlined in reverse order, which will require four iterations, not two. Is this wrong and if so, why? Well, we may need more than two iterations if all memory8, memory9, memory10 are in use by kernel, and 10 depends on 9, 9 depends on 8. So, as you see here, the iteration method is not good enough. But this only happens when the memory is used by kernel, which will not be able to be migrated. So if we can use a boot option, such as movablecore_map, or movable_online functionality to limit the memory as movable, the kernel will not use this memory. So it is safe when we are doing node hot-remove. Also, what happens if we wish to offline only memory9? Do we offline memory11 then memory10 then memory9 and then re-online memory10 and memory11? In this case, offlining memory9 could fail if user do this by himself, for example using sysfs. In this path, it is in memory hot-remove path. So when we remove a memory device, it will automatically offline all pages, and it is in reverse order by itself. And again, this is not good enough. We will figure out a reasonable way to solve it soon. And a new idea from Wen Congyangwe...@cn.fujitsu.com is: allocate the memory from the memory block they are describing. Yes. But we are not sure if it is OK to do so because there is not existing API to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE to MEM_ONLINE. This all sounds solvable - can we proceed in this fashion? Yes, we are in progress now. And also, it may interfere the hugepage. Please provide full details on this problem. It is not very clear now, and if I find something, I'll share it out. Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Right. But how often does this happen in testing? In other words, please provide an overall description of how well memory hot-remove is presently operating. Is it reliable? What is the success rate in real-world situations? We test the hot-remove functionality mostly with movable_online used. And the memory used by kernel is not allowed to be removed. We will do some tests in the kernel memory offline cases, and tell you the test results soon. And since we are trying out some other ways, I think the problem will be solved soon. Are there precautions which the administrator can take to improve the success rate? Administrator could use movablecore_map boot option or movable_online functionality (which is now in kernel) to limit memory as movable to avoid this problem. What are the remaining problems and are there plans to address them? For now, we will try to allocate page_group on the memory block which itself is describing. And all the other parts seems work well now. And we are still testing. If we have any problem, we will share. Thanks. :) -- To unsubscribe from this list: send the line unsubscribe linux-acpi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Andrew, On 01/10/2013 07:33 AM, Andrew Morton wrote: On Wed, 9 Jan 2013 17:32:24 +0800 Tang Chentangc...@cn.fujitsu.com wrote: This patch-set aims to implement physical memory hot-removing. As you were on th patch delivery path, all of these patches should have your Signed-off-by:. But some were missing it. I fixed this in my copy of the patches. Thank you very much for the help. Next time I'll add it myself. I suspect this patchset adds a significant amount of code which will not be used if CONFIG_MEMORY_HOTPLUG=n. [PATCH v6 06/15] memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap, for example. This is not a good thing, so please go through the patchset (in fact, go through all the memhotplug code) and let's see if we can reduce the bloat for CONFIG_MEMORY_HOTPLUG=n kernels. This needn't be done immediately - it would be OK by me if you were to defer this exercise until all the new memhotplug code is largely in place. But please, let's do it. OK, I'll do have a check on it when the page_cgroup problem is solved. Thanks. :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On 01/10/2013 06:17 AM, Tang Chen wrote: Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Right. But how often does this happen in testing? In other words, please provide an overall description of how well memory hot-remove is presently operating. Is it reliable? What is the success rate in real-world situations? We test the hot-remove functionality mostly with movable_online used. And the memory used by kernel is not allowed to be removed. Can you try doing this using cpusets configured to hardwall ? It is my understanding that the object allocators will try hard not to allocate anything outside the walls defined by cpuset. Which means that if you have one process per node, and they are hardwalled, your kernel memory will be spread evenly among the machine. With a big enough load, they should eventually be present in all blocks. Another question I have for you: Have you considering calling shrink_slab to try to deplete the caches and therefore free at least slab memory in the nodes that can't be offlined? Is it relevant? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
(2013/01/10 16:14), Glauber Costa wrote: On 01/10/2013 06:17 AM, Tang Chen wrote: Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Right. But how often does this happen in testing? In other words, please provide an overall description of how well memory hot-remove is presently operating. Is it reliable? What is the success rate in real-world situations? We test the hot-remove functionality mostly with movable_online used. And the memory used by kernel is not allowed to be removed. Can you try doing this using cpusets configured to hardwall ? It is my understanding that the object allocators will try hard not to allocate anything outside the walls defined by cpuset. Which means that if you have one process per node, and they are hardwalled, your kernel memory will be spread evenly among the machine. With a big enough load, they should eventually be present in all blocks. I'm sorry I couldn't catch your point. Do you want to confirm whether cpuset can work enough instead of ZONE_MOVABLE ? Or Do you want to confirm whether ZONE_MOVABLE will not work if it's used with cpuset ? Another question I have for you: Have you considering calling shrink_slab to try to deplete the caches and therefore free at least slab memory in the nodes that can't be offlined? Is it relevant? At this stage, we don't consider to call shrink_slab(). We require nearly 100% success at offlining memory for removing DIMM. It's my understanding. IMHO, I don't think shrink_slab() can kill all objects in a node even if they are some caches. We need more study for doing that. Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On 01/10/2013 11:31 AM, Kamezawa Hiroyuki wrote: (2013/01/10 16:14), Glauber Costa wrote: On 01/10/2013 06:17 AM, Tang Chen wrote: Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Right. But how often does this happen in testing? In other words, please provide an overall description of how well memory hot-remove is presently operating. Is it reliable? What is the success rate in real-world situations? We test the hot-remove functionality mostly with movable_online used. And the memory used by kernel is not allowed to be removed. Can you try doing this using cpusets configured to hardwall ? It is my understanding that the object allocators will try hard not to allocate anything outside the walls defined by cpuset. Which means that if you have one process per node, and they are hardwalled, your kernel memory will be spread evenly among the machine. With a big enough load, they should eventually be present in all blocks. I'm sorry I couldn't catch your point. Do you want to confirm whether cpuset can work enough instead of ZONE_MOVABLE ? Or Do you want to confirm whether ZONE_MOVABLE will not work if it's used with cpuset ? No, I am not proposing to use cpuset do tackle the problem. I am just wondering if you would still have high success rates with cpusets in use with hardwalls. This is just one example of a workload that would spread kernel memory around quite heavily. So this is just me trying to understand the limitations of the mechanism. Another question I have for you: Have you considering calling shrink_slab to try to deplete the caches and therefore free at least slab memory in the nodes that can't be offlined? Is it relevant? At this stage, we don't consider to call shrink_slab(). We require nearly 100% success at offlining memory for removing DIMM. It's my understanding. Of course, this is indisputable. IMHO, I don't think shrink_slab() can kill all objects in a node even if they are some caches. We need more study for doing that. Indeed, shrink_slab can only kill cached objects. They, however, are usually a very big part of kernel memory. I wonder though if in case of failure, it is worth it to try at least one shrink pass before you give up. It is not very different from what is in memory-failure.c, except that we could do better and do a more targetted shrinking (support for that is being worked on) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev