Re: [PATCH] sysctl_panic_on_oom broken
On Tue, 17 Apr 2007, Larry Woodman wrote: out_of_memory() does not panic when sysctl_panic_on_oom is set if constrained_alloc() does not return CONSTRAINT_NONE. Instead, out_of_memory() kills the current process whenever constrained_alloc() returns either CONSTRAINT_MEMORY_POLICY or CONSTRAINT_CPUSET. This patch fixes this problem: It recreates the old problem that we OOM while we still have memory in other parts of the system. Hmm. User's expectation is failover of clustering ASAP by panic. Even if free memory remain due to cpuset/mempolicy setting, some people may want failover soon. Of course some other people don't want panic if free memory remain. I think it depends on user. If panic_on_oom is 1, only panic if mempolicy/cpuset is not used. And if panic_on_oom is 2, panic on all case. This might be desirable. Bye. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Make new setting of panic_on_oom
read_lock(tasklist_lock); + if (sysctl_panic_on_oom == 2) + panic(out of memory. Compulsory panic_on_oom is selected.\n); + Wouldn't it be safer to put the panic before the read_lock()? I agree. Otherwise the patch seem to be okay. Ok. This is take 2. Thanks for your comment. - The current panic_on_oom may not work if there is a process using cpusets/mempolicy, because other nodes' memory may remain. But some people want failover by panic ASAP even if they are used. This patch makes new setting for its request. This is not tested yet. But it would work. Please apply. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- Documentation/sysctl/vm.txt | 23 +-- mm/oom_kill.c |3 +++ 2 files changed, 20 insertions(+), 6 deletions(-) Index: panic_on_oom2/Documentation/sysctl/vm.txt === --- panic_on_oom2.orig/Documentation/sysctl/vm.txt 2007-04-21 12:39:09.0 +0900 +++ panic_on_oom2/Documentation/sysctl/vm.txt 2007-04-21 12:39:58.0 +0900 @@ -197,11 +197,22 @@ panic_on_oom -This enables or disables panic on out-of-memory feature. If this is set to 1, -the kernel panics when out-of-memory happens. If this is set to 0, the kernel -will kill some rogue process, called oom_killer. Usually, oom_killer can kill -rogue processes and system will survive. If you want to panic the system -rather than killing rogue processes, set this to 1. +This enables or disables panic on out-of-memory feature. -The default value is 0. +If this is set to 0, the kernel will kill some rogue process, +called oom_killer. Usually, oom_killer can kill rogue processes and +system will survive. + +If this is set to 1, the kernel panics when out-of-memory happens. +However, if a process limits using nodes by mempolicy/cpusets, +and those nodes become memory exhaustion status, one process +may be killed by oom-killer. No panic occurs in this case. +Because other nodes' memory may be free. This means system total status +may be not fatal yet. +If this is set to 2, the kernel panics compulsorily even on the +above-mentioned. + +The default value is 0. +1 and 2 are for failover of clustering. Please select either +according to your policy of failover. Index: panic_on_oom2/mm/oom_kill.c === --- panic_on_oom2.orig/mm/oom_kill.c2007-04-21 12:39:09.0 +0900 +++ panic_on_oom2/mm/oom_kill.c 2007-04-21 12:40:31.0 +0900 @@ -409,6 +409,9 @@ show_mem(); } + if (sysctl_panic_on_oom == 2) + panic(out of memory. Compulsory panic_on_oom is selected.\n); + cpuset_lock(); read_lock(tasklist_lock); -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH]Fix parsing kernelcore boot option for ia64
On Fri, 13 Apr 2007 14:26:22 +0900 Yasunori Goto [EMAIL PROTECTED] wrote: Hello. cmdline_parse_kernelcore() should return the next pointer of boot option like memparse() doing. If not, it is cause of eternal loop on ia64 box. This patch is for 2.6.21-rc6-mm1. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] arch/ia64/kernel/efi.c |2 +- include/linux/mm.h |2 +- mm/page_alloc.c|4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) Index: current_test/arch/ia64/kernel/efi.c === --- current_test.orig/arch/ia64/kernel/efi.c2007-04-12 17:33:28.0 +0900 +++ current_test/arch/ia64/kernel/efi.c 2007-04-13 12:13:21.0 +0900 @@ -424,7 +424,7 @@ efi_init (void) } else if (memcmp(cp, max_addr=, 9) == 0) { max_addr = GRANULEROUNDDOWN(memparse(cp + 9, cp)); } else if (memcmp(cp, kernelcore=,11) == 0) { - cmdline_parse_kernelcore(cp+11); + cmdline_parse_kernelcore(cp+11, cp); } else if (memcmp(cp, min_addr=, 9) == 0) { min_addr = GRANULEROUNDDOWN(memparse(cp + 9, cp)); } else { Index: current_test/mm/page_alloc.c === --- current_test.orig/mm/page_alloc.c 2007-04-12 18:25:37.0 +0900 +++ current_test/mm/page_alloc.c2007-04-13 12:12:58.0 +0900 @@ -3736,13 +3736,13 @@ void __init free_area_init_nodes(unsigne * kernelcore=size sets the amount of memory for use for allocations that * cannot be reclaimed or migrated. */ -int __init cmdline_parse_kernelcore(char *p) +int __init cmdline_parse_kernelcore(char *p, char **retp) { unsigned long long coremem; if (!p) return -EINVAL; - coremem = memparse(p, p); + coremem = memparse(p, retp); required_kernelcore = coremem PAGE_SHIFT; /* Paranoid check that UL is enough for required_kernelcore */ Index: current_test/include/linux/mm.h === --- current_test.orig/include/linux/mm.h2007-04-11 14:15:33.0 +0900 +++ current_test/include/linux/mm.h 2007-04-13 12:12:20.0 +0900 @@ -1051,7 +1051,7 @@ extern unsigned long find_max_pfn_with_a extern void free_bootmem_with_active_regions(int nid, unsigned long max_low_pfn); extern void sparse_memory_present_with_active_regions(int nid); -extern int cmdline_parse_kernelcore(char *p); +extern int cmdline_parse_kernelcore(char *p, char **retp); #ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID extern int early_pfn_to_nid(unsigned long pfn); #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */ This will cause all other architectures to crash when kernelcore= is used. I wasn't even aware of this kernelcore thing. It's pretty nasty-looking. yet another reminder that this code hasn't been properly reviewed in the past year or three. Just now, I'm making memory-unplug patches with current MOVABLE_ZONE code. So, I might be the first user of it on ia64. Anyway, I'll try to fix it. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Make new setting of panic_on_oom
I tested this patch. It worked well. So, I fixed its description. Please apply. -- The current panic_on_oom may not work if there is a process using cpusets/mempolicy, because other nodes' memory may remain. But some people want failover by panic ASAP even if they are used. This patch makes new setting for its request. This is tested on my ia64 box which has 3 nodes. Please apply. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] Signed-off-by: Benjamin LaHaise [EMAIL PROTECTED] --- Documentation/sysctl/vm.txt | 23 +-- mm/oom_kill.c |3 +++ 2 files changed, 20 insertions(+), 6 deletions(-) Index: panic_on_oom2/Documentation/sysctl/vm.txt === --- panic_on_oom2.orig/Documentation/sysctl/vm.txt 2007-04-21 12:39:09.0 +0900 +++ panic_on_oom2/Documentation/sysctl/vm.txt 2007-04-21 12:39:58.0 +0900 @@ -197,11 +197,22 @@ panic_on_oom -This enables or disables panic on out-of-memory feature. If this is set to 1, -the kernel panics when out-of-memory happens. If this is set to 0, the kernel -will kill some rogue process, called oom_killer. Usually, oom_killer can kill -rogue processes and system will survive. If you want to panic the system -rather than killing rogue processes, set this to 1. +This enables or disables panic on out-of-memory feature. -The default value is 0. +If this is set to 0, the kernel will kill some rogue process, +called oom_killer. Usually, oom_killer can kill rogue processes and +system will survive. + +If this is set to 1, the kernel panics when out-of-memory happens. +However, if a process limits using nodes by mempolicy/cpusets, +and those nodes become memory exhaustion status, one process +may be killed by oom-killer. No panic occurs in this case. +Because other nodes' memory may be free. This means system total status +may be not fatal yet. +If this is set to 2, the kernel panics compulsorily even on the +above-mentioned. + +The default value is 0. +1 and 2 are for failover of clustering. Please select either +according to your policy of failover. Index: panic_on_oom2/mm/oom_kill.c === --- panic_on_oom2.orig/mm/oom_kill.c2007-04-21 12:39:09.0 +0900 +++ panic_on_oom2/mm/oom_kill.c 2007-04-21 12:40:31.0 +0900 @@ -409,6 +409,9 @@ show_mem(); } + if (sysctl_panic_on_oom == 2) + panic(out of memory. Compulsory panic_on_oom is selected.\n); + cpuset_lock(); read_lock(tasklist_lock); -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH]Fix parsing kernelcore boot option for ia64
Mel-san. I tested your patch (Thanks!). It worked. But.. In my understanding, why ia64 doesn't use early_param() macro for mem= at el. is that it has to use mem= option at efi handling which is called before parse_early_param(). Current ia64's boot path is setup_arch() - efi handling - parse_early_param() - numa handling - pgdat/zone init kernelcore= option is just used at pgdat/zone initialization. (no arch dependent part...) So I think just adding == early_param(kernelcore,cmpdline_parse_kernelcore) == to ia64 is ok. Then, it can be common code. How is this patch? I confirmed this can work well too. When kernelcore boot option is specified, kernel can't boot up on ia64. It is cause of eternal loop. In addition, its code can be common code. This is fix for it. I tested this patch on my ia64 box. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] - arch/i386/kernel/setup.c |1 - arch/ia64/kernel/efi.c |2 -- arch/powerpc/kernel/prom.c |1 - arch/ppc/mm/init.c |2 -- arch/x86_64/kernel/e820.c |1 - include/linux/mm.h |1 - mm/page_alloc.c|3 +++ 7 files changed, 3 insertions(+), 8 deletions(-) Index: kernelcore/arch/ia64/kernel/efi.c === --- kernelcore.orig/arch/ia64/kernel/efi.c 2007-04-24 15:09:37.0 +0900 +++ kernelcore/arch/ia64/kernel/efi.c 2007-04-24 15:25:22.0 +0900 @@ -423,8 +423,6 @@ efi_init (void) mem_limit = memparse(cp + 4, cp); } else if (memcmp(cp, max_addr=, 9) == 0) { max_addr = GRANULEROUNDDOWN(memparse(cp + 9, cp)); - } else if (memcmp(cp, kernelcore=,11) == 0) { - cmdline_parse_kernelcore(cp+11); } else if (memcmp(cp, min_addr=, 9) == 0) { min_addr = GRANULEROUNDDOWN(memparse(cp + 9, cp)); } else { Index: kernelcore/arch/i386/kernel/setup.c === --- kernelcore.orig/arch/i386/kernel/setup.c2007-04-24 15:29:20.0 +0900 +++ kernelcore/arch/i386/kernel/setup.c 2007-04-24 15:29:39.0 +0900 @@ -195,7 +195,6 @@ static int __init parse_mem(char *arg) return 0; } early_param(mem, parse_mem); -early_param(kernelcore, cmdline_parse_kernelcore); #ifdef CONFIG_PROC_VMCORE /* elfcorehdr= specifies the location of elf core header Index: kernelcore/arch/powerpc/kernel/prom.c === --- kernelcore.orig/arch/powerpc/kernel/prom.c 2007-04-24 15:04:47.0 +0900 +++ kernelcore/arch/powerpc/kernel/prom.c 2007-04-24 15:30:25.0 +0900 @@ -431,7 +431,6 @@ static int __init early_parse_mem(char * return 0; } early_param(mem, early_parse_mem); -early_param(kernelcore, cmdline_parse_kernelcore); /* * The device tree may be allocated below our memory limit, or inside the Index: kernelcore/arch/ppc/mm/init.c === --- kernelcore.orig/arch/ppc/mm/init.c 2007-04-24 15:04:47.0 +0900 +++ kernelcore/arch/ppc/mm/init.c 2007-04-24 15:30:56.0 +0900 @@ -214,8 +214,6 @@ void MMU_setup(void) } } -early_param(kernelcore, cmdline_parse_kernelcore); - /* * MMU_init sets up the basic memory mappings for the kernel, * including both RAM and possibly some I/O regions, Index: kernelcore/arch/x86_64/kernel/e820.c === --- kernelcore.orig/arch/x86_64/kernel/e820.c 2007-04-24 15:04:47.0 +0900 +++ kernelcore/arch/x86_64/kernel/e820.c2007-04-24 15:34:02.0 +0900 @@ -604,7 +604,6 @@ static int __init parse_memopt(char *p) return 0; } early_param(mem, parse_memopt); -early_param(kernelcore, cmdline_parse_kernelcore); static int userdef __initdata; Index: kernelcore/include/linux/mm.h === --- kernelcore.orig/include/linux/mm.h 2007-04-24 15:09:37.0 +0900 +++ kernelcore/include/linux/mm.h 2007-04-24 15:35:52.0 +0900 @@ -1051,7 +1051,6 @@ extern unsigned long find_max_pfn_with_a extern void free_bootmem_with_active_regions(int nid, unsigned long max_low_pfn); extern void sparse_memory_present_with_active_regions(int nid); -extern int cmdline_parse_kernelcore(char *p); #ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID extern int early_pfn_to_nid(unsigned long pfn); #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */ Index: kernelcore/mm/page_alloc.c === --- kernelcore.orig/mm/page_alloc.c 2007-04-24 15:09:37.0 +0900 +++ kernelcore/mm/page_alloc.c 2007-04-24 16:00:21.0 +0900
Re: [PATCH]Fix parsing kernelcore boot option for ia64
Subject: Check zone boundaries when freeing bootmem Zone boundaries do not have to be aligned to MAX_ORDER_NR_PAGES. Hmm. I don't understand here yet... Could you explain more? This issue occurs only when ZONE_MOVABLE is specified. If its boundary is aligned to MAX_ORDER automatically, I guess user will not mind it. From memory hotplug view, I prefer section size alignment to make simple code. :-P However, during boot, there is an implicit assumption that they are aligned to a BITS_PER_LONG boundary when freeing pages as quickly as possible. This patch checks the zone boundaries when freeing pages from the bootmem allocator. Anyway, the patch works well. Bye. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] Align ZONE_MOVABLE to a MAX_ORDER_NR_PAGES boundary
Looks good. :-) Thanks. Acked-by: Yasunori Goto [EMAIL PROTECTED] The boot memory allocator makes assumptions on the alignment of zone boundaries even though the buddy allocator has no requirements on the alignment of zones. This may cause boot problems in situations where ZONE_MOVABLE is populated because the bootmem allocator assumes zones are at least order-log2(BITS_PER_LONG) aligned. As the two potential users (huge pages and memory hot-remove) of ZONE_MOVABLE would prefer a higher alignment, this patch aligns the start of the zone instead of fixing the different assumptions made by the bootmem allocator. This patch rounds the start of ZONE_MOVABLE in each node to a MAX_ORDER_NR_PAGES boundary. If the rounding pushes the start of ZONE_MOVABLE above the end of the node then the zone will contain no memory and will not be used at runtime. The value is rounded up instead of down as it is better to have the kernel-portion of memory larger than requested instead of smaller. The impact is that the kernel-usable portion of memory because a minimum guarantee instead of the exact size requested by the user. Signed-off-by: Mel Gorman [EMAIL PROTECTED] Acked-by: Andy Whitcroft [EMAIL PROTECTED] --- page_alloc.c |5 + 1 files changed, 5 insertions(+) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.21-rc6-mm1-002_commonparse/mm/page_alloc.c linux-2.6.21-rc6-mm1-003_alignmovable/mm/page_alloc.c --- linux-2.6.21-rc6-mm1-002_commonparse/mm/page_alloc.c 2007-04-24 09:38:30.0 +0100 +++ linux-2.6.21-rc6-mm1-003_alignmovable/mm/page_alloc.c 2007-04-24 11:15:40.0 +0100 @@ -3642,6 +3642,11 @@ restart: usable_nodes--; if (usable_nodes required_kernelcore usable_nodes) goto restart; + + /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */ + for (nid = 0; nid MAX_NUMNODES; nid++) + zone_movable_pfn[nid] = + roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES); } /** -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] FRV: Fix unannotated variable declarations
From: David Howells [EMAIL PROTECTED] Fix unannotated variable declarations. Variables that have allocation section annotations (such as __meminitdata) on their definitions must also have them on their declarations as not doing so may affect the addressing mode used by the compiler and may result in a linker error. Right. Thanks. Acked-by: Yasunori Goto [EMAIL PROTECTED] -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc4-mm1 + 3 hot-fixes -- WARNING: could not find versions for .tmp_versions/built-in.mod
Hello. WARNING: mm/built-in.o - Section mismatch: reference to .init.text:__alloc_bootmem_node from .text between 'sparse_init' (at offset 0x15c8f) and '__section_nr' I took a look at this one. You have SPARSEMEM enabled in your config. And then I see that in sparse.c we call alloc_bootmem_node() from a function I thought should be marked __devinit (it is used by memory_hotplug.c). But I am not familiar enough to judge if __alloc_bootmen_node are marked correct with __init or __devinit (to say this is used in the HOTPLUG case) is more correct. Anyone? WARNING: mm/built-in.o - Section mismatch: reference to .init.text:__alloc_bootmem_node from .text between 'sparse_init' (at offset 0x15d02) and '__section_nr' Same as above Memory hotplug code has __meminit for its purpose. But, I suspect that many other places of memory hotplug code may have same issue. I will chase them. BTW, does -mm code checks more strict than stock kernel? I can't see these warnings in 2.6.21-rc4. Bye. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH]Fix parsing kernelcore boot option for ia64
Hello. cmdline_parse_kernelcore() should return the next pointer of boot option like memparse() doing. If not, it is cause of eternal loop on ia64 box. This patch is for 2.6.21-rc6-mm1. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] arch/ia64/kernel/efi.c |2 +- include/linux/mm.h |2 +- mm/page_alloc.c|4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) Index: current_test/arch/ia64/kernel/efi.c === --- current_test.orig/arch/ia64/kernel/efi.c2007-04-12 17:33:28.0 +0900 +++ current_test/arch/ia64/kernel/efi.c 2007-04-13 12:13:21.0 +0900 @@ -424,7 +424,7 @@ efi_init (void) } else if (memcmp(cp, max_addr=, 9) == 0) { max_addr = GRANULEROUNDDOWN(memparse(cp + 9, cp)); } else if (memcmp(cp, kernelcore=,11) == 0) { - cmdline_parse_kernelcore(cp+11); + cmdline_parse_kernelcore(cp+11, cp); } else if (memcmp(cp, min_addr=, 9) == 0) { min_addr = GRANULEROUNDDOWN(memparse(cp + 9, cp)); } else { Index: current_test/mm/page_alloc.c === --- current_test.orig/mm/page_alloc.c 2007-04-12 18:25:37.0 +0900 +++ current_test/mm/page_alloc.c2007-04-13 12:12:58.0 +0900 @@ -3736,13 +3736,13 @@ void __init free_area_init_nodes(unsigne * kernelcore=size sets the amount of memory for use for allocations that * cannot be reclaimed or migrated. */ -int __init cmdline_parse_kernelcore(char *p) +int __init cmdline_parse_kernelcore(char *p, char **retp) { unsigned long long coremem; if (!p) return -EINVAL; - coremem = memparse(p, p); + coremem = memparse(p, retp); required_kernelcore = coremem PAGE_SHIFT; /* Paranoid check that UL is enough for required_kernelcore */ Index: current_test/include/linux/mm.h === --- current_test.orig/include/linux/mm.h2007-04-11 14:15:33.0 +0900 +++ current_test/include/linux/mm.h 2007-04-13 12:12:20.0 +0900 @@ -1051,7 +1051,7 @@ extern unsigned long find_max_pfn_with_a extern void free_bootmem_with_active_regions(int nid, unsigned long max_low_pfn); extern void sparse_memory_present_with_active_regions(int nid); -extern int cmdline_parse_kernelcore(char *p); +extern int cmdline_parse_kernelcore(char *p, char **retp); #ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID extern int early_pfn_to_nid(unsigned long pfn); #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */ -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] fix BUG_ON check at move_freepages() (Re: 2.6.21-rc3-mm2)
Hello. The BUG_ON() check at move_freepages() is wrong. Its end_page is start_page + MAX_ORDER_NR_PAGES. So, it can be next zone. BUG_ON() should check end_page - 1. This is fix of 2.6.21-rc3-mm2 for it. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- mm/page_alloc.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: current_test/mm/page_alloc.c === --- current_test.orig/mm/page_alloc.c 2007-03-08 15:44:10.0 +0900 +++ current_test/mm/page_alloc.c2007-03-08 16:17:29.0 +0900 @@ -707,7 +707,7 @@ int move_freepages(struct zone *zone, unsigned long order; int blocks_moved = 0; - BUG_ON(page_zone(start_page) != page_zone(end_page)); + BUG_ON(page_zone(start_page) != page_zone(end_page - 1)); for (page = start_page; page end_page;) { if (!PageBuddy(page)) { -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC:PATCH]regster memory init functions into white list of section mismatch.
WARNING: mm/built-in.o - Section mismatch: reference to .init.text:__alloc_bootmem_node from .text between 'sparse_init' (at offset 0x15c8f) and '__section_nr' I took a look at this one. You have SPARSEMEM enabled in your config. And then I see that in sparse.c we call alloc_bootmem_node() from a function I thought should be marked __devinit (it is used by memory_hotplug.c). But I am not familiar enough to judge if __alloc_bootmen_node are marked correct with __init or __devinit (to say this is used in the HOTPLUG case) is more correct. Anyone? WARNING: mm/built-in.o - Section mismatch: reference to .init.text:__alloc_bootmem_node from .text between 'sparse_init' (at offset 0x15d02) and '__section_nr' Same as above Memory hotplug code has __meminit for its purpose. But, I suspect that many other places of memory hotplug code may have same issue. I will chase them. Hello. I chased section mismatch codes on memory hotplug code. Many of them should be defined as __meminit. (This check was great helpful for checking it. Thanks!) But, I would like to add a new pattern in white list for some of them. (I'll post another patch for others.) sparse.c (sparse_index_alloc()) calles alloc_bootmem_node() as you mentioned. And, zone_wait_table_init() calles it too. These functions call it on only boot time, and call vmalloc()/kmalloc() on hotplug time. It is distinguished by system_state value or slab_is_available(). Just refrerences remain at them after boot. Bootmem allocation functions are called by many functions and it must be used only at boot time. I think __init of them should keep for section mismatch check. So, I would like to register sparse_index_alloc() and zone_wait_table_init() into white list. Please comment. If there is a more good way, please let me know... Thanks. P.S. Pattarn 10 is for ia64 (not for memory hotplug). ia64's .machvec section is mixture table of .init functions and normal text. It is defined for platform dependent functions. This is also cause of warnings. I think this should be registered too. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- mm/page_alloc.c |2 +- mm/sparse.c |2 +- scripts/mod/modpost.c | 29 + 3 files changed, 31 insertions(+), 2 deletions(-) Index: current_test/scripts/mod/modpost.c === --- current_test.orig/scripts/mod/modpost.c 2007-03-27 20:21:20.0 +0900 +++ current_test/scripts/mod/modpost.c 2007-03-29 14:16:05.0 +0900 @@ -643,6 +643,17 @@ static int strrcmp(const char *s, const * The pattern is: * tosec= .init.text * fromsec = __ksymtab* + * + * Pattern 9: + * Some of functions are common code between boot time and hotplug + * time. The bootmem allocater is called only boot time in its + * functions. So it's ok to reference. + * tosec= .init.text + * + * Pattern 10: + * ia64 has machvec table for each platform. It is mixture of function + * pointer of .init.text and .text. + * fromsec = .machvec **/ static int secref_whitelist(const char *modname, const char *tosec, const char *fromsec, const char *atsym, @@ -669,6 +680,12 @@ static int secref_whitelist(const char * NULL }; + const char *pat4sym[] = { + sparse_index_alloc, + zone_wait_table_init, + NULL + }; + /* Check for pattern 1 */ if (strcmp(tosec, .init.data) != 0) f1 = 0; @@ -725,6 +742,18 @@ static int secref_whitelist(const char * if ((strcmp(tosec, .init.text) == 0) (strncmp(fromsec, __ksymtab, strlen(__ksymtab)) == 0)) return 1; + + /* Check for pattern 9 */ + if ((strcmp(tosec, .init.text) == 0) + (strcmp(fromsec, .text) == 0)) + for (s = pat4sym; *s; s++) + if (strcmp(atsym, *s) == 0) + return 1; + + /* Check for pattern 10 */ + if (strcmp(fromsec, .machvec) == 0) + return 1; + return 0; } Index: current_test/mm/page_alloc.c === --- current_test.orig/mm/page_alloc.c 2007-03-27 16:04:41.0 +0900 +++ current_test/mm/page_alloc.c2007-03-29 14:14:42.0 +0900 @@ -2673,7 +2673,7 @@ void __init setup_per_cpu_pageset(void) #endif -static __meminit +static __meminit noinline int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages) { int i; Index: current_test/mm/sparse.c === --- current_test.orig/mm/sparse.c 2007-03-27 16:04:41.0 +0900 +++ current_test/mm/sparse.c2007-03-29 14:15:00.0 +0900 @@ -44,7 +44,7
[Patch] Fix section mismatch of memory hotplug related code.
Hello. This is to fix many section mismatches of code related to memory hotplug. I checked compile with memory hotplug on/off on ia64 and x86-64 box. This patch is for 2.6.21-rc5-mm4. Please apply. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- arch/ia64/mm/discontig.c |2 ++ arch/x86_64/mm/init.c|6 +++--- drivers/acpi/numa.c |4 ++-- mm/page_alloc.c | 30 +++--- mm/sparse.c | 12 +++- 5 files changed, 29 insertions(+), 25 deletions(-) Index: meminit/mm/sparse.c === --- meminit.orig/mm/sparse.c2007-04-04 20:15:58.0 +0900 +++ meminit/mm/sparse.c 2007-04-04 20:55:44.0 +0900 @@ -61,7 +61,7 @@ static struct mem_section *sparse_index_ return section; } -static int sparse_index_init(unsigned long section_nr, int nid) +static int __meminit sparse_index_init(unsigned long section_nr, int nid) { static DEFINE_SPINLOCK(index_init_lock); unsigned long root = SECTION_NR_TO_ROOT(section_nr); @@ -138,7 +138,7 @@ static inline int sparse_early_nid(struc } /* Record a memory area against a node. */ -void memory_present(int nid, unsigned long start, unsigned long end) +void __init memory_present(int nid, unsigned long start, unsigned long end) { unsigned long pfn; @@ -197,7 +197,7 @@ struct page *sparse_decode_mem_map(unsig return ((struct page *)coded_mem_map) + section_nr_to_pfn(pnum); } -static int sparse_init_one_section(struct mem_section *ms, +static int __meminit sparse_init_one_section(struct mem_section *ms, unsigned long pnum, struct page *mem_map, unsigned long *pageblock_bitmap) { @@ -211,7 +211,7 @@ static int sparse_init_one_section(struc return 1; } -static struct page *sparse_early_mem_map_alloc(unsigned long pnum) +static struct page __init *sparse_early_mem_map_alloc(unsigned long pnum) { struct page *map; struct mem_section *ms = __nr_to_section(pnum); @@ -301,7 +301,7 @@ static unsigned long *sparse_early_usema * Allocate the accumulated non-linear sections, allocate a mem_map * for each and record the physical to section mapping. */ -void sparse_init(void) +void __init sparse_init(void) { unsigned long pnum; struct page *map; @@ -324,6 +324,7 @@ void sparse_init(void) } } +#ifdef CONFIG_MEMORY_HOTPLUG /* * returns the number of sections whose mem_maps were properly * set. If this is =0, then that means that the passed-in @@ -370,3 +371,4 @@ out: __kfree_section_memmap(memmap, nr_pages); return ret; } +#endif Index: meminit/arch/ia64/mm/discontig.c === --- meminit.orig/arch/ia64/mm/discontig.c 2007-04-04 20:15:58.0 +0900 +++ meminit/arch/ia64/mm/discontig.c2007-04-04 20:16:02.0 +0900 @@ -696,6 +696,7 @@ void __init paging_init(void) zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page)); } +#ifdef CONFIG_MEMORY_HOTPLUG pg_data_t *arch_alloc_nodedata(int nid) { unsigned long size = compute_pernodesize(nid); @@ -713,3 +714,4 @@ void arch_refresh_nodedata(int update_no pgdat_list[update_node] = update_pgdat; scatter_node_data(); } +#endif Index: meminit/mm/page_alloc.c === --- meminit.orig/mm/page_alloc.c2007-04-04 20:15:58.0 +0900 +++ meminit/mm/page_alloc.c 2007-04-04 20:55:44.0 +0900 @@ -105,7 +105,7 @@ int min_free_kbytes = 1024; unsigned long __meminitdata nr_kernel_pages; unsigned long __meminitdata nr_all_pages; -static unsigned long __initdata dma_reserve; +static unsigned long __meminitdata dma_reserve; #ifdef CONFIG_ARCH_POPULATES_NODE_MAP /* @@ -128,16 +128,16 @@ static unsigned long __initdata dma_rese #endif #endif - struct node_active_region __initdata early_node_map[MAX_ACTIVE_REGIONS]; - int __initdata nr_nodemap_entries; - unsigned long __initdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES]; - unsigned long __initdata arch_zone_highest_possible_pfn[MAX_NR_ZONES]; + struct node_active_region __meminitdata early_node_map[MAX_ACTIVE_REGIONS]; + int __meminitdata nr_nodemap_entries; + unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES]; + unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES]; #ifdef CONFIG_MEMORY_HOTPLUG_RESERVE unsigned long __initdata node_boundary_start_pfn[MAX_NUMNODES]; unsigned long __initdata node_boundary_end_pfn[MAX_NUMNODES]; #endif /* CONFIG_MEMORY_HOTPLUG_RESERVE */ unsigned long __initdata required_kernelcore; - unsigned long __initdata zone_movable_pfn[MAX_NUMNODES]; + unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES]; /* movable_zone is the real zone pages in ZONE_MOVABLE are taken from
[Patch] Add white list into modpost.c for memory hotplug code and ia64's machvec section
This patch is add white list into modpost.c for some functions and ia64's section to fix section mismatchs. sparse_index_alloc() and zone_wait_table_init() calls bootmem allocator at boot time, and kmalloc/vmalloc at hotplug time. If config memory hotplug is on, there are references of bootmem allocater(init text) from them (normal text). This is cause of section mismatch. Bootmem is called by many functions and it must be used only at boot time. I think __init of them should keep for section mismatch check. So, I would like to register sparse_index_alloc() and zone_wait_table_init() into white list. In addition, ia64's .machvec section is function table of some platform dependent code. It is mixture of .init.text and normal text. These reference of __init functions are valid too. This is for 2.6.21-rc5-mm4. Please apply. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- mm/page_alloc.c |2 +- mm/sparse.c |2 +- scripts/mod/modpost.c | 28 3 files changed, 30 insertions(+), 2 deletions(-) Index: current_test/scripts/mod/modpost.c === --- current_test.orig/scripts/mod/modpost.c 2007-04-03 16:04:57.0 +0900 +++ current_test/scripts/mod/modpost.c 2007-04-03 16:09:59.0 +0900 @@ -649,6 +649,17 @@ static int strrcmp(const char *s, const * The pattern is: * tosec = .init.text * fromsec = .paravirtprobe + * + * Pattern 10: + * Some of functions are common code between boot time and hotplug + * time. The bootmem allocater is called only boot time in its + * functions. So it's ok to reference. + * tosec= .init.text + * + * Pattern 11: + * ia64 has machvec table for each platform. It is mixture of function + * pointer of .init.text and .text. + * fromsec = .machvec **/ static int secref_whitelist(const char *modname, const char *tosec, const char *fromsec, const char *atsym, @@ -675,6 +686,12 @@ static int secref_whitelist(const char * NULL }; + const char *pat4sym[] = { + sparse_index_alloc, + zone_wait_table_init, + NULL + }; + /* Check for pattern 1 */ if (strcmp(tosec, .init.data) != 0) f1 = 0; @@ -738,6 +755,17 @@ static int secref_whitelist(const char * (strcmp(fromsec, .paravirtprobe) == 0)) return 1; + /* Check for pattern 10 */ + if ((strcmp(tosec, .init.text) == 0) + (strcmp(fromsec, .text) == 0)) + for (s = pat4sym; *s; s++) + if (strcmp(atsym, *s) == 0) + return 1; + + /* Check for pattern 11 */ + if (strcmp(fromsec, .machvec) == 0) + return 1; + return 0; } Index: current_test/mm/page_alloc.c === --- current_test.orig/mm/page_alloc.c 2007-04-03 16:04:57.0 +0900 +++ current_test/mm/page_alloc.c2007-04-03 16:05:26.0 +0900 @@ -2667,7 +2667,7 @@ void __init setup_per_cpu_pageset(void) #endif -static __meminit +static __meminit noinline int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages) { int i; Index: current_test/mm/sparse.c === --- current_test.orig/mm/sparse.c 2007-04-03 16:04:57.0 +0900 +++ current_test/mm/sparse.c2007-04-03 16:05:26.0 +0900 @@ -44,7 +44,7 @@ EXPORT_SYMBOL(page_to_nid); #endif #ifdef CONFIG_SPARSEMEM_EXTREME -static struct mem_section *sparse_index_alloc(int nid) +static struct mem_section noinline *sparse_index_alloc(int nid) { struct mem_section *section = NULL; unsigned long array_size = SECTIONS_PER_ROOT * -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 4/4] [RESEND] Recomputing msgmni on memory add / remove
Hello Nadia-san. @@ -118,6 +122,10 @@ struct ipc_namespace { size_t shm_ctlall; int shm_ctlmni; int shm_tot; + +#ifdef CONFIG_MEMORY_HOTPLUG + struct notifier_block ipc_memory_hotplug; +#endif }; I'm sorry, but I don't see why each ipc namespace must have each callbacks of memory hotplug. I prefer only one callback for each subsystem, not for each namespace. In addition, the recompute_msgmni() calculation looks very similar for all ipc namespace. Or do you wish each ipc namespace have different callback for the future? BTW, have you ever tested this patch? If you don't have any test environment for memory hotplug code, then I'll check it. :-) Bye. -- Yasunori Goto -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 4/4] [RESEND] Recomputing msgmni on memory add / remove
Yasunori Goto wrote: Hello Nadia-san. @@ -118,6 +122,10 @@ struct ipc_namespace { size_t shm_ctlall; int shm_ctlmni; int shm_tot; + +#ifdef CONFIG_MEMORY_HOTPLUG + struct notifier_block ipc_memory_hotplug; +#endif }; I'm sorry, but I don't see why each ipc namespace must have each callbacks of memory hotplug. I prefer only one callback for each subsystem, not for each namespace. In addition, the recompute_msgmni() calculation looks very similar for all ipc namespace. Or do you wish each ipc namespace have different callback for the future? Actually, this is what I wanted to do at the very beginning: have a single callback that would recompute the msgmni for each ipc namespace. But the issue here is that the namespaces are not linked to each other, so I had no simple way to go through all the namespaces. I solved the issue by having a callback for any single ipc namespace and make it recompute the msgmni value for itslef. The recompute_msg() must be called when new ipc_namespace is created/removed as you mentioned. I think namespaces should be linked each other for it in the end BTW, have you ever tested this patch? If you don't have any test environment for memory hotplug code, then I'll check it. :-) Well, I tested it but not in real configuration: what I did is that I changed the status by hand under sysfs to offline. I also changed remove_memory() in mm/memory_hotplug.c in the following way (instead of returninf EINVAL): 1) decrease the total_ram pages 2) call memory_notify(MEM_OFFLINE, NULL) and checked that the msgmni was recomputed. You can also online again after offline by writing sysfs. But sure, if you are candidate to test it, that would be great! Ok. I'll check it too. Bye. -- Yasunori Goto -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm] mm: Fix memory hotplug + sparsemem build.
On Tue, 11 Sep 2007 18:37:12 +0900 Yasunori Goto [EMAIL PROTECTED] wrote: + if (onlined_pages){ Nit, needs a space there before the '{'. Ah, Ok. I attached fixed patch in this mail. The problem as I see it is that when we boot the system we start a kswapd on all nodes with memory. If the hot-add adds memory to a pre-existing node with no memory we will not start one and we end up with a node with memory and no kswapd. Bad. As kswapd_run is a no-op when a kswapd already exists this seems a safe way to fix that. Paul's -zone conversion is obviously correct also. Acked-by: Andy Whitcroft [EMAIL PROTECTED] Thanks for your explanation. You mentioned all of my intention correctly. :-) Fix kswapd doesn't run when memory is added on memory-less-node. Fix compile error of zone-node when CONFIG_NUMA is off. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] Signed-off-by: Paul Mundt [EMAIL PROTECTED] Acked-by: Andy Whitcroft [EMAIL PROTECTED] --- mm/memory_hotplug.c |9 - 1 file changed, 4 insertions(+), 5 deletions(-) Index: current/mm/memory_hotplug.c === --- current.orig/mm/memory_hotplug.c2007-09-07 18:08:07.0 +0900 +++ current/mm/memory_hotplug.c 2007-09-11 17:29:19.0 +0900 @@ -211,10 +211,12 @@ int online_pages(unsigned long pfn, unsi online_pages_range); zone-present_pages += onlined_pages; zone-zone_pgdat-node_present_pages += onlined_pages; - if (onlined_pages) - node_set_state(zone-node, N_HIGH_MEMORY); setup_per_zone_pages_min(); + if (onlined_pages) { + kswapd_run(zone_to_nid(zone)); + node_set_state(zone_to_nid(zone), N_HIGH_MEMORY); + } if (need_zonelists_rebuild) build_all_zonelists(); @@ -269,9 +271,6 @@ int add_memory(int nid, u64 start, u64 s if (!pgdat) return -ENOMEM; new_pgdat = 1; - ret = kswapd_run(nid); - if (ret) - goto error; } /* call arch's memory hotadd */ OK, we're getting into a mess here. This patch fixes update-n_high_memory-node-state-for-memory-hotadd.patch, but which patch does update-n_high_memory-node-state-for-memory-hotadd.patch fix? At present I just whacked update-n_high_memory-node-state-for-memory-hotadd.patch at the end of everything, but that was lazy of me and it ends up making a mess. It is enough. No more patch is necessary for these issues. I already fixed about Andy-san's comment. :-) Thanks. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm] mm: Fix memory hotplug + sparsemem build.
On Fri, 14 Sep 2007 11:02:43 +0900 Yasunori Goto [EMAIL PROTECTED] wrote: /* call arch's memory hotadd */ OK, we're getting into a mess here. This patch fixes update-n_high_memory-node-state-for-memory-hotadd.patch, but which patch does update-n_high_memory-node-state-for-memory-hotadd.patch fix? At present I just whacked update-n_high_memory-node-state-for-memory-hotadd.patch at the end of everything, but that was lazy of me and it ends up making a mess. It is enough. No more patch is necessary for these issues. I already fixed about Andy-san's comment. :-) Now I'm more confused. I have two separeate questions: a) Is the justr-added update-n_high_memory-node-state-for-memory-hotadd-fix.patch still needed? I'm not sure exact meaning of just-added. But, update-n_high_memory-node-state-for-memory-hotadd-fix.patch is necessary for 2.6.23-rc4-mm1. b) Which patch in 2.6.22-rc4-mm1 does 2.6.23-rc4-mm1? update-n_high_memory-node-state-for-memory-hotadd.patch fix? In other words, into which patch should I fold update-n_high_memory-node-state-for-memory-hotadd.patch prior to sending to Linus? In my understanding, update-n_high_memory-node-state-for-memory-hotadd.patch should be folded with all of memoryless-nodes-.patch. It sets N_HIGH_MEMORY for a new node-with-memory. But if you need specifing of more detail patch, becase N_HIGH_MEMORY is set in memoryless-nodes-introduce-ask-of-nodes-with-memory.patch, I suppose update-n_high_memory-node-state-for-memory-hotadd.patch should be fold with it. update-n_high_memory-node-state-for-memory-hotadd-fix.patch ^^^ is fixes of update-n_high_memory-node-state-for-memory-hotadd.patch and memoryless-nodes-no-need-for-kswapd.patch Is it enough for your question? Or more confuse? (I (usually) get to work this out for myself. Sometimes it is painful). Generally, if people tell me which patch-in-mm their patch is fixing, it really helps. Adrian does this all the time. Sorry for your confusing... -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: EIP is at device_shutdown+0x32/0x60
On Thu, 15 Nov 2007 12:11:58 +0300 Alexey Dobriyan [EMAIL PROTECTED] wrote: Three boxes rarely oops during reboot or poweroff with 2.6.24-rc2-mm1 (1) and during 2.6.24 cycle (2): kernel_restart sys_reboot [garbage] Code: 8b 88 a8 00 00 00 85 c9 74 04 89 EIP is at device_shutdown+0x32/0x60 Yes, all my test boxes did that - it's what I referred to in the releaee notes. Greg is pondering the problem - seem he's the only person who cannot reproduce it ;) Fortunately, my ia64 box reproduces this oops every time. So, I could chase it. device_shutdown() function in drivers/base/power/shutdown.c is followings. --- /** * device_shutdown - call -shutdown() on each device to shutdown. */ void device_shutdown(void) { struct device * dev, *devn; list_for_each_entry_safe_reverse(dev, devn, devices_kset-list, kobj.entry) { if (dev-bus dev-bus-shutdown) { dev_dbg(dev, shutdown\n); dev-bus-shutdown(dev); } else if (dev-driver dev-driver-shutdown) { dev_dbg(dev, shutdown\n); dev-driver-shutdown(dev); } } } When oops occured, dev-driver pointed kset_ktype's address, and dev-driver-shutdown was the address of bus_type_list. So, Oops was caused by Illegal operation fault. kset_ktypes is pointed by system_kset. If my understanding is correct, this loop can't distinguish between struct device and struct kset, but both are connected in this list, right? It may be the cause of this. Bye. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: EIP is at device_shutdown+0x32/0x60
Care to try this: + system_kset = kset_create_and_register(system, NULL, + devices_kset-kobj, NULL); We should not join the kset, only use it as a parent. Yes, that fixes the problem for me! Can anyone else verify this? I confirmed it fixed the problem. :-) Thanks. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PS3: trouble with SPARSEMEM_VMEMMAP and kexec
I'll try Milton's suggestion to pre-allocate the memory early. It seems that should work as long as nothing else before the hot-plug mem is added needs a large chunk. Hello. Geoff-san. Sorry for late response. Could you tell me the value of the following page_size calculation in vmemmap_populate()? I think this page_size may be too big value. -- int __meminit vmemmap_populate(struct page *start_page, unsigned long nr_pages, int node) : : unsigned long page_size = 1 mmu_psize_defs[mmu_linear_psize].shift; : --- In addition, I remember that current add_memory() is designed for only 1 section's addition. (See: memory_probe_store() and sparse_mem_map_populate(). they require only for 1 section's mem_map by specifing PAGES_PER_SECTION.) The 1 section size for normal powerpc box is only 16MB. (IA64 - 1GB, x86-64 - 128MB). But, if my understanding is correct, PS3's add_memory() requires all of total memory. I'm afraid something other problems might be hidden in this issue yet. (However, I think Milton-san's suggestion is very desirable. If preallocation of hotadd works on ia64 too, I'm very glad.) Thanks. -- Yasunori Goto -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PS3: trouble with SPARSEMEM_VMEMMAP and kexec
On Thu, 6 Dec 2007, Geert Uytterhoeven wrote: On Thu, 6 Dec 2007, Yasunori Goto wrote: I'll try Milton's suggestion to pre-allocate the memory early. It seems that should work as long as nothing else before the hot-plug mem is added needs a large chunk. Hello. Geoff-san. Sorry for late response. Could you tell me the value of the following page_size calculation in vmemmap_populate()? I think this page_size may be too big value. -- int __meminit vmemmap_populate(struct page *start_page, unsigned long nr_pages, int node) : : unsigned long page_size = 1 mmu_psize_defs[mmu_linear_psize].shift; : --- 24 MiB Bummer, messing up bits and MiB. 16 MiB of course. 16 MiB is not page size. It is section size. IIRC, powerpc's page size must be 4K (or 64K). If page size is 4k, vmemmap_alloc_block will call the order 12 page. Is it really correct value for vmemmap population? PS3 initially starts with 128 MiB. Later hotplug is used to add the remaining memory (96 or 112 MIB, IIRC). Ok. Then, add_memory() must be called 6 or 7 times for each sections. Thanks. -- Yasunori Goto -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] mm/sparse.c: Improve the error handling for sparse_add_one_section()
Hi, Cong-san. ms-section_mem_map |= SECTION_MARKED_PRESENT; ret = sparse_init_one_section(ms, section_nr, memmap, usemap); out: pgdat_resize_unlock(pgdat, flags); - if (ret = 0) - __kfree_section_memmap(memmap, nr_pages); + return ret; } #endif Hmm. When sparse_init_one_section() returns error, memmap and usemap should be free. Thanks for your fixing. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch](Resend) mm/sparse.c: Improve the error handling for sparse_add_one_section()
ret = sparse_init_one_section(ms, section_nr, memmap, usemap); @@ -414,7 +418,7 @@ int sparse_add_one_section(struct zone * out: pgdat_resize_unlock(pgdat, flags); if (ret = 0) - __kfree_section_memmap(memmap, nr_pages); + kfree(usemap); return ret; } #endif I guess you think __kfree_section_memmap() is not necessary due to no implementation. But, it is still available when CONFIG_SPARSEMEM_VMEMMAP is off. So, it should not be removed. Bye. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch](Resend) mm/sparse.c: Improve the error handling for sparse_add_one_section()
Looks good to me. Thanks. Acked-by: Yasunori Goto [EMAIL PROTECTED] On Tue, Nov 27, 2007 at 10:53:45AM -0800, Dave Hansen wrote: On Tue, 2007-11-27 at 10:26 +0800, WANG Cong wrote: @@ -414,7 +418,7 @@ int sparse_add_one_section(struct zone * out: pgdat_resize_unlock(pgdat, flags); if (ret = 0) - __kfree_section_memmap(memmap, nr_pages); + kfree(usemap); return ret; } #endif Why did you get rid of the memmap free here? A bad return from sparse_init_one_section() indicates that we didn't use the memmap, so it will leak otherwise. Sorry, I was confused by the recursion. This one should be OK. Thanks. Improve the error handling for mm/sparse.c::sparse_add_one_section(). And I see no reason to check 'usemap' until holding the 'pgdat_resize_lock'. Cc: Christoph Lameter [EMAIL PROTECTED] Cc: Dave Hansen [EMAIL PROTECTED] Cc: Rik van Riel [EMAIL PROTECTED] Cc: Yasunori Goto [EMAIL PROTECTED] Cc: Andy Whitcroft [EMAIL PROTECTED] Signed-off-by: WANG Cong [EMAIL PROTECTED] --- Index: linux-2.6/mm/sparse.c === --- linux-2.6.orig/mm/sparse.c +++ linux-2.6/mm/sparse.c @@ -391,9 +391,17 @@ int sparse_add_one_section(struct zone * * no locking for this, because it does its own * plus, it does a kmalloc */ - sparse_index_init(section_nr, pgdat-node_id); + ret = sparse_index_init(section_nr, pgdat-node_id); + if (ret 0) + return ret; memmap = kmalloc_section_memmap(section_nr, pgdat-node_id, nr_pages); + if (!memmap) + return -ENOMEM; usemap = __kmalloc_section_usemap(); + if (!usemap) { + __kfree_section_memmap(memmap, nr_pages); + return -ENOMEM; + } pgdat_resize_lock(pgdat, flags); @@ -403,18 +411,16 @@ int sparse_add_one_section(struct zone * goto out; } - if (!usemap) { - ret = -ENOMEM; - goto out; - } ms-section_mem_map |= SECTION_MARKED_PRESENT; ret = sparse_init_one_section(ms, section_nr, memmap, usemap); out: pgdat_resize_unlock(pgdat, flags); - if (ret = 0) + if (ret = 0) { + kfree(usemap); __kfree_section_memmap(memmap, nr_pages); + } return ret; } #endif -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PS3: trouble with SPARSEMEM_VMEMMAP and kexec
Yasunori Goto wrote: On Thu, 6 Dec 2007, Geert Uytterhoeven wrote: On Thu, 6 Dec 2007, Yasunori Goto wrote: I'll try Milton's suggestion to pre-allocate the memory early. It seems that should work as long as nothing else before the hot-plug mem is added needs a large chunk. Hello. Geoff-san. Sorry for late response. Could you tell me the value of the following page_size calculation in vmemmap_populate()? I think this page_size may be too big value. -- int __meminit vmemmap_populate(struct page *start_page, unsigned long nr_pages, int node) : : unsigned long page_size = 1 mmu_psize_defs[mmu_linear_psize].shift; : --- 16 MiB of course. 16 MiB is not page size. It is section size. IIRC, powerpc's page size must be 4K (or 64K). If page size is 4k, vmemmap_alloc_block will call the order 12 page. By default PS3 uses 4K virtual pages, and 16M linear pages. Is it really correct value for vmemmap population? It seems vmemmap needs linear pages, so I think it is ok. Oh, I see. Sorry for noise. Bye. -- Yasunori Goto -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sparsemem: Make SPARSEMEM_VMEMMAP selectable
Looks good to me. Thanks. Acked-by: Yasunori Goto [EMAIL PROTECTED] From: Geoff Levand [EMAIL PROTECTED] SPARSEMEM_VMEMMAP needs to be a selectable config option to support building the kernel both with and without sparsemem vmemmap support. This selection is desirable for platforms which could be configured one way for platform specific builds and the other for multi-platform builds. Signed-off-by: Miguel Boton [EMAIL PROTECTED] Signed-off-by: Geoff Levand [EMAIL PROTECTED] --- Andrew, Please consider for 2.6.24. -Geoff mm/Kconfig | 15 +++ 1 file changed, 7 insertions(+), 8 deletions(-) --- a/mm/Kconfig +++ b/mm/Kconfig @@ -112,18 +112,17 @@ config SPARSEMEM_EXTREME def_bool y depends on SPARSEMEM !SPARSEMEM_STATIC -# -# SPARSEMEM_VMEMMAP uses a virtually mapped mem_map to optimise pfn_to_page -# and page_to_pfn. The most efficient option where kernel virtual space is -# not under pressure. -# config SPARSEMEM_VMEMMAP_ENABLE def_bool n config SPARSEMEM_VMEMMAP - bool - depends on SPARSEMEM - default y if (SPARSEMEM_VMEMMAP_ENABLE) + bool Sparse Memory virtual memmap + depends on SPARSEMEM SPARSEMEM_VMEMMAP_ENABLE + default y + help + SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise + pfn_to_page and page_to_pfn operations. This is the most + efficient option when sufficient kernel resources are available. # eventually, we can have this option just 'select SPARSEMEM' config MEMORY_HOTPLUG -- Yasunori Goto -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Add IORESOUCE_BUSY flag for System RAM (Re: [Question] How to represent SYSTEM_RAM in kerenel/resouce.c)
Hello. I was asked from Kame-san to write this patch. Please apply. - i386 and x86-64 registers System RAM as IORESOURCE_MEM | IORESOURCE_BUSY. But ia64 registers it as IORESOURCE_MEM only. In addition, memory hotplug code registers new memory as IORESOURCE_MEM too. This patch adds IORESOURCE_BUSY for them to avoid potential overlap mapping by PCI device. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- arch/ia64/kernel/efi.c |6 ++ mm/memory_hotplug.c|2 +- 2 files changed, 3 insertions(+), 5 deletions(-) Index: current/arch/ia64/kernel/efi.c === --- current.orig/arch/ia64/kernel/efi.c 2007-11-01 15:24:05.0 +0900 +++ current/arch/ia64/kernel/efi.c 2007-11-01 15:24:18.0 +0900 @@ -,7 +,7 @@ efi_initialize_iomem_resources(struct re if (md-num_pages == 0) /* should not happen */ continue; - flags = IORESOURCE_MEM; + flags = IORESOURCE_MEM | IORESOURCE_BUSY; switch (md-type) { case EFI_MEMORY_MAPPED_IO: @@ -1133,12 +1133,11 @@ efi_initialize_iomem_resources(struct re case EFI_ACPI_MEMORY_NVS: name = ACPI Non-volatile Storage; - flags |= IORESOURCE_BUSY; break; case EFI_UNUSABLE_MEMORY: name = reserved; - flags |= IORESOURCE_BUSY | IORESOURCE_DISABLED; + flags |= IORESOURCE_DISABLED; break; case EFI_RESERVED_TYPE: @@ -1147,7 +1146,6 @@ efi_initialize_iomem_resources(struct re case EFI_ACPI_RECLAIM_MEMORY: default: name = reserved; - flags |= IORESOURCE_BUSY; break; } Index: current/mm/memory_hotplug.c === --- current.orig/mm/memory_hotplug.c2007-11-01 15:24:16.0 +0900 +++ current/mm/memory_hotplug.c 2007-11-01 15:41:27.0 +0900 @@ -39,7 +39,7 @@ static struct resource *register_memory_ res-name = System RAM; res-start = start; res-end = start + size - 1; - res-flags = IORESOURCE_MEM; + res-flags = IORESOURCE_MEM | IORESOURCE_BUSY; if (request_resource(iomem_resource, res) 0) { printk(System RAM resource %llx - %llx cannot be added\n, (unsigned long long)res-start, (unsigned long long)res-end); -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch 000/002](memory hotplug) Rearrange notifier of memory hotplug (take 2)
Hello. This patch set is to rearrange event notifier for memory hotplug, because the old notifier has some defects. For example, there is no information like new memory's pfn and # of pages for callback functions. Fortunately, nothing uses this notifier so far, there is no impact by this change. (SLUB will use this after this patch set to make kmem_cache_node structure). In addition, descriptions of notifer is added to memory hotplug document. This patch was a part of patch set to make kmem_cache_node of SLUB to avoid panic of memory online. But, I think this change becomes not only for SLUB but also for others. So, I extracted this from it. This patch set is for 2.6.23-mm1. I tested this patch on my ia64 box. Please apply. Bye. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch 001/002](memory hotplug) Make description of memory hotplug notifier in document
Add description about event notification callback routine to the document. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- Documentation/memory-hotplug.txt | 58 --- 1 file changed, 55 insertions(+), 3 deletions(-) Index: current/Documentation/memory-hotplug.txt === --- current.orig/Documentation/memory-hotplug.txt 2007-10-17 15:57:50.0 +0900 +++ current/Documentation/memory-hotplug.txt2007-10-17 21:26:30.0 +0900 @@ -2,7 +2,8 @@ Memory Hotplug == -Last Updated: Jul 28 2007 +Created: Jul 28 2007 +Add description of notifier of memory hotplug Oct 11 2007 This document is about memory hotplug including how-to-use and current status. Because Memory Hotplug is still under development, contents of this text will @@ -24,7 +25,8 @@ be changed often. 6.1 Memory offline and ZONE_MOVABLE 6.2. How to offline memory 7. Physical memory remove -8. Future Work List +8. Memory hotplug event notifier +9. Future Work List Note(1): x86_64's has special implementation for memory hotplug. This text does not describe it. @@ -307,8 +309,58 @@ Need more implementation yet - Notification completion of remove works by OS to firmware. - Guard from remove if not yet. + +8. Memory hotplug event notifier + +Memory hotplug has event notifer. There are 6 types of notification. + +MEMORY_GOING_ONLINE + Generated before new memory becomes available in order to be able to + prepare subsystems to handle memory. The page allocator is still unable + to allocate from the new memory. + +MEMORY_CANCEL_ONLINE + Generated if MEMORY_GOING_ONLINE fails. + +MEMORY_ONLINE + Generated when memory has succesfully brought online. The callback may + allocate pages from the new memory. + +MEMORY_GOING_OFFLINE + Generated to begin the process of offlining memory. Allocations are no + longer possible from the memory but some of the memory to be offlined + is still in use. The callback can be used to free memory known to a + subsystem from the indicated memory section. + +MEMORY_CANCEL_OFFLINE + Generated if MEMORY_GOING_OFFLINE fails. Memory is available again from + the section that we attempted to offline. + +MEMORY_OFFLINE + Generated after offlining memory is complete. + +A callback routine can be registered by + hotplug_memory_notifier(callback_func, priority) + +The second argument of callback function (action) is event types of above. +The third argument is passed by pointer of struct memory_notify. + +struct memory_notify { + unsigned long start_pfn; + unsigned long nr_pages; + int status_cahnge_nid; +} + +start_pfn is start_pfn of online/offline memory. +nr_pages is # of pages of online/offline memory. +status_change_nid is set node id when N_HIGH_MEMORY of nodemask is (will be) +set/clear. It means a new(memoryless) node gets new memory by online and a +node loses all memory. If this is -1, then nodemask status is not changed. +If status_changed_nid = 0, callback should create/discard structures for the +node if necessary. + -- -8. Future Work +9. Future Work -- - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like sysctl or new control file. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch 002/002](memory hotplug) rearrange patch for notifier of memory hotplug
Current memory notifier has some defects yet. (Fortunately, nothing uses it.) This patch is to fix and rearrange for them. - Add information of start_pfn, nr_pages, and node id if node status is changes from/to memoryless node for callback functions. Callbacks can't do anything without those information. - Add notification going-online status. It is necessary for creating per node structure before the node's pages are available. - Move GOING_OFFLINE status notification after page isolation. It is good place for return memory like cache for callback, because returned page is not used again. - Make CANCEL events for rollingback when error occurs. - Delete MEM_MAPPING_INVALID notification. It will be not used. - Fix compile error of (un)register_memory_notifier(). Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- drivers/base/memory.c |9 + include/linux/memory.h | 27 +++ mm/memory_hotplug.c| 48 +--- 3 files changed, 61 insertions(+), 23 deletions(-) Index: current/drivers/base/memory.c === --- current.orig/drivers/base/memory.c 2007-10-17 21:17:54.0 +0900 +++ current/drivers/base/memory.c 2007-10-17 21:21:30.0 +0900 @@ -137,7 +137,7 @@ static ssize_t show_mem_state(struct sys return len; } -static inline int memory_notify(unsigned long val, void *v) +int memory_notify(unsigned long val, void *v) { return blocking_notifier_call_chain(memory_chain, val, v); } @@ -183,7 +183,6 @@ memory_block_action(struct memory_block break; case MEM_OFFLINE: mem-state = MEM_GOING_OFFLINE; - memory_notify(MEM_GOING_OFFLINE, NULL); start_paddr = page_to_pfn(first_page) PAGE_SHIFT; ret = remove_memory(start_paddr, PAGES_PER_SECTION PAGE_SHIFT); @@ -191,7 +190,6 @@ memory_block_action(struct memory_block mem-state = old_state; break; } - memory_notify(MEM_MAPPING_INVALID, NULL); break; default: printk(KERN_WARNING %s(%p, %ld) unknown action: %ld\n, @@ -199,11 +197,6 @@ memory_block_action(struct memory_block WARN_ON(1); ret = -EINVAL; } - /* -* For now, only notify on successful memory operations -*/ - if (!ret) - memory_notify(action, NULL); return ret; } Index: current/include/linux/memory.h === --- current.orig/include/linux/memory.h 2007-10-17 21:17:54.0 +0900 +++ current/include/linux/memory.h 2007-10-17 21:21:30.0 +0900 @@ -41,18 +41,15 @@ struct memory_block { #defineMEM_ONLINE (10) /* exposed to userspace */ #defineMEM_GOING_OFFLINE (11) /* exposed to userspace */ #defineMEM_OFFLINE (12) /* exposed to userspace */ +#defineMEM_GOING_ONLINE(13) +#defineMEM_CANCEL_ONLINE (14) +#defineMEM_CANCEL_OFFLINE (15) -/* - * All of these states are currently kernel-internal for notifying - * kernel components and architectures. - * - * For MEM_MAPPING_INVALID, all notifier chains with priority 0 - * are called before pfn_to_page() becomes invalid. The priority=0 - * entry is reserved for the function that actually makes - * pfn_to_page() stop working. Any notifiers that want to be called - * after that should have priority 0. - */ -#defineMEM_MAPPING_INVALID (13) +struct memory_notify { + unsigned long start_pfn; + unsigned long nr_pages; + int status_change_nid; +}; struct notifier_block; struct mem_section; @@ -69,12 +66,18 @@ static inline int register_memory_notifi static inline void unregister_memory_notifier(struct notifier_block *nb) { } +static inline int memory_notify(unsigned long val, void *v) +{ + return 0; +} #else +extern int register_memory_notifier(struct notifier_block *nb); +extern void unregister_memory_notifier(struct notifier_block *nb); extern int register_new_memory(struct mem_section *); extern int unregister_memory_section(struct mem_section *); extern int memory_dev_init(void); extern int remove_memory_block(unsigned long, struct mem_section *, int); - +extern int memory_notify(unsigned long val, void *v); #define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTIONPAGE_SHIFT) Index: current/mm/memory_hotplug.c === --- current.orig/mm/memory_hotplug.c2007-10-17 21:17:54.0 +0900 +++ current/mm/memory_hotplug.c
[Patch](memory hotplug) Make kmem_cache_node for SLUB on memory online to avoid panic(take 3)
This patch fixes panic due to access NULL pointer of kmem_cache_node at discard_slab() after memory online. When memory online is called, kmem_cache_nodes are created for all SLUBs for new node whose memory are available. slab_mem_going_online_callback() is called to make kmem_cache_node() in callback of memory online event. If it (or other callbacks) fails, then slab_mem_offline_callback() is called for rollback. In memory offline, slab_mem_going_offline_callback() is called to shrink all slub cache, then slab_mem_offline_callback() is called later. This patch is tested on my ia64 box. Please apply. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- mm/slub.c | 115 ++ 1 file changed, 115 insertions(+) Index: current/mm/slub.c === --- current.orig/mm/slub.c 2007-10-17 21:17:53.0 +0900 +++ current/mm/slub.c 2007-10-17 22:23:08.0 +0900 @@ -20,6 +20,7 @@ #include linux/mempolicy.h #include linux/ctype.h #include linux/kallsyms.h +#include linux/memory.h /* * Lock order: @@ -2688,6 +2689,118 @@ int kmem_cache_shrink(struct kmem_cache } EXPORT_SYMBOL(kmem_cache_shrink); +#if defined(CONFIG_NUMA) defined(CONFIG_MEMORY_HOTPLUG) +static int slab_mem_going_offline_callback(void *arg) +{ + struct kmem_cache *s; + + down_read(slub_lock); + list_for_each_entry(s, slab_caches, list) + kmem_cache_shrink(s); + up_read(slub_lock); + + return 0; +} + +static void slab_mem_offline_callback(void *arg) +{ + struct kmem_cache_node *n; + struct kmem_cache *s; + struct memory_notify *marg = arg; + int offline_node; + + offline_node = marg-status_change_nid; + + /* +* If the node still has available memory. we need kmem_cache_node +* for it yet. +*/ + if (offline_node 0) + return; + + down_read(slub_lock); + list_for_each_entry(s, slab_caches, list) { + n = get_node(s, offline_node); + if (n) { + /* +* if n-nr_slabs 0, slabs still exist on the node +* that is going down. We were unable to free them, +* and offline_pages() function shoudn't call this +* callback. So, we must fail. +*/ + BUG_ON(atomic_read(n-nr_slabs)); + + s-node[offline_node] = NULL; + kmem_cache_free(kmalloc_caches, n); + } + } + up_read(slub_lock); +} + +static int slab_mem_going_online_callback(void *arg) +{ + struct kmem_cache_node *n; + struct kmem_cache *s; + struct memory_notify *marg = arg; + int nid = marg-status_change_nid; + + /* +* If the node's memory is already available, then kmem_cache_node is +* already created. Nothing to do. +*/ + if (nid 0) + return 0; + + /* +* We are bringing a node online. No memory is availabe yet. We must +* allocate a kmem_cache_node structure in order to bring the node +* online. +*/ + down_read(slub_lock); + list_for_each_entry(s, slab_caches, list) { + /* +* XXX: kmem_cache_alloc_node will fallback to other nodes +* since memory is not yet available from the node that +* is brought up. +*/ + n = kmem_cache_alloc(kmalloc_caches, GFP_KERNEL); + if (!n) + return -ENOMEM; + init_kmem_cache_node(n); + s-node[nid] = n; + } + up_read(slub_lock); + + return 0; +} + +static int slab_memory_callback(struct notifier_block *self, + unsigned long action, void *arg) +{ + int ret = 0; + + switch (action) { + case MEM_GOING_ONLINE: + ret = slab_mem_going_online_callback(arg); + break; + case MEM_GOING_OFFLINE: + ret = slab_mem_going_offline_callback(arg); + break; + case MEM_OFFLINE: + case MEM_CANCEL_ONLINE: + slab_mem_offline_callback(arg); + break; + case MEM_ONLINE: + case MEM_CANCEL_OFFLINE: + break; + } + + ret = notifier_from_errno(ret); + return ret; +} + +#endif /* CONFIG_MEMORY_HOTPLUG */ + / * Basic setup of slabs ***/ @@ -2709,6 +2822,8 @@ void __init kmem_cache_init(void) sizeof(struct kmem_cache_node), GFP_KERNEL); kmalloc_caches[0].refcount = -1; caches
Re: [Patch](memory hotplug) Make kmem_cache_node for SLUB on memory online to avoid panic(take 3)
On Wed, 17 Oct 2007 23:25:58 -0700 (PDT) Christoph Lameter [EMAIL PROTECTED] wrote: So that's slub. Does slab already have this functionality or are you not bothering to maintain slab in this area? Slab brings up a per node structure when the corresponding cpu is brought up. That was sufficient as long as we did not have any memoryless nodes. Right. At least, I don't have any experience of panic with SLAB so far. (If panic occurred, I already made a patch.). Now we may have to fix some things over there as well. Though the fix may be better for it, my priority is very low for it now. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch](memory hotplug) Make kmem_cache_node for SLUB on memory online to avoid panic(take 3)
On Thu, 18 Oct 2007 12:25:37 +0900 Yasunori Goto [EMAIL PROTECTED] wrote: This patch fixes panic due to access NULL pointer of kmem_cache_node at discard_slab() after memory online. When memory online is called, kmem_cache_nodes are created for all SLUBs for new node whose memory are available. slab_mem_going_online_callback() is called to make kmem_cache_node() in callback of memory online event. If it (or other callbacks) fails, then slab_mem_offline_callback() is called for rollback. In memory offline, slab_mem_going_offline_callback() is called to shrink all slub cache, then slab_mem_offline_callback() is called later. This patch is tested on my ia64 box. ... +#if defined(CONFIG_NUMA) defined(CONFIG_MEMORY_HOTPLUG) hm. There should be no linkage between memory hotpluggability and NUMA, surely? Sure. IBM's powerpc boxes have to support memory hotplug even if it is non-numa machine. They have the Dynamic Logical Partitioning feature. + down_read(slub_lock); + list_for_each_entry(s, slab_caches, list) { + n = get_node(s, offline_node); + if (n) { + /* +* if n-nr_slabs 0, slabs still exist on the node +* that is going down. We were unable to free them, +* and offline_pages() function shoudn't call this +* callback. So, we must fail. +*/ + BUG_ON(atomic_read(n-nr_slabs)); Expereince tells us that WARN_ON is preferred for newly added code ;) Oh... Ok! + s-node[offline_node] = NULL; + kmem_cache_free(kmalloc_caches, n); + } + } + up_read(slub_lock); +} + +static int slab_mem_going_online_callback(void *arg) +{ + struct kmem_cache_node *n; + struct kmem_cache *s; + struct memory_notify *marg = arg; + int nid = marg-status_change_nid; + + /* +* If the node's memory is already available, then kmem_cache_node is +* already created. Nothing to do. +*/ + if (nid 0) + return 0; + + /* +* We are bringing a node online. No memory is availabe yet. We must +* allocate a kmem_cache_node structure in order to bring the node +* online. +*/ + down_read(slub_lock); + list_for_each_entry(s, slab_caches, list) { + /* +* XXX: kmem_cache_alloc_node will fallback to other nodes +* since memory is not yet available from the node that +* is brought up. +*/ + n = kmem_cache_alloc(kmalloc_caches, GFP_KERNEL); + if (!n) + return -ENOMEM; err, we forgot slub_lock. I'll fix that. Oops. Indeed. Thanks for your check. Bye. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix warning in mm/slub.c
Make kmem_cache_node for SLUB on memory online to avoid panic introduced the following: mm/slub.c:2737: warning: passing argument 1 of 'atomic_read' from incompatible pointer type Signed-off-by: Olof Johansson [EMAIL PROTECTED] diff --git a/mm/slub.c b/mm/slub.c index aac1dd3..bcdb2c8 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2734,7 +2734,7 @@ static void slab_mem_offline_callback(void *arg) * and offline_pages() function shoudn't call this * callback. So, we must fail. */ - BUG_ON(atomic_read(n-nr_slabs)); + BUG_ON(atomic_long_read(n-nr_slabs)); s-node[offline_node] = NULL; kmem_cache_free(kmalloc_caches, n); Oops, yes. Thanks. Acked-by: Yasunori Goto [EMAIL PROTECTED] -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-mm PATCH] register_memory/unregister_memory clean ups
On Mon, 2008-02-11 at 11:48 -0800, Andrew Morton wrote: On Mon, 11 Feb 2008 09:23:18 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote: Hi Andrew, While testing hotplug memory remove against -mm, I noticed that unregister_memory() is not cleaning up /sysfs entries correctly. It also de-references structures after destroying them (luckily in the code which never gets used). So, I cleaned up the code and fixed the extra reference issue. Could you please include it in -mm ? Thanks, Badari register_memory()/unregister_memory() never gets called with root. unregister_memory() is accessing kobject_name of the object just freed up. Since no one uses the code, lets take the code out. And also, make register_memory() static. Another bug fix - before calling unregister_memory() remove_memory_block() gets a ref on kobject. unregister_memory() need to drop that ref before calling sysdev_unregister(). I'd say this: Subject: [-mm PATCH] register_memory/unregister_memory clean ups is rather tame. These are more than cleanups! These sound like machine-crashing bugs. Do they crash machines? How come nobody noticed it? No they don't crash machine - mainly because, they never get called with root argument (where we have the bug). They were never tested before, since we don't have memory remove work yet. All it does is, it leave /sysfs directory laying around and causing next memory add failure. Badari-san. Which function does call unregister_memory() or unregister_memory_section()? I can't find its caller in current 2.6.24-mm1. ???() | |nothing calls? | +--unregister_memory_section() | |call | +--- remove_memory_block() | |call | + unregister_memory() unregister_memory_section() is only externed in linux/memory.h. Do you have any another patch to call it? I think it is necessary for physical memory removing. If you have not posted it or it is not merged to -mm, I can understand why this bug remains. If you posted it, could you point it to me? Or do I misunderstand something? Thanks. -- Yasunori Goto -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-mm PATCH] register_memory/unregister_memory clean ups
Thanks Badari-san. I understand what was occured. :-) On Tue, 2008-02-12 at 13:56 -0800, Badari Pulavarty wrote: + /* +* Its ugly, but this is the best I can do - HELP !! +* We don't know where the allocations for section memmap and usemap +* came from. If they are allocated at the boot time, they would come +* from bootmem. If they are added through hot-memory-add they could be +* from sla or vmalloc. If they are allocated as part of hot-mem-add +* free them up properly. If they are allocated at boot, no easy way +* to correctly free them :( +*/ + if (usemap) { + if (PageSlab(virt_to_page(usemap))) { + kfree(usemap); + if (memmap) + __kfree_section_memmap(memmap, nr_pages); + } + } +} Do what we did with the memmap and store some of its origination information in the low bits. Hmm. my understand of memmap is limited. Can you help me out here ? Never mind. That was a bad suggestion. I do think it would be a good idea to mark the 'struct page' of ever page we use as bootmem in some way. Perhaps page-private? I agree. page-private is not used by bootmem allocator. I would like to mark not only memmap but also pgdat (and so on) for next step. It will be necessary for removing whole node. :-) Otherwise, you can simply try all of the possibilities and consider the remainder bootmem. Did you ever find out if we properly initialize the bootmem 'struct page's? Please have mercy and put this in a helper, first of all. static void free_usemap(unsigned long *usemap) { if (!usemap_ return; if (PageSlab(virt_to_page(usemap))) { kfree(usemap) } else if (is_vmalloc_addr(usemap)) { vfree(usemap); } else { int nid = page_to_nid(virt_to_page(usemap)); bootmem_fun_here(NODE_DATA(nid), usemap); } } right? It may work. But, to be honest, I feel there are TOO MANY allocation/free way for memmap (usemap and so on). If possible, I would like to unify some of them. I would like to try it. Bye. -- Yasunori Goto -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [5/8] Fix logic error in 64bit memory hotadd
Hi Ingo-san. Does anyone even use memory hotplug currently? I don't know. IBM's powerpc box can memory hot-add/remove by dynamic partitioning. And our fujitsu server has memory hot-add feature (Ia-64). So, they are concrete user of memory hotplug. In x86, E8500 chipset has the feature of memory-hotplug. (I searched a data-sheet from intel site.) http://download.intel.com/design/chipsets/e8500/datashts/30674501.pdf (6.3.8 IMI Hot-Plug) So, it depends on how many server uses it, I think. Thanks. -- Yasunori Goto -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch / 000](memory hotplug) Fix NULL pointer access of kmem_cache_node when hot-add.
Hello. This patch set is to fix panic due to access NULL pointer of SLUB. When new memory is hot-added on the new node (or memory less node), kmem_cache_node for the new node is not prepared, and panic occurs by it. So, new kmem_cache_node should be created before new memory is available on the node. This is the first user of the callback of memory notifier. So, the first patch is to change some defects of it. This patch set is for 2.6.23-rc8-mm2. I tested this patch on my ia64 box. Please apply. Bye. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch / 001](memory hotplug) fix some defects of memory notifer callback interface.
Current memory notifier has some defects yet. (Nothing uses it.) This patch is to fix for them. - Add information of start_pfn and nr_pages for callback functions. They can't do anything without those information. - Add notification going-online status. It is necessary for creating per node structure before the node's pages are available. - Fix compile error of (un)register_memory_notifier(). Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- drivers/base/memory.c | 10 +++--- include/linux/memory.h | 16 2 files changed, 19 insertions(+), 7 deletions(-) Index: current/drivers/base/memory.c === --- current.orig/drivers/base/memory.c 2007-09-28 11:21:00.0 +0900 +++ current/drivers/base/memory.c 2007-09-28 11:23:46.0 +0900 @@ -155,10 +155,13 @@ memory_block_action(struct memory_block struct page *first_page; int ret; int old_state = mem-state; + struct memory_notify arg; psection = mem-phys_index; first_page = pfn_to_page(psection PFN_SECTION_SHIFT); + arg.start_pfn = page_to_pfn(first_page); + arg.nr_pages = PAGES_PER_SECTION; /* * The probe routines leave the pages reserved, just * as the bootmem code does. Make sure they're still @@ -178,12 +181,13 @@ memory_block_action(struct memory_block switch (action) { case MEM_ONLINE: + memory_notify(MEM_GOING_ONLINE, arg); start_pfn = page_to_pfn(first_page); ret = online_pages(start_pfn, PAGES_PER_SECTION); break; case MEM_OFFLINE: mem-state = MEM_GOING_OFFLINE; - memory_notify(MEM_GOING_OFFLINE, NULL); + memory_notify(MEM_GOING_OFFLINE, arg); start_paddr = page_to_pfn(first_page) PAGE_SHIFT; ret = remove_memory(start_paddr, PAGES_PER_SECTION PAGE_SHIFT); @@ -191,7 +195,7 @@ memory_block_action(struct memory_block mem-state = old_state; break; } - memory_notify(MEM_MAPPING_INVALID, NULL); + memory_notify(MEM_MAPPING_INVALID, arg); break; default: printk(KERN_WARNING %s(%p, %ld) unknown action: %ld\n, @@ -203,7 +207,7 @@ memory_block_action(struct memory_block * For now, only notify on successful memory operations */ if (!ret) - memory_notify(action, NULL); + memory_notify(action, arg); return ret; } Index: current/include/linux/memory.h === --- current.orig/include/linux/memory.h 2007-09-28 11:18:25.0 +0900 +++ current/include/linux/memory.h 2007-09-28 11:23:46.0 +0900 @@ -37,10 +37,16 @@ struct memory_block { struct sys_device sysdev; }; +struct memory_notify { + unsigned long start_pfn; + unsigned long nr_pages; +}; + /* These states are exposed to userspace as text strings in sysfs */ -#defineMEM_ONLINE (10) /* exposed to userspace */ -#defineMEM_GOING_OFFLINE (11) /* exposed to userspace */ -#defineMEM_OFFLINE (12) /* exposed to userspace */ +#define MEM_GOING_ONLINE (10) /* exposed to userspace */ +#defineMEM_ONLINE (11) /* exposed to userspace */ +#defineMEM_GOING_OFFLINE (12) /* exposed to userspace */ +#defineMEM_OFFLINE (13) /* exposed to userspace */ /* * All of these states are currently kernel-internal for notifying @@ -52,7 +58,7 @@ struct memory_block { * pfn_to_page() stop working. Any notifiers that want to be called * after that should have priority 0. */ -#defineMEM_MAPPING_INVALID (13) +#defineMEM_MAPPING_INVALID (14) struct notifier_block; struct mem_section; @@ -70,6 +76,8 @@ static inline void unregister_memory_not { } #else +extern int register_memory_notifier(struct notifier_block *nb); +extern void unregister_memory_notifier(struct notifier_block *nb); extern int register_new_memory(struct mem_section *); extern int unregister_memory_section(struct mem_section *); extern int memory_dev_init(void); -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch / 002](memory hotplug) Callback function to create kmem_cache_node.
This is to make kmem_cache_nodes of all SLUBs for new node when memory-hotadd is called. This fixes panic due to access NULL pointer at discard_slab() after memory hot-add. If pages on the new node available, slub can use it before making new kmem_cache_nodes. So, this callback should be called BEFORE pages on the node are available. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- mm/slub.c | 79 ++ 1 file changed, 79 insertions(+) Index: current/mm/slub.c === --- current.orig/mm/slub.c 2007-09-28 11:23:50.0 +0900 +++ current/mm/slub.c 2007-09-28 11:23:59.0 +0900 @@ -20,6 +20,7 @@ #include linux/mempolicy.h #include linux/ctype.h #include linux/kallsyms.h +#include linux/memory.h /* * Lock order: @@ -2097,6 +2098,82 @@ static int init_kmem_cache_nodes(struct } return 1; } + +#ifdef CONFIG_MEMORY_HOTPLUG +static void __slab_callback_offline(int nid) +{ + struct kmem_cache_node *n; + struct kmem_cache *s; + + list_for_each_entry(s, slab_caches, list) { + if (s-node[nid]) { + n = get_node(s, nid); + s-node[nid] = NULL; + kmem_cache_free(kmalloc_caches, n); + } + } +} + +static int slab_callback_going_online(void *arg) +{ + struct kmem_cache_node *n; + struct kmem_cache *s; + struct memory_notify *marg = arg; + int nid; + + nid = page_to_nid(pfn_to_page(marg-start_pfn)); + + /* If the node already has memory, then nothing is necessary. */ + if (node_state(nid, N_HIGH_MEMORY)) + return 0; + + /* +* New memory will be onlined on the node which has no memory so far. +* New kmem_cache_node is necssary for it. +*/ + down_read(slub_lock); + list_for_each_entry(s, slab_caches, list) { + /* +* XXX: The new node's memory can't be allocated yet, +* kmem_cache_node will be allocated other node. +*/ + n = kmem_cache_alloc(kmalloc_caches, GFP_KERNEL); + if (!n) + goto error; + init_kmem_cache_node(n); + s-node[nid] = n; + } + up_read(slub_lock); + + return 0; + +error: + __slab_callback_offline(nid); + up_read(slub_lock); + + return -ENOMEM; +} + +static int slab_callback(struct notifier_block *self, unsigned long action, +void *arg) +{ + int ret = 0; + + switch (action) { + case MEM_GOING_ONLINE: + ret = slab_callback_going_online(arg); + break; + case MEM_ONLINE: + case MEM_GOING_OFFLINE: + case MEM_MAPPING_INVALID: + break; + } + + ret = notifier_from_errno(ret); + return ret; +} + +#endif /* CONFIG_MEMORY_HOTPLUG */ #else static void free_kmem_cache_nodes(struct kmem_cache *s) { @@ -2730,6 +2807,8 @@ void __init kmem_cache_init(void) sizeof(struct kmem_cache_node), GFP_KERNEL); kmalloc_caches[0].refcount = -1; caches++; + + hotplug_memory_notifier(slab_callback, 1); #endif /* Able to allocate the per node structures */ -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch 000/002](memory hotplug) Fix NULL pointer access of kmem_cache_node when hot-add.
I'm sorry. There are 2 patches for this fix. Subtitle should be [Patch 000/002]. :-( -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch / 002](memory hotplug) Callback function to create kmem_cache_node.
On Mon, 1 Oct 2007, Yasunori Goto wrote: +#ifdef CONFIG_MEMORY_HOTPLUG +static void __slab_callback_offline(int nid) +{ + struct kmem_cache_node *n; + struct kmem_cache *s; + + list_for_each_entry(s, slab_caches, list) { + if (s-node[nid]) { + n = get_node(s, nid); + s-node[nid] = NULL; + kmem_cache_free(kmalloc_caches, n); + } + } +} I think we need to bug here if there are still objects on the node that are in use. This will silently discard the objects. Here is just the rollback code for an allocation failure of kmem_cache_node in halfway. So, there is a case some of them are not allocated yet. Any slabs don't use new kmem_cache_node before the new nodes page is available --so far--. But, in the future, here will be useful for node hot-unplug code, and its check will be necessary. Ok. I'll add its check. Do you mean that just nr_slabs should be checked like followings? I'm not sure this is enough. : if (s-node[nid]) { n = get_node(s, nid); if (!atomic_read(n-nr_slabs)) { s-node[nid] = NULL; kmem_cache_free(kmalloc_caches, n); } } : : Thanks. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86 patches was Re: -mm merge plans for 2.6.24
On Tue, 2 Oct 2007 00:43:24 -0700 Andrew Morton [EMAIL PROTECTED] wrote: On Tue, 2 Oct 2007 16:36:24 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] Don't think so. A node is a lump of circuitry which can have zero or more CPUs, IO and memory. It may initially have been conceived as a memory-only concept in the Linux kernel, but that doesn't fully map onto reality (does it?) There was a real-world need for this, I think from the Fujitsu guys. That should be spelled out in the changelog but isn't. Yes, Fujitsu and HP guys really need this memory-less-node support. For what reason, please? For fujitsu, problem is called empty node. When ACPI's SRAT table includes possible nodes, ia64 bootstrap(acpi_numa_init) creates nodes, which includes no memory, no cpu. I tried to remove empty-node in past, but that was denied. It was because we can hot-add cpu to the empty node. (node-hotplug triggered by cpu is not implemented now. and it will be ugly.) For HP, (Lee can comment on this later), they have memory-less-node. As far as I hear, HP's machine can have following configration. (example) Node0: CPU0 memory AAA MB Node1: CPU1 memory AAA MB Node2: CPU2 memory AAA MB Node3: CPU3 memory AAA MB Node4: Memory XXX GB AAA is very small value (below 16MB) and will be omitted by ia64 bootstrap. After boot, only Node 4 has valid memory (but have no cpu.) Maybe this is memory-interleave by firmware config. From memory-hotplug view, memory-less node is very helpful. It can define and arrange some halfway conditions of node hot-plug. I guess that node unpluging code will be simpler by it. Bye. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch / 002](memory hotplug) Callback function to create kmem_cache_node.
On Tue, 2 Oct 2007, Yasunori Goto wrote: Do you mean that just nr_slabs should be checked like followings? I'm not sure this is enough. : if (s-node[nid]) { n = get_node(s, nid); if (!atomic_read(n-nr_slabs)) { s-node[nid] = NULL; kmem_cache_free(kmalloc_caches, n); } } : : That would work. But it would be better to shrink the cache first. The first 2 slabs on a node may be empty and the shrinking will remove those. If you do not shrink then the code may falsely assume that there are objects on the node. I'm sorry, but I don't think I understand what you mean... :-( Could you explain more? Which slabs should be shrinked? kmem_cache_node and kmem_cache_cpu? I think kmem_cache_cpu should be disabled by cpu hotplug, not memory/node hotplug. Basically, cpu should be offlined before memory offline on the node. Sorry, I'm confusing now... -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch / 002](memory hotplug) Callback function to create kmem_cache_node.
On Wed, 3 Oct 2007, Yasunori Goto wrote: That would work. But it would be better to shrink the cache first. The first 2 slabs on a node may be empty and the shrinking will remove those. If you do not shrink then the code may falsely assume that there are objects on the node. I'm sorry, but I don't think I understand what you mean... :-( Could you explain more? Which slabs should be shrinked? kmem_cache_node and kmem_cache_cpu? The slab for which you are trying to set the kmem_cache_node pointer to NULL needs to be shrunk. I think kmem_cache_cpu should be disabled by cpu hotplug, not memory/node hotplug. Basically, cpu should be offlined before memory offline on the node. Hmmm.. Ok for cpu hotplug you could simply disregard the per cpu structure if the per cpu slab was flushed first. However, the per node structure may hold slabs with no objects even after all objects were removed on a node. These need to be flushed by calling kmem_cache_shrink() on the slab cache. On the other hand: If you can guarantee that they will not be used and that no objects are in them and that you can recover the pages used in different ways then zapping the per node pointer like that is okay. Thanks for your advise. I'll reconsider and fix my patches. Bye. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm] mm: Fix memory hotplug + sparsemem build.
if (onlined_pages) - node_set_state(zone-node, N_HIGH_MEMORY); + node_set_state(zone_to_nid(zone), N_HIGH_MEMORY); setup_per_zone_pages_min(); Thanks Paul-san. I also have another issue around here. (Kswapd doesn't run on memory less node now. It should run when the node has memory.) I would like to merge them like following if you don't mind. Bye. --- Fix kswapd doesn't run when memory is added on memory-less-node. Fix compile error of zone-node when CONFIG_NUMA is off. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] Signed-off-by: Paul Mundt [EMAIL PROTECTED] --- mm/memory_hotplug.c |9 - 1 file changed, 4 insertions(+), 5 deletions(-) Index: current/mm/memory_hotplug.c === --- current.orig/mm/memory_hotplug.c2007-09-07 18:08:07.0 +0900 +++ current/mm/memory_hotplug.c 2007-09-11 17:29:19.0 +0900 @@ -211,10 +211,12 @@ int online_pages(unsigned long pfn, unsi online_pages_range); zone-present_pages += onlined_pages; zone-zone_pgdat-node_present_pages += onlined_pages; - if (onlined_pages) - node_set_state(zone-node, N_HIGH_MEMORY); setup_per_zone_pages_min(); + if (onlined_pages){ + kswapd_run(zone_to_nid(zone)); + node_set_state(zone_to_nid(zone), N_HIGH_MEMORY); + } if (need_zonelists_rebuild) build_all_zonelists(); @@ -269,9 +271,6 @@ int add_memory(int nid, u64 start, u64 s if (!pgdat) return -ENOMEM; new_pgdat = 1; - ret = kswapd_run(nid); - if (ret) - goto error; } /* call arch's memory hotadd */ -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm] mm: Fix memory hotplug + sparsemem build.
+ if (onlined_pages){ Nit, needs a space there before the '{'. Ah, Ok. I attached fixed patch in this mail. The problem as I see it is that when we boot the system we start a kswapd on all nodes with memory. If the hot-add adds memory to a pre-existing node with no memory we will not start one and we end up with a node with memory and no kswapd. Bad. As kswapd_run is a no-op when a kswapd already exists this seems a safe way to fix that. Paul's -zone conversion is obviously correct also. Acked-by: Andy Whitcroft [EMAIL PROTECTED] Thanks for your explanation. You mentioned all of my intention correctly. :-) Fix kswapd doesn't run when memory is added on memory-less-node. Fix compile error of zone-node when CONFIG_NUMA is off. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] Signed-off-by: Paul Mundt [EMAIL PROTECTED] Acked-by: Andy Whitcroft [EMAIL PROTECTED] --- mm/memory_hotplug.c |9 - 1 file changed, 4 insertions(+), 5 deletions(-) Index: current/mm/memory_hotplug.c === --- current.orig/mm/memory_hotplug.c2007-09-07 18:08:07.0 +0900 +++ current/mm/memory_hotplug.c 2007-09-11 17:29:19.0 +0900 @@ -211,10 +211,12 @@ int online_pages(unsigned long pfn, unsi online_pages_range); zone-present_pages += onlined_pages; zone-zone_pgdat-node_present_pages += onlined_pages; - if (onlined_pages) - node_set_state(zone-node, N_HIGH_MEMORY); setup_per_zone_pages_min(); + if (onlined_pages) { + kswapd_run(zone_to_nid(zone)); + node_set_state(zone_to_nid(zone), N_HIGH_MEMORY); + } if (need_zonelists_rebuild) build_all_zonelists(); @@ -269,9 +271,6 @@ int add_memory(int nid, u64 start, u64 s if (!pgdat) return -ENOMEM; new_pgdat = 1; - ret = kswapd_run(nid); - if (ret) - goto error; } /* call arch's memory hotadd */ -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch] Fix panic of cpu online with memory less node
When a cpu is onlined on memory-less-node box, kernel panics due to touch NULL pointer of pgdat-kswapd. Current kswapd runs only nodes which have memory. So, calling of set_cpus_allowed() is not necessary for memory-less node. This is fix for it. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- mm/vmscan.c |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) Index: current/mm/vmscan.c === --- current.orig/mm/vmscan.c2007-09-03 16:36:18.0 +0900 +++ current/mm/vmscan.c 2007-09-11 13:02:20.0 +0900 @@ -1843,9 +1843,11 @@ static int __devinit cpu_callback(struct { pg_data_t *pgdat; cpumask_t mask; + int nid; if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN) { - for_each_online_pgdat(pgdat) { + for_each_node_state(nid, N_HIGH_MEMORY) { + pgdat = NODE_DATA(nid); mask = node_to_cpumask(pgdat-node_id); if (any_online_cpu(mask) != NR_CPUS) /* One of our CPUs online: restore mask */ -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][22/37] Clean up duplicate includes in include/linux/memory_hotplug.h
Oops. This should be Thanks! Acked-by: Yasunori Goto [EMAIL PROTECTED] Hi, This patch cleans up duplicate includes in include/linux/memory_hotplug.h Signed-off-by: Jesper Juhl [EMAIL PROTECTED] --- diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 7b54666..b573d1e 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -3,7 +3,6 @@ #include linux/mmzone.h #include linux/spinlock.h -#include linux/mmzone.h #include linux/notifier.h struct page; -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][Doc] memory hotplug documentaion take 2.
Hello. This is new version of document of memory hotplug. At first, I was asked from Kame-san to review his new version which was only updated against previous comments. But, I became to want to change/add many description after reviewing. So, I'll post this. :-) Please comment. Change log from take 1. - updates against comments from Randy-san (Thanks a lot!) - mention about physical/logical phase of hotplug. change sections for it. - add description of kernel config option. - add description of relationship against ACPI node-hotplug. - make patch style. - etc. --- This is add a document for memory hotplug to describe How to use and Current status. --- Signed-off-by: KAMEZAWA Hiroyuki [EMAIL PROTECTED] Signed-off-by: Yasunori Goto [EMAIL PROTECTED] Documentation/memory-hotplug.txt | 322 +++ 1 files changed, 322 insertions(+) Index: makedocument/Documentation/memory-hotplug.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ makedocument/Documentation/memory-hotplug.txt 2007-07-27 22:31:11.0 +0900 @@ -0,0 +1,322 @@ +== +Memory Hotplug +== + +Last Updated: Jul 27 2007 + +This document is about memory hotplug including how-to-use and current status. +Because Memory Hotplug is still under development, contents of this text will +be changed often. + +1. Introduction + 1.1 purpose of memory hotplug + 1.2. Phases of memory hotplug + 1.3. Unit of Memory online/offline operation +2. Kernel Configuration +3. sysfs files for memory hotplug +4. Physical memory hot-add phase + 4.1 Hardware(Firmware) Support + 4.2 Notify memory hot-add event by hand +5. Logical Memory hot-add phase + 5.1. State of memory + 5.2. How to online memory +6. Logical memory remove + 6.1 Memory offline and ZONE_MOVABLE + 6.2. How to offline memory +7. Physical memory remove +8. Future Work List + +Note(1): x86_64's has special implementation for memory hotplug. + This test does not describe it. +Note(2): This text assumes that sysfs is mounted at /sys. + + +--- +1. Introduction +--- + +1.1 purpose of memory hotplug + +Memory Hotplug allows users to increase/decrease the amount of memory. +Generally, there are two purposes. + +(A) For changing the amount of memory. +This is to allow a feature like capacity on demand. +(B) For installing/removing DIMMs or NUMA-nodes physically. +This is to exchange DIMMs/NUMA-nodes, reduce power consumption, etc. + +(A) is required by highly virtualized environments and (B) is required by +hardware which supports memory power management. + +Linux memory hotplug is designed for both purpose. + + +1.2. Phases of memory hotplug +--- +There are 2 phases in Memory Hotplug. + 1) Physical Memory Hotplug phase + 2) Logical Memory Hotplug phase. + +The First phase is to communicate hardware/firmware and make/erase +environment for hotplugged memory. Basically, this phase is necessary +for the purpose (B), but this is good phase for communication between +highly virtulaized environments too. + +When memory is hotplugged, the kernel recognizes new memory, makes new memory +management tables, and makes sysfs files for new memory's operation. + +If firmware supports notification of connection of new memory to OS, +this phase is triggered automatically. ACPI can notify this event. If not, +probe operation by system administration works instead of it. +(see Section 4.). + +Logical Memory Hotplug phase is to change memory state into +avaiable/unavailable for users. Amount of memory from user's view is +changed by this phase. The kernel makes all memory in it as free pages +when a memory range is into available. + +In this document, this phase is described online/offline. + +Logical Memory Hotplug phase is trigged by write of sysfs file by system +administrator. When hot-add case, it must be executed after Physical Hotplug +phase by hand. +(However, if you writes udev's hotplug scripts for memory hotplug, these + phases can be execute in seamless way.) + + +1.3. Unit of Memory online/offline operation + +Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole memory +into chunks of the same size. The chunk is called a section. The size of +a section is architecture dependent. For example, power uses 16MiB, ia64 uses +1GiB. The unit of online/offline operation is one section. (see Section 3.) + +To know the size of sections, please read this file: + +/sys/devices/system/memory/block_size_bytes + +This file shows the size of sections in byte. + +--- +2. Kernel Configuration +--- +To use memory hotplug feature, kernel must be compiled with following +config options. + +- For all memory hotplug +Memory model - Sparse Memory (CONFIG_SPARSEMEM) +Allow for memory hot-add (CONFIG_MEMORY_HOTPLUG) + +- For using
Re: [RFC][Doc] memory hotplug documentaion take 2.
Thanks for your comment. Fixed patch is attached at the last of this mail. + +Note(1): x86_64's has special implementation for memory hotplug. + This test does not describe it. text (?) Oops. Yes. +1.2. Phases of memory hotplug +--- +There are 2 phases in Memory Hotplug. + 1) Physical Memory Hotplug phase + 2) Logical Memory Hotplug phase. + +The First phase is to communicate hardware/firmware and make/erase +environment for hotplugged memory. Basically, this phase is necessary +for the purpose (B), but this is good phase for communication between +highly virtulaized environments too. virtualized Yes. fixed... + +When memory is hotplugged, the kernel recognizes new memory, makes new memory +management tables, and makes sysfs files for new memory's operation. + +If firmware supports notification of connection of new memory to OS, +this phase is triggered automatically. ACPI can notify this event. If not, +probe operation by system administration works instead of it. is used instead. Ah, ok. +(see Section 4.). + +Logical Memory Hotplug phase is to change memory state into +avaiable/unavailable for users. Amount of memory from user's view is +changed by this phase. The kernel makes all memory in it as free pages +when a memory range is into available. ?? drop into ? or is a memory range always available? Confusing. Ok. I didn't know it was confusing. Thanks. I dropped it. +In this document, this phase is described online/offline. described as online/offline. OK. + +Logical Memory Hotplug phase is trigged by write of sysfs file by system triggered Oops. yes. +administrator. When hot-add case, it must be executed after Physical Hotplug For the hot-add case, OK. +phase by hand. +(However, if you writes udev's hotplug scripts for memory hotplug, these + phases can be execute in seamless way.) + + +1.3. Unit of Memory online/offline operation + +Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole memory +into chunks of the same size. The chunk is called a section. The size of +a section is architecture dependent. For example, power uses 16MiB, ia64 uses +1GiB. The unit of online/offline operation is one section. (see Section 3.) + +To know the size of sections, please read this file: To determine the size ... I didn't know determine can be used for this sentence. I remembered it means just decide due to my English vocabulary problem. Thanks. I changed it. :-) +- For using remove memory, followings are necessary too To enable memory removal, the following are also necessary Ok. +Allow for memory hot remove(CONFIG_MEMORY_HOTREMOVE) +Page Migration (CONFIG_MIGRATION) + +- For ACPI memory hotplug, followings are necessary too the following are also necessary Ok. +Now, XXX is defined as start_address_of_section / secion_size. section_size. Yes. Thanks. + +For example, assume 1GiB section size. A device for a memory starts from address for memory starting at Ok. + +In general, the firmware (ACPI) which supports memory hotplug defines +memory class object of _HID PNP0C80. When a notify is asserted to PNP0C80, +Linux's ACPI handler does hot-add memory to the system and calls a hotplug udev +script. This will be done in automatically. drop in Ok. +If firmware supports NUMA-node hotplug, and define object of _HID ACPI0004, defines an object Ok. +PNP0A05, or PNP0A06, notification is asserted to it, and ACPI hander handler Ah, yes. Thanks again! --- This is add a document for memory hotplug to describe How to use and Current status. --- Signed-off-by: KAMEZAWA Hiroyuki [EMAIL PROTECTED] Signed-off-by: Yasunori Goto [EMAIL PROTECTED] Documentation/memory-hotplug.txt | 322 +++ 1 files changed, 322 insertions(+) Index: makedocument/Documentation/memory-hotplug.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ makedocument/Documentation/memory-hotplug.txt 2007-07-28 11:47:52.0 +0900 @@ -0,0 +1,322 @@ +== +Memory Hotplug +== + +Last Updated: Jul 28 2007 + +This document is about memory hotplug including how-to-use and current status. +Because Memory Hotplug is still under development
Re: [BUG] 2.6.23-rc3-mm1 Kernel panic - not syncing: DMA: Memory would be corrupted
On (22/08/07 16:27), Luck, Tony didst pronounce: The more ioc's you have, the more space you will use. Default SW IOTLB allocation is 64MB ... how much should we see used per ioc? Kamelesh: You could try increasing the amount of sw iotlb space available by booting with a swiotlb=131072 argument (argument value is the number of 2K slabs to allocate ... 131072 would give you four times as much space as the default allocation). I tried that value and just in case swiotlb=262144. An IA-64 machines I have here fails with the same message anyway. i.e. [ 19.834906] mptbase: Initiating ioc1 bringup [ 20.317152] ioc1: LSI53C1030 C0: Capabilities={Initiator} [ 15.474303] scsi1 : ioc1: LSI53C1030 C0, FwRev=01032821h, Ports=1, MaxQ=222, IRQ=72 [ 20.669730] GSI 142 (level, low) - CPU 5 (0x1200) vector 73 [ 20.675602] ACPI: PCI Interrupt :41:03.0[A] - GSI 142 (level, low) - IRQ 73 [ 20.683508] mptbase: Initiating ioc2 bringup [ 21.166796] ioc2: LSI53C1030 C0: Capabilities={Initiator} [ 21.180539] DMA: Out of SW-IOMMU space for 263200 bytes at device ? [ 21.187018] Kernel panic - not syncing: DMA: Memory would be corrupted I saw same trouble on my box, and I chased what was wrong. Here is today's progress of mine. __get_free_pages() of swiotlb_alloc_coherent() fails in rc3-mm1. (See following patch) But, it doesn't fail on rc2-mm2, and kernel can boot up. Hmmm (2.6.23-rc3-mm1) --- swiotlb_alloc_coherent flags=21 order=3 ret= DMA: Out of SW-IOMMU space for 266368 bytes at device ? Kernel panic - not syncing: DMA: Memory would be corrupted --- (2.6.23-rc2-mm2) --- swiotlb_alloc_coherent flags=21 order=3 ret=e0002008 : (boot up continue...) --- lib/swiotlb.c |2 ++ 1 file changed, 2 insertions(+) Index: current/lib/swiotlb.c === --- current.orig/lib/swiotlb.c 2007-08-23 22:27:01.0 +0900 +++ current/lib/swiotlb.c 2007-08-23 22:29:49.0 +0900 @@ -455,6 +455,8 @@ swiotlb_alloc_coherent(struct device *hw flags |= GFP_DMA; ret = (void *)__get_free_pages(flags, order); + + printk(%s flags=%0x order=%d ret=%p\n,__func__, flags, order, ret); if (ret address_needs_mapping(hwdev, virt_to_bus(ret))) { /* * The allocated memory isn't reachable by the device. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Fix find_next_best_node (Re: [BUG] 2.6.23-rc3-mm1 Kernel panic - not syncing: DMA: Memory would be corrupted)
I found find_next_best_node() was wrong. I confirmed boot up by the following patch. Mel-san, Kamalesh-san, could you try this? Bye. --- Fix decision of memoryless node in find_next_best_node(). This can be cause of SW-IOMMU's allocation failure. This patch is for 2.6.23-rc3-mm1. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- mm/page_alloc.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: current/mm/page_alloc.c === --- current.orig/mm/page_alloc.c2007-08-24 16:03:17.0 +0900 +++ current/mm/page_alloc.c 2007-08-24 16:04:06.0 +0900 @@ -2136,7 +2136,7 @@ static int find_next_best_node(int node, * Note: N_HIGH_MEMORY state not guaranteed to be *populated yet. */ - if (pgdat-node_present_pages) + if (!pgdat-node_present_pages) continue; /* Don't want a node to appear more than once */ -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch 000/002] Rearrange notifier of memory hotplug
Hello. This patch set is to rearrange event notifier for memory hotplug, because the old notifier has some defects. For example, there is no information like new memory's pfn and # of pages for callback functions. Fortunately, nothing uses this notifier so far, there is no impact by this change. (SLUB will use this after this patch set to make kmem_cache_node structure). In addition, descriptions of notifer is added to memory hotplug document. This patch was a part of patch set to make kmem_cache_node of SLUB to avoid panic of memory online. But, I think this change becomes not only for SLUB but also for others. So, I extracted this from it. This patch set is for 2.6.23-rc8-mm2. I tested this patch on my ia64 box. Please apply. Bye. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch 001/002] Make description of memory hotplug notifier in document
Add description about event notification callback routine to the document. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- Documentation/memory-hotplug.txt | 56 --- 1 file changed, 53 insertions(+), 3 deletions(-) Index: current/Documentation/memory-hotplug.txt === --- current.orig/Documentation/memory-hotplug.txt +++ current/Documentation/memory-hotplug.txt @@ -2,7 +2,8 @@ Memory Hotplug == -Last Updated: Jul 28 2007 +Created: Jul 28 2007 +Add description of notifier of memory hotplug Oct 11 2007 This document is about memory hotplug including how-to-use and current status. Because Memory Hotplug is still under development, contents of this text will @@ -24,7 +25,8 @@ be changed often. 6.1 Memory offline and ZONE_MOVABLE 6.2. How to offline memory 7. Physical memory remove -8. Future Work List +8. Memory hotplug event notifier +9. Future Work List Note(1): x86_64's has special implementation for memory hotplug. This text does not describe it. @@ -307,8 +309,68 @@ Need more implementation yet - Notification completion of remove works by OS to firmware. - Guard from remove if not yet. + +8. Memory hotplug event notifier + +Memory hotplug has event notifer. There are 6 types of notification. + +MEMORY_GOING_ONLINE + This is notified before memory online. If some structures must be prepared + for new memory, it should be done at this event's callback. + The new onlining memory can't be used yet. + +MEMORY_CANCEL_ONLINE + If memory online fails, this event is notified for rollback of setting at + MEMORY_GOING_ONLINE. + (Currently, this event is notified only the case which a callback routine + of MEMORY_GOING_ONLINE fails). + +MEMORY_ONLINE + This event is called when memory online is completed. The page allocator uses + new memory area before this notification. In other words, callback routine + use new memory area via page allocator. + The failures of callbacks of this notification will be ignored. + +MEMORY_GOING_OFFLINE + This is notified on halfway of memory offline. The offlining pages are + isolated. In other words, the page allocater doesn't allocate new pages from + offlining memory area at this time. If callback routine freed some pages, + they are not used by the page allocator. + This is good place for shrinking cache. (If possible, it is desirable to + migrate to other area.) + +MEMORY_CANCEL_OFFLINE + If memory offline fails, this event is notified for rollback against + MEMORY_GOING_OFFLINE. The page allocator will use target memory area after + this callback again. + +MEMORY_OFFLINE + This is notified after memory offline completed. The failures of callbacks + of this notification will be ignored. Callback routine can return structures + for offlined memory. + If the node which has offlined memory, + +A callback routine can be registered by + hotplug_memory_notifier(callback_func, priority). + +The second argument of callback function (action) is event types of above. +The third argument is passed by pointer of struct memory_notify. + +struct memory_notify { + unsigned long start_pfn; + unsigned long nr_pages; + int status_change_nid; +}; +start_pfn is start pfn of online/offline memory. +nr_pages is # of pages of online/offline memory. +status_change_nid is set node id when N_HIGH_MEMORY of nodemask is (will be) +set/clear. It means a new(memoryless) node gets new memory by online and a +node lose all memory. If this is -1, then nodemask status is not changed. +If status_changed_nid = 0, callback should create/discard structures for the +node if necessary. + -- -8. Future Work +9. Future Work -- - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like sysctl or new control file. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch 002/002] rearrange patch for notifier of memory hotplug
Current memory notifier has some defects yet. (Fortunately, nothing uses it.) This patch is to fix and rearrange for them. - Add information of start_pfn, nr_pages, and node id if node status is changes from/to memoryless node for callback functions. Callbacks can't do anything without those information. - Add notification going-online status. It is necessary for creating per node structure before the node's pages are available. - Move GOING_OFFLINE status notification after page isolation. It is good place for return memory like cache for callback, because returned page is not used again. - Make CANCEL events for rollingback when error occurs. - Delete MEM_MAPPING_INVALID notification. It will be not used. - Fix compile error of (un)register_memory_notifier(). Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- drivers/base/memory.c |9 + include/linux/memory.h | 27 +++ mm/memory_hotplug.c| 48 +--- 3 files changed, 61 insertions(+), 23 deletions(-) Index: current/drivers/base/memory.c === --- current.orig/drivers/base/memory.c 2007-10-11 14:33:02.0 +0900 +++ current/drivers/base/memory.c 2007-10-11 14:33:07.0 +0900 @@ -137,7 +137,7 @@ static ssize_t show_mem_state(struct sys return len; } -static inline int memory_notify(unsigned long val, void *v) +int memory_notify(unsigned long val, void *v) { return blocking_notifier_call_chain(memory_chain, val, v); } @@ -183,7 +183,6 @@ memory_block_action(struct memory_block break; case MEM_OFFLINE: mem-state = MEM_GOING_OFFLINE; - memory_notify(MEM_GOING_OFFLINE, NULL); start_paddr = page_to_pfn(first_page) PAGE_SHIFT; ret = remove_memory(start_paddr, PAGES_PER_SECTION PAGE_SHIFT); @@ -191,7 +190,6 @@ memory_block_action(struct memory_block mem-state = old_state; break; } - memory_notify(MEM_MAPPING_INVALID, NULL); break; default: printk(KERN_WARNING %s(%p, %ld) unknown action: %ld\n, @@ -199,11 +197,6 @@ memory_block_action(struct memory_block WARN_ON(1); ret = -EINVAL; } - /* -* For now, only notify on successful memory operations -*/ - if (!ret) - memory_notify(action, NULL); return ret; } Index: current/include/linux/memory.h === --- current.orig/include/linux/memory.h 2007-10-11 14:33:02.0 +0900 +++ current/include/linux/memory.h 2007-10-11 15:19:31.0 +0900 @@ -41,18 +41,15 @@ struct memory_block { #defineMEM_ONLINE (10) /* exposed to userspace */ #defineMEM_GOING_OFFLINE (11) /* exposed to userspace */ #defineMEM_OFFLINE (12) /* exposed to userspace */ +#defineMEM_GOING_ONLINE(13) +#defineMEM_CANCEL_ONLINE (14) +#defineMEM_CANCEL_OFFLINE (15) -/* - * All of these states are currently kernel-internal for notifying - * kernel components and architectures. - * - * For MEM_MAPPING_INVALID, all notifier chains with priority 0 - * are called before pfn_to_page() becomes invalid. The priority=0 - * entry is reserved for the function that actually makes - * pfn_to_page() stop working. Any notifiers that want to be called - * after that should have priority 0. - */ -#defineMEM_MAPPING_INVALID (13) +struct memory_notify { + unsigned long start_pfn; + unsigned long nr_pages; + int status_change_nid; +}; struct notifier_block; struct mem_section; @@ -69,12 +66,18 @@ static inline int register_memory_notifi static inline void unregister_memory_notifier(struct notifier_block *nb) { } +static inline int memory_notify(unsigned long val, void *v) +{ + return 0; +} #else +extern int register_memory_notifier(struct notifier_block *nb); +extern void unregister_memory_notifier(struct notifier_block *nb); extern int register_new_memory(struct mem_section *); extern int unregister_memory_section(struct mem_section *); extern int memory_dev_init(void); extern int remove_memory_block(unsigned long, struct mem_section *, int); - +extern int memory_notify(unsigned long val, void *v); #define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTIONPAGE_SHIFT) Index: current/mm/memory_hotplug.c === --- current.orig/mm/memory_hotplug.c2007-10-11 14:33:02.0 +0900 +++ current/mm/memory_hotplug.c
[Patch 000/002] Make kmem_cache_node for SLUB on memory online to avoid panic(take 2)
This patch set is to fix panic due to access NULL pointer of SLUB. When new memory is hot-added on the new node (or memory less node), kmem_cache_node for the new node is not prepared, and panic occurs by it. So, kmem_cache_node should be created for the node before new memory is available on the node. Incidentally, it is freed on memory offline if it becomes not necessary. This is the first user of the callback of memory notifier, and requires its rearrange patch set. This patch set is for 2.6.23-rc8-mm2. I tested this patch on my ia64 box. Please apply. Bye. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch 001/002] extract kmem_cache_shrink
Make kmem_cache_shrink_node() for callback routine of memory hotplug notifier. This is just extract a part of kmem_cache_shrink(). Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- mm/slub.c | 111 ++ 1 file changed, 61 insertions(+), 50 deletions(-) Index: current/mm/slub.c === --- current.orig/mm/slub.c 2007-10-11 20:30:45.0 +0900 +++ current/mm/slub.c 2007-10-11 21:58:47.0 +0900 @@ -2626,6 +2626,56 @@ void kfree(const void *x) } EXPORT_SYMBOL(kfree); +static inline void __kmem_cache_shrink_node(struct kmem_cache *s, int node, + struct list_head *slabs_by_inuse) +{ + struct kmem_cache_node *n; + int i; + struct page *page; + struct page *t; + unsigned long flags; + + n = get_node(s, node); + + if (!n-nr_partial) + return; + + for (i = 0; i s-objects; i++) + INIT_LIST_HEAD(slabs_by_inuse + i); + + spin_lock_irqsave(n-list_lock, flags); + + /* +* Build lists indexed by the items in use in each slab. +* +* Note that concurrent frees may occur while we hold the +* list_lock. page-inuse here is the upper limit. +*/ + list_for_each_entry_safe(page, t, n-partial, lru) { + if (!page-inuse slab_trylock(page)) { + /* +* Must hold slab lock here because slab_free +* may have freed the last object and be +* waiting to release the slab. +*/ + list_del(page-lru); + n-nr_partial--; + slab_unlock(page); + discard_slab(s, page); + } else + list_move(page-lru, slabs_by_inuse + page-inuse); + } + + /* +* Rebuild the partial list with the slabs filled up most +* first and the least used slabs at the end. +*/ + for (i = s-objects - 1; i = 0; i--) + list_splice(slabs_by_inuse + i, n-partial.prev); + + spin_unlock_irqrestore(n-list_lock, flags); +} + /* * kmem_cache_shrink removes empty slabs from the partial lists and sorts * the remaining slabs by the number of items in use. The slabs with the @@ -2636,68 +2686,29 @@ EXPORT_SYMBOL(kfree); * being allocated from last increasing the chance that the last objects * are freed in them. */ -int kmem_cache_shrink(struct kmem_cache *s) +int kmem_cache_shrink_node(struct kmem_cache *s, int node) { - int node; - int i; - struct kmem_cache_node *n; - struct page *page; - struct page *t; struct list_head *slabs_by_inuse = kmalloc(sizeof(struct list_head) * s-objects, GFP_KERNEL); - unsigned long flags; if (!slabs_by_inuse) return -ENOMEM; flush_all(s); - for_each_node_state(node, N_NORMAL_MEMORY) { - n = get_node(s, node); - - if (!n-nr_partial) - continue; - - for (i = 0; i s-objects; i++) - INIT_LIST_HEAD(slabs_by_inuse + i); - - spin_lock_irqsave(n-list_lock, flags); - - /* -* Build lists indexed by the items in use in each slab. -* -* Note that concurrent frees may occur while we hold the -* list_lock. page-inuse here is the upper limit. -*/ - list_for_each_entry_safe(page, t, n-partial, lru) { - if (!page-inuse slab_trylock(page)) { - /* -* Must hold slab lock here because slab_free -* may have freed the last object and be -* waiting to release the slab. -*/ - list_del(page-lru); - n-nr_partial--; - slab_unlock(page); - discard_slab(s, page); - } else { - list_move(page-lru, - slabs_by_inuse + page-inuse); - } - } - - /* -* Rebuild the partial list with the slabs filled up most -* first and the least used slabs at the end. -*/ - for (i = s-objects - 1; i = 0; i--) - list_splice(slabs_by_inuse + i, n-partial.prev); - - spin_unlock_irqrestore(n-list_lock, flags); - } + if (node = 0) + __kmem_cache_shrink_node(s, node, slabs_by_inuse); + else
[Patch 002/002] Create/delete kmem_cache_node for SLUB on memory online callback
This is to make kmem_cache_nodes of all SLUBs for new node when memory-hotadd is called. This fixes panic due to access NULL pointer at discard_slab() after memory hot-add. If pages on the new node available, slub can use it before making new kmem_cache_nodes. So, this callback should be called BEFORE pages on the node are available. When memory online is called, slab_mem_going_online_callback() is called to make kmem_cache_node(). if it (or other callbacks) fails, then slab_mem_offline_callback() is called for rollback. In memory offline, slab_mem_going_offline_callback() is called to shrink cache, then slab_mem_offline_callback() is called later. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- mm/slub.c | 117 ++ 1 file changed, 117 insertions(+) Index: current/mm/slub.c === --- current.orig/mm/slub.c 2007-10-11 20:31:37.0 +0900 +++ current/mm/slub.c 2007-10-11 21:58:10.0 +0900 @@ -20,6 +20,7 @@ #include linux/mempolicy.h #include linux/ctype.h #include linux/kallsyms.h +#include linux/memory.h /* * Lock order: @@ -2711,6 +2712,120 @@ int kmem_cache_shrink(struct kmem_cache } EXPORT_SYMBOL(kmem_cache_shrink); +#if defined(CONFIG_NUMA) defined(CONFIG_MEMORY_HOTPLUG) +static int slab_mem_going_offline_callback(void *arg) +{ + struct kmem_cache *s; + struct memory_notify *marg = arg; + int local_node, offline_node = marg-status_change_nid; + + if (offline_node 0) + /* node has memory yet. nothing to do. */ + return 0; + + down_read(slub_lock); + list_for_each_entry(s, slab_caches, list) { + local_node = page_to_nid(virt_to_page(s)); + if (local_node == offline_node) + /* This slub is on the offline node. */ + return -EBUSY; + } + up_read(slub_lock); + + kmem_cache_shrink_node(s, offline_node); + + return 0; +} + +static void slab_mem_offline_callback(void *arg) +{ + struct kmem_cache_node *n; + struct kmem_cache *s; + struct memory_notify *marg = arg; + int offline_node; + + offline_node = marg-status_change_nid; + + if (offline_node 0) + /* node has memory yet. nothing to do. */ + return; + + down_read(slub_lock); + list_for_each_entry(s, slab_caches, list) { + n = get_node(s, offline_node); + if (n) { + /* +* if n-nr_slabs 0, offline_pages() must be fail, +* because the node is used by slub yet. +*/ + BUG_ON(atomic_read(n-nr_slabs)); + + s-node[offline_node] = NULL; + kmem_cache_free(kmalloc_caches, n); + } + } + up_read(slub_lock); +} + +static int slab_mem_going_online_callback(void *arg) +{ + struct kmem_cache_node *n; + struct kmem_cache *s; + struct memory_notify *marg = arg; + int nid = marg-status_change_nid; + + /* If the node already has memory, then nothing is necessary. */ + if (nid 0) + return 0; + + /* +* New memory will be onlined on the node which has no memory so far. +* New kmem_cache_node is necssary for it. +*/ + down_read(slub_lock); + list_for_each_entry(s, slab_caches, list) { + /* +* XXX: The new node's memory can't be allocated yet, +* kmem_cache_node will be allocated other node. +*/ + n = kmem_cache_alloc(kmalloc_caches, GFP_KERNEL); + if (!n) + return -ENOMEM; + init_kmem_cache_node(n); + s-node[nid] = n; + } + up_read(slub_lock); + + return 0; +} + +static int slab_memory_callback(struct notifier_block *self, + unsigned long action, void *arg) +{ + int ret = 0; + + switch (action) { + case MEM_GOING_ONLINE: + ret = slab_mem_going_online_callback(arg); + break; + case MEM_GOING_OFFLINE: + ret = slab_mem_going_offline_callback(arg); + break; + case MEM_OFFLINE: + case MEM_CANCEL_ONLINE: + slab_mem_offline_callback(arg); + break; + case MEM_ONLINE: + case MEM_CANCEL_OFFLINE: + break; + } + + ret = notifier_from_errno(ret); + return ret; +} + +#endif /* CONFIG_MEMORY_HOTPLUG */ + / * Basic setup of slabs ***/ @@ -2741,6 +2856,8 @@ void __init kmem_cache_init(void
Re: [Patch 001/002] Make description of memory hotplug notifier in document
Looks good. Some suggestions on improving the wording. Thanks! I'll fix them. Bye. On Fri, 12 Oct 2007, Yasunori Goto wrote: +MEMORY_GOING_ONLINE + This is notified before memory online. If some structures must be prepared + for new memory, it should be done at this event's callback. + The new onlining memory can't be used yet. Generated before new memory becomes available in order to be able to prepare subsystems to handle memory. The page allocator is still unable to allocate from the new memory. +MEMORY_CANCEL_ONLINE + If memory online fails, this event is notified for rollback of setting at + MEMORY_GOING_ONLINE. + (Currently, this event is notified only the case which a callback routine + of MEMORY_GOING_ONLINE fails). Generated if MEMORY_GOING_ONLINE fails. +MEMORY_ONLINE + This event is called when memory online is completed. The page allocator uses + new memory area before this notification. In other words, callback routine + use new memory area via page allocator. + The failures of callbacks of this notification will be ignored. Generated when memory has succesfully brought online. The callback may allocate from the new memory. +MEMORY_GOING_OFFLINE + This is notified on halfway of memory offline. The offlining pages are + isolated. In other words, the page allocater doesn't allocate new pages from + offlining memory area at this time. If callback routine freed some pages, + they are not used by the page allocator. + This is good place for shrinking cache. (If possible, it is desirable to + migrate to other area.) Generated to begin the process of offlining memory. Allocations are no longer possible from the memory but some of the memory to be offlined is still in use. The callback can be used to free memory known to a subsystem from the indicated node. +MEMORY_CANCEL_OFFLINE + If memory offline fails, this event is notified for rollback against + MEMORY_GOING_OFFLINE. The page allocator will use target memory area after + this callback again. Generated if MEMORY_GOING_OFFLINE fails. Memory is available again from the node that we attempted to offline. + +MEMORY_OFFLINE Generated after offlining memory is complete. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch 001/002] extract kmem_cache_shrink
On Fri, 12 Oct 2007, Yasunori Goto wrote: Make kmem_cache_shrink_node() for callback routine of memory hotplug notifier. This is just extract a part of kmem_cache_shrink(). Could we just call kmem_cache_shrink? It will do the shrink on every node but memory hotplug is rare? Yes it is. Memory hotplug is rare. Ok. I'll do it. Thanks. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch 002/002] Create/delete kmem_cache_node for SLUB on memory online callback
On Fri, 12 Oct 2007, Yasunori Goto wrote: If pages on the new node available, slub can use it before making new kmem_cache_nodes. So, this callback should be called BEFORE pages on the node are available. If its called before pages on the node are available then it must fallback and cannot use the pages. Hmm. My description may be wrong. I would like to just mention that kmem_cache_node should be created before the node's page can be allocated. If not, it will cause of panic. +#if defined(CONFIG_NUMA) defined(CONFIG_MEMORY_HOTPLUG) +static int slab_mem_going_offline_callback(void *arg) +{ + struct kmem_cache *s; + struct memory_notify *marg = arg; + int local_node, offline_node = marg-status_change_nid; + + if (offline_node 0) + /* node has memory yet. nothing to do. */ Please clarify the comment. This seems to indicate that we should not do anything because the node still has memory? Yes. kmem_cache_node is still necessary for remaining memory on the node. Doesnt the node always have memory before offlining? If node doesn't have memory and offline_pages() called for it, it must be check and fail. This callback shouldn't be called. If not, it is bug of memory hotplug, I think. + return 0; + + down_read(slub_lock); + list_for_each_entry(s, slab_caches, list) { + local_node = page_to_nid(virt_to_page(s)); + if (local_node == offline_node) + /* This slub is on the offline node. */ + return -EBUSY; + } + up_read(slub_lock); So this checks if the any kmem_cache structure is on the offlined node? If so then we cannot offline the node? Right. If slabs' migration is possible, here would be good place for doing it. But, it is not possible (at least now). + kmem_cache_shrink_node(s, offline_node); kmem_cache_shrink(s) would be okay here I would think. The function is reasonably fast. Offlining is a rare event. Ok. I'll fix it. +static void slab_mem_offline_callback(void *arg) We call this after we have established that no kmem_cache structures are on this and after we have shrunk the slabs. Is there any guarantee that no slab operations have occurrent since then? If slabs still exist, it can't be migrated and offline_pages() has to give up offline. This means MEM_OFFLINE event is not generated when slabs are on the removing node. In other word, when this event is generated, all of pages on this section is isolated and there are no used pages(slabs). +{ + struct kmem_cache_node *n; + struct kmem_cache *s; + struct memory_notify *marg = arg; + int offline_node; + + offline_node = marg-status_change_nid; + + if (offline_node 0) + /* node has memory yet. nothing to do. */ + return; Does this mean that the node still has memory? Yes. + down_read(slub_lock); + list_for_each_entry(s, slab_caches, list) { + n = get_node(s, offline_node); + if (n) { + /* +* if n-nr_slabs 0, offline_pages() must be fail, +* because the node is used by slub yet. +*/ It may be clearer to say: If nr_slabs 0 then slabs still exist on the node that is going down. We were unable to free them so we must fail. Again. If nr_slabs 0, offline_pages must be fail due to slabs remaining on the node before. So, this callback isn't called. +static int slab_mem_going_online_callback(void *arg) +{ + struct kmem_cache_node *n; + struct kmem_cache *s; + struct memory_notify *marg = arg; + int nid = marg-status_change_nid; + + /* If the node already has memory, then nothing is necessary. */ + if (nid 0) + return 0; The node must have memory Or we have already brought up the code? kmem_cache_node is created at boot time if the node has memory. (Or, it is created by this callback on first added memory on the node). When nid = - 1, kmem_cache_node is created before this node due to node has memory. + /* +* New memory will be onlined on the node which has no memory so far. +* New kmem_cache_node is necssary for it. We are bringing a node online. No memory is available yet. We must allocate a kmem_cache_node structure in order to bring the node online. ? Your mention might be ok. But. I would like to prefer to define status of node hotplug for exactitude like followings A)Node online -- pgdat is created and can be accessed for this node. but there are no gurantee that cpu or memory is onlined. This status is very close from memory-less node. But this might be halfway status for node hotplug. Node online bit is set. But N_HIGH_MEMORY (or N_NORMAL_MEMORY) might be not set. B)Node has memory-- one or more sections
Re: [Patch 002/002] Create/delete kmem_cache_node for SLUB on memory online callback
On Fri, 12 Oct 2007, Yasunori Goto wrote: + down_read(slub_lock); + list_for_each_entry(s, slab_caches, list) { + local_node = page_to_nid(virt_to_page(s)); + if (local_node == offline_node) + /* This slub is on the offline node. */ + return -EBUSY; + } + up_read(slub_lock); So this checks if the any kmem_cache structure is on the offlined node? If so then we cannot offline the node? Right. If slabs' migration is possible, here would be good place for doing it. But, it is not possible (at least now). I think you can avoid this check. The kmem_cache structures are allocated from the kmalloc array. The check if the kmalloc slabs are empty will fail if kmem_cache structures still exist on the node. Ah, Ok. +* because the node is used by slub yet. +*/ It may be clearer to say: If nr_slabs 0 then slabs still exist on the node that is going down. We were unable to free them so we must fail. Again. If nr_slabs 0, offline_pages must be fail due to slabs remaining on the node before. So, this callback isn't called. Ok then we can remove these checks? Hmm. Yes. I'll remove it. +static int slab_mem_going_online_callback(void *arg) +{ + struct kmem_cache_node *n; + struct kmem_cache *s; + struct memory_notify *marg = arg; + int nid = marg-status_change_nid; + + /* If the node already has memory, then nothing is necessary. */ + if (nid 0) + return 0; The node must have memory Or we have already brought up the code? kmem_cache_node is created at boot time if the node has memory. (Or, it is created by this callback on first added memory on the node). When nid = - 1, kmem_cache_node is created before this node due to node has memory. So the function can be called for a node that is already online? already node memory available, accurately ;-) +* New memory will be onlined on the node which has no memory so far. +* New kmem_cache_node is necssary for it. We are bringing a node online. No memory is available yet. We must allocate a kmem_cache_node structure in order to bring the node online. ? Your mention might be ok. But. I would like to prefer to define status of node hotplug for exactitude like followings A)Node online -- pgdat is created and can be accessed for this node. but there are no gurantee that cpu or memory is onlined. This status is very close from memory-less node. But this might be halfway status for node hotplug. Node online bit is set. But N_HIGH_MEMORY (or N_NORMAL_MEMORY) might be not set. Ahh.. Okay. B)Node has memory-- one or more sections memory is onlined on the node. N_HIGH_MEMORY (or N_NORMAL_MEMORY) is set. If first memory is onlined on the node, the node status changes from A) to B). I feel this is very useful to manage halfway status of node hotplug. (So, memory-less node patch is very helpful for me.) So, I would like to avoid using the word node online at here. But, if above definition is messy for others, I'll change it. Ok can we talk about this as node online and node memory available? Yes. Thanks. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Doc] Memory hotplug document take 3
Hello. This is the newest version of document of memory hotplug. Please apply. -- Change log from take 2. - updates against comments from Randy-san. (Take 3 is same as http://lkml.org/lkml/2007/7/27/432 ) Change log from take 1. - updates against comments from Randy-san (Thanks a lot!) - mention about physical/logical phase of hotplug. change sections for it. - add description of kernel config option. - add description of relationship against ACPI node-hotplug. - make patch style. - etc. --- This is add a document for memory hotplug to describe How to use and Current status. --- Signed-off-by: KAMEZAWA Hiroyuki [EMAIL PROTECTED] Signed-off-by: Yasunori Goto [EMAIL PROTECTED] Documentation/memory-hotplug.txt | 322 +++ 1 files changed, 322 insertions(+) Index: makedocument/Documentation/memory-hotplug.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ makedocument/Documentation/memory-hotplug.txt 2007-07-28 11:47:52.0 +0900 @@ -0,0 +1,322 @@ +== +Memory Hotplug +== + +Last Updated: Jul 28 2007 + +This document is about memory hotplug including how-to-use and current status. +Because Memory Hotplug is still under development, contents of this text will +be changed often. + +1. Introduction + 1.1 purpose of memory hotplug + 1.2. Phases of memory hotplug + 1.3. Unit of Memory online/offline operation +2. Kernel Configuration +3. sysfs files for memory hotplug +4. Physical memory hot-add phase + 4.1 Hardware(Firmware) Support + 4.2 Notify memory hot-add event by hand +5. Logical Memory hot-add phase + 5.1. State of memory + 5.2. How to online memory +6. Logical memory remove + 6.1 Memory offline and ZONE_MOVABLE + 6.2. How to offline memory +7. Physical memory remove +8. Future Work List + +Note(1): x86_64's has special implementation for memory hotplug. + This text does not describe it. +Note(2): This text assumes that sysfs is mounted at /sys. + + +--- +1. Introduction +--- + +1.1 purpose of memory hotplug + +Memory Hotplug allows users to increase/decrease the amount of memory. +Generally, there are two purposes. + +(A) For changing the amount of memory. +This is to allow a feature like capacity on demand. +(B) For installing/removing DIMMs or NUMA-nodes physically. +This is to exchange DIMMs/NUMA-nodes, reduce power consumption, etc. + +(A) is required by highly virtualized environments and (B) is required by +hardware which supports memory power management. + +Linux memory hotplug is designed for both purpose. + + +1.2. Phases of memory hotplug +--- +There are 2 phases in Memory Hotplug. + 1) Physical Memory Hotplug phase + 2) Logical Memory Hotplug phase. + +The First phase is to communicate hardware/firmware and make/erase +environment for hotplugged memory. Basically, this phase is necessary +for the purpose (B), but this is good phase for communication between +highly virtualized environments too. + +When memory is hotplugged, the kernel recognizes new memory, makes new memory +management tables, and makes sysfs files for new memory's operation. + +If firmware supports notification of connection of new memory to OS, +this phase is triggered automatically. ACPI can notify this event. If not, +probe operation by system administration is used instead. +(see Section 4.). + +Logical Memory Hotplug phase is to change memory state into +avaiable/unavailable for users. Amount of memory from user's view is +changed by this phase. The kernel makes all memory in it as free pages +when a memory range is available. + +In this document, this phase is described as online/offline. + +Logical Memory Hotplug phase is triggred by write of sysfs file by system +administrator. For the hot-add case, it must be executed after Physical Hotplug +phase by hand. +(However, if you writes udev's hotplug scripts for memory hotplug, these + phases can be execute in seamless way.) + + +1.3. Unit of Memory online/offline operation + +Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole memory +into chunks of the same size. The chunk is called a section. The size of +a section is architecture dependent. For example, power uses 16MiB, ia64 uses +1GiB. The unit of online/offline operation is one section. (see Section 3.) + +To determine the size of sections, please read this file: + +/sys/devices/system/memory/block_size_bytes + +This file shows the size of sections in byte. + +--- +2. Kernel Configuration +--- +To use memory hotplug feature, kernel must be compiled with following +config options. + +- For all memory hotplug +Memory model - Sparse Memory (CONFIG_SPARSEMEM) +Allow for memory hot-add (CONFIG_MEMORY_HOTPLUG) + +- To enable memory removal, the followings are also necessary +Allow
Re: [2.6 patch] mm/migrate.c: cleanups
Sorry for late response. But, this patch is the cause of compile error of memory unplug code of 2.6.23-rc1-mm2. It uses putback_lru_pages(). Don't make it static please... :-( Bye. CC mm/memory_hotplug.o mm/memory_hotplug.c: In function ‘do_migrate_range’: mm/memory_hotplug.c:402: error: implicit declaration of function ‘putback_lru_pages’ make[1]: *** [mm/memory_hotplug.o] Error 1 This patch contains the following cleanups: - every file should include the headers containing the prototypes for its global functions - make the needlessly global putback_lru_pages() static Signed-off-by: Adrian Bunk [EMAIL PROTECTED] Acked-by: Christoph Lameter [EMAIL PROTECTED] --- This patch has been sent on: - 6 Jul 2007 include/linux/migrate.h |2 -- mm/migrate.c|3 ++- 2 files changed, 2 insertions(+), 3 deletions(-) --- linux-2.6.22-rc6-mm1/include/linux/migrate.h.old 2007-07-05 17:10:01.0 +0200 +++ linux-2.6.22-rc6-mm1/include/linux/migrate.h 2007-07-05 17:10:10.0 +0200 @@ -26,7 +26,6 @@ } extern int isolate_lru_page(struct page *p, struct list_head *pagelist); -extern int putback_lru_pages(struct list_head *l); extern int migrate_page(struct address_space *, struct page *, struct page *); extern int migrate_pages(struct list_head *l, new_page_t x, unsigned long); @@ -44,7 +43,6 @@ static inline int isolate_lru_page(struct page *p, struct list_head *list) { return -ENOSYS; } -static inline int putback_lru_pages(struct list_head *l) { return 0; } static inline int migrate_pages(struct list_head *l, new_page_t x, unsigned long private) { return -ENOSYS; } --- linux-2.6.22-rc6-mm1/mm/migrate.c.old 2007-07-05 17:10:16.0 +0200 +++ linux-2.6.22-rc6-mm1/mm/migrate.c 2007-07-05 17:11:43.0 +0200 @@ -28,6 +28,7 @@ #include linux/mempolicy.h #include linux/vmalloc.h #include linux/security.h +#include linux/syscalls.h #include internal.h @@ -101,7 +102,7 @@ * * returns the number of pages put back. */ -int putback_lru_pages(struct list_head *l) +static int putback_lru_pages(struct list_head *l) { struct page *page; struct page *page2; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Avoiding fragmentation through different allocator
Hello. I'm also very interested in your patches, because I'm working for memory hotplug too. On possibility is that we could say that the UserRclm and KernRclm pool are always eligible for hotplug and have hotplug banks only satisy those allocations pushing KernNonRclm allocations to fixed banks. How is it currently known if a bank of memory is hotplug? Is there a node for each hotplug bank? If yes, we could flag those nodes to only satisify UserRclm and KernRclm allocations and force fallback to other nodes. There are 2 types of memory hotplug. a)SMP machine case A some part of memory will be added and removed. b)NUMA machine case. Whole of a node will be able to remove and add. However, if a block of memory like DIMM is broken and disabled, Its close from a). How to know where is hotpluggable bank is platform/archtecture dependent issue. ex) Asking to ACPI. Just node0 become unremovable, and other nodes are removable. etc... In current your patch, first attribute of all pages are NoRclm. But if your patches has interface to decide where will be Rclm for each arch/platform, it might be good. The danger is that allocations would fail because non-hotplug banks were already full and pageout would not happen because the watermarks were satisified. In this case, if user can change attribute Rclm area to NoRclm, it is better than nothing. In hotplug patches, there will be new zone as ZONE_REMOVABLE. But in this patch, this change attribute is a little bit difficult. (At first remove the pages from free_area of removable zone, then add them to free_area of Un-removable zone.) Probably its change is easier in your patch. (Bear in mind I can't test hotplug-related issues due to lack of suitable hardware) I also don't have real hotplug machine now. ;-) I just use software emulation. It looks like you left the per_cpu_pages as-is. Did you consider separating those as well to reflect kernel vs. user pools? I kept the per-cpu caches for UserRclm-style allocations only because otherwise a Kernel-nonreclaimable allocation could easily be taken from a UserRclm pool. I agree that dividing per-cpu caches is not good way. But if Kernel-nonreclaimable allocation use its UserRclm pool, its removable memory bank will be harder to remove suddenly. Is it correct? If so, it is not good for memory hotplug. H. Anyway, thank you for your patch. It is very interesting. Bye. -- Yasunori Goto ygoto at us.fujitsu.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Avoiding fragmentation through different allocator
There are 2 types of memory hotplug. a)SMP machine case A some part of memory will be added and removed. b)NUMA machine case. Whole of a node will be able to remove and add. However, if a block of memory like DIMM is broken and disabled, Its close from a). How to know where is hotpluggable bank is platform/archtecture dependent issue. ex) Asking to ACPI. Just node0 become unremovable, and other nodes are removable. etc... Is there an architecture-independant way of finding this out? No. At least, I have no idea. :-( In current your patch, first attribute of all pages are NoRclm. But if your patches has interface to decide where will be Rclm for each arch/platform, it might be good. It doesn't have an API as such. In page_alloc.c, there is a function get_pageblock_type() that returns what type of allocation the block of memory is being used for. There is no guarentee there is only those type of allocations there though. OK. I will write a patch of function to set it for some arch/platform. What's the current attidute for adding a new zone? I felt there would be resistence as a new zone would affect a lot of code paths and be yet another zone that needed balancing. For example, is there a HIGHMEM version of the ZONE_REMOVABLE or could normal and highmem be in this zone? Yes. In my current patch of memory hotplug, Removable is like Highmem. ( http://sourceforge.net/mailarchive/forum.php?forum_id=223 It is group B of Hot Add patches for NUMA ) I tried to make new removable zone which could be with normal and dma before it. But, it needs too much work as you said. So, I gave up it. I heard Matt-san has some ideas for it. So, I'm looking forward to see it. I agree that dividing per-cpu caches is not good way. But if Kernel-nonreclaimable allocation use its UserRclm pool, its removable memory bank will be harder to remove suddenly. Is it correct? If so, it is not good for memory hotplug. H. It is correct. However, this will only happen in low-memory conditions. For a kernel-nonreclaimable allocation to use the userrclm pool, three conditions have to be met; 1. Kernel-nonreclaimable pool has no pages 2. There are no global 2^MAX_ORDER pages 3. Kern-reclaimable pool has no pages I suppose if this patch have worked for one year, unlucky case might occur. Probably, enterprise system will not allow it. So, I will try disabling fallback for KernNoRclm. Thanks. -- Yasunori Goto ygoto at us.fujitsu.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch]compile error of register_memory()
Hello. register_memory() becomes double definition in 2.6.20-rc1. It is defined in arch/i386/kernel/setup.c as static definition in 2.6.19. But it is moved to arch/i386/kernel/e820.c in 2.6.20-rc1. And same name function is defined in driver/base/memory.c too. So, it becomes cause of compile error of duplicate definition if memory hotplug option is on. This patch is to fix it. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- arch/i386/kernel/e820.c |2 +- arch/i386/kernel/setup.c |2 +- include/asm-i386/e820.h |2 +- 3 files changed, 3 insertions(+), 3 deletions(-) Index: linux-2.6.20-rc1/arch/i386/kernel/e820.c === --- linux-2.6.20-rc1.orig/arch/i386/kernel/e820.c 2006-12-19 21:52:36.0 +0900 +++ linux-2.6.20-rc1/arch/i386/kernel/e820.c2006-12-19 22:15:59.0 +0900 @@ -668,7 +668,7 @@ } } -void __init register_memory(void) +void __init e820_register_memory(void) { unsigned long gapstart, gapsize, round; unsigned long long last; Index: linux-2.6.20-rc1/arch/i386/kernel/setup.c === --- linux-2.6.20-rc1.orig/arch/i386/kernel/setup.c 2006-12-19 21:52:36.0 +0900 +++ linux-2.6.20-rc1/arch/i386/kernel/setup.c 2006-12-19 22:15:59.0 +0900 @@ -639,7 +639,7 @@ get_smp_config(); #endif - register_memory(); + e820_register_memory(); #ifdef CONFIG_VT #if defined(CONFIG_VGA_CONSOLE) Index: linux-2.6.20-rc1/include/asm-i386/e820.h === --- linux-2.6.20-rc1.orig/include/asm-i386/e820.h 2006-12-19 21:52:36.0 +0900 +++ linux-2.6.20-rc1/include/asm-i386/e820.h2006-12-19 22:16:28.0 +0900 @@ -40,7 +40,7 @@ unsigned type); extern void find_max_pfn(void); extern void register_bootmem_low_pages(unsigned long max_low_pfn); -extern void register_memory(void); +extern void e820_register_memory(void); extern void limit_regions(unsigned long long size); extern void print_memory_map(char *who); -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch](memory hotplug) fix compile error for i386 with NUMA config (take 3).
Hello. This is take 3 patch to fix compile error when config memory hotplug with numa on i386. The cause of compile error was missing of arch_add_memory(), remove_memory(), and memory_add_physaddr_to_nid(). I fixed some bad points, and tested no compile error of it. This is for 2.6.20-rc1. Please apply. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- arch/i386/mm/discontig.c | 28 arch/i386/mm/init.c | 10 ++ 2 files changed, 30 insertions(+), 8 deletions(-) Index: linux-2.6.20-rc1/arch/i386/mm/init.c === --- linux-2.6.20-rc1.orig/arch/i386/mm/init.c 2006-12-20 22:12:07.0 +0900 +++ linux-2.6.20-rc1/arch/i386/mm/init.c2006-12-20 22:12:09.0 +0900 @@ -673,16 +673,10 @@ #endif } -/* - * this is for the non-NUMA, single node SMP system case. - * Specifically, in the case of x86, we will always add - * memory to the highmem for now. - */ #ifdef CONFIG_MEMORY_HOTPLUG -#ifndef CONFIG_NEED_MULTIPLE_NODES int arch_add_memory(int nid, u64 start, u64 size) { - struct pglist_data *pgdata = contig_page_data; + struct pglist_data *pgdata = NODE_DATA(nid); struct zone *zone = pgdata-node_zones + ZONE_HIGHMEM; unsigned long start_pfn = start PAGE_SHIFT; unsigned long nr_pages = size PAGE_SHIFT; @@ -694,7 +688,7 @@ { return -EINVAL; } -#endif +EXPORT_SYMBOL_GPL(remove_memory); #endif struct kmem_cache *pgd_cache; Index: linux-2.6.20-rc1/arch/i386/mm/discontig.c === --- linux-2.6.20-rc1.orig/arch/i386/mm/discontig.c 2006-12-20 22:12:07.0 +0900 +++ linux-2.6.20-rc1/arch/i386/mm/discontig.c 2006-12-20 22:37:54.0 +0900 @@ -405,3 +405,31 @@ totalram_pages += totalhigh_pages; #endif } + +#ifdef CONFIG_MEMORY_HOTPLUG +int paddr_to_nid(u64 addr) +{ + int nid; + unsigned long pfn = PFN_DOWN(addr); + + for_each_node(nid) + if (node_start_pfn[nid] = pfn + pfn node_end_pfn[nid]) + return nid; + + return -1; +} + +/* + * This function is used to ask node id BEFORE memmap and mem_section's + * initialization (pfn_to_nid() can't be used yet). + * If _PXM is not defined on ACPI's DSDT, node id must be found by this. + */ +int memory_add_physaddr_to_nid(u64 addr) +{ + int nid = paddr_to_nid(addr); + return (nid = 0) ? nid : 0; +} + +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); +#endif -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch](memory hotplug) Fix compile error for i386 with NUMA config
Hello. This patch is to fix compile error when config memory hotplug with numa on i386. The cause of compile error was missing of arch_add_memory(), remove_memory(), and memory_add_physaddr_to_nid() when NUMA config is on. This is for 2.6.19, and I tested no compile error of it. Please apply. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- arch/i386/mm/discontig.c | 17 + arch/i386/mm/init.c |4 +--- 2 files changed, 18 insertions(+), 3 deletions(-) Index: linux-2.6.19/arch/i386/mm/init.c === --- linux-2.6.19.orig/arch/i386/mm/init.c 2006-12-04 20:06:32.0 +0900 +++ linux-2.6.19/arch/i386/mm/init.c2006-12-04 21:09:49.0 +0900 @@ -681,10 +681,9 @@ * memory to the highmem for now. */ #ifdef CONFIG_MEMORY_HOTPLUG -#ifndef CONFIG_NEED_MULTIPLE_NODES int arch_add_memory(int nid, u64 start, u64 size) { - struct pglist_data *pgdata = contig_page_data; + struct pglist_data *pgdata = NODE_DATA(nid); struct zone *zone = pgdata-node_zones + ZONE_HIGHMEM; unsigned long start_pfn = start PAGE_SHIFT; unsigned long nr_pages = size PAGE_SHIFT; @@ -697,7 +696,6 @@ return -EINVAL; } #endif -#endif kmem_cache_t *pgd_cache; kmem_cache_t *pmd_cache; Index: linux-2.6.19/arch/i386/mm/discontig.c === --- linux-2.6.19.orig/arch/i386/mm/discontig.c 2006-12-04 20:06:32.0 +0900 +++ linux-2.6.19/arch/i386/mm/discontig.c 2006-12-09 17:30:24.0 +0900 @@ -405,3 +405,20 @@ totalram_pages += totalhigh_pages; #endif } + +#ifdef CONFIG_MEMORY_HOTPLUG +/* This is the case that there is no _PXM on DSDT for added memory */ +int memory_add_physaddr_to_nid(u64 addr) +{ + int nid; + unsigned long pfn = addr PAGE_SHIFT; + + for (nid = 0; nid MAX_NUMNODES; nid++){ + if (node_start_pfn[nid] = pfn + pfn node_end_pfn[nid]) + return nid; + } + + return 0; +} +#endif -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch](memory hotplug) Fix compile error for i386 with NUMA config
Hi David-san. On Sat, 9 Dec 2006, Yasunori Goto wrote: Hello. This patch is to fix compile error when config memory hotplug with numa on i386. The cause of compile error was missing of arch_add_memory(), remove_memory(), and memory_add_physaddr_to_nid() when NUMA config is on. This is for 2.6.19, and I tested no compile error of it. Please apply. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- arch/i386/mm/discontig.c | 17 + arch/i386/mm/init.c |4 +--- 2 files changed, 18 insertions(+), 3 deletions(-) Index: linux-2.6.19/arch/i386/mm/init.c === --- linux-2.6.19.orig/arch/i386/mm/init.c 2006-12-04 20:06:32.0 +0900 +++ linux-2.6.19/arch/i386/mm/init.c2006-12-04 21:09:49.0 +0900 @@ -681,10 +681,9 @@ * memory to the highmem for now. */ #ifdef CONFIG_MEMORY_HOTPLUG -#ifndef CONFIG_NEED_MULTIPLE_NODES int arch_add_memory(int nid, u64 start, u64 size) { - struct pglist_data *pgdata = contig_page_data; + struct pglist_data *pgdata = NODE_DATA(nid); struct zone *zone = pgdata-node_zones + ZONE_HIGHMEM; unsigned long start_pfn = start PAGE_SHIFT; unsigned long nr_pages = size PAGE_SHIFT; @@ -697,7 +696,6 @@ return -EINVAL; } #endif -#endif kmem_cache_t *pgd_cache; kmem_cache_t *pmd_cache; The reason for the #ifndef CONFIG_NEED_MULTIPLE_NODES check seems to solely exist for excluding the NUMA case, so it doesn't appear as though this is the correct fix since your changelog indicates a compile problem with a NUMA build. This hypothesis is supported by the comment which conveniently appears just before arch_add_memory which _explicitly_ states that the following is for non-NUMA cases. No. Other arch's arch_add_memory() and remove_memory() have been already used for NUMA case too. But i386 didn't do it because just contig_page_data is used. Current NODE_DATA() macro is defined both case appropriately. So, this #ifdef is redundant now. (See: http://marc.theaimsgroup.com/?l=linux-mmm=116494983531221w=2) Index: linux-2.6.19/arch/i386/mm/discontig.c === --- linux-2.6.19.orig/arch/i386/mm/discontig.c 2006-12-04 20:06:32.0 +0900 +++ linux-2.6.19/arch/i386/mm/discontig.c 2006-12-09 17:30:24.0 +0900 @@ -405,3 +405,20 @@ totalram_pages += totalhigh_pages; #endif } + +#ifdef CONFIG_MEMORY_HOTPLUG +/* This is the case that there is no _PXM on DSDT for added memory */ +int memory_add_physaddr_to_nid(u64 addr) +{ + int nid; + unsigned long pfn = addr PAGE_SHIFT; + + for (nid = 0; nid MAX_NUMNODES; nid++){ + if (node_start_pfn[nid] = pfn + pfn node_end_pfn[nid]) + return nid; + } + + return 0; +} +#endif memory_add_physaddr_to_nid is only declared as extern in include/linux/memory_hotplug.h in the CONFIG_NUMA case so this also doesn't appear as the correct fix but probably worked for your compile since you had CONFIG_MEMORY_HOTPLUG enabled. memory_add_physaddr_to_nid() is used for memory hotplug to find node id for new memory when ACPI's DSDT doesn't define _PXM for new memory. So, when CONFIG_MEMORY_HOTPLUG is not set, this function is not used. Bye. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch](memory hotplug) Fix compile error for i386 with NUMA config
No. Other arch's arch_add_memory() and remove_memory() have been already used for NUMA case too. But i386 didn't do it because just contig_page_data is used. Current NODE_DATA() macro is defined both case appropriately. So, this #ifdef is redundant now. Then I assume the comment directly above this change is also redundant since it explicitly states that the following code is for the non-NUMA case. Ah. Yes indeed. Here is fixed patch. Thanks for your comment. Bye. --- This patch is to fix compile error when config memory hotplug with numa on i386. The cause of compile error was missing of arch_add_memory(), remove_memory(), and memory_add_physaddr_to_nid(). This is for 2.6.19, and I tested no compile error of it. Please apply. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- arch/i386/mm/discontig.c | 17 + arch/i386/mm/init.c |9 + 2 files changed, 18 insertions(+), 8 deletions(-) Index: linux-2.6.19/arch/i386/mm/init.c === --- linux-2.6.19.orig/arch/i386/mm/init.c 2006-12-09 17:42:06.0 +0900 +++ linux-2.6.19/arch/i386/mm/init.c2006-12-11 16:58:49.0 +0900 @@ -675,16 +675,10 @@ #endif } -/* - * this is for the non-NUMA, single node SMP system case. - * Specifically, in the case of x86, we will always add - * memory to the highmem for now. - */ #ifdef CONFIG_MEMORY_HOTPLUG -#ifndef CONFIG_NEED_MULTIPLE_NODES int arch_add_memory(int nid, u64 start, u64 size) { - struct pglist_data *pgdata = contig_page_data; + struct pglist_data *pgdata = NODE_DATA(nid); struct zone *zone = pgdata-node_zones + ZONE_HIGHMEM; unsigned long start_pfn = start PAGE_SHIFT; unsigned long nr_pages = size PAGE_SHIFT; @@ -697,7 +691,6 @@ return -EINVAL; } #endif -#endif kmem_cache_t *pgd_cache; kmem_cache_t *pmd_cache; Index: linux-2.6.19/arch/i386/mm/discontig.c === --- linux-2.6.19.orig/arch/i386/mm/discontig.c 2006-12-09 17:42:06.0 +0900 +++ linux-2.6.19/arch/i386/mm/discontig.c 2006-12-09 17:58:32.0 +0900 @@ -405,3 +405,20 @@ totalram_pages += totalhigh_pages; #endif } + +#ifdef CONFIG_MEMORY_HOTPLUG +/* This is the case that there is no _PXM on DSDT for added memory */ +int memory_add_physaddr_to_nid(u64 addr) +{ + int nid; + unsigned long pfn = addr PAGE_SHIFT; + + for (nid = 0; nid MAX_NUMNODES; nid++){ + if (node_start_pfn[nid] = pfn + pfn node_end_pfn[nid]) + return nid; + } + + return 0; +} +#endif -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: memory hotplug function redefinition/confusion
Hello. include/linux/memory_hotplug.h uses CONFIG_NUMA to decide: (snip) but mm/init.c uses CONFIG_ACPI_NUMA to decide: (snip) (sic: duplicate function above) Indeed. It is strange. This is a patch for it. Thanks for your report! Bye. This is to fix compile error of x86-64 memory hotplug without any NUMA option. CC arch/x86_64/mm/init.o arch/x86_64/mm/init.c:501: error: redefinition of 'memory_add_physaddr_to_nid' include/linux/memory_hotplug.h:71: error: previous definition of 'memory_add_phys addr_to_nid' was here arch/x86_64/mm/init.c:509: error: redefinition of 'memory_add_physaddr_to_nid' arch/x86_64/mm/init.c:501: error: previous definition of 'memory_add_physaddr_to_ nid' was here make[1]: *** [arch/x86_64/mm/init.o] Error 1 I confirmed compile completion with !NUMA, (NUMA !ACPI_NUMA), or (NUMA ACPI_NUMA). This patch is for 2.6.19-rc5-mm2. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] arch/x86_64/mm/init.c |9 + 1 files changed, 1 insertion(+), 8 deletions(-) Index: 19-rc5-mm2/arch/x86_64/mm/init.c === --- 19-rc5-mm2.orig/arch/x86_64/mm/init.c 2006-11-17 22:31:30.0 +0900 +++ 19-rc5-mm2/arch/x86_64/mm/init.c2006-11-17 22:31:40.0 +0900 @@ -496,7 +496,7 @@ int remove_memory(u64 start, u64 size) } EXPORT_SYMBOL_GPL(remove_memory); -#ifndef CONFIG_ACPI_NUMA +#if !defined(CONFIG_ACPI_NUMA) defined(CONFIG_NUMA) int memory_add_physaddr_to_nid(u64 start) { return 0; @@ -504,13 +504,6 @@ int memory_add_physaddr_to_nid(u64 start EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); #endif -#ifndef CONFIG_ACPI_NUMA -int memory_add_physaddr_to_nid(u64 start) -{ - return 0; -} -#endif - #endif /* CONFIG_MEMORY_HOTPLUG */ #ifdef CONFIG_MEMORY_HOTPLUG_RESERVE -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH v2 7/7] Do not recompute msgmni anymore if explicitely set by user
Thanks Nadia-san. I tested this patch set on my box. It works well. I have only one comment. --- ipc/ipc_sysctl.c | 43 +-- 1 file changed, 41 insertions(+), 2 deletions(-) Index: linux-2.6.24/ipc/ipc_sysctl.c === --- linux-2.6.24.orig/ipc/ipc_sysctl.c2008-01-29 16:55:04.0 +0100 +++ linux-2.6.24/ipc/ipc_sysctl.c 2008-01-31 13:13:14.0 +0100 @@ -34,6 +34,24 @@ static int proc_ipc_dointvec(ctl_table * return proc_dointvec(ipc_table, write, filp, buffer, lenp, ppos); } +static int proc_ipc_callback_dointvec(ctl_table *table, int write, + struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos) +{ + size_t lenp_bef = *lenp; + int rc; + + rc = proc_ipc_dointvec(table, write, filp, buffer, lenp, ppos); + + if (write !rc lenp_bef == *lenp) + /* + * Tunable has successfully been changed from userland: + * disable its automatic recomputing. + */ + unregister_ipcns_notifier(current-nsproxy-ipc_ns); + + return rc; +} + Hmmm. I suppose this may be side effect which user does not wish. I would like to recommend there should be a switch which can turn on/off automatic recomputing. If user would like to change this value, it should be turned off. Otherwise, his requrest will be rejected with some messages. Probably, user can understand easier than this side effect. Bye. -- Yasunori Goto -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] Document about lowmem_reserve_ratio
Hello. I found the documentation about lowmem_reserve_ratio is not written, and the lower_zone_protection's description remains yet. I fixed it. I may be something wrong due to misunderstanding. And probably, sentence is not natural. (I'm not native English speaker.) So, please review it. Thanks. --- Though the lower_zone_protection was changed to lowmem_reserve_ratio, the document has been not changed. The lowmem_reserve_ratio seems quite hard to estimate, but there is no guidance. This patch is to change document for it. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- Documentation/filesystems/proc.txt | 76 + 1 file changed, 61 insertions(+), 15 deletions(-) Index: current/Documentation/filesystems/proc.txt === --- current.orig/Documentation/filesystems/proc.txt 2008-01-17 20:01:37.0 +0900 +++ current/Documentation/filesystems/proc.txt 2008-01-18 12:22:10.0 +0900 @@ -1311,7 +1311,7 @@ If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel will use the legacy (2.4) layout for all processes. -lower_zone_protection +lowmem_reserve_ratio - For some specialised workloads on highmem machines it is dangerous for @@ -1331,25 +1331,71 @@ mechanism will also defend that region from allocations which could use highmem or lowmem). -The `lower_zone_protection' tunable determines how aggressive the kernel is -in defending these lower zones. The default value is zero - no -protection at all. +The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is +in defending these lower zones. If you have a machine which uses highmem or ISA DMA and your applications are using mlock(), or if you are running with no swap then -you probably should increase the lower_zone_protection setting. +you probably should change the lowmem_reserve_ratio setting. -The units of this tunable are fairly vague. It is approximately equal -to megabytes, so setting lower_zone_protection=100 will protect around 100 -megabytes of the lowmem zone from user allocations. It will also make -those 100 megabytes unavailable for use by applications and by -pagecache, so there is a cost. - -The effects of this tunable may be observed by monitoring -/proc/meminfo:LowFree. Write a single huge file and observe the point -at which LowFree ceases to fall. +The lowmem_reserve_ratio is an array. You can see them by reading this file. +- +% cat /proc/sys/vm/lowmem_reserve_ratio +256 256 32 +- +Note: # of this elements is one fewer than number of zones. Because the highest + zone's value is not necessary for following calculation. + +But, these values are not used directly. The kernel calculates # of protection +pages for each zones from them. These are shown as array of protection pages +in /proc/zoneinfo like followings. (This is an example of x86-64 box). +Each zone has an array of protection pages like this. + +- +Node 0, zone DMA + pages free 1355 +min 3 +low 3 +high 4 + : + : +numa_other 0 +protection: (0, 2004, 2004, 2004) + ^ + pagesets +cpu: 0 pcp: 0 +: +- +These protections are added to score to judge whether this zone should be used +for page allocation or should be reclaimed. + +In this example, if normal pages (index=2) are required to this DMA zone and +pages_high is used for watermark, the kernel judges this zone should not be +used because pages_free(1355) is smaller than watermark + protection[2] +(4 + 2004 = 2008). If this protection value is 0, this zone would be used for +normal page requirement. If requirement is DMA zone(index=0), protection[0] +(=0) is used. + +zone[i]'s protection[j] is calculated by following exprssion. + +(i j): + zone[i]-protection[j] + = (total sums of present_pages from zone[i+1] to zone[j] on the node) +/ lowmem_reserve_ratio[i]; +(i = j): + (should not be protected. = 0; +(i j): + (not necessary, but looks 0) + +The default values of lowmem_reserve_ratio[i] are +256 (if zone[i] means DMA or DMA32 zone) +32 (others). +As above expression, they are reciprocal number of ratio. +256 means 1/256. # of protection pages becomes about 0.39% of total present +pages of higher zones on the node. -A reasonable value for lower_zone_protection is 100. +If you would like to protect more pages, smaller values are effective. +The minimum value is 1 (1/1 - 100%). page-cluster -- Yasunori Goto -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Document about lowmem_reserve_ratio
Oops. I sent to Andrea's old mail address. Sorry for repost. --- Hello. I found the documentation about lowmem_reserve_ratio is not written, and the lower_zone_protection's description remains yet. I fixed it. I may be something wrong due to misunderstanding. And probably, sentence is not natural. (I'm not native English speaker.) So, please review it. Thanks. --- Though the lower_zone_protection was changed to lowmem_reserve_ratio, the document has been not changed. The lowmem_reserve_ratio seems quite hard to estimate, but there is no guidance. This patch is to change document for it. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] --- Documentation/filesystems/proc.txt | 76 + 1 file changed, 61 insertions(+), 15 deletions(-) Index: current/Documentation/filesystems/proc.txt === --- current.orig/Documentation/filesystems/proc.txt 2008-01-17 20:01:37.0 +0900 +++ current/Documentation/filesystems/proc.txt 2008-01-18 12:22:10.0 +0900 @@ -1311,7 +1311,7 @@ If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel will use the legacy (2.4) layout for all processes. -lower_zone_protection +lowmem_reserve_ratio - For some specialised workloads on highmem machines it is dangerous for @@ -1331,25 +1331,71 @@ mechanism will also defend that region from allocations which could use highmem or lowmem). -The `lower_zone_protection' tunable determines how aggressive the kernel is -in defending these lower zones. The default value is zero - no -protection at all. +The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is +in defending these lower zones. If you have a machine which uses highmem or ISA DMA and your applications are using mlock(), or if you are running with no swap then -you probably should increase the lower_zone_protection setting. +you probably should change the lowmem_reserve_ratio setting. -The units of this tunable are fairly vague. It is approximately equal -to megabytes, so setting lower_zone_protection=100 will protect around 100 -megabytes of the lowmem zone from user allocations. It will also make -those 100 megabytes unavailable for use by applications and by -pagecache, so there is a cost. - -The effects of this tunable may be observed by monitoring -/proc/meminfo:LowFree. Write a single huge file and observe the point -at which LowFree ceases to fall. +The lowmem_reserve_ratio is an array. You can see them by reading this file. +- +% cat /proc/sys/vm/lowmem_reserve_ratio +256 256 32 +- +Note: # of this elements is one fewer than number of zones. Because the highest + zone's value is not necessary for following calculation. + +But, these values are not used directly. The kernel calculates # of protection +pages for each zones from them. These are shown as array of protection pages +in /proc/zoneinfo like followings. (This is an example of x86-64 box). +Each zone has an array of protection pages like this. + +- +Node 0, zone DMA + pages free 1355 +min 3 +low 3 +high 4 + : + : +numa_other 0 +protection: (0, 2004, 2004, 2004) + ^ + pagesets +cpu: 0 pcp: 0 +: +- +These protections are added to score to judge whether this zone should be used +for page allocation or should be reclaimed. + +In this example, if normal pages (index=2) are required to this DMA zone and +pages_high is used for watermark, the kernel judges this zone should not be +used because pages_free(1355) is smaller than watermark + protection[2] +(4 + 2004 = 2008). If this protection value is 0, this zone would be used for +normal page requirement. If requirement is DMA zone(index=0), protection[0] +(=0) is used. + +zone[i]'s protection[j] is calculated by following exprssion. + +(i j): + zone[i]-protection[j] + = (total sums of present_pages from zone[i+1] to zone[j] on the node) +/ lowmem_reserve_ratio[i]; +(i = j): + (should not be protected. = 0; +(i j): + (not necessary, but looks 0) + +The default values of lowmem_reserve_ratio[i] are +256 (if zone[i] means DMA or DMA32 zone) +32 (others). +As above expression, they are reciprocal number of ratio. +256 means 1/256. # of protection pages becomes about 0.39% of total present +pages of higher zones on the node. -A reasonable value for lower_zone_protection is 100. +If you would like to protect more pages, smaller values are effective. +The minimum value is 1 (1/1 - 100%). page-cluster -- Yasunori Goto -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2]: MM: Make Paget Tables Relocatable--Conditional TLB Flush
Hello. This is a nitpick, but all of archtectures code except generic use MMF_NNED_FLUSH at clear_bit()... ^ Please fix misspell. Bye. diff -uprwNbB -X 2.6.23/Documentation/dontdiff 2.6.23/arch/alpha/kernel/smp.c 2.6.23a/arch/alpha/kernel/smp.c --- 2.6.23/arch/alpha/kernel/smp.c2007-10-09 13:31:38.0 -0700 +++ 2.6.23a/arch/alpha/kernel/smp.c 2007-10-29 13:50:06.0 -0700 @@ -850,6 +850,8 @@ flush_tlb_mm(struct mm_struct *mm) { preempt_disable(); + clear_bit(MMF_NNED_FLUSH, mm-flags); + if (mm == current-active_mm) { flush_tlb_current(mm); if (atomic_read(mm-mm_users) = 1) { -- Yasunori Goto -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Add IORESOUCE_BUSY flag for System RAM take 2.
Hello. I merged Baradi-san's patch and mine. This and Kame-san's following patch is necessary for x86-64 memory unplug. http://marc.info/?l=linux-mmm=119399026017901w=2 I heard Kame-san's patch is already included in -mm. So, I'll repost merged patch now. This patch is tested on 2.6.23-mm1. Please apply. --- i386 and x86-64 registers System RAM as IORESOURCE_MEM | IORESOURCE_BUSY. But ia64 registers it as IORESOURCE_MEM only. In addition, memory hotplug code registers new memory as IORESOURCE_MEM too. This difference causes a failure of memory unplug of x86-64. This patch fix it. This patch adds IORESOURCE_BUSY to avoid potential overlap mapping by PCI device. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] Signed-off-by: Badari Pulavarty [EMAIL PROTECTED] --- arch/ia64/kernel/efi.c |6 ++ kernel/resource.c |2 +- mm/memory_hotplug.c|2 +- 3 files changed, 4 insertions(+), 6 deletions(-) Index: current/arch/ia64/kernel/efi.c === --- current.orig/arch/ia64/kernel/efi.c 2007-11-02 17:17:30.0 +0900 +++ current/arch/ia64/kernel/efi.c 2007-11-02 17:19:10.0 +0900 @@ -,7 +,7 @@ efi_initialize_iomem_resources(struct re if (md-num_pages == 0) /* should not happen */ continue; - flags = IORESOURCE_MEM; + flags = IORESOURCE_MEM | IORESOURCE_BUSY; switch (md-type) { case EFI_MEMORY_MAPPED_IO: @@ -1133,12 +1133,11 @@ efi_initialize_iomem_resources(struct re case EFI_ACPI_MEMORY_NVS: name = ACPI Non-volatile Storage; - flags |= IORESOURCE_BUSY; break; case EFI_UNUSABLE_MEMORY: name = reserved; - flags |= IORESOURCE_BUSY | IORESOURCE_DISABLED; + flags |= IORESOURCE_DISABLED; break; case EFI_RESERVED_TYPE: @@ -1147,7 +1146,6 @@ efi_initialize_iomem_resources(struct re case EFI_ACPI_RECLAIM_MEMORY: default: name = reserved; - flags |= IORESOURCE_BUSY; break; } Index: current/mm/memory_hotplug.c === --- current.orig/mm/memory_hotplug.c2007-11-02 17:19:09.0 +0900 +++ current/mm/memory_hotplug.c 2007-11-02 17:19:10.0 +0900 @@ -39,7 +39,7 @@ static struct resource *register_memory_ res-name = System RAM; res-start = start; res-end = start + size - 1; - res-flags = IORESOURCE_MEM; + res-flags = IORESOURCE_MEM | IORESOURCE_BUSY; if (request_resource(iomem_resource, res) 0) { printk(System RAM resource %llx - %llx cannot be added\n, (unsigned long long)res-start, (unsigned long long)res-end); Index: current/kernel/resource.c === --- current.orig/kernel/resource.c 2007-11-02 17:19:15.0 +0900 +++ current/kernel/resource.c 2007-11-02 17:22:39.0 +0900 @@ -287,7 +287,7 @@ walk_memory_resource(unsigned long start int ret = -1; res.start = (u64) start_pfn PAGE_SHIFT; res.end = ((u64)(start_pfn + nr_pages) PAGE_SHIFT) - 1; - res.flags = IORESOURCE_MEM; + res.flags = IORESOURCE_MEM | IORESOURCE_BUSY; orig_end = res.end; while ((res.start res.end) (find_next_system_ram(res) = 0)) { pfn = (unsigned long)(res.start PAGE_SHIFT); -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: memory/cpu hotplug section mismatch.
If CONFIG_MEMORY_HOTPLUG=n __meminit == __init, and if CONFIG_HOTPLUG_CPU=n __cpuinit == __init. However, with one set and the other disabled, you end up with a reference between __init and a regular non-init function. My plan is to define dedicated sections for both __devinit and __meminit. Then we can apply the checks no matter the definition of CONFIG_HOTPLUG* I prefer defining __nodeinit for __cpuinit and __meminit case to __devinit. __devinit is used many devices like I/O, and it is useful for many desktop users. But, cpu/memory hotpluggable box is very rare. And it should be in init section for many people. This kind of issue is caused by initialization of pgdat/zone. I think __nodeinit is enough and desirable. Bye. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: Fix memory/cpu hotplug section mismatch and oops.
Thanks. I tested compile with cpu/memory hotplug off/on. It was OK. Acked-by: Yasunori Goto [EMAIL PROTECTED] (This is a resend of the earlier patch, this issue still needs to be fixed.) When building with memory hotplug enabled and cpu hotplug disabled, we end up with the following section mismatch: WARNING: mm/built-in.o(.text+0x4e58): Section mismatch: reference to .init.text: (between 'free_area_init_node' and '__build_all_zonelists') This happens as a result of: - free_area_init_node() - free_area_init_core() - zone_pcp_init() -- all __meminit up to this point - zone_batchsize() -- marked as __cpuinit fo This happens because CONFIG_HOTPLUG_CPU=n sets __cpuinit to __init, but CONFIG_MEMORY_HOTPLUG=y unsets __meminit. Changing zone_batchsize() to __devinit fixes this. __devinit is the only thing that is common between CONFIG_HOTPLUG_CPU=y and CONFIG_MEMORY_HOTPLUG=y. In the long run, perhaps this should be moved to another section identifier completely. Without this, memory hot-add of offline nodes (via hotadd_new_pgdat()) will oops if CPU hotplug is not also enabled. Signed-off-by: Paul Mundt [EMAIL PROTECTED] -- mm/page_alloc.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index bd8e335..05ace44 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1968,7 +1968,7 @@ void zone_init_free_lists(struct pglist_data *pgdat, struct zone *zone, memmap_init_zone((size), (nid), (zone), (start_pfn), MEMMAP_EARLY) #endif -static int __cpuinit zone_batchsize(struct zone *zone) +static int __devinit zone_batchsize(struct zone *zone) { int batch; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: More __meminit annotations.
Thanks for your checking. -void zone_init_free_lists(struct pglist_data *pgdat, struct zone *zone, - unsigned long size) +static void __meminit zone_init_free_lists(struct pglist_data *pgdat, + struct zone *zone, unsigned long size) { int order; for (order = 0; order MAX_ORDER ; order++) { @@ -2431,7 +2431,7 @@ void __meminit get_pfn_range_for_nid(unsigned int nid, * Return the number of pages a zone spans in a node, including holes * present_pages = zone_spanned_pages_in_node() - zone_absent_pages_in_node() */ -unsigned long __meminit zone_spanned_pages_in_node(int nid, +static unsigned long __meminit zone_spanned_pages_in_node(int nid, unsigned long zone_type, unsigned long *ignored) { @@ -2519,7 +2519,7 @@ unsigned long __init absent_pages_in_range(unsigned long start_pfn, } /* Return the number of page frames in holes in a zone on a node */ -unsigned long __meminit zone_absent_pages_in_node(int nid, +static unsigned long __meminit zone_absent_pages_in_node(int nid, unsigned long zone_type, unsigned long *ignored) { Ah, Yes. Thanks. It is better. @@ -2536,14 +2536,14 @@ unsigned long __meminit zone_absent_pages_in_node(int nid, } #else -static inline unsigned long zone_spanned_pages_in_node(int nid, +static inline unsigned long __meminit zone_spanned_pages_in_node(int nid, unsigned long zone_type, unsigned long *zones_size) { return zones_size[zone_type]; } -static inline unsigned long zone_absent_pages_in_node(int nid, +static inline unsigned long __meminit zone_absent_pages_in_node(int nid, unsigned long zone_type, unsigned long *zholes_size) { I thought __meminit is not effective for these static functions, because they are inlined function. So, it depends on caller's defenition. Is it wrong? Bye. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: More __meminit annotations.
On Mon, Jun 18, 2007 at 02:49:24PM +0900, Yasunori Goto wrote: -static inline unsigned long zone_absent_pages_in_node(int nid, +static inline unsigned long __meminit zone_absent_pages_in_node(int nid, unsigned long zone_type, unsigned long *zholes_size) { I thought __meminit is not effective for these static functions, because they are inlined function. So, it depends on caller's defenition. Is it wrong? Ah, that's possible, I hadn't considered that. It seems to be a bit more obvious what the intention is if it's annotated, especially as this is the convention that's used by the rest of mm/page_alloc.c. A bit more consistent, if nothing more. I'm not sure which is intended. I found some functions define both __init and inline in kernel tree. And probably, some functions don't do it. So, it seems there is no convention. I'm Okay if you prefer both defined. :-) -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/7] Introduce a means of compacting memory within a zone
Hi Mel-san. This is very interesting feature. Now, I'm testing your patches. +static int isolate_migratepages(struct zone *zone, + struct compact_control *cc) +{ + unsigned long high_pfn, low_pfn, end_pfn, start_pfn; (snip) + /* Time to isolate some pages for migration */ + spin_lock_irq(zone-lru_lock); + for (; low_pfn end_pfn; low_pfn++) { + if (!pfn_valid_within(low_pfn)) + continue; + + /* Get the page and skip if free */ + page = pfn_to_page(low_pfn); I met panic at here on my tiger4. I compiled with CONFIG_SPARSEMEM. So, CONFIG_HOLES_IN_ZONE is not set. pfn_valid_within() returns 1 every time on this configuration. (This config is for only virtual memmap) But, my tiger4 box has memory holes in normal zone. When it is changed to normal pfn_valid(), no panic occurs. Hmmm. Bye. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sparsemem: Shut up unused symbol compiler warnings.
I think this issue is fixed by move-three-functions-that-are-only-needed-for.patch in current -mm tree. Is it not enough? Thanks. __kmalloc_section_memmap()/__kfree_section_memmap() and friends are only used by the memory hotplug code. Move these in to the existing CONFIG_MEMORY_HOTPLUG block. Signed-off-by: Paul Mundt [EMAIL PROTECTED] -- mm/sparse.c | 42 +- 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/mm/sparse.c b/mm/sparse.c index 1302f83..35f739a 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -229,6 +229,7 @@ static struct page __init *sparse_early_mem_map_alloc(unsigned long pnum) return NULL; } +#ifdef CONFIG_MEMORY_HOTPLUG static struct page *__kmalloc_section_memmap(unsigned long nr_pages) { struct page *page, *ret; @@ -269,27 +270,6 @@ static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages) } /* - * Allocate the accumulated non-linear sections, allocate a mem_map - * for each and record the physical to section mapping. - */ -void __init sparse_init(void) -{ - unsigned long pnum; - struct page *map; - - for (pnum = 0; pnum NR_MEM_SECTIONS; pnum++) { - if (!valid_section_nr(pnum)) - continue; - - map = sparse_early_mem_map_alloc(pnum); - if (!map) - continue; - sparse_init_one_section(__nr_to_section(pnum), pnum, map); - } -} - -#ifdef CONFIG_MEMORY_HOTPLUG -/* * returns the number of sections whose mem_maps were properly * set. If this is =0, then that means that the passed-in * map was not consumed and must be freed. @@ -329,3 +309,23 @@ out: return ret; } #endif + +/* + * Allocate the accumulated non-linear sections, allocate a mem_map + * for each and record the physical to section mapping. + */ +void __init sparse_init(void) +{ + unsigned long pnum; + struct page *map; + + for (pnum = 0; pnum NR_MEM_SECTIONS; pnum++) { + if (!valid_section_nr(pnum)) + continue; + + map = sparse_early_mem_map_alloc(pnum); + if (!map) + continue; + sparse_init_one_section(__nr_to_section(pnum), pnum, map); + } +} -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to [EMAIL PROTECTED] For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:[EMAIL PROTECTED] [EMAIL PROTECTED] /a -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sparsemem: Shut up unused symbol compiler warnings.
On Fri, Jun 01, 2007 at 02:26:17PM +0900, Yasunori Goto wrote: I think this issue is fixed by move-three-functions-that-are-only-needed-for.patch in current -mm tree. Is it not enough? That's possible, I hadn't checked -mm. This was simply against current git. If there's already a fix in -mm, then this can simply be ignored. Okay. Thanks for your report. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86_64 memory hotplug simulation support?
Hello. Nigel-san. I'm wondering whether anyone has patches lying around that might be useful for simulating memory hotplug on x86_64. Goggling has revealed some old x86 patches, but that's all. I'm not sure what simulation means. Could you tell me how/what do you expect memory hotplug simulation exactly? Memory hot-add code is included in kernel. And, remove(unplug) code has developed (and hopefully, it will be merged to -mm after some cleanups, I think.) I would like to make sure what is necessary. Thanks. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86_64 memory hotplug simulation support?
Thanks for your reply. Please, just call me Nigel :). Haha. Okay. Nigel. (Though, San is useful for even friendly/frank situation in Japanese :) ) I saw a patch that Dave Hansen had posted, back around the time of 2.6.11 iirc. It was for x86, and (so far as I understand) allowed a person who doesn't really have hotpluggable memory to make their computer pretend that it does. Just in case I'm not being clear enough, let me get more concrete. I have had some code for a while that uses bitmaps to simulate page flags, without needing to take up those precious bits in page-flags. I've begun to add support for memory hotplugging, in the hope that I can make it general enough that it will be useful for more than just suspend2. To do that, I'd like to be able to test the memory hotplugging paths, without needing to actually have hotpluggable memory. I do have an x86 desktop I could work on, but would prefer to do it on my x86_64 laptop if I can. Current memory hot-add code expects special hardware which allows memory hot-unplug physically. Yes, these is no way to use it on normal PC without emulation. And I don't have emulation code for x86-64. Usually, I'm using ia64 box for test it. These are 2 ideas to use memory hotplug with normal x86-64 box. - Make emulation code for x86-64. To add memory, some of memory have to be ignored at boot time. And add memory after boot up. This way may need fake BIOS information. And if memory is added once, reboot is necessary for next hot-add test. - Bootup normaly. Unplug some memory at first, then hot-add them later. You can try hot-plug code many times after bootup. Unplug code must is not merged yet. Followings are the newest one. http://marc.info/?l=linux-mmm=118180415304117w=2 But the 6th patch of them is only for ia64. http://marc.info/?l=linux-mmm=118180483715610w=2 So, same role patch for x86-64 is still necessary. In addition, some of route of hot-add code can't be tested. Because, current hot-add has 2 phase. 1. Physically hot-add. - Accept notification from firmware. - Make sysfs file for new memory. - register SPARSEMEM and allocate memmap/pgdat/zone. 2. logically online - free each pages of new memory to use them. - rebuild zonelist. But, unplug code just do logical offline. physicall hot-unplug must be necessary for test phase 1. Hmm. I don't know what is necessary for suspend2. But, some works looks still necessary for each way. Thanks. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH](memory hotplug) Fix unnecessary calling of init_currenty_empty_zone()
Hello. This patch is to fix unnecessary calling of init_currently_empty_zone(). zone-present_pages is updated in online_pages(). But, __add_zone() can be called twice or more before calling online_pages(). So, init_currenty_empty_zone() can be called unnecessary times. It is cause of memory leak of zone's wait_table. This patch is tested on my ia64 box with 2.6.22-rc2-mm1. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] mm/memory_hotplug.c |2 +- 1 files changed, 1 insertion(+), 1 deletion(-) Index: vmemmap/mm/memory_hotplug.c === --- vmemmap.orig/mm/memory_hotplug.c2007-05-29 15:30:28.0 +0900 +++ vmemmap/mm/memory_hotplug.c 2007-05-29 17:31:43.0 +0900 @@ -65,7 +65,7 @@ static int __add_zone(struct zone *zone, int zone_type; zone_type = zone - pgdat-node_zones; - if (!populated_zone(zone)) { + if (!zone-wait_table) { int ret = 0; ret = init_currently_empty_zone(zone, phys_start_pfn, nr_pages, MEMMAP_HOTPLUG); -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch] Fix unnecesary meminit
It doesn't make a lot of sense to export an __init symbol to modules. I guess it's OK in this case, but we get warnings: It seems wrong to me to first tell linker to discard the code after init and next to export the symbol to make it available for any module anytime. Both function are relatively small so better avoid playing games and drop the __meminit tag. Ok. This is the patch. --- This is to fix unnecessary __meminit definition. These are exported for kernel modules. I compiled on ia64/x86-64 with memory hotplug on/off. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] drivers/acpi/numa.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.21-mm1/drivers/acpi/numa.c === --- linux-2.6.21-mm1.orig/drivers/acpi/numa.c 2007-05-08 19:33:05.0 +0900 +++ linux-2.6.21-mm1/drivers/acpi/numa.c2007-05-08 19:33:12.0 +0900 @@ -228,7 +228,7 @@ int __init acpi_numa_init(void) return 0; } -int __meminit acpi_get_pxm(acpi_handle h) +int acpi_get_pxm(acpi_handle h) { unsigned long pxm; acpi_status status; @@ -246,7 +246,7 @@ int __meminit acpi_get_pxm(acpi_handle h } EXPORT_SYMBOL(acpi_get_pxm); -int __meminit acpi_get_node(acpi_handle *handle) +int acpi_get_node(acpi_handle *handle) { int pxm, node = -1; -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] memory hotremove patch take 2 [00/10]
Hello. I rebased and debugged Kame-san's memory hot-remove patches. This work is not finished yet. (Some pages keep un-removable status) But, I would like to show current progress of it, because it has been a long time since previous post, and some bugs are fixed. If you have concern, please check this. Any comments are welcome. Thanks. --- These patches are for memory hot-remove. How to use - kernelcore=xx[GMK] must be specified at boottime option to create ZONE_MOVABLE area. - After bootup, execute following. # echo offline /sys/devices/system/memory/memoryX/status Change log from previous version. - Rebase to 2.6.21-mm1. - Old original ZONE_MOVABLE code is removed. Mel-san's ZONE_REMOVABLE for anti-fragmentation is used. - Fix wrong return code check of isolate_lru_page() - Page is isolated ASAP, which was source of page migration when memory-hotremove. In old code, it uses just put_page(), and we expected that migrated source page is catched in __free_one_page() as isolated page. But, it is spooled in per_cpu_page and used soon for next destination page of migration. This was cause of eternal loop in offline_pages(). - There is a page which is not mapped but added to swapcache in swap-in code. It was cause of panic in try_to_unmap(). fixed it. - end_pfn is rounded up at memmap_init. If there is a small hole on end of section. These page is not initialized. TODO: - There are some pages which are un-removable page on memory stress condition. (These pages are set PG_swapcache or PG_mappedtodisk without connecting to lru.) - Should make i386/x86-64/powerpc interface code. But not yet (really sorry :-( ). - If bootmem parameter or efi's memory map is stored by efi, memory can't be removed even if it is in removable zone. - node hotplug support. (this may needs some amount of patches.) - test under heavy work load and more careful race check. - Fix where we should allocate migration target page from. - H And so on. [1] counters patch -- per-zone counter for ZONE_MOVABLE ==page isolation== [2] page isolation patch . basic defintions of page isolation. [3] drain_all_zone_pages patch . drain all cpus' pcp pages. [4] isolate freed page patch . isolate pages in free_area[] ==memory unplug== offline a section of pages. isolate specified section and migrate content of used pages to out of section. (Because free pages in a section is isolated, it never be returned by alloc_pages()) This patch doesn't care where we should allocate migration new pages from. [5] memory unplug core patch --- maybe need more work. [6] interface patch --- offline interface support ==migration nocontext== Fix race condition of page migration without process context (not taking mm-sem). This patch delayes kmem_cache_free() of anon_vma until migration ends. [7] migration nocontext patch --- support page migration without acquiring mm-sem. need careful debug... ==other fixes== [8] round up end_pfn at memmap_init [9] page isolation ASAP when memory-hotremove case. [10] fix swapping-in page panic. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] memory hotremove patch take 2 [02/10] (make page unused)
This patch is for supporting making page unused. Isolate pages by capturing freed pages before inserting free_area[], buddy allocator. If you have an idea for avoiding spin_lock(), please advise me. Isolating pages in free_area[] is implemented in other patch. Signed-Off-By: KAMEZAWA Hiroyuki [EMAIL PROTECTED] Signed-off-by: Yasunori Goto [EMAIL PROTECTED] include/linux/mmzone.h |8 + include/linux/page_isolation.h | 52 +++ mm/Kconfig |7 + mm/page_alloc.c| 187 + 4 files changed, 254 insertions(+) Index: current_test/include/linux/mmzone.h === --- current_test.orig/include/linux/mmzone.h2007-05-08 15:06:49.0 +0900 +++ current_test/include/linux/mmzone.h 2007-05-08 15:08:03.0 +0900 @@ -314,6 +314,14 @@ struct zone { /* zone_start_pfn == zone_start_paddr PAGE_SHIFT */ unsigned long zone_start_pfn; +#ifdef CONFIG_PAGE_ISOLATION + /* +* For pages which are not used but not free. +* See include/linux/page_isolation.h +*/ + spinlock_t isolation_lock; + struct list_headisolation_list; +#endif /* * zone_start_pfn, spanned_pages and present_pages are all * protected by span_seqlock. It is a seqlock because it has Index: current_test/mm/page_alloc.c === --- current_test.orig/mm/page_alloc.c 2007-05-08 15:07:20.0 +0900 +++ current_test/mm/page_alloc.c2007-05-08 15:08:34.0 +0900 @@ -41,6 +41,7 @@ #include linux/pfn.h #include linux/backing-dev.h #include linux/fault-inject.h +#include linux/page_isolation.h #include asm/tlbflush.h #include asm/div64.h @@ -448,6 +449,9 @@ static inline void __free_one_page(struc if (unlikely(PageCompound(page))) destroy_compound_page(page, order); + if (page_under_isolation(zone, page, order)) + return; + page_idx = page_to_pfn(page) ((1 MAX_ORDER) - 1); VM_BUG_ON(page_idx (order_size - 1)); @@ -3259,6 +3263,10 @@ static void __meminit free_area_init_cor zone-nr_scan_inactive = 0; zap_zone_vm_stats(zone); atomic_set(zone-reclaim_in_progress, 0); +#ifdef CONFIG_PAGE_ISOLATION + spin_lock_init(zone-isolation_lock); + INIT_LIST_HEAD(zone-isolation_list); +#endif if (!size) continue; @@ -4214,3 +4222,182 @@ void set_pageblock_flags_group(struct pa else __clear_bit(bitidx + start_bitidx, bitmap); } + +#ifdef CONFIG_PAGE_ISOLATION +/* + * Page Isolation. + * + * If a page is removed from usual free_list and will never be used, + * It is linked to struct isolation_info and set Reserved, Private + * bit. page-mapping points to isolation_info in it. + * and page_count(page) is 0. + * + * This can be used for creating a chunk of contiguous *unused* memory. + * + * current user is Memory-Hot-Remove. + * maybe move to some other file is better. + */ +static void +isolate_page_nolock(struct isolation_info *info, struct page *page, int order) +{ + int pagenum; + pagenum = 1 order; + while (pagenum 0) { + SetPageReserved(page); + SetPagePrivate(page); + page-private = (unsigned long)info; + list_add(page-lru, info-pages); + page++; + pagenum--; + } +} + +/* + * This function is called from page_under_isolation()l + */ + +int __page_under_isolation(struct zone *zone, struct page *page, int order) +{ + struct isolation_info *info; + unsigned long pfn = page_to_pfn(page); + unsigned long flags; + int found = 0; + + spin_lock_irqsave(zone-isolation_lock,flags); + list_for_each_entry(info, zone-isolation_list, list) { + if (info-start_pfn = pfn pfn info-end_pfn) { + found = 1; + break; + } + } + if (found) { + isolate_page_nolock(info, page, order); + } + spin_unlock_irqrestore(zone-isolation_lock, flags); + return found; +} + +/* + * start and end must be in the same zone. + * + */ +struct isolation_info * +register_isolation(unsigned long start, unsigned long end) +{ + struct zone *zone; + struct isolation_info *info = NULL, *tmp; + unsigned long flags; + unsigned long last_pfn = end - 1; + + if (!pfn_valid(start) || !pfn_valid(last_pfn) || (start = end)) + return ERR_PTR(-EINVAL); + /* check start and end is in the same zone */ + zone = page_zone(pfn_to_page(start)); + + if (zone != page_zone(pfn_to_page(last_pfn))) + return ERR_PTR
[RFC] memory hotremove patch take 2 [04/10] (isolate all free pages)
Isolate all freed pages (means in buddy_list) in the range. See page_buddy() and free_one_page() function if unsure. Signed-Off-By: KAMEZAWA Hiroyuki [EMAIL PROTECTED] Signed-off-by: Yasunori Goto [EMAIL PROTECTED] include/linux/page_isolation.h |1 mm/page_alloc.c| 45 + 2 files changed, 46 insertions(+) Index: current_test/mm/page_alloc.c === --- current_test.orig/mm/page_alloc.c 2007-05-08 15:08:04.0 +0900 +++ current_test/mm/page_alloc.c2007-05-08 15:08:26.0 +0900 @@ -4411,6 +4411,51 @@ free_all_isolated_pages(struct isolation } } +/* + * Isolate already freed pages. + */ +int +capture_isolate_freed_pages(struct isolation_info *info) +{ + struct zone *zone; + unsigned long pfn; + struct page *page; + int order, order_size; + int nr_pages = 0; + unsigned long last_pfn = info-end_pfn - 1; + pfn = info-start_pfn; + if (!pfn_valid(pfn)) + return -EINVAL; + zone = info-zone; + if ((zone != page_zone(pfn_to_page(pfn))) || + (zone != page_zone(pfn_to_page(last_pfn + return -EINVAL; + drain_all_pages(); + spin_lock(zone-lock); + while (pfn info-end_pfn) { + if (!pfn_valid(pfn)) { + pfn++; + continue; + } + page = pfn_to_page(pfn); + /* See page_is_buddy() */ + if (page_count(page) == 0 PageBuddy(page)) { + order = page_order(page); + order_size = 1 order; + zone-free_area[order].nr_free--; + __mod_zone_page_state(zone, NR_FREE_PAGES, -order_size); + list_del(page-lru); + rmv_page_order(page); + isolate_page_nolock(info, page, order); + nr_pages += order_size; + pfn += order_size; + } else { + pfn++; + } + } + spin_unlock(zone-lock); + return nr_pages; +} #endif /* CONFIG_PAGE_ISOLATION */ Index: current_test/include/linux/page_isolation.h === --- current_test.orig/include/linux/page_isolation.h2007-05-08 15:08:04.0 +0900 +++ current_test/include/linux/page_isolation.h 2007-05-08 15:08:27.0 +0900 @@ -40,6 +40,7 @@ extern void free_isolation_info(struct i extern void unuse_all_isolated_pages(struct isolation_info *info); extern void free_all_isolated_pages(struct isolation_info *info); extern void drain_all_pages(void); +extern int capture_isolate_freed_pages(struct isolation_info *info); #else -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] memory hotremove patch take 2 [01/10] (counter of removable page)
Show #of Movable pages and vmstat. Signed-Off-By: KAMEZAWA Hiroyuki [EMAIL PROTECTED] Signed-off-by: Yasunori Goto [EMAIL PROTECTED] arch/ia64/mm/init.c|2 ++ drivers/base/node.c|4 fs/proc/proc_misc.c|4 include/linux/kernel.h |2 ++ include/linux/swap.h |1 + mm/page_alloc.c| 22 ++ 6 files changed, 35 insertions(+) Index: current_test/mm/page_alloc.c === --- current_test.orig/mm/page_alloc.c 2007-05-08 15:06:50.0 +0900 +++ current_test/mm/page_alloc.c2007-05-08 15:08:36.0 +0900 @@ -58,6 +58,7 @@ unsigned long totalram_pages __read_most unsigned long totalreserve_pages __read_mostly; long nr_swap_pages; int percpu_pagelist_fraction; +unsigned long total_movable_pages __read_mostly; static void __free_pages_ok(struct page *page, unsigned int order); @@ -1827,6 +1828,18 @@ static unsigned int nr_free_zone_pages(i return sum; } +unsigned int nr_free_movable_pages(void) +{ + unsigned long nr_pages = 0; + struct zone *zone; + int nid; + + for_each_online_node(nid) { + zone = (NODE_DATA(nid)-node_zones[ZONE_MOVABLE]); + nr_pages += zone_page_state(zone, NR_FREE_PAGES); + } + return nr_pages; +} /* * Amount of free RAM allocatable within ZONE_DMA and ZONE_NORMAL */ @@ -1889,6 +1902,8 @@ void si_meminfo(struct sysinfo *val) val-totalhigh = totalhigh_pages; val-freehigh = nr_free_highpages(); val-mem_unit = PAGE_SIZE; + val-movable = total_movable_pages; + val-free_movable = nr_free_movable_pages(); } EXPORT_SYMBOL(si_meminfo); @@ -1908,6 +1923,11 @@ void si_meminfo_node(struct sysinfo *val val-totalhigh = 0; val-freehigh = 0; #endif + + val-movable = pgdat-node_zones[ZONE_MOVABLE].present_pages; + val-free_movable = zone_page_state(pgdat-node_zones[ZONE_MOVABLE], + NR_FREE_PAGES); + val-mem_unit = PAGE_SIZE; } #endif @@ -3216,6 +3236,8 @@ static void __meminit free_area_init_cor zone-spanned_pages = size; zone-present_pages = realsize; + if (j == ZONE_MOVABLE) + total_movable_pages += realsize; #ifdef CONFIG_NUMA zone-node = nid; zone-min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio) Index: current_test/include/linux/kernel.h === --- current_test.orig/include/linux/kernel.h2007-05-08 15:06:49.0 +0900 +++ current_test/include/linux/kernel.h 2007-05-08 15:07:20.0 +0900 @@ -352,6 +352,8 @@ struct sysinfo { unsigned short pad; /* explicit padding for m68k */ unsigned long totalhigh;/* Total high memory size */ unsigned long freehigh; /* Available high memory size */ + unsigned long movable; /* pages used only for data */ + unsigned long free_movable; /* Avaiable pages in movable */ unsigned int mem_unit; /* Memory unit size in bytes */ char _f[20-2*sizeof(long)-sizeof(int)]; /* Padding: libc5 uses this.. */ }; Index: current_test/fs/proc/proc_misc.c === --- current_test.orig/fs/proc/proc_misc.c 2007-05-08 15:06:48.0 +0900 +++ current_test/fs/proc/proc_misc.c2007-05-08 15:07:20.0 +0900 @@ -161,6 +161,8 @@ static int meminfo_read_proc(char *page, LowTotal: %8lu kB\n LowFree: %8lu kB\n #endif + MovableTotal: %8lu kB\n + MovableFree: %8lu kB\n SwapTotal:%8lu kB\n SwapFree: %8lu kB\n Dirty:%8lu kB\n @@ -191,6 +193,8 @@ static int meminfo_read_proc(char *page, K(i.totalram-i.totalhigh), K(i.freeram-i.freehigh), #endif + K(i.movable), + K(i.free_movable), K(i.totalswap), K(i.freeswap), K(global_page_state(NR_FILE_DIRTY)), Index: current_test/drivers/base/node.c === --- current_test.orig/drivers/base/node.c 2007-05-08 15:06:10.0 +0900 +++ current_test/drivers/base/node.c2007-05-08 15:07:20.0 +0900 @@ -55,6 +55,8 @@ static ssize_t node_read_meminfo(struct Node %d LowTotal: %8lu kB\n Node %d LowFree: %8lu kB\n #endif + Node %d MovableTotal: %8lu kB\n + Node %d MovableFree: %8lu kB\n Node %d Dirty:%8lu kB\n Node %d Writeback:%8lu kB\n Node %d FilePages:%8lu kB\n @@ -77,6
[RFC] memory hotremove patch take 2 [03/10] (drain all pages)
This patch add function drain_all_pages(void) to drain all pages on per-cpu-freelist. Page isolation will catch them in free_one_page. Signed-Off-By: KAMEZAWA Hiroyuki [EMAIL PROTECTED] Signed-off-by: Yasunori Goto [EMAIL PROTECTED] include/linux/page_isolation.h |1 + mm/page_alloc.c| 13 + 2 files changed, 14 insertions(+) Index: current_test/mm/page_alloc.c === --- current_test.orig/mm/page_alloc.c 2007-05-08 15:08:03.0 +0900 +++ current_test/mm/page_alloc.c2007-05-08 15:08:33.0 +0900 @@ -1070,6 +1070,19 @@ void drain_all_local_pages(void) smp_call_function(smp_drain_local_pages, NULL, 0, 1); } +#ifdef CONFIG_PAGE_ISOLATION +static void drain_local_zone_pages(struct work_struct *work) +{ + drain_local_pages(); +} + +void drain_all_pages(void) +{ + schedule_on_each_cpu(drain_local_zone_pages); +} + +#endif /* CONFIG_PAGE_ISOLATION */ + /* * Free a 0-order page */ Index: current_test/include/linux/page_isolation.h === --- current_test.orig/include/linux/page_isolation.h2007-05-08 15:08:03.0 +0900 +++ current_test/include/linux/page_isolation.h 2007-05-08 15:08:33.0 +0900 @@ -39,6 +39,7 @@ extern void detach_isolation_info_zone(s extern void free_isolation_info(struct isolation_info *info); extern void unuse_all_isolated_pages(struct isolation_info *info); extern void free_all_isolated_pages(struct isolation_info *info); +extern void drain_all_pages(void); #else -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] memory hotremove patch take 2 [05/10] (make basic remove code)
Add MEMORY_HOTREMOVE config and implements basic algorythm. This config selects ZONE_MOVABLE and PAGE_ISOLATION how work: 1. register Isololation area of specified section 2. search mem_map and migrate pages. 3. detach isolation and make pages unused. This works on my easy test, but I think I need more work on loop algorythm and policy. Signed-Off-By: KAMEZAWA Hiroyuki [EMAIL PROTECTED] Signed-off-by: Yasunori Goto [EMAIL PROTECTED] include/linux/memory_hotplug.h |1 mm/Kconfig |8 + mm/memory_hotplug.c| 221 + 3 files changed, 229 insertions(+), 1 deletion(-) Index: current_test/mm/Kconfig === --- current_test.orig/mm/Kconfig2007-05-08 15:08:03.0 +0900 +++ current_test/mm/Kconfig 2007-05-08 15:08:27.0 +0900 @@ -126,6 +126,12 @@ config MEMORY_HOTPLUG_SPARSE def_bool y depends on SPARSEMEM MEMORY_HOTPLUG +config MEMORY_HOTREMOVE + bool Allow for memory hot-remove + depends on MEMORY_HOTPLUG_SPARSE + select MIGRATION + select PAGE_ISOLATION + # Heavily threaded applications may benefit from splitting the mm-wide # page_table_lock, so that faults on different parts of the user address # space can be handled with less contention: split it at this NR_CPUS. @@ -145,7 +151,7 @@ config SPLIT_PTLOCK_CPUS config MIGRATION bool Page migration def_bool y - depends on NUMA + depends on NUMA || MEMORY_HOTREMOVE help Allows the migration of the physical location of pages of processes while the virtual addresses are not changed. This is useful for Index: current_test/mm/memory_hotplug.c === --- current_test.orig/mm/memory_hotplug.c 2007-05-08 15:02:48.0 +0900 +++ current_test/mm/memory_hotplug.c2007-05-08 15:08:27.0 +0900 @@ -23,6 +23,9 @@ #include linux/vmalloc.h #include linux/ioport.h #include linux/cpuset.h +#include linux/page_isolation.h +#include linux/delay.h +#include linux/migrate.h #include asm/tlbflush.h @@ -308,3 +311,221 @@ error: return ret; } EXPORT_SYMBOL_GPL(add_memory); + + + +#ifdef CONFIG_MEMORY_HOTREMOVE + +/* + * Just an easy implementation. + */ +static struct page * +hotremove_migrate_alloc(struct page *page, + unsigned long private, + int **x) +{ + return alloc_page(GFP_HIGH_MOVABLE); +} + +/* scans # of pages per itelation */ +#define HOTREMOVE_UNIT (1024) + +static int do_migrate_and_isolate_pages(struct isolation_info *info, + unsigned long start_pfn, + unsigned long end_pfn) +{ + int move_pages = HOTREMOVE_UNIT; + int ret, managed, not_managed; + unsigned long pfn; + struct page *page; + LIST_HEAD(source); + + not_managed = 0; + for (pfn = start_pfn; pfn end_pfn move_pages 0; pfn++) { + if (!pfn_valid(pfn)) /* never happens in sparsemem */ + continue; + page = pfn_to_page(pfn); + if (is_page_isolated(info,page)) + continue; + ret = isolate_lru_page(page, source); + + if (ret == 0) { + move_pages--; + managed++; + } else { + if (page_count(page)) + not_managed++; /* someone uses this */ + } + } + ret = -EBUSY; + if (not_managed) { + if (!list_empty(source)) + putback_lru_pages(source); + goto out; + } + ret = 0; + if (list_empty(source)) + goto out; + /* this function returns # of failed pages */ + ret = migrate_pages(source, hotremove_migrate_alloc, + (unsigned long)info); +out: + return ret; +} + + +/* + * Check All pages registered as IORESOURCE_RAM are isolated or not. + */ +static int check_removal_success(struct isolation_info *info) +{ + struct resource res; + unsigned long section_end; + unsigned long start_pfn, i, nr_pages; + struct page *page; + int removed = 0; + res.start = info-start_pfn PAGE_SHIFT; + res.end = (info-end_pfn - 1) PAGE_SHIFT; + res.flags = IORESOURCE_MEM; + section_end = res.end; + while ((res.start res.end) (find_next_system_ram(res) = 0)) { + start_pfn =(res.start PAGE_SHIFT); + nr_pages = (res.end + 1UL - res.start) PAGE_SHIFT; + for (i = 0; i nr_pages; i++) { + page = pfn_to_page(start_pfn + i); + if (!is_page_isolated(info,page)) + return -EBUSY
[RFC] memory hotremove patch take 2 [06/10] (ia64's remove_memory code)
Call offline pages from remove_memory(). Signed-off-by: Yasunori Goto [EMAIL PROTECTED] Signed-Off-By: KAMEZAWA Hiroyuki [EMAIL PROTECTED] arch/ia64/mm/init.c | 13 - 1 files changed, 12 insertions(+), 1 deletion(-) Index: current_test/arch/ia64/mm/init.c === --- current_test.orig/arch/ia64/mm/init.c 2007-05-08 15:07:20.0 +0900 +++ current_test/arch/ia64/mm/init.c2007-05-08 15:08:07.0 +0900 @@ -726,7 +726,18 @@ int arch_add_memory(int nid, u64 start, int remove_memory(u64 start, u64 size) { - return -EINVAL; + unsigned long start_pfn, end_pfn; + unsigned long timeout = 120 * HZ; + int ret; + start_pfn = start PAGE_SHIFT; + end_pfn = start_pfn + (size PAGE_SHIFT); + ret = offline_pages(start_pfn, end_pfn, timeout); + if (ret) + goto out; + /* we can free mem_map at this point */ +out: + return ret; } + EXPORT_SYMBOL_GPL(remove_memory); #endif -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] memory hotremove patch take 2 [10/10] (retry swap-in page)
There is a race condition between swap-in and unmap_and_move(). When swap-in occur, page_mapped might be not set yet. So, unmap_and_move() gives up at once, and tries later. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] mm/migrate.c |5 + 1 files changed, 5 insertions(+) Index: current_test/mm/migrate.c === --- current_test.orig/mm/migrate.c 2007-05-08 15:08:09.0 +0900 +++ current_test/mm/migrate.c 2007-05-08 15:08:09.0 +0900 @@ -670,6 +670,11 @@ static int unmap_and_move(new_page_t get /* hold this anon_vma until remove_migration_ptes() finishes */ anon_vma_hold(page); } + + if (PageSwapCache(page) !page_mapped(page)) + /* swap in now. try lator*/ + goto unlock; + /* * Establish migration ptes or remove ptes */ -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] memory hotremove patch take 2 [09/10] (direct isolation for remove)
This patch is to isolate source page of migration ASAP in unmap_and_move(), when memory-hotremove. In old code, it uses just put_page(), and we expected that migrated source page is catched in __free_one_page() as isolated page. But, it is spooled in per_cpu_page and used soon for next destination page of migration. This was cause of eternal loop in offline_pages(). Signed-off-by: Yasunori Goto [EMAIL PROTECTED] include/linux/page_isolation.h | 14 mm/Kconfig |1 mm/migrate.c | 46 +++-- 3 files changed, 59 insertions(+), 2 deletions(-) Index: current_test/mm/migrate.c === --- current_test.orig/mm/migrate.c 2007-05-08 15:08:07.0 +0900 +++ current_test/mm/migrate.c 2007-05-08 15:08:21.0 +0900 @@ -249,6 +249,32 @@ static void remove_migration_ptes(struct remove_file_migration_ptes(old, new); } + +static int +is_page_isolated_noinfo(struct page *page) +{ + int ret = 0; + struct zone *zone; + unsigned long flags; + struct isolation_info *info; + + if (unlikely(PageReserved(page) PagePrivate(page) +page_count(page) == 1)){ + zone = page_zone(page); + spin_lock_irqsave(zone-isolation_lock, flags); + list_for_each_entry(info, zone-isolation_list, list) { + if (PageReserved(page) PagePrivate(page) + page_count(page) == 1 + page-private == (unsigned long)info){ + ret = 1; + break; + } + } + spin_unlock_irqrestore(zone-isolation_lock, flags); + + } + return ret; +} /* * Something used the pte of a page under migration. We need to * get to the page and wait until migration is finished. @@ -278,7 +304,14 @@ void migration_entry_wait(struct mm_stru get_page(page); pte_unmap_unlock(ptep, ptl); wait_on_page_locked(page); - put_page(page); + + /* +* The page might be migrated and directly isolated. +* If not, then release page. +*/ + if (!is_page_isolated_noinfo(page)) + put_page(page); + return; out: pte_unmap_unlock(ptep, ptl); @@ -653,6 +686,15 @@ static int unmap_and_move(new_page_t get anon_vma_release(page); } + if (rc != -EAGAIN is_migrate_isolation(flag)) { + /* page must be removed sooner. */ + list_del(page-lru); + page_under_isolation(page_zone(page), page, 0); + __put_page(page); + unlock_page(page); + goto move_newpage; + } + unlock: unlock_page(page); @@ -758,7 +800,7 @@ int migrate_pages_and_remove(struct list new_page_t get_new_page, unsigned long private) { return __migrate_pages(from, get_new_page, private, - MIGRATE_NOCONTEXT); + MIGRATE_NOCONTEXT | MIGRATE_ISOLATION); } #endif Index: current_test/include/linux/page_isolation.h === --- current_test.orig/include/linux/page_isolation.h2007-05-08 15:08:07.0 +0900 +++ current_test/include/linux/page_isolation.h 2007-05-08 15:08:09.0 +0900 @@ -33,12 +33,20 @@ is_page_isolated(struct isolation_info * } #define MIGRATE_NOCONTEXT 0x1 +#define MIGRATE_ISOLATION 0x2 + static inline int is_migrate_nocontext(int flag) { return (flag MIGRATE_NOCONTEXT) == MIGRATE_NOCONTEXT; } +static inline int +is_migrate_isolation(int flag) +{ + return (flag MIGRATE_ISOLATION) == MIGRATE_ISOLATION; +} + extern struct isolation_info * register_isolation(unsigned long start, unsigned long end); @@ -64,5 +72,11 @@ is_migrate_nocontext(int flag) return 0; } +static inline int +is_migrate_isolation(int flag) +{ + return 0; +} + #endif #endif Index: current_test/mm/Kconfig === --- current_test.orig/mm/Kconfig2007-05-08 15:08:07.0 +0900 +++ current_test/mm/Kconfig 2007-05-08 15:08:09.0 +0900 @@ -169,6 +169,7 @@ config MIGRATION_REMOVE migration target pages. This has a small race condition. If this config is selected, some workaround for fix them is enabled. This may be add slight performance influence. + In addition, page must be isolated sooner for remove. config RESOURCES_64BIT bool 64 bit Memory and IO resources (EXPERIMENTAL) if (!64BIT EXPERIMENTAL) -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http
Re: [RFC] memory hotremove patch take 2 [07/10] (delay freeing anon_vma)
Delaying freeing anon_vma until migration finishes. We cannot trust page-mapping (of ANON) when page_mapcount(page) ==0. page migration puts page_mocount(page) to be 0. So we have to guarantee anon_vma pointed by page-mapping is valid by some hook. Usual page migration guarantees this by mm-sem. but we can't do it. So, just delaying freeing anon_vma. Signed-Off-By: KAMEZAWA Hiroyuki [EMAIL PROTECTED] Signed-off-by: Yasunori Goto [EMAIL PROTECTED] include/linux/migrate.h|2 ++ include/linux/page_isolation.h | 14 ++ include/linux/rmap.h | 22 ++ mm/Kconfig | 12 mm/memory_hotplug.c|4 ++-- mm/migrate.c | 37 +++-- mm/rmap.c | 36 +++- 7 files changed, 118 insertions(+), 9 deletions(-) Index: current_test/mm/migrate.c === --- current_test.orig/mm/migrate.c 2007-05-08 15:06:50.0 +0900 +++ current_test/mm/migrate.c 2007-05-08 15:08:24.0 +0900 @@ -28,6 +28,7 @@ #include linux/mempolicy.h #include linux/vmalloc.h #include linux/security.h +#include linux/page_isolation.h #include internal.h @@ -607,7 +608,7 @@ static int move_to_new_page(struct page * to the newly allocated page in newpage. */ static int unmap_and_move(new_page_t get_new_page, unsigned long private, - struct page *page, int force) + struct page *page, int force, int flag) { int rc = 0; int *result = NULL; @@ -632,7 +633,10 @@ static int unmap_and_move(new_page_t get goto unlock; wait_on_page_writeback(page); } - + if (PageAnon(page) is_migrate_nocontext(flag)) { + /* hold this anon_vma until remove_migration_ptes() finishes */ + anon_vma_hold(page); + } /* * Establish migration ptes or remove ptes */ @@ -640,8 +644,14 @@ static int unmap_and_move(new_page_t get if (!page_mapped(page)) rc = move_to_new_page(newpage, page); - if (rc) + if (rc) { remove_migration_ptes(page, page); + if (PageAnon(page) is_migrate_nocontext(flag)) + anon_vma_release(page); + } else { + if (PageAnon(newpage) is_migrate_nocontext(flag)) + anon_vma_release(page); + } unlock: unlock_page(page); @@ -686,8 +696,8 @@ move_newpage: * * Return: Number of pages not migrated or error code. */ -int migrate_pages(struct list_head *from, - new_page_t get_new_page, unsigned long private) +static int __migrate_pages(struct list_head *from, + new_page_t get_new_page, unsigned long private, int flag) { int retry = 1; int nr_failed = 0; @@ -707,7 +717,7 @@ int migrate_pages(struct list_head *from cond_resched(); rc = unmap_and_move(get_new_page, private, - page, pass 2); + page, pass 2, flag); switch(rc) { case -ENOMEM: @@ -737,6 +747,21 @@ out: return nr_failed + retry; } +int migrate_pages(struct list_head *from, + new_page_t get_new_page, unsigned long private) +{ + return __migrate_pages(from, get_new_page, private, 0); +} + +#ifdef CONFIG_MIGRATION_REMOVE +int migrate_pages_and_remove(struct list_head *from, + new_page_t get_new_page, unsigned long private) +{ + return __migrate_pages(from, get_new_page, private, + MIGRATE_NOCONTEXT); +} +#endif + #ifdef CONFIG_NUMA /* * Move a list of individual pages Index: current_test/include/linux/rmap.h === --- current_test.orig/include/linux/rmap.h 2007-05-08 15:06:49.0 +0900 +++ current_test/include/linux/rmap.h 2007-05-08 15:08:07.0 +0900 @@ -26,6 +26,9 @@ struct anon_vma { spinlock_t lock;/* Serialize access to vma list */ struct list_head head; /* List of private related vmas */ +#ifdef CONFIG_MIGRATION_REMOVE + atomic_thold; /* == 0 if we can free this immediately */ +#endif }; #ifdef CONFIG_MMU @@ -37,10 +40,14 @@ static inline struct anon_vma *anon_vma_ return kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL); } +#ifndef CONFIG_MIGRATION_REMOVE static inline void anon_vma_free(struct anon_vma *anon_vma) { kmem_cache_free(anon_vma_cachep, anon_vma); } +#else +extern void anon_vma_free(struct anon_vma *anon_vma); +#endif static inline void anon_vma_lock(struct vm_area_struct *vma) { @@ -75,6 +82,21 @@ void page_add_file_rmap(struct page *); void