Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
(2013/01/10 16:55), Glauber Costa wrote: On 01/10/2013 11:31 AM, Kamezawa Hiroyuki wrote: (2013/01/10 16:14), Glauber Costa wrote: On 01/10/2013 06:17 AM, Tang Chen wrote: Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Right. But how often does this happen in testing? In other words, please provide an overall description of how well memory hot-remove is presently operating. Is it reliable? What is the success rate in real-world situations? We test the hot-remove functionality mostly with movable_online used. And the memory used by kernel is not allowed to be removed. Can you try doing this using cpusets configured to hardwall ? It is my understanding that the object allocators will try hard not to allocate anything outside the walls defined by cpuset. Which means that if you have one process per node, and they are hardwalled, your kernel memory will be spread evenly among the machine. With a big enough load, they should eventually be present in all blocks. I'm sorry I couldn't catch your point. Do you want to confirm whether cpuset can work enough instead of ZONE_MOVABLE ? Or Do you want to confirm whether ZONE_MOVABLE will not work if it's used with cpuset ? No, I am not proposing to use cpuset do tackle the problem. I am just wondering if you would still have high success rates with cpusets in use with hardwalls. This is just one example of a workload that would spread kernel memory around quite heavily. So this is just me trying to understand the limitations of the mechanism. Hm, okay. In my undestanding, if the whole memory of a node is configured as MOVABLE, no kernel memory will not be allocated in the node because zonelist will not match. So, if cpuset is used with hardwalls, user will see -ENOMEM or OOM, I guess. even fork() will fail if fallback-to-other-node is not allowed. If it's configure as ZONE_NORMAL, you need to pray for offlining memory. AFAIK, IBM's ppc? has 16MB section size. So, some of sections can be offlined even if they are configured as ZONE_NORMAL. For them, placement of offlined memory is not important because it's virtualized by LPAR, they don't try to remove DIMM, they just want to increase/decrease amount of memory. It's an another approach. But here, we(fujitsu) tries to remove a system board/DIMM. So, configuring the whole memory of a node as ZONE_MOVABLE and tries to guarantee DIMM as removable. IMHO, I don't think shrink_slab() can kill all objects in a node even if they are some caches. We need more study for doing that. Indeed, shrink_slab can only kill cached objects. They, however, are usually a very big part of kernel memory. I wonder though if in case of failure, it is worth it to try at least one shrink pass before you give up. Yeah, now, his (our) approach is never allowing kernel memory on a node to be hot-removed by ZONE_MOVABLE. So, shrink_slab()'s effect will not be seen. If other brave guys tries to use ZONE_NORMAL for hot-pluggable DIMM, I see, it's worth triying. How about checking the target memsection is in NORMAL or in MOVABLE at hot-removing ? If NORMAL, shrink_slab() will be worth to be called. BTW, shrink_slab() is now node/zone aware ? If not, fixing that first will be better direction I guess. Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
(2013/01/10 17:36), Glauber Costa wrote: BTW, shrink_slab() is now node/zone aware ? If not, fixing that first will be better direction I guess. It is not upstream, but there are patches for this that I am already using in my private tree. Oh, I see. If it's merged, it's worth add shrink_slab() if ZONE_NORMAL code. Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
(2013/01/10 16:14), Glauber Costa wrote: On 01/10/2013 06:17 AM, Tang Chen wrote: Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Right. But how often does this happen in testing? In other words, please provide an overall description of how well memory hot-remove is presently operating. Is it reliable? What is the success rate in real-world situations? We test the hot-remove functionality mostly with movable_online used. And the memory used by kernel is not allowed to be removed. Can you try doing this using cpusets configured to hardwall ? It is my understanding that the object allocators will try hard not to allocate anything outside the walls defined by cpuset. Which means that if you have one process per node, and they are hardwalled, your kernel memory will be spread evenly among the machine. With a big enough load, they should eventually be present in all blocks. I'm sorry I couldn't catch your point. Do you want to confirm whether cpuset can work enough instead of ZONE_MOVABLE ? Or Do you want to confirm whether ZONE_MOVABLE will not work if it's used with cpuset ? Another question I have for you: Have you considering calling shrink_slab to try to deplete the caches and therefore free at least slab memory in the nodes that can't be offlined? Is it relevant? At this stage, we don't consider to call shrink_slab(). We require nearly 100% success at offlining memory for removing DIMM. It's my understanding. IMHO, I don't think shrink_slab() can kill all objects in a node even if they are some caches. We need more study for doing that. Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v5 14/14] memory-hotplug: free node_data when a node is offlined
(2012/12/30 15:02), Wen Congyang wrote: At 12/28/2012 08:28 AM, Kamezawa Hiroyuki Wrote: (2012/12/27 21:16), Wen Congyang wrote: At 12/26/2012 11:55 AM, Kamezawa Hiroyuki Wrote: (2012/12/24 21:09), Tang Chen wrote: From: Wen Congyang we...@cn.fujitsu.com We call hotadd_new_pgdat() to allocate memory to store node_data. So we should free it when removing a node. Signed-off-by: Wen Congyang we...@cn.fujitsu.com I'm sorry but is it safe to remove pgdat ? All zone cache and zonelists are properly cleared/rebuilded in synchronous way ? and No threads are visinting zone in vmscan.c ? We have rebuilt zonelists when a zone has no memory after offlining some pages. How do you guarantee that the address of pgdat/zone is not on stack of any kernel threads or other kernel objects without reference counting or other syncing method ? No way to guarentee this. But, the kernel should not use the address of pgdat/zone when it is offlined. Hmm, what about this: reuse the memory when the node is onlined again? That's the only way which we can go now. Please don't free it. Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v5 14/14] memory-hotplug: free node_data when a node is offlined
(2012/12/27 21:16), Wen Congyang wrote: At 12/26/2012 11:55 AM, Kamezawa Hiroyuki Wrote: (2012/12/24 21:09), Tang Chen wrote: From: Wen Congyang we...@cn.fujitsu.com We call hotadd_new_pgdat() to allocate memory to store node_data. So we should free it when removing a node. Signed-off-by: Wen Congyang we...@cn.fujitsu.com I'm sorry but is it safe to remove pgdat ? All zone cache and zonelists are properly cleared/rebuilded in synchronous way ? and No threads are visinting zone in vmscan.c ? We have rebuilt zonelists when a zone has no memory after offlining some pages. How do you guarantee that the address of pgdat/zone is not on stack of any kernel threads or other kernel objects without reference counting or other syncing method ? Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v5 02/14] memory-hotplug: check whether all memory blocks are offlined or not when removing memory
(2012/12/24 21:09), Tang Chen wrote: From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com We remove the memory like this: 1. lock memory hotplug 2. offline a memory block 3. unlock memory hotplug 4. repeat 1-3 to offline all memory blocks 5. lock memory hotplug 6. remove memory(TODO) 7. unlock memory hotplug All memory blocks must be offlined before removing memory. But we don't hold the lock in the whole operation. So we should check whether all memory blocks are offlined before step6. Otherwise, kernel maybe panicked. Signed-off-by: Wen Congyang we...@cn.fujitsu.com Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com a nitpick below. --- drivers/base/memory.c |6 + include/linux/memory_hotplug.h |1 + mm/memory_hotplug.c| 47 3 files changed, 54 insertions(+), 0 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 987604d..8300a18 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -693,6 +693,12 @@ int offline_memory_block(struct memory_block *mem) return ret; } +/* return true if the memory block is offlined, otherwise, return false */ +bool is_memblock_offlined(struct memory_block *mem) +{ + return mem-state == MEM_OFFLINE; +} + /* * Initialize the sysfs support for memory devices... */ diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 4a45c4e..8dd0950 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -247,6 +247,7 @@ extern int add_memory(int nid, u64 start, u64 size); extern int arch_add_memory(int nid, u64 start, u64 size); extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages); extern int offline_memory_block(struct memory_block *mem); +extern bool is_memblock_offlined(struct memory_block *mem); extern int remove_memory(u64 start, u64 size); extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn, int nr_pages); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 62e04c9..d43d97b 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1430,6 +1430,53 @@ repeat: goto repeat; } + lock_memory_hotplug(); + + /* + * we have offlined all memory blocks like this: + * 1. lock memory hotplug + * 2. offline a memory block + * 3. unlock memory hotplug + * + * repeat step1-3 to offline the memory block. All memory blocks + * must be offlined before removing memory. But we don't hold the + * lock in the whole operation. So we should check whether all + * memory blocks are offlined. + */ + + for (pfn = start_pfn; pfn end_pfn; pfn += PAGES_PER_SECTION) { I prefer adding mem = NULL at the start of this for(). + section_nr = pfn_to_section_nr(pfn); + if (!present_section_nr(section_nr)) + continue; + + section = __nr_to_section(section_nr); + /* same memblock? */ + if (mem) + if ((section_nr = mem-start_section_nr) + (section_nr = mem-end_section_nr)) + continue; + Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v5 03/14] memory-hotplug: remove redundant codes
(2012/12/24 21:09), Tang Chen wrote: From: Wen Congyang we...@cn.fujitsu.com offlining memory blocks and checking whether memory blocks are offlined are very similar. This patch introduces a new function to remove redundant codes. Signed-off-by: Wen Congyang we...@cn.fujitsu.com --- mm/memory_hotplug.c | 101 --- 1 files changed, 55 insertions(+), 46 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index d43d97b..dbb04d8 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1381,20 +1381,14 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages) return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ); } -int remove_memory(u64 start, u64 size) please add explanation of this function here. If (*func) returns val other than 0, this function will fail and returns callback's return value...right ? +static int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn, + void *arg, int (*func)(struct memory_block *, void *)) { struct memory_block *mem = NULL; struct mem_section *section; - unsigned long start_pfn, end_pfn; unsigned long pfn, section_nr; int ret; - int return_on_error = 0; - int retry = 0; - - start_pfn = PFN_DOWN(start); - end_pfn = start_pfn + PFN_DOWN(size); -repeat: Shouldn't we check lock is held here ? (VM_BUG_ON(!mutex_is_locked(mem_hotplug_mutex); for (pfn = start_pfn; pfn end_pfn; pfn += PAGES_PER_SECTION) { section_nr = pfn_to_section_nr(pfn); if (!present_section_nr(section_nr)) @@ -1411,22 +1405,61 @@ repeat: if (!mem) continue; - ret = offline_memory_block(mem); + ret = func(mem, arg); if (ret) { - if (return_on_error) { - kobject_put(mem-dev.kobj); - return ret; - } else { - retry = 1; - } + kobject_put(mem-dev.kobj); + return ret; } } if (mem) kobject_put(mem-dev.kobj); - if (retry) { - return_on_error = 1; + return 0; +} + +static int offline_memory_block_cb(struct memory_block *mem, void *arg) +{ + int *ret = arg; + int error = offline_memory_block(mem); + + if (error != 0 *ret == 0) + *ret = error; + + return 0; Always returns 0 and run through all mem blocks for scan-and-retry, right ? You need explanation here ! +} + +static int is_memblock_offlined_cb(struct memory_block *mem, void *arg) +{ + int ret = !is_memblock_offlined(mem); + + if (unlikely(ret)) + pr_warn(removing memory fails, because memory + [%#010llx-%#010llx] is onlined\n, + PFN_PHYS(section_nr_to_pfn(mem-start_section_nr)), + PFN_PHYS(section_nr_to_pfn(mem-end_section_nr + 1))-1); + + return ret; +} + +int remove_memory(u64 start, u64 size) +{ + unsigned long start_pfn, end_pfn; + int ret = 0; + int retry = 1; + + start_pfn = PFN_DOWN(start); + end_pfn = start_pfn + PFN_DOWN(size); + +repeat: please explan why you repeat here . + walk_memory_range(start_pfn, end_pfn, ret, + offline_memory_block_cb); + if (ret) { + if (!retry) + return ret; + + retry = 0; + ret = 0; goto repeat; } @@ -1444,37 +1477,13 @@ repeat: * memory blocks are offlined. */ - for (pfn = start_pfn; pfn end_pfn; pfn += PAGES_PER_SECTION) { - section_nr = pfn_to_section_nr(pfn); - if (!present_section_nr(section_nr)) - continue; - - section = __nr_to_section(section_nr); - /* same memblock? */ - if (mem) - if ((section_nr = mem-start_section_nr) - (section_nr = mem-end_section_nr)) - continue; - - mem = find_memory_block_hinted(section, mem); - if (!mem) - continue; - - ret = is_memblock_offlined(mem); - if (!ret) { - pr_warn(removing memory fails, because memory - [%#010llx-%#010llx] is onlined\n, - PFN_PHYS(section_nr_to_pfn(mem-start_section_nr)), - PFN_PHYS(section_nr_to_pfn(mem-end_section_nr + 1)) - 1); - - kobject_put(mem-dev.kobj); - unlock_memory_hotplug(); - return ret; - } please explain
Re: [PATCH v5 04/14] memory-hotplug: remove /sys/firmware/memmap/X sysfs
(2012/12/24 21:09), Tang Chen wrote: From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type} sysfs files are created. But there is no code to remove these files. The patch implements the function to remove them. Note: The code does not free firmware_map_entry which is allocated by bootmem. So the patch makes memory leak. But I think the memory leak size is very samll. And it does not affect the system. Signed-off-by: Wen Congyang we...@cn.fujitsu.com Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com --- drivers/firmware/memmap.c| 98 +- include/linux/firmware-map.h |6 +++ mm/memory_hotplug.c |5 ++- 3 files changed, 106 insertions(+), 3 deletions(-) diff --git a/drivers/firmware/memmap.c b/drivers/firmware/memmap.c index 90723e6..49be12a 100644 --- a/drivers/firmware/memmap.c +++ b/drivers/firmware/memmap.c @@ -21,6 +21,7 @@ #include linux/types.h #include linux/bootmem.h #include linux/slab.h +#include linux/mm.h /* * Data types -- @@ -41,6 +42,7 @@ struct firmware_map_entry { const char *type; /* type of the memory range */ struct list_headlist; /* entry for the linked list */ struct kobject kobj; /* kobject for each entry */ + unsigned intbootmem:1; /* allocated from bootmem */ }; Can't we detect from which the object is allocated from, slab or bootmem ? Hm, for example, PageReserved(virt_to_page(address_of_obj)) ? PageSlab(virt_to_page(address_of_obj)) ? Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v5 05/14] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture
(2012/12/24 21:09), Tang Chen wrote: From: Wen Congyang we...@cn.fujitsu.com For removing memory, we need to remove page table. But it depends on architecture. So the patch introduce arch_remove_memory() for removing page table. Now it only calls __remove_pages(). Note: __remove_pages() for some archtecuture is not implemented (I don't know how to implement it for s390). Signed-off-by: Wen Congyang we...@cn.fujitsu.com Then, remove code will be symetric to add codes. Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence
(2012/12/24 21:09), Tang Chen wrote: From: Wen Congyang we...@cn.fujitsu.com memory can't be offlined when CONFIG_MEMCG is selected. For example: there is a memory device on node 1. The address range is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10, and memory11 under the directory /sys/devices/system/memory/. If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup when we online pages. When we online memory8, the memory stored page cgroup is not provided by this memory device. But when we online memory9, the memory stored page cgroup may be provided by memory8. So we can't offline memory8 now. We should offline the memory in the reversed order. If memory8 is onlined as NORMAL memory ...right ? IIUC, vmalloc() uses __GFP_HIGHMEM but doesn't use __GFP_MOVABLE. When the memory device is hotremoved, we will auto offline memory provided by this memory device. But we don't know which memory is onlined first, so offlining memory may fail. In such case, iterate twice to offline the memory. 1st iterate: offline every non primary memory block. 2nd iterate: offline primary (i.e. first added) memory block. This idea is suggested by KOSAKI Motohiro. Signed-off-by: Wen Congyang we...@cn.fujitsu.com I'm not sure but the whole DIMM should be onlined as MOVABLE mem ? Anyway, I agree this kind of retry is required if memory is onlined as NORMAL mem. But retry-once is ok ? Thanks, -Kame --- mm/memory_hotplug.c | 16 ++-- 1 files changed, 14 insertions(+), 2 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index d04ed87..62e04c9 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1388,10 +1388,13 @@ int remove_memory(u64 start, u64 size) unsigned long start_pfn, end_pfn; unsigned long pfn, section_nr; int ret; + int return_on_error = 0; + int retry = 0; start_pfn = PFN_DOWN(start); end_pfn = start_pfn + PFN_DOWN(size); +repeat: for (pfn = start_pfn; pfn end_pfn; pfn += PAGES_PER_SECTION) { section_nr = pfn_to_section_nr(pfn); if (!present_section_nr(section_nr)) @@ -1410,14 +1413,23 @@ int remove_memory(u64 start, u64 size) ret = offline_memory_block(mem); if (ret) { - kobject_put(mem-dev.kobj); - return ret; + if (return_on_error) { + kobject_put(mem-dev.kobj); + return ret; + } else { + retry = 1; + } } } if (mem) kobject_put(mem-dev.kobj); + if (retry) { + return_on_error = 1; + goto repeat; + } + return 0; } #else ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v5 07/14] memory-hotplug: move pgdat_resize_lock into sparse_remove_one_section()
(2012/12/24 21:09), Tang Chen wrote: In __remove_section(), we locked pgdat_resize_lock when calling sparse_remove_one_section(). This lock will disable irq. But we don't need to lock the whole function. If we do some work to free pagetables in free_section_usemap(), we need to call flush_tlb_all(), which need irq enabled. Otherwise the WARN_ON_ONCE() in smp_call_function_many() will be triggered. Signed-off-by: Tang Chen tangc...@cn.fujitsu.com Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com Signed-off-by: Wen Congyang we...@cn.fujitsu.com If this is a bug fix, call-trace in your log and BUGFIX or -fix- in patch title will be appreciated, I think. Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v5 14/14] memory-hotplug: free node_data when a node is offlined
(2012/12/24 21:09), Tang Chen wrote: From: Wen Congyang we...@cn.fujitsu.com We call hotadd_new_pgdat() to allocate memory to store node_data. So we should free it when removing a node. Signed-off-by: Wen Congyang we...@cn.fujitsu.com I'm sorry but is it safe to remove pgdat ? All zone cache and zonelists are properly cleared/rebuilded in synchronous way ? and No threads are visinting zone in vmscan.c ? Thanks, -Kame --- mm/memory_hotplug.c | 20 +++- 1 files changed, 19 insertions(+), 1 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index f8a1d2f..447fa24 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1680,9 +1680,12 @@ static int check_cpu_on_node(void *data) /* offline the node if all memory sections of this node are removed */ static void try_offline_node(int nid) { + pg_data_t *pgdat = NODE_DATA(nid); unsigned long start_pfn = NODE_DATA(nid)-node_start_pfn; - unsigned long end_pfn = start_pfn + NODE_DATA(nid)-node_spanned_pages; + unsigned long end_pfn = start_pfn + pgdat-node_spanned_pages; unsigned long pfn; + struct page *pgdat_page = virt_to_page(pgdat); + int i; for (pfn = start_pfn; pfn end_pfn; pfn += PAGES_PER_SECTION) { unsigned long section_nr = pfn_to_section_nr(pfn); @@ -1709,6 +1712,21 @@ static void try_offline_node(int nid) */ node_set_offline(nid); unregister_one_node(nid); + + if (!PageSlab(pgdat_page) !PageCompound(pgdat_page)) + /* node data is allocated from boot memory */ + return; + + /* free waittable in each zone */ + for (i = 0; i MAX_NR_ZONES; i++) { + struct zone *zone = pgdat-node_zones + i; + + if (zone-wait_table) + vfree(zone-wait_table); + } + + arch_refresh_nodedata(nid, NULL); + arch_free_nodedata(pgdat); } int __ref remove_memory(int nid, u64 start, u64 size) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [patch 4/4] mm, oom: remove statically defined arch functions of same name
(2012/11/14 18:15), David Rientjes wrote: out_of_memory() is a globally defined function to call the oom killer. x86, sh, and powerpc all use a function of the same name within file scope in their respective fault.c unnecessarily. Inline the functions into the pagefault handlers to clean the code up. Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Cc: Thomas Gleixner t...@linutronix.de Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: Paul Mackerras pau...@samba.org Cc: Paul Mundt let...@linux-sh.org Signed-off-by: David Rientjes rient...@google.com I think this is good. Reviewed-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: linux-next: build failure after merge of the final tree (Linus' tree related)
On Fri, 17 Jun 2011 15:38:09 +1000 Stephen Rothwell s...@canb.auug.org.au wrote: Hi all, After merging the final tree, today's linux-next build (powerpc allyesconfig) failed like this: mm/page_cgroup.c: In function 'page_cgroup_init': mm/page_cgroup.c:309:13: error: 'pg_data_t' has no member named 'node_end_pfn' Caused by commit 37573e8c7182 (memcg: fix init_page_cgroup nid with sparsemem). On powerpc, node_end_pfn() is defined to be (NODE_DATA (nid)-node_end_pfn) where NODE_DATA(nid) is (node_data[nid]) and node_data is struct pglist_data *node_data[]. As far as I can see, struct pglist_data has never had a member called node_end_pfn. This commit introduces the only use of node_end_pfn() in the generic kernel code. Presumably the powerpc definition needs to be fixed (to maybe something like the x86 version). It looks like the sparc version is broken as well. Sorry, here is a fix I posted today. but no ack yet. == From 507cc95c5ba2351bff16c5421255d1395a3b555b Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Date: Thu, 16 Jun 2011 17:28:07 +0900 Subject: [PATCH] Fix node_start/end_pfn() definition for mm/page_cgroup.c commit 21a3c96 uses node_start/end_pfn(nid) for detection start/end of nodes. But, it's not defined in linux/mmzone.h but defined in /arch/???/include/mmzone.h which is included only under CONFIG_NEED_MULTIPLE_NODES=y. Then, we see mm/page_cgroup.c: In function 'page_cgroup_init': mm/page_cgroup.c:308: error: implicit declaration of function 'node_start_pfn' mm/page_cgroup.c:309: error: implicit declaration of function 'node_end_pfn' So, fixiing page_cgroup.c is an idea... But node_start_pfn()/node_end_pfn() is a very generic macro and should be implemented in the same manner for all archs. (m32r has different implementation...) This patch removes definitions of node_start/end_pfn() in each archs and defines a unified one in linux/mmzone.h. It's not under CONFIG_NEED_MULTIPLE_NODES, now. A result of macro expansion is here (mm/page_cgroup.c) for !NUMA start_pfn = ((contig_page_data)-node_start_pfn); end_pfn = ({ pg_data_t *__pgdat = (contig_page_data); __pgdat-node_start_pfn + __pgdat-node_spanned_pages;}); for NUMA (x86-64) start_pfn = ((node_data[nid])-node_start_pfn); end_pfn = ({ pg_data_t *__pgdat = (node_data[nid]); __pgdat-node_start_pfn + __pgdat-node_spanned_pages;}); Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Changelog: - fixed to avoid using nid twice in node_end_pfn() macro. --- arch/alpha/include/asm/mmzone.h |1 - arch/m32r/include/asm/mmzone.h|8 +--- arch/parisc/include/asm/mmzone.h |7 --- arch/powerpc/include/asm/mmzone.h |7 --- arch/sh/include/asm/mmzone.h |4 arch/sparc/include/asm/mmzone.h |2 -- arch/tile/include/asm/mmzone.h| 11 --- arch/x86/include/asm/mmzone_32.h | 11 --- arch/x86/include/asm/mmzone_64.h |3 --- include/linux/mmzone.h|7 +++ 10 files changed, 8 insertions(+), 53 deletions(-) diff --git a/arch/alpha/include/asm/mmzone.h b/arch/alpha/include/asm/mmzone.h index 8af56ce..445dc42 100644 --- a/arch/alpha/include/asm/mmzone.h +++ b/arch/alpha/include/asm/mmzone.h @@ -56,7 +56,6 @@ PLAT_NODE_DATA_LOCALNR(unsigned long p, int n) * Given a kernel address, find the home node of the underlying memory. */ #define kvaddr_to_nid(kaddr) pa_to_nid(__pa(kaddr)) -#define node_start_pfn(nid)(NODE_DATA(nid)-node_start_pfn) /* * Given a kaddr, LOCAL_BASE_ADDR finds the owning node of the memory diff --git a/arch/m32r/include/asm/mmzone.h b/arch/m32r/include/asm/mmzone.h index 9f3b5ac..115ced3 100644 --- a/arch/m32r/include/asm/mmzone.h +++ b/arch/m32r/include/asm/mmzone.h @@ -14,12 +14,6 @@ extern struct pglist_data *node_data[]; #define NODE_DATA(nid) (node_data[nid]) #define node_localnr(pfn, nid) ((pfn) - NODE_DATA(nid)-node_start_pfn) -#define node_start_pfn(nid)(NODE_DATA(nid)-node_start_pfn) -#define node_end_pfn(nid) \ -({ \ - pg_data_t *__pgdat = NODE_DATA(nid);\ - __pgdat-node_start_pfn + __pgdat-node_spanned_pages - 1; \ -}) #define pmd_page(pmd) (pfn_to_page(pmd_val(pmd) PAGE_SHIFT)) /* @@ -44,7 +38,7 @@ static __inline__ int pfn_to_nid(unsigned long pfn) int node; for (node = 0 ; node MAX_NUMNODES ; node++) - if (pfn = node_start_pfn(node) pfn = node_end_pfn(node)) + if (pfn = node_start_pfn(node) pfn node_end_pfn(node)) break; return node; diff --git a/arch/parisc/include/asm/mmzone.h b/arch/parisc/include/asm/mmzone.h index 9608d2c..e67eb9c 100644 --- a/arch/parisc/include/asm/mmzone.h +++ b/arch/parisc/include/asm/mmzone.h @@ -14,13 +14,6 @@ extern struct
Re: [linux-2.6.36-git7: Power7] LTP Memory CGROUP Controller functional test creates Backtrace, OOMKill rcu_sched_state detected stall jiffies
On Tue, 26 Oct 2010 16:03:56 +0530 Subrata Modak subr...@linux.vnet.ibm.com wrote: If you run LTP Memory CGROUP Controller functional test on linux-2.6.36-git7, the following Backtrace, OOMKill rcu_sched_state detected stall jiffies are created. The machine is not reachable thereafter. Ways to reproduce this problem: 1) Build and boot kernel 2.6.36-git7 on Power7 machine with attached config file, 2) Fetch, build and install LTP: git clone git://ltp.git.sourceforge.net/gitroot/ltp/ltp cd ltp ./configure make make install 3) Create a LTP runtest file /opt/ltp/runtest/memcg_function_test with the following entry: memcg_function memcg_function_test.sh EOF cd /opt/ltp ./runltp -f memcg_function_test IIUC, memcg test includes intentional OOM-Kill test by setting the limit to 0. And it has another test to set the limit to PAGE_SIZE. In your environemnt, I think page size is 64kb...right ? About rcu_sched_state()I have no idea at this stage. I reviewed memcontrol.c and oom_kill.c again and coundn't found anything in quick review. Could you try again after -rc1 shipped ? I think Andrew Morton sent some amount of updates for oom_kill and memcg, vmscan to Linus, today. Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/9] v3 Add section count to memory_block struct
On Fri, 01 Oct 2010 13:30:40 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Add a section count property to the memory_block struct to track the number of memory sections that have been added/removed from a memory block. This allows us to know when the last memory section of a memory block has been removed so we can remove the memory block. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com Reviewed-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com a nitpick, Index: linux-next/include/linux/memory.h === --- linux-next.orig/include/linux/memory.h2010-09-29 14:56:29.0 -0500 +++ linux-next/include/linux/memory.h 2010-09-30 14:13:50.0 -0500 @@ -23,6 +23,8 @@ struct memory_block { unsigned long phys_index; unsigned long state; + int section_count; I prefer int section_count; /* updated under mutex */ or some for this kind of non-atomic counters. but nitpick. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/9] v3 Allow memory blocks to span multiple memory sections
On Fri, 01 Oct 2010 14:00:50 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Update the memory sysfs code such that each sysfs memory directory is now considered a memory block that can span multiple memory sections per memory block. The default size of each memory block is SECTION_SIZE_BITS to maintain the current behavior of having a single memory section per memory block (i.e. one sysfs directory per memory section). For architectures that want to have memory blocks span multiple memory sections they need only define their own memory_block_size_bytes() routine. This should be commented in code before MEMORY_BLOCK_SIZE declaration. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com Reviewed-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 5/9] v3 rename phys_index properties of memory block struct
On Fri, 01 Oct 2010 13:33:38 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Update the 'phys_index' property of a the memory_block struct to be called start_section_nr, and add a end_section_nr property. The data tracked here is the same but the updated naming is more in line with what is stored here, namely the first and last section number that the memory block spans. The names presented to userspace remain the same, phys_index for start_section_nr and end_phys_index for end_section_nr, to avoid breaking anything in userspace. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com Reviewed-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 6/9] v3 Update node sysfs code
On Fri, 01 Oct 2010 13:34:34 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Update the node sysfs code to be aware of the new capability for a memory block to contain multiple memory sections and be aware of the memory block structure name changes (start_section_nr). This requires an additional parameter to unregister_mem_sect_under_nodes so that we know which memory section of the memory block to unregister. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com Reviewed-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 9/9] v3 Update memory hotplug documentation
On Fri, 01 Oct 2010 13:37:49 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Update the memory hotplug documentation to reflect the new behaviors of memory blocks reflected in sysfs. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com Reviewed-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Thank you for your patient work!. --- Documentation/memory-hotplug.txt | 47 +-- 1 file changed, 31 insertions(+), 16 deletions(-) Index: linux-next/Documentation/memory-hotplug.txt === --- linux-next.orig/Documentation/memory-hotplug.txt 2010-09-29 14:56:24.0 -0500 +++ linux-next/Documentation/memory-hotplug.txt 2010-09-30 14:59:47.0 -0500 @@ -126,36 +126,51 @@ 4 sysfs files for memory hotplug -All sections have their device information under /sys/devices/system/memory as +All sections have their device information in sysfs. Each section is part of +a memory block under /sys/devices/system/memory as /sys/devices/system/memory/memoryXXX -(XXX is section id.) +(XXX is the section id.) -Now, XXX is defined as start_address_of_section / section_size. +Now, XXX is defined as (start_address_of_section / section_size) of the first +section contained in the memory block. The files 'phys_index' and +'end_phys_index' under each directory report the beginning and end section id's +for the memory block covered by the sysfs directory. It is expected that all +memory sections in this range are present and no memory holes exist in the +range. Currently there is no way to determine if there is a memory hole, but +the existence of one should not affect the hotplug capabilities of the memory +block. For example, assume 1GiB section size. A device for a memory starting at 0x1 is /sys/device/system/memory/memory4 (0x1 / 1Gib = 4) This device covers address range [0x1 ... 0x14000) -Under each section, you can see 4 files. +Under each section, you can see 4 or 5 files, the end_phys_index file being +a recent addition and not present on older kernels. -/sys/devices/system/memory/memoryXXX/phys_index +/sys/devices/system/memory/memoryXXX/start_phys_index +/sys/devices/system/memory/memoryXXX/end_phys_index /sys/devices/system/memory/memoryXXX/phys_device /sys/devices/system/memory/memoryXXX/state /sys/devices/system/memory/memoryXXX/removable -'phys_index' : read-only and contains section id, same as XXX. -'state' : read-write - at read: contains online/offline state of memory. - at write: user can specify online, offline command -'phys_device': read-only: designed to show the name of physical memory device. - This is not well implemented now. -'removable' : read-only: contains an integer value indicating - whether the memory section is removable or not - removable. A value of 1 indicates that the memory - section is removable and a value of 0 indicates that - it is not removable. +'phys_index' : read-only and contains section id of the first section + in the memory block, same as XXX. +'end_phys_index' : read-only and contains section id of the last section + in the memory block. +'state' : read-write +at read: contains online/offline state of memory. +at write: user can specify online, offline command +which will be performed on al sections in the block. +'phys_device' : read-only: designed to show the name of physical memory +device. This is not well implemented now. +'removable' : read-only: contains an integer value indicating +whether the memory block is removable or not +removable. A value of 1 indicates that the memory +block is removable and a value of 0 indicates that +it is not removable. A memory block is removable only if +every section in the block is removable. NOTE: These directories/files appear after physical memory hotplug phase. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/9] v3 Move find_memory_block routine
On Fri, 01 Oct 2010 13:28:39 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Move the find_memory_block() routine up to avoid needing a forward declaration in subsequent patches. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com Reviewd-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/9] v3 Add mutex for adding/removing memory blocks
On Fri, 01 Oct 2010 13:29:42 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Add a new mutex for use in adding and removing of memory blocks. This is needed to avoid any race conditions in which the same memory block could be added and removed at the same time. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com Reviewed-By: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/9] v4 Move the find_memory_block() routine up
On Tue, 03 Aug 2010 08:36:39 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Move the find_memory_block() routine up to avoid needing a forward declaration in subsequent patches. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/9] v4 Add new phys_index properties
On Tue, 03 Aug 2010 08:37:31 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Update the 'phys_index' properties of a memory block to include a 'start_phys_index' which is the same as the current 'phys_index' property. The property still appears as 'phys_index' in sysfs but the memory_block struct name is updated to indicate the start and end values. This also adds an 'end_phys_index' property to indicate the id of the last section in th memory block. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com nitpick. After this patch, end_phys_index is added but contains 0. It's better to contain the same value with phys_index.. But, ok. Following patch will fix it. Thanks, -Kame --- drivers/base/memory.c | 28 include/linux/memory.h |3 ++- 2 files changed, 22 insertions(+), 9 deletions(-) Index: linux-2.6/drivers/base/memory.c === --- linux-2.6.orig/drivers/base/memory.c 2010-08-02 13:32:21.0 -0500 +++ linux-2.6/drivers/base/memory.c 2010-08-02 13:33:27.0 -0500 @@ -109,12 +109,20 @@ unregister_memory(struct memory_block *m * uses. */ -static ssize_t show_mem_phys_index(struct sys_device *dev, +static ssize_t show_mem_start_phys_index(struct sys_device *dev, struct sysdev_attribute *attr, char *buf) { struct memory_block *mem = container_of(dev, struct memory_block, sysdev); - return sprintf(buf, %08lx\n, mem-phys_index); + return sprintf(buf, %08lx\n, mem-start_phys_index); +} + +static ssize_t show_mem_end_phys_index(struct sys_device *dev, + struct sysdev_attribute *attr, char *buf) +{ + struct memory_block *mem = + container_of(dev, struct memory_block, sysdev); + return sprintf(buf, %08lx\n, mem-end_phys_index); } /* @@ -128,7 +136,7 @@ static ssize_t show_mem_removable(struct struct memory_block *mem = container_of(dev, struct memory_block, sysdev); - start_pfn = section_nr_to_pfn(mem-phys_index); + start_pfn = section_nr_to_pfn(mem-start_phys_index); ret = is_mem_section_removable(start_pfn, PAGES_PER_SECTION); return sprintf(buf, %d\n, ret); } @@ -191,7 +199,7 @@ memory_block_action(struct memory_block int ret; int old_state = mem-state; - psection = mem-phys_index; + psection = mem-start_phys_index; first_page = pfn_to_page(psection PFN_SECTION_SHIFT); /* @@ -264,7 +272,7 @@ store_mem_state(struct sys_device *dev, int ret = -EINVAL; mem = container_of(dev, struct memory_block, sysdev); - phys_section_nr = mem-phys_index; + phys_section_nr = mem-start_phys_index; if (!present_section_nr(phys_section_nr)) goto out; @@ -296,7 +304,8 @@ static ssize_t show_phys_device(struct s return sprintf(buf, %d\n, mem-phys_device); } -static SYSDEV_ATTR(phys_index, 0444, show_mem_phys_index, NULL); +static SYSDEV_ATTR(phys_index, 0444, show_mem_start_phys_index, NULL); +static SYSDEV_ATTR(end_phys_index, 0444, show_mem_end_phys_index, NULL); static SYSDEV_ATTR(state, 0644, show_mem_state, store_mem_state); static SYSDEV_ATTR(phys_device, 0444, show_phys_device, NULL); static SYSDEV_ATTR(removable, 0444, show_mem_removable, NULL); @@ -476,16 +485,18 @@ static int add_memory_block(int nid, str if (!mem) return -ENOMEM; - mem-phys_index = __section_nr(section); + mem-start_phys_index = __section_nr(section); mem-state = state; mutex_init(mem-state_mutex); - start_pfn = section_nr_to_pfn(mem-phys_index); + start_pfn = section_nr_to_pfn(mem-start_phys_index); mem-phys_device = arch_get_memory_phys_device(start_pfn); ret = register_memory(mem, section); if (!ret) ret = mem_create_simple_file(mem, phys_index); if (!ret) + ret = mem_create_simple_file(mem, end_phys_index); + if (!ret) ret = mem_create_simple_file(mem, state); if (!ret) ret = mem_create_simple_file(mem, phys_device); @@ -507,6 +518,7 @@ int remove_memory_block(unsigned long no mem = find_memory_block(section); unregister_mem_sect_under_nodes(mem); mem_remove_simple_file(mem, phys_index); + mem_remove_simple_file(mem, end_phys_index); mem_remove_simple_file(mem, state); mem_remove_simple_file(mem, phys_device); mem_remove_simple_file(mem, removable); Index: linux-2.6/include/linux/memory.h === --- linux-2.6.orig/include/linux/memory.h 2010-08-02 13:23:49.0 -0500 +++ linux-2.6/include/linux/memory.h 2010-08-02 13:33:27.0 -0500 @@ -21,7 +21,8 @@ #include linux
Re: [PATCH 3/9] v4 Add section count to memory_block
On Tue, 03 Aug 2010 08:38:37 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Add a section count property to the memory_block struct to track the number of memory sections that have been added/removed from a memory block. This allows us to know when the last memory section of a memory block has been removed so we can remove the memory block. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/9] v4 Add mutex for add/remove of memory blocks
On Tue, 03 Aug 2010 08:39:50 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Add a new mutex for use in adding and removing of memory blocks. This is needed to avoid any race conditions in which the same memory block could be added and removed at the same time. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com But a nitpick (see below) --- drivers/base/memory.c |9 + 1 file changed, 9 insertions(+) Index: linux-2.6/drivers/base/memory.c === --- linux-2.6.orig/drivers/base/memory.c 2010-08-02 13:35:00.0 -0500 +++ linux-2.6/drivers/base/memory.c 2010-08-02 13:45:34.0 -0500 @@ -27,6 +27,8 @@ #include asm/atomic.h #include asm/uaccess.h +static struct mutex mem_sysfs_mutex; + For static symbol of mutex, we usually do static DEFINE_MUTEX(mem_sysfs_mutex); Then, extra calls of mutex_init() is not required. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 5/9] v4 Allow memory_block to span multiple memory sections
On Tue, 03 Aug 2010 08:40:49 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Update the memory sysfs code that each sysfs memory directory is now considered a memory block that can contain multiple memory sections per memory block. The default size of each memory block is SECTION_SIZE_BITS to maintain the current behavior of having a single memory section per memory block (i.e. one sysfs directory per memory section). For architectures that want to have memory blocks span multiple memory sections they need only define their own memory_block_size_bytes() routine. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com (But maybe it's better to get ppc guy's Ack.) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 6/9] v4 Update the find_memory_block declaration
On Tue, 03 Aug 2010 08:41:45 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Update the find_memory_block declaration to to take a struct mem_section * so that it matches the definition. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Hmm...my mmotm-0727 has this definition in memory.h... extern struct memory_block *find_memory_block(struct mem_section *); What patch makes it unsigned long ? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 7/9] v4 Update the node sysfs code
On Tue, 03 Aug 2010 08:42:35 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Update the node sysfs code to be aware of the new capability for a memory block to contain multiple memory sections. This requires an additional parameter to unregister_mem_sect_under_nodes so that we know which memory section of the memory block to unregister. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 9/9] v4 Update memory-hotplug documentation
On Tue, 03 Aug 2010 08:44:16 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Update the memory hotplug documentation to reflect the new behaviors of memory blocks reflected in sysfs. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com A request from me: Could you clarify what happens if there are memory hole in [start end)_phys_index. in Documentation ? (Or add TODO list.) Thanks, -Kame --- Documentation/memory-hotplug.txt | 40 +++ 1 file changed, 24 insertions(+), 16 deletions(-) Index: linux-2.6/Documentation/memory-hotplug.txt === --- linux-2.6.orig/Documentation/memory-hotplug.txt 2010-08-02 14:09:28.0 -0500 +++ linux-2.6/Documentation/memory-hotplug.txt2010-08-02 14:10:36.0 -0500 @@ -126,36 +126,44 @@ config options. 4 sysfs files for memory hotplug -All sections have their device information under /sys/devices/system/memory as +All sections have their device information in sysfs. Each section is part of +a memory block under /sys/devices/system/memory as /sys/devices/system/memory/memoryXXX -(XXX is section id.) +(XXX is the section id.) -Now, XXX is defined as start_address_of_section / section_size. +Now, XXX is defined as (start_address_of_section / section_size) of the first +section contained in the memory block. For example, assume 1GiB section size. A device for a memory starting at 0x1 is /sys/device/system/memory/memory4 (0x1 / 1Gib = 4) This device covers address range [0x1 ... 0x14000) -Under each section, you can see 4 files. +Under each section, you can see 5 files. -/sys/devices/system/memory/memoryXXX/phys_index +/sys/devices/system/memory/memoryXXX/start_phys_index +/sys/devices/system/memory/memoryXXX/end_phys_index /sys/devices/system/memory/memoryXXX/phys_device /sys/devices/system/memory/memoryXXX/state /sys/devices/system/memory/memoryXXX/removable -'phys_index' : read-only and contains section id, same as XXX. -'state' : read-write - at read: contains online/offline state of memory. - at write: user can specify online, offline command -'phys_device': read-only: designed to show the name of physical memory device. - This is not well implemented now. -'removable' : read-only: contains an integer value indicating - whether the memory section is removable or not - removable. A value of 1 indicates that the memory - section is removable and a value of 0 indicates that - it is not removable. +'phys_index' : read-only and contains section id of the first section + in the memory block, same as XXX. +'end_phys_index' : read-only and contains section id of the last section + in the memory block. +'state' : read-write +at read: contains online/offline state of memory. +at write: user can specify online, offline command +which will be performed on al sections in the block. +'phys_device' : read-only: designed to show the name of physical memory +device. This is not well implemented now. +'removable' : read-only: contains an integer value indicating +whether the memory block is removable or not +removable. A value of 1 indicates that the memory +block is removable and a value of 0 indicates that +it is not removable. A memory block is removable only if +every section in the block is removable. NOTE: These directories/files appear after physical memory hotplug phase. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/8] v3 Move the find_memory_block() routine up
On Mon, 19 Jul 2010 22:51:42 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Move the find_me mory_block() routine up to avoid needing a forward declaration in subsequent patches. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com --- drivers/base/memory.c | 62 +- 1 file changed, 31 insertions(+), 31 deletions(-) Index: linux-2.6/drivers/base/memory.c === --- linux-2.6.orig/drivers/base/memory.c 2010-07-16 12:41:30.0 -0500 +++ linux-2.6/drivers/base/memory.c 2010-07-19 20:42:11.0 -0500 @@ -435,6 +435,37 @@ int __weak arch_get_memory_phys_device(u return 0; } +/* + * For now, we have a linear search to go find the appropriate + * memory_block corresponding to a particular phys_index. If + * this gets to be a real problem, we can always use a radix + * tree or something here. + * + * This could be made generic for all sysdev classes. + */ +struct memory_block *find_memory_block(struct mem_section *section) +{ + struct kobject *kobj; + struct sys_device *sysdev; + struct memory_block *mem; + char name[sizeof(MEMORY_CLASS_NAME) + 9 + 1]; + + /* + * This only works because we know that section == sysdev-id + * slightly redundant with sysdev_register() + */ + sprintf(name[0], %s%d, MEMORY_CLASS_NAME, __section_nr(section)); + + kobj = kset_find_obj(memory_sysdev_class.kset, name); + if (!kobj) + return NULL; + + sysdev = container_of(kobj, struct sys_device, kobj); + mem = container_of(sysdev, struct memory_block, sysdev); + + return mem; +} + static int add_memory_block(int nid, struct mem_section *section, unsigned long state, enum mem_add_context context) { @@ -468,37 +499,6 @@ static int add_memory_block(int nid, str return ret; } -/* - * For now, we have a linear search to go find the appropriate - * memory_block corresponding to a particular phys_index. If - * this gets to be a real problem, we can always use a radix - * tree or something here. - * - * This could be made generic for all sysdev classes. - */ -struct memory_block *find_memory_block(struct mem_section *section) -{ - struct kobject *kobj; - struct sys_device *sysdev; - struct memory_block *mem; - char name[sizeof(MEMORY_CLASS_NAME) + 9 + 1]; - - /* - * This only works because we know that section == sysdev-id - * slightly redundant with sysdev_register() - */ - sprintf(name[0], %s%d, MEMORY_CLASS_NAME, __section_nr(section)); - - kobj = kset_find_obj(memory_sysdev_class.kset, name); - if (!kobj) - return NULL; - - sysdev = container_of(kobj, struct sys_device, kobj); - mem = container_of(sysdev, struct memory_block, sysdev); - - return mem; -} - int remove_memory_block(unsigned long node_id, struct mem_section *section, int phys_device) { -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/8] v3 Add new phys_index properties
On Mon, 19 Jul 2010 22:52:50 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Update the 'phys_index' properties of a memory block to include a 'start_phys_index' which is the same as the current 'phys_index' property. This also adds an 'end_phys_index' property to indicate the id of the last section in th memory block. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com No, please remain phys_index as it is. please don't rename it. IMHO, just adding end_phys_index is better. please avoid interface change AFAP. Do you have a problem if phys_index means start_phys_index ? Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/8] v3 Add section count to memory_block
On Mon, 19 Jul 2010 22:53:58 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Add a section count property to the memory_block struct to track the number of memory sections that have been added/removed from a emory block. Signed-off-by: Nathan Fontenot nf...@asutin.ibm.com --- drivers/base/memory.c | 19 --- include/linux/memory.h |2 ++ 2 files changed, 14 insertions(+), 7 deletions(-) Index: linux-2.6/drivers/base/memory.c === --- linux-2.6.orig/drivers/base/memory.c 2010-07-19 20:43:49.0 -0500 +++ linux-2.6/drivers/base/memory.c 2010-07-19 20:44:01.0 -0500 @@ -487,6 +487,7 @@ static int add_memory_block(int nid, str mem-start_phys_index = __section_nr(section); mem-state = state; + atomic_inc(mem-section_count); mutex_init(mem-state_mutex); start_pfn = section_nr_to_pfn(mem-start_phys_index); mem-phys_device = arch_get_memory_phys_device(start_pfn); @@ -516,13 +517,17 @@ int remove_memory_block(unsigned long no struct memory_block *mem; mem = find_memory_block(section); - unregister_mem_sect_under_nodes(mem); - mem_remove_simple_file(mem, start_phys_index); - mem_remove_simple_file(mem, end_phys_index); - mem_remove_simple_file(mem, state); - mem_remove_simple_file(mem, phys_device); - mem_remove_simple_file(mem, removable); - unregister_memory(mem, section); + atomic_dec(mem-section_count); + + if (atomic_read(mem-section_count) == 0) { We use atomic_dec_and_test() in usual. Otherwise, I don't see other problems in other part. Please fix this nitpick. Regards, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/8] v3 Allow memory_block to span multiple memory sections
On Mon, 19 Jul 2010 22:55:08 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Update the memory sysfs code that each sysfs memory directory is now considered a memory block that can contain multiple memory sections per memory block. The default size of each memory block is SECTION_SIZE_BITS to maintain the current behavior of having a single memory section per memory block (i.e. one sysfs directory per memory section). For architectures that want to have memory blocks span multiple memory sections they need only define their own memory_block_size_bytes() routine. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com --- drivers/base/memory.c | 141 ++ 1 file changed, 98 insertions(+), 43 deletions(-) Index: linux-2.6/drivers/base/memory.c === --- linux-2.6.orig/drivers/base/memory.c 2010-07-19 20:44:01.0 -0500 +++ linux-2.6/drivers/base/memory.c 2010-07-19 21:12:22.0 -0500 @@ -28,6 +28,14 @@ #include asm/uaccess.h #define MEMORY_CLASS_NAMEmemory +#define MIN_MEMORY_BLOCK_SIZE(1 SECTION_SIZE_BITS) + +static int sections_per_block; + +static inline int base_memory_block_id(int section_nr) +{ + return (section_nr / sections_per_block) * sections_per_block; +} static struct sysdev_class memory_sysdev_class = { .name = MEMORY_CLASS_NAME, @@ -82,22 +90,21 @@ EXPORT_SYMBOL(unregister_memory_isolate_ * register_memory - Setup a sysfs device for a memory block */ static -int register_memory(struct memory_block *memory, struct mem_section *section) +int register_memory(struct memory_block *memory) { int error; memory-sysdev.cls = memory_sysdev_class; - memory-sysdev.id = __section_nr(section); + memory-sysdev.id = memory-start_phys_index; I'm curious that this memory-start_phys_index can't overflow ? sysdev.id is 32bit. Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 5/8] v3 Update the find_memory_block declaration
On Mon, 19 Jul 2010 22:56:16 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Update the find_memory_block declaration to to take a struct mem_section * so that it matches the definition. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com Reviewd-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 6/8] v3 Update the node sysfs code
On Mon, 19 Jul 2010 22:57:35 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Update the node sysfs code to be aware of the new capability for a memory block to contain multiple memory sections. This requires an additional parameter to unregister_mem_sect_under_nodes so that we know which memory section of the memory block to unregister. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/5] v2 Split the memory_block structure
On Thu, 15 Jul 2010 13:37:51 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Split the memory_block struct into a memory_block struct to cover each sysfs directory and a new memory_block_section struct for each memory section covered by the sysfs directory. This change allows for creation of memory sysfs directories that can span multiple memory sections. This can be beneficial in that it can reduce the number of memory sysfs directories created at boot. This also allows different architectures to define how many memory sections are covered by a sysfs directory. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com --- drivers/base/memory.c | 222 ++--- include/linux/memory.h | 11 +- 2 files changed, 167 insertions(+), 66 deletions(-) Index: linux-2.6/drivers/base/memory.c === --- linux-2.6.orig/drivers/base/memory.c 2010-07-15 08:48:41.0 -0500 +++ linux-2.6/drivers/base/memory.c 2010-07-15 09:55:54.0 -0500 @@ -28,6 +28,14 @@ #include asm/uaccess.h #define MEMORY_CLASS_NAMEmemory +#define MIN_MEMORY_BLOCK_SIZE(1 SECTION_SIZE_BITS) + +static int sections_per_block; + +static inline int base_memory_block_id(int section_nr) +{ + return (section_nr / sections_per_block) * sections_per_block; +} static struct sysdev_class memory_sysdev_class = { .name = MEMORY_CLASS_NAME, @@ -94,10 +102,9 @@ } static void -unregister_memory(struct memory_block *memory, struct mem_section *section) +unregister_memory(struct memory_block *memory) { BUG_ON(memory-sysdev.cls != memory_sysdev_class); - BUG_ON(memory-sysdev.id != __section_nr(section)); /* drop the ref. we got in remove_memory_block() */ kobject_put(memory-sysdev.kobj); @@ -123,13 +130,20 @@ static ssize_t show_mem_removable(struct sys_device *dev, struct sysdev_attribute *attr, char *buf) { + struct memory_block *mem; + struct memory_block_section *mbs; unsigned long start_pfn; - int ret; - struct memory_block *mem = - container_of(dev, struct memory_block, sysdev); + int ret = 1; + + mem = container_of(dev, struct memory_block, sysdev); + mutex_lock(mem-state_mutex); - start_pfn = section_nr_to_pfn(mem-phys_index); - ret = is_mem_section_removable(start_pfn, PAGES_PER_SECTION); + list_for_each_entry(mbs, mem-sections, next) { + start_pfn = section_nr_to_pfn(mbs-phys_index); + ret = is_mem_section_removable(start_pfn, PAGES_PER_SECTION); + } + + mutex_unlock(mem-state_mutex); Hmm, this means memory cab be offlined the while memory block section. Right ? Please write this fact in patch description... And Documentaion/memory_hotplug.txt as From user's perspective, memory section is not a unit of memory hotplug anymore. And descirbe about a new rule. return sprintf(buf, %d\n, ret); } @@ -182,16 +196,16 @@ * OK to have direct references to sparsemem variables in here. */ static int -memory_block_action(struct memory_block *mem, unsigned long action) +memory_block_action(struct memory_block_section *mbs, unsigned long action) { int i; unsigned long psection; unsigned long start_pfn, start_paddr; struct page *first_page; int ret; - int old_state = mem-state; + int old_state = mbs-state; - psection = mem-phys_index; + psection = mbs-phys_index; first_page = pfn_to_page(psection PFN_SECTION_SHIFT); /* @@ -217,18 +231,18 @@ ret = online_pages(start_pfn, PAGES_PER_SECTION); break; case MEM_OFFLINE: - mem-state = MEM_GOING_OFFLINE; + mbs-state = MEM_GOING_OFFLINE; start_paddr = page_to_pfn(first_page) PAGE_SHIFT; ret = remove_memory(start_paddr, PAGES_PER_SECTION PAGE_SHIFT); if (ret) { - mem-state = old_state; + mbs-state = old_state; break; } break; default: WARN(1, KERN_WARNING %s(%p, %ld) unknown action: %ld\n, - __func__, mem, action, action); + __func__, mbs, action, action); ret = -EINVAL; } @@ -238,19 +252,34 @@ And please check quilt's diff option. Usual patche in ML shows a function name in any changes, as @@ -241,6 +293,8 @@ static int memory_block_change_state(str Maybe -p option is lacked.. static int memory_block_change_state(struct memory_block *mem, unsigned
Re: [PATCH 2/5] v2 Create new 'end_phys_index' file
On Thu, 15 Jul 2010 13:38:52 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Add a new 'end_phys_index' file to each memory sysfs directory to report the physical index of the last memory section covered by the sysfs directory. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com Does memory_block have to be contiguous between [phys_index, end_phys_index] ? Should we provide # of sections or amount of memory under a block ? No objections to end_phys_index...buf plz fix diff style. Thanks, -Kame --- drivers/base/memory.c | 14 +- include/linux/memory.h |3 +++ 2 files changed, 16 insertions(+), 1 deletion(-) Index: linux-2.6/drivers/base/memory.c === --- linux-2.6.orig/drivers/base/memory.c 2010-07-15 09:55:54.0 -0500 +++ linux-2.6/drivers/base/memory.c 2010-07-15 09:56:05.0 -0500 @@ -121,7 +121,15 @@ { struct memory_block *mem = container_of(dev, struct memory_block, sysdev); - return sprintf(buf, %08lx\n, mem-phys_index); + return sprintf(buf, %08lx\n, mem-start_phys_index); +} + +static ssize_t show_mem_end_phys_index(struct sys_device *dev, + struct sysdev_attribute *attr, char *buf) +{ + struct memory_block *mem = + container_of(dev, struct memory_block, sysdev); + return sprintf(buf, %08lx\n, mem-end_phys_index); } /* @@ -321,6 +329,7 @@ } static SYSDEV_ATTR(phys_index, 0444, show_mem_phys_index, NULL); +static SYSDEV_ATTR(end_phys_index, 0444, show_mem_end_phys_index, NULL); static SYSDEV_ATTR(state, 0644, show_mem_state, store_mem_state); static SYSDEV_ATTR(phys_device, 0444, show_phys_device, NULL); static SYSDEV_ATTR(removable, 0444, show_mem_removable, NULL); @@ -533,6 +542,8 @@ if (!ret) ret = mem_create_simple_file(mem, phys_index); if (!ret) + ret = mem_create_simple_file(mem, end_phys_index); + if (!ret) ret = mem_create_simple_file(mem, state); if (!ret) ret = mem_create_simple_file(mem, phys_device); @@ -577,6 +588,7 @@ if (list_empty(mem-sections)) { unregister_mem_sect_under_nodes(mem); mem_remove_simple_file(mem, phys_index); + mem_remove_simple_file(mem, end_phys_index); mem_remove_simple_file(mem, state); mem_remove_simple_file(mem, phys_device); mem_remove_simple_file(mem, removable); Index: linux-2.6/include/linux/memory.h === --- linux-2.6.orig/include/linux/memory.h 2010-07-15 09:54:06.0 -0500 +++ linux-2.6/include/linux/memory.h 2010-07-15 09:56:05.0 -0500 @@ -29,6 +29,9 @@ struct memory_block { unsigned long state; + unsigned long start_phys_index; + unsigned long end_phys_index; + /* * This serializes all state change requests. It isn't * held during creation because the control files are -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/5] v2 Update sysfs node routines for new sysfs memory directories
On Thu, 15 Jul 2010 13:40:40 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: Update the node sysfs directory routines that create links to the memory sysfs directories under each node. This update makes the node code aware that a memory sysfs directory can cover multiple memory sections. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com Shouldn't static int link_mem_sections(int nid) be update ? It does for (pfn = start_pfn; pfn end_pfn; pfn += PAGES_PER_SECTION) { register.. Thanks, -Kame --- drivers/base/node.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) Index: linux-2.6/drivers/base/node.c === --- linux-2.6.orig/drivers/base/node.c2010-07-15 09:54:06.0 -0500 +++ linux-2.6/drivers/base/node.c 2010-07-15 09:56:16.0 -0500 @@ -346,8 +346,10 @@ return -EFAULT; if (!node_online(nid)) return 0; - sect_start_pfn = section_nr_to_pfn(mem_blk-phys_index); - sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1; + + sect_start_pfn = section_nr_to_pfn(mem_blk-start_phys_index); + sect_end_pfn = section_nr_to_pfn(mem_blk-end_phys_index); + sect_end_pfn += PAGES_PER_SECTION - 1; for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) { int page_nid; @@ -383,8 +385,10 @@ if (!unlinked_nodes) return -ENOMEM; nodes_clear(*unlinked_nodes); - sect_start_pfn = section_nr_to_pfn(mem_blk-phys_index); - sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1; + + sect_start_pfn = section_nr_to_pfn(mem_blk-start_phys_index); + sect_end_pfn = section_nr_to_pfn(mem_blk-end_phys_index); + sect_end_pfn += PAGES_PER_SECTION - 1; for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) { int nid; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/7] Allow sysfs memory directories to be split
On Wed, 14 Jul 2010 12:25:03 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Tue, 13 Jul 2010 22:18:03 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: On 07/13/2010 07:35 PM, KAMEZAWA Hiroyuki wrote: On Tue, 13 Jul 2010 10:51:58 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: And for what purpose this interface is ? Does this split memory block into 2 pieces of the same size ?? sounds __very__ strange interface to me. Yes, this splits the memory_block into two blocks of the same size. This was suggested as something we may want to do. From ppc perspective I am not sure we would use this. The split functionality is not required. The main goal of the patch set is to reduce the number of memory sysfs directories created. From a ppc perspective the split functionality is not really needed. Okay, this is an offer from me. 1. I think you can add an boot option as don't create memory sysfs. please do. I posted a patch to do that a week or so ago, it didn't go over very well. 2. I'd like to write a configfs module for handling memory hotplug even when sysfs directroy is not created. Because configfs support rmdir/mkdir, the user (ppc's daemon?) has to do When offlining section X. # insmod configfs_memory.ko # mount -t configfs none /configfs # mkdir /configfs/memoryX # echo offline /configfs/memoryX/state # rmdir /configfs/memoryX And making this operation as the default bahavior for all arch's memory hotplug may be better... Dave, how do you think ? Because ppc guys uses probe interface already, this can be handled... no ? ppc would still require the existance of the 'probe' interface. Are you objecting to the 'split' functionality? yes. If so I do not see any reason from ppc perspective that it is needed. This was something Dave suggested, unless I am missing something. Since ppc needs the 'probe' interface in sysfs, and for ppc having mutliple memory_block_sections reside under a single memory_block makes memory hotplug simpler. On ppc we do emory hotplug operations on an LMB size basis. With my patches this now lets us set each memory_block to span an LMB's worth of memory. Now we could do emory hotplug in a single operation instead of multiple operations to offline/online all of the memory sections in an LMB. Why per-section memory offlining is provided is for allowing good success-rate of memory offlining. Because memory-hotplug has to migrate or free all used page under a section, possibility of memory unplug depends on usage of memory. If a section contains unmovable page(kernel page), we can't offline sectin. For example, comparing 1. offlining 128MB of memory at once 2. offlining 8 chunks of 16MB memory 2 can get very good possibility and system-busy time can be much reduced. IIUC, ppc's 1st requirement is resizing not hot-removing some memory device, 2 is much welcomed. So, some fine-grained interface to section_size is appreciated. So, multiple operations is much better than single operation. As I posted show/hide patch, I'm writing it in configfs. I think it meets IBM's requirements. _But_, it's IBM's issue not Fujitsu's. So, final decistion will depend on you guys. Anyway, I don't like a too fancy interface as split. This is a sample configfs for handling memory hotplug. I wrote this just for my fun and study. code-duplication was not as big as expected...most of codes are for configfs management. you can ignore this. but please avoid changing existing interace in fancy way. == [r...@bluextal kamezawa]# mount -t configfs none /configfs/ [r...@bluextal kamezawa]# mkdir /configfs/memory/72 [r...@bluextal kamezawa]# cat /configfs/memory/72/phys_index 0048 [r...@bluextal kamezawa]# cat /sys/devices/system/memory/memory72/phys_index 0048 [r...@bluextal kamezawa]# echo offline /configfs/memory/72/state [r...@bluextal kamezawa]# cat /configfs/memory/72/state offline [r...@bluextal kamezawa]# cat /sys/devices/system/memory/memory72/state offline [r...@bluextal kamezawa]# echo online /configfs/memory/72/state [r...@bluextal kamezawa]# cat /sys/devices/system/memory/memory72/state online No sign. --- drivers/base/Makefile|2 drivers/base/memory.c| 87 +-- drivers/base/memory_config.c | 192 +++ include/linux/memory.h | 10 ++ mm/Kconfig |1 5 files changed, 280 insertions(+), 12 deletions(-) Index: mmotm-2.6.35-0701/drivers/base/memory.c === --- mmotm-2.6.35-0701.orig/drivers/base/memory.c +++ mmotm-2.6.35-0701/drivers/base/memory.c @@ -23,12 +23,15 @@ #include linux/mutex.h
Re: [PATCH 1/7] Split the memory_block structure
plz cc linux-mm in the next time... And please incudes updates for Documentation/memory-hotplug.txt. On Mon, 12 Jul 2010 10:42:06 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: This patch splits the memory_block struct into a memory_block struct to cover each sysfs directory and a new memory_block_section struct for each memory section covered by the sysfs directory. This also updates the routine handling memory_block creation and manipulation to use these updated structures. Could you clarify the number of memory_block_section per memory_block ? Signed -off-by: Nathan Fontenot nf...@austin.ibm.com --- drivers/base/memory.c | 228 +++-- include/linux/memory.h | 11 +- 2 files changed, 172 insertions(+), 67 deletions(-) Index: linux-2.6/drivers/base/memory.c === --- linux-2.6.orig/drivers/base/memory.c 2010-07-08 11:27:21.0 -0500 +++ linux-2.6/drivers/base/memory.c 2010-07-09 14:23:09.0 -0500 @@ -28,6 +28,14 @@ #include asm/uaccess.h #define MEMORY_CLASS_NAMEmemory +#define MIN_MEMORY_BLOCK_SIZE(1 SECTION_SIZE_BITS) + +static int sections_per_block; + some default value, plz. Does this can be determined only by .config ? +static inline int base_memory_block_id(int section_nr) +{ + return (section_nr / sections_per_block) * sections_per_block; +} static struct sysdev_class memory_sysdev_class = { .name = MEMORY_CLASS_NAME, @@ -94,10 +102,9 @@ } static void -unregister_memory(struct memory_block *memory, struct mem_section *section) +unregister_memory(struct memory_block *memory) { BUG_ON(memory-sysdev.cls != memory_sysdev_class); - BUG_ON(memory-sysdev.id != __section_nr(section)); /* drop the ref. we got in remove_memory_block() */ kobject_put(memory-sysdev.kobj); @@ -123,13 +130,20 @@ static ssize_t show_mem_removable(struct sys_device *dev, struct sysdev_attribute *attr, char *buf) { - unsigned long start_pfn; - int ret; - struct memory_block *mem = - container_of(dev, struct memory_block, sysdev); + struct list_head *pos, *tmp; + struct memory_block *mem; + int ret = 1; + + mem = container_of(dev, struct memory_block, sysdev); + list_for_each_safe(pos, tmp, mem-sections) { + struct memory_block_section *mbs; + unsigned long start_pfn; + + mbs = list_entry(pos, struct memory_block_section, next); list_for_each_entry ? + start_pfn = section_nr_to_pfn(mbs-phys_index); + ret = is_mem_section_removable(start_pfn, PAGES_PER_SECTION); + } Hmm, them, only when the whole memory block is removable, it's shown as removable. Right ? Does it meets ppc guy's requirements ? - start_pfn = section_nr_to_pfn(mem-phys_index); - ret = is_mem_section_removable(start_pfn, PAGES_PER_SECTION); return sprintf(buf, %d\n, ret); } Hmm...can't you print removable information as bitmap, here ? overkill ? @@ -182,16 +196,16 @@ * OK to have direct references to sparsemem variables in here. */ static int -memory_block_action(struct memory_block *mem, unsigned long action) +memory_block_action(struct memory_block_section *mbs, unsigned long action) { int i; unsigned long psection; unsigned long start_pfn, start_paddr; struct page *first_page; int ret; - int old_state = mem-state; ot-option-to-disable-memory-hotplug.patch + int old_state = mbs-state; Where is this noise from ? - psection = mem-phys_index; + psection = mbs-phys_index; first_page = pfn_to_page(psection PFN_SECTION_SHIFT); /* @@ -217,18 +231,18 @@ ret = online_pages(start_pfn, PAGES_PER_SECTION); break; case MEM_OFFLINE: - mem-state = MEM_GOING_OFFLINE; + mbs-state = MEM_GOING_OFFLINE; start_paddr = page_to_pfn(first_page) PAGE_SHIFT; ret = remove_memory(start_paddr, PAGES_PER_SECTION PAGE_SHIFT); if (ret) { - mem-state = old_state; + mbs-state = old_state; break; } break; default: WARN(1, KERN_WARNING %s(%p, %ld) unknown action: %ld\n, - __func__, mem, action, action); + __func__, mbs, action, action); ret = -EINVAL; } @@ -238,19 +252,40 @@ static int memory_block_change_state(struct memory_block *mem, unsigned long to_state, unsigned
Re: [PATCH 3/7] Update the [register,unregister]_memory routines
On Mon, 12 Jul 2010 10:44:10 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: This patch moves the register/unregister_memory routines to avoid a forward declaration. It also moves the sysfs file creation and deletion for each directory into the register/ unregister routines to avoid duplicating it with these updates. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com --- drivers/base/memory.c | 93 +- 1 file changed, 48 insertions(+), 45 deletions(-) Index: linux-2.6/drivers/base/memory.c === --- linux-2.6.orig/drivers/base/memory.c 2010-07-09 14:23:17.0 -0500 +++ linux-2.6/drivers/base/memory.c 2010-07-09 14:23:20.0 -0500 @@ -87,31 +87,6 @@ EXPORT_SYMBOL(unregister_memory_isolate_notifier); /* - * register_memory - Setup a sysfs device for a memory block - */ -static -int register_memory(struct memory_block *memory, struct mem_section *section) -{ - int error; - - memory-sysdev.cls = memory_sysdev_class; - memory-sysdev.id = __section_nr(section); - - error = sysdev_register(memory-sysdev); - return error; -} - -static void -unregister_memory(struct memory_block *memory) -{ - BUG_ON(memory-sysdev.cls != memory_sysdev_class); - - /* drop the ref. we got in remove_memory_block() */ - kobject_put(memory-sysdev.kobj); - sysdev_unregister(memory-sysdev); -} - -/* * use this as the physical section index that this memsection * uses. */ @@ -346,6 +321,53 @@ sysdev_remove_file(mem-sysdev, attr_##attr_name) /* + * register_memory - Setup a sysfs device for a memory block + */ +static +int register_memory(struct memory_block *memory, struct mem_section *section, + int nid, enum mem_add_context context) +{ + int ret; + + memory-sysdev.cls = memory_sysdev_class; + memory-sysdev.id = __section_nr(section); + Why not block-ID but section-ID ? -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/7] Allow sysfs memory directories to be split
On Mon, 12 Jul 2010 10:45:25 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: This patch introduces the new 'split' file in each memory sysfs directory and the associated routines needed to handle splitting a directory. Signed-off-by; Nathan Fontenot nf...@austin.ibm.com --- pleae check diff option... drivers/base/memory.c | 99 +- 1 file changed, 98 insertions(+), 1 deletion(-) Index: linux-2.6/drivers/base/memory.c === --- linux-2.6.orig/drivers/base/memory.c 2010-07-09 14:23:20.0 -0500 +++ linux-2.6/drivers/base/memory.c 2010-07-09 14:38:09.0 -0500 @@ -32,6 +32,9 @@ static int sections_per_block; +static int register_memory(struct memory_block *, struct mem_section *, +int, enum mem_add_context); + static inline int base_memory_block_id(int section_nr) { return (section_nr / sections_per_block) * sections_per_block; @@ -309,11 +312,100 @@ return sprintf(buf, %d\n, mem-phys_device); } +static void update_memory_block_phys_indexes(struct memory_block *mem) +{ + struct list_head *pos; + struct memory_block_section *mbs; + unsigned long min_index = 0x; + unsigned long max_index = 0; + + list_for_each(pos, mem-sections) { + mbs = list_entry(pos, struct memory_block_section, next); + + if (mbs-phys_index min_index) + min_index = mbs-phys_index; + + if (mbs-phys_index max_index) + max_index = mbs-phys_index; + } + + mem-start_phys_index = min_index; + mem-end_phys_index = max_index; +} + +static ssize_t +store_mem_split_block(struct sys_device *dev, struct sysdev_attribute *attr, + const char *buf, size_t count) +{ + struct memory_block *mem, *new_mem_blk; + struct memory_block_section *mbs; + struct list_head *pos, *tmp; + struct mem_section *section; + int min_scn_nr = 0; + int max_scn_nr = 0; + int total_scns = 0; + int new_blk_min, new_blk_total; + int ret = -EINVAL; + + mem = container_of(dev, struct memory_block, sysdev); + + if (list_is_singular(mem-sections)) + return -EINVAL; What this means ? + + mutex_lock(mem-state_mutex); + + list_for_each(pos, mem-sections) { + mbs = list_entry(pos, struct memory_block_section, next); + + total_scns++; + + if (min_scn_nr mbs-phys_index) + min_scn_nr = mbs-phys_index; + + if (max_scn_nr mbs-phys_index) + max_scn_nr = mbs-phys_index; + } + + new_mem_blk = kzalloc(sizeof(*new_mem_blk), GFP_KERNEL); + if (!new_mem_blk) + return -ENOMEM; + + mutex_init(new_mem_blk-state_mutex); + INIT_LIST_HEAD(new_mem_blk-sections); + new_mem_blk-state = mem-state; + + mutex_lock(new_mem_blk-state_mutex); + + new_blk_total = total_scns / 2; + new_blk_min = max_scn_nr - new_blk_total + 1; + + section = __nr_to_section(new_blk_min); + ret = register_memory(new_mem_blk, section, 0, HOTPLUG); + 'nid' is always 0 ? And for what purpose this interface is ? Does this split memory block into 2 pieces of the same size ?? sounds __very__ strange interface to me. If this is necessary, I hope move the whole things to configfs rather than something tricky. Bye. -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/7] Allow sysfs memory directories to be split
On Tue, 13 Jul 2010 10:51:58 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: And for what purpose this interface is ? Does this split memory block into 2 pieces of the same size ?? sounds __very__ strange interface to me. Yes, this splits the memory_block into two blocks of the same size. This was suggested as something we may want to do. From ppc perspective I am not sure we would use this. The split functionality is not required. The main goal of the patch set is to reduce the number of memory sysfs directories created. From a ppc perspective the split functionality is not really needed. Okay, this is an offer from me. 1. I think you can add an boot option as don't create memory sysfs. please do. 2. I'd like to write a configfs module for handling memory hotplug even when sysfs directroy is not created. Because configfs support rmdir/mkdir, the user (ppc's daemon?) has to do When offlining section X. # insmod configfs_memory.ko # mount -t configfs none /configfs # mkdir /configfs/memoryX # echo offline /configfs/memoryX/state # rmdir /configfs/memoryX And making this operation as the default bahavior for all arch's memory hotplug may be better... Dave, how do you think ? Because ppc guys uses probe interface already, this can be handled... no ? One problem is that I don't have enough knowledge about configfs..it seems complex. Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/7] Allow sysfs memory directories to be split
On Tue, 13 Jul 2010 22:18:03 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: On 07/13/2010 07:35 PM, KAMEZAWA Hiroyuki wrote: On Tue, 13 Jul 2010 10:51:58 -0500 Nathan Fontenot nf...@austin.ibm.com wrote: And for what purpose this interface is ? Does this split memory block into 2 pieces of the same size ?? sounds __very__ strange interface to me. Yes, this splits the memory_block into two blocks of the same size. This was suggested as something we may want to do. From ppc perspective I am not sure we would use this. The split functionality is not required. The main goal of the patch set is to reduce the number of memory sysfs directories created. From a ppc perspective the split functionality is not really needed. Okay, this is an offer from me. 1. I think you can add an boot option as don't create memory sysfs. please do. I posted a patch to do that a week or so ago, it didn't go over very well. 2. I'd like to write a configfs module for handling memory hotplug even when sysfs directroy is not created. Because configfs support rmdir/mkdir, the user (ppc's daemon?) has to do When offlining section X. # insmod configfs_memory.ko # mount -t configfs none /configfs # mkdir /configfs/memoryX # echo offline /configfs/memoryX/state # rmdir /configfs/memoryX And making this operation as the default bahavior for all arch's memory hotplug may be better... Dave, how do you think ? Because ppc guys uses probe interface already, this can be handled... no ? ppc would still require the existance of the 'probe' interface. Are you objecting to the 'split' functionality? yes. If so I do not see any reason from ppc perspective that it is needed. This was something Dave suggested, unless I am missing something. Since ppc needs the 'probe' interface in sysfs, and for ppc having mutliple memory_block_sections reside under a single memory_block makes memory hotplug simpler. On ppc we do emory hotplug operations on an LMB size basis. With my patches this now lets us set each memory_block to span an LMB's worth of memory. Now we could do emory hotplug in a single operation instead of multiple operations to offline/online all of the memory sections in an LMB. Why per-section memory offlining is provided is for allowing good success-rate of memory offlining. Because memory-hotplug has to migrate or free all used page under a section, possibility of memory unplug depends on usage of memory. If a section contains unmovable page(kernel page), we can't offline sectin. For example, comparing 1. offlining 128MB of memory at once 2. offlining 8 chunks of 16MB memory 2 can get very good possibility and system-busy time can be much reduced. IIUC, ppc's 1st requirement is resizing not hot-removing some memory device, 2 is much welcomed. So, some fine-grained interface to section_size is appreciated. So, multiple operations is much better than single operation. As I posted show/hide patch, I'm writing it in configfs. I think it meets IBM's requirements. _But_, it's IBM's issue not Fujitsu's. So, final decistion will depend on you guys. Anyway, I don't like a too fancy interface as split. Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: 2.6.35-rc2 : OOPS with LTP memcg regression test run.
On Thu, 10 Jun 2010 22:00:57 +0200 Maciej Rutecki maciej.rute...@gmail.com wrote: I created a Bugzilla entry at https://bugzilla.kernel.org/show_bug.cgi?id=16178 for your bug report, please add your address to the CC list in there, thanks! Hmm... It seems a panic in SLUB or SLAB. Is .config available ? -Kame On niedziela, 6 czerwca 2010 o 17:06:54 Sachin Sant wrote: While executing LTP Controller tests(memcg regression) on a POWER6 box came across this following OOPS. Memory cgroup out of memory: kill process 9139 (memcg_test_1) score 3 or a child Killed process 9139 (memcg_test_1) vsz:3456kB, anon-rss:448kB, file-rss:1088kB Memory cgroup out of memory: kill process 9140 (memcg_test_1) score 3 or a child Killed process 9140 (memcg_test_1) vsz:3456kB, anon-rss:448kB, file-rss:1088kB Unable to handle kernel paging request for data at address 0x720072007200720 Faulting instruction address: 0xc015b778 Oops: Kernel access of bad area, sig: 11 [#2] SMP NR_CPUS=1024 NUMA pSeries last sysfs file: /sys/devices/system/cpu/cpu1/cache/index1/shared_cpu_map Modules linked in: quota_v2 quota_tree ipv6 fuse loop dm_mod sr_mod cdrom sg sd_mod crc_t10dif ibmvscsic scsi_transport_srp scsi_tgt scsi_mod NIP: c015b778 LR: c015b740 CTR: REGS: c9812ff0 TRAP: 0300 Tainted: G D (2.6.35-rc2-autotest) MSR: 80009032 EE,ME,IR,DR CR: 44004424 XER: 0001 DAR: 0720072007200720, DSISR: 4000 TASK = c5fb1100[9155] 'umount' THREAD: c981 CPU: 0 GPR00: c9813270 c0d3d7a0 GPR04: 8050 0016 0027 cf2c6870 GPR08: 06a5 c0b16870 c0cf0140 0e7b GPR12: 24004428 c744 8000 f000 GPR16: c98138f0 002d 0027 GPR20: 0027 c7063138 GPR24: c019bafc ce02e000 GPR28: 0001 8050 c0ca6b00 0720072007200720 NIP [c015b778] .kmem_cache_alloc+0xb0/0x13c LR [c015b740] .kmem_cache_alloc+0x78/0x13c Call Trace: [c9813270] [c015b740] .kmem_cache_alloc+0x78/0x13c (unreliable) [c9813310] [c019bafc] .alloc_buffer_head+0x2c/0x78 [c9813390] [c019c99c] .alloc_page_buffers+0x60/0x114 [c9813450] [c019ca78] .create_empty_buffers+0x28/0x140 [c98134e0] [c019f2ec] .__block_prepare_write+0xe4/0x4f0 [c9813610] [c019f94c] .block_write_begin_newtrunc+0xa8/0x120 [c98136d0] [c019fea0] .block_write_begin+0x34/0x8c [c9813770] [c022b458] .ext3_write_begin+0x13c/0x298 [c9813880] [c0117500] .generic_file_buffered_write+0x13c/0x320 [c98139b0] [c0119c80] .__generic_file_aio_write+0x378/0x3dc [c9813ab0] [c0119d68] .generic_file_aio_write+0x84/0xfc [c9813b60] [c016e460] .do_sync_write+0xac/0x10c [c9813ce0] [c016f204] .vfs_write+0xd0/0x1dc [c9813d80] [c016f418] .SyS_write+0x58/0xa0 [c9813e30] [c00085b4] syscall_exit+0x0/0x40 Instruction dump: 3860 409e0090 3800 8b8d0212 980d0212 e96d0040 e93b 7ce95a14 7fe9582a 2fbf 419e0014 e81b001a 7c1f002a 7c09592a 481c 7f46d378 ---[ end trace f24cb0cb5729d2bb ]--- And few more of these. Previous snapshot release 2.6.35-rc1-git5(6c5de280b6...) was good. Thanks -Sachin -- Maciej Rutecki http://www.maciek.unixy.pl -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[BUGFIX][PATCH] memcg: avoid use cmpxchg in swap cgroup maintainance (Was Re: 34-rc1-git3 build failure with CGROUP_MEM_RES_CTLR_SWAP=y
On Sun, 14 Mar 2010 16:18:06 +0530 Sachin Sant sach...@in.ibm.com wrote: On a PowerPC box, latest 34-rc1 git(d89b218b8...) fails to build with CGROUPS_MEM_RES_CTRL_SWAP=y. LD init/built-in.o LD .tmp_vmlinux1 mm/built-in.o: In function __xchg: arch/powerpc/include/asm/system.h:331: undefined reference to .__xchg_called_with_bad_pointer mm/built-in.o: In function __cmpxchg: arch/powerpc/include/asm/system.h:474: undefined reference to .__cmpxchg_called_with_bad_pointer make: *** [.tmp_vmlinux1] Error 1 The code in question was added via commit 024914477e... memcg: move charges of anonymous swap Oh..ok, powerpc (and other archs?) can't do 2byte cmpxchg and xchg. Then, we should use spinlock rather than that. How about this ? Nishimura-san, could you consider something better ? We need a quick fix. == swap_cgroup uses 2bytes data and uses cmpxchg in a new operation. 2byte cmpxchg/xchg is not available on some archs. This patch replaces cmpxchg/xchg with operations under lock. Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com --- mm/page_cgroup.c | 20 1 file changed, 16 insertions(+), 4 deletions(-) Index: mmotm-2.6.34-Mar11/mm/page_cgroup.c === --- mmotm-2.6.34-Mar11.orig/mm/page_cgroup.c +++ mmotm-2.6.34-Mar11/mm/page_cgroup.c @@ -284,6 +284,7 @@ static DEFINE_MUTEX(swap_cgroup_mutex); struct swap_cgroup_ctrl { struct page **map; unsigned long length; + spinlock_t lock; }; struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES]; @@ -353,16 +354,22 @@ unsigned short swap_cgroup_cmpxchg(swp_e struct swap_cgroup_ctrl *ctrl; struct page *mappage; struct swap_cgroup *sc; + unsigned long flags; + unsigned short retval; ctrl = swap_cgroup_ctrl[type]; mappage = ctrl-map[idx]; sc = page_address(mappage); sc += pos; - if (cmpxchg(sc-id, old, new) == old) - return old; + spin_lock_irqsave(ctrl-lock, flags); + retval = sc-id; + if (retval == old) + sc-id = new; else - return 0; + retval = 0; + spin_unlock_irqrestore(ctrl-lock, flags); + return retval; } /** @@ -383,13 +390,17 @@ unsigned short swap_cgroup_record(swp_en struct page *mappage; struct swap_cgroup *sc; unsigned short old; + unsigned long flags; ctrl = swap_cgroup_ctrl[type]; mappage = ctrl-map[idx]; sc = page_address(mappage); sc += pos; - old = xchg(sc-id, id); + spin_lock_irqsave(ctrl-lock, flags); + old = sc-id; + sc-id = id; + spin_unlock_irqrestore(ctrl-lock, flags); return old; } @@ -441,6 +452,7 @@ int swap_cgroup_swapon(int type, unsigne mutex_lock(swap_cgroup_mutex); ctrl-length = length; ctrl-map = array; + spin_lock_init(ctrl-lock); if (swap_cgroup_prepare(type)) { /* memory shortage */ ctrl-map = NULL; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2][v2] mm: add notifier in pageblock isolation for balloon drivers
On Fri, 2 Oct 2009 13:44:58 -0500 Robert Jennings r...@linux.vnet.ibm.com wrote: Memory balloon drivers can allocate a large amount of memory which is not movable but could be freed to accomodate memory hotplug remove. Prior to calling the memory hotplug notifier chain the memory in the pageblock is isolated. If the migrate type is not MIGRATE_MOVABLE the isolation will not proceed, causing the memory removal for that page range to fail. Rather than failing pageblock isolation if the the migrateteype is not MIGRATE_MOVABLE, this patch checks if all of the pages in the pageblock are owned by a registered balloon driver (or other entity) using a notifier chain. If all of the non-movable pages are owned by a balloon, they can be freed later through the memory notifier chain and the range can still be isolated in set_migratetype_isolate(). Signed-off-by: Robert Jennings r...@linux.vnet.ibm.com --- drivers/base/memory.c | 19 +++ include/linux/memory.h | 26 ++ mm/page_alloc.c| 45 ++--- 3 files changed, 83 insertions(+), 7 deletions(-) Index: b/drivers/base/memory.c === --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -63,6 +63,20 @@ void unregister_memory_notifier(struct n } EXPORT_SYMBOL(unregister_memory_notifier); +static BLOCKING_NOTIFIER_HEAD(memory_isolate_chain); + IIUC, this notifier is called under zone-lock. please ATOMIC_NOTIFIER_HEAD(). +int register_memory_isolate_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_register(memory_isolate_chain, nb); +} +EXPORT_SYMBOL(register_memory_isolate_notifier); + +void unregister_memory_isolate_notifier(struct notifier_block *nb) +{ + blocking_notifier_chain_unregister(memory_isolate_chain, nb); +} +EXPORT_SYMBOL(unregister_memory_isolate_notifier); + /* * register_memory - Setup a sysfs device for a memory block */ @@ -157,6 +171,11 @@ int memory_notify(unsigned long val, voi return blocking_notifier_call_chain(memory_chain, val, v); } +int memory_isolate_notify(unsigned long val, void *v) +{ + return blocking_notifier_call_chain(memory_isolate_chain, val, v); +} + /* * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is * OK to have direct references to sparsemem variables in here. Index: b/include/linux/memory.h === --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -50,6 +50,18 @@ struct memory_notify { int status_change_nid; }; +/* + * During pageblock isolation, count the number of pages in the + * range [start_pfn, start_pfn + nr_pages) + */ +#define MEM_ISOLATE_COUNT(10) + +struct memory_isolate_notify { + unsigned long start_pfn; + unsigned int nr_pages; + unsigned int pages_found; +}; Could you add commentary for each field ? + struct notifier_block; struct mem_section; @@ -76,14 +88,28 @@ static inline int memory_notify(unsigned { return 0; } +static inline int register_memory_isolate_notifier(struct notifier_block *nb) +{ + return 0; +} +static inline void unregister_memory_isolate_notifier(struct notifier_block *nb) +{ +} +static inline int memory_isolate_notify(unsigned long val, void *v) +{ + return 0; +} #else extern int register_memory_notifier(struct notifier_block *nb); extern void unregister_memory_notifier(struct notifier_block *nb); +extern int register_memory_isolate_notifier(struct notifier_block *nb); +extern void unregister_memory_isolate_notifier(struct notifier_block *nb); extern int register_new_memory(int, struct mem_section *); extern int unregister_memory_section(struct mem_section *); extern int memory_dev_init(void); extern int remove_memory_block(unsigned long, struct mem_section *, int); extern int memory_notify(unsigned long val, void *v); +extern int memory_isolate_notify(unsigned long val, void *v); extern struct memory_block *find_memory_block(struct mem_section *); #define CONFIG_MEM_BLOCK_SIZE(PAGES_PER_SECTIONPAGE_SHIFT) enum mem_add_context { BOOT, HOTPLUG }; Index: b/mm/page_alloc.c === --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -48,6 +48,7 @@ #include linux/page_cgroup.h #include linux/debugobjects.h #include linux/kmemleak.h +#include linux/memory.h #include trace/events/kmem.h #include asm/tlbflush.h @@ -4985,23 +4986,53 @@ void set_pageblock_flags_group(struct pa int set_migratetype_isolate(struct page *page) { struct zone *zone; - unsigned long flags; + unsigned long flags, pfn, iter; + unsigned long immobile = 0; + struct memory_isolate_notify arg; + int notifier_ret; int ret = -EBUSY;
Re: [BUG] Linux 2.6.25-rc2 - Regression from 2.6.24-rc1-git1 softlockup while bootup on powerpc
On Sun, 17 Feb 2008 20:29:13 +0100 Jens Axboe [EMAIL PROTECTED] wrote: It's odd stuff. Could you perhaps try and add some printks to block/cfq-iosched.c:call_for_each_cic(), like dumping the 'nr' return from radix_tree_gang_lookup() and the pointer value of cics[i] in the for() loop after the lookup? I met the same issue on ia64/NUMA box. seems cisc[]-key is NULL and index for radix_tree_gang_lookup() was always '1'. Attached patch works well for me, but I don't know much about cfq. please confirm. Regards, -Kame == cics[]-key can be NULL. In that case, cics[]-dead_key has key value. Signed-off-by: KAMEZAWA Hiroyuki [EMAIL PROTECTED] Index: linux-2.6.25-rc2/block/cfq-iosched.c === --- linux-2.6.25-rc2.orig/block/cfq-iosched.c +++ linux-2.6.25-rc2/block/cfq-iosched.c @@ -1171,7 +1171,11 @@ call_for_each_cic(struct io_context *ioc break; called += nr; - index = 1 + (unsigned long) cics[nr - 1]-key; + + if (!cics[nr - 1]-key) + index = 1 + (unsigned long) cics[nr - 1]-dead_key; + else + index = 1 + (unsigned long) cics[nr - 1]-key; for (i = 0; i nr; i++) func(ioc, cics[i]); ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [BUG] Linux 2.6.25-rc2 - Regression from 2.6.24-rc1-git1 softlockup while bootup on powerpc
On Tue, 19 Feb 2008 09:36:34 +0100 Jens Axboe [EMAIL PROTECTED] wrote: On Tue, Feb 19 2008, KAMEZAWA Hiroyuki wrote: On Sun, 17 Feb 2008 20:29:13 +0100 Jens Axboe [EMAIL PROTECTED] wrote: It's odd stuff. Could you perhaps try and add some printks to block/cfq-iosched.c:call_for_each_cic(), like dumping the 'nr' return from radix_tree_gang_lookup() and the pointer value of cics[i] in the for() loop after the lookup? I met the same issue on ia64/NUMA box. seems cisc[]-key is NULL and index for radix_tree_gang_lookup() was always '1'. Why does it keep repeating then? If -key is NULL, the next lookup index should be 1UL. when I inserted printk here == for (i = 0; i nr; i++) func(ioc, cics[i]); printk(%d %lx\n, nr, index); == index was always 1 and nr was always 32. So, cics[31]-key was always NULL when index=1 is passed to radix_tree_gang_lookup(). But I think the radix 'scan over entire tree' is a bit fragile. This patch adds a parallel hlist for ease of properly browsing the members, does that work for you? It compiles, but I haven't booted it here yet... will try. please wait a bit. Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [BUG] Linux 2.6.25-rc2 - Regression from 2.6.24-rc1-git1 softlockup while bootup on powerpc
On Tue, 19 Feb 2008 09:58:38 +0100 Jens Axboe [EMAIL PROTECTED] wrote: when I inserted printk here == for (i = 0; i nr; i++) func(ioc, cics[i]); printk(%d %lx\n, nr, index); == index was always 1 and nr was always 32. So, cics[31]-key was always NULL when index=1 is passed to radix_tree_gang_lookup(). Hang on, it returned 32? It should not return more than 16, since that is what we have room for and asked for. sorry. Of course, it was 16 ;( your patch works well. thank you. -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [BUG] Linux 2.6.25-rc2 - Regression from 2.6.24-rc1-git1 softlockup while bootup on powerpc
On Tue, 19 Feb 2008 09:36:34 +0100 Jens Axboe [EMAIL PROTECTED] wrote: On Tue, Feb 19 2008, KAMEZAWA Hiroyuki wrote: On Sun, 17 Feb 2008 20:29:13 +0100 Jens Axboe [EMAIL PROTECTED] wrote: It's odd stuff. Could you perhaps try and add some printks to block/cfq-iosched.c:call_for_each_cic(), like dumping the 'nr' return from radix_tree_gang_lookup() and the pointer value of cics[i] in the for() loop after the lookup? I met the same issue on ia64/NUMA box. seems cisc[]-key is NULL and index for radix_tree_gang_lookup() was always '1'. Why does it keep repeating then? If -key is NULL, the next lookup index should be 1UL. But I think the radix 'scan over entire tree' is a bit fragile. This patch adds a parallel hlist for ease of properly browsing the members, does that work for you? It compiles, but I haven't booted it here yet... Works well for me and my box booted ! Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [RFC] hotplug memory remove - walk_memory_resource for ppc64
On Wed, 31 Oct 2007 08:02:40 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote: Paul's concern is, since we didn't need it so far - why we need this for hotplug memory remove to work ? It might break API for *unknown* applications. Its unfortunate that, hotplug memory add updates /proc/iomem. We can deal with it later, as a separate patch. I have no objection to skip /proc/iomem related routine when arch doesn't need it. My advice is just please take care both of hot-add and hot-remove. If ppc64 people agreed to use arch-specific routine for detect conventional memory, there is no problem, I think. Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/3] Add remove_memory() for ppc64
On Wed, 31 Oct 2007 14:55:03 -0700 Dave Hansen [EMAIL PROTECTED] wrote: On Wed, 2007-10-31 at 14:11 -0800, Badari Pulavarty wrote: Well, We don't need arch-specific remove_memory() for ia64 and ppc64. x86_64, I don't know. We will know, only when some one does the verification. I don't need arch_remove_memory() hook also at this time. I wasn't being very clear. I say, add the arch hook only if you need it. But, for now, just take the ia64 code and make it generic. remove_memory() has been arch-specific since there was no piece of unplug code. And I didn't merge it to be generic when I implemented ia64 ver. Hmm...I have no objection to merge them. But let's see how memory hotremove for ppc64 works for a while. We can merge them later. I'm glad to have new testers :) Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [RFC] hotplug memory remove - walk_memory_resource for ppc64
On Wed, 31 Oct 2007 14:28:46 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: ioresource was good structure for remembering which memory is conventional memory and i386/x86_64/ia64 registered conventional memory as System RAM, when I posted patch. (just say System Ram is not for memory hotplug.) If I remember correctly, System RAM is for kdump (to know which memory should be dumped.) Then, memory-hotadd/remove has to modify it anyway. Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [RFC] PPC64 Exporting memory information through /proc/iomem
On Wed, 03 Oct 2007 08:35:35 -0700 Badari Pulavarty [EMAIL PROTECTED] wrote: On Wed, 2007-10-03 at 10:19 +0900, KAMEZAWA Hiroyuki wrote: CONFIG_ARCH_HAS_VALID_MEMORY_RANGE. Then define own find_next_system_ram() (rename to is_valid_memory_range()) - which checks the given range is a valid memory range for memory-remove or not. What do you think ? My concern is... Now, memory hot *add* makes use of resource(/proc/iomem) information for onlining memory.(See add_memory()-register_memory_resource() in mm/memoryhotplug.c) So, we'll have to consider changing it if we need. Does PPC64 memory hot add registers new memory information to arch dependent information list ? It seems ppc64 registers hot-added memory information from *probe* file and registers it by add_memory()-register_memory_resource(). If you add all add/remove/walk system ram information in sane way, I have no objection. I like find_next_system_ram() because I used some amount of time to debug it ;) Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [RFC] PPC64 Exporting memory information through /proc/iomem
On Tue, 02 Oct 2007 16:10:53 -0700 Badari Pulavarty [EMAIL PROTECTED] wrote: Otherwise, we need to add arch-specific hooks in hotplug-remove code to be able to do this. Isn't it just a matter of abstracting the test for a valid range of memory? If it's really hard to abstract that, then I guess we can put RAM in iomem_resource, but I'd rather not. Sure. I will work on it and see how ugly it looks. KAME, are you okay with abstracting the find_next_system_ram() and let arch provide whatever implementation they want ? (since current code doesn't work for x86-64 also ?). Hmm, registering /proc/iomem is complicated ? If too complicated, adding config like CONFIG_ARCH_SUPPORT_IORESOURCE_RAM or something can do good work. you can define your own check_pages_isolated (you can rename this to arch_check_apges_isolated().) BTW, I shoudl ask people how to describe conventional memory A. #define IORESOURCE_RAM IORESOURCE_MEM (ia64) B. #define IORESOURCE_RAM IORESOURCE_MEM | IORESOUCE_BUSY (i386, x86_64) Sad to say, memory hot-add registers new memory just as IORESOURCE_MEM. Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev