[PATCH v6 07/15] memory-hotplug: move pgdat_resize_lock into sparse_remove_one_section()
In __remove_section(), we locked pgdat_resize_lock when calling sparse_remove_one_section(). This lock will disable irq. But we don't need to lock the whole function. If we do some work to free pagetables in free_section_usemap(), we need to call flush_tlb_all(), which need irq enabled. Otherwise the WARN_ON_ONCE() in smp_call_function_many() will be triggered. If we lock the whole sparse_remove_one_section(), then we come to this call trace: [ 454.796248] [ cut here ] [ 454.851408] WARNING: at kernel/smp.c:461 smp_call_function_many+0xbd/0x260() [ 454.935620] Hardware name: PRIMEQUEST 1800E .. [ 455.652201] Call Trace: [ 455.681391] [8106e73f] warn_slowpath_common+0x7f/0xc0 [ 455.753151] [810560a0] ? leave_mm+0x50/0x50 [ 455.814527] [8106e79a] warn_slowpath_null+0x1a/0x20 [ 455.884208] [810e7a9d] smp_call_function_many+0xbd/0x260 [ 455.959082] [810e7ecb] smp_call_function+0x3b/0x50 [ 456.027722] [810560a0] ? leave_mm+0x50/0x50 [ 456.089098] [810e7f4b] on_each_cpu+0x3b/0xc0 [ 456.151512] [81055f0c] flush_tlb_all+0x1c/0x20 [ 456.216004] [8104f8de] remove_pagetable+0x14e/0x1d0 [ 456.285683] [8104f978] vmemmap_free+0x18/0x20 [ 456.349139] [811b8797] sparse_remove_one_section+0xf7/0x100 [ 456.427126] [811c5fc2] __remove_section+0xa2/0xb0 [ 456.494726] [811c6070] __remove_pages+0xa0/0xd0 [ 456.560258] [81669c7b] arch_remove_memory+0x6b/0xc0 [ 456.629937] [8166ad28] remove_memory+0xb8/0xf0 [ 456.694431] [813e686f] acpi_memory_device_remove+0x53/0x96 [ 456.771379] [813b33c4] acpi_device_remove+0x90/0xb2 [ 456.841059] [8144b02c] __device_release_driver+0x7c/0xf0 [ 456.915928] [8144b1af] device_release_driver+0x2f/0x50 [ 456.988719] [813b4476] acpi_bus_remove+0x32/0x6d [ 457.055285] [813b4542] acpi_bus_trim+0x91/0x102 [ 457.120814] [813b463b] acpi_bus_hot_remove_device+0x88/0x16b [ 457.199840] [813afda7] acpi_os_execute_deferred+0x27/0x34 [ 457.275756] [81091ece] process_one_work+0x20e/0x5c0 [ 457.345434] [81091e5f] ? process_one_work+0x19f/0x5c0 [ 457.417190] [813afd80] ? acpi_os_wait_events_complete+0x23/0x23 [ 457.499332] [81093f6e] worker_thread+0x12e/0x370 [ 457.565896] [81093e40] ? manage_workers+0x180/0x180 [ 457.635574] [8109a09e] kthread+0xee/0x100 [ 457.694871] [810dfaf9] ? __lock_release+0x129/0x190 [ 457.764552] [81099fb0] ? __init_kthread_worker+0x70/0x70 [ 457.839427] [81690aac] ret_from_fork+0x7c/0xb0 [ 457.903914] [81099fb0] ? __init_kthread_worker+0x70/0x70 [ 457.978784] ---[ end trace 25e85300f542aa01 ]--- Signed-off-by: Tang Chen tangc...@cn.fujitsu.com Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com Signed-off-by: Wen Congyang we...@cn.fujitsu.com Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com --- mm/memory_hotplug.c |4 mm/sparse.c |5 - 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 0682d2a..674e791 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -442,8 +442,6 @@ static int __remove_section(struct zone *zone, struct mem_section *ms) #else static int __remove_section(struct zone *zone, struct mem_section *ms) { - unsigned long flags; - struct pglist_data *pgdat = zone-zone_pgdat; int ret = -EINVAL; if (!valid_section(ms)) @@ -453,9 +451,7 @@ static int __remove_section(struct zone *zone, struct mem_section *ms) if (ret) return ret; - pgdat_resize_lock(pgdat, flags); sparse_remove_one_section(zone, ms); - pgdat_resize_unlock(pgdat, flags); return 0; } #endif diff --git a/mm/sparse.c b/mm/sparse.c index aadbb2a..05ca73a 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -796,8 +796,10 @@ static inline void clear_hwpoisoned_pages(struct page *memmap, int nr_pages) void sparse_remove_one_section(struct zone *zone, struct mem_section *ms) { struct page *memmap = NULL; - unsigned long *usemap = NULL; + unsigned long *usemap = NULL, flags; + struct pglist_data *pgdat = zone-zone_pgdat; + pgdat_resize_lock(pgdat, flags); if (ms-section_mem_map) { usemap = ms-pageblock_flags; memmap = sparse_decode_mem_map(ms-section_mem_map, @@ -805,6 +807,7 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms) ms-section_mem_map = 0; ms-pageblock_flags = NULL; } + pgdat_resize_unlock(pgdat, flags); clear_hwpoisoned_pages(memmap, PAGES_PER_SECTION); free_section_usemap(memmap, usemap); -- 1.7.1 ___ Linuxppc-dev mailing list
[PATCH v6 04/15] memory-hotplug: remove /sys/firmware/memmap/X sysfs
From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type} sysfs files are created. But there is no code to remove these files. The patch implements the function to remove them. Note: The code does not free firmware_map_entry which is allocated by bootmem. So the patch makes memory leak. But I think the memory leak size is very samll. And it does not affect the system. Signed-off-by: Wen Congyang we...@cn.fujitsu.com Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com Signed-off-by: Tang Chen tangc...@cn.fujitsu.com Reviewed-by: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com --- drivers/firmware/memmap.c| 96 +- include/linux/firmware-map.h |6 +++ mm/memory_hotplug.c |5 ++- 3 files changed, 104 insertions(+), 3 deletions(-) diff --git a/drivers/firmware/memmap.c b/drivers/firmware/memmap.c index 90723e6..4211da5 100644 --- a/drivers/firmware/memmap.c +++ b/drivers/firmware/memmap.c @@ -21,6 +21,7 @@ #include linux/types.h #include linux/bootmem.h #include linux/slab.h +#include linux/mm.h /* * Data types -- @@ -79,7 +80,26 @@ static const struct sysfs_ops memmap_attr_ops = { .show = memmap_attr_show, }; + +static inline struct firmware_map_entry * +to_memmap_entry(struct kobject *kobj) +{ + return container_of(kobj, struct firmware_map_entry, kobj); +} + +static void release_firmware_map_entry(struct kobject *kobj) +{ + struct firmware_map_entry *entry = to_memmap_entry(kobj); + + if (PageReserved(virt_to_page(entry))) + /* There is no way to free memory allocated from bootmem */ + return; + + kfree(entry); +} + static struct kobj_type memmap_ktype = { + .release= release_firmware_map_entry, .sysfs_ops = memmap_attr_ops, .default_attrs = def_attrs, }; @@ -94,6 +114,7 @@ static struct kobj_type memmap_ktype = { * in firmware initialisation code in one single thread of execution. */ static LIST_HEAD(map_entries); +static DEFINE_SPINLOCK(map_entries_lock); /** * firmware_map_add_entry() - Does the real work to add a firmware memmap entry. @@ -118,11 +139,25 @@ static int firmware_map_add_entry(u64 start, u64 end, INIT_LIST_HEAD(entry-list); kobject_init(entry-kobj, memmap_ktype); + spin_lock(map_entries_lock); list_add_tail(entry-list, map_entries); + spin_unlock(map_entries_lock); return 0; } +/** + * firmware_map_remove_entry() - Does the real work to remove a firmware + * memmap entry. + * @entry: removed entry. + **/ +static inline void firmware_map_remove_entry(struct firmware_map_entry *entry) +{ + spin_lock(map_entries_lock); + list_del(entry-list); + spin_unlock(map_entries_lock); +} + /* * Add memmap entry on sysfs */ @@ -144,6 +179,35 @@ static int add_sysfs_fw_map_entry(struct firmware_map_entry *entry) return 0; } +/* + * Remove memmap entry on sysfs + */ +static inline void remove_sysfs_fw_map_entry(struct firmware_map_entry *entry) +{ + kobject_put(entry-kobj); +} + +/* + * Search memmap entry + */ + +static struct firmware_map_entry * __meminit +firmware_map_find_entry(u64 start, u64 end, const char *type) +{ + struct firmware_map_entry *entry; + + spin_lock(map_entries_lock); + list_for_each_entry(entry, map_entries, list) + if ((entry-start == start) (entry-end == end) + (!strcmp(entry-type, type))) { + spin_unlock(map_entries_lock); + return entry; + } + + spin_unlock(map_entries_lock); + return NULL; +} + /** * firmware_map_add_hotplug() - Adds a firmware mapping entry when we do * memory hotplug. @@ -196,6 +260,32 @@ int __init firmware_map_add_early(u64 start, u64 end, const char *type) return firmware_map_add_entry(start, end, type, entry); } +/** + * firmware_map_remove() - remove a firmware mapping entry + * @start: Start of the memory range. + * @end: End of the memory range. + * @type: Type of the memory range. + * + * removes a firmware mapping entry. + * + * Returns 0 on success, or -EINVAL if no entry. + **/ +int __meminit firmware_map_remove(u64 start, u64 end, const char *type) +{ + struct firmware_map_entry *entry; + + entry = firmware_map_find_entry(start, end - 1, type); + if (!entry) + return -EINVAL; + + firmware_map_remove_entry(entry); + + /* remove the memmap entry */ + remove_sysfs_fw_map_entry(entry); + + return 0; +} + /* * Sysfs functions - */ @@ -217,8 +307,10 @@ static ssize_t type_show(struct firmware_map_entry *entry, char *buf) return snprintf(buf,
[PATCH v6 01/15] memory-hotplug: try to offline the memory twice to avoid dependence
From: Wen Congyang we...@cn.fujitsu.com memory can't be offlined when CONFIG_MEMCG is selected. For example: there is a memory device on node 1. The address range is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10, and memory11 under the directory /sys/devices/system/memory/. If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup when we online pages. When we online memory8, the memory stored page cgroup is not provided by this memory device. But when we online memory9, the memory stored page cgroup may be provided by memory8. So we can't offline memory8 now. We should offline the memory in the reversed order. When the memory device is hotremoved, we will auto offline memory provided by this memory device. But we don't know which memory is onlined first, so offlining memory may fail. In such case, iterate twice to offline the memory. 1st iterate: offline every non primary memory block. 2nd iterate: offline primary (i.e. first added) memory block. This idea is suggested by KOSAKI Motohiro. Signed-off-by: Wen Congyang we...@cn.fujitsu.com --- mm/memory_hotplug.c | 16 ++-- 1 files changed, 14 insertions(+), 2 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index d04ed87..62e04c9 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1388,10 +1388,13 @@ int remove_memory(u64 start, u64 size) unsigned long start_pfn, end_pfn; unsigned long pfn, section_nr; int ret; + int return_on_error = 0; + int retry = 0; start_pfn = PFN_DOWN(start); end_pfn = start_pfn + PFN_DOWN(size); +repeat: for (pfn = start_pfn; pfn end_pfn; pfn += PAGES_PER_SECTION) { section_nr = pfn_to_section_nr(pfn); if (!present_section_nr(section_nr)) @@ -1410,14 +1413,23 @@ int remove_memory(u64 start, u64 size) ret = offline_memory_block(mem); if (ret) { - kobject_put(mem-dev.kobj); - return ret; + if (return_on_error) { + kobject_put(mem-dev.kobj); + return ret; + } else { + retry = 1; + } } } if (mem) kobject_put(mem-dev.kobj); + if (retry) { + return_on_error = 1; + goto repeat; + } + return 0; } #else -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v6 03/15] memory-hotplug: remove redundant codes
From: Wen Congyang we...@cn.fujitsu.com offlining memory blocks and checking whether memory blocks are offlined are very similar. This patch introduces a new function to remove redundant codes. Signed-off-by: Wen Congyang we...@cn.fujitsu.com Signed-off-by: Tang Chen tangc...@cn.fujitsu.com Reviewed-by: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com --- mm/memory_hotplug.c | 129 -- 1 files changed, 82 insertions(+), 47 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 5808045..69d62eb 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1381,20 +1381,26 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages) return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ); } -int remove_memory(u64 start, u64 size) +/** + * walk_memory_range - walks through all mem sections in [start_pfn, end_pfn) + * @start_pfn: start pfn of the memory range + * @end_pfn: end pft of the memory range + * @arg: argument passed to func + * @func: callback for each memory section walked + * + * This function walks through all present mem sections in range + * [start_pfn, end_pfn) and call func on each mem section. + * + * Returns the return value of func. + */ +static int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn, + void *arg, int (*func)(struct memory_block *, void *)) { struct memory_block *mem = NULL; struct mem_section *section; - unsigned long start_pfn, end_pfn; unsigned long pfn, section_nr; int ret; - int return_on_error = 0; - int retry = 0; - - start_pfn = PFN_DOWN(start); - end_pfn = start_pfn + PFN_DOWN(size); -repeat: for (pfn = start_pfn; pfn end_pfn; pfn += PAGES_PER_SECTION) { section_nr = pfn_to_section_nr(pfn); if (!present_section_nr(section_nr)) @@ -1411,22 +1417,76 @@ repeat: if (!mem) continue; - ret = offline_memory_block(mem); + ret = func(mem, arg); if (ret) { - if (return_on_error) { - kobject_put(mem-dev.kobj); - return ret; - } else { - retry = 1; - } + kobject_put(mem-dev.kobj); + return ret; } } if (mem) kobject_put(mem-dev.kobj); - if (retry) { - return_on_error = 1; + return 0; +} + +/** + * offline_memory_block_cb - callback function for offlining memory block + * @mem: the memory block to be offlined + * @arg: buffer to hold error msg + * + * Always return 0, and put the error msg in arg if any. + */ +static int offline_memory_block_cb(struct memory_block *mem, void *arg) +{ + int *ret = arg; + int error = offline_memory_block(mem); + + if (error != 0 *ret == 0) + *ret = error; + + return 0; +} + +static int is_memblock_offlined_cb(struct memory_block *mem, void *arg) +{ + int ret = !is_memblock_offlined(mem); + + if (unlikely(ret)) + pr_warn(removing memory fails, because memory + [%#010llx-%#010llx] is onlined\n, + PFN_PHYS(section_nr_to_pfn(mem-start_section_nr)), + PFN_PHYS(section_nr_to_pfn(mem-end_section_nr + 1))-1); + + return ret; +} + +int remove_memory(u64 start, u64 size) +{ + unsigned long start_pfn, end_pfn; + int ret = 0; + int retry = 1; + + start_pfn = PFN_DOWN(start); + end_pfn = start_pfn + PFN_DOWN(size); + + /* +* When CONFIG_MEMCG is on, one memory block may be used by other +* blocks to store page cgroup when onlining pages. But we don't know +* in what order pages are onlined. So we iterate twice to offline +* memory: +* 1st iterate: offline every non primary memory block. +* 2nd iterate: offline primary (i.e. first added) memory block. +*/ +repeat: + walk_memory_range(start_pfn, end_pfn, ret, + offline_memory_block_cb); + if (ret) { + if (!retry) + return ret; + + retry = 0; + ret = 0; goto repeat; } @@ -1444,38 +1504,13 @@ repeat: * memory blocks are offlined. */ - mem = NULL; - for (pfn = start_pfn; pfn end_pfn; pfn += PAGES_PER_SECTION) { - section_nr = pfn_to_section_nr(pfn); - if (!present_section_nr(section_nr)) - continue; - - section = __nr_to_section(section_nr); - /* same memblock? */ - if (mem) - if ((section_nr = mem-start_section_nr) -
[PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Here is the physical memory hot-remove patch-set based on 3.8rc-2. This patch-set aims to implement physical memory hot-removing. The patches can free/remove the following things: - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15] - memmap of sparse-vmemmap : [PATCH 6,7,8,10/15] - page table of removed memory : [RFC PATCH 7,8,10/15] - node and related sysfs files : [RFC PATCH 13-15/15] Existing problem: If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup when we online pages. For example: there is a memory device on node 1. The address range is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10, and memory11 under the directory /sys/devices/system/memory/. If CONFIG_MEMCG is selected, when we online memory8, the memory stored page cgroup is not provided by this memory device. But when we online memory9, the memory stored page cgroup may be provided by memory8. So we can't offline memory8 now. We should offline the memory in the reversed order. When the memory device is hotremoved, we will auto offline memory provided by this memory device. But we don't know which memory is onlined first, so offlining memory may fail. In patch1, we provide a solution which is not good enough: Iterate twice to offline the memory. 1st iterate: offline every non primary memory block. 2nd iterate: offline primary (i.e. first added) memory block. And a new idea from Wen Congyang we...@cn.fujitsu.com is: allocate the memory from the memory block they are describing. But we are not sure if it is OK to do so because there is not existing API to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE to MEM_ONLINE. And also, it may interfere the hugepage. How to test this patchset? 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE, ACPI_HOTPLUG_MEMORY must be selected. 2. load the module acpi_memhotplug 3. hotplug the memory device(it depends on your hardware) You will see the memory device under the directory /sys/bus/acpi/devices/. Its name is PNP0C80:XX. 4. online/offline pages provided by this memory device You can write online/offline to /sys/devices/system/memory/memoryX/state to online/offline pages provided by this memory device 5. hotremove the memory device You can hotremove the memory device by the hardware, or writing 1 to /sys/bus/acpi/devices/PNP0C80:XX/eject. Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Changelogs from v5 to v6: Patch3: Add some more comments to explain memory hot-remove. Patch4: Remove bootmem member in struct firmware_map_entry. Patch6: Repeatedly register bootmem pages when using hugepage. Patch8: Repeatedly free bootmem pages when using hugepage. Patch14: Don't free pgdat when offlining a node, just reset it to 0. Patch15: New patch, pgdat is not freed in patch14, so don't allocate a new one when online a node. Changelogs from v4 to v5: Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to avoid disabling irq because we need flush tlb when free pagetables. Patch8: new patch, pick up some common APIs that are used to free direct mapping and vmemmap pagetables. Patch9: free direct mapping pagetables on x86_64 arch. Patch10: free vmemmap pagetables. Patch11: since freeing memmap with vmemmap has been implemented, the config macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is no longer needed. Patch13: no need to modify acpi_memory_disable_device() since it was removed, and add nid parameter when calling remove_memory(). Changelogs from v3 to v4: Patch7: remove unused codes. Patch8: fix nr_pages that is passed to free_map_bootmem() Changelogs from v2 to v3: Patch9: call sync_global_pgds() if pgd is changed Patch10: fix a problem int the patch Changelogs from v1 to v2: Patch1: new patch, offline memory twice. 1st iterate: offline every non primary memory block. 2nd iterate: offline primary (i.e. first added) memory block. Patch3: new patch, no logical change, just remove reduntant codes. Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu after the pagetable is changed. Patch12: new patch, free node_data when a node is offlined. Tang Chen (6): memory-hotplug: move pgdat_resize_lock into sparse_remove_one_section() memory-hotplug: remove page table of x86_64 architecture memory-hotplug: remove memmap of sparse-vmemmap memory-hotplug: Integrated __remove_section() of CONFIG_SPARSEMEM_VMEMMAP. memory-hotplug: remove sysfs file of node memory-hotplug: Do not allocate pdgat if it was not freed when offline. Wen Congyang (5): memory-hotplug: try to offline the memory twice to avoid dependence memory-hotplug: remove redundant codes memory-hotplug:
[PATCH v6 05/15] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture
From: Wen Congyang we...@cn.fujitsu.com For removing memory, we need to remove page table. But it depends on architecture. So the patch introduce arch_remove_memory() for removing page table. Now it only calls __remove_pages(). Note: __remove_pages() for some archtecuture is not implemented (I don't know how to implement it for s390). Signed-off-by: Wen Congyang we...@cn.fujitsu.com Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com --- arch/ia64/mm/init.c| 18 ++ arch/powerpc/mm/mem.c | 12 arch/s390/mm/init.c| 12 arch/sh/mm/init.c | 17 + arch/tile/mm/init.c|8 arch/x86/mm/init_32.c | 12 arch/x86/mm/init_64.c | 15 +++ include/linux/memory_hotplug.h |1 + mm/memory_hotplug.c|2 ++ 9 files changed, 97 insertions(+), 0 deletions(-) diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c index b755ea9..20bc967 100644 --- a/arch/ia64/mm/init.c +++ b/arch/ia64/mm/init.c @@ -688,6 +688,24 @@ int arch_add_memory(int nid, u64 start, u64 size) return ret; } + +#ifdef CONFIG_MEMORY_HOTREMOVE +int arch_remove_memory(u64 start, u64 size) +{ + unsigned long start_pfn = start PAGE_SHIFT; + unsigned long nr_pages = size PAGE_SHIFT; + struct zone *zone; + int ret; + + zone = page_zone(pfn_to_page(start_pfn)); + ret = __remove_pages(zone, start_pfn, nr_pages); + if (ret) + pr_warn(%s: Problem encountered in __remove_pages() as +ret=%d\n, __func__, ret); + + return ret; +} +#endif #endif /* diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index 0dba506..09c6451 100644 --- a/arch/powerpc/mm/mem.c +++ b/arch/powerpc/mm/mem.c @@ -133,6 +133,18 @@ int arch_add_memory(int nid, u64 start, u64 size) return __add_pages(nid, zone, start_pfn, nr_pages); } + +#ifdef CONFIG_MEMORY_HOTREMOVE +int arch_remove_memory(u64 start, u64 size) +{ + unsigned long start_pfn = start PAGE_SHIFT; + unsigned long nr_pages = size PAGE_SHIFT; + struct zone *zone; + + zone = page_zone(pfn_to_page(start_pfn)); + return __remove_pages(zone, start_pfn, nr_pages); +} +#endif #endif /* CONFIG_MEMORY_HOTPLUG */ /* diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c index ae672f4..49ce6bb 100644 --- a/arch/s390/mm/init.c +++ b/arch/s390/mm/init.c @@ -228,4 +228,16 @@ int arch_add_memory(int nid, u64 start, u64 size) vmem_remove_mapping(start, size); return rc; } + +#ifdef CONFIG_MEMORY_HOTREMOVE +int arch_remove_memory(u64 start, u64 size) +{ + /* +* There is no hardware or firmware interface which could trigger a +* hot memory remove on s390. So there is nothing that needs to be +* implemented. +*/ + return -EBUSY; +} +#endif #endif /* CONFIG_MEMORY_HOTPLUG */ diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c index 82cc576..1057940 100644 --- a/arch/sh/mm/init.c +++ b/arch/sh/mm/init.c @@ -558,4 +558,21 @@ int memory_add_physaddr_to_nid(u64 addr) EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); #endif +#ifdef CONFIG_MEMORY_HOTREMOVE +int arch_remove_memory(u64 start, u64 size) +{ + unsigned long start_pfn = start PAGE_SHIFT; + unsigned long nr_pages = size PAGE_SHIFT; + struct zone *zone; + int ret; + + zone = page_zone(pfn_to_page(start_pfn)); + ret = __remove_pages(zone, start_pfn, nr_pages); + if (unlikely(ret)) + pr_warn(%s: Failed, __remove_pages() == %d\n, __func__, + ret); + + return ret; +} +#endif #endif /* CONFIG_MEMORY_HOTPLUG */ diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c index ef29d6c..2749515 100644 --- a/arch/tile/mm/init.c +++ b/arch/tile/mm/init.c @@ -935,6 +935,14 @@ int remove_memory(u64 start, u64 size) { return -EINVAL; } + +#ifdef CONFIG_MEMORY_HOTREMOVE +int arch_remove_memory(u64 start, u64 size) +{ + /* TODO */ + return -EBUSY; +} +#endif #endif struct kmem_cache *pgd_cache; diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c index 745d66b..3166e78 100644 --- a/arch/x86/mm/init_32.c +++ b/arch/x86/mm/init_32.c @@ -836,6 +836,18 @@ int arch_add_memory(int nid, u64 start, u64 size) return __add_pages(nid, zone, start_pfn, nr_pages); } + +#ifdef CONFIG_MEMORY_HOTREMOVE +int arch_remove_memory(u64 start, u64 size) +{ + unsigned long start_pfn = start PAGE_SHIFT; + unsigned long nr_pages = size PAGE_SHIFT; + struct zone *zone; + + zone = page_zone(pfn_to_page(start_pfn)); + return __remove_pages(zone, start_pfn, nr_pages); +} +#endif #endif /* diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index e779e0b..f78509c 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -682,6 +682,21 @@ int
[PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove
From: Wen Congyang we...@cn.fujitsu.com When memory is removed, the corresponding pagetables should alse be removed. This patch introduces some common APIs to support vmemmap pagetable and x86_64 architecture pagetable removing. All pages of virtual mapping in removed memory cannot be freedi if some pages used as PGD/PUD includes not only removed memory but also other memory. So the patch uses the following way to check whether page can be freed or not. 1. When removing memory, the page structs of the revmoved memory are filled with 0FD. 2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared. In this case, the page used as PT/PMD can be freed. Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com Signed-off-by: Jianguo Wu wujian...@huawei.com Signed-off-by: Wen Congyang we...@cn.fujitsu.com Signed-off-by: Tang Chen tangc...@cn.fujitsu.com --- arch/x86/include/asm/pgtable_types.h |1 + arch/x86/mm/init_64.c| 299 ++ arch/x86/mm/pageattr.c | 47 +++--- include/linux/bootmem.h |1 + 4 files changed, 326 insertions(+), 22 deletions(-) diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 3c32db8..4b6fd2a 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned long pages) { } * as a pte too. */ extern pte_t *lookup_address(unsigned long address, unsigned int *level); +extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase); #endif /* !__ASSEMBLY__ */ diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 9ac1723..fe01116 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -682,6 +682,305 @@ int arch_add_memory(int nid, u64 start, u64 size) } EXPORT_SYMBOL_GPL(arch_add_memory); +#define PAGE_INUSE 0xFD + +static void __meminit free_pagetable(struct page *page, int order) +{ + struct zone *zone; + bool bootmem = false; + unsigned long magic; + unsigned int nr_pages = 1 order; + + /* bootmem page has reserved flag */ + if (PageReserved(page)) { + __ClearPageReserved(page); + bootmem = true; + + magic = (unsigned long)page-lru.next; + if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) { + while (nr_pages--) + put_page_bootmem(page++); + } else + __free_pages_bootmem(page, order); + } else + free_pages((unsigned long)page_address(page), order); + + /* +* SECTION_INFO pages and MIX_SECTION_INFO pages +* are all allocated by bootmem. +*/ + if (bootmem) { + zone = page_zone(page); + zone_span_writelock(zone); + zone-present_pages += nr_pages; + zone_span_writeunlock(zone); + totalram_pages += nr_pages; + } +} + +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd) +{ + pte_t *pte; + int i; + + for (i = 0; i PTRS_PER_PTE; i++) { + pte = pte_start + i; + if (pte_val(*pte)) + return; + } + + /* free a pte talbe */ + free_pagetable(pmd_page(*pmd), 0); + spin_lock(init_mm.page_table_lock); + pmd_clear(pmd); + spin_unlock(init_mm.page_table_lock); +} + +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud) +{ + pmd_t *pmd; + int i; + + for (i = 0; i PTRS_PER_PMD; i++) { + pmd = pmd_start + i; + if (pmd_val(*pmd)) + return; + } + + /* free a pmd talbe */ + free_pagetable(pud_page(*pud), 0); + spin_lock(init_mm.page_table_lock); + pud_clear(pud); + spin_unlock(init_mm.page_table_lock); +} + +/* Return true if pgd is changed, otherwise return false. */ +static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd) +{ + pud_t *pud; + int i; + + for (i = 0; i PTRS_PER_PUD; i++) { + pud = pud_start + i; + if (pud_val(*pud)) + return false; + } + + /* free a pud table */ + free_pagetable(pgd_page(*pgd), 0); + spin_lock(init_mm.page_table_lock); + pgd_clear(pgd); + spin_unlock(init_mm.page_table_lock); + + return true; +} + +static void __meminit +remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end, +bool direct) +{ + unsigned long next, pages = 0; + pte_t *pte; + void *page_addr; + phys_addr_t phys_addr; + + pte = pte_start + pte_index(addr); + for (; addr end; addr = next, pte++) { + next = (addr + PAGE_SIZE) PAGE_MASK; + if
[PATCH v6 06/15] memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap
From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com For removing memmap region of sparse-vmemmap which is allocated bootmem, memmap region of sparse-vmemmap needs to be registered by get_page_bootmem(). So the patch searches pages of virtual mapping and registers the pages by get_page_bootmem(). Note: register_page_bootmem_memmap() is not implemented for ia64, ppc, s390, and sparc. Signed-off-by: Wen Congyang we...@cn.fujitsu.com Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com Reviewed-by: Wu Jianguo wujian...@huawei.com --- arch/ia64/mm/discontig.c |6 arch/powerpc/mm/init_64.c |6 arch/s390/mm/vmem.c|6 arch/sparc/mm/init_64.c|6 arch/x86/mm/init_64.c | 58 include/linux/memory_hotplug.h | 11 +-- include/linux/mm.h |3 +- mm/memory_hotplug.c| 33 --- 8 files changed, 115 insertions(+), 14 deletions(-) diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c index c641333..33943db 100644 --- a/arch/ia64/mm/discontig.c +++ b/arch/ia64/mm/discontig.c @@ -822,4 +822,10 @@ int __meminit vmemmap_populate(struct page *start_page, { return vmemmap_populate_basepages(start_page, size, node); } + +void register_page_bootmem_memmap(unsigned long section_nr, + struct page *start_page, unsigned long size) +{ + /* TODO */ +} #endif diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index 95a4529..6466440 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -297,5 +297,11 @@ int __meminit vmemmap_populate(struct page *start_page, return 0; } + +void register_page_bootmem_memmap(unsigned long section_nr, + struct page *start_page, unsigned long size) +{ + /* TODO */ +} #endif /* CONFIG_SPARSEMEM_VMEMMAP */ diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c index 6ed1426..2c14bc2 100644 --- a/arch/s390/mm/vmem.c +++ b/arch/s390/mm/vmem.c @@ -272,6 +272,12 @@ out: return ret; } +void register_page_bootmem_memmap(unsigned long section_nr, + struct page *start_page, unsigned long size) +{ + /* TODO */ +} + /* * Add memory segment to the segment list if it doesn't overlap with * an already present segment. diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index c3b7242..1f30db3 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -2231,6 +2231,12 @@ void __meminit vmemmap_populate_print_last(void) node_start = 0; } } + +void register_page_bootmem_memmap(unsigned long section_nr, + struct page *start_page, unsigned long size) +{ + /* TODO */ +} #endif /* CONFIG_SPARSEMEM_VMEMMAP */ static void prot_init_common(unsigned long page_none, diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index f78509c..9ac1723 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1000,6 +1000,64 @@ vmemmap_populate(struct page *start_page, unsigned long size, int node) return 0; } +void register_page_bootmem_memmap(unsigned long section_nr, + struct page *start_page, unsigned long size) +{ + unsigned long addr = (unsigned long)start_page; + unsigned long end = (unsigned long)(start_page + size); + unsigned long next; + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + unsigned int nr_pages; + struct page *page; + + for (; addr end; addr = next) { + pte_t *pte = NULL; + + pgd = pgd_offset_k(addr); + if (pgd_none(*pgd)) { + next = (addr + PAGE_SIZE) PAGE_MASK; + continue; + } + get_page_bootmem(section_nr, pgd_page(*pgd), MIX_SECTION_INFO); + + pud = pud_offset(pgd, addr); + if (pud_none(*pud)) { + next = (addr + PAGE_SIZE) PAGE_MASK; + continue; + } + get_page_bootmem(section_nr, pud_page(*pud), MIX_SECTION_INFO); + + if (!cpu_has_pse) { + next = (addr + PAGE_SIZE) PAGE_MASK; + pmd = pmd_offset(pud, addr); + if (pmd_none(*pmd)) + continue; + get_page_bootmem(section_nr, pmd_page(*pmd), +MIX_SECTION_INFO); + + pte = pte_offset_kernel(pmd, addr); + if (pte_none(*pte)) + continue; + get_page_bootmem(section_nr, pte_page(*pte), +SECTION_INFO); + } else { + next = pmd_addr_end(addr, end); + +
[PATCH v6 02/15] memory-hotplug: check whether all memory blocks are offlined or not when removing memory
From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com We remove the memory like this: 1. lock memory hotplug 2. offline a memory block 3. unlock memory hotplug 4. repeat 1-3 to offline all memory blocks 5. lock memory hotplug 6. remove memory(TODO) 7. unlock memory hotplug All memory blocks must be offlined before removing memory. But we don't hold the lock in the whole operation. So we should check whether all memory blocks are offlined before step6. Otherwise, kernel maybe panicked. Signed-off-by: Wen Congyang we...@cn.fujitsu.com Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com Signed-off-by: Tang Chen tangc...@cn.fujitsu.com Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com --- drivers/base/memory.c |6 + include/linux/memory_hotplug.h |1 + mm/memory_hotplug.c| 48 3 files changed, 55 insertions(+), 0 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 987604d..8300a18 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -693,6 +693,12 @@ int offline_memory_block(struct memory_block *mem) return ret; } +/* return true if the memory block is offlined, otherwise, return false */ +bool is_memblock_offlined(struct memory_block *mem) +{ + return mem-state == MEM_OFFLINE; +} + /* * Initialize the sysfs support for memory devices... */ diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 4a45c4e..8dd0950 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -247,6 +247,7 @@ extern int add_memory(int nid, u64 start, u64 size); extern int arch_add_memory(int nid, u64 start, u64 size); extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages); extern int offline_memory_block(struct memory_block *mem); +extern bool is_memblock_offlined(struct memory_block *mem); extern int remove_memory(u64 start, u64 size); extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn, int nr_pages); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 62e04c9..5808045 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1430,6 +1430,54 @@ repeat: goto repeat; } + lock_memory_hotplug(); + + /* +* we have offlined all memory blocks like this: +* 1. lock memory hotplug +* 2. offline a memory block +* 3. unlock memory hotplug +* +* repeat step1-3 to offline the memory block. All memory blocks +* must be offlined before removing memory. But we don't hold the +* lock in the whole operation. So we should check whether all +* memory blocks are offlined. +*/ + + mem = NULL; + for (pfn = start_pfn; pfn end_pfn; pfn += PAGES_PER_SECTION) { + section_nr = pfn_to_section_nr(pfn); + if (!present_section_nr(section_nr)) + continue; + + section = __nr_to_section(section_nr); + /* same memblock? */ + if (mem) + if ((section_nr = mem-start_section_nr) + (section_nr = mem-end_section_nr)) + continue; + + mem = find_memory_block_hinted(section, mem); + if (!mem) + continue; + + ret = is_memblock_offlined(mem); + if (!ret) { + pr_warn(removing memory fails, because memory + [%#010llx-%#010llx] is onlined\n, + PFN_PHYS(section_nr_to_pfn(mem-start_section_nr)), + PFN_PHYS(section_nr_to_pfn(mem-end_section_nr + 1)) - 1); + + kobject_put(mem-dev.kobj); + unlock_memory_hotplug(); + return ret; + } + } + + if (mem) + kobject_put(mem-dev.kobj); + unlock_memory_hotplug(); + return 0; } #else -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v6 09/15] memory-hotplug: remove page table of x86_64 architecture
This patch searches a page table about the removed memory, and clear page table for x86_64 architecture. Signed-off-by: Wen Congyang we...@cn.fujitsu.com Signed-off-by: Jianguo Wu wujian...@huawei.com Signed-off-by: Jiang Liu jiang@huawei.com Signed-off-by: Tang Chen tangc...@cn.fujitsu.com --- arch/x86/mm/init_64.c | 10 ++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index fe01116..d950f9b 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -981,6 +981,15 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct) flush_tlb_all(); } +void __meminit +kernel_physical_mapping_remove(unsigned long start, unsigned long end) +{ + start = (unsigned long)__va(start); + end = (unsigned long)__va(end); + + remove_pagetable(start, end, true); +} + #ifdef CONFIG_MEMORY_HOTREMOVE int __ref arch_remove_memory(u64 start, u64 size) { @@ -990,6 +999,7 @@ int __ref arch_remove_memory(u64 start, u64 size) int ret; zone = page_zone(pfn_to_page(start_pfn)); + kernel_physical_mapping_remove(start, start + size); ret = __remove_pages(zone, start_pfn, nr_pages); WARN_ON_ONCE(ret); -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v6 13/15] memory-hotplug: remove sysfs file of node
This patch introduces a new function try_offline_node() to remove sysfs file of node when all memory sections of this node are removed. If some memory sections of this node are not removed, this function does nothing. Signed-off-by: Wen Congyang we...@cn.fujitsu.com Signed-off-by: Tang Chen tangc...@cn.fujitsu.com --- drivers/acpi/acpi_memhotplug.c |8 - include/linux/memory_hotplug.h |2 +- mm/memory_hotplug.c| 58 ++- 3 files changed, 63 insertions(+), 5 deletions(-) diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c index eb30e5a..9c53cc6 100644 --- a/drivers/acpi/acpi_memhotplug.c +++ b/drivers/acpi/acpi_memhotplug.c @@ -295,9 +295,11 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device) static int acpi_memory_remove_memory(struct acpi_memory_device *mem_device) { - int result = 0; + int result = 0, nid; struct acpi_memory_info *info, *n; + nid = acpi_get_node(mem_device-device-handle); + list_for_each_entry_safe(info, n, mem_device-res_list, list) { if (info-failed) /* The kernel does not use this memory block */ @@ -310,7 +312,9 @@ static int acpi_memory_remove_memory(struct acpi_memory_device *mem_device) */ return -EBUSY; - result = remove_memory(info-start_addr, info-length); + if (nid 0) + nid = memory_add_physaddr_to_nid(info-start_addr); + result = remove_memory(nid, info-start_addr, info-length); if (result) return result; diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 2441f36..f60e728 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -242,7 +242,7 @@ extern int arch_add_memory(int nid, u64 start, u64 size); extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages); extern int offline_memory_block(struct memory_block *mem); extern bool is_memblock_offlined(struct memory_block *mem); -extern int remove_memory(u64 start, u64 size); +extern int remove_memory(int nid, u64 start, u64 size); extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn, int nr_pages); extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index da20c14..a8703f7 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -29,6 +29,7 @@ #include linux/suspend.h #include linux/mm_inline.h #include linux/firmware-map.h +#include linux/stop_machine.h #include asm/tlbflush.h @@ -1678,7 +1679,58 @@ static int is_memblock_offlined_cb(struct memory_block *mem, void *arg) return ret; } -int __ref remove_memory(u64 start, u64 size) +static int check_cpu_on_node(void *data) +{ + struct pglist_data *pgdat = data; + int cpu; + + for_each_present_cpu(cpu) { + if (cpu_to_node(cpu) == pgdat-node_id) + /* +* the cpu on this node isn't removed, and we can't +* offline this node. +*/ + return -EBUSY; + } + + return 0; +} + +/* offline the node if all memory sections of this node are removed */ +static void try_offline_node(int nid) +{ + unsigned long start_pfn = NODE_DATA(nid)-node_start_pfn; + unsigned long end_pfn = start_pfn + NODE_DATA(nid)-node_spanned_pages; + unsigned long pfn; + + for (pfn = start_pfn; pfn end_pfn; pfn += PAGES_PER_SECTION) { + unsigned long section_nr = pfn_to_section_nr(pfn); + + if (!present_section_nr(section_nr)) + continue; + + if (pfn_to_nid(pfn) != nid) + continue; + + /* +* some memory sections of this node are not removed, and we +* can't offline node now. +*/ + return; + } + + if (stop_machine(check_cpu_on_node, NODE_DATA(nid), NULL)) + return; + + /* +* all memory/cpu of this node are removed, we can offline this +* node now. +*/ + node_set_offline(nid); + unregister_one_node(nid); +} + +int __ref remove_memory(int nid, u64 start, u64 size) { unsigned long start_pfn, end_pfn; int ret = 0; @@ -1733,6 +1785,8 @@ repeat: arch_remove_memory(start, size); + try_offline_node(nid); + unlock_memory_hotplug(); return 0; @@ -1742,7 +1796,7 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages) { return -EINVAL; } -int remove_memory(u64 start, u64 size) +int remove_memory(int nid, u64 start, u64 size) {
[PATCH v6 10/15] memory-hotplug: remove memmap of sparse-vmemmap
This patch introduces a new API vmemmap_free() to free and remove vmemmap pagetables. Since pagetable implements are different, each architecture has to provide its own version of vmemmap_free(), just like vmemmap_populate(). Note: vmemmap_free() are not implemented for ia64, ppc, s390, and sparc. Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com Signed-off-by: Jianguo Wu wujian...@huawei.com Signed-off-by: Wen Congyang we...@cn.fujitsu.com Signed-off-by: Tang Chen tangc...@cn.fujitsu.com --- arch/arm64/mm/mmu.c |3 +++ arch/ia64/mm/discontig.c |4 arch/powerpc/mm/init_64.c |4 arch/s390/mm/vmem.c |4 arch/sparc/mm/init_64.c |4 arch/x86/mm/init_64.c |8 include/linux/mm.h|1 + mm/sparse.c |3 ++- 8 files changed, 30 insertions(+), 1 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index a6885d8..9834886 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -392,4 +392,7 @@ int __meminit vmemmap_populate(struct page *start_page, return 0; } #endif /* CONFIG_ARM64_64K_PAGES */ +void vmemmap_free(struct page *memmap, unsigned long nr_pages) +{ +} #endif /* CONFIG_SPARSEMEM_VMEMMAP */ diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c index 33943db..882a0fd 100644 --- a/arch/ia64/mm/discontig.c +++ b/arch/ia64/mm/discontig.c @@ -823,6 +823,10 @@ int __meminit vmemmap_populate(struct page *start_page, return vmemmap_populate_basepages(start_page, size, node); } +void vmemmap_free(struct page *memmap, unsigned long nr_pages) +{ +} + void register_page_bootmem_memmap(unsigned long section_nr, struct page *start_page, unsigned long size) { diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index 6466440..2969591 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -298,6 +298,10 @@ int __meminit vmemmap_populate(struct page *start_page, return 0; } +void vmemmap_free(struct page *memmap, unsigned long nr_pages) +{ +} + void register_page_bootmem_memmap(unsigned long section_nr, struct page *start_page, unsigned long size) { diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c index 2c14bc2..81e6ba3 100644 --- a/arch/s390/mm/vmem.c +++ b/arch/s390/mm/vmem.c @@ -272,6 +272,10 @@ out: return ret; } +void vmemmap_free(struct page *memmap, unsigned long nr_pages) +{ +} + void register_page_bootmem_memmap(unsigned long section_nr, struct page *start_page, unsigned long size) { diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index 1f30db3..5afe21a 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -2232,6 +2232,10 @@ void __meminit vmemmap_populate_print_last(void) } } +void vmemmap_free(struct page *memmap, unsigned long nr_pages) +{ +} + void register_page_bootmem_memmap(unsigned long section_nr, struct page *start_page, unsigned long size) { diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index d950f9b..e829113 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1309,6 +1309,14 @@ vmemmap_populate(struct page *start_page, unsigned long size, int node) return 0; } +void __ref vmemmap_free(struct page *memmap, unsigned long nr_pages) +{ + unsigned long start = (unsigned long)memmap; + unsigned long end = (unsigned long)(memmap + nr_pages); + + remove_pagetable(start, end, false); +} + void register_page_bootmem_memmap(unsigned long section_nr, struct page *start_page, unsigned long size) { diff --git a/include/linux/mm.h b/include/linux/mm.h index 1eca498..31d5e5d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1709,6 +1709,7 @@ int vmemmap_populate_basepages(struct page *start_page, unsigned long pages, int node); int vmemmap_populate(struct page *start_page, unsigned long pages, int node); void vmemmap_populate_print_last(void); +void vmemmap_free(struct page *memmap, unsigned long nr_pages); void register_page_bootmem_memmap(unsigned long section_nr, struct page *map, unsigned long size); diff --git a/mm/sparse.c b/mm/sparse.c index 05ca73a..cff9796 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -615,10 +615,11 @@ static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid, } static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages) { - return; /* XXX: Not implemented yet */ + vmemmap_free(memmap, nr_pages); } static void free_map_bootmem(struct page *memmap, unsigned long nr_pages) { + vmemmap_free(memmap, nr_pages); } #else static struct page *__kmalloc_section_memmap(unsigned long nr_pages) -- 1.7.1
[PATCH v6 12/15] memory-hotplug: memory_hotplug: clear zone when removing the memory
From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com When a memory is added, we update zone's and pgdat's start_pfn and spanned_pages in the function __add_zone(). So we should revert them when the memory is removed. The patch adds a new function __remove_zone() to do this. Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com Signed-off-by: Wen Congyang we...@cn.fujitsu.com --- mm/memory_hotplug.c | 207 +++ 1 files changed, 207 insertions(+), 0 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index b20c4c7..da20c14 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -430,8 +430,211 @@ static int __meminit __add_section(int nid, struct zone *zone, return register_new_memory(nid, __pfn_to_section(phys_start_pfn)); } +/* find the smallest valid pfn in the range [start_pfn, end_pfn) */ +static int find_smallest_section_pfn(int nid, struct zone *zone, +unsigned long start_pfn, +unsigned long end_pfn) +{ + struct mem_section *ms; + + for (; start_pfn end_pfn; start_pfn += PAGES_PER_SECTION) { + ms = __pfn_to_section(start_pfn); + + if (unlikely(!valid_section(ms))) + continue; + + if (unlikely(pfn_to_nid(start_pfn) != nid)) + continue; + + if (zone zone != page_zone(pfn_to_page(start_pfn))) + continue; + + return start_pfn; + } + + return 0; +} + +/* find the biggest valid pfn in the range [start_pfn, end_pfn). */ +static int find_biggest_section_pfn(int nid, struct zone *zone, + unsigned long start_pfn, + unsigned long end_pfn) +{ + struct mem_section *ms; + unsigned long pfn; + + /* pfn is the end pfn of a memory section. */ + pfn = end_pfn - 1; + for (; pfn = start_pfn; pfn -= PAGES_PER_SECTION) { + ms = __pfn_to_section(pfn); + + if (unlikely(!valid_section(ms))) + continue; + + if (unlikely(pfn_to_nid(pfn) != nid)) + continue; + + if (zone zone != page_zone(pfn_to_page(pfn))) + continue; + + return pfn; + } + + return 0; +} + +static void shrink_zone_span(struct zone *zone, unsigned long start_pfn, +unsigned long end_pfn) +{ + unsigned long zone_start_pfn = zone-zone_start_pfn; + unsigned long zone_end_pfn = zone-zone_start_pfn + zone-spanned_pages; + unsigned long pfn; + struct mem_section *ms; + int nid = zone_to_nid(zone); + + zone_span_writelock(zone); + if (zone_start_pfn == start_pfn) { + /* +* If the section is smallest section in the zone, it need +* shrink zone-zone_start_pfn and zone-zone_spanned_pages. +* In this case, we find second smallest valid mem_section +* for shrinking zone. +*/ + pfn = find_smallest_section_pfn(nid, zone, end_pfn, + zone_end_pfn); + if (pfn) { + zone-zone_start_pfn = pfn; + zone-spanned_pages = zone_end_pfn - pfn; + } + } else if (zone_end_pfn == end_pfn) { + /* +* If the section is biggest section in the zone, it need +* shrink zone-spanned_pages. +* In this case, we find second biggest valid mem_section for +* shrinking zone. +*/ + pfn = find_biggest_section_pfn(nid, zone, zone_start_pfn, + start_pfn); + if (pfn) + zone-spanned_pages = pfn - zone_start_pfn + 1; + } + + /* +* The section is not biggest or smallest mem_section in the zone, it +* only creates a hole in the zone. So in this case, we need not +* change the zone. But perhaps, the zone has only hole data. Thus +* it check the zone has only hole or not. +*/ + pfn = zone_start_pfn; + for (; pfn zone_end_pfn; pfn += PAGES_PER_SECTION) { + ms = __pfn_to_section(pfn); + + if (unlikely(!valid_section(ms))) + continue; + + if (page_zone(pfn_to_page(pfn)) != zone) + continue; + +/* If the section is current section, it continues the loop */ + if (start_pfn == pfn) + continue; + + /* If we find valid section, we have nothing to do */ + zone_span_writeunlock(zone); + return; + } + + /* The zone has no
[PATCH v6 14/15] memory-hotplug: free node_data when a node is offlined
From: Wen Congyang we...@cn.fujitsu.com We call hotadd_new_pgdat() to allocate memory to store node_data. So we should free it when removing a node. Signed-off-by: Wen Congyang we...@cn.fujitsu.com Reviewed-by: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com --- mm/memory_hotplug.c | 30 +++--- 1 files changed, 27 insertions(+), 3 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index a8703f7..8b67752 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1699,9 +1699,12 @@ static int check_cpu_on_node(void *data) /* offline the node if all memory sections of this node are removed */ static void try_offline_node(int nid) { - unsigned long start_pfn = NODE_DATA(nid)-node_start_pfn; - unsigned long end_pfn = start_pfn + NODE_DATA(nid)-node_spanned_pages; + pg_data_t *pgdat = NODE_DATA(nid); + unsigned long start_pfn = pgdat-node_start_pfn; + unsigned long end_pfn = start_pfn + pgdat-node_spanned_pages; unsigned long pfn; + struct page *pgdat_page = virt_to_page(pgdat); + int i; for (pfn = start_pfn; pfn end_pfn; pfn += PAGES_PER_SECTION) { unsigned long section_nr = pfn_to_section_nr(pfn); @@ -1719,7 +1722,7 @@ static void try_offline_node(int nid) return; } - if (stop_machine(check_cpu_on_node, NODE_DATA(nid), NULL)) + if (stop_machine(check_cpu_on_node, pgdat, NULL)) return; /* @@ -1728,6 +1731,27 @@ static void try_offline_node(int nid) */ node_set_offline(nid); unregister_one_node(nid); + + if (!PageSlab(pgdat_page) !PageCompound(pgdat_page)) + /* node data is allocated from boot memory */ + return; + + /* free waittable in each zone */ + for (i = 0; i MAX_NR_ZONES; i++) { + struct zone *zone = pgdat-node_zones + i; + + if (zone-wait_table) + vfree(zone-wait_table); + } + + /* +* Since there is no way to guarentee the address of pgdat/zone is not +* on stack of any kernel threads or used by other kernel objects +* without reference counting or other symchronizing method, do not +* reset node_data and free pgdat here. Just reset it to 0 and reuse +* the memory when the node is online again. +*/ + memset(pgdat, 0, sizeof(*pgdat)); } int __ref remove_memory(int nid, u64 start, u64 size) -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v6 11/15] memory-hotplug: Integrated __remove_section() of CONFIG_SPARSEMEM_VMEMMAP.
Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But even if we use SPARSEMEM_VMEMMAP, we can unregister the memory_section. Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com Signed-off-by: Wen Congyang we...@cn.fujitsu.com Signed-off-by: Tang Chen tangc...@cn.fujitsu.com --- mm/memory_hotplug.c | 11 --- 1 files changed, 0 insertions(+), 11 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 674e791..b20c4c7 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -430,16 +430,6 @@ static int __meminit __add_section(int nid, struct zone *zone, return register_new_memory(nid, __pfn_to_section(phys_start_pfn)); } -#ifdef CONFIG_SPARSEMEM_VMEMMAP -static int __remove_section(struct zone *zone, struct mem_section *ms) -{ - /* -* XXX: Freeing memmap with vmemmap is not implement yet. -* This should be removed later. -*/ - return -EBUSY; -} -#else static int __remove_section(struct zone *zone, struct mem_section *ms) { int ret = -EINVAL; @@ -454,7 +444,6 @@ static int __remove_section(struct zone *zone, struct mem_section *ms) sparse_remove_one_section(zone, ms); return 0; } -#endif /* * Reasonably generic function for adding memory. It is -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v6 15/15] memory-hotplug: Do not allocate pdgat if it was not freed when offline.
Since there is no way to guarentee the address of pgdat/zone is not on stack of any kernel threads or used by other kernel objects without reference counting or other symchronizing method, we cannot reset node_data and free pgdat when offlining a node. Just reset pgdat to 0 and reuse the memory when the node is online again. The problem is suggested by Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com The idea is from Wen Congyang we...@cn.fujitsu.com NOTE: If we don't reset pgdat to 0, the WARN_ON in free_area_init_node() will be triggered. Signed-off-by: Tang Chen tangc...@cn.fujitsu.com Reviewed-by: Wen Congyang we...@cn.fujitsu.com --- mm/memory_hotplug.c | 20 1 files changed, 12 insertions(+), 8 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 8b67752..8aa2b56 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1015,11 +1015,14 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start) unsigned long zholes_size[MAX_NR_ZONES] = {0}; unsigned long start_pfn = start PAGE_SHIFT; - pgdat = arch_alloc_nodedata(nid); - if (!pgdat) - return NULL; + pgdat = NODE_DATA(nid); + if (!pgdat) { + pgdat = arch_alloc_nodedata(nid); + if (!pgdat) + return NULL; - arch_refresh_nodedata(nid, pgdat); + arch_refresh_nodedata(nid, pgdat); + } /* we can use NODE_DATA(nid) from here */ @@ -1072,7 +1075,7 @@ out: int __ref add_memory(int nid, u64 start, u64 size) { pg_data_t *pgdat = NULL; - int new_pgdat = 0; + int new_pgdat = 0, new_node = 0; struct resource *res; int ret; @@ -1083,12 +1086,13 @@ int __ref add_memory(int nid, u64 start, u64 size) if (!res) goto out; - if (!node_online(nid)) { + new_pgdat = NODE_DATA(nid) ? 0 : 1; + new_node = node_online(nid) ? 0 : 1; + if (new_node) { pgdat = hotadd_new_pgdat(nid, start); ret = -ENOMEM; if (!pgdat) goto error; - new_pgdat = 1; } /* call arch's memory hotadd */ @@ -1100,7 +1104,7 @@ int __ref add_memory(int nid, u64 start, u64 size) /* we online node here. we can't roll back from here. */ node_set_online(nid); - if (new_pgdat) { + if (new_node) { ret = register_one_node(nid); /* * If sysfs file of new node can't create, cpu on the node -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 7/8] mm: use vm_unmapped_area() on powerpc architecture
On Wed, Jan 09, 2013 at 02:32:56PM +1100, Benjamin Herrenschmidt wrote: Ok. I think at least you can move that construct: + if (addr SLICE_LOW_TOP) { + slice = GET_LOW_SLICE_INDEX(addr); + addr = (slice + 1) SLICE_LOW_SHIFT; + if (!(available.low_slices (1u slice))) + continue; + } else { + slice = GET_HIGH_SLICE_INDEX(addr); + addr = (slice + 1) SLICE_HIGH_SHIFT; + if (!(available.high_slices (1u slice))) + continue; + } Into some kind of helper. It will probably compile to the same thing but at least it's more readable and it will avoid a fuckup in the future if somebody changes the algorithm and forgets to update one of the copies :-) All right, does the following look more palatable then ? (didn't re-test it, though) Signed-off-by: Michel Lespinasse wal...@google.com --- arch/powerpc/mm/slice.c | 123 ++- 1 files changed, 78 insertions(+), 45 deletions(-) diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c index 999a74f25ebe..3e99c149271a 100644 --- a/arch/powerpc/mm/slice.c +++ b/arch/powerpc/mm/slice.c @@ -237,36 +237,69 @@ static void slice_convert(struct mm_struct *mm, struct slice_mask mask, int psiz #endif } +/* + * Compute which slice addr is part of; + * set *boundary_addr to the start or end boundary of that slice + * (depending on 'end' parameter); + * return boolean indicating if the slice is marked as available in the + * 'available' slice_mark. + */ +static bool slice_scan_available(unsigned long addr, +struct slice_mask available, +int end, +unsigned long *boundary_addr) +{ + unsigned long slice; + if (addr SLICE_LOW_TOP) { + slice = GET_LOW_SLICE_INDEX(addr); + *boundary_addr = (slice + end) SLICE_LOW_SHIFT; + return !!(available.low_slices (1u slice)); + } else { + slice = GET_HIGH_SLICE_INDEX(addr); + *boundary_addr = (slice + end) ? + ((slice + end) SLICE_HIGH_SHIFT) : SLICE_LOW_TOP; + return !!(available.high_slices (1u slice)); + } +} + static unsigned long slice_find_area_bottomup(struct mm_struct *mm, unsigned long len, struct slice_mask available, int psize) { - struct vm_area_struct *vma; - unsigned long addr; - struct slice_mask mask; int pshift = max_t(int, mmu_psize_defs[psize].shift, PAGE_SHIFT); + unsigned long addr, found, next_end; + struct vm_unmapped_area_info info; - addr = TASK_UNMAPPED_BASE; - - for (;;) { - addr = _ALIGN_UP(addr, 1ul pshift); - if ((TASK_SIZE - len) addr) - break; - vma = find_vma(mm, addr); - BUG_ON(vma (addr = vma-vm_end)); + info.flags = 0; + info.length = len; + info.align_mask = PAGE_MASK ((1ul pshift) - 1); + info.align_offset = 0; - mask = slice_range_to_mask(addr, len); - if (!slice_check_fit(mask, available)) { - if (addr SLICE_LOW_TOP) - addr = _ALIGN_UP(addr + 1, 1ul SLICE_LOW_SHIFT); - else - addr = _ALIGN_UP(addr + 1, 1ul SLICE_HIGH_SHIFT); + addr = TASK_UNMAPPED_BASE; + while (addr TASK_SIZE) { + info.low_limit = addr; + if (!slice_scan_available(addr, available, 1, addr)) continue; + + next_slice: + /* +* At this point [info.low_limit; addr) covers +* available slices only and ends at a slice boundary. +* Check if we need to reduce the range, or if we can +* extend it to cover the next available slice. +*/ + if (addr = TASK_SIZE) + addr = TASK_SIZE; + else if (slice_scan_available(addr, available, 1, next_end)) { + addr = next_end; + goto next_slice; } - if (!vma || addr + len = vma-vm_start) - return addr; - addr = vma-vm_end; + info.high_limit = addr; + + found = vm_unmapped_area(info); + if (!(found ~PAGE_MASK)) + return found; } return -ENOMEM; @@ -277,39 +310,39 @@ static unsigned long slice_find_area_topdown(struct mm_struct *mm,
Re: [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence
On 12/30/2012 09:58 AM, Wen Congyang wrote: At 12/25/2012 04:35 PM, Glauber Costa Wrote: On 12/24/2012 04:09 PM, Tang Chen wrote: From: Wen Congyang we...@cn.fujitsu.com memory can't be offlined when CONFIG_MEMCG is selected. For example: there is a memory device on node 1. The address range is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10, and memory11 under the directory /sys/devices/system/memory/. If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup when we online pages. When we online memory8, the memory stored page cgroup is not provided by this memory device. But when we online memory9, the memory stored page cgroup may be provided by memory8. So we can't offline memory8 now. We should offline the memory in the reversed order. When the memory device is hotremoved, we will auto offline memory provided by this memory device. But we don't know which memory is onlined first, so offlining memory may fail. In such case, iterate twice to offline the memory. 1st iterate: offline every non primary memory block. 2nd iterate: offline primary (i.e. first added) memory block. This idea is suggested by KOSAKI Motohiro. Signed-off-by: Wen Congyang we...@cn.fujitsu.com Maybe there is something here that I am missing - I admit that I came late to this one, but this really sounds like a very ugly hack, that really has no place in here. Retrying, of course, may make sense, if we have reasonable belief that we may now succeed. If this is the case, you need to document - in the code - while is that. The memcg argument, however, doesn't really cut it. Why can't we make all page_cgroup allocations local to the node they are describing? If memcg is the culprit here, we should fix it, and not retry. If there is still any benefit in retrying, then we retry being very specific about why. We try to make all page_cgroup allocations local to the node they are describing now. If the memory is the first memory onlined in this node, we will allocate it from the other node. For example, node1 has 4 memory blocks: 8-11, and we online it from 8 to 11 1. memory block 8, page_cgroup allocations are in the other nodes 2. memory block 9, page_cgroup allocations are in memory block 8 So we should offline memory block 9 first. But we don't know in which order the user online the memory block. I think we can modify memcg like this: allocate the memory from the memory block they are describing I am not sure it is OK to do so. I don't see a reason why not. You would have to tweak a bit the lookup function for page_cgroup, but assuming you will always have the pfns and limits, it should be easy to do. I think the only tricky part is that today we have a single node_page_cgroup, and we would of course have to have one per memory block. My assumption is that the number of memory blocks is limited and likely not very big. So even a static array would do. Kamezawa, do you have any input in here? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/8] mm: use vm_unmapped_area() on parisc architecture
On 01/08/2013 08:28 PM, Michel Lespinasse wrote: Update the parisc arch_get_unmapped_area function to make use of vm_unmapped_area() instead of implementing a brute force search. Signed-off-by: Michel Lespinasse wal...@google.com Acked-by: Rik van Riel r...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/8] mm: use vm_unmapped_area() on alpha architecture
On 01/08/2013 08:28 PM, Michel Lespinasse wrote: Update the alpha arch_get_unmapped_area function to make use of vm_unmapped_area() instead of implementing a brute force search. Signed-off-by: Michel Lespinasse wal...@google.com Acked-by: Rik van Riel r...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/8] mm: use vm_unmapped_area() on frv architecture
On 01/08/2013 08:28 PM, Michel Lespinasse wrote: Update the frv arch_get_unmapped_area function to make use of vm_unmapped_area() instead of implementing a brute force search. Signed-off-by: Michel Lespinasse wal...@google.com Acked-by: Rik van Riel r...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/8] mm: use vm_unmapped_area() on ia64 architecture
On 01/08/2013 08:28 PM, Michel Lespinasse wrote: Update the ia64 arch_get_unmapped_area function to make use of vm_unmapped_area() instead of implementing a brute force search. Signed-off-by: Michel Lespinasse wal...@google.com Acked-by: Rik van Riel r...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 5/8] mm: use vm_unmapped_area() in hugetlbfs on ia64 architecture
On 01/08/2013 08:28 PM, Michel Lespinasse wrote: Update the ia64 hugetlb_get_unmapped_area function to make use of vm_unmapped_area() instead of implementing a brute force search. Signed-off-by: Michel Lespinasse wal...@google.com Acked-by: Rik van Riel r...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 6/8] mm: remove free_area_cache use in powerpc architecture
On 01/08/2013 08:28 PM, Michel Lespinasse wrote: As all other architectures have been converted to use vm_unmapped_area(), we are about to retire the free_area_cache. This change simply removes the use of that cache in slice_get_unmapped_area(), which will most certainly have a performance cost. Next one will convert that function to use the vm_unmapped_area() infrastructure and regain the performance. Signed-off-by: Michel Lespinasse wal...@google.com Acked-by: Rik van Riel r...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 7/8] mm: use vm_unmapped_area() on powerpc architecture
On 01/08/2013 08:28 PM, Michel Lespinasse wrote: Update the powerpc slice_get_unmapped_area function to make use of vm_unmapped_area() instead of implementing a brute force search. Signed-off-by: Michel Lespinasse wal...@google.com Acked-by: Rik van Riel r...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 8/8] mm: remove free_area_cache
On 01/08/2013 08:28 PM, Michel Lespinasse wrote: Since all architectures have been converted to use vm_unmapped_area(), there is no remaining use for the free_area_cache. Signed-off-by: Michel Lespinasse wal...@google.com Yay Acked-by: Rik van Riel r...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 7/8] mm: use vm_unmapped_area() on powerpc architecture
On 01/09/2013 06:23 AM, Michel Lespinasse wrote: On Wed, Jan 09, 2013 at 02:32:56PM +1100, Benjamin Herrenschmidt wrote: Ok. I think at least you can move that construct: + if (addr SLICE_LOW_TOP) { + slice = GET_LOW_SLICE_INDEX(addr); + addr = (slice + 1) SLICE_LOW_SHIFT; + if (!(available.low_slices (1u slice))) + continue; + } else { + slice = GET_HIGH_SLICE_INDEX(addr); + addr = (slice + 1) SLICE_HIGH_SHIFT; + if (!(available.high_slices (1u slice))) + continue; + } Into some kind of helper. It will probably compile to the same thing but at least it's more readable and it will avoid a fuckup in the future if somebody changes the algorithm and forgets to update one of the copies :-) All right, does the following look more palatable then ? (didn't re-test it, though) Looks equivalent. I have also not tested :) Signed-off-by: Michel Lespinasse wal...@google.com Acked-by: Rik van Riel r...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch
On Dec 18, 2012, at 10:31 AM, Peter Bergner berg...@vnet.ibm.com wrote: On Tue, 2012-12-18 at 07:28 -0600, Jimi Xenidis wrote: On Dec 17, 2012, at 6:26 PM, Peter Bergner berg...@vnet.ibm.com wrote: Jimi, are you using an old binutils from before my patch that changed the operand order for these types of instructions? http://sourceware.org/ml/binutils/2009-02/msg00044.html Actually, this confused me as well, that embedded has the same instruction encoding but different mnemonic. The mnemonic is the same (ie, dcbtst), and yes, the encoding is the same. All that is different is the accepted operand ordering...and yes, it is very unfortunate the operand ordering is different between embedded and server. :( I was under the impression that the assembler made no instruction decisions based on CPU. So your only hint would be that '0b' prefix. Does AS even see that? GAS definitely makes decisions based on CPU (ie, -mcpu option). Below is the GAS code used in recognizing the dcbtst instruction. This shows that the server operand ordering is enabled for POWER4 and later cpus while the embedded operand ordering is enabled for pre POWER4 cpus (yes, not exactly a server versus embedded trigger, but that's we agreed on to mitigate breaking any old asm code out there). {dcbtst,X(31,246), X_MASK, POWER4,PPCNONE,{RA0, RB, CT}}, {dcbtst,X(31,246), X_MASK, PPC|PPCVLE, POWER4,{CT, RA0, RB}}, GAS doesn't look at how the operands are written to try and guess what operand ordering you are attempting to use. Rather, it knows what ordering it expects and the values had better match that ordering. I agree, but that means it is impossible for the same .S file can be compiled but -mcpu=e500mc and -mcpu=powerpc? So either these files have to be Book3S versus Book3E --or-- we use a CPP macro to get them right. FWIW, I prefer the latter which allows more code reuse. -jx Peter ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On Wed, 9 Jan 2013 17:32:24 +0800 Tang Chen tangc...@cn.fujitsu.com wrote: Here is the physical memory hot-remove patch-set based on 3.8rc-2. This patch-set aims to implement physical memory hot-removing. The patches can free/remove the following things: - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15] - memmap of sparse-vmemmap : [PATCH 6,7,8,10/15] - page table of removed memory : [RFC PATCH 7,8,10/15] - node and related sysfs files : [RFC PATCH 13-15/15] Existing problem: If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup when we online pages. For example: there is a memory device on node 1. The address range is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10, and memory11 under the directory /sys/devices/system/memory/. If CONFIG_MEMCG is selected, when we online memory8, the memory stored page cgroup is not provided by this memory device. But when we online memory9, the memory stored page cgroup may be provided by memory8. So we can't offline memory8 now. We should offline the memory in the reversed order. When the memory device is hotremoved, we will auto offline memory provided by this memory device. But we don't know which memory is onlined first, so offlining memory may fail. This does sound like a significant problem. We should assume that mmecg is available and in use. In patch1, we provide a solution which is not good enough: Iterate twice to offline the memory. 1st iterate: offline every non primary memory block. 2nd iterate: offline primary (i.e. first added) memory block. Let's flesh this out a bit. If we online memory8, memory9, memory10 and memory11 then I'd have thought that they would need to offlined in reverse order, which will require four iterations, not two. Is this wrong and if so, why? Also, what happens if we wish to offline only memory9? Do we offline memory11 then memory10 then memory9 and then re-online memory10 and memory11? And a new idea from Wen Congyang we...@cn.fujitsu.com is: allocate the memory from the memory block they are describing. Yes. But we are not sure if it is OK to do so because there is not existing API to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE to MEM_ONLINE. This all sounds solvable - can we proceed in this fashion? And also, it may interfere the hugepage. Please provide full details on this problem. Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Right. But how often does this happen in testing? In other words, please provide an overall description of how well memory hot-remove is presently operating. Is it reliable? What is the success rate in real-world situations? Are there precautions which the administrator can take to improve the success rate? What are the remaining problems and are there plans to address them? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2] powerpc/mm: eliminate unneeded for_each_memblock
The only persistent change made by this loop is calling memblock_set_node() once for each memblock, which is not useful (and has no effect) as memblock_set_node() is not called with any memblock-specific parameters. Subsistute a single memblock_set_node(). Signed-off-by: Cody P Schafer c...@linux.vnet.ibm.com --- Now with a signoff wrapped comment line. arch/powerpc/mm/mem.c | 11 --- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index 0dba506..40df7c8 100644 --- a/arch/powerpc/mm/mem.c +++ b/arch/powerpc/mm/mem.c @@ -195,13 +195,10 @@ void __init do_init_bootmem(void) min_low_pfn = MEMORY_START PAGE_SHIFT; boot_mapsize = init_bootmem_node(NODE_DATA(0), start PAGE_SHIFT, min_low_pfn, max_low_pfn); - /* Add active regions with valid PFNs */ - for_each_memblock(memory, reg) { - unsigned long start_pfn, end_pfn; - start_pfn = memblock_region_memory_base_pfn(reg); - end_pfn = memblock_region_memory_end_pfn(reg); - memblock_set_node(0, (phys_addr_t)ULLONG_MAX, 0); - } + /* Place all memblock_regions in the same node and merge contiguous +* memblock_regions +*/ + memblock_set_node(0, (phys_addr_t)ULLONG_MAX, 0); /* Add all physical memory to the bootmem map, mark each area * present. -- 1.8.0.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 04/15] memory-hotplug: remove /sys/firmware/memmap/X sysfs
On Wed, 9 Jan 2013 17:32:28 +0800 Tang Chen tangc...@cn.fujitsu.com wrote: When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type} sysfs files are created. But there is no code to remove these files. The patch implements the function to remove them. Note: The code does not free firmware_map_entry which is allocated by bootmem. So the patch makes memory leak. But I think the memory leak size is very samll. And it does not affect the system. Well that's bad. Can we remember the address of that memory and then reuse the storage if/when the memory is re-added? That at least puts an upper bound on the leak. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 05/15] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture
On Wed, 9 Jan 2013 17:32:29 +0800 Tang Chen tangc...@cn.fujitsu.com wrote: For removing memory, we need to remove page table. But it depends on architecture. So the patch introduce arch_remove_memory() for removing page table. Now it only calls __remove_pages(). Note: __remove_pages() for some archtecuture is not implemented (I don't know how to implement it for s390). Can this break the build for s390? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch
On Wed, 2013-01-09 at 16:19 -0600, Jimi Xenidis wrote: I agree, but that means it is impossible for the same .S file can be compiled but -mcpu=e500mc and -mcpu=powerpc? So either these files have to be Book3S versus Book3E --or-- we use a CPP macro to get them right. FWIW, I prefer the latter which allows more code reuse. I agree using a CPP macro - like we do for new instructions for which some older assemblers might not support yet - is probably the best solution. Peter ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 02/15] memory-hotplug: check whether all memory blocks are offlined or not when removing memory
On Wed, 9 Jan 2013 17:32:26 +0800 Tang Chen tangc...@cn.fujitsu.com wrote: We remove the memory like this: 1. lock memory hotplug 2. offline a memory block 3. unlock memory hotplug 4. repeat 1-3 to offline all memory blocks 5. lock memory hotplug 6. remove memory(TODO) 7. unlock memory hotplug All memory blocks must be offlined before removing memory. But we don't hold the lock in the whole operation. So we should check whether all memory blocks are offlined before step6. Otherwise, kernel maybe panicked. Well, the obvious question is: why don't we hold lock_memory_hotplug() for all of steps 1-4? Please send the reasons for this in a form which I can paste into the changelog. Actually, I wonder if doing this would fix a race in the current remove_memory() repeat: loop. That code does a find_memory_block_hinted() followed by offline_memory_block(), but afaict find_memory_block_hinted() only does a get_device(). Is the get_device() sufficiently strong to prevent problems if another thread concurrently offlines or otherwise alters this memory_block's state? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 04/15] memory-hotplug: remove /sys/firmware/memmap/X sysfs
On Wed, 9 Jan 2013 17:32:28 +0800 Tang Chen tangc...@cn.fujitsu.com wrote: From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type} sysfs files are created. But there is no code to remove these files. The patch implements the function to remove them. Note: The code does not free firmware_map_entry which is allocated by bootmem. So the patch makes memory leak. But I think the memory leak size is very samll. And it does not affect the system. ... +static struct firmware_map_entry * __meminit +firmware_map_find_entry(u64 start, u64 end, const char *type) +{ + struct firmware_map_entry *entry; + + spin_lock(map_entries_lock); + list_for_each_entry(entry, map_entries, list) + if ((entry-start == start) (entry-end == end) + (!strcmp(entry-type, type))) { + spin_unlock(map_entries_lock); + return entry; + } + + spin_unlock(map_entries_lock); + return NULL; +} ... + entry = firmware_map_find_entry(start, end - 1, type); + if (!entry) + return -EINVAL; + + firmware_map_remove_entry(entry); ... The above code looks racy. After firmware_map_find_entry() does the spin_unlock() there is nothing to prevent a concurrent firmware_map_remove_entry() from removing the entry, so the kernel ends up calling firmware_map_remove_entry() twice against the same entry. An easy fix for this is to hold the spinlock across the entire lookup/remove operation. This problem is inherent to firmware_map_find_entry() as you have implemented it, so this function simply should not exist in the current form - no caller can use it without being buggy! A simple fix for this is to remove the spin_lock()/spin_unlock() from firmware_map_find_entry() and add locking documentation to firmware_map_find_entry(), explaining that the caller must hold map_entries_lock and must not release that lock until processing of firmware_map_find_entry()'s return value has completed. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On Wed, 9 Jan 2013 17:32:24 +0800 Tang Chen tangc...@cn.fujitsu.com wrote: This patch-set aims to implement physical memory hot-removing. As you were on th patch delivery path, all of these patches should have your Signed-off-by:. But some were missing it. I fixed this in my copy of the patches. I suspect this patchset adds a significant amount of code which will not be used if CONFIG_MEMORY_HOTPLUG=n. [PATCH v6 06/15] memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap, for example. This is not a good thing, so please go through the patchset (in fact, go through all the memhotplug code) and let's see if we can reduce the bloat for CONFIG_MEMORY_HOTPLUG=n kernels. This needn't be done immediately - it would be OK by me if you were to defer this exercise until all the new memhotplug code is largely in place. But please, let's do it. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/6][v3] perf/Power7: Use macros to identify perf events
Define and use macros to identify perf events codes. This would make it easier and more readable when these event codes need to be used in more than one place. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- arch/powerpc/perf/power7-pmu.c | 28 1 files changed, 20 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c index 441af08..44e70d2 100644 --- a/arch/powerpc/perf/power7-pmu.c +++ b/arch/powerpc/perf/power7-pmu.c @@ -51,6 +51,18 @@ #define MMCR1_PMCSEL_MSK 0xff /* + * Power7 event codes. + */ +#definePME_PM_CYC 0x1e +#definePME_PM_GCT_NOSLOT_CYC 0x100f8 +#definePME_PM_CMPLU_STALL 0x4000a +#definePME_PM_INST_CMPL0x2 +#definePME_PM_LD_REF_L10xc880 +#definePME_PM_LD_MISS_L1 0x400f0 +#definePME_PM_BRU_FIN 0x10068 +#definePME_PM_BRU_MPRED0x400f6 + +/* * Layout of constraint bits: * 554433221100 * 3210987654321098765432109876543210987654321098765432109876543210 @@ -296,14 +308,14 @@ static void power7_disable_pmc(unsigned int pmc, unsigned long mmcr[]) } static int power7_generic_events[] = { - [PERF_COUNT_HW_CPU_CYCLES] = 0x1e, - [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = 0x100f8, /* GCT_NOSLOT_CYC */ - [PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = 0x4000a, /* CMPLU_STALL */ - [PERF_COUNT_HW_INSTRUCTIONS] = 2, - [PERF_COUNT_HW_CACHE_REFERENCES] = 0xc880, /* LD_REF_L1_LSU*/ - [PERF_COUNT_HW_CACHE_MISSES] = 0x400f0, /* LD_MISS_L1 */ - [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = 0x10068, /* BRU_FIN */ - [PERF_COUNT_HW_BRANCH_MISSES] = 0x400f6,/* BR_MPRED */ + [PERF_COUNT_HW_CPU_CYCLES] =PME_PM_CYC, + [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = PME_PM_GCT_NOSLOT_CYC, + [PERF_COUNT_HW_STALLED_CYCLES_BACKEND] =PME_PM_CMPLU_STALL, + [PERF_COUNT_HW_INSTRUCTIONS] = PME_PM_INST_CMPL, + [PERF_COUNT_HW_CACHE_REFERENCES] = PME_PM_LD_REF_L1, + [PERF_COUNT_HW_CACHE_MISSES] = PME_PM_LD_MISS_L1, + [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = PME_PM_BRU_FIN, + [PERF_COUNT_HW_BRANCH_MISSES] = PME_PM_BRU_MPRED, }; #define C(x) PERF_COUNT_HW_CACHE_##x -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/6][v3] perf: Make EVENT_ATTR global
Rename EVENT_ATTR() to PMU_EVENT_ATTR() and make it global so it is available to all architectures. Further to allow architectures flexibility, have PMU_EVENT_ATTR() pass in the variable name as a parameter. Changelog[v3] - [Jiri Olsa] No need to define PMU_EVENT_PTR() Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- arch/x86/kernel/cpu/perf_event.c | 13 +++-- include/linux/perf_event.h | 11 +++ 2 files changed, 14 insertions(+), 10 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 4428fd1..59a1238 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -1316,11 +1316,6 @@ static struct attribute_group x86_pmu_format_group = { .attrs = NULL, }; -struct perf_pmu_events_attr { - struct device_attribute attr; - u64 id; -}; - /* * Remove all undefined events (x86_pmu.event_map(id) == 0) * out of events_attr attributes. @@ -1354,11 +1349,9 @@ static ssize_t events_sysfs_show(struct device *dev, struct device_attribute *at #define EVENT_VAR(_id) event_attr_##_id #define EVENT_PTR(_id) event_attr_##_id.attr.attr -#define EVENT_ATTR(_name, _id) \ -static struct perf_pmu_events_attr EVENT_VAR(_id) = { \ - .attr = __ATTR(_name, 0444, events_sysfs_show, NULL), \ - .id = PERF_COUNT_HW_##_id, \ -}; +#define EVENT_ATTR(_name, _id) \ + PMU_EVENT_ATTR(_name, EVENT_VAR(_id), PERF_COUNT_HW_##_id, \ + events_sysfs_show) EVENT_ATTR(cpu-cycles, CPU_CYCLES ); EVENT_ATTR(instructions, INSTRUCTIONS); diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 6bfb2fa..42adf01 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -817,6 +817,17 @@ do { \ } while (0) +struct perf_pmu_events_attr { + struct device_attribute attr; + u64 id; +}; + +#define PMU_EVENT_ATTR(_name, _var, _id, _show) \ +static struct perf_pmu_events_attr _var = {\ + .attr = __ATTR(_name, 0444, _show, NULL), \ + .id = _id, \ +}; + #define PMU_FORMAT_ATTR(_name, _format) \ static ssize_t \ _name##_show(struct device *dev, \ -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 6/6][v3] perf: Document the ABI of perf sysfs entries
This patchset addes two new sets of files to sysfs: - generic and POWER-specific perf events in /sys/devices/cpu/events/ - perf event config format in /sys/devices/cpu/format/event Document the format of these files which would become part of the ABI. Changelog[v3]: [Greg KH] Include ABI documentation. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- Documentation/ABI/stable/sysfs-devices-cpu-events | 54 + Documentation/ABI/stable/sysfs-devices-cpu-format | 27 ++ 2 files changed, 81 insertions(+), 0 deletions(-) create mode 100644 Documentation/ABI/stable/sysfs-devices-cpu-format diff --git a/Documentation/ABI/stable/sysfs-devices-cpu-events b/Documentation/ABI/stable/sysfs-devices-cpu-events index e69de29..f37d542 100644 --- a/Documentation/ABI/stable/sysfs-devices-cpu-events +++ b/Documentation/ABI/stable/sysfs-devices-cpu-events @@ -0,0 +1,54 @@ +What: /sys/devices/cpu/events/ + /sys/devices/cpu/events/branch-misses + /sys/devices/cpu/events/cache-references + /sys/devices/cpu/events/cache-misses + /sys/devices/cpu/events/stalled-cycles-frontend + /sys/devices/cpu/events/branch-instructions + /sys/devices/cpu/events/stalled-cycles-backend + /sys/devices/cpu/events/instructions + /sys/devices/cpu/events/cpu-cycles + +Date: 2013/01/08 + +Contact: Linux kernel mailing list linux-ker...@vger.kernel.org + +Description: Generic performance monitoring events + + A collection of performance monitoring events that may be + supported by many/most CPUs. These events can be monitored + using the 'perf(1)' tool. + + The contents of each file would look like: + + event=0x + + where 'N' is a hex digit. + + +What: /sys/devices/cpu/events/PM_LD_MISS_L1 + /sys/devices/cpu/events/PM_LD_REF_L1 + /sys/devices/cpu/events/PM_CYC + /sys/devices/cpu/events/PM_BRU_FIN + /sys/devices/cpu/events/PM_GCT_NOSLOT_CYC + /sys/devices/cpu/events/PM_BRU_MPRED + /sys/devices/cpu/events/PM_INST_CMPL + /sys/devices/cpu/events/PM_CMPLU_STALL + +Date: 2013/01/08 + +Contact: Linux kernel mailing list linux-ker...@vger.kernel.org + Linux Powerpc mailing list linuxppc-...@ozlabs.org + +Description: POWER specific performance monitoring events + + A collection of performance monitoring events that may be + supported by the POWER CPU. These events can be monitored + using the 'perf(1)' tool. + + These events may not be supported by other CPUs. + + The contents of each file would look like: + + event=0x + + where 'N' is a hex digit. diff --git a/Documentation/ABI/stable/sysfs-devices-cpu-format b/Documentation/ABI/stable/sysfs-devices-cpu-format new file mode 100644 index 000..b15cfb2 --- /dev/null +++ b/Documentation/ABI/stable/sysfs-devices-cpu-format @@ -0,0 +1,27 @@ +What: /sys/devices/cpu/format/ + /sys/devices/cpu/format/event + +Date: 2013/01/08 + +Contact: Linux kernel mailing list linux-ker...@vger.kernel.org + +Description: Format of performance monitoring events + + Each CPU/architecture may use different format to represent + the perf event. The 'event' file describes the configuration + format of the performance monitoring event on the CPU/system. + + The contents of each file would look like: + + config:m-n + + where m and n are the starting and ending bits that are + used to represent the event. + + For example, on POWER, + + $ cat /sys/devices/cpu/format/event + config:0-20 + + meaning that POWER uses the first 20-bits to represent a perf + event. -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 3/6][v3] perf/POWER7: Make generic event translations available in sysfs
Make the generic perf events in POWER7 available via sysfs. $ ls /sys/bus/event_source/devices/cpu/events branch-instructions branch-misses cache-misses cache-references cpu-cycles instructions stalled-cycles-backend stalled-cycles-frontend $ cat /sys/bus/event_source/devices/cpu/events/cache-misses event=0x400f0 This patch is based on commits that implement this functionality on x86. Eg: commit a47473939db20e3961b200eb00acf5fcf084d755 Author: Jiri Olsa jo...@redhat.com Date: Wed Oct 10 14:53:11 2012 +0200 perf/x86: Make hardware event translations available in sysfs Changelog:[v3] [Jiri Olsa] Drop EVENT_ID() macro since it is only used once. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- arch/powerpc/include/asm/perf_event_server.h | 24 ++ arch/powerpc/perf/core-book3s.c | 12 +++ arch/powerpc/perf/power7-pmu.c| 34 + 3 files changed, 70 insertions(+), 0 deletions(-) create mode 100644 Documentation/ABI/stable/sysfs-devices-cpu-events diff --git a/Documentation/ABI/stable/sysfs-devices-cpu-events b/Documentation/ABI/stable/sysfs-devices-cpu-events new file mode 100644 index 000..e69de29 diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h index 9710be3..3f21d89 100644 --- a/arch/powerpc/include/asm/perf_event_server.h +++ b/arch/powerpc/include/asm/perf_event_server.h @@ -11,6 +11,7 @@ #include linux/types.h #include asm/hw_irq.h +#include linux/device.h #define MAX_HWEVENTS 8 #define MAX_EVENT_ALTERNATIVES 8 @@ -35,6 +36,7 @@ struct power_pmu { void(*disable_pmc)(unsigned int pmc, unsigned long mmcr[]); int (*limited_pmc_event)(u64 event_id); u32 flags; + const struct attribute_group**attr_groups; int n_generic; int *generic_events; int (*cache_events)[PERF_COUNT_HW_CACHE_MAX] @@ -109,3 +111,25 @@ extern unsigned long perf_instruction_pointer(struct pt_regs *regs); * If an event_id is not subject to the constraint expressed by a particular * field, then it will have 0 in both the mask and value for that field. */ + +extern ssize_t power_events_sysfs_show(struct device *dev, + struct device_attribute *attr, char *page); + +/* + * EVENT_VAR() is same as PMU_EVENT_VAR with a suffix. + * + * Having a suffix allows us to have aliases in sysfs - eg: the generic + * event 'cpu-cycles' can have two entries in sysfs: 'cpu-cycles' and + * 'PM_CYC' where the latter is the name by which the event is known in + * POWER CPU specification. + */ +#defineEVENT_VAR(_id, _suffix) event_attr_##_id##_suffix +#defineEVENT_PTR(_id, _suffix) EVENT_VAR(_id, _suffix) + +#defineEVENT_ATTR(_name, _id, _suffix) \ + PMU_EVENT_ATTR(_name, EVENT_VAR(_id, _suffix), PME_PM_##_id,\ + power_events_sysfs_show) + +#defineGENERIC_EVENT_ATTR(_name, _id) EVENT_ATTR(_name, _id, _g) +#defineGENERIC_EVENT_PTR(_id) EVENT_PTR(_id, _g) + diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index aa2465e..fa476d5 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -1305,6 +1305,16 @@ static int power_pmu_event_idx(struct perf_event *event) return event-hw.idx; } +ssize_t power_events_sysfs_show(struct device *dev, + struct device_attribute *attr, char *page) +{ + struct perf_pmu_events_attr *pmu_attr; + + pmu_attr = container_of(attr, struct perf_pmu_events_attr, attr); + + return sprintf(page, event=0x%02llx\n, pmu_attr-id); +} + struct pmu power_pmu = { .pmu_enable = power_pmu_enable, .pmu_disable= power_pmu_disable, @@ -1537,6 +1547,8 @@ int __cpuinit register_power_pmu(struct power_pmu *pmu) pr_info(%s performance monitor hardware support registered\n, pmu-name); + power_pmu.attr_groups = ppmu-attr_groups; + #ifdef MSR_HV /* * Use FCHV to ignore kernel events if MSR.HV is set. diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c index 44e70d2..ae5d757 100644 --- a/arch/powerpc/perf/power7-pmu.c +++ b/arch/powerpc/perf/power7-pmu.c @@ -363,6 +363,39 @@ static int power7_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { }, }; + +GENERIC_EVENT_ATTR(cpu-cycles, CYC); +GENERIC_EVENT_ATTR(stalled-cycles-frontend,GCT_NOSLOT_CYC); +GENERIC_EVENT_ATTR(stalled-cycles-backend, CMPLU_STALL); +GENERIC_EVENT_ATTR(instructions, INST_CMPL); +GENERIC_EVENT_ATTR(cache-references,
[PATCH 4/6][v3] perf/POWER7: Make some POWER7 events available in sysfs
Make some POWER7-specific perf events available in sysfs. $ /bin/ls -1 /sys/bus/event_source/devices/cpu/events/ branch-instructions branch-misses cache-misses cache-references cpu-cycles instructions PM_BRU_FIN PM_BRU_MPRED PM_CMPLU_STALL PM_CYC PM_GCT_NOSLOT_CYC PM_INST_CMPL PM_LD_MISS_L1 PM_LD_REF_L1 stalled-cycles-backend stalled-cycles-frontend where the 'PM_*' events are POWER specific and the others are the generic events. This will enable users to specify these events with their symbolic names rather than with their raw code. perf stat -e 'cpu/PM_CYC/' ... Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- arch/powerpc/include/asm/perf_event_server.h |2 ++ arch/powerpc/perf/power7-pmu.c | 18 ++ 2 files changed, 20 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h index 3f21d89..b29fcc6 100644 --- a/arch/powerpc/include/asm/perf_event_server.h +++ b/arch/powerpc/include/asm/perf_event_server.h @@ -133,3 +133,5 @@ extern ssize_t power_events_sysfs_show(struct device *dev, #defineGENERIC_EVENT_ATTR(_name, _id) EVENT_ATTR(_name, _id, _g) #defineGENERIC_EVENT_PTR(_id) EVENT_PTR(_id, _g) +#definePOWER_EVENT_ATTR(_name, _id)EVENT_ATTR(PM_##_name, _id, _p) +#definePOWER_EVENT_PTR(_id)EVENT_PTR(_id, _p) diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c index ae5d757..5627940 100644 --- a/arch/powerpc/perf/power7-pmu.c +++ b/arch/powerpc/perf/power7-pmu.c @@ -373,6 +373,15 @@ GENERIC_EVENT_ATTR(cache-misses, LD_MISS_L1); GENERIC_EVENT_ATTR(branch-instructions,BRU_FIN); GENERIC_EVENT_ATTR(branch-misses, BRU_MPRED); +POWER_EVENT_ATTR(CYC, CYC); +POWER_EVENT_ATTR(GCT_NOSLOT_CYC, GCT_NOSLOT_CYC); +POWER_EVENT_ATTR(CMPLU_STALL, CMPLU_STALL); +POWER_EVENT_ATTR(INST_CMPL,INST_CMPL); +POWER_EVENT_ATTR(LD_REF_L1,LD_REF_L1); +POWER_EVENT_ATTR(LD_MISS_L1, LD_MISS_L1); +POWER_EVENT_ATTR(BRU_FIN, BRU_FIN) +POWER_EVENT_ATTR(BRU_MPRED,BRU_MPRED); + static struct attribute *power7_events_attr[] = { GENERIC_EVENT_PTR(CYC), GENERIC_EVENT_PTR(GCT_NOSLOT_CYC), @@ -382,6 +391,15 @@ static struct attribute *power7_events_attr[] = { GENERIC_EVENT_PTR(LD_MISS_L1), GENERIC_EVENT_PTR(BRU_FIN), GENERIC_EVENT_PTR(BRU_MPRED), + + POWER_EVENT_PTR(CYC), + POWER_EVENT_PTR(GCT_NOSLOT_CYC), + POWER_EVENT_PTR(CMPLU_STALL), + POWER_EVENT_PTR(INST_CMPL), + POWER_EVENT_PTR(LD_REF_L1), + POWER_EVENT_PTR(LD_MISS_L1), + POWER_EVENT_PTR(BRU_FIN), + POWER_EVENT_PTR(BRU_MPRED), NULL }; -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 5/6][v3] perf: Create a sysfs entry for Power event format
Create a sysfs entry, '/sys/bus/event_source/devices/cpu/format/event' which describes the format of a POWER cpu. The format of the event is the same for all POWER cpus at least in (Power6, Power7), so bulk of this change is common in the code common to POWER cpus. This code is based on corresponding code in x86. Changelog[v2]: [Jiri Olsa] Use PMU_FORMAT_ATTR() rather than duplicating it. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- arch/powerpc/include/asm/perf_event_server.h |6 ++ arch/powerpc/perf/core-book3s.c | 12 arch/powerpc/perf/power7-pmu.c |1 + 3 files changed, 19 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h index b29fcc6..ee63205 100644 --- a/arch/powerpc/include/asm/perf_event_server.h +++ b/arch/powerpc/include/asm/perf_event_server.h @@ -135,3 +135,9 @@ extern ssize_t power_events_sysfs_show(struct device *dev, #definePOWER_EVENT_ATTR(_name, _id)EVENT_ATTR(PM_##_name, _id, _p) #definePOWER_EVENT_PTR(_id)EVENT_PTR(_id, _p) + +/* + * Format of a perf event is the same on all POWER cpus. Declare a + * common sysfs attribute group that individual POWER cpus can share. + */ +extern struct attribute_group power_pmu_format_group; diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index fa476d5..4ae044b 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -1315,6 +1315,18 @@ ssize_t power_events_sysfs_show(struct device *dev, return sprintf(page, event=0x%02llx\n, pmu_attr-id); } +PMU_FORMAT_ATTR(event, config:0-20); + +static struct attribute *power_pmu_format_attr[] = { + format_attr_event.attr, + NULL, +}; + +struct attribute_group power_pmu_format_group = { + .name = format, + .attrs = power_pmu_format_attr, +}; + struct pmu power_pmu = { .pmu_enable = power_pmu_enable, .pmu_disable= power_pmu_disable, diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c index 5627940..5fb3c9b 100644 --- a/arch/powerpc/perf/power7-pmu.c +++ b/arch/powerpc/perf/power7-pmu.c @@ -410,6 +410,7 @@ static struct attribute_group power7_pmu_events_group = { }; static const struct attribute_group *power7_pmu_attr_groups[] = { + power_pmu_format_group, power7_pmu_events_group, NULL, }; -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] Added device tree binding for TDM and TDM phy
On 01/09/2013 01:10:24 AM, Singh Sandeep-B37400 wrote: A gentle reminder. Any comments are appreciated. Regards, Sandeep -Original Message- From: Singh Sandeep-B37400 Sent: Wednesday, January 02, 2013 6:55 PM To: devicetree-disc...@lists.ozlabs.org; linuxppc-...@ozlabs.org Cc: Singh Sandeep-B37400; Aggrwal Poonam-B10812 Subject: [PATCH] Added device tree binding for TDM and TDM phy This controller is available on many Freescale SOCs like MPC8315, P1020, P1010 and P1022 Signed-off-by: Sandeep Singh sand...@freescale.com Signed-off-by: Poonam Aggrwal poonam.aggr...@freescale.com --- .../devicetree/bindings/powerpc/fsl/fsl-tdm.txt| 63 .../devicetree/bindings/powerpc/fsl/tdm-phy.txt| 38 2 files changed, 101 insertions(+), 0 deletions(-) create mode 100644 Documentation/devicetree/bindings/powerpc/fsl/fsl-tdm.txt create mode 100644 Documentation/devicetree/bindings/powerpc/fsl/tdm- phy.txt diff --git a/Documentation/devicetree/bindings/powerpc/fsl/fsl-tdm.txt b/Documentation/devicetree/bindings/powerpc/fsl/fsl-tdm.txt new file mode 100644 index 000..ceb2ef1 --- /dev/null +++ b/Documentation/devicetree/bindings/powerpc/fsl/fsl-tdm.txt @@ -0,0 +1,63 @@ +TDM Device Tree Binding + +NOTE: The bindings described in this document are preliminary and +subject to change. + +TDM (Time Division Multiplexing) + +Description: + +The TDM is full duplex serial port designed to allow various devices +including digital signal processors (DSPs) to communicate with a +variety of serial devices including industry standard framers, codecs, other DSPs and microprocessors. + +The below properties describe the device tree bindings for Freescale +TDM controller. This TDM controller is available on various Freescale +Processors like MPC8315, P1020, P1022 and P1010. + +Required properties: + +- compatible +Value type: string +Definition: Should contain fsl,tdm1.0. + +- reg +Definition: A standard property. The first reg specifier describes the TDM +registers, and the second describes the TDM DMAC registers. + +- tdm_tx_clk +Value type: u32 or u64 +Definition: This specifies the value of transmit clock. It should not +exceed 50Mhz. + +- tdm_rx_clk +Value type: u32 or u64 +Definition: This specifies the value of receive clock. Its value could be +zero, in which case tdm will operate in shared mode. Its value should not +exceed 50Mhz. Please don't use underscores in property names, and use the vendor prefix: fsl,tdm-tx-clk and fsl,tdm-rx-clk. diff --git a/Documentation/devicetree/bindings/powerpc/fsl/tdm-phy.txt b/Documentation/devicetree/bindings/powerpc/fsl/tdm-phy.txt new file mode 100644 index 000..2563934 --- /dev/null +++ b/Documentation/devicetree/bindings/powerpc/fsl/tdm-phy.txt @@ -0,0 +1,38 @@ +TDM PHY Device Tree Binding + +NOTE: The bindings described in this document are preliminary and +subject to change. + +Description: +TDM PHY is the terminal interface of TDM subsystem. It is typically a +line control device like E1/T1 framer or SLIC. A TDM device can have +multiple TDM PHYs. + +Required properties: + +- compatible +Value type: string +Definition: Should contain generic compatibility like tdm-phy-slic or +tdm-phy-e1 or tdm-phy-t1. Does this generic string (plus the other properties) tell you all you need to know about the device? If there are other possible generic compatibles, they should be listed or else different people will make up different strings for the same thing. -Scott ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/6][v3] perf/Power7: Use macros to identify perf events
Define and use macros to identify perf events codes. This would make it easier and more readable when these event codes need to be used in more than one place. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- arch/powerpc/perf/power7-pmu.c | 28 1 files changed, 20 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c index 441af08..44e70d2 100644 --- a/arch/powerpc/perf/power7-pmu.c +++ b/arch/powerpc/perf/power7-pmu.c @@ -51,6 +51,18 @@ #define MMCR1_PMCSEL_MSK 0xff /* + * Power7 event codes. + */ +#definePME_PM_CYC 0x1e +#definePME_PM_GCT_NOSLOT_CYC 0x100f8 +#definePME_PM_CMPLU_STALL 0x4000a +#definePME_PM_INST_CMPL0x2 +#definePME_PM_LD_REF_L10xc880 +#definePME_PM_LD_MISS_L1 0x400f0 +#definePME_PM_BRU_FIN 0x10068 +#definePME_PM_BRU_MPRED0x400f6 + +/* * Layout of constraint bits: * 554433221100 * 3210987654321098765432109876543210987654321098765432109876543210 @@ -296,14 +308,14 @@ static void power7_disable_pmc(unsigned int pmc, unsigned long mmcr[]) } static int power7_generic_events[] = { - [PERF_COUNT_HW_CPU_CYCLES] = 0x1e, - [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = 0x100f8, /* GCT_NOSLOT_CYC */ - [PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = 0x4000a, /* CMPLU_STALL */ - [PERF_COUNT_HW_INSTRUCTIONS] = 2, - [PERF_COUNT_HW_CACHE_REFERENCES] = 0xc880, /* LD_REF_L1_LSU*/ - [PERF_COUNT_HW_CACHE_MISSES] = 0x400f0, /* LD_MISS_L1 */ - [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = 0x10068, /* BRU_FIN */ - [PERF_COUNT_HW_BRANCH_MISSES] = 0x400f6,/* BR_MPRED */ + [PERF_COUNT_HW_CPU_CYCLES] =PME_PM_CYC, + [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = PME_PM_GCT_NOSLOT_CYC, + [PERF_COUNT_HW_STALLED_CYCLES_BACKEND] =PME_PM_CMPLU_STALL, + [PERF_COUNT_HW_INSTRUCTIONS] = PME_PM_INST_CMPL, + [PERF_COUNT_HW_CACHE_REFERENCES] = PME_PM_LD_REF_L1, + [PERF_COUNT_HW_CACHE_MISSES] = PME_PM_LD_MISS_L1, + [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = PME_PM_BRU_FIN, + [PERF_COUNT_HW_BRANCH_MISSES] = PME_PM_BRU_MPRED, }; #define C(x) PERF_COUNT_HW_CACHE_##x -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/6][v3] perf: Make EVENT_ATTR global
Rename EVENT_ATTR() to PMU_EVENT_ATTR() and make it global so it is available to all architectures. Further to allow architectures flexibility, have PMU_EVENT_ATTR() pass in the variable name as a parameter. Changelog[v3] - [Jiri Olsa] No need to define PMU_EVENT_PTR() Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- arch/x86/kernel/cpu/perf_event.c | 13 +++-- include/linux/perf_event.h | 11 +++ 2 files changed, 14 insertions(+), 10 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 4428fd1..59a1238 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -1316,11 +1316,6 @@ static struct attribute_group x86_pmu_format_group = { .attrs = NULL, }; -struct perf_pmu_events_attr { - struct device_attribute attr; - u64 id; -}; - /* * Remove all undefined events (x86_pmu.event_map(id) == 0) * out of events_attr attributes. @@ -1354,11 +1349,9 @@ static ssize_t events_sysfs_show(struct device *dev, struct device_attribute *at #define EVENT_VAR(_id) event_attr_##_id #define EVENT_PTR(_id) event_attr_##_id.attr.attr -#define EVENT_ATTR(_name, _id) \ -static struct perf_pmu_events_attr EVENT_VAR(_id) = { \ - .attr = __ATTR(_name, 0444, events_sysfs_show, NULL), \ - .id = PERF_COUNT_HW_##_id, \ -}; +#define EVENT_ATTR(_name, _id) \ + PMU_EVENT_ATTR(_name, EVENT_VAR(_id), PERF_COUNT_HW_##_id, \ + events_sysfs_show) EVENT_ATTR(cpu-cycles, CPU_CYCLES ); EVENT_ATTR(instructions, INSTRUCTIONS); diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 6bfb2fa..42adf01 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -817,6 +817,17 @@ do { \ } while (0) +struct perf_pmu_events_attr { + struct device_attribute attr; + u64 id; +}; + +#define PMU_EVENT_ATTR(_name, _var, _id, _show) \ +static struct perf_pmu_events_attr _var = {\ + .attr = __ATTR(_name, 0444, _show, NULL), \ + .id = _id, \ +}; + #define PMU_FORMAT_ATTR(_name, _format) \ static ssize_t \ _name##_show(struct device *dev, \ -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 3/6][v3] perf/POWER7: Make generic event translations available in sysfs
Make the generic perf events in POWER7 available via sysfs. $ ls /sys/bus/event_source/devices/cpu/events branch-instructions branch-misses cache-misses cache-references cpu-cycles instructions stalled-cycles-backend stalled-cycles-frontend $ cat /sys/bus/event_source/devices/cpu/events/cache-misses event=0x400f0 This patch is based on commits that implement this functionality on x86. Eg: commit a47473939db20e3961b200eb00acf5fcf084d755 Author: Jiri Olsa jo...@redhat.com Date: Wed Oct 10 14:53:11 2012 +0200 perf/x86: Make hardware event translations available in sysfs Changelog:[v3] [Jiri Olsa] Drop EVENT_ID() macro since it is only used once. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- arch/powerpc/include/asm/perf_event_server.h | 24 ++ arch/powerpc/perf/core-book3s.c | 12 +++ arch/powerpc/perf/power7-pmu.c| 34 + 3 files changed, 70 insertions(+), 0 deletions(-) create mode 100644 Documentation/ABI/stable/sysfs-devices-cpu-events diff --git a/Documentation/ABI/stable/sysfs-devices-cpu-events b/Documentation/ABI/stable/sysfs-devices-cpu-events new file mode 100644 index 000..e69de29 diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h index 9710be3..3f21d89 100644 --- a/arch/powerpc/include/asm/perf_event_server.h +++ b/arch/powerpc/include/asm/perf_event_server.h @@ -11,6 +11,7 @@ #include linux/types.h #include asm/hw_irq.h +#include linux/device.h #define MAX_HWEVENTS 8 #define MAX_EVENT_ALTERNATIVES 8 @@ -35,6 +36,7 @@ struct power_pmu { void(*disable_pmc)(unsigned int pmc, unsigned long mmcr[]); int (*limited_pmc_event)(u64 event_id); u32 flags; + const struct attribute_group**attr_groups; int n_generic; int *generic_events; int (*cache_events)[PERF_COUNT_HW_CACHE_MAX] @@ -109,3 +111,25 @@ extern unsigned long perf_instruction_pointer(struct pt_regs *regs); * If an event_id is not subject to the constraint expressed by a particular * field, then it will have 0 in both the mask and value for that field. */ + +extern ssize_t power_events_sysfs_show(struct device *dev, + struct device_attribute *attr, char *page); + +/* + * EVENT_VAR() is same as PMU_EVENT_VAR with a suffix. + * + * Having a suffix allows us to have aliases in sysfs - eg: the generic + * event 'cpu-cycles' can have two entries in sysfs: 'cpu-cycles' and + * 'PM_CYC' where the latter is the name by which the event is known in + * POWER CPU specification. + */ +#defineEVENT_VAR(_id, _suffix) event_attr_##_id##_suffix +#defineEVENT_PTR(_id, _suffix) EVENT_VAR(_id, _suffix) + +#defineEVENT_ATTR(_name, _id, _suffix) \ + PMU_EVENT_ATTR(_name, EVENT_VAR(_id, _suffix), PME_PM_##_id,\ + power_events_sysfs_show) + +#defineGENERIC_EVENT_ATTR(_name, _id) EVENT_ATTR(_name, _id, _g) +#defineGENERIC_EVENT_PTR(_id) EVENT_PTR(_id, _g) + diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index aa2465e..fa476d5 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -1305,6 +1305,16 @@ static int power_pmu_event_idx(struct perf_event *event) return event-hw.idx; } +ssize_t power_events_sysfs_show(struct device *dev, + struct device_attribute *attr, char *page) +{ + struct perf_pmu_events_attr *pmu_attr; + + pmu_attr = container_of(attr, struct perf_pmu_events_attr, attr); + + return sprintf(page, event=0x%02llx\n, pmu_attr-id); +} + struct pmu power_pmu = { .pmu_enable = power_pmu_enable, .pmu_disable= power_pmu_disable, @@ -1537,6 +1547,8 @@ int __cpuinit register_power_pmu(struct power_pmu *pmu) pr_info(%s performance monitor hardware support registered\n, pmu-name); + power_pmu.attr_groups = ppmu-attr_groups; + #ifdef MSR_HV /* * Use FCHV to ignore kernel events if MSR.HV is set. diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c index 44e70d2..ae5d757 100644 --- a/arch/powerpc/perf/power7-pmu.c +++ b/arch/powerpc/perf/power7-pmu.c @@ -363,6 +363,39 @@ static int power7_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { }, }; + +GENERIC_EVENT_ATTR(cpu-cycles, CYC); +GENERIC_EVENT_ATTR(stalled-cycles-frontend,GCT_NOSLOT_CYC); +GENERIC_EVENT_ATTR(stalled-cycles-backend, CMPLU_STALL); +GENERIC_EVENT_ATTR(instructions, INST_CMPL); +GENERIC_EVENT_ATTR(cache-references,
[PATCH 4/6][v3] perf/POWER7: Make some POWER7 events available in sysfs
Make some POWER7-specific perf events available in sysfs. $ /bin/ls -1 /sys/bus/event_source/devices/cpu/events/ branch-instructions branch-misses cache-misses cache-references cpu-cycles instructions PM_BRU_FIN PM_BRU_MPRED PM_CMPLU_STALL PM_CYC PM_GCT_NOSLOT_CYC PM_INST_CMPL PM_LD_MISS_L1 PM_LD_REF_L1 stalled-cycles-backend stalled-cycles-frontend where the 'PM_*' events are POWER specific and the others are the generic events. This will enable users to specify these events with their symbolic names rather than with their raw code. perf stat -e 'cpu/PM_CYC/' ... Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- arch/powerpc/include/asm/perf_event_server.h |2 ++ arch/powerpc/perf/power7-pmu.c | 18 ++ 2 files changed, 20 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h index 3f21d89..b29fcc6 100644 --- a/arch/powerpc/include/asm/perf_event_server.h +++ b/arch/powerpc/include/asm/perf_event_server.h @@ -133,3 +133,5 @@ extern ssize_t power_events_sysfs_show(struct device *dev, #defineGENERIC_EVENT_ATTR(_name, _id) EVENT_ATTR(_name, _id, _g) #defineGENERIC_EVENT_PTR(_id) EVENT_PTR(_id, _g) +#definePOWER_EVENT_ATTR(_name, _id)EVENT_ATTR(PM_##_name, _id, _p) +#definePOWER_EVENT_PTR(_id)EVENT_PTR(_id, _p) diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c index ae5d757..5627940 100644 --- a/arch/powerpc/perf/power7-pmu.c +++ b/arch/powerpc/perf/power7-pmu.c @@ -373,6 +373,15 @@ GENERIC_EVENT_ATTR(cache-misses, LD_MISS_L1); GENERIC_EVENT_ATTR(branch-instructions,BRU_FIN); GENERIC_EVENT_ATTR(branch-misses, BRU_MPRED); +POWER_EVENT_ATTR(CYC, CYC); +POWER_EVENT_ATTR(GCT_NOSLOT_CYC, GCT_NOSLOT_CYC); +POWER_EVENT_ATTR(CMPLU_STALL, CMPLU_STALL); +POWER_EVENT_ATTR(INST_CMPL,INST_CMPL); +POWER_EVENT_ATTR(LD_REF_L1,LD_REF_L1); +POWER_EVENT_ATTR(LD_MISS_L1, LD_MISS_L1); +POWER_EVENT_ATTR(BRU_FIN, BRU_FIN) +POWER_EVENT_ATTR(BRU_MPRED,BRU_MPRED); + static struct attribute *power7_events_attr[] = { GENERIC_EVENT_PTR(CYC), GENERIC_EVENT_PTR(GCT_NOSLOT_CYC), @@ -382,6 +391,15 @@ static struct attribute *power7_events_attr[] = { GENERIC_EVENT_PTR(LD_MISS_L1), GENERIC_EVENT_PTR(BRU_FIN), GENERIC_EVENT_PTR(BRU_MPRED), + + POWER_EVENT_PTR(CYC), + POWER_EVENT_PTR(GCT_NOSLOT_CYC), + POWER_EVENT_PTR(CMPLU_STALL), + POWER_EVENT_PTR(INST_CMPL), + POWER_EVENT_PTR(LD_REF_L1), + POWER_EVENT_PTR(LD_MISS_L1), + POWER_EVENT_PTR(BRU_FIN), + POWER_EVENT_PTR(BRU_MPRED), NULL }; -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 5/6][v3] perf: Create a sysfs entry for Power event format
Create a sysfs entry, '/sys/bus/event_source/devices/cpu/format/event' which describes the format of a POWER cpu. The format of the event is the same for all POWER cpus at least in (Power6, Power7), so bulk of this change is common in the code common to POWER cpus. This code is based on corresponding code in x86. Changelog[v2]: [Jiri Olsa] Use PMU_FORMAT_ATTR() rather than duplicating it. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- arch/powerpc/include/asm/perf_event_server.h |6 ++ arch/powerpc/perf/core-book3s.c | 12 arch/powerpc/perf/power7-pmu.c |1 + 3 files changed, 19 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h index b29fcc6..ee63205 100644 --- a/arch/powerpc/include/asm/perf_event_server.h +++ b/arch/powerpc/include/asm/perf_event_server.h @@ -135,3 +135,9 @@ extern ssize_t power_events_sysfs_show(struct device *dev, #definePOWER_EVENT_ATTR(_name, _id)EVENT_ATTR(PM_##_name, _id, _p) #definePOWER_EVENT_PTR(_id)EVENT_PTR(_id, _p) + +/* + * Format of a perf event is the same on all POWER cpus. Declare a + * common sysfs attribute group that individual POWER cpus can share. + */ +extern struct attribute_group power_pmu_format_group; diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index fa476d5..4ae044b 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -1315,6 +1315,18 @@ ssize_t power_events_sysfs_show(struct device *dev, return sprintf(page, event=0x%02llx\n, pmu_attr-id); } +PMU_FORMAT_ATTR(event, config:0-20); + +static struct attribute *power_pmu_format_attr[] = { + format_attr_event.attr, + NULL, +}; + +struct attribute_group power_pmu_format_group = { + .name = format, + .attrs = power_pmu_format_attr, +}; + struct pmu power_pmu = { .pmu_enable = power_pmu_enable, .pmu_disable= power_pmu_disable, diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c index 5627940..5fb3c9b 100644 --- a/arch/powerpc/perf/power7-pmu.c +++ b/arch/powerpc/perf/power7-pmu.c @@ -410,6 +410,7 @@ static struct attribute_group power7_pmu_events_group = { }; static const struct attribute_group *power7_pmu_attr_groups[] = { + power_pmu_format_group, power7_pmu_events_group, NULL, }; -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/6][v3] perf/Power7: Use macros to identify perf events
[PATCH 1/6][v3] perf/Power7: Use macros to identify perf events Define and use macros to identify perf events codes. This would make it easier and more readable when these event codes need to be used in more than one place. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- arch/powerpc/perf/power7-pmu.c | 28 1 files changed, 20 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c index 441af08..44e70d2 100644 --- a/arch/powerpc/perf/power7-pmu.c +++ b/arch/powerpc/perf/power7-pmu.c @@ -51,6 +51,18 @@ #define MMCR1_PMCSEL_MSK 0xff /* + * Power7 event codes. + */ +#definePME_PM_CYC 0x1e +#definePME_PM_GCT_NOSLOT_CYC 0x100f8 +#definePME_PM_CMPLU_STALL 0x4000a +#definePME_PM_INST_CMPL0x2 +#definePME_PM_LD_REF_L10xc880 +#definePME_PM_LD_MISS_L1 0x400f0 +#definePME_PM_BRU_FIN 0x10068 +#definePME_PM_BRU_MPRED0x400f6 + +/* * Layout of constraint bits: * 554433221100 * 3210987654321098765432109876543210987654321098765432109876543210 @@ -296,14 +308,14 @@ static void power7_disable_pmc(unsigned int pmc, unsigned long mmcr[]) } static int power7_generic_events[] = { - [PERF_COUNT_HW_CPU_CYCLES] = 0x1e, - [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = 0x100f8, /* GCT_NOSLOT_CYC */ - [PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = 0x4000a, /* CMPLU_STALL */ - [PERF_COUNT_HW_INSTRUCTIONS] = 2, - [PERF_COUNT_HW_CACHE_REFERENCES] = 0xc880, /* LD_REF_L1_LSU*/ - [PERF_COUNT_HW_CACHE_MISSES] = 0x400f0, /* LD_MISS_L1 */ - [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = 0x10068, /* BRU_FIN */ - [PERF_COUNT_HW_BRANCH_MISSES] = 0x400f6,/* BR_MPRED */ + [PERF_COUNT_HW_CPU_CYCLES] =PME_PM_CYC, + [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = PME_PM_GCT_NOSLOT_CYC, + [PERF_COUNT_HW_STALLED_CYCLES_BACKEND] =PME_PM_CMPLU_STALL, + [PERF_COUNT_HW_INSTRUCTIONS] = PME_PM_INST_CMPL, + [PERF_COUNT_HW_CACHE_REFERENCES] = PME_PM_LD_REF_L1, + [PERF_COUNT_HW_CACHE_MISSES] = PME_PM_LD_MISS_L1, + [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = PME_PM_BRU_FIN, + [PERF_COUNT_HW_BRANCH_MISSES] = PME_PM_BRU_MPRED, }; #define C(x) PERF_COUNT_HW_CACHE_##x -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/6][v3] perf: Make EVENT_ATTR global
[PATCH 2/6][v3] perf: Make EVENT_ATTR global Rename EVENT_ATTR() to PMU_EVENT_ATTR() and make it global so it is available to all architectures. Further to allow architectures flexibility, have PMU_EVENT_ATTR() pass in the variable name as a parameter. Changelog[v3] - [Jiri Olsa] No need to define PMU_EVENT_PTR() Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- arch/x86/kernel/cpu/perf_event.c | 13 +++-- include/linux/perf_event.h | 11 +++ 2 files changed, 14 insertions(+), 10 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 4428fd1..59a1238 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -1316,11 +1316,6 @@ static struct attribute_group x86_pmu_format_group = { .attrs = NULL, }; -struct perf_pmu_events_attr { - struct device_attribute attr; - u64 id; -}; - /* * Remove all undefined events (x86_pmu.event_map(id) == 0) * out of events_attr attributes. @@ -1354,11 +1349,9 @@ static ssize_t events_sysfs_show(struct device *dev, struct device_attribute *at #define EVENT_VAR(_id) event_attr_##_id #define EVENT_PTR(_id) event_attr_##_id.attr.attr -#define EVENT_ATTR(_name, _id) \ -static struct perf_pmu_events_attr EVENT_VAR(_id) = { \ - .attr = __ATTR(_name, 0444, events_sysfs_show, NULL), \ - .id = PERF_COUNT_HW_##_id, \ -}; +#define EVENT_ATTR(_name, _id) \ + PMU_EVENT_ATTR(_name, EVENT_VAR(_id), PERF_COUNT_HW_##_id, \ + events_sysfs_show) EVENT_ATTR(cpu-cycles, CPU_CYCLES ); EVENT_ATTR(instructions, INSTRUCTIONS); diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 6bfb2fa..42adf01 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -817,6 +817,17 @@ do { \ } while (0) +struct perf_pmu_events_attr { + struct device_attribute attr; + u64 id; +}; + +#define PMU_EVENT_ATTR(_name, _var, _id, _show) \ +static struct perf_pmu_events_attr _var = {\ + .attr = __ATTR(_name, 0444, _show, NULL), \ + .id = _id, \ +}; + #define PMU_FORMAT_ATTR(_name, _format) \ static ssize_t \ _name##_show(struct device *dev, \ -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 3/6][v3] perf/POWER7: Make generic event translations available in sysfs
[PATCH 3/6][v3] perf/POWER7: Make generic event translations available in sysfs Make the generic perf events in POWER7 available via sysfs. $ ls /sys/bus/event_source/devices/cpu/events branch-instructions branch-misses cache-misses cache-references cpu-cycles instructions stalled-cycles-backend stalled-cycles-frontend $ cat /sys/bus/event_source/devices/cpu/events/cache-misses event=0x400f0 This patch is based on commits that implement this functionality on x86. Eg: commit a47473939db20e3961b200eb00acf5fcf084d755 Author: Jiri Olsa jo...@redhat.com Date: Wed Oct 10 14:53:11 2012 +0200 perf/x86: Make hardware event translations available in sysfs Changelog:[v3] [Jiri Olsa] Drop EVENT_ID() macro since it is only used once. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- arch/powerpc/include/asm/perf_event_server.h | 24 ++ arch/powerpc/perf/core-book3s.c | 12 +++ arch/powerpc/perf/power7-pmu.c| 34 + 3 files changed, 70 insertions(+), 0 deletions(-) create mode 100644 Documentation/ABI/stable/sysfs-devices-cpu-events diff --git a/Documentation/ABI/stable/sysfs-devices-cpu-events b/Documentation/ABI/stable/sysfs-devices-cpu-events new file mode 100644 index 000..e69de29 diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h index 9710be3..3f21d89 100644 --- a/arch/powerpc/include/asm/perf_event_server.h +++ b/arch/powerpc/include/asm/perf_event_server.h @@ -11,6 +11,7 @@ #include linux/types.h #include asm/hw_irq.h +#include linux/device.h #define MAX_HWEVENTS 8 #define MAX_EVENT_ALTERNATIVES 8 @@ -35,6 +36,7 @@ struct power_pmu { void(*disable_pmc)(unsigned int pmc, unsigned long mmcr[]); int (*limited_pmc_event)(u64 event_id); u32 flags; + const struct attribute_group**attr_groups; int n_generic; int *generic_events; int (*cache_events)[PERF_COUNT_HW_CACHE_MAX] @@ -109,3 +111,25 @@ extern unsigned long perf_instruction_pointer(struct pt_regs *regs); * If an event_id is not subject to the constraint expressed by a particular * field, then it will have 0 in both the mask and value for that field. */ + +extern ssize_t power_events_sysfs_show(struct device *dev, + struct device_attribute *attr, char *page); + +/* + * EVENT_VAR() is same as PMU_EVENT_VAR with a suffix. + * + * Having a suffix allows us to have aliases in sysfs - eg: the generic + * event 'cpu-cycles' can have two entries in sysfs: 'cpu-cycles' and + * 'PM_CYC' where the latter is the name by which the event is known in + * POWER CPU specification. + */ +#defineEVENT_VAR(_id, _suffix) event_attr_##_id##_suffix +#defineEVENT_PTR(_id, _suffix) EVENT_VAR(_id, _suffix) + +#defineEVENT_ATTR(_name, _id, _suffix) \ + PMU_EVENT_ATTR(_name, EVENT_VAR(_id, _suffix), PME_PM_##_id,\ + power_events_sysfs_show) + +#defineGENERIC_EVENT_ATTR(_name, _id) EVENT_ATTR(_name, _id, _g) +#defineGENERIC_EVENT_PTR(_id) EVENT_PTR(_id, _g) + diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index aa2465e..fa476d5 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -1305,6 +1305,16 @@ static int power_pmu_event_idx(struct perf_event *event) return event-hw.idx; } +ssize_t power_events_sysfs_show(struct device *dev, + struct device_attribute *attr, char *page) +{ + struct perf_pmu_events_attr *pmu_attr; + + pmu_attr = container_of(attr, struct perf_pmu_events_attr, attr); + + return sprintf(page, event=0x%02llx\n, pmu_attr-id); +} + struct pmu power_pmu = { .pmu_enable = power_pmu_enable, .pmu_disable= power_pmu_disable, @@ -1537,6 +1547,8 @@ int __cpuinit register_power_pmu(struct power_pmu *pmu) pr_info(%s performance monitor hardware support registered\n, pmu-name); + power_pmu.attr_groups = ppmu-attr_groups; + #ifdef MSR_HV /* * Use FCHV to ignore kernel events if MSR.HV is set. diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c index 44e70d2..ae5d757 100644 --- a/arch/powerpc/perf/power7-pmu.c +++ b/arch/powerpc/perf/power7-pmu.c @@ -363,6 +363,39 @@ static int power7_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { }, }; + +GENERIC_EVENT_ATTR(cpu-cycles, CYC); +GENERIC_EVENT_ATTR(stalled-cycles-frontend,GCT_NOSLOT_CYC); +GENERIC_EVENT_ATTR(stalled-cycles-backend, CMPLU_STALL);
[PATCH 4/6][v3] perf/POWER7: Make some POWER7 events available in sysfs
[PATCH 4/6][v3] perf/POWER7: Make some POWER7 events available in sysfs Make some POWER7-specific perf events available in sysfs. $ /bin/ls -1 /sys/bus/event_source/devices/cpu/events/ branch-instructions branch-misses cache-misses cache-references cpu-cycles instructions PM_BRU_FIN PM_BRU_MPRED PM_CMPLU_STALL PM_CYC PM_GCT_NOSLOT_CYC PM_INST_CMPL PM_LD_MISS_L1 PM_LD_REF_L1 stalled-cycles-backend stalled-cycles-frontend where the 'PM_*' events are POWER specific and the others are the generic events. This will enable users to specify these events with their symbolic names rather than with their raw code. perf stat -e 'cpu/PM_CYC/' ... Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- arch/powerpc/include/asm/perf_event_server.h |2 ++ arch/powerpc/perf/power7-pmu.c | 18 ++ 2 files changed, 20 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h index 3f21d89..b29fcc6 100644 --- a/arch/powerpc/include/asm/perf_event_server.h +++ b/arch/powerpc/include/asm/perf_event_server.h @@ -133,3 +133,5 @@ extern ssize_t power_events_sysfs_show(struct device *dev, #defineGENERIC_EVENT_ATTR(_name, _id) EVENT_ATTR(_name, _id, _g) #defineGENERIC_EVENT_PTR(_id) EVENT_PTR(_id, _g) +#definePOWER_EVENT_ATTR(_name, _id)EVENT_ATTR(PM_##_name, _id, _p) +#definePOWER_EVENT_PTR(_id)EVENT_PTR(_id, _p) diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c index ae5d757..5627940 100644 --- a/arch/powerpc/perf/power7-pmu.c +++ b/arch/powerpc/perf/power7-pmu.c @@ -373,6 +373,15 @@ GENERIC_EVENT_ATTR(cache-misses, LD_MISS_L1); GENERIC_EVENT_ATTR(branch-instructions,BRU_FIN); GENERIC_EVENT_ATTR(branch-misses, BRU_MPRED); +POWER_EVENT_ATTR(CYC, CYC); +POWER_EVENT_ATTR(GCT_NOSLOT_CYC, GCT_NOSLOT_CYC); +POWER_EVENT_ATTR(CMPLU_STALL, CMPLU_STALL); +POWER_EVENT_ATTR(INST_CMPL,INST_CMPL); +POWER_EVENT_ATTR(LD_REF_L1,LD_REF_L1); +POWER_EVENT_ATTR(LD_MISS_L1, LD_MISS_L1); +POWER_EVENT_ATTR(BRU_FIN, BRU_FIN) +POWER_EVENT_ATTR(BRU_MPRED,BRU_MPRED); + static struct attribute *power7_events_attr[] = { GENERIC_EVENT_PTR(CYC), GENERIC_EVENT_PTR(GCT_NOSLOT_CYC), @@ -382,6 +391,15 @@ static struct attribute *power7_events_attr[] = { GENERIC_EVENT_PTR(LD_MISS_L1), GENERIC_EVENT_PTR(BRU_FIN), GENERIC_EVENT_PTR(BRU_MPRED), + + POWER_EVENT_PTR(CYC), + POWER_EVENT_PTR(GCT_NOSLOT_CYC), + POWER_EVENT_PTR(CMPLU_STALL), + POWER_EVENT_PTR(INST_CMPL), + POWER_EVENT_PTR(LD_REF_L1), + POWER_EVENT_PTR(LD_MISS_L1), + POWER_EVENT_PTR(BRU_FIN), + POWER_EVENT_PTR(BRU_MPRED), NULL }; -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 5/6][v3] perf: Create a sysfs entry for Power event format
[PATCH 5/6][v3] perf: Create a sysfs entry for Power event format Create a sysfs entry, '/sys/bus/event_source/devices/cpu/format/event' which describes the format of a POWER cpu. The format of the event is the same for all POWER cpus at least in (Power6, Power7), so bulk of this change is common in the code common to POWER cpus. This code is based on corresponding code in x86. Changelog[v2]: [Jiri Olsa] Use PMU_FORMAT_ATTR() rather than duplicating it. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- arch/powerpc/include/asm/perf_event_server.h |6 ++ arch/powerpc/perf/core-book3s.c | 12 arch/powerpc/perf/power7-pmu.c |1 + 3 files changed, 19 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h index b29fcc6..ee63205 100644 --- a/arch/powerpc/include/asm/perf_event_server.h +++ b/arch/powerpc/include/asm/perf_event_server.h @@ -135,3 +135,9 @@ extern ssize_t power_events_sysfs_show(struct device *dev, #definePOWER_EVENT_ATTR(_name, _id)EVENT_ATTR(PM_##_name, _id, _p) #definePOWER_EVENT_PTR(_id)EVENT_PTR(_id, _p) + +/* + * Format of a perf event is the same on all POWER cpus. Declare a + * common sysfs attribute group that individual POWER cpus can share. + */ +extern struct attribute_group power_pmu_format_group; diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index fa476d5..4ae044b 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -1315,6 +1315,18 @@ ssize_t power_events_sysfs_show(struct device *dev, return sprintf(page, event=0x%02llx\n, pmu_attr-id); } +PMU_FORMAT_ATTR(event, config:0-20); + +static struct attribute *power_pmu_format_attr[] = { + format_attr_event.attr, + NULL, +}; + +struct attribute_group power_pmu_format_group = { + .name = format, + .attrs = power_pmu_format_attr, +}; + struct pmu power_pmu = { .pmu_enable = power_pmu_enable, .pmu_disable= power_pmu_disable, diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c index 5627940..5fb3c9b 100644 --- a/arch/powerpc/perf/power7-pmu.c +++ b/arch/powerpc/perf/power7-pmu.c @@ -410,6 +410,7 @@ static struct attribute_group power7_pmu_events_group = { }; static const struct attribute_group *power7_pmu_attr_groups[] = { + power_pmu_format_group, power7_pmu_events_group, NULL, }; -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 6/6][v3] perf: Document the ABI of perf sysfs entries
[PATCH 6/6][v3] perf: Document the ABI of perf sysfs entries This patchset addes two new sets of files to sysfs: - generic and POWER-specific perf events in /sys/devices/cpu/events/ - perf event config format in /sys/devices/cpu/format/event Document the format of these files which would become part of the ABI. Changelog[v3]: [Greg KH] Include ABI documentation. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- Documentation/ABI/stable/sysfs-devices-cpu-events | 54 + Documentation/ABI/stable/sysfs-devices-cpu-format | 27 ++ 2 files changed, 81 insertions(+), 0 deletions(-) create mode 100644 Documentation/ABI/stable/sysfs-devices-cpu-format diff --git a/Documentation/ABI/stable/sysfs-devices-cpu-events b/Documentation/ABI/stable/sysfs-devices-cpu-events index e69de29..f37d542 100644 --- a/Documentation/ABI/stable/sysfs-devices-cpu-events +++ b/Documentation/ABI/stable/sysfs-devices-cpu-events @@ -0,0 +1,54 @@ +What: /sys/devices/cpu/events/ + /sys/devices/cpu/events/branch-misses + /sys/devices/cpu/events/cache-references + /sys/devices/cpu/events/cache-misses + /sys/devices/cpu/events/stalled-cycles-frontend + /sys/devices/cpu/events/branch-instructions + /sys/devices/cpu/events/stalled-cycles-backend + /sys/devices/cpu/events/instructions + /sys/devices/cpu/events/cpu-cycles + +Date: 2013/01/08 + +Contact: Linux kernel mailing list linux-ker...@vger.kernel.org + +Description: Generic performance monitoring events + + A collection of performance monitoring events that may be + supported by many/most CPUs. These events can be monitored + using the 'perf(1)' tool. + + The contents of each file would look like: + + event=0x + + where 'N' is a hex digit. + + +What: /sys/devices/cpu/events/PM_LD_MISS_L1 + /sys/devices/cpu/events/PM_LD_REF_L1 + /sys/devices/cpu/events/PM_CYC + /sys/devices/cpu/events/PM_BRU_FIN + /sys/devices/cpu/events/PM_GCT_NOSLOT_CYC + /sys/devices/cpu/events/PM_BRU_MPRED + /sys/devices/cpu/events/PM_INST_CMPL + /sys/devices/cpu/events/PM_CMPLU_STALL + +Date: 2013/01/08 + +Contact: Linux kernel mailing list linux-ker...@vger.kernel.org + Linux Powerpc mailing list linuxppc-...@ozlabs.org + +Description: POWER specific performance monitoring events + + A collection of performance monitoring events that may be + supported by the POWER CPU. These events can be monitored + using the 'perf(1)' tool. + + These events may not be supported by other CPUs. + + The contents of each file would look like: + + event=0x + + where 'N' is a hex digit. diff --git a/Documentation/ABI/stable/sysfs-devices-cpu-format b/Documentation/ABI/stable/sysfs-devices-cpu-format new file mode 100644 index 000..b15cfb2 --- /dev/null +++ b/Documentation/ABI/stable/sysfs-devices-cpu-format @@ -0,0 +1,27 @@ +What: /sys/devices/cpu/format/ + /sys/devices/cpu/format/event + +Date: 2013/01/08 + +Contact: Linux kernel mailing list linux-ker...@vger.kernel.org + +Description: Format of performance monitoring events + + Each CPU/architecture may use different format to represent + the perf event. The 'event' file describes the configuration + format of the performance monitoring event on the CPU/system. + + The contents of each file would look like: + + config:m-n + + where m and n are the starting and ending bits that are + used to represent the event. + + For example, on POWER, + + $ cat /sys/devices/cpu/format/event + config:0-20 + + meaning that POWER uses the first 20-bits to represent a perf + event. -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence
Hi Glauber, On 01/09/2013 11:09 PM, Glauber Costa wrote: We try to make all page_cgroup allocations local to the node they are describing now. If the memory is the first memory onlined in this node, we will allocate it from the other node. For example, node1 has 4 memory blocks: 8-11, and we online it from 8 to 11 1. memory block 8, page_cgroup allocations are in the other nodes 2. memory block 9, page_cgroup allocations are in memory block 8 So we should offline memory block 9 first. But we don't know in which order the user online the memory block. I think we can modify memcg like this: allocate the memory from the memory block they are describing I am not sure it is OK to do so. I don't see a reason why not. I'm not sure, but if we do this, we could bring in a fragment for each memory block (a memory section, 128MB, right?). Is this a problem when we use large page (such as 1GB page) ? Even if not, will these fragments make any bad effects ? Thank. :) You would have to tweak a bit the lookup function for page_cgroup, but assuming you will always have the pfns and limits, it should be easy to do. I think the only tricky part is that today we have a single node_page_cgroup, and we would of course have to have one per memory block. My assumption is that the number of memory blocks is limited and likely not very big. So even a static array would do. Kamezawa, do you have any input in here? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Andrew, Thank you very much for your pushing. :) On 01/10/2013 06:23 AM, Andrew Morton wrote: This does sound like a significant problem. We should assume that mmecg is available and in use. In patch1, we provide a solution which is not good enough: Iterate twice to offline the memory. 1st iterate: offline every non primary memory block. 2nd iterate: offline primary (i.e. first added) memory block. Let's flesh this out a bit. If we online memory8, memory9, memory10 and memory11 then I'd have thought that they would need to offlined in reverse order, which will require four iterations, not two. Is this wrong and if so, why? Well, we may need more than two iterations if all memory8, memory9, memory10 are in use by kernel, and 10 depends on 9, 9 depends on 8. So, as you see here, the iteration method is not good enough. But this only happens when the memory is used by kernel, which will not be able to be migrated. So if we can use a boot option, such as movablecore_map, or movable_online functionality to limit the memory as movable, the kernel will not use this memory. So it is safe when we are doing node hot-remove. Also, what happens if we wish to offline only memory9? Do we offline memory11 then memory10 then memory9 and then re-online memory10 and memory11? In this case, offlining memory9 could fail if user do this by himself, for example using sysfs. In this path, it is in memory hot-remove path. So when we remove a memory device, it will automatically offline all pages, and it is in reverse order by itself. And again, this is not good enough. We will figure out a reasonable way to solve it soon. And a new idea from Wen Congyangwe...@cn.fujitsu.com is: allocate the memory from the memory block they are describing. Yes. But we are not sure if it is OK to do so because there is not existing API to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE to MEM_ONLINE. This all sounds solvable - can we proceed in this fashion? Yes, we are in progress now. And also, it may interfere the hugepage. Please provide full details on this problem. It is not very clear now, and if I find something, I'll share it out. Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Right. But how often does this happen in testing? In other words, please provide an overall description of how well memory hot-remove is presently operating. Is it reliable? What is the success rate in real-world situations? We test the hot-remove functionality mostly with movable_online used. And the memory used by kernel is not allowed to be removed. We will do some tests in the kernel memory offline cases, and tell you the test results soon. And since we are trying out some other ways, I think the problem will be solved soon. Are there precautions which the administrator can take to improve the success rate? Administrator could use movablecore_map boot option or movable_online functionality (which is now in kernel) to limit memory as movable to avoid this problem. What are the remaining problems and are there plans to address them? For now, we will try to allocate page_group on the memory block which itself is describing. And all the other parts seems work well now. And we are still testing. If we have any problem, we will share. Thanks. :) -- To unsubscribe from this list: send the line unsubscribe linux-acpi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Andrew, On 01/10/2013 07:33 AM, Andrew Morton wrote: On Wed, 9 Jan 2013 17:32:24 +0800 Tang Chentangc...@cn.fujitsu.com wrote: This patch-set aims to implement physical memory hot-removing. As you were on th patch delivery path, all of these patches should have your Signed-off-by:. But some were missing it. I fixed this in my copy of the patches. Thank you very much for the help. Next time I'll add it myself. I suspect this patchset adds a significant amount of code which will not be used if CONFIG_MEMORY_HOTPLUG=n. [PATCH v6 06/15] memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap, for example. This is not a good thing, so please go through the patchset (in fact, go through all the memhotplug code) and let's see if we can reduce the bloat for CONFIG_MEMORY_HOTPLUG=n kernels. This needn't be done immediately - it would be OK by me if you were to defer this exercise until all the new memhotplug code is largely in place. But please, let's do it. OK, I'll do have a check on it when the page_cgroup problem is solved. Thanks. :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 05/15] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture
Hi Andrew, On 01/10/2013 06:50 AM, Andrew Morton wrote: On Wed, 9 Jan 2013 17:32:29 +0800 Tang Chentangc...@cn.fujitsu.com wrote: For removing memory, we need to remove page table. But it depends on architecture. So the patch introduce arch_remove_memory() for removing page table. Now it only calls __remove_pages(). Note: __remove_pages() for some archtecuture is not implemented (I don't know how to implement it for s390). Can this break the build for s390? No, I don't think so. The arch_remove_memory() in s390 will only return -EBUSY. Thanks. :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 6/7] powerpc: Hardware breakpoints rewrite to handle non DABR breakpoint registers
This is a rewrite so that we don't assume we are using the DABR throughout the code. We now use the arch_hw_breakpoint to store the breakpoint in a generic manner in the thread_struct, rather than storing the raw DABR value. The ptrace GET/SET_DEBUGREG interface currently passes the raw DABR in from userspace. We keep this functionality, so that future changes (like the POWER8 DAWR), will still fake the DABR to userspace. Signed-off-by: Michael Neuling mi...@neuling.org --- Resending to fix a problem with 8xx defconfigs. Noticed by benh. arch/powerpc/include/asm/debug.h | 15 +++--- arch/powerpc/include/asm/hw_breakpoint.h | 33 ++--- arch/powerpc/include/asm/processor.h |4 +- arch/powerpc/include/asm/reg.h |3 -- arch/powerpc/kernel/exceptions-64s.S |2 +- arch/powerpc/kernel/hw_breakpoint.c | 72 arch/powerpc/kernel/kgdb.c | 10 ++-- arch/powerpc/kernel/process.c| 75 +- arch/powerpc/kernel/ptrace.c | 60 +--- arch/powerpc/kernel/ptrace32.c |8 +++- arch/powerpc/kernel/signal.c |5 +- arch/powerpc/kernel/traps.c |4 +- arch/powerpc/mm/fault.c |4 +- arch/powerpc/xmon/xmon.c | 21 ++--- 14 files changed, 187 insertions(+), 129 deletions(-) diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h index 32de257..8d85ffb 100644 --- a/arch/powerpc/include/asm/debug.h +++ b/arch/powerpc/include/asm/debug.h @@ -4,6 +4,8 @@ #ifndef _ASM_POWERPC_DEBUG_H #define _ASM_POWERPC_DEBUG_H +#include asm/hw_breakpoint.h + struct pt_regs; extern struct dentry *powerpc_debugfs_root; @@ -15,7 +17,7 @@ extern int (*__debugger_ipi)(struct pt_regs *regs); extern int (*__debugger_bpt)(struct pt_regs *regs); extern int (*__debugger_sstep)(struct pt_regs *regs); extern int (*__debugger_iabr_match)(struct pt_regs *regs); -extern int (*__debugger_dabr_match)(struct pt_regs *regs); +extern int (*__debugger_break_match)(struct pt_regs *regs); extern int (*__debugger_fault_handler)(struct pt_regs *regs); #define DEBUGGER_BOILERPLATE(__NAME) \ @@ -31,7 +33,7 @@ DEBUGGER_BOILERPLATE(debugger_ipi) DEBUGGER_BOILERPLATE(debugger_bpt) DEBUGGER_BOILERPLATE(debugger_sstep) DEBUGGER_BOILERPLATE(debugger_iabr_match) -DEBUGGER_BOILERPLATE(debugger_dabr_match) +DEBUGGER_BOILERPLATE(debugger_break_match) DEBUGGER_BOILERPLATE(debugger_fault_handler) #else @@ -40,17 +42,18 @@ static inline int debugger_ipi(struct pt_regs *regs) { return 0; } static inline int debugger_bpt(struct pt_regs *regs) { return 0; } static inline int debugger_sstep(struct pt_regs *regs) { return 0; } static inline int debugger_iabr_match(struct pt_regs *regs) { return 0; } -static inline int debugger_dabr_match(struct pt_regs *regs) { return 0; } +static inline int debugger_break_match(struct pt_regs *regs) { return 0; } static inline int debugger_fault_handler(struct pt_regs *regs) { return 0; } #endif -extern int set_dabr(unsigned long dabr, unsigned long dabrx); +int set_break(struct arch_hw_breakpoint *brk); #ifdef CONFIG_PPC_ADV_DEBUG_REGS extern void do_send_trap(struct pt_regs *regs, unsigned long address, unsigned long error_code, int signal_code, int brkpt); #else -extern void do_dabr(struct pt_regs *regs, unsigned long address, - unsigned long error_code); + +extern void do_break(struct pt_regs *regs, unsigned long address, +unsigned long error_code); #endif #endif /* _ASM_POWERPC_DEBUG_H */ diff --git a/arch/powerpc/include/asm/hw_breakpoint.h b/arch/powerpc/include/asm/hw_breakpoint.h index 4234245..2c91faf 100644 --- a/arch/powerpc/include/asm/hw_breakpoint.h +++ b/arch/powerpc/include/asm/hw_breakpoint.h @@ -24,16 +24,30 @@ #define _PPC_BOOK3S_64_HW_BREAKPOINT_H #ifdef __KERNEL__ -#ifdef CONFIG_HAVE_HW_BREAKPOINT - struct arch_hw_breakpoint { unsigned long address; - unsigned long dabrx; - int type; - u8 len; /* length of the target data symbol */ - boolextraneous_interrupt; + u16 type; + u16 len; /* length of the target data symbol */ }; +/* Note: Don't change the the first 6 bits below as they are in the same order + * as the dabr and dabrx. + */ +#define HW_BRK_TYPE_READ 0x01 +#define HW_BRK_TYPE_WRITE 0x02 +#define HW_BRK_TYPE_TRANSLATE 0x04 +#define HW_BRK_TYPE_USER 0x08 +#define HW_BRK_TYPE_KERNEL 0x10 +#define HW_BRK_TYPE_HYP0x20 +#define HW_BRK_TYPE_EXTRANEOUS_IRQ 0x80 + +/* bits that overlap with the bottom 3 bits of the dabr */ +#define HW_BRK_TYPE_RDWR (HW_BRK_TYPE_READ | HW_BRK_TYPE_WRITE) +#define HW_BRK_TYPE_DABR (HW_BRK_TYPE_RDWR |
Re: [TRIVIAL PATCH 11/26] powerpc: Convert print_symbol to %pSR
On Thu, 2012-12-13 at 11:58 +, Arnd Bergmann wrote: On Wednesday 12 December 2012, Joe Perches wrote: Use the new vsprintf extension to avoid any possible message interleaving. Convert the #ifdef DEBUG block to a single pr_debug. Signed-off-by: Joe Perches j...@perches.com nice cleanup! ... which also breaks the build :-( Acked-by: Arnd Bergmann a...@arndb.de I'll fix it up locally. Ben. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH powerpc ] Protect smp_processor_id() in arch_spin_unlock_wait()
On Mon, 2012-11-19 at 14:16 +0800, Li Zhong wrote: This patch tries to disable preemption for using smp_processor_id() in arch_spin_unlock_wait(), to avoid following report: .../... diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c index bb7cfec..7a7c31b 100644 --- a/arch/powerpc/lib/locks.c +++ b/arch/powerpc/lib/locks.c @@ -72,8 +72,10 @@ void arch_spin_unlock_wait(arch_spinlock_t *lock) { while (lock-slock) { HMT_low(); + preempt_disable(); if (SHARED_PROCESSOR) __spin_yield(lock); + preempt_enable(); } I assume what you are protecting is the PACA access in SHARED_PROCESSOR or is there more ? In that case I'd say just make it use local_paca- directly or something like that. It doesn't matter if the access is racy, all processors will have the same value for that field as far as I can tell. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 04/15] memory-hotplug: remove /sys/firmware/memmap/X sysfs
Hi Andrew, On 01/10/2013 06:49 AM, Andrew Morton wrote: On Wed, 9 Jan 2013 17:32:28 +0800 Tang Chentangc...@cn.fujitsu.com wrote: When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type} sysfs files are created. But there is no code to remove these files. The patch implements the function to remove them. Note: The code does not free firmware_map_entry which is allocated by bootmem. So the patch makes memory leak. But I think the memory leak size is very samll. And it does not affect the system. Well that's bad. Can we remember the address of that memory and then reuse the storage if/when the memory is re-added? That at least puts an upper bound on the leak. I think we can do this. I'll post a new patch to do so. Thanks. :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 02/15] memory-hotplug: check whether all memory blocks are offlined or not when removing memory
Hi Andrew, On 01/10/2013 07:11 AM, Andrew Morton wrote: On Wed, 9 Jan 2013 17:32:26 +0800 Tang Chentangc...@cn.fujitsu.com wrote: We remove the memory like this: 1. lock memory hotplug 2. offline a memory block 3. unlock memory hotplug 4. repeat 1-3 to offline all memory blocks 5. lock memory hotplug 6. remove memory(TODO) 7. unlock memory hotplug All memory blocks must be offlined before removing memory. But we don't hold the lock in the whole operation. So we should check whether all memory blocks are offlined before step6. Otherwise, kernel maybe panicked. Well, the obvious question is: why don't we hold lock_memory_hotplug() for all of steps 1-4? Please send the reasons for this in a form which I can paste into the changelog. In the changelog form: Offlining a memory block and removing a memory device can be two different operations. Users can just offline some memory blocks without removing the memory device. For this purpose, the kernel has held lock_memory_hotplug() in __offline_pages(). To reuse the code for memory hot-remove, we repeat step 1-3 to offline all the memory blocks, repeatedly lock and unlock memory hotplug, but not hold the memory hotplug lock in the whole operation. Actually, I wonder if doing this would fix a race in the current remove_memory() repeat: loop. That code does a find_memory_block_hinted() followed by offline_memory_block(), but afaict find_memory_block_hinted() only does a get_device(). Is the get_device() sufficiently strong to prevent problems if another thread concurrently offlines or otherwise alters this memory_block's state? I think we already have memory_block-state_mutex to protect the concurrently changing of memory_block's state. The find_memory_block_hinted() here is to find the memory_block corresponding to the memory section we are dealing with. Thanks. :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 04/15] memory-hotplug: remove /sys/firmware/memmap/X sysfs
Hi Andrew, On 01/10/2013 07:19 AM, Andrew Morton wrote: ... + entry = firmware_map_find_entry(start, end - 1, type); + if (!entry) + return -EINVAL; + + firmware_map_remove_entry(entry); ... The above code looks racy. After firmware_map_find_entry() does the spin_unlock() there is nothing to prevent a concurrent firmware_map_remove_entry() from removing the entry, so the kernel ends up calling firmware_map_remove_entry() twice against the same entry. An easy fix for this is to hold the spinlock across the entire lookup/remove operation. This problem is inherent to firmware_map_find_entry() as you have implemented it, so this function simply should not exist in the current form - no caller can use it without being buggy! A simple fix for this is to remove the spin_lock()/spin_unlock() from firmware_map_find_entry() and add locking documentation to firmware_map_find_entry(), explaining that the caller must hold map_entries_lock and must not release that lock until processing of firmware_map_find_entry()'s return value has completed. Thank you for your advice, I'll fix it soon. Since you have merged the patch-set, do I need to resend all these patches again, or just send a patch to fix it based on the current one ? Thanks. :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc: Make room in exception vector area
The FWNMI region is fixed at 0x7000 and the vector are now overflowing that with some configurations. Fix that by moving some hash management code out of that region as it doesn't need to be that close to the call sites (isn't accessed using conditional branches). Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org --- arch/powerpc/kernel/exceptions-64s.S | 110 +- 1 file changed, 55 insertions(+), 55 deletions(-) diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index a28a65f..7a1c87c 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -1180,6 +1180,61 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX) .globl __end_handlers __end_handlers: + /* Equivalents to the above handlers for relocation-on interrupt vectors */ + STD_RELON_EXCEPTION_HV(., 0xe00, h_data_storage) + KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe00) + STD_RELON_EXCEPTION_HV(., 0xe20, h_instr_storage) + KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe20) + STD_RELON_EXCEPTION_HV(., 0xe40, emulation_assist) + KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe40) + STD_RELON_EXCEPTION_HV(., 0xe60, hmi_exception) + KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe60) + MASKABLE_RELON_EXCEPTION_HV(., 0xe80, h_doorbell) + KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe80) + + STD_RELON_EXCEPTION_PSERIES(., 0xf00, performance_monitor) + STD_RELON_EXCEPTION_PSERIES(., 0xf20, altivec_unavailable) + STD_RELON_EXCEPTION_PSERIES(., 0xf40, vsx_unavailable) + +#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV) +/* + * Data area reserved for FWNMI option. + * This address (0x7000) is fixed by the RPA. + */ + .= 0x7000 + .globl fwnmi_data_area +fwnmi_data_area: + + /* pseries and powernv need to keep the whole page from +* 0x7000 to 0x8000 free for use by the firmware +*/ + . = 0x8000 +#endif /* defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV) */ + +/* Space for CPU0's segment table */ + .balign 4096 + .globl initial_stab +initial_stab: + .space 4096 + +#ifdef CONFIG_PPC_POWERNV +_GLOBAL(opal_mc_secondary_handler) + HMT_MEDIUM_PPR_DISCARD + SET_SCRATCH0(r13) + GET_PACA(r13) + clrldi r3,r3,2 + tovirt(r3,r3) + std r3,PACA_OPAL_MC_EVT(r13) + ld r13,OPAL_MC_SRR0(r3) + mtspr SPRN_SRR0,r13 + ld r13,OPAL_MC_SRR1(r3) + mtspr SPRN_SRR1,r13 + ld r3,OPAL_MC_GPR3(r3) + GET_SCRATCH0(r13) + b machine_check_pSeries +#endif /* CONFIG_PPC_POWERNV */ + + /* * Hash table stuff */ @@ -1373,58 +1428,3 @@ _GLOBAL(do_stab_bolted) ld r13,PACA_EXSLB+EX_R13(r13) rfid b . /* prevent speculative execution */ - - - /* Equivalents to the above handlers for relocation-on interrupt vectors */ - STD_RELON_EXCEPTION_HV(., 0xe00, h_data_storage) - KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe00) - STD_RELON_EXCEPTION_HV(., 0xe20, h_instr_storage) - KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe20) - STD_RELON_EXCEPTION_HV(., 0xe40, emulation_assist) - KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe40) - STD_RELON_EXCEPTION_HV(., 0xe60, hmi_exception) - KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe60) - MASKABLE_RELON_EXCEPTION_HV(., 0xe80, h_doorbell) - KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe80) - - STD_RELON_EXCEPTION_PSERIES(., 0xf00, performance_monitor) - STD_RELON_EXCEPTION_PSERIES(., 0xf20, altivec_unavailable) - STD_RELON_EXCEPTION_PSERIES(., 0xf40, vsx_unavailable) - -#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV) -/* - * Data area reserved for FWNMI option. - * This address (0x7000) is fixed by the RPA. - */ - .= 0x7000 - .globl fwnmi_data_area -fwnmi_data_area: - - /* pseries and powernv need to keep the whole page from -* 0x7000 to 0x8000 free for use by the firmware -*/ - . = 0x8000 -#endif /* defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV) */ - -/* Space for CPU0's segment table */ - .balign 4096 - .globl initial_stab -initial_stab: - .space 4096 - -#ifdef CONFIG_PPC_POWERNV -_GLOBAL(opal_mc_secondary_handler) - HMT_MEDIUM_PPR_DISCARD - SET_SCRATCH0(r13) - GET_PACA(r13) - clrldi r3,r3,2 - tovirt(r3,r3) - std r3,PACA_OPAL_MC_EVT(r13) - ld r13,OPAL_MC_SRR0(r3) - mtspr SPRN_SRR0,r13 - ld r13,OPAL_MC_SRR1(r3) - mtspr SPRN_SRR1,r13 - ld r3,OPAL_MC_GPR3(r3) - GET_SCRATCH0(r13) - b machine_check_pSeries -#endif /* CONFIG_PPC_POWERNV */ ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On 01/10/2013 06:17 AM, Tang Chen wrote: Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Right. But how often does this happen in testing? In other words, please provide an overall description of how well memory hot-remove is presently operating. Is it reliable? What is the success rate in real-world situations? We test the hot-remove functionality mostly with movable_online used. And the memory used by kernel is not allowed to be removed. Can you try doing this using cpusets configured to hardwall ? It is my understanding that the object allocators will try hard not to allocate anything outside the walls defined by cpuset. Which means that if you have one process per node, and they are hardwalled, your kernel memory will be spread evenly among the machine. With a big enough load, they should eventually be present in all blocks. Another question I have for you: Have you considering calling shrink_slab to try to deplete the caches and therefore free at least slab memory in the nodes that can't be offlined? Is it relevant? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
(2013/01/10 16:14), Glauber Costa wrote: On 01/10/2013 06:17 AM, Tang Chen wrote: Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Right. But how often does this happen in testing? In other words, please provide an overall description of how well memory hot-remove is presently operating. Is it reliable? What is the success rate in real-world situations? We test the hot-remove functionality mostly with movable_online used. And the memory used by kernel is not allowed to be removed. Can you try doing this using cpusets configured to hardwall ? It is my understanding that the object allocators will try hard not to allocate anything outside the walls defined by cpuset. Which means that if you have one process per node, and they are hardwalled, your kernel memory will be spread evenly among the machine. With a big enough load, they should eventually be present in all blocks. I'm sorry I couldn't catch your point. Do you want to confirm whether cpuset can work enough instead of ZONE_MOVABLE ? Or Do you want to confirm whether ZONE_MOVABLE will not work if it's used with cpuset ? Another question I have for you: Have you considering calling shrink_slab to try to deplete the caches and therefore free at least slab memory in the nodes that can't be offlined? Is it relevant? At this stage, we don't consider to call shrink_slab(). We require nearly 100% success at offlining memory for removing DIMM. It's my understanding. IMHO, I don't think shrink_slab() can kill all objects in a node even if they are some caches. We need more study for doing that. Thanks, -Kame ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH powerpc ] Protect smp_processor_id() in arch_spin_unlock_wait()
On Thu, 2013-01-10 at 17:02 +1100, Benjamin Herrenschmidt wrote: On Mon, 2012-11-19 at 14:16 +0800, Li Zhong wrote: This patch tries to disable preemption for using smp_processor_id() in arch_spin_unlock_wait(), to avoid following report: .../... diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c index bb7cfec..7a7c31b 100644 --- a/arch/powerpc/lib/locks.c +++ b/arch/powerpc/lib/locks.c @@ -72,8 +72,10 @@ void arch_spin_unlock_wait(arch_spinlock_t *lock) { while (lock-slock) { HMT_low(); + preempt_disable(); if (SHARED_PROCESSOR) __spin_yield(lock); + preempt_enable(); } I assume what you are protecting is the PACA access in SHARED_PROCESSOR or is there more ? Yes, only the one in SHARED_PROCESSOR. In that case I'd say just make it use local_paca- directly or something like that. It doesn't matter if the access is racy, all processors will have the same value for that field as far as I can tell. It also seemed to me that all processors have the same value :). I'll send an updated version based on your suggestion soon. Thanks, Zhong Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On 01/10/2013 11:31 AM, Kamezawa Hiroyuki wrote: (2013/01/10 16:14), Glauber Costa wrote: On 01/10/2013 06:17 AM, Tang Chen wrote: Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Right. But how often does this happen in testing? In other words, please provide an overall description of how well memory hot-remove is presently operating. Is it reliable? What is the success rate in real-world situations? We test the hot-remove functionality mostly with movable_online used. And the memory used by kernel is not allowed to be removed. Can you try doing this using cpusets configured to hardwall ? It is my understanding that the object allocators will try hard not to allocate anything outside the walls defined by cpuset. Which means that if you have one process per node, and they are hardwalled, your kernel memory will be spread evenly among the machine. With a big enough load, they should eventually be present in all blocks. I'm sorry I couldn't catch your point. Do you want to confirm whether cpuset can work enough instead of ZONE_MOVABLE ? Or Do you want to confirm whether ZONE_MOVABLE will not work if it's used with cpuset ? No, I am not proposing to use cpuset do tackle the problem. I am just wondering if you would still have high success rates with cpusets in use with hardwalls. This is just one example of a workload that would spread kernel memory around quite heavily. So this is just me trying to understand the limitations of the mechanism. Another question I have for you: Have you considering calling shrink_slab to try to deplete the caches and therefore free at least slab memory in the nodes that can't be offlined? Is it relevant? At this stage, we don't consider to call shrink_slab(). We require nearly 100% success at offlining memory for removing DIMM. It's my understanding. Of course, this is indisputable. IMHO, I don't think shrink_slab() can kill all objects in a node even if they are some caches. We need more study for doing that. Indeed, shrink_slab can only kill cached objects. They, however, are usually a very big part of kernel memory. I wonder though if in case of failure, it is worth it to try at least one shrink pass before you give up. It is not very different from what is in memory-failure.c, except that we could do better and do a more targetted shrinking (support for that is being worked on) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev