[PATCH v6 07/15] memory-hotplug: move pgdat_resize_lock into sparse_remove_one_section()

2013-01-09 Thread Tang Chen
In __remove_section(), we locked pgdat_resize_lock when calling
sparse_remove_one_section(). This lock will disable irq. But we don't need
to lock the whole function. If we do some work to free pagetables in
free_section_usemap(), we need to call flush_tlb_all(), which need
irq enabled. Otherwise the WARN_ON_ONCE() in smp_call_function_many()
will be triggered.

If we lock the whole sparse_remove_one_section(), then we come to this call 
trace:

[  454.796248] [ cut here ]
[  454.851408] WARNING: at kernel/smp.c:461 smp_call_function_many+0xbd/0x260()
[  454.935620] Hardware name: PRIMEQUEST 1800E
..
[  455.652201] Call Trace:
[  455.681391]  [8106e73f] warn_slowpath_common+0x7f/0xc0
[  455.753151]  [810560a0] ? leave_mm+0x50/0x50
[  455.814527]  [8106e79a] warn_slowpath_null+0x1a/0x20
[  455.884208]  [810e7a9d] smp_call_function_many+0xbd/0x260
[  455.959082]  [810e7ecb] smp_call_function+0x3b/0x50
[  456.027722]  [810560a0] ? leave_mm+0x50/0x50
[  456.089098]  [810e7f4b] on_each_cpu+0x3b/0xc0
[  456.151512]  [81055f0c] flush_tlb_all+0x1c/0x20
[  456.216004]  [8104f8de] remove_pagetable+0x14e/0x1d0
[  456.285683]  [8104f978] vmemmap_free+0x18/0x20
[  456.349139]  [811b8797] sparse_remove_one_section+0xf7/0x100
[  456.427126]  [811c5fc2] __remove_section+0xa2/0xb0
[  456.494726]  [811c6070] __remove_pages+0xa0/0xd0
[  456.560258]  [81669c7b] arch_remove_memory+0x6b/0xc0
[  456.629937]  [8166ad28] remove_memory+0xb8/0xf0
[  456.694431]  [813e686f] acpi_memory_device_remove+0x53/0x96
[  456.771379]  [813b33c4] acpi_device_remove+0x90/0xb2
[  456.841059]  [8144b02c] __device_release_driver+0x7c/0xf0
[  456.915928]  [8144b1af] device_release_driver+0x2f/0x50
[  456.988719]  [813b4476] acpi_bus_remove+0x32/0x6d
[  457.055285]  [813b4542] acpi_bus_trim+0x91/0x102
[  457.120814]  [813b463b] acpi_bus_hot_remove_device+0x88/0x16b
[  457.199840]  [813afda7] acpi_os_execute_deferred+0x27/0x34
[  457.275756]  [81091ece] process_one_work+0x20e/0x5c0
[  457.345434]  [81091e5f] ? process_one_work+0x19f/0x5c0
[  457.417190]  [813afd80] ? acpi_os_wait_events_complete+0x23/0x23
[  457.499332]  [81093f6e] worker_thread+0x12e/0x370
[  457.565896]  [81093e40] ? manage_workers+0x180/0x180
[  457.635574]  [8109a09e] kthread+0xee/0x100
[  457.694871]  [810dfaf9] ? __lock_release+0x129/0x190
[  457.764552]  [81099fb0] ? __init_kthread_worker+0x70/0x70
[  457.839427]  [81690aac] ret_from_fork+0x7c/0xb0
[  457.903914]  [81099fb0] ? __init_kthread_worker+0x70/0x70
[  457.978784] ---[ end trace 25e85300f542aa01 ]---

Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com
Signed-off-by: Wen Congyang we...@cn.fujitsu.com
Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
---
 mm/memory_hotplug.c |4 
 mm/sparse.c |5 -
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0682d2a..674e791 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -442,8 +442,6 @@ static int __remove_section(struct zone *zone, struct 
mem_section *ms)
 #else
 static int __remove_section(struct zone *zone, struct mem_section *ms)
 {
-   unsigned long flags;
-   struct pglist_data *pgdat = zone-zone_pgdat;
int ret = -EINVAL;
 
if (!valid_section(ms))
@@ -453,9 +451,7 @@ static int __remove_section(struct zone *zone, struct 
mem_section *ms)
if (ret)
return ret;
 
-   pgdat_resize_lock(pgdat, flags);
sparse_remove_one_section(zone, ms);
-   pgdat_resize_unlock(pgdat, flags);
return 0;
 }
 #endif
diff --git a/mm/sparse.c b/mm/sparse.c
index aadbb2a..05ca73a 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -796,8 +796,10 @@ static inline void clear_hwpoisoned_pages(struct page 
*memmap, int nr_pages)
 void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 {
struct page *memmap = NULL;
-   unsigned long *usemap = NULL;
+   unsigned long *usemap = NULL, flags;
+   struct pglist_data *pgdat = zone-zone_pgdat;
 
+   pgdat_resize_lock(pgdat, flags);
if (ms-section_mem_map) {
usemap = ms-pageblock_flags;
memmap = sparse_decode_mem_map(ms-section_mem_map,
@@ -805,6 +807,7 @@ void sparse_remove_one_section(struct zone *zone, struct 
mem_section *ms)
ms-section_mem_map = 0;
ms-pageblock_flags = NULL;
}
+   pgdat_resize_unlock(pgdat, flags);
 
clear_hwpoisoned_pages(memmap, PAGES_PER_SECTION);
free_section_usemap(memmap, usemap);
-- 
1.7.1

___
Linuxppc-dev mailing list

[PATCH v6 04/15] memory-hotplug: remove /sys/firmware/memmap/X sysfs

2013-01-09 Thread Tang Chen
From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com

When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type}
sysfs files are created. But there is no code to remove these files. The patch
implements the function to remove them.

Note: The code does not free firmware_map_entry which is allocated by bootmem.
  So the patch makes memory leak. But I think the memory leak size is
  very samll. And it does not affect the system.

Signed-off-by: Wen Congyang we...@cn.fujitsu.com
Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com
Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
Reviewed-by: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
---
 drivers/firmware/memmap.c|   96 +-
 include/linux/firmware-map.h |6 +++
 mm/memory_hotplug.c  |5 ++-
 3 files changed, 104 insertions(+), 3 deletions(-)

diff --git a/drivers/firmware/memmap.c b/drivers/firmware/memmap.c
index 90723e6..4211da5 100644
--- a/drivers/firmware/memmap.c
+++ b/drivers/firmware/memmap.c
@@ -21,6 +21,7 @@
 #include linux/types.h
 #include linux/bootmem.h
 #include linux/slab.h
+#include linux/mm.h
 
 /*
  * Data types 
--
@@ -79,7 +80,26 @@ static const struct sysfs_ops memmap_attr_ops = {
.show = memmap_attr_show,
 };
 
+
+static inline struct firmware_map_entry *
+to_memmap_entry(struct kobject *kobj)
+{
+   return container_of(kobj, struct firmware_map_entry, kobj);
+}
+
+static void release_firmware_map_entry(struct kobject *kobj)
+{
+   struct firmware_map_entry *entry = to_memmap_entry(kobj);
+
+   if (PageReserved(virt_to_page(entry)))
+   /* There is no way to free memory allocated from bootmem */
+   return;
+
+   kfree(entry);
+}
+
 static struct kobj_type memmap_ktype = {
+   .release= release_firmware_map_entry,
.sysfs_ops  = memmap_attr_ops,
.default_attrs  = def_attrs,
 };
@@ -94,6 +114,7 @@ static struct kobj_type memmap_ktype = {
  * in firmware initialisation code in one single thread of execution.
  */
 static LIST_HEAD(map_entries);
+static DEFINE_SPINLOCK(map_entries_lock);
 
 /**
  * firmware_map_add_entry() - Does the real work to add a firmware memmap 
entry.
@@ -118,11 +139,25 @@ static int firmware_map_add_entry(u64 start, u64 end,
INIT_LIST_HEAD(entry-list);
kobject_init(entry-kobj, memmap_ktype);
 
+   spin_lock(map_entries_lock);
list_add_tail(entry-list, map_entries);
+   spin_unlock(map_entries_lock);
 
return 0;
 }
 
+/**
+ * firmware_map_remove_entry() - Does the real work to remove a firmware
+ * memmap entry.
+ * @entry: removed entry.
+ **/
+static inline void firmware_map_remove_entry(struct firmware_map_entry *entry)
+{
+   spin_lock(map_entries_lock);
+   list_del(entry-list);
+   spin_unlock(map_entries_lock);
+}
+
 /*
  * Add memmap entry on sysfs
  */
@@ -144,6 +179,35 @@ static int add_sysfs_fw_map_entry(struct 
firmware_map_entry *entry)
return 0;
 }
 
+/*
+ * Remove memmap entry on sysfs
+ */
+static inline void remove_sysfs_fw_map_entry(struct firmware_map_entry *entry)
+{
+   kobject_put(entry-kobj);
+}
+
+/*
+ * Search memmap entry
+ */
+
+static struct firmware_map_entry * __meminit
+firmware_map_find_entry(u64 start, u64 end, const char *type)
+{
+   struct firmware_map_entry *entry;
+
+   spin_lock(map_entries_lock);
+   list_for_each_entry(entry, map_entries, list)
+   if ((entry-start == start)  (entry-end == end) 
+   (!strcmp(entry-type, type))) {
+   spin_unlock(map_entries_lock);
+   return entry;
+   }
+
+   spin_unlock(map_entries_lock);
+   return NULL;
+}
+
 /**
  * firmware_map_add_hotplug() - Adds a firmware mapping entry when we do
  * memory hotplug.
@@ -196,6 +260,32 @@ int __init firmware_map_add_early(u64 start, u64 end, 
const char *type)
return firmware_map_add_entry(start, end, type, entry);
 }
 
+/**
+ * firmware_map_remove() - remove a firmware mapping entry
+ * @start: Start of the memory range.
+ * @end:   End of the memory range.
+ * @type:  Type of the memory range.
+ *
+ * removes a firmware mapping entry.
+ *
+ * Returns 0 on success, or -EINVAL if no entry.
+ **/
+int __meminit firmware_map_remove(u64 start, u64 end, const char *type)
+{
+   struct firmware_map_entry *entry;
+
+   entry = firmware_map_find_entry(start, end - 1, type);
+   if (!entry)
+   return -EINVAL;
+
+   firmware_map_remove_entry(entry);
+
+   /* remove the memmap entry */
+   remove_sysfs_fw_map_entry(entry);
+
+   return 0;
+}
+
 /*
  * Sysfs functions 
-
  */
@@ -217,8 +307,10 @@ static ssize_t type_show(struct firmware_map_entry *entry, 
char *buf)
return snprintf(buf, 

[PATCH v6 01/15] memory-hotplug: try to offline the memory twice to avoid dependence

2013-01-09 Thread Tang Chen
From: Wen Congyang we...@cn.fujitsu.com

memory can't be offlined when CONFIG_MEMCG is selected.
For example: there is a memory device on node 1. The address range
is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
and memory11 under the directory /sys/devices/system/memory/.

If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
when we online pages. When we online memory8, the memory stored page cgroup
is not provided by this memory device. But when we online memory9, the memory
stored page cgroup may be provided by memory8. So we can't offline memory8
now. We should offline the memory in the reversed order.

When the memory device is hotremoved, we will auto offline memory provided
by this memory device. But we don't know which memory is onlined first, so
offlining memory may fail. In such case, iterate twice to offline the memory.
1st iterate: offline every non primary memory block.
2nd iterate: offline primary (i.e. first added) memory block.

This idea is suggested by KOSAKI Motohiro.

Signed-off-by: Wen Congyang we...@cn.fujitsu.com
---
 mm/memory_hotplug.c |   16 ++--
 1 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d04ed87..62e04c9 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1388,10 +1388,13 @@ int remove_memory(u64 start, u64 size)
unsigned long start_pfn, end_pfn;
unsigned long pfn, section_nr;
int ret;
+   int return_on_error = 0;
+   int retry = 0;
 
start_pfn = PFN_DOWN(start);
end_pfn = start_pfn + PFN_DOWN(size);
 
+repeat:
for (pfn = start_pfn; pfn  end_pfn; pfn += PAGES_PER_SECTION) {
section_nr = pfn_to_section_nr(pfn);
if (!present_section_nr(section_nr))
@@ -1410,14 +1413,23 @@ int remove_memory(u64 start, u64 size)
 
ret = offline_memory_block(mem);
if (ret) {
-   kobject_put(mem-dev.kobj);
-   return ret;
+   if (return_on_error) {
+   kobject_put(mem-dev.kobj);
+   return ret;
+   } else {
+   retry = 1;
+   }
}
}
 
if (mem)
kobject_put(mem-dev.kobj);
 
+   if (retry) {
+   return_on_error = 1;
+   goto repeat;
+   }
+
return 0;
 }
 #else
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v6 03/15] memory-hotplug: remove redundant codes

2013-01-09 Thread Tang Chen
From: Wen Congyang we...@cn.fujitsu.com

offlining memory blocks and checking whether memory blocks are offlined
are very similar. This patch introduces a new function to remove
redundant codes.

Signed-off-by: Wen Congyang we...@cn.fujitsu.com
Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
Reviewed-by: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
---
 mm/memory_hotplug.c |  129 --
 1 files changed, 82 insertions(+), 47 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 5808045..69d62eb 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1381,20 +1381,26 @@ int offline_pages(unsigned long start_pfn, unsigned 
long nr_pages)
return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ);
 }
 
-int remove_memory(u64 start, u64 size)
+/**
+ * walk_memory_range - walks through all mem sections in [start_pfn, end_pfn)
+ * @start_pfn: start pfn of the memory range
+ * @end_pfn: end pft of the memory range
+ * @arg: argument passed to func
+ * @func: callback for each memory section walked
+ *
+ * This function walks through all present mem sections in range
+ * [start_pfn, end_pfn) and call func on each mem section.
+ *
+ * Returns the return value of func.
+ */
+static int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
+   void *arg, int (*func)(struct memory_block *, void *))
 {
struct memory_block *mem = NULL;
struct mem_section *section;
-   unsigned long start_pfn, end_pfn;
unsigned long pfn, section_nr;
int ret;
-   int return_on_error = 0;
-   int retry = 0;
-
-   start_pfn = PFN_DOWN(start);
-   end_pfn = start_pfn + PFN_DOWN(size);
 
-repeat:
for (pfn = start_pfn; pfn  end_pfn; pfn += PAGES_PER_SECTION) {
section_nr = pfn_to_section_nr(pfn);
if (!present_section_nr(section_nr))
@@ -1411,22 +1417,76 @@ repeat:
if (!mem)
continue;
 
-   ret = offline_memory_block(mem);
+   ret = func(mem, arg);
if (ret) {
-   if (return_on_error) {
-   kobject_put(mem-dev.kobj);
-   return ret;
-   } else {
-   retry = 1;
-   }
+   kobject_put(mem-dev.kobj);
+   return ret;
}
}
 
if (mem)
kobject_put(mem-dev.kobj);
 
-   if (retry) {
-   return_on_error = 1;
+   return 0;
+}
+
+/**
+ * offline_memory_block_cb - callback function for offlining memory block
+ * @mem: the memory block to be offlined
+ * @arg: buffer to hold error msg
+ *
+ * Always return 0, and put the error msg in arg if any.
+ */
+static int offline_memory_block_cb(struct memory_block *mem, void *arg)
+{
+   int *ret = arg;
+   int error = offline_memory_block(mem);
+
+   if (error != 0  *ret == 0)
+   *ret = error;
+
+   return 0;
+}
+
+static int is_memblock_offlined_cb(struct memory_block *mem, void *arg)
+{
+   int ret = !is_memblock_offlined(mem);
+
+   if (unlikely(ret))
+   pr_warn(removing memory fails, because memory 
+   [%#010llx-%#010llx] is onlined\n,
+   PFN_PHYS(section_nr_to_pfn(mem-start_section_nr)),
+   PFN_PHYS(section_nr_to_pfn(mem-end_section_nr + 1))-1);
+
+   return ret;
+}
+
+int remove_memory(u64 start, u64 size)
+{
+   unsigned long start_pfn, end_pfn;
+   int ret = 0;
+   int retry = 1;
+
+   start_pfn = PFN_DOWN(start);
+   end_pfn = start_pfn + PFN_DOWN(size);
+
+   /*
+* When CONFIG_MEMCG is on, one memory block may be used by other
+* blocks to store page cgroup when onlining pages. But we don't know
+* in what order pages are onlined. So we iterate twice to offline
+* memory:
+* 1st iterate: offline every non primary memory block.
+* 2nd iterate: offline primary (i.e. first added) memory block.
+*/
+repeat:
+   walk_memory_range(start_pfn, end_pfn, ret,
+ offline_memory_block_cb);
+   if (ret) {
+   if (!retry)
+   return ret;
+
+   retry = 0;
+   ret = 0;
goto repeat;
}
 
@@ -1444,38 +1504,13 @@ repeat:
 * memory blocks are offlined.
 */
 
-   mem = NULL;
-   for (pfn = start_pfn; pfn  end_pfn; pfn += PAGES_PER_SECTION) {
-   section_nr = pfn_to_section_nr(pfn);
-   if (!present_section_nr(section_nr))
-   continue;
-
-   section = __nr_to_section(section_nr);
-   /* same memblock? */
-   if (mem)
-   if ((section_nr = mem-start_section_nr) 
-   

[PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-09 Thread Tang Chen
Here is the physical memory hot-remove patch-set based on 3.8rc-2.

This patch-set aims to implement physical memory hot-removing.

The patches can free/remove the following things:

  - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
  - memmap of sparse-vmemmap  : [PATCH 6,7,8,10/15]
  - page table of removed memory  : [RFC PATCH 7,8,10/15]
  - node and related sysfs files  : [RFC PATCH 13-15/15]


Existing problem:
If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
when we online pages.

For example: there is a memory device on node 1. The address range
is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
and memory11 under the directory /sys/devices/system/memory/.

If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
cgroup is not provided by this memory device. But when we online memory9, the
memory stored page cgroup may be provided by memory8. So we can't offline
memory8 now. We should offline the memory in the reversed order.

When the memory device is hotremoved, we will auto offline memory provided
by this memory device. But we don't know which memory is onlined first, so
offlining memory may fail.

In patch1, we provide a solution which is not good enough:
Iterate twice to offline the memory.
1st iterate: offline every non primary memory block.
2nd iterate: offline primary (i.e. first added) memory block.

And a new idea from Wen Congyang we...@cn.fujitsu.com is:
allocate the memory from the memory block they are describing.

But we are not sure if it is OK to do so because there is not existing API
to do so, and we need to move page_cgroup memory allocation from 
MEM_GOING_ONLINE
to MEM_ONLINE. And also, it may interfere the hugepage.



How to test this patchset?
1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
   ACPI_HOTPLUG_MEMORY must be selected.
2. load the module acpi_memhotplug
3. hotplug the memory device(it depends on your hardware)
   You will see the memory device under the directory /sys/bus/acpi/devices/.
   Its name is PNP0C80:XX.
4. online/offline pages provided by this memory device
   You can write online/offline to /sys/devices/system/memory/memoryX/state to
   online/offline pages provided by this memory device
5. hotremove the memory device
   You can hotremove the memory device by the hardware, or writing 1 to
   /sys/bus/acpi/devices/PNP0C80:XX/eject.


Note: if the memory provided by the memory device is used by the kernel, it
can't be offlined. It is not a bug.


Changelogs from v5 to v6:
 Patch3: Add some more comments to explain memory hot-remove.
 Patch4: Remove bootmem member in struct firmware_map_entry.
 Patch6: Repeatedly register bootmem pages when using hugepage.
 Patch8: Repeatedly free bootmem pages when using hugepage.
 Patch14: Don't free pgdat when offlining a node, just reset it to 0.
 Patch15: New patch, pgdat is not freed in patch14, so don't allocate a new
  one when online a node.

Changelogs from v4 to v5:
 Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to
 avoid disabling irq because we need flush tlb when free pagetables.
 Patch8: new patch, pick up some common APIs that are used to free direct 
mapping
 and vmemmap pagetables.
 Patch9: free direct mapping pagetables on x86_64 arch.
 Patch10: free vmemmap pagetables.
 Patch11: since freeing memmap with vmemmap has been implemented, the config
  macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is
  no longer needed.
 Patch13: no need to modify acpi_memory_disable_device() since it was removed,
  and add nid parameter when calling remove_memory().

Changelogs from v3 to v4:
 Patch7: remove unused codes.
 Patch8: fix nr_pages that is passed to free_map_bootmem()

Changelogs from v2 to v3:
 Patch9: call sync_global_pgds() if pgd is changed
 Patch10: fix a problem int the patch

Changelogs from v1 to v2:
 Patch1: new patch, offline memory twice. 1st iterate: offline every non primary
 memory block. 2nd iterate: offline primary (i.e. first added) memory
 block.

 Patch3: new patch, no logical change, just remove reduntant codes.

 Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu
 after the pagetable is changed.

 Patch12: new patch, free node_data when a node is offlined.


Tang Chen (6):
  memory-hotplug: move pgdat_resize_lock into
sparse_remove_one_section()
  memory-hotplug: remove page table of x86_64 architecture
  memory-hotplug: remove memmap of sparse-vmemmap
  memory-hotplug: Integrated __remove_section() of
CONFIG_SPARSEMEM_VMEMMAP.
  memory-hotplug: remove sysfs file of node
  memory-hotplug: Do not allocate pdgat if it was not freed when
offline.

Wen Congyang (5):
  memory-hotplug: try to offline the memory twice to avoid dependence
  memory-hotplug: remove redundant codes
  memory-hotplug: 

[PATCH v6 05/15] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture

2013-01-09 Thread Tang Chen
From: Wen Congyang we...@cn.fujitsu.com

For removing memory, we need to remove page table. But it depends
on architecture. So the patch introduce arch_remove_memory() for
removing page table. Now it only calls __remove_pages().

Note: __remove_pages() for some archtecuture is not implemented
  (I don't know how to implement it for s390).

Signed-off-by: Wen Congyang we...@cn.fujitsu.com
Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
---
 arch/ia64/mm/init.c|   18 ++
 arch/powerpc/mm/mem.c  |   12 
 arch/s390/mm/init.c|   12 
 arch/sh/mm/init.c  |   17 +
 arch/tile/mm/init.c|8 
 arch/x86/mm/init_32.c  |   12 
 arch/x86/mm/init_64.c  |   15 +++
 include/linux/memory_hotplug.h |1 +
 mm/memory_hotplug.c|2 ++
 9 files changed, 97 insertions(+), 0 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index b755ea9..20bc967 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -688,6 +688,24 @@ int arch_add_memory(int nid, u64 start, u64 size)
 
return ret;
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+   unsigned long start_pfn = start  PAGE_SHIFT;
+   unsigned long nr_pages = size  PAGE_SHIFT;
+   struct zone *zone;
+   int ret;
+
+   zone = page_zone(pfn_to_page(start_pfn));
+   ret = __remove_pages(zone, start_pfn, nr_pages);
+   if (ret)
+   pr_warn(%s: Problem encountered in __remove_pages() as
+ret=%d\n, __func__,  ret);
+
+   return ret;
+}
+#endif
 #endif
 
 /*
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 0dba506..09c6451 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -133,6 +133,18 @@ int arch_add_memory(int nid, u64 start, u64 size)
 
return __add_pages(nid, zone, start_pfn, nr_pages);
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+   unsigned long start_pfn = start  PAGE_SHIFT;
+   unsigned long nr_pages = size  PAGE_SHIFT;
+   struct zone *zone;
+
+   zone = page_zone(pfn_to_page(start_pfn));
+   return __remove_pages(zone, start_pfn, nr_pages);
+}
+#endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 /*
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index ae672f4..49ce6bb 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -228,4 +228,16 @@ int arch_add_memory(int nid, u64 start, u64 size)
vmem_remove_mapping(start, size);
return rc;
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+   /*
+* There is no hardware or firmware interface which could trigger a
+* hot memory remove on s390. So there is nothing that needs to be
+* implemented.
+*/
+   return -EBUSY;
+}
+#endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 82cc576..1057940 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -558,4 +558,21 @@ int memory_add_physaddr_to_nid(u64 addr)
 EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 #endif
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+   unsigned long start_pfn = start  PAGE_SHIFT;
+   unsigned long nr_pages = size  PAGE_SHIFT;
+   struct zone *zone;
+   int ret;
+
+   zone = page_zone(pfn_to_page(start_pfn));
+   ret = __remove_pages(zone, start_pfn, nr_pages);
+   if (unlikely(ret))
+   pr_warn(%s: Failed, __remove_pages() == %d\n, __func__,
+   ret);
+
+   return ret;
+}
+#endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c
index ef29d6c..2749515 100644
--- a/arch/tile/mm/init.c
+++ b/arch/tile/mm/init.c
@@ -935,6 +935,14 @@ int remove_memory(u64 start, u64 size)
 {
return -EINVAL;
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+   /* TODO */
+   return -EBUSY;
+}
+#endif
 #endif
 
 struct kmem_cache *pgd_cache;
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 745d66b..3166e78 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -836,6 +836,18 @@ int arch_add_memory(int nid, u64 start, u64 size)
 
return __add_pages(nid, zone, start_pfn, nr_pages);
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+   unsigned long start_pfn = start  PAGE_SHIFT;
+   unsigned long nr_pages = size  PAGE_SHIFT;
+   struct zone *zone;
+
+   zone = page_zone(pfn_to_page(start_pfn));
+   return __remove_pages(zone, start_pfn, nr_pages);
+}
+#endif
 #endif
 
 /*
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index e779e0b..f78509c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -682,6 +682,21 @@ int 

[PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove

2013-01-09 Thread Tang Chen
From: Wen Congyang we...@cn.fujitsu.com

When memory is removed, the corresponding pagetables should alse be removed.
This patch introduces some common APIs to support vmemmap pagetable and x86_64
architecture pagetable removing.

All pages of virtual mapping in removed memory cannot be freedi if some pages
used as PGD/PUD includes not only removed memory but also other memory. So the
patch uses the following way to check whether page can be freed or not.

 1. When removing memory, the page structs of the revmoved memory are filled
with 0FD.
 2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
In this case, the page used as PT/PMD can be freed.

Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com
Signed-off-by: Jianguo Wu wujian...@huawei.com
Signed-off-by: Wen Congyang we...@cn.fujitsu.com
Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
---
 arch/x86/include/asm/pgtable_types.h |1 +
 arch/x86/mm/init_64.c|  299 ++
 arch/x86/mm/pageattr.c   |   47 +++---
 include/linux/bootmem.h  |1 +
 4 files changed, 326 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index 3c32db8..4b6fd2a 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned 
long pages) { }
  * as a pte too.
  */
 extern pte_t *lookup_address(unsigned long address, unsigned int *level);
+extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t 
*pbase);
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 9ac1723..fe01116 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -682,6 +682,305 @@ int arch_add_memory(int nid, u64 start, u64 size)
 }
 EXPORT_SYMBOL_GPL(arch_add_memory);
 
+#define PAGE_INUSE 0xFD
+
+static void __meminit free_pagetable(struct page *page, int order)
+{
+   struct zone *zone;
+   bool bootmem = false;
+   unsigned long magic;
+   unsigned int nr_pages = 1  order;
+
+   /* bootmem page has reserved flag */
+   if (PageReserved(page)) {
+   __ClearPageReserved(page);
+   bootmem = true;
+
+   magic = (unsigned long)page-lru.next;
+   if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
+   while (nr_pages--)
+   put_page_bootmem(page++);
+   } else
+   __free_pages_bootmem(page, order);
+   } else
+   free_pages((unsigned long)page_address(page), order);
+
+   /*
+* SECTION_INFO pages and MIX_SECTION_INFO pages
+* are all allocated by bootmem.
+*/
+   if (bootmem) {
+   zone = page_zone(page);
+   zone_span_writelock(zone);
+   zone-present_pages += nr_pages;
+   zone_span_writeunlock(zone);
+   totalram_pages += nr_pages;
+   }
+}
+
+static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
+{
+   pte_t *pte;
+   int i;
+
+   for (i = 0; i  PTRS_PER_PTE; i++) {
+   pte = pte_start + i;
+   if (pte_val(*pte))
+   return;
+   }
+
+   /* free a pte talbe */
+   free_pagetable(pmd_page(*pmd), 0);
+   spin_lock(init_mm.page_table_lock);
+   pmd_clear(pmd);
+   spin_unlock(init_mm.page_table_lock);
+}
+
+static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
+{
+   pmd_t *pmd;
+   int i;
+
+   for (i = 0; i  PTRS_PER_PMD; i++) {
+   pmd = pmd_start + i;
+   if (pmd_val(*pmd))
+   return;
+   }
+
+   /* free a pmd talbe */
+   free_pagetable(pud_page(*pud), 0);
+   spin_lock(init_mm.page_table_lock);
+   pud_clear(pud);
+   spin_unlock(init_mm.page_table_lock);
+}
+
+/* Return true if pgd is changed, otherwise return false. */
+static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
+{
+   pud_t *pud;
+   int i;
+
+   for (i = 0; i  PTRS_PER_PUD; i++) {
+   pud = pud_start + i;
+   if (pud_val(*pud))
+   return false;
+   }
+
+   /* free a pud table */
+   free_pagetable(pgd_page(*pgd), 0);
+   spin_lock(init_mm.page_table_lock);
+   pgd_clear(pgd);
+   spin_unlock(init_mm.page_table_lock);
+
+   return true;
+}
+
+static void __meminit
+remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
+bool direct)
+{
+   unsigned long next, pages = 0;
+   pte_t *pte;
+   void *page_addr;
+   phys_addr_t phys_addr;
+
+   pte = pte_start + pte_index(addr);
+   for (; addr  end; addr = next, pte++) {
+   next = (addr + PAGE_SIZE)  PAGE_MASK;
+   if 

[PATCH v6 06/15] memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap

2013-01-09 Thread Tang Chen
From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com

For removing memmap region of sparse-vmemmap which is allocated bootmem,
memmap region of sparse-vmemmap needs to be registered by get_page_bootmem().
So the patch searches pages of virtual mapping and registers the pages by
get_page_bootmem().

Note: register_page_bootmem_memmap() is not implemented for ia64, ppc, s390,
and sparc.

Signed-off-by: Wen Congyang we...@cn.fujitsu.com
Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com
Reviewed-by: Wu Jianguo wujian...@huawei.com
---
 arch/ia64/mm/discontig.c   |6 
 arch/powerpc/mm/init_64.c  |6 
 arch/s390/mm/vmem.c|6 
 arch/sparc/mm/init_64.c|6 
 arch/x86/mm/init_64.c  |   58 
 include/linux/memory_hotplug.h |   11 +--
 include/linux/mm.h |3 +-
 mm/memory_hotplug.c|   33 ---
 8 files changed, 115 insertions(+), 14 deletions(-)

diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index c641333..33943db 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -822,4 +822,10 @@ int __meminit vmemmap_populate(struct page *start_page,
 {
return vmemmap_populate_basepages(start_page, size, node);
 }
+
+void register_page_bootmem_memmap(unsigned long section_nr,
+ struct page *start_page, unsigned long size)
+{
+   /* TODO */
+}
 #endif
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 95a4529..6466440 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -297,5 +297,11 @@ int __meminit vmemmap_populate(struct page *start_page,
 
return 0;
 }
+
+void register_page_bootmem_memmap(unsigned long section_nr,
+ struct page *start_page, unsigned long size)
+{
+   /* TODO */
+}
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 6ed1426..2c14bc2 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -272,6 +272,12 @@ out:
return ret;
 }
 
+void register_page_bootmem_memmap(unsigned long section_nr,
+ struct page *start_page, unsigned long size)
+{
+   /* TODO */
+}
+
 /*
  * Add memory segment to the segment list if it doesn't overlap with
  * an already present segment.
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index c3b7242..1f30db3 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2231,6 +2231,12 @@ void __meminit vmemmap_populate_print_last(void)
node_start = 0;
}
 }
+
+void register_page_bootmem_memmap(unsigned long section_nr,
+ struct page *start_page, unsigned long size)
+{
+   /* TODO */
+}
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
 static void prot_init_common(unsigned long page_none,
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index f78509c..9ac1723 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1000,6 +1000,64 @@ vmemmap_populate(struct page *start_page, unsigned long 
size, int node)
return 0;
 }
 
+void register_page_bootmem_memmap(unsigned long section_nr,
+ struct page *start_page, unsigned long size)
+{
+   unsigned long addr = (unsigned long)start_page;
+   unsigned long end = (unsigned long)(start_page + size);
+   unsigned long next;
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   unsigned int nr_pages;
+   struct page *page;
+
+   for (; addr  end; addr = next) {
+   pte_t *pte = NULL;
+
+   pgd = pgd_offset_k(addr);
+   if (pgd_none(*pgd)) {
+   next = (addr + PAGE_SIZE)  PAGE_MASK;
+   continue;
+   }
+   get_page_bootmem(section_nr, pgd_page(*pgd), MIX_SECTION_INFO);
+
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud)) {
+   next = (addr + PAGE_SIZE)  PAGE_MASK;
+   continue;
+   }
+   get_page_bootmem(section_nr, pud_page(*pud), MIX_SECTION_INFO);
+
+   if (!cpu_has_pse) {
+   next = (addr + PAGE_SIZE)  PAGE_MASK;
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   continue;
+   get_page_bootmem(section_nr, pmd_page(*pmd),
+MIX_SECTION_INFO);
+
+   pte = pte_offset_kernel(pmd, addr);
+   if (pte_none(*pte))
+   continue;
+   get_page_bootmem(section_nr, pte_page(*pte),
+SECTION_INFO);
+   } else {
+   next = pmd_addr_end(addr, end);
+
+

[PATCH v6 02/15] memory-hotplug: check whether all memory blocks are offlined or not when removing memory

2013-01-09 Thread Tang Chen
From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com

We remove the memory like this:
1. lock memory hotplug
2. offline a memory block
3. unlock memory hotplug
4. repeat 1-3 to offline all memory blocks
5. lock memory hotplug
6. remove memory(TODO)
7. unlock memory hotplug

All memory blocks must be offlined before removing memory. But we don't hold
the lock in the whole operation. So we should check whether all memory blocks
are offlined before step6. Otherwise, kernel maybe panicked.

Signed-off-by: Wen Congyang we...@cn.fujitsu.com
Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com
Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
---
 drivers/base/memory.c  |6 +
 include/linux/memory_hotplug.h |1 +
 mm/memory_hotplug.c|   48 
 3 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 987604d..8300a18 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -693,6 +693,12 @@ int offline_memory_block(struct memory_block *mem)
return ret;
 }
 
+/* return true if the memory block is offlined, otherwise, return false */
+bool is_memblock_offlined(struct memory_block *mem)
+{
+   return mem-state == MEM_OFFLINE;
+}
+
 /*
  * Initialize the sysfs support for memory devices...
  */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 4a45c4e..8dd0950 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -247,6 +247,7 @@ extern int add_memory(int nid, u64 start, u64 size);
 extern int arch_add_memory(int nid, u64 start, u64 size);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern int offline_memory_block(struct memory_block *mem);
+extern bool is_memblock_offlined(struct memory_block *mem);
 extern int remove_memory(u64 start, u64 size);
 extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
int nr_pages);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 62e04c9..5808045 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1430,6 +1430,54 @@ repeat:
goto repeat;
}
 
+   lock_memory_hotplug();
+
+   /*
+* we have offlined all memory blocks like this:
+*   1. lock memory hotplug
+*   2. offline a memory block
+*   3. unlock memory hotplug
+*
+* repeat step1-3 to offline the memory block. All memory blocks
+* must be offlined before removing memory. But we don't hold the
+* lock in the whole operation. So we should check whether all
+* memory blocks are offlined.
+*/
+
+   mem = NULL;
+   for (pfn = start_pfn; pfn  end_pfn; pfn += PAGES_PER_SECTION) {
+   section_nr = pfn_to_section_nr(pfn);
+   if (!present_section_nr(section_nr))
+   continue;
+
+   section = __nr_to_section(section_nr);
+   /* same memblock? */
+   if (mem)
+   if ((section_nr = mem-start_section_nr) 
+   (section_nr = mem-end_section_nr))
+   continue;
+
+   mem = find_memory_block_hinted(section, mem);
+   if (!mem)
+   continue;
+
+   ret = is_memblock_offlined(mem);
+   if (!ret) {
+   pr_warn(removing memory fails, because memory 
+   [%#010llx-%#010llx] is onlined\n,
+   
PFN_PHYS(section_nr_to_pfn(mem-start_section_nr)),
+   PFN_PHYS(section_nr_to_pfn(mem-end_section_nr 
+ 1)) - 1);
+
+   kobject_put(mem-dev.kobj);
+   unlock_memory_hotplug();
+   return ret;
+   }
+   }
+
+   if (mem)
+   kobject_put(mem-dev.kobj);
+   unlock_memory_hotplug();
+
return 0;
 }
 #else
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v6 09/15] memory-hotplug: remove page table of x86_64 architecture

2013-01-09 Thread Tang Chen
This patch searches a page table about the removed memory, and clear
page table for x86_64 architecture.

Signed-off-by: Wen Congyang we...@cn.fujitsu.com
Signed-off-by: Jianguo Wu wujian...@huawei.com
Signed-off-by: Jiang Liu jiang@huawei.com
Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
---
 arch/x86/mm/init_64.c |   10 ++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index fe01116..d950f9b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -981,6 +981,15 @@ remove_pagetable(unsigned long start, unsigned long end, 
bool direct)
flush_tlb_all();
 }
 
+void __meminit
+kernel_physical_mapping_remove(unsigned long start, unsigned long end)
+{
+   start = (unsigned long)__va(start);
+   end = (unsigned long)__va(end);
+
+   remove_pagetable(start, end, true);
+}
+
 #ifdef CONFIG_MEMORY_HOTREMOVE
 int __ref arch_remove_memory(u64 start, u64 size)
 {
@@ -990,6 +999,7 @@ int __ref arch_remove_memory(u64 start, u64 size)
int ret;
 
zone = page_zone(pfn_to_page(start_pfn));
+   kernel_physical_mapping_remove(start, start + size);
ret = __remove_pages(zone, start_pfn, nr_pages);
WARN_ON_ONCE(ret);
 
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v6 13/15] memory-hotplug: remove sysfs file of node

2013-01-09 Thread Tang Chen
This patch introduces a new function try_offline_node() to
remove sysfs file of node when all memory sections of this
node are removed. If some memory sections of this node are
not removed, this function does nothing.

Signed-off-by: Wen Congyang we...@cn.fujitsu.com
Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
---
 drivers/acpi/acpi_memhotplug.c |8 -
 include/linux/memory_hotplug.h |2 +-
 mm/memory_hotplug.c|   58 ++-
 3 files changed, 63 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index eb30e5a..9c53cc6 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -295,9 +295,11 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
 
 static int acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
 {
-   int result = 0;
+   int result = 0, nid;
struct acpi_memory_info *info, *n;
 
+   nid = acpi_get_node(mem_device-device-handle);
+
list_for_each_entry_safe(info, n, mem_device-res_list, list) {
if (info-failed)
/* The kernel does not use this memory block */
@@ -310,7 +312,9 @@ static int acpi_memory_remove_memory(struct 
acpi_memory_device *mem_device)
 */
return -EBUSY;
 
-   result = remove_memory(info-start_addr, info-length);
+   if (nid  0)
+   nid = memory_add_physaddr_to_nid(info-start_addr);
+   result = remove_memory(nid, info-start_addr, info-length);
if (result)
return result;
 
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 2441f36..f60e728 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -242,7 +242,7 @@ extern int arch_add_memory(int nid, u64 start, u64 size);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern int offline_memory_block(struct memory_block *mem);
 extern bool is_memblock_offlined(struct memory_block *mem);
-extern int remove_memory(u64 start, u64 size);
+extern int remove_memory(int nid, u64 start, u64 size);
 extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
int nr_pages);
 extern void sparse_remove_one_section(struct zone *zone, struct mem_section 
*ms);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index da20c14..a8703f7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -29,6 +29,7 @@
 #include linux/suspend.h
 #include linux/mm_inline.h
 #include linux/firmware-map.h
+#include linux/stop_machine.h
 
 #include asm/tlbflush.h
 
@@ -1678,7 +1679,58 @@ static int is_memblock_offlined_cb(struct memory_block 
*mem, void *arg)
return ret;
 }
 
-int __ref remove_memory(u64 start, u64 size)
+static int check_cpu_on_node(void *data)
+{
+   struct pglist_data *pgdat = data;
+   int cpu;
+
+   for_each_present_cpu(cpu) {
+   if (cpu_to_node(cpu) == pgdat-node_id)
+   /*
+* the cpu on this node isn't removed, and we can't
+* offline this node.
+*/
+   return -EBUSY;
+   }
+
+   return 0;
+}
+
+/* offline the node if all memory sections of this node are removed */
+static void try_offline_node(int nid)
+{
+   unsigned long start_pfn = NODE_DATA(nid)-node_start_pfn;
+   unsigned long end_pfn = start_pfn + NODE_DATA(nid)-node_spanned_pages;
+   unsigned long pfn;
+
+   for (pfn = start_pfn; pfn  end_pfn; pfn += PAGES_PER_SECTION) {
+   unsigned long section_nr = pfn_to_section_nr(pfn);
+
+   if (!present_section_nr(section_nr))
+   continue;
+
+   if (pfn_to_nid(pfn) != nid)
+   continue;
+
+   /*
+* some memory sections of this node are not removed, and we
+* can't offline node now.
+*/
+   return;
+   }
+
+   if (stop_machine(check_cpu_on_node, NODE_DATA(nid), NULL))
+   return;
+
+   /*
+* all memory/cpu of this node are removed, we can offline this
+* node now.
+*/
+   node_set_offline(nid);
+   unregister_one_node(nid);
+}
+
+int __ref remove_memory(int nid, u64 start, u64 size)
 {
unsigned long start_pfn, end_pfn;
int ret = 0;
@@ -1733,6 +1785,8 @@ repeat:
 
arch_remove_memory(start, size);
 
+   try_offline_node(nid);
+
unlock_memory_hotplug();
 
return 0;
@@ -1742,7 +1796,7 @@ int offline_pages(unsigned long start_pfn, unsigned long 
nr_pages)
 {
return -EINVAL;
 }
-int remove_memory(u64 start, u64 size)
+int remove_memory(int nid, u64 start, u64 size)
 {
  

[PATCH v6 10/15] memory-hotplug: remove memmap of sparse-vmemmap

2013-01-09 Thread Tang Chen
This patch introduces a new API vmemmap_free() to free and remove
vmemmap pagetables. Since pagetable implements are different, each
architecture has to provide its own version of vmemmap_free(), just
like vmemmap_populate().

Note:  vmemmap_free() are not implemented for ia64, ppc, s390, and sparc.

Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com
Signed-off-by: Jianguo Wu wujian...@huawei.com
Signed-off-by: Wen Congyang we...@cn.fujitsu.com
Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
---
 arch/arm64/mm/mmu.c   |3 +++
 arch/ia64/mm/discontig.c  |4 
 arch/powerpc/mm/init_64.c |4 
 arch/s390/mm/vmem.c   |4 
 arch/sparc/mm/init_64.c   |4 
 arch/x86/mm/init_64.c |8 
 include/linux/mm.h|1 +
 mm/sparse.c   |3 ++-
 8 files changed, 30 insertions(+), 1 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index a6885d8..9834886 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -392,4 +392,7 @@ int __meminit vmemmap_populate(struct page *start_page,
return 0;
 }
 #endif /* CONFIG_ARM64_64K_PAGES */
+void vmemmap_free(struct page *memmap, unsigned long nr_pages)
+{
+}
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index 33943db..882a0fd 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -823,6 +823,10 @@ int __meminit vmemmap_populate(struct page *start_page,
return vmemmap_populate_basepages(start_page, size, node);
 }
 
+void vmemmap_free(struct page *memmap, unsigned long nr_pages)
+{
+}
+
 void register_page_bootmem_memmap(unsigned long section_nr,
  struct page *start_page, unsigned long size)
 {
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 6466440..2969591 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -298,6 +298,10 @@ int __meminit vmemmap_populate(struct page *start_page,
return 0;
 }
 
+void vmemmap_free(struct page *memmap, unsigned long nr_pages)
+{
+}
+
 void register_page_bootmem_memmap(unsigned long section_nr,
  struct page *start_page, unsigned long size)
 {
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 2c14bc2..81e6ba3 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -272,6 +272,10 @@ out:
return ret;
 }
 
+void vmemmap_free(struct page *memmap, unsigned long nr_pages)
+{
+}
+
 void register_page_bootmem_memmap(unsigned long section_nr,
  struct page *start_page, unsigned long size)
 {
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 1f30db3..5afe21a 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2232,6 +2232,10 @@ void __meminit vmemmap_populate_print_last(void)
}
 }
 
+void vmemmap_free(struct page *memmap, unsigned long nr_pages)
+{
+}
+
 void register_page_bootmem_memmap(unsigned long section_nr,
  struct page *start_page, unsigned long size)
 {
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index d950f9b..e829113 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1309,6 +1309,14 @@ vmemmap_populate(struct page *start_page, unsigned long 
size, int node)
return 0;
 }
 
+void __ref vmemmap_free(struct page *memmap, unsigned long nr_pages)
+{
+   unsigned long start = (unsigned long)memmap;
+   unsigned long end = (unsigned long)(memmap + nr_pages);
+
+   remove_pagetable(start, end, false);
+}
+
 void register_page_bootmem_memmap(unsigned long section_nr,
  struct page *start_page, unsigned long size)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1eca498..31d5e5d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1709,6 +1709,7 @@ int vmemmap_populate_basepages(struct page *start_page,
unsigned long pages, int node);
 int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
 void vmemmap_populate_print_last(void);
+void vmemmap_free(struct page *memmap, unsigned long nr_pages);
 void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
  unsigned long size);
 
diff --git a/mm/sparse.c b/mm/sparse.c
index 05ca73a..cff9796 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -615,10 +615,11 @@ static inline struct page 
*kmalloc_section_memmap(unsigned long pnum, int nid,
 }
 static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
 {
-   return; /* XXX: Not implemented yet */
+   vmemmap_free(memmap, nr_pages);
 }
 static void free_map_bootmem(struct page *memmap, unsigned long nr_pages)
 {
+   vmemmap_free(memmap, nr_pages);
 }
 #else
 static struct page *__kmalloc_section_memmap(unsigned long nr_pages)
-- 
1.7.1


[PATCH v6 12/15] memory-hotplug: memory_hotplug: clear zone when removing the memory

2013-01-09 Thread Tang Chen
From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com

When a memory is added, we update zone's and pgdat's start_pfn and
spanned_pages in the function __add_zone(). So we should revert them
when the memory is removed.

The patch adds a new function __remove_zone() to do this.

Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com
Signed-off-by: Wen Congyang we...@cn.fujitsu.com
---
 mm/memory_hotplug.c |  207 +++
 1 files changed, 207 insertions(+), 0 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b20c4c7..da20c14 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -430,8 +430,211 @@ static int __meminit __add_section(int nid, struct zone 
*zone,
return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
 }
 
+/* find the smallest valid pfn in the range [start_pfn, end_pfn) */
+static int find_smallest_section_pfn(int nid, struct zone *zone,
+unsigned long start_pfn,
+unsigned long end_pfn)
+{
+   struct mem_section *ms;
+
+   for (; start_pfn  end_pfn; start_pfn += PAGES_PER_SECTION) {
+   ms = __pfn_to_section(start_pfn);
+
+   if (unlikely(!valid_section(ms)))
+   continue;
+
+   if (unlikely(pfn_to_nid(start_pfn) != nid))
+   continue;
+
+   if (zone  zone != page_zone(pfn_to_page(start_pfn)))
+   continue;
+
+   return start_pfn;
+   }
+
+   return 0;
+}
+
+/* find the biggest valid pfn in the range [start_pfn, end_pfn). */
+static int find_biggest_section_pfn(int nid, struct zone *zone,
+   unsigned long start_pfn,
+   unsigned long end_pfn)
+{
+   struct mem_section *ms;
+   unsigned long pfn;
+
+   /* pfn is the end pfn of a memory section. */
+   pfn = end_pfn - 1;
+   for (; pfn = start_pfn; pfn -= PAGES_PER_SECTION) {
+   ms = __pfn_to_section(pfn);
+
+   if (unlikely(!valid_section(ms)))
+   continue;
+
+   if (unlikely(pfn_to_nid(pfn) != nid))
+   continue;
+
+   if (zone  zone != page_zone(pfn_to_page(pfn)))
+   continue;
+
+   return pfn;
+   }
+
+   return 0;
+}
+
+static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
+unsigned long end_pfn)
+{
+   unsigned long zone_start_pfn =  zone-zone_start_pfn;
+   unsigned long zone_end_pfn = zone-zone_start_pfn + zone-spanned_pages;
+   unsigned long pfn;
+   struct mem_section *ms;
+   int nid = zone_to_nid(zone);
+
+   zone_span_writelock(zone);
+   if (zone_start_pfn == start_pfn) {
+   /*
+* If the section is smallest section in the zone, it need
+* shrink zone-zone_start_pfn and zone-zone_spanned_pages.
+* In this case, we find second smallest valid mem_section
+* for shrinking zone.
+*/
+   pfn = find_smallest_section_pfn(nid, zone, end_pfn,
+   zone_end_pfn);
+   if (pfn) {
+   zone-zone_start_pfn = pfn;
+   zone-spanned_pages = zone_end_pfn - pfn;
+   }
+   } else if (zone_end_pfn == end_pfn) {
+   /*
+* If the section is biggest section in the zone, it need
+* shrink zone-spanned_pages.
+* In this case, we find second biggest valid mem_section for
+* shrinking zone.
+*/
+   pfn = find_biggest_section_pfn(nid, zone, zone_start_pfn,
+  start_pfn);
+   if (pfn)
+   zone-spanned_pages = pfn - zone_start_pfn + 1;
+   }
+
+   /*
+* The section is not biggest or smallest mem_section in the zone, it
+* only creates a hole in the zone. So in this case, we need not
+* change the zone. But perhaps, the zone has only hole data. Thus
+* it check the zone has only hole or not.
+*/
+   pfn = zone_start_pfn;
+   for (; pfn  zone_end_pfn; pfn += PAGES_PER_SECTION) {
+   ms = __pfn_to_section(pfn);
+
+   if (unlikely(!valid_section(ms)))
+   continue;
+
+   if (page_zone(pfn_to_page(pfn)) != zone)
+   continue;
+
+/* If the section is current section, it continues the loop */
+   if (start_pfn == pfn)
+   continue;
+
+   /* If we find valid section, we have nothing to do */
+   zone_span_writeunlock(zone);
+   return;
+   }
+
+   /* The zone has no 

[PATCH v6 14/15] memory-hotplug: free node_data when a node is offlined

2013-01-09 Thread Tang Chen
From: Wen Congyang we...@cn.fujitsu.com

We call hotadd_new_pgdat() to allocate memory to store node_data. So we
should free it when removing a node.

Signed-off-by: Wen Congyang we...@cn.fujitsu.com
Reviewed-by: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
---
 mm/memory_hotplug.c |   30 +++---
 1 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a8703f7..8b67752 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1699,9 +1699,12 @@ static int check_cpu_on_node(void *data)
 /* offline the node if all memory sections of this node are removed */
 static void try_offline_node(int nid)
 {
-   unsigned long start_pfn = NODE_DATA(nid)-node_start_pfn;
-   unsigned long end_pfn = start_pfn + NODE_DATA(nid)-node_spanned_pages;
+   pg_data_t *pgdat = NODE_DATA(nid);
+   unsigned long start_pfn = pgdat-node_start_pfn;
+   unsigned long end_pfn = start_pfn + pgdat-node_spanned_pages;
unsigned long pfn;
+   struct page *pgdat_page = virt_to_page(pgdat);
+   int i;
 
for (pfn = start_pfn; pfn  end_pfn; pfn += PAGES_PER_SECTION) {
unsigned long section_nr = pfn_to_section_nr(pfn);
@@ -1719,7 +1722,7 @@ static void try_offline_node(int nid)
return;
}
 
-   if (stop_machine(check_cpu_on_node, NODE_DATA(nid), NULL))
+   if (stop_machine(check_cpu_on_node, pgdat, NULL))
return;
 
/*
@@ -1728,6 +1731,27 @@ static void try_offline_node(int nid)
 */
node_set_offline(nid);
unregister_one_node(nid);
+
+   if (!PageSlab(pgdat_page)  !PageCompound(pgdat_page))
+   /* node data is allocated from boot memory */
+   return;
+
+   /* free waittable in each zone */
+   for (i = 0; i  MAX_NR_ZONES; i++) {
+   struct zone *zone = pgdat-node_zones + i;
+
+   if (zone-wait_table)
+   vfree(zone-wait_table);
+   }
+
+   /*
+* Since there is no way to guarentee the address of pgdat/zone is not
+* on stack of any kernel threads or used by other kernel objects
+* without reference counting or other symchronizing method, do not
+* reset node_data and free pgdat here. Just reset it to 0 and reuse
+* the memory when the node is online again.
+*/
+   memset(pgdat, 0, sizeof(*pgdat));
 }
 
 int __ref remove_memory(int nid, u64 start, u64 size)
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v6 11/15] memory-hotplug: Integrated __remove_section() of CONFIG_SPARSEMEM_VMEMMAP.

2013-01-09 Thread Tang Chen
Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But even if
we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.

Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com
Signed-off-by: Wen Congyang we...@cn.fujitsu.com
Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
---
 mm/memory_hotplug.c |   11 ---
 1 files changed, 0 insertions(+), 11 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 674e791..b20c4c7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -430,16 +430,6 @@ static int __meminit __add_section(int nid, struct zone 
*zone,
return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
 }
 
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-static int __remove_section(struct zone *zone, struct mem_section *ms)
-{
-   /*
-* XXX: Freeing memmap with vmemmap is not implement yet.
-*  This should be removed later.
-*/
-   return -EBUSY;
-}
-#else
 static int __remove_section(struct zone *zone, struct mem_section *ms)
 {
int ret = -EINVAL;
@@ -454,7 +444,6 @@ static int __remove_section(struct zone *zone, struct 
mem_section *ms)
sparse_remove_one_section(zone, ms);
return 0;
 }
-#endif
 
 /*
  * Reasonably generic function for adding memory.  It is
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v6 15/15] memory-hotplug: Do not allocate pdgat if it was not freed when offline.

2013-01-09 Thread Tang Chen
Since there is no way to guarentee the address of pgdat/zone is not
on stack of any kernel threads or used by other kernel objects
without reference counting or other symchronizing method, we cannot
reset node_data and free pgdat when offlining a node. Just reset pgdat
to 0 and reuse the memory when the node is online again.

The problem is suggested by Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
The idea is from Wen Congyang we...@cn.fujitsu.com

NOTE: If we don't reset pgdat to 0, the WARN_ON in free_area_init_node()
  will be triggered.

Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
Reviewed-by: Wen Congyang we...@cn.fujitsu.com
---
 mm/memory_hotplug.c |   20 
 1 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8b67752..8aa2b56 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1015,11 +1015,14 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 
start)
unsigned long zholes_size[MAX_NR_ZONES] = {0};
unsigned long start_pfn = start  PAGE_SHIFT;
 
-   pgdat = arch_alloc_nodedata(nid);
-   if (!pgdat)
-   return NULL;
+   pgdat = NODE_DATA(nid);
+   if (!pgdat) {
+   pgdat = arch_alloc_nodedata(nid);
+   if (!pgdat)
+   return NULL;
 
-   arch_refresh_nodedata(nid, pgdat);
+   arch_refresh_nodedata(nid, pgdat);
+   }
 
/* we can use NODE_DATA(nid) from here */
 
@@ -1072,7 +1075,7 @@ out:
 int __ref add_memory(int nid, u64 start, u64 size)
 {
pg_data_t *pgdat = NULL;
-   int new_pgdat = 0;
+   int new_pgdat = 0, new_node = 0;
struct resource *res;
int ret;
 
@@ -1083,12 +1086,13 @@ int __ref add_memory(int nid, u64 start, u64 size)
if (!res)
goto out;
 
-   if (!node_online(nid)) {
+   new_pgdat = NODE_DATA(nid) ? 0 : 1;
+   new_node = node_online(nid) ? 0 : 1;
+   if (new_node) {
pgdat = hotadd_new_pgdat(nid, start);
ret = -ENOMEM;
if (!pgdat)
goto error;
-   new_pgdat = 1;
}
 
/* call arch's memory hotadd */
@@ -1100,7 +1104,7 @@ int __ref add_memory(int nid, u64 start, u64 size)
/* we online node here. we can't roll back from here. */
node_set_online(nid);
 
-   if (new_pgdat) {
+   if (new_node) {
ret = register_one_node(nid);
/*
 * If sysfs file of new node can't create, cpu on the node
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 7/8] mm: use vm_unmapped_area() on powerpc architecture

2013-01-09 Thread Michel Lespinasse
On Wed, Jan 09, 2013 at 02:32:56PM +1100, Benjamin Herrenschmidt wrote:
 Ok. I think at least you can move that construct:
 
 +   if (addr  SLICE_LOW_TOP) {
 +   slice = GET_LOW_SLICE_INDEX(addr);
 +   addr = (slice + 1)  SLICE_LOW_SHIFT;
 +   if (!(available.low_slices  (1u  slice)))
 +   continue;
 +   } else {
 +   slice = GET_HIGH_SLICE_INDEX(addr);
 +   addr = (slice + 1)  SLICE_HIGH_SHIFT;
 +   if (!(available.high_slices  (1u  slice)))
 +   continue;
 +   }
 
 Into some kind of helper. It will probably compile to the same thing but
 at least it's more readable and it will avoid a fuckup in the future if
 somebody changes the algorithm and forgets to update one of the
 copies :-)

All right, does the following look more palatable then ?
(didn't re-test it, though)

Signed-off-by: Michel Lespinasse wal...@google.com

---
 arch/powerpc/mm/slice.c |  123 ++-
 1 files changed, 78 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 999a74f25ebe..3e99c149271a 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -237,36 +237,69 @@ static void slice_convert(struct mm_struct *mm, struct 
slice_mask mask, int psiz
 #endif
 }
 
+/*
+ * Compute which slice addr is part of;
+ * set *boundary_addr to the start or end boundary of that slice
+ * (depending on 'end' parameter);
+ * return boolean indicating if the slice is marked as available in the
+ * 'available' slice_mark.
+ */
+static bool slice_scan_available(unsigned long addr,
+struct slice_mask available,
+int end,
+unsigned long *boundary_addr)
+{
+   unsigned long slice;
+   if (addr  SLICE_LOW_TOP) {
+   slice = GET_LOW_SLICE_INDEX(addr);
+   *boundary_addr = (slice + end)  SLICE_LOW_SHIFT;
+   return !!(available.low_slices  (1u  slice));
+   } else {
+   slice = GET_HIGH_SLICE_INDEX(addr);
+   *boundary_addr = (slice + end) ?
+   ((slice + end)  SLICE_HIGH_SHIFT) : SLICE_LOW_TOP;
+   return !!(available.high_slices  (1u  slice));
+   }
+}
+
 static unsigned long slice_find_area_bottomup(struct mm_struct *mm,
  unsigned long len,
  struct slice_mask available,
  int psize)
 {
-   struct vm_area_struct *vma;
-   unsigned long addr;
-   struct slice_mask mask;
int pshift = max_t(int, mmu_psize_defs[psize].shift, PAGE_SHIFT);
+   unsigned long addr, found, next_end;
+   struct vm_unmapped_area_info info;
 
-   addr = TASK_UNMAPPED_BASE;
-
-   for (;;) {
-   addr = _ALIGN_UP(addr, 1ul  pshift);
-   if ((TASK_SIZE - len)  addr)
-   break;
-   vma = find_vma(mm, addr);
-   BUG_ON(vma  (addr = vma-vm_end));
+   info.flags = 0;
+   info.length = len;
+   info.align_mask = PAGE_MASK  ((1ul  pshift) - 1);
+   info.align_offset = 0;
 
-   mask = slice_range_to_mask(addr, len);
-   if (!slice_check_fit(mask, available)) {
-   if (addr  SLICE_LOW_TOP)
-   addr = _ALIGN_UP(addr + 1,  1ul  
SLICE_LOW_SHIFT);
-   else
-   addr = _ALIGN_UP(addr + 1,  1ul  
SLICE_HIGH_SHIFT);
+   addr = TASK_UNMAPPED_BASE;
+   while (addr  TASK_SIZE) {
+   info.low_limit = addr;
+   if (!slice_scan_available(addr, available, 1, addr))
continue;
+
+ next_slice:
+   /*
+* At this point [info.low_limit; addr) covers
+* available slices only and ends at a slice boundary.
+* Check if we need to reduce the range, or if we can
+* extend it to cover the next available slice.
+*/
+   if (addr = TASK_SIZE)
+   addr = TASK_SIZE;
+   else if (slice_scan_available(addr, available, 1, next_end)) {
+   addr = next_end;
+   goto next_slice;
}
-   if (!vma || addr + len = vma-vm_start)
-   return addr;
-   addr = vma-vm_end;
+   info.high_limit = addr;
+
+   found = vm_unmapped_area(info);
+   if (!(found  ~PAGE_MASK))
+   return found;
}
 
return -ENOMEM;
@@ -277,39 +310,39 @@ static unsigned long slice_find_area_topdown(struct 
mm_struct *mm,
   

Re: [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence

2013-01-09 Thread Glauber Costa
On 12/30/2012 09:58 AM, Wen Congyang wrote:
 At 12/25/2012 04:35 PM, Glauber Costa Wrote:
 On 12/24/2012 04:09 PM, Tang Chen wrote:
 From: Wen Congyang we...@cn.fujitsu.com

 memory can't be offlined when CONFIG_MEMCG is selected.
 For example: there is a memory device on node 1. The address range
 is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
 and memory11 under the directory /sys/devices/system/memory/.

 If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
 when we online pages. When we online memory8, the memory stored page cgroup
 is not provided by this memory device. But when we online memory9, the 
 memory
 stored page cgroup may be provided by memory8. So we can't offline memory8
 now. We should offline the memory in the reversed order.

 When the memory device is hotremoved, we will auto offline memory provided
 by this memory device. But we don't know which memory is onlined first, so
 offlining memory may fail. In such case, iterate twice to offline the 
 memory.
 1st iterate: offline every non primary memory block.
 2nd iterate: offline primary (i.e. first added) memory block.

 This idea is suggested by KOSAKI Motohiro.

 Signed-off-by: Wen Congyang we...@cn.fujitsu.com

 Maybe there is something here that I am missing - I admit that I came
 late to this one, but this really sounds like a very ugly hack, that
 really has no place in here.

 Retrying, of course, may make sense, if we have reasonable belief that
 we may now succeed. If this is the case, you need to document - in the
 code - while is that.

 The memcg argument, however, doesn't really cut it. Why can't we make
 all page_cgroup allocations local to the node they are describing? If
 memcg is the culprit here, we should fix it, and not retry. If there is
 still any benefit in retrying, then we retry being very specific about why.
 
 We try to make all page_cgroup allocations local to the node they are 
 describing
 now. If the memory is the first memory onlined in this node, we will allocate
 it from the other node.
 
 For example, node1 has 4 memory blocks: 8-11, and we online it from 8 to 11
 1. memory block 8, page_cgroup allocations are in the other nodes
 2. memory block 9, page_cgroup allocations are in memory block 8
 
 So we should offline memory block 9 first. But we don't know in which order
 the user online the memory block.
 
 I think we can modify memcg like this:
 allocate the memory from the memory block they are describing
 
 I am not sure it is OK to do so.

I don't see a reason why not.

You would have to tweak a bit the lookup function for page_cgroup, but
assuming you will always have the pfns and limits, it should be easy to do.

I think the only tricky part is that today we have a single
node_page_cgroup, and we would of course have to have one per memory
block. My assumption is that the number of memory blocks is limited and
likely not very big. So even a static array would do.

Kamezawa, do you have any input in here?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/8] mm: use vm_unmapped_area() on parisc architecture

2013-01-09 Thread Rik van Riel

On 01/08/2013 08:28 PM, Michel Lespinasse wrote:

Update the parisc arch_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse wal...@google.com


Acked-by: Rik van Riel r...@redhat.com

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 2/8] mm: use vm_unmapped_area() on alpha architecture

2013-01-09 Thread Rik van Riel

On 01/08/2013 08:28 PM, Michel Lespinasse wrote:

Update the alpha arch_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse wal...@google.com


Acked-by: Rik van Riel r...@redhat.com

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/8] mm: use vm_unmapped_area() on frv architecture

2013-01-09 Thread Rik van Riel

On 01/08/2013 08:28 PM, Michel Lespinasse wrote:

Update the frv arch_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse wal...@google.com


Acked-by: Rik van Riel r...@redhat.com

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 4/8] mm: use vm_unmapped_area() on ia64 architecture

2013-01-09 Thread Rik van Riel

On 01/08/2013 08:28 PM, Michel Lespinasse wrote:

Update the ia64 arch_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse wal...@google.com


Acked-by: Rik van Riel r...@redhat.com

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 5/8] mm: use vm_unmapped_area() in hugetlbfs on ia64 architecture

2013-01-09 Thread Rik van Riel

On 01/08/2013 08:28 PM, Michel Lespinasse wrote:

Update the ia64 hugetlb_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse wal...@google.com


Acked-by: Rik van Riel r...@redhat.com

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 6/8] mm: remove free_area_cache use in powerpc architecture

2013-01-09 Thread Rik van Riel

On 01/08/2013 08:28 PM, Michel Lespinasse wrote:

As all other architectures have been converted to use vm_unmapped_area(),
we are about to retire the free_area_cache.

This change simply removes the use of that cache in
slice_get_unmapped_area(), which will most certainly have a
performance cost. Next one will convert that function to use the
vm_unmapped_area() infrastructure and regain the performance.

Signed-off-by: Michel Lespinasse wal...@google.com


Acked-by: Rik van Riel r...@redhat.com

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 7/8] mm: use vm_unmapped_area() on powerpc architecture

2013-01-09 Thread Rik van Riel

On 01/08/2013 08:28 PM, Michel Lespinasse wrote:

Update the powerpc slice_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse wal...@google.com


Acked-by: Rik van Riel r...@redhat.com

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 8/8] mm: remove free_area_cache

2013-01-09 Thread Rik van Riel

On 01/08/2013 08:28 PM, Michel Lespinasse wrote:

Since all architectures have been converted to use vm_unmapped_area(),
there is no remaining use for the free_area_cache.

Signed-off-by: Michel Lespinasse wal...@google.com


Yay

Acked-by: Rik van Riel r...@redhat.com

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 7/8] mm: use vm_unmapped_area() on powerpc architecture

2013-01-09 Thread Rik van Riel

On 01/09/2013 06:23 AM, Michel Lespinasse wrote:

On Wed, Jan 09, 2013 at 02:32:56PM +1100, Benjamin Herrenschmidt wrote:

Ok. I think at least you can move that construct:

+   if (addr  SLICE_LOW_TOP) {
+   slice = GET_LOW_SLICE_INDEX(addr);
+   addr = (slice + 1)  SLICE_LOW_SHIFT;
+   if (!(available.low_slices  (1u  slice)))
+   continue;
+   } else {
+   slice = GET_HIGH_SLICE_INDEX(addr);
+   addr = (slice + 1)  SLICE_HIGH_SHIFT;
+   if (!(available.high_slices  (1u  slice)))
+   continue;
+   }

Into some kind of helper. It will probably compile to the same thing but
at least it's more readable and it will avoid a fuckup in the future if
somebody changes the algorithm and forgets to update one of the
copies :-)


All right, does the following look more palatable then ?
(didn't re-test it, though)


Looks equivalent. I have also not tested :)


Signed-off-by: Michel Lespinasse wal...@google.com


Acked-by: Rik van Riel r...@redhat.com

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch

2013-01-09 Thread Jimi Xenidis

On Dec 18, 2012, at 10:31 AM, Peter Bergner berg...@vnet.ibm.com wrote:

 On Tue, 2012-12-18 at 07:28 -0600, Jimi Xenidis wrote:
 On Dec 17, 2012, at 6:26 PM, Peter Bergner berg...@vnet.ibm.com wrote:
 Jimi, are you using an old binutils from before my patch that
 changed the operand order for these types of instructions?
 
   http://sourceware.org/ml/binutils/2009-02/msg00044.html
 
 Actually, this confused me as well, that embedded has the same instruction
 encoding but different mnemonic.
 
 The mnemonic is the same (ie, dcbtst), and yes, the encoding is the same.
 All that is different is the accepted operand ordering...and yes, it is
 very unfortunate the operand ordering is different between embedded and
 server. :(
 
 
 I was under the impression that the assembler made no instruction decisions
 based on CPU.  So your only hint would be that '0b' prefix.
 Does AS even see that?
 
 GAS definitely makes decisions based on CPU (ie, -mcpu option).  Below is
 the GAS code used in recognizing the dcbtst instruction.  This shows that
 the server operand ordering is enabled for POWER4 and later cpus while
 the embedded operand ordering is enabled for pre POWER4 cpus (yes, not
 exactly a server versus embedded trigger, but that's we agreed on to
 mitigate breaking any old asm code out there).
 
 {dcbtst,X(31,246),  X_MASK,  POWER4,PPCNONE,{RA0, 
 RB, CT}},
 {dcbtst,X(31,246),  X_MASK,  PPC|PPCVLE, POWER4,{CT, 
 RA0, RB}},
 
 GAS doesn't look at how the operands are written to try and guess what
 operand ordering you are attempting to use.  Rather, it knows what ordering
 it expects and the values had better match that ordering.
 

I agree, but that means it is impossible for the same .S file can be compiled 
but -mcpu=e500mc and -mcpu=powerpc?
So either these files have to be Book3S versus Book3E --or-- we use a CPP macro 
to get them right.
FWIW, I prefer the latter which allows more code reuse.

-jx


 
 Peter
 
 
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-09 Thread Andrew Morton
On Wed, 9 Jan 2013 17:32:24 +0800
Tang Chen tangc...@cn.fujitsu.com wrote:

 Here is the physical memory hot-remove patch-set based on 3.8rc-2.
 
 This patch-set aims to implement physical memory hot-removing.
 
 The patches can free/remove the following things:
 
   - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
   - memmap of sparse-vmemmap  : [PATCH 6,7,8,10/15]
   - page table of removed memory  : [RFC PATCH 7,8,10/15]
   - node and related sysfs files  : [RFC PATCH 13-15/15]
 
 
 Existing problem:
 If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
 when we online pages.
 
 For example: there is a memory device on node 1. The address range
 is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
 and memory11 under the directory /sys/devices/system/memory/.
 
 If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
 cgroup is not provided by this memory device. But when we online memory9, the
 memory stored page cgroup may be provided by memory8. So we can't offline
 memory8 now. We should offline the memory in the reversed order.
 
 When the memory device is hotremoved, we will auto offline memory provided
 by this memory device. But we don't know which memory is onlined first, so
 offlining memory may fail.

This does sound like a significant problem.  We should assume that
mmecg is available and in use.

 In patch1, we provide a solution which is not good enough:
 Iterate twice to offline the memory.
 1st iterate: offline every non primary memory block.
 2nd iterate: offline primary (i.e. first added) memory block.

Let's flesh this out a bit.

If we online memory8, memory9, memory10 and memory11 then I'd have
thought that they would need to offlined in reverse order, which will
require four iterations, not two.  Is this wrong and if so, why?

Also, what happens if we wish to offline only memory9?  Do we offline
memory11 then memory10 then memory9 and then re-online memory10 and
memory11?

 And a new idea from Wen Congyang we...@cn.fujitsu.com is:
 allocate the memory from the memory block they are describing.

Yes.

 But we are not sure if it is OK to do so because there is not existing API
 to do so, and we need to move page_cgroup memory allocation from 
 MEM_GOING_ONLINE
 to MEM_ONLINE.

This all sounds solvable - can we proceed in this fashion?

 And also, it may interfere the hugepage.

Please provide full details on this problem.

 Note: if the memory provided by the memory device is used by the kernel, it
 can't be offlined. It is not a bug.

Right.  But how often does this happen in testing?  In other words,
please provide an overall description of how well memory hot-remove is
presently operating.  Is it reliable?  What is the success rate in
real-world situations?  Are there precautions which the administrator
can take to improve the success rate?  What are the remaining problems
and are there plans to address them?


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v2] powerpc/mm: eliminate unneeded for_each_memblock

2013-01-09 Thread Cody P Schafer
The only persistent change made by this loop is calling
memblock_set_node() once for each memblock, which is not useful (and has
no effect) as memblock_set_node() is not called with any
memblock-specific parameters.

Subsistute a single memblock_set_node().

Signed-off-by: Cody P Schafer c...@linux.vnet.ibm.com
---

Now with a signoff  wrapped comment line.

 arch/powerpc/mm/mem.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 0dba506..40df7c8 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -195,13 +195,10 @@ void __init do_init_bootmem(void)
min_low_pfn = MEMORY_START  PAGE_SHIFT;
boot_mapsize = init_bootmem_node(NODE_DATA(0), start  PAGE_SHIFT, 
min_low_pfn, max_low_pfn);
 
-   /* Add active regions with valid PFNs */
-   for_each_memblock(memory, reg) {
-   unsigned long start_pfn, end_pfn;
-   start_pfn = memblock_region_memory_base_pfn(reg);
-   end_pfn = memblock_region_memory_end_pfn(reg);
-   memblock_set_node(0, (phys_addr_t)ULLONG_MAX, 0);
-   }
+   /* Place all memblock_regions in the same node and merge contiguous
+* memblock_regions
+*/
+   memblock_set_node(0, (phys_addr_t)ULLONG_MAX, 0);
 
/* Add all physical memory to the bootmem map, mark each area
 * present.
-- 
1.8.0.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 04/15] memory-hotplug: remove /sys/firmware/memmap/X sysfs

2013-01-09 Thread Andrew Morton
On Wed, 9 Jan 2013 17:32:28 +0800
Tang Chen tangc...@cn.fujitsu.com wrote:

 When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type}
 sysfs files are created. But there is no code to remove these files. The patch
 implements the function to remove them.
 
 Note: The code does not free firmware_map_entry which is allocated by bootmem.
   So the patch makes memory leak. But I think the memory leak size is
   very samll. And it does not affect the system.

Well that's bad.  Can we remember the address of that memory and then
reuse the storage if/when the memory is re-added?  That at least puts an upper
bound on the leak.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 05/15] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture

2013-01-09 Thread Andrew Morton
On Wed, 9 Jan 2013 17:32:29 +0800
Tang Chen tangc...@cn.fujitsu.com wrote:

 For removing memory, we need to remove page table. But it depends
 on architecture. So the patch introduce arch_remove_memory() for
 removing page table. Now it only calls __remove_pages().
 
 Note: __remove_pages() for some archtecuture is not implemented
   (I don't know how to implement it for s390).

Can this break the build for s390?


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch

2013-01-09 Thread Peter Bergner
On Wed, 2013-01-09 at 16:19 -0600, Jimi Xenidis wrote:
 I agree, but that means it is impossible for the same .S file can be compiled
 but -mcpu=e500mc and -mcpu=powerpc?  So either these files have to be Book3S
 versus Book3E --or-- we use a CPP macro to get them right.
 FWIW, I prefer the latter which allows more code reuse.

I agree using a CPP macro - like we do for new instructions for which some
older assemblers might not support yet - is probably the best solution.

Peter


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 02/15] memory-hotplug: check whether all memory blocks are offlined or not when removing memory

2013-01-09 Thread Andrew Morton
On Wed, 9 Jan 2013 17:32:26 +0800
Tang Chen tangc...@cn.fujitsu.com wrote:

 We remove the memory like this:
 1. lock memory hotplug
 2. offline a memory block
 3. unlock memory hotplug
 4. repeat 1-3 to offline all memory blocks
 5. lock memory hotplug
 6. remove memory(TODO)
 7. unlock memory hotplug
 
 All memory blocks must be offlined before removing memory. But we don't hold
 the lock in the whole operation. So we should check whether all memory blocks
 are offlined before step6. Otherwise, kernel maybe panicked.

Well, the obvious question is: why don't we hold lock_memory_hotplug()
for all of steps 1-4?  Please send the reasons for this in a form which
I can paste into the changelog.


Actually, I wonder if doing this would fix a race in the current
remove_memory() repeat: loop.  That code does a
find_memory_block_hinted() followed by offline_memory_block(), but
afaict find_memory_block_hinted() only does a get_device().  Is the
get_device() sufficiently strong to prevent problems if another thread
concurrently offlines or otherwise alters this memory_block's state?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 04/15] memory-hotplug: remove /sys/firmware/memmap/X sysfs

2013-01-09 Thread Andrew Morton
On Wed, 9 Jan 2013 17:32:28 +0800
Tang Chen tangc...@cn.fujitsu.com wrote:

 From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com
 
 When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type}
 sysfs files are created. But there is no code to remove these files. The patch
 implements the function to remove them.
 
 Note: The code does not free firmware_map_entry which is allocated by bootmem.
   So the patch makes memory leak. But I think the memory leak size is
   very samll. And it does not affect the system.
 
 ...

 +static struct firmware_map_entry * __meminit
 +firmware_map_find_entry(u64 start, u64 end, const char *type)
 +{
 + struct firmware_map_entry *entry;
 +
 + spin_lock(map_entries_lock);
 + list_for_each_entry(entry, map_entries, list)
 + if ((entry-start == start)  (entry-end == end) 
 + (!strcmp(entry-type, type))) {
 + spin_unlock(map_entries_lock);
 + return entry;
 + }
 +
 + spin_unlock(map_entries_lock);
 + return NULL;
 +}

 ...

 + entry = firmware_map_find_entry(start, end - 1, type);
 + if (!entry)
 + return -EINVAL;
 +
 + firmware_map_remove_entry(entry);

 ...


The above code looks racy.  After firmware_map_find_entry() does the
spin_unlock() there is nothing to prevent a concurrent
firmware_map_remove_entry() from removing the entry, so the kernel ends
up calling firmware_map_remove_entry() twice against the same entry.

An easy fix for this is to hold the spinlock across the entire
lookup/remove operation.


This problem is inherent to firmware_map_find_entry() as you have
implemented it, so this function simply should not exist in the current
form - no caller can use it without being buggy!  A simple fix for this
is to remove the spin_lock()/spin_unlock() from
firmware_map_find_entry() and add locking documentation to
firmware_map_find_entry(), explaining that the caller must hold
map_entries_lock and must not release that lock until processing of
firmware_map_find_entry()'s return value has completed.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-09 Thread Andrew Morton
On Wed, 9 Jan 2013 17:32:24 +0800
Tang Chen tangc...@cn.fujitsu.com wrote:

 This patch-set aims to implement physical memory hot-removing.

As you were on th patch delivery path, all of these patches should have
your Signed-off-by:.  But some were missing it.  I fixed this in my
copy of the patches.


I suspect this patchset adds a significant amount of code which will
not be used if CONFIG_MEMORY_HOTPLUG=n.  [PATCH v6 06/15]
memory-hotplug: implement register_page_bootmem_info_section of
sparse-vmemmap, for example.  This is not a good thing, so please go
through the patchset (in fact, go through all the memhotplug code) and
let's see if we can reduce the bloat for CONFIG_MEMORY_HOTPLUG=n
kernels.

This needn't be done immediately - it would be OK by me if you were to
defer this exercise until all the new memhotplug code is largely in
place.  But please, let's do it.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 1/6][v3] perf/Power7: Use macros to identify perf events

2013-01-09 Thread sukadev
Define and use macros to identify perf events codes. This would make it
easier and more readable when these event codes need to be used in more
than one place.

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 arch/powerpc/perf/power7-pmu.c |   28 
 1 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c
index 441af08..44e70d2 100644
--- a/arch/powerpc/perf/power7-pmu.c
+++ b/arch/powerpc/perf/power7-pmu.c
@@ -51,6 +51,18 @@
 #define MMCR1_PMCSEL_MSK   0xff
 
 /*
+ * Power7 event codes.
+ */
+#definePME_PM_CYC  0x1e
+#definePME_PM_GCT_NOSLOT_CYC   0x100f8
+#definePME_PM_CMPLU_STALL  0x4000a
+#definePME_PM_INST_CMPL0x2
+#definePME_PM_LD_REF_L10xc880
+#definePME_PM_LD_MISS_L1   0x400f0
+#definePME_PM_BRU_FIN  0x10068
+#definePME_PM_BRU_MPRED0x400f6
+
+/*
  * Layout of constraint bits:
  * 554433221100
  * 3210987654321098765432109876543210987654321098765432109876543210
@@ -296,14 +308,14 @@ static void power7_disable_pmc(unsigned int pmc, unsigned 
long mmcr[])
 }
 
 static int power7_generic_events[] = {
-   [PERF_COUNT_HW_CPU_CYCLES] = 0x1e,
-   [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = 0x100f8, /* GCT_NOSLOT_CYC */
-   [PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = 0x4000a,  /* CMPLU_STALL */
-   [PERF_COUNT_HW_INSTRUCTIONS] = 2,
-   [PERF_COUNT_HW_CACHE_REFERENCES] = 0xc880,  /* LD_REF_L1_LSU*/
-   [PERF_COUNT_HW_CACHE_MISSES] = 0x400f0, /* LD_MISS_L1   */
-   [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = 0x10068,  /* BRU_FIN  */
-   [PERF_COUNT_HW_BRANCH_MISSES] = 0x400f6,/* BR_MPRED */
+   [PERF_COUNT_HW_CPU_CYCLES] =PME_PM_CYC,
+   [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] =   PME_PM_GCT_NOSLOT_CYC,
+   [PERF_COUNT_HW_STALLED_CYCLES_BACKEND] =PME_PM_CMPLU_STALL,
+   [PERF_COUNT_HW_INSTRUCTIONS] =  PME_PM_INST_CMPL,
+   [PERF_COUNT_HW_CACHE_REFERENCES] =  PME_PM_LD_REF_L1,
+   [PERF_COUNT_HW_CACHE_MISSES] =  PME_PM_LD_MISS_L1,
+   [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] =   PME_PM_BRU_FIN,
+   [PERF_COUNT_HW_BRANCH_MISSES] = PME_PM_BRU_MPRED,
 };
 
 #define C(x)   PERF_COUNT_HW_CACHE_##x
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 2/6][v3] perf: Make EVENT_ATTR global

2013-01-09 Thread sukadev
Rename EVENT_ATTR() to PMU_EVENT_ATTR() and make it global so it is
available to all architectures.

Further to allow architectures flexibility, have PMU_EVENT_ATTR() pass
in the variable name as a parameter.

Changelog[v3]
- [Jiri Olsa] No need to define PMU_EVENT_PTR()

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 arch/x86/kernel/cpu/perf_event.c |   13 +++--
 include/linux/perf_event.h   |   11 +++
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 4428fd1..59a1238 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1316,11 +1316,6 @@ static struct attribute_group x86_pmu_format_group = {
.attrs = NULL,
 };
 
-struct perf_pmu_events_attr {
-   struct device_attribute attr;
-   u64 id;
-};
-
 /*
  * Remove all undefined events (x86_pmu.event_map(id) == 0)
  * out of events_attr attributes.
@@ -1354,11 +1349,9 @@ static ssize_t events_sysfs_show(struct device *dev, 
struct device_attribute *at
 #define EVENT_VAR(_id)  event_attr_##_id
 #define EVENT_PTR(_id) event_attr_##_id.attr.attr
 
-#define EVENT_ATTR(_name, _id) \
-static struct perf_pmu_events_attr EVENT_VAR(_id) = {  \
-   .attr = __ATTR(_name, 0444, events_sysfs_show, NULL),   \
-   .id   =  PERF_COUNT_HW_##_id,   \
-};
+#define EVENT_ATTR(_name, _id) \
+   PMU_EVENT_ATTR(_name, EVENT_VAR(_id), PERF_COUNT_HW_##_id,  \
+   events_sysfs_show)
 
 EVENT_ATTR(cpu-cycles, CPU_CYCLES  );
 EVENT_ATTR(instructions,   INSTRUCTIONS);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6bfb2fa..42adf01 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -817,6 +817,17 @@ do {   
\
 } while (0)
 
 
+struct perf_pmu_events_attr {
+   struct device_attribute attr;
+   u64 id;
+};
+
+#define PMU_EVENT_ATTR(_name, _var, _id, _show)
\
+static struct perf_pmu_events_attr _var = {\
+   .attr = __ATTR(_name, 0444, _show, NULL),   \
+   .id   =  _id,   \
+};
+
 #define PMU_FORMAT_ATTR(_name, _format)
\
 static ssize_t \
 _name##_show(struct device *dev,   \
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 6/6][v3] perf: Document the ABI of perf sysfs entries

2013-01-09 Thread sukadev
This patchset addes two new sets of files to sysfs:

- generic and POWER-specific perf events in /sys/devices/cpu/events/
- perf event config format in /sys/devices/cpu/format/event

Document the format of these files which would become part of the ABI.

Changelog[v3]:
[Greg KH] Include ABI documentation.

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 Documentation/ABI/stable/sysfs-devices-cpu-events |   54 +
 Documentation/ABI/stable/sysfs-devices-cpu-format |   27 ++
 2 files changed, 81 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/ABI/stable/sysfs-devices-cpu-format

diff --git a/Documentation/ABI/stable/sysfs-devices-cpu-events 
b/Documentation/ABI/stable/sysfs-devices-cpu-events
index e69de29..f37d542 100644
--- a/Documentation/ABI/stable/sysfs-devices-cpu-events
+++ b/Documentation/ABI/stable/sysfs-devices-cpu-events
@@ -0,0 +1,54 @@
+What:  /sys/devices/cpu/events/
+   /sys/devices/cpu/events/branch-misses
+   /sys/devices/cpu/events/cache-references
+   /sys/devices/cpu/events/cache-misses
+   /sys/devices/cpu/events/stalled-cycles-frontend
+   /sys/devices/cpu/events/branch-instructions
+   /sys/devices/cpu/events/stalled-cycles-backend
+   /sys/devices/cpu/events/instructions
+   /sys/devices/cpu/events/cpu-cycles
+
+Date:  2013/01/08
+
+Contact:   Linux kernel mailing list linux-ker...@vger.kernel.org
+
+Description:   Generic performance monitoring events
+
+   A collection of performance monitoring events that may be
+   supported by many/most CPUs. These events can be monitored
+   using the 'perf(1)' tool.
+
+   The contents of each file would look like:
+
+   event=0x
+
+   where 'N' is a hex digit.
+
+
+What:  /sys/devices/cpu/events/PM_LD_MISS_L1
+   /sys/devices/cpu/events/PM_LD_REF_L1
+   /sys/devices/cpu/events/PM_CYC
+   /sys/devices/cpu/events/PM_BRU_FIN
+   /sys/devices/cpu/events/PM_GCT_NOSLOT_CYC
+   /sys/devices/cpu/events/PM_BRU_MPRED
+   /sys/devices/cpu/events/PM_INST_CMPL
+   /sys/devices/cpu/events/PM_CMPLU_STALL
+
+Date:  2013/01/08
+
+Contact:   Linux kernel mailing list linux-ker...@vger.kernel.org
+   Linux Powerpc mailing list linuxppc-...@ozlabs.org
+
+Description:   POWER specific performance monitoring events
+
+   A collection of performance monitoring events that may be
+   supported by the POWER CPU. These events can be monitored
+   using the 'perf(1)' tool.
+
+   These events may not be supported by other CPUs.
+
+   The contents of each file would look like:
+
+   event=0x
+
+   where 'N' is a hex digit.
diff --git a/Documentation/ABI/stable/sysfs-devices-cpu-format 
b/Documentation/ABI/stable/sysfs-devices-cpu-format
new file mode 100644
index 000..b15cfb2
--- /dev/null
+++ b/Documentation/ABI/stable/sysfs-devices-cpu-format
@@ -0,0 +1,27 @@
+What:  /sys/devices/cpu/format/
+   /sys/devices/cpu/format/event
+
+Date:  2013/01/08
+
+Contact:   Linux kernel mailing list linux-ker...@vger.kernel.org
+
+Description:   Format of performance monitoring events
+
+   Each CPU/architecture may use different format to represent
+   the perf event.  The 'event' file describes the configuration
+   format of the performance monitoring event on the CPU/system.
+
+   The contents of each file would look like:
+
+   config:m-n
+
+   where m and n are the starting and ending bits that are
+   used to represent the event.
+
+   For example, on POWER,
+
+   $ cat /sys/devices/cpu/format/event
+   config:0-20
+
+   meaning that POWER uses the first 20-bits to represent a perf
+   event.
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 3/6][v3] perf/POWER7: Make generic event translations available in sysfs

2013-01-09 Thread sukadev
Make the generic perf events in POWER7 available via sysfs.

$ ls /sys/bus/event_source/devices/cpu/events
branch-instructions
branch-misses
cache-misses
cache-references
cpu-cycles
instructions
stalled-cycles-backend
stalled-cycles-frontend

$ cat /sys/bus/event_source/devices/cpu/events/cache-misses
event=0x400f0

This patch is based on commits that implement this functionality on x86.
Eg:
commit a47473939db20e3961b200eb00acf5fcf084d755
Author: Jiri Olsa jo...@redhat.com
Date:   Wed Oct 10 14:53:11 2012 +0200

perf/x86: Make hardware event translations available in sysfs

Changelog:[v3]
[Jiri Olsa] Drop EVENT_ID() macro since it is only used once.

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/perf_event_server.h  |   24 ++
 arch/powerpc/perf/core-book3s.c   |   12 +++
 arch/powerpc/perf/power7-pmu.c|   34 +
 3 files changed, 70 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/ABI/stable/sysfs-devices-cpu-events

diff --git a/Documentation/ABI/stable/sysfs-devices-cpu-events 
b/Documentation/ABI/stable/sysfs-devices-cpu-events
new file mode 100644
index 000..e69de29
diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index 9710be3..3f21d89 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -11,6 +11,7 @@
 
 #include linux/types.h
 #include asm/hw_irq.h
+#include linux/device.h
 
 #define MAX_HWEVENTS   8
 #define MAX_EVENT_ALTERNATIVES 8
@@ -35,6 +36,7 @@ struct power_pmu {
void(*disable_pmc)(unsigned int pmc, unsigned long mmcr[]);
int (*limited_pmc_event)(u64 event_id);
u32 flags;
+   const struct attribute_group**attr_groups;
int n_generic;
int *generic_events;
int (*cache_events)[PERF_COUNT_HW_CACHE_MAX]
@@ -109,3 +111,25 @@ extern unsigned long perf_instruction_pointer(struct 
pt_regs *regs);
  * If an event_id is not subject to the constraint expressed by a particular
  * field, then it will have 0 in both the mask and value for that field.
  */
+
+extern ssize_t power_events_sysfs_show(struct device *dev,
+   struct device_attribute *attr, char *page);
+
+/*
+ * EVENT_VAR() is same as PMU_EVENT_VAR with a suffix.
+ *
+ * Having a suffix allows us to have aliases in sysfs - eg: the generic
+ * event 'cpu-cycles' can have two entries in sysfs: 'cpu-cycles' and
+ * 'PM_CYC' where the latter is the name by which the event is known in
+ * POWER CPU specification.
+ */
+#defineEVENT_VAR(_id, _suffix) event_attr_##_id##_suffix
+#defineEVENT_PTR(_id, _suffix) EVENT_VAR(_id, _suffix)
+
+#defineEVENT_ATTR(_name, _id, _suffix) 
\
+   PMU_EVENT_ATTR(_name, EVENT_VAR(_id, _suffix), PME_PM_##_id,\
+   power_events_sysfs_show)
+
+#defineGENERIC_EVENT_ATTR(_name, _id)  EVENT_ATTR(_name, _id, _g)
+#defineGENERIC_EVENT_PTR(_id)  EVENT_PTR(_id, _g)
+
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index aa2465e..fa476d5 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -1305,6 +1305,16 @@ static int power_pmu_event_idx(struct perf_event *event)
return event-hw.idx;
 }
 
+ssize_t power_events_sysfs_show(struct device *dev,
+   struct device_attribute *attr, char *page)
+{
+   struct perf_pmu_events_attr *pmu_attr;
+
+   pmu_attr = container_of(attr, struct perf_pmu_events_attr, attr);
+
+   return sprintf(page, event=0x%02llx\n, pmu_attr-id);
+}
+
 struct pmu power_pmu = {
.pmu_enable = power_pmu_enable,
.pmu_disable= power_pmu_disable,
@@ -1537,6 +1547,8 @@ int __cpuinit register_power_pmu(struct power_pmu *pmu)
pr_info(%s performance monitor hardware support registered\n,
pmu-name);
 
+   power_pmu.attr_groups = ppmu-attr_groups;
+
 #ifdef MSR_HV
/*
 * Use FCHV to ignore kernel events if MSR.HV is set.
diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c
index 44e70d2..ae5d757 100644
--- a/arch/powerpc/perf/power7-pmu.c
+++ b/arch/powerpc/perf/power7-pmu.c
@@ -363,6 +363,39 @@ static int 
power7_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
},
 };
 
+
+GENERIC_EVENT_ATTR(cpu-cycles, CYC);
+GENERIC_EVENT_ATTR(stalled-cycles-frontend,GCT_NOSLOT_CYC);
+GENERIC_EVENT_ATTR(stalled-cycles-backend, CMPLU_STALL);
+GENERIC_EVENT_ATTR(instructions,   INST_CMPL);
+GENERIC_EVENT_ATTR(cache-references,   

[PATCH 4/6][v3] perf/POWER7: Make some POWER7 events available in sysfs

2013-01-09 Thread sukadev
Make some POWER7-specific perf events available in sysfs.

$ /bin/ls -1 /sys/bus/event_source/devices/cpu/events/
branch-instructions
branch-misses
cache-misses
cache-references
cpu-cycles
instructions
PM_BRU_FIN
PM_BRU_MPRED
PM_CMPLU_STALL
PM_CYC
PM_GCT_NOSLOT_CYC
PM_INST_CMPL
PM_LD_MISS_L1
PM_LD_REF_L1
stalled-cycles-backend
stalled-cycles-frontend

where the 'PM_*' events are POWER specific and the others are the
generic events.

This will enable users to specify these events with their symbolic
names rather than with their raw code.

perf stat -e 'cpu/PM_CYC/' ...

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/perf_event_server.h |2 ++
 arch/powerpc/perf/power7-pmu.c   |   18 ++
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index 3f21d89..b29fcc6 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -133,3 +133,5 @@ extern ssize_t power_events_sysfs_show(struct device *dev,
 #defineGENERIC_EVENT_ATTR(_name, _id)  EVENT_ATTR(_name, _id, _g)
 #defineGENERIC_EVENT_PTR(_id)  EVENT_PTR(_id, _g)
 
+#definePOWER_EVENT_ATTR(_name, _id)EVENT_ATTR(PM_##_name, _id, _p)
+#definePOWER_EVENT_PTR(_id)EVENT_PTR(_id, _p)
diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c
index ae5d757..5627940 100644
--- a/arch/powerpc/perf/power7-pmu.c
+++ b/arch/powerpc/perf/power7-pmu.c
@@ -373,6 +373,15 @@ GENERIC_EVENT_ATTR(cache-misses,   LD_MISS_L1);
 GENERIC_EVENT_ATTR(branch-instructions,BRU_FIN);
 GENERIC_EVENT_ATTR(branch-misses,  BRU_MPRED);
 
+POWER_EVENT_ATTR(CYC,  CYC);
+POWER_EVENT_ATTR(GCT_NOSLOT_CYC,   GCT_NOSLOT_CYC);
+POWER_EVENT_ATTR(CMPLU_STALL,  CMPLU_STALL);
+POWER_EVENT_ATTR(INST_CMPL,INST_CMPL);
+POWER_EVENT_ATTR(LD_REF_L1,LD_REF_L1);
+POWER_EVENT_ATTR(LD_MISS_L1,   LD_MISS_L1);
+POWER_EVENT_ATTR(BRU_FIN,  BRU_FIN)
+POWER_EVENT_ATTR(BRU_MPRED,BRU_MPRED);
+
 static struct attribute *power7_events_attr[] = {
GENERIC_EVENT_PTR(CYC),
GENERIC_EVENT_PTR(GCT_NOSLOT_CYC),
@@ -382,6 +391,15 @@ static struct attribute *power7_events_attr[] = {
GENERIC_EVENT_PTR(LD_MISS_L1),
GENERIC_EVENT_PTR(BRU_FIN),
GENERIC_EVENT_PTR(BRU_MPRED),
+
+   POWER_EVENT_PTR(CYC),
+   POWER_EVENT_PTR(GCT_NOSLOT_CYC),
+   POWER_EVENT_PTR(CMPLU_STALL),
+   POWER_EVENT_PTR(INST_CMPL),
+   POWER_EVENT_PTR(LD_REF_L1),
+   POWER_EVENT_PTR(LD_MISS_L1),
+   POWER_EVENT_PTR(BRU_FIN),
+   POWER_EVENT_PTR(BRU_MPRED),
NULL
 };
 
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 5/6][v3] perf: Create a sysfs entry for Power event format

2013-01-09 Thread sukadev
Create a sysfs entry, '/sys/bus/event_source/devices/cpu/format/event'
which describes the format of a POWER cpu.

The format of the event is the same for all POWER cpus at least in
(Power6, Power7), so bulk of this change is common in the code common
to POWER cpus.

This code is based on corresponding code in x86.

Changelog[v2]: [Jiri Olsa] Use PMU_FORMAT_ATTR() rather than duplicating it.

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/perf_event_server.h |6 ++
 arch/powerpc/perf/core-book3s.c  |   12 
 arch/powerpc/perf/power7-pmu.c   |1 +
 3 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index b29fcc6..ee63205 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -135,3 +135,9 @@ extern ssize_t power_events_sysfs_show(struct device *dev,
 
 #definePOWER_EVENT_ATTR(_name, _id)EVENT_ATTR(PM_##_name, _id, _p)
 #definePOWER_EVENT_PTR(_id)EVENT_PTR(_id, _p)
+
+/*
+ * Format of a perf event is the same on all POWER cpus. Declare a
+ * common sysfs attribute group that individual POWER cpus can share.
+ */
+extern struct attribute_group power_pmu_format_group;
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index fa476d5..4ae044b 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -1315,6 +1315,18 @@ ssize_t power_events_sysfs_show(struct device *dev,
return sprintf(page, event=0x%02llx\n, pmu_attr-id);
 }
 
+PMU_FORMAT_ATTR(event, config:0-20);
+
+static struct attribute *power_pmu_format_attr[] = {
+   format_attr_event.attr,
+   NULL,
+};
+
+struct attribute_group power_pmu_format_group = {
+   .name = format,
+   .attrs = power_pmu_format_attr,
+};
+
 struct pmu power_pmu = {
.pmu_enable = power_pmu_enable,
.pmu_disable= power_pmu_disable,
diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c
index 5627940..5fb3c9b 100644
--- a/arch/powerpc/perf/power7-pmu.c
+++ b/arch/powerpc/perf/power7-pmu.c
@@ -410,6 +410,7 @@ static struct attribute_group power7_pmu_events_group = {
 };
 
 static const struct attribute_group *power7_pmu_attr_groups[] = {
+   power_pmu_format_group,
power7_pmu_events_group,
NULL,
 };
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] Added device tree binding for TDM and TDM phy

2013-01-09 Thread Scott Wood

On 01/09/2013 01:10:24 AM, Singh Sandeep-B37400 wrote:

A gentle reminder.
Any comments are appreciated.

Regards,
Sandeep

 -Original Message-
 From: Singh Sandeep-B37400
 Sent: Wednesday, January 02, 2013 6:55 PM
 To: devicetree-disc...@lists.ozlabs.org; linuxppc-...@ozlabs.org
 Cc: Singh Sandeep-B37400; Aggrwal Poonam-B10812
 Subject: [PATCH] Added device tree binding for TDM and TDM phy

 This controller is available on many Freescale SOCs like MPC8315,  
P1020,

 P1010 and P1022

 Signed-off-by: Sandeep Singh sand...@freescale.com
 Signed-off-by: Poonam Aggrwal poonam.aggr...@freescale.com
 ---
  .../devicetree/bindings/powerpc/fsl/fsl-tdm.txt|   63
 
  .../devicetree/bindings/powerpc/fsl/tdm-phy.txt|   38  

  2 files changed, 101 insertions(+), 0 deletions(-)  create mode  
100644

 Documentation/devicetree/bindings/powerpc/fsl/fsl-tdm.txt
  create mode 100644  
Documentation/devicetree/bindings/powerpc/fsl/tdm-

 phy.txt

 diff --git  
a/Documentation/devicetree/bindings/powerpc/fsl/fsl-tdm.txt

 b/Documentation/devicetree/bindings/powerpc/fsl/fsl-tdm.txt
 new file mode 100644
 index 000..ceb2ef1
 --- /dev/null
 +++ b/Documentation/devicetree/bindings/powerpc/fsl/fsl-tdm.txt
 @@ -0,0 +1,63 @@
 +TDM Device Tree Binding
 +
 +NOTE: The bindings described in this document are preliminary and
 +subject to change.
 +
 +TDM (Time Division Multiplexing)
 +
 +Description:
 +
 +The TDM is full duplex serial port designed to allow various  
devices

 +including digital signal processors (DSPs) to communicate with a
 +variety of serial devices including industry standard framers,  
codecs,

 other DSPs and microprocessors.
 +
 +The below properties describe the device tree bindings for  
Freescale
 +TDM controller. This TDM controller is available on various  
Freescale

 +Processors like MPC8315, P1020, P1022 and P1010.
 +
 +Required properties:
 +
 +- compatible
 +Value type: string
 +Definition: Should contain fsl,tdm1.0.
 +
 +- reg
 +Definition: A standard property. The first reg specifier  
describes

 the TDM
 +registers, and the second describes the TDM DMAC registers.
 +
 +- tdm_tx_clk
 +Value type: u32 or u64
 +Definition: This specifies the value of transmit clock. It  
should

 not
 +exceed 50Mhz.
 +
 +- tdm_rx_clk
 +Value type: u32 or u64
 +Definition: This specifies the value of receive clock. Its  
value

 could be
 +zero, in which case tdm will operate in shared mode. Its value
 should not
 +exceed 50Mhz.


Please don't use underscores in property names, and use the vendor  
prefix: fsl,tdm-tx-clk and fsl,tdm-rx-clk.


 diff --git  
a/Documentation/devicetree/bindings/powerpc/fsl/tdm-phy.txt

 b/Documentation/devicetree/bindings/powerpc/fsl/tdm-phy.txt
 new file mode 100644
 index 000..2563934
 --- /dev/null
 +++ b/Documentation/devicetree/bindings/powerpc/fsl/tdm-phy.txt
 @@ -0,0 +1,38 @@
 +TDM PHY Device Tree Binding
 +
 +NOTE: The bindings described in this document are preliminary and
 +subject to change.
 +
 +Description:
 +TDM PHY is the terminal interface of TDM subsystem. It is  
typically a
 +line control device like E1/T1 framer or SLIC. A TDM device can  
have

 +multiple TDM PHYs.
 +
 +Required properties:
 +
 +- compatible
 +Value type: string
 +Definition: Should contain generic compatibility like  
tdm-phy-slic

 or
 +tdm-phy-e1 or tdm-phy-t1.


Does this generic string (plus the other properties) tell you all you  
need to know about the device?  If there are other possible generic  
compatibles, they should be listed or else different people will make  
up different strings for the same thing.


-Scott
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 1/6][v3] perf/Power7: Use macros to identify perf events

2013-01-09 Thread sukadev
Define and use macros to identify perf events codes. This would make it
easier and more readable when these event codes need to be used in more
than one place.

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 arch/powerpc/perf/power7-pmu.c |   28 
 1 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c
index 441af08..44e70d2 100644
--- a/arch/powerpc/perf/power7-pmu.c
+++ b/arch/powerpc/perf/power7-pmu.c
@@ -51,6 +51,18 @@
 #define MMCR1_PMCSEL_MSK   0xff
 
 /*
+ * Power7 event codes.
+ */
+#definePME_PM_CYC  0x1e
+#definePME_PM_GCT_NOSLOT_CYC   0x100f8
+#definePME_PM_CMPLU_STALL  0x4000a
+#definePME_PM_INST_CMPL0x2
+#definePME_PM_LD_REF_L10xc880
+#definePME_PM_LD_MISS_L1   0x400f0
+#definePME_PM_BRU_FIN  0x10068
+#definePME_PM_BRU_MPRED0x400f6
+
+/*
  * Layout of constraint bits:
  * 554433221100
  * 3210987654321098765432109876543210987654321098765432109876543210
@@ -296,14 +308,14 @@ static void power7_disable_pmc(unsigned int pmc, unsigned 
long mmcr[])
 }
 
 static int power7_generic_events[] = {
-   [PERF_COUNT_HW_CPU_CYCLES] = 0x1e,
-   [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = 0x100f8, /* GCT_NOSLOT_CYC */
-   [PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = 0x4000a,  /* CMPLU_STALL */
-   [PERF_COUNT_HW_INSTRUCTIONS] = 2,
-   [PERF_COUNT_HW_CACHE_REFERENCES] = 0xc880,  /* LD_REF_L1_LSU*/
-   [PERF_COUNT_HW_CACHE_MISSES] = 0x400f0, /* LD_MISS_L1   */
-   [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = 0x10068,  /* BRU_FIN  */
-   [PERF_COUNT_HW_BRANCH_MISSES] = 0x400f6,/* BR_MPRED */
+   [PERF_COUNT_HW_CPU_CYCLES] =PME_PM_CYC,
+   [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] =   PME_PM_GCT_NOSLOT_CYC,
+   [PERF_COUNT_HW_STALLED_CYCLES_BACKEND] =PME_PM_CMPLU_STALL,
+   [PERF_COUNT_HW_INSTRUCTIONS] =  PME_PM_INST_CMPL,
+   [PERF_COUNT_HW_CACHE_REFERENCES] =  PME_PM_LD_REF_L1,
+   [PERF_COUNT_HW_CACHE_MISSES] =  PME_PM_LD_MISS_L1,
+   [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] =   PME_PM_BRU_FIN,
+   [PERF_COUNT_HW_BRANCH_MISSES] = PME_PM_BRU_MPRED,
 };
 
 #define C(x)   PERF_COUNT_HW_CACHE_##x
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 2/6][v3] perf: Make EVENT_ATTR global

2013-01-09 Thread sukadev
Rename EVENT_ATTR() to PMU_EVENT_ATTR() and make it global so it is
available to all architectures.

Further to allow architectures flexibility, have PMU_EVENT_ATTR() pass
in the variable name as a parameter.

Changelog[v3]
- [Jiri Olsa] No need to define PMU_EVENT_PTR()

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 arch/x86/kernel/cpu/perf_event.c |   13 +++--
 include/linux/perf_event.h   |   11 +++
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 4428fd1..59a1238 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1316,11 +1316,6 @@ static struct attribute_group x86_pmu_format_group = {
.attrs = NULL,
 };
 
-struct perf_pmu_events_attr {
-   struct device_attribute attr;
-   u64 id;
-};
-
 /*
  * Remove all undefined events (x86_pmu.event_map(id) == 0)
  * out of events_attr attributes.
@@ -1354,11 +1349,9 @@ static ssize_t events_sysfs_show(struct device *dev, 
struct device_attribute *at
 #define EVENT_VAR(_id)  event_attr_##_id
 #define EVENT_PTR(_id) event_attr_##_id.attr.attr
 
-#define EVENT_ATTR(_name, _id) \
-static struct perf_pmu_events_attr EVENT_VAR(_id) = {  \
-   .attr = __ATTR(_name, 0444, events_sysfs_show, NULL),   \
-   .id   =  PERF_COUNT_HW_##_id,   \
-};
+#define EVENT_ATTR(_name, _id) \
+   PMU_EVENT_ATTR(_name, EVENT_VAR(_id), PERF_COUNT_HW_##_id,  \
+   events_sysfs_show)
 
 EVENT_ATTR(cpu-cycles, CPU_CYCLES  );
 EVENT_ATTR(instructions,   INSTRUCTIONS);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6bfb2fa..42adf01 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -817,6 +817,17 @@ do {   
\
 } while (0)
 
 
+struct perf_pmu_events_attr {
+   struct device_attribute attr;
+   u64 id;
+};
+
+#define PMU_EVENT_ATTR(_name, _var, _id, _show)
\
+static struct perf_pmu_events_attr _var = {\
+   .attr = __ATTR(_name, 0444, _show, NULL),   \
+   .id   =  _id,   \
+};
+
 #define PMU_FORMAT_ATTR(_name, _format)
\
 static ssize_t \
 _name##_show(struct device *dev,   \
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 3/6][v3] perf/POWER7: Make generic event translations available in sysfs

2013-01-09 Thread sukadev
Make the generic perf events in POWER7 available via sysfs.

$ ls /sys/bus/event_source/devices/cpu/events
branch-instructions
branch-misses
cache-misses
cache-references
cpu-cycles
instructions
stalled-cycles-backend
stalled-cycles-frontend

$ cat /sys/bus/event_source/devices/cpu/events/cache-misses
event=0x400f0

This patch is based on commits that implement this functionality on x86.
Eg:
commit a47473939db20e3961b200eb00acf5fcf084d755
Author: Jiri Olsa jo...@redhat.com
Date:   Wed Oct 10 14:53:11 2012 +0200

perf/x86: Make hardware event translations available in sysfs

Changelog:[v3]
[Jiri Olsa] Drop EVENT_ID() macro since it is only used once.

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/perf_event_server.h  |   24 ++
 arch/powerpc/perf/core-book3s.c   |   12 +++
 arch/powerpc/perf/power7-pmu.c|   34 +
 3 files changed, 70 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/ABI/stable/sysfs-devices-cpu-events

diff --git a/Documentation/ABI/stable/sysfs-devices-cpu-events 
b/Documentation/ABI/stable/sysfs-devices-cpu-events
new file mode 100644
index 000..e69de29
diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index 9710be3..3f21d89 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -11,6 +11,7 @@
 
 #include linux/types.h
 #include asm/hw_irq.h
+#include linux/device.h
 
 #define MAX_HWEVENTS   8
 #define MAX_EVENT_ALTERNATIVES 8
@@ -35,6 +36,7 @@ struct power_pmu {
void(*disable_pmc)(unsigned int pmc, unsigned long mmcr[]);
int (*limited_pmc_event)(u64 event_id);
u32 flags;
+   const struct attribute_group**attr_groups;
int n_generic;
int *generic_events;
int (*cache_events)[PERF_COUNT_HW_CACHE_MAX]
@@ -109,3 +111,25 @@ extern unsigned long perf_instruction_pointer(struct 
pt_regs *regs);
  * If an event_id is not subject to the constraint expressed by a particular
  * field, then it will have 0 in both the mask and value for that field.
  */
+
+extern ssize_t power_events_sysfs_show(struct device *dev,
+   struct device_attribute *attr, char *page);
+
+/*
+ * EVENT_VAR() is same as PMU_EVENT_VAR with a suffix.
+ *
+ * Having a suffix allows us to have aliases in sysfs - eg: the generic
+ * event 'cpu-cycles' can have two entries in sysfs: 'cpu-cycles' and
+ * 'PM_CYC' where the latter is the name by which the event is known in
+ * POWER CPU specification.
+ */
+#defineEVENT_VAR(_id, _suffix) event_attr_##_id##_suffix
+#defineEVENT_PTR(_id, _suffix) EVENT_VAR(_id, _suffix)
+
+#defineEVENT_ATTR(_name, _id, _suffix) 
\
+   PMU_EVENT_ATTR(_name, EVENT_VAR(_id, _suffix), PME_PM_##_id,\
+   power_events_sysfs_show)
+
+#defineGENERIC_EVENT_ATTR(_name, _id)  EVENT_ATTR(_name, _id, _g)
+#defineGENERIC_EVENT_PTR(_id)  EVENT_PTR(_id, _g)
+
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index aa2465e..fa476d5 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -1305,6 +1305,16 @@ static int power_pmu_event_idx(struct perf_event *event)
return event-hw.idx;
 }
 
+ssize_t power_events_sysfs_show(struct device *dev,
+   struct device_attribute *attr, char *page)
+{
+   struct perf_pmu_events_attr *pmu_attr;
+
+   pmu_attr = container_of(attr, struct perf_pmu_events_attr, attr);
+
+   return sprintf(page, event=0x%02llx\n, pmu_attr-id);
+}
+
 struct pmu power_pmu = {
.pmu_enable = power_pmu_enable,
.pmu_disable= power_pmu_disable,
@@ -1537,6 +1547,8 @@ int __cpuinit register_power_pmu(struct power_pmu *pmu)
pr_info(%s performance monitor hardware support registered\n,
pmu-name);
 
+   power_pmu.attr_groups = ppmu-attr_groups;
+
 #ifdef MSR_HV
/*
 * Use FCHV to ignore kernel events if MSR.HV is set.
diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c
index 44e70d2..ae5d757 100644
--- a/arch/powerpc/perf/power7-pmu.c
+++ b/arch/powerpc/perf/power7-pmu.c
@@ -363,6 +363,39 @@ static int 
power7_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
},
 };
 
+
+GENERIC_EVENT_ATTR(cpu-cycles, CYC);
+GENERIC_EVENT_ATTR(stalled-cycles-frontend,GCT_NOSLOT_CYC);
+GENERIC_EVENT_ATTR(stalled-cycles-backend, CMPLU_STALL);
+GENERIC_EVENT_ATTR(instructions,   INST_CMPL);
+GENERIC_EVENT_ATTR(cache-references,   

[PATCH 4/6][v3] perf/POWER7: Make some POWER7 events available in sysfs

2013-01-09 Thread sukadev
Make some POWER7-specific perf events available in sysfs.

$ /bin/ls -1 /sys/bus/event_source/devices/cpu/events/
branch-instructions
branch-misses
cache-misses
cache-references
cpu-cycles
instructions
PM_BRU_FIN
PM_BRU_MPRED
PM_CMPLU_STALL
PM_CYC
PM_GCT_NOSLOT_CYC
PM_INST_CMPL
PM_LD_MISS_L1
PM_LD_REF_L1
stalled-cycles-backend
stalled-cycles-frontend

where the 'PM_*' events are POWER specific and the others are the
generic events.

This will enable users to specify these events with their symbolic
names rather than with their raw code.

perf stat -e 'cpu/PM_CYC/' ...

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/perf_event_server.h |2 ++
 arch/powerpc/perf/power7-pmu.c   |   18 ++
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index 3f21d89..b29fcc6 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -133,3 +133,5 @@ extern ssize_t power_events_sysfs_show(struct device *dev,
 #defineGENERIC_EVENT_ATTR(_name, _id)  EVENT_ATTR(_name, _id, _g)
 #defineGENERIC_EVENT_PTR(_id)  EVENT_PTR(_id, _g)
 
+#definePOWER_EVENT_ATTR(_name, _id)EVENT_ATTR(PM_##_name, _id, _p)
+#definePOWER_EVENT_PTR(_id)EVENT_PTR(_id, _p)
diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c
index ae5d757..5627940 100644
--- a/arch/powerpc/perf/power7-pmu.c
+++ b/arch/powerpc/perf/power7-pmu.c
@@ -373,6 +373,15 @@ GENERIC_EVENT_ATTR(cache-misses,   LD_MISS_L1);
 GENERIC_EVENT_ATTR(branch-instructions,BRU_FIN);
 GENERIC_EVENT_ATTR(branch-misses,  BRU_MPRED);
 
+POWER_EVENT_ATTR(CYC,  CYC);
+POWER_EVENT_ATTR(GCT_NOSLOT_CYC,   GCT_NOSLOT_CYC);
+POWER_EVENT_ATTR(CMPLU_STALL,  CMPLU_STALL);
+POWER_EVENT_ATTR(INST_CMPL,INST_CMPL);
+POWER_EVENT_ATTR(LD_REF_L1,LD_REF_L1);
+POWER_EVENT_ATTR(LD_MISS_L1,   LD_MISS_L1);
+POWER_EVENT_ATTR(BRU_FIN,  BRU_FIN)
+POWER_EVENT_ATTR(BRU_MPRED,BRU_MPRED);
+
 static struct attribute *power7_events_attr[] = {
GENERIC_EVENT_PTR(CYC),
GENERIC_EVENT_PTR(GCT_NOSLOT_CYC),
@@ -382,6 +391,15 @@ static struct attribute *power7_events_attr[] = {
GENERIC_EVENT_PTR(LD_MISS_L1),
GENERIC_EVENT_PTR(BRU_FIN),
GENERIC_EVENT_PTR(BRU_MPRED),
+
+   POWER_EVENT_PTR(CYC),
+   POWER_EVENT_PTR(GCT_NOSLOT_CYC),
+   POWER_EVENT_PTR(CMPLU_STALL),
+   POWER_EVENT_PTR(INST_CMPL),
+   POWER_EVENT_PTR(LD_REF_L1),
+   POWER_EVENT_PTR(LD_MISS_L1),
+   POWER_EVENT_PTR(BRU_FIN),
+   POWER_EVENT_PTR(BRU_MPRED),
NULL
 };
 
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 5/6][v3] perf: Create a sysfs entry for Power event format

2013-01-09 Thread sukadev
Create a sysfs entry, '/sys/bus/event_source/devices/cpu/format/event'
which describes the format of a POWER cpu.

The format of the event is the same for all POWER cpus at least in
(Power6, Power7), so bulk of this change is common in the code common
to POWER cpus.

This code is based on corresponding code in x86.

Changelog[v2]: [Jiri Olsa] Use PMU_FORMAT_ATTR() rather than duplicating it.

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/perf_event_server.h |6 ++
 arch/powerpc/perf/core-book3s.c  |   12 
 arch/powerpc/perf/power7-pmu.c   |1 +
 3 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index b29fcc6..ee63205 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -135,3 +135,9 @@ extern ssize_t power_events_sysfs_show(struct device *dev,
 
 #definePOWER_EVENT_ATTR(_name, _id)EVENT_ATTR(PM_##_name, _id, _p)
 #definePOWER_EVENT_PTR(_id)EVENT_PTR(_id, _p)
+
+/*
+ * Format of a perf event is the same on all POWER cpus. Declare a
+ * common sysfs attribute group that individual POWER cpus can share.
+ */
+extern struct attribute_group power_pmu_format_group;
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index fa476d5..4ae044b 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -1315,6 +1315,18 @@ ssize_t power_events_sysfs_show(struct device *dev,
return sprintf(page, event=0x%02llx\n, pmu_attr-id);
 }
 
+PMU_FORMAT_ATTR(event, config:0-20);
+
+static struct attribute *power_pmu_format_attr[] = {
+   format_attr_event.attr,
+   NULL,
+};
+
+struct attribute_group power_pmu_format_group = {
+   .name = format,
+   .attrs = power_pmu_format_attr,
+};
+
 struct pmu power_pmu = {
.pmu_enable = power_pmu_enable,
.pmu_disable= power_pmu_disable,
diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c
index 5627940..5fb3c9b 100644
--- a/arch/powerpc/perf/power7-pmu.c
+++ b/arch/powerpc/perf/power7-pmu.c
@@ -410,6 +410,7 @@ static struct attribute_group power7_pmu_events_group = {
 };
 
 static const struct attribute_group *power7_pmu_attr_groups[] = {
+   power_pmu_format_group,
power7_pmu_events_group,
NULL,
 };
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 1/6][v3] perf/Power7: Use macros to identify perf events

2013-01-09 Thread Sukadev Bhattiprolu
[PATCH 1/6][v3] perf/Power7: Use macros to identify perf events

Define and use macros to identify perf events codes. This would make it
easier and more readable when these event codes need to be used in more
than one place.

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 arch/powerpc/perf/power7-pmu.c |   28 
 1 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c
index 441af08..44e70d2 100644
--- a/arch/powerpc/perf/power7-pmu.c
+++ b/arch/powerpc/perf/power7-pmu.c
@@ -51,6 +51,18 @@
 #define MMCR1_PMCSEL_MSK   0xff
 
 /*
+ * Power7 event codes.
+ */
+#definePME_PM_CYC  0x1e
+#definePME_PM_GCT_NOSLOT_CYC   0x100f8
+#definePME_PM_CMPLU_STALL  0x4000a
+#definePME_PM_INST_CMPL0x2
+#definePME_PM_LD_REF_L10xc880
+#definePME_PM_LD_MISS_L1   0x400f0
+#definePME_PM_BRU_FIN  0x10068
+#definePME_PM_BRU_MPRED0x400f6
+
+/*
  * Layout of constraint bits:
  * 554433221100
  * 3210987654321098765432109876543210987654321098765432109876543210
@@ -296,14 +308,14 @@ static void power7_disable_pmc(unsigned int pmc, unsigned 
long mmcr[])
 }
 
 static int power7_generic_events[] = {
-   [PERF_COUNT_HW_CPU_CYCLES] = 0x1e,
-   [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = 0x100f8, /* GCT_NOSLOT_CYC */
-   [PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = 0x4000a,  /* CMPLU_STALL */
-   [PERF_COUNT_HW_INSTRUCTIONS] = 2,
-   [PERF_COUNT_HW_CACHE_REFERENCES] = 0xc880,  /* LD_REF_L1_LSU*/
-   [PERF_COUNT_HW_CACHE_MISSES] = 0x400f0, /* LD_MISS_L1   */
-   [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = 0x10068,  /* BRU_FIN  */
-   [PERF_COUNT_HW_BRANCH_MISSES] = 0x400f6,/* BR_MPRED */
+   [PERF_COUNT_HW_CPU_CYCLES] =PME_PM_CYC,
+   [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] =   PME_PM_GCT_NOSLOT_CYC,
+   [PERF_COUNT_HW_STALLED_CYCLES_BACKEND] =PME_PM_CMPLU_STALL,
+   [PERF_COUNT_HW_INSTRUCTIONS] =  PME_PM_INST_CMPL,
+   [PERF_COUNT_HW_CACHE_REFERENCES] =  PME_PM_LD_REF_L1,
+   [PERF_COUNT_HW_CACHE_MISSES] =  PME_PM_LD_MISS_L1,
+   [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] =   PME_PM_BRU_FIN,
+   [PERF_COUNT_HW_BRANCH_MISSES] = PME_PM_BRU_MPRED,
 };
 
 #define C(x)   PERF_COUNT_HW_CACHE_##x
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 2/6][v3] perf: Make EVENT_ATTR global

2013-01-09 Thread Sukadev Bhattiprolu
[PATCH 2/6][v3] perf: Make EVENT_ATTR global

Rename EVENT_ATTR() to PMU_EVENT_ATTR() and make it global so it is
available to all architectures.

Further to allow architectures flexibility, have PMU_EVENT_ATTR() pass
in the variable name as a parameter.

Changelog[v3]
- [Jiri Olsa] No need to define PMU_EVENT_PTR()

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 arch/x86/kernel/cpu/perf_event.c |   13 +++--
 include/linux/perf_event.h   |   11 +++
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 4428fd1..59a1238 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1316,11 +1316,6 @@ static struct attribute_group x86_pmu_format_group = {
.attrs = NULL,
 };
 
-struct perf_pmu_events_attr {
-   struct device_attribute attr;
-   u64 id;
-};
-
 /*
  * Remove all undefined events (x86_pmu.event_map(id) == 0)
  * out of events_attr attributes.
@@ -1354,11 +1349,9 @@ static ssize_t events_sysfs_show(struct device *dev, 
struct device_attribute *at
 #define EVENT_VAR(_id)  event_attr_##_id
 #define EVENT_PTR(_id) event_attr_##_id.attr.attr
 
-#define EVENT_ATTR(_name, _id) \
-static struct perf_pmu_events_attr EVENT_VAR(_id) = {  \
-   .attr = __ATTR(_name, 0444, events_sysfs_show, NULL),   \
-   .id   =  PERF_COUNT_HW_##_id,   \
-};
+#define EVENT_ATTR(_name, _id) \
+   PMU_EVENT_ATTR(_name, EVENT_VAR(_id), PERF_COUNT_HW_##_id,  \
+   events_sysfs_show)
 
 EVENT_ATTR(cpu-cycles, CPU_CYCLES  );
 EVENT_ATTR(instructions,   INSTRUCTIONS);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6bfb2fa..42adf01 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -817,6 +817,17 @@ do {   
\
 } while (0)
 
 
+struct perf_pmu_events_attr {
+   struct device_attribute attr;
+   u64 id;
+};
+
+#define PMU_EVENT_ATTR(_name, _var, _id, _show)
\
+static struct perf_pmu_events_attr _var = {\
+   .attr = __ATTR(_name, 0444, _show, NULL),   \
+   .id   =  _id,   \
+};
+
 #define PMU_FORMAT_ATTR(_name, _format)
\
 static ssize_t \
 _name##_show(struct device *dev,   \
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 3/6][v3] perf/POWER7: Make generic event translations available in sysfs

2013-01-09 Thread Sukadev Bhattiprolu
[PATCH 3/6][v3] perf/POWER7: Make generic event translations available in sysfs

Make the generic perf events in POWER7 available via sysfs.

$ ls /sys/bus/event_source/devices/cpu/events
branch-instructions
branch-misses
cache-misses
cache-references
cpu-cycles
instructions
stalled-cycles-backend
stalled-cycles-frontend

$ cat /sys/bus/event_source/devices/cpu/events/cache-misses
event=0x400f0

This patch is based on commits that implement this functionality on x86.
Eg:
commit a47473939db20e3961b200eb00acf5fcf084d755
Author: Jiri Olsa jo...@redhat.com
Date:   Wed Oct 10 14:53:11 2012 +0200

perf/x86: Make hardware event translations available in sysfs

Changelog:[v3]
[Jiri Olsa] Drop EVENT_ID() macro since it is only used once.

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/perf_event_server.h  |   24 ++
 arch/powerpc/perf/core-book3s.c   |   12 +++
 arch/powerpc/perf/power7-pmu.c|   34 +
 3 files changed, 70 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/ABI/stable/sysfs-devices-cpu-events

diff --git a/Documentation/ABI/stable/sysfs-devices-cpu-events 
b/Documentation/ABI/stable/sysfs-devices-cpu-events
new file mode 100644
index 000..e69de29
diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index 9710be3..3f21d89 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -11,6 +11,7 @@
 
 #include linux/types.h
 #include asm/hw_irq.h
+#include linux/device.h
 
 #define MAX_HWEVENTS   8
 #define MAX_EVENT_ALTERNATIVES 8
@@ -35,6 +36,7 @@ struct power_pmu {
void(*disable_pmc)(unsigned int pmc, unsigned long mmcr[]);
int (*limited_pmc_event)(u64 event_id);
u32 flags;
+   const struct attribute_group**attr_groups;
int n_generic;
int *generic_events;
int (*cache_events)[PERF_COUNT_HW_CACHE_MAX]
@@ -109,3 +111,25 @@ extern unsigned long perf_instruction_pointer(struct 
pt_regs *regs);
  * If an event_id is not subject to the constraint expressed by a particular
  * field, then it will have 0 in both the mask and value for that field.
  */
+
+extern ssize_t power_events_sysfs_show(struct device *dev,
+   struct device_attribute *attr, char *page);
+
+/*
+ * EVENT_VAR() is same as PMU_EVENT_VAR with a suffix.
+ *
+ * Having a suffix allows us to have aliases in sysfs - eg: the generic
+ * event 'cpu-cycles' can have two entries in sysfs: 'cpu-cycles' and
+ * 'PM_CYC' where the latter is the name by which the event is known in
+ * POWER CPU specification.
+ */
+#defineEVENT_VAR(_id, _suffix) event_attr_##_id##_suffix
+#defineEVENT_PTR(_id, _suffix) EVENT_VAR(_id, _suffix)
+
+#defineEVENT_ATTR(_name, _id, _suffix) 
\
+   PMU_EVENT_ATTR(_name, EVENT_VAR(_id, _suffix), PME_PM_##_id,\
+   power_events_sysfs_show)
+
+#defineGENERIC_EVENT_ATTR(_name, _id)  EVENT_ATTR(_name, _id, _g)
+#defineGENERIC_EVENT_PTR(_id)  EVENT_PTR(_id, _g)
+
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index aa2465e..fa476d5 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -1305,6 +1305,16 @@ static int power_pmu_event_idx(struct perf_event *event)
return event-hw.idx;
 }
 
+ssize_t power_events_sysfs_show(struct device *dev,
+   struct device_attribute *attr, char *page)
+{
+   struct perf_pmu_events_attr *pmu_attr;
+
+   pmu_attr = container_of(attr, struct perf_pmu_events_attr, attr);
+
+   return sprintf(page, event=0x%02llx\n, pmu_attr-id);
+}
+
 struct pmu power_pmu = {
.pmu_enable = power_pmu_enable,
.pmu_disable= power_pmu_disable,
@@ -1537,6 +1547,8 @@ int __cpuinit register_power_pmu(struct power_pmu *pmu)
pr_info(%s performance monitor hardware support registered\n,
pmu-name);
 
+   power_pmu.attr_groups = ppmu-attr_groups;
+
 #ifdef MSR_HV
/*
 * Use FCHV to ignore kernel events if MSR.HV is set.
diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c
index 44e70d2..ae5d757 100644
--- a/arch/powerpc/perf/power7-pmu.c
+++ b/arch/powerpc/perf/power7-pmu.c
@@ -363,6 +363,39 @@ static int 
power7_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
},
 };
 
+
+GENERIC_EVENT_ATTR(cpu-cycles, CYC);
+GENERIC_EVENT_ATTR(stalled-cycles-frontend,GCT_NOSLOT_CYC);
+GENERIC_EVENT_ATTR(stalled-cycles-backend, CMPLU_STALL);

[PATCH 4/6][v3] perf/POWER7: Make some POWER7 events available in sysfs

2013-01-09 Thread Sukadev Bhattiprolu
[PATCH 4/6][v3] perf/POWER7: Make some POWER7 events available in sysfs

Make some POWER7-specific perf events available in sysfs.

$ /bin/ls -1 /sys/bus/event_source/devices/cpu/events/
branch-instructions
branch-misses
cache-misses
cache-references
cpu-cycles
instructions
PM_BRU_FIN
PM_BRU_MPRED
PM_CMPLU_STALL
PM_CYC
PM_GCT_NOSLOT_CYC
PM_INST_CMPL
PM_LD_MISS_L1
PM_LD_REF_L1
stalled-cycles-backend
stalled-cycles-frontend

where the 'PM_*' events are POWER specific and the others are the
generic events.

This will enable users to specify these events with their symbolic
names rather than with their raw code.

perf stat -e 'cpu/PM_CYC/' ...

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/perf_event_server.h |2 ++
 arch/powerpc/perf/power7-pmu.c   |   18 ++
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index 3f21d89..b29fcc6 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -133,3 +133,5 @@ extern ssize_t power_events_sysfs_show(struct device *dev,
 #defineGENERIC_EVENT_ATTR(_name, _id)  EVENT_ATTR(_name, _id, _g)
 #defineGENERIC_EVENT_PTR(_id)  EVENT_PTR(_id, _g)
 
+#definePOWER_EVENT_ATTR(_name, _id)EVENT_ATTR(PM_##_name, _id, _p)
+#definePOWER_EVENT_PTR(_id)EVENT_PTR(_id, _p)
diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c
index ae5d757..5627940 100644
--- a/arch/powerpc/perf/power7-pmu.c
+++ b/arch/powerpc/perf/power7-pmu.c
@@ -373,6 +373,15 @@ GENERIC_EVENT_ATTR(cache-misses,   LD_MISS_L1);
 GENERIC_EVENT_ATTR(branch-instructions,BRU_FIN);
 GENERIC_EVENT_ATTR(branch-misses,  BRU_MPRED);
 
+POWER_EVENT_ATTR(CYC,  CYC);
+POWER_EVENT_ATTR(GCT_NOSLOT_CYC,   GCT_NOSLOT_CYC);
+POWER_EVENT_ATTR(CMPLU_STALL,  CMPLU_STALL);
+POWER_EVENT_ATTR(INST_CMPL,INST_CMPL);
+POWER_EVENT_ATTR(LD_REF_L1,LD_REF_L1);
+POWER_EVENT_ATTR(LD_MISS_L1,   LD_MISS_L1);
+POWER_EVENT_ATTR(BRU_FIN,  BRU_FIN)
+POWER_EVENT_ATTR(BRU_MPRED,BRU_MPRED);
+
 static struct attribute *power7_events_attr[] = {
GENERIC_EVENT_PTR(CYC),
GENERIC_EVENT_PTR(GCT_NOSLOT_CYC),
@@ -382,6 +391,15 @@ static struct attribute *power7_events_attr[] = {
GENERIC_EVENT_PTR(LD_MISS_L1),
GENERIC_EVENT_PTR(BRU_FIN),
GENERIC_EVENT_PTR(BRU_MPRED),
+
+   POWER_EVENT_PTR(CYC),
+   POWER_EVENT_PTR(GCT_NOSLOT_CYC),
+   POWER_EVENT_PTR(CMPLU_STALL),
+   POWER_EVENT_PTR(INST_CMPL),
+   POWER_EVENT_PTR(LD_REF_L1),
+   POWER_EVENT_PTR(LD_MISS_L1),
+   POWER_EVENT_PTR(BRU_FIN),
+   POWER_EVENT_PTR(BRU_MPRED),
NULL
 };
 
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 5/6][v3] perf: Create a sysfs entry for Power event format

2013-01-09 Thread Sukadev Bhattiprolu
[PATCH 5/6][v3] perf: Create a sysfs entry for Power event format

Create a sysfs entry, '/sys/bus/event_source/devices/cpu/format/event'
which describes the format of a POWER cpu.

The format of the event is the same for all POWER cpus at least in
(Power6, Power7), so bulk of this change is common in the code common
to POWER cpus.

This code is based on corresponding code in x86.

Changelog[v2]: [Jiri Olsa] Use PMU_FORMAT_ATTR() rather than duplicating it.

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/perf_event_server.h |6 ++
 arch/powerpc/perf/core-book3s.c  |   12 
 arch/powerpc/perf/power7-pmu.c   |1 +
 3 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index b29fcc6..ee63205 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -135,3 +135,9 @@ extern ssize_t power_events_sysfs_show(struct device *dev,
 
 #definePOWER_EVENT_ATTR(_name, _id)EVENT_ATTR(PM_##_name, _id, _p)
 #definePOWER_EVENT_PTR(_id)EVENT_PTR(_id, _p)
+
+/*
+ * Format of a perf event is the same on all POWER cpus. Declare a
+ * common sysfs attribute group that individual POWER cpus can share.
+ */
+extern struct attribute_group power_pmu_format_group;
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index fa476d5..4ae044b 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -1315,6 +1315,18 @@ ssize_t power_events_sysfs_show(struct device *dev,
return sprintf(page, event=0x%02llx\n, pmu_attr-id);
 }
 
+PMU_FORMAT_ATTR(event, config:0-20);
+
+static struct attribute *power_pmu_format_attr[] = {
+   format_attr_event.attr,
+   NULL,
+};
+
+struct attribute_group power_pmu_format_group = {
+   .name = format,
+   .attrs = power_pmu_format_attr,
+};
+
 struct pmu power_pmu = {
.pmu_enable = power_pmu_enable,
.pmu_disable= power_pmu_disable,
diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c
index 5627940..5fb3c9b 100644
--- a/arch/powerpc/perf/power7-pmu.c
+++ b/arch/powerpc/perf/power7-pmu.c
@@ -410,6 +410,7 @@ static struct attribute_group power7_pmu_events_group = {
 };
 
 static const struct attribute_group *power7_pmu_attr_groups[] = {
+   power_pmu_format_group,
power7_pmu_events_group,
NULL,
 };
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 6/6][v3] perf: Document the ABI of perf sysfs entries

2013-01-09 Thread Sukadev Bhattiprolu
[PATCH 6/6][v3] perf: Document the ABI of perf sysfs entries

This patchset addes two new sets of files to sysfs:

- generic and POWER-specific perf events in /sys/devices/cpu/events/
- perf event config format in /sys/devices/cpu/format/event

Document the format of these files which would become part of the ABI.

Changelog[v3]:
[Greg KH] Include ABI documentation.

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 Documentation/ABI/stable/sysfs-devices-cpu-events |   54 +
 Documentation/ABI/stable/sysfs-devices-cpu-format |   27 ++
 2 files changed, 81 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/ABI/stable/sysfs-devices-cpu-format

diff --git a/Documentation/ABI/stable/sysfs-devices-cpu-events 
b/Documentation/ABI/stable/sysfs-devices-cpu-events
index e69de29..f37d542 100644
--- a/Documentation/ABI/stable/sysfs-devices-cpu-events
+++ b/Documentation/ABI/stable/sysfs-devices-cpu-events
@@ -0,0 +1,54 @@
+What:  /sys/devices/cpu/events/
+   /sys/devices/cpu/events/branch-misses
+   /sys/devices/cpu/events/cache-references
+   /sys/devices/cpu/events/cache-misses
+   /sys/devices/cpu/events/stalled-cycles-frontend
+   /sys/devices/cpu/events/branch-instructions
+   /sys/devices/cpu/events/stalled-cycles-backend
+   /sys/devices/cpu/events/instructions
+   /sys/devices/cpu/events/cpu-cycles
+
+Date:  2013/01/08
+
+Contact:   Linux kernel mailing list linux-ker...@vger.kernel.org
+
+Description:   Generic performance monitoring events
+
+   A collection of performance monitoring events that may be
+   supported by many/most CPUs. These events can be monitored
+   using the 'perf(1)' tool.
+
+   The contents of each file would look like:
+
+   event=0x
+
+   where 'N' is a hex digit.
+
+
+What:  /sys/devices/cpu/events/PM_LD_MISS_L1
+   /sys/devices/cpu/events/PM_LD_REF_L1
+   /sys/devices/cpu/events/PM_CYC
+   /sys/devices/cpu/events/PM_BRU_FIN
+   /sys/devices/cpu/events/PM_GCT_NOSLOT_CYC
+   /sys/devices/cpu/events/PM_BRU_MPRED
+   /sys/devices/cpu/events/PM_INST_CMPL
+   /sys/devices/cpu/events/PM_CMPLU_STALL
+
+Date:  2013/01/08
+
+Contact:   Linux kernel mailing list linux-ker...@vger.kernel.org
+   Linux Powerpc mailing list linuxppc-...@ozlabs.org
+
+Description:   POWER specific performance monitoring events
+
+   A collection of performance monitoring events that may be
+   supported by the POWER CPU. These events can be monitored
+   using the 'perf(1)' tool.
+
+   These events may not be supported by other CPUs.
+
+   The contents of each file would look like:
+
+   event=0x
+
+   where 'N' is a hex digit.
diff --git a/Documentation/ABI/stable/sysfs-devices-cpu-format 
b/Documentation/ABI/stable/sysfs-devices-cpu-format
new file mode 100644
index 000..b15cfb2
--- /dev/null
+++ b/Documentation/ABI/stable/sysfs-devices-cpu-format
@@ -0,0 +1,27 @@
+What:  /sys/devices/cpu/format/
+   /sys/devices/cpu/format/event
+
+Date:  2013/01/08
+
+Contact:   Linux kernel mailing list linux-ker...@vger.kernel.org
+
+Description:   Format of performance monitoring events
+
+   Each CPU/architecture may use different format to represent
+   the perf event.  The 'event' file describes the configuration
+   format of the performance monitoring event on the CPU/system.
+
+   The contents of each file would look like:
+
+   config:m-n
+
+   where m and n are the starting and ending bits that are
+   used to represent the event.
+
+   For example, on POWER,
+
+   $ cat /sys/devices/cpu/format/event
+   config:0-20
+
+   meaning that POWER uses the first 20-bits to represent a perf
+   event.
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence

2013-01-09 Thread Tang Chen

Hi Glauber,

On 01/09/2013 11:09 PM, Glauber Costa wrote:


We try to make all page_cgroup allocations local to the node they are describing
now. If the memory is the first memory onlined in this node, we will allocate
it from the other node.

For example, node1 has 4 memory blocks: 8-11, and we online it from 8 to 11
1. memory block 8, page_cgroup allocations are in the other nodes
2. memory block 9, page_cgroup allocations are in memory block 8

So we should offline memory block 9 first. But we don't know in which order
the user online the memory block.

I think we can modify memcg like this:
allocate the memory from the memory block they are describing

I am not sure it is OK to do so.


I don't see a reason why not.


I'm not sure, but if we do this, we could bring in a fragment for each
memory block (a memory section, 128MB, right?). Is this a problem when
we use large page (such as 1GB page) ?

Even if not, will these fragments make any bad effects ?

Thank. :)



You would have to tweak a bit the lookup function for page_cgroup, but
assuming you will always have the pfns and limits, it should be easy to do.

I think the only tricky part is that today we have a single
node_page_cgroup, and we would of course have to have one per memory
block. My assumption is that the number of memory blocks is limited and
likely not very big. So even a static array would do.

Kamezawa, do you have any input in here?



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-09 Thread Tang Chen

Hi Andrew,

Thank you very much for your pushing. :)

On 01/10/2013 06:23 AM, Andrew Morton wrote:


This does sound like a significant problem.  We should assume that
mmecg is available and in use.


In patch1, we provide a solution which is not good enough:
Iterate twice to offline the memory.
1st iterate: offline every non primary memory block.
2nd iterate: offline primary (i.e. first added) memory block.


Let's flesh this out a bit.

If we online memory8, memory9, memory10 and memory11 then I'd have
thought that they would need to offlined in reverse order, which will
require four iterations, not two.  Is this wrong and if so, why?


Well, we may need more than two iterations if all memory8, memory9,
memory10 are in use by kernel, and 10 depends on 9, 9 depends on 8.

So, as you see here, the iteration method is not good enough.

But this only happens when the memory is used by kernel, which will not
be able to be migrated. So if we can use a boot option, such as
movablecore_map, or movable_online functionality to limit the memory as 
movable, the kernel will not use this memory. So it is safe when we are

doing node hot-remove.



Also, what happens if we wish to offline only memory9?  Do we offline
memory11 then memory10 then memory9 and then re-online memory10 and
memory11?


In this case, offlining memory9 could fail if user do this by himself,
for example using sysfs.

In this path, it is in memory hot-remove path. So when we remove a
memory device, it will automatically offline all pages, and it is in
reverse order by itself.

And again, this is not good enough. We will figure out a reasonable way
to solve it soon.




And a new idea from Wen Congyangwe...@cn.fujitsu.com  is:
allocate the memory from the memory block they are describing.


Yes.


But we are not sure if it is OK to do so because there is not existing API
to do so, and we need to move page_cgroup memory allocation from 
MEM_GOING_ONLINE
to MEM_ONLINE.


This all sounds solvable - can we proceed in this fashion?


Yes, we are in progress now.




And also, it may interfere the hugepage.


Please provide full details on this problem.


It is not very clear now, and if I find something, I'll share it out.




Note: if the memory provided by the memory device is used by the kernel, it
can't be offlined. It is not a bug.


Right.  But how often does this happen in testing?  In other words,
please provide an overall description of how well memory hot-remove is
presently operating.  Is it reliable?  What is the success rate in
real-world situations?


We test the hot-remove functionality mostly with movable_online used.
And the memory used by kernel is not allowed to be removed.

We will do some tests in the kernel memory offline cases, and tell you
the test results soon.

And since we are trying out some other ways, I think the problem will
be solved soon.


Are there precautions which the administrator
can take to improve the success rate?


Administrator could use movablecore_map boot option or movable_online
functionality (which is now in kernel) to limit memory as movable to
avoid this problem.


What are the remaining problems
and are there plans to address them?


For now, we will try to allocate page_group on the memory block which
itself is describing. And all the other parts seems work well now.

And we are still testing. If we have any problem, we will share.

Thanks. :)




--
To unsubscribe from this list: send the line unsubscribe linux-acpi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-09 Thread Tang Chen

Hi Andrew,

On 01/10/2013 07:33 AM, Andrew Morton wrote:

On Wed, 9 Jan 2013 17:32:24 +0800
Tang Chentangc...@cn.fujitsu.com  wrote:


This patch-set aims to implement physical memory hot-removing.


As you were on th patch delivery path, all of these patches should have
your Signed-off-by:.  But some were missing it.  I fixed this in my
copy of the patches.


Thank you very much for the help. Next time I'll add it myself.




I suspect this patchset adds a significant amount of code which will
not be used if CONFIG_MEMORY_HOTPLUG=n.  [PATCH v6 06/15]
memory-hotplug: implement register_page_bootmem_info_section of
sparse-vmemmap, for example.  This is not a good thing, so please go
through the patchset (in fact, go through all the memhotplug code) and
let's see if we can reduce the bloat for CONFIG_MEMORY_HOTPLUG=n
kernels.

This needn't be done immediately - it would be OK by me if you were to
defer this exercise until all the new memhotplug code is largely in
place.  But please, let's do it.


OK, I'll do have a check on it when the page_cgroup problem is solved.

Thanks. :)







___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 05/15] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture

2013-01-09 Thread Tang Chen

Hi Andrew,

On 01/10/2013 06:50 AM, Andrew Morton wrote:

On Wed, 9 Jan 2013 17:32:29 +0800
Tang Chentangc...@cn.fujitsu.com  wrote:


For removing memory, we need to remove page table. But it depends
on architecture. So the patch introduce arch_remove_memory() for
removing page table. Now it only calls __remove_pages().

Note: __remove_pages() for some archtecuture is not implemented
   (I don't know how to implement it for s390).


Can this break the build for s390?


No, I don't think so. The arch_remove_memory() in s390 will only
return -EBUSY.

Thanks. :)







___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 6/7] powerpc: Hardware breakpoints rewrite to handle non DABR breakpoint registers

2013-01-09 Thread Michael Neuling
This is a rewrite so that we don't assume we are using the DABR throughout the
code.  We now use the arch_hw_breakpoint to store the breakpoint in a generic
manner in the thread_struct, rather than storing the raw DABR value.

The ptrace GET/SET_DEBUGREG interface currently passes the raw DABR in from
userspace.  We keep this functionality, so that future changes (like the POWER8
DAWR), will still fake the DABR to userspace.

Signed-off-by: Michael Neuling mi...@neuling.org
---
Resending to fix a problem with 8xx defconfigs. Noticed by benh.

 arch/powerpc/include/asm/debug.h |   15 +++---
 arch/powerpc/include/asm/hw_breakpoint.h |   33 ++---
 arch/powerpc/include/asm/processor.h |4 +-
 arch/powerpc/include/asm/reg.h   |3 --
 arch/powerpc/kernel/exceptions-64s.S |2 +-
 arch/powerpc/kernel/hw_breakpoint.c  |   72 
 arch/powerpc/kernel/kgdb.c   |   10 ++--
 arch/powerpc/kernel/process.c|   75 +-
 arch/powerpc/kernel/ptrace.c |   60 +---
 arch/powerpc/kernel/ptrace32.c   |8 +++-
 arch/powerpc/kernel/signal.c |5 +-
 arch/powerpc/kernel/traps.c  |4 +-
 arch/powerpc/mm/fault.c  |4 +-
 arch/powerpc/xmon/xmon.c |   21 ++---
 14 files changed, 187 insertions(+), 129 deletions(-)

diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h
index 32de257..8d85ffb 100644
--- a/arch/powerpc/include/asm/debug.h
+++ b/arch/powerpc/include/asm/debug.h
@@ -4,6 +4,8 @@
 #ifndef _ASM_POWERPC_DEBUG_H
 #define _ASM_POWERPC_DEBUG_H
 
+#include asm/hw_breakpoint.h
+
 struct pt_regs;
 
 extern struct dentry *powerpc_debugfs_root;
@@ -15,7 +17,7 @@ extern int (*__debugger_ipi)(struct pt_regs *regs);
 extern int (*__debugger_bpt)(struct pt_regs *regs);
 extern int (*__debugger_sstep)(struct pt_regs *regs);
 extern int (*__debugger_iabr_match)(struct pt_regs *regs);
-extern int (*__debugger_dabr_match)(struct pt_regs *regs);
+extern int (*__debugger_break_match)(struct pt_regs *regs);
 extern int (*__debugger_fault_handler)(struct pt_regs *regs);
 
 #define DEBUGGER_BOILERPLATE(__NAME) \
@@ -31,7 +33,7 @@ DEBUGGER_BOILERPLATE(debugger_ipi)
 DEBUGGER_BOILERPLATE(debugger_bpt)
 DEBUGGER_BOILERPLATE(debugger_sstep)
 DEBUGGER_BOILERPLATE(debugger_iabr_match)
-DEBUGGER_BOILERPLATE(debugger_dabr_match)
+DEBUGGER_BOILERPLATE(debugger_break_match)
 DEBUGGER_BOILERPLATE(debugger_fault_handler)
 
 #else
@@ -40,17 +42,18 @@ static inline int debugger_ipi(struct pt_regs *regs) { 
return 0; }
 static inline int debugger_bpt(struct pt_regs *regs) { return 0; }
 static inline int debugger_sstep(struct pt_regs *regs) { return 0; }
 static inline int debugger_iabr_match(struct pt_regs *regs) { return 0; }
-static inline int debugger_dabr_match(struct pt_regs *regs) { return 0; }
+static inline int debugger_break_match(struct pt_regs *regs) { return 0; }
 static inline int debugger_fault_handler(struct pt_regs *regs) { return 0; }
 #endif
 
-extern int set_dabr(unsigned long dabr, unsigned long dabrx);
+int set_break(struct arch_hw_breakpoint *brk);
 #ifdef CONFIG_PPC_ADV_DEBUG_REGS
 extern void do_send_trap(struct pt_regs *regs, unsigned long address,
 unsigned long error_code, int signal_code, int brkpt);
 #else
-extern void do_dabr(struct pt_regs *regs, unsigned long address,
-   unsigned long error_code);
+
+extern void do_break(struct pt_regs *regs, unsigned long address,
+unsigned long error_code);
 #endif
 
 #endif /* _ASM_POWERPC_DEBUG_H */
diff --git a/arch/powerpc/include/asm/hw_breakpoint.h 
b/arch/powerpc/include/asm/hw_breakpoint.h
index 4234245..2c91faf 100644
--- a/arch/powerpc/include/asm/hw_breakpoint.h
+++ b/arch/powerpc/include/asm/hw_breakpoint.h
@@ -24,16 +24,30 @@
 #define _PPC_BOOK3S_64_HW_BREAKPOINT_H
 
 #ifdef __KERNEL__
-#ifdef CONFIG_HAVE_HW_BREAKPOINT
-
 struct arch_hw_breakpoint {
unsigned long   address;
-   unsigned long   dabrx;
-   int type;
-   u8  len; /* length of the target data symbol */
-   boolextraneous_interrupt;
+   u16 type;
+   u16 len; /* length of the target data symbol */
 };
 
+/* Note: Don't change the the first 6 bits below as they are in the same order
+ * as the dabr and dabrx.
+ */
+#define HW_BRK_TYPE_READ   0x01
+#define HW_BRK_TYPE_WRITE  0x02
+#define HW_BRK_TYPE_TRANSLATE  0x04
+#define HW_BRK_TYPE_USER   0x08
+#define HW_BRK_TYPE_KERNEL 0x10
+#define HW_BRK_TYPE_HYP0x20
+#define HW_BRK_TYPE_EXTRANEOUS_IRQ 0x80
+
+/* bits that overlap with the bottom 3 bits of the dabr */
+#define HW_BRK_TYPE_RDWR   (HW_BRK_TYPE_READ | HW_BRK_TYPE_WRITE)
+#define HW_BRK_TYPE_DABR   (HW_BRK_TYPE_RDWR | 

Re: [TRIVIAL PATCH 11/26] powerpc: Convert print_symbol to %pSR

2013-01-09 Thread Benjamin Herrenschmidt
On Thu, 2012-12-13 at 11:58 +, Arnd Bergmann wrote:
 On Wednesday 12 December 2012, Joe Perches wrote:
  Use the new vsprintf extension to avoid any possible
  message interleaving.
  
  Convert the #ifdef DEBUG block to a single pr_debug.
  
  Signed-off-by: Joe Perches j...@perches.com
 
 nice cleanup!

 ... which also breaks the build :-(

 Acked-by: Arnd Bergmann a...@arndb.de

I'll fix it up locally.

Ben.

 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH powerpc ] Protect smp_processor_id() in arch_spin_unlock_wait()

2013-01-09 Thread Benjamin Herrenschmidt
On Mon, 2012-11-19 at 14:16 +0800, Li Zhong wrote:
 This patch tries to disable preemption for using smp_processor_id() in 
 arch_spin_unlock_wait(), 
 to avoid following report:

 .../...

 diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c
 index bb7cfec..7a7c31b 100644
 --- a/arch/powerpc/lib/locks.c
 +++ b/arch/powerpc/lib/locks.c
 @@ -72,8 +72,10 @@ void arch_spin_unlock_wait(arch_spinlock_t *lock)
  {
   while (lock-slock) {
   HMT_low();
 + preempt_disable();
   if (SHARED_PROCESSOR)
   __spin_yield(lock);
 + preempt_enable();
   }

I assume what you are protecting is the PACA access in SHARED_PROCESSOR
or is there more ?

In that case I'd say just make it use local_paca- directly or something
like that. It doesn't matter if the access is racy, all processors will
have the same value for that field as far as I can tell.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 04/15] memory-hotplug: remove /sys/firmware/memmap/X sysfs

2013-01-09 Thread Tang Chen

Hi Andrew,

On 01/10/2013 06:49 AM, Andrew Morton wrote:

On Wed, 9 Jan 2013 17:32:28 +0800
Tang Chentangc...@cn.fujitsu.com  wrote:


When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type}
sysfs files are created. But there is no code to remove these files. The patch
implements the function to remove them.

Note: The code does not free firmware_map_entry which is allocated by bootmem.
   So the patch makes memory leak. But I think the memory leak size is
   very samll. And it does not affect the system.


Well that's bad.  Can we remember the address of that memory and then
reuse the storage if/when the memory is re-added?  That at least puts an upper
bound on the leak.


I think we can do this. I'll post a new patch to do so.

Thanks. :)






___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 02/15] memory-hotplug: check whether all memory blocks are offlined or not when removing memory

2013-01-09 Thread Tang Chen

Hi Andrew,

On 01/10/2013 07:11 AM, Andrew Morton wrote:

On Wed, 9 Jan 2013 17:32:26 +0800
Tang Chentangc...@cn.fujitsu.com  wrote:


We remove the memory like this:
1. lock memory hotplug
2. offline a memory block
3. unlock memory hotplug
4. repeat 1-3 to offline all memory blocks
5. lock memory hotplug
6. remove memory(TODO)
7. unlock memory hotplug

All memory blocks must be offlined before removing memory. But we don't hold
the lock in the whole operation. So we should check whether all memory blocks
are offlined before step6. Otherwise, kernel maybe panicked.


Well, the obvious question is: why don't we hold lock_memory_hotplug()
for all of steps 1-4?  Please send the reasons for this in a form which
I can paste into the changelog.


In the changelog form:

Offlining a memory block and removing a memory device can be two
different operations. Users can just offline some memory blocks
without removing the memory device. For this purpose, the kernel has
held lock_memory_hotplug() in __offline_pages(). To reuse the code
for memory hot-remove, we repeat step 1-3 to offline all the memory
blocks, repeatedly lock and unlock memory hotplug, but not hold the
memory hotplug lock in the whole operation.




Actually, I wonder if doing this would fix a race in the current
remove_memory() repeat: loop.  That code does a
find_memory_block_hinted() followed by offline_memory_block(), but
afaict find_memory_block_hinted() only does a get_device().  Is the
get_device() sufficiently strong to prevent problems if another thread
concurrently offlines or otherwise alters this memory_block's state?


I think we already have memory_block-state_mutex to protect the
concurrently changing of memory_block's state.

The find_memory_block_hinted() here is to find the memory_block
corresponding to the memory section we are dealing with.

Thanks. :)





___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 04/15] memory-hotplug: remove /sys/firmware/memmap/X sysfs

2013-01-09 Thread Tang Chen

Hi Andrew,

On 01/10/2013 07:19 AM, Andrew Morton wrote:

...

+   entry = firmware_map_find_entry(start, end - 1, type);
+   if (!entry)
+   return -EINVAL;
+
+   firmware_map_remove_entry(entry);

...



The above code looks racy.  After firmware_map_find_entry() does the
spin_unlock() there is nothing to prevent a concurrent
firmware_map_remove_entry() from removing the entry, so the kernel ends
up calling firmware_map_remove_entry() twice against the same entry.

An easy fix for this is to hold the spinlock across the entire
lookup/remove operation.


This problem is inherent to firmware_map_find_entry() as you have
implemented it, so this function simply should not exist in the current
form - no caller can use it without being buggy!  A simple fix for this
is to remove the spin_lock()/spin_unlock() from
firmware_map_find_entry() and add locking documentation to
firmware_map_find_entry(), explaining that the caller must hold
map_entries_lock and must not release that lock until processing of
firmware_map_find_entry()'s return value has completed.


Thank you for your advice, I'll fix it soon.

Since you have merged the patch-set, do I need to resend all these
patches again, or just send a patch to fix it based on the current
one ?

Thanks. :)





___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] powerpc: Make room in exception vector area

2013-01-09 Thread Benjamin Herrenschmidt
The FWNMI region is fixed at 0x7000 and the vector are now
overflowing that with some configurations. Fix that by moving
some hash management code out of that region as it doesn't need
to be that close to the call sites (isn't accessed using
conditional branches).

Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org
---
 arch/powerpc/kernel/exceptions-64s.S |  110 +-
 1 file changed, 55 insertions(+), 55 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index a28a65f..7a1c87c 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1180,6 +1180,61 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX)
.globl  __end_handlers
 __end_handlers:
 
+   /* Equivalents to the above handlers for relocation-on interrupt 
vectors */
+   STD_RELON_EXCEPTION_HV(., 0xe00, h_data_storage)
+   KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe00)
+   STD_RELON_EXCEPTION_HV(., 0xe20, h_instr_storage)
+   KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe20)
+   STD_RELON_EXCEPTION_HV(., 0xe40, emulation_assist)
+   KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe40)
+   STD_RELON_EXCEPTION_HV(., 0xe60, hmi_exception)
+   KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe60)
+   MASKABLE_RELON_EXCEPTION_HV(., 0xe80, h_doorbell)
+   KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe80)
+
+   STD_RELON_EXCEPTION_PSERIES(., 0xf00, performance_monitor)
+   STD_RELON_EXCEPTION_PSERIES(., 0xf20, altivec_unavailable)
+   STD_RELON_EXCEPTION_PSERIES(., 0xf40, vsx_unavailable)
+
+#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
+/*
+ * Data area reserved for FWNMI option.
+ * This address (0x7000) is fixed by the RPA.
+ */
+   .= 0x7000
+   .globl fwnmi_data_area
+fwnmi_data_area:
+
+   /* pseries and powernv need to keep the whole page from
+* 0x7000 to 0x8000 free for use by the firmware
+*/
+   . = 0x8000
+#endif /* defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV) */
+
+/* Space for CPU0's segment table */
+   .balign 4096
+   .globl initial_stab
+initial_stab:
+   .space  4096
+
+#ifdef CONFIG_PPC_POWERNV
+_GLOBAL(opal_mc_secondary_handler)
+   HMT_MEDIUM_PPR_DISCARD
+   SET_SCRATCH0(r13)
+   GET_PACA(r13)
+   clrldi  r3,r3,2
+   tovirt(r3,r3)
+   std r3,PACA_OPAL_MC_EVT(r13)
+   ld  r13,OPAL_MC_SRR0(r3)
+   mtspr   SPRN_SRR0,r13
+   ld  r13,OPAL_MC_SRR1(r3)
+   mtspr   SPRN_SRR1,r13
+   ld  r3,OPAL_MC_GPR3(r3)
+   GET_SCRATCH0(r13)
+   b   machine_check_pSeries
+#endif /* CONFIG_PPC_POWERNV */
+
+
 /*
  * Hash table stuff
  */
@@ -1373,58 +1428,3 @@ _GLOBAL(do_stab_bolted)
ld  r13,PACA_EXSLB+EX_R13(r13)
rfid
b   .   /* prevent speculative execution */
-
-
-   /* Equivalents to the above handlers for relocation-on interrupt 
vectors */
-   STD_RELON_EXCEPTION_HV(., 0xe00, h_data_storage)
-   KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe00)
-   STD_RELON_EXCEPTION_HV(., 0xe20, h_instr_storage)
-   KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe20)
-   STD_RELON_EXCEPTION_HV(., 0xe40, emulation_assist)
-   KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe40)
-   STD_RELON_EXCEPTION_HV(., 0xe60, hmi_exception)
-   KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe60)
-   MASKABLE_RELON_EXCEPTION_HV(., 0xe80, h_doorbell)
-   KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe80)
-
-   STD_RELON_EXCEPTION_PSERIES(., 0xf00, performance_monitor)
-   STD_RELON_EXCEPTION_PSERIES(., 0xf20, altivec_unavailable)
-   STD_RELON_EXCEPTION_PSERIES(., 0xf40, vsx_unavailable)
-
-#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
-/*
- * Data area reserved for FWNMI option.
- * This address (0x7000) is fixed by the RPA.
- */
-   .= 0x7000
-   .globl fwnmi_data_area
-fwnmi_data_area:
-
-   /* pseries and powernv need to keep the whole page from
-* 0x7000 to 0x8000 free for use by the firmware
-*/
-   . = 0x8000
-#endif /* defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV) */
-
-/* Space for CPU0's segment table */
-   .balign 4096
-   .globl initial_stab
-initial_stab:
-   .space  4096
-
-#ifdef CONFIG_PPC_POWERNV
-_GLOBAL(opal_mc_secondary_handler)
-   HMT_MEDIUM_PPR_DISCARD
-   SET_SCRATCH0(r13)
-   GET_PACA(r13)
-   clrldi  r3,r3,2
-   tovirt(r3,r3)
-   std r3,PACA_OPAL_MC_EVT(r13)
-   ld  r13,OPAL_MC_SRR0(r3)
-   mtspr   SPRN_SRR0,r13
-   ld  r13,OPAL_MC_SRR1(r3)
-   mtspr   SPRN_SRR1,r13
-   ld  r3,OPAL_MC_GPR3(r3)
-   GET_SCRATCH0(r13)
-   b   machine_check_pSeries
-#endif /* CONFIG_PPC_POWERNV */




___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-09 Thread Glauber Costa
On 01/10/2013 06:17 AM, Tang Chen wrote:
 Note: if the memory provided by the memory device is used by the
 kernel, it
 can't be offlined. It is not a bug.

 Right.  But how often does this happen in testing?  In other words,
 please provide an overall description of how well memory hot-remove is
 presently operating.  Is it reliable?  What is the success rate in
 real-world situations?
 
 We test the hot-remove functionality mostly with movable_online used.
 And the memory used by kernel is not allowed to be removed.

Can you try doing this using cpusets configured to hardwall ?
It is my understanding that the object allocators will try hard not to
allocate anything outside the walls defined by cpuset. Which means that
if you have one process per node, and they are hardwalled, your kernel
memory will be spread evenly among the machine. With a big enough load,
they should eventually be present in all blocks.

Another question I have for you: Have you considering calling
shrink_slab to try to deplete the caches and therefore free at least
slab memory in the nodes that can't be offlined? Is it relevant?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-09 Thread Kamezawa Hiroyuki

(2013/01/10 16:14), Glauber Costa wrote:

On 01/10/2013 06:17 AM, Tang Chen wrote:

Note: if the memory provided by the memory device is used by the
kernel, it
can't be offlined. It is not a bug.


Right.  But how often does this happen in testing?  In other words,
please provide an overall description of how well memory hot-remove is
presently operating.  Is it reliable?  What is the success rate in
real-world situations?


We test the hot-remove functionality mostly with movable_online used.
And the memory used by kernel is not allowed to be removed.


Can you try doing this using cpusets configured to hardwall ?
It is my understanding that the object allocators will try hard not to
allocate anything outside the walls defined by cpuset. Which means that
if you have one process per node, and they are hardwalled, your kernel
memory will be spread evenly among the machine. With a big enough load,
they should eventually be present in all blocks.



I'm sorry I couldn't catch your point.
Do you want to confirm whether cpuset can work enough instead of ZONE_MOVABLE ?
Or Do you want to confirm whether ZONE_MOVABLE will not work if it's used with 
cpuset ?



Another question I have for you: Have you considering calling
shrink_slab to try to deplete the caches and therefore free at least
slab memory in the nodes that can't be offlined? Is it relevant?



At this stage, we don't consider to call shrink_slab(). We require
nearly 100% success at offlining memory for removing DIMM.
It's my understanding.

IMHO, I don't think shrink_slab() can kill all objects in a node even
if they are some caches. We need more study for doing that.

Thanks,
-Kame


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH powerpc ] Protect smp_processor_id() in arch_spin_unlock_wait()

2013-01-09 Thread Li Zhong
On Thu, 2013-01-10 at 17:02 +1100, Benjamin Herrenschmidt wrote:
 On Mon, 2012-11-19 at 14:16 +0800, Li Zhong wrote:
  This patch tries to disable preemption for using smp_processor_id() in 
  arch_spin_unlock_wait(), 
  to avoid following report:
 
  .../...
 
  diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c
  index bb7cfec..7a7c31b 100644
  --- a/arch/powerpc/lib/locks.c
  +++ b/arch/powerpc/lib/locks.c
  @@ -72,8 +72,10 @@ void arch_spin_unlock_wait(arch_spinlock_t *lock)
   {
  while (lock-slock) {
  HMT_low();
  +   preempt_disable();
  if (SHARED_PROCESSOR)
  __spin_yield(lock);
  +   preempt_enable();
  }
 
 I assume what you are protecting is the PACA access in SHARED_PROCESSOR
 or is there more ?

Yes, only the one in SHARED_PROCESSOR.

 
 In that case I'd say just make it use local_paca- directly or something
 like that. It doesn't matter if the access is racy, all processors will
 have the same value for that field as far as I can tell.

It also seemed to me that all processors have the same value :). I'll
send an updated version based on your suggestion soon.

Thanks, Zhong

 
 Cheers,
 Ben.
 
 


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory

2013-01-09 Thread Glauber Costa
On 01/10/2013 11:31 AM, Kamezawa Hiroyuki wrote:
 (2013/01/10 16:14), Glauber Costa wrote:
 On 01/10/2013 06:17 AM, Tang Chen wrote:
 Note: if the memory provided by the memory device is used by the
 kernel, it
 can't be offlined. It is not a bug.

 Right.  But how often does this happen in testing?  In other words,
 please provide an overall description of how well memory hot-remove is
 presently operating.  Is it reliable?  What is the success rate in
 real-world situations?

 We test the hot-remove functionality mostly with movable_online used.
 And the memory used by kernel is not allowed to be removed.

 Can you try doing this using cpusets configured to hardwall ?
 It is my understanding that the object allocators will try hard not to
 allocate anything outside the walls defined by cpuset. Which means that
 if you have one process per node, and they are hardwalled, your kernel
 memory will be spread evenly among the machine. With a big enough load,
 they should eventually be present in all blocks.

 
 I'm sorry I couldn't catch your point.
 Do you want to confirm whether cpuset can work enough instead of
 ZONE_MOVABLE ?
 Or Do you want to confirm whether ZONE_MOVABLE will not work if it's
 used with cpuset ?
 
 
No, I am not proposing to use cpuset do tackle the problem. I am just
wondering if you would still have high success rates with cpusets in use
with hardwalls. This is just one example of a workload that would spread
kernel memory around quite heavily.

So this is just me trying to understand the limitations of the mechanism.

 Another question I have for you: Have you considering calling
 shrink_slab to try to deplete the caches and therefore free at least
 slab memory in the nodes that can't be offlined? Is it relevant?

 
 At this stage, we don't consider to call shrink_slab(). We require
 nearly 100% success at offlining memory for removing DIMM.
 It's my understanding.
 
Of course, this is indisputable.

 IMHO, I don't think shrink_slab() can kill all objects in a node even
 if they are some caches. We need more study for doing that.
 

Indeed, shrink_slab can only kill cached objects. They, however, are
usually a very big part of kernel memory. I wonder though if in case of
failure, it is worth it to try at least one shrink pass before you give up.

It is not very different from what is in memory-failure.c, except that
we could do better and do a more targetted shrinking (support for that
is being worked on)


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev