date:20200727

Re: [PATCH v6 0/5] Migrate non-migrated pages of a SVM.

2020-07-27 Thread Paul Mackerras

On Mon, Jul 27, 2020 at 11:07:13AM -0700, Ram Pai wrote:
> The time to switch a VM to Secure-VM, increases by the size of the VM.
> A 100GB VM takes about 7minutes. This is unacceptable.  This linear
> increase is caused by a suboptimal behavior by the Ultravisor and the
> Hypervisor.  The Ultravisor unnecessarily migrates all the GFN of the
> VM from normal-memory to secure-memory. It has to just migrate the
> necessary and sufficient GFNs.
> 
> However when the optimization is incorporated in the Ultravisor, the
> Hypervisor starts misbehaving. The Hypervisor has a inbuilt assumption
> that the Ultravisor will explicitly request to migrate, each and every
> GFN of the VM. If only necessary and sufficient GFNs are requested for
> migration, the Hypervisor continues to manage the remaining GFNs as
> normal GFNs. This leads to memory corruption; manifested
> consistently when the SVM reboots.
> 
> The same is true, when a memory slot is hotplugged into a SVM. The
> Hypervisor expects the ultravisor to request migration of all GFNs to
> secure-GFN.  But the hypervisor cannot handle any H_SVM_PAGE_IN
> requests from the Ultravisor, done in the context of
> UV_REGISTER_MEM_SLOT ucall.  This problem manifests as random errors
> in the SVM, when a memory-slot is hotplugged.
> 
> This patch series automatically migrates the non-migrated pages of a
> SVM, and thus solves the problem.
> 
> Testing: Passed rigorous testing using various sized SVMs.

Thanks, series applied to my kvm-ppc-next branch and pull request sent.

Paul.

Re: [PATCH v2 0/2] Rework secure memslot dropping

2020-07-27 Thread Paul Mackerras

On Mon, Jul 27, 2020 at 12:24:27PM -0700, Ram Pai wrote:
> From: Laurent Dufour 
> 
> When doing memory hotplug on a secure VM, the secure pages are not well
> cleaned from the secure device when dropping the memslot.  This silent
> error, is then preventing the SVM to reboot properly after the following
> sequence of commands are run in the Qemu monitor:
> 
> device_add pc-dimm,id=dimm1,memdev=mem1
> device_del dimm1
> device_add pc-dimm,id=dimm1,memdev=mem1
> 
> At reboot time, when the kernel is booting again and switching to the
> secure mode, the page_in is failing for the pages in the memslot because
> the cleanup was not done properly, because the memslot is flagged as
> invalid during the hot unplug and thus the page fault mechanism is not
> triggered.
> 
> To prevent that during the memslot dropping, instead of belonging on the
> page fault mechanism to trigger the page out of the secured pages, it seems
> simpler to directly call the function doing the page out. This way the
> state of the memslot is not interfering on the page out process.
> 
> This series applies on top of the Ram's one titled:
> "[v6 0/5] Migrate non-migrated pages of a SVM."

Thanks, series applied to my kvm-ppc-next branch and pull request sent.

Paul.

[PATCH 15/15] memblock: remove 'type' parameter from for_each_memblock()

2020-07-27 Thread Mike Rapoport

From: Mike Rapoport 

for_each_memblock() is used exclusively to iterate over memblock.memory in
a few places that use data from memblock_region rather than the memory
ranges.

Remove type parameter from the for_each_memblock() iterator to improve
encapsulation of memblock internals from its users.

Signed-off-by: Mike Rapoport 
---
 arch/arm64/kernel/setup.c  |  2 +-
 arch/arm64/mm/numa.c   |  2 +-
 arch/mips/netlogic/xlp/setup.c |  2 +-
 include/linux/memblock.h   | 10 +++---
 mm/memblock.c  |  4 ++--
 mm/page_alloc.c|  8 
 6 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 93b3844cf442..23da7908cbed 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -217,7 +217,7 @@ static void __init request_standard_resources(void)
if (!standard_resources)
panic("%s: Failed to allocate %zu bytes\n", __func__, res_size);
 
-   for_each_memblock(memory, region) {
+   for_each_memblock(region) {
res = _resources[i++];
if (memblock_is_nomap(region)) {
res->name  = "reserved";
diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index 0cbdbcc885fb..08721d2c0b79 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -350,7 +350,7 @@ static int __init numa_register_nodes(void)
struct memblock_region *mblk;
 
/* Check that valid nid is set to memblks */
-   for_each_memblock(memory, mblk) {
+   for_each_memblock(mblk) {
int mblk_nid = memblock_get_region_node(mblk);
 
if (mblk_nid == NUMA_NO_NODE || mblk_nid >= MAX_NUMNODES) {
diff --git a/arch/mips/netlogic/xlp/setup.c b/arch/mips/netlogic/xlp/setup.c
index 1a0fc5b62ba4..e69d9fc468cf 100644
--- a/arch/mips/netlogic/xlp/setup.c
+++ b/arch/mips/netlogic/xlp/setup.c
@@ -70,7 +70,7 @@ static void nlm_fixup_mem(void)
const int pref_backup = 512;
struct memblock_region *mem;
 
-   for_each_memblock(memory, mem) {
+   for_each_memblock(mem) {
memblock_remove(mem->base + mem->size - pref_backup,
pref_backup);
}
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index d70c2835e913..c901cb8ecf92 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -527,9 +527,13 @@ static inline unsigned long 
memblock_region_reserved_end_pfn(const struct memblo
return PFN_UP(reg->base + reg->size);
 }
 
-#define for_each_memblock(memblock_type, region)   
\
-   for (region = memblock.memblock_type.regions;   
\
-region < (memblock.memblock_type.regions + 
memblock.memblock_type.cnt);\
+/**
+ * for_each_memblock - itereate over registered memory regions
+ * @region: loop variable
+ */
+#define for_each_memblock(region)  \
+   for (region = memblock.memory.regions;  \
+region < (memblock.memory.regions + memblock.memory.cnt);  \
 region++)
 
 extern void *alloc_large_system_hash(const char *tablename,
diff --git a/mm/memblock.c b/mm/memblock.c
index 2ad5e6e47215..550bb72cf6cb 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1694,7 +1694,7 @@ static phys_addr_t __init_memblock 
__find_max_addr(phys_addr_t limit)
 * the memory memblock regions, if the @limit exceeds the total size
 * of those regions, max_addr will keep original value PHYS_ADDR_MAX
 */
-   for_each_memblock(memory, r) {
+   for_each_memblock(r) {
if (limit <= r->size) {
max_addr = r->base + limit;
break;
@@ -1864,7 +1864,7 @@ void __init_memblock memblock_trim_memory(phys_addr_t 
align)
phys_addr_t start, end, orig_start, orig_end;
struct memblock_region *r;
 
-   for_each_memblock(memory, r) {
+   for_each_memblock(r) {
orig_start = r->base;
orig_end = r->base + r->size;
start = round_up(orig_start, align);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 95af111d69d3..8a19f46dc86e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5927,7 +5927,7 @@ overlap_memmap_init(unsigned long zone, unsigned long 
*pfn)
 
if (mirrored_kernelcore && zone == ZONE_MOVABLE) {
if (!r || *pfn >= memblock_region_memory_end_pfn(r)) {
-   for_each_memblock(memory, r) {
+   for_each_memblock(r) {
if (*pfn < memblock_region_memory_end_pfn(r))
break;
}
@@ -6528,7 +6528,7 @@ static unsigned long __init zone_absent_pages_in_node(int 
nid,
unsigned long start_pfn, end_pfn;
struct memblock_region *r;
 
-

[PATCH 14/15] x86/numa: remove redundant iteration over memblock.reserved

2020-07-27 Thread Mike Rapoport

From: Mike Rapoport 

numa_clear_kernel_node_hotplug() function first traverses numa_meminfo
regions to set node ID in memblock.reserved and than traverses
memblock.reserved to update reserved_nodemask to include node IDs that were
set in the first loop.

Remove redundant traversal over memblock.reserved and update
reserved_nodemask while iterating over numa_meminfo.

Signed-off-by: Mike Rapoport 
---
 arch/x86/mm/numa.c | 26 ++
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 8ee952038c80..4078abd33938 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -498,31 +498,25 @@ static void __init numa_clear_kernel_node_hotplug(void)
 * and use those ranges to set the nid in memblock.reserved.
 * This will split up the memblock regions along node
 * boundaries and will set the node IDs as well.
+*
+* The nid will also be set in reserved_nodemask which is later
+* used to clear MEMBLOCK_HOTPLUG flag.
+*
+* [ Note, when booting with mem=nn[kMG] or in a kdump kernel,
+*   numa_meminfo might not include all memblock.reserved
+*   memory ranges, because quirks such as trim_snb_memory()
+*   reserve specific pages for Sandy Bridge graphics.
+*   These ranges will remain with nid == MAX_NUMNODES. ]
 */
for (i = 0; i < numa_meminfo.nr_blks; i++) {
struct numa_memblk *mb = numa_meminfo.blk + i;
int ret;
 
ret = memblock_set_node(mb->start, mb->end - mb->start, 
, mb->nid);
+   node_set(mb->nid, reserved_nodemask);
WARN_ON_ONCE(ret);
}
 
-   /*
-* Now go over all reserved memblock regions, to construct a
-* node mask of all kernel reserved memory areas.
-*
-* [ Note, when booting with mem=nn[kMG] or in a kdump kernel,
-*   numa_meminfo might not include all memblock.reserved
-*   memory ranges, because quirks such as trim_snb_memory()
-*   reserve specific pages for Sandy Bridge graphics. ]
-*/
-   for_each_memblock(reserved, mb_region) {
-   int nid = memblock_get_region_node(mb_region);
-
-   if (nid != MAX_NUMNODES)
-   node_set(nid, reserved_nodemask);
-   }
-
/*
 * Finally, clear the MEMBLOCK_HOTPLUG flag for all memory
 * belonging to the reserved node mask.
-- 
2.26.2

[PATCH 13/15] arch, drivers: replace for_each_membock() with for_each_mem_range()

2020-07-27 Thread Mike Rapoport

From: Mike Rapoport 

There are several occurrences of the following pattern:

for_each_memblock(memory, reg) {
start = __pfn_to_phys(memblock_region_memory_base_pfn(reg);
end = __pfn_to_phys(memblock_region_memory_end_pfn(reg));

/* do something with start and end */
}

Using for_each_mem_range() iterator is more appropriate in such cases and
allows simpler and cleaner code.

Signed-off-by: Mike Rapoport 
---
 arch/arm/kernel/setup.c  | 18 +++
 arch/arm/mm/mmu.c| 39 
 arch/arm/mm/pmsa-v7.c| 20 ++--
 arch/arm/mm/pmsa-v8.c| 17 +--
 arch/arm/xen/mm.c|  7 +++--
 arch/arm64/mm/kasan_init.c   |  8 ++---
 arch/arm64/mm/mmu.c  | 11 ++-
 arch/c6x/kernel/setup.c  |  9 +++---
 arch/microblaze/mm/init.c|  9 +++---
 arch/mips/cavium-octeon/dma-octeon.c | 12 
 arch/mips/kernel/setup.c | 31 +--
 arch/openrisc/mm/init.c  |  8 +++--
 arch/powerpc/kernel/fadump.c | 27 +++-
 arch/powerpc/mm/book3s64/hash_utils.c| 16 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c | 11 +++
 arch/powerpc/mm/kasan/kasan_init_32.c|  8 ++---
 arch/powerpc/mm/mem.c| 16 ++
 arch/powerpc/mm/pgtable_32.c |  8 ++---
 arch/riscv/mm/init.c | 24 ++-
 arch/riscv/mm/kasan_init.c   | 10 +++---
 arch/s390/kernel/setup.c | 27 ++--
 arch/s390/mm/vmem.c  | 16 +-
 arch/sparc/mm/init_64.c  | 12 +++-
 drivers/bus/mvebu-mbus.c | 12 
 drivers/s390/char/zcore.c|  9 +++---
 25 files changed, 187 insertions(+), 198 deletions(-)

diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
index d8e18cdd96d3..3f65d0ac9f63 100644
--- a/arch/arm/kernel/setup.c
+++ b/arch/arm/kernel/setup.c
@@ -843,19 +843,25 @@ early_param("mem", early_mem);
 
 static void __init request_standard_resources(const struct machine_desc *mdesc)
 {
-   struct memblock_region *region;
+   phys_addr_t start, end, res_end;
struct resource *res;
+   u64 i;
 
kernel_code.start   = virt_to_phys(_text);
kernel_code.end = virt_to_phys(__init_begin - 1);
kernel_data.start   = virt_to_phys(_sdata);
kernel_data.end = virt_to_phys(_end - 1);
 
-   for_each_memblock(memory, region) {
-   phys_addr_t start = 
__pfn_to_phys(memblock_region_memory_base_pfn(region));
-   phys_addr_t end = 
__pfn_to_phys(memblock_region_memory_end_pfn(region)) - 1;
+   for_each_mem_range(i, , ) {
unsigned long boot_alias_start;
 
+   /*
+* In memblock, end points to the first byte after the
+* range while in resourses, end points to the last byte in
+* the range.
+*/
+   res_end = end - 1;
+
/*
 * Some systems have a special memory alias which is only
 * used for booting.  We need to advertise this region to
@@ -869,7 +875,7 @@ static void __init request_standard_resources(const struct 
machine_desc *mdesc)
  __func__, sizeof(*res));
res->name = "System RAM (boot alias)";
res->start = boot_alias_start;
-   res->end = phys_to_idmap(end);
+   res->end = phys_to_idmap(res_end);
res->flags = IORESOURCE_MEM | IORESOURCE_BUSY;
request_resource(_resource, res);
}
@@ -880,7 +886,7 @@ static void __init request_standard_resources(const struct 
machine_desc *mdesc)
  sizeof(*res));
res->name  = "System RAM";
res->start = start;
-   res->end = end;
+   res->end = res_end;
res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
 
request_resource(_resource, res);
diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index 628028bfbb92..a149d9cb4fdb 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -1155,9 +1155,8 @@ phys_addr_t arm_lowmem_limit __initdata = 0;
 
 void __init adjust_lowmem_bounds(void)
 {
-   phys_addr_t memblock_limit = 0;
-   u64 vmalloc_limit;
-   struct memblock_region *reg;
+   phys_addr_t block_start, block_end, memblock_limit = 0;
+   u64 vmalloc_limit, i;
phys_addr_t lowmem_limit = 0;
 
/*
@@ -1173,26 +1172,18 @@ void __init adjust_lowmem_bounds(void)
 * The first usable region must be PMD aligned. Mark its start
 * as MEMBLOCK_NOMAP if it isn't

[PATCH 12/15] arch, mm: replace for_each_memblock() with for_each_mem_pfn_range()

2020-07-27 Thread Mike Rapoport

From: Mike Rapoport 

There are several occurrences of the following pattern:

for_each_memblock(memory, reg) {
start_pfn = memblock_region_memory_base_pfn(reg);
end_pfn = memblock_region_memory_end_pfn(reg);

/* do something with start_pfn and end_pfn */
}

Rather than iterate over all memblock.memory regions and each time query
for their start and end PFNs, use for_each_mem_pfn_range() iterator to get
simpler and clearer code.

Signed-off-by: Mike Rapoport 
---
 arch/arm/mm/init.c   | 11 ---
 arch/arm64/mm/init.c | 11 ---
 arch/powerpc/kernel/fadump.c | 11 ++-
 arch/powerpc/mm/mem.c| 15 ---
 arch/powerpc/mm/numa.c   |  7 ++-
 arch/s390/mm/page-states.c   |  6 ++
 arch/sh/mm/init.c|  9 +++--
 mm/memblock.c|  6 ++
 mm/sparse.c  | 10 --
 9 files changed, 35 insertions(+), 51 deletions(-)

diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index 626af348eb8f..bb56668b4f54 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -304,16 +304,14 @@ free_memmap(unsigned long start_pfn, unsigned long 
end_pfn)
  */
 static void __init free_unused_memmap(void)
 {
-   unsigned long start, prev_end = 0;
-   struct memblock_region *reg;
+   unsigned long start, end, prev_end = 0;
+   int i;
 
/*
 * This relies on each bank being in address order.
 * The banks are sorted previously in bootmem_init().
 */
-   for_each_memblock(memory, reg) {
-   start = memblock_region_memory_base_pfn(reg);
-
+   for_each_mem_pfn_range(i, NUMA_NO_NODE, , , NULL) {
 #ifdef CONFIG_SPARSEMEM
/*
 * Take care not to free memmap entries that don't exist
@@ -341,8 +339,7 @@ static void __init free_unused_memmap(void)
 * memmap entries are valid from the bank end aligned to
 * MAX_ORDER_NR_PAGES.
 */
-   prev_end = ALIGN(memblock_region_memory_end_pfn(reg),
-MAX_ORDER_NR_PAGES);
+   prev_end = ALIGN(end, MAX_ORDER_NR_PAGES);
}
 
 #ifdef CONFIG_SPARSEMEM
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 1e93cfc7c47a..271a8ea32482 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -473,12 +473,10 @@ static inline void free_memmap(unsigned long start_pfn, 
unsigned long end_pfn)
  */
 static void __init free_unused_memmap(void)
 {
-   unsigned long start, prev_end = 0;
-   struct memblock_region *reg;
-
-   for_each_memblock(memory, reg) {
-   start = __phys_to_pfn(reg->base);
+   unsigned long start, end, prev_end = 0;
+   int i;
 
+   for_each_mem_pfn_range(i, NUMA_NO_NODE, , , NULL) {
 #ifdef CONFIG_SPARSEMEM
/*
 * Take care not to free memmap entries that don't exist due
@@ -498,8 +496,7 @@ static void __init free_unused_memmap(void)
 * memmap entries are valid from the bank end aligned to
 * MAX_ORDER_NR_PAGES.
 */
-   prev_end = ALIGN(__phys_to_pfn(reg->base + reg->size),
-MAX_ORDER_NR_PAGES);
+   prev_end = ALIGN(end, MAX_ORDER_NR_PAGES);
}
 
 #ifdef CONFIG_SPARSEMEM
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 2446a61e3c25..fdbafe417139 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -1216,14 +1216,15 @@ static void fadump_free_reserved_memory(unsigned long 
start_pfn,
  */
 static void fadump_release_reserved_area(u64 start, u64 end)
 {
-   u64 tstart, tend, spfn, epfn;
-   struct memblock_region *reg;
+   u64 tstart, tend, spfn, epfn, reg_spfn, reg_epfn, i;
 
spfn = PHYS_PFN(start);
epfn = PHYS_PFN(end);
-   for_each_memblock(memory, reg) {
-   tstart = max_t(u64, spfn, memblock_region_memory_base_pfn(reg));
-   tend   = min_t(u64, epfn, memblock_region_memory_end_pfn(reg));
+
+   for_each_mem_pfn_range(i, NUMA_NO_NODE, _spfn, _epfn, NULL) {
+   tstart = max_t(u64, spfn, reg_spfn);
+   tend   = min_t(u64, epfn, reg_epfn);
+
if (tstart < tend) {
fadump_free_reserved_memory(tstart, tend);
 
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index c2c11eb8dcfc..38d1acd7c8ef 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -192,15 +192,16 @@ void __init initmem_init(void)
 /* mark pages that don't exist as nosave */
 static int __init mark_nonram_nosave(void)
 {
-   struct memblock_region *reg, *prev = NULL;
+   unsigned long spfn, epfn, prev = 0;
+   int i;
 
-   for_each_memblock(memory, reg) {
-   if (prev &&
-   memblock_region_memory_end_pfn(prev) <

[PATCH 11/15] memblock: reduce number of parameters in for_each_mem_range()

2020-07-27 Thread Mike Rapoport

From: Mike Rapoport 

Currently for_each_mem_range() iterator is the most generic way to traverse
memblock regions. As such, it has 8 parameters and it is hardly convenient
to users. Most users choose to utilize one of its wrappers and the only
user that actually needs most of the parameters outside memblock is s390
crash dump implementation.

To avoid yet another naming for memblock iterators, rename the existing
for_each_mem_range() to __for_each_mem_range() and add a new
for_each_mem_range() wrapper with only index, start and end parameters.

The new wrapper nicely fits into init_unavailable_mem() and will be used in
upcoming changes to simplify memblock traversals.

Signed-off-by: Mike Rapoport 
---
 .clang-format  |  1 +
 arch/arm64/kernel/machine_kexec_file.c |  6 ++
 arch/s390/kernel/crash_dump.c  |  8 
 include/linux/memblock.h   | 18 ++
 mm/page_alloc.c|  3 +--
 5 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/.clang-format b/.clang-format
index a0a96088c74f..52ededab25ce 100644
--- a/.clang-format
+++ b/.clang-format
@@ -205,6 +205,7 @@ ForEachMacros:
   - 'for_each_memblock_type'
   - 'for_each_memcg_cache_index'
   - 'for_each_mem_pfn_range'
+  - '__for_each_mem_range'
   - 'for_each_mem_range'
   - 'for_each_mem_range_rev'
   - 'for_each_migratetype_order'
diff --git a/arch/arm64/kernel/machine_kexec_file.c 
b/arch/arm64/kernel/machine_kexec_file.c
index 361a1143e09e..5b0e67b93cdc 100644
--- a/arch/arm64/kernel/machine_kexec_file.c
+++ b/arch/arm64/kernel/machine_kexec_file.c
@@ -215,8 +215,7 @@ static int prepare_elf_headers(void **addr, unsigned long 
*sz)
phys_addr_t start, end;
 
nr_ranges = 1; /* for exclusion of crashkernel region */
-   for_each_mem_range(i, , NULL, NUMA_NO_NODE,
-   MEMBLOCK_NONE, , , NULL)
+   for_each_mem_range(i, , )
nr_ranges++;
 
cmem = kmalloc(struct_size(cmem, ranges, nr_ranges), GFP_KERNEL);
@@ -225,8 +224,7 @@ static int prepare_elf_headers(void **addr, unsigned long 
*sz)
 
cmem->max_nr_ranges = nr_ranges;
cmem->nr_ranges = 0;
-   for_each_mem_range(i, , NULL, NUMA_NO_NODE,
-   MEMBLOCK_NONE, , , NULL) {
+   for_each_mem_range(i, , ) {
cmem->ranges[cmem->nr_ranges].start = start;
cmem->ranges[cmem->nr_ranges].end = end - 1;
cmem->nr_ranges++;
diff --git a/arch/s390/kernel/crash_dump.c b/arch/s390/kernel/crash_dump.c
index f96a5857bbfd..e28085c725ff 100644
--- a/arch/s390/kernel/crash_dump.c
+++ b/arch/s390/kernel/crash_dump.c
@@ -549,8 +549,8 @@ static int get_mem_chunk_cnt(void)
int cnt = 0;
u64 idx;
 
-   for_each_mem_range(idx, , _type, NUMA_NO_NODE,
-  MEMBLOCK_NONE, NULL, NULL, NULL)
+   __for_each_mem_range(idx, , _type, NUMA_NO_NODE,
+MEMBLOCK_NONE, NULL, NULL, NULL)
cnt++;
return cnt;
 }
@@ -563,8 +563,8 @@ static void loads_init(Elf64_Phdr *phdr, u64 loads_offset)
phys_addr_t start, end;
u64 idx;
 
-   for_each_mem_range(idx, , _type, NUMA_NO_NODE,
-  MEMBLOCK_NONE, , , NULL) {
+   __for_each_mem_range(idx, , _type, NUMA_NO_NODE,
+MEMBLOCK_NONE, , , NULL) {
phdr->p_filesz = end - start;
phdr->p_type = PT_LOAD;
phdr->p_offset = start;
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index e6a23b3db696..d70c2835e913 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -142,7 +142,7 @@ void __next_reserved_mem_region(u64 *idx, phys_addr_t 
*out_start,
 void __memblock_free_late(phys_addr_t base, phys_addr_t size);
 
 /**
- * for_each_mem_range - iterate through memblock areas from type_a and not
+ * __for_each_mem_range - iterate through memblock areas from type_a and not
  * included in type_b. Or just type_a if type_b is NULL.
  * @i: u64 used as loop variable
  * @type_a: ptr to memblock_type to iterate
@@ -153,7 +153,7 @@ void __memblock_free_late(phys_addr_t base, phys_addr_t 
size);
  * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
  * @p_nid: ptr to int for nid of the range, can be %NULL
  */
-#define for_each_mem_range(i, type_a, type_b, nid, flags,  \
+#define __for_each_mem_range(i, type_a, type_b, nid, flags,\
   p_start, p_end, p_nid)   \
for (i = 0, __next_mem_range(, nid, flags, type_a, type_b,\
 p_start, p_end, p_nid);\
@@ -182,6 +182,16 @@ void __memblock_free_late(phys_addr_t base, phys_addr_t 
size);
 __next_mem_range_rev(, nid, flags, type_a, type_b,   \
  p_start, p_end, p_nid))

[PATCH 10/15] memblock: make memblock_debug and related functionality private

2020-07-27 Thread Mike Rapoport

From: Mike Rapoport 

The only user of memblock_dbg() outside memblock was s390 setup code and it
is converted to use pr_debug() instead.
This allows to stop exposing memblock_debug and memblock_dbg() to the rest
of the kernel.

Signed-off-by: Mike Rapoport 
---
 arch/s390/kernel/setup.c |  4 ++--
 include/linux/memblock.h | 12 +---
 mm/memblock.c| 13 +++--
 3 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
index 07aa15ba43b3..8b284cf6e199 100644
--- a/arch/s390/kernel/setup.c
+++ b/arch/s390/kernel/setup.c
@@ -776,8 +776,8 @@ static void __init memblock_add_mem_detect_info(void)
unsigned long start, end;
int i;
 
-   memblock_dbg("physmem info source: %s (%hhd)\n",
-get_mem_info_source(), mem_detect.info_source);
+   pr_debug("physmem info source: %s (%hhd)\n",
+get_mem_info_source(), mem_detect.info_source);
/* keep memblock lists close to the kernel */
memblock_set_bottom_up(true);
for_each_mem_detect_block(i, , ) {
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 220b5f0dad42..e6a23b3db696 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -90,7 +90,6 @@ struct memblock {
 };
 
 extern struct memblock memblock;
-extern int memblock_debug;
 
 #ifndef CONFIG_ARCH_KEEP_MEMBLOCK
 #define __init_memblock __meminit
@@ -102,9 +101,6 @@ void memblock_discard(void);
 static inline void memblock_discard(void) {}
 #endif
 
-#define memblock_dbg(fmt, ...) \
-   if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
-
 phys_addr_t memblock_find_in_range(phys_addr_t start, phys_addr_t end,
   phys_addr_t size, phys_addr_t align);
 void memblock_allow_resize(void);
@@ -456,13 +452,7 @@ bool memblock_is_region_memory(phys_addr_t base, 
phys_addr_t size);
 bool memblock_is_reserved(phys_addr_t addr);
 bool memblock_is_region_reserved(phys_addr_t base, phys_addr_t size);
 
-extern void __memblock_dump_all(void);
-
-static inline void memblock_dump_all(void)
-{
-   if (memblock_debug)
-   __memblock_dump_all();
-}
+void memblock_dump_all(void);
 
 /**
  * memblock_set_current_limit - Set the current allocation limit to allow
diff --git a/mm/memblock.c b/mm/memblock.c
index a5b9b3df81fc..824938849f6d 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -134,7 +134,10 @@ struct memblock memblock __initdata_memblock = {
 i < memblock_type->cnt;\
 i++, rgn = _type->regions[i])
 
-int memblock_debug __initdata_memblock;
+#define memblock_dbg(fmt, ...) \
+   if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
+
+static int memblock_debug __initdata_memblock;
 static bool system_has_some_mirror __initdata_memblock = false;
 static int memblock_can_resize __initdata_memblock;
 static int memblock_memory_in_slab __initdata_memblock = 0;
@@ -1919,7 +1922,7 @@ static void __init_memblock memblock_dump(struct 
memblock_type *type)
}
 }
 
-void __init_memblock __memblock_dump_all(void)
+static void __init_memblock __memblock_dump_all(void)
 {
pr_info("MEMBLOCK configuration:\n");
pr_info(" memory size = %pa reserved size = %pa\n",
@@ -1933,6 +1936,12 @@ void __init_memblock __memblock_dump_all(void)
 #endif
 }
 
+void __init_memblock memblock_dump_all(void)
+{
+   if (memblock_debug)
+   __memblock_dump_all();
+}
+
 void __init memblock_allow_resize(void)
 {
memblock_can_resize = 1;
-- 
2.26.2

[PATCH 09/15] memblock: make for_each_memblock_type() iterator private

2020-07-27 Thread Mike Rapoport

From: Mike Rapoport 

for_each_memblock_type() is not used outside mm/memblock.c, move it there
from include/linux/memblock.h

Signed-off-by: Mike Rapoport 
---
 include/linux/memblock.h | 5 -
 mm/memblock.c| 5 +
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 017fae833d4a..220b5f0dad42 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -532,11 +532,6 @@ static inline unsigned long 
memblock_region_reserved_end_pfn(const struct memblo
 region < (memblock.memblock_type.regions + 
memblock.memblock_type.cnt);\
 region++)
 
-#define for_each_memblock_type(i, memblock_type, rgn)  \
-   for (i = 0, rgn = _type->regions[0];   \
-i < memblock_type->cnt;\
-i++, rgn = _type->regions[i])
-
 extern void *alloc_large_system_hash(const char *tablename,
 unsigned long bucketsize,
 unsigned long numentries,
diff --git a/mm/memblock.c b/mm/memblock.c
index 39aceafc57f6..a5b9b3df81fc 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -129,6 +129,11 @@ struct memblock memblock __initdata_memblock = {
.current_limit  = MEMBLOCK_ALLOC_ANYWHERE,
 };
 
+#define for_each_memblock_type(i, memblock_type, rgn)  \
+   for (i = 0, rgn = _type->regions[0];   \
+i < memblock_type->cnt;\
+i++, rgn = _type->regions[i])
+
 int memblock_debug __initdata_memblock;
 static bool system_has_some_mirror __initdata_memblock = false;
 static int memblock_can_resize __initdata_memblock;
-- 
2.26.2

[PATCH 08/15] mircoblaze: drop unneeded NUMA and sparsemem initializations

2020-07-27 Thread Mike Rapoport

From: Mike Rapoport 

microblaze does not support neither NUMA not SPARSMEM, so there is no point
to call memblock_set_node() and sparse_memory_present_with_active_regions()
functions during microblaze memory initialization.

Remove these calls and the surrounding code.

Signed-off-by: Mike Rapoport 
---
 arch/microblaze/mm/init.c | 17 +
 1 file changed, 1 insertion(+), 16 deletions(-)

diff --git a/arch/microblaze/mm/init.c b/arch/microblaze/mm/init.c
index 521b59ba716c..49e0c241f9b1 100644
--- a/arch/microblaze/mm/init.c
+++ b/arch/microblaze/mm/init.c
@@ -105,9 +105,8 @@ static void __init paging_init(void)
 
 void __init setup_memory(void)
 {
-   struct memblock_region *reg;
-
 #ifndef CONFIG_MMU
+   struct memblock_region *reg;
u32 kernel_align_start, kernel_align_size;
 
/* Find main memory where is the kernel */
@@ -161,20 +160,6 @@ void __init setup_memory(void)
pr_info("%s: max_low_pfn: %#lx\n", __func__, max_low_pfn);
pr_info("%s: max_pfn: %#lx\n", __func__, max_pfn);
 
-   /* Add active regions with valid PFNs */
-   for_each_memblock(memory, reg) {
-   unsigned long start_pfn, end_pfn;
-
-   start_pfn = memblock_region_memory_base_pfn(reg);
-   end_pfn = memblock_region_memory_end_pfn(reg);
-   memblock_set_node(start_pfn << PAGE_SHIFT,
- (end_pfn - start_pfn) << PAGE_SHIFT,
- , 0);
-   }
-
-   /* XXX need to clip this if using highmem? */
-   sparse_memory_present_with_active_regions(0);
-
paging_init();
 }
 
-- 
2.26.2

[PATCH 07/15] riscv: drop unneeded node initialization

2020-07-27 Thread Mike Rapoport

From: Mike Rapoport 

RISC-V does not (yet) support NUMA  and for UMA architectures node 0 is
used implicitly during early memory initialization.

There is no need to call memblock_set_node(), remove this call and the
surrounding code.

Signed-off-by: Mike Rapoport 
---
 arch/riscv/mm/init.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 79e9d55bdf1a..7440ba2cdaaa 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -191,15 +191,6 @@ void __init setup_bootmem(void)
early_init_fdt_scan_reserved_mem();
memblock_allow_resize();
memblock_dump_all();
-
-   for_each_memblock(memory, reg) {
-   unsigned long start_pfn = memblock_region_memory_base_pfn(reg);
-   unsigned long end_pfn = memblock_region_memory_end_pfn(reg);
-
-   memblock_set_node(PFN_PHYS(start_pfn),
- PFN_PHYS(end_pfn - start_pfn),
- , 0);
-   }
 }
 
 #ifdef CONFIG_MMU
-- 
2.26.2

[PATCH 06/15] powerpc: fadamp: simplify fadump_reserve_crash_area()

2020-07-27 Thread Mike Rapoport

From: Mike Rapoport 

fadump_reserve_crash_area() reserves memory from a specified base address
till the end of the RAM.

Replace iteration through the memblock.memory with a single call to
memblock_reserve() with appropriate  that will take care of proper memory
reservation.

Signed-off-by: Mike Rapoport 
---
 arch/powerpc/kernel/fadump.c | 20 +---
 1 file changed, 1 insertion(+), 19 deletions(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 78ab9a6ee6ac..2446a61e3c25 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -1658,25 +1658,7 @@ int __init fadump_reserve_mem(void)
 /* Preserve everything above the base address */
 static void __init fadump_reserve_crash_area(u64 base)
 {
-   struct memblock_region *reg;
-   u64 mstart, msize;
-
-   for_each_memblock(memory, reg) {
-   mstart = reg->base;
-   msize  = reg->size;
-
-   if ((mstart + msize) < base)
-   continue;
-
-   if (mstart < base) {
-   msize -= (base - mstart);
-   mstart = base;
-   }
-
-   pr_info("Reserving %lluMB of memory at %#016llx for preserving 
crash data",
-   (msize >> 20), mstart);
-   memblock_reserve(mstart, msize);
-   }
+   memblock_reserve(base, memblock_end_of_DRAM() - base);
 }
 
 unsigned long __init arch_reserved_kernel_pages(void)
-- 
2.26.2

[PATCH 05/15] h8300, nds32, openrisc: simplify detection of memory extents

2020-07-27 Thread Mike Rapoport

From: Mike Rapoport 

Instead of traversing memblock.memory regions to find memory_start and
memory_end, simply query memblock_{start,end}_of_DRAM().

Signed-off-by: Mike Rapoport 
---
 arch/h8300/kernel/setup.c| 8 +++-
 arch/nds32/kernel/setup.c| 8 ++--
 arch/openrisc/kernel/setup.c | 9 ++---
 3 files changed, 7 insertions(+), 18 deletions(-)

diff --git a/arch/h8300/kernel/setup.c b/arch/h8300/kernel/setup.c
index 28ac88358a89..0281f92eea3d 100644
--- a/arch/h8300/kernel/setup.c
+++ b/arch/h8300/kernel/setup.c
@@ -74,17 +74,15 @@ static void __init bootmem_init(void)
memory_end = memory_start = 0;
 
/* Find main memory where is the kernel */
-   for_each_memblock(memory, region) {
-   memory_start = region->base;
-   memory_end = region->base + region->size;
-   }
+   memory_start = memblock_start_of_DRAM();
+   memory_end = memblock_end_of_DRAM();
 
if (!memory_end)
panic("No memory!");
 
/* setup bootmem globals (we use no_bootmem, but mm still depends on 
this) */
min_low_pfn = PFN_UP(memory_start);
-   max_low_pfn = PFN_DOWN(memblock_end_of_DRAM());
+   max_low_pfn = PFN_DOWN(memory_end);
max_pfn = max_low_pfn;
 
memblock_reserve(__pa(_stext), _end - _stext);
diff --git a/arch/nds32/kernel/setup.c b/arch/nds32/kernel/setup.c
index a066efbe53c0..c356e484dcab 100644
--- a/arch/nds32/kernel/setup.c
+++ b/arch/nds32/kernel/setup.c
@@ -249,12 +249,8 @@ static void __init setup_memory(void)
memory_end = memory_start = 0;
 
/* Find main memory where is the kernel */
-   for_each_memblock(memory, region) {
-   memory_start = region->base;
-   memory_end = region->base + region->size;
-   pr_info("%s: Memory: 0x%x-0x%x\n", __func__,
-   memory_start, memory_end);
-   }
+   memory_start = memblock_start_of_DRAM();
+   memory_end = memblock_end_of_DRAM();
 
if (!memory_end) {
panic("No memory!");
diff --git a/arch/openrisc/kernel/setup.c b/arch/openrisc/kernel/setup.c
index 8aa438e1f51f..c5706153d3b6 100644
--- a/arch/openrisc/kernel/setup.c
+++ b/arch/openrisc/kernel/setup.c
@@ -48,17 +48,12 @@ static void __init setup_memory(void)
unsigned long ram_start_pfn;
unsigned long ram_end_pfn;
phys_addr_t memory_start, memory_end;
-   struct memblock_region *region;
 
memory_end = memory_start = 0;
 
/* Find main memory where is the kernel, we assume its the only one */
-   for_each_memblock(memory, region) {
-   memory_start = region->base;
-   memory_end = region->base + region->size;
-   printk(KERN_INFO "%s: Memory: 0x%x-0x%x\n", __func__,
-  memory_start, memory_end);
-   }
+   memory_start = memblock_start_of_DRAM();
+   memory_end = memblock_end_of_DRAM();
 
if (!memory_end) {
panic("No memory!");
-- 
2.26.2

[PATCH 04/15] arm64: numa: simplify dummy_numa_init()

2020-07-27 Thread Mike Rapoport

From: Mike Rapoport 

dummy_numa_init() loops over memblock.memory and passes nid=0 to
numa_add_memblk() which essentially wraps memblock_set_node(). However,
memblock_set_node() can cope with entire memory span itself, so the loop
over memblock.memory regions is redundant.

Replace the loop with a single call to memblock_set_node() to the entire
memory.

Signed-off-by: Mike Rapoport 
---
 arch/arm64/mm/numa.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index aafcee3e3f7e..0cbdbcc885fb 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -423,19 +423,16 @@ static int __init numa_init(int (*init_func)(void))
  */
 static int __init dummy_numa_init(void)
 {
+   phys_addr_t start = memblock_start_of_DRAM();
+   phys_addr_t end = memblock_end_of_DRAM();
int ret;
-   struct memblock_region *mblk;
 
if (numa_off)
pr_info("NUMA disabled\n"); /* Forced off on command line. */
-   pr_info("Faking a node at [mem %#018Lx-%#018Lx]\n",
-   memblock_start_of_DRAM(), memblock_end_of_DRAM() - 1);
-
-   for_each_memblock(memory, mblk) {
-   ret = numa_add_memblk(0, mblk->base, mblk->base + mblk->size);
-   if (!ret)
-   continue;
+   pr_info("Faking a node at [mem %#018Lx-%#018Lx]\n", start, end - 1);
 
+   ret = numa_add_memblk(0, start, end);
+   if (ret) {
pr_err("NUMA init failed\n");
return ret;
}
-- 
2.26.2

[PATCH 03/15] arm, xtensa: simplify initialization of high memory pages

2020-07-27 Thread Mike Rapoport

From: Mike Rapoport 

The function free_highpages() in both arm and xtensa essentially open-code
for_each_free_mem_range() loop to detect high memory pages that were not
reserved and that should be initialized and passed to the buddy allocator.

Replace open-coded implementation of for_each_free_mem_range() with usage
of memblock API to simplify the code.

Signed-off-by: Mike Rapoport 
---
 arch/arm/mm/init.c| 48 +++--
 arch/xtensa/mm/init.c | 55 ---
 2 files changed, 18 insertions(+), 85 deletions(-)

diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index 01e18e43b174..626af348eb8f 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -352,61 +352,29 @@ static void __init free_unused_memmap(void)
 #endif
 }
 
-#ifdef CONFIG_HIGHMEM
-static inline void free_area_high(unsigned long pfn, unsigned long end)
-{
-   for (; pfn < end; pfn++)
-   free_highmem_page(pfn_to_page(pfn));
-}
-#endif
-
 static void __init free_highpages(void)
 {
 #ifdef CONFIG_HIGHMEM
unsigned long max_low = max_low_pfn;
-   struct memblock_region *mem, *res;
+   phys_addr_t range_start, range_end;
+   u64 i;
 
/* set highmem page free */
-   for_each_memblock(memory, mem) {
-   unsigned long start = memblock_region_memory_base_pfn(mem);
-   unsigned long end = memblock_region_memory_end_pfn(mem);
+   for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE,
+   _start, _end, NULL) {
+   unsigned long start = PHYS_PFN(range_start);
+   unsigned long end = PHYS_PFN(range_end);
 
/* Ignore complete lowmem entries */
if (end <= max_low)
continue;
 
-   if (memblock_is_nomap(mem))
-   continue;
-
/* Truncate partial highmem entries */
if (start < max_low)
start = max_low;
 
-   /* Find and exclude any reserved regions */
-   for_each_memblock(reserved, res) {
-   unsigned long res_start, res_end;
-
-   res_start = memblock_region_reserved_base_pfn(res);
-   res_end = memblock_region_reserved_end_pfn(res);
-
-   if (res_end < start)
-   continue;
-   if (res_start < start)
-   res_start = start;
-   if (res_start > end)
-   res_start = end;
-   if (res_end > end)
-   res_end = end;
-   if (res_start != start)
-   free_area_high(start, res_start);
-   start = res_end;
-   if (start == end)
-   break;
-   }
-
-   /* And now free anything which remains */
-   if (start < end)
-   free_area_high(start, end);
+   for (; start < end; start++)
+   free_highmem_page(pfn_to_page(start));
}
 #endif
 }
diff --git a/arch/xtensa/mm/init.c b/arch/xtensa/mm/init.c
index a05b306cf371..ad9d59d93f39 100644
--- a/arch/xtensa/mm/init.c
+++ b/arch/xtensa/mm/init.c
@@ -79,67 +79,32 @@ void __init zones_init(void)
free_area_init(max_zone_pfn);
 }
 
-#ifdef CONFIG_HIGHMEM
-static void __init free_area_high(unsigned long pfn, unsigned long end)
-{
-   for (; pfn < end; pfn++)
-   free_highmem_page(pfn_to_page(pfn));
-}
-
 static void __init free_highpages(void)
 {
+#ifdef CONFIG_HIGHMEM
unsigned long max_low = max_low_pfn;
-   struct memblock_region *mem, *res;
+   phys_addr_t range_start, range_end;
+   u64 i;
 
-   reset_all_zones_managed_pages();
/* set highmem page free */
-   for_each_memblock(memory, mem) {
-   unsigned long start = memblock_region_memory_base_pfn(mem);
-   unsigned long end = memblock_region_memory_end_pfn(mem);
+   for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE,
+   _start, _end, NULL) {
+   unsigned long start = PHYS_PFN(range_start);
+   unsigned long end = PHYS_PFN(range_end);
 
/* Ignore complete lowmem entries */
if (end <= max_low)
continue;
 
-   if (memblock_is_nomap(mem))
-   continue;
-
/* Truncate partial highmem entries */
if (start < max_low)
start = max_low;
 
-   /* Find and exclude any reserved regions */
-   for_each_memblock(reserved, res) {
-   unsigned long res_start, res_end;
-
-   res_start = memblock_region_reserved_base_pfn(res);
-

[PATCH 02/15] dma-contiguous: simplify cma_early_percent_memory()

2020-07-27 Thread Mike Rapoport

From: Mike Rapoport 

The memory size calculation in cma_early_percent_memory() traverses
memblock.memory rather than simply call memblock_phys_mem_size(). The
comment in that function suggests that at some point there should have been
call to memblock_analyze() before memblock_phys_mem_size() could be used.
As of now, there is no memblock_analyze() at all and
memblock_phys_mem_size() can be used as soon as cold-plug memory is
registerd with memblock.

Replace loop over memblock.memory with a call to memblock_phys_mem_size().

Signed-off-by: Mike Rapoport 
---
 kernel/dma/contiguous.c | 11 +--
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
index 15bc5026c485..1992afd8ca7b 100644
--- a/kernel/dma/contiguous.c
+++ b/kernel/dma/contiguous.c
@@ -73,16 +73,7 @@ early_param("cma", early_cma);
 
 static phys_addr_t __init __maybe_unused cma_early_percent_memory(void)
 {
-   struct memblock_region *reg;
-   unsigned long total_pages = 0;
-
-   /*
-* We cannot use memblock_phys_mem_size() here, because
-* memblock_analyze() has not been called yet.
-*/
-   for_each_memblock(memory, reg)
-   total_pages += memblock_region_memory_end_pfn(reg) -
-  memblock_region_memory_base_pfn(reg);
+   unsigned long total_pages = PHYS_PFN(memblock_phys_mem_size());
 
return (total_pages * CONFIG_CMA_SIZE_PERCENTAGE / 100) << PAGE_SHIFT;
 }
-- 
2.26.2

[PATCH 01/15] KVM: PPC: Book3S HV: simplify kvm_cma_reserve()

2020-07-27 Thread Mike Rapoport

From: Mike Rapoport 

The memory size calculation in kvm_cma_reserve() traverses memblock.memory
rather than simply call memblock_phys_mem_size(). The comment in that
function suggests that at some point there should have been call to
memblock_analyze() before memblock_phys_mem_size() could be used.
As of now, there is no memblock_analyze() at all and
memblock_phys_mem_size() can be used as soon as cold-plug memory is
registerd with memblock.

Replace loop over memblock.memory with a call to memblock_phys_mem_size().

Signed-off-by: Mike Rapoport 
---
 arch/powerpc/kvm/book3s_hv_builtin.c | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index 7cd3cf3d366b..56ab0d28de2a 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -95,22 +95,15 @@ EXPORT_SYMBOL_GPL(kvm_free_hpt_cma);
 void __init kvm_cma_reserve(void)
 {
unsigned long align_size;
-   struct memblock_region *reg;
-   phys_addr_t selected_size = 0;
+   phys_addr_t selected_size;
 
/*
 * We need CMA reservation only when we are in HV mode
 */
if (!cpu_has_feature(CPU_FTR_HVMODE))
return;
-   /*
-* We cannot use memblock_phys_mem_size() here, because
-* memblock_analyze() has not been called yet.
-*/
-   for_each_memblock(memory, reg)
-   selected_size += memblock_region_memory_end_pfn(reg) -
-memblock_region_memory_base_pfn(reg);
 
+   selected_size = PHYS_PFN(memblock_phys_mem_size());
selected_size = (selected_size * kvm_cma_resv_ratio / 100) << 
PAGE_SHIFT;
if (selected_size) {
pr_debug("%s: reserving %ld MiB for global area\n", __func__,
-- 
2.26.2

[PATCH 00/15] memblock: seasonal cleaning^w cleanup

2020-07-27 Thread Mike Rapoport

From: Mike Rapoport 

Hi,

These patches simplify several uses of memblock iterators and hide some of
the memblock implementation details from the rest of the system.

The patches are on top of v5.8-rc7 + cherry-pick of "mm/sparse: cleanup the
code surrounding memory_present()" [1] from mmotm tree.

[1] http://lkml.kernel.org/r/20200712083130.22919-1-r...@kernel.org 

Mike Rapoport (15):
  KVM: PPC: Book3S HV: simplify kvm_cma_reserve()
  dma-contiguous: simplify cma_early_percent_memory()
  arm, xtensa: simplify initialization of high memory pages
  arm64: numa: simplify dummy_numa_init()
  h8300, nds32, openrisc: simplify detection of memory extents
  powerpc: fadamp: simplify fadump_reserve_crash_area()
  riscv: drop unneeded node initialization
  mircoblaze: drop unneeded NUMA and sparsemem initializations
  memblock: make for_each_memblock_type() iterator private
  memblock: make memblock_debug and related functionality private
  memblock: reduce number of parameters in for_each_mem_range()
  arch, mm: replace for_each_memblock() with for_each_mem_pfn_range()
  arch, drivers: replace for_each_membock() with for_each_mem_range()
  x86/numa: remove redundant iteration over memblock.reserved
  memblock: remove 'type' parameter from for_each_memblock()

 .clang-format|  1 +
 arch/arm/kernel/setup.c  | 18 +---
 arch/arm/mm/init.c   | 59 +---
 arch/arm/mm/mmu.c| 39 ++--
 arch/arm/mm/pmsa-v7.c| 20 
 arch/arm/mm/pmsa-v8.c| 17 ---
 arch/arm/xen/mm.c|  7 +--
 arch/arm64/kernel/machine_kexec_file.c   |  6 +--
 arch/arm64/kernel/setup.c|  2 +-
 arch/arm64/mm/init.c | 11 ++---
 arch/arm64/mm/kasan_init.c   |  8 ++--
 arch/arm64/mm/mmu.c  | 11 ++---
 arch/arm64/mm/numa.c | 15 +++---
 arch/c6x/kernel/setup.c  |  9 ++--
 arch/h8300/kernel/setup.c|  8 ++--
 arch/microblaze/mm/init.c| 24 ++
 arch/mips/cavium-octeon/dma-octeon.c | 12 ++---
 arch/mips/kernel/setup.c | 31 ++---
 arch/mips/netlogic/xlp/setup.c   |  2 +-
 arch/nds32/kernel/setup.c|  8 +---
 arch/openrisc/kernel/setup.c |  9 +---
 arch/openrisc/mm/init.c  |  8 ++--
 arch/powerpc/kernel/fadump.c | 58 ---
 arch/powerpc/kvm/book3s_hv_builtin.c | 11 +
 arch/powerpc/mm/book3s64/hash_utils.c| 16 +++
 arch/powerpc/mm/book3s64/radix_pgtable.c | 11 ++---
 arch/powerpc/mm/kasan/kasan_init_32.c|  8 ++--
 arch/powerpc/mm/mem.c| 33 +++--
 arch/powerpc/mm/numa.c   |  7 +--
 arch/powerpc/mm/pgtable_32.c |  8 ++--
 arch/riscv/mm/init.c | 33 -
 arch/riscv/mm/kasan_init.c   | 10 ++--
 arch/s390/kernel/crash_dump.c|  8 ++--
 arch/s390/kernel/setup.c | 31 -
 arch/s390/mm/page-states.c   |  6 +--
 arch/s390/mm/vmem.c  | 16 ---
 arch/sh/mm/init.c|  9 ++--
 arch/sparc/mm/init_64.c  | 12 ++---
 arch/x86/mm/numa.c   | 26 ---
 arch/xtensa/mm/init.c| 55 --
 drivers/bus/mvebu-mbus.c | 12 ++---
 drivers/s390/char/zcore.c|  9 ++--
 include/linux/memblock.h | 45 +-
 kernel/dma/contiguous.c  | 11 +
 mm/memblock.c| 28 +++
 mm/page_alloc.c  | 11 ++---
 mm/sparse.c  | 10 ++--
 47 files changed, 324 insertions(+), 485 deletions(-)

-- 
2.26.2

Re: [RESEND PATCH v5 00/11] ppc64: enable kdump support for kexec_file_load syscall

2020-07-27 Thread piliu




On 07/27/2020 03:36 AM, Hari Bathini wrote:
> Sorry! There was a gateway issue on my system while posting v5, due to
> which some patches did not make it through. Resending...
> 
> This patch series enables kdump support for kexec_file_load system
> call (kexec -s -p) on PPC64. The changes are inspired from kexec-tools
> code but heavily modified for kernel consumption.
> 
> The first patch adds a weak arch_kexec_locate_mem_hole() function to
> override locate memory hole logic suiting arch needs. There are some
> special regions in ppc64 which should be avoided while loading buffer
> & there are multiple callers to kexec_add_buffer making it complicated
> to maintain range sanity and using generic lookup at the same time.
> 
> The second patch marks ppc64 specific code within arch/powerpc/kexec
> and arch/powerpc/purgatory to make the subsequent code changes easy
> to understand.
> 
> The next patch adds helper function to setup different memory ranges
> needed for loading kdump kernel, booting into it and exporting the
> crashing kernel's elfcore.
> 
> The fourth patch overrides arch_kexec_locate_mem_hole() function to
> locate memory hole for kdump segments by accounting for the special
> memory regions, referred to as excluded memory ranges, and sets
> kbuf->mem when a suitable memory region is found.
> 
> The fifth patch moves walk_drmem_lmbs() out of .init section with
> a few changes to reuse it for setting up kdump kernel's usable memory
> ranges. The next patch uses walk_drmem_lmbs() to look up the LMBs
> and set linux,drconf-usable-memory & linux,usable-memory properties
> in order to restrict kdump kernel's memory usage.
> 
> The seventh patch updates purgatory to setup r8 & r9 with opal base
> and opal entry addresses respectively to aid kernels built with
> CONFIG_PPC_EARLY_DEBUG_OPAL enabled. The next patch setups up backup
> region as a kexec segment while loading kdump kernel and teaches
> purgatory to copy data from source to destination.
> 
> Patch 09 builds the elfcore header for the running kernel & passes
> the info to kdump kernel via "elfcorehdr=" parameter to export as
> /proc/vmcore file. The next patch sets up the memory reserve map
> for the kexec kernel and also claims kdump support for kdump as
> all the necessary changes are added.
> 
> The last patch fixes a lookup issue for `kexec -l -s` case when
> memory is reserved for crashkernel.
> 
> Tested the changes successfully on P8, P9 lpars, couple of OpenPOWER
> boxes, one with secureboot enabled, KVM guest and a simulator.
> 
> v4 -> v5:
> * Dropped patches 07/12 & 08/12 and updated purgatory to do everything
>   in assembly.
I guess you achieve this by carefully selecting instruction to avoid
relocation issue, right?

Thanks,
Pingfan

[PATCH 16/24] powerpc: use asm-generic/mmu_context.h for no-op implementations

2020-07-27 Thread Nicholas Piggin

Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/mmu_context.h | 22 +++---
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 1a474f6b1992..242bd987247b 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -14,7 +14,9 @@
 /*
  * Most if the context management is out of line
  */
+#define init_new_context init_new_context
 extern int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
+#define destroy_context destroy_context
 extern void destroy_context(struct mm_struct *mm);
 #ifdef CONFIG_SPAPR_TCE_IOMMU
 struct mm_iommu_table_group_mem_t;
@@ -237,27 +239,15 @@ static inline void switch_mm(struct mm_struct *prev, 
struct mm_struct *next,
 }
 #define switch_mm_irqs_off switch_mm_irqs_off
 
-
-#define deactivate_mm(tsk,mm)  do { } while (0)
-
-/*
- * After we have set current->mm to a new value, this activates
- * the context for the new mm so we see the new mappings.
- */
-static inline void activate_mm(struct mm_struct *prev, struct mm_struct *next)
-{
-   switch_mm(prev, next, current);
-}
-
-/* We don't currently use enter_lazy_tlb() for anything */
+#ifdef CONFIG_PPC_BOOK3E_64
+#define enter_lazy_tlb enter_lazy_tlb
 static inline void enter_lazy_tlb(struct mm_struct *mm,
  struct task_struct *tsk)
 {
/* 64-bit Book3E keeps track of current PGD in the PACA */
-#ifdef CONFIG_PPC_BOOK3E_64
get_paca()->pgd = NULL;
-#endif
 }
+#endif
 
 extern void arch_exit_mmap(struct mm_struct *mm);
 
@@ -300,5 +290,7 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
return 0;
 }
 
+#include 
+
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
-- 
2.23.0

Re: [RESEND PATCH v5 08/11] ppc64/kexec_file: setup backup region for kdump kernel

2020-07-27 Thread Thiago Jung Bauermann



Hari Bathini  writes:

> Though kdump kernel boots from loaded address, the first 64KB of it is
> copied down to real 0. So, setup a backup region and let purgatory
> copy the first 64KB of crashed kernel into this backup region before
> booting into kdump kernel. Update reserve map with backup region and
> crashed kernel's memory to avoid kdump kernel from accidentially using
> that memory.
>
> Signed-off-by: Hari Bathini 

Reviewed-by: Thiago Jung Bauermann 

-- 
Thiago Jung Bauermann
IBM Linux Technology Center

Re: [PATCH v2 4/5] powerpc/mm: Remove custom stack expansion checking

2020-07-27 Thread Michael Ellerman

Daniel Axtens  writes:
> Hi Michael,
>
> I tested v1 of this. I ran the test from the bug with a range of stack
> sizes, in a loop, for several hours and didn't see any crashes/signal
> delivery failures.
>
> I retested v2 for a few minutes just to be sure, and I ran stress-ng's
> stack, stackmmap and bad-altstack stressors to make sure no obvious
> kernel bugs were exposed. Nothing crashed.
>
> All tests done on a P8 LE guest under KVM.
>
> On that basis:
>
> Tested-by: Daniel Axtens 

Thanks.

Always nice to have someone review my patches!

cheers

[PATCH -next] powerpc: use for_each_child_of_node() macro

2020-07-27 Thread Qinglang Miao

Use for_each_child_of_node() macro instead of open coding it.

Signed-off-by: Qinglang Miao 
---
 arch/powerpc/platforms/pasemi/misc.c | 3 +--
 arch/powerpc/platforms/powermac/low_i2c.c| 6 ++
 arch/powerpc/platforms/powermac/pfunc_base.c | 4 ++--
 arch/powerpc/platforms/powermac/udbg_scc.c   | 2 +-
 4 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/pasemi/misc.c 
b/arch/powerpc/platforms/pasemi/misc.c
index 1cd4ca14b..1bf65d02d 100644
--- a/arch/powerpc/platforms/pasemi/misc.c
+++ b/arch/powerpc/platforms/pasemi/misc.c
@@ -56,8 +56,7 @@ static int __init pasemi_register_i2c_devices(void)
if (!adap_node)
continue;
 
-   node = NULL;
-   while ((node = of_get_next_child(adap_node, node))) {
+   for_each_child_of_node(adap_node, node) {
struct i2c_board_info info = {};
const u32 *addr;
int len;
diff --git a/arch/powerpc/platforms/powermac/low_i2c.c 
b/arch/powerpc/platforms/powermac/low_i2c.c
index bf4be4b53..f77a59b5c 100644
--- a/arch/powerpc/platforms/powermac/low_i2c.c
+++ b/arch/powerpc/platforms/powermac/low_i2c.c
@@ -629,8 +629,7 @@ static void __init kw_i2c_probe(void)
for (i = 0; i < chans; i++)
kw_i2c_add(host, np, np, i);
} else {
-   for (child = NULL;
-(child = of_get_next_child(np, child)) != NULL;) {
+   for_each_child_of_node(np, child) {
const u32 *reg = of_get_property(child,
"reg", NULL);
if (reg == NULL)
@@ -1193,8 +1192,7 @@ static void pmac_i2c_devscan(void (*callback)(struct 
device_node *dev,
 * platform function instance
 */
list_for_each_entry(bus, _i2c_busses, link) {
-   for (np = NULL;
-(np = of_get_next_child(bus->busnode, np)) != NULL;) {
+   for_each_child_of_node(bus->busnode, np) {
struct whitelist_ent *p;
/* If multibus, check if device is on that bus */
if (bus->flags & pmac_i2c_multibus)
diff --git a/arch/powerpc/platforms/powermac/pfunc_base.c 
b/arch/powerpc/platforms/powermac/pfunc_base.c
index 62311e84a..f5422506d 100644
--- a/arch/powerpc/platforms/powermac/pfunc_base.c
+++ b/arch/powerpc/platforms/powermac/pfunc_base.c
@@ -114,7 +114,7 @@ static void macio_gpio_init_one(struct macio_chip *macio)
 * Ok, got one, we dont need anything special to track them down, so
 * we just create them all
 */
-   for (gp = NULL; (gp = of_get_next_child(gparent, gp)) != NULL;) {
+   for_each_child_of_node(gparent, gp) {
const u32 *reg = of_get_property(gp, "reg", NULL);
unsigned long offset;
if (reg == NULL)
@@ -133,7 +133,7 @@ static void macio_gpio_init_one(struct macio_chip *macio)
macio->of_node);
 
/* And now we run all the init ones */
-   for (gp = NULL; (gp = of_get_next_child(gparent, gp)) != NULL;)
+   for_each_child_of_node(gparent, gp)
pmf_do_functions(gp, NULL, 0, PMF_FLAGS_ON_INIT, NULL);
 
/* Note: We do not at this point implement the "at sleep" or "at wake"
diff --git a/arch/powerpc/platforms/powermac/udbg_scc.c 
b/arch/powerpc/platforms/powermac/udbg_scc.c
index 6b61a18e8..f286bdfe8 100644
--- a/arch/powerpc/platforms/powermac/udbg_scc.c
+++ b/arch/powerpc/platforms/powermac/udbg_scc.c
@@ -80,7 +80,7 @@ void udbg_scc_init(int force_scc)
path = of_get_property(of_chosen, "linux,stdout-path", NULL);
if (path != NULL)
stdout = of_find_node_by_path(path);
-   for (ch = NULL; (ch = of_get_next_child(escc, ch)) != NULL;) {
+   for_each_child_of_node(escc, ch) {
if (ch == stdout)
ch_def = of_node_get(ch);
if (of_node_name_eq(ch, "ch-a"))
-- 
2.25.1

Re: [RESEND PATCH v5 07/11] ppc64/kexec_file: enable early kernel's OPAL calls

2020-07-27 Thread Thiago Jung Bauermann



Hari Bathini  writes:

> Kernel built with CONFIG_PPC_EARLY_DEBUG_OPAL enabled expects r8 & r9
> to be filled with OPAL base & entry addresses respectively. Setting
> these registers allows the kernel to perform OPAL calls before the
> device tree is parsed.
>
> Signed-off-by: Hari Bathini 

Reviewed-by: Thiago Jung Bauermann 

-- 
Thiago Jung Bauermann
IBM Linux Technology Center

Re: [RESEND PATCH v5 06/11] ppc64/kexec_file: restrict memory usage of kdump kernel

2020-07-27 Thread Thiago Jung Bauermann



Hari Bathini  writes:

> Kdump kernel, used for capturing the kernel core image, is supposed
> to use only specific memory regions to avoid corrupting the image to
> be captured. The regions are crashkernel range - the memory reserved
> explicitly for kdump kernel, memory used for the tce-table, the OPAL
> region and RTAS region as applicable. Restrict kdump kernel memory
> to use only these regions by setting up usable-memory DT property.
> Also, tell the kdump kernel to run at the loaded address by setting
> the magic word at 0x5c.
>
> Signed-off-by: Hari Bathini 
> Tested-by: Pingfan Liu 

I liked the new versions of get_node_path_size() and get_node_path().

Reviewed-by: Thiago Jung Bauermann 

--
Thiago Jung Bauermann
IBM Linux Technology Center

[Bug 205183] PPC64: Signal delivery fails with SIGSEGV if between about 1KB and 4KB bytes of stack remain

2020-07-27 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=205183

Michael Ellerman (mich...@ellerman.id.au) changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |CODE_FIX

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[Bug 205183] PPC64: Signal delivery fails with SIGSEGV if between about 1KB and 4KB bytes of stack remain

2020-07-27 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=205183

Michael Ellerman (mich...@ellerman.id.au) changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
 CC||mich...@ellerman.id.au

--- Comment #5 from Michael Ellerman (mich...@ellerman.id.au) ---
Patches posted:

https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=192046

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: [PATCH] powerpc/64s/hash: Fix hash_preload running with interrupts enabled

2020-07-27 Thread Michael Ellerman

Athira Rajeev  writes:
>> On 27-Jul-2020, at 6:05 PM, Michael Ellerman  wrote:
>> 
>> Athira Rajeev  writes:
 On 27-Jul-2020, at 11:39 AM, Nicholas Piggin  wrote:
 
 Commit 2f92447f9f96 ("powerpc/book3s64/hash: Use the pte_t address from the
 caller") removed the local_irq_disable from hash_preload, but it was
 required for more than just the page table walk: the hash pte busy bit is
 effectively a lock which may be taken in interrupt context, and the local
 update flag test must not be preempted before it's used.
 
 This solves apparent lockups with perf interrupting __hash_page_64K. If
 get_perf_callchain then also takes a hash fault on the same page while it
 is already locked, it will loop forever taking hash faults, which looks 
 like
 this:
 
 cpu 0x49e: Vector: 100 (System Reset) at [c0001a4f7d70]
   pc: c0072dc8: hash_page_mm+0x8/0x800
   lr: c000c5a4: do_hash_page+0x24/0x38
   sp: c0002ac1cc69ac70
  msr: 80081033
 current = 0xc0002ac1cc602e00
 paca= 0xc0001de1f280   irqmask: 0x03   irq_happened: 0x01
   pid   = 20118, comm = pread2_processe
 Linux version 5.8.0-rc6-00345-g1fad14f18bc6
 49e:mon> t
 [c0002ac1cc69ac70] c000c5a4 do_hash_page+0x24/0x38 (unreliable)
 --- Exception: 300 (Data Access) at c008fa60 
 __copy_tofrom_user_power7+0x20c/0x7ac
 [link register   ] c0335d10 copy_from_user_nofault+0xf0/0x150
 [c0002ac1cc69af70] c00032bf9fa3c880 (unreliable)
 [c0002ac1cc69afa0] c0109df0 read_user_stack_64+0x70/0xf0
 [c0002ac1cc69afd0] c0109fcc perf_callchain_user_64+0x15c/0x410
 [c0002ac1cc69b060] c0109c00 perf_callchain_user+0x20/0x40
 [c0002ac1cc69b080] c031c6cc get_perf_callchain+0x25c/0x360
 [c0002ac1cc69b120] c0316b50 perf_callchain+0x70/0xa0
 [c0002ac1cc69b140] c0316ddc perf_prepare_sample+0x25c/0x790
 [c0002ac1cc69b1a0] c0317350 perf_event_output_forward+0x40/0xb0
 [c0002ac1cc69b220] c0306138 __perf_event_overflow+0x88/0x1a0
 [c0002ac1cc69b270] c010cf70 record_and_restart+0x230/0x750
 [c0002ac1cc69b620] c010d69c perf_event_interrupt+0x20c/0x510
 [c0002ac1cc69b730] c0027d9c performance_monitor_exception+0x4c/0x60
 [c0002ac1cc69b750] c000b2f8 
 performance_monitor_common_virt+0x1b8/0x1c0
 --- Exception: f00 (Performance Monitor) at c00cb5b0 
 pSeries_lpar_hpte_insert+0x0/0x160
 [link register   ] c00846f0 __hash_page_64K+0x210/0x540
 [c0002ac1cc69ba50]  (unreliable)
 [c0002ac1cc69bb00] c0073ae0 update_mmu_cache+0x390/0x3a0
 [c0002ac1cc69bb70] c037f024 wp_page_copy+0x364/0xce0
 [c0002ac1cc69bc20] c038272c do_wp_page+0xdc/0xa60
 [c0002ac1cc69bc70] c03857bc handle_mm_fault+0xb9c/0x1b60
 [c0002ac1cc69bd50] c006c434 __do_page_fault+0x314/0xc90
 [c0002ac1cc69be20] c000c5c8 handle_page_fault+0x10/0x2c
 --- Exception: 300 (Data Access) at 7fff8c861fe8
 SP (76b19660) is in userspace
 
 Reported-by: Athira Rajeev 
 Reported-by: Anton Blanchard 
 Reviewed-by: Aneesh Kumar K.V 
 Fixes: 2f92447f9f96 ("powerpc/book3s64/hash: Use the pte_t address from the
 caller")
 Signed-off-by: Nicholas Piggin 
>>> 
>>> 
>>> Hi,
>>> 
>>> Tested with the patch and it fixes the lockups I was seeing with my test 
>>> run.
>>> Thanks for the fix.
>>> 
>>> Tested-by: Athira Rajeev 
>> 
>> Thanks for testing.
>> 
>> What test are you running?
>
> Hi Michael
>
> I was running  “perf record”  and Unixbench tests ( 
> https://github.com/kdlucas/byte-unixbench ) in parallel where we were getting 
> soft lockups
>
> 1. Perf command run:
> # perf record -a -g -c 1000 -o  sleep 60
>
> 2. Unixbench tests
> # Run -q -c  spawn

Thanks, I can reproduce it with that.

cheers

Re: [PATCH -next] powerpc/powernv/sriov: Remove unused but set variable 'phb'

2020-07-27 Thread Oliver O'Halloran

On Tue, Jul 28, 2020 at 3:01 AM Wei Yongjun  wrote:
>
> Gcc report warning as follows:
>
> arch/powerpc/platforms/powernv/pci-sriov.c:602:25: warning:
>  variable 'phb' set but not used [-Wunused-but-set-variable]
>   602 |  struct pnv_phb*phb;
>   | ^~~
>
> This variable is not used, so this commit removing it.
>
> Reported-by: Hulk Robot 
> Signed-off-by: Wei Yongjun 
> ---
>  arch/powerpc/platforms/powernv/pci-sriov.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
> b/arch/powerpc/platforms/powernv/pci-sriov.c
> index 8404d8c3901d..7894745fd4f8 100644
> --- a/arch/powerpc/platforms/powernv/pci-sriov.c
> +++ b/arch/powerpc/platforms/powernv/pci-sriov.c
> @@ -599,10 +599,8 @@ static int pnv_pci_vf_resource_shift(struct pci_dev 
> *dev, int offset)
>  static void pnv_pci_sriov_disable(struct pci_dev *pdev)
>  {
> u16num_vfs, base_pe;
> -   struct pnv_phb*phb;
> struct pnv_iov_data   *iov;
>
> -   phb = pci_bus_to_pnvhb(pdev->bus);
> iov = pnv_iov_get(pdev);
> num_vfs = iov->num_vfs;
> base_pe = iov->vf_pe_arr[0].pe_number;
>

Acked-by: Oliver O'Halloran

[PATCH][next] powerpc: Use fallthrough pseudo-keyword

2020-07-27 Thread Gustavo A. R. Silva

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] 
https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva 
---
 arch/powerpc/kernel/align.c | 8 
 arch/powerpc/platforms/powermac/feature.c   | 2 +-
 arch/powerpc/platforms/powernv/opal-async.c | 2 +-
 arch/powerpc/platforms/pseries/hvcserver.c  | 2 +-
 arch/powerpc/xmon/xmon.c| 2 +-
 5 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/align.c b/arch/powerpc/kernel/align.c
index 1f1ce8b86d5b..c7797eb958c7 100644
--- a/arch/powerpc/kernel/align.c
+++ b/arch/powerpc/kernel/align.c
@@ -178,11 +178,11 @@ static int emulate_spe(struct pt_regs *regs, unsigned int 
reg,
ret |= __get_user_inatomic(temp.v[1], p++);
ret |= __get_user_inatomic(temp.v[2], p++);
ret |= __get_user_inatomic(temp.v[3], p++);
-   /* fall through */
+   fallthrough;
case 4:
ret |= __get_user_inatomic(temp.v[4], p++);
ret |= __get_user_inatomic(temp.v[5], p++);
-   /* fall through */
+   fallthrough;
case 2:
ret |= __get_user_inatomic(temp.v[6], p++);
ret |= __get_user_inatomic(temp.v[7], p++);
@@ -263,11 +263,11 @@ static int emulate_spe(struct pt_regs *regs, unsigned int 
reg,
ret |= __put_user_inatomic(data.v[1], p++);
ret |= __put_user_inatomic(data.v[2], p++);
ret |= __put_user_inatomic(data.v[3], p++);
-   /* fall through */
+   fallthrough;
case 4:
ret |= __put_user_inatomic(data.v[4], p++);
ret |= __put_user_inatomic(data.v[5], p++);
-   /* fall through */
+   fallthrough;
case 2:
ret |= __put_user_inatomic(data.v[6], p++);
ret |= __put_user_inatomic(data.v[7], p++);
diff --git a/arch/powerpc/platforms/powermac/feature.c 
b/arch/powerpc/platforms/powermac/feature.c
index 181caa3f6717..5c77b9a24c0e 100644
--- a/arch/powerpc/platforms/powermac/feature.c
+++ b/arch/powerpc/platforms/powermac/feature.c
@@ -1465,7 +1465,7 @@ static long g5_i2s_enable(struct device_node *node, long 
param, long value)
case 2:
if (macio->type == macio_shasta)
break;
-   /* fall through */
+   fallthrough;
default:
return -ENODEV;
}
diff --git a/arch/powerpc/platforms/powernv/opal-async.c 
b/arch/powerpc/platforms/powernv/opal-async.c
index 1656e8965d6b..c094fdf5825c 100644
--- a/arch/powerpc/platforms/powernv/opal-async.c
+++ b/arch/powerpc/platforms/powernv/opal-async.c
@@ -104,7 +104,7 @@ static int __opal_async_release_token(int token)
 */
case ASYNC_TOKEN_DISPATCHED:
opal_async_tokens[token].state = ASYNC_TOKEN_ABANDONED;
-   /* Fall through */
+   fallthrough;
default:
rc = 1;
}
diff --git a/arch/powerpc/platforms/pseries/hvcserver.c 
b/arch/powerpc/platforms/pseries/hvcserver.c
index 267139b13530..96e18d3b2fcf 100644
--- a/arch/powerpc/platforms/pseries/hvcserver.c
+++ b/arch/powerpc/platforms/pseries/hvcserver.c
@@ -45,7 +45,7 @@ static int hvcs_convert(long to_convert)
case H_LONG_BUSY_ORDER_10_SEC:
case H_LONG_BUSY_ORDER_100_SEC:
return -EBUSY;
-   case H_FUNCTION: /* fall through */
+   case H_FUNCTION:
default:
return -EPERM;
}
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 8fb1f857c11c..ed1a9f43709d 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -4278,7 +4278,7 @@ static int do_spu_cmd(void)
subcmd = inchar();
if (isxdigit(subcmd) || subcmd == '\n')
termch = subcmd;
-   /* fall through */
+   fallthrough;
case 'f':
scanhex();
if (num >= XMON_NUM_SPUS || !spu_info[num].spu) {
-- 
2.27.0

[PATCH][next] dmaengine: Use fallthrough pseudo-keyword

2020-07-27 Thread Gustavo A. R. Silva

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] 
https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva 
---
 drivers/dma/amba-pl08x.c| 10 +-
 drivers/dma/fsldma.c|  2 +-
 drivers/dma/imx-dma.c   |  2 +-
 drivers/dma/iop-adma.h  | 12 ++--
 drivers/dma/nbpfaxi.c   |  2 +-
 drivers/dma/pl330.c | 10 +++---
 drivers/dma/sh/shdma-base.c |  2 +-
 7 files changed, 18 insertions(+), 22 deletions(-)

diff --git a/drivers/dma/amba-pl08x.c b/drivers/dma/amba-pl08x.c
index 9adc7a2fa3d3..a24882ba3764 100644
--- a/drivers/dma/amba-pl08x.c
+++ b/drivers/dma/amba-pl08x.c
@@ -1767,7 +1767,7 @@ static u32 pl08x_memcpy_cctl(struct pl08x_driver_data 
*pl08x)
default:
dev_err(>adev->dev,
"illegal burst size for memcpy, set to 1\n");
-   /* Fall through */
+   fallthrough;
case PL08X_BURST_SZ_1:
cctl |= PL080_BSIZE_1 << PL080_CONTROL_SB_SIZE_SHIFT |
PL080_BSIZE_1 << PL080_CONTROL_DB_SIZE_SHIFT;
@@ -1806,7 +1806,7 @@ static u32 pl08x_memcpy_cctl(struct pl08x_driver_data 
*pl08x)
default:
dev_err(>adev->dev,
"illegal bus width for memcpy, set to 8 bits\n");
-   /* Fall through */
+   fallthrough;
case PL08X_BUS_WIDTH_8_BITS:
cctl |= PL080_WIDTH_8BIT << PL080_CONTROL_SWIDTH_SHIFT |
PL080_WIDTH_8BIT << PL080_CONTROL_DWIDTH_SHIFT;
@@ -1850,7 +1850,7 @@ static u32 pl08x_ftdmac020_memcpy_cctl(struct 
pl08x_driver_data *pl08x)
default:
dev_err(>adev->dev,
"illegal bus width for memcpy, set to 8 bits\n");
-   /* Fall through */
+   fallthrough;
case PL08X_BUS_WIDTH_8_BITS:
cctl |= PL080_WIDTH_8BIT << FTDMAC020_LLI_SRC_WIDTH_SHIFT |
PL080_WIDTH_8BIT << FTDMAC020_LLI_DST_WIDTH_SHIFT;
@@ -2612,7 +2612,7 @@ static int pl08x_of_probe(struct amba_device *adev,
switch (val) {
default:
dev_err(>dev, "illegal burst size for memcpy, set to 
1\n");
-   /* Fall through */
+   fallthrough;
case 1:
pd->memcpy_burst_size = PL08X_BURST_SZ_1;
break;
@@ -2647,7 +2647,7 @@ static int pl08x_of_probe(struct amba_device *adev,
switch (val) {
default:
dev_err(>dev, "illegal bus width for memcpy, set to 8 
bits\n");
-   /* Fall through */
+   fallthrough;
case 8:
pd->memcpy_bus_width = PL08X_BUS_WIDTH_8_BITS;
break;
diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c
index ad72b3f42ffa..e342cf52d296 100644
--- a/drivers/dma/fsldma.c
+++ b/drivers/dma/fsldma.c
@@ -1163,7 +1163,7 @@ static int fsl_dma_chan_probe(struct fsldma_device *fdev,
switch (chan->feature & FSL_DMA_IP_MASK) {
case FSL_DMA_IP_85XX:
chan->toggle_ext_pause = fsl_chan_toggle_ext_pause;
-   /* Fall through */
+   fallthrough;
case FSL_DMA_IP_83XX:
chan->toggle_ext_start = fsl_chan_toggle_ext_start;
chan->set_src_loop_size = fsl_chan_set_src_loop_size;
diff --git a/drivers/dma/imx-dma.c b/drivers/dma/imx-dma.c
index 5c0fb3134825..88717506c1f6 100644
--- a/drivers/dma/imx-dma.c
+++ b/drivers/dma/imx-dma.c
@@ -556,7 +556,7 @@ static int imxdma_xfer_desc(struct imxdma_desc *d)
 * We fall-through here intentionally, since a 2D transfer is
 * similar to MEMCPY just adding the 2D slot configuration.
 */
-   /* Fall through */
+   fallthrough;
case IMXDMA_DESC_MEMCPY:
imx_dmav1_writel(imxdma, d->src, DMA_SAR(imxdmac->channel));
imx_dmav1_writel(imxdma, d->dest, DMA_DAR(imxdmac->channel));
diff --git a/drivers/dma/iop-adma.h b/drivers/dma/iop-adma.h
index c499c9578f00..d44eabb6f5eb 100644
--- a/drivers/dma/iop-adma.h
+++ b/drivers/dma/iop-adma.h
@@ -496,7 +496,7 @@ iop3xx_desc_init_xor(struct iop3xx_desc_aau *hw_desc, int 
src_cnt,
}
hw_desc->src_edc[AAU_EDCR2_IDX].e_desc_ctrl = edcr;
src_cnt = 24;
-   /* fall through */
+   fallthrough;
case 17 ... 24:
if (!u_desc_ctrl.field.blk_ctrl) {
hw_desc->src_edc[AAU_EDCR2_IDX].e_desc_ctrl = 0;
@@ -510,7 +510,7 @@ iop3xx_desc_init_xor(struct iop3xx_desc_aau *hw_desc, int 
src_cnt,
}
hw_desc->src_edc[AAU_EDCR1_IDX].e_desc_ctrl = edcr;
src_cnt = 16;
-

[PATCH v2 2/2] KVM: PPC: Book3S HV: rework secure mem slot dropping

2020-07-27 Thread Ram Pai

From: Laurent Dufour 

When a secure memslot is dropped, all the pages backed in the secure
device (aka really backed by secure memory by the Ultravisor)
should be paged out to a normal page. Previously, this was
achieved by triggering the page fault mechanism which is calling
kvmppc_svm_page_out() on each pages.

This can't work when hot unplugging a memory slot because the memory
slot is flagged as invalid and gfn_to_pfn() is then not trying to access
the page, so the page fault mechanism is not triggered.

Since the final goal is to make a call to kvmppc_svm_page_out() it seems
simpler to call directly instead of triggering such a mechanism. This
way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
memslot.

Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
the call to __kvmppc_svm_page_out() is made.  As
__kvmppc_svm_page_out needs the vma pointer to migrate the pages,
the VMA is fetched in a lazy way, to not trigger find_vma() all
the time. In addition, the mmap_sem is held in read mode during
that time, not in write mode since the virual memory layout is not
impacted, and kvm->arch.uvmem_lock prevents concurrent operation
on the secure device.

Cc: Ram Pai 
Cc: Bharata B Rao 
Cc: Paul Mackerras 
Reviewed-by: Bharata B Rao 
Signed-off-by: Ram Pai 
[modified the changelog description]
Signed-off-by: Laurent Dufour 
[modified check on the VMA in kvmppc_uvmem_drop_pages]
---
 arch/powerpc/kvm/book3s_hv_uvmem.c | 52 +-
 1 file changed, 35 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 565f24b..0d49e34 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -594,35 +594,53 @@ static inline int kvmppc_svm_page_out(struct 
vm_area_struct *vma,
  * fault on them, do fault time migration to replace the device PTEs in
  * QEMU page table with normal PTEs from newly allocated pages.
  */
-void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
+void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,
 struct kvm *kvm, bool skip_page_out)
 {
int i;
struct kvmppc_uvmem_page_pvt *pvt;
-   unsigned long pfn, uvmem_pfn;
-   unsigned long gfn = free->base_gfn;
+   struct page *uvmem_page;
+   struct vm_area_struct *vma = NULL;
+   unsigned long uvmem_pfn, gfn;
+   unsigned long addr;
+
+   mmap_read_lock(kvm->mm);
+
+   addr = slot->userspace_addr;
 
-   for (i = free->npages; i; --i, ++gfn) {
-   struct page *uvmem_page;
+   gfn = slot->base_gfn;
+   for (i = slot->npages; i; --i, ++gfn, addr += PAGE_SIZE) {
+
+   /* Fetch the VMA if addr is not in the latest fetched one */
+   if (!vma || addr >= vma->vm_end) {
+   vma = find_vma_intersection(kvm->mm, addr, addr+1);
+   if (!vma) {
+   pr_err("Can't find VMA for gfn:0x%lx\n", gfn);
+   break;
+   }
+   }
 
mutex_lock(>arch.uvmem_lock);
-   if (!kvmppc_gfn_is_uvmem_pfn(gfn, kvm, _pfn)) {
+
+   if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, _pfn)) {
+   uvmem_page = pfn_to_page(uvmem_pfn);
+   pvt = uvmem_page->zone_device_data;
+   pvt->skip_page_out = skip_page_out;
+   pvt->remove_gfn = true;
+
+   if (__kvmppc_svm_page_out(vma, addr, addr + PAGE_SIZE,
+ PAGE_SHIFT, kvm, pvt->gpa))
+   pr_err("Can't page out gpa:0x%lx addr:0x%lx\n",
+  pvt->gpa, addr);
+   } else {
+   /* Remove the shared flag if any */
kvmppc_gfn_remove(gfn, kvm);
-   mutex_unlock(>arch.uvmem_lock);
-   continue;
}
 
-   uvmem_page = pfn_to_page(uvmem_pfn);
-   pvt = uvmem_page->zone_device_data;
-   pvt->skip_page_out = skip_page_out;
-   pvt->remove_gfn = true;
mutex_unlock(>arch.uvmem_lock);
-
-   pfn = gfn_to_pfn(kvm, gfn);
-   if (is_error_noslot_pfn(pfn))
-   continue;
-   kvm_release_pfn_clean(pfn);
}
+
+   mmap_read_unlock(kvm->mm);
 }
 
 unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm)
-- 
1.8.3.1

[PATCH v2 1/2] KVM: PPC: Book3S HV: move kvmppc_svm_page_out up

2020-07-27 Thread Ram Pai

From: Laurent Dufour 

kvmppc_svm_page_out() will need to be called by kvmppc_uvmem_drop_pages()
so move it upper in this file.

Furthermore it will be interesting to call this function when already
holding the kvm->arch.uvmem_lock, so prefix the original function with __
and remove the locking in it, and introduce a wrapper which call that
function with the lock held.

There is no functional change.

Cc: Ram Pai 
Cc: Bharata B Rao 
Cc: Paul Mackerras 
Reviewed-by: Bharata B Rao 
Signed-off-by: Ram Pai 
Signed-off-by: Laurent Dufour 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c | 166 -
 1 file changed, 90 insertions(+), 76 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 5b917ea..565f24b 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -497,6 +497,96 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
 }
 
 /*
+ * Provision a new page on HV side and copy over the contents
+ * from secure memory using UV_PAGE_OUT uvcall.
+ * Caller must held kvm->arch.uvmem_lock.
+ */
+static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
+   unsigned long start,
+   unsigned long end, unsigned long page_shift,
+   struct kvm *kvm, unsigned long gpa)
+{
+   unsigned long src_pfn, dst_pfn = 0;
+   struct migrate_vma mig;
+   struct page *dpage, *spage;
+   struct kvmppc_uvmem_page_pvt *pvt;
+   unsigned long pfn;
+   int ret = U_SUCCESS;
+
+   memset(, 0, sizeof(mig));
+   mig.vma = vma;
+   mig.start = start;
+   mig.end = end;
+   mig.src = _pfn;
+   mig.dst = _pfn;
+   mig.src_owner = _uvmem_pgmap;
+
+   /* The requested page is already paged-out, nothing to do */
+   if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL))
+   return ret;
+
+   ret = migrate_vma_setup();
+   if (ret)
+   return -1;
+
+   spage = migrate_pfn_to_page(*mig.src);
+   if (!spage || !(*mig.src & MIGRATE_PFN_MIGRATE))
+   goto out_finalize;
+
+   if (!is_zone_device_page(spage))
+   goto out_finalize;
+
+   dpage = alloc_page_vma(GFP_HIGHUSER, vma, start);
+   if (!dpage) {
+   ret = -1;
+   goto out_finalize;
+   }
+
+   lock_page(dpage);
+   pvt = spage->zone_device_data;
+   pfn = page_to_pfn(dpage);
+
+   /*
+* This function is used in two cases:
+* - When HV touches a secure page, for which we do UV_PAGE_OUT
+* - When a secure page is converted to shared page, we *get*
+*   the page to essentially unmap the device page. In this
+*   case we skip page-out.
+*/
+   if (!pvt->skip_page_out)
+   ret = uv_page_out(kvm->arch.lpid, pfn << page_shift,
+ gpa, 0, page_shift);
+
+   if (ret == U_SUCCESS)
+   *mig.dst = migrate_pfn(pfn) | MIGRATE_PFN_LOCKED;
+   else {
+   unlock_page(dpage);
+   __free_page(dpage);
+   goto out_finalize;
+   }
+
+   migrate_vma_pages();
+
+out_finalize:
+   migrate_vma_finalize();
+   return ret;
+}
+
+static inline int kvmppc_svm_page_out(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ unsigned long page_shift,
+ struct kvm *kvm, unsigned long gpa)
+{
+   int ret;
+
+   mutex_lock(>arch.uvmem_lock);
+   ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa);
+   mutex_unlock(>arch.uvmem_lock);
+
+   return ret;
+}
+
+/*
  * Drop device pages that we maintain for the secure guest
  *
  * We first mark the pages to be skipped from UV_PAGE_OUT when there
@@ -866,82 +956,6 @@ unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, 
unsigned long gpa,
return ret;
 }
 
-/*
- * Provision a new page on HV side and copy over the contents
- * from secure memory using UV_PAGE_OUT uvcall.
- */
-static int kvmppc_svm_page_out(struct vm_area_struct *vma,
-   unsigned long start,
-   unsigned long end, unsigned long page_shift,
-   struct kvm *kvm, unsigned long gpa)
-{
-   unsigned long src_pfn, dst_pfn = 0;
-   struct migrate_vma mig;
-   struct page *dpage, *spage;
-   struct kvmppc_uvmem_page_pvt *pvt;
-   unsigned long pfn;
-   int ret = U_SUCCESS;
-
-   memset(, 0, sizeof(mig));
-   mig.vma = vma;
-   mig.start = start;
-   mig.end = end;
-   mig.src = _pfn;
-   mig.dst = _pfn;
-   mig.src_owner = _uvmem_pgmap;
-
-   mutex_lock(>arch.uvmem_lock);
-   /* The requested page is already paged-out, nothing to do */
-   if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL))
-   goto out;
-
-   ret = migrate_vma_setup();
-

[PATCH v2 0/2] Rework secure memslot dropping

2020-07-27 Thread Ram Pai

From: Laurent Dufour 

When doing memory hotplug on a secure VM, the secure pages are not well
cleaned from the secure device when dropping the memslot.  This silent
error, is then preventing the SVM to reboot properly after the following
sequence of commands are run in the Qemu monitor:

device_add pc-dimm,id=dimm1,memdev=mem1
device_del dimm1
device_add pc-dimm,id=dimm1,memdev=mem1

At reboot time, when the kernel is booting again and switching to the
secure mode, the page_in is failing for the pages in the memslot because
the cleanup was not done properly, because the memslot is flagged as
invalid during the hot unplug and thus the page fault mechanism is not
triggered.

To prevent that during the memslot dropping, instead of belonging on the
page fault mechanism to trigger the page out of the secured pages, it seems
simpler to directly call the function doing the page out. This way the
state of the memslot is not interfering on the page out process.

This series applies on top of the Ram's one titled:
"[v6 0/5] Migrate non-migrated pages of a SVM."


Changes since V2:
 - fix to vma boundary check in kvmppc_uvmem_drop_pages().

Changes since V1:
 - Rebase on top of Ram's V4 series
 - Address Bharata's comment to use mmap_read_*lock().

Laurent Dufour (2):
  KVM: PPC: Book3S HV: move kvmppc_svm_page_out up
  KVM: PPC: Book3S HV: rework secure mem slot dropping

 arch/powerpc/kvm/book3s_hv_uvmem.c | 218 +
 1 file changed, 125 insertions(+), 93 deletions(-)

-- 
1.8.3.1

Re: [PATCH 0/5] cpuidle-pseries: Parse extended CEDE information for idle.

2020-07-27 Thread Gautham R Shenoy

Hello Rafael,

On Mon, Jul 27, 2020 at 04:14:12PM +0200, Rafael J. Wysocki wrote:
> On Tue, Jul 7, 2020 at 1:32 PM Gautham R Shenoy  
> wrote:
> >
> > Hi,
> >
> > On Tue, Jul 07, 2020 at 04:41:34PM +0530, Gautham R. Shenoy wrote:
> > > From: "Gautham R. Shenoy" 
> > >
> > > Hi,
> > >
> > >
> > >
> > >
> > > Gautham R. Shenoy (5):
> > >   cpuidle-pseries: Set the latency-hint before entering CEDE
> > >   cpuidle-pseries: Add function to parse extended CEDE records
> > >   cpuidle-pseries : Fixup exit latency for CEDE(0)
> > >   cpuidle-pseries : Include extended CEDE states in cpuidle framework
> > >   cpuidle-pseries: Block Extended CEDE(1) which adds no additional
> > > value.
> >
> > Forgot to mention that these patches are on top of Nathan's series to
> > remove extended CEDE offline and bogus topology update code :
> > https://lore.kernel.org/linuxppc-dev/20200612051238.1007764-1-nath...@linux.ibm.com/
> 
> OK, so this is targeted at the powerpc maintainers, isn't it?

Yes, the code is powerpc specific.

Also, I noticed that Nathan's patches have been merged by Michael
Ellerman in the powerpc/merge tree. I will rebase and post a v2 of
this patch series.

--
Thanks and Regards
gautham.

Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain

2020-07-27 Thread Gautham R Shenoy

Hi Srikar,

On Mon, Jul 27, 2020 at 11:02:29AM +0530, Srikar Dronamraju wrote:
> Add percpu coregroup maps and masks to create coregroup domain.
> If a coregroup doesn't exist, the coregroup domain will be degenerated
> in favour of SMT/CACHE domain.
> 
> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Anton Blanchard 
> Cc: Oliver O'Halloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Gautham R Shenoy 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Jordan Niethe 
> Signed-off-by: Srikar Dronamraju 

This version looks good to me.

Reviewed-by: Gautham R. Shenoy 


> ---
> Changelog v3 ->v4:
>   if coregroup_support doesn't exist, update MC mask to the next
>   smaller domain mask.
> 
> Changelog v2 -> v3:
>   Add optimization for mask updation under coregroup_support
> 
> Changelog v1 -> v2:
>   Moved coregroup topology fixup to fixup_topology (Gautham)
> 
>  arch/powerpc/include/asm/topology.h | 10 +++
>  arch/powerpc/kernel/smp.c   | 44 +
>  arch/powerpc/mm/numa.c  |  5 
>  3 files changed, 59 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/topology.h 
> b/arch/powerpc/include/asm/topology.h
> index f0b6300e7dd3..6609174918ab 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -88,12 +88,22 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 
> *cpu2_assoc)
> 
>  #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
>  extern int find_and_online_cpu_nid(int cpu);
> +extern int cpu_to_coregroup_id(int cpu);
>  #else
>  static inline int find_and_online_cpu_nid(int cpu)
>  {
>   return 0;
>  }
> 
> +static inline int cpu_to_coregroup_id(int cpu)
> +{
> +#ifdef CONFIG_SMP
> + return cpu_to_core_id(cpu);
> +#else
> + return 0;
> +#endif
> +}
> +
>  #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
> 
>  #include 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index dab96a1203ec..95f0bf72e283 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -80,6 +80,7 @@ DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
>  DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
>  DEFINE_PER_CPU(cpumask_var_t, cpu_l2_cache_map);
>  DEFINE_PER_CPU(cpumask_var_t, cpu_core_map);
> +DEFINE_PER_CPU(cpumask_var_t, cpu_coregroup_map);
> 
>  EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
>  EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
> @@ -91,6 +92,7 @@ enum {
>   smt_idx,
>  #endif
>   bigcore_idx,
> + mc_idx,
>   die_idx,
>  };
> 
> @@ -869,6 +871,21 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
>  }
>  #endif
> 
> +static struct cpumask *cpu_coregroup_mask(int cpu)
> +{
> + return per_cpu(cpu_coregroup_map, cpu);
> +}
> +
> +static bool has_coregroup_support(void)
> +{
> + return coregroup_enabled;
> +}
> +
> +static const struct cpumask *cpu_mc_mask(int cpu)
> +{
> + return cpu_coregroup_mask(cpu);
> +}
> +
>  static const struct cpumask *cpu_bigcore_mask(int cpu)
>  {
>   return per_cpu(cpu_sibling_map, cpu);
> @@ -879,6 +896,7 @@ static struct sched_domain_topology_level 
> powerpc_topology[] = {
>   { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
>  #endif
>   { cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) },
> + { cpu_mc_mask, SD_INIT_NAME(MC) },
>   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>   { NULL, },
>  };
> @@ -925,6 +943,10 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
>   GFP_KERNEL, cpu_to_node(cpu));
>   zalloc_cpumask_var_node(_cpu(cpu_core_map, cpu),
>   GFP_KERNEL, cpu_to_node(cpu));
> + if (has_coregroup_support())
> + zalloc_cpumask_var_node(_cpu(cpu_coregroup_map, 
> cpu),
> + GFP_KERNEL, cpu_to_node(cpu));
> +
>  #ifdef CONFIG_NEED_MULTIPLE_NODES
>   /*
>* numa_node_id() works after this.
> @@ -942,6 +964,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
>   cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid));
>   cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid));
> 
> + if (has_coregroup_support())
> + cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid));
> +
>   init_big_cores();
>   if (has_big_cores) {
>   cpumask_set_cpu(boot_cpuid,
> @@ -1233,6 +1258,8 @@ static void remove_cpu_from_masks(int cpu)
>   set_cpus_unrelated(cpu, i, cpu_sibling_mask);
>   if (has_big_cores)
>   set_cpus_unrelated(cpu, i, cpu_smallcore_mask);
> + if (has_coregroup_support())
> + set_cpus_unrelated(cpu, i, cpu_coregroup_mask);
>   }
>  }
>  #endif
> @@ -1293,6 +1320,20 @@ static void add_cpu_to_masks(int cpu)
>   add_cpu_to_smallcore_masks(cpu);
>

[PATCH v2 2/2] powerpc/pseries: new lparcfg key/value pair: partition_affinity_score

2020-07-27 Thread Scott Cheloha

The H_GetPerformanceCounterInfo (GPCI) PHYP hypercall has a subcall,
Affinity_Domain_Info_By_Partition, which returns, among other things,
a "partition affinity score" for a given LPAR.  This score, a value on
[0-100], represents the processor-memory affinity for the LPAR in
question.  A score of 0 indicates the worst possible affinity while a
score of 100 indicates perfect affinity.  The score can be used to
reason about performance.

This patch adds the score for the local LPAR to the lparcfg procfile
under a new 'partition_affinity_score' key.

Signed-off-by: Scott Cheloha 
---
 arch/powerpc/platforms/pseries/lparcfg.c | 35 
 1 file changed, 35 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/lparcfg.c 
b/arch/powerpc/platforms/pseries/lparcfg.c
index b8d28ab88178..e278390ab28d 100644
--- a/arch/powerpc/platforms/pseries/lparcfg.c
+++ b/arch/powerpc/platforms/pseries/lparcfg.c
@@ -136,6 +136,39 @@ static unsigned int h_get_ppp(struct hvcall_ppp_data 
*ppp_data)
return rc;
 }
 
+static void show_gpci_data(struct seq_file *m)
+{
+   struct hv_gpci_request_buffer *buf;
+   unsigned int affinity_score;
+   long ret;
+
+   buf = kmalloc(sizeof(*buf), GFP_KERNEL);
+   if (buf == NULL)
+   return;
+
+   /*
+* Show the local LPAR's affinity score.
+*
+* 0xB1 selects the Affinity_Domain_Info_By_Partition subcall.
+* The score is at byte 0xB in the output buffer.
+*/
+   memset(>params, 0, sizeof(buf->params));
+   buf->params.counter_request = cpu_to_be32(0xB1);
+   buf->params.starting_index = cpu_to_be32(-1);   /* local LPAR */
+   buf->params.counter_info_version_in = 0x5;  /* v5+ for score */
+   ret = plpar_hcall_norets(H_GET_PERF_COUNTER_INFO, virt_to_phys(buf),
+sizeof(*buf));
+   if (ret != H_SUCCESS) {
+   pr_debug("hcall failed: H_GET_PERF_COUNTER_INFO: %ld, %x\n",
+ret, be32_to_cpu(buf->params.detail_rc));
+   goto out;
+   }
+   affinity_score = buf->bytes[0xB];
+   seq_printf(m, "partition_affinity_score=%u\n", affinity_score);
+out:
+   kfree(buf);
+}
+
 static unsigned h_pic(unsigned long *pool_idle_time,
  unsigned long *num_procs)
 {
@@ -487,6 +520,8 @@ static int pseries_lparcfg_data(struct seq_file *m, void *v)
   partition_active_processors * 100);
}
 
+   show_gpci_data(m);
+
seq_printf(m, "partition_active_processors=%d\n",
   partition_active_processors);
 
-- 
2.24.1

[PATCH v2 1/2] powerpc/perf: consolidate GPCI hcall structs into asm/hvcall.h

2020-07-27 Thread Scott Cheloha

The H_GetPerformanceCounterInfo (GPCI) hypercall input/output structs are
useful to modules outside of perf/, so move them into asm/hvcall.h to live
alongside the other powerpc hypercall structs.

Leave the perf-specific GPCI stuff in perf/hv-gpci.h.

Signed-off-by: Scott Cheloha 
---
 arch/powerpc/include/asm/hvcall.h | 36 +++
 arch/powerpc/perf/hv-gpci.c   |  9 
 arch/powerpc/perf/hv-gpci.h   | 27 ---
 3 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index e90c073e437e..c338480b4551 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -527,6 +527,42 @@ struct hv_guest_state {
 /* Latest version of hv_guest_state structure */
 #define HV_GUEST_STATE_VERSION 1
 
+/*
+ * From the document "H_GetPerformanceCounterInfo Interface" v1.07
+ *
+ * H_GET_PERF_COUNTER_INFO argument
+ */
+struct hv_get_perf_counter_info_params {
+   __be32 counter_request; /* I */
+   __be32 starting_index;  /* IO */
+   __be16 secondary_index; /* IO */
+   __be16 returned_values; /* O */
+   __be32 detail_rc; /* O, only needed when called via *_norets() */
+
+   /*
+* O, size each of counter_value element in bytes, only set for version
+* >= 0x3
+*/
+   __be16 cv_element_size;
+
+   /* I, 0 (zero) for versions < 0x3 */
+   __u8 counter_info_version_in;
+
+   /* O, 0 (zero) if version < 0x3. Must be set to 0 when making hcall */
+   __u8 counter_info_version_out;
+   __u8 reserved[0xC];
+   __u8 counter_value[];
+} __packed;
+
+#define HGPCI_REQ_BUFFER_SIZE  4096
+#define HGPCI_MAX_DATA_BYTES \
+   (HGPCI_REQ_BUFFER_SIZE - sizeof(struct hv_get_perf_counter_info_params))
+
+struct hv_gpci_request_buffer {
+   struct hv_get_perf_counter_info_params params;
+   uint8_t bytes[HGPCI_MAX_DATA_BYTES];
+} __packed;
+
 #endif /* __ASSEMBLY__ */
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_HVCALL_H */
diff --git a/arch/powerpc/perf/hv-gpci.c b/arch/powerpc/perf/hv-gpci.c
index 6884d16ec19b..1667315b82e9 100644
--- a/arch/powerpc/perf/hv-gpci.c
+++ b/arch/powerpc/perf/hv-gpci.c
@@ -123,17 +123,8 @@ static const struct attribute_group *attr_groups[] = {
NULL,
 };
 
-#define HGPCI_REQ_BUFFER_SIZE  4096
-#define HGPCI_MAX_DATA_BYTES \
-   (HGPCI_REQ_BUFFER_SIZE - sizeof(struct hv_get_perf_counter_info_params))
-
 static DEFINE_PER_CPU(char, hv_gpci_reqb[HGPCI_REQ_BUFFER_SIZE]) 
__aligned(sizeof(uint64_t));
 
-struct hv_gpci_request_buffer {
-   struct hv_get_perf_counter_info_params params;
-   uint8_t bytes[HGPCI_MAX_DATA_BYTES];
-} __packed;
-
 static unsigned long single_gpci_request(u32 req, u32 starting_index,
u16 secondary_index, u8 version_in, u32 offset, u8 length,
u64 *value)
diff --git a/arch/powerpc/perf/hv-gpci.h b/arch/powerpc/perf/hv-gpci.h
index a3053eda5dcc..4d108262bed7 100644
--- a/arch/powerpc/perf/hv-gpci.h
+++ b/arch/powerpc/perf/hv-gpci.h
@@ -2,33 +2,6 @@
 #ifndef LINUX_POWERPC_PERF_HV_GPCI_H_
 #define LINUX_POWERPC_PERF_HV_GPCI_H_
 
-#include 
-
-/* From the document "H_GetPerformanceCounterInfo Interface" v1.07 */
-
-/* H_GET_PERF_COUNTER_INFO argument */
-struct hv_get_perf_counter_info_params {
-   __be32 counter_request; /* I */
-   __be32 starting_index;  /* IO */
-   __be16 secondary_index; /* IO */
-   __be16 returned_values; /* O */
-   __be32 detail_rc; /* O, only needed when called via *_norets() */
-
-   /*
-* O, size each of counter_value element in bytes, only set for version
-* >= 0x3
-*/
-   __be16 cv_element_size;
-
-   /* I, 0 (zero) for versions < 0x3 */
-   __u8 counter_info_version_in;
-
-   /* O, 0 (zero) if version < 0x3. Must be set to 0 when making hcall */
-   __u8 counter_info_version_out;
-   __u8 reserved[0xC];
-   __u8 counter_value[];
-} __packed;
-
 /*
  * counter info version => fw version/reference (spec version)
  *
-- 
2.24.1

[PATCH v6 5/5] KVM: PPC: Book3S HV: migrate hot plugged memory

2020-07-27 Thread Ram Pai

From: Laurent Dufour 

When a memory slot is hot plugged to a SVM, PFNs associated with the
GFNs in that slot must be migrated to the secure-PFNs, aka device-PFNs.

Call kvmppc_uv_migrate_mem_slot() to accomplish this.
Disable page-merge for all pages in the memory slot.

Reviewed-by: Bharata B Rao 
Signed-off-by: Ram Pai 
[rearranged the code, and modified the commit log]
Signed-off-by: Laurent Dufour 
---
 arch/powerpc/include/asm/kvm_book3s_uvmem.h | 14 ++
 arch/powerpc/kvm/book3s_hv.c| 14 ++
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 23 +++
 3 files changed, 39 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
index 9cb7d8b..0a63194 100644
--- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -23,6 +23,10 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
 unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm);
 void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
 struct kvm *kvm, bool skip_page_out);
+int kvmppc_uvmem_memslot_create(struct kvm *kvm,
+   const struct kvm_memory_slot *new);
+void kvmppc_uvmem_memslot_delete(struct kvm *kvm,
+   const struct kvm_memory_slot *old);
 #else
 static inline int kvmppc_uvmem_init(void)
 {
@@ -82,5 +86,15 @@ static inline int kvmppc_send_page_to_uv(struct kvm *kvm, 
unsigned long gfn)
 static inline void
 kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
struct kvm *kvm, bool skip_page_out) { }
+
+static inline int  kvmppc_uvmem_memslot_create(struct kvm *kvm,
+   const struct kvm_memory_slot *new)
+{
+   return H_UNSUPPORTED;
+}
+
+static inline void  kvmppc_uvmem_memslot_delete(struct kvm *kvm,
+   const struct kvm_memory_slot *old) { }
+
 #endif /* CONFIG_PPC_UV */
 #endif /* __ASM_KVM_BOOK3S_UVMEM_H__ */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index d331b46..a93bc65 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4515,16 +4515,14 @@ static void kvmppc_core_commit_memory_region_hv(struct 
kvm *kvm,
 
switch (change) {
case KVM_MR_CREATE:
-   if (kvmppc_uvmem_slot_init(kvm, new))
-   return;
-   uv_register_mem_slot(kvm->arch.lpid,
-new->base_gfn << PAGE_SHIFT,
-new->npages * PAGE_SIZE,
-0, new->id);
+   /*
+* @TODO kvmppc_uvmem_memslot_create() can fail and
+* return error. Fix this.
+*/
+   kvmppc_uvmem_memslot_create(kvm, new);
break;
case KVM_MR_DELETE:
-   uv_unregister_mem_slot(kvm->arch.lpid, old->id);
-   kvmppc_uvmem_slot_free(kvm, old);
+   kvmppc_uvmem_memslot_delete(kvm, old);
break;
default:
/* TODO: Handle KVM_MR_MOVE */
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index a1664ae..5b917ea 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -418,7 +418,7 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
return ret;
 }
 
-static void kvmppc_uvmem_memslot_delete(struct kvm *kvm,
+static void __kvmppc_uvmem_memslot_delete(struct kvm *kvm,
const struct kvm_memory_slot *memslot)
 {
uv_unregister_mem_slot(kvm->arch.lpid, memslot->id);
@@ -426,7 +426,7 @@ static void kvmppc_uvmem_memslot_delete(struct kvm *kvm,
kvmppc_memslot_page_merge(kvm, memslot, true);
 }
 
-static int kvmppc_uvmem_memslot_create(struct kvm *kvm,
+static int __kvmppc_uvmem_memslot_create(struct kvm *kvm,
const struct kvm_memory_slot *memslot)
 {
int ret = H_PARAMETER;
@@ -478,7 +478,7 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
/* register the memslot */
slots = kvm_memslots(kvm);
kvm_for_each_memslot(memslot, slots) {
-   ret = kvmppc_uvmem_memslot_create(kvm, memslot);
+   ret = __kvmppc_uvmem_memslot_create(kvm, memslot);
if (ret)
break;
}
@@ -488,7 +488,7 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
kvm_for_each_memslot(m, slots) {
if (m == memslot)
break;
-   kvmppc_uvmem_memslot_delete(kvm, memslot);
+   __kvmppc_uvmem_memslot_delete(kvm, memslot);
}
}
 
@@ -1057,6 +1057,21 @@ int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned 
long gfn)
return (ret == U_SUCCESS) ? RESUME_GUEST : -EFAULT;
 }
 
+int kvmppc_uvmem_memslot_create(struct kvm

[PATCH v6 4/5] KVM: PPC: Book3S HV: in H_SVM_INIT_DONE, migrate remaining normal-GFNs to secure-GFNs.

2020-07-27 Thread Ram Pai

The Ultravisor is expected to explicitly call H_SVM_PAGE_IN for all the
pages of the SVM before calling H_SVM_INIT_DONE. This causes a huge
delay in tranistioning the VM to SVM. The Ultravisor is only interested
in the pages that contain the kernel, initrd and other important data
structures. The rest contain throw-away content.

However if not all pages are requested by the Ultravisor, the Hypervisor
continues to consider the GFNs corresponding to the non-requested pages
as normal GFNs. This can lead to data-corruption and undefined behavior.

In H_SVM_INIT_DONE handler, move all the PFNs associated with the SVM's
GFNs to secure-PFNs. Skip the GFNs that are already Paged-in or Shared
or Paged-in followed by a Paged-out.

Cc: Paul Mackerras 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Cc: Bharata B Rao 
Cc: Aneesh Kumar K.V 
Cc: Sukadev Bhattiprolu 
Cc: Laurent Dufour 
Cc: Thiago Jung Bauermann 
Cc: David Gibson 
Cc: Claudio Carvalho 
Cc: kvm-...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Bharata B Rao 
Signed-off-by: Ram Pai 
---
 Documentation/powerpc/ultravisor.rst |   2 +
 arch/powerpc/kvm/book3s_hv_uvmem.c   | 154 ++-
 2 files changed, 134 insertions(+), 22 deletions(-)

diff --git a/Documentation/powerpc/ultravisor.rst 
b/Documentation/powerpc/ultravisor.rst
index a1c8c37..ba6b1bf 100644
--- a/Documentation/powerpc/ultravisor.rst
+++ b/Documentation/powerpc/ultravisor.rst
@@ -934,6 +934,8 @@ Return values
* H_UNSUPPORTED if called from the wrong context (e.g.
from an SVM or before an H_SVM_INIT_START
hypercall).
+   * H_STATE   if the hypervisor could not successfully
+transition the VM to Secure VM.
 
 Description
 ~~~
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 1b2b029..a1664ae 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -93,6 +93,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct dev_pagemap kvmppc_uvmem_pgmap;
 static unsigned long *kvmppc_uvmem_bitmap;
@@ -348,6 +349,41 @@ static bool kvmppc_gfn_is_uvmem_pfn(unsigned long gfn, 
struct kvm *kvm,
return false;
 }
 
+/*
+ * starting from *gfn search for the next available GFN that is not yet
+ * transitioned to a secure GFN.  return the value of that GFN in *gfn.  If a
+ * GFN is found, return true, else return false
+ *
+ * Must be called with kvm->arch.uvmem_lock  held.
+ */
+static bool kvmppc_next_nontransitioned_gfn(const struct kvm_memory_slot 
*memslot,
+   struct kvm *kvm, unsigned long *gfn)
+{
+   struct kvmppc_uvmem_slot *p;
+   bool ret = false;
+   unsigned long i;
+
+   list_for_each_entry(p, >arch.uvmem_pfns, list)
+   if (*gfn >= p->base_pfn && *gfn < p->base_pfn + p->nr_pfns)
+   break;
+   if (!p)
+   return ret;
+   /*
+* The code below assumes, one to one correspondence between
+* kvmppc_uvmem_slot and memslot.
+*/
+   for (i = *gfn; i < p->base_pfn + p->nr_pfns; i++) {
+   unsigned long index = i - p->base_pfn;
+
+   if (!(p->pfns[index] & KVMPPC_GFN_FLAG_MASK)) {
+   *gfn = i;
+   ret = true;
+   break;
+   }
+   }
+   return ret;
+}
+
 static int kvmppc_memslot_page_merge(struct kvm *kvm,
const struct kvm_memory_slot *memslot, bool merge)
 {
@@ -460,16 +496,6 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
return ret;
 }
 
-unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)
-{
-   if (!(kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START))
-   return H_UNSUPPORTED;
-
-   kvm->arch.secure_guest |= KVMPPC_SECURE_INIT_DONE;
-   pr_info("LPID %d went secure\n", kvm->arch.lpid);
-   return H_SUCCESS;
-}
-
 /*
  * Drop device pages that we maintain for the secure guest
  *
@@ -588,12 +614,14 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
 }
 
 /*
- * Alloc a PFN from private device memory pool and copy page from normal
- * memory to secure memory using UV_PAGE_IN uvcall.
+ * Alloc a PFN from private device memory pool. If @pagein is true,
+ * copy page from normal memory to secure memory using UV_PAGE_IN uvcall.
  */
-static int kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start,
-  unsigned long end, unsigned long gpa, struct kvm *kvm,
-  unsigned long page_shift)
+static int kvmppc_svm_page_in(struct vm_area_struct *vma,
+   unsigned long start,
+   unsigned long end, unsigned long gpa, struct kvm *kvm,
+   unsigned long page_shift,
+   bool pagein)
 {
unsigned long src_pfn, dst_pfn = 0;
struct migrate_vma

[PATCH v6 2/5] KVM: PPC: Book3S HV: Disable page merging in H_SVM_INIT_START

2020-07-27 Thread Ram Pai

Page-merging of pages in memory-slots associated with a Secure VM,
is disabled in H_SVM_PAGE_IN handler.

This operation should have been done the much earlier; the moment the VM
is initiated for secure-transition. Delaying this operation, increases
the probability for those pages to acquire new references , making it
impossible to migrate those pages in H_SVM_PAGE_IN handler.

Disable page-migration in H_SVM_INIT_START handling.

Reviewed-by: Bharata B Rao 
Signed-off-by: Ram Pai 
---
 Documentation/powerpc/ultravisor.rst |   1 +
 arch/powerpc/kvm/book3s_hv_uvmem.c   | 123 +--
 2 files changed, 89 insertions(+), 35 deletions(-)

diff --git a/Documentation/powerpc/ultravisor.rst 
b/Documentation/powerpc/ultravisor.rst
index df136c8..a1c8c37 100644
--- a/Documentation/powerpc/ultravisor.rst
+++ b/Documentation/powerpc/ultravisor.rst
@@ -895,6 +895,7 @@ Return values
 One of the following values:
 
* H_SUCCESS  on success.
+* H_STATEif the VM is not in a position to switch to secure.
 
 Description
 ~~~
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index e6f76bc..533b608 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -211,10 +211,79 @@ static bool kvmppc_gfn_is_uvmem_pfn(unsigned long gfn, 
struct kvm *kvm,
return false;
 }
 
+static int kvmppc_memslot_page_merge(struct kvm *kvm,
+   const struct kvm_memory_slot *memslot, bool merge)
+{
+   unsigned long gfn = memslot->base_gfn;
+   unsigned long end, start = gfn_to_hva(kvm, gfn);
+   int ret = 0;
+   struct vm_area_struct *vma;
+   int merge_flag = (merge) ? MADV_MERGEABLE : MADV_UNMERGEABLE;
+
+   if (kvm_is_error_hva(start))
+   return H_STATE;
+
+   end = start + (memslot->npages << PAGE_SHIFT);
+
+   mmap_write_lock(kvm->mm);
+   do {
+   vma = find_vma_intersection(kvm->mm, start, end);
+   if (!vma) {
+   ret = H_STATE;
+   break;
+   }
+   ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
+ merge_flag, >vm_flags);
+   if (ret) {
+   ret = H_STATE;
+   break;
+   }
+   start = vma->vm_end;
+   } while (end > vma->vm_end);
+
+   mmap_write_unlock(kvm->mm);
+   return ret;
+}
+
+static void kvmppc_uvmem_memslot_delete(struct kvm *kvm,
+   const struct kvm_memory_slot *memslot)
+{
+   uv_unregister_mem_slot(kvm->arch.lpid, memslot->id);
+   kvmppc_uvmem_slot_free(kvm, memslot);
+   kvmppc_memslot_page_merge(kvm, memslot, true);
+}
+
+static int kvmppc_uvmem_memslot_create(struct kvm *kvm,
+   const struct kvm_memory_slot *memslot)
+{
+   int ret = H_PARAMETER;
+
+   if (kvmppc_memslot_page_merge(kvm, memslot, false))
+   return ret;
+
+   if (kvmppc_uvmem_slot_init(kvm, memslot))
+   goto out1;
+
+   ret = uv_register_mem_slot(kvm->arch.lpid,
+  memslot->base_gfn << PAGE_SHIFT,
+  memslot->npages * PAGE_SIZE,
+  0, memslot->id);
+   if (ret < 0) {
+   ret = H_PARAMETER;
+   goto out;
+   }
+   return 0;
+out:
+   kvmppc_uvmem_slot_free(kvm, memslot);
+out1:
+   kvmppc_memslot_page_merge(kvm, memslot, true);
+   return ret;
+}
+
 unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
 {
struct kvm_memslots *slots;
-   struct kvm_memory_slot *memslot;
+   struct kvm_memory_slot *memslot, *m;
int ret = H_SUCCESS;
int srcu_idx;
 
@@ -232,23 +301,24 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
return H_AUTHORITY;
 
srcu_idx = srcu_read_lock(>srcu);
+
+   /* register the memslot */
slots = kvm_memslots(kvm);
kvm_for_each_memslot(memslot, slots) {
-   if (kvmppc_uvmem_slot_init(kvm, memslot)) {
-   ret = H_PARAMETER;
-   goto out;
-   }
-   ret = uv_register_mem_slot(kvm->arch.lpid,
-  memslot->base_gfn << PAGE_SHIFT,
-  memslot->npages * PAGE_SIZE,
-  0, memslot->id);
-   if (ret < 0) {
-   kvmppc_uvmem_slot_free(kvm, memslot);
-   ret = H_PARAMETER;
-   goto out;
+   ret = kvmppc_uvmem_memslot_create(kvm, memslot);
+   if (ret)
+   break;
+   }
+
+   if (ret) {
+   slots = kvm_memslots(kvm);
+   kvm_for_each_memslot(m, slots) {
+   if (m == memslot)
+

[PATCH v6 3/5] KVM: PPC: Book3S HV: track the state GFNs associated with secure VMs

2020-07-27 Thread Ram Pai

During the life of SVM, its GFNs transition through normal, secure and
shared states. Since the kernel does not track GFNs that are shared, it
is not possible to disambiguate a shared GFN from a GFN whose PFN has
not yet been migrated to a secure-PFN. Also it is not possible to
disambiguate a secure-GFN from a GFN whose GFN has been pagedout from
the ultravisor.

The ability to identify the state of a GFN is needed to skip migration
of its PFN to secure-PFN during ESM transition.

The code is re-organized to track the states of a GFN as explained
below.


 1. States of a GFN
---
 The GFN can be in one of the following states.

 (a) Secure - The GFN is secure. The GFN is associated with
a Secure VM, the contents of the GFN is not accessible
to the Hypervisor.  This GFN can be backed by a secure-PFN,
or can be backed by a normal-PFN with contents encrypted.
The former is true when the GFN is paged-in into the
ultravisor. The latter is true when the GFN is paged-out
of the ultravisor.

 (b) Shared - The GFN is shared. The GFN is associated with a
a secure VM. The contents of the GFN is accessible to
Hypervisor. This GFN is backed by a normal-PFN and its
content is un-encrypted.

 (c) Normal - The GFN is a normal. The GFN is associated with
a normal VM. The contents of the GFN is accesible to
the Hypervisor. Its content is never encrypted.

 2. States of a VM.
---

 (a) Normal VM:  A VM whose contents are always accessible to
the hypervisor.  All its GFNs are normal-GFNs.

 (b) Secure VM: A VM whose contents are not accessible to the
hypervisor without the VM's consent.  Its GFNs are
either Shared-GFN or Secure-GFNs.

 (c) Transient VM: A Normal VM that is transitioning to secure VM.
The transition starts on successful return of
H_SVM_INIT_START, and ends on successful return
of H_SVM_INIT_DONE. This transient VM, can have GFNs
in any of the three states; i.e Secure-GFN, Shared-GFN,
and Normal-GFN. The VM never executes in this state
in supervisor-mode.

 3. Memory slot State.
--
The state of a memory slot mirrors the state of the
VM the memory slot is associated with.

 4. VM State transition.


  A VM always starts in Normal Mode.

  H_SVM_INIT_START moves the VM into transient state. During this
  time the Ultravisor may request some of its GFNs to be shared or
  secured. So its GFNs can be in one of the three GFN states.

  H_SVM_INIT_DONE moves the VM entirely from transient state to
  secure-state. At this point any left-over normal-GFNs are
  transitioned to Secure-GFN.

  H_SVM_INIT_ABORT moves the transient VM back to normal VM.
  All its GFNs are moved to Normal-GFNs.

  UV_TERMINATE transitions the secure-VM back to normal-VM. All
  the secure-GFN and shared-GFNs are tranistioned to normal-GFN
  Note: The contents of the normal-GFN is undefined at this point.

 5. GFN state implementation:
-

 Secure GFN is associated with a secure-PFN; also called uvmem_pfn,
 when the GFN is paged-in. Its pfn[] has KVMPPC_GFN_UVMEM_PFN flag
 set, and contains the value of the secure-PFN.
 It is associated with a normal-PFN; also called mem_pfn, when
 the GFN is pagedout. Its pfn[] has KVMPPC_GFN_MEM_PFN flag set.
 The value of the normal-PFN is not tracked.

 Shared GFN is associated with a normal-PFN. Its pfn[] has
 KVMPPC_UVMEM_SHARED_PFN flag set. The value of the normal-PFN
 is not tracked.

 Normal GFN is associated with normal-PFN. Its pfn[] has
 no flag set. The value of the normal-PFN is not tracked.

 6. Life cycle of a GFN

 --
 || Share  |  Unshare | SVM   |H_SVM_INIT_DONE|
 ||operation   |operation | abort/|   |
 |||  | terminate |   |
 -
 |||  |   |   |
 | Secure | Shared | Secure   |Normal |Secure |
 |||  |   |   |
 | Shared | Shared | Secure   |Normal |Shared |
 |||  |   |   |
 | Normal | Shared | Secure   |Normal |Secure |
 --

 7. Life cycle of a VM

 
 | |  start|  H_SVM_  |H_SVM_   |H_SVM_ |UV_SVM_|
 | |  VM   |INIT_START|INIT_DONE|INIT_ABORT |TERMINATE  |
 | |   |  | |   |   |
 -

[PATCH v6 0/5] Migrate non-migrated pages of a SVM.

2020-07-27 Thread Ram Pai

The time to switch a VM to Secure-VM, increases by the size of the VM.
A 100GB VM takes about 7minutes. This is unacceptable.  This linear
increase is caused by a suboptimal behavior by the Ultravisor and the
Hypervisor.  The Ultravisor unnecessarily migrates all the GFN of the
VM from normal-memory to secure-memory. It has to just migrate the
necessary and sufficient GFNs.

However when the optimization is incorporated in the Ultravisor, the
Hypervisor starts misbehaving. The Hypervisor has a inbuilt assumption
that the Ultravisor will explicitly request to migrate, each and every
GFN of the VM. If only necessary and sufficient GFNs are requested for
migration, the Hypervisor continues to manage the remaining GFNs as
normal GFNs. This leads to memory corruption; manifested
consistently when the SVM reboots.

The same is true, when a memory slot is hotplugged into a SVM. The
Hypervisor expects the ultravisor to request migration of all GFNs to
secure-GFN.  But the hypervisor cannot handle any H_SVM_PAGE_IN
requests from the Ultravisor, done in the context of
UV_REGISTER_MEM_SLOT ucall.  This problem manifests as random errors
in the SVM, when a memory-slot is hotplugged.

This patch series automatically migrates the non-migrated pages of a
SVM, and thus solves the problem.

Testing: Passed rigorous testing using various sized SVMs.

Changelog:

v6: . rearrangement of functions in book3s_hv_uvmem.c. No functional
change.
. decoupling this patch series from Laurent's memory-hotplug/unplug,
since the memhotplug/unplug/hotplug/reboot test is failing.

v5:  .  This patch series includes Laurent's fix for memory hotplug/unplug
  . drop pages first and then delete the memslot. Otherwise
the memslot does not get cleanly deleted, causing
problems during reboot.
  . recreatable through the following set of commands
 . device_add pc-dimm,id=dimm1,memdev=mem1
 . device_del dimm1
 . device_add pc-dimm,id=dimm1,memdev=mem1
Further incorporates comments from Bharata:
. fix for off-by-one while disabling migration.
. code-reorganized to maximize sharing in init_start path
and in memory-hotplug path
. locking adjustments in mass-page migration during H_SVM_INIT_DONE.
. improved recovery on error paths.
. additional comments in the code for better understanding.
. removed the retry-on-migration-failure code.
. re-added the initial patch that adjust some prototype to overcome
   a git problem, where it messes up the code context. Had
accidently dropped the patch in the last version.

v4:  .  Incorported Bharata's comments:
- Optimization -- replace write mmap semaphore with read mmap semphore.
- disable page-merge during memory hotplug.
- rearranged the patches. consolidated the page-migration-retry logic
in a single patch.

v3: . Optimized the page-migration retry-logic. 
. Relax and relinquish the cpu regularly while bulk migrating
the non-migrated pages. This issue was causing soft-lockups.
Fixed it.
. Added a new patch, to retry page-migration a couple of times
before returning H_BUSY in H_SVM_PAGE_IN. This issue was
seen a few times in a 24hour continuous reboot test of the SVMs.

v2: . fixed a bug observed by Laurent. The state of the GFN's associated
with Secure-VMs were not reset during memslot flush.
. Re-organized the code, for easier review.
. Better description of the patch series.

v1: . fixed a bug observed by Bharata. Pages that where paged-in and later
paged-out must also be skipped from migration during H_SVM_INIT_DONE.


Laurent Dufour (1):
  KVM: PPC: Book3S HV: migrate hot plugged memory

Ram Pai (4):
  KVM: PPC: Book3S HV: Fix function definition in book3s_hv_uvmem.c
  KVM: PPC: Book3S HV: Disable page merging in H_SVM_INIT_START
  KVM: PPC: Book3S HV: track the state GFNs associated with secure VMs
  KVM: PPC: Book3S HV: in H_SVM_INIT_DONE, migrate remaining normal-GFNs
to secure-GFNs.

 Documentation/powerpc/ultravisor.rst|   3 +
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  14 +
 arch/powerpc/kvm/book3s_hv.c|  14 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 498 +++-
 4 files changed, 437 insertions(+), 92 deletions(-)

-- 
1.8.3.1

[PATCH v6 1/5] KVM: PPC: Book3S HV: Fix function definition in book3s_hv_uvmem.c

2020-07-27 Thread Ram Pai

Without this fix, git is confused. It generates wrong
function context for code changes in subsequent patches.
Weird, but true.

Cc: Paul Mackerras 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Cc: Bharata B Rao 
Cc: Aneesh Kumar K.V 
Cc: Sukadev Bhattiprolu 
Cc: Laurent Dufour 
Cc: Thiago Jung Bauermann 
Cc: David Gibson 
Cc: Claudio Carvalho 
Cc: kvm-...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ram Pai 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 09d8119..e6f76bc 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -382,8 +382,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
  * Alloc a PFN from private device memory pool and copy page from normal
  * memory to secure memory using UV_PAGE_IN uvcall.
  */
-static int
-kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start,
+static int kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start,
   unsigned long end, unsigned long gpa, struct kvm *kvm,
   unsigned long page_shift, bool *downgrade)
 {
@@ -450,8 +449,8 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
  * In the former case, uses dev_pagemap_ops.migrate_to_ram handler
  * to unmap the device page from QEMU's page tables.
  */
-static unsigned long
-kvmppc_share_page(struct kvm *kvm, unsigned long gpa, unsigned long page_shift)
+static unsigned long kvmppc_share_page(struct kvm *kvm, unsigned long gpa,
+   unsigned long page_shift)
 {
 
int ret = H_PARAMETER;
@@ -500,9 +499,9 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
  * H_PAGE_IN_SHARED flag makes the page shared which means that the same
  * memory in is visible from both UV and HV.
  */
-unsigned long
-kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
-unsigned long flags, unsigned long page_shift)
+unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
+   unsigned long flags,
+   unsigned long page_shift)
 {
bool downgrade = false;
unsigned long start, end;
@@ -559,10 +558,10 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
  * Provision a new page on HV side and copy over the contents
  * from secure memory using UV_PAGE_OUT uvcall.
  */
-static int
-kvmppc_svm_page_out(struct vm_area_struct *vma, unsigned long start,
-   unsigned long end, unsigned long page_shift,
-   struct kvm *kvm, unsigned long gpa)
+static int kvmppc_svm_page_out(struct vm_area_struct *vma,
+   unsigned long start,
+   unsigned long end, unsigned long page_shift,
+   struct kvm *kvm, unsigned long gpa)
 {
unsigned long src_pfn, dst_pfn = 0;
struct migrate_vma mig;
-- 
1.8.3.1

Re: [PATCH] powerpc/64s/hash: Fix hash_preload running with interrupts enabled

2020-07-27 Thread Athira Rajeev




> On 27-Jul-2020, at 6:05 PM, Michael Ellerman  wrote:
> 
> Athira Rajeev  writes:
>>> On 27-Jul-2020, at 11:39 AM, Nicholas Piggin  wrote:
>>> 
>>> Commit 2f92447f9f96 ("powerpc/book3s64/hash: Use the pte_t address from the
>>> caller") removed the local_irq_disable from hash_preload, but it was
>>> required for more than just the page table walk: the hash pte busy bit is
>>> effectively a lock which may be taken in interrupt context, and the local
>>> update flag test must not be preempted before it's used.
>>> 
>>> This solves apparent lockups with perf interrupting __hash_page_64K. If
>>> get_perf_callchain then also takes a hash fault on the same page while it
>>> is already locked, it will loop forever taking hash faults, which looks like
>>> this:
>>> 
>>> cpu 0x49e: Vector: 100 (System Reset) at [c0001a4f7d70]
>>>   pc: c0072dc8: hash_page_mm+0x8/0x800
>>>   lr: c000c5a4: do_hash_page+0x24/0x38
>>>   sp: c0002ac1cc69ac70
>>>  msr: 80081033
>>> current = 0xc0002ac1cc602e00
>>> paca= 0xc0001de1f280   irqmask: 0x03   irq_happened: 0x01
>>>   pid   = 20118, comm = pread2_processe
>>> Linux version 5.8.0-rc6-00345-g1fad14f18bc6
>>> 49e:mon> t
>>> [c0002ac1cc69ac70] c000c5a4 do_hash_page+0x24/0x38 (unreliable)
>>> --- Exception: 300 (Data Access) at c008fa60 
>>> __copy_tofrom_user_power7+0x20c/0x7ac
>>> [link register   ] c0335d10 copy_from_user_nofault+0xf0/0x150
>>> [c0002ac1cc69af70] c00032bf9fa3c880 (unreliable)
>>> [c0002ac1cc69afa0] c0109df0 read_user_stack_64+0x70/0xf0
>>> [c0002ac1cc69afd0] c0109fcc perf_callchain_user_64+0x15c/0x410
>>> [c0002ac1cc69b060] c0109c00 perf_callchain_user+0x20/0x40
>>> [c0002ac1cc69b080] c031c6cc get_perf_callchain+0x25c/0x360
>>> [c0002ac1cc69b120] c0316b50 perf_callchain+0x70/0xa0
>>> [c0002ac1cc69b140] c0316ddc perf_prepare_sample+0x25c/0x790
>>> [c0002ac1cc69b1a0] c0317350 perf_event_output_forward+0x40/0xb0
>>> [c0002ac1cc69b220] c0306138 __perf_event_overflow+0x88/0x1a0
>>> [c0002ac1cc69b270] c010cf70 record_and_restart+0x230/0x750
>>> [c0002ac1cc69b620] c010d69c perf_event_interrupt+0x20c/0x510
>>> [c0002ac1cc69b730] c0027d9c performance_monitor_exception+0x4c/0x60
>>> [c0002ac1cc69b750] c000b2f8 
>>> performance_monitor_common_virt+0x1b8/0x1c0
>>> --- Exception: f00 (Performance Monitor) at c00cb5b0 
>>> pSeries_lpar_hpte_insert+0x0/0x160
>>> [link register   ] c00846f0 __hash_page_64K+0x210/0x540
>>> [c0002ac1cc69ba50]  (unreliable)
>>> [c0002ac1cc69bb00] c0073ae0 update_mmu_cache+0x390/0x3a0
>>> [c0002ac1cc69bb70] c037f024 wp_page_copy+0x364/0xce0
>>> [c0002ac1cc69bc20] c038272c do_wp_page+0xdc/0xa60
>>> [c0002ac1cc69bc70] c03857bc handle_mm_fault+0xb9c/0x1b60
>>> [c0002ac1cc69bd50] c006c434 __do_page_fault+0x314/0xc90
>>> [c0002ac1cc69be20] c000c5c8 handle_page_fault+0x10/0x2c
>>> --- Exception: 300 (Data Access) at 7fff8c861fe8
>>> SP (76b19660) is in userspace
>>> 
>>> Reported-by: Athira Rajeev 
>>> Reported-by: Anton Blanchard 
>>> Reviewed-by: Aneesh Kumar K.V 
>>> Fixes: 2f92447f9f96 ("powerpc/book3s64/hash: Use the pte_t address from the
>>> caller")
>>> Signed-off-by: Nicholas Piggin 
>> 
>> 
>> Hi,
>> 
>> Tested with the patch and it fixes the lockups I was seeing with my test run.
>> Thanks for the fix.
>> 
>> Tested-by: Athira Rajeev 
> 
> Thanks for testing.
> 
> What test are you running?

Hi Michael

I was running  “perf record”  and Unixbench tests ( 
https://github.com/kdlucas/byte-unixbench ) in parallel where we were getting 
soft lockups

1. Perf command run:
# perf record -a -g -c 1000 -o  sleep 60

2. Unixbench tests
# Run -q -c  spawn

Wtth the fix, perf completes successfully.

Thanks
Athira

> 
> cheers

[PATCH V5 4/4] tools/perf: Add perf tools support for extended regs in power10

2020-07-27 Thread Athira Rajeev

Added support for supported regs which are new in power10
( MMCR3, SIER2, SIER3 ) to sample_reg_mask in the tool side
to use with `-I?` option. Also added PVR check to send extended
mask for power10 at kernel while capturing extended regs in
each sample.

Signed-off-by: Athira Rajeev 
Reviewed-by: Kajol Jain 
Reviewed-and-tested-by: Ravi Bangoria 
---
 tools/arch/powerpc/include/uapi/asm/perf_regs.h | 6 ++
 tools/perf/arch/powerpc/include/perf_regs.h | 3 +++
 tools/perf/arch/powerpc/util/perf_regs.c| 6 ++
 3 files changed, 15 insertions(+)

diff --git a/tools/arch/powerpc/include/uapi/asm/perf_regs.h 
b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
index 225c64c..bdf5f10 100644
--- a/tools/arch/powerpc/include/uapi/asm/perf_regs.h
+++ b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
@@ -52,6 +52,9 @@ enum perf_event_powerpc_regs {
PERF_REG_POWERPC_MMCR0,
PERF_REG_POWERPC_MMCR1,
PERF_REG_POWERPC_MMCR2,
+   PERF_REG_POWERPC_MMCR3,
+   PERF_REG_POWERPC_SIER2,
+   PERF_REG_POWERPC_SIER3,
/* Max regs without the extended regs */
PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
 };
@@ -60,6 +63,9 @@ enum perf_event_powerpc_regs {
 
 /* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */
 #define PERF_REG_PMU_MASK_300   (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) 
- PERF_REG_PMU_MASK)
+/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_31 */
+#define PERF_REG_PMU_MASK_31   (((1ULL << (PERF_REG_POWERPC_SIER3 + 1)) - 1) - 
PERF_REG_PMU_MASK)
 
 #define PERF_REG_MAX_ISA_300   (PERF_REG_POWERPC_MMCR2 + 1)
+#define PERF_REG_MAX_ISA_31(PERF_REG_POWERPC_SIER3 + 1)
 #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
diff --git a/tools/perf/arch/powerpc/include/perf_regs.h 
b/tools/perf/arch/powerpc/include/perf_regs.h
index 46ed00d..63f3ac9 100644
--- a/tools/perf/arch/powerpc/include/perf_regs.h
+++ b/tools/perf/arch/powerpc/include/perf_regs.h
@@ -68,6 +68,9 @@
[PERF_REG_POWERPC_MMCR0] = "mmcr0",
[PERF_REG_POWERPC_MMCR1] = "mmcr1",
[PERF_REG_POWERPC_MMCR2] = "mmcr2",
+   [PERF_REG_POWERPC_MMCR3] = "mmcr3",
+   [PERF_REG_POWERPC_SIER2] = "sier2",
+   [PERF_REG_POWERPC_SIER3] = "sier3",
 };
 
 static inline const char *perf_reg_name(int id)
diff --git a/tools/perf/arch/powerpc/util/perf_regs.c 
b/tools/perf/arch/powerpc/util/perf_regs.c
index d64ba0c..2b6d470 100644
--- a/tools/perf/arch/powerpc/util/perf_regs.c
+++ b/tools/perf/arch/powerpc/util/perf_regs.c
@@ -14,6 +14,7 @@
 #include 
 
 #define PVR_POWER9 0x004E
+#define PVR_POWER100x0080
 
 const struct sample_reg sample_reg_masks[] = {
SMPL_REG(r0, PERF_REG_POWERPC_R0),
@@ -64,6 +65,9 @@
SMPL_REG(mmcr0, PERF_REG_POWERPC_MMCR0),
SMPL_REG(mmcr1, PERF_REG_POWERPC_MMCR1),
SMPL_REG(mmcr2, PERF_REG_POWERPC_MMCR2),
+   SMPL_REG(mmcr3, PERF_REG_POWERPC_MMCR3),
+   SMPL_REG(sier2, PERF_REG_POWERPC_SIER2),
+   SMPL_REG(sier3, PERF_REG_POWERPC_SIER3),
SMPL_REG_END
 };
 
@@ -194,6 +198,8 @@ uint64_t arch__intr_reg_mask(void)
version = (((mfspr(SPRN_PVR)) >>  16) & 0x);
if (version == PVR_POWER9)
extended_mask = PERF_REG_PMU_MASK_300;
+   else if (version == PVR_POWER10)
+   extended_mask = PERF_REG_PMU_MASK_31;
else
return mask;
 
-- 
1.8.3.1

[PATCH V5 3/4] tools/perf: Add perf tools support for extended register capability in powerpc

2020-07-27 Thread Athira Rajeev

From: Anju T Sudhakar 

Add extended regs to sample_reg_mask in the tool side to use
with `-I?` option. Perf tools side uses extended mask to display
the platform supported register names (with -I? option) to the user
and also send this mask to the kernel to capture the extended registers
in each sample. Hence decide the mask value based on the processor
version.

Currently definitions for `mfspr`, `SPRN_PVR` are part of
`arch/powerpc/util/header.c`. Move this to a header file so that
these definitions can be re-used in other source files as well.

Signed-off-by: Anju T Sudhakar 
[Decide extended mask at run time based on platform]
Signed-off-by: Athira Rajeev 
Reviewed-by: Madhavan Srinivasan 
Reviewed-by: Kajol Jain 
Reviewed-and-tested-by: Ravi Bangoria 
---
 tools/arch/powerpc/include/uapi/asm/perf_regs.h | 14 ++-
 tools/perf/arch/powerpc/include/perf_regs.h |  5 ++-
 tools/perf/arch/powerpc/util/header.c   |  9 +
 tools/perf/arch/powerpc/util/perf_regs.c| 49 +
 tools/perf/arch/powerpc/util/utils_header.h | 15 
 5 files changed, 82 insertions(+), 10 deletions(-)
 create mode 100644 tools/perf/arch/powerpc/util/utils_header.h

diff --git a/tools/arch/powerpc/include/uapi/asm/perf_regs.h 
b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
index f599064..225c64c 100644
--- a/tools/arch/powerpc/include/uapi/asm/perf_regs.h
+++ b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
@@ -48,6 +48,18 @@ enum perf_event_powerpc_regs {
PERF_REG_POWERPC_DSISR,
PERF_REG_POWERPC_SIER,
PERF_REG_POWERPC_MMCRA,
-   PERF_REG_POWERPC_MAX,
+   /* Extended registers */
+   PERF_REG_POWERPC_MMCR0,
+   PERF_REG_POWERPC_MMCR1,
+   PERF_REG_POWERPC_MMCR2,
+   /* Max regs without the extended regs */
+   PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
 };
+
+#define PERF_REG_PMU_MASK  ((1ULL << PERF_REG_POWERPC_MAX) - 1)
+
+/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */
+#define PERF_REG_PMU_MASK_300   (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) 
- PERF_REG_PMU_MASK)
+
+#define PERF_REG_MAX_ISA_300   (PERF_REG_POWERPC_MMCR2 + 1)
 #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
diff --git a/tools/perf/arch/powerpc/include/perf_regs.h 
b/tools/perf/arch/powerpc/include/perf_regs.h
index e18a355..46ed00d 100644
--- a/tools/perf/arch/powerpc/include/perf_regs.h
+++ b/tools/perf/arch/powerpc/include/perf_regs.h
@@ -64,7 +64,10 @@
[PERF_REG_POWERPC_DAR] = "dar",
[PERF_REG_POWERPC_DSISR] = "dsisr",
[PERF_REG_POWERPC_SIER] = "sier",
-   [PERF_REG_POWERPC_MMCRA] = "mmcra"
+   [PERF_REG_POWERPC_MMCRA] = "mmcra",
+   [PERF_REG_POWERPC_MMCR0] = "mmcr0",
+   [PERF_REG_POWERPC_MMCR1] = "mmcr1",
+   [PERF_REG_POWERPC_MMCR2] = "mmcr2",
 };
 
 static inline const char *perf_reg_name(int id)
diff --git a/tools/perf/arch/powerpc/util/header.c 
b/tools/perf/arch/powerpc/util/header.c
index d487007..1a95017 100644
--- a/tools/perf/arch/powerpc/util/header.c
+++ b/tools/perf/arch/powerpc/util/header.c
@@ -7,17 +7,10 @@
 #include 
 #include 
 #include "header.h"
+#include "utils_header.h"
 #include "metricgroup.h"
 #include 
 
-#define mfspr(rn)   ({unsigned long rval; \
-asm volatile("mfspr %0," __stringify(rn) \
- : "=r" (rval)); rval; })
-
-#define SPRN_PVR0x11F  /* Processor Version Register */
-#define PVR_VER(pvr)(((pvr) >>  16) & 0x) /* Version field */
-#define PVR_REV(pvr)(((pvr) >>   0) & 0x) /* Revison field */
-
 int
 get_cpuid(char *buffer, size_t sz)
 {
diff --git a/tools/perf/arch/powerpc/util/perf_regs.c 
b/tools/perf/arch/powerpc/util/perf_regs.c
index 0a52429..d64ba0c 100644
--- a/tools/perf/arch/powerpc/util/perf_regs.c
+++ b/tools/perf/arch/powerpc/util/perf_regs.c
@@ -6,9 +6,15 @@
 
 #include "../../../util/perf_regs.h"
 #include "../../../util/debug.h"
+#include "../../../util/event.h"
+#include "../../../util/header.h"
+#include "../../../perf-sys.h"
+#include "utils_header.h"
 
 #include 
 
+#define PVR_POWER9 0x004E
+
 const struct sample_reg sample_reg_masks[] = {
SMPL_REG(r0, PERF_REG_POWERPC_R0),
SMPL_REG(r1, PERF_REG_POWERPC_R1),
@@ -55,6 +61,9 @@
SMPL_REG(dsisr, PERF_REG_POWERPC_DSISR),
SMPL_REG(sier, PERF_REG_POWERPC_SIER),
SMPL_REG(mmcra, PERF_REG_POWERPC_MMCRA),
+   SMPL_REG(mmcr0, PERF_REG_POWERPC_MMCR0),
+   SMPL_REG(mmcr1, PERF_REG_POWERPC_MMCR1),
+   SMPL_REG(mmcr2, PERF_REG_POWERPC_MMCR2),
SMPL_REG_END
 };
 
@@ -163,3 +172,43 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
 
return SDT_ARG_VALID;
 }
+
+uint64_t arch__intr_reg_mask(void)
+{
+   struct perf_event_attr attr = {
+   .type   = PERF_TYPE_HARDWARE,
+   .config = PERF_COUNT_HW_CPU_CYCLES,
+   .sample_type= PERF_SAMPLE_REGS_INTR,
+

[PATCH V5 2/4] powerpc/perf: Add extended regs support for power10 platform

2020-07-27 Thread Athira Rajeev

Include capability flag `PERF_PMU_CAP_EXTENDED_REGS` for power10
and expose MMCR3, SIER2, SIER3 registers as part of extended regs.
Also introduce `PERF_REG_PMU_MASK_31` to define extended mask
value at runtime for power10

Signed-off-by: Athira Rajeev 
[Fix build failure on PPC32 platform]
Suggested-by: Ryan Grimm 
Reported-by: kernel test robot 
Reviewed-by: Kajol Jain 
Tested-by: Nageswara R Sastry 
Reviewed-and-tested-by: Ravi Bangoria 
---
 arch/powerpc/include/uapi/asm/perf_regs.h |  6 ++
 arch/powerpc/perf/perf_regs.c | 12 +++-
 arch/powerpc/perf/power10-pmu.c   |  6 ++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/uapi/asm/perf_regs.h 
b/arch/powerpc/include/uapi/asm/perf_regs.h
index 225c64c..bdf5f10 100644
--- a/arch/powerpc/include/uapi/asm/perf_regs.h
+++ b/arch/powerpc/include/uapi/asm/perf_regs.h
@@ -52,6 +52,9 @@ enum perf_event_powerpc_regs {
PERF_REG_POWERPC_MMCR0,
PERF_REG_POWERPC_MMCR1,
PERF_REG_POWERPC_MMCR2,
+   PERF_REG_POWERPC_MMCR3,
+   PERF_REG_POWERPC_SIER2,
+   PERF_REG_POWERPC_SIER3,
/* Max regs without the extended regs */
PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
 };
@@ -60,6 +63,9 @@ enum perf_event_powerpc_regs {
 
 /* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */
 #define PERF_REG_PMU_MASK_300   (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) 
- PERF_REG_PMU_MASK)
+/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_31 */
+#define PERF_REG_PMU_MASK_31   (((1ULL << (PERF_REG_POWERPC_SIER3 + 1)) - 1) - 
PERF_REG_PMU_MASK)
 
 #define PERF_REG_MAX_ISA_300   (PERF_REG_POWERPC_MMCR2 + 1)
+#define PERF_REG_MAX_ISA_31(PERF_REG_POWERPC_SIER3 + 1)
 #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
diff --git a/arch/powerpc/perf/perf_regs.c b/arch/powerpc/perf/perf_regs.c
index 9301e68..8e53f2f 100644
--- a/arch/powerpc/perf/perf_regs.c
+++ b/arch/powerpc/perf/perf_regs.c
@@ -81,6 +81,14 @@ static u64 get_ext_regs_value(int idx)
return mfspr(SPRN_MMCR1);
case PERF_REG_POWERPC_MMCR2:
return mfspr(SPRN_MMCR2);
+#ifdef CONFIG_PPC64
+   case PERF_REG_POWERPC_MMCR3:
+   return mfspr(SPRN_MMCR3);
+   case PERF_REG_POWERPC_SIER2:
+   return mfspr(SPRN_SIER2);
+   case PERF_REG_POWERPC_SIER3:
+   return mfspr(SPRN_SIER3);
+#endif
default: return 0;
}
 }
@@ -89,7 +97,9 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 {
u64 perf_reg_extended_max = PERF_REG_POWERPC_MAX;
 
-   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   if (cpu_has_feature(CPU_FTR_ARCH_31))
+   perf_reg_extended_max = PERF_REG_MAX_ISA_31;
+   else if (cpu_has_feature(CPU_FTR_ARCH_300))
perf_reg_extended_max = PERF_REG_MAX_ISA_300;
 
if (idx == PERF_REG_POWERPC_SIER &&
diff --git a/arch/powerpc/perf/power10-pmu.c b/arch/powerpc/perf/power10-pmu.c
index f7cff7f..8314865 100644
--- a/arch/powerpc/perf/power10-pmu.c
+++ b/arch/powerpc/perf/power10-pmu.c
@@ -87,6 +87,8 @@
 #define POWER10_MMCRA_IFM3 0xC000UL
 #define POWER10_MMCRA_BHRB_MASK0xC000UL
 
+extern u64 PERF_REG_EXTENDED_MASK;
+
 /* Table of alternatives, sorted by column 0 */
 static const unsigned int power10_event_alternatives[][MAX_ALT] = {
{ PM_RUN_CYC_ALT,   PM_RUN_CYC },
@@ -397,6 +399,7 @@ static void power10_config_bhrb(u64 pmu_bhrb_filter)
.cache_events   = _cache_events,
.attr_groups= power10_pmu_attr_groups,
.bhrb_nr= 32,
+   .capabilities   = PERF_PMU_CAP_EXTENDED_REGS,
 };
 
 int init_power10_pmu(void)
@@ -408,6 +411,9 @@ int init_power10_pmu(void)
strcmp(cur_cpu_spec->oprofile_cpu_type, "ppc64/power10"))
return -ENODEV;
 
+   /* Set the PERF_REG_EXTENDED_MASK here */
+   PERF_REG_EXTENDED_MASK = PERF_REG_PMU_MASK_31;
+
rc = register_power_pmu(_pmu);
if (rc)
return rc;
-- 
1.8.3.1

[PATCH V5 1/4] powerpc/perf: Add support for outputting extended regs in perf intr_regs

2020-07-27 Thread Athira Rajeev

From: Anju T Sudhakar 

Add support for perf extended register capability in powerpc.
The capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to indicate the
PMU which support extended registers. The generic code define the mask
of extended registers as 0 for non supported architectures.

Patch adds extended regs support for power9 platform by
exposing MMCR0, MMCR1 and MMCR2 registers.

REG_RESERVED mask needs update to include extended regs.
`PERF_REG_EXTENDED_MASK`, contains mask value of the supported registers,
is defined at runtime in the kernel based on platform since the supported
registers may differ from one processor version to another and hence the
MASK value.

with patch
--

available registers: r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11
r12 r13 r14 r15 r16 r17 r18 r19 r20 r21 r22 r23 r24 r25 r26
r27 r28 r29 r30 r31 nip msr orig_r3 ctr link xer ccr softe
trap dar dsisr sier mmcra mmcr0 mmcr1 mmcr2

PERF_RECORD_SAMPLE(IP, 0x1): 4784/4784: 0 period: 1 addr: 0
... intr regs: mask 0x ABI 64-bit
 r00xc012b77c
 r10xc03fe5e03930
 r20xc1b0e000
 r30xc03fdcddf800
 r40xc03fc788
 r50x9c422724be
 r60xc03fe5e03908
 r70xff63bddc8706
 r80x9e4
 r90x0
 r10   0x1
 r11   0x0
 r12   0xc01299c0
 r13   0xc03c4800
 r14   0x0
 r15   0x7fffdd8b8b00
 r16   0x0
 r17   0x7fffdd8be6b8
 r18   0x7e7076607730
 r19   0x2f
 r20   0xc0001fc26c68
 r21   0xc0002041e4227e00
 r22   0xc0002018fb60
 r23   0x1
 r24   0xc03ffec4d900
 r25   0x8000
 r26   0x0
 r27   0x1
 r28   0x1
 r29   0xc1be1260
 r30   0x6008010
 r31   0xc03ffebb7218
 nip   0xc012b910
 msr   0x90009033
 orig_r3 0xc012b86c
 ctr   0xc01299c0
 link  0xc012b77c
 xer   0x0
 ccr   0x2800
 softe 0x1
 trap  0xf00
 dar   0x0
 dsisr 0x800
 sier  0x0
 mmcra 0x800
 mmcr0 0x82008090
 mmcr1 0x1e00
 mmcr2 0x0
 ... thread: perf:4784

Signed-off-by: Anju T Sudhakar 
[Defined PERF_REG_EXTENDED_MASK at run time to add support for different 
platforms ]
Signed-off-by: Athira Rajeev 
Reviewed-by: Madhavan Srinivasan 
[Fix build issue using CONFIG_PERF_EVENTS without CONFIG_PPC_PERF_CTRS]
Reported-by: kernel test robot 
Reviewed-by: Kajol Jain 
Tested-by: Nageswara R Sastry 
Reviewed-and-tested-by: Ravi Bangoria 
---
 arch/powerpc/include/asm/perf_event.h|  3 +++
 arch/powerpc/include/asm/perf_event_server.h |  5 
 arch/powerpc/include/uapi/asm/perf_regs.h| 14 +++-
 arch/powerpc/perf/core-book3s.c  |  1 +
 arch/powerpc/perf/perf_regs.c| 34 +---
 arch/powerpc/perf/power9-pmu.c   |  6 +
 6 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/perf_event.h 
b/arch/powerpc/include/asm/perf_event.h
index 1e8b2e1..daec64d 100644
--- a/arch/powerpc/include/asm/perf_event.h
+++ b/arch/powerpc/include/asm/perf_event.h
@@ -40,4 +40,7 @@
 
 /* To support perf_regs sier update */
 extern bool is_sier_available(void);
+/* To define perf extended regs mask value */
+extern u64 PERF_REG_EXTENDED_MASK;
+#define PERF_REG_EXTENDED_MASK PERF_REG_EXTENDED_MASK
 #endif
diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index 86c9eb06..f6acabb 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -62,6 +62,11 @@ struct power_pmu {
int *blacklist_ev;
/* BHRB entries in the PMU */
int bhrb_nr;
+   /*
+* set this flag with `PERF_PMU_CAP_EXTENDED_REGS` if
+* the pmu supports extended perf regs capability
+*/
+   int capabilities;
 };
 
 /*
diff --git a/arch/powerpc/include/uapi/asm/perf_regs.h 
b/arch/powerpc/include/uapi/asm/perf_regs.h
index f599064..225c64c 100644
--- a/arch/powerpc/include/uapi/asm/perf_regs.h
+++ b/arch/powerpc/include/uapi/asm/perf_regs.h
@@ -48,6 +48,18 @@ enum perf_event_powerpc_regs {
PERF_REG_POWERPC_DSISR,
PERF_REG_POWERPC_SIER,
PERF_REG_POWERPC_MMCRA,
-   PERF_REG_POWERPC_MAX,
+   /* Extended registers */
+   PERF_REG_POWERPC_MMCR0,
+   PERF_REG_POWERPC_MMCR1,
+   PERF_REG_POWERPC_MMCR2,
+   /* Max regs without the extended regs */
+   PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
 };
+
+#define PERF_REG_PMU_MASK  ((1ULL << PERF_REG_POWERPC_MAX) - 1)
+
+/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */
+#define PERF_REG_PMU_MASK_300   (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) 
- PERF_REG_PMU_MASK)
+
+#define PERF_REG_MAX_ISA_300   (PERF_REG_POWERPC_MMCR2 + 1)
 #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
diff

[PATCH V5 0/4] powerpc/perf: Add support for perf extended regs in powerpc

2020-07-27 Thread Athira Rajeev

Patch set to add support for perf extended register capability in
powerpc. The capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to
indicate the PMU which support extended registers. The generic code
define the mask of extended registers as 0 for non supported architectures.

Patches 1 and 2 are the kernel side changes needed to include
base support for extended regs in powerpc and in power10.
Patches 3 and 4 are the perf tools side changes needed to support the
extended registers.

patch 1/4 defines the PERF_PMU_CAP_EXTENDED_REGS mask to output the
values of mmcr0,mmcr1,mmcr2 for POWER9. Defines `PERF_REG_EXTENDED_MASK`
at runtime which contains mask value of the supported registers under
extended regs.

patch 2/4 adds the extended regs support for power10 and exposes
MMCR3, SIER2, SIER3 registers as part of extended regs.

Patch 3/4 and 4/4 adds extended regs to sample_reg_mask in the tool
side to use with `-I?` option for power9 and power10 respectively.

Ravi bangoria found an issue with `perf record -I` while testing the
changes. The same issue is currently being worked on here:
https://lkml.org/lkml/2020/7/19/413 and will be resolved once fix
from Jin Yao is merged.

This patch series is based on powerpc/next

Changelog:

Changes from v4 -> v5
- initialize `perf_reg_extended_max` to work on
  all platforms as suggested by Ravi Bangoria
- Added Reviewed-and-Tested-by from Ravi Bangoria

Changes from v3 -> v4
- Split the series and send extended regs as separate patch set here.
  Link to previous series :
  https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=190462=*
  Other PMU patches are already merged in powerpc/next.

- Fixed kernel build issue when using config having
  CONFIG_PERF_EVENTS set and without CONFIG_PPC_PERF_CTRS
  reported by kernel build bot.
- Included Reviewed-by from Kajol Jain.
- Addressed review comments from Ravi Bangoria to initialize 
`perf_reg_extended_max`
  and define it in lowercase since it is local variable.

Anju T Sudhakar (2):
  powerpc/perf: Add support for outputting extended regs in perf
intr_regs
  tools/perf: Add perf tools support for extended register capability in
powerpc

Athira Rajeev (2):
  powerpc/perf: Add extended regs support for power10 platform
  tools/perf: Add perf tools support for extended regs in power10

 arch/powerpc/include/asm/perf_event.h   |  3 ++
 arch/powerpc/include/asm/perf_event_server.h|  5 +++
 arch/powerpc/include/uapi/asm/perf_regs.h   | 20 -
 arch/powerpc/perf/core-book3s.c |  1 +
 arch/powerpc/perf/perf_regs.c   | 44 ++--
 arch/powerpc/perf/power10-pmu.c |  6 +++
 arch/powerpc/perf/power9-pmu.c  |  6 +++
 tools/arch/powerpc/include/uapi/asm/perf_regs.h | 20 -
 tools/perf/arch/powerpc/include/perf_regs.h |  8 +++-
 tools/perf/arch/powerpc/util/header.c   |  9 +---
 tools/perf/arch/powerpc/util/perf_regs.c| 55 +
 tools/perf/arch/powerpc/util/utils_header.h | 15 +++
 12 files changed, 178 insertions(+), 14 deletions(-)
 create mode 100644 tools/perf/arch/powerpc/util/utils_header.h

-- 
1.8.3.1

[PATCH -next] powerpc/powernv/sriov: Remove unused but set variable 'phb'

2020-07-27 Thread Wei Yongjun

Gcc report warning as follows:

arch/powerpc/platforms/powernv/pci-sriov.c:602:25: warning:
 variable 'phb' set but not used [-Wunused-but-set-variable]
  602 |  struct pnv_phb*phb;
  | ^~~

This variable is not used, so this commit removing it.

Reported-by: Hulk Robot 
Signed-off-by: Wei Yongjun 
---
 arch/powerpc/platforms/powernv/pci-sriov.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
b/arch/powerpc/platforms/powernv/pci-sriov.c
index 8404d8c3901d..7894745fd4f8 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -599,10 +599,8 @@ static int pnv_pci_vf_resource_shift(struct pci_dev *dev, 
int offset)
 static void pnv_pci_sriov_disable(struct pci_dev *pdev)
 {
u16num_vfs, base_pe;
-   struct pnv_phb*phb;
struct pnv_iov_data   *iov;
 
-   phb = pci_bus_to_pnvhb(pdev->bus);
iov = pnv_iov_get(pdev);
num_vfs = iov->num_vfs;
base_pe = iov->vf_pe_arr[0].pe_number;

Re: [PATCH 0/5] cpuidle-pseries: Parse extended CEDE information for idle.

2020-07-27 Thread Rafael J. Wysocki

On Tue, Jul 7, 2020 at 1:32 PM Gautham R Shenoy  wrote:
>
> Hi,
>
> On Tue, Jul 07, 2020 at 04:41:34PM +0530, Gautham R. Shenoy wrote:
> > From: "Gautham R. Shenoy" 
> >
> > Hi,
> >
> >
> >
> >
> > Gautham R. Shenoy (5):
> >   cpuidle-pseries: Set the latency-hint before entering CEDE
> >   cpuidle-pseries: Add function to parse extended CEDE records
> >   cpuidle-pseries : Fixup exit latency for CEDE(0)
> >   cpuidle-pseries : Include extended CEDE states in cpuidle framework
> >   cpuidle-pseries: Block Extended CEDE(1) which adds no additional
> > value.
>
> Forgot to mention that these patches are on top of Nathan's series to
> remove extended CEDE offline and bogus topology update code :
> https://lore.kernel.org/linuxppc-dev/20200612051238.1007764-1-nath...@linux.ibm.com/

OK, so this is targeted at the powerpc maintainers, isn't it?

Re: [PATCH v2 4/5] powerpc/mm: Remove custom stack expansion checking

2020-07-27 Thread Daniel Axtens

Hi Michael,

I tested v1 of this. I ran the test from the bug with a range of stack
sizes, in a loop, for several hours and didn't see any crashes/signal
delivery failures.

I retested v2 for a few minutes just to be sure, and I ran stress-ng's
stack, stackmmap and bad-altstack stressors to make sure no obvious
kernel bugs were exposed. Nothing crashed.

All tests done on a P8 LE guest under KVM.

On that basis:

Tested-by: Daniel Axtens 

The more I look at this the less qualified I feel to Review it, but
certainly it looks better than my ugly hack from late last year.

Kind regards,
Daniel

> We have powerpc specific logic in our page fault handling to decide if
> an access to an unmapped address below the stack pointer should expand
> the stack VMA.
>
> The logic aims to prevent userspace from doing bad accesses below the
> stack pointer. However as long as the stack is < 1MB in size, we allow
> all accesses without further checks. Adding some debug I see that I
> can do a full kernel build and LTP run, and not a single process has
> used more than 1MB of stack. So for the majority of processes the
> logic never even fires.
>
> We also recently found a nasty bug in this code which could cause
> userspace programs to be killed during signal delivery. It went
> unnoticed presumably because most processes use < 1MB of stack.
>
> The generic mm code has also grown support for stack guard pages since
> this code was originally written, so the most heinous case of the
> stack expanding into other mappings is now handled for us.
>
> Finally although some other arches have special logic in this path,
> from what I can tell none of x86, arm64, arm and s390 impose any extra
> checks other than those in expand_stack().
>
> So drop our complicated logic and like other architectures just let
> the stack expand as long as its within the rlimit.
>
> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/mm/fault.c | 109 ++--
>  1 file changed, 5 insertions(+), 104 deletions(-)
>
> v2: no change just rebased.
>
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 3ebb1792e636..925a7231abb3 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -42,39 +42,7 @@
>  #include 
>  #include 
>  
> -/*
> - * Check whether the instruction inst is a store using
> - * an update addressing form which will update r1.
> - */
> -static bool store_updates_sp(struct ppc_inst inst)
> -{
> - /* check for 1 in the rA field */
> - if (((ppc_inst_val(inst) >> 16) & 0x1f) != 1)
> - return false;
> - /* check major opcode */
> - switch (ppc_inst_primary_opcode(inst)) {
> - case OP_STWU:
> - case OP_STBU:
> - case OP_STHU:
> - case OP_STFSU:
> - case OP_STFDU:
> - return true;
> - case OP_STD:/* std or stdu */
> - return (ppc_inst_val(inst) & 3) == 1;
> - case OP_31:
> - /* check minor opcode */
> - switch ((ppc_inst_val(inst) >> 1) & 0x3ff) {
> - case OP_31_XOP_STDUX:
> - case OP_31_XOP_STWUX:
> - case OP_31_XOP_STBUX:
> - case OP_31_XOP_STHUX:
> - case OP_31_XOP_STFSUX:
> - case OP_31_XOP_STFDUX:
> - return true;
> - }
> - }
> - return false;
> -}
> +
>  /*
>   * do_page_fault error handling helpers
>   */
> @@ -267,57 +235,6 @@ static bool bad_kernel_fault(struct pt_regs *regs, 
> unsigned long error_code,
>   return false;
>  }
>  
> -// This comes from 64-bit struct rt_sigframe + __SIGNAL_FRAMESIZE
> -#define SIGFRAME_MAX_SIZE(4096 + 128)
> -
> -static bool bad_stack_expansion(struct pt_regs *regs, unsigned long address,
> - struct vm_area_struct *vma, unsigned int flags,
> - bool *must_retry)
> -{
> - /*
> -  * N.B. The POWER/Open ABI allows programs to access up to
> -  * 288 bytes below the stack pointer.
> -  * The kernel signal delivery code writes a bit over 4KB
> -  * below the stack pointer (r1) before decrementing it.
> -  * The exec code can write slightly over 640kB to the stack
> -  * before setting the user r1.  Thus we allow the stack to
> -  * expand to 1MB without further checks.
> -  */
> - if (address + 0x10 < vma->vm_end) {
> - struct ppc_inst __user *nip = (struct ppc_inst __user 
> *)regs->nip;
> - /* get user regs even if this fault is in kernel mode */
> - struct pt_regs *uregs = current->thread.regs;
> - if (uregs == NULL)
> - return true;
> -
> - /*
> -  * A user-mode access to an address a long way below
> -  * the stack pointer is only valid if the instruction
> -  * is one which would update the stack pointer to the
> -  * address accessed if the instruction completed,
> -  *

Re: [PATCH v3 1/2] cpuidle: Trace IPI based and timer based wakeup latency from idle states

2020-07-27 Thread Rafael J. Wysocki

On Tue, Jul 21, 2020 at 2:43 PM Pratik Rajesh Sampat
 wrote:
>
> Fire directed smp_call_function_single IPIs from a specified source
> CPU to the specified target CPU to reduce the noise we have to wade
> through in the trace log.

And what's the purpose of it?

> The module is based on the idea written by Srivatsa Bhat and maintained
> by Vaidyanathan Srinivasan internally.
>
> Queue HR timer and measure jitter. Wakeup latency measurement for idle
> states using hrtimer.  Echo a value in ns to timer_test_function and
> watch trace. A HRtimer will be queued and when it fires the expected
> wakeup vs actual wakeup is computes and delay printed in ns.
>
> Implemented as a module which utilizes debugfs so that it can be
> integrated with selftests.
>
> To include the module, check option and include as module
> kernel hacking -> Cpuidle latency selftests
>
> [srivatsa.b...@linux.vnet.ibm.com: Initial implementation in
>  cpidle/sysfs]
>
> [sva...@linux.vnet.ibm.com: wakeup latency measurements using hrtimer
>  and fix some of the time calculation]
>
> [e...@linux.vnet.ibm.com: Fix some whitespace and tab errors and
>  increase the resolution of IPI wakeup]
>
> Signed-off-by: Pratik Rajesh Sampat 
> Reviewed-by: Gautham R. Shenoy 
> ---
>  drivers/cpuidle/Makefile   |   1 +
>  drivers/cpuidle/test-cpuidle_latency.c | 150 +
>  lib/Kconfig.debug  |  10 ++
>  3 files changed, 161 insertions(+)
>  create mode 100644 drivers/cpuidle/test-cpuidle_latency.c
>
> diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
> index f07800cbb43f..2ae05968078c 100644
> --- a/drivers/cpuidle/Makefile
> +++ b/drivers/cpuidle/Makefile
> @@ -8,6 +8,7 @@ obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o
>  obj-$(CONFIG_DT_IDLE_STATES) += dt_idle_states.o
>  obj-$(CONFIG_ARCH_HAS_CPU_RELAX) += poll_state.o
>  obj-$(CONFIG_HALTPOLL_CPUIDLE)   += cpuidle-haltpoll.o
> +obj-$(CONFIG_IDLE_LATENCY_SELFTEST)  += test-cpuidle_latency.o
>
>  
> ##
>  # ARM SoC drivers
> diff --git a/drivers/cpuidle/test-cpuidle_latency.c 
> b/drivers/cpuidle/test-cpuidle_latency.c
> new file mode 100644
> index ..61574665e972
> --- /dev/null
> +++ b/drivers/cpuidle/test-cpuidle_latency.c
> @@ -0,0 +1,150 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Module-based API test facility for cpuidle latency using IPIs and timers

I'd like to see a more detailed description of what it does and how it
works here.

> + */
> +
> +#include 
> +#include 
> +#include 
> +
> +/* IPI based wakeup latencies */
> +struct latency {
> +   unsigned int src_cpu;
> +   unsigned int dest_cpu;
> +   ktime_t time_start;
> +   ktime_t time_end;
> +   u64 latency_ns;
> +} ipi_wakeup;
> +
> +static void measure_latency(void *info)
> +{
> +   struct latency *v;
> +   ktime_t time_diff;
> +
> +   v = (struct latency *)info;
> +   v->time_end = ktime_get();
> +   time_diff = ktime_sub(v->time_end, v->time_start);
> +   v->latency_ns = ktime_to_ns(time_diff);
> +}
> +
> +void run_smp_call_function_test(unsigned int cpu)
> +{
> +   ipi_wakeup.src_cpu = smp_processor_id();
> +   ipi_wakeup.dest_cpu = cpu;
> +   ipi_wakeup.time_start = ktime_get();
> +   smp_call_function_single(cpu, measure_latency, _wakeup, 1);
> +}
> +
> +/* Timer based wakeup latencies */
> +struct timer_data {
> +   unsigned int src_cpu;
> +   u64 timeout;
> +   ktime_t time_start;
> +   ktime_t time_end;
> +   struct hrtimer timer;
> +   u64 timeout_diff_ns;
> +} timer_wakeup;
> +
> +static enum hrtimer_restart timer_called(struct hrtimer *hrtimer)
> +{
> +   struct timer_data *w;
> +   ktime_t time_diff;
> +
> +   w = container_of(hrtimer, struct timer_data, timer);
> +   w->time_end = ktime_get();
> +
> +   time_diff = ktime_sub(w->time_end, w->time_start);
> +   time_diff = ktime_sub(time_diff, ns_to_ktime(w->timeout));
> +   w->timeout_diff_ns = ktime_to_ns(time_diff);
> +   return HRTIMER_NORESTART;
> +}
> +
> +static void run_timer_test(unsigned int ns)
> +{
> +   hrtimer_init(_wakeup.timer, CLOCK_MONOTONIC,
> +HRTIMER_MODE_REL);
> +   timer_wakeup.timer.function = timer_called;
> +   timer_wakeup.time_start = ktime_get();
> +   timer_wakeup.src_cpu = smp_processor_id();
> +   timer_wakeup.timeout = ns;
> +
> +   hrtimer_start(_wakeup.timer, ns_to_ktime(ns),
> + HRTIMER_MODE_REL_PINNED);
> +}
> +
> +static struct dentry *dir;
> +
> +static int cpu_read_op(void *data, u64 *value)
> +{
> +   *value = ipi_wakeup.dest_cpu;
> +   return 0;
> +}
> +
> +static int cpu_write_op(void *data, u64 value)
> +{
> +   run_smp_call_function_test(value);
> +   return 0;
> +}
> +DEFINE_SIMPLE_ATTRIBUTE(ipi_ops, cpu_read_op,

Re: [PATCH] lockdep: Fix TRACE_IRQFLAGS vs NMIs

2020-07-27 Thread Ingo Molnar

* pet...@infradead.org  wrote:

> 
> Prior to commit 859d069ee1dd ("lockdep: Prepare for NMI IRQ state
> tracking") IRQ state tracking was disabled in NMIs due to nmi_enter()
> doing lockdep_off() -- with the obvious requirement that NMI entry
> call nmi_enter() before trace_hardirqs_off().
> 
> [ afaict, PowerPC and SH violate this order on their NMI entry ]
> 
> However, that commit explicitly changed lockdep_hardirqs_*() to ignore
> lockdep_off() and breaks every architecture that has irq-tracing in
> it's NMI entry that hasn't been fixed up (x86 being the only fixed one
> at this point).
> 
> The reason for this change is that by ignoring lockdep_off() we can:
> 
>   - get rid of 'current->lockdep_recursion' in lockdep_assert_irqs*()
> which was going to to give header-recursion issues with the
> seqlock rework.
> 
>   - allow these lockdep_assert_*() macros to function in NMI context.
> 
> Restore the previous state of things and allow an architecture to
> opt-in to the NMI IRQ tracking support, however instead of relying on
> lockdep_off(), rely on in_nmi(), both are part of nmi_enter() and so
> over-all entry ordering doesn't need to change.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  arch/x86/Kconfig.debug   |3 +++
>  kernel/locking/lockdep.c |8 +++-
>  lib/Kconfig.debug|6 ++
>  3 files changed, 16 insertions(+), 1 deletion(-)

Tree management side note: to apply this I've created a new 
tip:locking/nmi branch, which is based off the existing NMI vs. IRQ 
tracing commits included in locking/core:

ed00495333cc: ("locking/lockdep: Fix TRACE_IRQFLAGS vs. NMIs")
ba1f2b2eaa2a: ("x86/entry: Fix NMI vs IRQ state tracking")
859d069ee1dd: ("lockdep: Prepare for NMI IRQ state tracking")
248591f5d257: ("kcsan: Make KCSAN compatible with new IRQ state tracking")
e1bcad609f5a: ("Merge branch 'tip/x86/entry'")
b037b09b9058: ("x86/entry: Rename idtentry_enter/exit_cond_rcu() to 
idtentry_enter/exit()")
dcb7fd82c75e: ("Linux 5.8-rc4")

This locking/nmi branch can then be merged into irq/entry (there's a 
bunch of conflicts between them), without coupling all of v5.9's 
locking changes to Thomas's generic entry work.

Thanks,

Ingo

[PATCH] lockdep: Fix TRACE_IRQFLAGS vs NMIs

2020-07-27 Thread peterz



Prior to commit 859d069ee1dd ("lockdep: Prepare for NMI IRQ state
tracking") IRQ state tracking was disabled in NMIs due to nmi_enter()
doing lockdep_off() -- with the obvious requirement that NMI entry
call nmi_enter() before trace_hardirqs_off().

[ afaict, PowerPC and SH violate this order on their NMI entry ]

However, that commit explicitly changed lockdep_hardirqs_*() to ignore
lockdep_off() and breaks every architecture that has irq-tracing in
it's NMI entry that hasn't been fixed up (x86 being the only fixed one
at this point).

The reason for this change is that by ignoring lockdep_off() we can:

  - get rid of 'current->lockdep_recursion' in lockdep_assert_irqs*()
which was going to to give header-recursion issues with the
seqlock rework.

  - allow these lockdep_assert_*() macros to function in NMI context.

Restore the previous state of things and allow an architecture to
opt-in to the NMI IRQ tracking support, however instead of relying on
lockdep_off(), rely on in_nmi(), both are part of nmi_enter() and so
over-all entry ordering doesn't need to change.

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/x86/Kconfig.debug   |3 +++
 kernel/locking/lockdep.c |8 +++-
 lib/Kconfig.debug|6 ++
 3 files changed, 16 insertions(+), 1 deletion(-)

--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -3,6 +3,9 @@
 config TRACE_IRQFLAGS_SUPPORT
def_bool y
 
+config TRACE_IRQFLAGS_NMI_SUPPORT
+   def_bool y
+
 config EARLY_PRINTK_USB
bool
 
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3712,6 +3712,9 @@ void noinstr lockdep_hardirqs_on(unsigne
 * and not rely on hardware state like normal interrupts.
 */
if (unlikely(in_nmi())) {
+   if (!IS_ENABLED(CONFIG_TRACE_IRQFLAGS_NMI))
+   return;
+
/*
 * Skip:
 *  - recursion check, because NMI can hit lockdep;
@@ -3773,7 +3776,10 @@ void noinstr lockdep_hardirqs_off(unsign
 * they will restore the software state. This ensures the software
 * state is consistent inside NMIs as well.
 */
-   if (unlikely(!in_nmi() && (current->lockdep_recursion & 
LOCKDEP_RECURSION_MASK)))
+   if (in_nmi()) {
+   if (!IS_ENABLED(CONFIG_TRACE_IRQFLAGS_NMI))
+   return;
+   } else if (current->lockdep_recursion & LOCKDEP_RECURSION_MASK)
return;
 
/*
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1325,11 +1325,17 @@ config WW_MUTEX_SELFTEST
 endmenu # lock debugging
 
 config TRACE_IRQFLAGS
+   depends on TRACE_IRQFLAGS_SUPPORT
bool
help
  Enables hooks to interrupt enabling and disabling for
  either tracing or lock debugging.
 
+config TRACE_IRQFLAGS_NMI
+   def_bool y
+   depends on TRACE_IRQFLAGS
+   depends on TRACE_IRQFLAGS_NMI_SUPPORT
+
 config STACKTRACE
bool "Stack backtrace support"
depends on STACKTRACE_SUPPORT

Re: [PATCH] powerpc/64s/hash: Fix hash_preload running with interrupts enabled

2020-07-27 Thread Michael Ellerman

Athira Rajeev  writes:
>> On 27-Jul-2020, at 11:39 AM, Nicholas Piggin  wrote:
>> 
>> Commit 2f92447f9f96 ("powerpc/book3s64/hash: Use the pte_t address from the
>> caller") removed the local_irq_disable from hash_preload, but it was
>> required for more than just the page table walk: the hash pte busy bit is
>> effectively a lock which may be taken in interrupt context, and the local
>> update flag test must not be preempted before it's used.
>> 
>> This solves apparent lockups with perf interrupting __hash_page_64K. If
>> get_perf_callchain then also takes a hash fault on the same page while it
>> is already locked, it will loop forever taking hash faults, which looks like
>> this:
>> 
>> cpu 0x49e: Vector: 100 (System Reset) at [c0001a4f7d70]
>>pc: c0072dc8: hash_page_mm+0x8/0x800
>>lr: c000c5a4: do_hash_page+0x24/0x38
>>sp: c0002ac1cc69ac70
>>   msr: 80081033
>>  current = 0xc0002ac1cc602e00
>>  paca= 0xc0001de1f280   irqmask: 0x03   irq_happened: 0x01
>>pid   = 20118, comm = pread2_processe
>> Linux version 5.8.0-rc6-00345-g1fad14f18bc6
>> 49e:mon> t
>> [c0002ac1cc69ac70] c000c5a4 do_hash_page+0x24/0x38 (unreliable)
>> --- Exception: 300 (Data Access) at c008fa60 
>> __copy_tofrom_user_power7+0x20c/0x7ac
>> [link register   ] c0335d10 copy_from_user_nofault+0xf0/0x150
>> [c0002ac1cc69af70] c00032bf9fa3c880 (unreliable)
>> [c0002ac1cc69afa0] c0109df0 read_user_stack_64+0x70/0xf0
>> [c0002ac1cc69afd0] c0109fcc perf_callchain_user_64+0x15c/0x410
>> [c0002ac1cc69b060] c0109c00 perf_callchain_user+0x20/0x40
>> [c0002ac1cc69b080] c031c6cc get_perf_callchain+0x25c/0x360
>> [c0002ac1cc69b120] c0316b50 perf_callchain+0x70/0xa0
>> [c0002ac1cc69b140] c0316ddc perf_prepare_sample+0x25c/0x790
>> [c0002ac1cc69b1a0] c0317350 perf_event_output_forward+0x40/0xb0
>> [c0002ac1cc69b220] c0306138 __perf_event_overflow+0x88/0x1a0
>> [c0002ac1cc69b270] c010cf70 record_and_restart+0x230/0x750
>> [c0002ac1cc69b620] c010d69c perf_event_interrupt+0x20c/0x510
>> [c0002ac1cc69b730] c0027d9c performance_monitor_exception+0x4c/0x60
>> [c0002ac1cc69b750] c000b2f8 
>> performance_monitor_common_virt+0x1b8/0x1c0
>> --- Exception: f00 (Performance Monitor) at c00cb5b0 
>> pSeries_lpar_hpte_insert+0x0/0x160
>> [link register   ] c00846f0 __hash_page_64K+0x210/0x540
>> [c0002ac1cc69ba50]  (unreliable)
>> [c0002ac1cc69bb00] c0073ae0 update_mmu_cache+0x390/0x3a0
>> [c0002ac1cc69bb70] c037f024 wp_page_copy+0x364/0xce0
>> [c0002ac1cc69bc20] c038272c do_wp_page+0xdc/0xa60
>> [c0002ac1cc69bc70] c03857bc handle_mm_fault+0xb9c/0x1b60
>> [c0002ac1cc69bd50] c006c434 __do_page_fault+0x314/0xc90
>> [c0002ac1cc69be20] c000c5c8 handle_page_fault+0x10/0x2c
>> --- Exception: 300 (Data Access) at 7fff8c861fe8
>> SP (76b19660) is in userspace
>> 
>> Reported-by: Athira Rajeev 
>> Reported-by: Anton Blanchard 
>> Reviewed-by: Aneesh Kumar K.V 
>> Fixes: 2f92447f9f96 ("powerpc/book3s64/hash: Use the pte_t address from the
>> caller")
>> Signed-off-by: Nicholas Piggin 
>
>
> Hi,
>
> Tested with the patch and it fixes the lockups I was seeing with my test run.
> Thanks for the fix.
>
> Tested-by: Athira Rajeev 

Thanks for testing.

What test are you running?

cheers

Re: [PATCH v2 2/5] powerpc: Allow 4224 bytes of stack expansion for the signal frame

2020-07-27 Thread Michael Ellerman

Gabriel Paubert  writes:
> On Fri, Jul 24, 2020 at 07:25:25PM +1000, Michael Ellerman wrote:
>> We have powerpc specific logic in our page fault handling to decide if
>> an access to an unmapped address below the stack pointer should expand
>> the stack VMA.
>> 
>> The code was originally added in 2004 "ported from 2.4". The rough
>> logic is that the stack is allowed to grow to 1MB with no extra
>> checking. Over 1MB the access must be within 2048 bytes of the stack
>> pointer, or be from a user instruction that updates the stack pointer.
>> 
>> The 2048 byte allowance below the stack pointer is there to cover the
>> 288 byte "red zone" as well as the "about 1.5kB" needed by the signal
>> delivery code.
>> 
>> Unfortunately since then the signal frame has expanded, and is now
>> 4224 bytes on 64-bit kernels with transactional memory enabled.
>
> Are there really users of transactional memory in the wild? 

Not many that I've heard of, but some.

Though anything that does use it needs to be written to fallback to
regular locking if TM is not available anyway.

> Just asking because Power10 removes TM, and Power9 has had some issues
> with it AFAICT.

It varies on different Power9 chip levels. For guests it should work.

> Getting rid of it (if possible) would result in smaller signal frames,
> with simpler signal delivery code (probably slightly faster also).

All the kernel code should be behind CONFIG_PPC_TRANSACTIONAL_MEM.

Deciding to disable that is really a distro decision.

In upstream we tend not to drop support for existing hardware while
people are still using it. But we could make a special case for TM,
because it's quite intrusive. I think we'd wait for a major distro to
ship without TM enabled before we did that though.

cheers

Re: [PATCH] powerpc/test_emulate_sstep: Fix build error

2020-07-27 Thread Michael Ellerman

On Fri, 24 Jul 2020 10:41:09 +1000, Michael Ellerman wrote:
> ppc64_book3e_allmodconfig fails with:
> 
>   arch/powerpc/lib/test_emulate_step.c: In function 'test_pld':
>   arch/powerpc/lib/test_emulate_step.c:113:7: error: implicit declaration of 
> function 'cpu_has_feature'
> 113 |  if (!cpu_has_feature(CPU_FTR_ARCH_31)) {
> |   ^~~
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/test_emulate_sstep: Fix build error
  https://git.kernel.org/powerpc/c/70cc062c47e7851335ff4c44ba9b362174baf7d4

cheers

Re: [PATCH] powerpc/sstep: Fix incorrect CONFIG symbol in scv handling

2020-07-27 Thread Michael Ellerman

On Fri, 24 Jul 2020 23:16:09 +1000, Michael Ellerman wrote:
> When I "fixed" the ppc64e build in Nick's recent patch, I typoed the
> CONFIG symbol, resulting in one that doesn't exist. Fix it to use the
> correct symbol.

Applied to powerpc/next.

[1/1] powerpc/sstep: Fix incorrect CONFIG symbol in scv handling
  https://git.kernel.org/powerpc/c/826b07b190c8ca69ce674f13b4dc9be2bc536fcd

cheers

[powerpc:next] BUILD SUCCESS 86052e407e8e1964c81965de25832258875a0e6d

2020-07-27 Thread kernel test robot

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  
next
branch HEAD: 86052e407e8e1964c81965de25832258875a0e6d  powerpc/powernv/pci.h: 
delete duplicated word

elapsed time: 1289m

configs tested: 70
configs skipped: 1

The following configs have been built successfully.
More configs may be tested in the coming days.

arm defconfig
arm64allyesconfig
arm64   defconfig
arm  allyesconfig
arm  allmodconfig
i386 allyesconfig
i386defconfig
ia64 allmodconfig
ia64defconfig
ia64 allyesconfig
m68k allmodconfig
m68kdefconfig
m68k allyesconfig
m68k   sun3_defconfig
nios2   defconfig
arc  allyesconfig
nds32 allnoconfig
c6x  allyesconfig
nios2allyesconfig
openriscdefconfig
nds32   defconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
xtensa  defconfig
parisc  defconfig
s390 allyesconfig
parisc   allyesconfig
s390defconfig
sparcallyesconfig
sparc   defconfig
mips allyesconfig
mips allmodconfig
powerpc defconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386 randconfig-a003-20200727
i386 randconfig-a005-20200727
i386 randconfig-a004-20200727
i386 randconfig-a006-20200727
i386 randconfig-a002-20200727
i386 randconfig-a001-20200727
x86_64   randconfig-a005-20200727
x86_64   randconfig-a004-20200727
x86_64   randconfig-a003-20200727
x86_64   randconfig-a006-20200727
x86_64   randconfig-a002-20200727
x86_64   randconfig-a001-20200727
i386 randconfig-a016-20200727
i386 randconfig-a013-20200727
i386 randconfig-a012-20200727
i386 randconfig-a015-20200727
i386 randconfig-a011-20200727
i386 randconfig-a014-20200727
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscvallmodconfig
sparc64 defconfig
x86_64rhel-7.6-kselftests
x86_64   rhel-8.3
x86_64  kexec
x86_64   rhel
x86_64   allyesconfig
x86_64  defconfig

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org

Re: [PATCH v2] powerpc/book3s64/radix: Add kernel command line option to disable radix GTSE

2020-07-27 Thread Bharata B Rao

On Mon, Jul 27, 2020 at 02:29:08PM +0530, Aneesh Kumar K.V wrote:
> This adds a kernel command line option that can be used to disable GTSE 
> support.
> Disabling GTSE implies kernel will make hcalls to invalidate TLB entries.
> 
> This was done so that we can do VM migration between configs that 
> enable/disable
> GTSE support via hypervisor. To migrate a VM from a system that supports
> GTSE to a system that doesn't, we can boot the guest with
> radix_hcall_invalidate=on, thereby forcing the guest to use hcalls for TLB
> invalidates.
> 
> The check for hcall availability is done in pSeries_setup_arch so that
> the panic message appears on the console. This should only happen on
> a hypervisor that doesn't force the guest to hash translation even
> though it can't handle the radix GTSE=0 request via CAS. With
> radix_hcall_invalidate=on if the hypervisor doesn't support 
> hcall_rpt_invalidate
> hcall it should force the LPAR to hash translation.
> 
> Signed-off-by: Aneesh Kumar K.V 

Tested

1. radix_hcall_invalidate=on with KVM implementation of H_RPT_INVALIDATE hcall,
   the tlb flush calls get off-loaded to the hcall.
2. radix_hcall_invalidate=on w/o H_RPT_INVALIDATE hcall, the guest kernel
   panics as per design.

Tested-by: Bharata B Rao

Re: [PATCH v3 09/10] powerpc/smp: Create coregroup domain

2020-07-27 Thread Srikar Dronamraju

* Gautham R Shenoy  [2020-07-27 10:09:41]:

> > 
> >  static void fixup_topology(void)
> >  {
> > +   if (!has_coregroup_support())
> > +   powerpc_topology[mc_idx].mask = cpu_bigcore_mask;
> > +
> > if (shared_caches) {
> > pr_info("Using shared cache scheduler topology\n");
> > powerpc_topology[bigcore_idx].mask = shared_cache_mask;
> 
> 
> Suppose we consider a topology which does not have coregroup_support,
> but has shared_caches. In that case, we would want our coregroup
> domain to degenerate.
> 
> From the above code, after the fixup, our topology will look as
> follows:
> 
> static struct sched_domain_topology_level powerpc_topology[] = {
>   { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
>   { shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
>   { cpu_bigcore_mask, SD_INIT_NAME(MC) },
>   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>   { NULL, },
> 
> So, in this case, the core-group domain (identified by MC) will
> degenerate only if cpu_bigcore_mask() and shared_cache_mask() return
> the same value. This may work for existing platforms, because either
> shared_caches don't exist, or when they do, cpu_bigcore_mask and
> shared_cache_mask return the same set of CPUs. But this may or may not
> continue to hold good in the future.
> 
> Furthermore, if that is always going to be the case that in the
> presence of shared_caches the cpu_bigcore_mask() and
> shared_cache_mask() will always be the same, then why even define two
> separate masks and not just have only the cpu_bigcore_mask() ?
> 

Your two statements are contradicting. In the former you are saying we
should be future proof and in the latter, you are asking for why add if they
are both going to be the same.

> The correct way would be to set the powerpc_topology[mc_idx].mask to
> powerpc_topology[bigcore_idx].mask *after* we have fixedup the
> big_core level.

The reason I modified it in v4 is not for degeneration or for future case
but for the current PowerNV/SMT 4 case. I could have as well detected the
the same and modified bigcore but thought fixup at one place would be
better.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH v2 01/14] powerpc/eeh: Remove eeh_dev_phb_init_dynamic()

2020-07-27 Thread Michael Ellerman

Michael Ellerman  writes:
> On Wed, 22 Jul 2020 14:26:15 +1000, Oliver O'Halloran wrote:
>> This function is a one line wrapper around eeh_phb_pe_create() and despite
>> the name it doesn't create any eeh_dev structures. Replace it with direct
>> calls to eeh_phb_pe_create() since that does what it says on the tin
>> and removes a layer of indirection.
>
> Applied to powerpc/next.
>
> [01/14] powerpc/eeh: Remove eeh_dev_phb_init_dynamic()
> 
> https://git.kernel.org/powerpc/c/475028efc708880e16e61cc4cbbc00af784cb39b

Something weird happened with the "thanks" script. Pretty sure I applied v3.

I think I applied this version previously and the script just matched
the subjects?

Anyway, ignore this mail.

cheers

Re: [PATCH v2 2/5] powerpc: Allow 4224 bytes of stack expansion for the signal frame

2020-07-27 Thread Daniel Axtens

Hi Michael,

I have tested this with the test from the bug and it now seems to pass
fine. On that basis:

Tested-by: Daniel Axtens 

Thank you for coming up with a better solution than my gross hack!

Kind regards,
Daniel

> We have powerpc specific logic in our page fault handling to decide if
> an access to an unmapped address below the stack pointer should expand
> the stack VMA.
>
> The code was originally added in 2004 "ported from 2.4". The rough
> logic is that the stack is allowed to grow to 1MB with no extra
> checking. Over 1MB the access must be within 2048 bytes of the stack
> pointer, or be from a user instruction that updates the stack pointer.
>
> The 2048 byte allowance below the stack pointer is there to cover the
> 288 byte "red zone" as well as the "about 1.5kB" needed by the signal
> delivery code.
>
> Unfortunately since then the signal frame has expanded, and is now
> 4224 bytes on 64-bit kernels with transactional memory enabled. This
> means if a process has consumed more than 1MB of stack, and its stack
> pointer lies less than 4224 bytes from the next page boundary, signal
> delivery will fault when trying to expand the stack and the process
> will see a SEGV.
>
> The total size of the signal frame is the size of struct rt_sigframe
> (which includes the red zone) plus __SIGNAL_FRAMESIZE (128 bytes on
> 64-bit).
>
> The 2048 byte allowance was correct until 2008 as the signal frame
> was:
>
> struct rt_sigframe {
> struct ucontextuc;   /* 0  1440 */
> /* --- cacheline 11 boundary (1408 bytes) was 32 bytes ago --- */
> long unsigned int  _unused[2];   /*  144016 */
> unsigned int   tramp[6]; /*  145624 */
> struct siginfo *   pinfo;/*  1480 8 */
> void * puc;  /*  1488 8 */
> struct siginfo info; /*  1496   128 */
> /* --- cacheline 12 boundary (1536 bytes) was 88 bytes ago --- */
> char   abigap[288];  /*  1624   288 */
>
> /* size: 1920, cachelines: 15, members: 7 */
> /* padding: 8 */
> };
>
> 1920 + 128 = 2048
>
> Then in commit ce48b2100785 ("powerpc: Add VSX context save/restore,
> ptrace and signal support") (Jul 2008) the signal frame expanded to
> 2304 bytes:
>
> struct rt_sigframe {
> struct ucontextuc;   /* 0  1696 */
> <--
> /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
> long unsigned int  _unused[2];   /*  169616 */
> unsigned int   tramp[6]; /*  171224 */
> struct siginfo *   pinfo;/*  1736 8 */
> void * puc;  /*  1744 8 */
> struct siginfo info; /*  1752   128 */
> /* --- cacheline 14 boundary (1792 bytes) was 88 bytes ago --- */
> char   abigap[288];  /*  1880   288 */
>
> /* size: 2176, cachelines: 17, members: 7 */
> /* padding: 8 */
> };
>
> 2176 + 128 = 2304
>
> At this point we should have been exposed to the bug, though as far as
> I know it was never reported. I no longer have a system old enough to
> easily test on.
>
> Then in 2010 commit 320b2b8de126 ("mm: keep a guard page below a
> grow-down stack segment") caused our stack expansion code to never
> trigger, as there was always a VMA found for a write up to PAGE_SIZE
> below r1.
>
> That meant the bug was hidden as we continued to expand the signal
> frame in commit 2b0a576d15e0 ("powerpc: Add new transactional memory
> state to the signal context") (Feb 2013):
>
> struct rt_sigframe {
> struct ucontextuc;   /* 0  1696 */
> /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
> struct ucontextuc_transact;  /*  1696  1696 */
> <--
> /* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */
> long unsigned int  _unused[2];   /*  339216 */
> unsigned int   tramp[6]; /*  340824 */
> struct siginfo *   pinfo;/*  3432 8 */
> void * puc;  /*  3440 8 */
> struct siginfo info; /*  3448   128 */
> /* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */
> char   abigap[288];  /*  3576   288 */
>
> /* size: 3872, cachelines: 31, members: 8 */
> /* padding: 8 */
> /* last cacheline: 32 bytes */
> };
>
> 3872 + 128 = 4000
>
> And commit 573ebfa6601f ("powerpc: Increase stack redzone for 64-bit
> userspace to 512 bytes") (Feb 2014):
>
> struct rt_sigframe {

[PATCH] powerpc/mm: Limit resize_hpt_for_hotplug() call to hash guests only

2020-07-27 Thread Bharata B Rao

During memory hotplug and unplug, resize_hpt_for_hotplug() gets called
for both hash and radix guests but it should be called only for hash
guests. Though the call does nothing in the radix guest case, it is
cleaner to push this call into hash specific memory hotplug routines.

Reported-by: Nathan Lynch 
Signed-off-by: Bharata B Rao 
---
Tested with memory hotplug and unplug for hash and radix KVM guests.

 arch/powerpc/include/asm/sparsemem.h  | 6 --
 arch/powerpc/mm/book3s64/hash_utils.c | 8 +++-
 arch/powerpc/mm/mem.c | 5 -
 3 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/sparsemem.h 
b/arch/powerpc/include/asm/sparsemem.h
index c89b32443cff..1e6fa371cc38 100644
--- a/arch/powerpc/include/asm/sparsemem.h
+++ b/arch/powerpc/include/asm/sparsemem.h
@@ -17,12 +17,6 @@ extern int create_section_mapping(unsigned long start, 
unsigned long end,
  int nid, pgprot_t prot);
 extern int remove_section_mapping(unsigned long start, unsigned long end);
 
-#ifdef CONFIG_PPC_BOOK3S_64
-extern int resize_hpt_for_hotplug(unsigned long new_mem_size);
-#else
-static inline int resize_hpt_for_hotplug(unsigned long new_mem_size) { return 
0; }
-#endif
-
 #ifdef CONFIG_NUMA
 extern int hot_add_scn_to_nid(unsigned long scn_addr);
 #else
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 9fdabea04990..30a4a91d9987 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -785,7 +785,7 @@ static unsigned long __init htab_get_table_size(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int resize_hpt_for_hotplug(unsigned long new_mem_size)
+static int resize_hpt_for_hotplug(unsigned long new_mem_size)
 {
unsigned target_hpt_shift;
 
@@ -819,6 +819,8 @@ int hash__create_section_mapping(unsigned long start, 
unsigned long end,
return -1;
}
 
+   resize_hpt_for_hotplug(memblock_phys_mem_size());
+
rc = htab_bolt_mapping(start, end, __pa(start),
   pgprot_val(prot), mmu_linear_psize,
   mmu_kernel_ssize);
@@ -836,6 +838,10 @@ int hash__remove_section_mapping(unsigned long start, 
unsigned long end)
int rc = htab_remove_mapping(start, end, mmu_linear_psize,
 mmu_kernel_ssize);
WARN_ON(rc < 0);
+
+   if (resize_hpt_for_hotplug(memblock_phys_mem_size()) == -ENOSPC)
+   pr_warn("Hash collision while resizing HPT\n");
+
return rc;
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index c2c11eb8dcfc..9dafc636588f 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -127,8 +127,6 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
unsigned long nr_pages = size >> PAGE_SHIFT;
int rc;
 
-   resize_hpt_for_hotplug(memblock_phys_mem_size());
-
start = (unsigned long)__va(start);
rc = create_section_mapping(start, start + size, nid,
params->pgprot);
@@ -161,9 +159,6 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size,
 * hit that section of memory
 */
vm_unmap_aliases();
-
-   if (resize_hpt_for_hotplug(memblock_phys_mem_size()) == -ENOSPC)
-   pr_warn("Hash collision while resizing HPT\n");
 }
 #endif
 
-- 
2.26.2

Re: [PATCH] KVM: PPC: Book3S HV: increase KVMPPC_NR_LPIDS on POWER8 and POWER9

2020-07-27 Thread Cédric Le Goater

On 7/23/20 8:20 AM, Paul Mackerras wrote:
> On Mon, Jun 08, 2020 at 01:57:14PM +0200, Cédric Le Goater wrote:
>> POWER8 and POWER9 have 12-bit LPIDs. Change LPID_RSVD to support up to
>> (4096 - 2) guests on these processors. POWER7 is kept the same with a
>> limitation of (1024 - 2), but it might be time to drop KVM support for
>> POWER7.
>>
>> Tested with 2048 guests * 4 vCPUs on a witherspoon system with 512G
>> RAM and a bit of swap.
>>
>> Signed-off-by: Cédric Le Goater 
> 
> Thanks, patch applied to my kvm-ppc-next branch.

We have pushed the limits further on a 1TB system and reached the limit
of 4094 guests with 16 vCPUs. 

With more vCPUs, the system starts to check-stop. We believe that the 
pages used by the interrupt controller for the backing store of the 
XIVE internal tables (END and NVT) allocated with GFP_KERNEL are 
reclaimable.

I am thinking of changing the allocation flags with :  

__GFP_NORETRY | __GFP_NOWARN | __GFP_NOMEMALLOC

because XIVE should be able to fail gracefully if the system is 
low on mem. Is that correct ? 

Thanks,  

C.

Re: [PATCH] powerpc/64s/hash: Fix hash_preload running with interrupts enabled

2020-07-27 Thread Athira Rajeev




> On 27-Jul-2020, at 11:39 AM, Nicholas Piggin  wrote:
> 
> Commit 2f92447f9f96 ("powerpc/book3s64/hash: Use the pte_t address from the
> caller") removed the local_irq_disable from hash_preload, but it was
> required for more than just the page table walk: the hash pte busy bit is
> effectively a lock which may be taken in interrupt context, and the local
> update flag test must not be preempted before it's used.
> 
> This solves apparent lockups with perf interrupting __hash_page_64K. If
> get_perf_callchain then also takes a hash fault on the same page while it
> is already locked, it will loop forever taking hash faults, which looks like
> this:
> 
> cpu 0x49e: Vector: 100 (System Reset) at [c0001a4f7d70]
>pc: c0072dc8: hash_page_mm+0x8/0x800
>lr: c000c5a4: do_hash_page+0x24/0x38
>sp: c0002ac1cc69ac70
>   msr: 80081033
>  current = 0xc0002ac1cc602e00
>  paca= 0xc0001de1f280   irqmask: 0x03   irq_happened: 0x01
>pid   = 20118, comm = pread2_processe
> Linux version 5.8.0-rc6-00345-g1fad14f18bc6
> 49e:mon> t
> [c0002ac1cc69ac70] c000c5a4 do_hash_page+0x24/0x38 (unreliable)
> --- Exception: 300 (Data Access) at c008fa60 
> __copy_tofrom_user_power7+0x20c/0x7ac
> [link register   ] c0335d10 copy_from_user_nofault+0xf0/0x150
> [c0002ac1cc69af70] c00032bf9fa3c880 (unreliable)
> [c0002ac1cc69afa0] c0109df0 read_user_stack_64+0x70/0xf0
> [c0002ac1cc69afd0] c0109fcc perf_callchain_user_64+0x15c/0x410
> [c0002ac1cc69b060] c0109c00 perf_callchain_user+0x20/0x40
> [c0002ac1cc69b080] c031c6cc get_perf_callchain+0x25c/0x360
> [c0002ac1cc69b120] c0316b50 perf_callchain+0x70/0xa0
> [c0002ac1cc69b140] c0316ddc perf_prepare_sample+0x25c/0x790
> [c0002ac1cc69b1a0] c0317350 perf_event_output_forward+0x40/0xb0
> [c0002ac1cc69b220] c0306138 __perf_event_overflow+0x88/0x1a0
> [c0002ac1cc69b270] c010cf70 record_and_restart+0x230/0x750
> [c0002ac1cc69b620] c010d69c perf_event_interrupt+0x20c/0x510
> [c0002ac1cc69b730] c0027d9c performance_monitor_exception+0x4c/0x60
> [c0002ac1cc69b750] c000b2f8 
> performance_monitor_common_virt+0x1b8/0x1c0
> --- Exception: f00 (Performance Monitor) at c00cb5b0 
> pSeries_lpar_hpte_insert+0x0/0x160
> [link register   ] c00846f0 __hash_page_64K+0x210/0x540
> [c0002ac1cc69ba50]  (unreliable)
> [c0002ac1cc69bb00] c0073ae0 update_mmu_cache+0x390/0x3a0
> [c0002ac1cc69bb70] c037f024 wp_page_copy+0x364/0xce0
> [c0002ac1cc69bc20] c038272c do_wp_page+0xdc/0xa60
> [c0002ac1cc69bc70] c03857bc handle_mm_fault+0xb9c/0x1b60
> [c0002ac1cc69bd50] c006c434 __do_page_fault+0x314/0xc90
> [c0002ac1cc69be20] c000c5c8 handle_page_fault+0x10/0x2c
> --- Exception: 300 (Data Access) at 7fff8c861fe8
> SP (76b19660) is in userspace
> 
> Reported-by: Athira Rajeev 
> Reported-by: Anton Blanchard 
> Reviewed-by: Aneesh Kumar K.V 
> Fixes: 2f92447f9f96 ("powerpc/book3s64/hash: Use the pte_t address from the
> caller")
> Signed-off-by: Nicholas Piggin 


Hi,

Tested with the patch and it fixes the lockups I was seeing with my test run.
Thanks for the fix.

Tested-by: Athira Rajeev 

> ---
> arch/powerpc/kernel/exceptions-64s.S  | 14 +++---
> arch/powerpc/mm/book3s64/hash_utils.c | 25 +
> arch/powerpc/perf/core-book3s.c   |  6 ++
> 3 files changed, 42 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> b/arch/powerpc/kernel/exceptions-64s.S
> index 0fc8bad878b2..446e54c3f71e 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -3072,10 +3072,18 @@ do_hash_page:
>   ori r0,r0,DSISR_BAD_FAULT_64S@l
>   and.r0,r5,r0/* weird error? */
>   bne-handle_page_fault   /* if not, try to insert a HPTE */
> +
> + /*
> +  * If we are in an "NMI" (e.g., an interrupt when soft-disabled), then
> +  * don't call hash_page, just fail the fault. This is required to
> +  * prevent re-entrancy problems in the hash code, namely perf
> +  * interrupts hitting while something holds H_PAGE_BUSY, and taking a
> +  * hash fault. See the comment in hash_preload().
> +  */
>   ld  r11, PACA_THREAD_INFO(r13)
> - lwz r0,TI_PREEMPT(r11)  /* If we're in an "NMI" */
> - andis.  r0,r0,NMI_MASK@h/* (i.e. an irq when soft-disabled) */
> - bne 77f /* then don't call hash_page now */
> + lwz r0,TI_PREEMPT(r11)
> + andis.  r0,r0,NMI_MASK@h
> + bne 77f
> 
>   /*
>* r3 contains the trap number
> diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
> b/arch/powerpc/mm/book3s64/hash_utils.c
> index 468169e33c86..9b9f92ad0e7a 100644
> --- a/arch/powerpc/mm/book3s64/hash_utils.c
> +++ b/arch/powerpc/mm/book3s64/hash_utils.c
> @@

[PATCH v2] powerpc/book3s64/radix: Add kernel command line option to disable radix GTSE

2020-07-27 Thread Aneesh Kumar K.V

This adds a kernel command line option that can be used to disable GTSE support.
Disabling GTSE implies kernel will make hcalls to invalidate TLB entries.

This was done so that we can do VM migration between configs that enable/disable
GTSE support via hypervisor. To migrate a VM from a system that supports
GTSE to a system that doesn't, we can boot the guest with
radix_hcall_invalidate=on, thereby forcing the guest to use hcalls for TLB
invalidates.

The check for hcall availability is done in pSeries_setup_arch so that
the panic message appears on the console. This should only happen on
a hypervisor that doesn't force the guest to hash translation even
though it can't handle the radix GTSE=0 request via CAS. With
radix_hcall_invalidate=on if the hypervisor doesn't support hcall_rpt_invalidate
hcall it should force the LPAR to hash translation.

Signed-off-by: Aneesh Kumar K.V 
---
Changes from v1:
* rename kernel parameter
* Drop a kernel warn

 Documentation/admin-guide/kernel-parameters.txt |  4 
 arch/powerpc/include/asm/firmware.h |  4 +++-
 arch/powerpc/kernel/prom_init.c | 13 +
 arch/powerpc/mm/init_64.c   |  1 -
 arch/powerpc/platforms/pseries/firmware.c   |  1 +
 arch/powerpc/platforms/pseries/setup.c  |  5 +
 6 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index fb95fad81c79..3ab61cd0f89c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -896,6 +896,10 @@
disable_radix   [PPC]
Disable RADIX MMU mode on POWER9
 
+   radix_hcall_invalidate=on  [PPC/PSERIES]
+   Disable RADIX GTSE feature and use hcall for TLB
+   invalidate.
+
disable_tlbie   [PPC]
Disable TLBIE instruction. Currently does not work
with KVM, with HASH MMU, or with coherent accelerators.
diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 6003c2e533a0..aa6a5ef5d483 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -52,6 +52,7 @@
 #define FW_FEATURE_PAPR_SCMASM_CONST(0x0020)
 #define FW_FEATURE_ULTRAVISOR  ASM_CONST(0x0040)
 #define FW_FEATURE_STUFF_TCE   ASM_CONST(0x0080)
+#define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0100)
 
 #ifndef __ASSEMBLY__
 
@@ -71,7 +72,8 @@ enum {
FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
-   FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR,
+   FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
+   FW_FEATURE_RPT_INVALIDATE,
FW_FEATURE_PSERIES_ALWAYS = 0,
FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
FW_FEATURE_POWERNV_ALWAYS = 0,
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index cbc605cfdec0..f279a1f58fa7 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -169,6 +169,7 @@ static unsigned long __prombss prom_tce_alloc_end;
 
 #ifdef CONFIG_PPC_PSERIES
 static bool __prombss prom_radix_disable;
+static bool __prombss prom_radix_gtse_disable;
 static bool __prombss prom_xive_disable;
 #endif
 
@@ -823,6 +824,12 @@ static void __init early_cmdline_parse(void)
if (prom_radix_disable)
prom_debug("Radix disabled from cmdline\n");
 
+   opt = prom_strstr(prom_cmd_line, "radix_hcall_invalidate=on");
+   if (opt) {
+   prom_radix_gtse_disable = true;
+   prom_debug("Radix GTSE disabled from cmdline\n");
+   }
+
opt = prom_strstr(prom_cmd_line, "xive=off");
if (opt) {
prom_xive_disable = true;
@@ -1285,10 +1292,8 @@ static void __init prom_parse_platform_support(u8 index, 
u8 val,
prom_parse_mmu_model(val & OV5_FEAT(OV5_MMU_SUPPORT), support);
break;
case OV5_INDX(OV5_RADIX_GTSE): /* Radix Extensions */
-   if (val & OV5_FEAT(OV5_RADIX_GTSE)) {
-   prom_debug("Radix - GTSE supported\n");
-   support->radix_gtse = true;
-   }
+   if (val & OV5_FEAT(OV5_RADIX_GTSE))
+   support->radix_gtse = !prom_radix_gtse_disable;
break;
case OV5_INDX(OV5_XIVE_SUPPORT): /* Interrupt mode */
prom_parse_xive_model(val & OV5_FEAT(OV5_XIVE_SUPPORT),
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 152aa0200cef..4ae5fc0ceb30 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -406,7 +406,6 @@ static void __init

Re: [PATCH v2 2/5] powerpc: Allow 4224 bytes of stack expansion for the signal frame

2020-07-27 Thread Gabriel Paubert

On Fri, Jul 24, 2020 at 07:25:25PM +1000, Michael Ellerman wrote:
> We have powerpc specific logic in our page fault handling to decide if
> an access to an unmapped address below the stack pointer should expand
> the stack VMA.
> 
> The code was originally added in 2004 "ported from 2.4". The rough
> logic is that the stack is allowed to grow to 1MB with no extra
> checking. Over 1MB the access must be within 2048 bytes of the stack
> pointer, or be from a user instruction that updates the stack pointer.
> 
> The 2048 byte allowance below the stack pointer is there to cover the
> 288 byte "red zone" as well as the "about 1.5kB" needed by the signal
> delivery code.
> 
> Unfortunately since then the signal frame has expanded, and is now
> 4224 bytes on 64-bit kernels with transactional memory enabled.

Are there really users of transactional memory in the wild? 

Just asking because Power10 removes TM, and Power9 has had some issues
with it AFAICT.

Getting rid of it (if possible) would result in smaller signal frames,
with simpler signal delivery code (probably slightly faster also).

Gabriel

[powerpc:next-test] BUILD SUCCESS 78807804b0854ecb7dc6906e379fc688aca36456

2020-07-27 Thread kernel test robot

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  
next-test
branch HEAD: 78807804b0854ecb7dc6906e379fc688aca36456  selftests/powerpc: Add 
test for pkey siginfo verification

elapsed time: 1065m

configs tested: 58
configs skipped: 1

The following configs have been built successfully.
More configs may be tested in the coming days.

arm defconfig
arm  allyesconfig
arm  allmodconfig
arm64allyesconfig
arm64   defconfig
i386 allyesconfig
i386defconfig
ia64 allmodconfig
ia64defconfig
ia64 allyesconfig
m68k allmodconfig
m68k   sun3_defconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
arc  allyesconfig
nds32 allnoconfig
c6x  allyesconfig
nios2allyesconfig
openriscdefconfig
nds32   defconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
xtensa  defconfig
arc defconfig
sh   allmodconfig
parisc  defconfig
s390 allyesconfig
parisc   allyesconfig
s390defconfig
sparcallyesconfig
sparc   defconfig
mips allyesconfig
mips allmodconfig
powerpc defconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386 randconfig-a016-20200727
i386 randconfig-a013-20200727
i386 randconfig-a012-20200727
i386 randconfig-a015-20200727
i386 randconfig-a011-20200727
i386 randconfig-a014-20200727
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscvallmodconfig
sparc64 defconfig
x86_64rhel-7.6-kselftests
x86_64   rhel-8.3
x86_64  kexec
x86_64   rhel
x86_64   allyesconfig
x86_64  defconfig

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org

[PATCH 7/7] powerpc/smp: Depend on cpu_l1_cache_map when adding cpus

2020-07-27 Thread Srikar Dronamraju

Currently on hotplug/hotunplug, the cpu iterates through all the cpus in
its core to find threads in its thread group. However this info is
already captured in cpu_l1_cache_map. Hence we could reduce the
iteration and cleanup add_cpu_to_smallcore_masks function.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/kernel/smp.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index eceb7aa0f4b8..22f4b3856470 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1276,16 +1276,15 @@ static void remove_cpu_from_masks(int cpu)
 
 static inline void add_cpu_to_smallcore_masks(int cpu)
 {
-   struct cpumask *this_l1_cache_map = per_cpu(cpu_l1_cache_map, cpu);
-   int i, first_thread = cpu_first_thread_sibling(cpu);
+   int i;
 
if (!has_big_cores)
return;
 
cpumask_set_cpu(cpu, cpu_smallcore_mask(cpu));
 
-   for (i = first_thread; i < first_thread + threads_per_core; i++) {
-   if (cpu_online(i) && cpumask_test_cpu(i, this_l1_cache_map))
+   for_each_cpu(i, per_cpu(cpu_l1_cache_map, cpu)) {
+   if (cpu_online(i))
set_cpus_related(i, cpu, cpu_smallcore_mask);
}
 }
-- 
2.17.1

[PATCH 5/7] powerpc/smp: Limit cpus traversed to within a node.

2020-07-27 Thread Srikar Dronamraju

All the arch specific topology cpumasks are within a node/die.
However when setting these per cpu cpumasks, system traverses through
all the online cpus. This is redundant.

Reduce the traversal to only cpus that are online in the node to which
the cpu belongs to.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/kernel/smp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index cde157483abf..9b03aad0beac 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1232,7 +1232,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask 
*(*mask_fn)(int))
return false;
 
cpumask_set_cpu(cpu, mask_fn(cpu));
-   for_each_cpu(i, cpu_online_mask) {
+   for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) {
/*
 * when updating the marks the current CPU has not been marked
 * online, but we need to update the cache masks
-- 
2.17.1

[PATCH 6/7] powerpc/smp: Stop passing mask to update_mask_by_l2

2020-07-27 Thread Srikar Dronamraju

update_mask_by_l2 is called only once. But it passes cpu_l2_cache_mask
as parameter. Instead of passing cpu_l2_cache_mask, use it directly in
update_mask_by_l2.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/kernel/smp.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 9b03aad0beac..eceb7aa0f4b8 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1222,7 +1222,7 @@ static struct device_node *cpu_to_l2cache(int cpu)
return cache;
 }
 
-static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int))
+static bool update_mask_by_l2(int cpu)
 {
struct device_node *l2_cache, *np;
int i;
@@ -1231,7 +1231,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask 
*(*mask_fn)(int))
if (!l2_cache)
return false;
 
-   cpumask_set_cpu(cpu, mask_fn(cpu));
+   cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu));
for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) {
/*
 * when updating the marks the current CPU has not been marked
@@ -1242,7 +1242,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask 
*(*mask_fn)(int))
continue;
 
if (np == l2_cache)
-   set_cpus_related(cpu, i, mask_fn);
+   set_cpus_related(cpu, i, cpu_l2_cache_mask);
 
of_node_put(np);
}
@@ -1306,7 +1306,7 @@ static void add_cpu_to_masks(int cpu)
set_cpus_related(i, cpu, cpu_sibling_mask);
 
add_cpu_to_smallcore_masks(cpu);
-   update_mask_by_l2(cpu, cpu_l2_cache_mask);
+   update_mask_by_l2(cpu);
 
if (has_coregroup_support()) {
int coregroup_id = cpu_to_coregroup_id(cpu);
-- 
2.17.1

[PATCH 4/7] powerpc/smp: Optimize remove_cpu_from_masks

2020-07-27 Thread Srikar Dronamraju

Currently while offlining a cpu, we iterate through all the cpus in the
DIE to clear sibling, l2_cache and smallcore maps. However if there are
more number of cores in a DIE, we end up spending more time iterating
through cpus which are completely unrelated.

Optimize this by only iterating through lesser but relevant cpumap.
If shared_cache is set, cpu_l2_cache_map should be relevant else
cpu_sibling_map would be relevant.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/kernel/smp.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index d476098fc25c..cde157483abf 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1254,14 +1254,21 @@ static bool update_mask_by_l2(int cpu, struct cpumask 
*(*mask_fn)(int))
 #ifdef CONFIG_HOTPLUG_CPU
 static void remove_cpu_from_masks(int cpu)
 {
+   struct cpumask *(*mask_fn)(int) = cpu_sibling_mask;
int i;
 
-   for_each_cpu(i, cpu_cpu_mask(cpu)) {
+   if (shared_caches)
+   mask_fn = cpu_l2_cache_mask;
+
+   for_each_cpu(i, mask_fn(cpu)) {
set_cpus_unrelated(cpu, i, cpu_l2_cache_mask);
set_cpus_unrelated(cpu, i, cpu_sibling_mask);
if (has_big_cores)
set_cpus_unrelated(cpu, i, cpu_smallcore_mask);
-   if (has_coregroup_support())
+   }
+
+   if (has_coregroup_support()) {
+   for_each_cpu(i, cpu_coregroup_mask(cpu))
set_cpus_unrelated(cpu, i, cpu_coregroup_mask);
}
 }
-- 
2.17.1

[PATCH 3/7] powerpc/smp: Remove get_physical_package_id

2020-07-27 Thread Srikar Dronamraju

Now that cpu_core_mask has been removed and topology_core_cpumask has
been updated to use cpu_cpu_mask, we no more need
get_physical_package_id.

Please note get_physical_package_id is an exported symbol. However
it was introduced recently and probably no users outside kernel.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/topology.h |  5 -
 arch/powerpc/kernel/smp.c   | 20 
 2 files changed, 25 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index e0f232533c9d..e45219f74be0 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -114,12 +114,7 @@ static inline int cpu_to_coregroup_id(int cpu)
 #ifdef CONFIG_PPC64
 #include 
 
-#ifdef CONFIG_PPC_SPLPAR
-int get_physical_package_id(int cpu);
-#define topology_physical_package_id(cpu)  (get_physical_package_id(cpu))
-#else
 #define topology_physical_package_id(cpu)  (cpu_to_chip_id(cpu))
-#endif
 
 #define topology_sibling_cpumask(cpu)  (per_cpu(cpu_sibling_map, cpu))
 #define topology_core_cpumask(cpu) (cpu_cpu_mask(cpu))
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 8c28e1b4957b..d476098fc25c 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1283,26 +1283,6 @@ static inline void add_cpu_to_smallcore_masks(int cpu)
}
 }
 
-int get_physical_package_id(int cpu)
-{
-   int pkg_id = cpu_to_chip_id(cpu);
-
-   /*
-* If the platform is PowerNV or Guest on KVM, ibm,chip-id is
-* defined. Hence we would return the chip-id as the result of
-* get_physical_package_id.
-*/
-   if (pkg_id == -1 && firmware_has_feature(FW_FEATURE_LPAR) &&
-   IS_ENABLED(CONFIG_PPC_SPLPAR)) {
-   struct device_node *np = of_get_cpu_node(cpu, NULL);
-   pkg_id = of_node_to_nid(np);
-   of_node_put(np);
-   }
-
-   return pkg_id;
-}
-EXPORT_SYMBOL_GPL(get_physical_package_id);
-
 static void add_cpu_to_masks(int cpu)
 {
int first_thread = cpu_first_thread_sibling(cpu);
-- 
2.17.1

[PATCH 2/7] powerpc/smp: Stop updating cpu_core_mask

2020-07-27 Thread Srikar Dronamraju

Anton Blanchard reported that his 4096 vcpu KVM guest took around 30
minutes to boot. He also analyzed it to the time taken to iterate while
setting the cpu_core_mask.

Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU
would be equal on Power. However updating cpu_core_mask took forever to
update as its a per cpu cpumask variable. Instead cpu_cpu_mask was a per
NODE /per DIE cpumask that was shared by all the respective CPUs.

Also cpu_cpu_mask is needed from a scheduler perspective. However
cpu_core_map is an exported symbol. Hence stop updating cpu_core_map
and make it point to cpu_cpu_mask.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/smp.h |  5 -
 arch/powerpc/kernel/smp.c  | 33 +++--
 2 files changed, 7 insertions(+), 31 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 5bdc17a7049f..cf6e7c7be62b 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -119,11 +119,6 @@ static inline struct cpumask *cpu_sibling_mask(int cpu)
return per_cpu(cpu_sibling_map, cpu);
 }
 
-static inline struct cpumask *cpu_core_mask(int cpu)
-{
-   return per_cpu(cpu_core_map, cpu);
-}
-
 static inline struct cpumask *cpu_l2_cache_mask(int cpu)
 {
return per_cpu(cpu_l2_cache_map, cpu);
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 95f0bf72e283..8c28e1b4957b 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -957,12 +957,17 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
local_memory_node(numa_cpu_lookup_table[cpu]));
}
 #endif
+   /*
+* cpu_core_map is no more updated and exists only since
+* its been exported for long. It only will have a snapshot
+* of cpu_cpu_mask.
+*/
+   cpumask_copy(per_cpu(cpu_core_map, cpu), cpu_cpu_mask(cpu));
}
 
/* Init the cpumasks so the boot CPU is related to itself */
cpumask_set_cpu(boot_cpuid, cpu_sibling_mask(boot_cpuid));
cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid));
-   cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid));
 
if (has_coregroup_support())
cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid));
@@ -1251,9 +1256,7 @@ static void remove_cpu_from_masks(int cpu)
 {
int i;
 
-   /* NB: cpu_core_mask is a superset of the others */
-   for_each_cpu(i, cpu_core_mask(cpu)) {
-   set_cpus_unrelated(cpu, i, cpu_core_mask);
+   for_each_cpu(i, cpu_cpu_mask(cpu)) {
set_cpus_unrelated(cpu, i, cpu_l2_cache_mask);
set_cpus_unrelated(cpu, i, cpu_sibling_mask);
if (has_big_cores)
@@ -1303,7 +1306,6 @@ EXPORT_SYMBOL_GPL(get_physical_package_id);
 static void add_cpu_to_masks(int cpu)
 {
int first_thread = cpu_first_thread_sibling(cpu);
-   int pkg_id = get_physical_package_id(cpu);
int i;
 
/*
@@ -1311,7 +1313,6 @@ static void add_cpu_to_masks(int cpu)
 * add it to it's own thread sibling mask.
 */
cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
-   cpumask_set_cpu(cpu, cpu_core_mask(cpu));
 
for (i = first_thread; i < first_thread + threads_per_core; i++)
if (cpu_online(i))
@@ -1333,26 +1334,6 @@ static void add_cpu_to_masks(int cpu)
set_cpus_related(cpu, i, cpu_coregroup_mask);
}
}
-
-   if (pkg_id == -1) {
-   struct cpumask *(*mask)(int) = cpu_sibling_mask;
-
-   /*
-* Copy the sibling mask into core sibling mask and
-* mark any CPUs on the same chip as this CPU.
-*/
-   if (shared_caches)
-   mask = cpu_l2_cache_mask;
-
-   for_each_cpu(i, mask(cpu))
-   set_cpus_related(cpu, i, cpu_core_mask);
-
-   return;
-   }
-
-   for_each_cpu(i, cpu_online_mask)
-   if (get_physical_package_id(i) == pkg_id)
-   set_cpus_related(cpu, i, cpu_core_mask);
 }
 
 /* Activate a secondary processor. */
-- 
2.17.1

[PATCH 1/7] powerpc/topology: Update topology_core_cpumask

2020-07-27 Thread Srikar Dronamraju

On Power, cpu_core_mask and cpu_cpu_mask refer to the same set of CPUs.
cpu_cpu_mask is needed by scheduler, hence look at deprecating
cpu_core_mask. Before deleting the cpu_core_mask, ensure its only user
is moved to cpu_cpu_mask.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/topology.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 6609174918ab..e0f232533c9d 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -122,7 +122,7 @@ int get_physical_package_id(int cpu);
 #endif
 
 #define topology_sibling_cpumask(cpu)  (per_cpu(cpu_sibling_map, cpu))
-#define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
+#define topology_core_cpumask(cpu) (cpu_cpu_mask(cpu))
 #define topology_core_id(cpu)  (cpu_to_core_id(cpu))
 
 #endif
-- 
2.17.1

[PATCH 0/7] Optimization to improve cpu online/offline on Powerpc

2020-07-27 Thread Srikar Dronamraju

Anton reported that his 4096 cpu (1024 cores in a socket) was taking too
long to boot. He also analyzed that most of the time was being spent on
updating cpu_core_mask.

Here are some optimizations and fixes to make ppc64_cpu --smt=8/ppc64_cpu
--smt=1 run faster and hence boot the kernel also faster.

Its based on top of my v4 coregroup support patchset.
http://lore.kernel.org/lkml/20200727053230.19753-1-sri...@linux.vnet.ibm.com/t/#u

The first two patches should solve Anton's immediate problem.
On the unofficial patches, Anton reported that the boot time came from 30
mins to 6 seconds. (Basically a high core count in a single socket
configuration). Satheesh also reported similar numbers.

The rest are simple cleanups/optimizations.

Since cpu_core_mask is an exported symbol for a long duration, lets retain
as a snapshot of cpumask_of_node.

Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  160
On-line CPU(s) list: 0-159
Thread(s) per core:  4
Core(s) per socket:  20
Socket(s):   2
NUMA node(s):2
Model:   2.2 (pvr 004e 1202)
Model name:  POWER9, altivec supported
CPU max MHz: 3800.
CPU min MHz: 2166.
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-79
NUMA node8 CPU(s):   80-159

without patch (powerpc/next)
[0.099347] smp: Bringing up secondary CPUs ...
[0.832513] smp: Brought up 2 nodes, 160 CPUs

with powerpc/next + coregroup support patchset
[0.099241] smp: Bringing up secondary CPUs ...
[0.835627] smp: Brought up 2 nodes, 160 CPUs

with powerpc/next + coregroup + this patchset
[0.097232] smp: Bringing up secondary CPUs ...
[0.528457] smp: Brought up 2 nodes, 160 CPUs

x ppc64_cpu --smt=1
+ ppc64_cpu --smt=4

without patch
N   Min   MaxMedian   AvgStddev
x 100 11.82 17.06 14.01 14.05 1.2665247
+ 100 12.25 16.59 13.86   14.1143  1.164293

with patch
N   Min   MaxMedian   AvgStddev
x 100 12.68 16.15 14.2414.2380.75489246
+ 100 12.93 15.85 14.35   14.28970.60041813

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 

Srikar Dronamraju (7):
  powerpc/topology: Update topology_core_cpumask
  powerpc/smp: Stop updating cpu_core_mask
  powerpc/smp: Remove get_physical_package_id
  powerpc/smp: Optimize remove_cpu_from_masks
  powerpc/smp: Limit cpus traversed to within a node.
  powerpc/smp: Stop passing mask to update_mask_by_l2
  powerpc/smp: Depend on cpu_l1_cache_map when adding cpus

 arch/powerpc/include/asm/smp.h  |  5 --
 arch/powerpc/include/asm/topology.h |  7 +--
 arch/powerpc/kernel/smp.c   | 79 +
 3 files changed, 24 insertions(+), 67 deletions(-)

-- 
2.17.1

Re: [PATCH] powerpc/book3s64/radix: Add kernel command line option to disable radix GTSE

2020-07-27 Thread Bharata B Rao

On Fri, Jul 24, 2020 at 01:26:00PM +0530, Aneesh Kumar K.V wrote:
> This adds a kernel command line option that can be used to disable GTSE 
> support.
> Disabling GTSE implies kernel will make hcalls to invalidate TLB entries.
> 
> This was done so that we can do VM migration between configs that 
> enable/disable
> GTSE support via hypervisor. To migrate a VM from a system that supports
> GTSE to a system that doesn't, we can boot the guest with radix_gtse=off, 
> thereby
> forcing the guest to use hcalls for TLB invalidates.
> 
> The check for hcall availability is done in pSeries_setup_arch so that
> the panic message appears on the console. This should only happen on
> a hypervisor that doesn't force the guest to hash translation even
> though it can't handle the radix GTSE=0 request via CAS. With radix_gtse=off
> if the hypervisor doesn't support hcall_rpt_invalidate hcall it should
> force the LPAR to hash translation.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  Documentation/admin-guide/kernel-parameters.txt |  3 +++
>  arch/powerpc/include/asm/firmware.h |  4 +++-
>  arch/powerpc/kernel/prom_init.c | 13 +
>  arch/powerpc/platforms/pseries/firmware.c   |  1 +
>  arch/powerpc/platforms/pseries/setup.c  |  5 +
>  5 files changed, 21 insertions(+), 5 deletions(-)
 
Tested

1. radix_gtse=off with KVM implementation of H_RPT_INVALIDATE hcall, the
   tlb flush calls get off-loaded to hcalls.
2. radix_gtse=off w/o H_RPT_INVALIDATE hcall, the guest kernel panics
   as per design.

However in both cases, the guest kernel prints out
"WARNING: Hypervisor doesn't support RADIX with GTSE" which can be a bit
confusing in case 1 as GTSE has disabled by the guest and hypervisor is
capable of supporting the same via hcall.

Regards,
Bharata.

Re: [PATCH -next] powerpc/papr_scm: Make some symbols static

2020-07-27 Thread Michael Ellerman

On Sat, 25 Jul 2020 17:19:49 +0800, Wei Yongjun wrote:
> The sparse tool complains as follows:
> 
> arch/powerpc/platforms/pseries/papr_scm.c:97:1: warning:
>  symbol 'papr_nd_regions' was not declared. Should it be static?
> arch/powerpc/platforms/pseries/papr_scm.c:98:1: warning:
>  symbol 'papr_ndr_lock' was not declared. Should it be static?
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/papr_scm: Make some symbols static
  https://git.kernel.org/powerpc/c/19a551b254e6c308348a46a65332aa03c01767ed

cheers

Re: [PATCH v2] powerpc/numa: Limit possible nodes to within num_possible_nodes

2020-07-27 Thread Michael Ellerman

On Fri, 24 Jul 2020 16:28:09 +0530, Srikar Dronamraju wrote:
> MAX_NUMNODES is a theoretical maximum number of nodes thats is supported
> by the kernel. Device tree properties exposes the number of possible
> nodes on the current platform. The kernel would detected this and would
> use it for most of its resource allocations.  If the platform now
> increases the nodes to over what was already exposed, then it may lead
> to inconsistencies. Hence limit it to the already exposed nodes.
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/numa: Limit possible nodes to within num_possible_nodes
  https://git.kernel.org/powerpc/c/dbce456280857f329af9069af5e48a9b6ebad146

cheers

Re: [PATCH 0/9] powerpc: delete duplicated words

2020-07-27 Thread Michael Ellerman

On Sat, 25 Jul 2020 17:38:00 -0700, Randy Dunlap wrote:
> Drop duplicated words in arch/powerpc/ header files.
> 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: linuxppc-dev@lists.ozlabs.org
> 
> [...]

Applied to powerpc/next.

[1/9] powerpc/book3s/mmu-hash.h: delete duplicated word
  https://git.kernel.org/powerpc/c/10a4a016d6a882ba7601159b0f719330b102c41b
[2/9] powerpc/book3s/radix-4k.h: delete duplicated word
  https://git.kernel.org/powerpc/c/92be1fca08eabe8ab083b1dfccd3e932b4fb6f1a
[3/9] powerpc/cputime.h: delete duplicated word
  https://git.kernel.org/powerpc/c/dc9bf323d6b8996d22c111add0ac8b0c895dcf52
[4/9] powerpc/epapr_hcalls.h: delete duplicated words
  https://git.kernel.org/powerpc/c/8965aa4b684f022c4d0bc6429097ddb38a26eaef
[5/9] powerpc/hw_breakpoint.h: delete duplicated word
  https://git.kernel.org/powerpc/c/028cc22d29959b501add32fc62660e5484c8092d
[6/9] powerpc/ppc_asm.h: delete duplicated word
  https://git.kernel.org/powerpc/c/db10f554268b29e3c5bfd1e44ef53a1e25c9
[7/9] powerpc/reg.h: delete duplicated word
  https://git.kernel.org/powerpc/c/850659392abc303d41c3f9217d45ab4fa79d201c
[8/9] powerpc/smu.h: delete duplicated word
  https://git.kernel.org/powerpc/c/3b56ed4b461fd92b66f6ea44d81837e12878031f
[9/9] powerpc/powernv/pci.h: delete duplicated word
  https://git.kernel.org/powerpc/c/86052e407e8e1964c81965de25832258875a0e6d

cheers

Re: [PATCH v5 00/10] powerpc/watchpoint: Enable 2nd DAWR on baremetal and powervm

2020-07-27 Thread Michael Ellerman

On Thu, 23 Jul 2020 14:38:03 +0530, Ravi Bangoria wrote:
> Last series[1] was to add basic infrastructure support for more than
> one watchpoint on Book3S powerpc. This series actually enables the 2nd
> DAWR for baremetal and powervm. Kvm guest is still not supported.
> 
> v4: 
> https://lore.kernel.org/r/20200717040958.70561-1-ravi.bango...@linux.ibm.com
> 
> v4->v5:
>  - Using hardcoded values instead of macros HBP_NUM_ONE and HBP_NUM_TWO.
>Comment above HBP_NUM_MAX changed to explain it's value.
>  - Included CPU_FTR_DAWR1 into CPU_FTRS_POWER10
>  - Using generic function feat_enable() instead of
>feat_enable_debug_facilities_v31() to enable CPU_FTR_DAWR1.
>  - ISA still includes 512B boundary in match criteria. But that's a
>documentation mistake. Mentioned about this in the last patch.
>  - Rebased to powerpc/next
>  - Added Jordan's Reviewed-by/Tested-by tags
> 
> [...]

Applied to powerpc/next.

[01/10] powerpc/watchpoint: Fix 512 byte boundary limit

https://git.kernel.org/powerpc/c/3190ecbfeeb2ab17778887ce3fa964615d6460fd
[02/10] powerpc/watchpoint: Fix DAWR exception constraint

https://git.kernel.org/powerpc/c/f6780ce619f8daa285760302d56e95892087bd1f
[03/10] powerpc/watchpoint: Fix DAWR exception for CACHEOP

https://git.kernel.org/powerpc/c/f3c832f1350bcf1e6906113ee3168066f4235dbe
[04/10] powerpc/watchpoint: Enable watchpoint functionality on power10 guest

https://git.kernel.org/powerpc/c/8f460a8175e6d85537d581734e9fa7ef97036b1a
[05/10] powerpc/dt_cpu_ftrs: Add feature for 2nd DAWR

https://git.kernel.org/powerpc/c/dc1cedca54704d336c333b5398daaf13b23e391b
[06/10] powerpc/watchpoint: Set CPU_FTR_DAWR1 based on pa-features bit

https://git.kernel.org/powerpc/c/8f45ca3f8b87c4810674fbfe65de6d041ee0baee
[07/10] powerpc/watchpoint: Rename current H_SET_MODE DAWR macro

https://git.kernel.org/powerpc/c/6f3fe297f95134e9b2386dae0067bf530e1ddca0
[08/10] powerpc/watchpoint: Guest support for 2nd DAWR hcall

https://git.kernel.org/powerpc/c/03f3e54abd95061ea11bdb4eedbe3cab6553704f
[09/10] powerpc/watchpoint: Return available watchpoints dynamically

https://git.kernel.org/powerpc/c/deb2bd9bcc8428d4b65b6ba640ba8b57c1b20b17
[10/10] powerpc/watchpoint: Remove 512 byte boundary

https://git.kernel.org/powerpc/c/3f31e49dc4588d396023028791e36c23235e1334

cheers

Re: [PATCH v3 01/14] powerpc/eeh: Remove eeh_dev_phb_init_dynamic()

2020-07-27 Thread Michael Ellerman

On Sat, 25 Jul 2020 18:12:18 +1000, Oliver O'Halloran wrote:
> This function is a one line wrapper around eeh_phb_pe_create() and despite
> the name it doesn't create any eeh_dev structures. Replace it with direct
> calls to eeh_phb_pe_create() since that does what it says on the tin
> and removes a layer of indirection.

Applied to powerpc/next.

[01/14] powerpc/eeh: Remove eeh_dev_phb_init_dynamic()

https://git.kernel.org/powerpc/c/475028efc708880e16e61cc4cbbc00af784cb39b
[02/14] powerpc/eeh: Remove eeh_dev.c

https://git.kernel.org/powerpc/c/d74ee8e9d12e2071014ecec96a1ce2744f77639d
[03/14] powerpc/eeh: Move vf_index out of pci_dn and into eeh_dev

https://git.kernel.org/powerpc/c/dffa91539e80355402c0716a91af17fc8ddd1abf
[04/14] powerpc/pseries: Stop using pdn->pe_number

https://git.kernel.org/powerpc/c/c408ce9075b8e1533f30fd3a113b75fb745f722f
[05/14] powerpc/eeh: Kill off eeh_ops->get_pe_addr()

https://git.kernel.org/powerpc/c/a40db934312cb2a4bef16b3edc962bc8c7f6462f
[06/14] powerpc/eeh: Remove VF config space restoration

https://git.kernel.org/powerpc/c/21b43bd59c7838825b94eea288333affb53dd399
[07/14] powerpc/eeh: Pass eeh_dev to eeh_ops->restore_config()

https://git.kernel.org/powerpc/c/0c2c76523c04ac184c7d7bbb8756f603375b7fc4
[08/14] powerpc/eeh: Pass eeh_dev to eeh_ops->resume_notify()

https://git.kernel.org/powerpc/c/8225d543dc0170e5b61af8559af07ec4f26f0bd6
[09/14] powerpc/eeh: Pass eeh_dev to eeh_ops->{read|write}_config()

https://git.kernel.org/powerpc/c/17d2a4870467bc8e8966304c08980571da943558
[10/14] powerpc/eeh: Remove spurious use of pci_dn in eeh_dump_dev_log

https://git.kernel.org/powerpc/c/1a303d8844d082ef58ff5fc3005b99621a3263ba
[11/14] powerpc/eeh: Remove class code field from edev

https://git.kernel.org/powerpc/c/768a42845b9ecdb28ba1991e17088b7eeb23a3eb
[12/14] powerpc/eeh: Rename eeh_{add_to|remove_from}_parent_pe()

https://git.kernel.org/powerpc/c/d923ab7a96fcc2b46aac9b2fc38ffdca72436fd1
[13/14] powerpc/eeh: Drop pdn use in eeh_pe_tree_insert()

https://git.kernel.org/powerpc/c/31595ae5aece519be5faa2e2013278ce45894d26
[14/14] powerpc/eeh: Move PE tree setup into the platform

https://git.kernel.org/powerpc/c/a131bfc69bc868083a6c7f9b5dad1331902a3534

cheers

Re: [PATCH v2 01/16] powernv/pci: Add pci_bus_to_pnvhb() helper

2020-07-27 Thread Michael Ellerman

On Wed, 22 Jul 2020 16:57:00 +1000, Oliver O'Halloran wrote:
> Add a helper to go from a pci_bus structure to the pnv_phb that hosts that
> bus. There's a lot of instances of the following pattern:
> 
>   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
>   struct pnv_phb *phb = hose->private_data;
> 
> Without any other uses of the pci_controller inside the function. This is
> hard to read since it requires you to memorise the contents of the
> private data fields and kind of error prone since it involves blindly
> assigning a void pointer. Add a helper to make it more concise and
> explicit.

Applied to powerpc/next.

[01/16] powerpc/powernv/pci: Add pci_bus_to_pnvhb() helper

https://git.kernel.org/powerpc/c/5609ffddd19dd52019d78b197e86b0331aeef8ae
[02/16] powerpc/powernv/pci: Always tear down DMA windows on PE release

https://git.kernel.org/powerpc/c/7a52ffabe867c0d93e47af113e5107340974047a
[03/16] powerpc/powernv/pci: Add explicit tracking of the DMA setup state

https://git.kernel.org/powerpc/c/01e12629af4e0e4864ed4d83e07783d7cb5b06be
[04/16] powerpc/powernv/pci: Initialise M64 for IODA1 as a 1-1 window

https://git.kernel.org/powerpc/c/369633654fcb9639cd4cd0e1a448ffde3533d776
[05/16] powerpc/powernv/sriov: Move SR-IOV into a separate file

https://git.kernel.org/powerpc/c/37b59ef08c546c6f54cdc52eed749f494619a102
[06/16] powerpc/powernv/sriov: Explain how SR-IOV works on PowerNV

https://git.kernel.org/powerpc/c/ff79e11af0979b25ecb38e4c843779d4a759a4e2
[07/16] powerpc/powernv/sriov: Rename truncate_iov

https://git.kernel.org/powerpc/c/fac248f8119170e3f8f54900985498ff6ee560bf
[08/16] powerpc/powernv/sriov: Simplify used window tracking

https://git.kernel.org/powerpc/c/ad9add529d99d195195c27abf99e42d4965d35e2
[09/16] powerpc/powernv/sriov: Factor out M64 BAR setup

https://git.kernel.org/powerpc/c/a610d35cc8780e781321ea8d002d5fef8484bf59
[10/16] powerpc/powernv/pci: Refactor pnv_ioda_alloc_pe()

https://git.kernel.org/powerpc/c/a4bc676ed5c3f53781cc342b73097eb7e8d43fa5
[11/16] powerpc/powernv/sriov: Drop iov->pe_num_map[]

https://git.kernel.org/powerpc/c/d29a2488d2c020032fdb1fe052347a6021e3591d
[12/16] powerpc/powernv/sriov: De-indent setup and teardown

https://git.kernel.org/powerpc/c/052da31d45fc71238ea8bed7e9a84648a1ee0bf3
[13/16] powerpc/powernv/sriov: Move M64 BAR allocation into a helper

https://git.kernel.org/powerpc/c/39efc03e3ee8f41909b7542be70b4061b38ca277
[14/16] powerpc/powernv/sriov: Refactor M64 BAR setup

https://git.kernel.org/powerpc/c/a0be516f8160fdb4836237cba037229e88a1de7d
[15/16] powerpc/powernv/sriov: Make single PE mode a per-BAR setting

https://git.kernel.org/powerpc/c/4c51f3e1e8702cbd0e53159fc3d1f54c20c70574
[16/16] powerpc/powernv/sriov: Remove vfs_expanded

https://git.kernel.org/powerpc/c/84d8505ed1dafb2e62d49fca5e7aa7d96cfcec49

cheers

Re: [PATCH v2 01/14] powerpc/eeh: Remove eeh_dev_phb_init_dynamic()

2020-07-27 Thread Michael Ellerman

On Wed, 22 Jul 2020 14:26:15 +1000, Oliver O'Halloran wrote:
> This function is a one line wrapper around eeh_phb_pe_create() and despite
> the name it doesn't create any eeh_dev structures. Replace it with direct
> calls to eeh_phb_pe_create() since that does what it says on the tin
> and removes a layer of indirection.

Applied to powerpc/next.

[01/14] powerpc/eeh: Remove eeh_dev_phb_init_dynamic()

https://git.kernel.org/powerpc/c/475028efc708880e16e61cc4cbbc00af784cb39b
[02/14] powerpc/eeh: Remove eeh_dev.c

https://git.kernel.org/powerpc/c/d74ee8e9d12e2071014ecec96a1ce2744f77639d
[03/14] powerpc/eeh: Move vf_index out of pci_dn and into eeh_dev

https://git.kernel.org/powerpc/c/dffa91539e80355402c0716a91af17fc8ddd1abf
[04/14] powerpc/pseries: Stop using pdn->pe_number

https://git.kernel.org/powerpc/c/c408ce9075b8e1533f30fd3a113b75fb745f722f
[05/14] powerpc/eeh: Kill off eeh_ops->get_pe_addr()

https://git.kernel.org/powerpc/c/a40db934312cb2a4bef16b3edc962bc8c7f6462f
[06/14] powerpc/eeh: Remove VF config space restoration

https://git.kernel.org/powerpc/c/21b43bd59c7838825b94eea288333affb53dd399
[07/14] powerpc/eeh: Pass eeh_dev to eeh_ops->restore_config()

https://git.kernel.org/powerpc/c/0c2c76523c04ac184c7d7bbb8756f603375b7fc4
[08/14] powerpc/eeh: Pass eeh_dev to eeh_ops->resume_notify()

https://git.kernel.org/powerpc/c/8225d543dc0170e5b61af8559af07ec4f26f0bd6
[09/14] powerpc/eeh: Pass eeh_dev to eeh_ops->{read|write}_config()

https://git.kernel.org/powerpc/c/17d2a4870467bc8e8966304c08980571da943558
[10/14] powerpc/eeh: Remove spurious use of pci_dn in eeh_dump_dev_log

https://git.kernel.org/powerpc/c/1a303d8844d082ef58ff5fc3005b99621a3263ba
[11/14] powerpc/eeh: Remove class code field from edev

https://git.kernel.org/powerpc/c/768a42845b9ecdb28ba1991e17088b7eeb23a3eb
[12/14] powerpc/eeh: Rename eeh_{add_to|remove_from}_parent_pe()

https://git.kernel.org/powerpc/c/d923ab7a96fcc2b46aac9b2fc38ffdca72436fd1
[13/14] powerpc/eeh: Drop pdn use in eeh_pe_tree_insert()

https://git.kernel.org/powerpc/c/31595ae5aece519be5faa2e2013278ce45894d26
[14/14] powerpc/eeh: Move PE tree setup into the platform

https://git.kernel.org/powerpc/c/a131bfc69bc868083a6c7f9b5dad1331902a3534

cheers

Re: [PATCH v4 0/6] powerpc: queued spinlocks and rwlocks

2020-07-27 Thread Michael Ellerman

On Fri, 24 Jul 2020 23:14:17 +1000, Nicholas Piggin wrote:
> Updated with everybody's feedback (thanks all), and more performance
> results.
> 
> What I've found is I might have been measuring the worst load point for
> the paravirt case, and by looking at a range of loads it's clear that
> queued spinlocks are overall better even on PV, doubly so when you look
> at the generally much improved worst case latencies.
> 
> [...]

Applied to powerpc/next.

[1/6] powerpc/pseries: Move some PAPR paravirt functions to their own file
  https://git.kernel.org/powerpc/c/20d444d06f97504d165b08558678b4737dcefb02
[2/6] powerpc: Move spinlock implementation to simple_spinlock
  https://git.kernel.org/powerpc/c/12d0b9d6c843e7dbe739ebefcf16c7e4a45e4e78
[3/6] powerpc/64s: Implement queued spinlocks and rwlocks
  https://git.kernel.org/powerpc/c/aa65ff6b18e0366db1790609956a4ac7308c5668
[4/6] powerpc/pseries: Implement paravirt qspinlocks for SPLPAR
  https://git.kernel.org/powerpc/c/20c0e8269e9d515e677670902c7e1cc0209d6ad9
[5/6] powerpc/qspinlock: Optimised atomic_try_cmpxchg_lock() that adds the lock 
hint
  https://git.kernel.org/powerpc/c/2f6560e652dfdbdb59df28b45a3458bf36d3c580
[6/6] powerpc: Implement smp_cond_load_relaxed()
  https://git.kernel.org/powerpc/c/49a7d46a06c30c7beabbf9d1a8ea1de0f9e4fdfe

cheers

Re: [PATCH v2] powerpc/64s: allow for clang's objdump differences

2020-07-27 Thread Michael Ellerman

On Fri, 24 Jul 2020 15:49:01 -0700, Bill Wendling wrote:
> Clang's objdump emits slightly different output from GNU's objdump,
> causing a list of warnings to be emitted during relocatable builds.
> E.g., clang's objdump emits this:
> 
>c004: 2c 00 00 48  b  0xc030
>...
>c0005c6c: 10 00 82 40  bf 2, 0xc0005c7c
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/64s: allow for clang's objdump differences
  https://git.kernel.org/powerpc/c/faedc380129501bdd7f669bf14e9c7ee3e7a2feb

cheers

Re: [PATCH][v3] powerpc/lib: remove memcpy_flushcache redundant return

2020-07-27 Thread Michael Ellerman

On Fri, 26 Apr 2019 19:36:30 +0800, Li RongQing wrote:
> Align it with other architectures and none of the callers has
> been interested its return

Applied to powerpc/next.

[1/1] powerpc/lib: remove memcpy_flushcache redundant return
  https://git.kernel.org/powerpc/c/e2802618970566277cf5cf5c99df66f21ee83766

cheers

Re: [PATCH 0/9] Macintosh II ADB driver fixes

2020-07-27 Thread Michael Ellerman

On Sun, 28 Jun 2020 14:23:12 +1000, Finn Thain wrote:
> Various issues with the via-macii driver have become apparent over the
> years. Some examples:
> 
>  - A Talk command response can be lost. This can result in phantom devices
> being probed or an incorrect device handler ID being retrieved.
> 
>  - A reply packet containing a null byte can get truncated. Such packets
> are sometimes generated by ADB keyboards.
> 
> [...]

Applied to powerpc/next.

[1/9] macintosh/via-macii: Access autopoll_devs when inside lock
  https://git.kernel.org/powerpc/c/59ea38f6b3af5636edf541768a1ed721eeaca99e
[2/9] macintosh/via-macii: Poll the device most likely to respond
  https://git.kernel.org/powerpc/c/f93bfeb55255bddaa16597e187a99ae6131b964a
[3/9] macintosh/via-macii: Handle /CTLR_IRQ signal correctly
  https://git.kernel.org/powerpc/c/b4d76c28eca369b8105fe3a0a9f396e3fbcd0dd5
[4/9] macintosh/via-macii: Remove read_done state
  https://git.kernel.org/powerpc/c/b16b67689baa01a5616b651356df7ad3e47a8763
[5/9] macintosh/via-macii: Handle poll replies correctly
  https://git.kernel.org/powerpc/c/624cf5b538b507293ec761797bd8ce0702fefe64
[6/9] macintosh/via-macii: Use bool type for reading_reply variable
  https://git.kernel.org/powerpc/c/f87a162572c9f7c839a207c7de6c73ffe54a777c
[7/9] macintosh/via-macii: Use unsigned type for autopoll_devs variable
  https://git.kernel.org/powerpc/c/5c0c15a1953a7de2878d7e6f5711fd3322b11faa
[8/9] macintosh/via-macii: Use the stack for reset request storage
  https://git.kernel.org/powerpc/c/046ace8256489f32740da07de55a913ca09ce5cf
[9/9] macintosh/via-macii: Clarify definition of macii_init()
  https://git.kernel.org/powerpc/c/3327e58a04501e06aa531cdb4044aab214a6254a

cheers

Re: [PATCH v2 1/2] powerpc/ptdump: Refactor update of st->last_pa

2020-07-27 Thread Michael Ellerman

On Mon, 29 Jun 2020 11:17:18 + (UTC), Christophe Leroy wrote:
> st->last_pa is always updated in note_page() so it can
> be done outside the if/elseif/else block.

Applied to powerpc/next.

[1/2] powerpc/ptdump: Refactor update of st->last_pa
  https://git.kernel.org/powerpc/c/846feeace51bce13f5c645d5bf162455b89841fd
[2/2] powerpc/ptdump: Refactor update of pg_state
  https://git.kernel.org/powerpc/c/e54e30bca40233139290aecfce932dea9b996516

cheers

Re: [PATCH 0/8] Mac ADB IOP driver fixes

2020-07-27 Thread Michael Ellerman

On Sun, 31 May 2020 09:17:03 +1000, Finn Thain wrote:
> The adb-iop driver was never finished. Some deficiencies have become
> apparent over the years. For example,
> 
>  - Mouse and/or keyboard may stop working if used together
> 
>  - SRQ autopoll list cannot be changed
> 
> [...]

Applied to powerpc/next.

[1/8] macintosh/adb-iop: Remove dead and redundant code
  https://git.kernel.org/powerpc/c/8384c82ab0860cd7db2ce4ec403e574f4ee54b6e
[2/8] macintosh/adb-iop: Correct comment text
  https://git.kernel.org/powerpc/c/ff785e179faf4bb06a2f73b8dcde6dabb66a83d2
[3/8] macintosh/adb-iop: Adopt bus reset algorithm from via-macii driver
  https://git.kernel.org/powerpc/c/303511edb859b1fbf48b3c1d1d53b33a6ebd2a2b
[4/8] macintosh/adb-iop: Access current_req and adb_iop_state when inside lock
  https://git.kernel.org/powerpc/c/aac840eca8fec02d594560647130d4e4447e10d9
[5/8] macintosh/adb-iop: Resolve static checker warnings
  https://git.kernel.org/powerpc/c/56b732edda96b1942fff974fd298ea2a2c543b94
[6/8] macintosh/adb-iop: Implement idle -> sending state transition
  https://git.kernel.org/powerpc/c/32226e81704398317e1cc5a82d24c0ef3cc25e5e
[7/8] macintosh/adb-iop: Implement sending -> idle state transition
  https://git.kernel.org/powerpc/c/e2954e5f727fad126258e83259b513988973c166
[8/8] macintosh/adb-iop: Implement SRQ autopolling
  https://git.kernel.org/powerpc/c/c66da95a39ec2bb95544c3def974d96e8c178f57

cheers

Re: [PATCH v2 0/6] powerpc/32s: Allocate modules outside of vmalloc space for STRICT_KERNEL_RWX

2020-07-27 Thread Michael Ellerman

On Mon, 29 Jun 2020 11:15:19 + (UTC), Christophe Leroy wrote:
> On book3s32 (hash), exec protection is set per 256Mb segments with NX bit.
> Instead of clearing NX bit on vmalloc space when CONFIG_MODULES is selected,
> allocate modules in a dedicated segment (0xb000-0xbfff by default).
> This allows to keep exec protection on vmalloc space while allowing exec
> on modules.
> 
> v2:
> - Removed the two patches that fix ptdump. Will submitted independently
> - Only changing the user/kernel boundary for PPC32 now.
> - Reordered the patches inside the series.
> 
> [...]

Applied to powerpc/next.

[1/6] powerpc/lib: Prepare code-patching for modules allocated outside vmalloc 
space
  https://git.kernel.org/powerpc/c/ccc8fcf72a6953fbfd6998999d622295f522b952
[2/6] powerpc: Use MODULES_VADDR if defined
  https://git.kernel.org/powerpc/c/7fbc22ce29931630da200cfc90fe5a454f54a794
[3/6] powerpc/32s: Only leave NX unset on segments used for modules
  https://git.kernel.org/powerpc/c/c496433197154144c310a17939736bc5c155914d
[4/6] powerpc/32: Set user/kernel boundary at TASK_SIZE instead of PAGE_OFFSET
  https://git.kernel.org/powerpc/c/b6be1bb7f7216b9e9f33f57abe6e3290c0e66bd4
[5/6] powerpc/32s: Kernel space starts at TASK_SIZE
  https://git.kernel.org/powerpc/c/f1a1f7a15eb0e13b84791ff2738b84e414501718
[6/6] powerpc/32s: Use dedicated segment for modules with STRICT_KERNEL_RWX
  https://git.kernel.org/powerpc/c/6ca055322da8fe25ff9ac50db6f3b7b59b6f961c

cheers

Re: [v4] powerpc/perf: Initialize power10 PMU registers in cpu setup routine

2020-07-27 Thread Michael Ellerman

On Thu, 23 Jul 2020 03:32:37 -0400, Athira Rajeev wrote:
> Initialize Monitor Mode Control Register 3 (MMCR3)
> SPR which is new in power10. For PowerISA v3.1, BHRB disable
> is controlled via Monitor Mode Control Register A (MMCRA) bit,
> namely "BHRB Recording Disable (BHRBRD)". This patch also initializes
> MMCRA BHRBRD to disable BHRB feature at boot for power10.

Applied to powerpc/next.

[1/1] powerpc/perf: Initialize power10 PMU registers in cpu setup routine
  https://git.kernel.org/powerpc/c/65156f2b1d9d5bf3fd0eac54b0a7fd515c92773c

cheers

Re: [PATCH v2] powerpc/book3s64/pkey: Disable pkey on POWER6 and before

2020-07-27 Thread Michael Ellerman

On Sun, 26 Jul 2020 18:55:17 +0530, Aneesh Kumar K.V wrote:
> POWER6 only support AMR update via privileged mode(MSR[PR] = 0, SPRN_AMR=29)
> The PR=1 alias for that SPR (SPRN_AMR=13) was only supported
> from POWER7. Since we don't allow userspace modifying of AMR value
> we should disable pkey support on P6 and before.
> 
> The hypervisor will still report pkey support via ibm,processor-storage-keys.
> Hence check for P7 CPU_FTR bit to decide on pkey support.

Applied to powerpc/next.

[1/1] powerpc/book3s64/pkey: Disable pkey on POWER6 and before
  https://git.kernel.org/powerpc/c/269e829f48a0d0d27667abe25ca5c9e5b6ab08e2

cheers

[powerpc:merge] BUILD SUCCESS 163c4334d6564fc8cb638d9f816ef5e1db1b8f1d

2020-07-27 Thread kernel test robot

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  
merge
branch HEAD: 163c4334d6564fc8cb638d9f816ef5e1db1b8f1d  Automatic merge of 
'master', 'next' and 'fixes' (2020-07-26 23:57)

elapsed time: 1017m

configs tested: 58
configs skipped: 1

The following configs have been built successfully.
More configs may be tested in the coming days.

arm64allyesconfig
arm64   defconfig
arm defconfig
arm  allyesconfig
arm  allmodconfig
i386 allyesconfig
i386defconfig
ia64 allmodconfig
ia64defconfig
ia64 allyesconfig
m68k allmodconfig
m68k   sun3_defconfig
m68kdefconfig
m68k allyesconfig
nds32   defconfig
nds32 allnoconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
nios2allyesconfig
xtensa   allyesconfig
h8300allyesconfig
xtensa  defconfig
arc defconfig
sh   allmodconfig
arc  allyesconfig
parisc  defconfig
s390 allyesconfig
parisc   allyesconfig
s390defconfig
nios2   defconfig
openriscdefconfig
c6x  allyesconfig
sparcallyesconfig
sparc   defconfig
mips allyesconfig
mips allmodconfig
powerpc defconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386 randconfig-a016-20200727
i386 randconfig-a013-20200727
i386 randconfig-a012-20200727
i386 randconfig-a015-20200727
i386 randconfig-a011-20200727
i386 randconfig-a014-20200727
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscvallmodconfig
sparc64 defconfig
x86_64rhel-7.6-kselftests
x86_64   rhel-8.3
x86_64  kexec
x86_64   rhel
x86_64   allyesconfig
x86_64  defconfig

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org

Re: [PATCH V4 0/4] powerpc/perf: Add support for perf extended regs in powerpc

2020-07-27 Thread Ravi Bangoria





On 7/26/20 8:15 PM, Athira Rajeev wrote:

Patch set to add support for perf extended register capability in
powerpc. The capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to
indicate the PMU which support extended registers. The generic code
define the mask of extended registers as 0 for non supported architectures.

Patches 1 and 2 are the kernel side changes needed to include
base support for extended regs in powerpc and in power10.
Patches 3 and 4 are the perf tools side changes needed to support the
extended registers.

patch 1/4 defines the PERF_PMU_CAP_EXTENDED_REGS mask to output the
values of mmcr0,mmcr1,mmcr2 for POWER9. Defines `PERF_REG_EXTENDED_MASK`
at runtime which contains mask value of the supported registers under
extended regs.

patch 2/4 adds the extended regs support for power10 and exposes
MMCR3, SIER2, SIER3 registers as part of extended regs.

Patch 3/4 and 4/4 adds extended regs to sample_reg_mask in the tool
side to use with `-I?` option for power9 and power10 respectively.

Ravi bangoria found an issue with `perf record -I` while testing the
changes. The same issue is currently being worked on here:
https://lkml.org/lkml/2020/7/19/413 and will be resolved once fix
from Jin Yao is merged.

This patch series is based on powerpc/next


Apart from the issue with patch #1, the series LGTM..

Reviewed-and-Tested-by: Ravi Bangoria

[PATCH] powerpc/fadump: Fix build error with CONFIG_PRESERVE_FA_DUMP=y

2020-07-27 Thread Michael Ellerman

skiroot_defconfig fails:

arch/powerpc/kernel/fadump.c:48:17: error: ‘cpus_in_fadump’ defined but not used
   48 | static atomic_t cpus_in_fadump;

Fix it by moving the definition into the #ifdef where it's used.

Fixes: ba608c4fa12c ("powerpc/fadump: fix race between pstore write and fadump 
crash trigger")
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/kernel/fadump.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 1858896d6809..10ebb4bf71ad 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -45,10 +45,12 @@ static struct fw_dump fw_dump;
 static void __init fadump_reserve_crash_area(u64 base);
 
 struct kobject *fadump_kobj;
-static atomic_t cpus_in_fadump;
 
 #ifndef CONFIG_PRESERVE_FA_DUMP
+
+static atomic_t cpus_in_fadump;
 static DEFINE_MUTEX(fadump_mutex);
+
 struct fadump_mrange_info crash_mrange_info = { "crash", NULL, 0, 0, 0, false 
};
 
 #define RESERVED_RNGS_SZ   16384 /* 16K - 128 entries */
-- 
2.25.1

Re: [PATCH V4 1/4] powerpc/perf: Add support for outputting extended regs in perf intr_regs

2020-07-27 Thread Ravi Bangoria


Hi Athira,


+/* Function to return the extended register values */
+static u64 get_ext_regs_value(int idx)
+{
+   switch (idx) {
+   case PERF_REG_POWERPC_MMCR0:
+   return mfspr(SPRN_MMCR0);
+   case PERF_REG_POWERPC_MMCR1:
+   return mfspr(SPRN_MMCR1);
+   case PERF_REG_POWERPC_MMCR2:
+   return mfspr(SPRN_MMCR2);
+   default: return 0;
+   }
+}
+
  u64 perf_reg_value(struct pt_regs *regs, int idx)
  {
-   if (WARN_ON_ONCE(idx >= PERF_REG_POWERPC_MAX))
-   return 0;
+   u64 perf_reg_extended_max = 0;


This should be initialized to PERF_REG_POWERPC_MAX. Default to 0 will always
trigger below WARN_ON_ONCE(idx >= perf_reg_extended_max) on non p9/p10 systems.


+
+   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   perf_reg_extended_max = PERF_REG_MAX_ISA_300;
  
  	if (idx == PERF_REG_POWERPC_SIER &&

   (IS_ENABLED(CONFIG_FSL_EMB_PERF_EVENT) ||
@@ -85,6 +103,16 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
IS_ENABLED(CONFIG_PPC32)))
return 0;
  
+	if (idx >= PERF_REG_POWERPC_MAX && idx < perf_reg_extended_max)

+   return get_ext_regs_value(idx);
+
+   /*
+* If the idx is referring to value beyond the
+* supported registers, return 0 with a warning
+*/
+   if (WARN_ON_ONCE(idx >= perf_reg_extended_max))
+   return 0;
+
return regs_get_register(regs, pt_regs_offset[idx]);
  }


Ravi

1 2 >

1 - 100 of 104 matches

Mail list logo