[PATCH 0/2] powerpc/cacheinfo: Add "ibm,thread-groups" awareness

2021-01-18 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Hi,

Currently the cacheinfo code on powerpc indexes the "cache" objects
(modelling the L1/L2/L3 caches) where the key is device-tree node
corresponding to that cache. On some of the POWER server platforms
thread-groups within the core share different sets of caches (Eg: On
SMT8 POWER9 systems, threads 0,2,4,6 of a core share L1 cache and
threads 1,3,5,7 of the same core share another L1 cache). On such
platforms, there is a single device-tree node corresponding to that
cache and the cache-configuration within the threads of the core is
indicated via "ibm,thread-groups" device-tree property.

Since the current code is not aware of the "ibm,thread-groups"
property, on the aforementoined systems, cacheinfo code still treats
all the threads in the core to be sharing the cache because of the
single device-tree node (In the earlier example, the cacheinfo code
would says CPUs 0-7 share L1 cache).

In this patch series, we make the powerpc cacheinfo code aware of the
"ibm,thread-groups" property. We indexe the "cache" objects by the
key-pair (device-tree node, thread-group id). For any CPUX, for a
given level of cache, the thread-group id is defined to be the first
CPU in the "ibm,thread-groups" cache-group containing CPUX. For levels
of cache which are not represented in "ibm,thread-groups" property,
the thread-group id is -1.

We can now remove the helper function get_shared_cpu_map() and
index_dir_to_cpu() since the cache->shared_cpu_map contains the
correct satate of the thread-siblings sharing the cache.

This has been tested on a SMT8 POWER9 system where L1 cache is split
between groups of threads in the core and on an SMT8 POWER10 system
where L1 and L2 caches are split between groups of threads in a core.

With this patch series, on POWER10 SMT8 system, we see the following
reported via sysfs:

$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-15
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-15

$ ppc64_cpu --smt=4
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-11
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-11

$ ppc64_cpu --smt=2
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-9
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-9

$ ppc64_cpu --smt=1
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8

Gautham R. Shenoy (2):
  powerpc/cacheinfo: Lookup cache by dt node and thread-group id
  powerpc/cacheinfo: Remove the redundant get_shared_cpu_map()

 arch/powerpc/include/asm/smp.h  |   3 +
 arch/powerpc/kernel/cacheinfo.c | 121 
 2 files changed, 62 insertions(+), 62 deletions(-)

-- 
1.9.4



[PATCH 2/2] powerpc/cacheinfo: Remove the redundant get_shared_cpu_map()

2021-01-18 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

The helper function get_shared_cpu_map() was added in

'commit 500fe5f550ec ("powerpc/cacheinfo: Report the correct
shared_cpu_map on big-cores")'

and subsequently expanded upon in

'commit 0be47634db0b ("powerpc/cacheinfo: Print correct cache-sibling
map/list for L2 cache")'

in order to help report the correct groups of threads sharing these caches
on big-core systems where groups of threads within a core can share
different sets of caches.

Now that powerpc/cacheinfo is aware of "ibm,thread-groups" property,
cache->shared_cpu_map contains the correct set of thread-siblings
sharing the cache. Hence we no longer need the functions
get_shared_cpu_map(). This patch removes this function. We also remove
the helper function index_dir_to_cpu() which was only called by
get_shared_cpu_map().

With these functions removed, we can still see the correct
cache-sibling map/list for L1 and L2 caches on systems with L1 and L2
caches distributed among groups of threads in a core.

With this patch, on a SMT8 POWER10 system where the L1 and L2 caches
are split between the two groups of threads in a core, for CPUs 8,9,
the L1-Data, L1-Instruction, L2, L3 cache CPU sibling list is as
follows:

$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-15
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-15

$ ppc64_cpu --smt=4
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-11
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-11

$ ppc64_cpu --smt=2
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-9
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-9

$ ppc64_cpu --smt=1
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/cacheinfo.c | 41 +
 1 file changed, 1 insertion(+), 40 deletions(-)

diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c
index 5a6925d..20d9169 100644
--- a/arch/powerpc/kernel/cacheinfo.c
+++ b/arch/powerpc/kernel/cacheinfo.c
@@ -675,45 +675,6 @@ static ssize_t level_show(struct kobject *k, struct 
kobj_attribute *attr, char *
 static struct kobj_attribute cache_level_attr =
__ATTR(level, 0444, level_show, NULL);
 
-static unsigned int index_dir_to_cpu(struct cache_index_dir *index)
-{
-   struct kobject *index_dir_kobj = &index->kobj;
-   struct kobject *cache_dir_kobj = index_dir_kobj->parent;
-   struct kobject *cpu_dev_kobj = cache_dir_kobj->parent;
-   struct device *dev = kobj_to_dev(cpu_dev_kobj);
-
-   return dev->id;
-}
-
-/*
- * On big-core systems, each core has two groups of CPUs each of which
- * has its own L1-cache. The thread-siblings which share l1-cache with
- * @cpu can be obtained via cpu_smallcore_mask().
- *
- * On some big-core systems, the L2 cache is shared only between some
- * groups of siblings. This is already parsed and encoded in
- * cpu_l2_cache_mask().
- *
- * TODO: cache_lookup_or_instantiate() needs to be made aware of the
- *   "ibm,thread-groups" property so that cache->shared_cpu_map
- *   reflects the correct siblings on platforms that have this
- *   device-tree property. This helper function is only a stop-gap
- *   solution so that we report the correct siblings to the
- *   userspace via sysfs.
- */
-s

[PATCH 1/2] powerpc/cacheinfo: Lookup cache by dt node and thread-group id

2021-01-18 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Currently the cacheinfo code on powerpc indexes the "cache" objects
(modelling the L1/L2/L3 caches) where the key is device-tree node
corresponding to that cache. On some of the POWER server platforms
thread-groups within the core share different sets of caches (Eg: On
SMT8 POWER9 systems, threads 0,2,4,6 of a core share L1 cache and
threads 1,3,5,7 of the same core share another L1 cache). On such
platforms, there is a single device-tree node corresponding to that
cache and the cache-configuration within the threads of the core is
indicated via "ibm,thread-groups" device-tree property.

Since the current code is not aware of the "ibm,thread-groups"
property, on the aforementoined systems, cacheinfo code still treats
all the threads in the core to be sharing the cache because of the
single device-tree node (In the earlier example, the cacheinfo code
would says CPUs 0-7 share L1 cache).

In this patch, we make the powerpc cacheinfo code aware of the
"ibm,thread-groups" property. We indexe the "cache" objects by the
key-pair (device-tree node, thread-group id). For any CPUX, for a
given level of cache, the thread-group id is defined to be the first
CPU in the "ibm,thread-groups" cache-group containing CPUX. For levels
of cache which are not represented in "ibm,thread-groups" property,
the thread-group id is -1.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/smp.h  |  3 ++
 arch/powerpc/kernel/cacheinfo.c | 80 +
 2 files changed, 61 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index c4e2d53..39de24b 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -32,6 +32,9 @@
 
 extern int cpu_to_chip_id(int cpu);
 
+DECLARE_PER_CPU(cpumask_var_t, thread_group_l1_cache_map);
+DECLARE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map);
+
 #ifdef CONFIG_SMP
 
 struct smp_ops_t {
diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c
index 6f903e9a..5a6925d 100644
--- a/arch/powerpc/kernel/cacheinfo.c
+++ b/arch/powerpc/kernel/cacheinfo.c
@@ -120,6 +120,7 @@ struct cache {
struct cpumask shared_cpu_map; /* online CPUs using this cache */
int type;  /* split cache disambiguation */
int level; /* level not explicit in device tree */
+   int group_id;  /* id of the group of threads that share 
this cache */
struct list_head list; /* global list of cache objects */
struct cache *next_local;  /* next cache of >= level */
 };
@@ -142,22 +143,24 @@ static const char *cache_type_string(const struct cache 
*cache)
 }
 
 static void cache_init(struct cache *cache, int type, int level,
-  struct device_node *ofnode)
+  struct device_node *ofnode, int group_id)
 {
cache->type = type;
cache->level = level;
cache->ofnode = of_node_get(ofnode);
+   cache->group_id = group_id;
INIT_LIST_HEAD(&cache->list);
list_add(&cache->list, &cache_list);
 }
 
-static struct cache *new_cache(int type, int level, struct device_node *ofnode)
+static struct cache *new_cache(int type, int level,
+  struct device_node *ofnode, int group_id)
 {
struct cache *cache;
 
cache = kzalloc(sizeof(*cache), GFP_KERNEL);
if (cache)
-   cache_init(cache, type, level, ofnode);
+   cache_init(cache, type, level, ofnode, group_id);
 
return cache;
 }
@@ -309,20 +312,24 @@ static struct cache *cache_find_first_sibling(struct 
cache *cache)
return cache;
 
list_for_each_entry(iter, &cache_list, list)
-   if (iter->ofnode == cache->ofnode && iter->next_local == cache)
+   if (iter->ofnode == cache->ofnode &&
+   iter->group_id == cache->group_id &&
+   iter->next_local == cache)
return iter;
 
return cache;
 }
 
-/* return the first cache on a local list matching node */
-static struct cache *cache_lookup_by_node(const struct device_node *node)
+/* return the first cache on a local list matching node and thread-group id */
+static struct cache *cache_lookup_by_node_group(const struct device_node *node,
+   int group_id)
 {
struct cache *cache = NULL;
struct cache *iter;
 
list_for_each_entry(iter, &cache_list, list) {
-   if (iter->ofnode != node)
+   if (iter->ofnode != node ||
+   iter->group_id != group_id)
continue;
cache = cache_find_first_sibling(iter);
break;
@@ -352,14 +359,15 @@ static int cache_is_unified_d(const struct device_node 
*np)
CACHE_TYPE_UNIFIED_D : CACHE_TYPE_UNIFIED;
 }
 
-static struct ca

Re: [RFC Qemu PATCH v2 1/2] spapr: drc: Add support for async hcalls at the drc level

2021-01-18 Thread Shivaprasad G Bhat

Thanks for the comments!


On 12/28/20 2:08 PM, David Gibson wrote:


On Mon, Dec 21, 2020 at 01:08:53PM +0100, Greg Kurz wrote:

...

The overall idea looks good but I think you should consider using
a thread pool to implement it. See below.

I am not convinced, however.  Specifically, attaching this to the DRC
doesn't make sense to me.  We're adding exactly one DRC related async
hcall, and I can't really see much call for another one.  We could
have other async hcalls - indeed we already have one for HPT resizing
- but attaching this to DRCs doesn't help for those.


The semantics of the hcall made me think, if this is going to be

re-usable for future if implemented at DRC level. Other option

is to move the async-hcall-state/list into the NVDIMMState structure

in include/hw/mem/nvdimm.h and handle it with machine->nvdimms_state

at a global level.


Hope you are okay with using the pool based approach that Greg

suggested.


Please let me know.


Thanks,

Shivaprasad




Re: [PATCH] selftests/powerpc: Fix exit status of pkey tests

2021-01-18 Thread Aneesh Kumar K.V
Sandipan Das  writes:

> Since main() does not return a value explicitly, the
> return values from FAIL_IF() conditions are ignored
> and the tests can still pass irrespective of failures.
> This makes sure that we always explicitly return the
> correct test exit status.
>

Reviewed-by: Aneesh Kumar K.V 

> Reported-by: Eirik Fuller 
> Fixes: 1addb6444791 ("selftests/powerpc: Add test for execute-disabled pkeys")
> Fixes: c27f2fd1705a ("selftests/powerpc: Add test for pkey siginfo 
> verification")
> Signed-off-by: Sandipan Das 
> ---
>  tools/testing/selftests/powerpc/mm/pkey_exec_prot.c | 2 +-
>  tools/testing/selftests/powerpc/mm/pkey_siginfo.c   | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/powerpc/mm/pkey_exec_prot.c 
> b/tools/testing/selftests/powerpc/mm/pkey_exec_prot.c
> index 9e5c7f3f498a..0af4f02669a1 100644
> --- a/tools/testing/selftests/powerpc/mm/pkey_exec_prot.c
> +++ b/tools/testing/selftests/powerpc/mm/pkey_exec_prot.c
> @@ -290,5 +290,5 @@ static int test(void)
>  
>  int main(void)
>  {
> - test_harness(test, "pkey_exec_prot");
> + return test_harness(test, "pkey_exec_prot");
>  }
> diff --git a/tools/testing/selftests/powerpc/mm/pkey_siginfo.c 
> b/tools/testing/selftests/powerpc/mm/pkey_siginfo.c
> index 4f815d7c1214..2db76e56d4cb 100644
> --- a/tools/testing/selftests/powerpc/mm/pkey_siginfo.c
> +++ b/tools/testing/selftests/powerpc/mm/pkey_siginfo.c
> @@ -329,5 +329,5 @@ static int test(void)
>  
>  int main(void)
>  {
> - test_harness(test, "pkey_siginfo");
> + return test_harness(test, "pkey_siginfo");
>  }
> -- 
> 2.25.1


[PATCH] selftests/powerpc: Only test lwm/stmw on big endian

2021-01-18 Thread Michael Ellerman
Newer binutils (>= 2.36) refuse to assemble lmw/stmw when building in
little endian mode. That breaks compilation of our alignment handler
test:

  /tmp/cco4l14N.s: Assembler messages:
  /tmp/cco4l14N.s:1440: Error: `lmw' invalid when little-endian
  /tmp/cco4l14N.s:1814: Error: `stmw' invalid when little-endian
  make[2]: *** [../../lib.mk:139: 
/output/kselftest/powerpc/alignment/alignment_handler] Error 1

These tests do pass on little endian machines, as the kernel will
still emulate those instructions even when running little
endian (which is arguably a kernel bug).

But we don't really need to test that case, so ifdef those
instructions out to get the alignment test building again.

Reported-by: Libor Pechacek 
Signed-off-by: Michael Ellerman 
---
 .../testing/selftests/powerpc/alignment/alignment_handler.c  | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/powerpc/alignment/alignment_handler.c 
b/tools/testing/selftests/powerpc/alignment/alignment_handler.c
index cb53a8b777e6..c25cf7cd45e9 100644
--- a/tools/testing/selftests/powerpc/alignment/alignment_handler.c
+++ b/tools/testing/selftests/powerpc/alignment/alignment_handler.c
@@ -443,7 +443,6 @@ int test_alignment_handler_integer(void)
LOAD_DFORM_TEST(ldu);
LOAD_XFORM_TEST(ldx);
LOAD_XFORM_TEST(ldux);
-   LOAD_DFORM_TEST(lmw);
STORE_DFORM_TEST(stb);
STORE_XFORM_TEST(stbx);
STORE_DFORM_TEST(stbu);
@@ -462,7 +461,11 @@ int test_alignment_handler_integer(void)
STORE_XFORM_TEST(stdx);
STORE_DFORM_TEST(stdu);
STORE_XFORM_TEST(stdux);
+
+#ifdef __BIG_ENDIAN__
+   LOAD_DFORM_TEST(lmw);
STORE_DFORM_TEST(stmw);
+#endif
 
return rc;
 }
-- 
2.25.1



Re: [PATCH v3] powerpc/mce: Remove per cpu variables from MCE handlers

2021-01-18 Thread Nicholas Piggin
Excerpts from Ganesh Goudar's message of January 15, 2021 10:58 pm:
> Access to per-cpu variables requires translation to be enabled on
> pseries machine running in hash mmu mode, Since part of MCE handler
> runs in realmode and part of MCE handling code is shared between ppc
> architectures pseries and powernv, it becomes difficult to manage
> these variables differently on different architectures, So have
> these variables in paca instead of having them as per-cpu variables
> to avoid complications.

Seems okay.

> 
> Maximum recursive depth of MCE is 4, Considering the maximum depth
> allowed reduce the size of event to 10 from 100.

Could you make this a separate patch, with memory saving numbers?
"Delayed" MCEs are not necessarily the same as recursive (several 
sequential MCEs can occur before the first event is processed).
But I agree 100 is pretty overboard (as is 4 recursive MCEs really).

> 
> Signed-off-by: Ganesh Goudar 
> ---
> v2: Dynamically allocate memory for machine check event info
> 
> v3: Remove check for hash mmu lpar, use memblock_alloc_try_nid
> to allocate memory.
> ---
>  arch/powerpc/include/asm/mce.h | 21 -
>  arch/powerpc/include/asm/paca.h|  4 ++
>  arch/powerpc/kernel/mce.c  | 76 +-
>  arch/powerpc/kernel/setup-common.c |  2 +-
>  4 files changed, 69 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
> index e6c27ae843dc..8d6e3a7a9f37 100644
> --- a/arch/powerpc/include/asm/mce.h
> +++ b/arch/powerpc/include/asm/mce.h
> @@ -204,7 +204,18 @@ struct mce_error_info {
>   boolignore_event;
>  };
>  
> -#define MAX_MC_EVT   100
> +#define MAX_MC_EVT   10

> +
> +struct mce_info {
> + int mce_nest_count;
> + struct machine_check_event mce_event[MAX_MC_EVT];
> + /* Queue for delayed MCE events. */
> + int mce_queue_count;
> + struct machine_check_event mce_event_queue[MAX_MC_EVT];
> + /* Queue for delayed MCE UE events. */
> + int mce_ue_count;
> + struct machine_check_event  mce_ue_event_queue[MAX_MC_EVT];
> +};
>  
>  /* Release flags for get_mce_event() */
>  #define MCE_EVENT_RELEASEtrue
> @@ -233,5 +244,13 @@ long __machine_check_early_realmode_p7(struct pt_regs 
> *regs);
>  long __machine_check_early_realmode_p8(struct pt_regs *regs);
>  long __machine_check_early_realmode_p9(struct pt_regs *regs);
>  long __machine_check_early_realmode_p10(struct pt_regs *regs);
> +#define get_mce_info() local_paca->mce_info

I don't think this adds anything. Could you open code it?

Thanks,
Nick


Re: [PATCH 1/4] KVM: PPC: Book3S HV: Remove support for running HPT guest on RPT host without mixed mode support

2021-01-18 Thread Nicholas Piggin
Excerpts from Fabiano Rosas's message of January 19, 2021 11:46 am:
> Resending because the previous got spam-filtered:
> 
> Nicholas Piggin  writes:
> 
>> This reverts much of commit c01015091a770 ("KVM: PPC: Book3S HV: Run HPT
>> guests on POWER9 radix hosts"), which was required to run HPT guests on
>> RPT hosts on early POWER9 CPUs without support for "mixed mode", which
>> meant the host could not run with MMU on while guests were running.
>>
>> This code has some corner case bugs, e.g., when the guest hits a machine
>> check or HMI the primary locks up waiting for secondaries to switch LPCR
>> to host, which they never do. This could all be fixed in software, but
>> most CPUs in production have mixed mode support, and those that don't
>> are believed to be all in installations that don't use this capability.
>> So simplify things and remove support.
> 
> With this patch in a DD2.1 machine + indep_threads_mode=N +
> disable_radix, QEMU aborts and dumps registers, is that intended?

Yes. That configuration is hanging handling MCEs in the guest with some 
threads waiting forever to synchronize. Paul suggested it was never a
supported configuration so we might just remove it.

> Could we use the 'no_mixing_hpt_and_radix' logic in check_extension to
> advertise only KVM_CAP_PPC_MMU_RADIX to the guest via OV5 so it doesn't
> try to run hash?
> 
> For instance, if I hack QEMU's 'spapr_dt_ov5_platform_support' from
> OV5_MMU_BOTH to OV5_MMU_RADIX_300 then it boots succesfuly, but the
> guest turns into radix, due to this code in prom_init:
> 
> prom_parse_mmu_model:
> 
> case OV5_FEAT(OV5_MMU_RADIX): /* Only Radix */
>   prom_debug("MMU - radix only\n");
>   if (prom_radix_disable) {
>   /*
>* If we __have__ to do radix, we're better off ignoring
>* the command line rather than not booting.
>*/
>   prom_printf("WARNING: Ignoring cmdline option disable_radix\n");
>   }
>   support->radix_mmu = true;
>   break;
> 
> It seems we could explicitly say that the host does not support hash and
> that would align with the above code.

I'm not sure, sounds like you could, on the other hand these aborts seem 
like the prefered failure mode for these kinds of configuration issues, 
I don't know what the policy is, is reverting back to radix acceptable?

Thanks,
Nick



Re: [PATCH v3 1/8] powerpc/uaccess: Add unsafe_copy_from_user

2021-01-18 Thread Michael Ellerman
"Christopher M. Riedl"  writes:
> On Mon Jan 11, 2021 at 7:22 AM CST, Christophe Leroy wrote:
>> Le 09/01/2021 à 04:25, Christopher M. Riedl a écrit :
>> > Implement raw_copy_from_user_allowed() which assumes that userspace read
>> > access is open. Use this new function to implement raw_copy_from_user().
>> > Finally, wrap the new function to follow the usual "unsafe_" convention
>> > of taking a label argument.
>>
>> I think there is no point implementing raw_copy_from_user_allowed(), see
>> https://github.com/linuxppc/linux/commit/4b842e4e25b1 and
>> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/8c74fc9ce8131cabb10b3e95dc0e430f396ee83e.1610369143.git.christophe.le...@csgroup.eu/
>>
>> You should simply do:
>>
>> #define unsafe_copy_from_user(d, s, l, e) \
>> unsafe_op_wrap(__copy_tofrom_user((__force void __user *)d, s, l), e)
>>
>
> I gave this a try and the signal ops decreased by ~8K. Now, to be
> honest, I am not sure what an "acceptable" benchmark number here
> actually is - so maybe this is ok? Same loss with both radix and hash:
>
>   |  | hash   | radix  |
>   |  | -- | -- |
>   | linuxppc/next| 118693 | 133296 |
>   | linuxppc/next w/o KUAP+KUEP  | 228911 | 228654 |
>   | unsafe-signal64  | 200480 | 234067 |
>   | unsafe-signal64 (__copy_tofrom_user) | 192467 | 225119 |
>
> To put this into perspective, prior to KUAP and uaccess flush, signal
> performance in this benchmark was ~290K on hash.

If I'm doing the math right 8K is ~4% of the best number.

It seems like 4% is worth a few lines of code to handle these constant
sizes. It's not like we have performance to throw away.

Or, we should chase down where the call sites are that are doing small
constant copies with copy_to/from_user() and change them to use
get/put_user().

cheers


Re: [PATCH 1/4] KVM: PPC: Book3S HV: Remove support for running HPT guest on RPT host without mixed mode support

2021-01-18 Thread Fabiano Rosas
Resending because the previous got spam-filtered:

Nicholas Piggin  writes:

> This reverts much of commit c01015091a770 ("KVM: PPC: Book3S HV: Run HPT
> guests on POWER9 radix hosts"), which was required to run HPT guests on
> RPT hosts on early POWER9 CPUs without support for "mixed mode", which
> meant the host could not run with MMU on while guests were running.
>
> This code has some corner case bugs, e.g., when the guest hits a machine
> check or HMI the primary locks up waiting for secondaries to switch LPCR
> to host, which they never do. This could all be fixed in software, but
> most CPUs in production have mixed mode support, and those that don't
> are believed to be all in installations that don't use this capability.
> So simplify things and remove support.

With this patch in a DD2.1 machine + indep_threads_mode=N +
disable_radix, QEMU aborts and dumps registers, is that intended?

Could we use the 'no_mixing_hpt_and_radix' logic in check_extension to
advertise only KVM_CAP_PPC_MMU_RADIX to the guest via OV5 so it doesn't
try to run hash?

For instance, if I hack QEMU's 'spapr_dt_ov5_platform_support' from
OV5_MMU_BOTH to OV5_MMU_RADIX_300 then it boots succesfuly, but the
guest turns into radix, due to this code in prom_init:

prom_parse_mmu_model:

case OV5_FEAT(OV5_MMU_RADIX): /* Only Radix */
prom_debug("MMU - radix only\n");
if (prom_radix_disable) {
/*
 * If we __have__ to do radix, we're better off ignoring
 * the command line rather than not booting.
 */
prom_printf("WARNING: Ignoring cmdline option disable_radix\n");
}
support->radix_mmu = true;
break;

It seems we could explicitly say that the host does not support hash and
that would align with the above code.

>
> Signed-off-by: Nicholas Piggin 
> ---


Re: [PATCH 2/4] KVM: PPC: Book3S HV: Fix radix guest SLB side channel

2021-01-18 Thread Fabiano Rosas
Nicholas Piggin  writes:

> The slbmte instruction is legal in radix mode, including radix guest
> mode. This means radix guests can load the SLB with arbitrary data.
>
> KVM host does not clear the SLB when exiting a guest if it was a
> radix guest, which would allow a rogue radix guest to use the SLB as
> a side channel to communicate with other guests.
>
> Fix this by ensuring the SLB is cleared when coming out of a radix
> guest. Only the first 4 entries are a concern, because radix guests
> always run with LPCR[UPRT]=1, which limits the reach of slbmte. slbia
> is not used (except in a non-performance-critical path) because it
> can clear cached translations.
>
> Signed-off-by: Nicholas Piggin 

Reviewed-by: Fabiano Rosas 

> ---
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 39 -
>  1 file changed, 31 insertions(+), 8 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index d5a9b57ec129..0e1f5bf168a1 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1157,6 +1157,20 @@ EXPORT_SYMBOL_GPL(__kvmhv_vcpu_entry_p9)
>   mr  r4, r3
>   b   fast_guest_entry_c
>  guest_exit_short_path:
> + /*
> +  * Malicious or buggy radix guests may have inserted SLB entries
> +  * (only 0..3 because radix always runs with UPRT=1), so these must
> +  * be cleared here to avoid side-channels. slbmte is used rather
> +  * than slbia, as it won't clear cached translations.
> +  */
> + li  r0,0
> + slbmte  r0,r0
> + li  r4,1
> + slbmte  r0,r4
> + li  r4,2
> + slbmte  r0,r4
> + li  r4,3
> + slbmte  r0,r4
>
>   li  r0, KVM_GUEST_MODE_NONE
>   stb r0, HSTATE_IN_GUEST(r13)
> @@ -1469,7 +1483,7 @@ guest_exit_cont:/* r9 = vcpu, r12 = 
> trap, r13 = paca */
>   lbz r0, KVM_RADIX(r5)
>   li  r5, 0
>   cmpwi   r0, 0
> - bne 3f  /* for radix, save 0 entries */
> + bne 0f  /* for radix, save 0 entries */
>   lwz r0,VCPU_SLB_NR(r9)  /* number of entries in SLB */
>   mtctr   r0
>   li  r6,0
> @@ -1490,12 +1504,9 @@ guest_exit_cont:   /* r9 = vcpu, r12 = 
> trap, r13 = paca */
>   slbmte  r0,r0
>   slbia
>   ptesync
> -3:   stw r5,VCPU_SLB_MAX(r9)
> + stw r5,VCPU_SLB_MAX(r9)
>
>   /* load host SLB entries */
> -BEGIN_MMU_FTR_SECTION
> - b   0f
> -END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_RADIX)
>   ld  r8,PACA_SLBSHADOWPTR(r13)
>
>   .rept   SLB_NUM_BOLTED
> @@ -1508,7 +1519,17 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_RADIX)
>   slbmte  r6,r5
>  1:   addir8,r8,16
>   .endr
> -0:
> + b   guest_bypass
> +
> +0:   /* Sanitise radix guest SLB, see guest_exit_short_path comment. */
> + li  r0,0
> + slbmte  r0,r0
> + li  r4,1
> + slbmte  r0,r4
> + li  r4,2
> + slbmte  r0,r4
> + li  r4,3
> + slbmte  r0,r4
>
>  guest_bypass:
>   stw r12, STACK_SLOT_TRAP(r1)
> @@ -3302,12 +3323,14 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
>   mtspr   SPRN_CIABR, r0
>   mtspr   SPRN_DAWRX0, r0
>
> + /* Clear hash and radix guest SLB, see guest_exit_short_path comment. */
> + slbmte  r0, r0
> + slbia
> +
>  BEGIN_MMU_FTR_SECTION
>   b   4f
>  END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_RADIX)
>
> - slbmte  r0, r0
> - slbia
>   ptesync
>   ld  r8, PACA_SLBSHADOWPTR(r13)
>   .rept   SLB_NUM_BOLTED


[PATCH v2] powerpc: always enable queued spinlocks for 64s, disable for others

2021-01-18 Thread Nicholas Piggin
Queued spinlocks have shown to have good performance and fairness
properties even on smaller (2 socket) POWER systems. This selects
them automatically for 64s. For other platforms they are de-selected,
the standard spinlock is far simpler and smaller code, and single
chips with a handful of cores is unlikely to show any improvement.

CONFIG_EXPERT still allows this to be changed, e.g., to help debug
performance or correctness issues.

Signed-off-by: Nicholas Piggin 
---
Since v1, made the condition simpler as suggested by Christophe.
Verified this works as expected for a bunch of old configs
(e.g., updates old !EXPERT configs to PPC_QUEUED_SPINLOCKS=y if they
were previously =n).

Thanks,
Nick

 arch/powerpc/Kconfig | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 107bb4319e0e..eebb4a8c156c 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -505,18 +505,14 @@ config HOTPLUG_CPU
  Say N if you are unsure.
 
 config PPC_QUEUED_SPINLOCKS
-   bool "Queued spinlocks"
+   bool "Queued spinlocks" if EXPERT
depends on SMP
+   default PPC_BOOK3S_64
help
  Say Y here to use queued spinlocks which give better scalability and
  fairness on large SMP and NUMA systems without harming single threaded
  performance.
 
- This option is currently experimental, the code is more complex and
- less tested so it defaults to "N" for the moment.
-
- If unsure, say "N".
-
 config ARCH_CPU_PROBE_RELEASE
def_bool y
depends on HOTPLUG_CPU
-- 
2.23.0



[PATCH][next] scsi: ibmvfc: Fix spelling mistake "succeded" -> "succeeded"

2021-01-18 Thread Colin King
From: Colin Ian King 

There is a spelling mistake in a ibmvfc_dbg debug message. Fix it.

Signed-off-by: Colin Ian King 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index 1db9e04f1ad3..755313b766b9 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -4808,7 +4808,7 @@ static void ibmvfc_channel_setup_done(struct ibmvfc_event 
*evt)
 
switch (mad_status) {
case IBMVFC_MAD_SUCCESS:
-   ibmvfc_dbg(vhost, "Channel Setup succeded\n");
+   ibmvfc_dbg(vhost, "Channel Setup succeeded\n");
flags = be32_to_cpu(setup->flags);
vhost->do_enquiry = 0;
active_queues = be32_to_cpu(setup->num_scsi_subq_channels);
-- 
2.29.2



Re: [PATCH] tools/perf: Fix powerpc gap between kernel end and module start

2021-01-18 Thread kajoljain



On 1/12/21 3:08 PM, Jiri Olsa wrote:
> On Mon, Dec 28, 2020 at 09:14:14PM -0500, Athira Rajeev wrote:
> 
> SNIP
> 
>> c2799370 b backtrace_flag
>> c2799378 B radix_tree_node_cachep
>> c2799380 B __bss_stop
>> c27a B _end
>> c0080389 t icmp_checkentry  [ip_tables]
>> c00803890038 t ipt_alloc_initial_table  [ip_tables]
>> c00803890468 T ipt_do_table [ip_tables]
>> c00803890de8 T ipt_unregister_table_pre_exit[ip_tables]
>> ...
>>
>> Perf calls function symbols__fixup_end() which sets the end of symbol
>> to 0xc0080389, which is the next address and this is the start
>> address of first module (icmp_checkentry in above) which will make the
>> huge symbol size of 0x8010f.
>>
>> After symbols__fixup_end:
>> symbols__fixup_end: sym->name: _end, sym->start: 0xc27a,
>> sym->end: 0xc0080389
>>
>> On powerpc, kernel text segment is located at 0xc000
>> whereas the modules are located at very high memory addresses,
>> 0xc0080xxx. Since the gap between end of kernel text segment
>> and beginning of first module's address is high, histogram allocation
>> using calloc fails.
>>
>> Fix this by detecting the kernel's last symbol and limiting
>> the range of last kernel symbol to pagesize.

Patch looks good to me.

Tested-By: Kajol Jain

Thanks,
Kajol Jain
>>
>> Signed-off-by: Athira Rajeev
> 
> I can't test, but since the same approach works for arm and s390,
> this also looks ok
> 
> Acked-by: Jiri Olsa 
> 
> thanks,
> jirka
> 
>> ---
>>  tools/perf/arch/powerpc/util/Build |  1 +
>>  tools/perf/arch/powerpc/util/machine.c | 24 
>>  2 files changed, 25 insertions(+)
>>  create mode 100644 tools/perf/arch/powerpc/util/machine.c
>>
>> diff --git a/tools/perf/arch/powerpc/util/Build 
>> b/tools/perf/arch/powerpc/util/Build
>> index e86e210bf514..b7945e5a543b 100644
>> --- a/tools/perf/arch/powerpc/util/Build
>> +++ b/tools/perf/arch/powerpc/util/Build
>> @@ -1,4 +1,5 @@
>>  perf-y += header.o
>> +perf-y += machine.o
>>  perf-y += kvm-stat.o
>>  perf-y += perf_regs.o
>>  perf-y += mem-events.o
>> diff --git a/tools/perf/arch/powerpc/util/machine.c 
>> b/tools/perf/arch/powerpc/util/machine.c
>> new file mode 100644
>> index ..c30e5cc88c16
>> --- /dev/null
>> +++ b/tools/perf/arch/powerpc/util/machine.c
>> @@ -0,0 +1,24 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +#include 
>> +#include 
>> +#include  // page_size
>> +#include "debug.h"
>> +#include "symbol.h"
>> +
>> +/* On powerpc kernel text segment start at memory addresses, 
>> 0xc000
>> + * whereas the modules are located at very high memory addresses,
>> + * for example 0xc0080xxx. The gap between end of kernel text 
>> segment
>> + * and beginning of first module's text segment is very high.
>> + * Therefore do not fill this gap and do not assign it to the kernel dso 
>> map.
>> + */
>> +
>> +void arch__symbols__fixup_end(struct symbol *p, struct symbol *c)
>> +{
>> +if (strchr(p->name, '[') == NULL && strchr(c->name, '['))
>> +/* Limit the range of last kernel symbol */
>> +p->end += page_size;
>> +else
>> +p->end = c->start;
>> +pr_debug4("%s sym:%s end:%#lx\n", __func__, p->name, p->end);
>> +}
>> -- 
>> 1.8.3.1
>>
> 


[PATCH] selftests/powerpc: Fix exit status of pkey tests

2021-01-18 Thread Sandipan Das
Since main() does not return a value explicitly, the
return values from FAIL_IF() conditions are ignored
and the tests can still pass irrespective of failures.
This makes sure that we always explicitly return the
correct test exit status.

Reported-by: Eirik Fuller 
Fixes: 1addb6444791 ("selftests/powerpc: Add test for execute-disabled pkeys")
Fixes: c27f2fd1705a ("selftests/powerpc: Add test for pkey siginfo 
verification")
Signed-off-by: Sandipan Das 
---
 tools/testing/selftests/powerpc/mm/pkey_exec_prot.c | 2 +-
 tools/testing/selftests/powerpc/mm/pkey_siginfo.c   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/powerpc/mm/pkey_exec_prot.c 
b/tools/testing/selftests/powerpc/mm/pkey_exec_prot.c
index 9e5c7f3f498a..0af4f02669a1 100644
--- a/tools/testing/selftests/powerpc/mm/pkey_exec_prot.c
+++ b/tools/testing/selftests/powerpc/mm/pkey_exec_prot.c
@@ -290,5 +290,5 @@ static int test(void)
 
 int main(void)
 {
-   test_harness(test, "pkey_exec_prot");
+   return test_harness(test, "pkey_exec_prot");
 }
diff --git a/tools/testing/selftests/powerpc/mm/pkey_siginfo.c 
b/tools/testing/selftests/powerpc/mm/pkey_siginfo.c
index 4f815d7c1214..2db76e56d4cb 100644
--- a/tools/testing/selftests/powerpc/mm/pkey_siginfo.c
+++ b/tools/testing/selftests/powerpc/mm/pkey_siginfo.c
@@ -329,5 +329,5 @@ static int test(void)
 
 int main(void)
 {
-   test_harness(test, "pkey_siginfo");
+   return test_harness(test, "pkey_siginfo");
 }
-- 
2.25.1