Re: [PATCH v3 03/12] powerpc/kexec_file: add helper functions for getting memory ranges

2020-07-17 Thread Hari Bathini



On 17/07/20 10:02 am, Hari Bathini wrote:
> 
> 
> On 15/07/20 5:19 am, Thiago Jung Bauermann wrote:
>>
>> Hello Hari,
>>
>> Hari Bathini  writes:
>>
>>> In kexec case, the kernel to be loaded uses the same memory layout as
>>> the running kernel. So, passing on the DT of the running kernel would
>>> be good enough.
>>>
>>> But in case of kdump, different memory ranges are needed to manage
>>> loading the kdump kernel, booting into it and exporting the elfcore
>>> of the crashing kernel. The ranges are exlude memory ranges, usable
>>
>> s/exlude/exclude/
>>
>>> memory ranges, reserved memory ranges and crash memory ranges.
>>>
>>> Exclude memory ranges specify the list of memory ranges to avoid while
>>> loading kdump segments. Usable memory ranges list the memory ranges
>>> that could be used for booting kdump kernel. Reserved memory ranges
>>> list the memory regions for the loading kernel's reserve map. Crash
>>> memory ranges list the memory ranges to be exported as the crashing
>>> kernel's elfcore.
>>>
>>> Add helper functions for setting up the above mentioned memory ranges.
>>> This helpers facilitate in understanding the subsequent changes better
>>> and make it easy to setup the different memory ranges listed above, as
>>> and when appropriate.
>>>
>>> Signed-off-by: Hari Bathini 
>>> Tested-by: Pingfan Liu 
>>
> 
> 
> 
>>> +/**
>>> + * add_reserved_ranges - Adds "/reserved-ranges" regions exported by f/w
>>> + *   to the given memory ranges list.
>>> + * @mem_ranges:  Range list to add the memory ranges to.
>>> + *
>>> + * Returns 0 on success, negative errno on error.
>>> + */
>>> +int add_reserved_ranges(struct crash_mem **mem_ranges)
>>> +{
>>> +   int i, len, ret = 0;
>>> +   const __be32 *prop;
>>> +
>>> +   prop = of_get_property(of_root, "reserved-ranges", &len);
>>> +   if (!prop)
>>> +   return 0;
>>> +
>>> +   /*
>>> +* Each reserved range is an (address,size) pair, 2 cells each,
>>> +* totalling 4 cells per range.
>>
>> Can you assume that, or do you need to check the #address-cells and
>> #size-cells properties of the root node?
> 
> Taken from early_reserve_mem_dt() which did not seem to care.
> Should we be doing any different here?

On second thoughts, wouldn't hurt to be extra cautious. Will use
#address-cells & #size-cells to parse reserved-ranges.

Thanks
Hari


[PATCH v3 3/3] powerpc/powernv/idle: Exclude mfspr on HID1, 4, 5 on P9 and above

2020-07-17 Thread Pratik Rajesh Sampat
POWER9 onwards the support for the registers HID1, HID4, HID5 has been
receded.
Although mfspr on the above registers worked in Power9, In Power10
simulator is unrecognized. Moving their assignment under the
check for machines lower than Power9

Signed-off-by: Pratik Rajesh Sampat 
Reviewed-by: Gautham R. Shenoy 
---
 arch/powerpc/platforms/powernv/idle.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index d439e11af101..d24d6671f3e8 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -73,9 +73,6 @@ static int pnv_save_sprs_for_deep_states(void)
 */
uint64_t lpcr_val   = mfspr(SPRN_LPCR);
uint64_t hid0_val   = mfspr(SPRN_HID0);
-   uint64_t hid1_val   = mfspr(SPRN_HID1);
-   uint64_t hid4_val   = mfspr(SPRN_HID4);
-   uint64_t hid5_val   = mfspr(SPRN_HID5);
uint64_t hmeer_val  = mfspr(SPRN_HMEER);
uint64_t msr_val = MSR_IDLE;
uint64_t psscr_val = pnv_deepest_stop_psscr_val;
@@ -117,6 +114,9 @@ static int pnv_save_sprs_for_deep_states(void)
 
/* Only p8 needs to set extra HID regiters */
if (!pvr_version_is(PVR_POWER9)) {
+   uint64_t hid1_val = mfspr(SPRN_HID1);
+   uint64_t hid4_val = mfspr(SPRN_HID4);
+   uint64_t hid5_val = mfspr(SPRN_HID5);
 
rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val);
if (rc != 0)
-- 
2.25.4



[PATCH v3 2/3] powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable

2020-07-17 Thread Pratik Rajesh Sampat
Replace the variable name from using "pnv_first_spr_loss_level" to
"pnv_first_fullstate_loss_level".

As pnv_first_spr_loss_level is supposed to be the earliest state that
has OPAL_PM_LOSE_FULL_CONTEXT set, however as shallow states too loose
SPR values, render an incorrect terminology.

Signed-off-by: Pratik Rajesh Sampat 
---
 arch/powerpc/platforms/powernv/idle.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index f62904f70fc6..d439e11af101 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -48,7 +48,7 @@ static bool default_stop_found;
  * First stop state levels when SPR and TB loss can occur.
  */
 static u64 pnv_first_tb_loss_level = MAX_STOP_STATE + 1;
-static u64 pnv_first_spr_loss_level = MAX_STOP_STATE + 1;
+static u64 pnv_first_fullstate_loss_level = MAX_STOP_STATE + 1;
 
 /*
  * psscr value and mask of the deepest stop idle state.
@@ -657,7 +657,7 @@ static unsigned long power9_idle_stop(unsigned long psscr, 
bool mmu_on)
  */
mmcr0   = mfspr(SPRN_MMCR0);
}
-   if ((psscr & PSSCR_RL_MASK) >= pnv_first_spr_loss_level) {
+   if ((psscr & PSSCR_RL_MASK) >= pnv_first_fullstate_loss_level) {
sprs.lpcr   = mfspr(SPRN_LPCR);
sprs.hfscr  = mfspr(SPRN_HFSCR);
sprs.fscr   = mfspr(SPRN_FSCR);
@@ -741,7 +741,7 @@ static unsigned long power9_idle_stop(unsigned long psscr, 
bool mmu_on)
 * just always test PSSCR for SPR/TB state loss.
 */
pls = (psscr & PSSCR_PLS) >> PSSCR_PLS_SHIFT;
-   if (likely(pls < pnv_first_spr_loss_level)) {
+   if (likely(pls < pnv_first_fullstate_loss_level)) {
if (sprs_saved)
atomic_stop_thread_idle();
goto out;
@@ -1088,7 +1088,7 @@ static void __init pnv_power9_idle_init(void)
 * the deepest loss-less (OPAL_PM_STOP_INST_FAST) stop state.
 */
pnv_first_tb_loss_level = MAX_STOP_STATE + 1;
-   pnv_first_spr_loss_level = MAX_STOP_STATE + 1;
+   pnv_first_fullstate_loss_level = MAX_STOP_STATE + 1;
for (i = 0; i < nr_pnv_idle_states; i++) {
int err;
struct pnv_idle_states_t *state = &pnv_idle_states[i];
@@ -1099,8 +1099,8 @@ static void __init pnv_power9_idle_init(void)
pnv_first_tb_loss_level = psscr_rl;
 
if ((state->flags & OPAL_PM_LOSE_FULL_CONTEXT) &&
-(pnv_first_spr_loss_level > psscr_rl))
-   pnv_first_spr_loss_level = psscr_rl;
+(pnv_first_fullstate_loss_level > psscr_rl))
+   pnv_first_fullstate_loss_level = psscr_rl;
 
/*
 * The idle code does not deal with TB loss occurring
@@ -,8 +,8 @@ static void __init pnv_power9_idle_init(void)
 * compatibility.
 */
if ((state->flags & OPAL_PM_TIMEBASE_STOP) &&
-(pnv_first_spr_loss_level > psscr_rl))
-   pnv_first_spr_loss_level = psscr_rl;
+(pnv_first_fullstate_loss_level > psscr_rl))
+   pnv_first_fullstate_loss_level = psscr_rl;
 
err = validate_psscr_val_mask(&state->psscr_val,
  &state->psscr_mask,
@@ -1158,7 +1158,7 @@ static void __init pnv_power9_idle_init(void)
}
 
pr_info("cpuidle-powernv: First stop level that may lose SPRs = 
0x%llx\n",
-   pnv_first_spr_loss_level);
+   pnv_first_fullstate_loss_level);
 
pr_info("cpuidle-powernv: First stop level that may lose timebase = 
0x%llx\n",
pnv_first_tb_loss_level);
-- 
2.25.4



[PATCH v3 0/3] powernv/idle: Power9 idle cleanup

2020-07-17 Thread Pratik Rajesh Sampat
v2: https://lkml.org/lkml/2020/7/10/28
Changelog v2-->v3:
1. Based on comments from Nicholas Piggin, introducing a cleanup patch
   in which, instead of checking for CPU_FTR_ARCH_300 check for
   PVR_POWER9 to allow for a finer granularity of checks where one
   processor generation can have multiple ways to handling idle

2. Removed saving-restoring DAWR, DAWRX patch for P10 systems. Based on
   discussions it has become evident that checks based on PVR is the way
   to go; however, P10 PVR is yet to up-stream hence shelving this patch
   for later.

Pratik Rajesh Sampat (3):
  powerpc/powernv/idle: Replace CPU features checks with PVR checks
  powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable
  powerpc/powernv/idle: Exclude mfspr on HID1,4,5 on P9 and above

 arch/powerpc/platforms/powernv/idle.c | 38 +--
 1 file changed, 19 insertions(+), 19 deletions(-)

-- 
2.25.4



[PATCH v3 1/3] powerpc/powernv/idle: Replace CPU features checks with PVR checks

2020-07-17 Thread Pratik Rajesh Sampat
As the idle framework's architecture is incomplete, hence instead of
checking for just the processor type advertised in the device tree CPU
features; check for the Processor Version Register (PVR) so that finer
granularity can be leveraged while making processor checks.

Signed-off-by: Pratik Rajesh Sampat 
---
 arch/powerpc/platforms/powernv/idle.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index 2dd467383a88..f62904f70fc6 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -92,7 +92,7 @@ static int pnv_save_sprs_for_deep_states(void)
if (rc != 0)
return rc;
 
-   if (cpu_has_feature(CPU_FTR_ARCH_300)) {
+   if (pvr_version_is(PVR_POWER9)) {
rc = opal_slw_set_reg(pir, P9_STOP_SPR_MSR, msr_val);
if (rc)
return rc;
@@ -116,7 +116,7 @@ static int pnv_save_sprs_for_deep_states(void)
return rc;
 
/* Only p8 needs to set extra HID regiters */
-   if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
+   if (!pvr_version_is(PVR_POWER9)) {
 
rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val);
if (rc != 0)
@@ -971,7 +971,7 @@ unsigned long pnv_cpu_offline(unsigned int cpu)
 
__ppc64_runlatch_off();
 
-   if (cpu_has_feature(CPU_FTR_ARCH_300) && deepest_stop_found) {
+   if (pvr_version_is(PVR_POWER9) && deepest_stop_found) {
unsigned long psscr;
 
psscr = mfspr(SPRN_PSSCR);
@@ -1175,7 +1175,7 @@ static void __init pnv_disable_deep_states(void)
pr_warn("cpuidle-powernv: Disabling idle states that lose full 
context\n");
pr_warn("cpuidle-powernv: Idle power-savings, CPU-Hotplug affected\n");
 
-   if (cpu_has_feature(CPU_FTR_ARCH_300) &&
+   if (pvr_version_is(PVR_POWER9) &&
(pnv_deepest_stop_flag & OPAL_PM_LOSE_FULL_CONTEXT)) {
/*
 * Use the default stop state for CPU-Hotplug
@@ -1205,7 +1205,7 @@ static void __init pnv_probe_idle_states(void)
return;
}
 
-   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   if (pvr_version_is(PVR_POWER9))
pnv_power9_idle_init();
 
for (i = 0; i < nr_pnv_idle_states; i++)
@@ -1278,7 +1278,7 @@ static int pnv_parse_cpuidle_dt(void)
pnv_idle_states[i].residency_ns = temp_u32[i];
 
/* For power9 */
-   if (cpu_has_feature(CPU_FTR_ARCH_300)) {
+   if (pvr_version_is(PVR_POWER9)) {
/* Read pm_crtl_val */
if (of_property_read_u64_array(np, "ibm,cpu-idle-state-psscr",
   temp_u64, nr_idle_states)) {
@@ -1337,7 +1337,7 @@ static int __init pnv_init_idle_states(void)
if (cpu == cpu_first_thread_sibling(cpu))
p->idle_state = (1 << threads_per_core) - 1;
 
-   if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
+   if (!pvr_version_is(PVR_POWER9)) {
/* P7/P8 nap */
p->thread_idle_state = PNV_THREAD_RUNNING;
} else {
-- 
2.25.4



Re: [PATCH v6] ima: move APPRAISE_BOOTPARAM dependency on ARCH_POLICY to runtime

2020-07-17 Thread Bruno Meneguele
On Mon, Jul 13, 2020 at 01:48:30PM -0300, Bruno Meneguele wrote:
> The IMA_APPRAISE_BOOTPARAM config allows enabling different "ima_appraise="
> modes - log, fix, enforce - at run time, but not when IMA architecture
> specific policies are enabled.  This prevents properly labeling the
> filesystem on systems where secure boot is supported, but not enabled on the
> platform.  Only when secure boot is actually enabled should these IMA
> appraise modes be disabled.
> 
> This patch removes the compile time dependency and makes it a runtime
> decision, based on the secure boot state of that platform.
> 
> Test results as follows:
> 
> -> x86-64 with secure boot enabled
> 
> [0.015637] Kernel command line: <...> ima_policy=appraise_tcb 
> ima_appraise=fix
> [0.015668] ima: Secure boot enabled: ignoring ima_appraise=fix boot 
> parameter option
> 
> -> powerpc with secure boot disabled
> 
> [0.00] Kernel command line: <...> ima_policy=appraise_tcb 
> ima_appraise=fix
> [0.00] Secure boot mode disabled
> 
> -> Running the system without secure boot and with both options set:
> 
> CONFIG_IMA_APPRAISE_BOOTPARAM=y
> CONFIG_IMA_ARCH_POLICY=y
> 
> Audit prompts "missing-hash" but still allow execution and, consequently,
> filesystem labeling:
> 
> type=INTEGRITY_DATA msg=audit(07/09/2020 12:30:27.778:1691) : pid=4976
> uid=root auid=root ses=2
> subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 op=appraise_data
> cause=missing-hash comm=bash name=/usr/bin/evmctl dev="dm-0" ino=493150
> res=no
> 
> Cc: sta...@vger.kernel.org
> Fixes: d958083a8f64 ("x86/ima: define arch_get_ima_policy() for x86")
> Signed-off-by: Bruno Meneguele 
> ---
> v6:
>   - explictly print the bootparam being ignored to the user (Mimi)
> v5:
>   - add pr_info() to inform user the ima_appraise= boot param is being
>   ignored due to secure boot enabled (Nayna)
>   - add some testing results to commit log
> v4:
>   - instead of change arch_policy loading code, check secure boot state at
>   "ima_appraise=" parameter handler (Mimi)
> v3:
>   - extend secure boot arch checker to also consider trusted boot
>   - enforce IMA appraisal when secure boot is effectively enabled (Nayna)
>   - fix ima_appraise flag assignment by or'ing it (Mimi)
> v2:
>   - pr_info() message prefix correction
>  security/integrity/ima/Kconfig| 2 +-
>  security/integrity/ima/ima_appraise.c | 6 ++
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/security/integrity/ima/Kconfig b/security/integrity/ima/Kconfig
> index edde88dbe576..62dc11a5af01 100644
> --- a/security/integrity/ima/Kconfig
> +++ b/security/integrity/ima/Kconfig
> @@ -232,7 +232,7 @@ config IMA_APPRAISE_REQUIRE_POLICY_SIGS
>  
>  config IMA_APPRAISE_BOOTPARAM
>   bool "ima_appraise boot parameter"
> - depends on IMA_APPRAISE && !IMA_ARCH_POLICY
> + depends on IMA_APPRAISE
>   default y
>   help
> This option enables the different "ima_appraise=" modes
> diff --git a/security/integrity/ima/ima_appraise.c 
> b/security/integrity/ima/ima_appraise.c
> index a9649b04b9f1..28a59508c6bd 100644
> --- a/security/integrity/ima/ima_appraise.c
> +++ b/security/integrity/ima/ima_appraise.c
> @@ -19,6 +19,12 @@
>  static int __init default_appraise_setup(char *str)
>  {
>  #ifdef CONFIG_IMA_APPRAISE_BOOTPARAM
> + if (arch_ima_get_secureboot()) {
> + pr_info("Secure boot enabled: ignoring ima_appraise=%s boot 
> parameter option",
> + str);
> + return 1;
> + }
> +
>   if (strncmp(str, "off", 3) == 0)
>   ima_appraise = 0;
>   else if (strncmp(str, "log", 3) == 0)
> -- 
> 2.26.2
> 

Ping for review.

Many thanks.

-- 
bmeneg 
PGP Key: http://bmeneg.com/pubkey.txt


signature.asc
Description: PGP signature


[PATCH] macintosh/adb: Replace HTTP links with HTTPS ones

2020-07-17 Thread Alexander A. Klimov
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.

Deterministic algorithm:
For each file:
  If not .svg:
For each line:
  If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
  If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
  Replace HTTP with HTTPS.

Signed-off-by: Alexander A. Klimov 
---
 Continuing my work started at 93431e0607e5.
 See also: git log --oneline '--author=Alexander A. Klimov 
' v5.7..master

 If there are any URLs to be removed completely
 or at least not (just) HTTPSified:
 Just clearly say so and I'll *undo my change*.
 See also: https://lkml.org/lkml/2020/6/27/64

 If there are any valid, but yet not changed URLs:
 See: https://lkml.org/lkml/2020/6/26/837

 If you apply the patch, please let me know.


 drivers/macintosh/adb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/macintosh/adb.c b/drivers/macintosh/adb.c
index e49d1f287a17..73b396189039 100644
--- a/drivers/macintosh/adb.c
+++ b/drivers/macintosh/adb.c
@@ -163,7 +163,7 @@ static int adb_scan_bus(void)
 * See if anybody actually moved. This is suggested
 * by HW TechNote 01:
 *
-* http://developer.apple.com/technotes/hw/hw_01.html
+* https://developer.apple.com/technotes/hw/hw_01.html
 */
adb_request(&req, NULL, ADBREQ_SYNC | ADBREQ_REPLY, 1,
(highFree << 4) | 0xf);
-- 
2.27.0



Re: [PATCH v3 02/12] powerpc/kexec_file: mark PPC64 specific code

2020-07-17 Thread Thiago Jung Bauermann


Hari Bathini  writes:

> On 16/07/20 7:19 am, Thiago Jung Bauermann wrote:
>> 
>> I didn't forget about this patch. I just wanted to see more of the
>> changes before comenting on it.
>> 
>> Hari Bathini  writes:
>> 
>>> Some of the kexec_file_load code isn't PPC64 specific. Move PPC64
>>> specific code from kexec/file_load.c to kexec/file_load_64.c. Also,
>>> rename purgatory/trampoline.S to purgatory/trampoline_64.S in the
>>> same spirit.
>> 
>> There's only a 64 bit implementation of kexec_file_load() so this is a
>> somewhat theoretical exercise, but there's no harm in getting the code
>> organized, so:
>> 
>> Reviewed-by: Thiago Jung Bauermann 
>> 
>> I have just one question below.
>
> 
>
>>> +/**
>>> + * setup_new_fdt_ppc64 - Update the flattend device-tree of the kernel
>>> + *   being loaded.
>>> + * @image:   kexec image being loaded.
>>> + * @fdt: Flattened device tree for the next kernel.
>>> + * @initrd_load_addr:Address where the next initrd will be loaded.
>>> + * @initrd_len:  Size of the next initrd, or 0 if there will be 
>>> none.
>>> + * @cmdline: Command line for the next kernel, or NULL if 
>>> there will
>>> + *   be none.
>>> + *
>>> + * Returns 0 on success, negative errno on error.
>>> + */
>>> +int setup_new_fdt_ppc64(const struct kimage *image, void *fdt,
>>> +   unsigned long initrd_load_addr,
>>> +   unsigned long initrd_len, const char *cmdline)
>>> +{
>>> +   int chosen_node, ret;
>>> +
>>> +   /* Remove memory reservation for the current device tree. */
>>> +   ret = delete_fdt_mem_rsv(fdt, __pa(initial_boot_params),
>>> +fdt_totalsize(initial_boot_params));
>>> +   if (ret == 0)
>>> +   pr_debug("Removed old device tree reservation.\n");
>>> +   else if (ret != -ENOENT) {
>>> +   pr_err("Failed to remove old device-tree reservation.\n");
>>> +   return ret;
>>> +   }
>>> +
>>> +   ret = setup_new_fdt(image, fdt, initrd_load_addr, initrd_len,
>>> +   cmdline, &chosen_node);
>>> +   if (ret)
>>> +   return ret;
>>> +
>>> +   ret = fdt_setprop(fdt, chosen_node, "linux,booted-from-kexec", NULL, 0);
>>> +   if (ret)
>>> +   pr_err("Failed to update device-tree with 
>>> linux,booted-from-kexec\n");
>>> +
>>> +   return ret;
>>> +}
>> 
>> For setup_purgatory_ppc64() you start with an empty function and build
>> from there, but for setup_new_fdt_ppc64() you moved some code here. Is
>> the code above 64 bit specific?
>
> Actually, I was not quiet sure if fdt updates like in patch 6 & patch 9 can be
> done after setup_ima_buffer() call. If you can confirm, I will move them back
> to setup_purgatory()

Hari and I discussed this off-line and we came to the conclusion that
theis code can be moved back to setup_new_fdt().

-- 
Thiago Jung Bauermann
IBM Linux Technology Center


[PATCH] macintosh/therm_adt746x: Replace HTTP links with HTTPS ones

2020-07-17 Thread Alexander A. Klimov
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.

Deterministic algorithm:
For each file:
  If not .svg:
For each line:
  If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
  If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
  Replace HTTP with HTTPS.

Signed-off-by: Alexander A. Klimov 
---
 Continuing my work started at 93431e0607e5.
 See also: git log --oneline '--author=Alexander A. Klimov 
' v5.7..master

 If there are any URLs to be removed completely
 or at least not (just) HTTPSified:
 Just clearly say so and I'll *undo my change*.
 See also: https://lkml.org/lkml/2020/6/27/64

 If there are any valid, but yet not changed URLs:
 See: https://lkml.org/lkml/2020/6/26/837

 If you apply the patch, please let me know.


 drivers/macintosh/therm_adt746x.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/macintosh/therm_adt746x.c 
b/drivers/macintosh/therm_adt746x.c
index 8f7725dc2166..7e218437730c 100644
--- a/drivers/macintosh/therm_adt746x.c
+++ b/drivers/macintosh/therm_adt746x.c
@@ -5,8 +5,8 @@
  * Copyright (C) 2003, 2004 Colin Leroy, Rasmus Rohde, Benjamin Herrenschmidt
  *
  * Documentation from 115254175ADT7467_pra.pdf and 3686221171167ADT7460_b.pdf
- * http://www.onsemi.com/PowerSolutions/product.do?id=ADT7467
- * http://www.onsemi.com/PowerSolutions/product.do?id=ADT7460
+ * https://www.onsemi.com/PowerSolutions/product.do?id=ADT7467
+ * https://www.onsemi.com/PowerSolutions/product.do?id=ADT7460
  *
  */
 
-- 
2.27.0



Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Mathieu Desnoyers
- On Jul 17, 2020, at 1:44 PM, Alan Stern st...@rowland.harvard.edu wrote:

> On Fri, Jul 17, 2020 at 12:22:49PM -0400, Mathieu Desnoyers wrote:
>> - On Jul 17, 2020, at 12:11 PM, Alan Stern st...@rowland.harvard.edu 
>> wrote:
>> 
>> >> > I agree with Nick: A memory barrier is needed somewhere between the
>> >> > assignment at 6 and the return to user mode at 8.  Otherwise you end up
>> >> > with the Store Buffer pattern having a memory barrier on only one side,
>> >> > and it is well known that this arrangement does not guarantee any
>> >> > ordering.
>> >> 
>> >> Yes, I see this now. I'm still trying to wrap my head around why the 
>> >> memory
>> >> barrier at the end of membarrier() needs to be paired with a scheduler
>> >> barrier though.
>> > 
>> > The memory barrier at the end of membarrier() on CPU0 is necessary in
>> > order to enforce the guarantee that any writes occurring on CPU1 before
>> > the membarrier() is executed will be visible to any code executing on
>> > CPU0 after the membarrier().  Ignoring the kthread issue, we can have:
>> > 
>> >CPU0CPU1
>> >x = 1
>> >barrier()
>> >y = 1
>> >r2 = y
>> >membarrier():
>> >  a: smp_mb()
>> >  b: send IPI   IPI-induced mb
>> >  c: smp_mb()
>> >r1 = x
>> > 
>> > The writes to x and y are unordered by the hardware, so it's possible to
>> > have r2 = 1 even though the write to x doesn't execute until b.  If the
>> > memory barrier at c is omitted then "r1 = x" can be reordered before b
>> > (although not before a), so we get r1 = 0.  This violates the guarantee
>> > that membarrier() is supposed to provide.
>> > 
>> > The timing of the memory barrier at c has to ensure that it executes
>> > after the IPI-induced memory barrier on CPU1.  If it happened before
>> > then we could still end up with r1 = 0.  That's why the pairing matters.
>> > 
>> > I hope this helps your head get properly wrapped.  :-)
>> 
>> It does help a bit! ;-)
>> 
>> This explains this part of the comment near the smp_mb at the end of 
>> membarrier:
>> 
>>  * Memory barrier on the caller thread _after_ we finished
>>  * waiting for the last IPI. [...]
>> 
>> However, it does not explain why it needs to be paired with a barrier in the
>> scheduler, clearly for the case where the IPI is skipped. I wonder whether 
>> this
>> part
>> of the comment is factually correct:
>> 
>>  * [...] Matches memory barriers around rq->curr modification in 
>> scheduler.
> 
> The reasoning is pretty much the same as above:
> 
>   CPU0CPU1
>   x = 1
>   barrier()
>   y = 1
>   r2 = y
>   membarrier():
> a: smp_mb()
>   switch to kthread (includes mb)
> b: read rq->curr == kthread
>   switch to user (includes mb)
> c: smp_mb()
>   r1 = x
> 
> Once again, it is possible that x = 1 doesn't become visible to CPU0
> until shortly before b.  But if c is omitted then "r1 = x" can be
> reordered before b (to any time after a), so we can have r1 = 0.
> 
> Here the timing requirement is that c executes after the first memory
> barrier on CPU1 -- which is one of the ones around the rq->curr
> modification.  (In fact, in this scenario CPU1's switch back to the user
> process is irrelevant.)

That indeed covers the last scenario I was wondering about. Thanks Alan!

Mathieu

> 
> Alan Stern

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Alan Stern
On Fri, Jul 17, 2020 at 12:22:49PM -0400, Mathieu Desnoyers wrote:
> - On Jul 17, 2020, at 12:11 PM, Alan Stern st...@rowland.harvard.edu 
> wrote:
> 
> >> > I agree with Nick: A memory barrier is needed somewhere between the
> >> > assignment at 6 and the return to user mode at 8.  Otherwise you end up
> >> > with the Store Buffer pattern having a memory barrier on only one side,
> >> > and it is well known that this arrangement does not guarantee any
> >> > ordering.
> >> 
> >> Yes, I see this now. I'm still trying to wrap my head around why the memory
> >> barrier at the end of membarrier() needs to be paired with a scheduler
> >> barrier though.
> > 
> > The memory barrier at the end of membarrier() on CPU0 is necessary in
> > order to enforce the guarantee that any writes occurring on CPU1 before
> > the membarrier() is executed will be visible to any code executing on
> > CPU0 after the membarrier().  Ignoring the kthread issue, we can have:
> > 
> > CPU0CPU1
> > x = 1
> > barrier()
> > y = 1
> > r2 = y
> > membarrier():
> >   a: smp_mb()
> >   b: send IPI   IPI-induced mb
> >   c: smp_mb()
> > r1 = x
> > 
> > The writes to x and y are unordered by the hardware, so it's possible to
> > have r2 = 1 even though the write to x doesn't execute until b.  If the
> > memory barrier at c is omitted then "r1 = x" can be reordered before b
> > (although not before a), so we get r1 = 0.  This violates the guarantee
> > that membarrier() is supposed to provide.
> > 
> > The timing of the memory barrier at c has to ensure that it executes
> > after the IPI-induced memory barrier on CPU1.  If it happened before
> > then we could still end up with r1 = 0.  That's why the pairing matters.
> > 
> > I hope this helps your head get properly wrapped.  :-)
> 
> It does help a bit! ;-)
> 
> This explains this part of the comment near the smp_mb at the end of 
> membarrier:
> 
>  * Memory barrier on the caller thread _after_ we finished
>  * waiting for the last IPI. [...]
> 
> However, it does not explain why it needs to be paired with a barrier in the
> scheduler, clearly for the case where the IPI is skipped. I wonder whether 
> this part
> of the comment is factually correct:
> 
>  * [...] Matches memory barriers around rq->curr modification in 
> scheduler.

The reasoning is pretty much the same as above:

CPU0CPU1
x = 1
barrier()
y = 1
r2 = y
membarrier():
  a: smp_mb()
switch to kthread (includes mb)
  b: read rq->curr == kthread
switch to user (includes mb)
  c: smp_mb()
r1 = x

Once again, it is possible that x = 1 doesn't become visible to CPU0 
until shortly before b.  But if c is omitted then "r1 = x" can be 
reordered before b (to any time after a), so we can have r1 = 0.

Here the timing requirement is that c executes after the first memory 
barrier on CPU1 -- which is one of the ones around the rq->curr 
modification.  (In fact, in this scenario CPU1's switch back to the user 
process is irrelevant.)

Alan Stern


[powerpc:next] BUILD SUCCESS 61f879d97ce4510dd29d676a20d67692e3b34806

2020-07-17 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  
next
branch HEAD: 61f879d97ce4510dd29d676a20d67692e3b34806  powerpc/pseries: Detect 
secure and trusted boot state of the system.

elapsed time: 1716m

configs tested: 103
configs skipped: 1

The following configs have been built successfully.
More configs may be tested in the coming days.

arm defconfig
arm  allyesconfig
arm  allmodconfig
arm   allnoconfig
arm64allyesconfig
arm64   defconfig
arm64allmodconfig
arm64 allnoconfig
arm   aspeed_g4_defconfig
mips   ip27_defconfig
m68km5307c3_defconfig
sh   j2_defconfig
powerpc   ppc64_defconfig
parisc   allyesconfig
i386  allnoconfig
i386 allyesconfig
i386defconfig
i386  debian-10.3
ia64 allmodconfig
ia64  allnoconfig
ia64 allyesconfig
ia64defconfig
m68k allmodconfig
m68k  allnoconfig
m68k   sun3_defconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
nios2allyesconfig
openriscdefconfig
c6x  allyesconfig
c6x   allnoconfig
openrisc allyesconfig
nds32   defconfig
nds32 allnoconfig
csky allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
h8300allmodconfig
xtensa  defconfig
arc defconfig
arc  allyesconfig
sh   allmodconfig
shallnoconfig
microblazeallnoconfig
mips allyesconfig
mips  allnoconfig
mips allmodconfig
pariscallnoconfig
parisc  defconfig
parisc   allmodconfig
powerpc  allyesconfig
powerpc  rhel-kconfig
powerpc  allmodconfig
powerpc   allnoconfig
powerpc defconfig
x86_64   randconfig-a005-20200717
x86_64   randconfig-a006-20200717
x86_64   randconfig-a002-20200717
x86_64   randconfig-a001-20200717
x86_64   randconfig-a003-20200717
x86_64   randconfig-a004-20200717
i386 randconfig-a001-20200717
i386 randconfig-a005-20200717
i386 randconfig-a002-20200717
i386 randconfig-a006-20200717
i386 randconfig-a003-20200717
i386 randconfig-a004-20200717
x86_64   randconfig-a012-20200716
x86_64   randconfig-a011-20200716
x86_64   randconfig-a016-20200716
x86_64   randconfig-a014-20200716
x86_64   randconfig-a013-20200716
x86_64   randconfig-a015-20200716
i386 randconfig-a016-20200717
i386 randconfig-a011-20200717
i386 randconfig-a015-20200717
i386 randconfig-a012-20200717
i386 randconfig-a013-20200717
i386 randconfig-a014-20200717
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscvallmodconfig
s390 allyesconfig
s390  allnoconfig
s390 allmodconfig
s390defconfig
sparcallyesconfig
sparc   defconfig
sparc64 defconfig
sparc64   allnoconfig
sparc64  allyesconfig
sparc64  allmodconfig
x86_64rhel-7.6-kselftests
x86_64   rhel-8.3
x86_64  kexec
x86_64

Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Mathieu Desnoyers
- On Jul 17, 2020, at 12:11 PM, Alan Stern st...@rowland.harvard.edu wrote:

>> > I agree with Nick: A memory barrier is needed somewhere between the
>> > assignment at 6 and the return to user mode at 8.  Otherwise you end up
>> > with the Store Buffer pattern having a memory barrier on only one side,
>> > and it is well known that this arrangement does not guarantee any
>> > ordering.
>> 
>> Yes, I see this now. I'm still trying to wrap my head around why the memory
>> barrier at the end of membarrier() needs to be paired with a scheduler
>> barrier though.
> 
> The memory barrier at the end of membarrier() on CPU0 is necessary in
> order to enforce the guarantee that any writes occurring on CPU1 before
> the membarrier() is executed will be visible to any code executing on
> CPU0 after the membarrier().  Ignoring the kthread issue, we can have:
> 
>   CPU0CPU1
>   x = 1
>   barrier()
>   y = 1
>   r2 = y
>   membarrier():
> a: smp_mb()
> b: send IPI   IPI-induced mb
> c: smp_mb()
>   r1 = x
> 
> The writes to x and y are unordered by the hardware, so it's possible to
> have r2 = 1 even though the write to x doesn't execute until b.  If the
> memory barrier at c is omitted then "r1 = x" can be reordered before b
> (although not before a), so we get r1 = 0.  This violates the guarantee
> that membarrier() is supposed to provide.
> 
> The timing of the memory barrier at c has to ensure that it executes
> after the IPI-induced memory barrier on CPU1.  If it happened before
> then we could still end up with r1 = 0.  That's why the pairing matters.
> 
> I hope this helps your head get properly wrapped.  :-)

It does help a bit! ;-)

This explains this part of the comment near the smp_mb at the end of membarrier:

 * Memory barrier on the caller thread _after_ we finished
 * waiting for the last IPI. [...]

However, it does not explain why it needs to be paired with a barrier in the
scheduler, clearly for the case where the IPI is skipped. I wonder whether this 
part
of the comment is factually correct:

 * [...] Matches memory barriers around rq->curr modification in 
scheduler.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Alan Stern
> > I agree with Nick: A memory barrier is needed somewhere between the
> > assignment at 6 and the return to user mode at 8.  Otherwise you end up
> > with the Store Buffer pattern having a memory barrier on only one side,
> > and it is well known that this arrangement does not guarantee any
> > ordering.
> 
> Yes, I see this now. I'm still trying to wrap my head around why the memory
> barrier at the end of membarrier() needs to be paired with a scheduler
> barrier though.

The memory barrier at the end of membarrier() on CPU0 is necessary in 
order to enforce the guarantee that any writes occurring on CPU1 before 
the membarrier() is executed will be visible to any code executing on 
CPU0 after the membarrier().  Ignoring the kthread issue, we can have:

CPU0CPU1
x = 1
barrier()
y = 1
r2 = y
membarrier():
  a: smp_mb()
  b: send IPI   IPI-induced mb
  c: smp_mb()
r1 = x

The writes to x and y are unordered by the hardware, so it's possible to 
have r2 = 1 even though the write to x doesn't execute until b.  If the 
memory barrier at c is omitted then "r1 = x" can be reordered before b 
(although not before a), so we get r1 = 0.  This violates the guarantee 
that membarrier() is supposed to provide.

The timing of the memory barrier at c has to ensure that it executes 
after the IPI-induced memory barrier on CPU1.  If it happened before 
then we could still end up with r1 = 0.  That's why the pairing matters.

I hope this helps your head get properly wrapped.  :-)

Alan Stern


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Mathieu Desnoyers
- On Jul 17, 2020, at 10:51 AM, Alan Stern st...@rowland.harvard.edu wrote:

> On Fri, Jul 17, 2020 at 09:39:25AM -0400, Mathieu Desnoyers wrote:
>> - On Jul 16, 2020, at 5:24 PM, Alan Stern st...@rowland.harvard.edu 
>> wrote:
>> 
>> > On Thu, Jul 16, 2020 at 02:58:41PM -0400, Mathieu Desnoyers wrote:
>> >> - On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers
>> >> mathieu.desnoy...@efficios.com wrote:
>> >> 
>> >> > - On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers
>> >> > mathieu.desnoy...@efficios.com wrote:
>> >> > 
>> >> >> - On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com 
>> >> >> wrote:
>> >> >>> I should be more complete here, especially since I was complaining
>> >> >>> about unclear barrier comment :)
>> >> >>> 
>> >> >>> 
>> >> >>> CPU0 CPU1
>> >> >>> a. user stuff1. user stuff
>> >> >>> b. membarrier()  2. enter kernel
>> >> >>> c. smp_mb()  3. smp_mb__after_spinlock(); // in __schedule
>> >> >>> d. read rq->curr 4. rq->curr switched to kthread
>> >> >>> e. is kthread, skip IPI  5. switch_to kthread
>> >> >>> f. return to user6. rq->curr switched to user thread
>> >> >>> g. user stuff7. switch_to user thread
>> >> >>> 8. exit kernel
>> >> >>> 9. more user stuff
> 
> ...
> 
>> >> Requiring a memory barrier between update of rq->curr (back to current 
>> >> process's
>> >> thread) and following user-space memory accesses does not seem to 
>> >> guarantee
>> >> anything more than what the initial barrier at the beginning of __schedule
>> >> already
>> >> provides, because the guarantees are only about accesses to user-space 
>> >> memory.
> 
> ...
> 
>> > Is it correct to say that the switch_to operations in 5 and 7 include
>> > memory barriers?  If they do, then skipping the IPI should be okay.
>> > 
>> > The reason is as follows: The guarantee you need to enforce is that
>> > anything written by CPU0 before the membarrier() will be visible to CPU1
>> > after it returns to user mode.  Let's say that a writes to X and 9
>> > reads from X.
>> > 
>> > Then we have an instance of the Store Buffer pattern:
>> > 
>> >CPU0CPU1
>> >a. Write X  6. Write rq->curr for user thread
>> >c. smp_mb() 7. switch_to memory barrier
>> >d. Read rq->curr9. Read X
>> > 
>> > In this pattern, the memory barriers make it impossible for both reads
>> > to miss their corresponding writes.  Since d does fail to read 6 (it
>> > sees the earlier value stored by 4), 9 must read a.
>> > 
>> > The other guarantee you need is that g on CPU0 will observe anything
>> > written by CPU1 in 1.  This is easier to see, using the fact that 3 is a
>> > memory barrier and d reads from 4.
>> 
>> Right, and Nick's reply involving pairs of loads/stores on each side
>> clarifies the situation even further.
> 
> The key part of my reply was the question: "Is it correct to say that
> the switch_to operations in 5 and 7 include memory barriers?"  From the
> text quoted above and from Nick's reply, it seems clear that they do
> not.

I remember that switch_mm implies it, but not switch_to.

The scenario that triggered this discussion is when the scheduler does a
lazy tlb entry/exit, which is basically switch from a user task to
a kernel thread without changing the mm, and usually switching back afterwards.
This optimization means the rq->curr mm temporarily differs, which prevent
IPIs from being sent by membarrier, but without involving a switch_mm.
This requires explicit memory barriers either on entry/exit of lazy tlb
mode, or explicit barriers in the scheduler for those special-cases.

> I agree with Nick: A memory barrier is needed somewhere between the
> assignment at 6 and the return to user mode at 8.  Otherwise you end up
> with the Store Buffer pattern having a memory barrier on only one side,
> and it is well known that this arrangement does not guarantee any
> ordering.

Yes, I see this now. I'm still trying to wrap my head around why the memory
barrier at the end of membarrier() needs to be paired with a scheduler
barrier though.

> One thing I don't understand about all this: Any context switch has to
> include a memory barrier somewhere, but both you and Nick seem to be
> saying that steps 6 and 7 don't include (or don't need) any memory
> barriers.  What am I missing?

All context switch have the smp_mb__before_spinlock at the beginning of
__schedule(), which I suspect is what you refer to. However this barrier
is before the store to rq->curr, not after.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Alan Stern
On Fri, Jul 17, 2020 at 09:39:25AM -0400, Mathieu Desnoyers wrote:
> - On Jul 16, 2020, at 5:24 PM, Alan Stern st...@rowland.harvard.edu wrote:
> 
> > On Thu, Jul 16, 2020 at 02:58:41PM -0400, Mathieu Desnoyers wrote:
> >> - On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers
> >> mathieu.desnoy...@efficios.com wrote:
> >> 
> >> > - On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers
> >> > mathieu.desnoy...@efficios.com wrote:
> >> > 
> >> >> - On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com 
> >> >> wrote:
> >> >>> I should be more complete here, especially since I was complaining
> >> >>> about unclear barrier comment :)
> >> >>> 
> >> >>> 
> >> >>> CPU0 CPU1
> >> >>> a. user stuff1. user stuff
> >> >>> b. membarrier()  2. enter kernel
> >> >>> c. smp_mb()  3. smp_mb__after_spinlock(); // in __schedule
> >> >>> d. read rq->curr 4. rq->curr switched to kthread
> >> >>> e. is kthread, skip IPI  5. switch_to kthread
> >> >>> f. return to user6. rq->curr switched to user thread
> >> >>> g. user stuff7. switch_to user thread
> >> >>> 8. exit kernel
> >> >>> 9. more user stuff

...

> >> Requiring a memory barrier between update of rq->curr (back to current 
> >> process's
> >> thread) and following user-space memory accesses does not seem to guarantee
> >> anything more than what the initial barrier at the beginning of __schedule
> >> already
> >> provides, because the guarantees are only about accesses to user-space 
> >> memory.

...

> > Is it correct to say that the switch_to operations in 5 and 7 include
> > memory barriers?  If they do, then skipping the IPI should be okay.
> > 
> > The reason is as follows: The guarantee you need to enforce is that
> > anything written by CPU0 before the membarrier() will be visible to CPU1
> > after it returns to user mode.  Let's say that a writes to X and 9
> > reads from X.
> > 
> > Then we have an instance of the Store Buffer pattern:
> > 
> > CPU0CPU1
> > a. Write X  6. Write rq->curr for user thread
> > c. smp_mb() 7. switch_to memory barrier
> > d. Read rq->curr9. Read X
> > 
> > In this pattern, the memory barriers make it impossible for both reads
> > to miss their corresponding writes.  Since d does fail to read 6 (it
> > sees the earlier value stored by 4), 9 must read a.
> > 
> > The other guarantee you need is that g on CPU0 will observe anything
> > written by CPU1 in 1.  This is easier to see, using the fact that 3 is a
> > memory barrier and d reads from 4.
> 
> Right, and Nick's reply involving pairs of loads/stores on each side
> clarifies the situation even further.

The key part of my reply was the question: "Is it correct to say that 
the switch_to operations in 5 and 7 include memory barriers?"  From the 
text quoted above and from Nick's reply, it seems clear that they do 
not.

I agree with Nick: A memory barrier is needed somewhere between the 
assignment at 6 and the return to user mode at 8.  Otherwise you end up 
with the Store Buffer pattern having a memory barrier on only one side, 
and it is well known that this arrangement does not guarantee any 
ordering.

One thing I don't understand about all this: Any context switch has to 
include a memory barrier somewhere, but both you and Nick seem to be 
saying that steps 6 and 7 don't include (or don't need) any memory 
barriers.  What am I missing?

Alan Stern


[v3 13/15] tools/perf: Add perf tools support for extended register capability in powerpc

2020-07-17 Thread Athira Rajeev
From: Anju T Sudhakar 

Add extended regs to sample_reg_mask in the tool side to use
with `-I?` option. Perf tools side uses extended mask to display
the platform supported register names (with -I? option) to the user
and also send this mask to the kernel to capture the extended registers
in each sample. Hence decide the mask value based on the processor
version.

Currently definitions for `mfspr`, `SPRN_PVR` are part of
`arch/powerpc/util/header.c`. Move this to a header file so that
these definitions can be re-used in other source files as well.

Signed-off-by: Anju T Sudhakar 
[Decide extended mask at run time based on platform]
Signed-off-by: Athira Rajeev 
Reviewed-by: Madhavan Srinivasan 
---
 tools/arch/powerpc/include/uapi/asm/perf_regs.h | 14 ++-
 tools/perf/arch/powerpc/include/perf_regs.h |  5 ++-
 tools/perf/arch/powerpc/util/header.c   |  9 +
 tools/perf/arch/powerpc/util/perf_regs.c| 49 +
 tools/perf/arch/powerpc/util/utils_header.h | 15 
 5 files changed, 82 insertions(+), 10 deletions(-)
 create mode 100644 tools/perf/arch/powerpc/util/utils_header.h

diff --git a/tools/arch/powerpc/include/uapi/asm/perf_regs.h 
b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
index f599064..225c64c 100644
--- a/tools/arch/powerpc/include/uapi/asm/perf_regs.h
+++ b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
@@ -48,6 +48,18 @@ enum perf_event_powerpc_regs {
PERF_REG_POWERPC_DSISR,
PERF_REG_POWERPC_SIER,
PERF_REG_POWERPC_MMCRA,
-   PERF_REG_POWERPC_MAX,
+   /* Extended registers */
+   PERF_REG_POWERPC_MMCR0,
+   PERF_REG_POWERPC_MMCR1,
+   PERF_REG_POWERPC_MMCR2,
+   /* Max regs without the extended regs */
+   PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
 };
+
+#define PERF_REG_PMU_MASK  ((1ULL << PERF_REG_POWERPC_MAX) - 1)
+
+/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */
+#define PERF_REG_PMU_MASK_300   (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) 
- PERF_REG_PMU_MASK)
+
+#define PERF_REG_MAX_ISA_300   (PERF_REG_POWERPC_MMCR2 + 1)
 #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
diff --git a/tools/perf/arch/powerpc/include/perf_regs.h 
b/tools/perf/arch/powerpc/include/perf_regs.h
index e18a355..46ed00d 100644
--- a/tools/perf/arch/powerpc/include/perf_regs.h
+++ b/tools/perf/arch/powerpc/include/perf_regs.h
@@ -64,7 +64,10 @@
[PERF_REG_POWERPC_DAR] = "dar",
[PERF_REG_POWERPC_DSISR] = "dsisr",
[PERF_REG_POWERPC_SIER] = "sier",
-   [PERF_REG_POWERPC_MMCRA] = "mmcra"
+   [PERF_REG_POWERPC_MMCRA] = "mmcra",
+   [PERF_REG_POWERPC_MMCR0] = "mmcr0",
+   [PERF_REG_POWERPC_MMCR1] = "mmcr1",
+   [PERF_REG_POWERPC_MMCR2] = "mmcr2",
 };
 
 static inline const char *perf_reg_name(int id)
diff --git a/tools/perf/arch/powerpc/util/header.c 
b/tools/perf/arch/powerpc/util/header.c
index d487007..1a95017 100644
--- a/tools/perf/arch/powerpc/util/header.c
+++ b/tools/perf/arch/powerpc/util/header.c
@@ -7,17 +7,10 @@
 #include 
 #include 
 #include "header.h"
+#include "utils_header.h"
 #include "metricgroup.h"
 #include 
 
-#define mfspr(rn)   ({unsigned long rval; \
-asm volatile("mfspr %0," __stringify(rn) \
- : "=r" (rval)); rval; })
-
-#define SPRN_PVR0x11F  /* Processor Version Register */
-#define PVR_VER(pvr)(((pvr) >>  16) & 0x) /* Version field */
-#define PVR_REV(pvr)(((pvr) >>   0) & 0x) /* Revison field */
-
 int
 get_cpuid(char *buffer, size_t sz)
 {
diff --git a/tools/perf/arch/powerpc/util/perf_regs.c 
b/tools/perf/arch/powerpc/util/perf_regs.c
index 0a52429..d64ba0c 100644
--- a/tools/perf/arch/powerpc/util/perf_regs.c
+++ b/tools/perf/arch/powerpc/util/perf_regs.c
@@ -6,9 +6,15 @@
 
 #include "../../../util/perf_regs.h"
 #include "../../../util/debug.h"
+#include "../../../util/event.h"
+#include "../../../util/header.h"
+#include "../../../perf-sys.h"
+#include "utils_header.h"
 
 #include 
 
+#define PVR_POWER9 0x004E
+
 const struct sample_reg sample_reg_masks[] = {
SMPL_REG(r0, PERF_REG_POWERPC_R0),
SMPL_REG(r1, PERF_REG_POWERPC_R1),
@@ -55,6 +61,9 @@
SMPL_REG(dsisr, PERF_REG_POWERPC_DSISR),
SMPL_REG(sier, PERF_REG_POWERPC_SIER),
SMPL_REG(mmcra, PERF_REG_POWERPC_MMCRA),
+   SMPL_REG(mmcr0, PERF_REG_POWERPC_MMCR0),
+   SMPL_REG(mmcr1, PERF_REG_POWERPC_MMCR1),
+   SMPL_REG(mmcr2, PERF_REG_POWERPC_MMCR2),
SMPL_REG_END
 };
 
@@ -163,3 +172,43 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
 
return SDT_ARG_VALID;
 }
+
+uint64_t arch__intr_reg_mask(void)
+{
+   struct perf_event_attr attr = {
+   .type   = PERF_TYPE_HARDWARE,
+   .config = PERF_COUNT_HW_CPU_CYCLES,
+   .sample_type= PERF_SAMPLE_REGS_INTR,
+   .precise_ip = 1,
+   .disab

[v3 15/15] tools/perf: Add perf tools support for extended regs in power10

2020-07-17 Thread Athira Rajeev
Added support for supported regs which are new in power10
( MMCR3, SIER2, SIER3 ) to sample_reg_mask in the tool side
to use with `-I?` option. Also added PVR check to send extended
mask for power10 at kernel while capturing extended regs in
each sample.

Signed-off-by: Athira Rajeev 
---
 tools/arch/powerpc/include/uapi/asm/perf_regs.h | 6 ++
 tools/perf/arch/powerpc/include/perf_regs.h | 3 +++
 tools/perf/arch/powerpc/util/perf_regs.c| 6 ++
 3 files changed, 15 insertions(+)

diff --git a/tools/arch/powerpc/include/uapi/asm/perf_regs.h 
b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
index 225c64c..bdf5f10 100644
--- a/tools/arch/powerpc/include/uapi/asm/perf_regs.h
+++ b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
@@ -52,6 +52,9 @@ enum perf_event_powerpc_regs {
PERF_REG_POWERPC_MMCR0,
PERF_REG_POWERPC_MMCR1,
PERF_REG_POWERPC_MMCR2,
+   PERF_REG_POWERPC_MMCR3,
+   PERF_REG_POWERPC_SIER2,
+   PERF_REG_POWERPC_SIER3,
/* Max regs without the extended regs */
PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
 };
@@ -60,6 +63,9 @@ enum perf_event_powerpc_regs {
 
 /* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */
 #define PERF_REG_PMU_MASK_300   (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) 
- PERF_REG_PMU_MASK)
+/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_31 */
+#define PERF_REG_PMU_MASK_31   (((1ULL << (PERF_REG_POWERPC_SIER3 + 1)) - 1) - 
PERF_REG_PMU_MASK)
 
 #define PERF_REG_MAX_ISA_300   (PERF_REG_POWERPC_MMCR2 + 1)
+#define PERF_REG_MAX_ISA_31(PERF_REG_POWERPC_SIER3 + 1)
 #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
diff --git a/tools/perf/arch/powerpc/include/perf_regs.h 
b/tools/perf/arch/powerpc/include/perf_regs.h
index 46ed00d..63f3ac9 100644
--- a/tools/perf/arch/powerpc/include/perf_regs.h
+++ b/tools/perf/arch/powerpc/include/perf_regs.h
@@ -68,6 +68,9 @@
[PERF_REG_POWERPC_MMCR0] = "mmcr0",
[PERF_REG_POWERPC_MMCR1] = "mmcr1",
[PERF_REG_POWERPC_MMCR2] = "mmcr2",
+   [PERF_REG_POWERPC_MMCR3] = "mmcr3",
+   [PERF_REG_POWERPC_SIER2] = "sier2",
+   [PERF_REG_POWERPC_SIER3] = "sier3",
 };
 
 static inline const char *perf_reg_name(int id)
diff --git a/tools/perf/arch/powerpc/util/perf_regs.c 
b/tools/perf/arch/powerpc/util/perf_regs.c
index d64ba0c..2b6d470 100644
--- a/tools/perf/arch/powerpc/util/perf_regs.c
+++ b/tools/perf/arch/powerpc/util/perf_regs.c
@@ -14,6 +14,7 @@
 #include 
 
 #define PVR_POWER9 0x004E
+#define PVR_POWER100x0080
 
 const struct sample_reg sample_reg_masks[] = {
SMPL_REG(r0, PERF_REG_POWERPC_R0),
@@ -64,6 +65,9 @@
SMPL_REG(mmcr0, PERF_REG_POWERPC_MMCR0),
SMPL_REG(mmcr1, PERF_REG_POWERPC_MMCR1),
SMPL_REG(mmcr2, PERF_REG_POWERPC_MMCR2),
+   SMPL_REG(mmcr3, PERF_REG_POWERPC_MMCR3),
+   SMPL_REG(sier2, PERF_REG_POWERPC_SIER2),
+   SMPL_REG(sier3, PERF_REG_POWERPC_SIER3),
SMPL_REG_END
 };
 
@@ -194,6 +198,8 @@ uint64_t arch__intr_reg_mask(void)
version = (((mfspr(SPRN_PVR)) >>  16) & 0x);
if (version == PVR_POWER9)
extended_mask = PERF_REG_PMU_MASK_300;
+   else if (version == PVR_POWER10)
+   extended_mask = PERF_REG_PMU_MASK_31;
else
return mask;
 
-- 
1.8.3.1



[v3 14/15] powerpc/perf: Add extended regs support for power10 platform

2020-07-17 Thread Athira Rajeev
Include capability flag `PERF_PMU_CAP_EXTENDED_REGS` for power10
and expose MMCR3, SIER2, SIER3 registers as part of extended regs.
Also introduce `PERF_REG_PMU_MASK_31` to define extended mask
value at runtime for power10

Signed-off-by: Athira Rajeev 
[Fix build failure on PPC32 platform]
Suggested-by: Ryan Grimm 
Reported-by: kernel test robot 
---
 arch/powerpc/include/uapi/asm/perf_regs.h |  6 ++
 arch/powerpc/perf/perf_regs.c | 12 +++-
 arch/powerpc/perf/power10-pmu.c   |  6 ++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/uapi/asm/perf_regs.h 
b/arch/powerpc/include/uapi/asm/perf_regs.h
index 225c64c..bdf5f10 100644
--- a/arch/powerpc/include/uapi/asm/perf_regs.h
+++ b/arch/powerpc/include/uapi/asm/perf_regs.h
@@ -52,6 +52,9 @@ enum perf_event_powerpc_regs {
PERF_REG_POWERPC_MMCR0,
PERF_REG_POWERPC_MMCR1,
PERF_REG_POWERPC_MMCR2,
+   PERF_REG_POWERPC_MMCR3,
+   PERF_REG_POWERPC_SIER2,
+   PERF_REG_POWERPC_SIER3,
/* Max regs without the extended regs */
PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
 };
@@ -60,6 +63,9 @@ enum perf_event_powerpc_regs {
 
 /* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */
 #define PERF_REG_PMU_MASK_300   (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) 
- PERF_REG_PMU_MASK)
+/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_31 */
+#define PERF_REG_PMU_MASK_31   (((1ULL << (PERF_REG_POWERPC_SIER3 + 1)) - 1) - 
PERF_REG_PMU_MASK)
 
 #define PERF_REG_MAX_ISA_300   (PERF_REG_POWERPC_MMCR2 + 1)
+#define PERF_REG_MAX_ISA_31(PERF_REG_POWERPC_SIER3 + 1)
 #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
diff --git a/arch/powerpc/perf/perf_regs.c b/arch/powerpc/perf/perf_regs.c
index b0cf68f..11b90d5 100644
--- a/arch/powerpc/perf/perf_regs.c
+++ b/arch/powerpc/perf/perf_regs.c
@@ -81,6 +81,14 @@ static u64 get_ext_regs_value(int idx)
return mfspr(SPRN_MMCR1);
case PERF_REG_POWERPC_MMCR2:
return mfspr(SPRN_MMCR2);
+#ifdef CONFIG_PPC64
+   case PERF_REG_POWERPC_MMCR3:
+   return mfspr(SPRN_MMCR3);
+   case PERF_REG_POWERPC_SIER2:
+   return mfspr(SPRN_SIER2);
+   case PERF_REG_POWERPC_SIER3:
+   return mfspr(SPRN_SIER3);
+#endif
default: return 0;
}
 }
@@ -89,7 +97,9 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 {
u64 PERF_REG_EXTENDED_MAX;
 
-   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   if (cpu_has_feature(CPU_FTR_ARCH_31))
+   PERF_REG_EXTENDED_MAX = PERF_REG_MAX_ISA_31;
+   else if (cpu_has_feature(CPU_FTR_ARCH_300))
PERF_REG_EXTENDED_MAX = PERF_REG_MAX_ISA_300;
 
if (idx == PERF_REG_POWERPC_SIER &&
diff --git a/arch/powerpc/perf/power10-pmu.c b/arch/powerpc/perf/power10-pmu.c
index b02aabb..f066ed9 100644
--- a/arch/powerpc/perf/power10-pmu.c
+++ b/arch/powerpc/perf/power10-pmu.c
@@ -87,6 +87,8 @@
 #define POWER10_MMCRA_IFM3 0xC000UL
 #define POWER10_MMCRA_BHRB_MASK0xC000UL
 
+extern u64 PERF_REG_EXTENDED_MASK;
+
 /* Table of alternatives, sorted by column 0 */
 static const unsigned int power10_event_alternatives[][MAX_ALT] = {
{ PM_RUN_CYC_ALT,   PM_RUN_CYC },
@@ -397,6 +399,7 @@ static void power10_config_bhrb(u64 pmu_bhrb_filter)
.cache_events   = &power10_cache_events,
.attr_groups= power10_pmu_attr_groups,
.bhrb_nr= 32,
+   .capabilities   = PERF_PMU_CAP_EXTENDED_REGS,
 };
 
 int init_power10_pmu(void)
@@ -408,6 +411,9 @@ int init_power10_pmu(void)
strcmp(cur_cpu_spec->oprofile_cpu_type, "ppc64/power10"))
return -ENODEV;
 
+   /* Set the PERF_REG_EXTENDED_MASK here */
+   PERF_REG_EXTENDED_MASK = PERF_REG_PMU_MASK_31;
+
rc = register_power_pmu(&power10_pmu);
if (rc)
return rc;
-- 
1.8.3.1



[v3 12/15] powerpc/perf: Add support for outputting extended regs in perf intr_regs

2020-07-17 Thread Athira Rajeev
From: Anju T Sudhakar 

Add support for perf extended register capability in powerpc.
The capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to indicate the
PMU which support extended registers. The generic code define the mask
of extended registers as 0 for non supported architectures.

Patch adds extended regs support for power9 platform by
exposing MMCR0, MMCR1 and MMCR2 registers.

REG_RESERVED mask needs update to include extended regs.
`PERF_REG_EXTENDED_MASK`, contains mask value of the supported registers,
is defined at runtime in the kernel based on platform since the supported
registers may differ from one processor version to another and hence the
MASK value.

with patch
--

available registers: r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11
r12 r13 r14 r15 r16 r17 r18 r19 r20 r21 r22 r23 r24 r25 r26
r27 r28 r29 r30 r31 nip msr orig_r3 ctr link xer ccr softe
trap dar dsisr sier mmcra mmcr0 mmcr1 mmcr2

PERF_RECORD_SAMPLE(IP, 0x1): 4784/4784: 0 period: 1 addr: 0
... intr regs: mask 0x ABI 64-bit
 r00xc012b77c
 r10xc03fe5e03930
 r20xc1b0e000
 r30xc03fdcddf800
 r40xc03fc788
 r50x9c422724be
 r60xc03fe5e03908
 r70xff63bddc8706
 r80x9e4
 r90x0
 r10   0x1
 r11   0x0
 r12   0xc01299c0
 r13   0xc03c4800
 r14   0x0
 r15   0x7fffdd8b8b00
 r16   0x0
 r17   0x7fffdd8be6b8
 r18   0x7e7076607730
 r19   0x2f
 r20   0xc0001fc26c68
 r21   0xc0002041e4227e00
 r22   0xc0002018fb60
 r23   0x1
 r24   0xc03ffec4d900
 r25   0x8000
 r26   0x0
 r27   0x1
 r28   0x1
 r29   0xc1be1260
 r30   0x6008010
 r31   0xc03ffebb7218
 nip   0xc012b910
 msr   0x90009033
 orig_r3 0xc012b86c
 ctr   0xc01299c0
 link  0xc012b77c
 xer   0x0
 ccr   0x2800
 softe 0x1
 trap  0xf00
 dar   0x0
 dsisr 0x800
 sier  0x0
 mmcra 0x800
 mmcr0 0x82008090
 mmcr1 0x1e00
 mmcr2 0x0
 ... thread: perf:4784

Signed-off-by: Anju T Sudhakar 
[Defined PERF_REG_EXTENDED_MASK at run time to add support for different 
platforms ]
Signed-off-by: Athira Rajeev 
Reviewed-by: Madhavan Srinivasan 
---
 arch/powerpc/include/asm/perf_event_server.h |  8 +++
 arch/powerpc/include/uapi/asm/perf_regs.h| 14 +++-
 arch/powerpc/perf/core-book3s.c  |  1 +
 arch/powerpc/perf/perf_regs.c| 34 +---
 arch/powerpc/perf/power9-pmu.c   |  6 +
 5 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index 832450a..bf85d1a 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -15,6 +15,9 @@
 #define MAX_EVENT_ALTERNATIVES 8
 #define MAX_LIMITED_HWCOUNTERS 2
 
+extern u64 PERF_REG_EXTENDED_MASK;
+#define PERF_REG_EXTENDED_MASK PERF_REG_EXTENDED_MASK
+
 struct perf_event;
 
 struct mmcr_regs {
@@ -62,6 +65,11 @@ struct power_pmu {
int *blacklist_ev;
/* BHRB entries in the PMU */
int bhrb_nr;
+   /*
+* set this flag with `PERF_PMU_CAP_EXTENDED_REGS` if
+* the pmu supports extended perf regs capability
+*/
+   int capabilities;
 };
 
 /*
diff --git a/arch/powerpc/include/uapi/asm/perf_regs.h 
b/arch/powerpc/include/uapi/asm/perf_regs.h
index f599064..225c64c 100644
--- a/arch/powerpc/include/uapi/asm/perf_regs.h
+++ b/arch/powerpc/include/uapi/asm/perf_regs.h
@@ -48,6 +48,18 @@ enum perf_event_powerpc_regs {
PERF_REG_POWERPC_DSISR,
PERF_REG_POWERPC_SIER,
PERF_REG_POWERPC_MMCRA,
-   PERF_REG_POWERPC_MAX,
+   /* Extended registers */
+   PERF_REG_POWERPC_MMCR0,
+   PERF_REG_POWERPC_MMCR1,
+   PERF_REG_POWERPC_MMCR2,
+   /* Max regs without the extended regs */
+   PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
 };
+
+#define PERF_REG_PMU_MASK  ((1ULL << PERF_REG_POWERPC_MAX) - 1)
+
+/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */
+#define PERF_REG_PMU_MASK_300   (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) 
- PERF_REG_PMU_MASK)
+
+#define PERF_REG_MAX_ISA_300   (PERF_REG_POWERPC_MMCR2 + 1)
 #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 31c0535..d5a9529 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -2316,6 +2316,7 @@ int register_power_pmu(struct power_pmu *pmu)
pmu->name);
 
power_pmu.attr_groups = ppmu->attr_groups;
+   power_pmu.capabilities |= (ppmu->capabilities & 
PERF_PMU_CAP_EXTENDED_REGS);
 
 #ifdef MSR_HV
/*
diff --git a/arch/powerpc/perf/perf_regs

[v3 11/15] powerpc/perf: BHRB control to disable BHRB logic when not used

2020-07-17 Thread Athira Rajeev
PowerISA v3.1 has few updates for the Branch History Rolling Buffer(BHRB).

BHRB disable is controlled via Monitor Mode Control Register A (MMCRA)
bit, namely "BHRB Recording Disable (BHRBRD)". This field controls
whether BHRB entries are written when BHRB recording is enabled by other
bits. This patch implements support for this BHRB disable bit.

By setting 0b1 to this bit will disable the BHRB and by setting 0b0
to this bit will have BHRB enabled. This addresses backward
compatibility (for older OS), since this bit will be cleared and
hardware will be writing to BHRB by default.

This patch addresses changes to set MMCRA (BHRBRD) at boot for power10
( there by the core will run faster) and enable this feature only on
runtime ie, on explicit need from user. Also save/restore MMCRA in the
restore path of state-loss idle state to make sure we keep BHRB disabled
if it was not enabled on request at runtime.

Signed-off-by: Athira Rajeev 
---
 arch/powerpc/perf/core-book3s.c   | 20 
 arch/powerpc/perf/isa207-common.c | 12 
 arch/powerpc/platforms/powernv/idle.c | 22 --
 3 files changed, 48 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index bd125fe..31c0535 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -1218,7 +1218,7 @@ static void write_mmcr0(struct cpu_hw_events *cpuhw, 
unsigned long mmcr0)
 static void power_pmu_disable(struct pmu *pmu)
 {
struct cpu_hw_events *cpuhw;
-   unsigned long flags, mmcr0, val;
+   unsigned long flags, mmcr0, val, mmcra;
 
if (!ppmu)
return;
@@ -1251,12 +1251,24 @@ static void power_pmu_disable(struct pmu *pmu)
mb();
isync();
 
+   val = mmcra = cpuhw->mmcr.mmcra;
+
/*
 * Disable instruction sampling if it was enabled
 */
-   if (cpuhw->mmcr.mmcra & MMCRA_SAMPLE_ENABLE) {
-   mtspr(SPRN_MMCRA,
- cpuhw->mmcr.mmcra & ~MMCRA_SAMPLE_ENABLE);
+   if (cpuhw->mmcr.mmcra & MMCRA_SAMPLE_ENABLE)
+   val &= ~MMCRA_SAMPLE_ENABLE;
+
+   /* Disable BHRB via mmcra (BHRBRD) for p10 */
+   if (ppmu->flags & PPMU_ARCH_310S)
+   val |= MMCRA_BHRB_DISABLE;
+
+   /*
+* Write SPRN_MMCRA if mmcra has either disabled
+* instruction sampling or BHRB.
+*/
+   if (val != mmcra) {
+   mtspr(SPRN_MMCRA, mmcra);
mb();
isync();
}
diff --git a/arch/powerpc/perf/isa207-common.c 
b/arch/powerpc/perf/isa207-common.c
index 77643f3..964437a 100644
--- a/arch/powerpc/perf/isa207-common.c
+++ b/arch/powerpc/perf/isa207-common.c
@@ -404,6 +404,13 @@ int isa207_compute_mmcr(u64 event[], int n_ev,
 
mmcra = mmcr1 = mmcr2 = mmcr3 = 0;
 
+   /*
+* Disable bhrb unless explicitly requested
+* by setting MMCRA (BHRBRD) bit.
+*/
+   if (cpu_has_feature(CPU_FTR_ARCH_31))
+   mmcra |= MMCRA_BHRB_DISABLE;
+
/* Second pass: assign PMCs, set all MMCR1 fields */
for (i = 0; i < n_ev; ++i) {
pmc = (event[i] >> EVENT_PMC_SHIFT) & EVENT_PMC_MASK;
@@ -479,6 +486,11 @@ int isa207_compute_mmcr(u64 event[], int n_ev,
mmcra |= val << MMCRA_IFM_SHIFT;
}
 
+   /* set MMCRA (BHRBRD) to 0 if there is user request for BHRB */
+   if (cpu_has_feature(CPU_FTR_ARCH_31) &&
+   (has_branch_stack(pevents[i]) || (event[i] & 
EVENT_WANTS_BHRB)))
+   mmcra &= ~MMCRA_BHRB_DISABLE;
+
if (pevents[i]->attr.exclude_user)
mmcr2 |= MMCR2_FCP(pmc);
 
diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index 2dd4673..1c9d0a9 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -611,6 +611,7 @@ static unsigned long power9_idle_stop(unsigned long psscr, 
bool mmu_on)
unsigned long srr1;
unsigned long pls;
unsigned long mmcr0 = 0;
+   unsigned long mmcra = 0;
struct p9_sprs sprs = {}; /* avoid false used-uninitialised */
bool sprs_saved = false;
 
@@ -657,6 +658,21 @@ static unsigned long power9_idle_stop(unsigned long psscr, 
bool mmu_on)
  */
mmcr0   = mfspr(SPRN_MMCR0);
}
+
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+   /*
+* POWER10 uses MMCRA (BHRBRD) as BHRB disable bit.
+* If the user hasn't asked for the BHRB to be
+* written, the value of MMCRA[BHRBRD] is 1.
+* On wakeup from stop, MMCRA[BH

[v3 10/15] powerpc/perf: Add Power10 BHRB filter support for PERF_SAMPLE_BRANCH_IND_CALL/COND

2020-07-17 Thread Athira Rajeev
PowerISA v3.1 introduce filtering support for
PERF_SAMPLE_BRANCH_IND_CALL/COND. The patch adds BHRB filter
support for "ind_call" and "cond" in power10_bhrb_filter_map().

Signed-off-by: Athira Rajeev 
---
 arch/powerpc/perf/power10-pmu.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/perf/power10-pmu.c b/arch/powerpc/perf/power10-pmu.c
index ba19f40..b02aabb 100644
--- a/arch/powerpc/perf/power10-pmu.c
+++ b/arch/powerpc/perf/power10-pmu.c
@@ -83,6 +83,8 @@
 
 /* MMCRA IFM bits - POWER10 */
 #define POWER10_MMCRA_IFM1 0x4000UL
+#define POWER10_MMCRA_IFM2 0x8000UL
+#define POWER10_MMCRA_IFM3 0xC000UL
 #define POWER10_MMCRA_BHRB_MASK0xC000UL
 
 /* Table of alternatives, sorted by column 0 */
@@ -233,8 +235,15 @@ static u64 power10_bhrb_filter_map(u64 branch_sample_type)
if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
return -1;
 
-   if (branch_sample_type & PERF_SAMPLE_BRANCH_IND_CALL)
-   return -1;
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_IND_CALL) {
+   pmu_bhrb_filter |= POWER10_MMCRA_IFM2;
+   return pmu_bhrb_filter;
+   }
+
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_COND) {
+   pmu_bhrb_filter |= POWER10_MMCRA_IFM3;
+   return pmu_bhrb_filter;
+   }
 
if (branch_sample_type & PERF_SAMPLE_BRANCH_CALL)
return -1;
-- 
1.8.3.1



[v3 09/15] powerpc/perf: Ignore the BHRB kernel address filtering for P10

2020-07-17 Thread Athira Rajeev
commit bb19af816025 ("powerpc/perf: Prevent kernel address leak to
userspace via BHRB buffer") added a check in bhrb_read() to filter
the kernel address from BHRB buffer. This patch modified it to avoid
that check for PowerISA v3.1 based processors, since PowerISA v3.1
allows only MSR[PR]=1 address to be written to BHRB buffer.

Signed-off-by: Athira Rajeev 
---
 arch/powerpc/perf/core-book3s.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 0ffb757d..bd125fe 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -469,8 +469,11 @@ static void power_pmu_bhrb_read(struct perf_event *event, 
struct cpu_hw_events *
 * addresses at this point. Check the privileges before
 * exporting it to userspace (avoid exposure of regions
 * where we could have speculative execution)
+* Incase of ISA v3.1, BHRB will capture only user-space
+* addresses, hence include a check before filtering 
code
 */
-   if (is_kernel_addr(addr) && 
perf_allow_kernel(&event->attr) != 0)
+   if (!(ppmu->flags & PPMU_ARCH_310S) &&
+   is_kernel_addr(addr) && 
perf_allow_kernel(&event->attr) != 0)
continue;
 
/* Branches are read most recent first (ie. mfbhrb 0 is
-- 
1.8.3.1



[v3 08/15] powerpc/perf: power10 Performance Monitoring support

2020-07-17 Thread Athira Rajeev
Base enablement patch to register performance monitoring
hardware support for power10. Patch introduce the raw event
encoding format, defines the supported list of events, config
fields for the event attributes and their corresponding bit values
which are exported via sysfs.

Patch also enhances the support function in isa207_common.c to
include power10 pmu hardware.

[Enablement of base PMU driver code]
Signed-off-by: Madhavan Srinivasan 
[Addition of ISA macros for counter support functions]
Signed-off-by: Athira Rajeev 
[Fix compilation warning for missing prototype for init_power10_pmu]
Reported-by: kernel test robot 
---
 arch/powerpc/perf/Makefile  |   2 +-
 arch/powerpc/perf/core-book3s.c |   2 +
 arch/powerpc/perf/internal.h|   1 +
 arch/powerpc/perf/isa207-common.c   |  59 -
 arch/powerpc/perf/isa207-common.h   |  33 ++-
 arch/powerpc/perf/power10-events-list.h |  70 ++
 arch/powerpc/perf/power10-pmu.c | 410 
 7 files changed, 566 insertions(+), 11 deletions(-)
 create mode 100644 arch/powerpc/perf/power10-events-list.h
 create mode 100644 arch/powerpc/perf/power10-pmu.c

diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
index 53d614e..c02854d 100644
--- a/arch/powerpc/perf/Makefile
+++ b/arch/powerpc/perf/Makefile
@@ -9,7 +9,7 @@ obj-$(CONFIG_PPC_PERF_CTRS) += core-book3s.o bhrb.o
 obj64-$(CONFIG_PPC_PERF_CTRS)  += ppc970-pmu.o power5-pmu.o \
   power5+-pmu.o power6-pmu.o power7-pmu.o \
   isa207-common.o power8-pmu.o power9-pmu.o \
-  generic-compat-pmu.o
+  generic-compat-pmu.o power10-pmu.o
 obj32-$(CONFIG_PPC_PERF_CTRS)  += mpc7450-pmu.o
 
 obj-$(CONFIG_PPC_POWERNV)  += imc-pmu.o
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index ca32fc0..0ffb757d 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -2332,6 +2332,8 @@ static int __init init_ppc64_pmu(void)
return 0;
else if (!init_power9_pmu())
return 0;
+   else if (!init_power10_pmu())
+   return 0;
else if (!init_ppc970_pmu())
return 0;
else
diff --git a/arch/powerpc/perf/internal.h b/arch/powerpc/perf/internal.h
index f755c64..80bbf72 100644
--- a/arch/powerpc/perf/internal.h
+++ b/arch/powerpc/perf/internal.h
@@ -9,4 +9,5 @@
 extern int init_power7_pmu(void);
 extern int init_power8_pmu(void);
 extern int init_power9_pmu(void);
+extern int init_power10_pmu(void);
 extern int init_generic_compat_pmu(void);
diff --git a/arch/powerpc/perf/isa207-common.c 
b/arch/powerpc/perf/isa207-common.c
index 2fe63f2..77643f3 100644
--- a/arch/powerpc/perf/isa207-common.c
+++ b/arch/powerpc/perf/isa207-common.c
@@ -55,7 +55,9 @@ static bool is_event_valid(u64 event)
 {
u64 valid_mask = EVENT_VALID_MASK;
 
-   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   if (cpu_has_feature(CPU_FTR_ARCH_31))
+   valid_mask = p10_EVENT_VALID_MASK;
+   else if (cpu_has_feature(CPU_FTR_ARCH_300))
valid_mask = p9_EVENT_VALID_MASK;
 
return !(event & ~valid_mask);
@@ -69,6 +71,14 @@ static inline bool is_event_marked(u64 event)
return false;
 }
 
+static unsigned long sdar_mod_val(u64 event)
+{
+   if (cpu_has_feature(CPU_FTR_ARCH_31))
+   return p10_SDAR_MODE(event);
+
+   return p9_SDAR_MODE(event);
+}
+
 static void mmcra_sdar_mode(u64 event, unsigned long *mmcra)
 {
/*
@@ -79,7 +89,7 @@ static void mmcra_sdar_mode(u64 event, unsigned long *mmcra)
 * MMCRA[SDAR_MODE] will be programmed as "0b01" for continous sampling
 * mode and will be un-changed when setting MMCRA[63] (Marked events).
 *
-* Incase of Power9:
+* Incase of Power9/power10:
 * Marked event: MMCRA[SDAR_MODE] will be set to 0b00 ('No Updates'),
 *   or if group already have any marked events.
 * For rest
@@ -90,8 +100,8 @@ static void mmcra_sdar_mode(u64 event, unsigned long *mmcra)
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
if (is_event_marked(event) || (*mmcra & MMCRA_SAMPLE_ENABLE))
*mmcra &= MMCRA_SDAR_MODE_NO_UPDATES;
-   else if (p9_SDAR_MODE(event))
-   *mmcra |=  p9_SDAR_MODE(event) << MMCRA_SDAR_MODE_SHIFT;
+   else if (sdar_mod_val(event))
+   *mmcra |= sdar_mod_val(event) << MMCRA_SDAR_MODE_SHIFT;
else
*mmcra |= MMCRA_SDAR_MODE_DCACHE;
} else
@@ -134,7 +144,11 @@ static bool is_thresh_cmp_valid(u64 event)
/*
 * Check the mantissa upper two bits are not zero, unless the
 * exponent is also zero. See the THRESH_CMP_MANTISSA doc.
+* Power10: thresh_cmp is replac

[v3 07/15] powerpc/perf: Add power10_feat to dt_cpu_ftrs

2020-07-17 Thread Athira Rajeev
From: Madhavan Srinivasan 

Add power10 feature function to dt_cpu_ftrs.c along
with a power10 specific init() to initialize pmu sprs,
sets the oprofile_cpu_type and cpu_features. This will
enable performance monitoring unit(PMU) for Power10
in CPU features with "performance-monitor-power10".

For PowerISA v3.1, BHRB disable is controlled via Monitor Mode
Control Register A (MMCRA) bit, namely "BHRB Recording Disable
(BHRBRD)". This patch initializes MMCRA BHRBRD to disable BHRB
feature at boot for power10.

Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/include/asm/reg.h|  3 +++
 arch/powerpc/kernel/cpu_setup_power.S |  8 
 arch/powerpc/kernel/dt_cpu_ftrs.c | 26 ++
 3 files changed, 37 insertions(+)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 21a1b2d..900ada1 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -1068,6 +1068,9 @@
 #define MMCR0_PMC2_LOADMISSTIME0x5
 #endif
 
+/* BHRB disable bit for PowerISA v3.10 */
+#define MMCRA_BHRB_DISABLE 0x0020
+
 /*
  * SPRG usage:
  *
diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
b/arch/powerpc/kernel/cpu_setup_power.S
index efdcfa7..b8e0d1e 100644
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -94,6 +94,7 @@ _GLOBAL(__restore_cpu_power8)
 _GLOBAL(__setup_cpu_power10)
mflrr11
bl  __init_FSCR_power10
+   bl  __init_PMU_ISA31
b   1f
 
 _GLOBAL(__setup_cpu_power9)
@@ -233,3 +234,10 @@ __init_PMU_ISA207:
li  r5,0
mtspr   SPRN_MMCRS,r5
blr
+
+__init_PMU_ISA31:
+   li  r5,0
+   mtspr   SPRN_MMCR3,r5
+   LOAD_REG_IMMEDIATE(r5, MMCRA_BHRB_DISABLE)
+   mtspr   SPRN_MMCRA,r5
+   blr
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 3a40951..f482286 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -450,6 +450,31 @@ static int __init feat_enable_pmu_power9(struct 
dt_cpu_feature *f)
return 1;
 }
 
+static void init_pmu_power10(void)
+{
+   init_pmu_power9();
+
+   mtspr(SPRN_MMCR3, 0);
+   mtspr(SPRN_MMCRA, MMCRA_BHRB_DISABLE);
+}
+
+static int __init feat_enable_pmu_power10(struct dt_cpu_feature *f)
+{
+   hfscr_pmu_enable();
+
+   init_pmu_power10();
+   init_pmu_registers = init_pmu_power10;
+
+   cur_cpu_spec->cpu_features |= CPU_FTR_MMCRA;
+   cur_cpu_spec->cpu_user_features |= PPC_FEATURE_PSERIES_PERFMON_COMPAT;
+
+   cur_cpu_spec->num_pmcs  = 6;
+   cur_cpu_spec->pmc_type  = PPC_PMC_IBM;
+   cur_cpu_spec->oprofile_cpu_type = "ppc64/power10";
+
+   return 1;
+}
+
 static int __init feat_enable_tm(struct dt_cpu_feature *f)
 {
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
@@ -639,6 +664,7 @@ struct dt_cpu_feature_match {
{"pc-relative-addressing", feat_enable, 0},
{"machine-check-power9", feat_enable_mce_power9, 0},
{"performance-monitor-power9", feat_enable_pmu_power9, 0},
+   {"performance-monitor-power10", feat_enable_pmu_power10, 0},
{"event-based-branch-v3", feat_enable, 0},
{"random-number-generator", feat_enable, 0},
{"system-call-vectored", feat_disable, 0},
-- 
1.8.3.1



[v3 06/15] powerpc/xmon: Add PowerISA v3.1 PMU SPRs

2020-07-17 Thread Athira Rajeev
From: Madhavan Srinivasan 

PowerISA v3.1 added three new perfromance
monitoring unit (PMU) speical purpose register (SPR).
They are Monitor Mode Control Register 3 (MMCR3),
Sampled Instruction Event Register 2 (SIER2),
Sampled Instruction Event Register 3 (SIER3).

Patch here adds a new dump function dump_310_sprs
to print these SPR values.

Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/xmon/xmon.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 7efe4bc..16f5493 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -2022,6 +2022,18 @@ static void dump_300_sprs(void)
 #endif
 }
 
+static void dump_310_sprs(void)
+{
+#ifdef CONFIG_PPC64
+   if (!cpu_has_feature(CPU_FTR_ARCH_31))
+   return;
+
+   printf("mmcr3  = %.16lx, sier2  = %.16lx, sier3  = %.16lx\n",
+   mfspr(SPRN_MMCR3), mfspr(SPRN_SIER2), mfspr(SPRN_SIER3));
+
+#endif
+}
+
 static void dump_one_spr(int spr, bool show_unimplemented)
 {
unsigned long val;
@@ -2076,6 +2088,7 @@ static void super_regs(void)
dump_206_sprs();
dump_207_sprs();
dump_300_sprs();
+   dump_310_sprs();
 
return;
}
-- 
1.8.3.1



[v3 05/15] KVM: PPC: Book3S HV: Save/restore new PMU registers

2020-07-17 Thread Athira Rajeev
PowerISA v3.1 has added new performance monitoring unit (PMU)
special purpose registers (SPRs). They are

Monitor Mode Control Register 3 (MMCR3)
Sampled Instruction Event Register A (SIER2)
Sampled Instruction Event Register B (SIER3)

Patch addes support to save/restore these new
SPRs while entering/exiting guest. Also includes
changes to support KVM_REG_PPC_MMCR3/SIER2/SIER3.
And adds new SPRs to KVM API documentation.

Signed-off-by: Athira Rajeev 
---
 Documentation/virt/kvm/api.rst|  3 +++
 arch/powerpc/include/asm/kvm_book3s_asm.h |  2 +-
 arch/powerpc/include/asm/kvm_host.h   |  4 ++--
 arch/powerpc/include/uapi/asm/kvm.h   |  5 +
 arch/powerpc/kernel/asm-offsets.c |  3 +++
 arch/powerpc/kvm/book3s_hv.c  | 22 --
 arch/powerpc/kvm/book3s_hv_interrupts.S   |  8 
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 24 
 tools/arch/powerpc/include/uapi/asm/kvm.h |  5 +
 9 files changed, 71 insertions(+), 5 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 426f945..8cb6eb7 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -2156,9 +2156,12 @@ registers, find a list below:
   PPC KVM_REG_PPC_MMCRA   64
   PPC KVM_REG_PPC_MMCR2   64
   PPC KVM_REG_PPC_MMCRS   64
+  PPC KVM_REG_PPC_MMCR3   64
   PPC KVM_REG_PPC_SIAR64
   PPC KVM_REG_PPC_SDAR64
   PPC KVM_REG_PPC_SIER64
+  PPC KVM_REG_PPC_SIER2   64
+  PPC KVM_REG_PPC_SIER3   64
   PPC KVM_REG_PPC_PMC132
   PPC KVM_REG_PPC_PMC232
   PPC KVM_REG_PPC_PMC332
diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 
b/arch/powerpc/include/asm/kvm_book3s_asm.h
index 45704f2..078f464 100644
--- a/arch/powerpc/include/asm/kvm_book3s_asm.h
+++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
@@ -119,7 +119,7 @@ struct kvmppc_host_state {
void __iomem *xive_tima_virt;
u32 saved_xirr;
u64 dabr;
-   u64 host_mmcr[7];   /* MMCR 0,1,A, SIAR, SDAR, MMCR2, SIER */
+   u64 host_mmcr[10];  /* MMCR 0,1,A, SIAR, SDAR, MMCR2, SIER, MMCR3, 
SIER2/3 */
u32 host_pmc[8];
u64 host_purr;
u64 host_spurr;
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index fc115e2..4f26623 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -637,14 +637,14 @@ struct kvm_vcpu_arch {
u32 ccr1;
u32 dbsr;
 
-   u64 mmcr[3];
+   u64 mmcr[4];
u64 mmcra;
u64 mmcrs;
u32 pmc[8];
u32 spmc[2];
u64 siar;
u64 sdar;
-   u64 sier;
+   u64 sier[3];
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
u64 tfhar;
u64 texasr;
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index e55d847..1cc779d 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -640,6 +640,11 @@ struct kvm_ppc_cpu_char {
 #define KVM_REG_PPC_ONLINE (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xbf)
 #define KVM_REG_PPC_PTCR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xc0)
 
+/* POWER10 registers */
+#define KVM_REG_PPC_MMCR3  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xc1)
+#define KVM_REG_PPC_SIER2  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xc2)
+#define KVM_REG_PPC_SIER3  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xc3)
+
 /* Transactional Memory checkpointed state:
  * This is all GPRs, all VSX regs and a subset of SPRs
  */
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 6fa4853..8711c21 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -698,6 +698,9 @@ int main(void)
HSTATE_FIELD(HSTATE_SDAR, host_mmcr[4]);
HSTATE_FIELD(HSTATE_MMCR2, host_mmcr[5]);
HSTATE_FIELD(HSTATE_SIER, host_mmcr[6]);
+   HSTATE_FIELD(HSTATE_MMCR3, host_mmcr[7]);
+   HSTATE_FIELD(HSTATE_SIER2, host_mmcr[8]);
+   HSTATE_FIELD(HSTATE_SIER3, host_mmcr[9]);
HSTATE_FIELD(HSTATE_PMC1, host_pmc[0]);
HSTATE_FIELD(HSTATE_PMC2, host_pmc[1]);
HSTATE_FIELD(HSTATE_PMC3, host_pmc[2]);
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 3f90eee..d650d31 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1689,6 +1689,9 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
case KVM_REG_PPC_MMCRS:
*val = get_reg_val(id, vcpu->arch.mmcrs);
break;
+   case KVM_REG_PPC_MMCR3:
+   *val = get_reg_val(id, vcpu->arch.mmcr[3]);
+   break;
case KVM_REG_PPC_PMC1 ... KVM_REG_PPC_PMC8:
i = id - KVM_REG_PPC_PMC1;
*val = get_reg_val(id,

[v3 04/15] powerpc/perf: Add support for ISA3.1 PMU SPRs

2020-07-17 Thread Athira Rajeev
From: Madhavan Srinivasan 

PowerISA v3.1 includes new performance monitoring unit(PMU)
special purpose registers (SPRs). They are

Monitor Mode Control Register 3 (MMCR3)
Sampled Instruction Event Register 2 (SIER2)
Sampled Instruction Event Register 3 (SIER3)

MMCR3 is added for further sampling related configuration
control. SIER2/SIER3 are added to provide additional
information about the sampled instruction.

Patch adds new PPMU flag called "PPMU_ARCH_310S" to support
handling of these new SPRs, updates the struct thread_struct
to include these new SPRs, include MMCR3 in struct mmcr_regs.
This is needed to support programming of MMCR3 SPR during
event_[enable/disable]. Patch also adds the sysfs support
for the MMCR3 SPR along with SPRN_ macros for these new pmu sprs.

Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/include/asm/perf_event_server.h |  2 ++
 arch/powerpc/include/asm/processor.h |  4 
 arch/powerpc/include/asm/reg.h   |  6 ++
 arch/powerpc/kernel/sysfs.c  |  8 
 arch/powerpc/perf/core-book3s.c  | 29 
 5 files changed, 49 insertions(+)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index 14b8dc1..832450a 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -22,6 +22,7 @@ struct mmcr_regs {
unsigned long mmcr1;
unsigned long mmcr2;
unsigned long mmcra;
+   unsigned long mmcr3;
 };
 /*
  * This struct provides the constants and functions needed to
@@ -75,6 +76,7 @@ struct power_pmu {
 #define PPMU_HAS_SIER  0x0040 /* Has SIER */
 #define PPMU_ARCH_207S 0x0080 /* PMC is architecture v2.07S */
 #define PPMU_NO_SIAR   0x0100 /* Do not use SIAR */
+#define PPMU_ARCH_310S 0x0200 /* Has MMCR3, SIER2 and SIER3 */
 
 /*
  * Values for flags to get_alternatives()
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index 52a6783..a466e94 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -272,6 +272,10 @@ struct thread_struct {
unsignedmmcr0;
 
unsignedused_ebb;
+   unsigned long   mmcr3;
+   unsigned long   sier2;
+   unsigned long   sier3;
+
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 88e6c78..21a1b2d 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -876,7 +876,9 @@
 #define   MMCR0_FCHV   0x0001UL /* freeze conditions in hypervisor mode */
 #define SPRN_MMCR1 798
 #define SPRN_MMCR2 785
+#define SPRN_MMCR3 754
 #define SPRN_UMMCR2769
+#define SPRN_UMMCR3738
 #define SPRN_MMCRA 0x312
 #define   MMCRA_SDSYNC 0x8000UL /* SDAR synced with SIAR */
 #define   MMCRA_SDAR_DCACHE_MISS 0x4000UL
@@ -918,6 +920,10 @@
 #define   SIER_SIHV0x100   /* Sampled MSR_HV */
 #define   SIER_SIAR_VALID  0x040   /* SIAR contents valid */
 #define   SIER_SDAR_VALID  0x020   /* SDAR contents valid */
+#define SPRN_SIER2 752
+#define SPRN_SIER3 753
+#define SPRN_USIER2736
+#define SPRN_USIER3737
 #define SPRN_SIAR  796
 #define SPRN_SDAR  797
 #define SPRN_TACR  888
diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index 571b325..46b4ebc 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -622,8 +622,10 @@ void ppc_enable_pmcs(void)
 SYSFS_PMCSETUP(pmc8, SPRN_PMC8);
 
 SYSFS_PMCSETUP(mmcra, SPRN_MMCRA);
+SYSFS_PMCSETUP(mmcr3, SPRN_MMCR3);
 
 static DEVICE_ATTR(mmcra, 0600, show_mmcra, store_mmcra);
+static DEVICE_ATTR(mmcr3, 0600, show_mmcr3, store_mmcr3);
 #endif /* HAS_PPC_PMC56 */
 
 
@@ -886,6 +888,9 @@ static int register_cpu_online(unsigned int cpu)
 #ifdef CONFIG_PMU_SYSFS
if (cpu_has_feature(CPU_FTR_MMCRA))
device_create_file(s, &dev_attr_mmcra);
+
+   if (cpu_has_feature(CPU_FTR_ARCH_31))
+   device_create_file(s, &dev_attr_mmcr3);
 #endif /* CONFIG_PMU_SYSFS */
 
if (cpu_has_feature(CPU_FTR_PURR)) {
@@ -980,6 +985,9 @@ static int unregister_cpu_online(unsigned int cpu)
 #ifdef CONFIG_PMU_SYSFS
if (cpu_has_feature(CPU_FTR_MMCRA))
device_remove_file(s, &dev_attr_mmcra);
+
+   if (cpu_has_feature(CPU_FTR_ARCH_31))
+   device_remove_file(s, &dev_attr_mmcr3);
 #endif /* CONFIG_PMU_SYSFS */
 
if (cpu_has_feature(CPU_FTR_PURR)) {
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index f4d07b5..ca32fc0 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -72,6 +72,11 @@ struct cpu_hw_events {
 /*
  * 32-bit doesn't have MMCRA but does have an MMCR2,
  * and a few other names are different.
+ * Also 32-bit doesn't have

[v3 03/15] powerpc/perf: Update Power PMU cache_events to u64 type

2020-07-17 Thread Athira Rajeev
Events of type PERF_TYPE_HW_CACHE was described for Power PMU
as: int (*cache_events)[type][op][result];

where type, op, result values unpacked from the event attribute config
value is used to generate the raw event code at runtime.

So far the event code values which used to create these cache-related
events were within 32 bit and `int` type worked. In power10,
some of the event codes are of 64-bit value and hence update the
Power PMU cache_events to `u64` type in `power_pmu` struct.
Also propagate this change to existing all PMU driver code paths
which are using ppmu->cache_events.

Signed-off-by: Athira Rajeev
---
 arch/powerpc/include/asm/perf_event_server.h | 2 +-
 arch/powerpc/perf/core-book3s.c  | 2 +-
 arch/powerpc/perf/generic-compat-pmu.c   | 2 +-
 arch/powerpc/perf/mpc7450-pmu.c  | 2 +-
 arch/powerpc/perf/power5+-pmu.c  | 2 +-
 arch/powerpc/perf/power5-pmu.c   | 2 +-
 arch/powerpc/perf/power6-pmu.c   | 2 +-
 arch/powerpc/perf/power7-pmu.c   | 2 +-
 arch/powerpc/perf/power8-pmu.c   | 2 +-
 arch/powerpc/perf/power9-pmu.c   | 2 +-
 arch/powerpc/perf/ppc970-pmu.c   | 2 +-
 11 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index f9a3668..14b8dc1 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -53,7 +53,7 @@ struct power_pmu {
const struct attribute_group**attr_groups;
int n_generic;
int *generic_events;
-   int (*cache_events)[PERF_COUNT_HW_CACHE_MAX]
+   u64 (*cache_events)[PERF_COUNT_HW_CACHE_MAX]
   [PERF_COUNT_HW_CACHE_OP_MAX]
   [PERF_COUNT_HW_CACHE_RESULT_MAX];
 
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 18b1b6a..f4d07b5 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -1790,7 +1790,7 @@ static void hw_perf_event_destroy(struct perf_event 
*event)
 static int hw_perf_cache_event(u64 config, u64 *eventp)
 {
unsigned long type, op, result;
-   int ev;
+   u64 ev;
 
if (!ppmu->cache_events)
return -EINVAL;
diff --git a/arch/powerpc/perf/generic-compat-pmu.c 
b/arch/powerpc/perf/generic-compat-pmu.c
index 5e5a54d..eb8a6aaf 100644
--- a/arch/powerpc/perf/generic-compat-pmu.c
+++ b/arch/powerpc/perf/generic-compat-pmu.c
@@ -101,7 +101,7 @@ enum {
  * 0 means not supported, -1 means nonsensical, other values
  * are event codes.
  */
-static int generic_compat_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
+static u64 generic_compat_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[ C(L1D) ] = {
[ C(OP_READ) ] = {
[ C(RESULT_ACCESS) ] = 0,
diff --git a/arch/powerpc/perf/mpc7450-pmu.c b/arch/powerpc/perf/mpc7450-pmu.c
index 826de25..1919e9d 100644
--- a/arch/powerpc/perf/mpc7450-pmu.c
+++ b/arch/powerpc/perf/mpc7450-pmu.c
@@ -361,7 +361,7 @@ static void mpc7450_disable_pmc(unsigned int pmc, struct 
mmcr_regs *mmcr)
  * 0 means not supported, -1 means nonsensical, other values
  * are event codes.
  */
-static int mpc7450_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
+static u64 mpc7450_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[C(L1D)] = {/*  RESULT_ACCESS   RESULT_MISS */
[C(OP_READ)] = {0,  0x225   },
[C(OP_WRITE)] = {   0,  0x227   },
diff --git a/arch/powerpc/perf/power5+-pmu.c b/arch/powerpc/perf/power5+-pmu.c
index 5f0821e..a62b2cd 100644
--- a/arch/powerpc/perf/power5+-pmu.c
+++ b/arch/powerpc/perf/power5+-pmu.c
@@ -619,7 +619,7 @@ static void power5p_disable_pmc(unsigned int pmc, struct 
mmcr_regs *mmcr)
  * 0 means not supported, -1 means nonsensical, other values
  * are event codes.
  */
-static int power5p_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
+static u64 power5p_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[C(L1D)] = {/*  RESULT_ACCESS   RESULT_MISS */
[C(OP_READ)] = {0x1c10a8,   0x3c1088},
[C(OP_WRITE)] = {   0x2c10a8,   0xc10c3 },
diff --git a/arch/powerpc/perf/power5-pmu.c b/arch/powerpc/perf/power5-pmu.c
index 426021d..8732b58 100644
--- a/arch/powerpc/perf/power5-pmu.c
+++ b/arch/powerpc/perf/power5-pmu.c
@@ -561,7 +561,7 @@ static void power5_disable_pmc(unsigned int pmc, struct 
mmcr_regs *mmcr)
  * 0 means not supported, -1 means nonsensical, other values
  * are event codes.
  */
-static int power5_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
+static u64 power5_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[C(L1D)] = {/*  RESULT_ACCESS   RESULT_

[v3 02/15] KVM: PPC: Book3S HV: Cleanup updates for kvm vcpu MMCR

2020-07-17 Thread Athira Rajeev
Currently `kvm_vcpu_arch` stores all Monitor Mode Control registers
in a flat array in order: mmcr0, mmcr1, mmcra, mmcr2, mmcrs
Split this to give mmcra and mmcrs its own entries in vcpu and
use a flat array for mmcr0 to mmcr2. This patch implements this
cleanup to make code easier to read.

Signed-off-by: Athira Rajeev 
---
 arch/powerpc/include/asm/kvm_host.h   |  4 +++-
 arch/powerpc/include/uapi/asm/kvm.h   |  4 ++--
 arch/powerpc/kernel/asm-offsets.c |  2 ++
 arch/powerpc/kvm/book3s_hv.c  | 16 ++--
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 12 ++--
 tools/arch/powerpc/include/uapi/asm/kvm.h |  4 ++--
 6 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 7e2d061..fc115e2 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -637,7 +637,9 @@ struct kvm_vcpu_arch {
u32 ccr1;
u32 dbsr;
 
-   u64 mmcr[5];
+   u64 mmcr[3];
+   u64 mmcra;
+   u64 mmcrs;
u32 pmc[8];
u32 spmc[2];
u64 siar;
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 264e266..e55d847 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -510,8 +510,8 @@ struct kvm_ppc_cpu_char {
 
 #define KVM_REG_PPC_MMCR0  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x10)
 #define KVM_REG_PPC_MMCR1  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x11)
-#define KVM_REG_PPC_MMCRA  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12)
-#define KVM_REG_PPC_MMCR2  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x13)
+#define KVM_REG_PPC_MMCR2  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12)
+#define KVM_REG_PPC_MMCRA  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x13)
 #define KVM_REG_PPC_MMCRS  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x14)
 #define KVM_REG_PPC_SIAR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x15)
 #define KVM_REG_PPC_SDAR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x16)
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 6657dc6..6fa4853 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -559,6 +559,8 @@ int main(void)
OFFSET(VCPU_IRQ_PENDING, kvm_vcpu, arch.irq_pending);
OFFSET(VCPU_DBELL_REQ, kvm_vcpu, arch.doorbell_request);
OFFSET(VCPU_MMCR, kvm_vcpu, arch.mmcr);
+   OFFSET(VCPU_MMCRA, kvm_vcpu, arch.mmcra);
+   OFFSET(VCPU_MMCRS, kvm_vcpu, arch.mmcrs);
OFFSET(VCPU_PMC, kvm_vcpu, arch.pmc);
OFFSET(VCPU_SPMC, kvm_vcpu, arch.spmc);
OFFSET(VCPU_SIAR, kvm_vcpu, arch.siar);
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 6bf66649..3f90eee 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1679,10 +1679,16 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
case KVM_REG_PPC_UAMOR:
*val = get_reg_val(id, vcpu->arch.uamor);
break;
-   case KVM_REG_PPC_MMCR0 ... KVM_REG_PPC_MMCRS:
+   case KVM_REG_PPC_MMCR0 ... KVM_REG_PPC_MMCR2:
i = id - KVM_REG_PPC_MMCR0;
*val = get_reg_val(id, vcpu->arch.mmcr[i]);
break;
+   case KVM_REG_PPC_MMCRA:
+   *val = get_reg_val(id, vcpu->arch.mmcra);
+   break;
+   case KVM_REG_PPC_MMCRS:
+   *val = get_reg_val(id, vcpu->arch.mmcrs);
+   break;
case KVM_REG_PPC_PMC1 ... KVM_REG_PPC_PMC8:
i = id - KVM_REG_PPC_PMC1;
*val = get_reg_val(id, vcpu->arch.pmc[i]);
@@ -1900,10 +1906,16 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
case KVM_REG_PPC_UAMOR:
vcpu->arch.uamor = set_reg_val(id, *val);
break;
-   case KVM_REG_PPC_MMCR0 ... KVM_REG_PPC_MMCRS:
+   case KVM_REG_PPC_MMCR0 ... KVM_REG_PPC_MMCR2:
i = id - KVM_REG_PPC_MMCR0;
vcpu->arch.mmcr[i] = set_reg_val(id, *val);
break;
+   case KVM_REG_PPC_MMCRA:
+   vcpu->arch.mmcra = set_reg_val(id, *val);
+   break;
+   case KVM_REG_PPC_MMCRS:
+   vcpu->arch.mmcrs = set_reg_val(id, *val);
+   break;
case KVM_REG_PPC_PMC1 ... KVM_REG_PPC_PMC8:
i = id - KVM_REG_PPC_PMC1;
vcpu->arch.pmc[i] = set_reg_val(id, *val);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 7194389..702eaa2 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -3428,7 +3428,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_PMAO_BUG)
mtspr   SPRN_PMC6, r9
ld  r3, VCPU_MMCR(r4)
ld  r5, VCPU_MMCR + 8(r4)
-   ld  r6, VCPU_MMCR + 16(r4)
+   ld  r6, VCPU_MMCRA(r4)
ld  r7, VCPU_SIAR(r4)
ld  r8,

[v3 01/15] powerpc/perf: Update cpu_hw_event to use `struct` for storing MMCR registers

2020-07-17 Thread Athira Rajeev
core-book3s currently uses array to store the MMCR registers as part
of per-cpu `cpu_hw_events`. This patch does a clean up to use `struct`
to store mmcr regs instead of array. This will make code easier to read
and reduces chance of any subtle bug that may come in the future, say
when new registers are added. Patch updates all relevant code that was
using MMCR array ( cpuhw->mmcr[x]) to use newly introduced `struct`.
This includes the PMU driver code for supported platforms (power5
to power9) and ISA macros for counter support functions.

Signed-off-by: Athira Rajeev 
---
 arch/powerpc/include/asm/perf_event_server.h | 10 --
 arch/powerpc/perf/core-book3s.c  | 53 +---
 arch/powerpc/perf/isa207-common.c| 20 +--
 arch/powerpc/perf/isa207-common.h|  4 +--
 arch/powerpc/perf/mpc7450-pmu.c  | 21 +++
 arch/powerpc/perf/power5+-pmu.c  | 17 -
 arch/powerpc/perf/power5-pmu.c   | 17 -
 arch/powerpc/perf/power6-pmu.c   | 16 -
 arch/powerpc/perf/power7-pmu.c   | 17 -
 arch/powerpc/perf/ppc970-pmu.c   | 24 ++---
 10 files changed, 105 insertions(+), 94 deletions(-)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index 3e9703f..f9a3668 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -17,6 +17,12 @@
 
 struct perf_event;
 
+struct mmcr_regs {
+   unsigned long mmcr0;
+   unsigned long mmcr1;
+   unsigned long mmcr2;
+   unsigned long mmcra;
+};
 /*
  * This struct provides the constants and functions needed to
  * describe the PMU on a particular POWER-family CPU.
@@ -28,7 +34,7 @@ struct power_pmu {
unsigned long   add_fields;
unsigned long   test_adder;
int (*compute_mmcr)(u64 events[], int n_ev,
-   unsigned int hwc[], unsigned long mmcr[],
+   unsigned int hwc[], struct mmcr_regs *mmcr,
struct perf_event *pevents[]);
int (*get_constraint)(u64 event_id, unsigned long *mskp,
unsigned long *valp);
@@ -41,7 +47,7 @@ struct power_pmu {
unsigned long   group_constraint_val;
u64 (*bhrb_filter_map)(u64 branch_sample_type);
void(*config_bhrb)(u64 pmu_bhrb_filter);
-   void(*disable_pmc)(unsigned int pmc, unsigned long mmcr[]);
+   void(*disable_pmc)(unsigned int pmc, struct mmcr_regs 
*mmcr);
int (*limited_pmc_event)(u64 event_id);
u32 flags;
const struct attribute_group**attr_groups;
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index cd6a742..18b1b6a 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -37,12 +37,7 @@ struct cpu_hw_events {
struct perf_event *event[MAX_HWEVENTS];
u64 events[MAX_HWEVENTS];
unsigned int flags[MAX_HWEVENTS];
-   /*
-* The order of the MMCR array is:
-*  - 64-bit, MMCR0, MMCR1, MMCRA, MMCR2
-*  - 32-bit, MMCR0, MMCR1, MMCR2
-*/
-   unsigned long mmcr[4];
+   struct mmcr_regs mmcr;
struct perf_event *limited_counter[MAX_LIMITED_HWCOUNTERS];
u8  limited_hwidx[MAX_LIMITED_HWCOUNTERS];
u64 alternatives[MAX_HWEVENTS][MAX_EVENT_ALTERNATIVES];
@@ -121,7 +116,7 @@ static void ebb_event_add(struct perf_event *event) { }
 static void ebb_switch_out(unsigned long mmcr0) { }
 static unsigned long ebb_switch_in(bool ebb, struct cpu_hw_events *cpuhw)
 {
-   return cpuhw->mmcr[0];
+   return cpuhw->mmcr.mmcr0;
 }
 
 static inline void power_pmu_bhrb_enable(struct perf_event *event) {}
@@ -590,7 +585,7 @@ static void ebb_switch_out(unsigned long mmcr0)
 
 static unsigned long ebb_switch_in(bool ebb, struct cpu_hw_events *cpuhw)
 {
-   unsigned long mmcr0 = cpuhw->mmcr[0];
+   unsigned long mmcr0 = cpuhw->mmcr.mmcr0;
 
if (!ebb)
goto out;
@@ -624,7 +619,7 @@ static unsigned long ebb_switch_in(bool ebb, struct 
cpu_hw_events *cpuhw)
 * unfreeze counters, it should not set exclude_xxx in its events and
 * instead manage the MMCR2 entirely by itself.
 */
-   mtspr(SPRN_MMCR2, cpuhw->mmcr[3] | current->thread.mmcr2);
+   mtspr(SPRN_MMCR2, cpuhw->mmcr.mmcr2 | current->thread.mmcr2);
 out:
return mmcr0;
 }
@@ -1232,9 +1227,9 @@ static void power_pmu_disable(struct pmu *pmu)
/*
 * Disable instruction sampling if it was enabled
 */
-   if (cpuhw->mmcr[2] & MMCRA_SAMPLE_ENABLE) {
+   if (cpuhw->mmcr.mmcra & MMCRA_SAMPLE_ENABLE) {
mtspr(SPRN_MMCRA,
-

[v3 00/15] powerpc/perf: Add support for power10 PMU Hardware

2020-07-17 Thread Athira Rajeev
The patch series adds support for power10 PMU hardware.

Patches 1..3 are the clean up patches which refactors the way how
PMU SPR's are stored in core-book3s and in KVM book3s, as well as update
data type for PMU cache_events.

Patches 12 and 13 adds base support for perf extended register
capability in powerpc. Support for extended regs in power10 is
covered in patches 14,15

Other patches includes main changes to support for power10 PMU.

Anju T Sudhakar (2):
  powerpc/perf: Add support for outputting extended regs in perf
intr_regs
  tools/perf: Add perf tools support for extended register capability in
powerpc

Athira Rajeev (10):
  powerpc/perf: Update cpu_hw_event to use `struct` for storing MMCR
registers
  KVM: PPC: Book3S HV: Cleanup updates for kvm vcpu MMCR
  powerpc/perf: Update Power PMU cache_events to u64 type
  KVM: PPC: Book3S HV: Save/restore new PMU registers
  powerpc/perf: power10 Performance Monitoring support
  powerpc/perf: Ignore the BHRB kernel address filtering for P10
  powerpc/perf: Add Power10 BHRB filter support for
PERF_SAMPLE_BRANCH_IND_CALL/COND
  powerpc/perf: BHRB control to disable BHRB logic when not used
  powerpc/perf: Add extended regs support for power10 platform
  tools/perf: Add perf tools support for extended regs in power10

Madhavan Srinivasan (3):
  powerpc/perf: Add support for ISA3.1 PMU SPRs
  powerpc/xmon: Add PowerISA v3.1 PMU SPRs
  powerpc/perf: Add power10_feat to dt_cpu_ftrs

---
Changes from v2 -> v3
- Addressed review comments from Michael Neuling,
  Michael Ellerman, Gautham Shenoy and Paul Mackerras

Changes from v1 -> v2
- Added support for extended regs in powerpc
  for power9/power10 platform ( patches 12 to 15)
- Addressed change/removal of some event codes
  in the PMU driver
---

 Documentation/virt/kvm/api.rst  |   3 +
 arch/powerpc/include/asm/kvm_book3s_asm.h   |   2 +-
 arch/powerpc/include/asm/kvm_host.h |   6 +-
 arch/powerpc/include/asm/perf_event_server.h|  22 +-
 arch/powerpc/include/asm/processor.h|   4 +
 arch/powerpc/include/asm/reg.h  |   9 +
 arch/powerpc/include/uapi/asm/kvm.h |   9 +-
 arch/powerpc/include/uapi/asm/perf_regs.h   |  20 +-
 arch/powerpc/kernel/asm-offsets.c   |   5 +
 arch/powerpc/kernel/cpu_setup_power.S   |   8 +
 arch/powerpc/kernel/dt_cpu_ftrs.c   |  26 ++
 arch/powerpc/kernel/sysfs.c |   8 +
 arch/powerpc/kvm/book3s_hv.c|  38 ++-
 arch/powerpc/kvm/book3s_hv_interrupts.S |   8 +
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |  36 +-
 arch/powerpc/perf/Makefile  |   2 +-
 arch/powerpc/perf/core-book3s.c | 108 --
 arch/powerpc/perf/generic-compat-pmu.c  |   2 +-
 arch/powerpc/perf/internal.h|   1 +
 arch/powerpc/perf/isa207-common.c   |  91 +++--
 arch/powerpc/perf/isa207-common.h   |  37 ++-
 arch/powerpc/perf/mpc7450-pmu.c |  23 +-
 arch/powerpc/perf/perf_regs.c   |  44 ++-
 arch/powerpc/perf/power10-events-list.h |  70 
 arch/powerpc/perf/power10-pmu.c | 425 
 arch/powerpc/perf/power5+-pmu.c |  19 +-
 arch/powerpc/perf/power5-pmu.c  |  19 +-
 arch/powerpc/perf/power6-pmu.c  |  18 +-
 arch/powerpc/perf/power7-pmu.c  |  19 +-
 arch/powerpc/perf/power8-pmu.c  |   2 +-
 arch/powerpc/perf/power9-pmu.c  |   8 +-
 arch/powerpc/perf/ppc970-pmu.c  |  26 +-
 arch/powerpc/platforms/powernv/idle.c   |  22 +-
 arch/powerpc/xmon/xmon.c|  13 +
 tools/arch/powerpc/include/uapi/asm/kvm.h   |   9 +-
 tools/arch/powerpc/include/uapi/asm/perf_regs.h |  20 +-
 tools/perf/arch/powerpc/include/perf_regs.h |   8 +-
 tools/perf/arch/powerpc/util/header.c   |   9 +-
 tools/perf/arch/powerpc/util/perf_regs.c|  55 +++
 tools/perf/arch/powerpc/util/utils_header.h |  15 +
 40 files changed, 1117 insertions(+), 152 deletions(-)
 create mode 100644 arch/powerpc/perf/power10-events-list.h
 create mode 100644 arch/powerpc/perf/power10-pmu.c
 create mode 100644 tools/perf/arch/powerpc/util/utils_header.h

-- 
1.8.3.1



Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Mathieu Desnoyers
- On Jul 16, 2020, at 7:26 PM, Nicholas Piggin npig...@gmail.com wrote:
[...]
> 
> membarrier does replace barrier instructions on remote CPUs, which do
> order accesses performed by the kernel on the user address space. So
> membarrier should too I guess.
> 
> Normal process context accesses like read(2) will do so because they
> don't get filtered out from IPIs, but kernel threads using the mm may
> not.

But it should not be an issue, because membarrier's ordering is only with 
respect
to submit and completion of io_uring requests, which are performed through
system calls from the context of user-space threads, which are called from the
right mm.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Mathieu Desnoyers
- On Jul 16, 2020, at 5:24 PM, Alan Stern st...@rowland.harvard.edu wrote:

> On Thu, Jul 16, 2020 at 02:58:41PM -0400, Mathieu Desnoyers wrote:
>> - On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers
>> mathieu.desnoy...@efficios.com wrote:
>> 
>> > - On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers
>> > mathieu.desnoy...@efficios.com wrote:
>> > 
>> >> - On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com 
>> >> wrote:
>> >>> I should be more complete here, especially since I was complaining
>> >>> about unclear barrier comment :)
>> >>> 
>> >>> 
>> >>> CPU0 CPU1
>> >>> a. user stuff1. user stuff
>> >>> b. membarrier()  2. enter kernel
>> >>> c. smp_mb()  3. smp_mb__after_spinlock(); // in __schedule
>> >>> d. read rq->curr 4. rq->curr switched to kthread
>> >>> e. is kthread, skip IPI  5. switch_to kthread
>> >>> f. return to user6. rq->curr switched to user thread
>> >>> g. user stuff7. switch_to user thread
>> >>> 8. exit kernel
>> >>> 9. more user stuff
>> >>> 
>> >>> What you're really ordering is a, g vs 1, 9 right?
>> >>> 
>> >>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9,
>> >>> etc.
>> >>> 
>> >>> Userspace does not care where the barriers are exactly or what kernel
>> >>> memory accesses might be being ordered by them, so long as there is a
>> >>> mb somewhere between a and g, and 1 and 9. Right?
>> >> 
>> >> This is correct.
>> > 
>> > Actually, sorry, the above is not quite right. It's been a while
>> > since I looked into the details of membarrier.
>> > 
>> > The smp_mb() at the beginning of membarrier() needs to be paired with a
>> > smp_mb() _after_ rq->curr is switched back to the user thread, so the
>> > memory barrier is between store to rq->curr and following user-space
>> > accesses.
>> > 
>> > The smp_mb() at the end of membarrier() needs to be paired with the
>> > smp_mb__after_spinlock() at the beginning of schedule, which is
>> > between accesses to userspace memory and switching rq->curr to kthread.
>> > 
>> > As to *why* this ordering is needed, I'd have to dig through additional
>> > scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ?
>> 
>> Thinking further about this, I'm beginning to consider that maybe we have 
>> been
>> overly cautious by requiring memory barriers before and after store to 
>> rq->curr.
>> 
>> If CPU0 observes a CPU1's rq->curr->mm which differs from its own process
>> (current)
>> while running the membarrier system call, it necessarily means that CPU1 had
>> to issue smp_mb__after_spinlock when entering the scheduler, between any
>> user-space
>> loads/stores and update of rq->curr.
>> 
>> Requiring a memory barrier between update of rq->curr (back to current 
>> process's
>> thread) and following user-space memory accesses does not seem to guarantee
>> anything more than what the initial barrier at the beginning of __schedule
>> already
>> provides, because the guarantees are only about accesses to user-space 
>> memory.
>> 
>> Therefore, with the memory barrier at the beginning of __schedule, just
>> observing that
>> CPU1's rq->curr differs from current should guarantee that a memory barrier 
>> was
>> issued
>> between any sequentially consistent instructions belonging to the current
>> process on
>> CPU1.
>> 
>> Or am I missing/misremembering an important point here ?
> 
> Is it correct to say that the switch_to operations in 5 and 7 include
> memory barriers?  If they do, then skipping the IPI should be okay.
> 
> The reason is as follows: The guarantee you need to enforce is that
> anything written by CPU0 before the membarrier() will be visible to CPU1
> after it returns to user mode.  Let's say that a writes to X and 9
> reads from X.
> 
> Then we have an instance of the Store Buffer pattern:
> 
>   CPU0CPU1
>   a. Write X  6. Write rq->curr for user thread
>   c. smp_mb() 7. switch_to memory barrier
>   d. Read rq->curr9. Read X
> 
> In this pattern, the memory barriers make it impossible for both reads
> to miss their corresponding writes.  Since d does fail to read 6 (it
> sees the earlier value stored by 4), 9 must read a.
> 
> The other guarantee you need is that g on CPU0 will observe anything
> written by CPU1 in 1.  This is easier to see, using the fact that 3 is a
> memory barrier and d reads from 4.

Right, and Nick's reply involving pairs of loads/stores on each side
clarifies the situation even further.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


[PATCH -next] powerpc: Remove unneeded inline functions

2020-07-17 Thread YueHaibing
Both of those functions are only called from 64-bit only code, so the
stubs should not be needed at all.

Suggested-by: Michael Ellerman 
Signed-off-by: YueHaibing 
---
 arch/powerpc/include/asm/mmu_context.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 1a474f6b1992..7f3658a97384 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -218,8 +218,6 @@ static inline void inc_mm_active_cpus(struct mm_struct *mm) 
{ }
 static inline void dec_mm_active_cpus(struct mm_struct *mm) { }
 static inline void mm_context_add_copro(struct mm_struct *mm) { }
 static inline void mm_context_remove_copro(struct mm_struct *mm) { }
-static inline void mm_context_add_vas_windows(struct mm_struct *mm) { }
-static inline void mm_context_remove_vas_windows(struct mm_struct *mm) { }
 #endif
 
 
-- 
2.17.1




Re: [PATCH 0/4] ASoC: fsl_asrc: allow selecting arbitrary clocks

2020-07-17 Thread Arnaud Ferraris
Hi Nic,

Le 02/07/2020 à 20:42, Nicolin Chen a écrit :
> Hi Arnaud,
> 
> On Thu, Jul 02, 2020 at 04:22:31PM +0200, Arnaud Ferraris wrote:
>> The current ASRC driver hardcodes the input and output clocks used for
>> sample rate conversions. In order to allow greater flexibility and to
>> cover more use cases, it would be preferable to select the clocks using
>> device-tree properties.
> 
> We recent just merged a new change that auto-selecting internal
> clocks based on sample rates as the first option -- ideal ratio
> mode is the fallback mode now. Please refer to:
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20200702&id=d0250cf4f2abfbea64ed247230f08f5ae23979f0

While working on fixing the automatic clock selection (see my v3), I
came across another potential issue, which would be better explained
with an example:
  - Input has sample rate 8kHz and uses clock SSI1 with rate 512kHz
  - Output has sample rate 16kHz and uses clock SSI2 with rate 1024kHz

Let's say my v3 patch is merged, then the selected input clock will be
SSI1, while the selected output clock will be SSI2. In that case, it's
all good, as the driver will calculate the dividers right.

Now, suppose a similar board has the input wired to SSI2 and output to
SSI1, meaning we're now in the following case:
  - Input has sample rate 8kHz and uses clock SSI2 with rate 512kHz
  - Output has sample rate 16kHz and uses clock SSI1 with rate 1024kHz
(the same result is achieved during capture with the initial example
setup, as input and output properties are then swapped)

In that case, the selected clocks will still be SSI1 for input (just
because it appears first in the clock table), and SSI2 for output,
meaning the calculated dividers will be:
  - input: 512 / 16 => 32 (should be 64)
  - output: 1024 / 8 => 128 (should be 64 here too)

---

I can't see how the clock selection algorithm could be made smart enough
to cover cases such as this one, as it would need to be aware of the
exact relationship between the sample rate and the clock rate (my
example demonstrates a case where the "sample rate to clock rate"
multiplier is identical for both input and output, but this can't be
assumed to be always the case).

Therefore, I still believe being able to force clock selection using
optional DT properties would make sense, while still using the
auto-selection by default.

Regards,
Arnaud

> 
> Having a quick review at your changes, I think the DT part may
> not be necessary as it's more likely a software configuration.
> I personally like the new auto-selecting solution more.
> 
>> This series also fix register configuration and clock assignment so
>> conversion can be conducted effectively in both directions with a good
>> quality.
> 
> If there's any further change that you feel you can improve on
> the top of mentioned change after rebasing, I'd like to review.
> 
> Thanks
> Nic
> 


Re: ASMedia USB 3.x host controllers triggering EEH on POWER9

2020-07-17 Thread Oliver O'Halloran
On Fri, Jul 17, 2020 at 8:10 PM Forest Crossman  wrote:
>
> > From word 2 of the PEST entry the faulting DMA address is:
> > 0x203974c0. That address is interesting since it looks a lot
> > like a memory address on the 2nd chip, but it doesn't have bit 59 set
> > so TVE#0 is used to validate it. Obviously that address is above 2GB
> > so we get the error.
>
> Ah, I see. Do you know if the information on the PEST registers is
> documented publicly somewhere? I tried searching for what those
> registers meant in the PHB4 spec but it basically just said, "the PEST
> registers contain PEST data," which isn't particularly helpful.

Most of it is in the IODA spec. See Sec 2.2.6 "Partitionable-Endpoint
State Table", the only part that isn't documented there is the top
bits of each word which are documented in the PHB spec for some
reason. For word one the top bit (PPCBIT(0)) means MMIO is frozen and
for word two (PPCBIT(64)) the top bit indicates DMA is frozen.

> > What's probably happening is that the ASmedia controller doesn't
> > actually implement all 64 address bits and truncates the upper bits of
> > the DMA address. Doing that is a blatant violation of the PCIe (and
> > probably the xHCI) spec, but it's also pretty common since "it works
> > on x86." Something to try would be booting with the iommu=nobypass in
> > the kernel command line. That'll disable TVE#1 and force all DMAs to
> > go through TVE#0.
>
> Thanks, iommu=nobypass fixed it! Plugging in one or more USB devices
> no longer triggers any EEH errors.
>
> > Assuming the nobypass trick above works, what you really need to do is
> > have the driver report that it can't address all 64bits by setting its
> > DMA mask accordingly. For the xhci driver it looks like this is done
> > in xhci_gen_setup(), there might be a quirks-style interface for
> > working around bugs in specific controllers that you can use. Have a
> > poke around and see what you can find :)
>
> Yup, the xhci driver has a quirks system, and conveniently one of
> those is XHCI_NO_64BIT_SUPPORT. After making a 3-line patch to
> xhci-pci.c to add that quirk for this chip, the host controller is now
> able to work without setting iommu=nobypass in the kernel arguments.

Cool

> Thank you so much for your help! You've likely saved me several hours
> of reading documentation, as well as several more hours of fiddling
> around with the xhci driver.

> I'm almost disappointed the fix was so
> simple, but the time savings alone more than makes up for it. I'll
> submit the patch to the USB ML shortly.

It might be only three lines, but it doesn't mean it's trivial. Most
bugs like that require a lot of context to understand, let alone fix.
Thanks for looking into it!

Oliver


Re: ASMedia USB 3.x host controllers triggering EEH on POWER9

2020-07-17 Thread Forest Crossman
> In the future you can use this script to automate some of the tedium
> of parsing the eeh dumps:
> https://patchwork.ozlabs.org/project/skiboot/patch/20200717044243.1195833-1-ooh...@gmail.com/

Ah, nice, thanks for showing me this! I had written my own parser that
just dumped a few register names and what bits were set in each, but
your script seems much more feature-complete.

> Anyway, for background the way PHB3 and PHB4 handle incoming DMAs goes
> as follows:
>
> 1. Map the 16bit  number to an IOMMU context, we call
> those PEs. PE means "partitionable endpoint", but for the purpose of
> processing DMAs you can ignore that and just treat it as an IOMMU
> context ID.
> 2. Use the PE number and some of the upper bits of the DMA address to
> form the index into the Translation Validation Table.
> 3. Use the table entry to validate the DMA address is within bounds
> and whether it should be translated by the IOMMU or used as-is.
>
> If the table entry says the DMA needs to be translated by the IOMMU we'll 
> also:
> 4. Walk the IOMMU table to get the relevant IOMMU table entry.
> 5. Validate the device has permission to read/write to that address.
>
> The "TVT Address Range Error" you're seeing means that the bounds
> checks done in 3) is failing. OPAL configures the PHB so there's two
> TVT entries (TVEs for short) assigned to each PE. Bit 59 of the DMA
> address is used to select which TVE to use. We typically configure
> TVE#0 to map 0x0...0x8000_ so there's a 2GB 32bit DMA window.
> TVE#1 is configured for no-translate (bypass) mode so you can convert
> from a system physical address to a DMA address by ORing in bit 59.

Thanks for the in-depth explanation, I find these low-level details
really fascinating.

> From word 2 of the PEST entry the faulting DMA address is:
> 0x203974c0. That address is interesting since it looks a lot
> like a memory address on the 2nd chip, but it doesn't have bit 59 set
> so TVE#0 is used to validate it. Obviously that address is above 2GB
> so we get the error.

Ah, I see. Do you know if the information on the PEST registers is
documented publicly somewhere? I tried searching for what those
registers meant in the PHB4 spec but it basically just said, "the PEST
registers contain PEST data," which isn't particularly helpful.

> What's probably happening is that the ASmedia controller doesn't
> actually implement all 64 address bits and truncates the upper bits of
> the DMA address. Doing that is a blatant violation of the PCIe (and
> probably the xHCI) spec, but it's also pretty common since "it works
> on x86." Something to try would be booting with the iommu=nobypass in
> the kernel command line. That'll disable TVE#1 and force all DMAs to
> go through TVE#0.

Thanks, iommu=nobypass fixed it! Plugging in one or more USB devices
no longer triggers any EEH errors.

> Assuming the nobypass trick above works, what you really need to do is
> have the driver report that it can't address all 64bits by setting its
> DMA mask accordingly. For the xhci driver it looks like this is done
> in xhci_gen_setup(), there might be a quirks-style interface for
> working around bugs in specific controllers that you can use. Have a
> poke around and see what you can find :)

Yup, the xhci driver has a quirks system, and conveniently one of
those is XHCI_NO_64BIT_SUPPORT. After making a 3-line patch to
xhci-pci.c to add that quirk for this chip, the host controller is now
able to work without setting iommu=nobypass in the kernel arguments.

Thank you so much for your help! You've likely saved me several hours
of reading documentation, as well as several more hours of fiddling
around with the xhci driver. I'm almost disappointed the fix was so
simple, but the time savings alone more than makes up for it. I'll
submit the patch to the USB ML shortly.

Thanks again!

Forest


[PATCH v2 2/2] selftest/cpuidle: Add support for cpuidle latency measurement

2020-07-17 Thread Pratik Rajesh Sampat
This patch adds support to trace IPI based and timer based wakeup
latency from idle states

Latches onto the test-cpuidle_latency kernel module using the debugfs
interface to send IPIs or schedule a timer based event, which in-turn
populates the debugfs with the latency measurements.

Currently for the IPI and timer tests; first disable all idle states
and then test for latency measurements incrementally enabling each state

Signed-off-by: Pratik Rajesh Sampat 
---
 tools/testing/selftests/Makefile   |   1 +
 tools/testing/selftests/cpuidle/Makefile   |   6 +
 tools/testing/selftests/cpuidle/cpuidle.sh | 257 +
 tools/testing/selftests/cpuidle/settings   |   1 +
 4 files changed, 265 insertions(+)
 create mode 100644 tools/testing/selftests/cpuidle/Makefile
 create mode 100755 tools/testing/selftests/cpuidle/cpuidle.sh
 create mode 100644 tools/testing/selftests/cpuidle/settings

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 1195bd85af38..ab6cf51f3518 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -7,6 +7,7 @@ TARGETS += capabilities
 TARGETS += cgroup
 TARGETS += clone3
 TARGETS += cpufreq
+TARGETS += cpuidle
 TARGETS += cpu-hotplug
 TARGETS += drivers/dma-buf
 TARGETS += efivarfs
diff --git a/tools/testing/selftests/cpuidle/Makefile 
b/tools/testing/selftests/cpuidle/Makefile
new file mode 100644
index ..72fd5d2e974d
--- /dev/null
+++ b/tools/testing/selftests/cpuidle/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+all:
+
+TEST_PROGS := cpuidle.sh
+
+include ../lib.mk
diff --git a/tools/testing/selftests/cpuidle/cpuidle.sh 
b/tools/testing/selftests/cpuidle/cpuidle.sh
new file mode 100755
index ..5c313a3abb0c
--- /dev/null
+++ b/tools/testing/selftests/cpuidle/cpuidle.sh
@@ -0,0 +1,257 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+LOG=cpuidle.log
+MODULE=/lib/modules/$(uname -r)/kernel/drivers/cpuidle/test-cpuidle_latency.ko
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+helpme()
+{
+   printf "Usage: $0 [-h] [-todg args]
+   [-h ]
+   [-m ]
+   [-o ]
+   \n"
+   exit 2
+}
+
+parse_arguments()
+{
+   while getopts ht:m:o: arg
+   do
+   case $arg in
+   h) # --help
+   helpme
+   ;;
+   m) # --mod-file
+   MODULE=$OPTARG
+   ;;
+   o) # output log files
+   LOG=$OPTARG
+   ;;
+   \?)
+   helpme
+   ;;
+   esac
+   done
+}
+
+ins_mod()
+{
+   if [ ! -f "$MODULE" ]; then
+   printf "$MODULE module does not exist. Exitting\n"
+   exit $ksft_skip
+   fi
+   printf "Inserting $MODULE module\n\n"
+   insmod $MODULE
+   if [ $? != 0 ]; then
+   printf "Insmod $MODULE failed\n"
+   exit $ksft_skip
+   fi
+}
+
+compute_average()
+{
+   arr=("$@")
+   sum=0
+   size=${#arr[@]}
+   for i in "${arr[@]}"
+   do
+   sum=$((sum + i))
+   done
+   avg=$((sum/size))
+}
+
+# Disable all stop states
+disable_idle()
+{
+   for ((cpu=0; cpu 
/sys/devices/system/cpu/cpu$cpu/cpuidle/state$state/disable
+   done
+   done
+}
+
+# Perform operation on each CPU for the given state
+# $1 - Operation: enable (0) / disable (1)
+# $2 - State to enable
+op_state()
+{
+   for ((cpu=0; cpu 
/sys/devices/system/cpu/cpu$cpu/cpuidle/state$2/disable
+   done
+}
+
+# Extract latency in microseconds and convert to nanoseconds
+extract_latency()
+{
+   for ((state=0; state /dev/null &
+   task_pid=$!
+   # Wait for the workload to achieve 100% CPU usage
+   sleep 1
+fi
+taskset 0x1 echo $dest_cpu > 
/sys/kernel/debug/latency_test/ipi_cpu_dest
+ipi_latency=$(cat /sys/kernel/debug/latency_test/ipi_latency_ns)
+src_cpu=$(cat /sys/kernel/debug/latency_test/ipi_cpu_src)
+if [ "$1" = "baseline" ]; then
+   kill $task_pid
+   wait $task_pid 2>/dev/null
+fi
+}
+
+# Incrementally Enable idle states one by one and compute the latency
+run_ipi_tests()
+{
+extract_latency
+disable_idle
+declare -a avg_arr
+echo -e "--IPI Latency Test---" >> $LOG
+
+   echo -e "--Baseline IPI Latency measurement: CPU Busy--" >> $LOG
+   printf "%s %10s %12s\n" "SRC_CPU" "DEST_CPU" "IPI_Latency(ns)" 
>> $LOG
+   for ((cpu=0; cpu> 
$LOG
+   avg_arr+=($ipi_latency)
+   done
+   compute_average "${avg_arr[@]}"
+   echo -e "Baseline Aver

[PATCH v2 1/2] cpuidle: Trace IPI based and timer based wakeup latency from idle states

2020-07-17 Thread Pratik Rajesh Sampat
Fire directed smp_call_function_single IPIs from a specified source
CPU to the specified target CPU to reduce the noise we have to wade
through in the trace log.
The module is based on the idea written by Srivatsa Bhat and maintained
by Vaidyanathan Srinivasan internally.

Queue HR timer and measure jitter. Wakeup latency measurement for idle
states using hrtimer.  Echo a value in ns to timer_test_function and
watch trace. A HRtimer will be queued and when it fires the expected
wakeup vs actual wakeup is computes and delay printed in ns.

Implemented as a module which utilizes debugfs so that it can be
integrated with selftests.

To include the module, check option and include as module
kernel hacking -> Cpuidle latency selftests

[srivatsa.b...@linux.vnet.ibm.com: Initial implementation in
 cpidle/sysfs]

[sva...@linux.vnet.ibm.com: wakeup latency measurements using hrtimer
 and fix some of the time calculation]

[e...@linux.vnet.ibm.com: Fix some whitespace and tab errors and
 increase the resolution of IPI wakeup]

Signed-off-by: Pratik Rajesh Sampat 
---
 drivers/cpuidle/Makefile   |   1 +
 drivers/cpuidle/test-cpuidle_latency.c | 150 +
 lib/Kconfig.debug  |  10 ++
 3 files changed, 161 insertions(+)
 create mode 100644 drivers/cpuidle/test-cpuidle_latency.c

diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
index f07800cbb43f..2ae05968078c 100644
--- a/drivers/cpuidle/Makefile
+++ b/drivers/cpuidle/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o
 obj-$(CONFIG_DT_IDLE_STATES) += dt_idle_states.o
 obj-$(CONFIG_ARCH_HAS_CPU_RELAX) += poll_state.o
 obj-$(CONFIG_HALTPOLL_CPUIDLE)   += cpuidle-haltpoll.o
+obj-$(CONFIG_IDLE_LATENCY_SELFTEST)  += test-cpuidle_latency.o
 
 
##
 # ARM SoC drivers
diff --git a/drivers/cpuidle/test-cpuidle_latency.c 
b/drivers/cpuidle/test-cpuidle_latency.c
new file mode 100644
index ..61574665e972
--- /dev/null
+++ b/drivers/cpuidle/test-cpuidle_latency.c
@@ -0,0 +1,150 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Module-based API test facility for cpuidle latency using IPIs and timers
+ */
+
+#include 
+#include 
+#include 
+
+/* IPI based wakeup latencies */
+struct latency {
+   unsigned int src_cpu;
+   unsigned int dest_cpu;
+   ktime_t time_start;
+   ktime_t time_end;
+   u64 latency_ns;
+} ipi_wakeup;
+
+static void measure_latency(void *info)
+{
+   struct latency *v;
+   ktime_t time_diff;
+
+   v = (struct latency *)info;
+   v->time_end = ktime_get();
+   time_diff = ktime_sub(v->time_end, v->time_start);
+   v->latency_ns = ktime_to_ns(time_diff);
+}
+
+void run_smp_call_function_test(unsigned int cpu)
+{
+   ipi_wakeup.src_cpu = smp_processor_id();
+   ipi_wakeup.dest_cpu = cpu;
+   ipi_wakeup.time_start = ktime_get();
+   smp_call_function_single(cpu, measure_latency, &ipi_wakeup, 1);
+}
+
+/* Timer based wakeup latencies */
+struct timer_data {
+   unsigned int src_cpu;
+   u64 timeout;
+   ktime_t time_start;
+   ktime_t time_end;
+   struct hrtimer timer;
+   u64 timeout_diff_ns;
+} timer_wakeup;
+
+static enum hrtimer_restart timer_called(struct hrtimer *hrtimer)
+{
+   struct timer_data *w;
+   ktime_t time_diff;
+
+   w = container_of(hrtimer, struct timer_data, timer);
+   w->time_end = ktime_get();
+
+   time_diff = ktime_sub(w->time_end, w->time_start);
+   time_diff = ktime_sub(time_diff, ns_to_ktime(w->timeout));
+   w->timeout_diff_ns = ktime_to_ns(time_diff);
+   return HRTIMER_NORESTART;
+}
+
+static void run_timer_test(unsigned int ns)
+{
+   hrtimer_init(&timer_wakeup.timer, CLOCK_MONOTONIC,
+HRTIMER_MODE_REL);
+   timer_wakeup.timer.function = timer_called;
+   timer_wakeup.time_start = ktime_get();
+   timer_wakeup.src_cpu = smp_processor_id();
+   timer_wakeup.timeout = ns;
+
+   hrtimer_start(&timer_wakeup.timer, ns_to_ktime(ns),
+ HRTIMER_MODE_REL_PINNED);
+}
+
+static struct dentry *dir;
+
+static int cpu_read_op(void *data, u64 *value)
+{
+   *value = ipi_wakeup.dest_cpu;
+   return 0;
+}
+
+static int cpu_write_op(void *data, u64 value)
+{
+   run_smp_call_function_test(value);
+   return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(ipi_ops, cpu_read_op, cpu_write_op, "%llu\n");
+
+static int timeout_read_op(void *data, u64 *value)
+{
+   *value = timer_wakeup.timeout;
+   return 0;
+}
+
+static int timeout_write_op(void *data, u64 value)
+{
+   run_timer_test(value);
+   return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(timeout_ops, timeout_read_op, timeout_write_op, 
"%llu\n");
+
+static int __init latency_init(void)
+{
+   struct dentry *temp;
+
+   dir = debugfs_create_dir("latency_test", 0);
+   if (!dir) {

[PATCH v2 0/2] Selftest for cpuidle latency measurement

2020-07-17 Thread Pratik Rajesh Sampat
v1: https://lkml.org/lkml/2020/7/7/1036
Changelog v1 --> v2
1. Based on Shuah Khan's comment, changed exit code to ksft_skip to
   indicate the test is being skipped
2. Change the busy workload for baseline measurement from
   "yes > /dev/null" to "cat /dev/random to /dev/null", based on
   observed CPU utilization for "yes" consuming ~60% CPU while the
   latter consumes 100% of CPUs, giving more accurate baseline numbers
---

The patch series introduces a mechanism to measure wakeup latency for
IPI and timer based interrupts
The motivation behind this series is to find significant deviations
behind advertised latency and resisdency values

To achieve this, we introduce a kernel module and expose its control
knobs through the debugfs interface that the selftests can engage with.

The kernel module provides the following interfaces within
/sys/kernel/debug/latency_test/ for,
1. IPI test:
  ipi_cpu_dest   # Destination CPU for the IPI
  ipi_cpu_src# Origin of the IPI
  ipi_latency_ns # Measured latency time in ns
2. Timeout test:
  timeout_cpu_src # CPU on which the timer to be queued
  timeout_expected_ns # Timer duration
  timeout_diff_ns # Difference of actual duration vs expected timer
To include the module, check option and include as module
kernel hacking -> Cpuidle latency selftests

The selftest inserts the module, disables all the idle states and
enables them one by one testing the following:
1. Keeping source CPU constant, iterates through all the CPUS measuring
   IPI latency for baseline (CPU is busy with
   "cat /dev/random > /dev/null" workload) and the when the CPU is
   allowed to be at rest
2. Iterating through all the CPUs, sending expected timer durations to
   be equivalent to the residency of the the deepest idle state
   enabled and extracting the difference in time between the time of
   wakeup and the expected timer duration

Usage
-
Can be used in conjuction to the rest of the selftests.
Default Output location in: tools/testing/cpuidle/cpuidle.log

To run this test specifically:
$ make -C tools/testing/selftests TARGETS="cpuidle" run_tests

There are a few optinal arguments too that the script can take
[-h ]
[-m ]
[-o ]

Sample output snippet
-
--IPI Latency Test---
--Baseline IPI Latency measurement: CPU Busy--
SRC_CPU   DEST_CPU IPI_Latency(ns)
...
08 1996
09 2125
0   10 1264
0   11 1788
0   12 2045
Baseline Average IPI latency(ns): 1843
---Enabling state: 5---
SRC_CPU   DEST_CPU IPI_Latency(ns)
08   621719
09   624752
0   10   622218
0   11   623968
0   12   621303
Expected IPI latency(ns): 10
Observed Average IPI latency(ns): 622792

--Timeout Latency Test--
--Baseline Timeout Latency measurement: CPU Busy--
Wakeup_src Baseline_delay(ns) 
...
82249
92226
10   2211
11   2183
12   2263
Baseline Average timeout diff(ns): 2226
---Enabling state: 5---
8   10749   
9   10911   
10  10912   
11  12100   
12  73276   
Expected timeout(ns): 1200
Observed Average timeout diff(ns): 23589

Pratik Rajesh Sampat (2):
  cpuidle: Trace IPI based and timer based wakeup latency from idle
states
  selftest/cpuidle: Add support for cpuidle latency measurement

 drivers/cpuidle/Makefile   |   1 +
 drivers/cpuidle/test-cpuidle_latency.c | 150 
 lib/Kconfig.debug  |  10 +
 tools/testing/selftests/Makefile   |   1 +
 tools/testing/selftests/cpuidle/Makefile   |   6 +
 tools/testing/selftests/cpuidle/cpuidle.sh | 257 +
 tools/testing/selftests/cpuidle/settings   |   1 +
 7 files changed, 426 insertions(+)
 create mode 100644 drivers/cpuidle/test-cpuidle_latency.c
 create mode 100644 tools/testing/selftests/cpuidle/Makefile
 create mode 100755 tools/testing/selftests/cpuidle/cpuidle.sh
 create mode 100644 tools/testing/selftests/cpuidle/settings

-- 
2.25.4



linux-next: manual merge of the set_fs tree with the powerpc tree

2020-07-17 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the set_fs tree got a conflict in:

  arch/powerpc/mm/numa.c

between commit:

  c30f931e891e ("powerpc/numa: remove ability to enable topology updates")

from the powerpc tree and commit:

  16a04bde8169 ("proc: switch over direct seq_read method calls to 
seq_read_iter")

from the set_fs tree.

I fixed it up (the former removed the code updated by the latter, so I
just did that) and can carry the fix as necessary. This is now fixed as
far as linux-next is concerned, but any non trivial conflicts should be
mentioned to your upstream maintainer when your tree is submitted for
merging.  You may also want to consider cooperating with the maintainer
of the conflicting tree to minimise any particularly complex conflicts.

-- 
Cheers,
Stephen Rothwell


pgptnR9YEfRrZ.pgp
Description: OpenPGP digital signature


Re: [PATCH v2 2/5] powerpc/lib: Initialize a temporary mm for code patching

2020-07-17 Thread kernel test robot
Hi "Christopher,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on powerpc/next]
[also build test WARNING on char-misc/char-misc-testing v5.8-rc5 next-20200716]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Christopher-M-Riedl/Use-per-CPU-temporary-mappings-for-patching/20200709-123827
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-randconfig-r013-20200717 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project 
ed6b578040a85977026c93bf4188f996148f3218)
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# install powerpc cross compiling tool for clang build
# apt-get install binutils-powerpc-linux-gnu
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All warnings (new ones prefixed by >>):

>> arch/powerpc/lib/code-patching.c:53:13: warning: no previous prototype for 
>> function 'poking_init' [-Wmissing-prototypes]
   void __init poking_init(void)
   ^
   arch/powerpc/lib/code-patching.c:53:1: note: declare 'static' if the 
function is not intended to be used outside of this translation unit
   void __init poking_init(void)
   ^
   static 
   1 warning generated.

vim +/poking_init +53 arch/powerpc/lib/code-patching.c

52  
  > 53  void __init poking_init(void)
54  {
55  spinlock_t *ptl; /* for protecting pte table */
56  pte_t *ptep;
57  
58  /*
59   * Some parts of the kernel (static keys for example) depend on
60   * successful code patching. Code patching under 
STRICT_KERNEL_RWX
61   * requires this setup - otherwise we cannot patch at all. We 
use
62   * BUG_ON() here and later since an early failure is preferred 
to
63   * buggy behavior and/or strange crashes later.
64   */
65  patching_mm = copy_init_mm();
66  BUG_ON(!patching_mm);
67  
68  /*
69   * In hash we cannot go above DEFAULT_MAP_WINDOW easily.
70   * XXX: Do we want additional bits of entropy for radix?
71   */
72  patching_addr = (get_random_long() & PAGE_MASK) %
73  (DEFAULT_MAP_WINDOW - PAGE_SIZE);
74  
75  ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
76  BUG_ON(!ptep);
77  pte_unmap_unlock(ptep, ptl);
78  }
79  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip


Re: [PATCH v3] powerpc/pseries: detect secure and trusted boot state of the system.

2020-07-17 Thread Michal Suchánek
On Fri, Jul 17, 2020 at 03:58:01PM +1000, Daniel Axtens wrote:
> Michal Suchánek  writes:
> 
> > On Wed, Jul 15, 2020 at 07:52:01AM -0400, Nayna Jain wrote:
> >> The device-tree property to check secure and trusted boot state is
> >> different for guests(pseries) compared to baremetal(powernv).
> >> 
> >> This patch updates the existing is_ppc_secureboot_enabled() and
> >> is_ppc_trustedboot_enabled() functions to add support for pseries.
> >> 
> >> The secureboot and trustedboot state are exposed via device-tree property:
> >> /proc/device-tree/ibm,secure-boot and /proc/device-tree/ibm,trusted-boot
> >> 
> >> The values of ibm,secure-boot under pseries are interpreted as:
> >   ^^^
> >> 
> >> 0 - Disabled
> >> 1 - Enabled in Log-only mode. This patch interprets this value as
> >> disabled, since audit mode is currently not supported for Linux.
> >> 2 - Enabled and enforced.
> >> 3-9 - Enabled and enforcing; requirements are at the discretion of the
> >> operating system.
> >> 
> >> The values of ibm,trusted-boot under pseries are interpreted as:
> >^^^
> > These two should be different I suppose?
> 
> I'm not quite sure what you mean? They'll be documented in a future
> revision of the PAPR, once I get my act together and submit the
> relevant internal paperwork.

Nevermind, one talks about secure boot, the other about trusted boot.

Thanks

Michal


Re: [PATCH 11/11] powerpc/smp: Provide an ability to disable coregroup

2020-07-17 Thread Gautham R Shenoy
On Tue, Jul 14, 2020 at 10:06:24AM +0530, Srikar Dronamraju wrote:
> If user wants to enable coregroup sched_domain then they can boot with
> kernel parameter "coregroup_support=on"
> 
> Cc: linuxppc-dev 
> Cc: Michael Ellerman 
> Cc: Nick Piggin 
> Cc: Oliver OHalloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Anton Blanchard 
> Cc: Gautham R Shenoy 
> Cc: Vaidyanathan Srinivasan 
> Signed-off-by: Srikar Dronamraju 


We need this documented in the
Documentation/admin-guide/kernel-parameters.txt

Other than that, the patch looks good to me.
Reviewed-by: Gautham R. Shenoy 

> ---
>  arch/powerpc/kernel/smp.c | 19 ++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index bb25c13bbb79..c43909e6e8e9 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -118,6 +118,23 @@ struct smp_ops_t *smp_ops;
>  volatile unsigned int cpu_callin_map[NR_CPUS];
> 
>  int smt_enabled_at_boot = 1;
> +int coregroup_support;
> +
> +static int __init early_coregroup(char *p)
> +{
> + if (!p)
> + return 0;
> +
> + if (strstr(p, "on"))
> + coregroup_support = 1;
> +
> + if (strstr(p, "1"))
> + coregroup_support = 1;
> +
> + return 0;
> +}
> +
> +early_param("coregroup_support", early_coregroup);
> 
>  /*
>   * Returns 1 if the specified cpu should be brought up during boot.
> @@ -878,7 +895,7 @@ static struct cpumask *cpu_coregroup_mask(int cpu)
> 
>  static bool has_coregroup_support(void)
>  {
> - return coregroup_enabled;
> + return coregroup_enabled && coregroup_support;
>  }
> 
>  static const struct cpumask *cpu_mc_mask(int cpu)
> -- 
> 2.17.1
> 


Re: [PATCH 10/11] powerpc/smp: Implement cpu_to_coregroup_id

2020-07-17 Thread Gautham R Shenoy
On Tue, Jul 14, 2020 at 10:06:23AM +0530, Srikar Dronamraju wrote:
> Lookup the coregroup id from the associativity array.
> 
> If unable to detect the coregroup id, fallback on the core id.
> This way, ensure sched_domain degenerates and an extra sched domain is
> not created.
> 
> Ideally this function should have been implemented in
> arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
> don't need to find the primary domain again.
> 
> If the device-tree mentions more than one coregroup, then kernel
> implements only the last or the smallest coregroup, which currently
> corresponds to the penultimate domain in the device-tree.
> 
> Cc: linuxppc-dev 
> Cc: Michael Ellerman 
> Cc: Nick Piggin 
> Cc: Oliver OHalloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Anton Blanchard 
> Cc: Gautham R Shenoy 
> Cc: Vaidyanathan Srinivasan 
> Signed-off-by: Srikar Dronamraju 
> ---
>  arch/powerpc/mm/numa.c | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index d9ab9da85eab..4e85564ef62a 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1697,6 +1697,23 @@ static const struct proc_ops topology_proc_ops = {
> 
>  int cpu_to_coregroup_id(int cpu)
>  {
> + __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
> + int index;
> +

It would be good to have an assert here to ensure that we are calling
this function only when coregroups are enabled.

Else, we may end up returning the penultimate index which maps to the
chip-id.



> + if (cpu < 0 || cpu > nr_cpu_ids)
> + return -1;
> +
> + if (!firmware_has_feature(FW_FEATURE_VPHN))
> + goto out;
> +
> + if (vphn_get_associativity(cpu, associativity))
> + goto out;
> +
> + index = of_read_number(associativity, 1);
> + if ((index > min_common_depth + 1) && coregroup_enabled)
> + return of_read_number(&associativity[index - 1], 1);
> +
> +out:
>   return cpu_to_core_id(cpu);
>  }
> 
> -- 
> 2.17.1
> 


Re: [PATCH 09/11] Powerpc/smp: Create coregroup domain

2020-07-17 Thread Gautham R Shenoy
On Fri, Jul 17, 2020 at 01:49:26PM +0530, Gautham R Shenoy wrote:

> > +int cpu_to_coregroup_id(int cpu)
> > +{
> > +   return cpu_to_core_id(cpu);
> > +}
> 
> 
> So, if has_coregroup_support() returns true, then since the core_group
> identification is currently done through the core-id, the
> coregroup_mask is going to be the same as the
> cpu_core_mask/cpu_cpu_mask. Thus, we will be degenerating the DIE
> domain. Right ? Instead we could have kept the core-group to be a
> single bigcore by default, so that those domains can get degenerated
> preserving the legacy SMT, DIE, NUMA hierarchy.

Never mind, got confused with core_id and cpu_core_mask (which
corresponds to the cores of a chip).  cpu_to_core_id() does exactly
what we have described above. Sorry for the noise.

I am ok with this patch, except that I would request if all the
changes to the topology structure be done within a single function for
ease of tracking it instead of distributing it all over the code.


> 
> 
> 
> 
> > +
> >  static int topology_update_init(void)
> >  {
> > start_topology_update();
> > -- 
> > 2.17.1
> > 


Re: [PATCH 09/11] Powerpc/smp: Create coregroup domain

2020-07-17 Thread Gautham R Shenoy
On Tue, Jul 14, 2020 at 10:06:22AM +0530, Srikar Dronamraju wrote:
> Add percpu coregroup maps and masks to create coregroup domain.
> If a coregroup doesn't exist, the coregroup domain will be degenerated
> in favour of SMT/CACHE domain.
> 
> Cc: linuxppc-dev 
> Cc: Michael Ellerman 
> Cc: Nick Piggin 
> Cc: Oliver OHalloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Anton Blanchard 
> Cc: Gautham R Shenoy 
> Cc: Vaidyanathan Srinivasan 
> Signed-off-by: Srikar Dronamraju 
> ---
>  arch/powerpc/include/asm/topology.h | 10 
>  arch/powerpc/kernel/smp.c   | 37 +
>  arch/powerpc/mm/numa.c  |  5 
>  3 files changed, 52 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/topology.h 
> b/arch/powerpc/include/asm/topology.h
> index 2db7ba789720..34812c35018e 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -98,6 +98,7 @@ extern int stop_topology_update(void);
>  extern int prrn_is_enabled(void);
>  extern int find_and_online_cpu_nid(int cpu);
>  extern int timed_topology_update(int nsecs);
> +extern int cpu_to_coregroup_id(int cpu);
>  #else
>  static inline int start_topology_update(void)
>  {
> @@ -120,6 +121,15 @@ static inline int timed_topology_update(int nsecs)
>   return 0;
>  }
> 
> +static inline int cpu_to_coregroup_id(int cpu)
> +{
> +#ifdef CONFIG_SMP
> + return cpu_to_core_id(cpu);
> +#else
> + return 0;
> +#endif
> +}
> +
>  #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
> 
>  #include 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index ef19eeccd21e..bb25c13bbb79 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -80,6 +80,7 @@ DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
>  DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
>  DEFINE_PER_CPU(cpumask_var_t, cpu_l2_cache_map);
>  DEFINE_PER_CPU(cpumask_var_t, cpu_core_map);
> +DEFINE_PER_CPU(cpumask_var_t, cpu_coregroup_map);
> 
>  EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
>  EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
> @@ -91,6 +92,7 @@ enum {
>   smt_idx,
>  #endif
>   bigcore_idx,
> + mc_idx,
>   die_idx,
>  };
> 
> @@ -869,6 +871,21 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
>  }
>  #endif
> 
> +static struct cpumask *cpu_coregroup_mask(int cpu)
> +{
> + return per_cpu(cpu_coregroup_map, cpu);
> +}
> +
> +static bool has_coregroup_support(void)
> +{
> + return coregroup_enabled;
> +}
> +
> +static const struct cpumask *cpu_mc_mask(int cpu)
> +{
> + return cpu_coregroup_mask(cpu);
> +}
> +
>  static const struct cpumask *cpu_bigcore_mask(int cpu)
>  {
>   return cpu_core_mask(cpu);
> @@ -879,6 +896,7 @@ static struct sched_domain_topology_level 
> powerpc_topology[] = {
>   { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
>  #endif
>   { cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) },
> + { cpu_mc_mask, SD_INIT_NAME(MC) },
>   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>   { NULL, },
>  };
> @@ -933,6 +951,10 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
>   GFP_KERNEL, node);
>   zalloc_cpumask_var_node(&per_cpu(cpu_core_map, cpu),
>   GFP_KERNEL, node);
> + if (has_coregroup_support())
> + zalloc_cpumask_var_node(&per_cpu(cpu_coregroup_map, 
> cpu),
> + GFP_KERNEL, node);
> +
>  #ifdef CONFIG_NEED_MULTIPLE_NODES
>   /*
>* numa_node_id() works after this.
> @@ -950,6 +972,11 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
>   cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid));
>   cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid));
> 
> + if (has_coregroup_support())
> + cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid));
> + else
> + powerpc_topology[mc_idx].mask = cpu_bigcore_mask;
> +

The else part could be moved to the common function where we are
modifying the other attributes of the topology array.


>   init_big_cores();
>   if (has_big_cores) {
>   cpumask_set_cpu(boot_cpuid,
> @@ -1241,6 +1268,8 @@ static void remove_cpu_from_masks(int cpu)
>   set_cpus_unrelated(cpu, i, cpu_sibling_mask);
>   if (has_big_cores)
>   set_cpus_unrelated(cpu, i, cpu_smallcore_mask);
> + if (has_coregroup_support())
> + set_cpus_unrelated(cpu, i, cpu_coregroup_mask);
>   }
>  }
>  #endif
> @@ -1301,6 +1330,14 @@ static void add_cpu_to_masks(int cpu)
>   add_cpu_to_smallcore_masks(cpu);
>   update_mask_by_l2(cpu, cpu_l2_cache_mask);
> 
> + if (has_coregroup_support()) {
> + cpumask_set_cpu(cpu, cpu_coregroup_mask(cpu));
> + for_each_cpu(i, cpu_online_mask) {
> + if (cpu_to_coregroup_id(cpu) == cpu

Re: [PATCH 08/11] powerpc/smp: Allocate cpumask only after searching thread group

2020-07-17 Thread Gautham R Shenoy
On Tue, Jul 14, 2020 at 10:06:21AM +0530, Srikar Dronamraju wrote:
> If allocated earlier and the search fails, then cpumask need to be
> freed. However cpu_l1_cache_map can be allocated after we search thread
> group.
> 
> Cc: linuxppc-dev 
> Cc: Michael Ellerman 
> Cc: Nick Piggin 
> Cc: Oliver OHalloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Anton Blanchard 
> Cc: Gautham R Shenoy 
> Cc: Vaidyanathan Srinivasan 
> Signed-off-by: Srikar Dronamraju 

Good fix.

Reviewed-by: Gautham R. Shenoy 

> ---
>  arch/powerpc/kernel/smp.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 96e47450d9b3..ef19eeccd21e 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -797,10 +797,6 @@ static int init_cpu_l1_cache_map(int cpu)
>   if (err)
>   goto out;
> 
> - zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
> - GFP_KERNEL,
> - cpu_to_node(cpu));
> -
>   cpu_group_start = get_cpu_thread_group_start(cpu, &tg);
> 
>   if (unlikely(cpu_group_start == -1)) {
> @@ -809,6 +805,9 @@ static int init_cpu_l1_cache_map(int cpu)
>   goto out;
>   }
> 
> + zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
> + GFP_KERNEL, cpu_to_node(cpu));
> +
>   for (i = first_thread; i < first_thread + threads_per_core; i++) {
>   int i_group_start = get_cpu_thread_group_start(i, &tg);
> 
> -- 
> 2.17.1
> 


Re: [PATCH 07/11] Powerpc/numa: Detect support for coregroup

2020-07-17 Thread Gautham R Shenoy
On Tue, Jul 14, 2020 at 10:06:20AM +0530, Srikar Dronamraju wrote:
> Add support for grouping cores based on the device-tree classification.
> - The last domain in the associativity domains always refers to the
> core.
> - If primary reference domain happens to be the penultimate domain in
> the associativity domains device-tree property, then there are no
> coregroups. However if its not a penultimate domain, then there are
> coregroups. There can be more than one coregroup. For now we would be
> interested in the last or the smallest coregroups.
> 
> Cc: linuxppc-dev 
> Cc: Michael Ellerman 
> Cc: Nick Piggin 
> Cc: Oliver OHalloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Anton Blanchard 
> Cc: Gautham R Shenoy 
> Cc: Vaidyanathan Srinivasan 
> Signed-off-by: Srikar Dronamraju 

This looks good to me from the discovery of the core-group point of
view.

Reviewed-by: Gautham R. Shenoy 

> ---
>  arch/powerpc/include/asm/smp.h |  1 +
>  arch/powerpc/kernel/smp.c  |  1 +
>  arch/powerpc/mm/numa.c | 34 +-
>  3 files changed, 23 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
> index 49a25e2400f2..5bdc17a7049f 100644
> --- a/arch/powerpc/include/asm/smp.h
> +++ b/arch/powerpc/include/asm/smp.h
> @@ -28,6 +28,7 @@
>  extern int boot_cpuid;
>  extern int spinning_secondaries;
>  extern u32 *cpu_to_phys_id;
> +extern bool coregroup_enabled;
> 
>  extern void cpu_die(void);
>  extern int cpu_to_chip_id(int cpu);
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index f8faf75135af..96e47450d9b3 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -74,6 +74,7 @@ static DEFINE_PER_CPU(int, cpu_state) = { 0 };
> 
>  struct task_struct *secondary_current;
>  bool has_big_cores;
> +bool coregroup_enabled;
> 
>  DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
>  DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index fc7b0505bdd8..a43eab455be4 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -887,7 +887,9 @@ static void __init setup_node_data(int nid, u64 
> start_pfn, u64 end_pfn)
>  static void __init find_possible_nodes(void)
>  {
>   struct device_node *rtas;
> - u32 numnodes, i;
> + const __be32 *domains;
> + int prop_length, max_nodes;
> + u32 i;
> 
>   if (!numa_enabled)
>   return;
> @@ -896,25 +898,31 @@ static void __init find_possible_nodes(void)
>   if (!rtas)
>   return;
> 
> - if (of_property_read_u32_index(rtas, 
> "ibm,current-associativity-domains",
> - min_common_depth, &numnodes)) {
> - /*
> -  * ibm,current-associativity-domains is a fairly recent
> -  * property. If it doesn't exist, then fallback on
> -  * ibm,max-associativity-domains. Current denotes what the
> -  * platform can support compared to max which denotes what the
> -  * Hypervisor can support.
> -  */
> - if (of_property_read_u32_index(rtas, 
> "ibm,max-associativity-domains",
> - min_common_depth, &numnodes))
> + /*
> +  * ibm,current-associativity-domains is a fairly recent property. If
> +  * it doesn't exist, then fallback on ibm,max-associativity-domains.
> +  * Current denotes what the platform can support compared to max
> +  * which denotes what the Hypervisor can support.
> +  */
> + domains = of_get_property(rtas, "ibm,current-associativity-domains",
> + &prop_length);
> + if (!domains) {
> + domains = of_get_property(rtas, "ibm,max-associativity-domains",
> + &prop_length);
> + if (!domains)
>   goto out;
>   }
> 
> - for (i = 0; i < numnodes; i++) {
> + max_nodes = of_read_number(&domains[min_common_depth], 1);
> + for (i = 0; i < max_nodes; i++) {
>   if (!node_possible(i))
>   node_set(i, node_possible_map);
>   }
> 
> + prop_length /= sizeof(int);
> + if (prop_length > min_common_depth + 2)
> + coregroup_enabled = 1;
> +
>  out:
>   of_node_put(rtas);
>  }
> -- 
> 2.17.1
> 


[v4 5/5] KVM: PPC: Book3S HV: migrate hot plugged memory

2020-07-17 Thread Ram Pai
From: Laurent Dufour 

When a memory slot is hot plugged to a SVM, PFNs associated with the
GFNs in that slot must be migrated to secure-PFNs, aka device-PFNs.

Call kvmppc_uv_migrate_mem_slot() to accomplish this.
Disable page-merge for all pages in the memory slot.

Signed-off-by: Ram Pai 
[rearranged the code, and modified the commit log]
Signed-off-by: Laurent Dufour 
---
 arch/powerpc/include/asm/kvm_book3s_uvmem.h | 10 ++
 arch/powerpc/kvm/book3s_hv.c| 10 ++
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 22 ++
 3 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
index f229ab5..6f7da00 100644
--- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -25,6 +25,9 @@ void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot 
*free,
 struct kvm *kvm, bool skip_page_out);
 int kvmppc_uv_migrate_mem_slot(struct kvm *kvm,
const struct kvm_memory_slot *memslot);
+void kvmppc_memslot_create(struct kvm *kvm, const struct kvm_memory_slot *new);
+void kvmppc_memslot_delete(struct kvm *kvm, const struct kvm_memory_slot *old);
+
 #else
 static inline int kvmppc_uvmem_init(void)
 {
@@ -84,5 +87,12 @@ static inline int kvmppc_send_page_to_uv(struct kvm *kvm, 
unsigned long gfn)
 static inline void
 kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
struct kvm *kvm, bool skip_page_out) { }
+
+static inline void  kvmppc_memslot_create(struct kvm *kvm,
+   const struct kvm_memory_slot *new) { }
+
+static inline void  kvmppc_memslot_delete(struct kvm *kvm,
+   const struct kvm_memory_slot *old) { }
+
 #endif /* CONFIG_PPC_UV */
 #endif /* __ASM_KVM_BOOK3S_UVMEM_H__ */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index d331b46..bf3be3b 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4515,16 +4515,10 @@ static void kvmppc_core_commit_memory_region_hv(struct 
kvm *kvm,
 
switch (change) {
case KVM_MR_CREATE:
-   if (kvmppc_uvmem_slot_init(kvm, new))
-   return;
-   uv_register_mem_slot(kvm->arch.lpid,
-new->base_gfn << PAGE_SHIFT,
-new->npages * PAGE_SIZE,
-0, new->id);
+   kvmppc_memslot_create(kvm, new);
break;
case KVM_MR_DELETE:
-   uv_unregister_mem_slot(kvm->arch.lpid, old->id);
-   kvmppc_uvmem_slot_free(kvm, old);
+   kvmppc_memslot_delete(kvm, old);
break;
default:
/* TODO: Handle KVM_MR_MOVE */
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index a206984..a2b4d25 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -1089,6 +1089,28 @@ int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned 
long gfn)
return (ret == U_SUCCESS) ? RESUME_GUEST : -EFAULT;
 }
 
+void kvmppc_memslot_create(struct kvm *kvm, const struct kvm_memory_slot *new)
+{
+   if (kvmppc_uvmem_slot_init(kvm, new))
+   return;
+
+   if (kvmppc_memslot_page_merge(kvm, new, false))
+   return;
+
+   if (uv_register_mem_slot(kvm->arch.lpid, new->base_gfn << PAGE_SHIFT,
+   new->npages * PAGE_SIZE, 0, new->id))
+   return;
+
+   kvmppc_uv_migrate_mem_slot(kvm, new);
+}
+
+void kvmppc_memslot_delete(struct kvm *kvm, const struct kvm_memory_slot *old)
+{
+   uv_unregister_mem_slot(kvm->arch.lpid, old->id);
+   kvmppc_memslot_page_merge(kvm, old, true);
+   kvmppc_uvmem_slot_free(kvm, old);
+}
+
 static u64 kvmppc_get_secmem_size(void)
 {
struct device_node *np;
-- 
1.8.3.1



[v4 4/5] KVM: PPC: Book3S HV: retry page migration before erroring-out

2020-07-17 Thread Ram Pai
The page requested for page-in; sometimes, can have transient
references, and hence cannot migrate immediately. Retry a few times
before returning error.

The same is true for non-migrated pages that are migrated in
H_SVM_INIT_DONE hanlder.  Retry a few times before returning error.

H_SVM_PAGE_IN interface is enhanced to return H_BUSY if the page is
not in a migratable state.

Cc: Paul Mackerras 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Cc: Bharata B Rao 
Cc: Aneesh Kumar K.V 
Cc: Sukadev Bhattiprolu 
Cc: Laurent Dufour 
Cc: Thiago Jung Bauermann 
Cc: David Gibson 
Cc: Claudio Carvalho 
Cc: kvm-...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org

Signed-off-by: Ram Pai 
---
 Documentation/powerpc/ultravisor.rst |   1 +
 arch/powerpc/kvm/book3s_hv_uvmem.c   | 106 ---
 2 files changed, 74 insertions(+), 33 deletions(-)

diff --git a/Documentation/powerpc/ultravisor.rst 
b/Documentation/powerpc/ultravisor.rst
index ba6b1bf..fe533ad 100644
--- a/Documentation/powerpc/ultravisor.rst
+++ b/Documentation/powerpc/ultravisor.rst
@@ -1035,6 +1035,7 @@ Return values
* H_PARAMETER   if ``guest_pa`` is invalid.
* H_P2  if ``flags`` is invalid.
* H_P3  if ``order`` of page is invalid.
+   * H_BUSYif ``page`` is not in a state to pagein
 
 Description
 ~~~
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 3274663..a206984 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -672,7 +672,7 @@ static int kvmppc_svm_migrate_page(struct vm_area_struct 
*vma,
return ret;
 
if (!(*mig.src & MIGRATE_PFN_MIGRATE)) {
-   ret = -1;
+   ret = -2;
goto out_finalize;
}
 
@@ -700,43 +700,73 @@ static int kvmppc_svm_migrate_page(struct vm_area_struct 
*vma,
return ret;
 }
 
-int kvmppc_uv_migrate_mem_slot(struct kvm *kvm,
-   const struct kvm_memory_slot *memslot)
+/*
+ * return 1, if some page migration failed because of transient error,
+ * while the remaining pages migrated successfully.
+ * The caller can use this as a hint to retry.
+ *
+ * return 0 otherwise. *ret indicates the success status
+ * of this call.
+ */
+static bool __kvmppc_uv_migrate_mem_slot(struct kvm *kvm,
+   const struct kvm_memory_slot *memslot, int *ret)
 {
unsigned long gfn = memslot->base_gfn;
struct vm_area_struct *vma;
unsigned long start, end;
-   int ret = 0;
+   bool retry = false;
 
+   *ret = 0;
while (kvmppc_next_nontransitioned_gfn(memslot, kvm, &gfn)) {
 
mmap_read_lock(kvm->mm);
start = gfn_to_hva(kvm, gfn);
if (kvm_is_error_hva(start)) {
-   ret = H_STATE;
+   *ret = H_STATE;
goto next;
}
 
end = start + (1UL << PAGE_SHIFT);
vma = find_vma_intersection(kvm->mm, start, end);
if (!vma || vma->vm_start > start || vma->vm_end < end) {
-   ret = H_STATE;
+   *ret = H_STATE;
goto next;
}
 
mutex_lock(&kvm->arch.uvmem_lock);
-   ret = kvmppc_svm_migrate_page(vma, start, end,
+   *ret = kvmppc_svm_migrate_page(vma, start, end,
(gfn << PAGE_SHIFT), kvm, PAGE_SHIFT, false);
mutex_unlock(&kvm->arch.uvmem_lock);
-   if (ret)
-   ret = H_STATE;
 
 next:
mmap_read_unlock(kvm->mm);
+   if (*ret == -2) {
+   retry = true;
+   continue;
+   }
 
-   if (ret)
-   break;
+   if (*ret)
+   return false;
}
+   return retry;
+}
+
+#define REPEAT_COUNT 10
+
+int kvmppc_uv_migrate_mem_slot(struct kvm *kvm,
+   const struct kvm_memory_slot *memslot)
+{
+   int ret = 0, repeat_count = REPEAT_COUNT;
+
+   /*
+* try migration of pages in the memslot 'repeat_count' number of
+* times, provided each time migration fails because of transient
+* errors only.
+*/
+   while (__kvmppc_uv_migrate_mem_slot(kvm, memslot, &ret) &&
+   repeat_count--)
+   ;
+
return ret;
 }
 
@@ -812,7 +842,7 @@ unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, 
unsigned long gpa,
struct vm_area_struct *vma;
int srcu_idx;
unsigned long gfn = gpa >> page_shift;
-   int ret;
+   int ret, repeat_count = REPEAT_COUNT;
 
if (!(kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START))
return H_UNSUPPORTED;
@@ -826,34 +856,44 @@ unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, 
unsigned long gpa,
if (flags & H_PAG

[v4 3/5] KVM: PPC: Book3S HV: in H_SVM_INIT_DONE, migrate remaining normal-GFNs to secure-GFNs.

2020-07-17 Thread Ram Pai
The Ultravisor is expected to explicitly call H_SVM_PAGE_IN for all the
pages of the SVM before calling H_SVM_INIT_DONE. This causes a huge
delay in tranistioning the VM to SVM. The Ultravisor is only interested
in the pages that contain the kernel, initrd and other important data
structures. The rest contain throw-away content.

However if not all pages are requested by the Ultravisor, the Hypervisor
continues to consider the GFNs corresponding to the non-requested pages
as normal GFNs. This can lead to data-corruption and undefined behavior.

In H_SVM_INIT_DONE handler, move all the PFNs associated with the SVM's
GFNs to secure-PFNs. Skip the GFNs that are already Paged-in or Shared
or Paged-in followed by a Paged-out.

Cc: Paul Mackerras 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Cc: Bharata B Rao 
Cc: Aneesh Kumar K.V 
Cc: Sukadev Bhattiprolu 
Cc: Laurent Dufour 
Cc: Thiago Jung Bauermann 
Cc: David Gibson 
Cc: Claudio Carvalho 
Cc: kvm-...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ram Pai 
---
 Documentation/powerpc/ultravisor.rst|   2 +
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |   2 +
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 135 +---
 3 files changed, 125 insertions(+), 14 deletions(-)

diff --git a/Documentation/powerpc/ultravisor.rst 
b/Documentation/powerpc/ultravisor.rst
index a1c8c37..ba6b1bf 100644
--- a/Documentation/powerpc/ultravisor.rst
+++ b/Documentation/powerpc/ultravisor.rst
@@ -934,6 +934,8 @@ Return values
* H_UNSUPPORTED if called from the wrong context (e.g.
from an SVM or before an H_SVM_INIT_START
hypercall).
+   * H_STATE   if the hypervisor could not successfully
+transition the VM to Secure VM.
 
 Description
 ~~~
diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
index 9cb7d8b..f229ab5 100644
--- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -23,6 +23,8 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
 unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm);
 void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
 struct kvm *kvm, bool skip_page_out);
+int kvmppc_uv_migrate_mem_slot(struct kvm *kvm,
+   const struct kvm_memory_slot *memslot);
 #else
 static inline int kvmppc_uvmem_init(void)
 {
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index df2e272..3274663 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -93,6 +93,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct dev_pagemap kvmppc_uvmem_pgmap;
 static unsigned long *kvmppc_uvmem_bitmap;
@@ -348,8 +349,45 @@ static bool kvmppc_gfn_is_uvmem_pfn(unsigned long gfn, 
struct kvm *kvm,
return false;
 }
 
+/*
+ * starting from *gfn search for the next available GFN that is not yet
+ * transitioned to a secure GFN.  return the value of that GFN in *gfn.  If a
+ * GFN is found, return true, else return false
+ */
+static bool kvmppc_next_nontransitioned_gfn(const struct kvm_memory_slot 
*memslot,
+   struct kvm *kvm, unsigned long *gfn)
+{
+   struct kvmppc_uvmem_slot *p;
+   bool ret = false;
+   unsigned long i;
+
+   mutex_lock(&kvm->arch.uvmem_lock);
+
+   list_for_each_entry(p, &kvm->arch.uvmem_pfns, list)
+   if (*gfn >= p->base_pfn && *gfn < p->base_pfn + p->nr_pfns)
+   break;
+   if (!p)
+   goto out;
+   /*
+* The code below assumes, one to one correspondence between
+* kvmppc_uvmem_slot and memslot.
+*/
+   for (i = *gfn; i < p->base_pfn + p->nr_pfns; i++) {
+   unsigned long index = i - p->base_pfn;
+
+   if (!(p->pfns[index] & KVMPPC_GFN_FLAG_MASK)) {
+   *gfn = i;
+   ret = true;
+   break;
+   }
+   }
+out:
+   mutex_unlock(&kvm->arch.uvmem_lock);
+   return ret;
+}
+
 static int kvmppc_memslot_page_merge(struct kvm *kvm,
-   struct kvm_memory_slot *memslot, bool merge)
+   const struct kvm_memory_slot *memslot, bool merge)
 {
unsigned long gfn = memslot->base_gfn;
unsigned long end, start = gfn_to_hva(kvm, gfn);
@@ -461,12 +499,31 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
 
 unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)
 {
+   struct kvm_memslots *slots;
+   struct kvm_memory_slot *memslot;
+   int srcu_idx;
+   long ret = H_SUCCESS;
+
if (!(kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START))
return H_UNSUPPORTED;
 
+   /* migrate any unmoved normal pfn to device pfns*/
+   srcu_idx = srcu_read_lock(&kvm->srcu);

[v4 1/5] KVM: PPC: Book3S HV: Disable page merging in H_SVM_INIT_START

2020-07-17 Thread Ram Pai
Page-merging of pages in memory-slots associated with a Secure VM,
is disabled in H_SVM_PAGE_IN handler.

This operation should have been done much earlier; the moment the VM
is initiated for secure-transition. Delaying this operation, increases
the probability for those pages to acquire new references , making it
impossible to migrate those pages.

Disable page-migration in H_SVM_INIT_START handling.

Signed-off-by: Ram Pai 
---
 Documentation/powerpc/ultravisor.rst |  1 +
 arch/powerpc/kvm/book3s_hv_uvmem.c   | 98 +++-
 2 files changed, 76 insertions(+), 23 deletions(-)

diff --git a/Documentation/powerpc/ultravisor.rst 
b/Documentation/powerpc/ultravisor.rst
index df136c8..a1c8c37 100644
--- a/Documentation/powerpc/ultravisor.rst
+++ b/Documentation/powerpc/ultravisor.rst
@@ -895,6 +895,7 @@ Return values
 One of the following values:
 
* H_SUCCESS  on success.
+* H_STATEif the VM is not in a position to switch to secure.
 
 Description
 ~~~
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index e6f76bc..0baa293 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -211,6 +211,65 @@ static bool kvmppc_gfn_is_uvmem_pfn(unsigned long gfn, 
struct kvm *kvm,
return false;
 }
 
+static int kvmppc_memslot_page_merge(struct kvm *kvm,
+   struct kvm_memory_slot *memslot, bool merge)
+{
+   unsigned long gfn = memslot->base_gfn;
+   unsigned long end, start = gfn_to_hva(kvm, gfn);
+   int ret = 0;
+   struct vm_area_struct *vma;
+   int merge_flag = (merge) ? MADV_MERGEABLE : MADV_UNMERGEABLE;
+
+   if (kvm_is_error_hva(start))
+   return H_STATE;
+
+   end = start + (memslot->npages << PAGE_SHIFT);
+
+   mmap_write_lock(kvm->mm);
+   do {
+   vma = find_vma_intersection(kvm->mm, start, end);
+   if (!vma) {
+   ret = H_STATE;
+   break;
+   }
+   ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
+ merge_flag, &vma->vm_flags);
+   if (ret) {
+   ret = H_STATE;
+   break;
+   }
+   start = vma->vm_end + 1;
+   } while (end > vma->vm_end);
+
+   mmap_write_unlock(kvm->mm);
+   return ret;
+}
+
+static int __kvmppc_page_merge(struct kvm *kvm, bool merge)
+{
+   struct kvm_memslots *slots;
+   struct kvm_memory_slot *memslot;
+   int ret = 0;
+
+   slots = kvm_memslots(kvm);
+   kvm_for_each_memslot(memslot, slots) {
+   ret = kvmppc_memslot_page_merge(kvm, memslot, merge);
+   if (ret)
+   break;
+   }
+   return ret;
+}
+
+static inline int kvmppc_disable_page_merge(struct kvm *kvm)
+{
+   return __kvmppc_page_merge(kvm, false);
+}
+
+static inline int kvmppc_enable_page_merge(struct kvm *kvm)
+{
+   return __kvmppc_page_merge(kvm, true);
+}
+
 unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
 {
struct kvm_memslots *slots;
@@ -232,11 +291,18 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
return H_AUTHORITY;
 
srcu_idx = srcu_read_lock(&kvm->srcu);
+
+   /* disable page-merging for all memslot */
+   ret = kvmppc_disable_page_merge(kvm);
+   if (ret)
+   goto out;
+
+   /* register the memslot */
slots = kvm_memslots(kvm);
kvm_for_each_memslot(memslot, slots) {
if (kvmppc_uvmem_slot_init(kvm, memslot)) {
ret = H_PARAMETER;
-   goto out;
+   break;
}
ret = uv_register_mem_slot(kvm->arch.lpid,
   memslot->base_gfn << PAGE_SHIFT,
@@ -245,9 +311,12 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
if (ret < 0) {
kvmppc_uvmem_slot_free(kvm, memslot);
ret = H_PARAMETER;
-   goto out;
+   break;
}
}
+
+   if (ret)
+   kvmppc_enable_page_merge(kvm);
 out:
srcu_read_unlock(&kvm->srcu, srcu_idx);
return ret;
@@ -384,7 +453,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
  */
 static int kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start,
   unsigned long end, unsigned long gpa, struct kvm *kvm,
-  unsigned long page_shift, bool *downgrade)
+  unsigned long page_shift)
 {
unsigned long src_pfn, dst_pfn = 0;
struct migrate_vma mig;
@@ -400,18 +469,6 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma, 
unsigned long start,
mig.src = &src_pfn;
mig.dst = &dst_pfn;
 
-   /*
-* We come here

[v4 2/5] KVM: PPC: Book3S HV: track the state GFNs associated with secure VMs

2020-07-17 Thread Ram Pai
During the life of SVM, its GFNs transition through normal, secure and
shared states. Since the kernel does not track GFNs that are shared, it
is not possible to disambiguate a shared GFN from a GFN whose PFN has
not yet been migrated to a secure-PFN. Also it is not possible to
disambiguate a secure-GFN from a GFN whose GFN has been pagedout from
the ultravisor.

The ability to identify the state of a GFN is needed to skip migration
of its PFN to secure-PFN during ESM transition.

The code is re-organized to track the states of a GFN as explained
below.


 1. States of a GFN
---
 The GFN can be in one of the following states.

 (a) Secure - The GFN is secure. The GFN is associated with
a Secure VM, the contents of the GFN is not accessible
to the Hypervisor.  This GFN can be backed by a secure-PFN,
or can be backed by a normal-PFN with contents encrypted.
The former is true when the GFN is paged-in into the
ultravisor. The latter is true when the GFN is paged-out
of the ultravisor.

 (b) Shared - The GFN is shared. The GFN is associated with a
a secure VM. The contents of the GFN is accessible to
Hypervisor. This GFN is backed by a normal-PFN and its
content is un-encrypted.

 (c) Normal - The GFN is a normal. The GFN is associated with
a normal VM. The contents of the GFN is accesible to
the Hypervisor. Its content is never encrypted.

 2. States of a VM.
---

 (a) Normal VM:  A VM whose contents are always accessible to
the hypervisor.  All its GFNs are normal-GFNs.

 (b) Secure VM: A VM whose contents are not accessible to the
hypervisor without the VM's consent.  Its GFNs are
either Shared-GFN or Secure-GFNs.

 (c) Transient VM: A Normal VM that is transitioning to secure VM.
The transition starts on successful return of
H_SVM_INIT_START, and ends on successful return
of H_SVM_INIT_DONE. This transient VM, can have GFNs
in any of the three states; i.e Secure-GFN, Shared-GFN,
and Normal-GFN. The VM never executes in this state
in supervisor-mode.

 3. Memory slot State.
--
The state of a memory slot mirrors the state of the
VM the memory slot is associated with.

 4. VM State transition.


  A VM always starts in Normal Mode.

  H_SVM_INIT_START moves the VM into transient state. During this
  time the Ultravisor may request some of its GFNs to be shared or
  secured. So its GFNs can be in one of the three GFN states.

  H_SVM_INIT_DONE moves the VM entirely from transient state to
  secure-state. At this point any left-over normal-GFNs are
  transitioned to Secure-GFN.

  H_SVM_INIT_ABORT moves the transient VM back to normal VM.
  All its GFNs are moved to Normal-GFNs.

  UV_TERMINATE transitions the secure-VM back to normal-VM. All
  the secure-GFN and shared-GFNs are tranistioned to normal-GFN
  Note: The contents of the normal-GFN is undefined at this point.

 5. GFN state implementation:
-

 Secure GFN is associated with a secure-PFN; also called uvmem_pfn,
 when the GFN is paged-in. Its pfn[] has KVMPPC_GFN_UVMEM_PFN flag
 set, and contains the value of the secure-PFN.
 It is associated with a normal-PFN; also called mem_pfn, when
 the GFN is pagedout. Its pfn[] has KVMPPC_GFN_MEM_PFN flag set.
 The value of the normal-PFN is not tracked.

 Shared GFN is associated with a normal-PFN. Its pfn[] has
 KVMPPC_UVMEM_SHARED_PFN flag set. The value of the normal-PFN
 is not tracked.

 Normal GFN is associated with normal-PFN. Its pfn[] has
 no flag set. The value of the normal-PFN is not tracked.

 6. Life cycle of a GFN

 --
 || Share  |  Unshare | SVM   |H_SVM_INIT_DONE|
 ||operation   |operation | abort/|   |
 |||  | terminate |   |
 -
 |||  |   |   |
 | Secure | Shared | Secure   |Normal |Secure |
 |||  |   |   |
 | Shared | Shared | Secure   |Normal |Shared |
 |||  |   |   |
 | Normal | Shared | Secure   |Normal |Secure |
 --

 7. Life cycle of a VM

 
 | |  start|  H_SVM_  |H_SVM_   |H_SVM_ |UV_SVM_|
 | |  VM   |INIT_START|INIT_DONE|INIT_ABORT |TERMINATE  |
 | |   |  | |   |   |
 - ---

[v4 0/5] Migrate non-migrated pages of a SVM.

2020-07-17 Thread Ram Pai
The time to switch a VM to Secure-VM, increases by the size of the VM.
A 100GB VM takes about 7minutes. This is unacceptable.  This linear
increase is caused by a suboptimal behavior by the Ultravisor and the
Hypervisor.  The Ultravisor unnecessarily migrates all the GFN of the
VM from normal-memory to secure-memory. It has to just migrate the
necessary and sufficient GFNs.

However when the optimization is incorporated in the Ultravisor, the
Hypervisor starts misbehaving. The Hypervisor has a inbuilt assumption
that the Ultravisor will explicitly request to migrate, each and every
GFN of the VM. If only necessary and sufficient GFNs are requested for
migration, the Hypervisor continues to manage the remaining GFNs as
normal GFNs. This leads to memory corruption; manifested
consistently when the SVM reboots.

The same is true, when a memory slot is hotplugged into a SVM. The
Hypervisor expects the ultravisor to request migration of all GFNs to
secure-GFN.  But the hypervisor cannot handle any H_SVM_PAGE_IN
requests from the Ultravisor, done in the context of
UV_REGISTER_MEM_SLOT ucall.  This problem manifests as random errors
in the SVM, when a memory-slot is hotplugged.

This patch series automatically migrates the non-migrated pages of a
SVM, and thus solves the problem.

Testing: Passed rigorous testing using various sized SVMs.

Changelog:

v4:  .  Incorported Bharata's comments:
- Optimization -- replace write mmap semaphore with read mmap semphore.
- disable page-merge during memory hotplug.
- rearranged the patches. consolidated the page-migration-retry logic
in a single patch.

v3: . Optimized the page-migration retry-logic. 
. Relax and relinquish the cpu regularly while bulk migrating
the non-migrated pages. This issue was causing soft-lockups.
Fixed it.
. Added a new patch, to retry page-migration a couple of times
before returning H_BUSY in H_SVM_PAGE_IN. This issue was
seen a few times in a 24hour continuous reboot test of the SVMs.

v2: . fixed a bug observed by Laurent. The state of the GFN's associated
with Secure-VMs were not reset during memslot flush.
. Re-organized the code, for easier review.
. Better description of the patch series.

v1: fixed a bug observed by Bharata. Pages that where paged-in and later
paged-out must also be skipped from migration during H_SVM_INIT_DONE.


Laurent Dufour (1):
  KVM: PPC: Book3S HV: migrate hot plugged memory

Ram Pai (4):
  KVM: PPC: Book3S HV: Disable page merging in H_SVM_INIT_START
  KVM: PPC: Book3S HV: track the state GFNs associated with secure VMs
  KVM: PPC: Book3S HV: in H_SVM_INIT_DONE, migrate remaining normal-GFNs
to secure-GFNs.
  KVM: PPC: Book3S HV: retry page migration before erroring-out

 Documentation/powerpc/ultravisor.rst|   4 +
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  12 +
 arch/powerpc/kvm/book3s_hv.c|  10 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 508 
 4 files changed, 457 insertions(+), 77 deletions(-)

-- 
1.8.3.1