Re: [PATCH V5 0/4] powerpc/perf: Add support for perf extended regs in powerpc

2020-07-30 Thread Athira Rajeev



> On 27-Jul-2020, at 10:46 PM, Athira Rajeev  
> wrote:
> 
> Patch set to add support for perf extended register capability in
> powerpc. The capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to
> indicate the PMU which support extended registers. The generic code
> define the mask of extended registers as 0 for non supported architectures.
> 
> Patches 1 and 2 are the kernel side changes needed to include
> base support for extended regs in powerpc and in power10.
> Patches 3 and 4 are the perf tools side changes needed to support the
> extended registers.
> 

Hi Arnaldo, Jiri

please let me know if you have any comments/suggestions on this patch series to 
add support for perf extended regs.

Thanks
Athira

> patch 1/4 defines the PERF_PMU_CAP_EXTENDED_REGS mask to output the
> values of mmcr0,mmcr1,mmcr2 for POWER9. Defines `PERF_REG_EXTENDED_MASK`
> at runtime which contains mask value of the supported registers under
> extended regs.
> 
> patch 2/4 adds the extended regs support for power10 and exposes
> MMCR3, SIER2, SIER3 registers as part of extended regs.
> 
> Patch 3/4 and 4/4 adds extended regs to sample_reg_mask in the tool
> side to use with `-I?` option for power9 and power10 respectively.
> 
> Ravi bangoria found an issue with `perf record -I` while testing the
> changes. The same issue is currently being worked on here:
> https://lkml.org/lkml/2020/7/19/413 and will be resolved once fix
> from Jin Yao is merged.
> 
> This patch series is based on powerpc/next
> 
> Changelog:
> 
> Changes from v4 -> v5
> - initialize `perf_reg_extended_max` to work on
>  all platforms as suggested by Ravi Bangoria
> - Added Reviewed-and-Tested-by from Ravi Bangoria
> 
> Changes from v3 -> v4
> - Split the series and send extended regs as separate patch set here.
>  Link to previous series :
>  https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=190462=*
>  Other PMU patches are already merged in powerpc/next.
> 
> - Fixed kernel build issue when using config having
>  CONFIG_PERF_EVENTS set and without CONFIG_PPC_PERF_CTRS
>  reported by kernel build bot.
> - Included Reviewed-by from Kajol Jain.
> - Addressed review comments from Ravi Bangoria to initialize 
> `perf_reg_extended_max`
>  and define it in lowercase since it is local variable.
> 
> Anju T Sudhakar (2):
>  powerpc/perf: Add support for outputting extended regs in perf
>intr_regs
>  tools/perf: Add perf tools support for extended register capability in
>powerpc
> 
> Athira Rajeev (2):
>  powerpc/perf: Add extended regs support for power10 platform
>  tools/perf: Add perf tools support for extended regs in power10
> 
> arch/powerpc/include/asm/perf_event.h   |  3 ++
> arch/powerpc/include/asm/perf_event_server.h|  5 +++
> arch/powerpc/include/uapi/asm/perf_regs.h   | 20 -
> arch/powerpc/perf/core-book3s.c |  1 +
> arch/powerpc/perf/perf_regs.c   | 44 ++--
> arch/powerpc/perf/power10-pmu.c |  6 +++
> arch/powerpc/perf/power9-pmu.c  |  6 +++
> tools/arch/powerpc/include/uapi/asm/perf_regs.h | 20 -
> tools/perf/arch/powerpc/include/perf_regs.h |  8 +++-
> tools/perf/arch/powerpc/util/header.c   |  9 +---
> tools/perf/arch/powerpc/util/perf_regs.c| 55 +
> tools/perf/arch/powerpc/util/utils_header.h | 15 +++
> 12 files changed, 178 insertions(+), 14 deletions(-)
> create mode 100644 tools/perf/arch/powerpc/util/utils_header.h
> 
> -- 
> 1.8.3.1
> 



[powerpc:merge] BUILD SUCCESS 10a81441d89aa02486b3e710aa4761cb1cfcaf46

2020-07-30 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  
merge
branch HEAD: 10a81441d89aa02486b3e710aa4761cb1cfcaf46  Automatic merge of 
'master', 'next' and 'fixes' (2020-07-28 13:16)

elapsed time: 3105m

configs tested: 60
configs skipped: 1

The following configs have been built successfully.
More configs may be tested in the coming days.

arm defconfig
arm64allyesconfig
arm64   defconfig
arm  allyesconfig
arm  allmodconfig
ia64 allmodconfig
ia64defconfig
ia64 allyesconfig
m68k allmodconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
arc  allyesconfig
nds32 allnoconfig
c6x  allyesconfig
nds32   defconfig
nios2allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
parisc  defconfig
s390 allyesconfig
parisc   allyesconfig
s390defconfig
i386 allyesconfig
sparcallyesconfig
sparc   defconfig
i386defconfig
mips allyesconfig
mips allmodconfig
powerpc defconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386 randconfig-a003-20200728
i386 randconfig-a004-20200728
i386 randconfig-a005-20200728
i386 randconfig-a002-20200728
i386 randconfig-a006-20200728
i386 randconfig-a001-20200728
i386 randconfig-a016-20200728
i386 randconfig-a012-20200728
i386 randconfig-a013-20200728
i386 randconfig-a014-20200728
i386 randconfig-a011-20200728
i386 randconfig-a015-20200728
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscvallmodconfig
x86_64   rhel
x86_64   allyesconfig
x86_64rhel-7.6-kselftests
x86_64   rhel-8.3
x86_64  defconfig
x86_64  kexec

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


Re: [RESEND PATCH v5 00/11] ppc64: enable kdump support for kexec_file_load syscall

2020-07-30 Thread Hari Bathini




On 28/07/20 8:02 am, piliu wrote:



On 07/27/2020 03:36 AM, Hari Bathini wrote:

Sorry! There was a gateway issue on my system while posting v5, due to
which some patches did not make it through. Resending...

This patch series enables kdump support for kexec_file_load system
call (kexec -s -p) on PPC64. The changes are inspired from kexec-tools
code but heavily modified for kernel consumption.

The first patch adds a weak arch_kexec_locate_mem_hole() function to
override locate memory hole logic suiting arch needs. There are some
special regions in ppc64 which should be avoided while loading buffer
& there are multiple callers to kexec_add_buffer making it complicated
to maintain range sanity and using generic lookup at the same time.

The second patch marks ppc64 specific code within arch/powerpc/kexec
and arch/powerpc/purgatory to make the subsequent code changes easy
to understand.

The next patch adds helper function to setup different memory ranges
needed for loading kdump kernel, booting into it and exporting the
crashing kernel's elfcore.

The fourth patch overrides arch_kexec_locate_mem_hole() function to
locate memory hole for kdump segments by accounting for the special
memory regions, referred to as excluded memory ranges, and sets
kbuf->mem when a suitable memory region is found.

The fifth patch moves walk_drmem_lmbs() out of .init section with
a few changes to reuse it for setting up kdump kernel's usable memory
ranges. The next patch uses walk_drmem_lmbs() to look up the LMBs
and set linux,drconf-usable-memory & linux,usable-memory properties
in order to restrict kdump kernel's memory usage.

The seventh patch updates purgatory to setup r8 & r9 with opal base
and opal entry addresses respectively to aid kernels built with
CONFIG_PPC_EARLY_DEBUG_OPAL enabled. The next patch setups up backup
region as a kexec segment while loading kdump kernel and teaches
purgatory to copy data from source to destination.

Patch 09 builds the elfcore header for the running kernel & passes
the info to kdump kernel via "elfcorehdr=" parameter to export as
/proc/vmcore file. The next patch sets up the memory reserve map
for the kexec kernel and also claims kdump support for kdump as
all the necessary changes are added.

The last patch fixes a lookup issue for `kexec -l -s` case when
memory is reserved for crashkernel.

Tested the changes successfully on P8, P9 lpars, couple of OpenPOWER
boxes, one with secureboot enabled, KVM guest and a simulator.

v4 -> v5:
* Dropped patches 07/12 & 08/12 and updated purgatory to do everything
   in assembly.


Hello Pingfan,

Sorry, I missed out on responding to this.



I guess you achieve this by carefully selecting instruction to avoid
relocation issue, right?


Yes. No far branching or reference to data from elsewhere.

Thanks
Hari


[PATCH v3 2/2] powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric

2020-07-30 Thread Vaibhav Jain
We add support for reporting 'fuel-gauge' NVDIMM metric via
PAPR_PDSM_HEALTH pdsm payload. 'fuel-gauge' metric indicates the usage
life remaining of a papr-scm compatible NVDIMM. PHYP exposes this
metric via the H_SCM_PERFORMANCE_STATS.

The metric value is returned from the pdsm by extending the return
payload 'struct nd_papr_pdsm_health' without breaking the ABI. A new
field 'dimm_fuel_gauge' to hold the metric value is introduced at the
end of the payload struct and its presence is indicated by by
extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID.

The patch introduces a new function papr_pdsm_fuel_gauge() that is
called from papr_pdsm_health(). If fetching NVDIMM performance stats
is supported then 'papr_pdsm_fuel_gauge()' allocated an output buffer
large enough to hold the performance stat and passes it to
drc_pmem_query_stats() that issues the HCALL to PHYP. The return value
of the stat is then populated in the 'struct
nd_papr_pdsm_health.dimm_fuel_gauge' field with extension flag
'PDSM_DIMM_HEALTH_RUN_GAUGE_VALID' set in 'struct
nd_papr_pdsm_health.extension_flags'

Signed-off-by: Vaibhav Jain 
---
Changelog:

v3:
* Updated papr_pdsm_fuel_guage() to use the updated
  drc_pmem_query_stats() function.

Resend:
None

v2:
* Restructure code in papr_pdsm_fuel_gauge() to handle error case
first [ Ira ]
* Ignore the return value of papr_pdsm_fuel_gauge() in
papr_psdm_health() [ Ira ]
---
 arch/powerpc/include/uapi/asm/papr_pdsm.h |  9 
 arch/powerpc/platforms/pseries/papr_scm.c | 51 ++-
 2 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
b/arch/powerpc/include/uapi/asm/papr_pdsm.h
index 9ccecc1d6840..50ef95e2f5b1 100644
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
@@ -72,6 +72,11 @@
 #define PAPR_PDSM_DIMM_CRITICAL  2
 #define PAPR_PDSM_DIMM_FATAL 3
 
+/* struct nd_papr_pdsm_health.extension_flags field flags */
+
+/* Indicate that the 'dimm_fuel_gauge' field is valid */
+#define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1
+
 /*
  * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH
  * Various flags indicate the health status of the dimm.
@@ -84,6 +89,7 @@
  * dimm_locked : Contents of the dimm cant be modified until CEC reboot
  * dimm_encrypted  : Contents of dimm are encrypted.
  * dimm_health : Dimm health indicator. One of PAPR_PDSM_DIMM_
+ * dimm_fuel_gauge : Life remaining of DIMM as a percentage from 0-100
  */
 struct nd_papr_pdsm_health {
union {
@@ -96,6 +102,9 @@ struct nd_papr_pdsm_health {
__u8 dimm_locked;
__u8 dimm_encrypted;
__u16 dimm_health;
+
+   /* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */
+   __u16 dimm_fuel_gauge;
};
__u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
};
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 29cab86141d8..837a21083268 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -518,6 +518,51 @@ static int is_cmd_valid(struct nvdimm *nvdimm, unsigned 
int cmd, void *buf,
return 0;
 }
 
+static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
+   union nd_pdsm_payload *payload)
+{
+   int rc, size;
+   u64 statval;
+   struct papr_scm_perf_stat *stat;
+   struct papr_scm_perf_stats *stats;
+
+   /* Silently fail if fetching performance metrics isn't  supported */
+   if (!p->stat_buffer_len)
+   return 0;
+
+   /* Allocate request buffer enough to hold single performance stat */
+   size = sizeof(struct papr_scm_perf_stats) +
+   sizeof(struct papr_scm_perf_stat);
+
+   stats = kzalloc(size, GFP_KERNEL);
+   if (!stats)
+   return -ENOMEM;
+
+   stat = >scm_statistic[0];
+   memcpy(>stat_id, "MemLife ", sizeof(stat->stat_id));
+   stat->stat_val = 0;
+
+   /* Fetch the fuel gauge and populate it in payload */
+   rc = drc_pmem_query_stats(p, stats, 1);
+   if (rc < 0) {
+   dev_dbg(>pdev->dev, "Err(%d) fetching fuel gauge\n", rc);
+   goto free_stats;
+   }
+
+   statval = be64_to_cpu(stat->stat_val);
+   dev_dbg(>pdev->dev,
+   "Fetched fuel-gauge %llu", statval);
+   payload->health.extension_flags |=
+   PDSM_DIMM_HEALTH_RUN_GAUGE_VALID;
+   payload->health.dimm_fuel_gauge = statval;
+
+   rc = sizeof(struct nd_papr_pdsm_health);
+
+free_stats:
+   kfree(stats);
+   return rc;
+}
+
 /* Fetch the DIMM health info and populate it in provided package. */
 static int papr_pdsm_health(struct papr_scm_priv *p,
union nd_pdsm_payload *payload)
@@ -558,6 +603,10 @@ static int papr_pdsm_health(struct 

Re: [PATCH 1/2] spi: mpc512x-psc: Use the framework .set_cs()

2020-07-30 Thread Mark Brown
On Wed, Jul 29, 2020 at 11:48:16PM +0200, Linus Walleij wrote:
> The mpc512x-psc is rolling its own chip select control code,
> but the SPI master framework can handle this. It was also
> evaluating the CS status for each transfer but the CS change
> should be per-message not per-transfer.

No, CS change is per transfer.


signature.asc
Description: PGP signature


Re: [PATCH] powerpc: fix function annotations to avoid section mismatch warnings with gcc-10

2020-07-30 Thread Michael Ellerman
Segher Boessenkool  writes:
> On Wed, Jul 29, 2020 at 03:44:56PM -0400, Vladis Dronov wrote:
>> > > Certain warnings are emitted for powerpc code when building with a gcc-10
>> > > toolset:
>> > > 
>> > > WARNING: modpost: vmlinux.o(.text.unlikely+0x377c): Section mismatch 
>> > > in
>> > > reference from the function remove_pmd_table() to the function
>> > > .meminit.text:split_kernel_mapping()
>> > > The function remove_pmd_table() references
>> > > the function __meminit split_kernel_mapping().
>> > > This is often because remove_pmd_table lacks a __meminit
>> > > annotation or the annotation of split_kernel_mapping is wrong.
>> > > 
>> > > Add the appropriate __init and __meminit annotations to make modpost not
>> > > complain. In all the cases there are just a single callsite from another
>> > > __init or __meminit function:
>> > > 
>> > > __meminit remove_pagetable() -> remove_pud_table() -> remove_pmd_table()
>> > > __init prom_init() -> setup_secure_guest()
>> > > __init xive_spapr_init() -> xive_spapr_disabled()
>> > 
>> > So what changed?  These functions were inlined with older compilers, but
>> > not anymore?
>> 
>> Yes, exactly. Gcc-10 does not inline them anymore. If this is because of my
>> build system, this can happen to others also.
>> 
>> The same thing was fixed by Linus in e99332e7b4cd ("gcc-10: mark more 
>> functions
>> __init to avoid section mismatch warnings").
>
> It sounds like this is part of "-finline-functions was retuned" on
> ?  So everyone should see it
> (no matter what config or build system), and it is a good thing too :-)

I haven't seen it in my GCC 10 builds, so there must be some other
subtlety. Probably it depends on details of the .config.

cheers


[PATCH v3 1/2] powerpc/papr_scm: Fetch nvdimm performance stats from PHYP

2020-07-30 Thread Vaibhav Jain
Update papr_scm.c to query dimm performance statistics from PHYP via
H_SCM_PERFORMANCE_STATS hcall and export them to user-space as PAPR
specific NVDIMM attribute 'perf_stats' in sysfs. The patch also
provide a sysfs ABI documentation for the stats being reported and
their meanings.

During NVDIMM probe time in papr_scm_nvdimm_init() a special variant
of H_SCM_PERFORMANCE_STATS hcall is issued to check if collection of
performance statistics is supported or not. If successful then a PHYP
returns a maximum possible buffer length needed to read all
performance stats. This returned value is stored in a per-nvdimm
attribute 'stat_buffer_len'.

The layout of request buffer for reading NVDIMM performance stats from
PHYP is defined in 'struct papr_scm_perf_stats' and 'struct
papr_scm_perf_stat'. These structs are used in newly introduced
drc_pmem_query_stats() that issues the H_SCM_PERFORMANCE_STATS hcall.

The sysfs access function perf_stats_show() uses value
'stat_buffer_len' to allocate a buffer large enough to hold all
possible NVDIMM performance stats and passes it to
drc_pmem_query_stats() to populate. Finally statistics reported in the
buffer are formatted into the sysfs access function output buffer.

Signed-off-by: Vaibhav Jain 
---
Changelog:

v3:
* Updated drc_pmem_query_stats() to not require 'buff_size' and 'out'
  args to the function. Instead 'buff_size' is calculated from
  'num_stats' and instead of populating 'R4' in arg 'out' the value is
  returned from the function in case 'R4' represents
  'max-buffer-size'. [ Aneesh ]

Resend:
None

v2:
* Updated 'struct papr_scm_perf_stats' and 'struct papr_scm_perf_stat'
to use big-endian types. [ Aneesh ]
* s/len_stat_buffer/stat_buffer_len/ [ Aneesh ]
* s/statistics_id/stat_id/ , s/statistics_val/stat_val/ [ Aneesh ]
* Conversion from Big endian to cpu endian happens later rather than
just after its fetched from PHYP.
* Changed a log statement to unambiguously report dimm performance
stats are not available for the given nvdimm [ Ira ]
* Restructed some code to handle error case first [ Ira ]
---
 Documentation/ABI/testing/sysfs-bus-papr-pmem |  27 
 arch/powerpc/platforms/pseries/papr_scm.c | 150 ++
 2 files changed, 177 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem 
b/Documentation/ABI/testing/sysfs-bus-papr-pmem
index 5b10d036a8d4..c1a67275c43f 100644
--- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
+++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
@@ -25,3 +25,30 @@ Description:
  NVDIMM have been scrubbed.
* "locked"  : Indicating that NVDIMM contents cant
  be modified until next power cycle.
+
+What:  /sys/bus/nd/devices/nmemX/papr/perf_stats
+Date:  May, 2020
+KernelVersion: v5.9
+Contact:   linuxppc-dev , 
linux-nvd...@lists.01.org,
+Description:
+   (RO) Report various performance stats related to papr-scm NVDIMM
+   device.  Each stat is reported on a new line with each line
+   composed of a stat-identifier followed by it value. Below are
+   currently known dimm performance stats which are reported:
+
+   * "CtlResCt" : Controller Reset Count
+   * "CtlResTm" : Controller Reset Elapsed Time
+   * "PonSecs " : Power-on Seconds
+   * "MemLife " : Life Remaining
+   * "CritRscU" : Critical Resource Utilization
+   * "HostLCnt" : Host Load Count
+   * "HostSCnt" : Host Store Count
+   * "HostSDur" : Host Store Duration
+   * "HostLDur" : Host Load Duration
+   * "MedRCnt " : Media Read Count
+   * "MedWCnt " : Media Write Count
+   * "MedRDur " : Media Read Duration
+   * "MedWDur " : Media Write Duration
+   * "CchRHCnt" : Cache Read Hit Count
+   * "CchWHCnt" : Cache Write Hit Count
+   * "FastWCnt" : Fast Write Count
\ No newline at end of file
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 8fd441d32487..29cab86141d8 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -64,6 +64,26 @@
PAPR_PMEM_HEALTH_FATAL |\
PAPR_PMEM_HEALTH_UNHEALTHY)
 
+#define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS)
+#define PAPR_SCM_PERF_STATS_VERSION 0x1
+
+/* Struct holding a single performance metric */
+struct papr_scm_perf_stat {
+   u8 stat_id[8];
+   __be64 stat_val;
+} __packed;
+
+/* Struct exchanged between kernel and PHYP for fetching drc perf stats */
+struct papr_scm_perf_stats {
+   u8 eye_catcher[8];
+   /* Should be PAPR_SCM_PERF_STATS_VERSION */
+   __be32 stats_version;
+   /* Number of stats following */
+   __be32 num_statistics;

question about work on CMA integration into DMA

2020-07-30 Thread Maksym Kokhan
Hello!

I am working on some driver, which needs to allocate a big contiguous
memory block (~10 MB) and has to work on multiple platforms (x86, arm,
arm64, mips, powerpc). CMA - is the most appropriate way in this case,
but I have faced an unexpected problem - the fact that the CMA
subsystem is not integrated into the DMA subsystem for powerpc,
and I cannot request memory from CMA area from my kernel module.
The question is: is there any work in progress on CMA to DMA
integration?  Or is it decided not to perform such work at all?
And, is there any legal way to allocate a big contiguous memory block
from the kernel module on powerpc?

Thanks for your help,
Max


[PATCH v3 0/2] powerpc/papr_scm: add support for reporting NVDIMM 'life_used_percentage' metric

2020-07-30 Thread Vaibhav Jain
Changes since v2[1]:

* Updated drc_pmem_query_stats() to reduce the number of input args
  to the function based suggestions from Aneesh.

[1] 
https://lore.kernel.org/linux-nvdimm/20200726122030.31529-1-vaib...@linux.ibm.com
---

This small patchset implements kernel side support for reporting
'life_used_percentage' metric in NDCTL with dimm health output for
papr-scm NVDIMMs. With corresponding NDCTL side changes output for
should be like:

$ sudo ndctl list -DH
[
  {
"dev":"nmem0",
"health":{
  "health_state":"ok",
  "life_used_percentage":0,
  "shutdown_state":"clean"
}
  }
]

PHYP supports H_SCM_PERFORMANCE_STATS hcall through which an LPAR can
fetch various performance stats including 'fuel_gauge' percentage for
an NVDIMM. 'fuel_gauge' metric indicates the usable life remaining of
an NVDIMM expressed as percentage and  'life_used_percentage' can be
calculated as 'life_used_percentage = 100 - fuel_gauge'.

Structure of the patchset
=
First patch implements necessary scaffolding needed to issue the
H_SCM_PERFORMANCE_STATS hcall and fetch performance stats
catalogue. The patch also implements support for 'perf_stats' sysfs
attribute to report the full catalogue of supported performance stats
by PHYP.

Second and final patch implements support for sending this value to
libndctl by extending the PAPR_PDSM_HEALTH pdsm payload to add a new
field named 'dimm_fuel_gauge' to it.

Vaibhav Jain (2):
  powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
  powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric

 Documentation/ABI/testing/sysfs-bus-papr-pmem |  27 +++
 arch/powerpc/include/uapi/asm/papr_pdsm.h |   9 +
 arch/powerpc/platforms/pseries/papr_scm.c | 199 ++
 3 files changed, 235 insertions(+)

-- 
2.26.2



Documentation/powerpc: Ultravisor API

2020-07-30 Thread Julia Lawall
The file Documentation/powerpc/ultravisor.rst contains:

Only valid value(s) in ``flags`` are:

* H_PAGE_IN_SHARED which indicates that the page is to be shared
  with the Ultravisor.

* H_PAGE_IN_NONSHARED indicates that the UV is not anymore
  interested in the page. Applicable if the page is a shared page.

The flag H_PAGE_IN_SHARED exists in the Linux kernel
(arch/powerpc/include/asm/hvcall.h), but the flag H_PAGE_IN_NONSHARED does
not.  Should the documentation be changed in some way?

julia


Re: [PATCH 04/15] arm64: numa: simplify dummy_numa_init()

2020-07-30 Thread Catalin Marinas
On Tue, Jul 28, 2020 at 08:11:42AM +0300, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> dummy_numa_init() loops over memblock.memory and passes nid=0 to
> numa_add_memblk() which essentially wraps memblock_set_node(). However,
> memblock_set_node() can cope with entire memory span itself, so the loop
> over memblock.memory regions is redundant.
> 
> Replace the loop with a single call to memblock_set_node() to the entire
> memory.
> 
> Signed-off-by: Mike Rapoport 

Acked-by: Catalin Marinas 


Re: [PATCH 06/15] powerpc: fadamp: simplify fadump_reserve_crash_area()

2020-07-30 Thread Michael Ellerman
Mike Rapoport  writes:
> From: Mike Rapoport 
>
> fadump_reserve_crash_area() reserves memory from a specified base address
> till the end of the RAM.
>
> Replace iteration through the memblock.memory with a single call to
> memblock_reserve() with appropriate  that will take care of proper memory
 ^
 parameters?
> reservation.
>
> Signed-off-by: Mike Rapoport 
> ---
>  arch/powerpc/kernel/fadump.c | 20 +---
>  1 file changed, 1 insertion(+), 19 deletions(-)

I think this looks OK to me, but I don't have a setup to test it easily.
I've added Hari to Cc who might be able to.

But I'll give you an ack in the hope that it works :)

Acked-by: Michael Ellerman 


> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index 78ab9a6ee6ac..2446a61e3c25 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -1658,25 +1658,7 @@ int __init fadump_reserve_mem(void)
>  /* Preserve everything above the base address */
>  static void __init fadump_reserve_crash_area(u64 base)
>  {
> - struct memblock_region *reg;
> - u64 mstart, msize;
> -
> - for_each_memblock(memory, reg) {
> - mstart = reg->base;
> - msize  = reg->size;
> -
> - if ((mstart + msize) < base)
> - continue;
> -
> - if (mstart < base) {
> - msize -= (base - mstart);
> - mstart = base;
> - }
> -
> - pr_info("Reserving %lluMB of memory at %#016llx for preserving 
> crash data",
> - (msize >> 20), mstart);
> - memblock_reserve(mstart, msize);
> - }
> + memblock_reserve(base, memblock_end_of_DRAM() - base);
>  }
>  
>  unsigned long __init arch_reserved_kernel_pages(void)
> -- 
> 2.26.2


[PATCH] KVM: PPC: Book3S HV: fix a oops in kvmppc_uvmem_page_free()

2020-07-30 Thread Ram Pai
Observed the following oops while stress-testing, using multiple
secureVM on a distro kernel. However this issue theoritically exists in
5.5 kernel and later.

This issue occurs when the total number of requested device-PFNs exceed
the total-number of available device-PFNs.  PFN migration fails to
allocate a device-pfn, which causes migrate_vma_finalize() to trigger
kvmppc_uvmem_page_free() on a page, that is not associated with any
device-pfn.  kvmppc_uvmem_page_free() blindly tries to access the
contents of the private data which can be null, leading to the following
kernel fault.

 --
 Unable to handle kernel paging request for data at address 0x0011
 Faulting instruction address: 0xc0080e36e110
 Oops: Kernel access of bad area, sig: 11 [#1]
 LE SMP NR_CPUS=2048 NUMA PowerNV

 MSR:  9280b033 
 CR: 24424822  XER: 
 CFAR: c0e3d764 DAR: 0011 DSISR: 4000 IRQMASK: 0
 GPR00: c0080e36e0a4 c01f1d59f610 c0080e38a400 
 GPR04: c01fa500 fffe  c000201fffeaf300
 GPR08: 01f0  0f80 c0080e373608
 GPR12: c0e3d710 c000201fffeaf300 0001 7fef8736
 GPR16: 7fff97db4410 c000201c3b66a578  
 GPR20: 000119db9ad0 000a fffc 0001
 GPR24: c000201c3b66 c01f1d59f7a0 c04cffb0 0001
 GPR28:  c00a001ff003e000 c0080e386150 0f80
 NIP [c0080e36e110] kvmppc_uvmem_page_free+0xc8/0x210 [kvm_hv]
 LR [c0080e36e0a4] kvmppc_uvmem_page_free+0x5c/0x210 [kvm_hv]
 Call Trace:
 [c0512010] free_devmap_managed_page+0xd0/0x100
 [c03f71d0] put_devmap_managed_page+0xa0/0xc0
 [c04d24bc] migrate_vma_finalize+0x32c/0x410
 [c0080e36e828] kvmppc_svm_page_in.constprop.5+0xa0/0x460 [kvm_hv]
 [c0080e36eddc] kvmppc_uv_migrate_mem_slot.isra.2+0x1f4/0x230 [kvm_hv]
 [c0080e36fa98] kvmppc_h_svm_init_done+0x90/0x170 [kvm_hv]
 [c0080e35bb14] kvmppc_pseries_do_hcall+0x1ac/0x10a0 [kvm_hv]
 [c0080e35edf4] kvmppc_vcpu_run_hv+0x83c/0x1060 [kvm_hv]
 [c0080e95eb2c] kvmppc_vcpu_run+0x34/0x48 [kvm]
 [c0080e95a2dc] kvm_arch_vcpu_ioctl_run+0x374/0x830 [kvm]
 [c0080e9433b4] kvm_vcpu_ioctl+0x45c/0x7c0 [kvm]
 [c05451d0] do_vfs_ioctl+0xe0/0xaa0
 [c0545d64] sys_ioctl+0xc4/0x160
 [c000b408] system_call+0x5c/0x70
 Instruction dump:
 a12d1174 2f89 409e0158 a1271172 3929 b1271172 7c2004ac 3920
 913e0140 3920 e87d0010 f93d0010 <89230011> e8c3 e9030008 2f89
 --

 Fix the oops..

fixes: ca9f49 ("KVM: PPC: Book3S HV: Support for running secure guests")
Signed-off-by: Ram Pai 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 2806983..f4002bf 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -1018,13 +1018,15 @@ static void kvmppc_uvmem_page_free(struct page *page)
 {
unsigned long pfn = page_to_pfn(page) -
(kvmppc_uvmem_pgmap.res.start >> PAGE_SHIFT);
-   struct kvmppc_uvmem_page_pvt *pvt;
+   struct kvmppc_uvmem_page_pvt *pvt = page->zone_device_data;
+
+   if (!pvt)
+   return;
 
spin_lock(_uvmem_bitmap_lock);
bitmap_clear(kvmppc_uvmem_bitmap, pfn, 1);
spin_unlock(_uvmem_bitmap_lock);
 
-   pvt = page->zone_device_data;
page->zone_device_data = NULL;
if (pvt->remove_gfn)
kvmppc_gfn_remove(pvt->gpa >> PAGE_SHIFT, pvt->kvm);
-- 
1.8.3.1



Re: OF: Can't handle multiple dma-ranges with different offsets

2020-07-30 Thread Chris Packham

On 23/07/20 10:11 am, Chris Packham wrote:
>
> On 22/07/20 4:19 pm, Chris Packham wrote:
>> Hi,
>>
>> I've just fired up linux kernel v5.7 on a p2040 based system and I'm 
>> getting the following new warning
>>
>> OF: Can't handle multiple dma-ranges with different offsets on 
>> node(/pcie@ffe202000)
>> OF: Can't handle multiple dma-ranges with different offsets on 
>> node(/pcie@ffe202000)
>>
>> The warning itself was added in commit 9d55bebd9816 ("of/address: 
>> Support multiple 'dma-ranges' entries") but I gather it's pointing 
>> out something about the dts. My boards dts is based heavily on 
>> p2041rdb.dts and the relevant pci2 section is identical (reproduced 
>> below for reference).
>>
>>     pci2: pcie@ffe202000 {
>>         reg = <0xf 0xfe202000 0 0x1000>;
>>         ranges = <0x0200 0 0xe000 0xc 0x4000 0 0x2000
>>               0x0100 0 0x 0xf 0xf802 0 0x0001>;
>>         pcie@0 {
>>             ranges = <0x0200 0 0xe000
>>                   0x0200 0 0xe000
>>                   0 0x2000
>>
>>                   0x0100 0 0x
>>                   0x0100 0 0x
>>                   0 0x0001>;
>>         };
>>     };
>>
>> I haven't noticed any ill effect (aside from the scary message). I'm 
>> not sure if there's something missing in the dts or in the code that 
>> checks the ranges. Any guidance would be appreciated.
>
> I've also just checked the T2080RDB on v5.7.9 which shows a similar issue
>
> OF: Can't handle multiple dma-ranges with different offsets on 
> node(/pcie@ffe25)
> OF: Can't handle multiple dma-ranges with different offsets on 
> node(/pcie@ffe25)
> pcieport :00:00.0: Invalid size 0xf9 for dma-range
> pcieport :00:00.0: AER: enabled with IRQ 21
> OF: Can't handle multiple dma-ranges with different offsets on 
> node(/pcie@ffe27)
> OF: Can't handle multiple dma-ranges with different offsets on 
> node(/pcie@ffe27)
> pcieport 0001:00:00.0: Invalid size 0xf9 for dma-range
> pcieport 0001:00:00.0: AER: enabled with IRQ 23

I've been doing a bit more digging. The dma-ranges property is not in 
the dts/dtb. It's actually inserted by u-boot via ft_fsl_pci_setup().

Here's some output from my T2080RDB

root@linuxbox ~]# xxd -g4 
/sys/firmware/devicetree/base/pcie@ffe24/dma-ranges
000: 0200  df07 000f  
010: fe00  00f9 4200  B...
020:      
030:  df07 4300 0010  C...
040:    0001  
050:  

I'm still wondering how best to deal with this. Hopefully without 
needing to deploy a u-boot update.


Re: [PATCH v2] powerpc/vio: drop bus_type from parent device

2020-07-30 Thread Michael Ellerman
Greg KH  writes:
> On Thu, Jul 30, 2020 at 11:28:38AM +1000, Michael Ellerman wrote:
>> [ Added Peter & Greg to Cc ]
>> 
>> Thadeu Lima de Souza Cascardo  writes:
>> > Commit df44b479654f62b478c18ee4d8bc4e9f897a9844 ("kobject: return error
>> > code if writing /sys/.../uevent fails") started returning failure when
>> > writing to /sys/devices/vio/uevent.
>> >
>> > This causes an early udevadm trigger to fail. On some installer versions of
>> > Ubuntu, this will cause init to exit, thus panicing the system very early
>> > during boot.
>> >
>> > Removing the bus_type from the parent device will remove some of the extra
>> > empty files from /sys/devices/vio/, but will keep the rest of the layout
>> > for vio devices, keeping them under /sys/devices/vio/.
>> 
>> What exactly does it change?
>> 
>> I'm finding it hard to evaluate if this change is going to cause a
>> regression somehow.
>> 
>> I'm also not clear on why removing the bus type is correct, apart from
>> whether it fixes the bug you're seeing.
>> 
>> > It has been tested that uevents for vio devices don't change after this
>> > fix, they still contain MODALIAS.
>> >
>> > Signed-off-by: Thadeu Lima de Souza Cascardo 
>> > Fixes: df44b479654f ("kobject: return error code if writing 
>> > /sys/.../uevent fails")
>> 
>> AFAICS there haven't been any other fixes for that commit. Do we know
>> why it is only vio that was affected? (possibly because it's a fake bus
>> to begin with?)
>
> So there was an error previously, the core was ignoring it, and now it
> isn't and to fix that you want to remove describing what bus a device is
> on?
>
> Huh???

Right.

Not to mention there are existing unfixed kernels out there, so whatever
userspace is crashing will need to be fixed for those anyway.

>> > diff --git a/arch/powerpc/platforms/pseries/vio.c 
>> > b/arch/powerpc/platforms/pseries/vio.c
>> > index 37f1f25ba804..a94dab3972a0 100644
>> > --- a/arch/powerpc/platforms/pseries/vio.c
>> > +++ b/arch/powerpc/platforms/pseries/vio.c
>> > @@ -36,7 +36,6 @@ static struct vio_dev vio_bus_device  = { /* fake 
>> > "parent" device */
>> >.name = "vio",
>> >.type = "",
>> >.dev.init_name = "vio",
>> > -  .dev.bus = _bus_type,
>> >  };
>
> Wait, a static 'struct device'?  You all are playing with fire there.
> That's a reference counted object, and should never be declared like
> that at all.

Since 2005 :)

AC33c9bcf1 ("[PATCH] ppc64: tidy up vio devices fake parent")


> I see you register it, but never unregister it, why?  Why is it even
> needed?

I don't remember, if I ever knew.

The code says:

/*
 * The fake parent of all vio devices, just to give us
 * a nice directory
 */
err = device_register(_bus_device.dev);


But I suspect that may no longer be true.

ie. the devices show up in /sys/bus/vio/devices because they have
dev.bus = vio_bus_type, the fake parent doesn't seem to determine the
location.

> And if you remove the bus type of it, it will show up in a different
> part of sysfs, so I think this patch will show a user-visable change,
> right?

Yes I think so. But because it's a fake device to begin with that's
possibly OK.

I think we really need to get to the bottom of whether we need that
device at all, it seems like it might be left over cruft from the
ancient past.

I'll try and find time to work it out.

cheers


Re: [PATCH 1/2 v2] powerpc/dma: Define map/unmap mmio resource callbacks

2020-07-30 Thread Oliver O'Halloran
On Thu, Apr 30, 2020 at 11:15 PM Max Gurtovoy  wrote:
>
> Define the map_resource/unmap_resource callbacks for the dma_iommu_ops
> used by several powerpc platforms. The map_resource callback is called
> when trying to map a mmio resource through the dma_map_resource()
> driver API.
>
> For now, the callback returns an invalid address for devices using
> translations, but will "direct" map the resource when in bypass
> mode. Previous behavior for dma_map_resource() was to always return an
> invalid address.
>
> We also call an optional platform-specific controller op in
> case some setup is needed for the platform.

Hey Max,

Sorry for not getting to this sooner. Fred has been dutifully nagging
me to look at it, but people are constantly throwing stuff at me so
it's slipped through the cracks.

Anyway, the changes here are fine IMO. The only real suggestion I have
is that we might want to move the direct / bypass mode check out of
the arch/powerpc/kernel/dma-iommu.c and into the PHB specific function
in pci_controller_ops. I don't see any real reason p2p support should
be limited to devices using bypass mode since the data path is the
same for translated and untranslated DMAs. We do need to impose that
restriction for OPAL / PowerNV IODA PHBs due to the implementation of
the opal_pci_set_p2p() has the side effect of forcing the TVE into
no-translate mode. However, that's a platform issue so the restriction
should be imposed in platform code.

I'd like to fix that, but I'd prefer to do it as a follow up change
since I need to have a think about how to fix the firmware bits.

Reviewed-by: Oliver O'Halloran 


Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain

2020-07-30 Thread Valentin Schneider


(+Cc Morten)

On 29/07/20 07:13, Srikar Dronamraju wrote:
> * Valentin Schneider  [2020-07-28 16:03:11]:
>
> Hi Valentin,
>
> Thanks for looking into the patches.
>
>> On 27/07/20 06:32, Srikar Dronamraju wrote:
>> > Add percpu coregroup maps and masks to create coregroup domain.
>> > If a coregroup doesn't exist, the coregroup domain will be degenerated
>> > in favour of SMT/CACHE domain.
>> >
>>
>> So there's at least one arm64 platform out there with the same "pairs of
>> cores share L2" thing (Ampere eMAG), and that lives quite happily with the
>> default scheduler topology (SMT/MC/DIE). Each pair of core gets its MC
>> domain, and the whole system is covered by DIE.
>>
>> Now arguably it's not a perfect representation; DIE doesn't have
>> SD_SHARE_PKG_RESOURCES so the highest level sd_llc can point to is MC. That
>> will impact all callsites using cpus_share_cache(): in the eMAG case, only
>> pairs of cores will be seen as sharing cache, even though *all* cores share
>> the same L3.
>>
>
> Okay, Its good to know that we have a chip which is similar to P9 in
> topology.
>
>> I'm trying to paint a picture of what the P9 topology looks like (the one
>> you showcase in your cover letter) to see if there are any similarities;
>> from what I gather in [1], wikichips and your cover letter, with P9 you can
>> have something like this in a single DIE (somewhat unsure about L3 setup;
>> it looks to be distributed?)
>>
>>  +-+
>>  |  L3 |
>>  +---+-+---+-+---+-+---+
>>  |   L2  | |   L2  | |   L2  | |   L2  |
>>  +--+-+--+ +--+-+--+ +--+-+--+ +--+-+--+
>>  |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  |
>>  +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+
>>  |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs|
>>  +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+
>>
>> Which would lead to (ignoring the whole SMT CPU numbering shenanigans)
>>
>> NUMA [   ...
>> DIE  [ ]
>> MC   [ ] [ ] [ ] [ ]
>> BIGCORE  [ ] [ ] [ ] [ ]
>> SMT  [   ] [   ] [   ] [   ] [   ] [   ] [   ] [   ]
>>  00-03 04-07 08-11 12-15 16-19 20-23 24-27 28-31  
>>
>
> What you have summed up is perfectly what a P9 topology looks like. I dont
> think I could have explained it better than this.
>

Yay!

>> This however has MC == BIGCORE; what makes it you can have different spans
>> for these two domains? If it's not too much to ask, I'd love to have a P9
>> topology diagram.
>>
>> [1]: 20200722081822.gg9...@linux.vnet.ibm.com
>
> At this time the current topology would be good enough i.e BIGCORE would
> always be equal to a MC. However in future we could have chips that can have
> lesser/larger number of CPUs in llc than in a BIGCORE or we could have
> granular or split L3 caches within a DIE. In such a case BIGCORE != MC.
>

Right, that one's fair enough.

> Also in the current P9 itself, two neighbouring core-pairs form a quad.
> Cache latency within a quad is better than a latency to a distant core-pair.
> Cache latency within a core pair is way better than latency within a quad.
> So if we have only 4 threads running on a DIE all of them accessing the same
> cache-lines, then we could probably benefit if all the tasks were to run
> within the quad aka MC/Coregroup.
>

Did you test this? WRT load balance we do try to balance "load" over the
different domain spans, so if you represent quads as their own MC domain,
you would AFAICT end up spreading tasks over the quads (rather than packing
them) when balancing at e.g. DIE level. The desired behaviour might be
hackable with some more ASYM_PACKING, but I'm not sure I should be
suggesting that :-)

> I have found some benchmarks which are latency sensitive to benefit by
> having a grouping a quad level (using kernel hacks and not backed by
> firmware changes). Gautham also found similar results in his experiments
> but he only used binding within the stock kernel.
>

IIUC you reflect this "fabric quirk" (i.e. coregroups) using this DT
binding thing.

That's also where things get interesting (for me) because I experienced
something similar on another arm64 platform (ThunderX1). This was more
about cache bandwidth than cache latency, but IMO it's in the same bag of
fabric quirks. I blabbered a bit about this at last LPC [1], but kind of
gave up on it given the TX1 was the only (arm64) platform where I could get
both significant and reproducible results.

Now, if you folks are seeing this on completely different hardware and have
"real" workloads that truly 

[powerpc:fixes-test] BUILD SUCCESS 909adfc66b9a1db21b5e8733e9ebfa6cd5135d74

2020-07-30 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  
fixes-test
branch HEAD: 909adfc66b9a1db21b5e8733e9ebfa6cd5135d74  powerpc/64s/hash: Fix 
hash_preload running with interrupts enabled

elapsed time: 4429m

configs tested: 102
configs skipped: 3

The following configs have been built successfully.
More configs may be tested in the coming days.

arm defconfig
arm64allyesconfig
arm64   defconfig
arm  allyesconfig
arm  allmodconfig
sh  r7785rp_defconfig
mips tb0226_defconfig
mips  loongson3_defconfig
umkunit_defconfig
nds32alldefconfig
arm   imx_v4_v5_defconfig
mips   gcw0_defconfig
mips  fuloong2e_defconfig
arm  pxa255-idp_defconfig
s390defconfig
arm  prima2_defconfig
arm  footbridge_defconfig
mipsnlm_xlr_defconfig
ia64 allmodconfig
ia64defconfig
ia64 allyesconfig
m68kdefconfig
m68k allmodconfig
m68k allyesconfig
nios2   defconfig
arc  allyesconfig
nds32 allnoconfig
c6x  allyesconfig
nds32   defconfig
nios2allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
parisc  defconfig
s390 allyesconfig
parisc   allyesconfig
i386 allyesconfig
sparcallyesconfig
sparc   defconfig
i386defconfig
mips allyesconfig
mips allmodconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
powerpc defconfig
x86_64   randconfig-a005-20200727
x86_64   randconfig-a004-20200727
x86_64   randconfig-a003-20200727
x86_64   randconfig-a006-20200727
x86_64   randconfig-a002-20200727
x86_64   randconfig-a001-20200727
i386 randconfig-a003-20200728
i386 randconfig-a004-20200728
i386 randconfig-a005-20200728
i386 randconfig-a002-20200728
i386 randconfig-a006-20200728
i386 randconfig-a001-20200728
i386 randconfig-a003-20200727
i386 randconfig-a005-20200727
i386 randconfig-a004-20200727
i386 randconfig-a006-20200727
i386 randconfig-a002-20200727
i386 randconfig-a001-20200727
x86_64   randconfig-a014-20200728
x86_64   randconfig-a012-20200728
x86_64   randconfig-a015-20200728
x86_64   randconfig-a016-20200728
x86_64   randconfig-a013-20200728
x86_64   randconfig-a011-20200728
i386 randconfig-a016-20200728
i386 randconfig-a012-20200728
i386 randconfig-a013-20200728
i386 randconfig-a014-20200728
i386 randconfig-a011-20200728
i386 randconfig-a015-20200728
i386 randconfig-a016-20200727
i386 randconfig-a013-20200727
i386 randconfig-a012-20200727
i386 randconfig-a015-20200727
i386 randconfig-a011-20200727
i386 randconfig-a014-20200727
i386 randconfig-a016-20200730
i386 randconfig-a012-20200730
i386 randconfig-a014-20200730
i386 randconfig-a015-20200730
i386 randconfig-a011-20200730
i386 randconfig-a013-20200730
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscvallmodconfig
x86_64   rhel
x86_64   allyesconfig
x86_64rhel-7.6-kselftests
x86_64  defconfig
x86_64

Re: [PATCH] powerpc: fix function annotations to avoid section mismatch warnings with gcc-10

2020-07-30 Thread Vladis Dronov
Hello, Michael,

- Original Message -
> From: "Michael Ellerman" 
> Subject: Re: [PATCH] powerpc: fix function annotations to avoid section 
> mismatch warnings with gcc-10
> 
...
> >> > So what changed?  These functions were inlined with older compilers, but
> >> > not anymore?
> >> 
> >> Yes, exactly. Gcc-10 does not inline them anymore. If this is because of
> >> my
> >> build system, this can happen to others also.
> >> 
> >> The same thing was fixed by Linus in e99332e7b4cd ("gcc-10: mark more
> >> functions
> >> __init to avoid section mismatch warnings").
> >
> > It sounds like this is part of "-finline-functions was retuned" on
> > ?  So everyone should see it
> > (no matter what config or build system), and it is a good thing too :-)
> 
> I haven't seen it in my GCC 10 builds, so there must be some other
> subtlety. Probably it depends on details of the .config.
> 

I've just had this building the latest upstream for the ppc64le with a 
derivative
of the RHEL-8 config. This can probably be a compiler/linker setting, like -O2
versus -O3.

> cheers

Best regards,
Vladis Dronov | Red Hat, Inc. | The Core Kernel | Senior Software Engineer



Re: [PATCH v2] powerpc/vio: drop bus_type from parent device

2020-07-30 Thread Thadeu Lima de Souza Cascardo
On Thu, Jul 30, 2020 at 07:37:16AM +0200, Greg KH wrote:
> On Thu, Jul 30, 2020 at 11:28:38AM +1000, Michael Ellerman wrote:
> > [ Added Peter & Greg to Cc ]
> > 
> > Thadeu Lima de Souza Cascardo  writes:
> > > Commit df44b479654f62b478c18ee4d8bc4e9f897a9844 ("kobject: return error
> > > code if writing /sys/.../uevent fails") started returning failure when
> > > writing to /sys/devices/vio/uevent.
> > >
> > > This causes an early udevadm trigger to fail. On some installer versions 
> > > of
> > > Ubuntu, this will cause init to exit, thus panicing the system very early
> > > during boot.
> > >
> > > Removing the bus_type from the parent device will remove some of the extra
> > > empty files from /sys/devices/vio/, but will keep the rest of the layout
> > > for vio devices, keeping them under /sys/devices/vio/.
> > 
> > What exactly does it change?
> > 
> > I'm finding it hard to evaluate if this change is going to cause a
> > regression somehow.
> > 
> > I'm also not clear on why removing the bus type is correct, apart from
> > whether it fixes the bug you're seeing.
> > 
> > > It has been tested that uevents for vio devices don't change after this
> > > fix, they still contain MODALIAS.
> > >
> > > Signed-off-by: Thadeu Lima de Souza Cascardo 
> > > Fixes: df44b479654f ("kobject: return error code if writing 
> > > /sys/.../uevent fails")
> > 
> > AFAICS there haven't been any other fixes for that commit. Do we know
> > why it is only vio that was affected? (possibly because it's a fake bus
> > to begin with?)
> 
> So there was an error previously, the core was ignoring it, and now it
> isn't and to fix that you want to remove describing what bus a device is
> on?
> 
> Huh???
> 
> > 
> > cheers
> > 
> > > diff --git a/arch/powerpc/platforms/pseries/vio.c 
> > > b/arch/powerpc/platforms/pseries/vio.c
> > > index 37f1f25ba804..a94dab3972a0 100644
> > > --- a/arch/powerpc/platforms/pseries/vio.c
> > > +++ b/arch/powerpc/platforms/pseries/vio.c
> > > @@ -36,7 +36,6 @@ static struct vio_dev vio_bus_device  = { /* fake 
> > > "parent" device */
> > >   .name = "vio",
> > >   .type = "",
> > >   .dev.init_name = "vio",
> > > - .dev.bus = _bus_type,
> > >  };
> 
> Wait, a static 'struct device'?  You all are playing with fire there.
> That's a reference counted object, and should never be declared like
> that at all.
> 
> I see you register it, but never unregister it, why?  Why is it even
> needed?
> 
> And if you remove the bus type of it, it will show up in a different
> part of sysfs, so I think this patch will show a user-visable change,
> right?
> 
> thanks,
> 
> greg k-h

As the comment says, it's a "fake parent device". There is a user-visible
change, which is removing some attributes from the object, but it's still
showing up on the same path.

Returning an error code like df44b479654f does is also a user visible change
and it breaks installer images that panic early on boot.

I could investigate an alternative here, which would be not fail when writing
to uevent for this specific fake device.

Cascardo.


[powerpc:next] BUILD SUCCESS cf1ae052e073c7ef6cf1a783a6427f7228253bd3

2020-07-30 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  
next
branch HEAD: cf1ae052e073c7ef6cf1a783a6427f7228253bd3  powerpc/powernv/sriov: 
Remove unused but set variable 'phb'

elapsed time: 1486m

configs tested: 54
configs skipped: 1

The following configs have been built successfully.
More configs may be tested in the coming days.

arm defconfig
arm64allyesconfig
arm64   defconfig
arm  allyesconfig
arm  allmodconfig
ia64 allmodconfig
ia64defconfig
ia64 allyesconfig
m68k allmodconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
arc  allyesconfig
nds32 allnoconfig
c6x  allyesconfig
nds32   defconfig
nios2allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
parisc  defconfig
s390 allyesconfig
parisc   allyesconfig
s390defconfig
i386 allyesconfig
sparcallyesconfig
sparc   defconfig
i386defconfig
mips allyesconfig
mips allmodconfig
powerpc defconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386 randconfig-a016-20200730
i386 randconfig-a012-20200730
i386 randconfig-a014-20200730
i386 randconfig-a015-20200730
i386 randconfig-a011-20200730
i386 randconfig-a013-20200730
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscvallmodconfig
x86_64   rhel
x86_64   allyesconfig
x86_64rhel-7.6-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  kexec

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


Re: [PATCH -next] PCI: rpadlpar: Make some functions static

2020-07-30 Thread Bjorn Helgaas
On Tue, Jul 21, 2020 at 11:17:35PM +0800, Wei Yongjun wrote:
> The sparse tool report build warnings as follows:
> 
> drivers/pci/hotplug/rpadlpar_core.c:355:5: warning:
>  symbol 'dlpar_remove_pci_slot' was not declared. Should it be static?
> drivers/pci/hotplug/rpadlpar_core.c:461:12: warning:
>  symbol 'rpadlpar_io_init' was not declared. Should it be static?
> drivers/pci/hotplug/rpadlpar_core.c:473:6: warning:
>  symbol 'rpadlpar_io_exit' was not declared. Should it be static?
> 
> Those functions are not used outside of this file, so marks them
> static.
> Also mark rpadlpar_io_exit() as __exit.
> 
> Reported-by: Hulk Robot 
> Signed-off-by: Wei Yongjun 

Applied to pci/hotplug for v5.9, thanks!

> ---
>  drivers/pci/hotplug/rpadlpar_core.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/hotplug/rpadlpar_core.c 
> b/drivers/pci/hotplug/rpadlpar_core.c
> index c5eb509c72f0..f979b7098acf 100644
> --- a/drivers/pci/hotplug/rpadlpar_core.c
> +++ b/drivers/pci/hotplug/rpadlpar_core.c
> @@ -352,7 +352,7 @@ static int dlpar_remove_vio_slot(char *drc_name, struct 
> device_node *dn)
>   * -ENODEV   Not a valid drc_name
>   * -EIO  Internal PCI Error
>   */
> -int dlpar_remove_pci_slot(char *drc_name, struct device_node *dn)
> +static int dlpar_remove_pci_slot(char *drc_name, struct device_node *dn)
>  {
>   struct pci_bus *bus;
>   struct slot *slot;
> @@ -458,7 +458,7 @@ static inline int is_dlpar_capable(void)
>   return (int) (rc != RTAS_UNKNOWN_SERVICE);
>  }
>  
> -int __init rpadlpar_io_init(void)
> +static int __init rpadlpar_io_init(void)
>  {
>  
>   if (!is_dlpar_capable()) {
> @@ -470,7 +470,7 @@ int __init rpadlpar_io_init(void)
>   return dlpar_sysfs_init();
>  }
>  
> -void rpadlpar_io_exit(void)
> +static void __exit rpadlpar_io_exit(void)
>  {
>   dlpar_sysfs_exit();
>  }
> 


Re: [PATCH v4 00/10] Coregroup support on Powerpc

2020-07-30 Thread Srikar Dronamraju
* Srikar Dronamraju  [2020-07-27 11:02:20]:

> Changelog v3 ->v4:
> v3: 
> https://lore.kernel.org/lkml/20200723085116.4731-1-sri...@linux.vnet.ibm.com/t/#u
>

Here is a summary of some of the testing done with coregroup v4 patchsets.
It includes ebizzy, schbench, perf bench sched pipe and topology verification.
One the left side are results from powerpc/next tree and on the right are the
results with the patchset applied.  Topological verification clearly shows that
there is no change in topology with and without the patches on all the 3 class
of systems that were tested.

On PowerPc/NextOn 
Powerpc/next + Coregroup Support v4 patchset

Power 9 PowerNV (2 Node/ 160 Cpu System)
-
ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)
  N  Min   MaxMedian   AvgStddev  N 
 Min   MaxMedian   Avg  Stddev
100   993884   1276090   1173476   1165914 54867.201100   
910470   1279820   1171095   116209167363.28

schbench (latency hence lower is better)
Latency percentiles (usec)  Latency 
percentiles (usec)
50.0th: 455 
50.0th: 454
75.0th: 533 
75.0th: 543
90.0th: 683 
90.0th: 701
95.0th: 743 
95.0th: 737
*99.0th: 815
*99.0th: 805
99.5th: 839 
99.5th: 835
99.9th: 913 
99.9th: 893
min=0, max=1011 
min=0, max=2833

perf bench sched pipe (lesser time and higher ops/sec is better)
# Running 'sched/pipe' benchmark:   # 
Running 'sched/pipe' benchmark:
# Executed 100 pipe operations between two processes# 
Executed 100 pipe operations between two processes

 Total time: 6.083 [sec] 
Total time: 6.303 [sec]

   6.083576 usecs/op   
6.303318 usecs/op
 164377 ops/sec 
 158646 ops/sec


Power 9 LPAR (2 Node/ 128 Cpu System)
-
ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)
  N   Min   MaxMedian Avg  Stddev N 
  Min   MaxMedian Avg  Stddev
100   1058029   1295393   1200414   1188306.7   56786.538   100
943264   1287619   1180522   1168473.2   64469.955

schbench (latency hence lower is better)
Latency percentiles (usec)
Latency percentiles (usec)
50.th: 34   
  50.th: 39
75.th: 46   
  75.th: 52
90.th: 53   
  90.th: 68
95.th: 56   
  95.th: 77
*99.th: 61  
  *99.th: 89
99.5000th: 63   
  99.5000th: 94
99.9000th: 81   
  99.9000th: 169
min=0, max=8405 
  min=0, max=23674

perf bench sched pipe (lesser time and higher ops/sec is better)
# Running 'sched/pipe' benchmark:# 
Running 'sched/pipe' benchmark:
# Executed 100 pipe operations between two processes # 
Executed 100 pipe operations between two processes

 Total time: 8.768 [sec]
  Total time: 5.217 [sec]

   8.768400 usecs/op
5.217625 usecs/op
 114045 ops/sec 
  191658 ops/sec

Power 8 LPAR (8 Node/ 256 Cpu System)
-
ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)
  N   Min   MaxMedian Avg  Stddev   N  
Min  Max   MedianAvg Stddev
100   1267615   1965234   1707423   1689137.6   144363.29 100  
1175357  1924262  1691104  1664792.1   145876.4

schbench (latency hence 

Re: Documentation/powerpc: Ultravisor API

2020-07-30 Thread Ram Pai
On Thu, Jul 30, 2020 at 12:35:38PM +0200, Julia Lawall wrote:
> The file Documentation/powerpc/ultravisor.rst contains:
> 
> Only valid value(s) in ``flags`` are:
> 
> * H_PAGE_IN_SHARED which indicates that the page is to be shared
> with the Ultravisor.
> 
> * H_PAGE_IN_NONSHARED indicates that the UV is not anymore
>   interested in the page. Applicable if the page is a shared page.
> 
> The flag H_PAGE_IN_SHARED exists in the Linux kernel
> (arch/powerpc/include/asm/hvcall.h), but the flag H_PAGE_IN_NONSHARED does
> not.  Should the documentation be changed in some way?

Currently the code assumes H_PAGE_IN_NONSHARED as !H_PAGE_IN_SHARED.

We need to patch the kernel to explicitly define the flag.
I will submit a patch towards this.

Thanks,
RP


Re: [PATCH 1/9] powerpc/configs: Drop old symbols from ppc6xx_defconfig

2020-07-30 Thread Michael Ellerman
On Fri, 24 Jul 2020 23:17:20 +1000, Michael Ellerman wrote:
> ppc6xx_defconfig refers to quite a few symbols that no longer exist,
> as reported by scripts/checkkconfigsymbols.py, remove them.

Applied to powerpc/next.

[1/9] powerpc/configs: Drop old symbols from ppc6xx_defconfig
  https://git.kernel.org/powerpc/c/fbb44c9a08ef994109947c5439e649b18ad509ac
[2/9] powerpc/configs: Remove dead symbols
  https://git.kernel.org/powerpc/c/0fcce25b7743d634cc1ddce83382f51333933f76
[3/9] powerpc/52xx: Fix comment about CONFIG_BDI*
  https://git.kernel.org/powerpc/c/8cdcde5f76a42d53a50d1fc9e1fbfc9b90102323
[4/9] powerpc/64e: Drop dead BOOK3E_MMU_TLB_STATS code
  https://git.kernel.org/powerpc/c/07e571ea59eef518730f983f4203651ea413f2cf
[5/9] powerpc/32s: Fix CONFIG_BOOK3S_601 uses
  https://git.kernel.org/powerpc/c/df4d4ef22446b3a789a4efd74d34f2ec1e24deb2
[6/9] powerpc/32s: Remove TAUException wart in traps.c
  https://git.kernel.org/powerpc/c/69eeff022433b54390a359c629f6457d7d1a8e94
[7/9] powerpc/boot: Fix CONFIG_PPC_MPC52XX references
  https://git.kernel.org/powerpc/c/e5eff89657e72a9050d95fde146b54c7dc165981
[8/9] powerpc/kvm: Use correct CONFIG symbol in comment
  https://git.kernel.org/powerpc/c/157dad8678ad910ef7579c3f8ba93cc2940b014b
[9/9] powerpc: Drop old comment about CONFIG_POWER
  https://git.kernel.org/powerpc/c/ee36d867b2fefeb6fb6661b27e62e29c9ca5e7e5

cheers


[PATCHv4 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents

2020-07-30 Thread Pingfan Liu
A bug is observed on pseries by taking the following steps on rhel:
-1. drmgr -c mem -r -q 5
-2. echo c > /proc/sysrq-trigger

And then, the failure looks like:
kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/
kdump: saving vmcore-dmesg.txt
kdump: saving vmcore-dmesg.txt complete
kdump: saving vmcore
 Checking for memory holes : [  0.0 %] /
   Checking for memory holes : [100.0 %] |  
 Excluding unnecessary pages   : [100.0 %] \
   Copying data  : [  0.3 %] -  
eta: 38s[   44.337636] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
access=0x8004 current=makedumpfile
[   44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
psize 2 pte=0xc0005504
[   44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
access=0x8004 current=makedumpfile
[   44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
psize 2 pte=0xc0005504
[   44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 nip 
7fffbbc4d7fc lr 00011356ca3c code 2
[   44.338548] Core dump to |/bin/false pipe failed
/lib/kdump-lib-initramfs.sh: line 98:   469 Bus error   
$CORE_COLLECTOR /proc/vmcore 
$_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete
kdump: saving vmcore failed

* Root cause *
  After analyzing, it turns out that in the current implementation,
when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating as
the code __remove_memory() comes before drmem_update_dt().
So in kdump kernel, when read_from_oldmem() resorts to
pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to
non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it
can be observed "Bus error"

>From a viewpoint of listener and publisher, the publisher notifies the
listener before data is ready.  This introduces a problem where udev
launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before
updating. And in capture kernel, makedumpfile will access the memory based
on the stale dt info, and hit a SIGBUS error due to an un-existed lmb.

* Fix *
This bug is introduced by commit 063b8b1251fd
("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR
request"), which tried to combine all the dt updating into one.

To fix this issue, meanwhile not to introduce a quadratic runtime
complexity by the model:
  dlpar_memory_add_by_count
for_each_drmem_lmb <--
  dlpar_add_lmb
drmem_update_dt(_v1|_v2)
  for_each_drmem_lmb   <--
The dt should still be only updated once, and just before the last memory
online/offline event is ejected to user space. Achieve this by tracing the
num of lmb added or removed.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Hari Bathini 
Cc: Nathan Lynch 
Cc: Nathan Fontenot 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
v3 -> v4: resolve a quadratic runtime complexity issue.
  This series is applied on next-test branch
 arch/powerpc/platforms/pseries/hotplug-memory.c | 88 ++---
 1 file changed, 66 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 1a3ac3b..e07d5b1 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -350,13 +350,13 @@ static bool lmb_is_removable(struct drmem_lmb *lmb)
return true;
 }
 
-static int dlpar_add_lmb(struct drmem_lmb *);
+static int dlpar_add_lmb(struct drmem_lmb *lmb, bool dt_update);
 
-static int dlpar_remove_lmb(struct drmem_lmb *lmb)
+static int dlpar_remove_lmb(struct drmem_lmb *lmb, bool dt_update)
 {
unsigned long block_sz;
phys_addr_t base_addr;
-   int rc, nid;
+   int rc, ret, nid;
 
if (!lmb_is_removable(lmb))
return -EINVAL;
@@ -372,6 +372,11 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+   if (dt_update) {
+   ret = drmem_update_dt();
+   if (ret)
+   pr_warn("%s fail to update dt, but continue\n", 
__func__);
+   }
 
__remove_memory(nid, base_addr, block_sz);
 
@@ -387,6 +392,7 @@ static int dlpar_memory_remove_by_count(u32 lmbs_to_remove)
int lmbs_removed = 0;
int lmbs_available = 0;
int rc;
+   bool dt_update = false;
 
pr_info("Attempting to hot-remove %d LMB(s)\n", lmbs_to_remove);
 
@@ -409,7 +415,7 @@ static int dlpar_memory_remove_by_count(u32 lmbs_to_remove)
}
 
for_each_drmem_lmb(lmb) {
-   rc = dlpar_remove_lmb(lmb);
+   rc = dlpar_remove_lmb(lmb, dt_update);
  

[powerpc:next-test] BUILD SUCCESS 2e6bd221d96fcfd9bd1eed5cd9c008e7959daed7

2020-07-30 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  
next-test
branch HEAD: 2e6bd221d96fcfd9bd1eed5cd9c008e7959daed7  powerpc/kexec_file: 
Enable early kernel OPAL calls

elapsed time: 1395m

configs tested: 52
configs skipped: 1

The following configs have been built successfully.
More configs may be tested in the coming days.

arm defconfig
arm64allyesconfig
arm64   defconfig
arm  allyesconfig
arm  allmodconfig
ia64 allmodconfig
ia64defconfig
ia64 allyesconfig
m68k allmodconfig
m68kdefconfig
m68k allyesconfig
nds32   defconfig
nios2allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
parisc  defconfig
s390 allyesconfig
parisc   allyesconfig
s390defconfig
i386 allyesconfig
sparcallyesconfig
sparc   defconfig
i386defconfig
nios2   defconfig
arc  allyesconfig
nds32 allnoconfig
c6x  allyesconfig
mips allyesconfig
mips allmodconfig
powerpc defconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386 randconfig-a016-20200730
i386 randconfig-a012-20200730
i386 randconfig-a014-20200730
i386 randconfig-a015-20200730
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscvallmodconfig
x86_64   rhel
x86_64   allyesconfig
x86_64rhel-7.6-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  kexec

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal

2020-07-30 Thread Nathan Lynch
Michael Ellerman  writes:
> Nathan Lynch  writes:
>> Laurent Dufour  writes:
>>> Le 28/07/2020 à 19:37, Nathan Lynch a écrit :
 The drmem lmb list can have hundreds of thousands of entries, and
 unfortunately lookups take the form of linear searches. As long as
 this is the case, traversals have the potential to monopolize the CPU
 and provoke lockup reports, workqueue stalls, and the like unless
 they explicitly yield.
 
 Rather than placing cond_resched() calls within various
 for_each_drmem_lmb() loop blocks in the code, put it in the iteration
 expression of the loop macro itself so users can't omit it.
>>>
>>> Is that not too much to call cond_resched() on every LMB?
>>>
>>> Could that be less frequent, every 10, or 100, I don't really know ?
>>
>> Everything done within for_each_drmem_lmb is relatively heavyweight
>> already. E.g. calling dlpar_remove_lmb()/dlpar_add_lmb() can take dozens
>> of milliseconds. I don't think cond_resched() is an expensive check in
>> this context.
>
> Hmm, mostly.
>
> But there are quite a few cases like drmem_update_dt_v1():
>
>   for_each_drmem_lmb(lmb) {
>   dr_cell->base_addr = cpu_to_be64(lmb->base_addr);
>   dr_cell->drc_index = cpu_to_be32(lmb->drc_index);
>   dr_cell->aa_index = cpu_to_be32(lmb->aa_index);
>   dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb));
>
>   dr_cell++;
>   }
>
> Which will compile to a pretty tight loop at the moment.
>
> Or drmem_update_dt_v2() which has two loops over all lmbs.
>
> And although the actual TIF check is cheap the function call to do it is
> not free.
>
> So I worry this is going to make some of those long loops take even
> longer.

That's fair, and I was wrong - some of the loop bodies are relatively
simple, not doing allocations or taking locks, etc.

One way to deal is to keep for_each_drmem_lmb() as-is and add a new
iterator that can reschedule, e.g. for_each_drmem_lmb_slow().

On the other hand... it's probably not too strong to say that the
drmem/hotplug code is in crisis with respect to correctness and
algorithmic complexity, so those are my overriding concerns right
now. Yes, this change will pessimize loops that are reinitializing the
entire drmem_lmb array on every DLPAR operation, but:

1. it doesn't make any user of for_each_drmem_lmb() less correct;
2. why is this code doing that in the first place, other than to
   accommodate a poor data structure choice?

The duration of the system calls where this code runs are measured in
minutes or hours on large configurations because of all the behaviors
that are at best O(n) with the amount of memory assigned to the
partition. For simplicity's sake I'd rather defer lower-level
performance considerations like this until the drmem data structures'
awful lookup properties are fixed -- hopefully in the 5.10 timeframe.

Thoughts?


Re: [PATCH] powerpc/mm: Limit resize_hpt_for_hotplug() call to hash guests only

2020-07-30 Thread Michael Ellerman
On Mon, 27 Jul 2020 15:27:04 +0530, Bharata B Rao wrote:
> During memory hotplug and unplug, resize_hpt_for_hotplug() gets called
> for both hash and radix guests but it should be called only for hash
> guests. Though the call does nothing in the radix guest case, it is
> cleaner to push this call into hash specific memory hotplug routines.

Applied to powerpc/next.

[1/1] powerpc/mm: Limit resize_hpt_for_hotplug() call to hash guests only
  https://git.kernel.org/powerpc/c/55548a86ebde2b3691b6a84baef1b02933408994

cheers


Re: [PATCH] selftests/powerpc: Squash spurious errors due to device removal

2020-07-30 Thread Michael Ellerman
On Mon, 27 Jul 2020 11:01:27 +1000, Oliver O'Halloran wrote:
> For drivers that don't have the error handling callbacks we implement
> recovery by removing the device and re-probing it. This causes the sysfs
> directory for the PCI device to be removed which causes the following
> spurious error to be printed when checking the PE state:
> 
> Breaking 0005:03:00.0...
> ./eeh-basic.sh: line 13: can't open 
> /sys/bus/pci/devices/0005:03:00.0/eeh_pe_state: no such file
> 0005:03:00.0, waited 0/60
> 0005:03:00.0, waited 1/60
> 0005:03:00.0, waited 2/60
> 0005:03:00.0, waited 3/60
> 0005:03:00.0, waited 4/60
> 0005:03:00.0, waited 5/60
> 0005:03:00.0, waited 6/60
> 0005:03:00.0, waited 7/60
> 0005:03:00.0, Recovered after 8 seconds
> 
> [...]

Applied to powerpc/next.

[1/1] selftests/powerpc: Squash spurious errors due to device removal
  https://git.kernel.org/powerpc/c/5f8cf6475828b600ff6d000e580c961ac839cc61

cheers


Re: [PATCH] powerpc/fadump: Fix build error with CONFIG_PRESERVE_FA_DUMP=y

2020-07-30 Thread Michael Ellerman
On Mon, 27 Jul 2020 17:03:41 +1000, Michael Ellerman wrote:
> skiroot_defconfig fails:
> 
> arch/powerpc/kernel/fadump.c:48:17: error: ‘cpus_in_fadump’ defined but 
> not used
>48 | static atomic_t cpus_in_fadump;
> 
> Fix it by moving the definition into the #ifdef where it's used.

Applied to powerpc/next.

[1/1] powerpc/fadump: Fix build error with CONFIG_PRESERVE_FA_DUMP=y
  https://git.kernel.org/powerpc/c/5f987caec521cbb00d4ba2dc641ac8074626b762

cheers


Re: [PATCH v3] powerpc xmon: use `dcbf` inplace of `dcbi` instruction for 64bit Book3S

2020-07-30 Thread Michael Ellerman
On Mon, 30 Mar 2020 13:29:54 +0530, Balamuruhan S wrote:
> Data Cache Block Invalidate (dcbi) instruction implemented back in
> PowerPC architecture version 2.03. But as per Power Processor Users Manual
> it is obsolete and not supported by POWER8/POWER9 core. Attempt to use of
> this illegal instruction results in a hypervisor emulation assistance
> interrupt. So, ifdef it out the option `i` in xmon for 64bit Book3S.
> 
> 0:mon> fi
> cpu 0x0: Vector: 700 (Program Check) at [c3be74a0]
> pc: c0102030: cacheflush+0x180/0x1a0
> lr: c0101f3c: cacheflush+0x8c/0x1a0
> sp: c3be7730
>msr: 80081033
>   current = 0xc35e5c00
>   paca= 0xc191   irqmask: 0x03   irq_happened: 0x01
> pid   = 1025, comm = bash
> Linux version 5.6.0-rc5-g5aa19adac (root@ltc-wspoon6) (gcc version 7.4.0
> (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #1 SMP Tue Mar 10 04:38:41 CDT 2020
> cpu 0x0: Exception 700 (Program Check) in xmon, returning to main loop
> [c3be7c50] c084abb0 __handle_sysrq+0xf0/0x2a0
> [c3be7d00] c084b3c0 write_sysrq_trigger+0xb0/0xe0
> [c3be7d30] c04d1edc proc_reg_write+0x8c/0x130
> [c3be7d60] c040dc7c __vfs_write+0x3c/0x70
> [c3be7d80] c0410e70 vfs_write+0xd0/0x210
> [c3be7dd0] c041126c ksys_write+0xdc/0x130
> [c3be7e20] c000b9d0 system_call+0x5c/0x68
> --- Exception: c01 (System Call) at 7fffa345e420
> SP (70b08ab0) is in userspace

Applied to powerpc/next.

[1/1] powerpc/xmon: Use `dcbf` inplace of `dcbi` instruction for 64bit Book3S
  https://git.kernel.org/powerpc/c/81a413259a224f0d1783c41a74f18864d4f3d67e

cheers


Re: [PATCH -next] powerpc: use for_each_child_of_node() macro

2020-07-30 Thread Michael Ellerman
On Tue, 28 Jul 2020 10:28:07 +0800, Qinglang Miao wrote:
> Use for_each_child_of_node() macro instead of open coding it.

Applied to powerpc/next.

[1/1] powerpc: use for_each_child_of_node() macro
  https://git.kernel.org/powerpc/c/b6ac59d39a348af29477d7bfdc3ba23526e3f4ea

cheers


Re: [PATCH v3 0/3] powerpc/pseries: IPI doorbell improvements

2020-07-30 Thread Michael Ellerman
On Sun, 26 Jul 2020 13:51:52 +1000, Nicholas Piggin wrote:
> Since v2:
> - Fixed ppc32 compile error
> - Tested-by from Cedric
> 
> Nicholas Piggin (3):
>   powerpc: inline doorbell sending functions
>   powerpc/pseries: Use doorbells even if XIVE is available
>   powerpc/pseries: Add KVM guest doorbell restrictions
> 
> [...]

Applied to powerpc/next.

[1/3] powerpc: Inline doorbell sending functions
  https://git.kernel.org/powerpc/c/1f0ce497433f8944045ee1baae218e31a0d295ee
[2/3] powerpc/pseries: Use doorbells even if XIVE is available
  https://git.kernel.org/powerpc/c/5b06d1679f2fe874ef49ea11324cd893ec9e2da8
[3/3] powerpc/pseries: Add KVM guest doorbell restrictions
  https://git.kernel.org/powerpc/c/107c55005fbd5243ee31fb13b6f166cde9e3ade1

cheers


Re: [PATCH v2 0/6] Improvements to pkey tests

2020-07-30 Thread Michael Ellerman
On Mon, 27 Jul 2020 09:30:34 +0530, Sandipan Das wrote:
> Based on recent bugs found in the pkey infrastructure, this
> improves the test for execute-disabled pkeys and adds a new
> test for detecting inconsistencies with the pkey reported by
> the signal information upon getting a fault.
> 
> Previous versions can be found at:
> v1: 
> https://lore.kernel.org/linuxppc-dev/cover.1594897099.git.sandi...@linux.ibm.com/
> 
> [...]

Applied to powerpc/next.

[1/6] selftests/powerpc: Move pkey helpers to headers
  https://git.kernel.org/powerpc/c/128d3d0210076232b7d54c361082c8ee17e4b669
[2/6] selftests/powerpc: Add pkey helpers for rights
  https://git.kernel.org/powerpc/c/264d7fccc4711328a19f07e6bd57aee4c68803aa
[3/6] selftests/powerpc: Harden test for execute-disabled pkeys
  https://git.kernel.org/powerpc/c/03634bbf5d8a6f2d97e6150a1b8ff03675badac3
[4/6] selftests/powerpc: Add helper to exit on failure
  https://git.kernel.org/powerpc/c/ec599482245d08002725cc1b353e4963fa26
[5/6] selftests/powerpc: Add wrapper for gettid
  https://git.kernel.org/powerpc/c/743f3544fffb9662aaf550c8358a8c1b6fcae707
[6/6] selftests/powerpc: Add test for pkey siginfo verification
  https://git.kernel.org/powerpc/c/c27f2fd1705a7e19ef2dc2b986c0d1cde3c3dbe7

cheers


[PATCHv4 1/2] powerpc/pseries: group lmb operation and memblock's

2020-07-30 Thread Pingfan Liu
This patch prepares for the incoming patch which swaps the order of
KOBJ_ADD/REMOVE uevent and dt's updating.

The dt updating should come after lmb operations, and before
__remove_memory()/__add_memory().  Accordingly, grouping all lmb operations
before the memblock's.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Hari Bathini 
Cc: Nathan Lynch 
Cc: Nathan Fontenot 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
v3 -> v4: improve commit log
 arch/powerpc/platforms/pseries/hotplug-memory.c | 26 -
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 5d545b7..1a3ac3b 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -355,7 +355,8 @@ static int dlpar_add_lmb(struct drmem_lmb *);
 static int dlpar_remove_lmb(struct drmem_lmb *lmb)
 {
unsigned long block_sz;
-   int rc;
+   phys_addr_t base_addr;
+   int rc, nid;
 
if (!lmb_is_removable(lmb))
return -EINVAL;
@@ -364,17 +365,19 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
if (rc)
return rc;
 
+   base_addr = lmb->base_addr;
+   nid = lmb->nid;
block_sz = pseries_memory_block_size();
 
-   __remove_memory(lmb->nid, lmb->base_addr, block_sz);
-
-   /* Update memory regions for memory remove */
-   memblock_remove(lmb->base_addr, block_sz);
-
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
lmb->flags &= ~DRCONF_MEM_ASSIGNED;
 
+   __remove_memory(nid, base_addr, block_sz);
+
+   /* Update memory regions for memory remove */
+   memblock_remove(base_addr, block_sz);
+
return 0;
 }
 
@@ -603,6 +606,8 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
}
 
lmb_set_nid(lmb);
+   lmb->flags |= DRCONF_MEM_ASSIGNED;
+
block_sz = memory_block_size_bytes();
 
/* Add the memory */
@@ -614,11 +619,14 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 
rc = dlpar_online_lmb(lmb);
if (rc) {
-   __remove_memory(lmb->nid, lmb->base_addr, block_sz);
+   int nid = lmb->nid;
+   phys_addr_t base_addr = lmb->base_addr;
+
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
-   } else {
-   lmb->flags |= DRCONF_MEM_ASSIGNED;
+   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+
+   __remove_memory(nid, base_addr, block_sz);
}
 
return rc;
-- 
2.7.5



Re: [PATCH v2] powerpc/book3s64/radix: Add kernel command line option to disable radix GTSE

2020-07-30 Thread Michael Ellerman
On Mon, 27 Jul 2020 14:29:08 +0530, Aneesh Kumar K.V wrote:
> This adds a kernel command line option that can be used to disable GTSE 
> support.
> Disabling GTSE implies kernel will make hcalls to invalidate TLB entries.
> 
> This was done so that we can do VM migration between configs that 
> enable/disable
> GTSE support via hypervisor. To migrate a VM from a system that supports
> GTSE to a system that doesn't, we can boot the guest with
> radix_hcall_invalidate=on, thereby forcing the guest to use hcalls for TLB
> invalidates.
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/book3s64/radix: Add kernel command line option to disable radix 
GTSE
  https://git.kernel.org/powerpc/c/bf6b7661f41615c0815fce0a3f27acb5fc005470

cheers


Re: [PATCH 1/2] powerpc/hugetlb/cma: Allocate gigantic hugetlb pages using CMA

2020-07-30 Thread Michael Ellerman
On Mon, 13 Jul 2020 20:37:48 +0530, Aneesh Kumar K.V wrote:
> commit: cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic hugepages 
> using cma")
> added support for allocating gigantic hugepages using CMA. This patch
> enables the same for powerpc

Applied to powerpc/next.

[1/2] powerpc/hugetlb/cma: Allocate gigantic hugetlb pages using CMA
  https://git.kernel.org/powerpc/c/ef26b76d1af61b90eb0dd3da58ad4f97d8e028f8
[2/2] powerpc/kvm/cma: Improve kernel log during boot
  https://git.kernel.org/powerpc/c/a5a8b258da7861009240b57687dfef47af91b406

cheers


Re: [PATCH v2 1/5] selftests/powerpc: Add test of stack expansion logic

2020-07-30 Thread Michael Ellerman
On Fri, 24 Jul 2020 19:25:24 +1000, Michael Ellerman wrote:
> We have custom stack expansion checks that it turns out are extremely
> badly tested and contain bugs, surprise. So add some tests that
> exercise the code and capture the current boundary conditions.
> 
> The signal test currently fails on 64-bit kernels because the 2048
> byte allowance for the signal frame is too small, we will fix that in
> a subsequent patch.

Applied to powerpc/next.

[1/5] selftests/powerpc: Add test of stack expansion logic
  https://git.kernel.org/powerpc/c/c9938a9dac95be7650218cdd8e9d1f882e7b5691
[2/5] powerpc: Allow 4224 bytes of stack expansion for the signal frame
  https://git.kernel.org/powerpc/c/63dee5df43a31f3844efabc58972f0a206ca4534
[3/5] selftests/powerpc: Update the stack expansion test
  https://git.kernel.org/powerpc/c/9ee571d84bf8cfdd587a1acbf3490ca90fc40c9d
[4/5] powerpc/mm: Remove custom stack expansion checking
  https://git.kernel.org/powerpc/c/773b3e53df5b84e73bf64998e4019f50a6662ad1
[5/5] selftests/powerpc: Remove powerpc special cases from stack expansion test
  https://git.kernel.org/powerpc/c/73da08f6966b81feb429af4fb3229da4cf21d6d9

cheers


[PATCH] soc: fsl: Remove bogus packed attributes from qman.h

2020-07-30 Thread Herbert Xu
There are two __packed attributes in qman.h that are both unnecessary
and causing compiler warnings because they're conflicting with
explicit alignment requirements set on members within the structure.

This patch removes them both.

Signed-off-by: Herbert Xu 

diff --git a/include/soc/fsl/qman.h b/include/soc/fsl/qman.h
index cfe00e08e85b..d81ff185dc0b 100644
--- a/include/soc/fsl/qman.h
+++ b/include/soc/fsl/qman.h
@@ -256,7 +256,7 @@ struct qm_dqrr_entry {
__be32 context_b;
struct qm_fd fd;
u8 __reserved4[32];
-} __packed;
+};
 #define QM_DQRR_VERB_VBIT  0x80
 #define QM_DQRR_VERB_MASK  0x7f/* where the verb contains; */
 #define QM_DQRR_VERB_FRAME_DEQUEUE 0x60/* "this format" */
@@ -289,7 +289,7 @@ union qm_mr_entry {
__be32 tag;
struct qm_fd fd;
u8 __reserved1[32];
-   } __packed ern;
+   } ern;
struct {
u8 verb;
u8 fqs; /* Frame Queue Status */
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH] powerpc: Fix MMCRA_BHRB_DISABLE define to work with binutils version < 2.28

2020-07-30 Thread Michael Ellerman
On Wed, 29 Jul 2020 00:16:54 -0400, Athira Rajeev wrote:
> commit 9908c826d5ed ("powerpc/perf: Add Power10 PMU feature to
> DT CPU features") defines MMCRA_BHRB_DISABLE as `0x20UL`.
> Binutils version less than 2.28 doesn't support UL suffix.
> 
> linux-ppc/arch/powerpc/kernel/cpu_setup_power.S: Assembler messages:
> linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', 
> expected: ')'
> linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: junk at end of 
> line, first unrecognized character is `L'
> linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', 
> expected: ')'
> linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', 
> expected: ')'
> linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: junk at end of 
> line, first unrecognized character is `L'
> linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', 
> expected: ')'
> linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', 
> expected: ')'
> linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: operand out of 
> range (0x0020 is not between 0x8000 and 
> 0x)
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/perf: Fix MMCRA_BHRB_DISABLE define for binutils < 2.28
  https://git.kernel.org/powerpc/c/443359aebce0e17148251c0e316801fe69aa7d33

cheers


Re: [PATCH][next] powerpc: Use fallthrough pseudo-keyword

2020-07-30 Thread Michael Ellerman
On Mon, 27 Jul 2020 17:42:01 -0500, Gustavo A. R. Silva wrote:
> Replace the existing /* fall through */ comments and its variants with
> the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
> fall-through markings when it is the case.
> 
> [1] 
> https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Applied to powerpc/next.

[1/1] powerpc: Use fallthrough pseudo-keyword
  https://git.kernel.org/powerpc/c/5e66a0cb5fbdc76f9ad86a1e8f43256dbad29ef7

cheers


Re: [PATCH] powerpc/build: vdso linker warning for orphan sections

2020-07-30 Thread Michael Ellerman
On Tue, 3 Mar 2020 11:27:48 +1000, Nicholas Piggin wrote:
> 


Applied to powerpc/next.

[1/1] powerpc/build: vdso linker warning for orphan sections
  https://git.kernel.org/powerpc/c/f2af201002a8bc22500c04cc474ea480bf361351

cheers


Re: [PATCH -next] powerpc/powernv/sriov: Remove unused but set variable 'phb'

2020-07-30 Thread Michael Ellerman
On Tue, 28 Jul 2020 01:11:12 +0800, Wei Yongjun wrote:
> Gcc report warning as follows:
> 
> arch/powerpc/platforms/powernv/pci-sriov.c:602:25: warning:
>  variable 'phb' set but not used [-Wunused-but-set-variable]
>   602 |  struct pnv_phb*phb;
>   | ^~~
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/powernv/sriov: Remove unused but set variable 'phb'
  https://git.kernel.org/powerpc/c/cf1ae052e073c7ef6cf1a783a6427f7228253bd3

cheers


Re: [PATCH V5 0/4] powerpc/perf: Add support for perf extended regs in powerpc

2020-07-30 Thread Jiri Olsa
On Thu, Jul 30, 2020 at 01:24:40PM +0530, Athira Rajeev wrote:
> 
> 
> > On 27-Jul-2020, at 10:46 PM, Athira Rajeev  
> > wrote:
> > 
> > Patch set to add support for perf extended register capability in
> > powerpc. The capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to
> > indicate the PMU which support extended registers. The generic code
> > define the mask of extended registers as 0 for non supported architectures.
> > 
> > Patches 1 and 2 are the kernel side changes needed to include
> > base support for extended regs in powerpc and in power10.
> > Patches 3 and 4 are the perf tools side changes needed to support the
> > extended registers.
> > 
> 
> Hi Arnaldo, Jiri
> 
> please let me know if you have any comments/suggestions on this patch series 
> to add support for perf extended regs.

hi,
can't really tell for powerpc, but in general
perf tool changes look ok

jirka



Re: [PATCH] KVM: PPC: Book3S HV: Define H_PAGE_IN_NONSHARED for H_SVM_PAGE_IN hcall

2020-07-30 Thread Bharata B Rao
On Thu, Jul 30, 2020 at 04:21:01PM -0700, Ram Pai wrote:
> H_SVM_PAGE_IN hcall takes a flag parameter. This parameter specifies the
> way in which a page will be treated.  H_PAGE_IN_NONSHARED indicates
> that the page will be shared with the Secure VM, and H_PAGE_IN_SHARED
> indicates that the page will not be shared but its contents will
> be copied.

Looks like you got the definitions of shared and non-shared interchanged.

> 
> However H_PAGE_IN_NONSHARED is not defined in the header file, though
> it is defined and documented in the API captured in
> Documentation/powerpc/ultravisor.rst
> 
> Define H_PAGE_IN_NONSHARED in the header file.

What is the use of defining this? Is this used directly in any place?
Or, are youp planning to introduce such a usage?

Regards,
Bharata.


Re: [PATCH] KVM: PPC: Book3S HV: fix a oops in kvmppc_uvmem_page_free()

2020-07-30 Thread Bharata B Rao
On Thu, Jul 30, 2020 at 04:25:26PM -0700, Ram Pai wrote:
> Observed the following oops while stress-testing, using multiple
> secureVM on a distro kernel. However this issue theoritically exists in
> 5.5 kernel and later.
> 
> This issue occurs when the total number of requested device-PFNs exceed
> the total-number of available device-PFNs.  PFN migration fails to
> allocate a device-pfn, which causes migrate_vma_finalize() to trigger
> kvmppc_uvmem_page_free() on a page, that is not associated with any
> device-pfn.  kvmppc_uvmem_page_free() blindly tries to access the
> contents of the private data which can be null, leading to the following
> kernel fault.
> 
>  --
>  Unable to handle kernel paging request for data at address 0x0011
>  Faulting instruction address: 0xc0080e36e110
>  Oops: Kernel access of bad area, sig: 11 [#1]
>  LE SMP NR_CPUS=2048 NUMA PowerNV
> 
>  MSR:  9280b033 
>CR: 24424822  XER: 
>  CFAR: c0e3d764 DAR: 0011 DSISR: 4000 IRQMASK: 0
>  GPR00: c0080e36e0a4 c01f1d59f610 c0080e38a400 
>  GPR04: c01fa500 fffe  c000201fffeaf300
>  GPR08: 01f0  0f80 c0080e373608
>  GPR12: c0e3d710 c000201fffeaf300 0001 7fef8736
>  GPR16: 7fff97db4410 c000201c3b66a578  
>  GPR20: 000119db9ad0 000a fffc 0001
>  GPR24: c000201c3b66 c01f1d59f7a0 c04cffb0 0001
>  GPR28:  c00a001ff003e000 c0080e386150 0f80
>  NIP [c0080e36e110] kvmppc_uvmem_page_free+0xc8/0x210 [kvm_hv]
>  LR [c0080e36e0a4] kvmppc_uvmem_page_free+0x5c/0x210 [kvm_hv]
>  Call Trace:
>  [c0512010] free_devmap_managed_page+0xd0/0x100
>  [c03f71d0] put_devmap_managed_page+0xa0/0xc0
>  [c04d24bc] migrate_vma_finalize+0x32c/0x410
>  [c0080e36e828] kvmppc_svm_page_in.constprop.5+0xa0/0x460 [kvm_hv]
>  [c0080e36eddc] kvmppc_uv_migrate_mem_slot.isra.2+0x1f4/0x230 [kvm_hv]
>  [c0080e36fa98] kvmppc_h_svm_init_done+0x90/0x170 [kvm_hv]
>  [c0080e35bb14] kvmppc_pseries_do_hcall+0x1ac/0x10a0 [kvm_hv]
>  [c0080e35edf4] kvmppc_vcpu_run_hv+0x83c/0x1060 [kvm_hv]
>  [c0080e95eb2c] kvmppc_vcpu_run+0x34/0x48 [kvm]
>  [c0080e95a2dc] kvm_arch_vcpu_ioctl_run+0x374/0x830 [kvm]
>  [c0080e9433b4] kvm_vcpu_ioctl+0x45c/0x7c0 [kvm]
>  [c05451d0] do_vfs_ioctl+0xe0/0xaa0
>  [c0545d64] sys_ioctl+0xc4/0x160
>  [c000b408] system_call+0x5c/0x70
>  Instruction dump:
>  a12d1174 2f89 409e0158 a1271172 3929 b1271172 7c2004ac 3920
>  913e0140 3920 e87d0010 f93d0010 <89230011> e8c3 e9030008 2f89
>  --
> 
>  Fix the oops..
> 
> fixes: ca9f49 ("KVM: PPC: Book3S HV: Support for running secure guests")
> Signed-off-by: Ram Pai 
> ---
>  arch/powerpc/kvm/book3s_hv_uvmem.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
> b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 2806983..f4002bf 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -1018,13 +1018,15 @@ static void kvmppc_uvmem_page_free(struct page *page)
>  {
>   unsigned long pfn = page_to_pfn(page) -
>   (kvmppc_uvmem_pgmap.res.start >> PAGE_SHIFT);
> - struct kvmppc_uvmem_page_pvt *pvt;
> + struct kvmppc_uvmem_page_pvt *pvt = page->zone_device_data;
> +
> + if (!pvt)
> + return;
>  
>   spin_lock(_uvmem_bitmap_lock);
>   bitmap_clear(kvmppc_uvmem_bitmap, pfn, 1);
>   spin_unlock(_uvmem_bitmap_lock);
>  
> - pvt = page->zone_device_data;
>   page->zone_device_data = NULL;
>   if (pvt->remove_gfn)
>   kvmppc_gfn_remove(pvt->gpa >> PAGE_SHIFT, pvt->kvm);

In our case, device pages that are in use are always associated with a valid
pvt member. See kvmppc_uvmem_get_page() which returns failure if it
runs out of device pfns and that will result in proper failure of
page-in calls.

For the case where we run out of device pfns, migrate_vma_finalize() will
restore the original PTE and will not replace the PTE with device private PTE.

Also kvmppc_uvmem_page_free() (=dev_pagemap_ops.page_free()) is never
called for non-device-private pages.

This could be a use-after-free case possibly arising out of the new state
changes in HV. If so, this fix will only mask the bug and not address the
original problem.

Regards,
Bharata.


[PATCH] KVM: PPC: Book3S HV: Define H_PAGE_IN_NONSHARED for H_SVM_PAGE_IN hcall

2020-07-30 Thread Ram Pai
H_SVM_PAGE_IN hcall takes a flag parameter. This parameter specifies the
way in which a page will be treated.  H_PAGE_IN_NONSHARED indicates
that the page will be shared with the Secure VM, and H_PAGE_IN_SHARED
indicates that the page will not be shared but its contents will
be copied.

However H_PAGE_IN_NONSHARED is not defined in the header file, though
it is defined and documented in the API captured in
Documentation/powerpc/ultravisor.rst

Define H_PAGE_IN_NONSHARED in the header file.

Reported-by: Julia Lawall 
Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/hvcall.h  | 4 +++-
 arch/powerpc/kvm/book3s_hv_uvmem.c | 3 ++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index e90c073..43e3f8d 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -343,7 +343,9 @@
 #define H_COPY_TOFROM_GUEST0xF80C
 
 /* Flags for H_SVM_PAGE_IN */
-#define H_PAGE_IN_SHARED0x1
+#define H_PAGE_IN_NONSHARED0x0  /* Page is not shared with the UV */
+#define H_PAGE_IN_SHARED   0x1  /* Page is shared with UV */
+#define H_PAGE_IN_MASK 0x1
 
 /* Platform-specific hcalls used by the Ultravisor */
 #define H_SVM_PAGE_IN  0xEF00
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 2dde0fb..2806983 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -947,12 +947,13 @@ unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, 
unsigned long gpa,
if (page_shift != PAGE_SHIFT)
return H_P3;
 
-   if (flags & ~H_PAGE_IN_SHARED)
+   if (flags & ~H_PAGE_IN_MASK)
return H_P2;
 
if (flags & H_PAGE_IN_SHARED)
return kvmppc_share_page(kvm, gpa, page_shift);
 
+   /* handle H_PAGE_IN_NONSHARED */
ret = H_PARAMETER;
srcu_idx = srcu_read_lock(>srcu);
mmap_read_lock(kvm->mm);
-- 
1.8.3.1

-- 
Ram Pai