Re: [PATCH v3 02/27] mm/memory_hotplug: Allow check_hotplug_memory_addressable to be called from drivers

2020-02-20 Thread Andrew Donnellan

On 21/2/20 2:26 pm, Alastair D'Silva wrote:

From: Alastair D'Silva 

When setting up OpenCAPI connected persistent memory, the range check may
not be performed until quite late (or perhaps not at all, if the user does
not establish a DAX device).

This patch makes the range check callable so we can perform the check while
probing the OpenCAPI SCM device.

Signed-off-by: Alastair D'Silva 


Reviewed-by: Andrew Donnellan 


--
Andrew Donnellan  OzLabs, ADL Canberra
a...@linux.ibm.com IBM Australia Limited



[PATCH v2 8/8] perf/tools/pmu-events/powerpc: Add hv_24x7 socket/chip level metric events

2020-02-20 Thread Kajol Jain
The hv_24×7 feature in IBM® POWER9™ processor-based servers provide the
facility to continuously collect large numbers of hardware performance
metrics efficiently and accurately.
This patch adds hv_24x7  metric file for different Socket/chip
resources.

Result:

power9 platform:

command:# ./perf stat --metric-only -M Memory_RD_BW_Chip -C 0
   -I 1000 sleep 1

time MB   Memory_RD_BW_Chip_0 MB   Memory_RD_BW_Chip_1 MB
1.000192635  0.4  0.0
1.001695883  0.0  0.0

Signed-off-by: Kajol Jain 
---
 .../arch/powerpc/power9/nest_metrics.json | 19 +++
 1 file changed, 19 insertions(+)
 create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/nest_metrics.json

diff --git a/tools/perf/pmu-events/arch/powerpc/power9/nest_metrics.json 
b/tools/perf/pmu-events/arch/powerpc/power9/nest_metrics.json
new file mode 100644
index ..ac38f5540ac6
--- /dev/null
+++ b/tools/perf/pmu-events/arch/powerpc/power9/nest_metrics.json
@@ -0,0 +1,19 @@
+[
+{
+"MetricExpr": "(hv_24x7@PM_MCS01_128B_RD_DISP_PORT01\\,chip\\=?@ + 
hv_24x7@PM_MCS01_128B_RD_DISP_PORT23\\,chip\\=?@ + 
hv_24x7@PM_MCS23_128B_RD_DISP_PORT01\\,chip\\=?@ + 
hv_24x7@PM_MCS23_128B_RD_DISP_PORT23\\,chip\\=?@)",
+"MetricName": "Memory_RD_BW_Chip",
+"MetricGroup": "Memory_BW",
+"ScaleUnit": "1.6e-2MB"
+},
+{
+"MetricExpr": "(hv_24x7@PM_MCS01_128B_WR_DISP_PORT01\\,chip\\=?@ + 
hv_24x7@PM_MCS01_128B_WR_DISP_PORT23\\,chip\\=?@ + 
hv_24x7@PM_MCS23_128B_WR_DISP_PORT01\\,chip\\=?@ + 
hv_24x7@PM_MCS23_128B_WR_DISP_PORT23\\,chip\\=?@ )",
+"MetricName": "Memory_WR_BW_Chip",
+"MetricGroup": "Memory_BW",
+"ScaleUnit": "1.6e-2MB"
+},
+{
+"MetricExpr": "(hv_24x7@PM_PB_CYC\\,chip\\=?@ )",
+"MetricName": "PowerBUS_Frequency",
+"ScaleUnit": "2.5e-7GHz"
+}
+]
-- 
2.18.1



[PATCH v2 7/8] tools/perf: Enable Hz/hz prinitg for --metric-only option

2020-02-20 Thread Kajol Jain
Commit 54b5091606c18 ("perf stat: Implement --metric-only mode")
added function 'valid_only_metric()' which drops "Hz" or "hz",
if it is part of "ScaleUnit". This patch enable it since hv_24x7
supports couple of frequency events.

Signed-off-by: Kajol Jain 
---
 tools/perf/util/stat-display.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
index bc31fccc0057..22dcdfbb9e10 100644
--- a/tools/perf/util/stat-display.c
+++ b/tools/perf/util/stat-display.c
@@ -236,8 +236,6 @@ static bool valid_only_metric(const char *unit)
if (!unit)
return false;
if (strstr(unit, "/sec") ||
-   strstr(unit, "hz") ||
-   strstr(unit, "Hz") ||
strstr(unit, "CPUs utilized"))
return false;
return true;
-- 
2.18.1



[PATCH v2 6/8] perf/tools: Enhance JSON/metric infrastructure to handle "?"

2020-02-20 Thread Kajol Jain
Patch enhances current metric infrastructure to handle "?" in the metric
expression. The "?" can be use for parameters whose value not known while
creating metric events and which can be replace later at runtime to
the proper value. It also add flexibility to create multiple events out
of single metric event added in json file.

Patch adds function 'arch_get_runtimeparam' which is a arch specific
function, returns the count of metric events need to be created.
By default it return 1.

One loop is added in function 'metricgroup__add_metric', which create
multiple events at run time depend on return value of
'arch_get_runtimeparam' and merge that event in 'group_list'.

This infrastructure needed for hv_24x7 socket/chip level events.
"hv_24x7" chip level events needs specific chip-id to which the
data is requested. Function 'arch_get_runtimeparam' implemented
in header.c which extract number of sockets from sysfs file
"sockets" under "/sys/devices/hv_24x7/interface/".

Signed-off-by: Kajol Jain 
---
 tools/perf/arch/powerpc/util/header.c |  40 +
 tools/perf/util/expr.h|   1 +
 tools/perf/util/expr.y|  17 +++-
 tools/perf/util/metricgroup.c | 112 --
 tools/perf/util/metricgroup.h |   1 +
 tools/perf/util/stat-shadow.c |   5 ++
 6 files changed, 134 insertions(+), 42 deletions(-)

diff --git a/tools/perf/arch/powerpc/util/header.c 
b/tools/perf/arch/powerpc/util/header.c
index 3b4cdfc5efd6..28425edb901c 100644
--- a/tools/perf/arch/powerpc/util/header.c
+++ b/tools/perf/arch/powerpc/util/header.c
@@ -7,6 +7,11 @@
 #include 
 #include 
 #include "header.h"
+#include "metricgroup.h"
+#include "evlist.h"
+#include 
+#include "pmu.h"
+#include 
 
 #define mfspr(rn)   ({unsigned long rval; \
 asm volatile("mfspr %0," __stringify(rn) \
@@ -16,6 +21,8 @@
 #define PVR_VER(pvr)(((pvr) >>  16) & 0x) /* Version field */
 #define PVR_REV(pvr)(((pvr) >>   0) & 0x) /* Revison field */
 
+#define SOCKETS_INFO_FILE_PATH "/devices/hv_24x7/interface/"
+
 int
 get_cpuid(char *buffer, size_t sz)
 {
@@ -44,3 +51,36 @@ get_cpuid_str(struct perf_pmu *pmu __maybe_unused)
 
return bufp;
 }
+
+int arch_get_runtimeparam(void)
+{
+   int count = 0;
+   DIR *dir;
+   char path[PATH_MAX];
+   const char *sysfs = sysfs__mountpoint();
+   char filename[] = "sockets";
+   FILE *file;
+   char buf[16], *num;
+   int data;
+
+   if (!sysfs)
+   goto out;
+   snprintf(path, PATH_MAX,
+"%s" SOCKETS_INFO_FILE_PATH, sysfs);
+   dir = opendir(path);
+   if (!dir)
+   goto out;
+   strcat(path, filename);
+   file = fopen(path, "r");
+   if (!file)
+   goto out;
+
+   data = fread(buf, 1, sizeof(buf), file);
+   if (data == 0)
+   goto out;
+   count = strtol(buf, , 10);
+out:
+   if (!count)
+   count = 1;
+   return count;
+}
diff --git a/tools/perf/util/expr.h b/tools/perf/util/expr.h
index 046160831f90..85ebea68b0c5 100644
--- a/tools/perf/util/expr.h
+++ b/tools/perf/util/expr.h
@@ -15,6 +15,7 @@ struct parse_ctx {
struct parse_id ids[MAX_PARSE_ID];
 };
 
+extern int expr__runtimeparam;
 void expr__ctx_init(struct parse_ctx *ctx);
 void expr__add_id(struct parse_ctx *ctx, const char *id, double val);
 #ifndef IN_EXPR_Y
diff --git a/tools/perf/util/expr.y b/tools/perf/util/expr.y
index 7d226241f1d7..8d1d51451873 100644
--- a/tools/perf/util/expr.y
+++ b/tools/perf/util/expr.y
@@ -37,6 +37,8 @@
 %type  expr if_expr
 
 %{
+int expr__runtimeparam;
+
 static int expr__lex(YYSTYPE *res, const char **pp);
 
 static void expr__error(double *final_val __maybe_unused,
@@ -102,7 +104,7 @@ static int expr__symbol(YYSTYPE *res, const char *p, const 
char **pp)
if (*p == '#')
*dst++ = *p++;
 
-   while (isalnum(*p) || *p == '_' || *p == '.' || *p == ':' || *p == '@' 
|| *p == '\\') {
+   while (isalnum(*p) || *p == '_' || *p == '.' || *p == ':' || *p == '@' 
|| *p == '\\' || *p == '?') {
if (p - s >= MAXIDLEN)
return -1;
/*
@@ -113,6 +115,19 @@ static int expr__symbol(YYSTYPE *res, const char *p, const 
char **pp)
*dst++ = '/';
else if (*p == '\\')
*dst++ = *++p;
+   else if (*p == '?') {
+   int size = snprintf(NULL, 0, "%d", expr__runtimeparam);
+   char * paramval = (char *)malloc(size);
+   int i = 0;
+   if(!paramval)
+   *dst++ = '0';
+   else {
+   sprintf(paramval, "%d", expr__runtimeparam);
+   while(i < size)
+   *dst++ = paramval[i++];
+   

[PATCH v2 5/8] powerpc/hv-24x7: Update post_mobility_fixup() to handle migration

2020-02-20 Thread Kajol Jain
Function 'read_sys_info_pseries()' is added to get system parameter
values like number of sockets and chips per socket.
and it gets these details via rtas_call with token
"PROCESSOR_MODULE_INFO".

Incase lpar migrate from one system to another, system
parameter details like chips per sockets or number of sockets might
change. So, it needs to be re-initialized otherwise, these values
corresponds to previous system values.
This patch adds a call to 'read_sys_info_pseries()' from
'post-mobility_fixup()' to re-init the physsockets and physchips values.

Signed-off-by: Kajol Jain 
---
 arch/powerpc/platforms/pseries/mobility.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index b571285f6c14..226accd6218b 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -371,6 +371,18 @@ void post_mobility_fixup(void)
/* Possibly switch to a new RFI flush type */
pseries_setup_rfi_flush();
 
+   /*
+* Incase lpar migrate from one system to another, system
+* parameter details like chips per sockets and number of sockets
+* might change. So, it needs to be re-initialized otherwise these
+* values corresponds to previous system.
+* Here, adding a call to read_sys_info_pseries() declared in
+* platforms/pseries/pseries.h to re-init the physsockets and
+* physchips value.
+*/
+   if (IS_ENABLED(CONFIG_HV_PERF_CTRS) && IS_ENABLED(CONFIG_PPC_RTAS))
+   read_sys_info_pseries();
+
return;
 }
 
-- 
2.18.1



[PATCH v2 4/8] Documentation/ABI: Add ABI documentation for chips and sockets

2020-02-20 Thread Kajol Jain
Add documentation for the following sysfs files:
/sys/devices/hv_24x7/interface/chips,
/sys/devices/hv_24x7/interface/sockets

Signed-off-by: Kajol Jain 
---
 .../testing/sysfs-bus-event_source-devices-hv_24x7 | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7 
b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
index ec27c6c9e737..e26cb1770c61 100644
--- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
@@ -22,6 +22,20 @@ Description:
Exposes the "version" field of the 24x7 catalog. This is also
extractable from the provided binary "catalog" sysfs entry.
 
+What:  /sys/devices/hv_24x7/interface/sockets
+Date:  December 2019
+Contact:   Linux on PowerPC Developer List 
+Description:   read only
+   This sysfs interface exposes the number of sockets present in 
the
+   system.
+
+What:  /sys/devices/hv_24x7/interface/chips
+Date:  December 2019
+Contact:   Linux on PowerPC Developer List 
+Description:   read only
+   This sysfs interface exposes the number of chips per socket
+   present in the system.
+
 What:  /sys/bus/event_source/devices/hv_24x7/event_descs/
 Date:  February 2014
 Contact:   Linux on PowerPC Developer List 
-- 
2.18.1



[PATCH v2 3/8] powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show processor details

2020-02-20 Thread Kajol Jain
To expose the system dependent parameter like total number of
sockets and numbers of chips per socket, patch adds two sysfs files.
"sockets" and "chips" are added to /sys/devices/hv_24x7/interface/
of the "hv_24x7" pmu.

Signed-off-by: Kajol Jain 
---
 arch/powerpc/perf/hv-24x7.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index 4248a9d1e2ed..9e486ec7269f 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -454,6 +454,20 @@ static ssize_t device_show_string(struct device *dev,
return sprintf(buf, "%s\n", (char *)d->var);
 }
 
+#ifdef CONFIG_PPC_RTAS
+static ssize_t sockets_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   return sprintf(buf, "%d\n", physsockets);
+}
+
+static ssize_t chips_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+   return sprintf(buf, "%d\n", physchips);
+}
+#endif
+
 static struct attribute *device_str_attr_create_(char *name, char *str)
 {
struct dev_ext_attribute *attr = kzalloc(sizeof(*attr), GFP_KERNEL);
@@ -1100,6 +1114,10 @@ PAGE_0_ATTR(catalog_len, "%lld\n",
(unsigned long long)be32_to_cpu(page_0->length) * 4096);
 static BIN_ATTR_RO(catalog, 0/* real length varies */);
 static DEVICE_ATTR_RO(domains);
+#ifdef CONFIG_PPC_RTAS
+static DEVICE_ATTR_RO(sockets);
+static DEVICE_ATTR_RO(chips);
+#endif
 
 static struct bin_attribute *if_bin_attrs[] = {
_attr_catalog,
@@ -1110,6 +1128,10 @@ static struct attribute *if_attrs[] = {
_attr_catalog_len.attr,
_attr_catalog_version.attr,
_attr_domains.attr,
+#ifdef CONFIG_PPC_RTAS
+   _attr_sockets.attr,
+   _attr_chips.attr,
+#endif
NULL,
 };
 
-- 
2.18.1



[PATCH v2 2/8] powerpc/hv-24x7: Add rtas call in hv-24x7 driver to get processor details

2020-02-20 Thread Kajol Jain
For hv_24x7 socket/chip level events, specific chip-id to which
the data requested should be added as part of pmu events.
But number of chips/socket in the system details are not exposed.

Patch implements read_sys_info_pseries() to get system
parameter values like number of sockets and chips per socket.
Rtas_call with token "PROCESSOR_MODULE_INFO"
is used to get these values.

Sub-sequent patch exports these values via sysfs.

Patch also make these parameters default to 1.

Signed-off-by: Kajol Jain 
---
 arch/powerpc/perf/hv-24x7.c  | 72 
 arch/powerpc/platforms/pseries/pseries.h |  3 +
 2 files changed, 75 insertions(+)

diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index 6dbbf70232aa..4248a9d1e2ed 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -20,6 +20,11 @@
 #include 
 #include 
 
+#ifdef CONFIG_PPC_RTAS
+#include 
+#include <../../platforms/pseries/pseries.h>
+#endif
+
 #include "hv-24x7.h"
 #include "hv-24x7-catalog.h"
 #include "hv-common.h"
@@ -57,6 +62,69 @@ static bool is_physical_domain(unsigned domain)
}
 }
 
+#ifdef CONFIG_PPC_RTAS
+#define PROCESSOR_MODULE_INFO   43
+#define PROCESSOR_MAX_LENGTH   (8 * 1024)
+
+static int strbe16toh(const char *buf, int offset)
+{
+   return (buf[offset] << 8) + buf[offset + 1];
+}
+
+static u32 physsockets;/* Physical sockets */
+static u32 physchips;  /* Physical chips */
+
+/*
+ * Function read_sys_info_pseries() make a rtas_call which require
+ * data buffer of size 8K. As standard 'rtas_data_buf' is of size
+ * 4K, we are adding new local buffer 'rtas_local_data_buf'.
+ */
+char rtas_local_data_buf[PROCESSOR_MAX_LENGTH] __cacheline_aligned;
+
+/*
+ * read_sys_info_pseries()
+ * Retrieve the number of sockets and chips per socket details
+ * through the get-system-parameter rtas call.
+ */
+void read_sys_info_pseries(void)
+{
+   int call_status, len, ntypes;
+
+   /*
+* Making system parameter: chips and sockets default to 1.
+*/
+   physsockets = 1;
+   physchips = 1;
+   memset(rtas_local_data_buf, 0, PROCESSOR_MAX_LENGTH);
+   spin_lock(_data_buf_lock);
+
+   call_status = rtas_call(rtas_token("ibm,get-system-parameter"), 3, 1,
+   NULL,
+   PROCESSOR_MODULE_INFO,
+   __pa(rtas_local_data_buf),
+   PROCESSOR_MAX_LENGTH);
+
+   spin_unlock(_data_buf_lock);
+
+   if (call_status != 0) {
+   pr_info("%s %s Error calling get-system-parameter (0x%x)\n",
+   __FILE__, __func__, call_status);
+   } else {
+   rtas_local_data_buf[PROCESSOR_MAX_LENGTH - 1] = '\0';
+   len = strbe16toh(rtas_local_data_buf, 0);
+   if (len < 6)
+   return;
+
+   ntypes = strbe16toh(rtas_local_data_buf, 2);
+
+   if (!ntypes)
+   return;
+   physsockets = strbe16toh(rtas_local_data_buf, 4);
+   physchips = strbe16toh(rtas_local_data_buf, 6);
+   }
+}
+#endif /* CONFIG_PPC_RTAS */
+
 /* Domains for which more than one result element are returned for each event. 
*/
 static bool domain_needs_aggregation(unsigned int domain)
 {
@@ -1615,6 +1683,10 @@ static int hv_24x7_init(void)
if (r)
return r;
 
+#ifdef CONFIG_PPC_RTAS
+   read_sys_info_pseries();
+#endif
+
return 0;
 }
 
diff --git a/arch/powerpc/platforms/pseries/pseries.h 
b/arch/powerpc/platforms/pseries/pseries.h
index 13fa370a87e4..1727559ce304 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -19,6 +19,9 @@ extern void request_event_sources_irqs(struct device_node *np,
 struct pt_regs;
 
 extern int pSeries_system_reset_exception(struct pt_regs *regs);
+#ifdef CONFIG_PPC_RTAS
+extern void read_sys_info_pseries(void);
+#endif
 extern int pSeries_machine_check_exception(struct pt_regs *regs);
 extern long pseries_machine_check_realmode(struct pt_regs *regs);
 
-- 
2.18.1



[PATCH v2 1/8] powerpc/perf/hv-24x7: Fix inconsistent output values incase multiple hv-24x7 events run

2020-02-20 Thread Kajol Jain
Commit 2b206ee6b0df ("powerpc/perf/hv-24x7: Display change in counter
values")' added to print _change_ in the counter value rather then raw
value for 24x7 counters. Incase of transactions, the event count
is set to 0 at the beginning of the transaction. It also sets
the event's prev_count to the raw value at the time of initialization.
Because of setting event count to 0, we are seeing some weird behaviour,
whenever we run multiple 24x7 events at a time.

For example:

command#: ./perf stat -e "{hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/,
   hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/}"
   -C 0 -I 1000 sleep 100

 1.000121704120 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/
 1.000121704  5 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/
 2.000357733  8 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/
 2.000357733 10 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/
 3.000495215 18,446,744,073,709,551,616 
hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/
 3.000495215 18,446,744,073,709,551,616 
hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/
 4.000641884 56 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/
 4.000641884 18,446,744,073,709,551,616 
hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/
 5.000791887 18,446,744,073,709,551,616 
hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/

Getting these large values in case we do -I.

As we are setting event_count to 0, for interval case, overall event_count is 
not
coming in incremental order. As we may can get new delta lesser then previous 
count.
Because of which when we print intervals, we are getting negative value which 
create
these large values.

This patch rather then setting event_count to 0, it change local64_set to
local64_add in function 'h_24x7_event_read'.

With this patch
In power9 platform

command#: ./perf stat -e "{hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/,
   hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/}"
   -C 0 -I 1000 sleep 100

 1.000117685 93 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/
 1.000117685  1 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/
 2.000349331 98 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/
 2.000349331  2 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/
 3.000495900131 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/
 3.000495900  4 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/
 4.000645920204 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/
 4.000645920 61 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/
 4.284169997 22 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/

Signed-off-by: Kajol Jain 
---
 arch/powerpc/perf/hv-24x7.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index 573e0b309c0c..6dbbf70232aa 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -1409,7 +1409,7 @@ static void h_24x7_event_read(struct perf_event *event)
 * that would require issuing a hcall, which would then
 * defeat the purpose of using the txn interface.
 */
-   local64_set(>count, 0);
+   local64_add(0, >count);
}
 
put_cpu_var(hv_24x7_reqb);
-- 
2.18.1



[PATCH v2 0/8] powerpc/perf: Add json file metric support for the hv_24x7 socket/chip level events

2020-02-20 Thread Kajol Jain
The hv_24×7 feature in IBM® POWER9™ processor-based servers provide the
facility to continuously collect large numbers of hardware performance
metrics efficiently and accurately.

First patch of the patchset fix inconsistent results we are getting when
we run multiple 24x7 events.

Patchset adds json file metric support for the hv_24x7 socket/chip level
events. "hv_24x7" pmu interface events needs system dependent parameter
like socket/chip/core. For example, hv_24x7 chip level events needs
specific chip-id to which the data is requested should be added as part
of pmu events.

So to enable JSON file support to "hv_24x7" interface, patchset expose
total number of sockets and chips per-socket details in sysfs
files (sockets, chips) under "/sys/devices/hv_24x7/interface/".

To get sockets and number of chips per sockets, patchset adds a rtas call
with token "PROCESSOR_MODULE_INFO" to get these details. Patchset also
handles partition migration case to re-init these system depended
parameters by adding proper calls in post_mobility_fixup() (mobility.c).

Patch 6 & 8 of the patchset handles perf tool plumbing needed to replace
the "?" character in the metric expression to proper value and hv_24x7
json metric file for different Socket/chip resources.

Patch set also enable Hz/hz prinitg for --metric-only option to print
metric data for bus frequency.

Changelog:

v1 -> v2
- Rename hv-24x7 metric json file as nest_metrics.json

Kajol Jain (8):
  powerpc/perf/hv-24x7: Fix inconsistent output values incase multiple
hv-24x7 events run
  powerpc/hv-24x7: Add rtas call in hv-24x7 driver to get processor
details
  powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show
processor details
  Documentation/ABI: Add ABI documentation for chips and sockets
  powerpc/hv-24x7: Update post_mobility_fixup() to handle migration
  perf/tools: Enhance JSON/metric infrastructure to handle "?"
  tools/perf: Enable Hz/hz prinitg for --metric-only option
  perf/tools/pmu-events/powerpc: Add hv_24x7 socket/chip level metric
events

 .../sysfs-bus-event_source-devices-hv_24x7|  14 +++
 arch/powerpc/perf/hv-24x7.c   |  96 ++-
 arch/powerpc/platforms/pseries/mobility.c |  12 ++
 arch/powerpc/platforms/pseries/pseries.h  |   3 +
 tools/perf/arch/powerpc/util/header.c |  40 +++
 .../arch/powerpc/power9/nest_metrics.json |  19 +++
 tools/perf/util/expr.h|   1 +
 tools/perf/util/expr.y|  17 ++-
 tools/perf/util/metricgroup.c | 112 +++---
 tools/perf/util/metricgroup.h |   1 +
 tools/perf/util/stat-display.c|   2 -
 tools/perf/util/stat-shadow.c |   5 +
 12 files changed, 277 insertions(+), 45 deletions(-)
 create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/nest_metrics.json

-- 
2.18.1



Re: [PATCH v3 04/27] ocxl: Remove unnecessary externs

2020-02-20 Thread Andrew Donnellan

On 21/2/20 2:26 pm, Alastair D'Silva wrote:

From: Alastair D'Silva 

Function declarations don't need externs, remove the existing ones
so they are consistent with newer code

Signed-off-by: Alastair D'Silva 


Acked-by: Andrew Donnellan 


--
Andrew Donnellan  OzLabs, ADL Canberra
a...@linux.ibm.com IBM Australia Limited



Re: [PATCH v3 27/27] MAINTAINERS: Add myself & nvdimm/ocxl to ocxl

2020-02-20 Thread Andrew Donnellan

On 21/2/20 2:27 pm, Alastair D'Silva wrote:

From: Alastair D'Silva 

The OpenCAPI Persistent Memory driver will be maintained as part ofi
the ppc tree.

I'm also adding myself as an author of the driver & contributor to
the generic ocxl driver.

Signed-off-by: Alastair D'Silva 


You need to update the title of this patch :)


---
  MAINTAINERS | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index f8670989ec91..3fb9a9f576a7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12064,13 +12064,16 @@ F:tools/objtool/
  OCXL (Open Coherent Accelerator Processor Interface OpenCAPI) DRIVER
  M:Frederic Barrat 
  M:Andrew Donnellan 
+M: Alastair D'Silva 
  L:linuxppc-dev@lists.ozlabs.org
  S:Supported
  F:arch/powerpc/platforms/powernv/ocxl.c
+F: arch/powerpc/platforms/powernv/pmem/*
  F:arch/powerpc/include/asm/pnv-ocxl.h
  F:drivers/misc/ocxl/
  F:include/misc/ocxl*
  F:include/uapi/misc/ocxl.h
+F: include/uapi/nvdimm/ocxl-pmem.h
  F:Documentation/userspace-api/accelerators/ocxl.rst


Should this be part of the ocxl entry or a separate entry? I guess I 
don't care too much either way.


--
Andrew Donnellan  OzLabs, ADL Canberra
a...@linux.ibm.com IBM Australia Limited



[PATCH v2 5/5] Documentation: Document sysfs interfaces purr, spurr, idle_purr, idle_spurr

2020-02-20 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Add documentation for the following sysfs interfaces:
/sys/devices/system/cpu/cpuX/purr
/sys/devices/system/cpu/cpuX/spurr
/sys/devices/system/cpu/cpuX/idle_purr
/sys/devices/system/cpu/cpuX/idle_spurr

Signed-off-by: Gautham R. Shenoy 
---
 Documentation/ABI/testing/sysfs-devices-system-cpu | 39 ++
 1 file changed, 39 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu 
b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 2e0e3b4..799dc737a 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -580,3 +580,42 @@ Description:   Secure Virtual Machine
If 1, it means the system is using the Protected Execution
Facility in POWER9 and newer processors. i.e., it is a Secure
Virtual Machine.
+
+What:  /sys/devices/system/cpu/cpuX/purr
+Date:  Apr 2005
+Contact:   Linux for PowerPC mailing list 
+Description:   PURR ticks for this CPU since the system boot.
+
+   The Processor Utilization Resources Register (PURR) is
+   a 64-bit counter which provides an estimate of the
+   resources used by the CPU thread. The contents of this
+   register increases monotonically. This sysfs interface
+   exposes the number of PURR ticks for cpuX.
+
+What:  /sys/devices/system/cpu/cpuX/spurr
+Date:  Dec 2006
+Contact:   Linux for PowerPC mailing list 
+Description:   SPURR ticks for this CPU since the system boot.
+
+   The Scaled Processor Utilization Resources Register
+   (SPURR) is a 64-bit counter that provides a frequency
+   invariant estimate of the resources used by the CPU
+   thread. The contents of this register increases
+   monotonically. This sysfs interface exposes the number
+   of SPURR ticks for cpuX.
+
+What:  /sys/devices/system/cpu/cpuX/idle_purr
+Date:  Nov 2019
+Contact:   Linux for PowerPC mailing list 
+Description:   PURR ticks for cpuX when it was idle.
+
+   This sysfs interface exposes the number of PURR ticks
+   for cpuX when it was idle.
+
+What:  /sys/devices/system/cpu/cpuX/spurr
+Date:  Nov 2019
+Contact:   Linux for PowerPC mailing list 
+Description:   SPURR ticks for cpuX when it was idle.
+
+   This sysfs interface exposes the number of SPURR ticks
+   for cpuX when it was idle.
-- 
1.9.4



[PATCH v2 3/5] powerpc/pseries: Account for SPURR ticks on idle CPUs

2020-02-20 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

On Pseries LPARs, to calculate utilization, we need to know the
[S]PURR ticks when the CPUs were busy or idle.

Via idle_loop_prolog(), idle_loop_epilog(), we track the idle PURR
ticks in the VPA variable "wait_state_cycles". This patch extends the
support to account for the idle SPURR ticks. It also provides an
accessor function to accurately reads idle SPURR ticks.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/idle.h| 33 +
 arch/powerpc/platforms/pseries/setup.c |  2 ++
 2 files changed, 35 insertions(+)

diff --git a/arch/powerpc/include/asm/idle.h b/arch/powerpc/include/asm/idle.h
index 126a217..db82fc1 100644
--- a/arch/powerpc/include/asm/idle.h
+++ b/arch/powerpc/include/asm/idle.h
@@ -2,13 +2,20 @@
 #define _ASM_POWERPC_IDLE_H
 #include 
 
+DECLARE_PER_CPU(u64, idle_spurr_cycles);
 DECLARE_PER_CPU(u64, idle_entry_purr_snap);
+DECLARE_PER_CPU(u64, idle_entry_spurr_snap);
 
 static inline void snapshot_purr_idle_entry(void)
 {
*this_cpu_ptr(_entry_purr_snap) = mfspr(SPRN_PURR);
 }
 
+static inline void snapshot_spurr_idle_entry(void)
+{
+   *this_cpu_ptr(_entry_spurr_snap) = mfspr(SPRN_SPURR);
+}
+
 static inline void update_idle_purr_accounting(void)
 {
u64 wait_cycles;
@@ -19,10 +26,19 @@ static inline void update_idle_purr_accounting(void)
get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles);
 }
 
+static inline void update_idle_spurr_accounting(void)
+{
+   u64 *idle_spurr_cycles_ptr = this_cpu_ptr(_spurr_cycles);
+   u64 in_spurr = *this_cpu_ptr(_entry_spurr_snap);
+
+   *idle_spurr_cycles_ptr += mfspr(SPRN_SPURR) - in_spurr;
+}
+
 static inline void idle_loop_prolog(void)
 {
ppc64_runlatch_off();
snapshot_purr_idle_entry();
+   snapshot_spurr_idle_entry();
/*
 * Indicate to the HV that we are idle. Now would be
 * a good time to find other work to dispatch.
@@ -33,6 +49,7 @@ static inline void idle_loop_prolog(void)
 static inline void idle_loop_epilog(void)
 {
update_idle_purr_accounting();
+   update_idle_spurr_accounting();
get_lppaca()->idle = 0;
ppc64_runlatch_on();
 }
@@ -52,4 +69,20 @@ static inline u64 read_this_idle_purr(void)
 
return be64_to_cpu(get_lppaca()->wait_state_cycles);
 }
+
+static inline u64 read_this_idle_spurr(void)
+{
+   /*
+* If we are reading from an idle context, update the
+* idle-spurr cycles corresponding to the last idle period.
+* Since the idle context is not yet over, take a fresh
+* snapshot of the idle-spurr.
+*/
+   if (get_lppaca()->idle == 1) {
+   update_idle_spurr_accounting();
+   snapshot_spurr_idle_entry();
+   }
+
+   return *this_cpu_ptr(_spurr_cycles);
+}
 #endif
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index e9f2cefa..5ef5c82 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -318,7 +318,9 @@ static int alloc_dispatch_log_kmem_cache(void)
 }
 machine_early_initcall(pseries, alloc_dispatch_log_kmem_cache);
 
+DEFINE_PER_CPU(u64, idle_spurr_cycles);
 DEFINE_PER_CPU(u64, idle_entry_purr_snap);
+DEFINE_PER_CPU(u64, idle_entry_spurr_snap);
 static void pseries_lpar_idle(void)
 {
/*
-- 
1.9.4



[PATCH v2 0/5] Track and expose idle PURR and SPURR ticks

2020-02-20 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Hi,

This is the second version of the patches to track and expose idle
PURR and SPURR ticks. These patches are required by tools such as
lparstat to compute system utilization for capacity planning purposes.

v1 can be found here: https://lore.kernel.org/patchwork/cover/1159341/

The key changes from v1 are

- The sysfs reads of idle PURR and SPURR now send an
  smp_call_function to the target CPU in order to read the most
  recent value of idle PURR and SPURR. This is required if the
  target CPU was idle for a long duration, in which case the
  cycles corresponding to its latest idle duration would not be
  updated in the variable tracking idle PURR/SPURR. Thus merely
  reading the variable would not reflect the most accurate idle
  PURR/SPURR ticks.

- Ensured that even when idle PURR/SPURR values are read in an
  interrupt context in-between idle_loop_prolog() and
  idle_loop_epilog(), we return the value that includes the cycles
  spent in the most recent idle period.

- The sysfs files for idle_purr and idle_spurr are created only
  when the FW_FEATURE_LPAR is enabled (the earlier version was
  checking for FW_FEATURE_SPLPAR)

Motivation:
===
On PSeries LPARs, the data centers planners desire a more accurate
view of system utilization per resource such as CPU to plan the system
capacity requirements better. Such accuracy can be obtained by reading
PURR/SPURR registers for CPU resource utilization.

Tools such as lparstat which are used to compute the utilization need
to know [S]PURR ticks when the cpu was busy or idle. The [S]PURR
counters are already exposed through sysfs.  We already account for
PURR ticks when we go to idle so that we can update the VPA area. This
patchset extends support to account for SPURR ticks when idle, and
expose both via per-cpu sysfs files.

These patches are required for enhancement to the lparstat utility
that compute the CPU utilization based on PURR and SPURR which can be
found here :
https://groups.google.com/forum/#!topic/powerpc-utils-devel/fYRo69xO9r4

With the patches, when lparstat is run on a LPAR running CPU-Hogs,
=
$sudo ./src/lparstat -E 1 3
System Configuration
type=Dedicated mode=Capped smt=8 lcpu=2 mem=4834176 kB cpus=0 ent=2.00 
---Actual--- -Normalized-
%busy  %idle   Frequency %busy  %idle
-- --  - -- --
 99.99   0.00  3.35GHz[111%] 110.99   0.00
100.00   0.00  3.35GHz[111%] 111.00   0.00
100.00   0.00  3.35GHz[111%] 111.00   0.00
=

When lparstat is run on an LPAR that is idle,
=
$ sudo ./src/lparstat -E 1 3
System Configuration
type=Dedicated mode=Capped smt=8 lcpu=2 mem=4834176 kB cpus=0 ent=2.00 
---Actual--- -Normalized-
%busy  %idle   Frequency %busy  %idle
-- --  - -- --
  0.09  99.91  2.11GHz[ 70%]   0.11  69.90
  0.32  99.68  2.17GHz[ 72%]   0.25  71.75
  0.56  99.44  2.18GHz[ 72%]   0.42  71.58
=

Gautham R. Shenoy (5):
  powerpc: Move idle_loop_prolog()/epilog() functions to header file
  powerpc/idle: Add accessor function to always read latest idle PURR
  powerpc/pseries: Account for SPURR ticks on idle CPUs
  powerpc/sysfs: Show idle_purr and idle_spurr for every CPU
  Documentation: Document sysfs interfaces purr, spurr, idle_purr,
idle_spurr

 Documentation/ABI/testing/sysfs-devices-system-cpu | 39 ++
 arch/powerpc/include/asm/idle.h| 88 ++
 arch/powerpc/kernel/sysfs.c| 54 -
 arch/powerpc/platforms/pseries/setup.c |  8 +-
 drivers/cpuidle/cpuidle-pseries.c  | 39 ++
 5 files changed, 191 insertions(+), 37 deletions(-)
 create mode 100644 arch/powerpc/include/asm/idle.h

-- 
1.9.4



[PATCH v2 4/5] powerpc/sysfs: Show idle_purr and idle_spurr for every CPU

2020-02-20 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

On Pseries LPARs, to calculate utilization, we need to know the
[S]PURR ticks when the CPUs were busy or idle.

The total PURR and SPURR ticks are already exposed via the per-cpu
sysfs files "purr" and "spurr". This patch adds support for exposing
the idle PURR and SPURR ticks via new per-cpu sysfs files named
"idle_purr" and "idle_spurr".

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/sysfs.c | 54 ++---
 1 file changed, 51 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index 80a676d..5b4b450 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "cacheinfo.h"
@@ -733,6 +734,42 @@ static void create_svm_file(void)
 }
 #endif /* CONFIG_PPC_SVM */
 
+static void read_idle_purr(void *val)
+{
+   u64 *ret = (u64 *)val;
+
+   *ret = read_this_idle_purr();
+}
+
+static ssize_t idle_purr_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+   struct cpu *cpu = container_of(dev, struct cpu, dev);
+   u64 val;
+
+   smp_call_function_single(cpu->dev.id, read_idle_purr, , 1);
+   return sprintf(buf, "%llx\n", val);
+}
+static DEVICE_ATTR(idle_purr, 0400, idle_purr_show, NULL);
+
+static void read_idle_spurr(void *val)
+{
+   u64 *ret = (u64 *)val;
+
+   *ret = read_this_idle_spurr();
+}
+
+static ssize_t idle_spurr_show(struct device *dev,
+  struct device_attribute *attr, char *buf)
+{
+   struct cpu *cpu = container_of(dev, struct cpu, dev);
+   u64 val;
+
+   smp_call_function_single(cpu->dev.id, read_idle_spurr, , 1);
+   return sprintf(buf, "%llx\n", val);
+}
+static DEVICE_ATTR(idle_spurr, 0400, idle_spurr_show, NULL);
+
 static int register_cpu_online(unsigned int cpu)
 {
struct cpu *c = _cpu(cpu_devices, cpu);
@@ -794,10 +831,15 @@ static int register_cpu_online(unsigned int cpu)
if (!firmware_has_feature(FW_FEATURE_LPAR))
add_write_permission_dev_attr(_attr_purr);
device_create_file(s, _attr_purr);
+   if (firmware_has_feature(FW_FEATURE_LPAR))
+   device_create_file(s, _attr_idle_purr);
}
 
-   if (cpu_has_feature(CPU_FTR_SPURR))
+   if (cpu_has_feature(CPU_FTR_SPURR)) {
device_create_file(s, _attr_spurr);
+   if (firmware_has_feature(FW_FEATURE_LPAR))
+   device_create_file(s, _attr_idle_spurr);
+   }
 
if (cpu_has_feature(CPU_FTR_DSCR))
device_create_file(s, _attr_dscr);
@@ -879,11 +921,17 @@ static int unregister_cpu_online(unsigned int cpu)
if (cpu_has_feature(CPU_FTR_MMCRA))
device_remove_file(s, _attr_mmcra);
 
-   if (cpu_has_feature(CPU_FTR_PURR))
+   if (cpu_has_feature(CPU_FTR_PURR)) {
device_remove_file(s, _attr_purr);
+   if (firmware_has_feature(FW_FEATURE_LPAR))
+   device_remove_file(s, _attr_idle_purr);
+   }
 
-   if (cpu_has_feature(CPU_FTR_SPURR))
+   if (cpu_has_feature(CPU_FTR_SPURR)) {
device_remove_file(s, _attr_spurr);
+   if (firmware_has_feature(FW_FEATURE_LPAR))
+   device_remove_file(s, _attr_idle_spurr);
+   }
 
if (cpu_has_feature(CPU_FTR_DSCR))
device_remove_file(s, _attr_dscr);
-- 
1.9.4



[PATCH v2 1/5] powerpc: Move idle_loop_prolog()/epilog() functions to header file

2020-02-20 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Currently prior to entering an idle state on a Linux Guest, the
pseries cpuidle driver implement an idle_loop_prolog() and
idle_loop_epilog() functions which ensure that idle_purr is correctly
computed, and the hypervisor is informed that the CPU cycles have been
donated.

These prolog and epilog functions are also required in the default
idle call, i.e pseries_lpar_idle(). Hence move these accessor
functions to a common header file and call them from
pseries_lpar_idle(). Since the existing header files such as
asm/processor.h have enough clutter, create a new header file
asm/idle.h.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/idle.h| 27 +++
 arch/powerpc/platforms/pseries/setup.c |  7 +--
 drivers/cpuidle/cpuidle-pseries.c  | 24 +---
 3 files changed, 33 insertions(+), 25 deletions(-)
 create mode 100644 arch/powerpc/include/asm/idle.h

diff --git a/arch/powerpc/include/asm/idle.h b/arch/powerpc/include/asm/idle.h
new file mode 100644
index 000..f32a7d8
--- /dev/null
+++ b/arch/powerpc/include/asm/idle.h
@@ -0,0 +1,27 @@
+#ifndef _ASM_POWERPC_IDLE_H
+#define _ASM_POWERPC_IDLE_H
+#include 
+
+static inline void idle_loop_prolog(unsigned long *in_purr)
+{
+   ppc64_runlatch_off();
+   *in_purr = mfspr(SPRN_PURR);
+   /*
+* Indicate to the HV that we are idle. Now would be
+* a good time to find other work to dispatch.
+*/
+   get_lppaca()->idle = 1;
+}
+
+static inline void idle_loop_epilog(unsigned long in_purr)
+{
+   u64 wait_cycles;
+
+   wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles);
+   wait_cycles += mfspr(SPRN_PURR) - in_purr;
+   get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles);
+   get_lppaca()->idle = 0;
+
+   ppc64_runlatch_on();
+}
+#endif
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 0c8421d..ffd4d59 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -68,6 +68,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -319,6 +320,8 @@ static int alloc_dispatch_log_kmem_cache(void)
 
 static void pseries_lpar_idle(void)
 {
+   unsigned long in_purr;
+
/*
 * Default handler to go into low thread priority and possibly
 * low power mode by ceding processor to hypervisor
@@ -328,7 +331,7 @@ static void pseries_lpar_idle(void)
return;
 
/* Indicate to hypervisor that we are idle. */
-   get_lppaca()->idle = 1;
+   idle_loop_prolog(_purr);
 
/*
 * Yield the processor to the hypervisor.  We return if
@@ -339,7 +342,7 @@ static void pseries_lpar_idle(void)
 */
cede_processor();
 
-   get_lppaca()->idle = 0;
+   idle_loop_epilog(in_purr);
 }
 
 /*
diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index 74c2479..fc9dee9c 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 struct cpuidle_driver pseries_idle_driver = {
@@ -31,29 +32,6 @@ struct cpuidle_driver pseries_idle_driver = {
 static u64 snooze_timeout __read_mostly;
 static bool snooze_timeout_en __read_mostly;
 
-static inline void idle_loop_prolog(unsigned long *in_purr)
-{
-   ppc64_runlatch_off();
-   *in_purr = mfspr(SPRN_PURR);
-   /*
-* Indicate to the HV that we are idle. Now would be
-* a good time to find other work to dispatch.
-*/
-   get_lppaca()->idle = 1;
-}
-
-static inline void idle_loop_epilog(unsigned long in_purr)
-{
-   u64 wait_cycles;
-
-   wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles);
-   wait_cycles += mfspr(SPRN_PURR) - in_purr;
-   get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles);
-   get_lppaca()->idle = 0;
-
-   ppc64_runlatch_on();
-}
-
 static int snooze_loop(struct cpuidle_device *dev,
struct cpuidle_driver *drv,
int index)
-- 
1.9.4



[PATCH v2 2/5] powerpc/idle: Add accessor function to always read latest idle PURR

2020-02-20 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Currently when CPU goes idle, we take a snapshot of PURR via
idle_loop_prolog() which is used at the CPU idle exit to compute the
idle PURR cycles via the function idle_loop_epilog().  Thus, the value
of idle PURR cycle thus read before idle_loop_prolog() and after
idle_loop_epilog() is always correct.

However, if we were to read the idle PURR cycles from an interrupt
context between idle_loop_prolog() and idle_loop_epilog() (this will
be done in a future patch), then, the value of the idle PURR thus read
will not include the cycles spent in the most recent idle period.

This patch addresses the issue by providing accessor function to read
the idle PURR such such that it includes the cycles spent in the most
recent idle period, if we read it between idle_loop_prolog() and
idle_loop_epilog(). In order to achieve it, the patch saves the
snapshot of PURR in idle_loop_prolog() in a per-cpu variable, instead
of on the stack, so that it can be accessed from an interrupt context.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/idle.h| 46 +++---
 arch/powerpc/platforms/pseries/setup.c |  7 +++---
 drivers/cpuidle/cpuidle-pseries.c  | 15 +--
 3 files changed, 46 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/idle.h b/arch/powerpc/include/asm/idle.h
index f32a7d8..126a217 100644
--- a/arch/powerpc/include/asm/idle.h
+++ b/arch/powerpc/include/asm/idle.h
@@ -2,10 +2,27 @@
 #define _ASM_POWERPC_IDLE_H
 #include 
 
-static inline void idle_loop_prolog(unsigned long *in_purr)
+DECLARE_PER_CPU(u64, idle_entry_purr_snap);
+
+static inline void snapshot_purr_idle_entry(void)
+{
+   *this_cpu_ptr(_entry_purr_snap) = mfspr(SPRN_PURR);
+}
+
+static inline void update_idle_purr_accounting(void)
+{
+   u64 wait_cycles;
+   u64 in_purr = *this_cpu_ptr(_entry_purr_snap);
+
+   wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles);
+   wait_cycles += mfspr(SPRN_PURR) - in_purr;
+   get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles);
+}
+
+static inline void idle_loop_prolog(void)
 {
ppc64_runlatch_off();
-   *in_purr = mfspr(SPRN_PURR);
+   snapshot_purr_idle_entry();
/*
 * Indicate to the HV that we are idle. Now would be
 * a good time to find other work to dispatch.
@@ -13,15 +30,26 @@ static inline void idle_loop_prolog(unsigned long *in_purr)
get_lppaca()->idle = 1;
 }
 
-static inline void idle_loop_epilog(unsigned long in_purr)
+static inline void idle_loop_epilog(void)
 {
-   u64 wait_cycles;
-
-   wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles);
-   wait_cycles += mfspr(SPRN_PURR) - in_purr;
-   get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles);
+   update_idle_purr_accounting();
get_lppaca()->idle = 0;
-
ppc64_runlatch_on();
 }
+
+static inline u64 read_this_idle_purr(void)
+{
+   /*
+* If we are reading from an idle context, update the
+* idle-purr cycles corresponding to the last idle period.
+* Since the idle context is not yet over, take a fresh
+* snapshot of the idle-purr.
+*/
+   if (unlikely(get_lppaca()->idle == 1)) {
+   update_idle_purr_accounting();
+   snapshot_purr_idle_entry();
+   }
+
+   return be64_to_cpu(get_lppaca()->wait_state_cycles);
+}
 #endif
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index ffd4d59..e9f2cefa 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -318,10 +318,9 @@ static int alloc_dispatch_log_kmem_cache(void)
 }
 machine_early_initcall(pseries, alloc_dispatch_log_kmem_cache);
 
+DEFINE_PER_CPU(u64, idle_entry_purr_snap);
 static void pseries_lpar_idle(void)
 {
-   unsigned long in_purr;
-
/*
 * Default handler to go into low thread priority and possibly
 * low power mode by ceding processor to hypervisor
@@ -331,7 +330,7 @@ static void pseries_lpar_idle(void)
return;
 
/* Indicate to hypervisor that we are idle. */
-   idle_loop_prolog(_purr);
+   idle_loop_prolog();
 
/*
 * Yield the processor to the hypervisor.  We return if
@@ -342,7 +341,7 @@ static void pseries_lpar_idle(void)
 */
cede_processor();
 
-   idle_loop_epilog(in_purr);
+   idle_loop_epilog();
 }
 
 /*
diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index fc9dee9c..98d3832 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -36,12 +36,11 @@ static int snooze_loop(struct cpuidle_device *dev,
struct cpuidle_driver *drv,
int index)
 {
-   unsigned long in_purr;
u64 snooze_exit_time;
 
set_thread_flag(TIF_POLLING_NRFLAG);
 
-   

[PATCH v3 03/27] powerpc: Map & release OpenCAPI LPC memory

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

This patch adds platform support to map & release LPC memory.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/include/asm/pnv-ocxl.h   |  4 +++
 arch/powerpc/platforms/powernv/ocxl.c | 43 +++
 2 files changed, 47 insertions(+)

diff --git a/arch/powerpc/include/asm/pnv-ocxl.h 
b/arch/powerpc/include/asm/pnv-ocxl.h
index 7de82647e761..0b2a6707e555 100644
--- a/arch/powerpc/include/asm/pnv-ocxl.h
+++ b/arch/powerpc/include/asm/pnv-ocxl.h
@@ -32,5 +32,9 @@ extern int pnv_ocxl_spa_remove_pe_from_cache(void 
*platform_data, int pe_handle)
 
 extern int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr);
 extern void pnv_ocxl_free_xive_irq(u32 irq);
+#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
+u64 pnv_ocxl_platform_lpc_setup(struct pci_dev *pdev, u64 size);
+void pnv_ocxl_platform_lpc_release(struct pci_dev *pdev);
+#endif
 
 #endif /* _ASM_PNV_OCXL_H */
diff --git a/arch/powerpc/platforms/powernv/ocxl.c 
b/arch/powerpc/platforms/powernv/ocxl.c
index 8c65aacda9c8..f2edbcc67361 100644
--- a/arch/powerpc/platforms/powernv/ocxl.c
+++ b/arch/powerpc/platforms/powernv/ocxl.c
@@ -475,6 +475,49 @@ void pnv_ocxl_spa_release(void *platform_data)
 }
 EXPORT_SYMBOL_GPL(pnv_ocxl_spa_release);
 
+#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
+u64 pnv_ocxl_platform_lpc_setup(struct pci_dev *pdev, u64 size)
+{
+   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
+   struct pnv_phb *phb = hose->private_data;
+   u32 bdfn = pci_dev_id(pdev);
+   __be64 base_addr_be64;
+   u64 base_addr;
+   int rc;
+
+   rc = opal_npu_mem_alloc(phb->opal_id, bdfn, size, _addr_be64);
+   if (rc) {
+   dev_warn(>dev,
+"OPAL could not allocate LPC memory, rc=%d\n", rc);
+   return 0;
+   }
+
+   base_addr = be64_to_cpu(base_addr_be64);
+
+   rc = check_hotplug_memory_addressable(base_addr >> PAGE_SHIFT,
+ size >> PAGE_SHIFT);
+   if (rc)
+   return 0;
+
+   return base_addr;
+}
+EXPORT_SYMBOL_GPL(pnv_ocxl_platform_lpc_setup);
+
+void pnv_ocxl_platform_lpc_release(struct pci_dev *pdev)
+{
+   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
+   struct pnv_phb *phb = hose->private_data;
+   u32 bdfn = pci_dev_id(pdev);
+   int rc;
+
+   rc = opal_npu_mem_release(phb->opal_id, bdfn);
+   if (rc)
+   dev_warn(>dev,
+"OPAL reported rc=%d when releasing LPC memory\n", rc);
+}
+EXPORT_SYMBOL_GPL(pnv_ocxl_platform_lpc_release);
+#endif
+
 int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, int pe_handle)
 {
struct spa_data *data = (struct spa_data *) platform_data;
-- 
2.24.1



[PATCH v3 04/27] ocxl: Remove unnecessary externs

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

Function declarations don't need externs, remove the existing ones
so they are consistent with newer code

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/include/asm/pnv-ocxl.h | 32 ++---
 include/misc/ocxl.h |  6 +++---
 2 files changed, 18 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/pnv-ocxl.h 
b/arch/powerpc/include/asm/pnv-ocxl.h
index 0b2a6707e555..b23c99bc0c84 100644
--- a/arch/powerpc/include/asm/pnv-ocxl.h
+++ b/arch/powerpc/include/asm/pnv-ocxl.h
@@ -9,29 +9,27 @@
 #define PNV_OCXL_TL_BITS_PER_RATE   4
 #define PNV_OCXL_TL_RATE_BUF_SIZE   ((PNV_OCXL_TL_MAX_TEMPLATE+1) * 
PNV_OCXL_TL_BITS_PER_RATE / 8)
 
-extern int pnv_ocxl_get_actag(struct pci_dev *dev, u16 *base, u16 *enabled,
-   u16 *supported);
-extern int pnv_ocxl_get_pasid_count(struct pci_dev *dev, int *count);
+int pnv_ocxl_get_actag(struct pci_dev *dev, u16 *base, u16 *enabled, u16 
*supported);
+int pnv_ocxl_get_pasid_count(struct pci_dev *dev, int *count);
 
-extern int pnv_ocxl_get_tl_cap(struct pci_dev *dev, long *cap,
+int pnv_ocxl_get_tl_cap(struct pci_dev *dev, long *cap,
char *rate_buf, int rate_buf_size);
-extern int pnv_ocxl_set_tl_conf(struct pci_dev *dev, long cap,
+int pnv_ocxl_set_tl_conf(struct pci_dev *dev, long cap,
uint64_t rate_buf_phys, int rate_buf_size);
 
-extern int pnv_ocxl_get_xsl_irq(struct pci_dev *dev, int *hwirq);
-extern void pnv_ocxl_unmap_xsl_regs(void __iomem *dsisr, void __iomem *dar,
-   void __iomem *tfc, void __iomem *pe_handle);
-extern int pnv_ocxl_map_xsl_regs(struct pci_dev *dev, void __iomem **dsisr,
-   void __iomem **dar, void __iomem **tfc,
-   void __iomem **pe_handle);
+int pnv_ocxl_get_xsl_irq(struct pci_dev *dev, int *hwirq);
+void pnv_ocxl_unmap_xsl_regs(void __iomem *dsisr, void __iomem *dar,
+void __iomem *tfc, void __iomem *pe_handle);
+int pnv_ocxl_map_xsl_regs(struct pci_dev *dev, void __iomem **dsisr,
+ void __iomem **dar, void __iomem **tfc,
+ void __iomem **pe_handle);
 
-extern int pnv_ocxl_spa_setup(struct pci_dev *dev, void *spa_mem, int PE_mask,
-   void **platform_data);
-extern void pnv_ocxl_spa_release(void *platform_data);
-extern int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, int 
pe_handle);
+int pnv_ocxl_spa_setup(struct pci_dev *dev, void *spa_mem, int PE_mask, void 
**platform_data);
+void pnv_ocxl_spa_release(void *platform_data);
+int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, int pe_handle);
 
-extern int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr);
-extern void pnv_ocxl_free_xive_irq(u32 irq);
+int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr);
+void pnv_ocxl_free_xive_irq(u32 irq);
 #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
 u64 pnv_ocxl_platform_lpc_setup(struct pci_dev *pdev, u64 size);
 void pnv_ocxl_platform_lpc_release(struct pci_dev *pdev);
diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h
index 06dd5839e438..0a762e387418 100644
--- a/include/misc/ocxl.h
+++ b/include/misc/ocxl.h
@@ -173,7 +173,7 @@ int ocxl_context_detach(struct ocxl_context *ctx);
  *
  * Returns 0 on success, negative on failure
  */
-extern int ocxl_afu_irq_alloc(struct ocxl_context *ctx, int *irq_id);
+int ocxl_afu_irq_alloc(struct ocxl_context *ctx, int *irq_id);
 
 /**
  * Frees an IRQ associated with an AFU context
@@ -182,7 +182,7 @@ extern int ocxl_afu_irq_alloc(struct ocxl_context *ctx, int 
*irq_id);
  *
  * Returns 0 on success, negative on failure
  */
-extern int ocxl_afu_irq_free(struct ocxl_context *ctx, int irq_id);
+int ocxl_afu_irq_free(struct ocxl_context *ctx, int irq_id);
 
 /**
  * Gets the address of the trigger page for an IRQ
@@ -193,7 +193,7 @@ extern int ocxl_afu_irq_free(struct ocxl_context *ctx, int 
irq_id);
  *
  * returns the trigger page address, or 0 if the IRQ is not valid
  */
-extern u64 ocxl_afu_irq_get_addr(struct ocxl_context *ctx, int irq_id);
+u64 ocxl_afu_irq_get_addr(struct ocxl_context *ctx, int irq_id);
 
 /**
  * Provide a callback to be called when an IRQ is triggered
-- 
2.24.1



[PATCH v3 16/27] powerpc/powernv/pmem: Register a character device for userspace to interact with

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

This patch introduces a character device (/dev/ocxl-scmX) which further
patches will use to interact with userspace.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/platforms/powernv/pmem/ocxl.c| 116 +-
 .../platforms/powernv/pmem/ocxl_internal.h|   2 +
 2 files changed, 116 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl.c 
b/arch/powerpc/platforms/powernv/pmem/ocxl.c
index b8bd7e703b19..63109a870d2c 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl.c
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "ocxl_internal.h"
@@ -339,6 +340,9 @@ static void free_ocxlpmem(struct ocxlpmem *ocxlpmem)
 
free_minor(ocxlpmem);
 
+   if (ocxlpmem->cdev.owner)
+   cdev_del(>cdev);
+
if (ocxlpmem->metadata_addr)
devm_memunmap(>dev, ocxlpmem->metadata_addr);
 
@@ -396,6 +400,70 @@ static int ocxlpmem_register(struct ocxlpmem *ocxlpmem)
return device_register(>dev);
 }
 
+static void ocxlpmem_put(struct ocxlpmem *ocxlpmem)
+{
+   put_device(>dev);
+}
+
+static struct ocxlpmem *ocxlpmem_get(struct ocxlpmem *ocxlpmem)
+{
+   return (get_device(>dev) == NULL) ? NULL : ocxlpmem;
+}
+
+static struct ocxlpmem *find_and_get_ocxlpmem(dev_t devno)
+{
+   struct ocxlpmem *ocxlpmem;
+   int minor = MINOR(devno);
+   /*
+* We don't declare an RCU critical section here, as our AFU
+* is protected by a reference counter on the device. By the time the
+* minor number of a device is removed from the idr, the ref count of
+* the device is already at 0, so no user API will access that AFU and
+* this function can't return it.
+*/
+   ocxlpmem = idr_find(_idr, minor);
+   if (ocxlpmem)
+   ocxlpmem_get(ocxlpmem);
+   return ocxlpmem;
+}
+
+static int file_open(struct inode *inode, struct file *file)
+{
+   struct ocxlpmem *ocxlpmem;
+
+   ocxlpmem = find_and_get_ocxlpmem(inode->i_rdev);
+   if (!ocxlpmem)
+   return -ENODEV;
+
+   file->private_data = ocxlpmem;
+   return 0;
+}
+
+static int file_release(struct inode *inode, struct file *file)
+{
+   struct ocxlpmem *ocxlpmem = file->private_data;
+
+   ocxlpmem_put(ocxlpmem);
+   return 0;
+}
+
+static const struct file_operations fops = {
+   .owner  = THIS_MODULE,
+   .open   = file_open,
+   .release= file_release,
+};
+
+/**
+ * create_cdev() - Create the chardev in /dev for the device
+ * @ocxlpmem: the SCM metadata
+ * Return: 0 on success, negative on failure
+ */
+static int create_cdev(struct ocxlpmem *ocxlpmem)
+{
+   cdev_init(>cdev, );
+   return cdev_add(>cdev, ocxlpmem->dev.devt, 1);
+}
+
 /**
  * ocxlpmem_remove() - Free an OpenCAPI persistent memory device
  * @pdev: the PCI device information struct
@@ -572,6 +640,11 @@ static int probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
goto err;
}
 
+   if (create_cdev(ocxlpmem)) {
+   dev_err(>dev, "Could not create character device\n");
+   goto err;
+   }
+
elapsed = 0;
timeout = ocxlpmem->readiness_timeout + 
ocxlpmem->memory_available_timeout;
while (!is_usable(ocxlpmem, false)) {
@@ -613,20 +686,59 @@ static struct pci_driver pci_driver = {
.shutdown = ocxlpmem_remove,
 };
 
+static int file_init(void)
+{
+   int rc;
+
+   mutex_init(_idr_lock);
+   idr_init(_idr);
+
+   rc = alloc_chrdev_region(_dev, 0, NUM_MINORS, "ocxl-pmem");
+   if (rc) {
+   idr_destroy(_idr);
+   pr_err("Unable to allocate OpenCAPI persistent memory major 
number: %d\n", rc);
+   return rc;
+   }
+
+   ocxlpmem_class = class_create(THIS_MODULE, "ocxl-pmem");
+   if (IS_ERR(ocxlpmem_class)) {
+   idr_destroy(_idr);
+   pr_err("Unable to create ocxl-pmem class\n");
+   unregister_chrdev_region(ocxlpmem_dev, NUM_MINORS);
+   return PTR_ERR(ocxlpmem_class);
+   }
+
+   return 0;
+}
+
+static void file_exit(void)
+{
+   class_destroy(ocxlpmem_class);
+   unregister_chrdev_region(ocxlpmem_dev, NUM_MINORS);
+   idr_destroy(_idr);
+}
+
 static int __init ocxlpmem_init(void)
 {
-   int rc = 0;
+   int rc;
 
-   rc = pci_register_driver(_driver);
+   rc = file_init();
if (rc)
return rc;
 
+   rc = pci_register_driver(_driver);
+   if (rc) {
+   file_exit();
+   return rc;
+   }
+
return 0;
 }
 
 static void ocxlpmem_exit(void)
 {
pci_unregister_driver(_driver);
+   file_exit();
 }
 
 module_init(ocxlpmem_init);
diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl_internal.h 

[PATCH v3 12/27] powerpc/powernv/pmem: Add register addresses & status values to the header

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

These values have been taken from the device specifications.

Signed-off-by: Alastair D'Silva 
---
 .../platforms/powernv/pmem/ocxl_internal.h| 72 +++
 1 file changed, 72 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl_internal.h 
b/arch/powerpc/platforms/powernv/pmem/ocxl_internal.h
index 0faf3740e9b8..9cf3e42750e7 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl_internal.h
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl_internal.h
@@ -8,6 +8,78 @@
 
 #define LABEL_AREA_SIZE(1UL << PA_SECTION_SHIFT)
 
+#define GLOBAL_MMIO_CHI0x000
+#define GLOBAL_MMIO_CHIC   0x008
+#define GLOBAL_MMIO_CHIE   0x010
+#define GLOBAL_MMIO_CHIEC  0x018
+#define GLOBAL_MMIO_HCI0x020
+#define GLOBAL_MMIO_HCIC   0x028
+#define GLOBAL_MMIO_IMA0_OHP   0x040
+#define GLOBAL_MMIO_IMA0_CFP   0x048
+#define GLOBAL_MMIO_IMA1_OHP   0x050
+#define GLOBAL_MMIO_IMA1_CFP   0x058
+#define GLOBAL_MMIO_ACMA_CREQO 0x100
+#define GLOBAL_MMIO_ACMA_CRSPO 0x104
+#define GLOBAL_MMIO_ACMA_CDBO  0x108
+#define GLOBAL_MMIO_ACMA_CDBS  0x10c
+#define GLOBAL_MMIO_NSCMA_CREQO0x120
+#define GLOBAL_MMIO_NSCMA_CRSPO0x124
+#define GLOBAL_MMIO_NSCMA_CDBO 0x128
+#define GLOBAL_MMIO_NSCMA_CDBS 0x12c
+#define GLOBAL_MMIO_CSTS   0x140
+#define GLOBAL_MMIO_FWVER  0x148
+#define GLOBAL_MMIO_CCAP0  0x160
+#define GLOBAL_MMIO_CCAP1  0x168
+
+#define GLOBAL_MMIO_CHI_ACRA   BIT_ULL(0)
+#define GLOBAL_MMIO_CHI_NSCRA  BIT_ULL(1)
+#define GLOBAL_MMIO_CHI_CRDY   BIT_ULL(4)
+#define GLOBAL_MMIO_CHI_CFFS   BIT_ULL(5)
+#define GLOBAL_MMIO_CHI_MA BIT_ULL(6)
+#define GLOBAL_MMIO_CHI_ELABIT_ULL(7)
+#define GLOBAL_MMIO_CHI_CDABIT_ULL(8)
+#define GLOBAL_MMIO_CHI_CHFS   BIT_ULL(9)
+
+#define GLOBAL_MMIO_CHI_ALL(GLOBAL_MMIO_CHI_ACRA | \
+GLOBAL_MMIO_CHI_NSCRA | \
+GLOBAL_MMIO_CHI_CRDY | \
+GLOBAL_MMIO_CHI_CFFS | \
+GLOBAL_MMIO_CHI_MA | \
+GLOBAL_MMIO_CHI_ELA | \
+GLOBAL_MMIO_CHI_CDA | \
+GLOBAL_MMIO_CHI_CHFS)
+
+#define GLOBAL_MMIO_HCI_ACRW   BIT_ULL(0)
+#define GLOBAL_MMIO_HCI_NSCRW  BIT_ULL(1)
+#define GLOBAL_MMIO_HCI_AFU_RESET  BIT_ULL(2)
+#define GLOBAL_MMIO_HCI_FW_DEBUG   BIT_ULL(3)
+#define GLOBAL_MMIO_HCI_CONTROLLER_DUMPBIT_ULL(4)
+#define GLOBAL_MMIO_HCI_CONTROLLER_DUMP_COLLECTED  BIT_ULL(5)
+#define GLOBAL_MMIO_HCI_REQ_HEALTH_PERFBIT_ULL(6)
+
+#define ADMIN_COMMAND_HEARTBEAT0x00u
+#define ADMIN_COMMAND_SHUTDOWN 0x01u
+#define ADMIN_COMMAND_FW_UPDATE0x02u
+#define ADMIN_COMMAND_FW_DEBUG 0x03u
+#define ADMIN_COMMAND_ERRLOG   0x04u
+#define ADMIN_COMMAND_SMART0x05u
+#define ADMIN_COMMAND_CONTROLLER_STATS 0x06u
+#define ADMIN_COMMAND_CONTROLLER_DUMP  0x07u
+#define ADMIN_COMMAND_CMD_CAPS 0x08u
+#define ADMIN_COMMAND_MAX  0x08u
+
+#define STATUS_SUCCESS 0x00
+#define STATUS_MEM_UNAVAILABLE 0x20
+#define STATUS_BAD_OPCODE  0x50
+#define STATUS_BAD_REQUEST_PARM0x51
+#define STATUS_BAD_DATA_PARM   0x52
+#define STATUS_DEBUG_BLOCKED   0x70
+#define STATUS_FAIL0xFF
+
+#define STATUS_FW_UPDATE_BLOCKED 0x21
+#define STATUS_FW_ARG_INVALID  0x51
+#define STATUS_FW_INVALID  0x52
+
 struct ocxlpmem_function0 {
struct pci_dev *pdev;
struct ocxl_fn *ocxl_fn;
-- 
2.24.1



[PATCH v3 21/27] powerpc/powernv/pmem: Add an IOCTL to request controller health & perf data

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

When health & performance data is requested from the controller,
it responds with an error log containing the requested information.

This patch allows the request to me issued via an IOCTL.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/platforms/powernv/pmem/ocxl.c | 16 
 include/uapi/nvdimm/ocxl-pmem.h|  1 +
 2 files changed, 17 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl.c 
b/arch/powerpc/platforms/powernv/pmem/ocxl.c
index e46696d3cc36..081883a8247a 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl.c
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl.c
@@ -1000,6 +1000,18 @@ static int ioctl_event_check(struct ocxlpmem *ocxlpmem, 
u64 __user *uarg)
return rc;
 }
 
+/**
+ * req_controller_health_perf() - Request controller health & performance data
+ * @ocxlpmem: the device metadata
+ * Return: 0 on success, negative on failure
+ */
+int req_controller_health_perf(struct ocxlpmem *ocxlpmem)
+{
+   return ocxl_global_mmio_set64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_HCI,
+ OCXL_LITTLE_ENDIAN,
+ GLOBAL_MMIO_HCI_REQ_HEALTH_PERF);
+}
+
 static long file_ioctl(struct file *file, unsigned int cmd, unsigned long args)
 {
struct ocxlpmem *ocxlpmem = file->private_data;
@@ -1037,6 +1049,10 @@ static long file_ioctl(struct file *file, unsigned int 
cmd, unsigned long args)
case IOCTL_OCXL_PMEM_EVENT_CHECK:
rc = ioctl_event_check(ocxlpmem, (u64 __user *)args);
break;
+
+   case IOCTL_OCXL_PMEM_REQUEST_HEALTH:
+   rc = req_controller_health_perf(ocxlpmem);
+   break;
}
 
return rc;
diff --git a/include/uapi/nvdimm/ocxl-pmem.h b/include/uapi/nvdimm/ocxl-pmem.h
index 988eb0bc413d..0d03abb44001 100644
--- a/include/uapi/nvdimm/ocxl-pmem.h
+++ b/include/uapi/nvdimm/ocxl-pmem.h
@@ -90,5 +90,6 @@ struct ioctl_ocxl_pmem_eventfd {
 #define IOCTL_OCXL_PMEM_CONTROLLER_STATS   _IO(OCXL_PMEM_MAGIC, 
0x05)
 #define IOCTL_OCXL_PMEM_EVENTFD
_IOW(OCXL_PMEM_MAGIC, 0x06, struct ioctl_ocxl_pmem_eventfd)
 #define IOCTL_OCXL_PMEM_EVENT_CHECK_IOR(OCXL_PMEM_MAGIC, 
0x07, __u64)
+#define IOCTL_OCXL_PMEM_REQUEST_HEALTH _IO(OCXL_PMEM_MAGIC, 
0x08)
 
 #endif /* _UAPI_OCXL_SCM_H */
-- 
2.24.1



[PATCH v3 19/27] powerpc/powernv/pmem: Add an IOCTL to report controller statistics

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

The controller can report a number of statistics that are useful
in evaluating the performance and reliability of the card.

This patch exposes this information via an IOCTL.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/platforms/powernv/pmem/ocxl.c | 185 +
 include/uapi/nvdimm/ocxl-pmem.h|  17 ++
 2 files changed, 202 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl.c 
b/arch/powerpc/platforms/powernv/pmem/ocxl.c
index 2cabafe1fc58..009d4fd29e7d 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl.c
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl.c
@@ -758,6 +758,186 @@ static int ioctl_controller_dump_complete(struct ocxlpmem 
*ocxlpmem)
GLOBAL_MMIO_HCI_CONTROLLER_DUMP_COLLECTED);
 }
 
+/**
+ * controller_stats_header_parse() - Parse the first 64 bits of the controller 
stats admin command response
+ * @ocxlpmem: the device metadata
+ * @length: out, returns the number of bytes in the response (excluding the 64 
bit header)
+ */
+static int controller_stats_header_parse(struct ocxlpmem *ocxlpmem,
+   u32 *length)
+{
+   int rc;
+   u64 val;
+
+   u16 data_identifier;
+   u32 data_length;
+
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+ocxlpmem->admin_command.data_offset,
+OCXL_LITTLE_ENDIAN, );
+   if (rc)
+   return rc;
+
+   data_identifier = val >> 48;
+   data_length = val & 0x;
+
+   if (data_identifier != 0x4353) { // 'CS'
+   dev_err(>dev,
+   "Bad data identifier for controller stats, expected 
'CS', got '%-.*s'\n",
+   2, (char *)_identifier);
+   return -EINVAL;
+   }
+
+   *length = data_length;
+   return 0;
+}
+
+static int ioctl_controller_stats(struct ocxlpmem *ocxlpmem,
+ struct ioctl_ocxl_pmem_controller_stats 
__user *uarg)
+{
+   struct ioctl_ocxl_pmem_controller_stats args;
+   u32 length;
+   int rc;
+   u64 val;
+
+   memset(, '\0', sizeof(args));
+
+   mutex_lock(>admin_command.lock);
+
+   rc = admin_command_request(ocxlpmem, ADMIN_COMMAND_CONTROLLER_STATS);
+   if (rc)
+   goto out;
+
+   rc = ocxl_global_mmio_write64(ocxlpmem->ocxl_afu,
+ ocxlpmem->admin_command.request_offset + 
0x08,
+ OCXL_LITTLE_ENDIAN, 0);
+   if (rc)
+   goto out;
+
+   rc = admin_command_execute(ocxlpmem);
+   if (rc)
+   goto out;
+
+
+   rc = admin_command_complete_timeout(ocxlpmem,
+   ADMIN_COMMAND_CONTROLLER_STATS);
+   if (rc < 0) {
+   dev_warn(>dev, "Controller stats timed out\n");
+   goto out;
+   }
+
+   rc = admin_response(ocxlpmem);
+   if (rc < 0)
+   goto out;
+   if (rc != STATUS_SUCCESS) {
+   warn_status(ocxlpmem,
+   "Unexpected status from controller stats", rc);
+   goto out;
+   }
+
+   rc = controller_stats_header_parse(ocxlpmem, );
+   if (rc)
+   goto out;
+
+   if (length != 0x140)
+   warn_status(ocxlpmem,
+   "Unexpected length for controller stats data, 
expected 0x140, got 0x%x",
+   length);
+
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+ocxlpmem->admin_command.data_offset + 0x08 
+ 0x08,
+OCXL_LITTLE_ENDIAN, );
+   if (rc)
+   goto out;
+
+   args.reset_count = val >> 32;
+   args.reset_uptime = val & 0x;
+
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+ocxlpmem->admin_command.data_offset + 0x08 
+ 0x10,
+OCXL_LITTLE_ENDIAN, );
+   if (rc)
+   goto out;
+
+   args.power_on_uptime = val >> 32;
+
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+ocxlpmem->admin_command.data_offset + 0x08 
+ 0x40 + 0x08,
+OCXL_LITTLE_ENDIAN, _load_count);
+   if (rc)
+   goto out;
+
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+ocxlpmem->admin_command.data_offset + 0x08 
+ 0x40 + 0x10,
+OCXL_LITTLE_ENDIAN, 
_store_count);
+   if (rc)
+   goto out;
+
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+ocxlpmem->admin_command.data_offset + 0x08 
+ 0x40 + 0x18,
+OCXL_LITTLE_ENDIAN, 
_read_count);
+   if (rc)
+   goto out;
+
+   rc = 

[PATCH v3 22/27] powerpc/powernv/pmem: Implement the heartbeat command

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

The heartbeat admin command is a simple admin command that exercises
the communication mechanisms within the controller.

This patch issues a heartbeat command to the card during init to ensure
we can communicate with the card's controller.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/platforms/powernv/pmem/ocxl.c | 43 ++
 1 file changed, 43 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl.c 
b/arch/powerpc/platforms/powernv/pmem/ocxl.c
index 081883a8247a..e01f6f9fc180 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl.c
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl.c
@@ -306,6 +306,44 @@ static bool is_usable(const struct ocxlpmem *ocxlpmem, 
bool verbose)
return true;
 }
 
+/**
+ * heartbeat() - Issue a heartbeat command to the controller
+ * @ocxlpmem: the device metadata
+ * Return: 0 if the controller responded correctly, negative on error
+ */
+static int heartbeat(struct ocxlpmem *ocxlpmem)
+{
+   int rc;
+
+   mutex_lock(>admin_command.lock);
+
+   rc = admin_command_request(ocxlpmem, ADMIN_COMMAND_HEARTBEAT);
+   if (rc)
+   goto out;
+
+   rc = admin_command_execute(ocxlpmem);
+   if (rc)
+   goto out;
+
+   rc = admin_command_complete_timeout(ocxlpmem, ADMIN_COMMAND_HEARTBEAT);
+   if (rc < 0) {
+   dev_err(>dev, "Heartbeat timeout\n");
+   goto out;
+   }
+
+   rc = admin_response(ocxlpmem);
+   if (rc < 0)
+   goto out;
+   if (rc != STATUS_SUCCESS)
+   warn_status(ocxlpmem, "Unexpected status from heartbeat", rc);
+
+   (void)admin_response_handled(ocxlpmem);
+
+out:
+   mutex_unlock(>admin_command.lock);
+   return rc;
+}
+
 /**
  * allocate_minor() - Allocate a minor number to use for an OpenCAPI pmem 
device
  * @ocxlpmem: the device metadata
@@ -1458,6 +1496,11 @@ static int probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
goto err;
}
 
+   if (heartbeat(ocxlpmem)) {
+   dev_err(>dev, "Heartbeat failed\n");
+   goto err;
+   }
+
elapsed = 0;
timeout = ocxlpmem->readiness_timeout + 
ocxlpmem->memory_available_timeout;
while (!is_usable(ocxlpmem, false)) {
-- 
2.24.1



[PATCH v3 25/27] powerpc/powernv/pmem: Expose the serial number in sysfs

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

This information will be used by ndctl in userspace to help users identify
the device.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/platforms/powernv/pmem/Makefile  |  2 +-
 arch/powerpc/platforms/powernv/pmem/ocxl.c|  5 +++
 .../platforms/powernv/pmem/ocxl_internal.h|  6 +++
 .../platforms/powernv/pmem/ocxl_sysfs.c   | 37 +++
 4 files changed, 49 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/platforms/powernv/pmem/ocxl_sysfs.c

diff --git a/arch/powerpc/platforms/powernv/pmem/Makefile 
b/arch/powerpc/platforms/powernv/pmem/Makefile
index 4ceda25907d4..d02870806f30 100644
--- a/arch/powerpc/platforms/powernv/pmem/Makefile
+++ b/arch/powerpc/platforms/powernv/pmem/Makefile
@@ -4,4 +4,4 @@ ccflags-$(CONFIG_PPC_WERROR)+= -Werror
 
 obj-$(CONFIG_OCXL_PMEM) += ocxlpmem.o
 
-ocxlpmem-y := ocxl.o ocxl_internal.o
+ocxlpmem-y := ocxl.o ocxl_internal.o ocxl_sysfs.o
diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl.c 
b/arch/powerpc/platforms/powernv/pmem/ocxl.c
index 5cd1b6d78dd6..ec73713d05ad 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl.c
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl.c
@@ -1878,6 +1878,11 @@ static int probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
goto err;
}
 
+   if (ocxlpmem_sysfs_add(ocxlpmem)) {
+   dev_err(>dev, "Could not create sysfs entries\n");
+   goto err;
+   }
+
elapsed = 0;
timeout = ocxlpmem->readiness_timeout + 
ocxlpmem->memory_available_timeout;
while (!is_usable(ocxlpmem, false)) {
diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl_internal.h 
b/arch/powerpc/platforms/powernv/pmem/ocxl_internal.h
index 0eb7a35d24ae..12304ceace61 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl_internal.h
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl_internal.h
@@ -246,3 +246,9 @@ int ns_response_handled(const struct ocxlpmem *ocxlpmem);
  */
 void warn_status(const struct ocxlpmem *ocxlpmem, const char *message,
 u8 status);
+
+/**
+ * ocxlpmem_sysfs_add() - Create sysfs entries for an OpenCAPI persistent 
memory device
+ * @ocxlpmem: the device metadata
+ */
+int ocxlpmem_sysfs_add(struct ocxlpmem *ocxlpmem);
diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl_sysfs.c 
b/arch/powerpc/platforms/powernv/pmem/ocxl_sysfs.c
new file mode 100644
index ..7829e4bc887d
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl_sysfs.c
@@ -0,0 +1,37 @@
+// SPDX-License-Identifier: GPL-2.0+
+// Copyright 2018 IBM Corp.
+
+#include 
+#include 
+#include 
+#include 
+#include "ocxl_internal.h"
+
+static ssize_t serial_show(struct device *device, struct device_attribute 
*attr,
+  char *buf)
+{
+   struct ocxlpmem *ocxlpmem = container_of(device, struct ocxlpmem, dev);
+   const struct ocxl_fn_config *fn_config = 
ocxl_function_config(ocxlpmem->ocxl_fn);
+
+   return scnprintf(buf, PAGE_SIZE, "%llu\n", fn_config->serial);
+}
+
+static struct device_attribute attrs[] = {
+   __ATTR_RO(serial),
+};
+
+int ocxlpmem_sysfs_add(struct ocxlpmem *ocxlpmem)
+{
+   int i, rc;
+
+   for (i = 0; i < ARRAY_SIZE(attrs); i++) {
+   rc = device_create_file(>dev, [i]);
+   if (rc) {
+   for (; --i >= 0;)
+   device_remove_file(>dev, [i]);
+
+   return rc;
+   }
+   }
+   return 0;
+}
-- 
2.24.1



[PATCH v3 23/27] powerpc/powernv/pmem: Add debug IOCTLs

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

These IOCTLs provide low level access to the card to aid in debugging
controller/FPGA firmware.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/platforms/powernv/pmem/Kconfig |   6 +
 arch/powerpc/platforms/powernv/pmem/ocxl.c  | 249 
 include/uapi/nvdimm/ocxl-pmem.h |  32 +++
 3 files changed, 287 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pmem/Kconfig 
b/arch/powerpc/platforms/powernv/pmem/Kconfig
index c5d927520920..3f44429d70c9 100644
--- a/arch/powerpc/platforms/powernv/pmem/Kconfig
+++ b/arch/powerpc/platforms/powernv/pmem/Kconfig
@@ -12,4 +12,10 @@ config OCXL_PMEM
 
  Select N if unsure.
 
+config OCXL_PMEM_DEBUG
+   bool "OpenCAPI Persistent Memory debugging"
+   depends on OCXL_PMEM
+   help
+ Enables low level IOCTLs for OpenCAPI Persistent Memory firmware 
development
+
 endif
diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl.c 
b/arch/powerpc/platforms/powernv/pmem/ocxl.c
index e01f6f9fc180..d4ce5e9e0521 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl.c
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl.c
@@ -1050,6 +1050,235 @@ int req_controller_health_perf(struct ocxlpmem 
*ocxlpmem)
  GLOBAL_MMIO_HCI_REQ_HEALTH_PERF);
 }
 
+#ifdef CONFIG_OCXL_PMEM_DEBUG
+/**
+ * enable_fwdebug() - Enable FW debug on the controller
+ * @ocxlpmem: the device metadata
+ * Return: 0 on success, negative on failure
+ */
+static int enable_fwdebug(const struct ocxlpmem *ocxlpmem)
+{
+   return ocxl_global_mmio_set64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_HCI,
+ OCXL_LITTLE_ENDIAN,
+ GLOBAL_MMIO_HCI_FW_DEBUG);
+}
+
+/**
+ * disable_fwdebug() - Disable FW debug on the controller
+ * @ocxlpmem: the device metadata
+ * Return: 0 on success, negative on failure
+ */
+static int disable_fwdebug(const struct ocxlpmem *ocxlpmem)
+{
+   return ocxl_global_mmio_set64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_HCIC,
+ OCXL_LITTLE_ENDIAN,
+ GLOBAL_MMIO_HCI_FW_DEBUG);
+}
+
+static int ioctl_fwdebug(struct ocxlpmem *ocxlpmem,
+struct ioctl_ocxl_pmem_fwdebug __user *uarg)
+{
+   struct ioctl_ocxl_pmem_fwdebug args;
+   u64 val;
+   int i;
+   int rc;
+
+   if (copy_from_user(, uarg, sizeof(args)))
+   return -EFAULT;
+
+   // Buffer size must be a multiple of 8
+   if ((args.buf_size & 0x07))
+   return -EINVAL;
+
+   if (args.buf_size > ocxlpmem->admin_command.data_size)
+   return -EINVAL;
+
+   mutex_lock(>admin_command.lock);
+
+   rc = enable_fwdebug(ocxlpmem);
+   if (rc)
+   goto out;
+
+   rc = admin_command_request(ocxlpmem, ADMIN_COMMAND_FW_DEBUG);
+   if (rc)
+   goto out;
+
+   // Write DebugAction & FunctionCode
+   val = ((u64)args.debug_action << 56) | ((u64)args.function_code << 40);
+
+   rc = ocxl_global_mmio_write64(ocxlpmem->ocxl_afu,
+ ocxlpmem->admin_command.request_offset + 
0x08,
+ OCXL_LITTLE_ENDIAN, val);
+   if (rc)
+   goto out;
+
+   rc = ocxl_global_mmio_write64(ocxlpmem->ocxl_afu,
+ ocxlpmem->admin_command.request_offset + 
0x10,
+ OCXL_LITTLE_ENDIAN, 
args.debug_parameter_1);
+   if (rc)
+   goto out;
+
+   rc = ocxl_global_mmio_write64(ocxlpmem->ocxl_afu,
+ ocxlpmem->admin_command.request_offset + 
0x18,
+ OCXL_LITTLE_ENDIAN, 
args.debug_parameter_2);
+   if (rc)
+   goto out;
+
+   for (i = 0x20; i < 0x38; i += 0x08)
+   rc = ocxl_global_mmio_write64(ocxlpmem->ocxl_afu,
+ 
ocxlpmem->admin_command.request_offset + i,
+ OCXL_LITTLE_ENDIAN, 0);
+   if (rc)
+   goto out;
+
+
+   // Populate admin command buffer
+   if (args.buf_size) {
+   for (i = 0; i < args.buf_size; i += sizeof(u64)) {
+   u64 val;
+
+   if (copy_from_user(, [i], sizeof(u64)))
+   return -EFAULT;
+
+   rc = ocxl_global_mmio_write64(ocxlpmem->ocxl_afu,
+ 
ocxlpmem->admin_command.data_offset + i,
+ OCXL_HOST_ENDIAN, val);
+   if (rc)
+   goto out;
+   }
+   }
+
+   rc = admin_command_execute(ocxlpmem);
+   if (rc)
+   goto out;
+
+   rc = admin_command_complete_timeout(ocxlpmem,
+   

[PATCH v3 20/27] powerpc/powernv/pmem: Forward events to userspace

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

Some of the interrupts that the card generates are better handled
by the userspace daemon, in particular:
Controller Hardware/Firmware Fatal
Controller Dump Available
Error Log available

This patch allows a userspace application to register an eventfd with
the driver via SCM_IOCTL_EVENTFD to receive notifications of these
interrupts.

Userspace can then identify what events have occurred by calling
SCM_IOCTL_EVENT_CHECK and checking against the SCM_IOCTL_EVENT_FOO
masks.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/platforms/powernv/pmem/ocxl.c| 216 ++
 .../platforms/powernv/pmem/ocxl_internal.h|   5 +
 include/uapi/nvdimm/ocxl-pmem.h   |  16 ++
 3 files changed, 237 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl.c 
b/arch/powerpc/platforms/powernv/pmem/ocxl.c
index 009d4fd29e7d..e46696d3cc36 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl.c
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -335,11 +336,22 @@ static void free_ocxlpmem(struct ocxlpmem *ocxlpmem)
 {
int rc;
 
+   // Disable doorbells
+   (void)ocxl_global_mmio_set64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_CHIEC,
+OCXL_LITTLE_ENDIAN,
+GLOBAL_MMIO_CHI_ALL);
+
if (ocxlpmem->nvdimm_bus)
nvdimm_bus_unregister(ocxlpmem->nvdimm_bus);
 
free_minor(ocxlpmem);
 
+   if (ocxlpmem->irq_addr[1])
+   iounmap(ocxlpmem->irq_addr[1]);
+
+   if (ocxlpmem->irq_addr[0])
+   iounmap(ocxlpmem->irq_addr[0]);
+
if (ocxlpmem->cdev.owner)
cdev_del(>cdev);
 
@@ -443,6 +455,11 @@ static int file_release(struct inode *inode, struct file 
*file)
 {
struct ocxlpmem *ocxlpmem = file->private_data;
 
+   if (ocxlpmem->ev_ctx) {
+   eventfd_ctx_put(ocxlpmem->ev_ctx);
+   ocxlpmem->ev_ctx = NULL;
+   }
+
ocxlpmem_put(ocxlpmem);
return 0;
 }
@@ -938,6 +955,51 @@ static int ioctl_controller_stats(struct ocxlpmem 
*ocxlpmem,
return rc;
 }
 
+static int ioctl_eventfd(struct ocxlpmem *ocxlpmem,
+struct ioctl_ocxl_pmem_eventfd __user *uarg)
+{
+   struct ioctl_ocxl_pmem_eventfd args;
+
+   if (copy_from_user(, uarg, sizeof(args)))
+   return -EFAULT;
+
+   if (ocxlpmem->ev_ctx)
+   return -EINVAL;
+
+   ocxlpmem->ev_ctx = eventfd_ctx_fdget(args.eventfd);
+   if (!ocxlpmem->ev_ctx)
+   return -EFAULT;
+
+   return 0;
+}
+
+static int ioctl_event_check(struct ocxlpmem *ocxlpmem, u64 __user *uarg)
+{
+   u64 val = 0;
+   int rc;
+   u64 chi = 0;
+
+   rc = ocxlpmem_chi(ocxlpmem, );
+   if (rc < 0)
+   return rc;
+
+   if (chi & GLOBAL_MMIO_CHI_ELA)
+   val |= IOCTL_OCXL_PMEM_EVENT_ERROR_LOG_AVAILABLE;
+
+   if (chi & GLOBAL_MMIO_CHI_CDA)
+   val |= IOCTL_OCXL_PMEM_EVENT_CONTROLLER_DUMP_AVAILABLE;
+
+   if (chi & GLOBAL_MMIO_CHI_CFFS)
+   val |= IOCTL_OCXL_PMEM_EVENT_FIRMWARE_FATAL;
+
+   if (chi & GLOBAL_MMIO_CHI_CHFS)
+   val |= IOCTL_OCXL_PMEM_EVENT_HARDWARE_FATAL;
+
+   rc = copy_to_user((u64 __user *) uarg, , sizeof(val));
+
+   return rc;
+}
+
 static long file_ioctl(struct file *file, unsigned int cmd, unsigned long args)
 {
struct ocxlpmem *ocxlpmem = file->private_data;
@@ -966,6 +1028,15 @@ static long file_ioctl(struct file *file, unsigned int 
cmd, unsigned long args)
rc = ioctl_controller_stats(ocxlpmem,
(struct 
ioctl_ocxl_pmem_controller_stats __user *)args);
break;
+
+   case IOCTL_OCXL_PMEM_EVENTFD:
+   rc = ioctl_eventfd(ocxlpmem,
+  (struct ioctl_ocxl_pmem_eventfd __user 
*)args);
+   break;
+
+   case IOCTL_OCXL_PMEM_EVENT_CHECK:
+   rc = ioctl_event_check(ocxlpmem, (u64 __user *)args);
+   break;
}
 
return rc;
@@ -1107,6 +1178,146 @@ static void dump_error_log(struct ocxlpmem *ocxlpmem)
kfree(buf);
 }
 
+static irqreturn_t imn0_handler(void *private)
+{
+   struct ocxlpmem *ocxlpmem = private;
+   u64 chi = 0;
+
+   (void)ocxlpmem_chi(ocxlpmem, );
+
+   if (chi & GLOBAL_MMIO_CHI_ELA) {
+   dev_warn(>dev, "Error log is available\n");
+
+   if (ocxlpmem->ev_ctx)
+   eventfd_signal(ocxlpmem->ev_ctx, 1);
+   }
+
+   if (chi & GLOBAL_MMIO_CHI_CDA) {
+   dev_warn(>dev, "Controller dump is available\n");
+
+   if (ocxlpmem->ev_ctx)
+   eventfd_signal(ocxlpmem->ev_ctx, 1);
+   }
+
+
+   return IRQ_HANDLED;
+}
+
+static irqreturn_t imn1_handler(void *private)
+{
+ 

[PATCH v3 10/27] powerpc: Add driver for OpenCAPI Persistent Memory

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

This driver exposes LPC memory on OpenCAPI pmem cards
as an NVDIMM, allowing the existing nvram infrastructure
to be used.

Namespace metadata is stored on the media itself, so
scm_reserve_metadata() maps 1 section's worth of PMEM storage
at the start to hold this. The rest of the PMEM range is registered
with libnvdimm as an nvdimm. scm_ndctl_config_read/write/size() provide
callbacks to libnvdimm to access the metadata.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/platforms/powernv/Kconfig|   3 +
 arch/powerpc/platforms/powernv/Makefile   |   1 +
 arch/powerpc/platforms/powernv/pmem/Kconfig   |  15 +
 arch/powerpc/platforms/powernv/pmem/Makefile  |   7 +
 arch/powerpc/platforms/powernv/pmem/ocxl.c| 473 ++
 .../platforms/powernv/pmem/ocxl_internal.h|  28 ++
 6 files changed, 527 insertions(+)
 create mode 100644 arch/powerpc/platforms/powernv/pmem/Kconfig
 create mode 100644 arch/powerpc/platforms/powernv/pmem/Makefile
 create mode 100644 arch/powerpc/platforms/powernv/pmem/ocxl.c
 create mode 100644 arch/powerpc/platforms/powernv/pmem/ocxl_internal.h

diff --git a/arch/powerpc/platforms/powernv/Kconfig 
b/arch/powerpc/platforms/powernv/Kconfig
index 938803eab0ad..fc8976af0e52 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -50,3 +50,6 @@ config PPC_VAS
 config SCOM_DEBUGFS
bool "Expose SCOM controllers via debugfs"
depends on DEBUG_FS
+
+source "arch/powerpc/platforms/powernv/pmem/Kconfig"
+
diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index c0f8120045c3..0bbd72988b6f 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -21,3 +21,4 @@ obj-$(CONFIG_PPC_VAS) += vas.o vas-window.o vas-debug.o
 obj-$(CONFIG_OCXL_BASE)+= ocxl.o
 obj-$(CONFIG_SCOM_DEBUGFS) += opal-xscom.o
 obj-$(CONFIG_PPC_SECURE_BOOT) += opal-secvar.o
+obj-$(CONFIG_LIBNVDIMM) += pmem/
diff --git a/arch/powerpc/platforms/powernv/pmem/Kconfig 
b/arch/powerpc/platforms/powernv/pmem/Kconfig
new file mode 100644
index ..c5d927520920
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/pmem/Kconfig
@@ -0,0 +1,15 @@
+# SPDX-License-Identifier: GPL-2.0-only
+if LIBNVDIMM
+
+config OCXL_PMEM
+   tristate "OpenCAPI Persistent Memory"
+   depends on LIBNVDIMM && PPC_POWERNV && PCI && EEH && ZONE_DEVICE && OCXL
+   help
+ Exposes devices that implement the OpenCAPI Storage Class Memory
+ specification as persistent memory regions. You may also want
+ DEV_DAX, DEV_DAX_PMEM & FS_DAX if you plan on using DAX devices
+ stacked on top of this driver.
+
+ Select N if unsure.
+
+endif
diff --git a/arch/powerpc/platforms/powernv/pmem/Makefile 
b/arch/powerpc/platforms/powernv/pmem/Makefile
new file mode 100644
index ..1c55c4193175
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/pmem/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+
+ccflags-$(CONFIG_PPC_WERROR)   += -Werror
+
+obj-$(CONFIG_OCXL_PMEM) += ocxlpmem.o
+
+ocxlpmem-y := ocxl.o
diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl.c 
b/arch/powerpc/platforms/powernv/pmem/ocxl.c
new file mode 100644
index ..3c4eeb5dcc0f
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl.c
@@ -0,0 +1,473 @@
+// SPDX-License-Id
+// Copyright 2019 IBM Corp.
+
+/*
+ * A driver for OpenCAPI devices that implement the Storage Class
+ * Memory specification.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "ocxl_internal.h"
+
+
+static const struct pci_device_id ocxlpmem_pci_tbl[] = {
+   { PCI_DEVICE(PCI_VENDOR_ID_IBM, 0x0625), },
+   { }
+};
+
+MODULE_DEVICE_TABLE(pci, ocxlpmem_pci_tbl);
+
+#define NUM_MINORS 256 // Total to reserve
+
+static dev_t ocxlpmem_dev;
+static struct class *ocxlpmem_class;
+static struct mutex minors_idr_lock;
+static struct idr minors_idr;
+
+/**
+ * ndctl_config_write() - Handle a ND_CMD_SET_CONFIG_DATA command from ndctl
+ * @ocxlpmem: the device metadata
+ * @command: the incoming data to write
+ * Return: 0 on success, negative on failure
+ */
+static int ndctl_config_write(struct ocxlpmem *ocxlpmem,
+ struct nd_cmd_set_config_hdr *command)
+{
+   if (command->in_offset + command->in_length > LABEL_AREA_SIZE)
+   return -EINVAL;
+
+   memcpy_flushcache(ocxlpmem->metadata_addr + command->in_offset, 
command->in_buf,
+ command->in_length);
+
+   return 0;
+}
+
+/**
+ * ndctl_config_read() - Handle a ND_CMD_GET_CONFIG_DATA command from ndctl
+ * @ocxlpmem: the device metadata
+ * @command: the read request
+ * Return: 0 on success, negative on failure
+ */
+static int ndctl_config_read(struct ocxlpmem *ocxlpmem,
+struct nd_cmd_get_config_data_hdr *command)
+{
+   if (command->in_offset + command->in_length > 

[PATCH v3 17/27] powerpc/powernv/pmem: Implement the Read Error Log command

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

The read error log command extracts information from the controller's
internal error log.

This patch exposes this information in 2 ways:
- During probe, if an error occurs & a log is available, print it to the
  console
- After probe, make the error log available to userspace via an IOCTL.
  Userspace is notified of pending error logs in a later patch
  ("powerpc/powernv/pmem: Forward events to userspace")

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/platforms/powernv/pmem/ocxl.c| 269 ++
 .../platforms/powernv/pmem/ocxl_internal.h|   1 +
 include/uapi/nvdimm/ocxl-pmem.h   |  46 +++
 3 files changed, 316 insertions(+)
 create mode 100644 include/uapi/nvdimm/ocxl-pmem.h

diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl.c 
b/arch/powerpc/platforms/powernv/pmem/ocxl.c
index 63109a870d2c..2b64504f9129 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl.c
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl.c
@@ -447,10 +447,219 @@ static int file_release(struct inode *inode, struct file 
*file)
return 0;
 }
 
+/**
+ * error_log_header_parse() - Parse the first 64 bits of the error log command 
response
+ * @ocxlpmem: the device metadata
+ * @length: out, returns the number of bytes in the response (excluding the 64 
bit header)
+ */
+static int error_log_header_parse(struct ocxlpmem *ocxlpmem, u16 *length)
+{
+   int rc;
+   u64 val;
+
+   u16 data_identifier;
+   u32 data_length;
+
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+ocxlpmem->admin_command.data_offset,
+OCXL_LITTLE_ENDIAN, );
+   if (rc)
+   return rc;
+
+   data_identifier = val >> 48;
+   data_length = val & 0x;
+
+   if (data_identifier != 0x454C) { // 'EL'
+   dev_err(>dev,
+   "Bad data identifier for error log data, expected 'EL', 
got '%2s' (%#x), data_length=%u\n",
+   (char *)_identifier,
+   (unsigned int)data_identifier, data_length);
+   return -EINVAL;
+   }
+
+   *length = data_length;
+   return 0;
+}
+
+static int error_log_offset_0x08(struct ocxlpmem *ocxlpmem,
+u32 *log_identifier, u32 *program_ref_code)
+{
+   int rc;
+   u64 val;
+
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+ocxlpmem->admin_command.data_offset + 0x08,
+OCXL_LITTLE_ENDIAN, );
+   if (rc)
+   return rc;
+
+   *log_identifier = val >> 32;
+   *program_ref_code = val & 0x;
+
+   return 0;
+}
+
+static int read_error_log(struct ocxlpmem *ocxlpmem,
+ struct ioctl_ocxl_pmem_error_log *log, bool 
buf_is_user)
+{
+   u64 val;
+   u16 user_buf_length;
+   u16 buf_length;
+   u16 i;
+   int rc;
+
+   if (log->buf_size % 8)
+   return -EINVAL;
+
+   rc = ocxlpmem_chi(ocxlpmem, );
+   if (rc)
+   goto out;
+
+   if (!(val & GLOBAL_MMIO_CHI_ELA))
+   return -EAGAIN;
+
+   user_buf_length = log->buf_size;
+
+   mutex_lock(>admin_command.lock);
+
+   rc = admin_command_request(ocxlpmem, ADMIN_COMMAND_ERRLOG);
+   if (rc)
+   goto out;
+
+   rc = admin_command_execute(ocxlpmem);
+   if (rc)
+   goto out;
+
+   rc = admin_command_complete_timeout(ocxlpmem, ADMIN_COMMAND_ERRLOG);
+   if (rc < 0) {
+   dev_warn(>dev, "Read error log timed out\n");
+   goto out;
+   }
+
+   rc = admin_response(ocxlpmem);
+   if (rc < 0)
+   goto out;
+   if (rc != STATUS_SUCCESS) {
+   warn_status(ocxlpmem, "Unexpected status from retrieve error 
log", rc);
+   goto out;
+   }
+
+
+   rc = error_log_header_parse(ocxlpmem, >buf_size);
+   if (rc)
+   goto out;
+   // log->buf_size now contains the returned buffer size, not the user 
size
+
+   rc = error_log_offset_0x08(ocxlpmem, >log_identifier,
+  >program_reference_code);
+   if (rc)
+   goto out;
+
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+ocxlpmem->admin_command.data_offset + 0x10,
+OCXL_LITTLE_ENDIAN, );
+   if (rc)
+   goto out;
+
+   log->error_log_type = val >> 56;
+   log->action_flags = (log->error_log_type == 
OCXL_PMEM_ERROR_LOG_TYPE_GENERAL) ?
+   (val >> 32) & 0xFF : 0;
+   log->power_on_seconds = val & 0x;
+
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+ocxlpmem->admin_command.data_offset + 0x18,
+OCXL_LITTLE_ENDIAN, >timestamp);
+   

[PATCH v3 08/27] ocxl: Emit a log message showing how much LPC memory was detected

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

This patch emits a message showing how much LPC memory & special purpose
memory was detected on an OCXL device.

Signed-off-by: Alastair D'Silva 
---
 drivers/misc/ocxl/config.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
index a62e3d7db2bf..701ae6216abf 100644
--- a/drivers/misc/ocxl/config.c
+++ b/drivers/misc/ocxl/config.c
@@ -568,6 +568,10 @@ static int read_afu_lpc_memory_info(struct pci_dev *dev,
afu->special_purpose_mem_size =
total_mem_size - lpc_mem_size;
}
+
+   dev_info(>dev, "Probed LPC memory of %#llx bytes and special 
purpose memory of %#llx bytes\n",
+   afu->lpc_mem_size, afu->special_purpose_mem_size);
+
return 0;
 }
 
-- 
2.24.1



[PATCH v3 27/27] MAINTAINERS: Add myself & nvdimm/ocxl to ocxl

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

The OpenCAPI Persistent Memory driver will be maintained as part ofi
the ppc tree.

I'm also adding myself as an author of the driver & contributor to
the generic ocxl driver.

Signed-off-by: Alastair D'Silva 
---
 MAINTAINERS | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index f8670989ec91..3fb9a9f576a7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12064,13 +12064,16 @@ F:tools/objtool/
 OCXL (Open Coherent Accelerator Processor Interface OpenCAPI) DRIVER
 M: Frederic Barrat 
 M: Andrew Donnellan 
+M: Alastair D'Silva 
 L: linuxppc-dev@lists.ozlabs.org
 S: Supported
 F: arch/powerpc/platforms/powernv/ocxl.c
+F: arch/powerpc/platforms/powernv/pmem/*
 F: arch/powerpc/include/asm/pnv-ocxl.h
 F: drivers/misc/ocxl/
 F: include/misc/ocxl*
 F: include/uapi/misc/ocxl.h
+F: include/uapi/nvdimm/ocxl-pmem.h
 F: Documentation/userspace-api/accelerators/ocxl.rst
 
 OMAP AUDIO SUPPORT
-- 
2.24.1



[PATCH v3 14/27] powerpc/powernv/pmem: Add support for Admin commands

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

This patch requests the metadata required to issue admin commands, as well
as some helper functions to construct and check the completion of the
commands.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/platforms/powernv/pmem/ocxl.c|  65 
 .../platforms/powernv/pmem/ocxl_internal.c| 153 ++
 .../platforms/powernv/pmem/ocxl_internal.h|  61 +++
 3 files changed, 279 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl.c 
b/arch/powerpc/platforms/powernv/pmem/ocxl.c
index 431212c9f0cc..4e782d22605b 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl.c
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl.c
@@ -216,6 +216,58 @@ static int register_lpc_mem(struct ocxlpmem *ocxlpmem)
return 0;
 }
 
+/**
+ * extract_command_metadata() - Extract command data from MMIO & save it for 
further use
+ * @ocxlpmem: the device metadata
+ * @offset: The base address of the command data structures (address of CREQO)
+ * @command_metadata: A pointer to the command metadata to populate
+ * Return: 0 on success, negative on failure
+ */
+static int extract_command_metadata(struct ocxlpmem *ocxlpmem, u32 offset,
+   struct command_metadata 
*command_metadata)
+{
+   int rc;
+   u64 tmp;
+
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu, offset, 
OCXL_LITTLE_ENDIAN,
+);
+   if (rc)
+   return rc;
+
+   command_metadata->request_offset = tmp >> 32;
+   command_metadata->response_offset = tmp & 0x;
+
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu, offset + 8, 
OCXL_LITTLE_ENDIAN,
+);
+   if (rc)
+   return rc;
+
+   command_metadata->data_offset = tmp >> 32;
+   command_metadata->data_size = tmp & 0x;
+
+   command_metadata->id = 0;
+
+   return 0;
+}
+
+/**
+ * setup_command_metadata() - Set up the command metadata
+ * @ocxlpmem: the device metadata
+ */
+static int setup_command_metadata(struct ocxlpmem *ocxlpmem)
+{
+   int rc;
+
+   mutex_init(>admin_command.lock);
+
+   rc = extract_command_metadata(ocxlpmem, GLOBAL_MMIO_ACMA_CREQO,
+ >admin_command);
+   if (rc)
+   return rc;
+
+   return 0;
+}
+
 /**
  * is_usable() - Is a controller usable?
  * @ocxlpmem: the device metadata
@@ -456,6 +508,14 @@ static int probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
}
ocxlpmem->pdev = pdev;
 
+   ocxlpmem->timeouts[ADMIN_COMMAND_ERRLOG] = 2000; // ms
+   ocxlpmem->timeouts[ADMIN_COMMAND_HEARTBEAT] = 100; // ms
+   ocxlpmem->timeouts[ADMIN_COMMAND_SMART] = 100; // ms
+   ocxlpmem->timeouts[ADMIN_COMMAND_CONTROLLER_DUMP] = 1000; // ms
+   ocxlpmem->timeouts[ADMIN_COMMAND_CONTROLLER_STATS] = 100; // ms
+   ocxlpmem->timeouts[ADMIN_COMMAND_SHUTDOWN] = 1000; // ms
+   ocxlpmem->timeouts[ADMIN_COMMAND_FW_UPDATE] = 16000; // ms
+
pci_set_drvdata(pdev, ocxlpmem);
 
ocxlpmem->ocxl_fn = ocxl_function_open(pdev);
@@ -501,6 +561,11 @@ static int probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
goto err;
}
 
+   if (setup_command_metadata(ocxlpmem)) {
+   dev_err(>dev, "Could not read OCXL command matada\n");
+   goto err;
+   }
+
elapsed = 0;
timeout = ocxlpmem->readiness_timeout + 
ocxlpmem->memory_available_timeout;
while (!is_usable(ocxlpmem, false)) {
diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl_internal.c 
b/arch/powerpc/platforms/powernv/pmem/ocxl_internal.c
index 617ca943b1b8..583f48023025 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl_internal.c
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl_internal.c
@@ -17,3 +17,156 @@ int ocxlpmem_chi(const struct ocxlpmem *ocxlpmem, u64 *chi)
 
return 0;
 }
+
+#define COMMAND_REQUEST_SIZE (8 * sizeof(u64))
+static int scm_command_request(const struct ocxlpmem *ocxlpmem,
+  struct command_metadata *cmd, u8 op_code)
+{
+   u64 val = op_code;
+   int rc;
+   u8 i;
+
+   cmd->op_code = op_code;
+   cmd->id++;
+
+   val |= ((u64)cmd->id) << 16;
+
+   rc = ocxl_global_mmio_write64(ocxlpmem->ocxl_afu, cmd->request_offset,
+ OCXL_LITTLE_ENDIAN, val);
+   if (rc)
+   return rc;
+
+   for (i = sizeof(u64); i < COMMAND_REQUEST_SIZE; i += sizeof(u64)) {
+   rc = ocxl_global_mmio_write64(ocxlpmem->ocxl_afu,
+ cmd->request_offset + i,
+ OCXL_LITTLE_ENDIAN, 0);
+   if (rc)
+   return rc;
+   }
+
+   return 0;
+}
+
+int admin_command_request(struct ocxlpmem *ocxlpmem, u8 op_code)
+{
+   u64 val;
+   int rc = 

[PATCH v3 15/27] powerpc/powernv/pmem: Add support for near storage commands

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

Similar to the previous patch, this adds support for near storage commands.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/platforms/powernv/pmem/ocxl.c|  6 +++
 .../platforms/powernv/pmem/ocxl_internal.c| 41 +++
 .../platforms/powernv/pmem/ocxl_internal.h| 37 +
 3 files changed, 84 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl.c 
b/arch/powerpc/platforms/powernv/pmem/ocxl.c
index 4e782d22605b..b8bd7e703b19 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl.c
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl.c
@@ -259,12 +259,18 @@ static int setup_command_metadata(struct ocxlpmem 
*ocxlpmem)
int rc;
 
mutex_init(>admin_command.lock);
+   mutex_init(>ns_command.lock);
 
rc = extract_command_metadata(ocxlpmem, GLOBAL_MMIO_ACMA_CREQO,
  >admin_command);
if (rc)
return rc;
 
+   rc = extract_command_metadata(ocxlpmem, GLOBAL_MMIO_NSCMA_CREQO,
+ >ns_command);
+   if (rc)
+   return rc;
+
return 0;
 }
 
diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl_internal.c 
b/arch/powerpc/platforms/powernv/pmem/ocxl_internal.c
index 583f48023025..3e0b133feddf 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl_internal.c
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl_internal.c
@@ -133,6 +133,47 @@ int admin_response_handled(const struct ocxlpmem *ocxlpmem)
  OCXL_LITTLE_ENDIAN, GLOBAL_MMIO_CHI_ACRA);
 }
 
+int ns_command_request(struct ocxlpmem *ocxlpmem, u8 op_code)
+{
+   u64 val;
+   int rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_CHI,
+OCXL_LITTLE_ENDIAN, );
+   if (rc)
+   return rc;
+
+   if (!(val & GLOBAL_MMIO_CHI_NSCRA))
+   return -EBUSY;
+
+   return scm_command_request(ocxlpmem, >ns_command, op_code);
+}
+
+int ns_response(const struct ocxlpmem *ocxlpmem)
+{
+   return command_response(ocxlpmem, >ns_command);
+}
+
+int ns_command_execute(const struct ocxlpmem *ocxlpmem)
+{
+   return ocxl_global_mmio_set64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_HCI,
+ OCXL_LITTLE_ENDIAN, 
GLOBAL_MMIO_HCI_NSCRW);
+}
+
+bool ns_command_complete(const struct ocxlpmem *ocxlpmem)
+{
+   u64 val = 0;
+   int rc = ocxlpmem_chi(ocxlpmem, );
+
+   WARN_ON(rc);
+
+   return (val & GLOBAL_MMIO_CHI_NSCRA) != 0;
+}
+
+int ns_response_handled(const struct ocxlpmem *ocxlpmem)
+{
+   return ocxl_global_mmio_set64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_CHIC,
+ OCXL_LITTLE_ENDIAN, 
GLOBAL_MMIO_CHI_NSCRA);
+}
+
 void warn_status(const struct ocxlpmem *ocxlpmem, const char *message,
 u8 status)
 {
diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl_internal.h 
b/arch/powerpc/platforms/powernv/pmem/ocxl_internal.h
index 2fef68c71271..28e2020f6355 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl_internal.h
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl_internal.h
@@ -107,6 +107,7 @@ struct ocxlpmem {
struct ocxl_context *ocxl_context;
void *metadata_addr;
struct command_metadata admin_command;
+   struct command_metadata ns_command;
struct resource pmem_res;
struct nd_region *nd_region;
char fw_version[8+1];
@@ -175,6 +176,42 @@ int admin_command_complete_timeout(const struct ocxlpmem 
*ocxlpmem,
  */
 int admin_response_handled(const struct ocxlpmem *ocxlpmem);
 
+/**
+ * ns_command_request() - Issue a near storage command request
+ * @ocxlpmem: the device metadata
+ * @op_code: The op-code for the command
+ * Returns an identifier for the command, or negative on error
+ */
+int ns_command_request(struct ocxlpmem *ocxlpmem, u8 op_code);
+
+/**
+ * ns_response() - Validate a near storage response
+ * @ocxlpmem: the device metadata
+ * Returns the status code of the command, or negative on error
+ */
+int ns_response(const struct ocxlpmem *ocxlpmem);
+
+/**
+ * ns_command_execute() - Notify the controller to start processing a pending 
near storage command
+ * @ocxlpmem: the device metadata
+ * Returns 0 on success, negative on error
+ */
+int ns_command_execute(const struct ocxlpmem *ocxlpmem);
+
+/**
+ * ns_command_complete() - Is a near storage command executing
+ * @ocxlpmem: the device metadata
+ * Returns true if the previous admin command has completed
+ */
+bool ns_command_complete(const struct ocxlpmem *ocxlpmem);
+
+/**
+ * ns_response_handled() - Notify the controller that the near storage 
response has been handled
+ * @ocxlpmem: the device metadata
+ * Returns 0 on success, negative on failure
+ */
+int ns_response_handled(const struct ocxlpmem *ocxlpmem);
+
 /**
  * warn_status() - Emit a kernel warning showing a command status.
  * @ocxlpmem: the device metadata
-- 
2.24.1



[PATCH v3 18/27] powerpc/powernv/pmem: Add controller dump IOCTLs

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

This patch adds IOCTLs to allow userspace to request & fetch dumps
of the internal controller state.

This is useful during debugging or when a fatal error on the controller
has occurred.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/platforms/powernv/pmem/ocxl.c | 132 +
 include/uapi/nvdimm/ocxl-pmem.h|  15 +++
 2 files changed, 147 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl.c 
b/arch/powerpc/platforms/powernv/pmem/ocxl.c
index 2b64504f9129..2cabafe1fc58 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl.c
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl.c
@@ -640,6 +640,124 @@ static int ioctl_error_log(struct ocxlpmem *ocxlpmem,
return 0;
 }
 
+static int ioctl_controller_dump_data(struct ocxlpmem *ocxlpmem,
+   struct ioctl_ocxl_pmem_controller_dump_data __user *uarg)
+{
+   struct ioctl_ocxl_pmem_controller_dump_data args;
+   u16 i;
+   u64 val;
+   int rc;
+
+   if (copy_from_user(, uarg, sizeof(args)))
+   return -EFAULT;
+
+   if (args.buf_size % 8)
+   return -EINVAL;
+
+   if (args.buf_size > ocxlpmem->admin_command.data_size)
+   return -EINVAL;
+
+   mutex_lock(>admin_command.lock);
+
+   rc = admin_command_request(ocxlpmem, ADMIN_COMMAND_CONTROLLER_DUMP);
+   if (rc)
+   goto out;
+
+   val = ((u64)args.offset) << 32;
+   val |= args.buf_size;
+   rc = ocxl_global_mmio_write64(ocxlpmem->ocxl_afu,
+ ocxlpmem->admin_command.request_offset + 
0x08,
+ OCXL_LITTLE_ENDIAN, val);
+   if (rc)
+   goto out;
+
+   rc = admin_command_execute(ocxlpmem);
+   if (rc)
+   goto out;
+
+   rc = admin_command_complete_timeout(ocxlpmem,
+   ADMIN_COMMAND_CONTROLLER_DUMP);
+   if (rc < 0) {
+   dev_warn(>dev, "Controller dump timed out\n");
+   goto out;
+   }
+
+   rc = admin_response(ocxlpmem);
+   if (rc < 0)
+   goto out;
+   if (rc != STATUS_SUCCESS) {
+   warn_status(ocxlpmem,
+   "Unexpected status from retrieve error log",
+   rc);
+   goto out;
+   }
+
+   for (i = 0; i < args.buf_size; i += 8) {
+   u64 val;
+
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+
ocxlpmem->admin_command.data_offset + i,
+OCXL_HOST_ENDIAN, );
+   if (rc)
+   goto out;
+
+   if (copy_to_user([i], , sizeof(u64))) {
+   rc = -EFAULT;
+   goto out;
+   }
+   }
+
+   if (copy_to_user(uarg, , sizeof(args))) {
+   rc = -EFAULT;
+   goto out;
+   }
+
+   rc = admin_response_handled(ocxlpmem);
+   if (rc)
+   goto out;
+
+out:
+   mutex_unlock(>admin_command.lock);
+   return rc;
+}
+
+int request_controller_dump(struct ocxlpmem *ocxlpmem)
+{
+   int rc;
+   u64 busy = 1;
+
+   rc = ocxl_global_mmio_set64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_CHIC,
+   OCXL_LITTLE_ENDIAN,
+   GLOBAL_MMIO_CHI_CDA);
+
+
+   rc = ocxl_global_mmio_set64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_HCI,
+   OCXL_LITTLE_ENDIAN,
+   GLOBAL_MMIO_HCI_CONTROLLER_DUMP);
+   if (rc)
+   return rc;
+
+   while (busy) {
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+GLOBAL_MMIO_HCI,
+OCXL_LITTLE_ENDIAN, );
+   if (rc)
+   return rc;
+
+   busy &= GLOBAL_MMIO_HCI_CONTROLLER_DUMP;
+   cond_resched();
+   }
+
+   return 0;
+}
+
+static int ioctl_controller_dump_complete(struct ocxlpmem *ocxlpmem)
+{
+   return ocxl_global_mmio_set64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_HCI,
+   OCXL_LITTLE_ENDIAN,
+   GLOBAL_MMIO_HCI_CONTROLLER_DUMP_COLLECTED);
+}
+
 static long file_ioctl(struct file *file, unsigned int cmd, unsigned long args)
 {
struct ocxlpmem *ocxlpmem = file->private_data;
@@ -650,7 +768,21 @@ static long file_ioctl(struct file *file, unsigned int 
cmd, unsigned long args)
rc = ioctl_error_log(ocxlpmem,
 (struct ioctl_ocxl_pmem_error_log __user 
*)args);
break;
+
+   case IOCTL_OCXL_PMEM_CONTROLLER_DUMP:
+   rc = request_controller_dump(ocxlpmem);
+   break;
+
+   case IOCTL_OCXL_PMEM_CONTROLLER_DUMP_DATA:
+   

[PATCH v3 11/27] powerpc: Enable the OpenCAPI Persistent Memory driver for powernv_defconfig

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

This patch enables the OpenCAPI Persistent Memory driver, as well
as DAX support, for the 'powernv' platform.

DAX is not a strict requirement for the functioning of the driver, but it
is likely that a user will want to create a DAX device on top of their
persistent memory device.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/configs/powernv_defconfig | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/configs/powernv_defconfig 
b/arch/powerpc/configs/powernv_defconfig
index 71749377d164..921d77bbd3d2 100644
--- a/arch/powerpc/configs/powernv_defconfig
+++ b/arch/powerpc/configs/powernv_defconfig
@@ -348,3 +348,8 @@ CONFIG_KVM_BOOK3S_64=m
 CONFIG_KVM_BOOK3S_64_HV=m
 CONFIG_VHOST_NET=m
 CONFIG_PRINTK_TIME=y
+CONFIG_ZONE_DEVICE=y
+CONFIG_OCXL_PMEM=m
+CONFIG_DEV_DAX=m
+CONFIG_DEV_DAX_PMEM=m
+CONFIG_FS_DAX=y
-- 
2.24.1



[PATCH v3 24/27] powerpc/powernv/pmem: Expose SMART data via ndctl

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

This patch retrieves proprietary formatted SMART data and makes it
available via ndctl. A later contribution will be made to ndctl to
parse this data.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/platforms/powernv/pmem/ocxl.c| 128 ++
 .../platforms/powernv/pmem/ocxl_internal.h|  18 +++
 include/uapi/linux/ndctl.h|   1 +
 3 files changed, 147 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl.c 
b/arch/powerpc/platforms/powernv/pmem/ocxl.c
index d4ce5e9e0521..5cd1b6d78dd6 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl.c
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl.c
@@ -81,6 +81,129 @@ static int ndctl_config_size(struct nd_cmd_get_config_size 
*command)
return 0;
 }
 
+/**
+ * smart_header_parse() - Parse the first 64 bits of the SMART admin command 
response
+ * @ocxlpmem: the device metadata
+ * @length: out, returns the number of bytes in the response (excluding the 64 
bit header)
+ */
+static int smart_header_parse(struct ocxlpmem *ocxlpmem, u32 *length)
+{
+   int rc;
+   u64 val;
+
+   u16 data_identifier;
+   u32 data_length;
+
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+ocxlpmem->admin_command.data_offset,
+OCXL_LITTLE_ENDIAN, );
+   if (rc)
+   return rc;
+
+   data_identifier = val >> 48;
+   data_length = val & 0x;
+
+   if (data_identifier != 0x534D) { // 'SM'
+   dev_err(>dev,
+   "Bad data identifier for smart data, expected 'SM', got 
'%-.*s'\n",
+   2, (char *)_identifier);
+   return -EINVAL;
+   }
+
+   *length = data_length;
+   return 0;
+}
+
+static int ndctl_smart(struct ocxlpmem *ocxlpmem, struct nd_cmd_pkg *pkg)
+{
+   u32 length, i;
+   struct nd_ocxl_smart *out;
+   int rc;
+
+   mutex_lock(>admin_command.lock);
+
+   rc = admin_command_request(ocxlpmem, ADMIN_COMMAND_SMART);
+   if (rc)
+   goto out;
+
+   rc = admin_command_execute(ocxlpmem);
+   if (rc)
+   goto out;
+
+   rc = admin_command_complete_timeout(ocxlpmem, ADMIN_COMMAND_SMART);
+   if (rc < 0) {
+   dev_err(>dev, "SMART timeout\n");
+   goto out;
+   }
+
+   rc = admin_response(ocxlpmem);
+   if (rc < 0)
+   goto out;
+   if (rc != STATUS_SUCCESS) {
+   warn_status(ocxlpmem, "Unexpected status from SMART", rc);
+   goto out;
+   }
+
+   rc = smart_header_parse(ocxlpmem, );
+   if (rc)
+   goto out;
+
+   pkg->nd_fw_size = length;
+
+   length = min(length, pkg->nd_size_out); // bytes
+   out = (struct nd_ocxl_smart *)pkg->nd_payload;
+   // Each SMART attribute is 2 * 64 bits
+   out->count = length / (2 * sizeof(u64)); // attributes
+
+   for (i = 0; i < length; i += sizeof(u64)) {
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+
ocxlpmem->admin_command.data_offset + sizeof(u64) + i,
+OCXL_LITTLE_ENDIAN,
+>attribs[i/sizeof(u64)]);
+   if (rc)
+   goto out;
+   }
+
+   rc = admin_response_handled(ocxlpmem);
+   if (rc)
+   goto out;
+
+   rc = 0;
+   goto out;
+
+out:
+   mutex_unlock(>admin_command.lock);
+   return rc;
+}
+
+static int ndctl_call(struct ocxlpmem *ocxlpmem, void *buf, unsigned int 
buf_len)
+{
+   struct nd_cmd_pkg *pkg = buf;
+
+   if (buf_len < sizeof(struct nd_cmd_pkg)) {
+   dev_err(>dev, "Invalid ND_CALL size=%u\n", buf_len);
+   return -EINVAL;
+   }
+
+   if (pkg->nd_family != NVDIMM_FAMILY_OCXL) {
+   dev_err(>dev, "Invalid ND_CALL family=0x%llx\n", 
pkg->nd_family);
+   return -EINVAL;
+   }
+
+   switch (pkg->nd_command) {
+   case ND_CMD_OCXL_SMART:
+   ndctl_smart(ocxlpmem, pkg);
+   break;
+
+   default:
+   dev_err(>dev, "Invalid ND_CALL command=0x%llx\n", 
pkg->nd_command);
+   return -EINVAL;
+   }
+
+
+   return 0;
+}
+
 static int ndctl(struct nvdimm_bus_descriptor *nd_desc,
 struct nvdimm *nvdimm,
 unsigned int cmd, void *buf, unsigned int buf_len, int *cmd_rc)
@@ -88,6 +211,10 @@ static int ndctl(struct nvdimm_bus_descriptor *nd_desc,
struct ocxlpmem *ocxlpmem = container_of(nd_desc, struct ocxlpmem, 
bus_desc);
 
switch (cmd) {
+   case ND_CMD_CALL:
+   *cmd_rc = ndctl_call(ocxlpmem, buf, buf_len);
+   return 0;
+
case ND_CMD_GET_CONFIG_SIZE:
*cmd_rc = ndctl_config_size(buf);
return 0;
@@ -171,6 +298,7 @@ 

[PATCH v3 13/27] powerpc/powernv/pmem: Read the capability registers & wait for device ready

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

This patch reads timeouts & firmware version from the controller, and
uses those timeouts to wait for the controller to report that it is ready
before handing the memory over to libnvdimm.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/platforms/powernv/pmem/Makefile  |  2 +-
 arch/powerpc/platforms/powernv/pmem/ocxl.c| 92 +++
 .../platforms/powernv/pmem/ocxl_internal.c| 19 
 .../platforms/powernv/pmem/ocxl_internal.h| 24 +
 4 files changed, 136 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/platforms/powernv/pmem/ocxl_internal.c

diff --git a/arch/powerpc/platforms/powernv/pmem/Makefile 
b/arch/powerpc/platforms/powernv/pmem/Makefile
index 1c55c4193175..4ceda25907d4 100644
--- a/arch/powerpc/platforms/powernv/pmem/Makefile
+++ b/arch/powerpc/platforms/powernv/pmem/Makefile
@@ -4,4 +4,4 @@ ccflags-$(CONFIG_PPC_WERROR)+= -Werror
 
 obj-$(CONFIG_OCXL_PMEM) += ocxlpmem.o
 
-ocxlpmem-y := ocxl.o
+ocxlpmem-y := ocxl.o ocxl_internal.o
diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl.c 
b/arch/powerpc/platforms/powernv/pmem/ocxl.c
index 3c4eeb5dcc0f..431212c9f0cc 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl.c
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl.c
@@ -8,6 +8,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -215,6 +216,36 @@ static int register_lpc_mem(struct ocxlpmem *ocxlpmem)
return 0;
 }
 
+/**
+ * is_usable() - Is a controller usable?
+ * @ocxlpmem: the device metadata
+ * @verbose: True to log errors
+ * Return: true if the controller is usable
+ */
+static bool is_usable(const struct ocxlpmem *ocxlpmem, bool verbose)
+{
+   u64 chi = 0;
+   int rc = ocxlpmem_chi(ocxlpmem, );
+
+   if (rc < 0)
+   return false;
+
+   if (!(chi & GLOBAL_MMIO_CHI_CRDY)) {
+   if (verbose)
+   dev_err(>dev, "controller is not ready.\n");
+   return false;
+   }
+
+   if (!(chi & GLOBAL_MMIO_CHI_MA)) {
+   if (verbose)
+   dev_err(>dev,
+   "controller does not have memory available.\n");
+   return false;
+   }
+
+   return true;
+}
+
 /**
  * allocate_minor() - Allocate a minor number to use for an OpenCAPI pmem 
device
  * @ocxlpmem: the device metadata
@@ -328,6 +359,48 @@ static void ocxlpmem_remove(struct pci_dev *pdev)
}
 }
 
+/**
+ * read_device_metadata() - Retrieve config information from the AFU and save 
it for future use
+ * @ocxlpmem: the device metadata
+ * Return: 0 on success, negative on failure
+ */
+static int read_device_metadata(struct ocxlpmem *ocxlpmem)
+{
+   u64 val;
+   int rc;
+
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_CCAP0,
+OCXL_LITTLE_ENDIAN, );
+   if (rc)
+   return rc;
+
+   ocxlpmem->scm_revision = val & 0x;
+   ocxlpmem->read_latency = (val >> 32) & 0xFF;
+   ocxlpmem->readiness_timeout = (val >> 48) & 0x0F;
+   ocxlpmem->memory_available_timeout = val >> 52;
+
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_CCAP1,
+OCXL_LITTLE_ENDIAN, );
+   if (rc)
+   return rc;
+
+   ocxlpmem->max_controller_dump_size = val & 0x;
+
+   // Extract firmware version text
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_FWVER,
+OCXL_HOST_ENDIAN, (u64 
*)ocxlpmem->fw_version);
+   if (rc)
+   return rc;
+
+   ocxlpmem->fw_version[8] = '\0';
+
+   dev_info(>dev,
+"Firmware version '%s' SCM revision %d:%d\n", 
ocxlpmem->fw_version,
+ocxlpmem->scm_revision >> 4, ocxlpmem->scm_revision & 0x0F);
+
+   return 0;
+}
+
 /**
  * probe_function0() - Set up function 0 for an OpenCAPI persistent memory 
device
  * This is important as it enables templates higher than 0 across all other 
functions,
@@ -368,6 +441,7 @@ static int probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
 {
struct ocxlpmem *ocxlpmem;
int rc;
+   u16 elapsed, timeout;
 
if (PCI_FUNC(pdev->devfn) == 0)
return probe_function0(pdev);
@@ -422,6 +496,24 @@ static int probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
goto err;
}
 
+   if (read_device_metadata(ocxlpmem)) {
+   dev_err(>dev, "Could not read metadata\n");
+   goto err;
+   }
+
+   elapsed = 0;
+   timeout = ocxlpmem->readiness_timeout + 
ocxlpmem->memory_available_timeout;
+   while (!is_usable(ocxlpmem, false)) {
+   if (elapsed++ > timeout) {
+   dev_warn(>dev, "OpenCAPI Persistent Memory 
ready timeout.\n");
+   (void)is_usable(ocxlpmem, true);
+   rc = -ENXIO;
+  

[PATCH v3 05/27] ocxl: Address kernel doc errors & warnings

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

This patch addresses warnings and errors from the kernel doc scripts for
the OpenCAPI driver.

It also makes minor tweaks to make the docs more consistent.

Signed-off-by: Alastair D'Silva 
---
 drivers/misc/ocxl/config.c| 24 
 drivers/misc/ocxl/ocxl_internal.h |  9 +--
 include/misc/ocxl.h   | 96 ---
 3 files changed, 55 insertions(+), 74 deletions(-)

diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
index c8e19bfb5ef9..a62e3d7db2bf 100644
--- a/drivers/misc/ocxl/config.c
+++ b/drivers/misc/ocxl/config.c
@@ -273,16 +273,16 @@ static int read_afu_info(struct pci_dev *dev, struct 
ocxl_fn_config *fn,
 }
 
 /**
- * Read the template version from the AFU
- * dev: the device for the AFU
- * fn: the AFU offsets
- * len: outputs the template length
- * version: outputs the major<<8,minor version
+ * read_template_version() - Read the template version from the AFU
+ * @dev: the device for the AFU
+ * @fn: the AFU offsets
+ * @len: outputs the template length
+ * @version: outputs the major<<8,minor version
  *
  * Returns 0 on success, negative on failure
  */
 static int read_template_version(struct pci_dev *dev, struct ocxl_fn_config 
*fn,
-   u16 *len, u16 *version)
+u16 *len, u16 *version)
 {
u32 val32;
u8 major, minor;
@@ -476,16 +476,16 @@ static int validate_afu(struct pci_dev *dev, struct 
ocxl_afu_config *afu)
 }
 
 /**
- * Populate AFU metadata regarding LPC memory
- * dev: the device for the AFU
- * fn: the AFU offsets
- * afu: the AFU struct to populate the LPC metadata into
+ * read_afu_lpc_memory_info() - Populate AFU metadata regarding LPC memory
+ * @dev: the device for the AFU
+ * @fn: the AFU offsets
+ * @afu: the AFU struct to populate the LPC metadata into
  *
  * Returns 0 on success, negative on failure
  */
 static int read_afu_lpc_memory_info(struct pci_dev *dev,
-   struct ocxl_fn_config *fn,
-   struct ocxl_afu_config *afu)
+   struct ocxl_fn_config *fn,
+   struct ocxl_afu_config *afu)
 {
int rc;
u32 val32;
diff --git a/drivers/misc/ocxl/ocxl_internal.h 
b/drivers/misc/ocxl/ocxl_internal.h
index 345bf843a38e..198e4e4bc51d 100644
--- a/drivers/misc/ocxl/ocxl_internal.h
+++ b/drivers/misc/ocxl/ocxl_internal.h
@@ -122,11 +122,12 @@ int ocxl_config_check_afu_index(struct pci_dev *dev,
struct ocxl_fn_config *fn, int afu_idx);
 
 /**
- * Update values within a Process Element
+ * ocxl_link_update_pe() - Update values within a Process Element
+ * @link_handle: the link handle associated with the process element
+ * @pasid: the PASID for the AFU context
+ * @tid: the new thread id for the process element
  *
- * link_handle: the link handle associated with the process element
- * pasid: the PASID for the AFU context
- * tid: the new thread id for the process element
+ * Returns 0 on success
  */
 int ocxl_link_update_pe(void *link_handle, int pasid, __u16 tid);
 
diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h
index 0a762e387418..357ef1aadbc0 100644
--- a/include/misc/ocxl.h
+++ b/include/misc/ocxl.h
@@ -62,8 +62,7 @@ struct ocxl_context;
 // Device detection & initialisation
 
 /**
- * Open an OpenCAPI function on an OpenCAPI device
- *
+ * ocxl_function_open() - Open an OpenCAPI function on an OpenCAPI device
  * @dev: The PCI device that contains the function
  *
  * Returns an opaque pointer to the function, or an error pointer (check with 
IS_ERR)
@@ -71,8 +70,7 @@ struct ocxl_context;
 struct ocxl_fn *ocxl_function_open(struct pci_dev *dev);
 
 /**
- * Get the list of AFUs associated with a PCI function device
- *
+ * ocxl_function_afu_list() - Get the list of AFUs associated with a PCI 
function device
  * Returns a list of struct ocxl_afu *
  *
  * @fn: The OpenCAPI function containing the AFUs
@@ -80,8 +78,7 @@ struct ocxl_fn *ocxl_function_open(struct pci_dev *dev);
 struct list_head *ocxl_function_afu_list(struct ocxl_fn *fn);
 
 /**
- * Fetch an AFU instance from an OpenCAPI function
- *
+ * ocxl_function_fetch_afu() - Fetch an AFU instance from an OpenCAPI function
  * @fn: The OpenCAPI function to get the AFU from
  * @afu_idx: The index of the AFU to get
  *
@@ -92,23 +89,20 @@ struct list_head *ocxl_function_afu_list(struct ocxl_fn 
*fn);
 struct ocxl_afu *ocxl_function_fetch_afu(struct ocxl_fn *fn, u8 afu_idx);
 
 /**
- * Take a reference to an AFU
- *
+ * ocxl_afu_get() - Take a reference to an AFU
  * @afu: The AFU to increment the reference count on
  */
 void ocxl_afu_get(struct ocxl_afu *afu);
 
 /**
- * Release a reference to an AFU
- *
+ * ocxl_afu_put() - Release a reference to an AFU
  * @afu: The AFU to decrement the reference count on
  */
 void ocxl_afu_put(struct ocxl_afu *afu);
 
 
 /**
- * Get the configuration information for an 

[PATCH v3 26/27] powerpc/powernv/pmem: Expose the firmware version in sysfs

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

This information will be used by ndctl in userspace to help users identify
the device.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/platforms/powernv/pmem/ocxl_sysfs.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl_sysfs.c 
b/arch/powerpc/platforms/powernv/pmem/ocxl_sysfs.c
index 7829e4bc887d..84b23cc3e8b7 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl_sysfs.c
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl_sysfs.c
@@ -16,8 +16,17 @@ static ssize_t serial_show(struct device *device, struct 
device_attribute *attr,
return scnprintf(buf, PAGE_SIZE, "%llu\n", fn_config->serial);
 }
 
+static ssize_t fw_version_show(struct device *device,
+  struct device_attribute *attr, char *buf)
+{
+   struct ocxlpmem *ocxlpmem = container_of(device, struct ocxlpmem, dev);
+
+   return scnprintf(buf, PAGE_SIZE, "%s\n", ocxlpmem->fw_version);
+}
+
 static struct device_attribute attrs[] = {
__ATTR_RO(serial),
+   __ATTR_RO(fw_version),
 };
 
 int ocxlpmem_sysfs_add(struct ocxlpmem *ocxlpmem)
-- 
2.24.1



[PATCH v3 07/27] ocxl: Add functions to map/unmap LPC memory

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

Add functions to map/unmap LPC memory

Signed-off-by: Alastair D'Silva 
---
 drivers/misc/ocxl/core.c  | 51 +++
 drivers/misc/ocxl/ocxl_internal.h |  3 ++
 include/misc/ocxl.h   | 21 +
 3 files changed, 75 insertions(+)

diff --git a/drivers/misc/ocxl/core.c b/drivers/misc/ocxl/core.c
index 2531c6cf19a0..75ff14e3882a 100644
--- a/drivers/misc/ocxl/core.c
+++ b/drivers/misc/ocxl/core.c
@@ -210,6 +210,56 @@ static void unmap_mmio_areas(struct ocxl_afu *afu)
release_fn_bar(afu->fn, afu->config.global_mmio_bar);
 }
 
+int ocxl_afu_map_lpc_mem(struct ocxl_afu *afu)
+{
+   struct pci_dev *dev = to_pci_dev(afu->fn->dev.parent);
+
+   if ((afu->config.lpc_mem_size + afu->config.special_purpose_mem_size) 
== 0)
+   return 0;
+
+   afu->lpc_base_addr = ocxl_link_lpc_map(afu->fn->link, dev);
+   if (afu->lpc_base_addr == 0)
+   return -EINVAL;
+
+   if (afu->config.lpc_mem_size > 0) {
+   afu->lpc_res.start = afu->lpc_base_addr + 
afu->config.lpc_mem_offset;
+   afu->lpc_res.end = afu->lpc_res.start + 
afu->config.lpc_mem_size - 1;
+   }
+
+   if (afu->config.special_purpose_mem_size > 0) {
+   afu->special_purpose_res.start = afu->lpc_base_addr +
+
afu->config.special_purpose_mem_offset;
+   afu->special_purpose_res.end = afu->special_purpose_res.start +
+  
afu->config.special_purpose_mem_size - 1;
+   }
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(ocxl_afu_map_lpc_mem);
+
+struct resource *ocxl_afu_lpc_mem(struct ocxl_afu *afu)
+{
+   return >lpc_res;
+}
+EXPORT_SYMBOL_GPL(ocxl_afu_lpc_mem);
+
+static void unmap_lpc_mem(struct ocxl_afu *afu)
+{
+   struct pci_dev *dev = to_pci_dev(afu->fn->dev.parent);
+
+   if (afu->lpc_res.start || afu->special_purpose_res.start) {
+   void *link = afu->fn->link;
+
+   // only release the link when the the last consumer calls 
release
+   ocxl_link_lpc_release(link, dev);
+
+   afu->lpc_res.start = 0;
+   afu->lpc_res.end = 0;
+   afu->special_purpose_res.start = 0;
+   afu->special_purpose_res.end = 0;
+   }
+}
+
 static int configure_afu(struct ocxl_afu *afu, u8 afu_idx, struct pci_dev *dev)
 {
int rc;
@@ -251,6 +301,7 @@ static int configure_afu(struct ocxl_afu *afu, u8 afu_idx, 
struct pci_dev *dev)
 
 static void deconfigure_afu(struct ocxl_afu *afu)
 {
+   unmap_lpc_mem(afu);
unmap_mmio_areas(afu);
reclaim_afu_pasid(afu);
reclaim_afu_actag(afu);
diff --git a/drivers/misc/ocxl/ocxl_internal.h 
b/drivers/misc/ocxl/ocxl_internal.h
index d0c8c4838f42..ce0cac1da416 100644
--- a/drivers/misc/ocxl/ocxl_internal.h
+++ b/drivers/misc/ocxl/ocxl_internal.h
@@ -52,6 +52,9 @@ struct ocxl_afu {
void __iomem *global_mmio_ptr;
u64 pp_mmio_start;
void *private;
+   u64 lpc_base_addr; /* Covers both LPC & special purpose memory */
+   struct resource lpc_res;
+   struct resource special_purpose_res;
 };
 
 enum ocxl_context_status {
diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h
index 357ef1aadbc0..d8b0b4d46bfb 100644
--- a/include/misc/ocxl.h
+++ b/include/misc/ocxl.h
@@ -203,6 +203,27 @@ int ocxl_irq_set_handler(struct ocxl_context *ctx, int 
irq_id,
 
 // AFU Metadata
 
+/**
+ * ocxl_afu_map_lpc_mem() - Map the LPC system & special purpose memory for an 
AFU
+ * Do not call this during device discovery, as there may me multiple
+ * devices on a link, and the memory is mapped for the whole link, not
+ * just one device. It should only be called after all devices have
+ * registered their memory on the link.
+ *
+ * @afu: The AFU that has the LPC memory to map
+ *
+ * Returns 0 on success, negative on failure
+ */
+int ocxl_afu_map_lpc_mem(struct ocxl_afu *afu);
+
+/**
+ * ocxl_afu_lpc_mem() - Get the physical address range of LPC memory for an AFU
+ * @afu: The AFU associated with the LPC memory
+ *
+ * Returns a pointer to the resource struct for the physical address range
+ */
+struct resource *ocxl_afu_lpc_mem(struct ocxl_afu *afu);
+
 /**
  * ocxl_afu_config() - Get a pointer to the config for an AFU
  * @afu: a pointer to the AFU to get the config for
-- 
2.24.1



[PATCH v3 09/27] ocxl: Save the device serial number in ocxl_fn

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

This patch retrieves the serial number of the card and makes it available
to consumers of the ocxl driver via the ocxl_fn struct.

Signed-off-by: Alastair D'Silva 
Acked-by: Frederic Barrat 
Acked-by: Andrew Donnellan 
---
 drivers/misc/ocxl/config.c | 46 ++
 include/misc/ocxl.h|  1 +
 2 files changed, 47 insertions(+)

diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
index 701ae6216abf..ce33fafa7b50 100644
--- a/drivers/misc/ocxl/config.c
+++ b/drivers/misc/ocxl/config.c
@@ -71,6 +71,51 @@ static int find_dvsec_afu_ctrl(struct pci_dev *dev, u8 
afu_idx)
return 0;
 }
 
+/**
+ * get_function_0() - Find a related PCI device (function 0)
+ * @device: PCI device to match
+ *
+ * Returns a pointer to the related device, or null if not found
+ */
+static struct pci_dev *get_function_0(struct pci_dev *dev)
+{
+   unsigned int devfn = PCI_DEVFN(PCI_SLOT(dev->devfn), 0);
+
+   return pci_get_domain_bus_and_slot(pci_domain_nr(dev->bus),
+  dev->bus->number, devfn);
+}
+
+static void read_serial(struct pci_dev *dev, struct ocxl_fn_config *fn)
+{
+   u32 low, high;
+   int pos;
+
+   pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DSN);
+   if (pos) {
+   pci_read_config_dword(dev, pos + 0x04, );
+   pci_read_config_dword(dev, pos + 0x08, );
+
+   fn->serial = low | ((u64)high) << 32;
+
+   return;
+   }
+
+   if (PCI_FUNC(dev->devfn) != 0) {
+   struct pci_dev *related = get_function_0(dev);
+
+   if (!related) {
+   fn->serial = 0;
+   return;
+   }
+
+   read_serial(related, fn);
+   pci_dev_put(related);
+   return;
+   }
+
+   fn->serial = 0;
+}
+
 static void read_pasid(struct pci_dev *dev, struct ocxl_fn_config *fn)
 {
u16 val;
@@ -208,6 +253,7 @@ int ocxl_config_read_function(struct pci_dev *dev, struct 
ocxl_fn_config *fn)
int rc;
 
read_pasid(dev, fn);
+   read_serial(dev, fn);
 
rc = read_dvsec_tl(dev, fn);
if (rc) {
diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h
index d8b0b4d46bfb..b8514dc64bd0 100644
--- a/include/misc/ocxl.h
+++ b/include/misc/ocxl.h
@@ -46,6 +46,7 @@ struct ocxl_fn_config {
int dvsec_afu_info_pos; /* offset of the AFU information DVSEC */
s8 max_pasid_log;
s8 max_afu_index;
+   u64 serial;
 };
 
 enum ocxl_endian {
-- 
2.24.1



[PATCH v3 06/27] ocxl: Tally up the LPC memory on a link & allow it to be mapped

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

Tally up the LPC memory on an OpenCAPI link & allow it to be mapped

Signed-off-by: Alastair D'Silva 
---
 drivers/misc/ocxl/core.c  | 10 ++
 drivers/misc/ocxl/link.c  | 53 +++
 drivers/misc/ocxl/ocxl_internal.h | 33 +++
 3 files changed, 96 insertions(+)

diff --git a/drivers/misc/ocxl/core.c b/drivers/misc/ocxl/core.c
index b7a09b21ab36..2531c6cf19a0 100644
--- a/drivers/misc/ocxl/core.c
+++ b/drivers/misc/ocxl/core.c
@@ -230,8 +230,18 @@ static int configure_afu(struct ocxl_afu *afu, u8 afu_idx, 
struct pci_dev *dev)
if (rc)
goto err_free_pasid;
 
+   if (afu->config.lpc_mem_size || afu->config.special_purpose_mem_size) {
+   rc = ocxl_link_add_lpc_mem(afu->fn->link, 
afu->config.lpc_mem_offset,
+  afu->config.lpc_mem_size +
+  
afu->config.special_purpose_mem_size);
+   if (rc)
+   goto err_free_mmio;
+   }
+
return 0;
 
+err_free_mmio:
+   unmap_mmio_areas(afu);
 err_free_pasid:
reclaim_afu_pasid(afu);
 err_free_actag:
diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
index 58d111afd9f6..1e039cc5ebe5 100644
--- a/drivers/misc/ocxl/link.c
+++ b/drivers/misc/ocxl/link.c
@@ -84,6 +84,11 @@ struct ocxl_link {
int dev;
atomic_t irq_available;
struct spa *spa;
+   struct mutex lpc_mem_lock; /* protects lpc_mem & lpc_mem_sz */
+   u64 lpc_mem_sz; /* Total amount of LPC memory presented on the link */
+   u64 lpc_mem;
+   int lpc_consumers;
+
void *platform_data;
 };
 static struct list_head links_list = LIST_HEAD_INIT(links_list);
@@ -396,6 +401,8 @@ static int alloc_link(struct pci_dev *dev, int PE_mask, 
struct ocxl_link **out_l
if (rc)
goto err_spa;
 
+   mutex_init(>lpc_mem_lock);
+
/* platform specific hook */
rc = pnv_ocxl_spa_setup(dev, link->spa->spa_mem, PE_mask,
>platform_data);
@@ -711,3 +718,49 @@ void ocxl_link_free_irq(void *link_handle, int hw_irq)
atomic_inc(>irq_available);
 }
 EXPORT_SYMBOL_GPL(ocxl_link_free_irq);
+
+int ocxl_link_add_lpc_mem(void *link_handle, u64 offset, u64 size)
+{
+   struct ocxl_link *link = (struct ocxl_link *) link_handle;
+
+   // Check for overflow
+   if (offset > (offset + size))
+   return -EINVAL;
+
+   mutex_lock(>lpc_mem_lock);
+   link->lpc_mem_sz = max(link->lpc_mem_sz, offset + size);
+
+   mutex_unlock(>lpc_mem_lock);
+
+   return 0;
+}
+
+u64 ocxl_link_lpc_map(void *link_handle, struct pci_dev *pdev)
+{
+   struct ocxl_link *link = (struct ocxl_link *) link_handle;
+
+   mutex_lock(>lpc_mem_lock);
+
+   if(!link->lpc_mem)
+   link->lpc_mem = pnv_ocxl_platform_lpc_setup(pdev, 
link->lpc_mem_sz);
+
+   if(link->lpc_mem)
+   link->lpc_consumers++;
+   mutex_unlock(>lpc_mem_lock);
+
+   return link->lpc_mem;
+}
+
+void ocxl_link_lpc_release(void *link_handle, struct pci_dev *pdev)
+{
+   struct ocxl_link *link = (struct ocxl_link *) link_handle;
+
+   mutex_lock(>lpc_mem_lock);
+   WARN_ON(--link->lpc_consumers < 0);
+   if (link->lpc_consumers == 0) {
+   pnv_ocxl_platform_lpc_release(pdev);
+   link->lpc_mem = 0;
+   }
+
+   mutex_unlock(>lpc_mem_lock);
+}
diff --git a/drivers/misc/ocxl/ocxl_internal.h 
b/drivers/misc/ocxl/ocxl_internal.h
index 198e4e4bc51d..d0c8c4838f42 100644
--- a/drivers/misc/ocxl/ocxl_internal.h
+++ b/drivers/misc/ocxl/ocxl_internal.h
@@ -142,4 +142,37 @@ int ocxl_irq_offset_to_id(struct ocxl_context *ctx, u64 
offset);
 u64 ocxl_irq_id_to_offset(struct ocxl_context *ctx, int irq_id);
 void ocxl_afu_irq_free_all(struct ocxl_context *ctx);
 
+/**
+ * ocxl_link_add_lpc_mem() - Increment the amount of memory required by an 
OpenCAPI link
+ *
+ * @link_handle: The OpenCAPI link handle
+ * @offset: The offset of the memory to add
+ * @size: The amount of memory to increment by
+ *
+ * Returns 0 on success, negative on overflow
+ */
+int ocxl_link_add_lpc_mem(void *link_handle, u64 offset, u64 size);
+
+/**
+ * ocxl_link_lpc_map() - Map the LPC memory for an OpenCAPI device
+ * Since LPC memory belongs to a link, the whole LPC memory available
+ * on the link must be mapped in order to make it accessible to a device.
+ * @link_handle: The OpenCAPI link handle
+ * @pdev: A device that is on the link
+ *
+ * Returns the address of the mapped LPC memory, or 0 on error
+ */
+u64 ocxl_link_lpc_map(void *link_handle, struct pci_dev *pdev);
+
+/**
+ * ocxl_link_lpc_release() - Release the LPC memory device for an OpenCAPI 
device
+ *
+ * Offlines LPC memory on an OpenCAPI link for a device. If this is the
+ * last device on the link to release the memory, unmap it from the link.
+ *
+ * @link_handle: The 

[PATCH v3 00/27] Add support for OpenCAPI Persistent Memory devices

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

This series adds support for OpenCAPI Persistent Memory devices, exposing
them as nvdimms so that we can make use of the existing infrastructure.

Alastair D'Silva (27):
  powerpc: Add OPAL calls for LPC memory alloc/release
  mm/memory_hotplug: Allow check_hotplug_memory_addressable to be called
from drivers
  powerpc: Map & release OpenCAPI LPC memory
  ocxl: Remove unnecessary externs
  ocxl: Address kernel doc errors & warnings
  ocxl: Tally up the LPC memory on a link & allow it to be mapped
  ocxl: Add functions to map/unmap LPC memory
  ocxl: Emit a log message showing how much LPC memory was detected
  ocxl: Save the device serial number in ocxl_fn
  powerpc: Add driver for OpenCAPI Persistent Memory
  powerpc: Enable the OpenCAPI Persistent Memory driver for
powernv_defconfig
  powerpc/powernv/pmem: Add register addresses & status values to the
header
  powerpc/powernv/pmem: Read the capability registers & wait for device
ready
  powerpc/powernv/pmem: Add support for Admin commands
  powerpc/powernv/pmem: Add support for near storage commands
  powerpc/powernv/pmem: Register a character device for userspace to
interact with
  powerpc/powernv/pmem: Implement the Read Error Log command
  powerpc/powernv/pmem: Add controller dump IOCTLs
  powerpc/powernv/pmem: Add an IOCTL to report controller statistics
  powerpc/powernv/pmem: Forward events to userspace
  powerpc/powernv/pmem: Add an IOCTL to request controller health & perf
data
  powerpc/powernv/pmem: Implement the heartbeat command
  powerpc/powernv/pmem: Add debug IOCTLs
  powerpc/powernv/pmem: Expose SMART data via ndctl
  powerpc/powernv/pmem: Expose the serial number in sysfs
  powerpc/powernv/pmem: Expose the firmware version in sysfs
  MAINTAINERS: Add myself & nvdimm/ocxl to ocxl

 MAINTAINERS   |3 +
 arch/powerpc/configs/powernv_defconfig|5 +
 arch/powerpc/include/asm/opal-api.h   |2 +
 arch/powerpc/include/asm/opal.h   |3 +
 arch/powerpc/include/asm/pnv-ocxl.h   |   40 +-
 arch/powerpc/platforms/powernv/Kconfig|3 +
 arch/powerpc/platforms/powernv/Makefile   |1 +
 arch/powerpc/platforms/powernv/ocxl.c |   43 +
 arch/powerpc/platforms/powernv/opal-call.c|2 +
 arch/powerpc/platforms/powernv/pmem/Kconfig   |   21 +
 arch/powerpc/platforms/powernv/pmem/Makefile  |7 +
 arch/powerpc/platforms/powernv/pmem/ocxl.c| 1991 +
 .../platforms/powernv/pmem/ocxl_internal.c|  213 ++
 .../platforms/powernv/pmem/ocxl_internal.h|  254 +++
 .../platforms/powernv/pmem/ocxl_sysfs.c   |   46 +
 drivers/misc/ocxl/config.c|   74 +-
 drivers/misc/ocxl/core.c  |   61 +
 drivers/misc/ocxl/link.c  |   53 +
 drivers/misc/ocxl/ocxl_internal.h |   45 +-
 include/linux/memory_hotplug.h|5 +
 include/misc/ocxl.h   |  122 +-
 include/uapi/linux/ndctl.h|1 +
 include/uapi/nvdimm/ocxl-pmem.h   |  127 ++
 mm/memory_hotplug.c   |4 +-
 24 files changed, 3029 insertions(+), 97 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/pmem/Kconfig
 create mode 100644 arch/powerpc/platforms/powernv/pmem/Makefile
 create mode 100644 arch/powerpc/platforms/powernv/pmem/ocxl.c
 create mode 100644 arch/powerpc/platforms/powernv/pmem/ocxl_internal.c
 create mode 100644 arch/powerpc/platforms/powernv/pmem/ocxl_internal.h
 create mode 100644 arch/powerpc/platforms/powernv/pmem/ocxl_sysfs.c
 create mode 100644 include/uapi/nvdimm/ocxl-pmem.h

-- 
2.24.1



[PATCH v3 01/27] powerpc: Add OPAL calls for LPC memory alloc/release

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

Add OPAL calls for LPC memory alloc/release

Signed-off-by: Alastair D'Silva 
Acked-by: Andrew Donnellan 
Acked-by: Frederic Barrat 
---
 arch/powerpc/include/asm/opal-api.h| 2 ++
 arch/powerpc/include/asm/opal.h| 3 +++
 arch/powerpc/platforms/powernv/opal-call.c | 2 ++
 3 files changed, 7 insertions(+)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index c1f25a760eb1..9298e603001b 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -208,6 +208,8 @@
 #define OPAL_HANDLE_HMI2   166
 #defineOPAL_NX_COPROC_INIT 167
 #define OPAL_XIVE_GET_VP_STATE 170
+#define OPAL_NPU_MEM_ALLOC 171
+#define OPAL_NPU_MEM_RELEASE   172
 #define OPAL_MPIPL_UPDATE  173
 #define OPAL_MPIPL_REGISTER_TAG174
 #define OPAL_MPIPL_QUERY_TAG   175
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 9986ac34b8e2..8f7727e0f9ce 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -39,6 +39,9 @@ int64_t opal_npu_spa_clear_cache(uint64_t phb_id, uint32_t 
bdfn,
uint64_t PE_handle);
 int64_t opal_npu_tl_set(uint64_t phb_id, uint32_t bdfn, long cap,
uint64_t rate_phys, uint32_t size);
+int64_t opal_npu_mem_alloc(uint64_t phb_id, uint32_t bdfn,
+   uint64_t size, uint64_t *bar);
+int64_t opal_npu_mem_release(uint64_t phb_id, uint32_t bdfn);
 
 int64_t opal_console_write(int64_t term_number, __be64 *length,
   const uint8_t *buffer);
diff --git a/arch/powerpc/platforms/powernv/opal-call.c 
b/arch/powerpc/platforms/powernv/opal-call.c
index 5cd0f52d258f..f26e58b72c04 100644
--- a/arch/powerpc/platforms/powernv/opal-call.c
+++ b/arch/powerpc/platforms/powernv/opal-call.c
@@ -287,6 +287,8 @@ OPAL_CALL(opal_pci_set_pbcq_tunnel_bar, 
OPAL_PCI_SET_PBCQ_TUNNEL_BAR);
 OPAL_CALL(opal_sensor_read_u64,OPAL_SENSOR_READ_U64);
 OPAL_CALL(opal_sensor_group_enable,OPAL_SENSOR_GROUP_ENABLE);
 OPAL_CALL(opal_nx_coproc_init, OPAL_NX_COPROC_INIT);
+OPAL_CALL(opal_npu_mem_alloc,  OPAL_NPU_MEM_ALLOC);
+OPAL_CALL(opal_npu_mem_release,OPAL_NPU_MEM_RELEASE);
 OPAL_CALL(opal_mpipl_update,   OPAL_MPIPL_UPDATE);
 OPAL_CALL(opal_mpipl_register_tag, OPAL_MPIPL_REGISTER_TAG);
 OPAL_CALL(opal_mpipl_query_tag,OPAL_MPIPL_QUERY_TAG);
-- 
2.24.1



[PATCH v3 02/27] mm/memory_hotplug: Allow check_hotplug_memory_addressable to be called from drivers

2020-02-20 Thread Alastair D'Silva
From: Alastair D'Silva 

When setting up OpenCAPI connected persistent memory, the range check may
not be performed until quite late (or perhaps not at all, if the user does
not establish a DAX device).

This patch makes the range check callable so we can perform the check while
probing the OpenCAPI SCM device.

Signed-off-by: Alastair D'Silva 
---
 include/linux/memory_hotplug.h | 5 +
 mm/memory_hotplug.c| 4 ++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index f4d59155f3d4..34a69aecc45e 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -337,6 +337,11 @@ static inline void __remove_memory(int nid, u64 start, u64 
size) {}
 extern void set_zone_contiguous(struct zone *zone);
 extern void clear_zone_contiguous(struct zone *zone);
 
+#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
+int check_hotplug_memory_addressable(unsigned long pfn,
+   unsigned long nr_pages);
+#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
+
 extern void __ref free_area_init_core_hotplug(int nid);
 extern int __add_memory(int nid, u64 start, u64 size);
 extern int add_memory(int nid, u64 start, u64 size);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0a54ffac8c68..14945f033594 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -276,8 +276,8 @@ static int check_pfn_span(unsigned long pfn, unsigned long 
nr_pages,
return 0;
 }
 
-static int check_hotplug_memory_addressable(unsigned long pfn,
-   unsigned long nr_pages)
+int check_hotplug_memory_addressable(unsigned long pfn,
+unsigned long nr_pages)
 {
const u64 max_addr = PFN_PHYS(pfn + nr_pages) - 1;
 
-- 
2.24.1



Re: [PATCH] evh_bytechan: fix out of bounds accesses

2020-02-20 Thread Stephen Rothwell
Hi all,

On Thu, 16 Jan 2020 11:37:14 +1100 Stephen Rothwell  
wrote:
>
> On Wed, 15 Jan 2020 14:01:35 -0600 Scott Wood  wrote:
> >
> > On Thu, 2020-01-16 at 06:42 +1100, Stephen Rothwell wrote:  
> > > Hi Timur,
> > > 
> > > On Wed, 15 Jan 2020 07:25:45 -0600 Timur Tabi  wrote:   
> > >  
> > > > On 1/14/20 12:31 AM, Stephen Rothwell wrote:
> > > > > +/**
> > > > > + * ev_byte_channel_send - send characters to a byte stream
> > > > > + * @handle: byte stream handle
> > > > > + * @count: (input) num of chars to send, (output) num chars sent
> > > > > + * @bp: pointer to chars to send
> > > > > + *
> > > > > + * Returns 0 for success, or an error code.
> > > > > + */
> > > > > +static unsigned int ev_byte_channel_send(unsigned int handle,
> > > > > + unsigned int *count, const char *bp)  
> > > > 
> > > > Well, now you've moved this into the .c file and it is no longer 
> > > > available to other callers.  Anything wrong with keeping it in the .h
> > > > file?
> > > 
> > > There are currently no other callers - are there likely to be in the
> > > future?  Even if there are, is it time critical enough that it needs to
> > > be inlined everywhere?
> > 
> > It's not performance critical and there aren't likely to be other users --
> > just a matter of what's cleaner.  FWIW I'd rather see the original patch,
> > that keeps the raw asm hcall stuff as simple wrappers in one place.  
> 
> And I don't mind either way :-)
> 
> I just want to get rid of the warnings.

Any progress with this?
-- 
Cheers,
Stephen Rothwell


pgpEm6beMpIil.pgp
Description: OpenPGP digital signature


Re: [PATCH] selftest/lkdtm: Don't pollute 'git status'

2020-02-20 Thread Kees Cook
On Thu, Feb 06, 2020 at 08:11:39AM +, Christophe Leroy wrote:
> Commit 46d1a0f03d66 ("selftests/lkdtm: Add tests for LKDTM targets")
> added generation of lkdtm test scripts.
> 
> Ignore those generated scripts when performing 'git status'
> 
> Fixes: 46d1a0f03d66 ("selftests/lkdtm: Add tests for LKDTM targets")
> Signed-off-by: Christophe Leroy 

Ah! Yes, a very good idea. Thanks!

Reviewed-by: Kees Cook 

-Kees

> ---
>  .gitignore | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/.gitignore b/.gitignore
> index b849a72d69d5..bb05dce58f8e 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -100,6 +100,10 @@ modules.order
>  /include/ksym/
>  /arch/*/include/generated/
>  
> +# Generated lkdtm tests
> +/tools/testing/selftests/lkdtm/*.sh
> +!/tools/testing/selftests/lkdtm/run.sh
> +
>  # stgit generated dirs
>  patches-*
>  
> -- 
> 2.25.0
> 

-- 
Kees Cook


Re: [PATCH kernel 5/5] vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB

2020-02-20 Thread Alex Williamson
On Tue, 18 Feb 2020 18:36:50 +1100
Alexey Kardashevskiy  wrote:

> So far the only option for a big 64big DMA window was a window located
> at 0x800... (1<<59) which creates problems for devices
> supporting smaller DMA masks.
> 
> This exploits a POWER9 PHB option to allow the second DMA window to map
> at 0 and advertises it with a 4GB offset to avoid overlap with
> the default 32bit window.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  include/uapi/linux/vfio.h   |  2 ++
>  drivers/vfio/vfio_iommu_spapr_tce.c | 10 --
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9e843a147ead..c7f89d47335a 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -831,9 +831,11 @@ struct vfio_iommu_spapr_tce_info {
>   __u32 argsz;
>   __u32 flags;
>  #define VFIO_IOMMU_SPAPR_INFO_DDW(1 << 0)/* DDW supported */
> +#define VFIO_IOMMU_SPAPR_INFO_DDW_START  (1 << 1)/* DDW offset */
>   __u32 dma32_window_start;   /* 32 bit window start (bytes) */
>   __u32 dma32_window_size;/* 32 bit window size (bytes) */
>   struct vfio_iommu_spapr_tce_ddw_info ddw;
> + __u64 dma64_window_start;
>  };
>  
>  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO_IO(VFIO_TYPE, VFIO_BASE + 12)
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 16b3adc508db..4f22be3c4aa2 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -691,7 +691,7 @@ static long tce_iommu_create_window(struct tce_container 
> *container,
>   container->tables[num] = tbl;
>  
>   /* Return start address assigned by platform in create_table() */
> - *start_addr = tbl->it_offset << tbl->it_page_shift;
> + *start_addr = tbl->it_dmaoff << tbl->it_page_shift;
>  
>   return 0;
>  
> @@ -842,7 +842,13 @@ static long tce_iommu_ioctl(void *iommu_data,
>   info.ddw.levels = table_group->max_levels;
>   }
>  
> - ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info, ddw);
> + ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info,
> + dma64_window_start);

This breaks existing users, now they no longer get the ddw struct
unless their argsz also includes the new dma64 window field.

> +
> + if (info.argsz >= ddwsz) {
> + info.flags |= VFIO_IOMMU_SPAPR_INFO_DDW_START;
> + info.dma64_window_start = table_group->tce64_start;
> + }

This is inconsistent with ddw where we set the flag regardless of
argsz, but obviously only provide the field to the user if they've
provided room for it.  Thanks,

Alex

>  
>   if (info.argsz >= ddwsz)
>   minsz = ddwsz;



Re: [PATCH] powerpc/8xx: Fix clearing of bits 20-23 in ITLB miss

2020-02-20 Thread Leonardo Bras
On Tue, 2020-02-11 at 01:28 -0300, Leonardo Bras wrote:
> Looks a valid change.
> rlwimi  r10, r10, 0, 0x0f00 means: 
> r10 = ((r10 << 0) & 0x0f00) | (r10 & ~0x0f00) which ends up being
> r10 = r10 
> 
> On ISA, rlwinm is recommended for clearing high order bits.
> rlwinm  r10, r10, 0, ~0x0f00 means:
> r10 = (r10 << 0) & ~0x0f00
> 
> Which does exactly what the comments suggests.
> 
> FWIW:
> Reviwed-by: Leonardo Bras 


Sorry, I just realized the above was not very clear on my part.

What I meant to say was:
I think your change is correct, as it correctly fixes this line.

I would suggest adding the text bellow to your commit message, making
it easier to understand why rlwimi is not the right instruction clear
bytes 20-23, and why rlwinm is.

The current instruction can be translated to C as:
rlwimi  r10, r10, 0, 0x0f00
r10 = ((r10 << 0) & 0x0f00) | (r10 & ~0x0f00)   ->
r10 = (r10 & 0x0f00) | (r10 & ~0x0f00)  ->
r10 = r10

The new proposed instruction can be translated to C as:
rlwinm  r10, r10, 0, ~0x0f00->
r10 = (r10 << 0) & ~0x0f00

Which clears bits 20-23 as comment on code states.

Best regards,

Leonardo Bras




signature.asc
Description: This is a digitally signed message part


Re: [PATCH] KVM: PPC: Book3S HV: Treat TM-related invalid form instructions on P9 like the valid ones

2020-02-20 Thread Gustavo Romero

Hi Leonardo,

Thanks a lot for the review.

On 02/20/2020 02:51 PM, Leonardo Bras wrote:

+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+


Could not see where is this used.


This is used by pr_warn_ratelimited() below so the module name is printed before
the message, for instance:

[531454.670909] kvm_hv: Unrecognized TM-related instruction 0x7c00075c for 
emulation



  #include 

  #include 
@@ -44,7 +46,18 @@ int kvmhv_p9_tm_emulation(struct kvm_vcpu *vcpu)
u64 newmsr, bescr;
int ra, rs;

-   switch (instr & 0xfc0007ff) {
+   /*
+* rfid, rfebb, and mtmsrd encode bit 31 = 0 since it's a reserved bit
+* in these instructions, so masking bit 31 out doesn't change these
+* instructions. For treclaim., tsr., and trechkpt. instructions if bit
+* 31 = 0 then they are per ISA invalid forms, however P9 UM, in section
+* 4.6.10 Book II Invalid Forms, informs specifically that ignoring bit
+* 31 is an acceptable way to handle these invalid forms that have
+* bit 31 = 0. Moreover, for emulation purposes both forms (w/ and wo/
+* bit 31 set) can generate a softpatch interrupt. Hence both forms
+* are handled below for these instructions so they behave the same way.
+*/
+   switch (instr & PO_XOP_OPCODE_MASK) {




-   case PPC_INST_TRECHKPT:
+   /* ignore bit 31, see comment above */
+   case (PPC_INST_TRECHKPT & PO_XOP_OPCODE_MASK):
/* XXX do we need to check for PR=0 here? */
/* check for TM disabled in the HFSCR or MSR */
if (!(vcpu->arch.hfscr & HFSCR_TM)) {
@@ -208,6 +224,8 @@ int kvmhv_p9_tm_emulation(struct kvm_vcpu *vcpu)
}



Seems good, using the same flag to mask out bit 31 of these macros.
They are used only in a few places, and I think removing the macro bit
would be ok, but I think your way is better to keep it documented.

I just noticed that there is a similar function that uses PPC_INST_TSR:
kvmhv_p9_tm_emulation_early @ arch/powerpc/kvm/book3s_hv_tm_builtin.c.
Wouldn't it need to be changed as well?


oh! you're right, I forgot that one. I'll send a v3.



/* What should we do here? We didn't recognize the instruction */
-   WARN_ON_ONCE(1);
+   kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
+   pr_warn_ratelimited("Unrecognized TM-related instruction %#x for 
emulation", instr);
+
return RESUME_GUEST;
  }


I suppose this is the right thing to do, but I think it would be better
to give this change it's own patch.

What do you think?


I think it's sufficiently self-contained and trivial to be in the same file and
to be in a single commit.


Best regards,
Gustavo


Re: [PATCH] selftest/lkdtm: Don't pollute 'git status'

2020-02-20 Thread Shuah Khan

On 2/20/20 7:58 AM, Christophe Leroy wrote:

ping

On 02/06/2020 08:11 AM, Christophe Leroy wrote:

Commit 46d1a0f03d66 ("selftests/lkdtm: Add tests for LKDTM targets")
added generation of lkdtm test scripts.

Ignore those generated scripts when performing 'git status'

Fixes: 46d1a0f03d66 ("selftests/lkdtm: Add tests for LKDTM targets")
Signed-off-by: Christophe Leroy 


Without this, 'git status' now reports the following crap and real 
problems are drowned in the middle, that's annoying.




I will pull this in. Please cc linux-kselftest mailing list in the
future.

thanks,
-- Shuah



[powerpc:fixes] BUILD SUCCESS 9eb425b2e04e0e3006adffea5bf5f227a896f128

2020-02-20 Thread kbuild test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  
fixes
branch HEAD: 9eb425b2e04e0e3006adffea5bf5f227a896f128  powerpc/entry: Fix an 
#if which should be an #ifdef in entry_32.S

elapsed time: 1724m

configs tested: 227
configs skipped: 148

The following configs have been built successfully.
More configs may be tested in the coming days.

arm  allmodconfig
arm   allnoconfig
arm  allyesconfig
arm64allmodconfig
arm64 allnoconfig
arm64allyesconfig
arm at91_dt_defconfig
arm   efm32_defconfig
arm  exynos_defconfig
armmulti_v5_defconfig
armmulti_v7_defconfig
armshmobile_defconfig
arm   sunxi_defconfig
arm64   defconfig
sparcallyesconfig
sparc64  allmodconfig
s390 allmodconfig
sparc64   allnoconfig
i386  allnoconfig
xtensa   common_defconfig
s390  debug_defconfig
shtitan_defconfig
ia64 allyesconfig
riscvnommu_virt_defconfig
h8300h8300h-sim_defconfig
nios2 3c120_defconfig
h8300   h8s-sim_defconfig
riscvallyesconfig
m68k allmodconfig
openriscor1ksim_defconfig
um  defconfig
ia64  allnoconfig
openrisc simple_smp_defconfig
s390defconfig
sh  sh7785lcr_32bit_defconfig
ia64 alldefconfig
alpha   defconfig
nds32   defconfig
riscv   defconfig
sparc64 defconfig
i386 alldefconfig
i386 allyesconfig
i386defconfig
ia64 allmodconfig
ia64defconfig
c6x  allyesconfig
c6xevmc6678_defconfig
nios2 10m50_defconfig
xtensa  iss_defconfig
cskydefconfig
nds32 allnoconfig
h8300 edosk2674_defconfig
m68k   m5475evb_defconfig
m68k  multi_defconfig
m68k   sun3_defconfig
arc  allyesconfig
arc defconfig
microblaze  mmu_defconfig
microblazenommu_defconfig
powerpc   allnoconfig
powerpc defconfig
powerpc   ppc64_defconfig
powerpc  rhel-kconfig
mips   32r2_defconfig
mips 64r6el_defconfig
mips allmodconfig
mips  allnoconfig
mips allyesconfig
mips  fuloong2e_defconfig
mips  malta_kvm_defconfig
pariscallnoconfig
parisc   allyesconfig
pariscgeneric-32bit_defconfig
pariscgeneric-64bit_defconfig
x86_64   randconfig-a001-20200220
x86_64   randconfig-a002-20200220
x86_64   randconfig-a003-20200220
i386 randconfig-a001-20200220
i386 randconfig-a002-20200220
i386 randconfig-a003-20200220
x86_64   randconfig-a001-20200219
x86_64   randconfig-a002-20200219
x86_64   randconfig-a003-20200219
i386 randconfig-a001-20200219
i386 randconfig-a002-20200219
i386 randconfig-a003-20200219
alpharandconfig-a001-20200220
m68k randconfig-a001-20200220
mips randconfig-a001-20200220
nds32randconfig-a001-20200220
parisc   randconfig-a001-20200220
riscvrandconfig-a001-20200220
c6x  randconfig-a001-20200220
h8300randconfig-a001-20200220
microblaze   randconfig-a001-20200220
nios2randconfig-a001-20200220
sparc64  randconfig-a001-20200220
c6x  randconfig-a001-20200219
h8300randconfig-a001-20200219
microblaze   randconfig-a001-20200219
nios2

Re: [PATCH] KVM: PPC: Book3S HV: Treat TM-related invalid form instructions on P9 like the valid ones

2020-02-20 Thread Leonardo Bras
Hello Gustavo, comments inline:

On Tue, 2020-02-18 at 16:13 -0500, Gustavo Romero wrote:

> diff --git a/arch/powerpc/kvm/book3s_hv_tm.c b/arch/powerpc/kvm/book3s_hv_tm.c
> index 0db937497169..cc90b8b82329 100644
> --- a/arch/powerpc/kvm/book3s_hv_tm.c
> +++ b/arch/powerpc/kvm/book3s_hv_tm.c
> @@ -3,6 +3,8 @@
>   * Copyright 2017 Paul Mackerras, IBM Corp. 
>   */
> 
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +

Could not see where is this used.

>  #include 
> 
>  #include 
> @@ -44,7 +46,18 @@ int kvmhv_p9_tm_emulation(struct kvm_vcpu *vcpu)
>   u64 newmsr, bescr;
>   int ra, rs;
> 
> - switch (instr & 0xfc0007ff) {
> + /*
> +  * rfid, rfebb, and mtmsrd encode bit 31 = 0 since it's a reserved bit
> +  * in these instructions, so masking bit 31 out doesn't change these
> +  * instructions. For treclaim., tsr., and trechkpt. instructions if bit
> +  * 31 = 0 then they are per ISA invalid forms, however P9 UM, in section
> +  * 4.6.10 Book II Invalid Forms, informs specifically that ignoring bit
> +  * 31 is an acceptable way to handle these invalid forms that have
> +  * bit 31 = 0. Moreover, for emulation purposes both forms (w/ and wo/
> +  * bit 31 set) can generate a softpatch interrupt. Hence both forms
> +  * are handled below for these instructions so they behave the same way.
> +  */
> + switch (instr & PO_XOP_OPCODE_MASK) {
> 

> - case PPC_INST_TRECHKPT:
> + /* ignore bit 31, see comment above */
> + case (PPC_INST_TRECHKPT & PO_XOP_OPCODE_MASK):
>   /* XXX do we need to check for PR=0 here? */
>   /* check for TM disabled in the HFSCR or MSR */
>   if (!(vcpu->arch.hfscr & HFSCR_TM)) {
> @@ -208,6 +224,8 @@ int kvmhv_p9_tm_emulation(struct kvm_vcpu *vcpu)
>   }
> 

Seems good, using the same flag to mask out bit 31 of these macros.
They are used only in a few places, and I think removing the macro bit
would be ok, but I think your way is better to keep it documented. 

I just noticed that there is a similar function that uses PPC_INST_TSR:
kvmhv_p9_tm_emulation_early @ arch/powerpc/kvm/book3s_hv_tm_builtin.c.
Wouldn't it need to be changed as well?

>   /* What should we do here? We didn't recognize the instruction */
> - WARN_ON_ONCE(1);
> + kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
> + pr_warn_ratelimited("Unrecognized TM-related instruction %#x for 
> emulation", instr);
> +
>   return RESUME_GUEST;
>  }

I suppose this is the right thing to do, but I think it would be better
to give this change it's own patch.

What do you think?

Best regards,
Leonardo Bras



signature.asc
Description: This is a digitally signed message part


Re: MCE handler gets NIP wrong on MPC8378

2020-02-20 Thread Christophe Leroy




Le 20/02/2020 à 18:34, Radu Rendec a écrit :

On 02/20/2020 at 11:25 AM Christophe Leroy  wrote:

Le 20/02/2020 à 17:02, Radu Rendec a écrit :

On 02/20/2020 at 3:38 AM Christophe Leroy  wrote:

On 02/19/2020 10:39 PM, Radu Rendec wrote:

On 02/19/2020 at 4:21 PM Christophe Leroy  wrote:

Interesting.

0x900 is the adress of the timer interrupt.

Would the MCE occur just after the timer interrupt ?


I doubt that. I'm using a small test module to artificially trigger the
MCE. Basically it's just this (the full code is in my original post):

   bad_addr_base = ioremap(0xf000, 0x100);
   x = ioread32(bad_addr_base);

I find it hard to believe that every time I load the module the lwbrx
instruction that triggers the MCE is executed exactly after the timer
interrupt (or that the timer interrupt always occurs close to the lwbrx
instruction).


Can you try to see how much time there is between your read and the MCE ?
The below should allow it, you'll see first value in r13 and the other
in r14 (mce.c is your test code)

Also provide the timebase frequency as reported in /proc/cpuinfo


I just ran a test: r13 is 0xda8e0f91 and r14 is 0xdaae0f9c.

# cat /proc/cpuinfo
processor   : 0
cpu : e300c4
clock   : 800.04MHz
revision: 1.1 (pvr 8086 1011)
bogomips: 200.00
timebase: 1

The difference between r14 and r13 is 0x2b. Assuming TB is
incremented with 'timebase' frequency, that means 20.97 milliseconds
(although the e300 manual says TB is "incremented once every four core
input clock cycles").


I wouldn't be surprised that the internal CPU clock be twice the input
clock.

So that's long enough to surely get a timer interrupt during every bad
access.

Now we have to understand why SRR1 contains the address of the timer
exception entry and not the address of the bad access.

The value of SRR1 confirms that it comes from 0x900 as MSR[IR] and [DR]
are cleared when interrupts are enabled.

Maybe you should file a support case at NXP. They are usually quite
professionnal at responding.


I already did (quite some time ago), but it started off as "why does the
MCE occur in the first place". That part has already been figured out,
but unfortunately I don't have a viable solution to it. Like you said,
now the focus has shifted to understanding why the SRR0 value is not
what we expect.

I asked them the question about SRR0 as soon as you helped me get back
on track and figured out there's nothing wrong with the Linux MCE
handler and the NIP value comes from SRR0. What they came up with is
basically this paragraph in the e300 core manual (section 5.5.2):

| Note that the e300 core makes no attempt to force recoverability on a
| machine check; however, it does guarantee that the machine check
| interrupt is always taken immediately upon request, with a nonpredicted
| address saved in SRR0, regardless of the current machine state.

... and with an emphasis on "nonpredicted". To be honest, I am a bit
disappointed with their response and I believe in this context what
"unpredicted" means is that the address that is saved to SRR0 is a
"real" address rather than the result of branch prediction. The support
folks were probably thinking "unpredictable". But that's another word
and the difference is quite subtle :)

I updated the case and added information about the interrupts and the
timing. Let's see what they come up with this time.



Yes now the point is to understand why it starts processing the timer 
interrupt at 0x900 (with IR and DR cleared as observed in SRR1) just 
before taking the Machine Check.


Allthough the execution of the decrementer interrupt is queue for after 
the completion of the failing memory access, I'd expect the Machine 
Check to take priority.


Note that I have never observed such a behaviour on MPC8321 which has an 
e300c2 core.


Christophe


Re: MCE handler gets NIP wrong on MPC8378

2020-02-20 Thread Radu Rendec
On 02/20/2020 at 11:25 AM Christophe Leroy  wrote:
> Le 20/02/2020 à 17:02, Radu Rendec a écrit :
> > On 02/20/2020 at 3:38 AM Christophe Leroy  wrote:
> >> On 02/19/2020 10:39 PM, Radu Rendec wrote:
> >>> On 02/19/2020 at 4:21 PM Christophe Leroy  wrote:
> > Interesting.
> >
> > 0x900 is the adress of the timer interrupt.
> >
> > Would the MCE occur just after the timer interrupt ?
> >>>
> >>> I doubt that. I'm using a small test module to artificially trigger the
> >>> MCE. Basically it's just this (the full code is in my original post):
> >>>
> >>>   bad_addr_base = ioremap(0xf000, 0x100);
> >>>   x = ioread32(bad_addr_base);
> >>>
> >>> I find it hard to believe that every time I load the module the lwbrx
> >>> instruction that triggers the MCE is executed exactly after the timer
> >>> interrupt (or that the timer interrupt always occurs close to the lwbrx
> >>> instruction).
> >>
> >> Can you try to see how much time there is between your read and the MCE ?
> >> The below should allow it, you'll see first value in r13 and the other
> >> in r14 (mce.c is your test code)
> >>
> >> Also provide the timebase frequency as reported in /proc/cpuinfo
> >
> > I just ran a test: r13 is 0xda8e0f91 and r14 is 0xdaae0f9c.
> >
> > # cat /proc/cpuinfo
> > processor   : 0
> > cpu : e300c4
> > clock   : 800.04MHz
> > revision: 1.1 (pvr 8086 1011)
> > bogomips: 200.00
> > timebase: 1
> >
> > The difference between r14 and r13 is 0x2b. Assuming TB is
> > incremented with 'timebase' frequency, that means 20.97 milliseconds
> > (although the e300 manual says TB is "incremented once every four core
> > input clock cycles").
>
> I wouldn't be surprised that the internal CPU clock be twice the input
> clock.
>
> So that's long enough to surely get a timer interrupt during every bad
> access.
>
> Now we have to understand why SRR1 contains the address of the timer
> exception entry and not the address of the bad access.
>
> The value of SRR1 confirms that it comes from 0x900 as MSR[IR] and [DR]
> are cleared when interrupts are enabled.
>
> Maybe you should file a support case at NXP. They are usually quite
> professionnal at responding.

I already did (quite some time ago), but it started off as "why does the
MCE occur in the first place". That part has already been figured out,
but unfortunately I don't have a viable solution to it. Like you said,
now the focus has shifted to understanding why the SRR0 value is not
what we expect.

I asked them the question about SRR0 as soon as you helped me get back
on track and figured out there's nothing wrong with the Linux MCE
handler and the NIP value comes from SRR0. What they came up with is
basically this paragraph in the e300 core manual (section 5.5.2):

| Note that the e300 core makes no attempt to force recoverability on a
| machine check; however, it does guarantee that the machine check
| interrupt is always taken immediately upon request, with a nonpredicted
| address saved in SRR0, regardless of the current machine state.

... and with an emphasis on "nonpredicted". To be honest, I am a bit
disappointed with their response and I believe in this context what
"unpredicted" means is that the address that is saved to SRR0 is a
"real" address rather than the result of branch prediction. The support
folks were probably thinking "unpredictable". But that's another word
and the difference is quite subtle :)

I updated the case and added information about the interrupts and the
timing. Let's see what they come up with this time.

Best regards,
Radu


Re: MCE handler gets NIP wrong on MPC8378

2020-02-20 Thread Christophe Leroy




Le 20/02/2020 à 17:02, Radu Rendec a écrit :

On 02/20/2020 at 3:38 AM Christophe Leroy  wrote:

On 02/19/2020 10:39 PM, Radu Rendec wrote:

On 02/19/2020 at 4:21 PM Christophe Leroy  wrote:

Interesting.

0x900 is the adress of the timer interrupt.

Would the MCE occur just after the timer interrupt ?


I doubt that. I'm using a small test module to artificially trigger the
MCE. Basically it's just this (the full code is in my original post):

  bad_addr_base = ioremap(0xf000, 0x100);
  x = ioread32(bad_addr_base);

I find it hard to believe that every time I load the module the lwbrx
instruction that triggers the MCE is executed exactly after the timer
interrupt (or that the timer interrupt always occurs close to the lwbrx
instruction).


Can you try to see how much time there is between your read and the MCE ?
The below should allow it, you'll see first value in r13 and the other
in r14 (mce.c is your test code)

Also provide the timebase frequency as reported in /proc/cpuinfo


I just ran a test: r13 is 0xda8e0f91 and r14 is 0xdaae0f9c.

# cat /proc/cpuinfo
processor   : 0
cpu : e300c4
clock   : 800.04MHz
revision: 1.1 (pvr 8086 1011)
bogomips: 200.00
timebase: 1

The difference between r14 and r13 is 0x2b. Assuming TB is
incremented with 'timebase' frequency, that means 20.97 milliseconds
(although the e300 manual says TB is "incremented once every four core
input clock cycles").


I wouldn't be surprised that the internal CPU clock be twice the input 
clock.


So that's long enough to surely get a timer interrupt during every bad 
access.


Now we have to understand why SRR1 contains the address of the timer 
exception entry and not the address of the bad access.


The value of SRR1 confirms that it comes from 0x900 as MSR[IR] and [DR] 
are cleared when interrupts are enabled.


Maybe you should file a support case at NXP. They are usually quite 
professionnal at responding.


Christophe


Re: [PATCH AUTOSEL 5.5 096/542] powerpc/powernv/ioda: Fix ref count for devices with their own PE

2020-02-20 Thread Sasha Levin

On Mon, Feb 17, 2020 at 09:49:41AM +0100, Frederic Barrat wrote:



Le 14/02/2020 à 16:41, Sasha Levin a écrit :

From: Frederic Barrat 

[ Upstream commit 05dd7da76986937fb288b4213b1fa10dbe0d1b33 ]



Hi,

Upstream commit 05dd7da76986937fb288b4213b1fa10dbe0d1b33 doesn't 
really need to go to stable (any of 4.19, 5.4 and 5.5). While it's 
probably safe, the patch replaces a refcount leak by another one, 
which makes sense as part of the full series merged in 5.6-rc1, but 
isn't terribly useful standalone on the current stable branches.


I'll drop it, thank you.

--
Thanks,
Sasha


Re: MCE handler gets NIP wrong on MPC8378

2020-02-20 Thread Radu Rendec
On 02/20/2020 at 3:38 AM Christophe Leroy  wrote:
> On 02/19/2020 10:39 PM, Radu Rendec wrote:
> > On 02/19/2020 at 4:21 PM Christophe Leroy  wrote:
> >>> Interesting.
> >>>
> >>> 0x900 is the adress of the timer interrupt.
> >>>
> >>> Would the MCE occur just after the timer interrupt ?
> >
> > I doubt that. I'm using a small test module to artificially trigger the
> > MCE. Basically it's just this (the full code is in my original post):
> >
> >  bad_addr_base = ioremap(0xf000, 0x100);
> >  x = ioread32(bad_addr_base);
> >
> > I find it hard to believe that every time I load the module the lwbrx
> > instruction that triggers the MCE is executed exactly after the timer
> > interrupt (or that the timer interrupt always occurs close to the lwbrx
> > instruction).
>
> Can you try to see how much time there is between your read and the MCE ?
> The below should allow it, you'll see first value in r13 and the other
> in r14 (mce.c is your test code)
>
> Also provide the timebase frequency as reported in /proc/cpuinfo

I just ran a test: r13 is 0xda8e0f91 and r14 is 0xdaae0f9c.

# cat /proc/cpuinfo
processor   : 0
cpu : e300c4
clock   : 800.04MHz
revision: 1.1 (pvr 8086 1011)
bogomips: 200.00
timebase: 1

The difference between r14 and r13 is 0x2b. Assuming TB is
incremented with 'timebase' frequency, that means 20.97 milliseconds
(although the e300 manual says TB is "incremented once every four core
input clock cycles").

I repeated the test twice and the absolute values were of course very
different, but r14-r13 was 0x2c and 0x200011, so it seems to be
quite consistent (within just a few clock cycles).

Just for the fun of it, I repeated the test once more, but with
interrupts disabled. The difference was 0x200014. FWIW, I disabled
interrupts before sampling TB in r13.

> And what's the reason given in the Oops message for the machine check ?
> Is that "Caused by (from SRR1=49030): Transfer error ack signal" or
> something else ?

When interrupts are enabled:
Caused by (from SRR1=41000): Transfer error ack signal

When interrupts are disabled:
Caused by (from SRR1=41030): Transfer error ack signal

> >
> >> Do you use the local bus monitoring driver ?
> >
> > I don't. In fact, I'm not even aware of it. What driver is that?
>
> CONFIG_FSL_LBC

OK, it seems I'm actually using it. I haven't enabled it explicitly, but
it's automatically pulled by CONFIG_MTD_NAND_FSL_ELBC as a prerequisite.

I looked at the code in arch/powerpc/sysdev/fsl_lbc.c and it's quite
small. Most of the code is in fsl_lbc_ctrl_irq, which I guess is
supposed to print a message if/when the LBC catches an error. I've never
seen any of those messages being printed.

Best regards,
Radu


Re: [PATCH] selftest/lkdtm: Don't pollute 'git status'

2020-02-20 Thread Christophe Leroy

ping

On 02/06/2020 08:11 AM, Christophe Leroy wrote:

Commit 46d1a0f03d66 ("selftests/lkdtm: Add tests for LKDTM targets")
added generation of lkdtm test scripts.

Ignore those generated scripts when performing 'git status'

Fixes: 46d1a0f03d66 ("selftests/lkdtm: Add tests for LKDTM targets")
Signed-off-by: Christophe Leroy 


Without this, 'git status' now reports the following crap and real 
problems are drowned in the middle, that's annoying.


On branch saf3000-5.6
Untracked files:
  (use "git add ..." to include in what will be committed)
tools/testing/selftests/lkdtm/ACCESS_NULL.sh
tools/testing/selftests/lkdtm/ACCESS_USERSPACE.sh
tools/testing/selftests/lkdtm/ATOMIC_TIMING.sh
tools/testing/selftests/lkdtm/BUG.sh
tools/testing/selftests/lkdtm/CFI_FORWARD_PROTO.sh
tools/testing/selftests/lkdtm/CORRUPT_LIST_ADD.sh
tools/testing/selftests/lkdtm/CORRUPT_LIST_DEL.sh
tools/testing/selftests/lkdtm/CORRUPT_STACK.sh
tools/testing/selftests/lkdtm/CORRUPT_STACK_STRONG.sh
tools/testing/selftests/lkdtm/CORRUPT_USER_DS.sh
tools/testing/selftests/lkdtm/DOUBLE_FAULT.sh
tools/testing/selftests/lkdtm/EXCEPTION.sh
tools/testing/selftests/lkdtm/EXEC_DATA.sh
tools/testing/selftests/lkdtm/EXEC_KMALLOC.sh
tools/testing/selftests/lkdtm/EXEC_NULL.sh
tools/testing/selftests/lkdtm/EXEC_RODATA.sh
tools/testing/selftests/lkdtm/EXEC_STACK.sh
tools/testing/selftests/lkdtm/EXEC_USERSPACE.sh
tools/testing/selftests/lkdtm/EXEC_VMALLOC.sh
tools/testing/selftests/lkdtm/EXHAUST_STACK.sh
tools/testing/selftests/lkdtm/HARDLOCKUP.sh
tools/testing/selftests/lkdtm/HUNG_TASK.sh
tools/testing/selftests/lkdtm/LOOP.sh
tools/testing/selftests/lkdtm/OVERWRITE_ALLOCATION.sh
tools/testing/selftests/lkdtm/PANIC.sh
tools/testing/selftests/lkdtm/READ_AFTER_FREE.sh
tools/testing/selftests/lkdtm/READ_BUDDY_AFTER_FREE.sh
tools/testing/selftests/lkdtm/REFCOUNT_ADD_NOT_ZERO_OVERFLOW.sh
tools/testing/selftests/lkdtm/REFCOUNT_ADD_NOT_ZERO_SATURATED.sh
tools/testing/selftests/lkdtm/REFCOUNT_ADD_OVERFLOW.sh
tools/testing/selftests/lkdtm/REFCOUNT_ADD_SATURATED.sh
tools/testing/selftests/lkdtm/REFCOUNT_ADD_ZERO.sh
tools/testing/selftests/lkdtm/REFCOUNT_DEC_AND_TEST_NEGATIVE.sh
tools/testing/selftests/lkdtm/REFCOUNT_DEC_AND_TEST_SATURATED.sh
tools/testing/selftests/lkdtm/REFCOUNT_DEC_NEGATIVE.sh
tools/testing/selftests/lkdtm/REFCOUNT_DEC_SATURATED.sh
tools/testing/selftests/lkdtm/REFCOUNT_DEC_ZERO.sh
tools/testing/selftests/lkdtm/REFCOUNT_INC_NOT_ZERO_OVERFLOW.sh
tools/testing/selftests/lkdtm/REFCOUNT_INC_NOT_ZERO_SATURATED.sh
tools/testing/selftests/lkdtm/REFCOUNT_INC_OVERFLOW.sh
tools/testing/selftests/lkdtm/REFCOUNT_INC_SATURATED.sh
tools/testing/selftests/lkdtm/REFCOUNT_INC_ZERO.sh
tools/testing/selftests/lkdtm/REFCOUNT_SUB_AND_TEST_NEGATIVE.sh
tools/testing/selftests/lkdtm/REFCOUNT_SUB_AND_TEST_SATURATED.sh
tools/testing/selftests/lkdtm/REFCOUNT_TIMING.sh
tools/testing/selftests/lkdtm/SLAB_FREE_CROSS.sh
tools/testing/selftests/lkdtm/SLAB_FREE_DOUBLE.sh
tools/testing/selftests/lkdtm/SLAB_FREE_PAGE.sh
tools/testing/selftests/lkdtm/SOFTLOCKUP.sh
tools/testing/selftests/lkdtm/SPINLOCKUP.sh
tools/testing/selftests/lkdtm/STACKLEAK_ERASING.sh
tools/testing/selftests/lkdtm/STACK_GUARD_PAGE_LEADING.sh
tools/testing/selftests/lkdtm/STACK_GUARD_PAGE_TRAILING.sh
tools/testing/selftests/lkdtm/UNALIGNED_LOAD_STORE_WRITE.sh
tools/testing/selftests/lkdtm/UNSET_SMEP.sh
tools/testing/selftests/lkdtm/USERCOPY_HEAP_SIZE_FROM.sh
tools/testing/selftests/lkdtm/USERCOPY_HEAP_SIZE_TO.sh
tools/testing/selftests/lkdtm/USERCOPY_HEAP_WHITELIST_FROM.sh
tools/testing/selftests/lkdtm/USERCOPY_HEAP_WHITELIST_TO.sh
tools/testing/selftests/lkdtm/USERCOPY_KERNEL.sh
tools/testing/selftests/lkdtm/USERCOPY_KERNEL_DS.sh
tools/testing/selftests/lkdtm/USERCOPY_STACK_BEYOND.sh
tools/testing/selftests/lkdtm/USERCOPY_STACK_FRAME_FROM.sh
tools/testing/selftests/lkdtm/USERCOPY_STACK_FRAME_TO.sh
tools/testing/selftests/lkdtm/WARNING.sh
tools/testing/selftests/lkdtm/WARNING_MESSAGE.sh
tools/testing/selftests/lkdtm/WRITE_AFTER_FREE.sh
tools/testing/selftests/lkdtm/WRITE_BUDDY_AFTER_FREE.sh
tools/testing/selftests/lkdtm/WRITE_KERN.sh
tools/testing/selftests/lkdtm/WRITE_RO.sh
tools/testing/selftests/lkdtm/WRITE_RO_AFTER_INIT.sh

nothing added to commit but untracked files present (use "git add" to track)


Thanks
Christophe



---
  .gitignore | 4 
  1 file changed, 4 insertions(+)

diff --git a/.gitignore b/.gitignore
index 

Re: [PATCH v3 6/6] powerpc/fsl_booke/kaslr: rename kaslr-booke32.rst to kaslr-booke.rst and add 64bit part

2020-02-20 Thread Christophe Leroy




Le 06/02/2020 à 03:58, Jason Yan a écrit :

Now we support both 32 and 64 bit KASLR for fsl booke. Add document for
64 bit part and rename kaslr-booke32.rst to kaslr-booke.rst.

Signed-off-by: Jason Yan 
Cc: Scott Wood 
Cc: Diana Craciun 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: Kees Cook 
---
  .../{kaslr-booke32.rst => kaslr-booke.rst}| 35 ---
  1 file changed, 31 insertions(+), 4 deletions(-)
  rename Documentation/powerpc/{kaslr-booke32.rst => kaslr-booke.rst} (59%)


Also update Documentation/powerpc/index.rst ?

Christophe


Re: [PATCH v3 5/6] powerpc/fsl_booke/64: clear the original kernel if randomized

2020-02-20 Thread Christophe Leroy




Le 06/02/2020 à 03:58, Jason Yan a écrit :

The original kernel still exists in the memory, clear it now.


No such problem with PPC32 ? Or is that common ?

Christophe



Signed-off-by: Jason Yan 
Cc: Scott Wood 
Cc: Diana Craciun 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: Kees Cook 
---
  arch/powerpc/mm/nohash/kaslr_booke.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/nohash/kaslr_booke.c 
b/arch/powerpc/mm/nohash/kaslr_booke.c
index c6f5c1db1394..ed1277059368 100644
--- a/arch/powerpc/mm/nohash/kaslr_booke.c
+++ b/arch/powerpc/mm/nohash/kaslr_booke.c
@@ -378,8 +378,10 @@ notrace void __init kaslr_early_init(void *dt_ptr, 
phys_addr_t size)
unsigned int *__kaslr_offset = (unsigned int *)(KERNELBASE + 0x58);
unsigned int *__run_at_load = (unsigned int *)(KERNELBASE + 0x5c);
  
-	if (*__run_at_load == 1)

+   if (*__run_at_load == 1) {
+   kaslr_late_init();
return;
+   }
  
  	/* Setup flat device-tree pointer */

initial_boot_params = dt_ptr;



Re: [PATCH v3 3/6] powerpc/fsl_booke/64: implement KASLR for fsl_booke64

2020-02-20 Thread Christophe Leroy




Le 06/02/2020 à 03:58, Jason Yan a écrit :

The implementation for Freescale BookE64 is similar as BookE32. One
difference is that Freescale BookE64 set up a TLB mapping of 1G during
booting. Another difference is that ppc64 needs the kernel to be
64K-aligned. So we can randomize the kernel in this 1G mapping and make
it 64K-aligned. This can save some code to creat another TLB map at
early boot. The disadvantage is that we only have about 1G/64K = 16384
slots to put the kernel in.

To support secondary cpu boot up, a variable __kaslr_offset was added in
first_256B section. This can help secondary cpu get the kaslr offset
before the 1:1 mapping has been setup.

Signed-off-by: Jason Yan 
Cc: Scott Wood 
Cc: Diana Craciun 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: Kees Cook 
---
  arch/powerpc/Kconfig |  2 +-
  arch/powerpc/kernel/exceptions-64e.S | 10 +
  arch/powerpc/kernel/head_64.S|  7 ++
  arch/powerpc/kernel/setup_64.c   |  4 +++-
  arch/powerpc/mm/mmu_decl.h   | 16 +++---
  arch/powerpc/mm/nohash/kaslr_booke.c | 33 +---
  6 files changed, 59 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c150a9d49343..754aeb96bb1c 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -568,7 +568,7 @@ config RELOCATABLE
  
  config RANDOMIZE_BASE

bool "Randomize the address of the kernel image"
-   depends on (FSL_BOOKE && FLATMEM && PPC32)
+   depends on (PPC_FSL_BOOK3E && FLATMEM)
depends on RELOCATABLE
help
  Randomizes the virtual address at which the kernel image is
diff --git a/arch/powerpc/kernel/exceptions-64e.S 
b/arch/powerpc/kernel/exceptions-64e.S
index 1b9b174bee86..c1c05b8684ca 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -1378,6 +1378,7 @@ skpinv:   addir6,r6,1 /* 
Increment */
  1:mflrr6
addir6,r6,(2f - 1b)
tovirt(r6,r6)
+   add r6,r6,r19
lis r7,MSR_KERNEL@h
ori r7,r7,MSR_KERNEL@l
mtspr   SPRN_SRR0,r6
@@ -1400,6 +1401,7 @@ skpinv:   addir6,r6,1 /* 
Increment */
  
  	/* We translate LR and return */

tovirt(r8,r8)
+   add r8,r8,r19
mtlrr8
blr
  
@@ -1528,6 +1530,7 @@ a2_tlbinit_code_end:

   */
  _GLOBAL(start_initialization_book3e)
mflrr28
+   li  r19, 0
  
  	/* First, we need to setup some initial TLBs to map the kernel

 * text, data and bss at PAGE_OFFSET. We don't have a real mode
@@ -1570,6 +1573,12 @@ _GLOBAL(book3e_secondary_core_init)
cmplwi  r4,0
bne 2f
  
+	li	r19, 0

+#ifdef CONFIG_RANDOMIZE_BASE
+   LOAD_REG_ADDR_PIC(r19, __kaslr_offset)
+   lwz r19,0(r19)
+   rlwinm  r19,r19,0,0,5
+#endif
/* Setup TLB for this core */
bl  initial_tlb_book3e
  
@@ -1602,6 +1611,7 @@ _GLOBAL(book3e_secondary_core_init)

lis r3,PAGE_OFFSET@highest
sldir3,r3,32
or  r28,r28,r3
+   add r28,r28,r19
  1:mtlrr28
blr
  
diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S

index ad79fddb974d..744624140fb8 100644
--- a/arch/powerpc/kernel/head_64.S
+++ b/arch/powerpc/kernel/head_64.S
@@ -104,6 +104,13 @@ __secondary_hold_acknowledge:
.8byte  0x0
  
  #ifdef CONFIG_RELOCATABLE

+#ifdef CONFIG_RANDOMIZE_BASE
+   . = 0x58
+   .globl  __kaslr_offset
+__kaslr_offset:
+DEFINE_FIXED_SYMBOL(__kaslr_offset)
+   .long   0
+#endif
/* This flag is set to 1 by a loader if the kernel should run
 * at the loaded address instead of the linked address.  This
 * is used by kexec-tools to keep the the kdump kernel in the
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 6104917a282d..a16b970a8d1a 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -66,7 +66,7 @@
  #include 
  #include 
  #include 
-


Why remove this new line which clearly separates things in asm/ and 
things in local dir ?



+#include 
  #include "setup.h"
  
  int spinning_secondaries;

@@ -300,6 +300,8 @@ void __init early_setup(unsigned long dt_ptr)
/* Enable early debugging if any specified (see udbg.h) */
udbg_early_init();
  
+	kaslr_early_init(__va(dt_ptr), 0);

+
udbg_printf(" -> %s(), dt_ptr: 0x%lx\n", __func__, dt_ptr);
  
  	/*

diff --git a/arch/powerpc/mm/mmu_decl.h b/arch/powerpc/mm/mmu_decl.h
index 3e1c85c7d10b..bbd721d1e3d7 100644
--- a/arch/powerpc/mm/mmu_decl.h
+++ b/arch/powerpc/mm/mmu_decl.h
@@ -147,14 +147,6 @@ void reloc_kernel_entry(void *fdt, long addr);
  extern void loadcam_entry(unsigned int index);
  extern void loadcam_multi(int first_idx, int num, int tmp_idx);
  
-#ifdef 

Re: [PATCH v3 2/6] powerpc/fsl_booke/64: introduce reloc_kernel_entry() helper

2020-02-20 Thread Christophe Leroy




Le 06/02/2020 à 03:58, Jason Yan a écrit :

Like the 32bit code, we introduce reloc_kernel_entry() helper to prepare
for the KASLR 64bit version. And move the C declaration of this function
out of CONFIG_PPC32 and use long instead of int for the parameter 'addr'.

Signed-off-by: Jason Yan 
Cc: Scott Wood 
Cc: Diana Craciun 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: Kees Cook 


Reviewed-by: Christophe Leroy 



---
  arch/powerpc/kernel/exceptions-64e.S | 13 +
  arch/powerpc/mm/mmu_decl.h   |  3 ++-
  2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/exceptions-64e.S 
b/arch/powerpc/kernel/exceptions-64e.S
index e4076e3c072d..1b9b174bee86 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -1679,3 +1679,16 @@ _GLOBAL(setup_ehv_ivors)
  _GLOBAL(setup_lrat_ivor)
SET_IVOR(42, 0x340) /* LRAT Error */
blr
+
+/*
+ * Return to the start of the relocated kernel and run again
+ * r3 - virtual address of fdt
+ * r4 - entry of the kernel
+ */
+_GLOBAL(reloc_kernel_entry)
+   mfmsr   r7
+   rlwinm  r7, r7, 0, ~(MSR_IS | MSR_DS)
+
+   mtspr   SPRN_SRR0,r4
+   mtspr   SPRN_SRR1,r7
+   rfi
diff --git a/arch/powerpc/mm/mmu_decl.h b/arch/powerpc/mm/mmu_decl.h
index 8e99649c24fc..3e1c85c7d10b 100644
--- a/arch/powerpc/mm/mmu_decl.h
+++ b/arch/powerpc/mm/mmu_decl.h
@@ -140,9 +140,10 @@ extern void adjust_total_lowmem(void);
  extern int switch_to_as1(void);
  extern void restore_to_as0(int esel, int offset, void *dt_ptr, int bootcpu);
  void create_kaslr_tlb_entry(int entry, unsigned long virt, phys_addr_t phys);
-void reloc_kernel_entry(void *fdt, int addr);
  extern int is_second_reloc;
  #endif
+
+void reloc_kernel_entry(void *fdt, long addr);
  extern void loadcam_entry(unsigned int index);
  extern void loadcam_multi(int first_idx, int num, int tmp_idx);
  



Re: [PATCH v3 1/6] powerpc/fsl_booke/kaslr: refactor kaslr_legal_offset() and kaslr_early_init()

2020-02-20 Thread Christophe Leroy




Le 06/02/2020 à 03:58, Jason Yan a écrit :

Some code refactor in kaslr_legal_offset() and kaslr_early_init(). No
functional change. This is a preparation for KASLR fsl_booke64.

Signed-off-by: Jason Yan 
Cc: Scott Wood 
Cc: Diana Craciun 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: Kees Cook 
---
  arch/powerpc/mm/nohash/kaslr_booke.c | 40 ++--
  1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/mm/nohash/kaslr_booke.c 
b/arch/powerpc/mm/nohash/kaslr_booke.c
index 4a75f2d9bf0e..07b036e98353 100644
--- a/arch/powerpc/mm/nohash/kaslr_booke.c
+++ b/arch/powerpc/mm/nohash/kaslr_booke.c
@@ -25,6 +25,7 @@ struct regions {
unsigned long pa_start;
unsigned long pa_end;
unsigned long kernel_size;
+   unsigned long linear_sz;
unsigned long dtb_start;
unsigned long dtb_end;
unsigned long initrd_start;
@@ -260,11 +261,23 @@ static __init void get_cell_sizes(const void *fdt, int 
node, int *addr_cells,
*size_cells = fdt32_to_cpu(*prop);
  }
  
-static unsigned long __init kaslr_legal_offset(void *dt_ptr, unsigned long index,

-  unsigned long offset)
+static unsigned long __init kaslr_legal_offset(void *dt_ptr, unsigned long 
random)
  {
unsigned long koffset = 0;
unsigned long start;
+   unsigned long index;
+   unsigned long offset;
+
+   /*
+* Decide which 64M we want to start
+* Only use the low 8 bits of the random seed
+*/
+   index = random & 0xFF;
+   index %= regions.linear_sz / SZ_64M;
+
+   /* Decide offset inside 64M */
+   offset = random % (SZ_64M - regions.kernel_size);
+   offset = round_down(offset, SZ_16K);
  
  	while ((long)index >= 0) {

offset = memstart_addr + index * SZ_64M + offset;
@@ -289,10 +302,9 @@ static inline __init bool kaslr_disabled(void)
  static unsigned long __init kaslr_choose_location(void *dt_ptr, phys_addr_t 
size,
  unsigned long kernel_sz)
  {
-   unsigned long offset, random;
+   unsigned long random;
unsigned long ram, linear_sz;
u64 seed;
-   unsigned long index;
  
  	kaslr_get_cmdline(dt_ptr);

if (kaslr_disabled())
@@ -333,22 +345,12 @@ static unsigned long __init kaslr_choose_location(void 
*dt_ptr, phys_addr_t size
regions.dtb_start = __pa(dt_ptr);
regions.dtb_end = __pa(dt_ptr) + fdt_totalsize(dt_ptr);
regions.kernel_size = kernel_sz;
+   regions.linear_sz = linear_sz;
  
  	get_initrd_range(dt_ptr);

get_crash_kernel(dt_ptr, ram);
  
-	/*

-* Decide which 64M we want to start
-* Only use the low 8 bits of the random seed
-*/
-   index = random & 0xFF;
-   index %= linear_sz / SZ_64M;
-
-   /* Decide offset inside 64M */
-   offset = random % (SZ_64M - kernel_sz);
-   offset = round_down(offset, SZ_16K);
-
-   return kaslr_legal_offset(dt_ptr, index, offset);
+   return kaslr_legal_offset(dt_ptr, random);
  }
  
  /*

@@ -358,8 +360,6 @@ static unsigned long __init kaslr_choose_location(void 
*dt_ptr, phys_addr_t size
   */
  notrace void __init kaslr_early_init(void *dt_ptr, phys_addr_t size)
  {
-   unsigned long tlb_virt;
-   phys_addr_t tlb_phys;
unsigned long offset;
unsigned long kernel_sz;
  
@@ -375,8 +375,8 @@ notrace void __init kaslr_early_init(void *dt_ptr, phys_addr_t size)

is_second_reloc = 1;
  
  	if (offset >= SZ_64M) {

-   tlb_virt = round_down(kernstart_virt_addr, SZ_64M);
-   tlb_phys = round_down(kernstart_addr, SZ_64M);
+   unsigned long tlb_virt = round_down(kernstart_virt_addr, 
SZ_64M);
+   phys_addr_t tlb_phys = round_down(kernstart_addr, SZ_64M);


That looks like cleanup unrelated to the patch itself.

  
  		/* Create kernel map to relocate in */

create_kaslr_tlb_entry(1, tlb_virt, tlb_phys);



Christophe


Re: [PATCH v3 1/5] powerpc: Rename current_stack_pointer() to current_stack_frame()

2020-02-20 Thread Christophe Leroy




Le 20/02/2020 à 12:51, Michael Ellerman a écrit :

current_stack_pointer(), which was called __get_SP(), used to just
return the value in r1.

But that caused problems in some cases, so it was turned into a
function in commit bfe9a2cfe91a ("powerpc: Reimplement __get_SP() as a
function not a define").

Because it's a function in a separate compilation unit to all its
callers, it has the effect of causing a stack frame to be created, and
then returns the address of that frame. This is good in some cases
like those described in the above commit, but in other cases it's
overkill, we just need to know what stack page we're on.

On some other arches current_stack_pointer is just a register global
giving the stack pointer, and we'd like to do that too. So rename our
current_stack_pointer() to current_stack_frame() to make that
possible.

Signed-off-by: Michael Ellerman 


LGTM

I was afraid to do that and risk invisible conflicts hence bugs by 
reusing the same name for different purpose, but that's the best 
solution for sure.


Reviewed-by: Christophe Leroy 


---
  arch/powerpc/include/asm/perf_event.h | 2 +-
  arch/powerpc/include/asm/reg.h| 2 +-
  arch/powerpc/kernel/irq.c | 4 ++--
  arch/powerpc/kernel/misc.S| 4 ++--
  arch/powerpc/kernel/process.c | 2 +-
  arch/powerpc/kernel/stacktrace.c  | 6 +++---
  6 files changed, 10 insertions(+), 10 deletions(-)

v3: New.

diff --git a/arch/powerpc/include/asm/perf_event.h 
b/arch/powerpc/include/asm/perf_event.h
index 7426d7a90e1e..eed3954082fa 100644
--- a/arch/powerpc/include/asm/perf_event.h
+++ b/arch/powerpc/include/asm/perf_event.h
@@ -32,7 +32,7 @@
do {\
(regs)->result = 0;  \
(regs)->nip = __ip;  \
-   (regs)->gpr[1] = current_stack_pointer();\
+   (regs)->gpr[1] = current_stack_frame();  \
asm volatile("mfmsr %0" : "=r" ((regs)->msr));   \
} while (0)
  
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h

index 1aa46dff0957..1b1ffdba6097 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -1448,7 +1448,7 @@ static inline void mtsrin(u32 val, u32 idx)
  
  #define proc_trap()	asm volatile("trap")
  
-extern unsigned long current_stack_pointer(void);

+extern unsigned long current_stack_frame(void);
  
  extern unsigned long scom970_read(unsigned int address);

  extern void scom970_write(unsigned int address, unsigned long value);
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 5c9b11878555..02118c18434d 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -602,7 +602,7 @@ static inline void check_stack_overflow(void)
  #ifdef CONFIG_DEBUG_STACKOVERFLOW
long sp;
  
-	sp = current_stack_pointer() & (THREAD_SIZE-1);

+   sp = current_stack_frame() & (THREAD_SIZE-1);
  
  	/* check for stack overflow: is there less than 2KB free? */

if (unlikely(sp < 2048)) {
@@ -647,7 +647,7 @@ void do_IRQ(struct pt_regs *regs)
void *cursp, *irqsp, *sirqsp;
  
  	/* Switch to the irq stack to handle this */

-   cursp = (void *)(current_stack_pointer() & ~(THREAD_SIZE - 1));
+   cursp = (void *)(current_stack_frame() & ~(THREAD_SIZE - 1));
irqsp = hardirq_ctx[raw_smp_processor_id()];
sirqsp = softirq_ctx[raw_smp_processor_id()];
  
diff --git a/arch/powerpc/kernel/misc.S b/arch/powerpc/kernel/misc.S

index 974f65f79a8e..65f9f731c229 100644
--- a/arch/powerpc/kernel/misc.S
+++ b/arch/powerpc/kernel/misc.S
@@ -110,7 +110,7 @@ _GLOBAL(longjmp)
li  r3, 1
blr
  
-_GLOBAL(current_stack_pointer)

+_GLOBAL(current_stack_frame)
PPC_LL  r3,0(r1)
blr
-EXPORT_SYMBOL(current_stack_pointer)
+EXPORT_SYMBOL(current_stack_frame)
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index e730b8e522b0..110db94cdf3c 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -2051,7 +2051,7 @@ void show_stack(struct task_struct *tsk, unsigned long 
*stack)
sp = (unsigned long) stack;
if (sp == 0) {
if (tsk == current)
-   sp = current_stack_pointer();
+   sp = current_stack_frame();
else
sp = tsk->thread.ksp;
}
diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c
index e2a46cfed5fd..c477b8585a29 100644
--- a/arch/powerpc/kernel/stacktrace.c
+++ b/arch/powerpc/kernel/stacktrace.c
@@ -57,7 +57,7 @@ void save_stack_trace(struct stack_trace *trace)
  {
unsigned long sp;
  
-	sp = current_stack_pointer();

+   sp = current_stack_frame();
  
  	save_context_stack(trace, sp, current, 1);

  }
@@ -71,7 +71,7 @@ void save_stack_trace_tsk(struct task_struct *tsk, 

Re: [PATCH v5 01/10] capabilities: introduce CAP_PERFMON to kernel and user space

2020-02-20 Thread Alexey Budankov


On 07.02.2020 16:39, Alexey Budankov wrote:
> 
> On 07.02.2020 14:38, Thomas Gleixner wrote:
>> Alexey Budankov  writes:
>>> On 22.01.2020 17:25, Alexey Budankov wrote:
 On 22.01.2020 17:07, Stephen Smalley wrote:
>> It keeps the implementation simple and readable. The implementation is 
>> more
>> performant in the sense of calling the API - one capable() call for 
>> CAP_PERFMON
>> privileged process.
>>
>> Yes, it bloats audit log for CAP_SYS_ADMIN privileged and unprivileged 
>> processes,
>> but this bloating also advertises and leverages using more secure 
>> CAP_PERFMON
>> based approach to use perf_event_open system call.
>
> I can live with that.  We just need to document that when you see
> both a CAP_PERFMON and a CAP_SYS_ADMIN audit message for a process,
> try only allowing CAP_PERFMON first and see if that resolves the
> issue.  We have a similar issue with CAP_DAC_READ_SEARCH versus
> CAP_DAC_OVERRIDE.

 perf security [1] document can be updated, at least, to align and document 
 this audit logging specifics.
>>>
>>> And I plan to update the document right after this patch set is accepted.
>>> Feel free to let me know of the places in the kernel docs that also
>>> require update w.r.t CAP_PERFMON extension.
>>
>> The documentation update wants be part of the patch set and not planned
>> to be done _after_ the patch set is merged.
> 
> Well, accepted. It is going to make patches #11 and beyond.

Patches #11 and #12 of v7 [1] contain information on CAP_PERFMON intention and 
usage.
Patch for man-pages [2] extends perf_event_open.2 documentation.

Thanks,
Alexey

---
[1] 
https://lore.kernel.org/lkml/c8de937a-0b3a-7147-f5ef-69f467e87...@linux.intel.com/
[2] 
https://lore.kernel.org/lkml/18d1083d-efe5-f5f8-c531-d142c0e5c...@linux.intel.com/



[Bug 206525] BUG: KASAN: stack-out-of-bounds in test_bit+0x30/0x44 (kernel 5.6-rc1)

2020-02-20 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=206525

--- Comment #6 from Nikolay Aleksandrov (niko...@cumulusnetworks.com) ---
Note that the bug wasn't introduced by my commit, but instead has been there
since:
 commit 4f520900522f
 Author: Richard Guy Briggs 
 Date:   Tue Apr 22 21:31:54 2014 -0400

netlink: have netlink per-protocol bind function return an error code.

which moved the ngroups test_bit() to a local variable. My commit only exposed
the bug since it added the 33rd group. I'm currently preparing a fix and will
post it to netdev after verifying and testing it.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[PATCH v3 5/5] powerpc/irq: Use current_stack_pointer in do_IRQ()

2020-02-20 Thread Michael Ellerman
From: Christophe Leroy 

Until commit 7306e83ccf5c ("powerpc: Don't use CURRENT_THREAD_INFO to
find the stack"), the current stack base address was obtained by
calling current_thread_info(). That inline function was simply masking
out the value of r1.

In that commit, it was changed to using current_stack_pointer() (since
renamed current_stack_frame()), which is a heavier function as it is
an outline assembly function which cannot be inlined and which reads
the content of the stack at 0(r1).

Convert to using current_stack_pointer for geting r1 and masking out
its value to obtain the base address of the stack pointer as before.

Fixes: 7306e83ccf5c ("powerpc: Don't use CURRENT_THREAD_INFO to find the stack")
Signed-off-by: Christophe Leroy 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/a37e699e7ab897742c07b6838a08af33bc9217e3.1579849665.git.christophe.le...@c-s.fr
---
 arch/powerpc/kernel/irq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 46d5852fb00a..1bed18b7229e 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -648,7 +648,7 @@ void do_IRQ(struct pt_regs *regs)
void *cursp, *irqsp, *sirqsp;
 
/* Switch to the irq stack to handle this */
-   cursp = (void *)(current_stack_frame() & ~(THREAD_SIZE - 1));
+   cursp = (void *)(current_stack_pointer & ~(THREAD_SIZE - 1));
irqsp = hardirq_ctx[raw_smp_processor_id()];
sirqsp = softirq_ctx[raw_smp_processor_id()];
 
-- 
2.21.1

v3: s/get_sp()/current_stack_pointer/


[PATCH v3 4/5] powerpc/irq: use IS_ENABLED() in check_stack_overflow()

2020-02-20 Thread Michael Ellerman
From: Christophe Leroy 

Instead of #ifdef, use IS_ENABLED(CONFIG_DEBUG_STACKOVERFLOW).
This enable GCC to check for code validity even when the option
is not selected.

Signed-off-by: Christophe Leroy 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/98855694e9e8993673af08cc2e97e16e0cf50f4a.1579849665.git.christophe.le...@c-s.fr
---
 arch/powerpc/kernel/irq.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index c7d6f5cdffdb..46d5852fb00a 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -599,9 +599,11 @@ u64 arch_irq_stat_cpu(unsigned int cpu)
 
 static inline void check_stack_overflow(void)
 {
-#ifdef CONFIG_DEBUG_STACKOVERFLOW
long sp;
 
+   if (!IS_ENABLED(CONFIG_DEBUG_STACKOVERFLOW))
+   return;
+
sp = current_stack_pointer & (THREAD_SIZE - 1);
 
/* check for stack overflow: is there less than 2KB free? */
@@ -609,7 +611,6 @@ static inline void check_stack_overflow(void)
pr_err("do_IRQ: stack overflow: %ld\n", sp);
dump_stack();
}
-#endif
 }
 
 void __do_irq(struct pt_regs *regs)
-- 
2.21.1



[PATCH v3 3/5] powerpc/irq: Use current_stack_pointer in check_stack_overflow()

2020-02-20 Thread Michael Ellerman
From: Christophe Leroy 

The purpose of check_stack_overflow() is to verify that the stack has
not overflowed.

To really know whether the stack pointer is still within boundaries,
the check must be done directly on the value of r1.

So use current_stack_pointer, which returns the current value of r1,
rather than current_stack_frame() which causes a frame to be created
and then returns that value.

Signed-off-by: Christophe Leroy 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/435e0030e942507766cbef5bc95f906262d2ccf2.1579849665.git.christophe.le...@c-s.fr
---
 arch/powerpc/kernel/irq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 02118c18434d..c7d6f5cdffdb 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -602,7 +602,7 @@ static inline void check_stack_overflow(void)
 #ifdef CONFIG_DEBUG_STACKOVERFLOW
long sp;
 
-   sp = current_stack_frame() & (THREAD_SIZE-1);
+   sp = current_stack_pointer & (THREAD_SIZE - 1);
 
/* check for stack overflow: is there less than 2KB free? */
if (unlikely(sp < 2048)) {
-- 
2.21.1

v3: s/get_sp()/current_stack_pointer/


[PATCH v3 2/5] powerpc: Add current_stack_pointer as a register global

2020-02-20 Thread Michael Ellerman
From: Christophe Leroy 

current_stack_frame() doesn't return the stack pointer, but the
caller's stack frame. See commit bfe9a2cfe91a ("powerpc: Reimplement
__get_SP() as a function not a define") and commit
acf620ecf56c ("powerpc: Rename __get_SP() to current_stack_pointer()")
for details.

In some cases this is overkill or incorrect, as it doesn't return the
current value of r1.

So add a current_stack_pointer register global to get the value of r1
directly.

Signed-off-by: Christophe Leroy 
[mpe: Split out of other patch, tweak change log]
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/435e0030e942507766cbef5bc95f906262d2ccf2.1579849665.git.christophe.le...@c-s.fr
---
 arch/powerpc/include/asm/reg.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 1b1ffdba6097..da5cab038e25 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -1450,6 +1450,8 @@ static inline void mtsrin(u32 val, u32 idx)
 
 extern unsigned long current_stack_frame(void);
 
+register unsigned long current_stack_pointer asm("r1");
+
 extern unsigned long scom970_read(unsigned int address);
 extern void scom970_write(unsigned int address, unsigned long value);
 
-- 
2.21.1

v3: Split out, and use current_stack_pointer not get_sp()


[PATCH v3 1/5] powerpc: Rename current_stack_pointer() to current_stack_frame()

2020-02-20 Thread Michael Ellerman
current_stack_pointer(), which was called __get_SP(), used to just
return the value in r1.

But that caused problems in some cases, so it was turned into a
function in commit bfe9a2cfe91a ("powerpc: Reimplement __get_SP() as a
function not a define").

Because it's a function in a separate compilation unit to all its
callers, it has the effect of causing a stack frame to be created, and
then returns the address of that frame. This is good in some cases
like those described in the above commit, but in other cases it's
overkill, we just need to know what stack page we're on.

On some other arches current_stack_pointer is just a register global
giving the stack pointer, and we'd like to do that too. So rename our
current_stack_pointer() to current_stack_frame() to make that
possible.

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/perf_event.h | 2 +-
 arch/powerpc/include/asm/reg.h| 2 +-
 arch/powerpc/kernel/irq.c | 4 ++--
 arch/powerpc/kernel/misc.S| 4 ++--
 arch/powerpc/kernel/process.c | 2 +-
 arch/powerpc/kernel/stacktrace.c  | 6 +++---
 6 files changed, 10 insertions(+), 10 deletions(-)

v3: New.

diff --git a/arch/powerpc/include/asm/perf_event.h 
b/arch/powerpc/include/asm/perf_event.h
index 7426d7a90e1e..eed3954082fa 100644
--- a/arch/powerpc/include/asm/perf_event.h
+++ b/arch/powerpc/include/asm/perf_event.h
@@ -32,7 +32,7 @@
do {\
(regs)->result = 0; \
(regs)->nip = __ip; \
-   (regs)->gpr[1] = current_stack_pointer();   \
+   (regs)->gpr[1] = current_stack_frame(); \
asm volatile("mfmsr %0" : "=r" ((regs)->msr));  \
} while (0)
 
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 1aa46dff0957..1b1ffdba6097 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -1448,7 +1448,7 @@ static inline void mtsrin(u32 val, u32 idx)
 
 #define proc_trap()asm volatile("trap")
 
-extern unsigned long current_stack_pointer(void);
+extern unsigned long current_stack_frame(void);
 
 extern unsigned long scom970_read(unsigned int address);
 extern void scom970_write(unsigned int address, unsigned long value);
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 5c9b11878555..02118c18434d 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -602,7 +602,7 @@ static inline void check_stack_overflow(void)
 #ifdef CONFIG_DEBUG_STACKOVERFLOW
long sp;
 
-   sp = current_stack_pointer() & (THREAD_SIZE-1);
+   sp = current_stack_frame() & (THREAD_SIZE-1);
 
/* check for stack overflow: is there less than 2KB free? */
if (unlikely(sp < 2048)) {
@@ -647,7 +647,7 @@ void do_IRQ(struct pt_regs *regs)
void *cursp, *irqsp, *sirqsp;
 
/* Switch to the irq stack to handle this */
-   cursp = (void *)(current_stack_pointer() & ~(THREAD_SIZE - 1));
+   cursp = (void *)(current_stack_frame() & ~(THREAD_SIZE - 1));
irqsp = hardirq_ctx[raw_smp_processor_id()];
sirqsp = softirq_ctx[raw_smp_processor_id()];
 
diff --git a/arch/powerpc/kernel/misc.S b/arch/powerpc/kernel/misc.S
index 974f65f79a8e..65f9f731c229 100644
--- a/arch/powerpc/kernel/misc.S
+++ b/arch/powerpc/kernel/misc.S
@@ -110,7 +110,7 @@ _GLOBAL(longjmp)
li  r3, 1
blr
 
-_GLOBAL(current_stack_pointer)
+_GLOBAL(current_stack_frame)
PPC_LL  r3,0(r1)
blr
-EXPORT_SYMBOL(current_stack_pointer)
+EXPORT_SYMBOL(current_stack_frame)
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index e730b8e522b0..110db94cdf3c 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -2051,7 +2051,7 @@ void show_stack(struct task_struct *tsk, unsigned long 
*stack)
sp = (unsigned long) stack;
if (sp == 0) {
if (tsk == current)
-   sp = current_stack_pointer();
+   sp = current_stack_frame();
else
sp = tsk->thread.ksp;
}
diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c
index e2a46cfed5fd..c477b8585a29 100644
--- a/arch/powerpc/kernel/stacktrace.c
+++ b/arch/powerpc/kernel/stacktrace.c
@@ -57,7 +57,7 @@ void save_stack_trace(struct stack_trace *trace)
 {
unsigned long sp;
 
-   sp = current_stack_pointer();
+   sp = current_stack_frame();
 
save_context_stack(trace, sp, current, 1);
 }
@@ -71,7 +71,7 @@ void save_stack_trace_tsk(struct task_struct *tsk, struct 
stack_trace *trace)
return;
 
if (tsk == current)
-   sp = current_stack_pointer();
+   sp = current_stack_frame();
else
sp = tsk->thread.ksp;
 
@@ -131,7 +131,7 @@ static int 

[PATCH] powerpc: Include .BTF section

2020-02-20 Thread Naveen N. Rao
Selecting CONFIG_DEBUG_INFO_BTF results in the below warning from ld:
  ld: warning: orphan section `.BTF' from `.btf.vmlinux.bin.o' being placed in 
section `.BTF'

Include .BTF section in vmlinux explicitly to fix the same.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/vmlinux.lds.S | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index b4c89a1acebb..a32d478a7f41 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -303,6 +303,12 @@ SECTIONS
*(.branch_lt)
}
 
+#ifdef CONFIG_DEBUG_INFO_BTF
+   .BTF : AT(ADDR(.BTF) - LOAD_OFFSET) {
+   *(.BTF)
+   }
+#endif
+
.opd : AT(ADDR(.opd) - LOAD_OFFSET) {
__start_opd = .;
KEEP(*(.opd))
-- 
2.24.1



[PATCH 5/8] powerpc/uapi: Introduce uapi header 'papr_scm_dsm.h' for papr_scm DSMs

2020-02-20 Thread Vaibhav Jain
Define and add a new uapi header for papr_scm describing device
specific methods (DSMs) and structs for libndctl. PAPR-SCM specific
implementation in libndctl will use these commands/structs to interact
with papr_scm kernel module. Currently only DSMs to retrieve health and
performance statistics information of a dimm are defined.

DSM Envelope
=

The ioctl ND_CMD_CALL transfers data between user-space and kernel via
'envelopes'. An envelope consists of a header and user-defined payload
section. The primary structure describing this envelope is 'struct
nd_papr_scm_cmd_pkg' which expects a payload at the end of the envelop
pointed to by 'nd_papr_scm_cmd_pkg.payload_offset'. Currently two
payloads are defined 'struct nd_papr_scm_dimm_health_stat' and 'struct
nd_papr_scm_perf_stats'. These can be used to retrieve dimm-health and
performance stats respectively.

The header is defined as 'struct nd_cmd_pkg' which in return is
wrapped in a user defined struct called 'struct
nd_papr_scm_cmd_pkg'. This relationship is illustrated below:

 64-Bytes 8-Bytes
 +-+---+--+
 | nd_family   |   |  |
 | |   |  |
 | nd_size_out | cmd_status|  |
 | |   |  |
 | nd_size_in  | payload_version   |   PAYLOAD|
 | |   |  |
 | nd_command  | payload_offset|  |
 | |  ||  |
 | nd_fw_size  |  +--> |  |
 +-+---+--+
 \ nd_cmd_pkg /   /  /
  \--/   /  /
   \nd_papr_scm_cmd_pkg /  /
\--/  /
 \Envelope   /
  \-/

Important fields to note in above illustration are:

* 'nd_command'  : DSM command sent by libndctl
* 'nd_family'   : Id for newly introduced DSM family NVDIMM_FAMILY_PAPR_SCM
* 'nd_fw_size'  : Number of bytes that kernel wanted to copy to the
  payload but may not have copied due to limited size of the envelope.
* 'nd_size_in/out' : Number of bytes that kernel needs to copy from
  user-space (in) and copy-back to user-space (out).
* 'cmd_status'  : Out variable indicating any error encountered while
  servicing the DSM.
* 'payload_version': Version number associated with the payload.
* 'payload_offset': Offset of the payload from start of the envelope.

libnvdimm enforces a hard limit of 256 bytes on the envelope size,
which leaves around 184 bytes for the envelope payload (ignoring any
padding that the compiler may silently introduce).

Envelope Payload Layout
===

The layout of the DSM Payload is defined by various structs defined in
'papr_scm_dsm.h'. Definition of these structs are shared between
papr_scm and libndctl so that contents of payload can be
interpreted. This patch-set introduces two such structs namely
'nd_papr_scm_dimm_health_stat' and 'nd_papr_scm_perf_stats' that can
be used to exchange dimm health and performance stats between papr_scm
and libndctl.

During servicing of a DSM the papr_scm module will read input args
from the payload field by casting its contents to an appropriate
struct pointer based on the DSM command. Similarly the output of
servicing the DSM command will be copied to the payload field using
the same struct.

Payload Version
===

Since the structs associated with each DSM can evolve over time adding
more data and the definitions of these structs known to papr_scm and
libndctl may differ, hence the version number is associated with each
iteration of the struct. This version number is exchanged between
papr_scm <-> libndctl via the 'payload_version' of the DSM envelope.

When libndctl sends an envelope to papr_scm it populates the
'payload_version' field with the version number of the struct it had
copied and/or expects in the payload area. The papr_scm module when
servicing the DSM envelop checks the 'payload_version', if required
changes it to a different version number that it knows about and then
use the DSM struct associated with the new version number to process
the DSM (i.e read the args and copy the results to the payload
area). Libndctl on receiving the envelop back from papr_scm again
checks the 'payload_version' field and based on it use the appropriate
version dsm struct to parse the results.

Above scheme of exchanging different versioned DSM struct between
libndctl and papr_scm should work until following two assumptions
hold:

Let T(X) = { 

[PATCH 8/8] powerpc/papr_scm: Implement support for DSM_PAPR_SCM_HEALTH

2020-02-20 Thread Vaibhav Jain
The DSM 'DSM_PAPR_SCM_HEALTH' should return a 'struct
nd_papr_scm_dimm_health_stat' containing information in dimm health back
to user space in response to ND_CMD_CALL. We implement this DSM by
implementing a new function papr_scm_get_health() that queries the
DIMM health information and then copies these bitmaps to the package
payload whose layout is defined by 'struct papr_scm_ndctl_health'.

The patch also handle cases where in future versions of 'struct
papr_scm_ndctl_health' may want to return more health
information. Such payload envelops will contain appropriate version
information in 'struct nd_papr_scm_cmd_pkg.payload_version'. The patch
takes care of only returning the sub-data corresponding to the payload
version requested. Please see the comments in papr_scm_get_health()
for how this is done.

Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 73 +++
 1 file changed, 73 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 29f38246c59f..bf81acb0bf3f 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -415,6 +415,74 @@ static int cmd_to_func(struct nvdimm *nvdimm, unsigned int 
cmd, void *buf,
return pkg->hdr.nd_command;
 }
 
+/*
+ * Fetch the DIMM health info and populate it in provided papr_scm package.
+ * Since the caller can request a different version of payload and each new
+ * version of struct nd_papr_scm_dimm_health_stat is a proper-subset of
+ * previous version hence we return a subset of the cached 'struct
+ * nd_papr_scm_dimm_health_stat' depending on the payload version requested.
+ */
+static int papr_scm_get_health(struct papr_scm_priv *p,
+  struct nd_papr_scm_cmd_pkg *pkg)
+{
+   int rc;
+   size_t copysize;
+   /* Map version to number of bytes to be copied to payload */
+   const size_t copysizes[] = {
+   [1] =
+   sizeof(struct nd_papr_scm_dimm_health_stat_v1),
+
+   /*  This should always be preset */
+   [ND_PAPR_SCM_DIMM_HEALTH_VERSION] =
+   sizeof(struct nd_papr_scm_dimm_health_stat),
+   };
+
+   rc = drc_pmem_query_health(p);
+   if (rc)
+   goto out;
+   /*
+* If the requested payload version is greater than one we know
+* aboute, return the payload version we know about and let
+* caller/userspace handle the mess.
+*/
+   if (pkg->payload_version > ND_PAPR_SCM_DIMM_HEALTH_VERSION)
+   pkg->payload_version = ND_PAPR_SCM_DIMM_HEALTH_VERSION;
+
+   copysize = copysizes[pkg->payload_version];
+   if (!copysize) {
+   dev_dbg(>pdev->dev, "%s Unsupported payload version=0x%x\n",
+   __func__, pkg->payload_version);
+   rc = -ENOSPC;
+   goto out;
+   }
+
+   if (pkg->hdr.nd_size_out < copysize) {
+   dev_dbg(>pdev->dev, "%s Payload not large enough\n",
+   __func__);
+   dev_dbg(>pdev->dev, "%s Expected %lu, available %u\n",
+   __func__, copysize, pkg->hdr.nd_size_out);
+   rc = -ENOSPC;
+   goto out;
+   }
+
+   dev_dbg(>pdev->dev, "%s Copying payload size=%lu version=0x%x\n",
+   __func__, copysize, pkg->payload_version);
+
+   /* Copy a subset of health struct based on copysize */
+   memcpy(papr_scm_pcmd_to_payload(pkg), >health, copysize);
+   pkg->hdr.nd_fw_size = copysize;
+
+out:
+   /*
+* Put the error in out package and return success from function
+* so that errors if any are propogated back to userspace.
+*/
+   pkg->cmd_status = rc;
+   dev_dbg(>pdev->dev, "%s completion code = %d\n", __func__, rc);
+
+   return 0;
+}
+
 int papr_scm_ndctl(struct nvdimm_bus_descriptor *nd_desc, struct nvdimm 
*nvdimm,
unsigned int cmd, void *buf, unsigned int buf_len, int *cmd_rc)
 {
@@ -460,6 +528,11 @@ int papr_scm_ndctl(struct nvdimm_bus_descriptor *nd_desc, 
struct nvdimm *nvdimm,
*cmd_rc = 0;
break;
 
+   case DSM_PAPR_SCM_HEALTH:
+   call_pkg = nd_to_papr_cmd_pkg(buf);
+   *cmd_rc = papr_scm_get_health(p, call_pkg);
+   break;
+
default:
dev_dbg(>pdev->dev, "Unknown command = %d\n", cmd_in);
*cmd_rc = -EINVAL;
-- 
2.24.1



[PATCH 7/8] powerpc/papr_scm: Re-implement 'papr_flags' using 'nd_papr_scm_dimm_health_stat'

2020-02-20 Thread Vaibhav Jain
Previous commit [1] introduced 'struct nd_papr_scm_dimm_health_stat' for
communicating health status of an nvdimm to libndctl. This struct
however can also be used to cache the nvdimm health information in
'struct papr_scm_priv' instead of two '__be64' values. Benefit of this
re-factoring will be apparent when support for libndctl being able to
request nvdimm health stats is implemented where we can simply memcpy
this struct over to the user-space provided payload envelope.

Hence this patch introduces a new member 'struct papr_scm_priv.health'
that caches the health information of a dimm. This member is populated
inside drc_pmem_query_health() which checks for the various bit flags
returned from H_SCM_HEALTH and sets them in this struct. We also
re-factor 'papr_flags' sysfs attribute show function papr_flags_show()
to use the flags in 'struct papr_scm_priv.health' to return
appropriate status strings pertaining to dimm health.

This patch shouldn't introduce any behavioral change.

Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 61 ---
 1 file changed, 44 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index d5eea2f38fa6..29f38246c59f 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -47,8 +47,7 @@ struct papr_scm_priv {
struct mutex dimm_mutex;
 
/* Health information for the dimm */
-   __be64 health_bitmap;
-   __be64 health_bitmap_valid;
+   struct nd_papr_scm_dimm_health_stat health;
 
/* length of the stat buffer as expected by phyp */
size_t len_stat_buffer;
@@ -205,6 +204,7 @@ static int drc_pmem_query_health(struct papr_scm_priv *p)
 {
unsigned long ret[PLPAR_HCALL_BUFSIZE];
int64_t rc;
+   __be64 health;
 
rc = plpar_hcall(H_SCM_HEALTH, ret, p->drc_index);
if (rc != H_SUCCESS) {
@@ -219,13 +219,41 @@ static int drc_pmem_query_health(struct papr_scm_priv *p)
return rc;
 
/* Store the retrieved health information in dimm platform data */
-   p->health_bitmap = ret[0];
-   p->health_bitmap_valid = ret[1];
+   health = ret[0] & ret[1];
 
dev_dbg(>pdev->dev,
"Queried dimm health info. Bitmap:0x%016llx Mask:0x%016llx\n",
-   be64_to_cpu(p->health_bitmap),
-   be64_to_cpu(p->health_bitmap_valid));
+   be64_to_cpu(ret[0]),
+   be64_to_cpu(ret[1]));
+
+   memset(>health, 0, sizeof(p->health));
+
+   /* Check for various masks in bitmap and set the buffer */
+   if (health & PAPR_SCM_DIMM_UNARMED_MASK)
+   p->health.dimm_unarmed = true;
+
+   if (health & PAPR_SCM_DIMM_BAD_SHUTDOWN_MASK)
+   p->health.dimm_bad_shutdown = true;
+
+   if (health & PAPR_SCM_DIMM_BAD_RESTORE_MASK)
+   p->health.dimm_bad_restore = true;
+
+   if (health & PAPR_SCM_DIMM_ENCRYPTED)
+   p->health.dimm_encrypted = true;
+
+   if (health & PAPR_SCM_DIMM_SCRUBBED_AND_LOCKED) {
+   p->health.dimm_locked = true;
+   p->health.dimm_scrubbed = true;
+   }
+
+   if (health & PAPR_SCM_DIMM_HEALTH_UNHEALTHY)
+   p->health.dimm_health = DSM_PAPR_SCM_DIMM_UNHEALTHY;
+
+   if (health & PAPR_SCM_DIMM_HEALTH_CRITICAL)
+   p->health.dimm_health = DSM_PAPR_SCM_DIMM_CRITICAL;
+
+   if (health & PAPR_SCM_DIMM_HEALTH_FATAL)
+   p->health.dimm_health = DSM_PAPR_SCM_DIMM_FATAL;
 
mutex_unlock(>dimm_mutex);
return 0;
@@ -513,7 +541,6 @@ static ssize_t papr_flags_show(struct device *dev,
 {
struct nvdimm *dimm = to_nvdimm(dev);
struct papr_scm_priv *p = nvdimm_provider_data(dimm);
-   __be64 health;
int rc;
 
rc = drc_pmem_query_health(p);
@@ -525,26 +552,26 @@ static ssize_t papr_flags_show(struct device *dev,
if (rc)
return rc;
 
-   health = p->health_bitmap & p->health_bitmap_valid;
-
-   /* Check for various masks in bitmap and set the buffer */
-   if (health & PAPR_SCM_DIMM_UNARMED_MASK)
+   if (p->health.dimm_unarmed)
rc += sprintf(buf, "not_armed ");
 
-   if (health & PAPR_SCM_DIMM_BAD_SHUTDOWN_MASK)
+   if (p->health.dimm_bad_shutdown)
rc += sprintf(buf + rc, "save_fail ");
 
-   if (health & PAPR_SCM_DIMM_BAD_RESTORE_MASK)
+   if (p->health.dimm_bad_restore)
rc += sprintf(buf + rc, "restore_fail ");
 
-   if (health & PAPR_SCM_DIMM_ENCRYPTED)
+   if (p->health.dimm_encrypted)
rc += sprintf(buf + rc, "encrypted ");
 
-   if (health & PAPR_SCM_DIMM_SMART_EVENT_MASK)
+   if (p->health.dimm_health)
rc += sprintf(buf + rc, "smart_notify ");
 
-   if (health & PAPR_SCM_DIMM_SCRUBBED_AND_LOCKED)
-   rc += 

[PATCH 6/8] powerpc/papr_scm: Add support for handling PAPR DSM commands

2020-02-20 Thread Vaibhav Jain
Implement support for handling PAPR DSM commands in papr_scm
module. We advertise support for ND_CMD_CALL for the dimm command mask
and implement necessary scaffolding in the module to handle ND_CMD_CALL
ioctl and DSM commands that we receive.

The layout of the DSM commands as we expect from libnvdimm/libndctl is
defined in 'struct nd_pkg_papr_scm' which contains a 'struct
nd_cmd_pkg' as header. This header is used to communicate the DSM
command via 'nd_pkg_papr_scm->nd_command' and size of payload that
need to be sent/received for servicing the DSM.

The PAPR DSM commands are assigned indexes started from 0x1 to
prevent them from overlapping ND_CMD_* values and also makes handling
dimm commands in papr_scm_ndctl() easier via a simplified switch-case
block. For this a new function cmd_to_func() is implemented that reads
the args to papr_scm_ndctl() , performs sanity tests on them and
converts them to PAPR DSM commands which can then be handled via the
switch-case block.

Signed-off-by: Vaibhav Jain 
---
Change-log:
* Added a 'reserved' field in 'struct nd_pkg_papr_scm' to ensure
  'payload' falls on a 8-Byte aligned boundary.
---
 arch/powerpc/platforms/pseries/papr_scm.c | 87 +--
 1 file changed, 80 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 28143a681aa2..d5eea2f38fa6 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -15,13 +15,15 @@
 
 #include 
 #include 
+#include 
 
 #define BIND_ANY_ADDR (~0ul)
 
 #define PAPR_SCM_DIMM_CMD_MASK \
((1ul << ND_CMD_GET_CONFIG_SIZE) | \
 (1ul << ND_CMD_GET_CONFIG_DATA) | \
-(1ul << ND_CMD_SET_CONFIG_DATA))
+(1ul << ND_CMD_SET_CONFIG_DATA) | \
+(1ul << ND_CMD_CALL))
 
 struct papr_scm_priv {
struct platform_device *pdev;
@@ -330,19 +332,82 @@ static int papr_scm_meta_set(struct papr_scm_priv *p,
return 0;
 }
 
+/*
+ * Validate the input to dimm-control function and return papr_scm specific
+ * commands. This does sanity validation to ND_CMD_CALL sub-command packages.
+ */
+static int cmd_to_func(struct nvdimm *nvdimm, unsigned int cmd, void *buf,
+  unsigned int buf_len)
+{
+   unsigned long cmd_mask = PAPR_SCM_DIMM_CMD_MASK;
+   struct nd_papr_scm_cmd_pkg *pkg = nd_to_papr_cmd_pkg(buf);
+
+   /* Only dimm-specific calls are supported atm */
+   if (!nvdimm)
+   return -EINVAL;
+
+   if (!test_bit(cmd, _mask)) {
+   pr_debug("%s: Unsupported cmd=%u\n", __func__, cmd);
+   return -EINVAL;
+   } else if (cmd != ND_CMD_CALL) {
+   return cmd;
+   }
+
+   /* cmd == ND_CMD_CALL so verify the envelop package */
+
+   if (!buf || buf_len < sizeof(struct nd_papr_scm_cmd_pkg)) {
+   pr_debug("%s: Invalid pkg size=%u\n", __func__, buf_len);
+   return -EINVAL;
+   }
+
+   if (pkg->hdr.nd_family != NVDIMM_FAMILY_PAPR_SCM) {
+   pr_debug("%s: Invalid pkg family=0x%llx\n", __func__,
+pkg->hdr.nd_family);
+   return -EINVAL;
+
+   }
+
+   if (pkg->hdr.nd_command <= DSM_PAPR_SCM_MIN ||
+   pkg->hdr.nd_command >= DSM_PAPR_SCM_MAX) {
+
+   /* for unknown subcommands return ND_CMD_CALL */
+   pr_debug("%s: Unknown sub-command=0x%llx\n", __func__,
+pkg->hdr.nd_command);
+   return ND_CMD_CALL;
+   }
+
+   /* We except a payload with all DSM commands */
+   if (papr_scm_pcmd_to_payload(pkg) == NULL) {
+   pr_debug("%s: Empty patload for sub-command=0x%llx\n", __func__,
+pkg->hdr.nd_command);
+   return -EINVAL;
+   }
+
+   /* Return the DSM_PAPR_SCM_* command */
+   return pkg->hdr.nd_command;
+}
+
 int papr_scm_ndctl(struct nvdimm_bus_descriptor *nd_desc, struct nvdimm 
*nvdimm,
unsigned int cmd, void *buf, unsigned int buf_len, int *cmd_rc)
 {
struct nd_cmd_get_config_size *get_size_hdr;
struct papr_scm_priv *p;
+   struct nd_papr_scm_cmd_pkg *call_pkg = NULL;
+   int cmd_in, rc;
 
-   /* Only dimm-specific calls are supported atm */
-   if (!nvdimm)
-   return -EINVAL;
+   /* Use a local variable in case cmd_rc pointer is NULL */
+   if (cmd_rc == NULL)
+   cmd_rc = 
+
+   cmd_in = cmd_to_func(nvdimm, cmd, buf, buf_len);
+   if (cmd_in < 0) {
+   pr_debug("%s: Invalid cmd=%u. Err=%d\n", __func__, cmd, cmd_in);
+   return cmd_in;
+   }
 
p = nvdimm_provider_data(nvdimm);
 
-   switch (cmd) {
+   switch (cmd_in) {
case ND_CMD_GET_CONFIG_SIZE:
get_size_hdr = buf;
 
@@ -360,13 +425,21 @@ int papr_scm_ndctl(struct nvdimm_bus_descriptor *nd_desc, 
struct nvdimm *nvdimm,

[PATCH 4/8] UAPI: ndctl: Introduce NVDIMM_FAMILY_PAPR_SCM as a new NVDIMM DSM family

2020-02-20 Thread Vaibhav Jain
Add PAPR-scm family of DSM command-set to the white list of NVDIMM
command sets.

Signed-off-by: Vaibhav Jain 
---
 include/uapi/linux/ndctl.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h
index de5d90212409..99fb60600ef8 100644
--- a/include/uapi/linux/ndctl.h
+++ b/include/uapi/linux/ndctl.h
@@ -244,6 +244,7 @@ struct nd_cmd_pkg {
 #define NVDIMM_FAMILY_HPE2 2
 #define NVDIMM_FAMILY_MSFT 3
 #define NVDIMM_FAMILY_HYPERV 4
+#define NVDIMM_FAMILY_PAPR_SCM 5
 
 #define ND_IOCTL_CALL  _IOWR(ND_IOCTL, ND_CMD_CALL,\
struct nd_cmd_pkg)
-- 
2.24.1



[PATCH 3/8] powerpc/papr_scm: Fetch dimm performance stats from PHYP

2020-02-20 Thread Vaibhav Jain
Implement support for fetching dimm performance metrics via
H_SCM_PERFORMANCE_HEALTH hcall as documented in Ref[1]. The hcall
returns a structure as described in Ref[1] and defined as newly
introduced 'struct papr_scm_perf_stats'. The struct has a header
followed by key-value pairs of performance attributes. The 'key' part
is a 8-byte char array naming the attribute encoded as a __be64
integer. This makes the output buffer format for the hcall self
describing and can be easily interpreted.

This patch implements functionality to fetch these performance stats
and reporting them via a nvdimm sysfs attribute named
'papr_perf_stats'.

A new function drc_pmem_query_stats() is implemented that issues hcall
H_SCM_PERFORMANCE_HEALTH ,requesting PHYP to store performance stats
in pre-allocated 'struct papr_scm_perf_stats' buffer. During nvdimm
initialization in papr_scm_nvdimm_init() this function is called with
an empty buffer to know the max buffer size needed for issuing the
H_SCM_PERFORMANCE_HEALTH hcall. The buffer size retrieved is stored in
newly introduced 'struct papc_scm_priv.len_stat_buffer' for later
retrival.

[1]: commit 58b278f568f0 ("powerpc: Provide initial documentation for
PAPR hcalls")

Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 107 ++
 1 file changed, 107 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index aaf2e4ab1f75..28143a681aa2 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -47,6 +47,9 @@ struct papr_scm_priv {
/* Health information for the dimm */
__be64 health_bitmap;
__be64 health_bitmap_valid;
+
+   /* length of the stat buffer as expected by phyp */
+   size_t len_stat_buffer;
 };
 
 static int drc_pmem_bind(struct papr_scm_priv *p)
@@ -152,6 +155,50 @@ static int drc_pmem_query_n_bind(struct papr_scm_priv *p)
return drc_pmem_bind(p);
 }
 
+static int drc_pmem_query_stats(struct papr_scm_priv *p,
+   struct papr_scm_perf_stats *stats,
+   size_t size, uint64_t *out)
+{
+   unsigned long ret[PLPAR_HCALL_BUFSIZE];
+   int64_t rc;
+
+   /* In case of no out buffer ignore the size */
+   if (!stats)
+   size = 0;
+
+   /*
+* Do the HCALL asking PHYP for info and if R4 was requested
+* return its value in 'out' variable.
+*/
+   rc = plpar_hcall(H_SCM_PERFORMANCE_STATS, ret, p->drc_index,
+__pa(stats), size);
+   if (out)
+   *out =  be64_to_cpu(ret[0]);
+
+   switch (rc) {
+   case H_SUCCESS:
+   /* Handle the case where size of stat buffer was requested */
+   if (size != 0)
+   dev_dbg(>pdev->dev,
+   "Performance stats returned %d stats\n",
+   be32_to_cpu(stats->num_statistics));
+   else
+   dev_dbg(>pdev->dev,
+   "Performance stats size %lld\n",
+   be64_to_cpu(ret[0]));
+   return 0;
+   case H_PARTIAL:
+   dev_err(>pdev->dev,
+"Unknown performance stats, Err:0x%016llX\n",
+   be64_to_cpu(ret[0]));
+   return -ENOENT;
+   default:
+   dev_err(>pdev->dev,
+"Failed to query performance stats, Err:%lld\n", rc);
+   return -ENXIO;
+   }
+}
+
 static int drc_pmem_query_health(struct papr_scm_priv *p)
 {
unsigned long ret[PLPAR_HCALL_BUFSIZE];
@@ -341,6 +388,53 @@ static inline int papr_scm_node(int node)
return min_node;
 }
 
+static ssize_t papr_perf_stats_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct nvdimm *dimm = to_nvdimm(dev);
+   struct papr_scm_priv *p = nvdimm_provider_data(dimm);
+   struct papr_scm_perf_stats *retbuffer;
+   struct papr_scm_perf_stat *stat;
+   uint64_t statid, val;
+   int rc, i;
+
+   if (!p->len_stat_buffer)
+   return -ENOENT;
+
+   /* Return buffer for phyp where stats are written */
+   retbuffer = kzalloc(p->len_stat_buffer, GFP_KERNEL);
+   if (!retbuffer)
+   return -ENOMEM;
+
+   /* Setup the buffer */
+   memcpy(retbuffer->eye_catcher, PAPR_SCM_PERF_STATS_EYECATCHER,
+  sizeof(retbuffer->eye_catcher));
+   retbuffer->stats_version = cpu_to_be32(0x1);
+   retbuffer->num_statistics = 0;
+
+   rc = drc_pmem_query_stats(p, retbuffer, p->len_stat_buffer, NULL);
+   if (rc)
+   goto out;
+
+   /*
+* Go through the returned output buffer and print stats and values.
+* Since statistic_id is essentially a char string of 8 bytes encoded
+   

[PATCH 2/8] powerpc/papr_scm: Provide support for fetching dimm health information

2020-02-20 Thread Vaibhav Jain
Implement support for fetching dimm health information via
H_SCM_HEALTH hcall as documented in Ref[1]. The hcall returns a pair of
64-bit big-endian integers which are then stored in 'struct
papr_scm_priv' and subsequently exposed to userspace via dimm
attribute 'papr_flags'.

'papr_flags' sysfs attribute reports space separated string flags
indicating various health state an nvdimm can be. These are:

* "not_armed"   : Indicating that nvdimm contents wont survive a power
  cycle.
* "save_fail"   : Indicating that nvdimm contents couldn't be flushed
  during last shutdown event.
* "restore_fail": Indicating that nvdimm contents couldn't be restored
  during dimm initialization.
* "encrypted"   : Dimm contents are encrypted.
* "smart_notify": There is health event for the nvdimm.
* "scrubbed": Indicating that contents of the nvdimm have been
  scrubbed.
* "locked"  : Indicating that nvdimm contents cant be modified
  until next power cycle.

[1]: commit 58b278f568f0 ("powerpc: Provide initial documentation for
PAPR hcalls")

Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 105 +-
 1 file changed, 103 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 0b4467e378e5..aaf2e4ab1f75 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -14,6 +14,7 @@
 #include 
 
 #include 
+#include 
 
 #define BIND_ANY_ADDR (~0ul)
 
@@ -39,6 +40,13 @@ struct papr_scm_priv {
struct resource res;
struct nd_region *region;
struct nd_interleave_set nd_set;
+
+   /* Protect dimm data from concurrent access */
+   struct mutex dimm_mutex;
+
+   /* Health information for the dimm */
+   __be64 health_bitmap;
+   __be64 health_bitmap_valid;
 };
 
 static int drc_pmem_bind(struct papr_scm_priv *p)
@@ -144,6 +152,35 @@ static int drc_pmem_query_n_bind(struct papr_scm_priv *p)
return drc_pmem_bind(p);
 }
 
+static int drc_pmem_query_health(struct papr_scm_priv *p)
+{
+   unsigned long ret[PLPAR_HCALL_BUFSIZE];
+   int64_t rc;
+
+   rc = plpar_hcall(H_SCM_HEALTH, ret, p->drc_index);
+   if (rc != H_SUCCESS) {
+   dev_err(>pdev->dev,
+"Failed to query health information, Err:%lld\n", rc);
+   return -ENXIO;
+   }
+
+   /* Protect modifications to papr_scm_priv with the mutex */
+   rc = mutex_lock_interruptible(>dimm_mutex);
+   if (rc)
+   return rc;
+
+   /* Store the retrieved health information in dimm platform data */
+   p->health_bitmap = ret[0];
+   p->health_bitmap_valid = ret[1];
+
+   dev_dbg(>pdev->dev,
+   "Queried dimm health info. Bitmap:0x%016llx Mask:0x%016llx\n",
+   be64_to_cpu(p->health_bitmap),
+   be64_to_cpu(p->health_bitmap_valid));
+
+   mutex_unlock(>dimm_mutex);
+   return 0;
+}
 
 static int papr_scm_meta_get(struct papr_scm_priv *p,
 struct nd_cmd_get_config_data_hdr *hdr)
@@ -304,6 +341,67 @@ static inline int papr_scm_node(int node)
return min_node;
 }
 
+static ssize_t papr_flags_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct nvdimm *dimm = to_nvdimm(dev);
+   struct papr_scm_priv *p = nvdimm_provider_data(dimm);
+   __be64 health;
+   int rc;
+
+   rc = drc_pmem_query_health(p);
+   if (rc)
+   return rc;
+
+   /* Protect against modifications to papr_scm_priv with the mutex */
+   rc = mutex_lock_interruptible(>dimm_mutex);
+   if (rc)
+   return rc;
+
+   health = p->health_bitmap & p->health_bitmap_valid;
+
+   /* Check for various masks in bitmap and set the buffer */
+   if (health & PAPR_SCM_DIMM_UNARMED_MASK)
+   rc += sprintf(buf, "not_armed ");
+
+   if (health & PAPR_SCM_DIMM_BAD_SHUTDOWN_MASK)
+   rc += sprintf(buf + rc, "save_fail ");
+
+   if (health & PAPR_SCM_DIMM_BAD_RESTORE_MASK)
+   rc += sprintf(buf + rc, "restore_fail ");
+
+   if (health & PAPR_SCM_DIMM_ENCRYPTED)
+   rc += sprintf(buf + rc, "encrypted ");
+
+   if (health & PAPR_SCM_DIMM_SMART_EVENT_MASK)
+   rc += sprintf(buf + rc, "smart_notify ");
+
+   if (health & PAPR_SCM_DIMM_SCRUBBED_AND_LOCKED)
+   rc += sprintf(buf + rc, "scrubbed locked ");
+
+   if (rc > 0)
+   rc += sprintf(buf + rc, "\n");
+
+   mutex_unlock(>dimm_mutex);
+   return rc;
+}
+DEVICE_ATTR_RO(papr_flags);
+
+/* papr_scm specific dimm attributes */
+static struct attribute *papr_scm_nd_attributes[] = {
+   _attr_papr_flags.attr,
+   NULL,
+};
+
+static struct attribute_group 

[PATCH 1/8] powerpc: Add asm header 'papr_scm.h' describing the papr-scm interface

2020-02-20 Thread Vaibhav Jain
Add a new powerpc specific asm header named 'papr-scm.h' that descibes
the interface between PHYP and guest kernel running as an LPAR.

The HCALLs specific to managing SCM are descibed in Ref[1]. The asm
header introduced by this patch however describes the data structures
exchanged between PHYP and kernel during those HCALLs.

Future patches will use these structures to provide support for
retriving nvdimm health and performance stats in papr_scm kernel
module.

[1]: commit 58b278f568f0 ("powerpc: Provide initial documentation for
PAPR hcalls")

Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/include/asm/papr_scm.h | 68 +
 1 file changed, 68 insertions(+)
 create mode 100644 arch/powerpc/include/asm/papr_scm.h

diff --git a/arch/powerpc/include/asm/papr_scm.h 
b/arch/powerpc/include/asm/papr_scm.h
new file mode 100644
index ..d893621063f3
--- /dev/null
+++ b/arch/powerpc/include/asm/papr_scm.h
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Structures and defines needed to manage nvdimms for spapr guests.
+ */
+#ifndef _ASM_POWERPC_PAPR_SCM_H_
+#define _ASM_POWERPC_PAPR_SCM_H_
+
+#include 
+#include 
+#include 
+
+/* DIMM health bitmap bitmap indicators */
+/* SCM device is unable to persist memory contents */
+#define PAPR_SCM_DIMM_UNARMED  PPC_BIT(0)
+/* SCM device failed to persist memory contents */
+#define PAPR_SCM_DIMM_SHUTDOWN_DIRTY   PPC_BIT(1)
+/* SCM device contents are persisted from previous IPL */
+#define PAPR_SCM_DIMM_SHUTDOWN_CLEAN   PPC_BIT(2)
+/* SCM device contents are not persisted from previous IPL */
+#define PAPR_SCM_DIMM_EMPTYPPC_BIT(3)
+/* SCM device memory life remaining is critically low */
+#define PAPR_SCM_DIMM_HEALTH_CRITICAL  PPC_BIT(4)
+/* SCM device will be garded off next IPL due to failure */
+#define PAPR_SCM_DIMM_HEALTH_FATAL PPC_BIT(5)
+/* SCM contents cannot persist due to current platform health status */
+#define PAPR_SCM_DIMM_HEALTH_UNHEALTHY PPC_BIT(6)
+/* SCM device is unable to persist memory contents in certain conditions */
+#define PAPR_SCM_DIMM_HEALTH_NON_CRITICAL  PPC_BIT(7)
+/* SCM device is encrypted */
+#define PAPR_SCM_DIMM_ENCRYPTEDPPC_BIT(8)
+/* SCM device has been scrubbed and locked */
+#define PAPR_SCM_DIMM_SCRUBBED_AND_LOCKED  PPC_BIT(9)
+
+/* Bits status indicators for health bitmap indicating unarmed dimm */
+#define PAPR_SCM_DIMM_UNARMED_MASK (PAPR_SCM_DIMM_UNARMED |\
+   PAPR_SCM_DIMM_HEALTH_UNHEALTHY | \
+   PAPR_SCM_DIMM_HEALTH_NON_CRITICAL)
+
+/* Bits status indicators for health bitmap indicating unflushed dimm */
+#define PAPR_SCM_DIMM_BAD_SHUTDOWN_MASK (PAPR_SCM_DIMM_SHUTDOWN_DIRTY)
+
+/* Bits status indicators for health bitmap indicating unrestored dimm */
+#define PAPR_SCM_DIMM_BAD_RESTORE_MASK  (PAPR_SCM_DIMM_EMPTY)
+
+/* Bit status indicators for smart event notification */
+#define PAPR_SCM_DIMM_SMART_EVENT_MASK (PAPR_SCM_DIMM_HEALTH_CRITICAL | \
+  PAPR_SCM_DIMM_HEALTH_FATAL | \
+  PAPR_SCM_DIMM_HEALTH_UNHEALTHY | \
+  PAPR_SCM_DIMM_HEALTH_NON_CRITICAL)
+
+#define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS)
+
+/* Struct holding a single performance metric */
+struct papr_scm_perf_stat {
+   __be64 statistic_id;
+   __be64 statistic_value;
+};
+
+/* Struct exchanged between kernel and ndctl reporting drc perf stats */
+struct papr_scm_perf_stats {
+   uint8_t eye_catcher[8];
+   __be32 stats_version;   /* Should be 0x01 */
+   __be32 num_statistics;  /* Number of stats following */
+   /* zero or more performance matrics */
+   struct papr_scm_perf_stat scm_statistics[];
+} __packed;
+
+#endif
-- 
2.24.1



[PATCH 0/7] powerpc/papr_scm: Add support for reporting nvdimm health

2020-02-20 Thread Vaibhav Jain
The PAPR standard[1][3] provides suitable mechanisms to query the health and
performance stats of an NVDIMM via various hcalls as described in Ref[2]. Until
now these stats were never available nor exposed to the user-space tools like
'ndctl'. This is partly due to PAPR platform not having support for ACPI and
NFIT. Hence 'ndctl' is unable to query and report the dimm health status and a
user had no way to determine the current health status of a NDVIMM.

To overcome this limitation this patch-set updates papr_scm kernel module to
query and fetch nvdimm health and performance stats using hcalls described in
Ref[2]. This health and performance stats are then exposed to userspace via
syfs and Dimm-Specific-Methods(DSM) issued by libndctl.

These changes coupled with proposed ndtcl changes located at Ref[4] should
provide a way for the user to retrieve NVDIMM health status using ndtcl. Below
is a sample output using proposed kernel + ndctl for PAPR NVDIMM in an
emulation environment:

 # ndctl list -DH
[
  {
"dev":"nmem0",
"health":{
  "health_state":"fatal",
  "shutdown_state":"dirty"
}
  }
]

PAPR Dimm-Specific-Methods(DSM)


As the name suggests DSMs are used by vendor specific code in libndctl to
execute certain operations or fetch certain information for NVDIMMS. DSMs
can be sent to papr_scm module via libndctl (userspace) and libnvdimm(kernel)
using the ND_CMD_CALL ioctl which can be handled in the dimm control function
papr_scm_ndctl(). For PAPR this patchset proposes two DSMs defined in the newly
introduced uapi header named 'papr_scm_dsm.h', that directly map to hcalls
provided by PHYP to query NVDIMM health and stats. These DSMs are:

* DSM_PAPR_SCM_HEALTH: Which map to hcall H_SCM_HEALTH and returns dimm health.

* DSM_PAPR_SCM_STATS: Which map to hcall H_SCM_PERFORMANCE_STATS and returns
  dimm performance stats.

P.S: The current patch-set only provides an implementation for servicing
DSM_PAPR_SCM_HEALTH and a future patch will add support for DSM_PAPR_SCM_STATS.

The ioctl ND_CMD_CALL can also transfer data between user-space and kernel via
'envelopes'. The envelop is part of a 'struct nd_cmd_pkg' which in return is
wrapped in a user defined struct which in our case is a newly introduced
'struct nd_papr_scm_cmd_pkg'. Apart from 'envelope header' this struct holds
'payload', 'payload offset', 'payload version' and 'command status'.

The 'payload' field of the envelop holds a struct depending on the DSM method
used and should be one of the structs defined in newly introduced uapi header
'papr_scm_dsm.h'. This makes it possible for libndctl/kernel to share the same
definitions for these DSM structs.

Earlier Work


An earlier RFC patch set titled "powerpc/papr_scm: Implement support for
reporting DIMM health and stats" [5] was proposed which tried to achieve
same functionality albeit with a different approach i.e papr_scm module
acted as a pass-through for the DSM calls from libndctl.

This patch-set however departs from that design by decoupling the
libndctl <--> papr_scm and papr_scm <--> phyp interfaces. This provides
more flexibility compared to earlier approach were these two interfaces were
coupled with each other.

Structure of the patch-set
==

The initial 3 patches of the patch-set add functionality of issuing necessary
HCALLs to PHYP to retrieve the dimm health/performance stats information and
exposing them to user-space via sysfs attributes.

Subsequent patches deal with defining and implementing support for
NVDIMM_FAMILY_PAPR_SCM DSM command family and implementing the payload
versioning scheme as mentioned above.

References:
[1]: "Power Architecture Platform Reference"
  https://en.wikipedia.org/wiki/Power_Architecture_Platform_Reference
[2]: "[DOC,v2] powerpc: Provide initial documentation for PAPR hcalls"
 https://patchwork.ozlabs.org/patch/1154292/
[3]: "Linux on Power Architecture Platform Reference"
 https://members.openpowerfoundation.org/document/dl/469
[4]: https://github.com/vaibhav92/ndctl/tree/papr_scm_health_v1
[5]: 
https://lore.kernel.org/linuxppc-dev/20200129152844.71286-1-vaib...@linux.ibm.com/

Vaibhav Jain (8):
  powerpc: Add asm header 'papr_scm.h' describing the papr-scm interface
  powerpc/papr_scm: Provide support for fetching dimm health information
  powerpc/papr_scm: Fetch dimm performance stats from PHYP
  UAPI: ndctl: Introduce NVDIMM_FAMILY_PAPR_SCM as a new NVDIMM DSM
family
  powerpc/uapi: Introduce uapi header 'papr_scm_dsm.h' for papr_scm DSMs
  powerpc/papr_scm: Add support for handling PAPR DSM commands
  powerpc/papr_scm: Re-implement 'papr_flags' using
'nd_papr_scm_dimm_health_stat'
  powerpc/papr_scm: Implement support for DSM_PAPR_SCM_HEALTH

 arch/powerpc/include/asm/papr_scm.h  |  68 
 arch/powerpc/include/uapi/asm/papr_scm_dsm.h | 143 +++
 arch/powerpc/platforms/pseries/papr_scm.c| 399 

Re: [PATCH 8/8] perf/tools/pmu-events/powerpc: Add hv_24x7 socket/chip level metric events

2020-02-20 Thread maddy




On 2/14/20 4:33 PM, Kajol Jain wrote:

The hv_24×7 feature in IBM® POWER9™ processor-based servers provide the
facility to continuously collect large numbers of hardware performance
metrics efficiently and accurately.
This patch adds hv_24x7 json metric file for different Socket/chip
resources.

Result:

power9 platform:

command:# ./perf stat --metric-only -M Memory_RD_BW_Chip -C 0
-I 1000 sleep 1

time MB   Memory_RD_BW_Chip_0 MB   Memory_RD_BW_Chip_1 MB
1.000192635  0.4  0.0
1.001695883  0.0  0.0

Signed-off-by: Kajol Jain 
---
  .../arch/powerpc/power9/hv_24x7_metrics.json  | 19 +++
  1 file changed, 19 insertions(+)
  create mode 100644 
tools/perf/pmu-events/arch/powerpc/power9/hv_24x7_metrics.json

diff --git a/tools/perf/pmu-events/arch/powerpc/power9/hv_24x7_metrics.json 
b/tools/perf/pmu-events/arch/powerpc/power9/hv_24x7_metrics.json
new file mode 100644
index ..ac38f5540ac6
--- /dev/null
+++ b/tools/perf/pmu-events/arch/powerpc/power9/hv_24x7_metrics.json


Better to have it as nest_metrics.json instead.  Rest looks fine

Reviewed-by: Madhavan Srinivasan 


@@ -0,0 +1,19 @@
+[
+{
+"MetricExpr": "(hv_24x7@PM_MCS01_128B_RD_DISP_PORT01\\,chip\\=?@ + 
hv_24x7@PM_MCS01_128B_RD_DISP_PORT23\\,chip\\=?@ + hv_24x7@PM_MCS23_128B_RD_DISP_PORT01\\,chip\\=?@ 
+ hv_24x7@PM_MCS23_128B_RD_DISP_PORT23\\,chip\\=?@)",
+"MetricName": "Memory_RD_BW_Chip",
+"MetricGroup": "Memory_BW",
+"ScaleUnit": "1.6e-2MB"
+},
+{
+"MetricExpr": "(hv_24x7@PM_MCS01_128B_WR_DISP_PORT01\\,chip\\=?@ + 
hv_24x7@PM_MCS01_128B_WR_DISP_PORT23\\,chip\\=?@ + hv_24x7@PM_MCS23_128B_WR_DISP_PORT01\\,chip\\=?@ 
+ hv_24x7@PM_MCS23_128B_WR_DISP_PORT23\\,chip\\=?@ )",
+"MetricName": "Memory_WR_BW_Chip",
+"MetricGroup": "Memory_BW",
+"ScaleUnit": "1.6e-2MB"
+},
+{
+"MetricExpr": "(hv_24x7@PM_PB_CYC\\,chip\\=?@ )",
+"MetricName": "PowerBUS_Frequency",
+"ScaleUnit": "2.5e-7GHz"
+}
+]




Re: [RESEND PATCH v2 9/9] ath5k: Constify ioreadX() iomem argument (as in generic implementation)

2020-02-20 Thread Jiri Slaby
On 19. 02. 20, 18:50, Krzysztof Kozlowski wrote:
> The ioreadX() helpers have inconsistent interface.  On some architectures
> void *__iomem address argument is a pointer to const, on some not.
> 
> Implementations of ioreadX() do not modify the memory under the address
> so they can be converted to a "const" version for const-safety and
> consistency among architectures.
> 
> Signed-off-by: Krzysztof Kozlowski 
> Acked-by: Kalle Valo 
> ---
>  drivers/net/wireless/ath/ath5k/ahb.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/wireless/ath/ath5k/ahb.c 
> b/drivers/net/wireless/ath/ath5k/ahb.c
> index 2c9cec8b53d9..8bd01df369fb 100644
> --- a/drivers/net/wireless/ath/ath5k/ahb.c
> +++ b/drivers/net/wireless/ath/ath5k/ahb.c
> @@ -138,18 +138,18 @@ static int ath_ahb_probe(struct platform_device *pdev)
>  
>   if (bcfg->devid >= AR5K_SREV_AR2315_R6) {
>   /* Enable WMAC AHB arbitration */
> - reg = ioread32((void __iomem *) AR5K_AR2315_AHB_ARB_CTL);
> + reg = ioread32((const void __iomem *) AR5K_AR2315_AHB_ARB_CTL);

While I understand why the parameter of ioread32 should be const, I
don't see a reason for these casts on the users' side. What does it
bring except longer code to read?

thanks,
-- 
js


[PATCH 5/6] mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush

2020-02-20 Thread Santosh Sivaraj
From: Peter Zijlstra 

Architectures for which we have hardware walkers of Linux page table
should flush TLB on mmu gather batch allocation failures and batch flush.
Some architectures like POWER supports multiple translation modes (hash
and radix) and in the case of POWER only radix translation mode needs the
above TLBI.  This is because for hash translation mode kernel wants to
avoid this extra flush since there are no hardware walkers of linux page
table.  With radix translation, the hardware also walks linux page table
and with that, kernel needs to make sure to TLB invalidate page walk cache
before page table pages are freed.

More details in commit d86564a2f085 ("mm/tlb, x86/mm: Support invalidating
TLB caches for RCU_TABLE_FREE")

The changes to sparc are to make sure we keep the old behavior since we
are now removing HAVE_RCU_TABLE_NO_INVALIDATE.  The default value for
tlb_needs_table_invalidate is to always force an invalidate and sparc can
avoid the table invalidate.  Hence we define tlb_needs_table_invalidate to
false for sparc architecture.

0ed1325967ab5f in upstream.

Link: 
http://lkml.kernel.org/r/20200116064531.483522-3-aneesh.ku...@linux.ibm.com
Fixes: a46cc7a90fd8 ("powerpc/mm/radix: Improve TLB/PWC flushes")
Signed-off-by: Peter Zijlstra (Intel) 
Cc:   # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: backported to 4.19 stable]
---
 arch/Kconfig|  3 ---
 arch/powerpc/Kconfig|  1 -
 arch/powerpc/include/asm/tlb.h  | 11 +++
 arch/sparc/Kconfig  |  1 -
 arch/sparc/include/asm/tlb_64.h |  9 +
 include/asm-generic/tlb.h   | 15 +++
 mm/memory.c | 16 
 7 files changed, 43 insertions(+), 13 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 061a12b8140e..3abbdb0cea44 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -363,9 +363,6 @@ config HAVE_ARCH_JUMP_LABEL
 config HAVE_RCU_TABLE_FREE
bool
 
-config HAVE_RCU_TABLE_NO_INVALIDATE
-   bool
-
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index fa231130eee1..b6429f53835e 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -216,7 +216,6 @@ config PPC
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE
-   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index f0e571b2dc7c..63418275f402 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -30,6 +30,17 @@
 #define tlb_remove_check_page_size_change tlb_remove_check_page_size_change
 
 extern void tlb_flush(struct mmu_gather *tlb);
+/*
+ * book3s:
+ * Hash does not use the linux page-tables, so we can avoid
+ * the TLB invalidate for page-table freeing, Radix otoh does use the
+ * page-tables and needs the TLBI.
+ *
+ * nohash:
+ * We still do TLB invalidate in the __pte_free_tlb routine before we
+ * add the page table pages to mmu gather table batch.
+ */
+#define tlb_needs_table_invalidate()   radix_enabled()
 
 /* Get the generic bits... */
 #include 
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index d90d632868aa..e6f2a38d2e61 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -64,7 +64,6 @@ config SPARC64
select HAVE_KRETPROBES
select HAVE_KPROBES
select HAVE_RCU_TABLE_FREE if SMP
-   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_DYNAMIC_FTRACE
diff --git a/arch/sparc/include/asm/tlb_64.h b/arch/sparc/include/asm/tlb_64.h
index a2f3fa61ee36..8cb8f3833239 100644
--- a/arch/sparc/include/asm/tlb_64.h
+++ b/arch/sparc/include/asm/tlb_64.h
@@ -28,6 +28,15 @@ void flush_tlb_pending(void);
 #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 #define tlb_flush(tlb) flush_tlb_pending()
 
+/*
+ * SPARC64's hardware TLB fill does not use the Linux page-tables
+ * and therefore we don't need a TLBI when freeing page-table pages.
+ */
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define tlb_needs_table_invalidate()   (false)
+#endif
+
 #include 
 
 #endif /* _SPARC64_TLB_H */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index f2b9dc9cbaf8..19934cdd143e 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -61,8 +61,23 @@ struct mmu_table_batch {
 extern void tlb_table_flush(struct mmu_gather *tlb);
 extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
 
+/*
+ * This allows an architecture that does not use the linux page-tables for
+ * hardware to skip the TLBI when freeing page tables.
+ */
+#ifndef tlb_needs_table_invalidate
+#define 

[PATCH 6/6] asm-generic/tlb: avoid potential double flush

2020-02-20 Thread Santosh Sivaraj
From: Peter Zijlstra 

Aneesh reported that:

tlb_flush_mmu()
  tlb_flush_mmu_tlbonly()
tlb_flush() <-- #1
  tlb_flush_mmu_free()
tlb_table_flush()
  tlb_table_invalidate()
tlb_flush_mmu_tlbonly()
  tlb_flush()   <-- #2

does two TLBIs when tlb->fullmm, because __tlb_reset_range() will not
clear tlb->end in that case.

Observe that any caller to __tlb_adjust_range() also sets at least one of
the tlb->freed_tables || tlb->cleared_p* bits, and those are
unconditionally cleared by __tlb_reset_range().

Change the condition for actually issuing TLBI to having one of those bits
set, as opposed to having tlb->end != 0.

0758cd830494 in upstream.

Link: 
http://lkml.kernel.org/r/20200116064531.483522-4-aneesh.ku...@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Aneesh Kumar K.V 
Reported-by: "Aneesh Kumar K.V" 
Cc:   # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: backported to 4.19 stable]
---
 include/asm-generic/tlb.h | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 19934cdd143e..427a70c56ddd 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -179,7 +179,12 @@ static inline void __tlb_reset_range(struct mmu_gather 
*tlb)
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
-   if (!tlb->end)
+   /*
+* Anything calling __tlb_adjust_range() also sets at least one of
+* these bits.
+*/
+   if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
+ tlb->cleared_puds || tlb->cleared_p4ds))
return;
 
tlb_flush(tlb);
-- 
2.24.1



[PATCH 4/6] powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case

2020-02-20 Thread Santosh Sivaraj
From: "Aneesh Kumar K.V" 

Patch series "Fixup page directory freeing", v4.

This is a repost of patch series from Peter with the arch specific changes
except ppc64 dropped.  ppc64 changes are added here because we are redoing
the patch series on top of ppc64 changes.  This makes it easy to backport
these changes.  Only the first 2 patches need to be backported to stable.

The thing is, on anything SMP, freeing page directories should observe the
exact same order as normal page freeing:

 1) unhook page/directory
 2) TLB invalidate
 3) free page/directory

Without this, any concurrent page-table walk could end up with a
Use-after-Free.  This is esp.  trivial for anything that has software
page-table walkers (HAVE_FAST_GUP / software TLB fill) or the hardware
caches partial page-walks (ie.  caches page directories).

Even on UP this might give issues since mmu_gather is preemptible these
days.  An interrupt or preempted task accessing user pages might stumble
into the free page if the hardware caches page directories.

This patch series fixes ppc64 and add generic MMU_GATHER changes to
support the conversion of other architectures.  I haven't added patches
w.r.t other architecture because they are yet to be acked.

This patch (of 9):

A followup patch is going to make sure we correctly invalidate page walk
cache before we free page table pages.  In order to keep things simple
enable RCU_TABLE_FREE even for !SMP so that we don't have to fixup the
!SMP case differently in the followup patch

!SMP case is right now broken for radix translation w.r.t page walk
cache flush.  We can get interrupted in between page table free and
that would imply we have page walk cache entries pointing to tables
which got freed already.  Michael said "both our platforms that run on
Power9 force SMP on in Kconfig, so the !SMP case is unlikely to be a
problem for anyone in practice, unless they've hacked their kernel to
build it !SMP."

12e4d53f3f04e in upstream.

Link: 
http://lkml.kernel.org/r/20200116064531.483522-2-aneesh.ku...@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: backported for 4.19 stable]
---
 arch/powerpc/Kconfig | 2 +-
 arch/powerpc/include/asm/book3s/32/pgalloc.h | 8 
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 2 --
 arch/powerpc/mm/pgtable-book3s64.c   | 7 ---
 4 files changed, 1 insertion(+), 18 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index f7f046ff6407..fa231130eee1 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -215,7 +215,7 @@ config PPC
select HAVE_HARDLOCKUP_DETECTOR_PERFif PERF_EVENTS && 
HAVE_PERF_EVENTS_NMI && !HAVE_HARDLOCKUP_DETECTOR_ARCH
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
-   select HAVE_RCU_TABLE_FREE  if SMP
+   select HAVE_RCU_TABLE_FREE
select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h 
b/arch/powerpc/include/asm/book3s/32/pgalloc.h
index 82e44b1a00ae..79ba3fbb512e 100644
--- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
@@ -110,7 +110,6 @@ static inline void pgtable_free(void *table, unsigned 
index_size)
 #define check_pgt_cache()  do { } while (0)
 #define get_hugepd_cache_index(x)  (x)
 
-#ifdef CONFIG_SMP
 static inline void pgtable_free_tlb(struct mmu_gather *tlb,
void *table, int shift)
 {
@@ -127,13 +126,6 @@ static inline void __tlb_remove_table(void *_table)
 
pgtable_free(table, shift);
 }
-#else
-static inline void pgtable_free_tlb(struct mmu_gather *tlb,
-   void *table, int shift)
-{
-   pgtable_free(table, shift);
-}
-#endif
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
  unsigned long address)
diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h 
b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index f9019b579903..1013c0214213 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -47,9 +47,7 @@ extern pmd_t *pmd_fragment_alloc(struct mm_struct *, unsigned 
long);
 extern void pte_fragment_free(unsigned long *, int);
 extern void pmd_fragment_free(unsigned long *);
 extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift);
-#ifdef CONFIG_SMP
 extern void __tlb_remove_table(void *_table);
-#endif
 
 static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm)
 {
diff --git a/arch/powerpc/mm/pgtable-book3s64.c 
b/arch/powerpc/mm/pgtable-book3s64.c
index 297db665d953..5b4e9fd8990c 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -432,7 +432,6 @@ static 

[PATCH 3/6] asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE

2020-02-20 Thread Santosh Sivaraj
From: Peter Zijlstra 

Make issuing a TLB invalidate for page-table pages the normal case.

The reason is twofold:

 - too many invalidates is safer than too few,
 - most architectures use the linux page-tables natively
   and would thus require this.

Make it an opt-out, instead of an opt-in.

No change in behavior intended.

96bc9567cbe1 in upstream.

Signed-off-by: Peter Zijlstra (Intel) 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: prerequisite for upcoming tlbflush backports]
---
 arch/Kconfig | 2 +-
 arch/powerpc/Kconfig | 1 +
 arch/sparc/Kconfig   | 1 +
 arch/x86/Kconfig | 1 -
 mm/memory.c  | 2 +-
 5 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index a336548487e6..061a12b8140e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -363,7 +363,7 @@ config HAVE_ARCH_JUMP_LABEL
 config HAVE_RCU_TABLE_FREE
bool
 
-config HAVE_RCU_TABLE_INVALIDATE
+config HAVE_RCU_TABLE_NO_INVALIDATE
bool
 
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index a80669209155..f7f046ff6407 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -216,6 +216,7 @@ config PPC
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE  if SMP
+   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index e6f2a38d2e61..d90d632868aa 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -64,6 +64,7 @@ config SPARC64
select HAVE_KRETPROBES
select HAVE_KPROBES
select HAVE_RCU_TABLE_FREE if SMP
+   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_DYNAMIC_FTRACE
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index af35f5caadbe..181d0d522977 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -181,7 +181,6 @@ config X86
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE  if PARAVIRT
-   select HAVE_RCU_TABLE_INVALIDATEif HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if X86_64 && 
(UNWINDER_FRAME_POINTER || UNWINDER_ORC) && STACK_VALIDATION
select HAVE_STACKPROTECTOR  if CC_HAS_SANE_STACKPROTECTOR
diff --git a/mm/memory.c b/mm/memory.c
index 1832c5ed6ac0..ba5689610c04 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -327,7 +327,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct 
page *page, int page_
  */
 static inline void tlb_table_invalidate(struct mmu_gather *tlb)
 {
-#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE
+#ifndef CONFIG_HAVE_RCU_TABLE_NO_INVALIDATE
/*
 * Invalidate page-table caches used by hardware walkers. Then we still
 * need to RCU-sched wait while freeing the pages because software
-- 
2.24.1



[PATCH 2/6] asm-generic/tlb: Track which levels of the page tables have been cleared

2020-02-20 Thread Santosh Sivaraj
From: Will Deacon 

It is common for architectures with hugepage support to require only a
single TLB invalidation operation per hugepage during unmap(), rather than
iterating through the mapping at a PAGE_SIZE increment. Currently,
however, the level in the page table where the unmap() operation occurs
is not stored in the mmu_gather structure, therefore forcing
architectures to issue additional TLB invalidation operations or to give
up and over-invalidate by e.g. invalidating the entire TLB.

Ideally, we could add an interval rbtree to the mmu_gather structure,
which would allow us to associate the correct mapping granule with the
various sub-mappings within the range being invalidated. However, this
is costly in terms of book-keeping and memory management, so instead we
approximate by keeping track of the page table levels that are cleared
and provide a means to query the smallest granule required for invalidation.

a6d60245d6d9 in upstream

Signed-off-by: Will Deacon 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: prerequisite for upcoming tlbflush backports]
---
 include/asm-generic/tlb.h | 58 +--
 mm/memory.c   |  4 ++-
 2 files changed, 53 insertions(+), 9 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 97306b32d8d2..f2b9dc9cbaf8 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -114,6 +114,14 @@ struct mmu_gather {
 */
unsigned intfreed_tables : 1;
 
+   /*
+* at which levels have we cleared entries?
+*/
+   unsigned intcleared_ptes : 1;
+   unsigned intcleared_pmds : 1;
+   unsigned intcleared_puds : 1;
+   unsigned intcleared_p4ds : 1;
+
struct mmu_gather_batch *active;
struct mmu_gather_batch local;
struct page *__pages[MMU_GATHER_BUNDLE];
@@ -148,6 +156,10 @@ static inline void __tlb_reset_range(struct mmu_gather 
*tlb)
tlb->end = 0;
}
tlb->freed_tables = 0;
+   tlb->cleared_ptes = 0;
+   tlb->cleared_pmds = 0;
+   tlb->cleared_puds = 0;
+   tlb->cleared_p4ds = 0;
 }
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
@@ -197,6 +209,25 @@ static inline void 
tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 }
 #endif
 
+static inline unsigned long tlb_get_unmap_shift(struct mmu_gather *tlb)
+{
+   if (tlb->cleared_ptes)
+   return PAGE_SHIFT;
+   if (tlb->cleared_pmds)
+   return PMD_SHIFT;
+   if (tlb->cleared_puds)
+   return PUD_SHIFT;
+   if (tlb->cleared_p4ds)
+   return P4D_SHIFT;
+
+   return PAGE_SHIFT;
+}
+
+static inline unsigned long tlb_get_unmap_size(struct mmu_gather *tlb)
+{
+   return 1UL << tlb_get_unmap_shift(tlb);
+}
+
 /*
  * In the case of tlb vma handling, we can optimise these away in the
  * case where we're doing a full MM flush.  When we're doing a munmap,
@@ -230,13 +261,19 @@ static inline void 
tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 #define tlb_remove_tlb_entry(tlb, ptep, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->cleared_ptes = 1;  \
__tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)
 
-#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)\
-   do { \
-   __tlb_adjust_range(tlb, address, huge_page_size(h)); \
-   __tlb_remove_tlb_entry(tlb, ptep, address);  \
+#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)   \
+   do {\
+   unsigned long _sz = huge_page_size(h);  \
+   __tlb_adjust_range(tlb, address, _sz);  \
+   if (_sz == PMD_SIZE)\
+   tlb->cleared_pmds = 1;  \
+   else if (_sz == PUD_SIZE)   \
+   tlb->cleared_puds = 1;  \
+   __tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)
 
 /**
@@ -250,6 +287,7 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #define tlb_remove_pmd_tlb_entry(tlb, pmdp, address)   \
do {\
__tlb_adjust_range(tlb, address, HPAGE_PMD_SIZE);   \
+   tlb->cleared_pmds = 1;  \
__tlb_remove_pmd_tlb_entry(tlb, pmdp, address); \
} while (0)
 
@@ -264,6 +302,7 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 

[PATCH 1/6] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather

2020-02-20 Thread Santosh Sivaraj
From: Peter Zijlstra 

Some architectures require different TLB invalidation instructions
depending on whether it is only the last-level of page table being
changed, or whether there are also changes to the intermediate
(directory) entries higher up the tree.

Add a new bit to the flags bitfield in struct mmu_gather so that the
architecture code can operate accordingly if it's the intermediate
levels being invalidated.

22a61c3c4f1379 in upstream

Signed-off-by: Peter Zijlstra 
Signed-off-by: Will Deacon 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: prerequisite for tlbflush backports]
---
 include/asm-generic/tlb.h | 31 +++
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b3353e21f3b3..97306b32d8d2 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -97,12 +97,22 @@ struct mmu_gather {
 #endif
unsigned long   start;
unsigned long   end;
-   /* we are in the middle of an operation to clear
-* a full mm and can make some optimizations */
-   unsigned intfullmm : 1,
-   /* we have performed an operation which
-* requires a complete flush of the tlb */
-   need_flush_all : 1;
+   /*
+* we are in the middle of an operation to clear
+* a full mm and can make some optimizations
+*/
+   unsigned intfullmm : 1;
+
+   /*
+* we have performed an operation which
+* requires a complete flush of the tlb
+*/
+   unsigned intneed_flush_all : 1;
+
+   /*
+* we have removed page directories
+*/
+   unsigned intfreed_tables : 1;
 
struct mmu_gather_batch *active;
struct mmu_gather_batch local;
@@ -137,6 +147,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
tlb->start = TASK_SIZE;
tlb->end = 0;
}
+   tlb->freed_tables = 0;
 }
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
@@ -278,6 +289,7 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #define pte_free_tlb(tlb, ptep, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pte_free_tlb(tlb, ptep, address); \
} while (0)
 #endif
@@ -285,7 +297,8 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #ifndef pmd_free_tlb
 #define pmd_free_tlb(tlb, pmdp, address)   \
do {\
-   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pmd_free_tlb(tlb, pmdp, address); \
} while (0)
 #endif
@@ -295,6 +308,7 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #define pud_free_tlb(tlb, pudp, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pud_free_tlb(tlb, pudp, address); \
} while (0)
 #endif
@@ -304,7 +318,8 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #ifndef p4d_free_tlb
 #define p4d_free_tlb(tlb, pudp, address)   \
do {\
-   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__p4d_free_tlb(tlb, pudp, address); \
} while (0)
 #endif
-- 
2.24.1



[PATCH 0/6] Memory corruption may occur due to incorrent tlb flush

2020-02-20 Thread Santosh Sivaraj
The TLB flush optimisation (a46cc7a90f: powerpc/mm/radix: Improve TLB/PWC
flushes) may result in random memory corruption. Any concurrent page-table walk
could end up with a Use-after-Free. Even on UP this might give issues, since
mmu_gather is preemptible these days. An interrupt or preempted task accessing
user pages might stumble into the free page if the hardware caches page
directories.

The series is a backport of the fix sent by Peter [1].

The first three patches are dependencies for the last patch (avoid potential
double flush). If the performance impact due to double flush is considered
trivial then the first three patches and last patch may be dropped.

This is only for v4.19 stable.

[1] https://patchwork.kernel.org/cover/11284843/

--
Aneesh Kumar K.V (1):
  powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case

Peter Zijlstra (4):
  asm-generic/tlb: Track freeing of page-table directories in struct
mmu_gather
  asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE
  mm/mmu_gather: invalidate TLB correctly on batch allocation failure
and flush
  asm-generic/tlb: avoid potential double flush

Will Deacon (1):
  asm-generic/tlb: Track which levels of the page tables have been
cleared

 arch/Kconfig |   3 -
 arch/powerpc/Kconfig |   2 +-
 arch/powerpc/include/asm/book3s/32/pgalloc.h |   8 --
 arch/powerpc/include/asm/book3s/64/pgalloc.h |   2 -
 arch/powerpc/include/asm/tlb.h   |  11 ++
 arch/powerpc/mm/pgtable-book3s64.c   |   7 --
 arch/sparc/include/asm/tlb_64.h  |   9 ++
 arch/x86/Kconfig |   1 -
 include/asm-generic/tlb.h| 103 ---
 mm/memory.c  |  20 ++--
 10 files changed, 122 insertions(+), 44 deletions(-)

-- 
2.24.1



Re: MCE handler gets NIP wrong on MPC8378

2020-02-20 Thread Christophe Leroy




On 02/19/2020 10:39 PM, Radu Rendec wrote:

On 02/19/2020 at 4:21 PM Christophe Leroy  wrote:

Radu Rendec  a écrit :

On 02/19/2020 at 10:11 AM Radu Rendec  wrote:

On 02/18/2020 at 1:08 PM Christophe Leroy  wrote:

Le 18/02/2020 à 18:07, Radu Rendec a écrit :

The saved NIP seems to be broken inside machine_check_exception() on
MPC8378, running Linux 4.9.191. The value is 0x900 most of the times,
but I have seen other weird values.

I've been able to track down the entry code to head_32.S (vector 0x200),
but I'm not sure where/how the NIP value (where the exception occurred)
is captured.


NIP value is supposed to come from SRR0, loaded in r12 in PROLOG_2 and
saved into _NIP(r11) in transfer_to_handler in entry_32.S

Can something clobber r12 at some point ?



I did something even simpler: I added the following

  lis r12,0x1234

... right after

  mfspr r12,SPRN_SRR0

... and now the NIP value I see in the crash dump is 0x1234. This
means r12 is not clobbered and most likely the NIP value I normally see
is the actual SRR0 value.


I apologize for the noise. I just found out accidentally that the saved
NIP value is correct if interrupts are disabled at the time when the
faulty access that triggers the MCE occurs. This seems to happen
consistently.

By "interrupts are disabled" I mean local_irq_save/local_irq_restore, so
it's basically enough to wrap ioread32 to get the NIP value right.

Does this make any sense? Maybe it's not a silicon bug after all, or
maybe it is and I just found a workaround. Could this happen on other
PowerPC CPUs as well?


Interesting.

0x900 is the adress of the timer interrupt.

Would the MCE occur just after the timer interrupt ?


I doubt that. I'm using a small test module to artificially trigger the
MCE. Basically it's just this (the full code is in my original post):

 bad_addr_base = ioremap(0xf000, 0x100);
 x = ioread32(bad_addr_base);

I find it hard to believe that every time I load the module the lwbrx
instruction that triggers the MCE is executed exactly after the timer
interrupt (or that the timer interrupt always occurs close to the lwbrx
instruction).


Can you try to see how much time there is between your read and the MCE ?
The below should allow it, you'll see first value in r13 and the other 
in r14 (mce.c is your test code)


Also provide the timebase frequency as reported in /proc/cpuinfo

diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S
index 97c887950c3c..0ae6a0a17e26 100644
--- a/arch/powerpc/kernel/head_32.S
+++ b/arch/powerpc/kernel/head_32.S
@@ -273,6 +273,7 @@ __secondary_hold_acknowledge:
. = 0x200
DO_KVM  0x200
 MachineCheck:
+   mftbl   r14
EXCEPTION_PROLOG_0
 #ifdef CONFIG_VMAP_STACK
li  r11, MSR_KERNEL & ~(MSR_IR | MSR_RI) /* can take DTLB miss */
diff --git a/arch/powerpc/platforms/83xx/mce.c 
b/arch/powerpc/platforms/83xx/mce.c

index 91c2de6b73ca..0b7e4dcc0cb3 100644
--- a/arch/powerpc/platforms/83xx/mce.c
+++ b/arch/powerpc/platforms/83xx/mce.c
@@ -11,7 +11,7 @@ static int __init test_mce_init(void)
 bad_addr_base = ioremap(0xf000, 0x100);

 if (bad_addr_base) {
-__asm__ __volatile__ ("isync");
+__asm__ __volatile__ ("isync ; mftbl 13");
 x = ioread32(bad_addr_base);
 pr_info("Test: %#0x\n", x);
 } else






Can you tell how are configured your IO busses, etc ... ?


Nothing special. The device tree is mostly similar to mpc8379_rdb.dts,
but I can provide the actual dts if you think it's relevant.


And what's the value of SERSR after the machine check ?


I'm assuming you're talking about the IPIC SERSR register. I modified
machine_check_exception and added a call to ipic_get_mcp_status, which
seems to read IPIC_SERSR. The value is 0, both with interrupts enabled
and disabled (which makes sense, since disabling/enabling interrupts is
local to the CPU core).


And what's the reason given in the Oops message for the machine check ? 
Is that "Caused by (from SRR1=49030): Transfer error ack signal" or 
something else ?





Do you use the local bus monitoring driver ?


I don't. In fact, I'm not even aware of it. What driver is that?


CONFIG_FSL_LBC

Christophe


[PATCH] powerpc/xive: Enforce load-after-store ordering when StoreEOI is active

2020-02-20 Thread Cédric Le Goater
When an interrupt has been handled, the OS notifies the interrupt
controller with a EOI sequence. On a POWER9 system using the XIVE
interrupt controller, this can be done with a load or a store
operation on the ESB interrupt management page of the interrupt. The
StoreEOI operation has less latency and improves interrupt handling
performance but it was deactivated during the POWER9 DD2.0 timeframe
because of ordering issues. We use the LoadEOI today but we plan to
reactivate StoreEOI in future architectures.

There is usually no need to enforce ordering between ESB load and
store operations as they should lead to the same result. E.g. a store
trigger and a load EOI can be executed in any order. Assuming the
interrupt state is PQ=10, a store trigger followed by a load EOI will
return a Q bit. In the reverse order, it will create a new interrupt
trigger from HW. In both cases, the handler processing interrupts is
notified.

In some cases, the XIVE_ESB_SET_PQ_10 load operation is used to
disable temporarily the interrupt source (mask/unmask). When the
source is reenabled, the OS can detect if interrupts were received
while the source was disabled and reinject them. This process needs
special care when StoreEOI is activated. The ESB load and store
operations should be correctly ordered because a XIVE_ESB_STORE_EOI
operation could leave the source enabled if it has not completed
before the loads.

For those cases, we enforce Load-after-Store ordering with a special
load operation offset. To avoid performance impact, this ordering is
only enforced when really needed, that is when interrupt sources are
temporarily disabled with the XIVE_ESB_SET_PQ_10 load. It should not
be needed for other loads.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/xive-regs.h| 8 
 arch/powerpc/kvm/book3s_xive_native.c   | 6 ++
 arch/powerpc/kvm/book3s_xive_template.c | 3 +++
 arch/powerpc/sysdev/xive/common.c   | 3 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 5 +
 5 files changed, 25 insertions(+)

diff --git a/arch/powerpc/include/asm/xive-regs.h 
b/arch/powerpc/include/asm/xive-regs.h
index f2dfcd50a2d3..b1996fbae59a 100644
--- a/arch/powerpc/include/asm/xive-regs.h
+++ b/arch/powerpc/include/asm/xive-regs.h
@@ -37,6 +37,14 @@
 #define XIVE_ESB_SET_PQ_10 0xe00 /* Load */
 #define XIVE_ESB_SET_PQ_11 0xf00 /* Load */
 
+/*
+ * Load-after-store ordering
+ *
+ * Adding this offset to the load address will enforce
+ * load-after-store ordering. This is required to use StoreEOI.
+ */
+#define XIVE_ESB_LD_ST_MO  0x40 /* Load-after-store ordering */
+
 #define XIVE_ESB_VAL_P 0x2
 #define XIVE_ESB_VAL_Q 0x1
 
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index d83adb1e1490..c80b6a447efd 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -31,6 +31,12 @@ static u8 xive_vm_esb_load(struct xive_irq_data *xd, u32 
offset)
 {
u64 val;
 
+   /*
+* The KVM XIVE native device does not use the XIVE_ESB_SET_PQ_10
+* load operation, so there is no need to enforce load-after-store
+* ordering.
+*/
+
if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
offset |= offset << 4;
 
diff --git a/arch/powerpc/kvm/book3s_xive_template.c 
b/arch/powerpc/kvm/book3s_xive_template.c
index a8a900ace1e6..4ad3c0279458 100644
--- a/arch/powerpc/kvm/book3s_xive_template.c
+++ b/arch/powerpc/kvm/book3s_xive_template.c
@@ -58,6 +58,9 @@ static u8 GLUE(X_PFX,esb_load)(struct xive_irq_data *xd, u32 
offset)
 {
u64 val;
 
+   if (offset == XIVE_ESB_SET_PQ_10 && xd->flags & XIVE_IRQ_FLAG_STORE_EOI)
+   offset |= XIVE_ESB_LD_ST_MO;
+
if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
offset |= offset << 4;
 
diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index f5fadbd2533a..0dc421bb494f 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -202,6 +202,9 @@ static notrace u8 xive_esb_read(struct xive_irq_data *xd, 
u32 offset)
 {
u64 val;
 
+   if (offset == XIVE_ESB_SET_PQ_10 && xd->flags & XIVE_IRQ_FLAG_STORE_EOI)
+   offset |= XIVE_ESB_LD_ST_MO;
+
/* Handle HW errata */
if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
offset |= offset << 4;
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index e11017897eb0..abe132ff2346 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -2911,6 +2911,11 @@ kvm_cede_exit:
beq 4f
li  r0, 0
stb r0, VCPU_CEDED(r9)
+   /*
+* The escalation interrupts are special as we don't EOI them.
+* There is no need to use the load-after-store ordering offset
+* to set PQ to 10 as we won't use StoreEOI.
+*/
li