[PATCH] lkdtm: do not depend on CONFIG_BLOCK

2018-11-08 Thread Christophe Leroy
Most parts of lkdtm don't require CONFIG_BLOCK.

This patch limits dependency to CONFIG_BLOCK in order to give embedded
platforms which don't select CONFIG_BLOCK the opportunity to use LKDTM.

Fixes: fddd9cf82c9f ("make LKDTM depend on BLOCK")
Signed-off-by: Christophe Leroy 
---
 drivers/misc/lkdtm/core.c | 7 ++-
 lib/Kconfig.debug | 1 -
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
index 2837dc77478e..bc76756b7eda 100644
--- a/drivers/misc/lkdtm/core.c
+++ b/drivers/misc/lkdtm/core.c
@@ -40,9 +40,12 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
+#ifdef CONFIG_BLOCK
+#include 
+#endif
+
 #ifdef CONFIG_IDE
 #include 
 #endif
@@ -101,7 +104,9 @@ static struct crashpoint crashpoints[] = {
CRASHPOINT("FS_DEVRW",   "ll_rw_block"),
CRASHPOINT("MEM_SWAPOUT","shrink_inactive_list"),
CRASHPOINT("TIMERADD",   "hrtimer_start"),
+# ifdef CONFIG_BLOCK
CRASHPOINT("SCSI_DISPATCH_CMD",  "scsi_dispatch_cmd"),
+# endif
 # ifdef CONFIG_IDE
CRASHPOINT("IDE_CORE_CP","generic_ide_ioctl"),
 # endif
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 1af29b8224fd..0dd65b4b2ad2 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1685,7 +1685,6 @@ if RUNTIME_TESTING_MENU
 config LKDTM
tristate "Linux Kernel Dump Test Tool Module"
depends on DEBUG_FS
-   depends on BLOCK
help
This module enables testing of the different dumping mechanisms by
inducing system failures at predefined crash points.
-- 
2.13.3



Re: [GIT PULL] of: overlay: validation checks, subsequent fixes for v20 -- correction: v4.20

2018-11-08 Thread Frank Rowand
On 11/8/18 10:56 PM, Frank Rowand wrote:
> Hi Rob,
> 
> Please pull the changes to add the overlay validation checks.
> 
> This is the v7 version of the patch series.
> 
> -Frank
> 
> 
> The following changes since commit 651022382c7f8da46cb4872a545ee1da6d097d2a:
> 
>   Linux 4.20-rc1 (2018-11-04 15:37:52 -0800)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/frowand/linux.git 
> tags/kfree_validate_v7-for-4.20
> 
> for you to fetch changes up to eeb07c573ec307c53fe2f6ac6d8d11c261f64006:
> 
>   of: unittest: initialize args before calling of_*parse_*() (2018-11-08 
> 22:12:37 -0800)
> 
> 
> Add checks to (1) overlay apply process and (2) memory freeing
> triggered by overlay release.  The checks are intended to detect
> possible memory leaks and invalid overlays.
> 
> The checks revealed bugs in existing code.  Fixed the bugs.
> 
> While fixing bugs, noted other issues, which are fixed in
> separate patches.
> 
> 
> Frank Rowand (17):
>   of: overlay: add tests to validate kfrees from overlay removal
>   of: overlay: add missing of_node_put() after add new node to changeset
>   of: overlay: add missing of_node_get() in __of_attach_node_sysfs
>   powerpc/pseries: add of_node_put() in dlpar_detach_node()
>   of: overlay: use prop add changeset entry for property in new nodes
>   of: overlay: do not duplicate properties from overlay for new nodes
>   of: overlay: reorder fields in struct fragment
>   of: overlay: validate overlay properties #address-cells and #size-cells
>   of: overlay: make all pr_debug() and pr_err() messages unique
>   of: overlay: test case of two fragments adding same node
>   of: overlay: check prevents multiple fragments add or delete same node
>   of: overlay: check prevents multiple fragments touching same property
>   of: unittest: remove unused of_unittest_apply_overlay() argument
>   of: overlay: set node fields from properties when add new overlay node
>   of: unittest: allow base devicetree to have symbol metadata
>   of: unittest: find overlays[] entry by name instead of index
>   of: unittest: initialize args before calling of_*parse_*()
> 
>  arch/powerpc/platforms/pseries/dlpar.c |   2 +
>  drivers/of/dynamic.c   |  59 -
>  drivers/of/kobj.c  |   4 +-
>  drivers/of/overlay.c   | 292 
> -
>  drivers/of/unittest-data/Makefile  |   2 +
>  .../of/unittest-data/overlay_bad_add_dup_node.dts  |  28 ++
>  .../of/unittest-data/overlay_bad_add_dup_prop.dts  |  24 ++
>  drivers/of/unittest-data/overlay_base.dts  |   1 +
>  drivers/of/unittest.c  |  96 +--
>  include/linux/of.h |  21 +-
>  10 files changed, 432 insertions(+), 97 deletions(-)
>  create mode 100644 drivers/of/unittest-data/overlay_bad_add_dup_node.dts
>  create mode 100644 drivers/of/unittest-data/overlay_bad_add_dup_prop.dts
> 



[GIT PULL] of: overlay: validation checks, subsequent fixes for v20

2018-11-08 Thread Frank Rowand
Hi Rob,

Please pull the changes to add the overlay validation checks.

This is the v7 version of the patch series.

-Frank


The following changes since commit 651022382c7f8da46cb4872a545ee1da6d097d2a:

  Linux 4.20-rc1 (2018-11-04 15:37:52 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/frowand/linux.git 
tags/kfree_validate_v7-for-4.20

for you to fetch changes up to eeb07c573ec307c53fe2f6ac6d8d11c261f64006:

  of: unittest: initialize args before calling of_*parse_*() (2018-11-08 
22:12:37 -0800)


Add checks to (1) overlay apply process and (2) memory freeing
triggered by overlay release.  The checks are intended to detect
possible memory leaks and invalid overlays.

The checks revealed bugs in existing code.  Fixed the bugs.

While fixing bugs, noted other issues, which are fixed in
separate patches.


Frank Rowand (17):
  of: overlay: add tests to validate kfrees from overlay removal
  of: overlay: add missing of_node_put() after add new node to changeset
  of: overlay: add missing of_node_get() in __of_attach_node_sysfs
  powerpc/pseries: add of_node_put() in dlpar_detach_node()
  of: overlay: use prop add changeset entry for property in new nodes
  of: overlay: do not duplicate properties from overlay for new nodes
  of: overlay: reorder fields in struct fragment
  of: overlay: validate overlay properties #address-cells and #size-cells
  of: overlay: make all pr_debug() and pr_err() messages unique
  of: overlay: test case of two fragments adding same node
  of: overlay: check prevents multiple fragments add or delete same node
  of: overlay: check prevents multiple fragments touching same property
  of: unittest: remove unused of_unittest_apply_overlay() argument
  of: overlay: set node fields from properties when add new overlay node
  of: unittest: allow base devicetree to have symbol metadata
  of: unittest: find overlays[] entry by name instead of index
  of: unittest: initialize args before calling of_*parse_*()

 arch/powerpc/platforms/pseries/dlpar.c |   2 +
 drivers/of/dynamic.c   |  59 -
 drivers/of/kobj.c  |   4 +-
 drivers/of/overlay.c   | 292 -
 drivers/of/unittest-data/Makefile  |   2 +
 .../of/unittest-data/overlay_bad_add_dup_node.dts  |  28 ++
 .../of/unittest-data/overlay_bad_add_dup_prop.dts  |  24 ++
 drivers/of/unittest-data/overlay_base.dts  |   1 +
 drivers/of/unittest.c  |  96 +--
 include/linux/of.h |  21 +-
 10 files changed, 432 insertions(+), 97 deletions(-)
 create mode 100644 drivers/of/unittest-data/overlay_bad_add_dup_node.dts
 create mode 100644 drivers/of/unittest-data/overlay_bad_add_dup_prop.dts


[PATCH v7 17/17] of: unittest: initialize args before calling of_*parse_*()

2018-11-08 Thread frowand . list
From: Frank Rowand 

Callers of of_irq_parse_one() blindly use the pointer args.np
without checking whether of_irq_parse_one() had an error and
thus did not set the value of args.np.  Initialize args to
zero so that using the format "%pOF" to show the value of
args.np will show "(null)" when of_irq_parse_one() has an
error.  This prevents the dereference of a random value.

Make the same fix for callers of of_parse_phandle_with_args()
and of_parse_phandle_with_args_map().

Reported-by: Guenter Roeck 
Tested-by: Alan Tull 
Signed-off-by: Frank Rowand 
---
 drivers/of/unittest.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index fe01c5869b0f..9a10a48eb6a1 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -379,6 +379,7 @@ static void __init of_unittest_parse_phandle_with_args(void)
for (i = 0; i < 8; i++) {
bool passed = true;
 
+   memset(, 0, sizeof(args));
rc = of_parse_phandle_with_args(np, "phandle-list",
"#phandle-cells", i, );
 
@@ -432,6 +433,7 @@ static void __init of_unittest_parse_phandle_with_args(void)
}
 
/* Check for missing list property */
+   memset(, 0, sizeof(args));
rc = of_parse_phandle_with_args(np, "phandle-list-missing",
"#phandle-cells", 0, );
unittest(rc == -ENOENT, "expected:%i got:%i\n", -ENOENT, rc);
@@ -440,6 +442,7 @@ static void __init of_unittest_parse_phandle_with_args(void)
unittest(rc == -ENOENT, "expected:%i got:%i\n", -ENOENT, rc);
 
/* Check for missing cells property */
+   memset(, 0, sizeof(args));
rc = of_parse_phandle_with_args(np, "phandle-list",
"#phandle-cells-missing", 0, );
unittest(rc == -EINVAL, "expected:%i got:%i\n", -EINVAL, rc);
@@ -448,6 +451,7 @@ static void __init of_unittest_parse_phandle_with_args(void)
unittest(rc == -EINVAL, "expected:%i got:%i\n", -EINVAL, rc);
 
/* Check for bad phandle in list */
+   memset(, 0, sizeof(args));
rc = of_parse_phandle_with_args(np, "phandle-list-bad-phandle",
"#phandle-cells", 0, );
unittest(rc == -EINVAL, "expected:%i got:%i\n", -EINVAL, rc);
@@ -456,6 +460,7 @@ static void __init of_unittest_parse_phandle_with_args(void)
unittest(rc == -EINVAL, "expected:%i got:%i\n", -EINVAL, rc);
 
/* Check for incorrectly formed argument list */
+   memset(, 0, sizeof(args));
rc = of_parse_phandle_with_args(np, "phandle-list-bad-args",
"#phandle-cells", 1, );
unittest(rc == -EINVAL, "expected:%i got:%i\n", -EINVAL, rc);
@@ -506,6 +511,7 @@ static void __init 
of_unittest_parse_phandle_with_args_map(void)
for (i = 0; i < 8; i++) {
bool passed = true;
 
+   memset(, 0, sizeof(args));
rc = of_parse_phandle_with_args_map(np, "phandle-list",
"phandle", i, );
 
@@ -563,21 +569,25 @@ static void __init 
of_unittest_parse_phandle_with_args_map(void)
}
 
/* Check for missing list property */
+   memset(, 0, sizeof(args));
rc = of_parse_phandle_with_args_map(np, "phandle-list-missing",
"phandle", 0, );
unittest(rc == -ENOENT, "expected:%i got:%i\n", -ENOENT, rc);
 
/* Check for missing cells,map,mask property */
+   memset(, 0, sizeof(args));
rc = of_parse_phandle_with_args_map(np, "phandle-list",
"phandle-missing", 0, );
unittest(rc == -EINVAL, "expected:%i got:%i\n", -EINVAL, rc);
 
/* Check for bad phandle in list */
+   memset(, 0, sizeof(args));
rc = of_parse_phandle_with_args_map(np, "phandle-list-bad-phandle",
"phandle", 0, );
unittest(rc == -EINVAL, "expected:%i got:%i\n", -EINVAL, rc);
 
/* Check for incorrectly formed argument list */
+   memset(, 0, sizeof(args));
rc = of_parse_phandle_with_args_map(np, "phandle-list-bad-args",
"phandle", 1, );
unittest(rc == -EINVAL, "expected:%i got:%i\n", -EINVAL, rc);
@@ -787,7 +797,7 @@ static void __init of_unittest_parse_interrupts(void)
for (i = 0; i < 4; i++) {
bool passed = true;
 
-   args.args_count = 0;
+   memset(, 0, sizeof(args));
rc = of_irq_parse_one(np, i, );
 
passed &= !rc;
@@ -808,7 +818,7 @@ static void __init of_unittest_parse_interrupts(void)
for (i = 0; i < 4; i++) {
bool passed = true;
 
-   args.args_count = 0;
+   memset(, 0, sizeof(args));

[PATCH v7 16/17] of: unittest: find overlays[] entry by name instead of index

2018-11-08 Thread frowand . list
From: Frank Rowand 

One accessor of overlays[] was using a hard coded index value to
find the correct array entry instead of searching for the entry
containing the correct name.

Tested-by: Alan Tull 
Signed-off-by: Frank Rowand 
---
 drivers/of/unittest.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index d625a91a7f60..fe01c5869b0f 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -2192,7 +2192,7 @@ struct overlay_info {
 OVERLAY_INFO_EXTERN(overlay_bad_phandle);
 OVERLAY_INFO_EXTERN(overlay_bad_symbol);
 
-/* order of entries is hard-coded into users of overlays[] */
+/* entries found by name */
 static struct overlay_info overlays[] = {
OVERLAY_INFO(overlay_base, -),
OVERLAY_INFO(overlay, 0),
@@ -2215,7 +2215,8 @@ struct overlay_info {
OVERLAY_INFO(overlay_bad_add_dup_prop, -EINVAL),
OVERLAY_INFO(overlay_bad_phandle, -EINVAL),
OVERLAY_INFO(overlay_bad_symbol, -EINVAL),
-   {}
+   /* end marker */
+   {.dtb_begin = NULL, .dtb_end = NULL, .expected_result = 0, .name = NULL}
 };
 
 static struct device_node *overlay_base_root;
@@ -2245,6 +2246,19 @@ void __init unittest_unflatten_overlay_base(void)
u32 data_size;
void *new_fdt;
u32 size;
+   int found = 0;
+   const char *overlay_name = "overlay_base";
+
+   for (info = overlays; info && info->name; info++) {
+   if (!strcmp(overlay_name, info->name)) {
+   found = 1;
+   break;
+   }
+   }
+   if (!found) {
+   pr_err("no overlay data for %s\n", overlay_name);
+   return;
+   }
 
info = [0];
 
@@ -2292,11 +2306,10 @@ static int __init overlay_data_apply(const char 
*overlay_name, int *overlay_id)
 {
struct overlay_info *info;
int found = 0;
-   int k;
int ret;
u32 size;
 
-   for (k = 0, info = overlays; info && info->name; info++, k++) {
+   for (info = overlays; info && info->name; info++) {
if (!strcmp(overlay_name, info->name)) {
found = 1;
break;
-- 
Frank Rowand 



[PATCH v7 15/17] of: unittest: allow base devicetree to have symbol metadata

2018-11-08 Thread frowand . list
From: Frank Rowand 

The overlay metadata nodes in the FDT created from testcases.dts
are not handled properly.

The __fixups__ and __local_fixups__ node were added to the live
devicetree, but should not be.

Only the first property in the /__symbols__ node was added to the
live devicetree if the live devicetree already contained a
/__symbols node.  All of the node's properties must be added.

Tested-by: Alan Tull 
Signed-off-by: Frank Rowand 
---
 drivers/of/unittest.c | 43 +++
 1 file changed, 35 insertions(+), 8 deletions(-)

diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index 14838b21ec6a..d625a91a7f60 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -1071,20 +1071,44 @@ static void __init of_unittest_platform_populate(void)
  * of np into dup node (present in live tree) and
  * updates parent of children of np to dup.
  *
- * @np:node already present in live tree
+ * @np:node whose properties are being added to the live tree
  * @dup:   node present in live tree to be updated
  */
 static void update_node_properties(struct device_node *np,
struct device_node *dup)
 {
struct property *prop;
+   struct property *save_next;
struct device_node *child;
-
-   for_each_property_of_node(np, prop)
-   of_add_property(dup, prop);
+   int ret;
 
for_each_child_of_node(np, child)
child->parent = dup;
+
+   /*
+* "unittest internal error: unable to add testdata property"
+*
+*If this message reports a property in node '/__symbols__' then
+*the respective unittest overlay contains a label that has the
+*same name as a label in the live devicetree.  The label will
+*be in the live devicetree only if the devicetree source was
+*compiled with the '-@' option.  If you encounter this error,
+*please consider renaming __all__ of the labels in the unittest
+*overlay dts files with an odd prefix that is unlikely to be
+*used in a real devicetree.
+*/
+
+   /*
+* open code for_each_property_of_node() because of_add_property()
+* sets prop->next to NULL
+*/
+   for (prop = np->properties; prop != NULL; prop = save_next) {
+   save_next = prop->next;
+   ret = of_add_property(dup, prop);
+   if (ret)
+   pr_err("unittest internal error: unable to add testdata 
property %pOF/%s",
+  np, prop->name);
+   }
 }
 
 /**
@@ -1093,18 +1117,23 @@ static void update_node_properties(struct device_node 
*np,
  *
  * @np:Node to attach to live tree
  */
-static int attach_node_and_children(struct device_node *np)
+static void attach_node_and_children(struct device_node *np)
 {
struct device_node *next, *dup, *child;
unsigned long flags;
const char *full_name;
 
full_name = kasprintf(GFP_KERNEL, "%pOF", np);
+
+   if (!strcmp(full_name, "/__local_fixups__") ||
+   !strcmp(full_name, "/__fixups__"))
+   return;
+
dup = of_find_node_by_path(full_name);
kfree(full_name);
if (dup) {
update_node_properties(np, dup);
-   return 0;
+   return;
}
 
child = np->child;
@@ -1125,8 +1154,6 @@ static int attach_node_and_children(struct device_node 
*np)
attach_node_and_children(child);
child = next;
}
-
-   return 0;
 }
 
 /**
-- 
Frank Rowand 



[PATCH v7 14/17] of: overlay: set node fields from properties when add new overlay node

2018-11-08 Thread frowand . list
From: Frank Rowand 

Overlay nodes added by add_changeset_node() do not have the node
fields name, phandle, and type set.

The node passed to __of_attach_node() when the add node changeset
entry is processed does not contain any properties.  The node's
properties are located in add property changeset entries that will
be processed after the add node changeset is applied.

Set the node's fields in the node contained in the add node
changeset entry and do not set them to incorrect values in
add_changeset_node().

A visible symptom that is fixed by this patch is the names of nodes
added by overlays that have an entry in /sys/bus/platform/drivers/*/
will contain the unit-address but the node-name will be ,  for
example, "fc4ab000.".  After applying the patch the name, in
this example, for node restart@fc4ab000 is "fc4ab000.restart".

Tested-by: Alan Tull 
Signed-off-by: Frank Rowand 
---
 drivers/of/dynamic.c | 27 ++-
 drivers/of/overlay.c | 29 -
 2 files changed, 42 insertions(+), 14 deletions(-)

diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
index 146681540487..b4e5b90cb314 100644
--- a/drivers/of/dynamic.c
+++ b/drivers/of/dynamic.c
@@ -205,15 +205,24 @@ static void __of_attach_node(struct device_node *np)
const __be32 *phandle;
int sz;
 
-   np->name = __of_get_property(np, "name", NULL) ? : "";
-   np->type = __of_get_property(np, "device_type", NULL) ? : "";
-
-   phandle = __of_get_property(np, "phandle", );
-   if (!phandle)
-   phandle = __of_get_property(np, "linux,phandle", );
-   if (IS_ENABLED(CONFIG_PPC_PSERIES) && !phandle)
-   phandle = __of_get_property(np, "ibm,phandle", );
-   np->phandle = (phandle && (sz >= 4)) ? be32_to_cpup(phandle) : 0;
+   if (!of_node_check_flag(np, OF_OVERLAY)) {
+   np->name = __of_get_property(np, "name", NULL);
+   np->type = __of_get_property(np, "device_type", NULL);
+   if (!np->name)
+   np->name = "";
+   if (!np->type)
+   np->type = "";
+
+   phandle = __of_get_property(np, "phandle", );
+   if (!phandle)
+   phandle = __of_get_property(np, "linux,phandle", );
+   if (IS_ENABLED(CONFIG_PPC_PSERIES) && !phandle)
+   phandle = __of_get_property(np, "ibm,phandle", );
+   if (phandle && (sz >= 4))
+   np->phandle = be32_to_cpup(phandle);
+   else
+   np->phandle = 0;
+   }
 
np->child = NULL;
np->sibling = np->parent->child;
diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index 184cc2c4a931..2b5ac43a5690 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -307,10 +307,11 @@ static int add_changeset_property(struct 
overlay_changeset *ovcs,
int ret = 0;
bool check_for_non_overlay_node = false;
 
-   if (!of_prop_cmp(overlay_prop->name, "name") ||
-   !of_prop_cmp(overlay_prop->name, "phandle") ||
-   !of_prop_cmp(overlay_prop->name, "linux,phandle"))
-   return 0;
+   if (target->in_livetree)
+   if (!of_prop_cmp(overlay_prop->name, "name") ||
+   !of_prop_cmp(overlay_prop->name, "phandle") ||
+   !of_prop_cmp(overlay_prop->name, "linux,phandle"))
+   return 0;
 
if (target->in_livetree)
prop = of_find_property(target->np, overlay_prop->name, NULL);
@@ -330,6 +331,10 @@ static int add_changeset_property(struct overlay_changeset 
*ovcs,
 
if (!prop) {
check_for_non_overlay_node = true;
+   if (!target->in_livetree) {
+   new_prop->next = target->np->deadprops;
+   target->np->deadprops = new_prop;
+   }
ret = of_changeset_add_property(>cset, target->np,
new_prop);
} else if (!of_prop_cmp(prop->name, "#address-cells")) {
@@ -400,9 +405,10 @@ static int add_changeset_node(struct overlay_changeset 
*ovcs,
struct target *target, struct device_node *node)
 {
const char *node_kbasename;
+   const __be32 *phandle;
struct device_node *tchild;
struct target target_child;
-   int ret = 0;
+   int ret = 0, size;
 
node_kbasename = kbasename(node->full_name);
 
@@ -416,6 +422,19 @@ static int add_changeset_node(struct overlay_changeset 
*ovcs,
return -ENOMEM;
 
tchild->parent = target->np;
+   tchild->name = __of_get_property(node, "name", NULL);
+   tchild->type = __of_get_property(node, "device_type", NULL);
+
+   if (!tchild->name)
+   tchild->name = "";
+   if (!tchild->type)
+   tchild->type = "";

[PATCH v7 13/17] of: unittest: remove unused of_unittest_apply_overlay() argument

2018-11-08 Thread frowand . list
From: Frank Rowand 

Argument unittest_nr is not used in of_unittest_apply_overlay(),
remove it.

Tested-by: Alan Tull 
Signed-off-by: Frank Rowand 
---
 drivers/of/unittest.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index f0139d1e8b63..14838b21ec6a 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -1433,8 +1433,7 @@ static void of_unittest_destroy_tracked_overlays(void)
} while (defers > 0);
 }
 
-static int __init of_unittest_apply_overlay(int overlay_nr, int unittest_nr,
-   int *overlay_id)
+static int __init of_unittest_apply_overlay(int overlay_nr, int *overlay_id)
 {
const char *overlay_name;
 
@@ -1467,7 +1466,7 @@ static int __init of_unittest_apply_overlay_check(int 
overlay_nr,
}
 
ovcs_id = 0;
-   ret = of_unittest_apply_overlay(overlay_nr, unittest_nr, _id);
+   ret = of_unittest_apply_overlay(overlay_nr, _id);
if (ret != 0) {
/* of_unittest_apply_overlay already called unittest() */
return ret;
@@ -1503,7 +1502,7 @@ static int __init 
of_unittest_apply_revert_overlay_check(int overlay_nr,
 
/* apply the overlay */
ovcs_id = 0;
-   ret = of_unittest_apply_overlay(overlay_nr, unittest_nr, _id);
+   ret = of_unittest_apply_overlay(overlay_nr, _id);
if (ret != 0) {
/* of_unittest_apply_overlay already called unittest() */
return ret;
-- 
Frank Rowand 



[PATCH v7 12/17] of: overlay: check prevents multiple fragments touching same property

2018-11-08 Thread frowand . list
From: Frank Rowand 

Add test case of two fragments updating the same property.  After
adding the test case, the system hangs at end of boot, after
after slub stack dumps from kfree() in crypto modprobe code.

Multiple overlay fragments adding, modifying, or deleting the same
property is not supported.  Add check to detect the attempt and fail
the overlay apply.

Before this patch, the first fragment error would terminate
processing.  Allow fragment checking to proceed and report all
of the fragment errors before terminating the overlay apply. This
is not a hot path, thus not a performance issue (the error is not
transient and requires fixing the overlay before attempting to
apply it again).

After applying this patch, the devicetree unittest messages will
include:

   OF: overlay: ERROR: multiple fragments add, update, and/or delete property 
/testcase-data-2/substation@100/motor-1/rpm_avail

   ...

   ### dt-test ### end of unittest - 212 passed, 0 failed

The check to detect two fragments updating the same property is
folded into the patch that created the test case to maintain
bisectability.

Tested-by: Alan Tull 
Signed-off-by: Frank Rowand 
---
 drivers/of/overlay.c   | 118 ++---
 drivers/of/unittest-data/Makefile  |   1 +
 .../of/unittest-data/overlay_bad_add_dup_prop.dts  |  24 +
 drivers/of/unittest-data/overlay_base.dts  |   1 +
 drivers/of/unittest.c  |   5 +
 5 files changed, 112 insertions(+), 37 deletions(-)
 create mode 100644 drivers/of/unittest-data/overlay_bad_add_dup_prop.dts

diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index 8af8115bd36e..184cc2c4a931 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -508,52 +508,96 @@ static int build_changeset_symbols_node(struct 
overlay_changeset *ovcs,
return 0;
 }
 
+static int find_dup_cset_node_entry(struct overlay_changeset *ovcs,
+   struct of_changeset_entry *ce_1)
+{
+   struct of_changeset_entry *ce_2;
+   char *fn_1, *fn_2;
+   int node_path_match;
+
+   if (ce_1->action != OF_RECONFIG_ATTACH_NODE &&
+   ce_1->action != OF_RECONFIG_DETACH_NODE)
+   return 0;
+
+   ce_2 = ce_1;
+   list_for_each_entry_continue(ce_2, >cset.entries, node) {
+   if ((ce_2->action != OF_RECONFIG_ATTACH_NODE &&
+ce_2->action != OF_RECONFIG_DETACH_NODE) ||
+   of_node_cmp(ce_1->np->full_name, ce_2->np->full_name))
+   continue;
+
+   fn_1 = kasprintf(GFP_KERNEL, "%pOF", ce_1->np);
+   fn_2 = kasprintf(GFP_KERNEL, "%pOF", ce_2->np);
+   node_path_match = !strcmp(fn_1, fn_2);
+   kfree(fn_1);
+   kfree(fn_2);
+   if (node_path_match) {
+   pr_err("ERROR: multiple fragments add and/or delete 
node %pOF\n",
+  ce_1->np);
+   return -EINVAL;
+   }
+   }
+
+   return 0;
+}
+
+static int find_dup_cset_prop(struct overlay_changeset *ovcs,
+   struct of_changeset_entry *ce_1)
+{
+   struct of_changeset_entry *ce_2;
+   char *fn_1, *fn_2;
+   int node_path_match;
+
+   if (ce_1->action != OF_RECONFIG_ADD_PROPERTY &&
+   ce_1->action != OF_RECONFIG_REMOVE_PROPERTY &&
+   ce_1->action != OF_RECONFIG_UPDATE_PROPERTY)
+   return 0;
+
+   ce_2 = ce_1;
+   list_for_each_entry_continue(ce_2, >cset.entries, node) {
+   if ((ce_2->action != OF_RECONFIG_ADD_PROPERTY &&
+ce_2->action != OF_RECONFIG_REMOVE_PROPERTY &&
+ce_2->action != OF_RECONFIG_UPDATE_PROPERTY) ||
+   of_node_cmp(ce_1->np->full_name, ce_2->np->full_name))
+   continue;
+
+   fn_1 = kasprintf(GFP_KERNEL, "%pOF", ce_1->np);
+   fn_2 = kasprintf(GFP_KERNEL, "%pOF", ce_2->np);
+   node_path_match = !strcmp(fn_1, fn_2);
+   kfree(fn_1);
+   kfree(fn_2);
+   if (node_path_match &&
+   !of_prop_cmp(ce_1->prop->name, ce_2->prop->name)) {
+   pr_err("ERROR: multiple fragments add, update, and/or 
delete property %pOF/%s\n",
+  ce_1->np, ce_1->prop->name);
+   return -EINVAL;
+   }
+   }
+
+   return 0;
+}
+
 /**
- * check_changeset_dup_add_node() - changeset validation: duplicate add node
+ * changeset_dup_entry_check() - check for duplicate entries
  * @ovcs:  Overlay changeset
  *
- * Check changeset @ovcs->cset for multiple add node entries for the same
- * node.
+ * Check changeset @ovcs->cset for multiple {add or delete} node entries for
+ * the same node or duplicate {add, delete, or update} properties entries
+ * for the same property.
  *
- * Returns 0 on success, -ENOMEM if 

[PATCH v7 11/17] of: overlay: check prevents multiple fragments add or delete same node

2018-11-08 Thread frowand . list
From: Frank Rowand 

Multiple overlay fragments adding or deleting the same node is not
supported.  Replace code comment of such, with check to detect the
attempt and fail the overlay apply.

Devicetree unittest where multiple fragments added the same node was
added in the previous patch in the series.  After applying this patch
the unittest messages will no longer include:

   Duplicate name in motor-1, renamed to "controller#1"
   OF: overlay: of_overlay_apply() err=0
   ### dt-test ### of_overlay_fdt_apply() expected -22, ret=0, 
overlay_bad_add_dup_node
   ### dt-test ### FAIL of_unittest_overlay_high_level():2419 Adding overlay 
'overlay_bad_add_dup_node' failed

   ...

   ### dt-test ### end of unittest - 210 passed, 1 failed

but will instead include:

   OF: overlay: ERROR: multiple overlay fragments add and/or delete node 
/testcase-data-2/substation@100/motor-1/controller

   ...

   ### dt-test ### end of unittest - 211 passed, 0 failed

Tested-by: Alan Tull 
Signed-off-by: Frank Rowand 
---

checkpatch errors "line over 80 characters" and "Too many leading tabs"
are ok, they will be fixed later in this series

 drivers/of/overlay.c | 58 
 1 file changed, 49 insertions(+), 9 deletions(-)

diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index c8f88b0836a3..8af8115bd36e 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -392,14 +392,6 @@ static int add_changeset_property(struct overlay_changeset 
*ovcs,
  *   a live devicetree created from Open Firmware.
  *
  * NOTE_2: Multiple mods of created nodes not supported.
- *   If more than one fragment contains a node that does not already exist
- *   in the live tree, then for each fragment of_changeset_attach_node()
- *   will add a changeset entry to add the node.  When the changeset is
- *   applied, __of_attach_node() will attach the node twice (once for
- *   each fragment).  At this point the device tree will be corrupted.
- *
- *   TODO: add integrity check to ensure that multiple fragments do not
- * create the same node.
  *
  * Returns 0 on success, -ENOMEM if memory allocation failure, or -EINVAL if
  * invalid @overlay.
@@ -517,6 +509,54 @@ static int build_changeset_symbols_node(struct 
overlay_changeset *ovcs,
 }
 
 /**
+ * check_changeset_dup_add_node() - changeset validation: duplicate add node
+ * @ovcs:  Overlay changeset
+ *
+ * Check changeset @ovcs->cset for multiple add node entries for the same
+ * node.
+ *
+ * Returns 0 on success, -ENOMEM if memory allocation failure, or -EINVAL if
+ * invalid overlay in @ovcs->fragments[].
+ */
+static int check_changeset_dup_add_node(struct overlay_changeset *ovcs)
+{
+   struct of_changeset_entry *ce_1, *ce_2;
+   char *fn_1, *fn_2;
+   int name_match;
+
+   list_for_each_entry(ce_1, >cset.entries, node) {
+
+   if (ce_1->action == OF_RECONFIG_ATTACH_NODE ||
+   ce_1->action == OF_RECONFIG_DETACH_NODE) {
+
+   ce_2 = ce_1;
+   list_for_each_entry_continue(ce_2, >cset.entries, 
node) {
+   if (ce_2->action == OF_RECONFIG_ATTACH_NODE ||
+   ce_2->action == OF_RECONFIG_DETACH_NODE) {
+   /* inexpensive name compare */
+   if (!of_node_cmp(ce_1->np->full_name,
+   ce_2->np->full_name)) {
+   /* expensive full path name 
compare */
+   fn_1 = kasprintf(GFP_KERNEL, 
"%pOF", ce_1->np);
+   fn_2 = kasprintf(GFP_KERNEL, 
"%pOF", ce_2->np);
+   name_match = !strcmp(fn_1, 
fn_2);
+   kfree(fn_1);
+   kfree(fn_2);
+   if (name_match) {
+   pr_err("ERROR: multiple 
overlay fragments add and/or delete node %pOF\n",
+  ce_1->np);
+   return -EINVAL;
+   }
+   }
+   }
+   }
+   }
+   }
+
+   return 0;
+}
+
+/**
  * build_changeset() - populate overlay changeset in @ovcs from 
@ovcs->fragments
  * @ovcs:  Overlay changeset
  *
@@ -571,7 +611,7 @@ static int build_changeset(struct overlay_changeset *ovcs)
}
}
 
-   return 0;
+   return check_changeset_dup_add_node(ovcs);
 }
 
 /*
-- 
Frank Rowand 



[PATCH v7 10/17] of: overlay: test case of two fragments adding same node

2018-11-08 Thread frowand . list
From: Frank Rowand 

Multiple overlay fragments adding or deleting the same node is not
supported.  An attempt to do so results in an incorrect devicetree.
The node name will be munged for the second add.

After adding this patch, the unittest messages will show:

   Duplicate name in motor-1, renamed to "controller#1"
   OF: overlay: of_overlay_apply() err=0
   ### dt-test ### of_overlay_fdt_apply() expected -22, ret=0, 
overlay_bad_add_dup_node
   ### dt-test ### FAIL of_unittest_overlay_high_level():2419 Adding overlay 
'overlay_bad_add_dup_node' failed

   ...

   ### dt-test ### end of unittest - 210 passed, 1 failed

The incorrect (munged) node name "controller#1" can be seen in the
/proc filesystem:

   $ pwd
   /proc/device-tree/testcase-data-2/substation@100/motor-1
   $ ls
   compatiblecontrollercontroller#1  name  phandle   spin
   $ ls controller
   power_bus
   $ ls controller#1
   power_bus_emergency

Tested-by: Alan Tull 
Signed-off-by: Frank Rowand 
---
 drivers/of/unittest-data/Makefile  |  1 +
 .../of/unittest-data/overlay_bad_add_dup_node.dts  | 28 ++
 drivers/of/unittest.c  |  5 
 3 files changed, 34 insertions(+)
 create mode 100644 drivers/of/unittest-data/overlay_bad_add_dup_node.dts

diff --git a/drivers/of/unittest-data/Makefile 
b/drivers/of/unittest-data/Makefile
index 013d85e694c6..166dbdbfd1c5 100644
--- a/drivers/of/unittest-data/Makefile
+++ b/drivers/of/unittest-data/Makefile
@@ -17,6 +17,7 @@ obj-$(CONFIG_OF_OVERLAY) += overlay.dtb.o \
overlay_12.dtb.o \
overlay_13.dtb.o \
overlay_15.dtb.o \
+   overlay_bad_add_dup_node.dtb.o \
overlay_bad_phandle.dtb.o \
overlay_bad_symbol.dtb.o \
overlay_base.dtb.o
diff --git a/drivers/of/unittest-data/overlay_bad_add_dup_node.dts 
b/drivers/of/unittest-data/overlay_bad_add_dup_node.dts
new file mode 100644
index ..145dfc3b1024
--- /dev/null
+++ b/drivers/of/unittest-data/overlay_bad_add_dup_node.dts
@@ -0,0 +1,28 @@
+// SPDX-License-Identifier: GPL-2.0
+/dts-v1/;
+/plugin/;
+
+/*
+ * _1/motor-1 and _ctrl_1 are the same node:
+ *   /testcase-data-2/substation@100/motor-1
+ *
+ * Thus the new node "controller" in each fragment will
+ * result in an attempt to add the same node twice.
+ * This will result in an error and the overlay apply
+ * will fail.
+ */
+
+_1 {
+
+   motor-1 {
+   controller {
+   power_bus = < 0x1 0x2 >;
+   };
+   };
+};
+
+_ctrl_1 {
+   controller {
+   power_bus_emergency = < 0x101 0x102 >;
+   };
+};
diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index 49ae2aa744d6..f82edf829f43 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -2161,6 +2161,7 @@ struct overlay_info {
 OVERLAY_INFO_EXTERN(overlay_12);
 OVERLAY_INFO_EXTERN(overlay_13);
 OVERLAY_INFO_EXTERN(overlay_15);
+OVERLAY_INFO_EXTERN(overlay_bad_add_dup_node);
 OVERLAY_INFO_EXTERN(overlay_bad_phandle);
 OVERLAY_INFO_EXTERN(overlay_bad_symbol);
 
@@ -2183,6 +2184,7 @@ struct overlay_info {
OVERLAY_INFO(overlay_12, 0),
OVERLAY_INFO(overlay_13, 0),
OVERLAY_INFO(overlay_15, 0),
+   OVERLAY_INFO(overlay_bad_add_dup_node, -EINVAL),
OVERLAY_INFO(overlay_bad_phandle, -EINVAL),
OVERLAY_INFO(overlay_bad_symbol, -EINVAL),
{}
@@ -2430,6 +2432,9 @@ static __init void of_unittest_overlay_high_level(void)
unittest(overlay_data_apply("overlay", NULL),
 "Adding overlay 'overlay' failed\n");
 
+   unittest(overlay_data_apply("overlay_bad_add_dup_node", NULL),
+"Adding overlay 'overlay_bad_add_dup_node' failed\n");
+
unittest(overlay_data_apply("overlay_bad_phandle", NULL),
 "Adding overlay 'overlay_bad_phandle' failed\n");
 
-- 
Frank Rowand 



[PATCH v7 09/17] of: overlay: make all pr_debug() and pr_err() messages unique

2018-11-08 Thread frowand . list
From: Frank Rowand 

Make overlay.c debug and error messages unique so that they can be
unambiguously found by grep.

Tested-by: Alan Tull 
Signed-off-by: Frank Rowand 
---
 drivers/of/overlay.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index 72bf00adb9c8..c8f88b0836a3 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -507,7 +507,7 @@ static int build_changeset_symbols_node(struct 
overlay_changeset *ovcs,
for_each_property_of_node(overlay_symbols_node, prop) {
ret = add_changeset_property(ovcs, target, prop, 1);
if (ret) {
-   pr_debug("Failed to apply prop @%pOF/%s, err=%d\n",
+   pr_debug("Failed to apply symbols prop @%pOF/%s, 
err=%d\n",
 target->np, prop->name, ret);
return ret;
}
@@ -551,7 +551,8 @@ static int build_changeset(struct overlay_changeset *ovcs)
ret = build_changeset_next_level(ovcs, ,
 fragment->overlay);
if (ret) {
-   pr_debug("apply failed '%pOF'\n", fragment->target);
+   pr_debug("fragment apply failed '%pOF'\n",
+fragment->target);
return ret;
}
}
@@ -564,7 +565,8 @@ static int build_changeset(struct overlay_changeset *ovcs)
ret = build_changeset_symbols_node(ovcs, ,
   fragment->overlay);
if (ret) {
-   pr_debug("apply failed '%pOF'\n", fragment->target);
+   pr_debug("symbols fragment apply failed '%pOF'\n",
+fragment->target);
return ret;
}
}
@@ -873,7 +875,7 @@ static int of_overlay_apply(const void *fdt, struct 
device_node *tree,
 
ret = __of_changeset_apply_notify(>cset);
if (ret)
-   pr_err("overlay changeset entry notify error %d\n", ret);
+   pr_err("overlay apply changeset entry notify error %d\n", ret);
/* notify failure is not fatal, continue */
 
list_add_tail(>ovcs_list, _list);
@@ -1132,7 +1134,7 @@ int of_overlay_remove(int *ovcs_id)
 
ret = __of_changeset_revert_notify(>cset);
if (ret)
-   pr_err("overlay changeset entry notify error %d\n", ret);
+   pr_err("overlay remove changeset entry notify error %d\n", ret);
/* notify failure is not fatal, continue */
 
*ovcs_id = 0;
-- 
Frank Rowand 



[PATCH v7 08/17] of: overlay: validate overlay properties #address-cells and #size-cells

2018-11-08 Thread frowand . list
From: Frank Rowand 

If overlay properties #address-cells or #size-cells are already in
the live devicetree for any given node, then the values in the
overlay must match the values in the live tree.

If the properties are already in the live tree then there is no
need to create a changeset entry to add them since they must
have the same value.  This reduces the memory used by the
changeset and eliminates a possible memory leak.

Tested-by: Alan Tull 
Signed-off-by: Frank Rowand 
---
 drivers/of/overlay.c | 32 +---
 include/linux/of.h   |  6 ++
 2 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index 15be3da34fef..72bf00adb9c8 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -287,7 +287,12 @@ static struct property *dup_and_fixup_symbol_prop(
  * @target may be either in the live devicetree or in a new subtree that
  * is contained in the changeset.
  *
- * Some special properties are not updated (no error returned).
+ * Some special properties are not added or updated (no error returned):
+ * "name", "phandle", "linux,phandle".
+ *
+ * Properties "#address-cells" and "#size-cells" are not updated if they
+ * are already in the live tree, but if present in the live tree, the values
+ * in the overlay must match the values in the live tree.
  *
  * Update of property in symbols node is not allowed.
  *
@@ -300,6 +305,7 @@ static int add_changeset_property(struct overlay_changeset 
*ovcs,
 {
struct property *new_prop = NULL, *prop;
int ret = 0;
+   bool check_for_non_overlay_node = false;
 
if (!of_prop_cmp(overlay_prop->name, "name") ||
!of_prop_cmp(overlay_prop->name, "phandle") ||
@@ -322,12 +328,32 @@ static int add_changeset_property(struct 
overlay_changeset *ovcs,
if (!new_prop)
return -ENOMEM;
 
-   if (!prop)
+   if (!prop) {
+   check_for_non_overlay_node = true;
ret = of_changeset_add_property(>cset, target->np,
new_prop);
-   else
+   } else if (!of_prop_cmp(prop->name, "#address-cells")) {
+   if (!of_prop_val_eq(prop, new_prop)) {
+   pr_err("ERROR: changing value of #address-cells is not 
allowed in %pOF\n",
+  target->np);
+   ret = -EINVAL;
+   }
+   } else if (!of_prop_cmp(prop->name, "#size-cells")) {
+   if (!of_prop_val_eq(prop, new_prop)) {
+   pr_err("ERROR: changing value of #size-cells is not 
allowed in %pOF\n",
+  target->np);
+   ret = -EINVAL;
+   }
+   } else {
+   check_for_non_overlay_node = true;
ret = of_changeset_update_property(>cset, target->np,
   new_prop);
+   }
+
+   if (check_for_non_overlay_node &&
+   !of_node_check_flag(target->np, OF_OVERLAY))
+   pr_err("WARNING: memory leak will occur if overlay removed, 
property: %pOF/%s\n",
+  target->np, new_prop->name);
 
if (ret) {
kfree(new_prop->name);
diff --git a/include/linux/of.h b/include/linux/of.h
index 664cd5573ae2..18ac8921e90c 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -990,6 +990,12 @@ static inline int of_map_rid(struct device_node *np, u32 
rid,
 #define of_node_cmp(s1, s2)strcasecmp((s1), (s2))
 #endif
 
+static inline int of_prop_val_eq(struct property *p1, struct property *p2)
+{
+   return p1->length == p2->length &&
+  !memcmp(p1->value, p2->value, (size_t)p1->length);
+}
+
 #if defined(CONFIG_OF) && defined(CONFIG_NUMA)
 extern int of_node_to_nid(struct device_node *np);
 #else
-- 
Frank Rowand 



[PATCH v7 07/17] of: overlay: reorder fields in struct fragment

2018-11-08 Thread frowand . list
From: Frank Rowand 

Order the fields of struct fragment in the same order as
struct of_overlay_notify_data.  The order in struct fragment is
not significant.  If both structs are ordered the same then when
examining the data in a debugger or dump the human involved does
not have to remember which context they are examining.

Tested-by: Alan Tull 
Signed-off-by: Frank Rowand 
---
 drivers/of/overlay.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index 9808aae4621a..15be3da34fef 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -49,8 +49,8 @@ struct target {
  * @overlay:   pointer to the __overlay__ node
  */
 struct fragment {
-   struct device_node *target;
struct device_node *overlay;
+   struct device_node *target;
 };
 
 /**
-- 
Frank Rowand 



[PATCH v7 06/17] of: overlay: do not duplicate properties from overlay for new nodes

2018-11-08 Thread frowand . list
From: Frank Rowand 

When allocating a new node, add_changeset_node() was duplicating the
properties from the respective node in the overlay instead of
allocating a node with no properties.

When this patch is applied the errors reported by the devictree
unittest from patch "of: overlay: add tests to validate kfrees from
overlay removal" will no longer occur.  These error messages are of
the form:

   "OF: ERROR: ..."

and the unittest results will change from:

   ### dt-test ### end of unittest - 203 passed, 7 failed

to

   ### dt-test ### end of unittest - 210 passed, 0 failed

Tested-by: Alan Tull 
Signed-off-by: Frank Rowand 
---
 drivers/of/overlay.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index 6fd8e6145e10..9808aae4621a 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -393,7 +393,7 @@ static int add_changeset_node(struct overlay_changeset 
*ovcs,
break;
 
if (!tchild) {
-   tchild = __of_node_dup(node, node_kbasename);
+   tchild = __of_node_dup(NULL, node_kbasename);
if (!tchild)
return -ENOMEM;
 
-- 
Frank Rowand 



[PATCH v7 05/17] of: overlay: use prop add changeset entry for property in new nodes

2018-11-08 Thread frowand . list
From: Frank Rowand 

The changeset entry 'update property' was used for new properties in
an overlay instead of 'add property'.

The decision of whether to use 'update property' was based on whether
the property already exists in the subtree where the node is being
spliced into.  At the top level of creating a changeset describing the
overlay, the target node is in the live devicetree, so checking whether
the property exists in the target node returns the correct result.
As soon as the changeset creation algorithm recurses into a new node,
the target is no longer in the live devicetree, but is instead in the
detached overlay tree, thus all properties are incorrectly found to
already exist in the target.

This fix will expose another devicetree bug that will be fixed
in the following patch in the series.

When this patch is applied the errors reported by the devictree
unittest will change, and the unittest results will change from:

   ### dt-test ### end of unittest - 210 passed, 0 failed

to

   ### dt-test ### end of unittest - 203 passed, 7 failed

Tested-by: Alan Tull 
Signed-off-by: Frank Rowand 
---
 drivers/of/overlay.c | 112 ++-
 1 file changed, 74 insertions(+), 38 deletions(-)

diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index 7613f7d680c7..6fd8e6145e10 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -24,6 +24,26 @@
 #include "of_private.h"
 
 /**
+ * struct target - info about current target node as recursing through overlay
+ * @np:node where current level of overlay will be 
applied
+ * @in_livetree:   @np is a node in the live devicetree
+ *
+ * Used in the algorithm to create the portion of a changeset that describes
+ * an overlay fragment, which is a devicetree subtree.  Initially @np is a node
+ * in the live devicetree where the overlay subtree is targeted to be grafted
+ * into.  When recursing to the next level of the overlay subtree, the target
+ * also recurses to the next level of the live devicetree, as long as overlay
+ * subtree node also exists in the live devicetree.  When a node in the overlay
+ * subtree does not exist at the same level in the live devicetree, target->np
+ * points to a newly allocated node, and all subsequent targets in the subtree
+ * will be newly allocated nodes.
+ */
+struct target {
+   struct device_node *np;
+   bool in_livetree;
+};
+
+/**
  * struct fragment - info about fragment nodes in overlay expanded device tree
  * @target:target of the overlay operation
  * @overlay:   pointer to the __overlay__ node
@@ -72,8 +92,7 @@ static int devicetree_corrupt(void)
 }
 
 static int build_changeset_next_level(struct overlay_changeset *ovcs,
-   struct device_node *target_node,
-   const struct device_node *overlay_node);
+   struct target *target, const struct device_node *overlay_node);
 
 /*
  * of_resolve_phandles() finds the largest phandle in the live tree.
@@ -257,14 +276,17 @@ static struct property *dup_and_fixup_symbol_prop(
 /**
  * add_changeset_property() - add @overlay_prop to overlay changeset
  * @ovcs:  overlay changeset
- * @target_node:   where to place @overlay_prop in live tree
+ * @target:where @overlay_prop will be placed
  * @overlay_prop:  property to add or update, from overlay tree
  * @is_symbols_prop:   1 if @overlay_prop is from node "/__symbols__"
  *
- * If @overlay_prop does not already exist in @target_node, add changeset entry
- * to add @overlay_prop in @target_node, else add changeset entry to update
+ * If @overlay_prop does not already exist in live devicetree, add changeset
+ * entry to add @overlay_prop in @target, else add changeset entry to update
  * value of @overlay_prop.
  *
+ * @target may be either in the live devicetree or in a new subtree that
+ * is contained in the changeset.
+ *
  * Some special properties are not updated (no error returned).
  *
  * Update of property in symbols node is not allowed.
@@ -273,20 +295,22 @@ static struct property *dup_and_fixup_symbol_prop(
  * invalid @overlay.
  */
 static int add_changeset_property(struct overlay_changeset *ovcs,
-   struct device_node *target_node,
-   struct property *overlay_prop,
+   struct target *target, struct property *overlay_prop,
bool is_symbols_prop)
 {
struct property *new_prop = NULL, *prop;
int ret = 0;
 
-   prop = of_find_property(target_node, overlay_prop->name, NULL);
-
if (!of_prop_cmp(overlay_prop->name, "name") ||
!of_prop_cmp(overlay_prop->name, "phandle") ||
!of_prop_cmp(overlay_prop->name, "linux,phandle"))
return 0;
 
+   if (target->in_livetree)
+   prop = of_find_property(target->np, overlay_prop->name, NULL);
+   else
+   prop = NULL;
+
if (is_symbols_prop) {
if (prop)

[PATCH v7 04/17] powerpc/pseries: add of_node_put() in dlpar_detach_node()

2018-11-08 Thread frowand . list
From: Frank Rowand 

The previous commit, "of: overlay: add missing of_node_get() in
__of_attach_node_sysfs" added a missing of_node_get() to
__of_attach_node_sysfs().  This results in a refcount imbalance
for nodes attached with dlpar_attach_node().  The calling sequence
from dlpar_attach_node() to __of_attach_node_sysfs() is:

   dlpar_attach_node()
  of_attach_node()
 __of_attach_node_sysfs()

For more detailed description of the node refcount, see
commit 68baf692c435 ("powerpc/pseries: Fix of_node_put() underflow
during DLPAR remove").

Tested-by: Alan Tull 
Acked-by: Michael Ellerman 
Signed-off-by: Frank Rowand 
---
 arch/powerpc/platforms/pseries/dlpar.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
b/arch/powerpc/platforms/pseries/dlpar.c
index 7625546caefd..17958043e7f7 100644
--- a/arch/powerpc/platforms/pseries/dlpar.c
+++ b/arch/powerpc/platforms/pseries/dlpar.c
@@ -270,6 +270,8 @@ int dlpar_detach_node(struct device_node *dn)
if (rc)
return rc;
 
+   of_node_put(dn);
+
return 0;
 }
 
-- 
Frank Rowand 



[PATCH v7 03/17] of: overlay: add missing of_node_get() in __of_attach_node_sysfs

2018-11-08 Thread frowand . list
From: Frank Rowand 

There is a matching of_node_put() in __of_detach_node_sysfs()

Remove misleading comment from function header comment for
of_detach_node().

This patch may result in memory leaks from code that directly calls
the dynamic node add and delete functions directly instead of
using changesets.

This commit should result in powerpc systems that dynamically
allocate a node, then later deallocate the node to have a
memory leak when the node is deallocated.

The next commit will fix the leak.

Tested-by: Alan Tull 
Acked-by: Michael Ellerman  (powerpc)
Signed-off-by: Frank Rowand 
---
 drivers/of/dynamic.c | 3 ---
 drivers/of/kobj.c| 4 +++-
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
index 12c3f9a15e94..146681540487 100644
--- a/drivers/of/dynamic.c
+++ b/drivers/of/dynamic.c
@@ -272,9 +272,6 @@ void __of_detach_node(struct device_node *np)
 
 /**
  * of_detach_node() - "Unplug" a node from the device tree.
- *
- * The caller must hold a reference to the node.  The memory associated with
- * the node is not freed until its refcount goes to zero.
  */
 int of_detach_node(struct device_node *np)
 {
diff --git a/drivers/of/kobj.c b/drivers/of/kobj.c
index 7a0a18980b98..c72eef988041 100644
--- a/drivers/of/kobj.c
+++ b/drivers/of/kobj.c
@@ -133,6 +133,9 @@ int __of_attach_node_sysfs(struct device_node *np)
}
if (!name)
return -ENOMEM;
+
+   of_node_get(np);
+
rc = kobject_add(>kobj, parent, "%s", name);
kfree(name);
if (rc)
@@ -159,6 +162,5 @@ void __of_detach_node_sysfs(struct device_node *np)
kobject_del(>kobj);
}
 
-   /* finally remove the kobj_init ref */
of_node_put(np);
 }
-- 
Frank Rowand 



[PATCH v7 02/17] of: overlay: add missing of_node_put() after add new node to changeset

2018-11-08 Thread frowand . list
From: Frank Rowand 

The refcount of a newly added overlay node decrements to one
(instead of zero) when the overlay changeset is destroyed.  This
change will cause the final decrement be to zero.

After applying this patch, new validation warnings will be
reported from the devicetree unittest during boot due to
a pre-existing devicetree bug.  The warnings will be similar to:

  OF: ERROR: memory leak before free overlay changeset,  
/testcase-data/overlay-node/test-bus/test-unittest4

This pre-existing devicetree bug will also trigger a WARN_ONCE() from
refcount_sub_and_test_checked() when an overlay changeset is
destroyed without having first been applied.  This scenario occurs
when an error in the overlay is detected during the overlay changeset
creation:

  WARNING: CPU: 0 PID: 1 at lib/refcount.c:187 
refcount_sub_and_test_checked+0xa8/0xbc
  refcount_t: underflow; use-after-free.

  (unwind_backtrace) from (show_stack+0x10/0x14)
  (show_stack) from (dump_stack+0x6c/0x8c)
  (dump_stack) from (__warn+0xdc/0x104)
  (__warn) from (warn_slowpath_fmt+0x44/0x6c)
  (warn_slowpath_fmt) from (refcount_sub_and_test_checked+0xa8/0xbc)
  (refcount_sub_and_test_checked) from (kobject_put+0x24/0x208)
  (kobject_put) from (of_changeset_destroy+0x2c/0xb4)
  (of_changeset_destroy) from (free_overlay_changeset+0x1c/0x9c)
  (free_overlay_changeset) from (of_overlay_remove+0x284/0x2cc)
  (of_overlay_remove) from 
(of_unittest_apply_revert_overlay_check.constprop.4+0xf8/0x1e8)
  (of_unittest_apply_revert_overlay_check.constprop.4) from 
(of_unittest_overlay+0x960/0xed8)
  (of_unittest_overlay) from (of_unittest+0x1cc4/0x2138)
  (of_unittest) from (do_one_initcall+0x4c/0x28c)
  (do_one_initcall) from (kernel_init_freeable+0x29c/0x378)
  (kernel_init_freeable) from (kernel_init+0x8/0x110)
  (kernel_init) from (ret_from_fork+0x14/0x2c)

Tested-by: Alan Tull 
Signed-off-by: Frank Rowand 
---
 drivers/of/overlay.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index f5fc8859a7ee..7613f7d680c7 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -379,7 +379,9 @@ static int add_changeset_node(struct overlay_changeset 
*ovcs,
if (ret)
return ret;
 
-   return build_changeset_next_level(ovcs, tchild, node);
+   ret = build_changeset_next_level(ovcs, tchild, node);
+   of_node_put(tchild);
+   return ret;
}
 
if (node->phandle && tchild->phandle)
-- 
Frank Rowand 



[PATCH v7 01/17] of: overlay: add tests to validate kfrees from overlay removal

2018-11-08 Thread frowand . list
From: Frank Rowand 

Add checks:
  - attempted kfree due to refcount reaching zero before overlay
is removed
  - properties linked to an overlay node when the node is removed
  - node refcount > one during node removal in a changeset destroy,
if the node was created by the changeset

After applying this patch, several validation warnings will be
reported from the devicetree unittest during boot due to
pre-existing devicetree bugs. The warnings will be similar to:

  OF: ERROR: of_node_release(), unexpected properties in 
/testcase-data/overlay-node/test-bus/test-unittest11
  OF: ERROR: memory leak, expected refcount 1 instead of 2, 
of_node_get()/of_node_put() unbalanced - destroy cset entry: attach overlay 
node /testcase-data-2/substation@100/
  hvac-medium-2

Tested-by: Alan Tull 
Signed-off-by: Frank Rowand 
---
 drivers/of/dynamic.c | 29 +
 drivers/of/overlay.c |  1 +
 include/linux/of.h   | 15 ++-
 3 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
index f4f8ed9b5454..12c3f9a15e94 100644
--- a/drivers/of/dynamic.c
+++ b/drivers/of/dynamic.c
@@ -330,6 +330,25 @@ void of_node_release(struct kobject *kobj)
if (!of_node_check_flag(node, OF_DYNAMIC))
return;
 
+   if (of_node_check_flag(node, OF_OVERLAY)) {
+
+   if (!of_node_check_flag(node, OF_OVERLAY_FREE_CSET)) {
+   /* premature refcount of zero, do not free memory */
+   pr_err("ERROR: memory leak before free overlay 
changeset,  %pOF\n",
+  node);
+   return;
+   }
+
+   /*
+* If node->properties non-empty then properties were added
+* to this node either by different overlay that has not
+* yet been removed, or by a non-overlay mechanism.
+*/
+   if (node->properties)
+   pr_err("ERROR: %s(), unexpected properties in %pOF\n",
+  __func__, node);
+   }
+
property_list_free(node->properties);
property_list_free(node->deadprops);
 
@@ -434,6 +453,16 @@ struct device_node *__of_node_dup(const struct device_node 
*np,
 
 static void __of_changeset_entry_destroy(struct of_changeset_entry *ce)
 {
+   if (ce->action == OF_RECONFIG_ATTACH_NODE &&
+   of_node_check_flag(ce->np, OF_OVERLAY)) {
+   if (kref_read(>np->kobj.kref) > 1) {
+   pr_err("ERROR: memory leak, expected refcount 1 instead 
of %d, of_node_get()/of_node_put() unbalanced - destroy cset entry: attach 
overlay node %pOF\n",
+  kref_read(>np->kobj.kref), ce->np);
+   } else {
+   of_node_set_flag(ce->np, OF_OVERLAY_FREE_CSET);
+   }
+   }
+
of_node_put(ce->np);
list_del(>node);
kfree(ce);
diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index 42b1f73ac5f6..f5fc8859a7ee 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -373,6 +373,7 @@ static int add_changeset_node(struct overlay_changeset 
*ovcs,
return -ENOMEM;
 
tchild->parent = target_node;
+   of_node_set_flag(tchild, OF_OVERLAY);
 
ret = of_changeset_attach_node(>cset, tchild);
if (ret)
diff --git a/include/linux/of.h b/include/linux/of.h
index a5aee3c438ad..664cd5573ae2 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -138,11 +138,16 @@ static inline void of_node_put(struct device_node *node) 
{ }
 extern struct device_node *of_stdout;
 extern raw_spinlock_t devtree_lock;
 
-/* flag descriptions (need to be visible even when !CONFIG_OF) */
-#define OF_DYNAMIC 1 /* node and properties were allocated via kmalloc */
-#define OF_DETACHED2 /* node has been detached from the device tree */
-#define OF_POPULATED   3 /* device already created for the node */
-#define OF_POPULATED_BUS   4 /* of_platform_populate recursed to children 
of this node */
+/*
+ * struct device_node flag descriptions
+ * (need to be visible even when !CONFIG_OF)
+ */
+#define OF_DYNAMIC 1 /* (and properties) allocated via kmalloc */
+#define OF_DETACHED2 /* detached from the device tree */
+#define OF_POPULATED   3 /* device already created */
+#define OF_POPULATED_BUS   4 /* platform bus created for children */
+#define OF_OVERLAY 5 /* allocated for an overlay */
+#define OF_OVERLAY_FREE_CSET   6 /* in overlay cset being freed */
 
 #define OF_BAD_ADDR((u64)-1)
 
-- 
Frank Rowand 



[PATCH v7 00/17] of: overlay: validation checks, subsequent fixes

2018-11-08 Thread frowand . list
From: Frank Rowand 

Add checks to (1) overlay apply process and (2) memory freeing
triggered by overlay release.  The checks are intended to detect
possible memory leaks and invalid overlays.

The checks revealed bugs in existing code.  Fixed the bugs.

While fixing bugs, noted other issues, which are fixed in
separate patches.

FPGA folks:

  I made the validation checks that should result in an
  invalid live devicetree report "ERROR" and cause the overlay apply
  to fail.

  I made the memory leak validation tests report "WARNING" and allow
  the overlay apply to complete successfully.  Please let me know
  if you encounter the warnings.  There are at least two paths
  forward to deal with the cases that trigger the warning: (1) change
  the warning to an error and fail the overlay apply, or (2) find a
  way to detect the potential memory leaks and free the memory
  appropriately.

ALL people:

  The validations do _not_ address another major concern I have with
  releasing overlays, which is use after free errors.

Changes since v6:
  - 03/18 Add acked-by from Michael Ellerman
  - 03/18 Move info from post "---" into header comment
  - 04/18 Add acked-by from Michael Ellerman
  - 03/18 Move info from post "---" into header comment, add more info
  - 07/18 Drop.

Changes since v5:
  - move from 4.19-rc1 to 4.20-rc1
  - all patches: add tested-by Alan Tull
  - 05/18: update for context change from commit a613b26a50136 ("of:
Convert to using %pOFn instead of device_node.name")

Changes since v4:
  - 01/18: make error message format consistent, error first, path last
  - 09/18: create of_prop_val_eq() and change open code to use it
  - 09/18: remove extra blank lines

Changes since v3:
  - 01/18: Add expected value of refcount for destroy cset entry error.  Also
explain the cause of the error.

  - 09/18: for errors of an overlay changing the value of #size-cells or
#address-cells, return -EINVAL so that overlay apply will fail
  - 09/18: for errors of an overlay changing the value of #size-cells or
#address-cells, make the message more direct.
Old message:
  OF: overlay: ERROR: overlay and/or live tree #size-cells invalid in node 
/soc/base_fpga_region
New message:
  OF: overlay: ERROR: changing value of /soc/base_fpga_region/#size-cells 
not allowed

  - 13/18: Update patch comment header to state that this patch modifies the
previous patch to not return immediately on fragment error and
explain this is not a performance issue.
  - 13/18: remove redundant "overlay" from two error messages.  "OF: overlay:"
is already present in pr_fmt()

Changes since v2:

  - 13/18: Use continue to reduce indentation in find_dup_cset_node_entry()
and find_dup_cset_prop()

Changes since v1:

  - move patch 16/16 to 17/18
  - move patch 15/16 to 18/18
  - new patch 15/18
  - new patch 16/18

  - 05/18: add_changeset_node() header comment: incorrect comment for @target

  - 18/18: add same fix for of_parse_phandle_with_args()
  - 18/18: add same fix for of_parse_phandle_with_args_map()

Frank Rowand (17):
  of: overlay: add tests to validate kfrees from overlay removal
  of: overlay: add missing of_node_put() after add new node to changeset
  of: overlay: add missing of_node_get() in __of_attach_node_sysfs
  powerpc/pseries: add of_node_put() in dlpar_detach_node()
  of: overlay: use prop add changeset entry for property in new nodes
  of: overlay: do not duplicate properties from overlay for new nodes
  of: overlay: reorder fields in struct fragment
  of: overlay: validate overlay properties #address-cells and
#size-cells
  of: overlay: make all pr_debug() and pr_err() messages unique
  of: overlay: test case of two fragments adding same node
  of: overlay: check prevents multiple fragments add or delete same node
  of: overlay: check prevents multiple fragments touching same property
  of: unittest: remove unused of_unittest_apply_overlay() argument
  of: overlay: set node fields from properties when add new overlay node
  of: unittest: allow base devicetree to have symbol metadata
  of: unittest: find overlays[] entry by name instead of index
  of: unittest: initialize args before calling of_*parse_*()

 arch/powerpc/platforms/pseries/dlpar.c |   2 +
 drivers/of/dynamic.c   |  59 -
 drivers/of/kobj.c  |   4 +-
 drivers/of/overlay.c   | 292 -
 drivers/of/unittest-data/Makefile  |   2 +
 .../of/unittest-data/overlay_bad_add_dup_node.dts  |  28 ++
 .../of/unittest-data/overlay_bad_add_dup_prop.dts  |  24 ++
 drivers/of/unittest-data/overlay_base.dts  |   1 +
 drivers/of/unittest.c  |  96 +--
 include/linux/of.h |  21 +-
 10 files changed, 432 insertions(+), 97 deletions(-)
 create mode 100644 drivers/of/unittest-data/overlay_bad_add_dup_node.dts
 create mode 

Re: [Skiboot] Important details about race condition in EEH/NVMe-issue on ppc64le.

2018-11-08 Thread Oliver
On Fri, Nov 9, 2018 at 2:30 AM Koltsov Dmitriy  wrote:
>
> Hi, Oliver.
>
> Your version of EEH/NVMe-issue looks close to be true.
>
> I've applied simple patch to ./drivers/nvme/host/pci.c file:
>
> @@ -29,6 +29,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include "nvme.h"
>
> @@ -2073,11 +2074,14 @@
>
>  static int nvme_pci_enable(struct nvme_dev *dev)
>  {
> +static int num=0;
>  int result = -ENOMEM;
>  struct pci_dev *pdev = to_pci_dev(dev->dev);
> +num=num+1;
>
>  if (pci_enable_device_mem(pdev))
>  return result;
> +msleep(num);
>
>  pci_set_master(pdev);
>
>
> and find out approximate minimal estimation of delay which removes the 
> EEH/NVMe-issue.
> This delay is about 1 millisecond (see in the code above). So, it's a stable 
> phenomena
> that when all nvme disks have minimum mutual interval of 1(one) millisecond 
> (each) to
> enable upstream bridge then EEH/NVMe-issue is not reproducible. Hence, I have 
> the
> hypothesis that due to bar alignment patch (see prev message) the bridge 
> enabling procedure
> lasts significantly more time than in case of alignment=0.

> So, in correspondence with your
> hypothesis about several threads "pinging" PHB, it turns out that one of that 
> threads
> begins to write to PHB, while necessary time is not passed yet (~1 ms).

No, only the first thread ever touches the hardware (PCI config space,
really). The first thread will set an in-memory flag which is seen by
the other threads and taken to mean that the bridge is already
enabled.

> HW error follows and readl() of NVME_REG_CSTS returns 0x in 
> nvme_pci_enable().
> (And when align=0 the PHB enable procedure duration is significantly less 
> than 1ms so
> other threads which don't enable PHB themselves - has no conflict, because by 
> the time
> they tries to write to PHB - PHB is already really enabled).
> I don't see here contradictions in such version of events sequence leading
> to EEH/NVMe-issue.
>
> So, there are two questions:
>
> Q.1. Does my hypothesis about delay=1ms and about difference of PHB enable 
> procedure
> duration in two cases (align=0 and align=64K) looks close to be true ?

*shrug*

Probably? Even if there is a difference in timing I don't see why that
is a problem. You might be taking (or not taking) a page fault due to
the alignment patch, or maybe there's some extra state inside of linux
that needed to be updated for some reason. It could be anything.
Fundamentally, the problem here is that Linux has a race condition.
Debugging the exact set of circumstances required to trigger the race
is sometimes worth it, but I don't think this is one of those cases.

If you honestly believe that there is a problem here then can you
provide us with some actual data rather than speculating about HW
bugs? The Linux ftrace (function_graph is useful) infrastructure
should allow you to do accurate measurements of the whole process and
get a breakdown of what the slow parts are. Once you have a good
overview of what code paths are being exercised in each case then you
can do more detailed measurements of specific functions.


> Q.2. If answer is 'yes' to Q.1 - are there some procedures in FW or HW 
> implementation in
> POWER8 PHB that may "extend" phb enable procedure time execution because of 
> alignment change
> to 64K ? May be in OPAL or on-chip ? Or the only "suspected area" for 
> EEH/NVMe-issue -
> is in linux kernel procedures ?

I don't see any reason to suspect OPAL or the chip here.

Oliver

>
>
>
> Regards,
> Dmitriy.
>


[PATCH] KVM: PPC: Book3S HV: fix handling for interrupted H_ENTER_NESTED

2018-11-08 Thread Michael Roth
While running a nested guest VCPU on L0 via H_ENTER_NESTED hcall, a
pending signal in the L0 QEMU process can generate the following
sequence:

  ret0 = kvmppc_pseries_do_hcall()
ret1 = kvmhv_enter_nested_guest()
  ret2 = kvmhv_run_single_vcpu()
  if (ret2 == -EINTR)
return H_INTERRUPT
if (ret1 == H_INTERRUPT)
  kvmppc_set_gpr(vcpu, 3, 0)
  return -EINTR
/* skipped: */
kvmppc_set_gpr(vcpu, 3, ret)
vcpu->arch.hcall_needed = 0
return RESUME_GUEST

which causes an exit to L0 userspace with ret0 == -EINTR.

The intention seems to be to set the hcall return value to 0 (via
VCPU r3) so that L1 will see a successful return from H_ENTER_NESTED
once we resume executing the VCPU. However, because we don't set
vcpu->arch.hcall_needed = 0, we do the following once userspace
resumes execution via kvm_arch_vcpu_ioctl_run():

  ...
  } else if (vcpu->arch.hcall_needed) {
int i

kvmppc_set_gpr(vcpu, 3, run->papr_hcall.ret);
for (i = 0; i < 9; ++i)
   kvmppc_set_gpr(vcpu, 4 + i, run->papr_hcall.args[i]);
vcpu->arch.hcall_needed = 0;

since vcpu->arch.hcall_needed == 1 indicates that userspace should
have handled the hcall and stored the return value in
run->papr_hcall.ret. Since that's not the case here, we can get an
unexpected value in VCPU r3, which can result in
kvmhv_p9_guest_entry() reporting an unexpected trap value when it
returns from H_ENTER_NESTED, causing the following register dump to
console via subsequent call to kvmppc_handle_exit_hv() in L1:

  [  350.612854] vcpu f9564cf8 (0):
  [  350.612915] pc  = c013eb98  msr = 80009033  trap = 1
  [  350.613020] r 0 = c04b9044  r16 = 
  [  350.613075] r 1 = c0007cffba30  r17 = 
  [  350.613120] r 2 = c178c100  r18 = 7fffc24f3b50
  [  350.613166] r 3 = c0007ef52480  r19 = 7fffc24fff58
  [  350.613212] r 4 =   r20 = 0a1e96ece9d0
  [  350.613253] r 5 = 70616d00746f6f72  r21 = 0a1ea117c9b0
  [  350.613295] r 6 = 0020  r22 = 0a1ea1184360
  [  350.613338] r 7 = c000783be440  r23 = 0003
  [  350.613380] r 8 = fffc  r24 = 0a1e96e9e124
  [  350.613423] r 9 = c0007ef52490  r25 = 07ff
  [  350.613469] r10 = 0004  r26 = c0007eb2f7a0
  [  350.613513] r11 = b0616d0009eccdb2  r27 = c0007cffbb10
  [  350.613556] r12 = c04b9000  r28 = c0007d83a2c0
  [  350.613597] r13 = c1b0  r29 = c000783cdf68
  [  350.613639] r14 =   r30 = 
  [  350.613681] r15 =   r31 = c0007cffbbf0
  [  350.613723] ctr = c04b9000  lr  = c04b9044
  [  350.613765] srr0 = 772f954dd48c srr1 = 8280f033
  [  350.613808] sprg0 =  sprg1 = c1b0
  [  350.613859] sprg2 = 772f9565a280 sprg3 = 
  [  350.613911] cr = 88002848  xer = 2004  dsisr = 4200
  [  350.613962] dar = 772f9539
  [  350.614031] fault dar = c00244b278c0 dsisr = 
  [  350.614073] SLB (0 entries):
  [  350.614157] lpcr = 004003d40413 sdr1 =  last_inst = 

  [  350.614252] trap=0x1 | pc=0xc013eb98 | msr=0x80009033

followed by L1's QEMU reporting the following before stopping execution
of the nested guest:

  KVM: unknown exit, hardware reason 1
  NIP c013eb98   LR c04b9044 CTR c04b9000 XER 
2004 CPU#0
  MSR 80009033 HID0   HF 8000 iidx 3 didx 3
  TB   DECR 
  GPR00 c04b9044 c0007cffba30 c178c100 c0007ef52480
  GPR04  70616d00746f6f72 0020 c000783be440
  GPR08 fffc c0007ef52490 0004 b0616d0009eccdb2
  GPR12 c04b9000 c1b0  
  GPR16   7fffc24f3b50 7fffc24fff58
  GPR20 0a1e96ece9d0 0a1ea117c9b0 0a1ea1184360 0003
  GPR24 0a1e96e9e124 07ff c0007eb2f7a0 c0007cffbb10
  GPR28 c0007d83a2c0 c000783cdf68  c0007cffbbf0
  CR 88002848  [ L  L  -  -  E  L  G  L  ] RES 
   SRR0 772f954dd48c  SRR1 8280f033PVR 004e1202 VRSAVE 

  SPRG0  SPRG1 c1b0  SPRG2 772f9565a280  SPRG3 

  SPRG4  SPRG5   SPRG6   SPRG7 

  HSRR0  HSRR1 
   CFAR 
   LPCR 03d40413
   PTCR    DAR 772f9539  DSISR 4200

Fix this by setting vcpu->arch.hcall_needed = 0 to indicate completion
of H_ENTER_NESTED before we exit to L0 userspace.

Cc: linuxppc-...@ozlabs.org
Cc: David Gibson 
Cc: Paul Mackerras 
Cc: 

RE: [PATCH 5/6] pci: layerscape: Add the EP mode support.

2018-11-08 Thread Xiaowei Bao


-Original Message-
From: Xiaowei Bao 
Sent: 2018年11月6日 14:48
To: 'Kishon Vijay Abraham I' ; bhelg...@google.com; 
robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; Leo Li 
; lorenzo.pieral...@arm.com; a...@arndb.de; 
gre...@linuxfoundation.org; M.h. Lian ; Mingkai Hu 
; Roy Zang ; 
kstew...@linuxfoundation.org; cyrille.pitc...@free-electrons.com; 
pombreda...@nexb.com; shawn@rock-chips.com; linux-...@vger.kernel.org; 
devicet...@vger.kernel.org; linux-ker...@vger.kernel.org; 
linux-arm-ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org
Cc: Jiafei Pan 
Subject: RE: [PATCH 5/6] pci: layerscape: Add the EP mode support.

Hi Kishon,

-Original Message-
From: Kishon Vijay Abraham I 
Sent: 2018年11月6日 14:07
To: Xiaowei Bao ; bhelg...@google.com; robh...@kernel.org; 
mark.rutl...@arm.com; shawn...@kernel.org; Leo Li ; 
lorenzo.pieral...@arm.com; a...@arndb.de; gre...@linuxfoundation.org; M.h. Lian 
; Mingkai Hu ; Roy Zang 
; kstew...@linuxfoundation.org; 
cyrille.pitc...@free-electrons.com; pombreda...@nexb.com; 
shawn@rock-chips.com; linux-...@vger.kernel.org; 
devicet...@vger.kernel.org; linux-ker...@vger.kernel.org; 
linux-arm-ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org
Cc: Jiafei Pan 
Subject: Re: [PATCH 5/6] pci: layerscape: Add the EP mode support.

(Removed Niklas as mails to him is bouncing)

Hi,

Please fix your email client. Refer Documentation/process/email-clients.rst

On 05/11/18 2:45 PM, Xiaowei Bao wrote:
> 
> 
> -Original Message-
> From: Kishon Vijay Abraham I 
> Sent: 2018年11月5日 16:57
> To: Xiaowei Bao ; bhelg...@google.com;
> robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; Leo Li
> ; lorenzo.pieral...@arm.com; a...@arndb.de; 
> gre...@linuxfoundation.org; M.h. Lian ; Mingkai 
> Hu ; Roy Zang ; 
> kstew...@linuxfoundation.org; cyrille.pitc...@free-electrons.com;
> pombreda...@nexb.com; shawn@rock-chips.com; 
> niklas.cas...@axis.com; linux-...@vger.kernel.org; 
> devicet...@vger.kernel.org; linux-ker...@vger.kernel.org; 
> linux-arm-ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org
> Cc: Jiafei Pan 
> Subject: Re: [PATCH 5/6] pci: layerscape: Add the EP mode support.
> 
> Hi,
> 
> On 31/10/18 4:08 PM, Xiaowei Bao wrote:
>>
>>
>> -Original Message-
>> From: Kishon Vijay Abraham I 
>> Sent: 2018年10月31日 12:15
>> To: Xiaowei Bao ; bhelg...@google.com;
>> robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; Leo Li
>> ; lorenzo.pieral...@arm.com; a...@arndb.de; 
>> gre...@linuxfoundation.org; M.h. Lian ; 
>> Mingkai Hu ; Roy Zang ; 
>> kstew...@linuxfoundation.org; cyrille.pitc...@free-electrons.com;
>> pombreda...@nexb.com; shawn@rock-chips.com; 
>> niklas.cas...@axis.com; linux-...@vger.kernel.org; 
>> devicet...@vger.kernel.org; linux-ker...@vger.kernel.org; 
>> linux-arm-ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org
>> Cc: Jiafei Pan 
>> Subject: Re: [PATCH 5/6] pci: layerscape: Add the EP mode support.
>>
>> Hi,
>>
>> On 31/10/18 8:03 AM, Xiaowei Bao wrote:
>>>
>>>
>>> -Original Message-
>>> From: Xiaowei Bao
>>> Sent: 2018年10月26日 17:19
>>> To: 'Kishon Vijay Abraham I' ; bhelg...@google.com;
>>> robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; Leo 
>>> robh+Li
>>> ; lorenzo.pieral...@arm.com; a...@arndb.de; 
>>> gre...@linuxfoundation.org; M.h. Lian ; 
>>> Mingkai Hu ; Roy Zang ; 
>>> kstew...@linuxfoundation.org; cyrille.pitc...@free-electrons.com;
>>> pombreda...@nexb.com; shawn@rock-chips.com; 
>>> niklas.cas...@axis.com; linux-...@vger.kernel.org; 
>>> devicet...@vger.kernel.org; linux-ker...@vger.kernel.org; 
>>> linux-arm-ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org
>>> Cc: Jiafei Pan 
>>> Subject: RE: [PATCH 5/6] pci: layerscape: Add the EP mode support.
>>>
>>>
>>>
>>> -Original Message-
>>> From: Kishon Vijay Abraham I 
>>> Sent: 2018年10月26日 13:29
>>> To: Xiaowei Bao ; bhelg...@google.com;
>>> robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; Leo 
>>> robh+Li
>>> ; lorenzo.pieral...@arm.com; a...@arndb.de; 
>>> gre...@linuxfoundation.org; M.h. Lian ; 
>>> Mingkai Hu ; Roy Zang ; 
>>> kstew...@linuxfoundation.org; cyrille.pitc...@free-electrons.com;
>>> pombreda...@nexb.com; shawn@rock-chips.com; 
>>> niklas.cas...@axis.com; linux-...@vger.kernel.org; 
>>> devicet...@vger.kernel.org; linux-ker...@vger.kernel.org; 
>>> linux-arm-ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org
>>> Subject: Re: [PATCH 5/6] pci: layerscape: Add the EP mode support.
>>>
>>> Hi,
>>>
>>> On Thursday 25 October 2018 04:39 PM, Xiaowei Bao wrote:
 Add the PCIe EP mode support for layerscape platform.

 Signed-off-by: Xiaowei Bao 
 ---
  drivers/pci/controller/dwc/Makefile|2 +-
  drivers/pci/controller/dwc/pci-layerscape-ep.c |  161
 
  2 files changed, 162 insertions(+), 1 deletions(-)  create mode
 100644 

Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-08 Thread Alex_Gagniuc
On 11/08/2018 04:51 PM, Greg KH wrote:
> On Thu, Nov 08, 2018 at 10:49:08PM +, alex_gagn...@dellteam.com wrote:
>> In the case that we're trying to fix, this code executing is a result of
>> the device being gone, so we can guarantee race-free operation. I agree
>> that there is a race, in the general case. As far as checking the result
>> for all F's, that's not an option when firmware crashes the system as a
>> result of the mmio read/write. It's never pretty when firmware gets
>> involved.
> 
> If you have firmware that crashes the system when you try to read from a
> PCI device that was hot-removed, that is broken firmware and needs to be
> fixed.  The kernel can not work around that as again, you will never win
> that race.

But it's not the firmware that crashes. It's linux as a result of a 
fatal error message from the firmware. And we can't fix that because FFS 
handling requires that the system reboots [1].

If we're going to say that we don't want to support FFS because it's a 
separate code path, and different flow, that's fine. I am myself, not a 
fan of FFS. But if we're going to continue supporting it, I think we'll 
continue to have to resolve these sort of unintended consequences.

Alex

[1] ACPI 6.2, 18.1 - Hardware Errors and Error Sources


Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-08 Thread Alex_Gagniuc
On 11/08/2018 04:43 PM, Greg Kroah-Hartman wrote:
> 
> [EXTERNAL EMAIL]
> Please report any suspicious attachments, links, or requests for sensitive 
> information.
> 
> 
> On Thu, Nov 08, 2018 at 03:32:58PM -0700, Keith Busch wrote:
>> On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote:
>>> On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
 I'm having second thoughts about this.  One thing I'm uncomfortable
 with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
 instead of systematic, in the sense that I don't know how we convince
 ourselves that this (and only this) is the correct place to put it.
>>>
>>> I think my stance always has been that this call is not good at all
>>> because once you call it you never really know if it is still true as
>>> the device could have been removed right afterward.
>>>
>>> So almost any code that relies on it is broken, there is no locking and
>>> it can and will race and you will loose.
>>
>> AIUI, we're not trying to create code to rely on this. This more about
>> reducing reliance on hardware. If the software misses the race once and
>> accesses disconnected device memory, that's usually not a big deal to
>> let hardware sort it out, but the point is not to push our luck.
> 
> Then why even care about this call at all?  If you need to really know
> if the read worked, you have to check the value.  If the value is FF
> then you have a huge hint that the hardware is now gone.  And you can
> rely on it being gone, you can never rely on making the call to the
> function to check if the hardware is there to be still valid any point
> in time after the call returns.

In the case that we're trying to fix, this code executing is a result of 
the device being gone, so we can guarantee race-free operation. I agree 
that there is a race, in the general case. As far as checking the result 
for all F's, that's not an option when firmware crashes the system as a 
result of the mmio read/write. It's never pretty when firmware gets 
involved.

Alex


Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-08 Thread Alex_Gagniuc
On 11/08/2018 02:09 PM, Bjorn Helgaas wrote:
> 
> [EXTERNAL EMAIL]
> Please report any suspicious attachments, links, or requests for sensitive 
> information.
> 
> 
> [+cc Jonathan, Greg, Lukas, Russell, Sam, Oliver for discussion about
> PCI error recovery in general]

Has anyone seen seen the ECRs in the PCIe base spec and ACPI that have 
been floating around the past few months? -- HPX, SFI, CER. Without 
divulging too much before publication, I'm curious on opinions on how 
well (or not well) those flows would work in general, and in linux.

> On Wed, Nov 07, 2018 at 05:42:57PM -0600, Bjorn Helgaas wrote:
> I'm having second thoughts about this.  One thing I'm uncomfortable
> with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> instead of systematic, in the sense that I don't know how we convince
> ourselves that this (and only this) is the correct place to put it. >
> Another is that the only place we call pci_dev_set_disconnected() is
> in pciehp and acpiphp, so the only "disconnected" case we catch is if
> hotplug happens to be involved.  Every MMIO read from the device is an
> opportunity to learn whether it is reachable (a read from an
> unreachable device typically returns ~0 data), but we don't do
> anything at all with those.
 >
> The config accessors already check pci_dev_is_disconnected(), so this
> patch is really aimed at MMIO accesses.  I think it would be more
> robust if we added wrappers for readl() and writel() so we could
> notice read errors and avoid future reads and writes.

I wouldn't expect anything less than  complete scrutiny and quality 
control of unquestionable moral integrity :). In theory ~0 can be a 
great indicator that something may be wrong. Though I think it's about 
as ad-hoc as pci_dev_is_disconnected().

I slightly like the idea of wrapping the MMIO accessors. There's still 
memcpy and DMA that cause the same MemRead/Wr PCIe transactions, and the 
same sort of errors in PCIe land, and it would be good to have more 
testing on this. Since this patch is tested and confirmed to fix a known 
failure case, I would keep it, and the look at fixing the problem in a 
more generic way.

BTW, a lot of the problems we're fixing here come courtesy of 
firmware-first error handling. Do we reach a point where we draw a line 
in handling new problems introduced by FFS? So, if something is a 
problem with FFS, but not native handling, do we commit to supporting it?

Alex



Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-08 Thread Keith Busch
On Thu, Nov 08, 2018 at 02:42:55PM -0800, Greg Kroah-Hartman wrote:
> On Thu, Nov 08, 2018 at 03:32:58PM -0700, Keith Busch wrote:
> > On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote:
> > > On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
> > > > I'm having second thoughts about this.  One thing I'm uncomfortable
> > > > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> > > > instead of systematic, in the sense that I don't know how we convince
> > > > ourselves that this (and only this) is the correct place to put it.
> > > 
> > > I think my stance always has been that this call is not good at all
> > > because once you call it you never really know if it is still true as
> > > the device could have been removed right afterward.
> > > 
> > > So almost any code that relies on it is broken, there is no locking and
> > > it can and will race and you will loose.
> > 
> > AIUI, we're not trying to create code to rely on this. This more about
> > reducing reliance on hardware. If the software misses the race once and
> > accesses disconnected device memory, that's usually not a big deal to
> > let hardware sort it out, but the point is not to push our luck.
> 
> Then why even care about this call at all?  If you need to really know
> if the read worked, you have to check the value.  If the value is FF
> then you have a huge hint that the hardware is now gone.  And you can
> rely on it being gone, you can never rely on making the call to the
> function to check if the hardware is there to be still valid any point
> in time after the call returns.
> 
> > Surprise hot remove is empirically more reliable the less we interact
> > with hardware and firmware. That shouldn't be necessary, but is just an
> > unfortunate reality.
> 
> You are not "interacting", you are reading/writing to the hardware, as
> you have to do so.  So I really don't understand what you are talking
> about here, sorry.

We're reading hardware memory, yes, but the hardware isn't there.
Something obviously needs to return FF, so we are indirectly interacting
with whatever mechanism handles that. Sometimes that mechanism doesn't
handle it gracefully and instead of having FF to consider, you have a
machine check rebooting your system.


Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-08 Thread Greg KH
On Thu, Nov 08, 2018 at 10:49:08PM +, alex_gagn...@dellteam.com wrote:
> On 11/08/2018 04:43 PM, Greg Kroah-Hartman wrote:
> > 
> > [EXTERNAL EMAIL]
> > Please report any suspicious attachments, links, or requests for sensitive 
> > information.
> > 
> > 
> > On Thu, Nov 08, 2018 at 03:32:58PM -0700, Keith Busch wrote:
> >> On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote:
> >>> On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
>  I'm having second thoughts about this.  One thing I'm uncomfortable
>  with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
>  instead of systematic, in the sense that I don't know how we convince
>  ourselves that this (and only this) is the correct place to put it.
> >>>
> >>> I think my stance always has been that this call is not good at all
> >>> because once you call it you never really know if it is still true as
> >>> the device could have been removed right afterward.
> >>>
> >>> So almost any code that relies on it is broken, there is no locking and
> >>> it can and will race and you will loose.
> >>
> >> AIUI, we're not trying to create code to rely on this. This more about
> >> reducing reliance on hardware. If the software misses the race once and
> >> accesses disconnected device memory, that's usually not a big deal to
> >> let hardware sort it out, but the point is not to push our luck.
> > 
> > Then why even care about this call at all?  If you need to really know
> > if the read worked, you have to check the value.  If the value is FF
> > then you have a huge hint that the hardware is now gone.  And you can
> > rely on it being gone, you can never rely on making the call to the
> > function to check if the hardware is there to be still valid any point
> > in time after the call returns.
> 
> In the case that we're trying to fix, this code executing is a result of 
> the device being gone, so we can guarantee race-free operation. I agree 
> that there is a race, in the general case. As far as checking the result 
> for all F's, that's not an option when firmware crashes the system as a 
> result of the mmio read/write. It's never pretty when firmware gets 
> involved.

If you have firmware that crashes the system when you try to read from a
PCI device that was hot-removed, that is broken firmware and needs to be
fixed.  The kernel can not work around that as again, you will never win
that race.

thanks,

greg k-h


Re: [PATCH 6/9] PCI: consolidate PCI config entry in drivers/pci

2018-11-08 Thread Bjorn Helgaas
On Fri, Oct 19, 2018 at 02:09:49PM +0200, Christoph Hellwig wrote:
> There is no good reason to duplicate the PCI menu in every architecture.
> Instead provide a selectable HAVE_PCI symbol that indicates availability
> of PCI support and the handle the rest in drivers/pci.
> 
> Note that for powerpc we now select HAVE_PCI globally instead of the
> convoluted mess of conditional or or non-conditional support per board,
> similar to what we do e.g. on x86.  For alpha PCI is selected for the
> non-jensen configs as it was the default before, and a lot of code does
> not compile without PCI enabled.  On other architectures with limited
> PCI support that wasn't as complicated I've left the selection as-is.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Max Filippov 
> Acked-by: Thomas Gleixner 
> Acked-by: Bjorn Helgaas 

Sounds like Masahiro plans to take this series and I'm fine with that.
Minor typo below, since it sounds like there's another revision coming.

> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -3,6 +3,18 @@
>  # PCI configuration
>  #
>  
> +config HAVE_PCI
> + bool
> +
> +menuconfig PCI
> + bool "PCI support"
> + depends on HAVE_PCI
> +
> + help
> +   This option enables support for the PCI local bus, including
> +   support for PCI-X and the fundations for PCI Express support.

s/fundations/foundations/

> +   Say 'Y' here unless you know what you are doing.
> +
>  source "drivers/pci/pcie/Kconfig"


Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-08 Thread Greg Kroah-Hartman
On Thu, Nov 08, 2018 at 03:32:58PM -0700, Keith Busch wrote:
> On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote:
> > On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
> > > I'm having second thoughts about this.  One thing I'm uncomfortable
> > > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> > > instead of systematic, in the sense that I don't know how we convince
> > > ourselves that this (and only this) is the correct place to put it.
> > 
> > I think my stance always has been that this call is not good at all
> > because once you call it you never really know if it is still true as
> > the device could have been removed right afterward.
> > 
> > So almost any code that relies on it is broken, there is no locking and
> > it can and will race and you will loose.
> 
> AIUI, we're not trying to create code to rely on this. This more about
> reducing reliance on hardware. If the software misses the race once and
> accesses disconnected device memory, that's usually not a big deal to
> let hardware sort it out, but the point is not to push our luck.

Then why even care about this call at all?  If you need to really know
if the read worked, you have to check the value.  If the value is FF
then you have a huge hint that the hardware is now gone.  And you can
rely on it being gone, you can never rely on making the call to the
function to check if the hardware is there to be still valid any point
in time after the call returns.

> Surprise hot remove is empirically more reliable the less we interact
> with hardware and firmware. That shouldn't be necessary, but is just an
> unfortunate reality.

You are not "interacting", you are reading/writing to the hardware, as
you have to do so.  So I really don't understand what you are talking
about here, sorry.

greg k-h


Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-08 Thread Keith Busch
On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote:
> On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
> > I'm having second thoughts about this.  One thing I'm uncomfortable
> > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> > instead of systematic, in the sense that I don't know how we convince
> > ourselves that this (and only this) is the correct place to put it.
> 
> I think my stance always has been that this call is not good at all
> because once you call it you never really know if it is still true as
> the device could have been removed right afterward.
> 
> So almost any code that relies on it is broken, there is no locking and
> it can and will race and you will loose.

AIUI, we're not trying to create code to rely on this. This more about
reducing reliance on hardware. If the software misses the race once and
accesses disconnected device memory, that's usually not a big deal to
let hardware sort it out, but the point is not to push our luck.

Surprise hot remove is empirically more reliable the less we interact
with hardware and firmware. That shouldn't be necessary, but is just an
unfortunate reality.


Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-08 Thread Greg Kroah-Hartman
On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
> [+cc Jonathan, Greg, Lukas, Russell, Sam, Oliver for discussion about
> PCI error recovery in general]
> 
> On Wed, Nov 07, 2018 at 05:42:57PM -0600, Bjorn Helgaas wrote:
> > On Tue, Sep 18, 2018 at 05:15:00PM -0500, Alexandru Gagniuc wrote:
> > > When a PCI device is gone, we don't want to send IO to it if we can
> > > avoid it. We expose functionality via the irq_chip structure. As
> > > users of that structure may not know about the underlying PCI device,
> > > it's our responsibility to guard against removed devices.
> > > 
> > > .irq_write_msi_msg() is already guarded inside __pci_write_msi_msg().
> > > .irq_mask/unmask() are not. Guard them for completeness.
> > > 
> > > For example, surprise removal of a PCIe device triggers teardown. This
> > > touches the irq_chips ops some point to disable the interrupts. I/O
> > > generated here can crash the system on firmware-first machines.
> > > Not triggering the IO in the first place greatly reduces the
> > > possibility of the problem occurring.
> > > 
> > > Signed-off-by: Alexandru Gagniuc 
> > 
> > Applied to pci/misc for v4.21, thanks!
> 
> I'm having second thoughts about this.  One thing I'm uncomfortable
> with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> instead of systematic, in the sense that I don't know how we convince
> ourselves that this (and only this) is the correct place to put it.

I think my stance always has been that this call is not good at all
because once you call it you never really know if it is still true as
the device could have been removed right afterward.

So almost any code that relies on it is broken, there is no locking and
it can and will race and you will loose.

I think your patch suffers from this race:

> +static u32 mmio_readl(struct pci_dev *dev, const volatile void __iomem *addr)
> +{
> + u32 val, id;
> +
> + if (pci_dev_is_disconnected(dev))
> + return ~0;

Great, but what happens if I yank the device out right here?

> + val = readl(addr);

This value could now be all FF, if the device is gone, so what did the
check above help with?

> + /*
> +  * If an MMIO read from the device returns ~0 data, that data may
> +  * be valid, or it may indicate a bus error.  If config space is
> +  * readable, assume it's valid data; otherwise, assume a bus error.
> +  */
> + if (val == ~0) {
> + pci_read_config_dword(dev, PCI_VENDOR_ID, );
> + if (id == ~0)
> + pci_dev_set_disconnected(dev, NULL);

So why do the check above for "is disconnected"?  What does this buy us
here, just short-circuiting the readl()?

> + }
> +
> + return val;
> +}
> +
> +static void mmio_writel(struct pci_dev *dev, u32 val,
> + volatile void __iomem *addr)
> +{
> + if (pci_dev_is_disconnected(dev))
> + return;
> +
> + writel(val, addr);

Why even check, what's wrong with always doing the write?

I understand the wish to make this easier, but I think the only way is
that the driver themselves should be checking on their reads.  And they
have to check on all reads, or at least on some subset of reads and be
able to handle 0xff for the other ones without going crazy.

I _think_ the xhci driver does this given that it is hot added/removed
all the time dynamically due to the way that modern laptops are made
where the bios adds/removed the xhci controller when a USB device is
added/removed.

thanks,

greg k-h


Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-08 Thread Keith Busch
On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
> 
> I'm having second thoughts about this.  One thing I'm uncomfortable
> with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> instead of systematic, in the sense that I don't know how we convince
> ourselves that this (and only this) is the correct place to put it.

You know how the kernel provides ZERO_PAGE, wouldn't it be cool if we
also had a ONES_PAGE and could remap all virtual addresses from a memory
mapped device to that page on an ungraceful disconnect? I do not know
how to accomplish that, so might just be crazy talk... But if it is
possible, that would be a pretty nifty way to solve this problem.


Re: [PATCH 00/36] Devicetree schema

2018-11-08 Thread Rob Herring
On Thu, Nov 8, 2018 at 1:54 PM Marta Rybczynska  wrote:
>
> Rob,
> The patch set does convert the documentation files. Could you explain
> the workflow of verifying a DTS? From what I can understand we can
> validate YAML devicetrees, and the schema files against the
> meta-schemas, but I see no tool for DTS to YAML conversion in your
> tools. Do you use https://github.com/pantoniou/yamldt ?

No, I'm not using that. dtc supports yaml output now. The kernel copy
of dtc has this support in 4.20. You need libyaml headers installed to
enable the support. The output format is intended only for validation
and could possibly change.

The 'dtbs_check' target will do a dts->dt.yaml pass of all targets in
$(dtb-y) with dtc and then run the .dt.yaml files through the schema
validation.

There's also a presentation here:
https://connect.linaro.org/resources/yvr18/yvr18-404/

Rob


Re: [PATCH 00/36] Devicetree schema

2018-11-08 Thread Marta Rybczynska
Rob,
The patch set does convert the documentation files. Could you explain
the workflow of verifying a DTS? From what I can understand we can
validate YAML devicetrees, and the schema files against the
meta-schemas, but I see no tool for DTS to YAML conversion in your
tools. Do you use https://github.com/pantoniou/yamldt ?

Thanks,
Marta


[PATCH -next-akpm 2/3] mm: speed up mremap by 20x on large regions (v5)

2018-11-08 Thread Joel Fernandes
From: "Joel Fernandes (Google)" 

Android needs to mremap large regions of memory during memory management
related operations. The mremap system call can be really slow if THP is
not enabled. The bottleneck is move_page_tables, which is copying each
pte at a time, and can be really slow across a large map. Turning on THP
may not be a viable option, and is not for us. This patch speeds up the
performance for non-THP system by copying at the PMD level when possible.

The speed up is an order of magnitude on x86 (~20x). On a 1GB mremap,
the mremap completion times drops from 3.4-3.6 milliseconds to 144-160
microseconds.

Before:
Total mremap time for 1GB data: 3521942 nanoseconds.
Total mremap time for 1GB data: 3449229 nanoseconds.
Total mremap time for 1GB data: 3488230 nanoseconds.

After:
Total mremap time for 1GB data: 150279 nanoseconds.
Total mremap time for 1GB data: 144665 nanoseconds.
Total mremap time for 1GB data: 158708 nanoseconds.

Incase THP is enabled, the optimization is mostly skipped except in
certain situations.

Acked-by: Kirill A. Shutemov 
Reviewed-by: William Kucharski 
Signed-off-by: Joel Fernandes (Google) 
---

Note that since the bug fix in [1], we now have to flush the TLB every
PMD move. The above numbers were obtained on x86 with a flush done every
move. For arm64, I previously encountered performance issues doing a
flush everytime we move, however Will Deacon says [2] the performance
should be better now with recent release. Until we can evaluate arm64, I
am dropping the HAVE_MOVE_PMD config enable patch for ARM64 for now. It
can be added back once we finish the performance evaluation. Also of
note is that the speed up on arm64 with this patch but without the TLB
flush every PMD move is around 500x.

[1] https://bugs.chromium.org/p/project-zero/issues/detail?id=1695
[2] https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg140837.html

 arch/Kconfig |  5 +
 mm/mremap.c  | 62 
 2 files changed, 67 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index e1e540ffa979..b70c952ac838 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -535,6 +535,11 @@ config HAVE_IRQ_TIME_ACCOUNTING
  Archs need to ensure they use a high enough resolution clock to
  support irq time accounting and then call 
enable_sched_clock_irqtime().
 
+config HAVE_MOVE_PMD
+   bool
+   help
+ Archs that select this are able to move page tables at the PMD level.
+
 config HAVE_ARCH_TRANSPARENT_HUGEPAGE
bool
 
diff --git a/mm/mremap.c b/mm/mremap.c
index 7c9ab747f19d..2591e512373a 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -191,6 +191,50 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t 
*old_pmd,
drop_rmap_locks(vma);
 }
 
+static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
+ unsigned long new_addr, unsigned long old_end,
+ pmd_t *old_pmd, pmd_t *new_pmd)
+{
+   spinlock_t *old_ptl, *new_ptl;
+   struct mm_struct *mm = vma->vm_mm;
+   pmd_t pmd;
+
+   if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
+   || old_end - old_addr < PMD_SIZE)
+   return false;
+
+   /*
+* The destination pmd shouldn't be established, free_pgtables()
+* should have release it.
+*/
+   if (WARN_ON(!pmd_none(*new_pmd)))
+   return false;
+
+   /*
+* We don't have to worry about the ordering of src and dst
+* ptlocks because exclusive mmap_sem prevents deadlock.
+*/
+   old_ptl = pmd_lock(vma->vm_mm, old_pmd);
+   new_ptl = pmd_lockptr(mm, new_pmd);
+   if (new_ptl != old_ptl)
+   spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+
+   /* Clear the pmd */
+   pmd = *old_pmd;
+   pmd_clear(old_pmd);
+
+   VM_BUG_ON(!pmd_none(*new_pmd));
+
+   /* Set the new pmd */
+   set_pmd_at(mm, new_addr, new_pmd, pmd);
+   flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
+   if (new_ptl != old_ptl)
+   spin_unlock(new_ptl);
+   spin_unlock(old_ptl);
+
+   return true;
+}
+
 unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long old_addr, struct vm_area_struct *new_vma,
unsigned long new_addr, unsigned long len,
@@ -237,7 +281,25 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
split_huge_pmd(vma, old_pmd, old_addr);
if (pmd_trans_unstable(old_pmd))
continue;
+   } else if (extent == PMD_SIZE) {
+#ifdef CONFIG_HAVE_MOVE_PMD
+   /*
+* If the extent is PMD-sized, try to speed the move by
+* moving at the PMD level if possible.
+*/
+   bool moved;
+
+   if (need_rmap_locks)
+   

Re: [PATCH 13/36] dt-bindings: arm: Convert PMU binding to json-schema

2018-11-08 Thread Thomas Petazzoni
Hello,

I'm jumping into the discussion, but I clearly don't have all the
context of the discussion.

On Thu, 8 Nov 2018 15:54:31 +, Robin Murphy wrote:

> >> This seems like a semantic different between the two representations, or am
> >> I missing something here? Specifically, both the introduction of
> >> interrupts-extended and also dropping any mention of using a single per-cpu
> >> interrupt (the single combined case is no longer support by Linux; not sure
> >> if you want to keep it in the binding).  
> > 
> > In regards to no support for the single combined interrupt, it looks
> > like Marvell Armada SoCs at least (armada-375 is what I'm looking at)
> > have only a single interrupt. Though the interrupt gets routed to MPIC
> > which then has a GIC PPI. So it isn't supported or happens to work
> > still since it is a PPI?  
> 
> Well, the description of the MPIC in the Armada XP functional spec says:
> 
> "Interrupt sources ID0–ID28 are private events per CPU. Thus, each 
> processor has a different set of events map interrupts ID0–ID28."
> 
> Odd grammar aside, that would seem to imply that < 3> is a per-cpu 
> interrupt itself, thus AFAICS so long as it's cascaded to a GIC PPI and 
> not an SPI then there's no issue there.

The Armada XP does not have a GIC at all, but only a MPIC as the
primary interrupt controller.

However the Armada 38x has both a GIC and a MPIC, and indeed the parent
interrupts of the MPIC towards the GIC is:

interrupts = ;

Best regards,

Thomas
-- 
Thomas Petazzoni, CTO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


[PATCH V5 4/10] KVM/VMX: Add hv tlb range flush support

2018-11-08 Thread ltykernel
From: Lan Tianyu 

This patch is to register tlb_remote_flush_with_range callback with
hv tlb range flush interface.

Signed-off-by: Lan Tianyu 
---
Change since v4:
- Use new function kvm_fill_hv_flush_list_func() to fill flush
   request.
Change since v3:
- Merge Vitaly's don't pass EPT configuration info to
vmx_hv_remote_flush_tlb() fix.
Change since v1:
- Pass flush range with new hyper-v tlb flush struct rather
   than KVM tlb flush struct.
---
 arch/x86/kvm/vmx.c | 69 ++
 1 file changed, 49 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index edbc96cb990a..405dfbde70b2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1567,7 +1567,38 @@ static void check_ept_pointer_match(struct kvm *kvm)
to_kvm_vmx(kvm)->ept_pointers_match = EPT_POINTERS_MATCH;
 }
 
-static int vmx_hv_remote_flush_tlb(struct kvm *kvm)
+int kvm_fill_hv_flush_list_func(struct hv_guest_mapping_flush_list *flush,
+   void *data)
+{
+   struct kvm_tlb_range *range = data;
+
+   return hyperv_fill_flush_guest_mapping_list(flush, range->start_gfn,
+   range->pages);
+}
+
+static inline int __hv_remote_flush_tlb_with_range(struct kvm *kvm,
+   struct kvm_vcpu *vcpu, struct kvm_tlb_range *range)
+{
+   u64 ept_pointer = to_vmx(vcpu)->ept_pointer;
+
+   /* If ept_pointer is invalid pointer, bypass flush request. */
+   if (!VALID_PAGE(ept_pointer))
+   return 0;
+
+   /*
+* FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE hypercall needs address
+* of the base of EPT PML4 table, strip off EPT configuration
+* information.
+*/
+   if (range)
+   return hyperv_flush_guest_mapping_range(ept_pointer & PAGE_MASK,
+   kvm_fill_hv_flush_list_func, (void *)range);
+   else
+   return hyperv_flush_guest_mapping(ept_pointer & PAGE_MASK);
+}
+
+static int hv_remote_flush_tlb_with_range(struct kvm *kvm,
+   struct kvm_tlb_range *range)
 {
struct kvm_vcpu *vcpu;
int ret = -ENOTSUPP, i;
@@ -1577,30 +1608,23 @@ static int vmx_hv_remote_flush_tlb(struct kvm *kvm)
if (to_kvm_vmx(kvm)->ept_pointers_match == EPT_POINTERS_CHECK)
check_ept_pointer_match(kvm);
 
-   /*
-* FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE hypercall needs the address of the
-* base of EPT PML4 table, strip off EPT configuration information.
-* If ept_pointer is invalid pointer, bypass the flush request.
-*/
if (to_kvm_vmx(kvm)->ept_pointers_match != EPT_POINTERS_MATCH) {
-   kvm_for_each_vcpu(i, vcpu, kvm) {
-   if (!VALID_PAGE(to_vmx(vcpu)->ept_pointer))
-   return 0;
-
-   ret |= hyperv_flush_guest_mapping(
-   to_vmx(vcpu)->ept_pointer & PAGE_MASK);
-   }
+   kvm_for_each_vcpu(i, vcpu, kvm)
+   ret |= __hv_remote_flush_tlb_with_range(
+   kvm, vcpu, range);
} else {
-   if (!VALID_PAGE(to_vmx(kvm_get_vcpu(kvm, 0))->ept_pointer))
-   return 0;
-
-   ret = hyperv_flush_guest_mapping(
-   to_vmx(kvm_get_vcpu(kvm, 0))->ept_pointer & PAGE_MASK);
+   ret = __hv_remote_flush_tlb_with_range(kvm,
+   kvm_get_vcpu(kvm, 0), range);
}
 
spin_unlock(_kvm_vmx(kvm)->ept_pointer_lock);
return ret;
 }
+
+static int hv_remote_flush_tlb(struct kvm *kvm)
+{
+   return hv_remote_flush_tlb_with_range(kvm, NULL);
+}
 #else /* !IS_ENABLED(CONFIG_HYPERV) */
 static inline void evmcs_write64(unsigned long field, u64 value) {}
 static inline void evmcs_write32(unsigned long field, u32 value) {}
@@ -7957,8 +7981,11 @@ static __init int hardware_setup(void)
 
 #if IS_ENABLED(CONFIG_HYPERV)
if (ms_hyperv.nested_features & HV_X64_NESTED_GUEST_MAPPING_FLUSH
-   && enable_ept)
-   kvm_x86_ops->tlb_remote_flush = vmx_hv_remote_flush_tlb;
+   && enable_ept) {
+   kvm_x86_ops->tlb_remote_flush = hv_remote_flush_tlb;
+   kvm_x86_ops->tlb_remote_flush_with_range =
+   hv_remote_flush_tlb_with_range;
+   }
 #endif
 
if (!cpu_has_vmx_ple()) {
@@ -11567,6 +11594,8 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm 
*kvm, unsigned int id)
vmx->nested.posted_intr_nv = -1;
vmx->nested.current_vmptr = -1ull;
 
+   vmx->ept_pointer = INVALID_PAGE;
+
vmx->msr_ia32_feature_control_valid_bits = FEATURE_CONTROL_LOCKED;
 
/*
-- 
2.14.4



Re: pkeys: Reserve PKEY_DISABLE_READ

2018-11-08 Thread Florian Weimer
* Ram Pai:

> Florian,
>
>   I can. But I am struggling to understand the requirement. Why is
>   this needed?  Are we proposing a enhancement to the sys_pkey_alloc(),
>   to be able to allocate keys that are initialied to disable-read
>   only?

Yes, I think that would be a natural consequence.

However, my immediate need comes from the fact that the AMR register can
contain a flag combination that is not possible to represent with the
existing PKEY_DISABLE_WRITE and PKEY_DISABLE_ACCESS flags.  User code
could write to AMR directly, so I cannot rule out that certain flag
combinations exist there.

So I came up with this:

int
pkey_get (int key)
{
  if (key < 0 || key > PKEY_MAX)
{
  __set_errno (EINVAL);
  return -1;
}
  unsigned int index = pkey_index (key);
  unsigned long int amr = pkey_read ();
  unsigned int bits = (amr >> index) & 3;

  /* Translate from AMR values.  PKEY_AMR_READ standing alone is not
 currently representable.  */
  if (bits & PKEY_AMR_READ)
return PKEY_DISABLE_ACCESS;
  else if (bits == PKEY_AMR_WRITE)
return PKEY_DISABLE_WRITE;
  return 0;
}

And this is not ideal.  I would prefer something like this instead:

  switch (bits)
{
  case PKEY_AMR_READ | PKEY_AMR_WRITE:
return PKEY_DISABLE_ACCESS;
  case PKEY_AMR_READ:
return PKEY_DISABLE_READ;
  case PKEY_AMR_WRITE:
return PKEY_DISABLE_WRITE;
  case 0:
return 0;
}

By the way, is the AMR register 64-bit or 32-bit on 32-bit POWER?

Thanks,
Florian


Re: pkeys: Reserve PKEY_DISABLE_READ

2018-11-08 Thread Ram Pai
On Thu, Nov 08, 2018 at 06:37:41PM +0100, Florian Weimer wrote:
> * Dave Hansen:
> 
> > On 11/8/18 7:01 AM, Florian Weimer wrote:
> >> Ideally, PKEY_DISABLE_READ | PKEY_DISABLE_WRITE and PKEY_DISABLE_READ |
> >> PKEY_DISABLE_ACCESS would be treated as PKEY_DISABLE_ACCESS both, and a
> >> line PKEY_DISABLE_READ would result in an EINVAL failure.
> >
> > Sounds reasonable to me.
> >
> > I don't see any urgency to do this right now.  It could easily go in
> > alongside the ppc patches when those get merged.
> 
> POWER support has already been merged, so we need to do something here
> now, before I can complete the userspace side.
> 
> > The only thing I'd suggest is that we make it something slightly
> > higher than 0x4.  It'll make the code easier to deal with in the
> > kernel if we have the ABI and the hardware mirror each other, and if
> > we pick 0x4 in the ABI for PKEY_DISABLE_READ, it might get messy if
> > the harware choose 0x4 for PKEY_DISABLE_EXECUTE or something.
> > 
> > So, let's make it 0x80 or something on x86 at least.
> 
> I don't have a problem with that if that's what it takes.
> 
> > Also, I'll be happy to review and ack the patch to do this, but I'd
> > expect the ppc guys (hi Ram!) to actually put it together.
> 
> Ram, do you want to write a patch?


Florian,

I can. But I am struggling to understand the requirement. Why is
this needed?  Are we proposing a enhancement to the sys_pkey_alloc(),
to be able to allocate keys that are initialied to disable-read
only?

RP

> 
> I'll promise I finish the glibc support for this. 8-)
> 
> Thanks,
> Florian

-- 
Ram Pai



Re: [PATCH 0/5] Guarded Userspace Access Prevention on Radix

2018-11-08 Thread Benjamin Herrenschmidt
On Thu, 2018-11-08 at 18:52 +0100, Christophe LEROY wrote:
> 
> In signal_32.c and signal_64.c, save_user_regs() calls __put_user() to 
> modify code, then calls flush_icache_range() on user addresses.
> 
> Shouldn't flush_icache_range() be performed with userspace access 
> protection unlocked ?

Thankfully this code is pretty much never used these days...

Russell: To trigger that, you need to disable the VDSO.

This brings back the idea however of having a way to "bulk" open the
gate during the whole signal sequence...

Cheers,
Ben.




Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-08 Thread Bjorn Helgaas
[+cc Jonathan, Greg, Lukas, Russell, Sam, Oliver for discussion about
PCI error recovery in general]

On Wed, Nov 07, 2018 at 05:42:57PM -0600, Bjorn Helgaas wrote:
> On Tue, Sep 18, 2018 at 05:15:00PM -0500, Alexandru Gagniuc wrote:
> > When a PCI device is gone, we don't want to send IO to it if we can
> > avoid it. We expose functionality via the irq_chip structure. As
> > users of that structure may not know about the underlying PCI device,
> > it's our responsibility to guard against removed devices.
> > 
> > .irq_write_msi_msg() is already guarded inside __pci_write_msi_msg().
> > .irq_mask/unmask() are not. Guard them for completeness.
> > 
> > For example, surprise removal of a PCIe device triggers teardown. This
> > touches the irq_chips ops some point to disable the interrupts. I/O
> > generated here can crash the system on firmware-first machines.
> > Not triggering the IO in the first place greatly reduces the
> > possibility of the problem occurring.
> > 
> > Signed-off-by: Alexandru Gagniuc 
> 
> Applied to pci/misc for v4.21, thanks!

I'm having second thoughts about this.  One thing I'm uncomfortable
with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
instead of systematic, in the sense that I don't know how we convince
ourselves that this (and only this) is the correct place to put it.

Another is that the only place we call pci_dev_set_disconnected() is
in pciehp and acpiphp, so the only "disconnected" case we catch is if
hotplug happens to be involved.  Every MMIO read from the device is an
opportunity to learn whether it is reachable (a read from an
unreachable device typically returns ~0 data), but we don't do
anything at all with those.

The config accessors already check pci_dev_is_disconnected(), so this
patch is really aimed at MMIO accesses.  I think it would be more
robust if we added wrappers for readl() and writel() so we could
notice read errors and avoid future reads and writes.

Two compiled but untested patches below for your comments.  You can
mostly ignore the first; it's just boring interface changes.  The
important part is the second, which adds the wrappers.

These would be an alternative to the (admittedly much shorter) patch
here.  The wrappers are independent of MSI and could potentially be
exported from the PCI core for use by drivers.  I think it would be
better for drivers to use something like this instead of calling
pci_device_is_present() or pci_dev_is_disconnected() directly.

> > ---
> >  drivers/pci/msi.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > index f2ef896464b3..f31058fd2260 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 
> > flag)
> >  {
> > struct msi_desc *desc = irq_data_get_msi_desc(data);
> >  
> > +   if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc)))
> > +   return;
> > +
> > if (desc->msi_attrib.is_msix) {
> > msix_mask_irq(desc, flag);
> > readl(desc->mask_base); /* Flush write to device */


commit 150346e09edbcaedc520a6d7dec2b16f3a53afa1
Author: Bjorn Helgaas 
Date:   Thu Nov 8 09:17:51 2018 -0600

PCI/MSI: Pass pci_dev into IRQ mask interfaces

Add the struct pci_dev pointer to these interfaces:

  __pci_msix_desc_mask_irq()
  __pci_msi_desc_mask_irq()
  msix_mask_irq
  msi_mask_irq()

The pci_dev pointer is currently unused, so there's no functional change
intended with this patch.  A subsequent patch will use the pointer to
improve error detection.

diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index 9f6f392a4461..56bbee2cf761 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -462,9 +462,9 @@ void arch_teardown_msi_irqs(struct pci_dev *pdev)
if (!msi->irq)
continue;
if (msi->msi_attrib.is_msix)
-   __pci_msix_desc_mask_irq(msi, 1);
+   __pci_msix_desc_mask_irq(pdev, msi, 1);
else
-   __pci_msi_desc_mask_irq(msi, 1, 1);
+   __pci_msi_desc_mask_irq(pdev, msi, 1, 1);
irq_set_msi_desc(msi->irq, NULL);
irq_free_desc(msi->irq);
msi->msg.address_lo = 0;
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index af24ed50a245..d46ae506e296 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -170,7 +170,8 @@ static inline __attribute_const__ u32 msi_mask(unsigned x)
  * reliably as devices without an INTx disable bit will then generate a
  * level IRQ which will never be cleared.
  */
-u32 __pci_msi_desc_mask_irq(struct msi_desc *desc, u32 mask, u32 flag)
+u32 __pci_msi_desc_mask_irq(struct pci_dev *dev, struct msi_desc *desc,
+   u32 mask, u32 flag)
 {
u32 mask_bits = desc->masked;
 
@@ -179,15 +180,15 @@ u32 

Re: pkeys: Reserve PKEY_DISABLE_READ

2018-11-08 Thread Ram Pai
On Thu, Nov 08, 2018 at 01:05:09PM +0100, Florian Weimer wrote:
> Would it be possible to reserve a bit for PKEY_DISABLE_READ?
> 
> I think the POWER implementation can disable read access at the hardware
> level, but not write access, and that cannot be expressed with the
> current PKEY_DISABLE_ACCESS and PKEY_DISABLE_WRITE bits.

POWER hardware can disable-read and can **also disable-write**
at the hardware level. It can disable-execute aswell at the
hardware level.   For example if the key bits for a given key in the AMR
register is  
0b01  it is read-disable
0b10  it is write-disable

To support access-disable, we make the key value 0b11.

So in case if you want to know if the key is read-disable 'bitwise-and' it
against 0x1.  i.e  (x & 0x1)

RP



Re: [PATCH] powerpc: Add KVM guest defconfig

2018-11-08 Thread Breno Leitao
hi Satheesh,

On 11/08/2018 03:08 AM, sathn...@linux.vnet.ibm.com wrote:
> --- /dev/null
> +++ b/arch/powerpc/configs/guest.config
> @@ -0,0 +1,14 @@
> +CONFIG_VIRTIO_BLK=y
> +CONFIG_VIRTIO_BLK_SCSI=y
> +CONFIG_SCSI_VIRTIO=y
> +CONFIG_VIRTIO_NET=y
> +CONFIG_NET_FAILOVER=y
> +CONFIG_VIRTIO_CONSOLE=y
> +CONFIG_VIRTIO=y
> +CONFIG_VIRTIO_PCI=y
> +CONFIG_KVM_GUEST=y
> +CONFIG_EPAPR_PARAVIRT=y
> +CONFIG_XFS_FS=y

Why a guest kernel needs to have XFS integrated in the core image? I am
wondering if it is a requirement from another CONFIG_ option.

If it is not a strict requirement from another config, I think we can keep it
as defined at ppc64_defconfig, which defines it as module (CONFIG_XFS_FS=m).

Thanks for this patch, very useful.
Breno



[PATCH -next-akpm 3/3] mm: select HAVE_MOVE_PMD in x86 for faster mremap

2018-11-08 Thread Joel Fernandes
From: "Joel Fernandes (Google)" 

Moving page-tables at the PMD-level on x86 is known to be safe. Enable
this option so that we can do fast mremap when possible.

Suggested-by: Kirill A. Shutemov 
Acked-by: Kirill A. Shutemov 
Signed-off-by: Joel Fernandes (Google) 
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 605bec0c228f..05f3667de0d2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -173,6 +173,7 @@ config X86
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_MIXED_BREAKPOINTS_REGS
select HAVE_MOD_ARCH_SPECIFIC
+   select HAVE_MOVE_PMD
select HAVE_NMI
select HAVE_OPROFILE
select HAVE_OPTPROBES
-- 
2.19.1.930.g4563a0d9d0-goog



[PATCH -next-akpm 1/3] mm: treewide: remove unused address argument from pte_alloc functions (v2)

2018-11-08 Thread Joel Fernandes
From: "Joel Fernandes (Google)" 

This series speeds up mremap(2) syscall by copying page tables at the
PMD level even for non-THP systems. There is concern that the extra
'address' argument that mremap passes to pte_alloc may do something
subtle architecture related in the future that may make the scheme not
work.  Also we find that there is no point in passing the 'address' to
pte_alloc since its unused. So this patch therefore removes this
argument tree-wide resulting in a nice negative diff as well. Also
ensuring along the way that the enabled architectures do not do anything
funky with 'address' argument that goes unnoticed by the optimization.

Build and boot tested on x86-64. Build tested on arm64.

The changes were obtained by applying the following Coccinelle script.
(thanks Julia for answering all Coccinelle questions!).
Following fix ups were done manually:
* Removal of address argument from  pte_fragment_alloc
* Removal of pte_alloc_one_fast definitions from m68k and microblaze.

// Options: --include-headers --no-includes
// Note: I split the 'identifier fn' line, so if you are manually
// running it, please unsplit it so it runs for you.

virtual patch

@pte_alloc_func_def depends on patch exists@
identifier E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
type T2;
@@

 fn(...
- , T2 E2
 )
 { ... }

@pte_alloc_func_proto_noarg depends on patch exists@
type T1, T2, T3, T4;
identifier fn =~ 
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1, T2);
+ T3 fn(T1);
|
- T3 fn(T1, T2, T4);
+ T3 fn(T1, T2);
)

@pte_alloc_func_proto depends on patch exists@
identifier E1, E2, E4;
type T1, T2, T3, T4;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1 E1, T2 E2);
+ T3 fn(T1 E1);
|
- T3 fn(T1 E1, T2 E2, T4 E4);
+ T3 fn(T1 E1, T2 E2);
)

@pte_alloc_func_call depends on patch exists@
expression E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

 fn(...
-,  E2
 )

@pte_alloc_macro depends on patch exists@
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
identifier a, b, c;
expression e;
position p;
@@

(
- #define fn(a, b, c) e
+ #define fn(a, b) e
|
- #define fn(a, b) e
+ #define fn(a) e
)

Suggested-by: Kirill A. Shutemov 
Acked-by: Kirill A. Shutemov 
Cc: Michal Hocko 
Cc: Julia Lawall 
Cc: Kirill A. Shutemov 
Signed-off-by: Joel Fernandes (Google) 

---
 arch/alpha/include/asm/pgalloc.h |  6 +++---
 arch/arc/include/asm/pgalloc.h   |  5 ++---
 arch/arm/include/asm/pgalloc.h   |  4 ++--
 arch/arm64/include/asm/pgalloc.h |  4 ++--
 arch/hexagon/include/asm/pgalloc.h   |  6 ++
 arch/ia64/include/asm/pgalloc.h  |  5 ++---
 arch/m68k/include/asm/mcf_pgalloc.h  |  8 ++--
 arch/m68k/include/asm/motorola_pgalloc.h |  4 ++--
 arch/m68k/include/asm/sun3_pgalloc.h |  6 ++
 arch/microblaze/include/asm/pgalloc.h| 19 ++-
 arch/microblaze/mm/pgtable.c |  3 +--
 arch/mips/include/asm/pgalloc.h  |  6 ++
 arch/nds32/include/asm/pgalloc.h |  5 ++---
 arch/nios2/include/asm/pgalloc.h |  6 ++
 arch/openrisc/include/asm/pgalloc.h  |  5 ++---
 arch/openrisc/mm/ioremap.c   |  3 +--
 arch/parisc/include/asm/pgalloc.h|  4 ++--
 arch/powerpc/include/asm/book3s/32/pgalloc.h |  4 ++--
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 12 +---
 arch/powerpc/include/asm/nohash/32/pgalloc.h |  4 ++--
 arch/powerpc/include/asm/nohash/64/pgalloc.h |  6 ++
 arch/powerpc/mm/pgtable-book3s64.c   |  2 +-
 arch/powerpc/mm/pgtable_32.c |  4 ++--
 arch/riscv/include/asm/pgalloc.h |  6 ++
 arch/s390/include/asm/pgalloc.h  |  4 ++--
 arch/sh/include/asm/pgalloc.h|  6 ++
 arch/sparc/include/asm/pgalloc_32.h  |  5 ++---
 arch/sparc/include/asm/pgalloc_64.h  |  6 ++
 arch/sparc/mm/init_64.c  |  6 ++
 arch/sparc/mm/srmmu.c|  4 ++--
 arch/um/include/asm/pgalloc.h|  4 ++--
 arch/um/kernel/mem.c |  4 ++--
 arch/unicore32/include/asm/pgalloc.h |  4 ++--
 arch/x86/include/asm/pgalloc.h   |  4 ++--
 arch/x86/mm/pgtable.c|  4 ++--
 arch/xtensa/include/asm/pgalloc.h|  8 +++-
 include/linux/mm.h   | 13 ++---
 mm/huge_memory.c |  8 
 mm/kasan/kasan_init.c|  2 +-
 mm/memory.c  | 17 -
 mm/migrate.c |  2 +-
 mm/mremap.c  |  2 +-
 

[PATCH -next-akpm 0/3] Add support for fast mremap

2018-11-08 Thread Joel Fernandes
Hi,
Here is the "fast mremap" series. This just a repost with Kirill's Acked-bys
added and William's Reviewed-by added. Also fixed a UML build error reported
last week. I would like this to be considered for linux -next. The performance
numbers in the series are for testing on x86. The config enablement patch for
arm64 will be posted in the future after testing (see notes below).

List of patches in series:

(1) mm: select HAVE_MOVE_PMD in x86 for faster mremap

(2) mm: speed up mremap by 20x on large regions (v5)
v1->v2: Added support for per-arch enablement (Kirill Shutemov)
v2->v3: Updated commit message to state the optimization may also
run for non-thp type of systems (Daniel Col).
v3->v4: Remove useless pmd_lock check (Kirill Shutemov)
Rebased ontop of Linus's master, updated perf results based
on x86 testing. Added Kirill's Acks.
v4->v5: Added William's Reviewed-by. Fixed arch/um build error
due to set_pmd_at not defined. Rebased on linux-next/akpm.

(3) mm: treewide: remove unused address argument from pte_alloc functions (v2)
v1->v2: fix arch/um/ prototype which was missed in v1 (Anton Ivanov)
update changelog with manual fixups for m68k and microblaze.

not included - (4) mm: select HAVE_MOVE_PMD in arm64 for faster mremap
This patch is dropped since last posting pending further performance
testing on arm64 with new TLB gather updates. See notes in patch
titled "mm: speed up mremap by 500x on large regions" for more
details.

Joel Fernandes (Google) (3):
mm: treewide: remove unused address argument from pte_alloc functions
(v2)
mm: speed up mremap by 20x on large regions (v5)
mm: select HAVE_MOVE_PMD in x86 for faster mremap

arch/Kconfig |  5 ++
arch/alpha/include/asm/pgalloc.h |  6 +-
arch/arc/include/asm/pgalloc.h   |  5 +-
arch/arm/include/asm/pgalloc.h   |  4 +-
arch/arm64/include/asm/pgalloc.h |  4 +-
arch/hexagon/include/asm/pgalloc.h   |  6 +-
arch/ia64/include/asm/pgalloc.h  |  5 +-
arch/m68k/include/asm/mcf_pgalloc.h  |  8 +--
arch/m68k/include/asm/motorola_pgalloc.h |  4 +-
arch/m68k/include/asm/sun3_pgalloc.h |  6 +-
arch/microblaze/include/asm/pgalloc.h| 19 +-
arch/microblaze/mm/pgtable.c |  3 +-
arch/mips/include/asm/pgalloc.h  |  6 +-
arch/nds32/include/asm/pgalloc.h |  5 +-
arch/nios2/include/asm/pgalloc.h |  6 +-
arch/openrisc/include/asm/pgalloc.h  |  5 +-
arch/openrisc/mm/ioremap.c   |  3 +-
arch/parisc/include/asm/pgalloc.h|  4 +-
arch/powerpc/include/asm/book3s/32/pgalloc.h |  4 +-
arch/powerpc/include/asm/book3s/64/pgalloc.h | 12 ++--
arch/powerpc/include/asm/nohash/32/pgalloc.h |  4 +-
arch/powerpc/include/asm/nohash/64/pgalloc.h |  6 +-
arch/powerpc/mm/pgtable-book3s64.c   |  2 +-
arch/powerpc/mm/pgtable_32.c |  4 +-
arch/riscv/include/asm/pgalloc.h |  6 +-
arch/s390/include/asm/pgalloc.h  |  4 +-
arch/sh/include/asm/pgalloc.h|  6 +-
arch/sparc/include/asm/pgalloc_32.h  |  5 +-
arch/sparc/include/asm/pgalloc_64.h  |  6 +-
arch/sparc/mm/init_64.c  |  6 +-
arch/sparc/mm/srmmu.c|  4 +-
arch/um/include/asm/pgalloc.h|  4 +-
arch/um/kernel/mem.c |  4 +-
arch/unicore32/include/asm/pgalloc.h |  4 +-
arch/x86/Kconfig |  1 +
arch/x86/include/asm/pgalloc.h   |  4 +-
arch/x86/mm/pgtable.c|  4 +-
arch/xtensa/include/asm/pgalloc.h|  8 +--
include/linux/mm.h   | 13 ++--
mm/huge_memory.c |  8 +--
mm/kasan/kasan_init.c|  2 +-
mm/memory.c  | 17 +++---
mm/migrate.c |  2 +-
mm/mremap.c  | 64 +++-
mm/userfaultfd.c |  2 +-
virt/kvm/arm/mmu.c   |  2 +-
46 files changed, 165 insertions(+), 147 deletions(-)

--
2.19.1.930.g4563a0d9d0-goog



Re: [PATCH 0/5] Guarded Userspace Access Prevention on Radix

2018-11-08 Thread Christophe LEROY




Le 01/11/2018 à 04:54, Russell Currey a écrit :

On Wed, 2018-10-31 at 17:58 +0100, LEROY Christophe wrote:

Russell Currey  a écrit :


On Fri, 2018-10-26 at 18:29 +0200, LEROY Christophe wrote:

Russell Currey  a écrit :


Guarded Userspace Access Prevention is a security mechanism
that
prevents
the kernel from being able to read and write userspace
addresses
outside of
the allowed paths, most commonly copy_{to/from}_user().

At present, the only CPU that supports this is POWER9, and only
while using
the Radix MMU.  Privileged reads and writes cannot access user
data
when
key 0 of the AMR is set.  This is described in the "Radix Tree
Translation
Storage Protection" section of the POWER ISA as of version 3.0.


It is not right that only power9 can support that.


It's true that not only P9 can support it, but there are more
considerations under hash than radix, implementing this for radix
is a
first step.


I don't know much about hash, but I was talking about the 8xx which
is
a nohash ppc32. I'll see next week if I can do something with it on
top of your serie.


My small brain saw the number 8 and assumed you were talking about
POWER8, I didn't know what 8xx was until now.

Working on a refactor to make things a bit more generic, and removing
the radix name and dependency from the config option.


In signal_32.c and signal_64.c, save_user_regs() calls __put_user() to 
modify code, then calls flush_icache_range() on user addresses.


Shouldn't flush_icache_range() be performed with userspace access 
protection unlocked ?


Christophe


[PATCH] powerpc/xmon: Fix invocation inside lock region

2018-11-08 Thread Breno Leitao
Currently xmon needs to get devtree_lock (through rtas_token()) during its
invocation (at crash time). If there is a crash while devtree_lock is being
held, then xmon tries to get the lock but spins forever and never get into
the interactive debugger, as in the following case:

int *ptr = NULL;
raw_spin_lock_irqsave(_lock, flags);
*ptr = 0xdeadbeef;

This patch avoids calling rtas_token(), thus trying to get the same lock,
at crash time. This new mechanism proposes getting the token at
initialization time (xmon_init()) and just consuming it at crash time.

This would allow xmon to be possible invoked independent of devtree_lock
being held or not.

Signed-off-by: Breno Leitao 
---
 arch/powerpc/xmon/xmon.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 36b8dc47a3c3..b566203d09c5 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -75,6 +75,9 @@ static int xmon_gate;
 #define xmon_owner 0
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_PPC_PSERIES
+static int set_indicator_token = RTAS_UNKNOWN_SERVICE;
+#endif
 static unsigned long in_xmon __read_mostly = 0;
 static int xmon_on = IS_ENABLED(CONFIG_XMON_DEFAULT);
 
@@ -358,7 +361,6 @@ static inline void disable_surveillance(void)
 #ifdef CONFIG_PPC_PSERIES
/* Since this can't be a module, args should end up below 4GB. */
static struct rtas_args args;
-   int token;
 
/*
 * At this point we have got all the cpus we can into
@@ -367,11 +369,11 @@ static inline void disable_surveillance(void)
 * If we did try to take rtas.lock there would be a
 * real possibility of deadlock.
 */
-   token = rtas_token("set-indicator");
-   if (token == RTAS_UNKNOWN_SERVICE)
+   if (set_indicator_token == RTAS_UNKNOWN_SERVICE)
return;
 
-   rtas_call_unlocked(, token, 3, 1, NULL, SURVEILLANCE_TOKEN, 0, 0);
+   rtas_call_unlocked(, set_indicator_token, 3, 1, NULL,
+  SURVEILLANCE_TOKEN, 0, 0);
 
 #endif /* CONFIG_PPC_PSERIES */
 }
@@ -3688,6 +3690,14 @@ static void xmon_init(int enable)
__debugger_iabr_match = xmon_iabr_match;
__debugger_break_match = xmon_break_match;
__debugger_fault_handler = xmon_fault_handler;
+
+#ifdef CONFIG_PPC_PSERIES
+   /*
+* Get the token here to avoid trying to get a lock
+* during the crash, causing a deadlock.
+*/
+   set_indicator_token = rtas_token("set-indicator");
+#endif
} else {
__debugger = NULL;
__debugger_ipi = NULL;
-- 
2.19.0



Re: [PATCH 13/36] dt-bindings: arm: Convert PMU binding to json-schema

2018-11-08 Thread Robin Murphy

On 08/11/2018 15:59, Thomas Petazzoni wrote:

Hello,

I'm jumping into the discussion, but I clearly don't have all the
context of the discussion.

On Thu, 8 Nov 2018 15:54:31 +, Robin Murphy wrote:


This seems like a semantic different between the two representations, or am
I missing something here? Specifically, both the introduction of
interrupts-extended and also dropping any mention of using a single per-cpu
interrupt (the single combined case is no longer support by Linux; not sure
if you want to keep it in the binding).


In regards to no support for the single combined interrupt, it looks
like Marvell Armada SoCs at least (armada-375 is what I'm looking at)
have only a single interrupt. Though the interrupt gets routed to MPIC
which then has a GIC PPI. So it isn't supported or happens to work
still since it is a PPI?


Well, the description of the MPIC in the Armada XP functional spec says:

"Interrupt sources ID0–ID28 are private events per CPU. Thus, each
processor has a different set of events map interrupts ID0–ID28."

Odd grammar aside, that would seem to imply that < 3> is a per-cpu
interrupt itself, thus AFAICS so long as it's cascaded to a GIC PPI and
not an SPI then there's no issue there.


The Armada XP does not have a GIC at all, but only a MPIC as the
primary interrupt controller.

However the Armada 38x has both a GIC and a MPIC, and indeed the parent
interrupts of the MPIC towards the GIC is:

interrupts = ;


Yeah, perhaps I should have clarified that the XP manual was the only 
publicly-available one I could find, but I'm inferring from the binding 
and driver that the MPIC in 375/38x still behaves the same.


Robin.


Re: [PATCH 13/36] dt-bindings: arm: Convert PMU binding to json-schema

2018-11-08 Thread Robin Murphy

On 01/11/2018 19:32, Rob Herring wrote:

On Tue, Oct 9, 2018 at 6:57 AM Will Deacon  wrote:


Hi Rob,

On Fri, Oct 05, 2018 at 11:58:25AM -0500, Rob Herring wrote:

Convert ARM PMU binding to DT schema format using json-schema.

Cc: Will Deacon 
Cc: Mark Rutland 
Cc: linux-arm-ker...@lists.infradead.org
Cc: devicet...@vger.kernel.org
Signed-off-by: Rob Herring 
---
  Documentation/devicetree/bindings/arm/pmu.txt | 70 --
  .../devicetree/bindings/arm/pmu.yaml  | 96 +++
  2 files changed, 96 insertions(+), 70 deletions(-)
  delete mode 100644 Documentation/devicetree/bindings/arm/pmu.txt
  create mode 100644 Documentation/devicetree/bindings/arm/pmu.yaml


[...]


-- interrupts : 1 combined interrupt or 1 per core. If the interrupt is a 
per-cpu
-   interrupt (PPI) then 1 interrupt should be specified.


[...]


+  interrupts:
+oneOf:
+  - maxItems: 1
+  - minItems: 2
+maxItems: 8
+description: 1 interrupt per core.
+
+  interrupts-extended:
+$ref: '#/properties/interrupts'


This seems like a semantic different between the two representations, or am
I missing something here? Specifically, both the introduction of
interrupts-extended and also dropping any mention of using a single per-cpu
interrupt (the single combined case is no longer support by Linux; not sure
if you want to keep it in the binding).


In regards to no support for the single combined interrupt, it looks
like Marvell Armada SoCs at least (armada-375 is what I'm looking at)
have only a single interrupt. Though the interrupt gets routed to MPIC
which then has a GIC PPI. So it isn't supported or happens to work
still since it is a PPI?


Well, the description of the MPIC in the Armada XP functional spec says:

"Interrupt sources ID0–ID28 are private events per CPU. Thus, each 
processor has a different set of events map interrupts ID0–ID28."


Odd grammar aside, that would seem to imply that < 3> is a per-cpu 
interrupt itself, thus AFAICS so long as it's cascaded to a GIC PPI and 
not an SPI then there's no issue there.


Robin.


[Resend PATCH V5 7/10] KVM: Make kvm_set_spte_hva() return int

2018-11-08 Thread lantianyu1986
From: Lan Tianyu 

The patch is to make kvm_set_spte_hva() return int and caller can
check return value to determine flush tlb or not.

Signed-off-by: Lan Tianyu 
---
 arch/arm/include/asm/kvm_host.h | 2 +-
 arch/arm64/include/asm/kvm_host.h   | 2 +-
 arch/mips/include/asm/kvm_host.h| 2 +-
 arch/mips/kvm/mmu.c | 3 ++-
 arch/powerpc/include/asm/kvm_host.h | 2 +-
 arch/powerpc/kvm/book3s.c   | 3 ++-
 arch/powerpc/kvm/e500_mmu_host.c| 3 ++-
 arch/x86/include/asm/kvm_host.h | 2 +-
 arch/x86/kvm/mmu.c  | 3 ++-
 virt/kvm/arm/mmu.c  | 6 --
 10 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 5ca5d9af0c26..fb62b500e590 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -225,7 +225,7 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
 #define KVM_ARCH_WANT_MMU_NOTIFIER
 int kvm_unmap_hva_range(struct kvm *kvm,
unsigned long start, unsigned long end);
-void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
+int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 
 unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu);
 int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 52fbc823ff8c..b3df7b38ba7d 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -360,7 +360,7 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
 #define KVM_ARCH_WANT_MMU_NOTIFIER
 int kvm_unmap_hva_range(struct kvm *kvm,
unsigned long start, unsigned long end);
-void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
+int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 2c1c53d12179..71c3f21d80d5 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -933,7 +933,7 @@ enum kvm_mips_fault_result kvm_trap_emul_gva_fault(struct 
kvm_vcpu *vcpu,
 #define KVM_ARCH_WANT_MMU_NOTIFIER
 int kvm_unmap_hva_range(struct kvm *kvm,
unsigned long start, unsigned long end);
-void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
+int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 
diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
index d8dcdb350405..97e538a8c1be 100644
--- a/arch/mips/kvm/mmu.c
+++ b/arch/mips/kvm/mmu.c
@@ -551,7 +551,7 @@ static int kvm_set_spte_handler(struct kvm *kvm, gfn_t gfn, 
gfn_t gfn_end,
   (pte_dirty(old_pte) && !pte_dirty(hva_pte));
 }
 
-void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
+int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
 {
unsigned long end = hva + PAGE_SIZE;
int ret;
@@ -559,6 +559,7 @@ void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, 
pte_t pte)
ret = handle_hva_to_gpa(kvm, hva, end, _set_spte_handler, );
if (ret)
kvm_mips_callbacks->flush_shadow_all(kvm);
+   return 0;
 }
 
 static int kvm_age_hva_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index fac6f631ed29..ab23379c53a9 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -72,7 +72,7 @@ extern int kvm_unmap_hva_range(struct kvm *kvm,
   unsigned long start, unsigned long end);
 extern int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long 
end);
 extern int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
-extern void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
+extern int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 
 #define HPTEG_CACHE_NUM(1 << 15)
 #define HPTEG_HASH_BITS_PTE13
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index fd9893bc7aa1..437613bb609a 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -850,9 +850,10 @@ int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
return kvm->arch.kvm_ops->test_age_hva(kvm, hva);
 }
 
-void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
+int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
 {
kvm->arch.kvm_ops->set_spte_hva(kvm, hva, pte);
+   return 0;
 }
 
 void kvmppc_mmu_destroy(struct kvm_vcpu *vcpu)
diff --git a/arch/powerpc/kvm/e500_mmu_host.c 

[Resend PATCH V5 6/10] KVM: Replace old tlb flush function with new one to flush a specified range.

2018-11-08 Thread lantianyu1986
From: Lan Tianyu 

This patch is to replace kvm_flush_remote_tlbs() with kvm_flush_
remote_tlbs_with_address() in some functions without logic change.

Signed-off-by: Lan Tianyu 
---
 arch/x86/kvm/mmu.c | 31 +--
 arch/x86/kvm/paging_tmpl.h |  3 ++-
 2 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c1d383532066..af3e9e466014 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1485,8 +1485,12 @@ static bool __drop_large_spte(struct kvm *kvm, u64 
*sptep)
 
 static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
 {
-   if (__drop_large_spte(vcpu->kvm, sptep))
-   kvm_flush_remote_tlbs(vcpu->kvm);
+   if (__drop_large_spte(vcpu->kvm, sptep)) {
+   struct kvm_mmu_page *sp = page_header(__pa(sptep));
+
+   kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
+   KVM_PAGES_PER_HPAGE(sp->role.level));
+   }
 }
 
 /*
@@ -1954,7 +1958,8 @@ static void rmap_recycle(struct kvm_vcpu *vcpu, u64 
*spte, gfn_t gfn)
rmap_head = gfn_to_rmap(vcpu->kvm, gfn, sp);
 
kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, sp->role.level, 0);
-   kvm_flush_remote_tlbs(vcpu->kvm);
+   kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
+   KVM_PAGES_PER_HPAGE(sp->role.level));
 }
 
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
@@ -2470,7 +2475,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
account_shadowed(vcpu->kvm, sp);
if (level == PT_PAGE_TABLE_LEVEL &&
  rmap_write_protect(vcpu, gfn))
-   kvm_flush_remote_tlbs(vcpu->kvm);
+   kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 
if (level > PT_PAGE_TABLE_LEVEL && need_sync)
flush |= kvm_sync_pages(vcpu, gfn, _list);
@@ -2590,7 +2595,7 @@ static void validate_direct_spte(struct kvm_vcpu *vcpu, 
u64 *sptep,
return;
 
drop_parent_pte(child, sptep);
-   kvm_flush_remote_tlbs(vcpu->kvm);
+   kvm_flush_remote_tlbs_with_address(vcpu->kvm, child->gfn, 1);
}
 }
 
@@ -3014,8 +3019,10 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep, unsigned pte_access,
ret = RET_PF_EMULATE;
kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
}
+
if (set_spte_ret & SET_SPTE_NEED_REMOTE_TLB_FLUSH || flush)
-   kvm_flush_remote_tlbs(vcpu->kvm);
+   kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn,
+   KVM_PAGES_PER_HPAGE(level));
 
if (unlikely(is_mmio_spte(*sptep)))
ret = RET_PF_EMULATE;
@@ -5681,7 +5688,8 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 * on PT_WRITABLE_MASK anymore.
 */
if (flush)
-   kvm_flush_remote_tlbs(kvm);
+   kvm_flush_remote_tlbs_with_address(kvm, memslot->base_gfn,
+   memslot->npages);
 }
 
 static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
@@ -5751,7 +5759,8 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 * dirty_bitmap.
 */
if (flush)
-   kvm_flush_remote_tlbs(kvm);
+   kvm_flush_remote_tlbs_with_address(kvm, memslot->base_gfn,
+   memslot->npages);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_slot_leaf_clear_dirty);
 
@@ -5769,7 +5778,8 @@ void kvm_mmu_slot_largepage_remove_write_access(struct 
kvm *kvm,
lockdep_assert_held(>slots_lock);
 
if (flush)
-   kvm_flush_remote_tlbs(kvm);
+   kvm_flush_remote_tlbs_with_address(kvm, memslot->base_gfn,
+   memslot->npages);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_slot_largepage_remove_write_access);
 
@@ -5786,7 +5796,8 @@ void kvm_mmu_slot_set_dirty(struct kvm *kvm,
 
/* see kvm_mmu_slot_leaf_clear_dirty */
if (flush)
-   kvm_flush_remote_tlbs(kvm);
+   kvm_flush_remote_tlbs_with_address(kvm, memslot->base_gfn,
+   memslot->npages);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_slot_set_dirty);
 
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 7cf2185b7eb5..6bdca39829bc 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -894,7 +894,8 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, 
hpa_t root_hpa)
pte_gpa += (sptep - sp->spt) * sizeof(pt_element_t);
 
if (mmu_page_zap_pte(vcpu->kvm, sp, sptep))
-   kvm_flush_remote_tlbs(vcpu->kvm);
+   kvm_flush_remote_tlbs_with_address(vcpu->kvm,
+   sp->gfn, 
KVM_PAGES_PER_HPAGE(sp->role.level));
 

[Resend PATCH V5 5/10] KVM/MMU: Add tlb flush with range helper function

2018-11-08 Thread lantianyu1986
From: Lan Tianyu 

This patch is to add wrapper functions for tlb_remote_flush_with_range
callback and flush tlb directly in kvm_mmu_zap_collapsible_spte().
kvm_mmu_zap_collapsible_spte() returns flush request to the
slot_handle_leaf() and the latter does flush on demand. When
range flush is available, make kvm_mmu_zap_collapsible_spte()
to flush tlb with range directly to avoid returning range back
to slot_handle_leaf().

Signed-off-by: Lan Tianyu 
---
Change since V4:
   Remove flush list interface and flush tlb directly in
   kvm_mmu_zap_collapsible_spte().
Change since V3:
   Remove code of updating "tlbs_dirty"
Change since V2:
   Fix comment in the kvm_flush_remote_tlbs_with_range()
---
 arch/x86/kvm/mmu.c | 37 -
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index cf5f572f2305..c1d383532066 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -264,6 +264,35 @@ static void mmu_spte_set(u64 *sptep, u64 spte);
 static union kvm_mmu_page_role
 kvm_mmu_calc_root_page_role(struct kvm_vcpu *vcpu);
 
+
+static inline bool kvm_available_flush_tlb_with_range(void)
+{
+   return kvm_x86_ops->tlb_remote_flush_with_range;
+}
+
+static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
+   struct kvm_tlb_range *range)
+{
+   int ret = -ENOTSUPP;
+
+   if (range && kvm_x86_ops->tlb_remote_flush_with_range)
+   ret = kvm_x86_ops->tlb_remote_flush_with_range(kvm, range);
+
+   if (ret)
+   kvm_flush_remote_tlbs(kvm);
+}
+
+static void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
+   u64 start_gfn, u64 pages)
+{
+   struct kvm_tlb_range range;
+
+   range.start_gfn = start_gfn;
+   range.pages = pages;
+
+   kvm_flush_remote_tlbs_with_range(kvm, );
+}
+
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value)
 {
BUG_ON((mmio_mask & mmio_value) != mmio_value);
@@ -5680,7 +5709,13 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
!kvm_is_reserved_pfn(pfn) &&
PageTransCompoundMap(pfn_to_page(pfn))) {
pte_list_remove(rmap_head, sptep);
-   need_tlb_flush = 1;
+
+   if (kvm_available_flush_tlb_with_range())
+   kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
+   KVM_PAGES_PER_HPAGE(sp->role.level));
+   else
+   need_tlb_flush = 1;
+
goto restart;
}
}
-- 
2.14.4



[Resend PATCH V5 4/10] KVM/VMX: Add hv tlb range flush support

2018-11-08 Thread lantianyu1986
From: Lan Tianyu 

This patch is to register tlb_remote_flush_with_range callback with
hv tlb range flush interface.

Signed-off-by: Lan Tianyu 
---
Change since v4:
- Use new function kvm_fill_hv_flush_list_func() to fill flush
   request.
Change since v3:
- Merge Vitaly's don't pass EPT configuration info to
vmx_hv_remote_flush_tlb() fix.
Change since v1:
- Pass flush range with new hyper-v tlb flush struct rather
   than KVM tlb flush struct.
---
 arch/x86/kvm/vmx.c | 69 ++
 1 file changed, 49 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index edbc96cb990a..405dfbde70b2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1567,7 +1567,38 @@ static void check_ept_pointer_match(struct kvm *kvm)
to_kvm_vmx(kvm)->ept_pointers_match = EPT_POINTERS_MATCH;
 }
 
-static int vmx_hv_remote_flush_tlb(struct kvm *kvm)
+int kvm_fill_hv_flush_list_func(struct hv_guest_mapping_flush_list *flush,
+   void *data)
+{
+   struct kvm_tlb_range *range = data;
+
+   return hyperv_fill_flush_guest_mapping_list(flush, range->start_gfn,
+   range->pages);
+}
+
+static inline int __hv_remote_flush_tlb_with_range(struct kvm *kvm,
+   struct kvm_vcpu *vcpu, struct kvm_tlb_range *range)
+{
+   u64 ept_pointer = to_vmx(vcpu)->ept_pointer;
+
+   /* If ept_pointer is invalid pointer, bypass flush request. */
+   if (!VALID_PAGE(ept_pointer))
+   return 0;
+
+   /*
+* FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE hypercall needs address
+* of the base of EPT PML4 table, strip off EPT configuration
+* information.
+*/
+   if (range)
+   return hyperv_flush_guest_mapping_range(ept_pointer & PAGE_MASK,
+   kvm_fill_hv_flush_list_func, (void *)range);
+   else
+   return hyperv_flush_guest_mapping(ept_pointer & PAGE_MASK);
+}
+
+static int hv_remote_flush_tlb_with_range(struct kvm *kvm,
+   struct kvm_tlb_range *range)
 {
struct kvm_vcpu *vcpu;
int ret = -ENOTSUPP, i;
@@ -1577,30 +1608,23 @@ static int vmx_hv_remote_flush_tlb(struct kvm *kvm)
if (to_kvm_vmx(kvm)->ept_pointers_match == EPT_POINTERS_CHECK)
check_ept_pointer_match(kvm);
 
-   /*
-* FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE hypercall needs the address of the
-* base of EPT PML4 table, strip off EPT configuration information.
-* If ept_pointer is invalid pointer, bypass the flush request.
-*/
if (to_kvm_vmx(kvm)->ept_pointers_match != EPT_POINTERS_MATCH) {
-   kvm_for_each_vcpu(i, vcpu, kvm) {
-   if (!VALID_PAGE(to_vmx(vcpu)->ept_pointer))
-   return 0;
-
-   ret |= hyperv_flush_guest_mapping(
-   to_vmx(vcpu)->ept_pointer & PAGE_MASK);
-   }
+   kvm_for_each_vcpu(i, vcpu, kvm)
+   ret |= __hv_remote_flush_tlb_with_range(
+   kvm, vcpu, range);
} else {
-   if (!VALID_PAGE(to_vmx(kvm_get_vcpu(kvm, 0))->ept_pointer))
-   return 0;
-
-   ret = hyperv_flush_guest_mapping(
-   to_vmx(kvm_get_vcpu(kvm, 0))->ept_pointer & PAGE_MASK);
+   ret = __hv_remote_flush_tlb_with_range(kvm,
+   kvm_get_vcpu(kvm, 0), range);
}
 
spin_unlock(_kvm_vmx(kvm)->ept_pointer_lock);
return ret;
 }
+
+static int hv_remote_flush_tlb(struct kvm *kvm)
+{
+   return hv_remote_flush_tlb_with_range(kvm, NULL);
+}
 #else /* !IS_ENABLED(CONFIG_HYPERV) */
 static inline void evmcs_write64(unsigned long field, u64 value) {}
 static inline void evmcs_write32(unsigned long field, u32 value) {}
@@ -7957,8 +7981,11 @@ static __init int hardware_setup(void)
 
 #if IS_ENABLED(CONFIG_HYPERV)
if (ms_hyperv.nested_features & HV_X64_NESTED_GUEST_MAPPING_FLUSH
-   && enable_ept)
-   kvm_x86_ops->tlb_remote_flush = vmx_hv_remote_flush_tlb;
+   && enable_ept) {
+   kvm_x86_ops->tlb_remote_flush = hv_remote_flush_tlb;
+   kvm_x86_ops->tlb_remote_flush_with_range =
+   hv_remote_flush_tlb_with_range;
+   }
 #endif
 
if (!cpu_has_vmx_ple()) {
@@ -11567,6 +11594,8 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm 
*kvm, unsigned int id)
vmx->nested.posted_intr_nv = -1;
vmx->nested.current_vmptr = -1ull;
 
+   vmx->ept_pointer = INVALID_PAGE;
+
vmx->msr_ia32_feature_control_valid_bits = FEATURE_CONTROL_LOCKED;
 
/*
-- 
2.14.4



[Resend PATCH V5 3/10] x86/Hyper-v: Add trace in the hyperv_nested_flush_guest_mapping_range()

2018-11-08 Thread lantianyu1986
From: Lan Tianyu 

This patch is to trace log in the hyperv_nested_flush_
guest_mapping_range().

Signed-off-by: Lan Tianyu 
---
 arch/x86/hyperv/nested.c|  1 +
 arch/x86/include/asm/trace/hyperv.h | 14 ++
 2 files changed, 15 insertions(+)

diff --git a/arch/x86/hyperv/nested.c b/arch/x86/hyperv/nested.c
index 3d0f31e46954..dd0a843f766d 100644
--- a/arch/x86/hyperv/nested.c
+++ b/arch/x86/hyperv/nested.c
@@ -130,6 +130,7 @@ int hyperv_flush_guest_mapping_range(u64 as,
else
ret = status;
 fault:
+   trace_hyperv_nested_flush_guest_mapping_range(as, ret);
return ret;
 }
 EXPORT_SYMBOL_GPL(hyperv_flush_guest_mapping_range);
diff --git a/arch/x86/include/asm/trace/hyperv.h 
b/arch/x86/include/asm/trace/hyperv.h
index 2e6245a023ef..ace464f09681 100644
--- a/arch/x86/include/asm/trace/hyperv.h
+++ b/arch/x86/include/asm/trace/hyperv.h
@@ -42,6 +42,20 @@ TRACE_EVENT(hyperv_nested_flush_guest_mapping,
TP_printk("address space %llx ret %d", __entry->as, __entry->ret)
);
 
+TRACE_EVENT(hyperv_nested_flush_guest_mapping_range,
+   TP_PROTO(u64 as, int ret),
+   TP_ARGS(as, ret),
+
+   TP_STRUCT__entry(
+   __field(u64, as)
+   __field(int, ret)
+   ),
+   TP_fast_assign(__entry->as = as;
+  __entry->ret = ret;
+   ),
+   TP_printk("address space %llx ret %d", __entry->as, __entry->ret)
+   );
+
 TRACE_EVENT(hyperv_send_ipi_mask,
TP_PROTO(const struct cpumask *cpus,
 int vector),
-- 
2.14.4



[Resend PATCH V5 2/10] x86/hyper-v: Add HvFlushGuestAddressList hypercall support

2018-11-08 Thread lantianyu1986
From: Lan Tianyu 

Hyper-V provides HvFlushGuestAddressList() hypercall to flush EPT tlb
with specified ranges. This patch is to add the hypercall support.

Reviewed-by:  Michael Kelley 
Signed-off-by: Lan Tianyu 
---
Change sincd v4:
   - Expose function hyperv_fill_flush_guest_mapping_list()
   out of hyperv file
   - Adjust parameter of hyperv_flush_guest_mapping_range()

Change since v2:
  Fix some coding style issues
- Move HV_MAX_FLUSH_PAGES and HV_MAX_FLUSH_REP_COUNT to
hyperv-tlfs.h.
- Calculate HV_MAX_FLUSH_REP_COUNT in the macro definition
- Use HV_MAX_FLUSH_REP_COUNT to define length of gpa_list in
struct hv_guest_mapping_flush_list.

Change since v1:
   Add hyperv tlb flush struct to avoid use kvm tlb flush struct
in the hyperv file.
---
 arch/x86/hyperv/nested.c   | 79 ++
 arch/x86/include/asm/hyperv-tlfs.h | 32 +++
 arch/x86/include/asm/mshyperv.h| 15 
 3 files changed, 126 insertions(+)

diff --git a/arch/x86/hyperv/nested.c b/arch/x86/hyperv/nested.c
index b8e60cc50461..3d0f31e46954 100644
--- a/arch/x86/hyperv/nested.c
+++ b/arch/x86/hyperv/nested.c
@@ -7,6 +7,7 @@
  *
  * Author : Lan Tianyu 
  */
+#define pr_fmt(fmt)  "Hyper-V: " fmt
 
 
 #include 
@@ -54,3 +55,81 @@ int hyperv_flush_guest_mapping(u64 as)
return ret;
 }
 EXPORT_SYMBOL_GPL(hyperv_flush_guest_mapping);
+
+int hyperv_fill_flush_guest_mapping_list(
+   struct hv_guest_mapping_flush_list *flush,
+   u64 start_gfn, u64 pages)
+{
+   u64 cur = start_gfn;
+   u64 additional_pages;
+   int gpa_n = 0;
+
+   do {
+   /*
+* If flush requests exceed max flush count, go back to
+* flush tlbs without range.
+*/
+   if (gpa_n >= HV_MAX_FLUSH_REP_COUNT)
+   return -ENOSPC;
+
+   additional_pages = min_t(u64, pages, HV_MAX_FLUSH_PAGES) - 1;
+
+   flush->gpa_list[gpa_n].page.additional_pages = additional_pages;
+   flush->gpa_list[gpa_n].page.largepage = false;
+   flush->gpa_list[gpa_n].page.basepfn = cur;
+
+   pages -= additional_pages + 1;
+   cur += additional_pages + 1;
+   gpa_n++;
+   } while (pages > 0);
+
+   return gpa_n;
+}
+EXPORT_SYMBOL_GPL(hyperv_fill_flush_guest_mapping_list);
+
+int hyperv_flush_guest_mapping_range(u64 as,
+   hyperv_fill_flush_list_func fill_flush_list_func, void *data)
+{
+   struct hv_guest_mapping_flush_list **flush_pcpu;
+   struct hv_guest_mapping_flush_list *flush;
+   u64 status = 0;
+   unsigned long flags;
+   int ret = -ENOTSUPP;
+   int gpa_n = 0;
+
+   if (!hv_hypercall_pg || !fill_flush_list_func)
+   goto fault;
+
+   local_irq_save(flags);
+
+   flush_pcpu = (struct hv_guest_mapping_flush_list **)
+   this_cpu_ptr(hyperv_pcpu_input_arg);
+
+   flush = *flush_pcpu;
+   if (unlikely(!flush)) {
+   local_irq_restore(flags);
+   goto fault;
+   }
+
+   flush->address_space = as;
+   flush->flags = 0;
+
+   gpa_n = fill_flush_list_func(flush, data);
+   if (gpa_n < 0) {
+   local_irq_restore(flags);
+   goto fault;
+   }
+
+   status = hv_do_rep_hypercall(HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST,
+gpa_n, 0, flush, NULL);
+
+   local_irq_restore(flags);
+
+   if (!(status & HV_HYPERCALL_RESULT_MASK))
+   ret = 0;
+   else
+   ret = status;
+fault:
+   return ret;
+}
+EXPORT_SYMBOL_GPL(hyperv_flush_guest_mapping_range);
diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
b/arch/x86/include/asm/hyperv-tlfs.h
index 4139f7650fe5..405a378e1c62 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -10,6 +10,7 @@
 #define _ASM_X86_HYPERV_TLFS_H
 
 #include 
+#include 
 
 /*
  * The below CPUID leaves are present if VersionAndFeatures.HypervisorPresent
@@ -358,6 +359,7 @@ struct hv_tsc_emulation_status {
 #define HVCALL_POST_MESSAGE0x005c
 #define HVCALL_SIGNAL_EVENT0x005d
 #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE 0x00af
+#define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST 0x00b0
 
 #define HV_X64_MSR_VP_ASSIST_PAGE_ENABLE   0x0001
 #define HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT12
@@ -757,6 +759,36 @@ struct hv_guest_mapping_flush {
u64 flags;
 };
 
+/*
+ *  HV_MAX_FLUSH_PAGES = "additional_pages" + 1. It's limited
+ *  by the bitwidth of "additional_pages" in union hv_gpa_page_range.
+ */
+#define HV_MAX_FLUSH_PAGES (2048)
+
+/* HvFlushGuestPhysicalAddressList hypercall */
+union hv_gpa_page_range {
+   u64 address_space;
+   struct {
+   u64 additional_pages:11;
+   u64 largepage:1;
+  

[Resend PATCH V5 1/10] KVM: Add tlb_remote_flush_with_range callback in kvm_x86_ops

2018-11-08 Thread lantianyu1986
From: Lan Tianyu 

Add flush range call back in the kvm_x86_ops and platform can use it
to register its associated function. The parameter "kvm_tlb_range"
accepts a single range and flush list which contains a list of ranges.

Signed-off-by: Lan Tianyu 
---
Change since v1:
   Change "end_gfn" to "pages" to aviod confusion as to whether
"end_gfn" is inclusive or exlusive.
---
 arch/x86/include/asm/kvm_host.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 55e51ff7e421..c8a65f0a7107 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -439,6 +439,11 @@ struct kvm_mmu {
u64 pdptrs[4]; /* pae */
 };
 
+struct kvm_tlb_range {
+   u64 start_gfn;
+   u64 pages;
+};
+
 enum pmc_type {
KVM_PMC_GP = 0,
KVM_PMC_FIXED,
@@ -1042,6 +1047,8 @@ struct kvm_x86_ops {
 
void (*tlb_flush)(struct kvm_vcpu *vcpu, bool invalidate_gpa);
int  (*tlb_remote_flush)(struct kvm *kvm);
+   int  (*tlb_remote_flush_with_range)(struct kvm *kvm,
+   struct kvm_tlb_range *range);
 
/*
 * Flush any TLB entries associated with the given GVA.
-- 
2.14.4



[PATCH V5 00/10] x86/KVM/Hyper-v: Add HV ept tlb range flush hypercall support in KVM

2018-11-08 Thread lantianyu1986
From: Lan Tianyu 

Sorry. Some patches was blocked and I try to resend via another account.

For nested memory virtualization, Hyper-v doesn't set write-protect
L1 hypervisor EPT page directory and page table node to track changes 
while it relies on guest to tell it changes via HvFlushGuestAddressLlist
hypercall. HvFlushGuestAddressLlist hypercall provides a way to flush
EPT page table with ranges which are specified by L1 hypervisor.

If L1 hypervisor uses INVEPT or HvFlushGuestAddressSpace hypercall to
flush EPT tlb, Hyper-V will invalidate associated EPT shadow page table
and sync L1's EPT table when next EPT page fault is triggered.
HvFlushGuestAddressLlist hypercall helps to avoid such redundant EPT
page fault and synchronization of shadow page table.

This patchset is rebased on the Linux 4.20-rc1 and Patch "KVM/VMX: Check
ept_pointer before flushing ept tlb".(https://www.mail-archive.com/linux
-ker...@vger.kernel.org/msg1798827.html).

Change since v4:
   1) Split flush address and flush list patches. This patchset only 
contains
   flush address patches. Will post flush list patches later.
   2) Expose function hyperv_fill_flush_guest_mapping_list()
   out of hyperv file
   3) Adjust parameter of hyperv_flush_guest_mapping_range()
   4) Reorder patchset and move Hyper-V and VMX changes ahead.

Change since v3:
1) Remove code of updating "tlbs_dirty" in 
kvm_flush_remote_tlbs_with_range()
2) Remove directly tlb flush in the kvm_handle_hva_range()
3) Move tlb flush in kvm_set_pte_rmapp() to 
kvm_mmu_notifier_change_pte()
4) Combine Vitaly's "don't pass EPT configuration info to
vmx_hv_remote_flush_tlb()" fix

Change since v2:
   1) Fix comment in the kvm_flush_remote_tlbs_with_range()
   2) Move HV_MAX_FLUSH_PAGES and HV_MAX_FLUSH_REP_COUNT to
hyperv-tlfs.h.
   3) Calculate HV_MAX_FLUSH_REP_COUNT in the macro definition
   4) Use HV_MAX_FLUSH_REP_COUNT to define length of gpa_list in
struct hv_guest_mapping_flush_list.

Change since v1:
   1) Convert "end_gfn" of struct kvm_tlb_range to "pages" in order
  to avoid confusion as to whether "end_gfn" is inclusive or exlusive.
   2) Add hyperv tlb range struct and replace kvm tlb range struct
  with new struct in order to avoid using kvm struct in the hyperv
  code directly.


Lan Tianyu (10):
  KVM: Add tlb_remote_flush_with_range callback in kvm_x86_ops
  x86/hyper-v: Add HvFlushGuestAddressList hypercall support
  x86/Hyper-v: Add trace in the
hyperv_nested_flush_guest_mapping_range()
  KVM/VMX: Add hv tlb range flush support
  KVM/MMU: Add tlb flush with range helper function
  KVM: Replace old tlb flush function with new one to flush a specified
range.
  KVM: Make kvm_set_spte_hva() return int
  KVM/MMU: Move tlb flush in kvm_set_pte_rmapp() to
kvm_mmu_notifier_change_pte()
  KVM/MMU: Flush tlb directly in the kvm_set_pte_rmapp()
  KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()

 arch/arm/include/asm/kvm_host.h |  2 +-
 arch/arm64/include/asm/kvm_host.h   |  2 +-
 arch/mips/include/asm/kvm_host.h|  2 +-
 arch/mips/kvm/mmu.c |  3 +-
 arch/powerpc/include/asm/kvm_host.h |  2 +-
 arch/powerpc/kvm/book3s.c   |  3 +-
 arch/powerpc/kvm/e500_mmu_host.c|  3 +-
 arch/x86/hyperv/nested.c| 80 +++
 arch/x86/include/asm/hyperv-tlfs.h  | 32 +
 arch/x86/include/asm/kvm_host.h |  9 +++-
 arch/x86/include/asm/mshyperv.h | 15 ++
 arch/x86/include/asm/trace/hyperv.h | 14 ++
 arch/x86/kvm/mmu.c  | 96 +
 arch/x86/kvm/paging_tmpl.h  |  3 +-
 arch/x86/kvm/vmx.c  | 69 ++
 virt/kvm/arm/mmu.c  |  6 ++-
 virt/kvm/kvm_main.c |  5 +-
 17 files changed, 295 insertions(+), 51 deletions(-)

-- 
2.14.4



Re: [PATCH] powerpc: Mark variable `cpumsr` as unused

2018-11-08 Thread Breno Leitao
Hi Mathieu, Christophe

Thanks for spotting and fixing this bug.

On 11/08/2018 05:25 AM, Mathieu Malaterre wrote:
> On Thu, Nov 8, 2018 at 7:09 AM Christophe Leroy  
> wrote:
>>
>>
>>
>> On 11/07/2018 08:26 PM, Mathieu Malaterre wrote:
>>> Add gcc attribute unused for `cpumsr` variable.
>>>
>>> Fix warnings treated as errors with W=1:
>>>
>>>arch/powerpc/kernel/process.c:231:16: error: variable ‘cpumsr’ set but 
>>> not used [-Werror=unused-but-set-variable]
>>>arch/powerpc/kernel/process.c:296:16: error: variable ‘cpumsr’ set but 
>>> not used [-Werror=unused-but-set-variable]
>>>
>>> Signed-off-by: Mathieu Malaterre 
>>
>> I don't think this is the good way to fix that. This problem was
>> introduced by commit 5c784c8414fb ("powerpc/tm: Remove
>> msr_tm_active()"). That commit should be reverted and fixed.
> 
> I see, it makes sense.
> 
>> That commit should have removed the macro and kept the inline function.
> 
> Breno, what do you think ?

Turning this macro into a function might cause the code to be more confused,
since all the other TM states bits are checked using a macro, for example:

 MSR_TM_SUSPENDED Checks if the MSR has Suspended bits set
 MSR_TM_TRANSACTIONAL Checks if the MSR has the transactional bits set
 MSR_TM_RESV  Checks if the MSR has the TM reserved bits set

That said, I understand that it makes sense to have an uniform way to check
for TM bits in MSR, thus having a MSR_TM_ACTIVE macro to check for the active
bits. Using a non-uniform function just to fix this warning seems to be an
overkill. Reverting the patch seems to bring back the old style, which is
having a macro and a function with the same name, where the function just
calls the macro.

Anyway, I think it might have other ways to fix warning, as I can think now:

 1) Avoid setting cpumsr if CONFIG_PPC_TRANSACTIONAL_MEM is not enabled

 2) If !CONFIG_PPC_TRANSACTIONAL_MEM, redefine MSR_TM_ACTIVE(x) to something
as (x & 0) instead of 0.

 3) Avoid double definition of MSR_TM_ACTIVE, i.e, have the same definition
independent of PPC_TRANSACTIONAL_MEM being set or not.

Anyway, I would like to try option 3), which is the hardest one to implement
and validate, but it seems to be the most correct option, once it checks for
a MSR bit configuration, and the caller should have the logic.


Re: [PATCH 35/36] dt-bindings: arm: Convert Xilinx board/soc bindings to json-schema

2018-11-08 Thread Michal Simek
On 05. 10. 18 18:58, Rob Herring wrote:
> Convert Xilinx SoC bindings to DT schema format using json-schema.
> 
> Cc: Mark Rutland 
> Cc: Michal Simek 
> Cc: devicet...@vger.kernel.org
> Cc: linux-arm-ker...@lists.infradead.org
> Signed-off-by: Rob Herring 
> ---
>  .../devicetree/bindings/arm/xilinx.txt| 83 ---
>  .../devicetree/bindings/arm/xilinx.yaml   | 81 ++
>  2 files changed, 81 insertions(+), 83 deletions(-)
>  delete mode 100644 Documentation/devicetree/bindings/arm/xilinx.txt
>  create mode 100644 Documentation/devicetree/bindings/arm/xilinx.yaml
> 
> diff --git a/Documentation/devicetree/bindings/arm/xilinx.txt 
> b/Documentation/devicetree/bindings/arm/xilinx.txt
> deleted file mode 100644
> index 26fe5ecc4332..
> --- a/Documentation/devicetree/bindings/arm/xilinx.txt
> +++ /dev/null
> @@ -1,83 +0,0 @@
> -Xilinx Zynq Platforms Device Tree Bindings
> -
> -Boards with Zynq-7000 SOC based on an ARM Cortex A9 processor
> -shall have the following properties.
> -
> -Required root node properties:
> -- compatible = "xlnx,zynq-7000";
> -
> -Additional compatible strings:
> -
> -- Adapteva Parallella board
> -  "adapteva,parallella"
> -
> -- Avnet MicroZed board
> -  "avnet,zynq-microzed"
> -  "xlnx,zynq-microzed"
> -
> -- Avnet ZedBoard board
> -  "avnet,zynq-zed"
> -  "xlnx,zynq-zed"
> -
> -- Digilent Zybo board
> -  "digilent,zynq-zybo"
> -
> -- Digilent Zybo Z7 board
> -  "digilent,zynq-zybo-z7"
> -
> -- Xilinx CC108 internal board
> -  "xlnx,zynq-cc108"
> -
> -- Xilinx ZC702 internal board
> -  "xlnx,zynq-zc702"
> -
> -- Xilinx ZC706 internal board
> -  "xlnx,zynq-zc706"
> -
> -- Xilinx ZC770 internal board, with different FMC cards
> -  "xlnx,zynq-zc770-xm010"
> -  "xlnx,zynq-zc770-xm011"
> -  "xlnx,zynq-zc770-xm012"
> -  "xlnx,zynq-zc770-xm013"
> -
> 
> -
> -Xilinx Zynq UltraScale+ MPSoC Platforms Device Tree Bindings
> -
> -Boards with ZynqMP SOC based on an ARM Cortex A53 processor
> -shall have the following properties.
> -
> -Required root node properties:
> -- compatible = "xlnx,zynqmp";
> -
> -
> -Additional compatible strings:
> -
> -- Xilinx internal board zc1232
> -  "xlnx,zynqmp-zc1232-revA", "xlnx,zynqmp-zc1232"
> -
> -- Xilinx internal board zc1254
> -  "xlnx,zynqmp-zc1254-revA", "xlnx,zynqmp-zc1254"
> -
> -- Xilinx internal board zc1275
> -  "xlnx,zynqmp-zc1275-revA", "xlnx,zynqmp-zc1275"
> -
> -- Xilinx internal board zc1751
> -  "xlnx,zynqmp-zc1751"
> -
> -- Xilinx 96boards compatible board zcu100
> -  "xlnx,zynqmp-zcu100-revC", "xlnx,zynqmp-zcu100"
> -
> -- Xilinx evaluation board zcu102
> -  "xlnx,zynqmp-zcu102-revA", "xlnx,zynqmp-zcu102"
> -  "xlnx,zynqmp-zcu102-revB", "xlnx,zynqmp-zcu102"
> -  "xlnx,zynqmp-zcu102-rev1.0", "xlnx,zynqmp-zcu102"
> -
> -- Xilinx evaluation board zcu104
> -  "xlnx,zynqmp-zcu104-revA", "xlnx,zynqmp-zcu104"
> -
> -- Xilinx evaluation board zcu106
> -  "xlnx,zynqmp-zcu106-revA", "xlnx,zynqmp-zcu106"
> -
> -- Xilinx evaluation board zcu111
> -  "xlnx,zynqmp-zcu111-revA", "xlnx,zynqmp-zcu111"
> diff --git a/Documentation/devicetree/bindings/arm/xilinx.yaml 
> b/Documentation/devicetree/bindings/arm/xilinx.yaml
> new file mode 100644
> index ..dd227bccf015
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/arm/xilinx.yaml
> @@ -0,0 +1,81 @@
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/soc/arm/xilinx.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Xilinx Zynq Platforms Device Tree Bindings
> +
> +maintainers:
> +  - Michal Simek 
> +
> +description: |
> +  Xilinx boards with Zynq-7000 SOC or Zynq UltraScale+ MPSoC
> +
> +properties:
> +  $nodename:
> +const: '/'
> +  compatible:
> +oneOf:
> +  - items:
> +  - enum:
> +  - adapteva,parallella
> +  - digilent,zynq-zybo
> +  - digilent,zynq-zybo-z7
> +  - xlnx,zynq-cc108
> +  - xlnx,zynq-zc702
> +  - xlnx,zynq-zc706
> +  - xlnx,zynq-zc770-xm010
> +  - xlnx,zynq-zc770-xm011
> +  - xlnx,zynq-zc770-xm012
> +  - xlnx,zynq-zc770-xm013
> +  - const: xlnx,zynq-7000
> +
> +  - items:
> +  - const: avnet,zynq-microzed
> +  - const: xlnx,zynq-microzed
> +  - const: xlnx,zynq-7000
> +
> +  - items:
> +  - const: avnet,zynq-zed
> +  - const: xlnx,zynq-zed
> +  - const: xlnx,zynq-7000
> +
> +  - items:
> +  - enum:
> +  - xlnx,zynqmp-zc1751
> +  - const: xlnx,zynqmp
> +
> +  - description: Xilinx internal board zc1232
> +items:
> +  - const: xlnx,zynqmp-zc1232-revA
> +  - const: xlnx,zynqmp-zc1232
> +  - const: xlnx,zynqmp
> +
> +  - description: Xilinx internal board zc1254
> +items:
> +  - const: xlnx,zynqmp-zc1254-revA
> +  - const: 

Re: [PATCH kernel 2/3] vfio_pci: Allow regions to add own capabilities

2018-11-08 Thread David Gibson
On Thu, Nov 08, 2018 at 05:48:58PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 08/11/2018 17:21, David Gibson wrote:
> > On Mon, Oct 15, 2018 at 08:42:32PM +1100, Alexey Kardashevskiy wrote:
> >> VFIO regions already support region capabilities with a limited set of
> >> fields. However the subdriver might have to report to the userspace
> >> additional bits.
> >>
> >> This adds an add_capability() hook to vfio_pci_regops.
> >>
> >> This is aiming Witherspoon POWER9 machines which have multiple
> >> interconnected NVIDIA V100 GPUs with coherent RAM; each GPU's RAM
> >> is mapped to a system bus and to each of GPU internal system bus and
> >> the GPUs use this for DMA routing as DMA trafic can go via any
> >> of many NVLink2 (GPU-GPU or GPU-CPU) or even stay local within a
> >> GPU.
> > 
> > This description doesn't really make clear how per-region capabilities
> > are relevant to these devices.
> 
> 
> I am confused. This patch just adds a hook, and the device specifics are
> explained in the next patch where they are used...

Well, my point is the last paragraph of the commit message appears to
be a rationale for this change in terms of what's needed for the GPU
devices.  But how those described properties of the GPU mean that
region capabilites are useful / necessary isn't made clear.  If it's
not meant to be a rationale, I'm not sure what it's doing there at
all.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


[PATCH v9 9/9] powerpc: clean stack pointers naming

2018-11-08 Thread Christophe Leroy
Some stack pointers used to also be thread_info pointers
and were called tp. Now that they are only stack pointers,
rename them sp.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/irq.c  | 17 +++--
 arch/powerpc/kernel/setup_64.c | 20 ++--
 2 files changed, 17 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 62cfccf4af89..754f0efc507b 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -659,21 +659,21 @@ void __do_irq(struct pt_regs *regs)
 void do_IRQ(struct pt_regs *regs)
 {
struct pt_regs *old_regs = set_irq_regs(regs);
-   void *curtp, *irqtp, *sirqtp;
+   void *cursp, *irqsp, *sirqsp;
 
/* Switch to the irq stack to handle this */
-   curtp = (void *)(current_stack_pointer() & ~(THREAD_SIZE - 1));
-   irqtp = hardirq_ctx[raw_smp_processor_id()];
-   sirqtp = softirq_ctx[raw_smp_processor_id()];
+   cursp = (void *)(current_stack_pointer() & ~(THREAD_SIZE - 1));
+   irqsp = hardirq_ctx[raw_smp_processor_id()];
+   sirqsp = softirq_ctx[raw_smp_processor_id()];
 
/* Already there ? */
-   if (unlikely(curtp == irqtp || curtp == sirqtp)) {
+   if (unlikely(cursp == irqsp || cursp == sirqsp)) {
__do_irq(regs);
set_irq_regs(old_regs);
return;
}
/* Switch stack and call */
-   call_do_irq(regs, irqtp);
+   call_do_irq(regs, irqsp);
 
set_irq_regs(old_regs);
 }
@@ -732,10 +732,7 @@ void irq_ctx_init(void)
 
 void do_softirq_own_stack(void)
 {
-   void *irqtp;
-
-   irqtp = softirq_ctx[smp_processor_id()];
-   call_do_softirq(irqtp);
+   call_do_softirq(softirq_ctx[smp_processor_id()]);
 }
 
 irq_hw_number_t virq_to_hw(unsigned int virq)
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index be6a2ea250a0..61d159d845aa 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -716,22 +716,22 @@ void __init emergency_stack_init(void)
limit = min(ppc64_bolted_size(), ppc64_rma_size);
 
for_each_possible_cpu(i) {
-   void *ti;
+   void *sp;
 
-   ti = alloc_stack(limit, i);
-   memset(ti, 0, THREAD_SIZE);
-   paca_ptrs[i]->emergency_sp = ti + THREAD_SIZE;
+   sp = alloc_stack(limit, i);
+   memset(sp, 0, THREAD_SIZE);
+   paca_ptrs[i]->emergency_sp = sp + THREAD_SIZE;
 
 #ifdef CONFIG_PPC_BOOK3S_64
/* emergency stack for NMI exception handling. */
-   ti = alloc_stack(limit, i);
-   memset(ti, 0, THREAD_SIZE);
-   paca_ptrs[i]->nmi_emergency_sp = ti + THREAD_SIZE;
+   sp = alloc_stack(limit, i);
+   memset(sp, 0, THREAD_SIZE);
+   paca_ptrs[i]->nmi_emergency_sp = sp + THREAD_SIZE;
 
/* emergency stack for machine check exception handling. */
-   ti = alloc_stack(limit, i);
-   memset(ti, 0, THREAD_SIZE);
-   paca_ptrs[i]->mc_emergency_sp = ti + THREAD_SIZE;
+   sp = alloc_stack(limit, i);
+   memset(sp, 0, THREAD_SIZE);
+   paca_ptrs[i]->mc_emergency_sp = sp + THREAD_SIZE;
 #endif
}
 }
-- 
2.13.3



[PATCH v9 8/9] powerpc/64: Remove CURRENT_THREAD_INFO

2018-11-08 Thread Christophe Leroy
Now that current_thread_info is located at the beginning of 'current'
task struct, CURRENT_THREAD_INFO macro is not really needed any more.

This patch replaces it by loads of the value at PACACURRENT(r13).

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/exception-64s.h   |  4 ++--
 arch/powerpc/include/asm/thread_info.h |  4 
 arch/powerpc/kernel/entry_64.S | 10 +-
 arch/powerpc/kernel/exceptions-64e.S   |  2 +-
 arch/powerpc/kernel/exceptions-64s.S   |  2 +-
 arch/powerpc/kernel/idle_book3e.S  |  2 +-
 arch/powerpc/kernel/idle_power4.S  |  2 +-
 arch/powerpc/kernel/trace/ftrace_64_mprofile.S |  6 +++---
 8 files changed, 14 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/exception-64s.h 
b/arch/powerpc/include/asm/exception-64s.h
index 3b4767ed3ec5..dd6a5ae7a769 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -671,7 +671,7 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
 
 #define RUNLATCH_ON\
 BEGIN_FTR_SECTION  \
-   CURRENT_THREAD_INFO(r3, r1);\
+   ld  r3, PACACURRENT(r13);   \
ld  r4,TI_LOCAL_FLAGS(r3);  \
andi.   r0,r4,_TLF_RUNLATCH;\
beqlppc64_runlatch_on_trampoline;   \
@@ -721,7 +721,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_CTRL)
 #ifdef CONFIG_PPC_970_NAP
 #define FINISH_NAP \
 BEGIN_FTR_SECTION  \
-   CURRENT_THREAD_INFO(r11, r1);   \
+   ld  r11, PACACURRENT(r13);  \
ld  r9,TI_LOCAL_FLAGS(r11); \
andi.   r10,r9,_TLF_NAPPING;\
bnelpower4_fixup_nap;   \
diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index c959b8d66cac..8e1d0195ac36 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -17,10 +17,6 @@
 
 #define THREAD_SIZE(1 << THREAD_SHIFT)
 
-#ifdef CONFIG_PPC64
-#define CURRENT_THREAD_INFO(dest, sp)  stringify_in_c(ld dest, 
PACACURRENT(r13))
-#endif
-
 #ifndef __ASSEMBLY__
 #include 
 #include 
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 03cbf409c3f8..b017bd3da1ed 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -158,7 +158,7 @@ system_call:/* label this so stack 
traces look sane */
li  r10,IRQS_ENABLED
std r10,SOFTE(r1)
 
-   CURRENT_THREAD_INFO(r11, r1)
+   ld  r11, PACACURRENT(r13)
ld  r10,TI_FLAGS(r11)
andi.   r11,r10,_TIF_SYSCALL_DOTRACE
bne .Lsyscall_dotrace   /* does not return */
@@ -205,7 +205,7 @@ system_call:/* label this so stack 
traces look sane */
ld  r3,RESULT(r1)
 #endif
 
-   CURRENT_THREAD_INFO(r12, r1)
+   ld  r12, PACACURRENT(r13)
 
ld  r8,_MSR(r1)
 #ifdef CONFIG_PPC_BOOK3S
@@ -336,7 +336,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
 
/* Repopulate r9 and r10 for the syscall path */
addir9,r1,STACK_FRAME_OVERHEAD
-   CURRENT_THREAD_INFO(r10, r1)
+   ld  r10, PACACURRENT(r13)
ld  r10,TI_FLAGS(r10)
 
cmpldi  r0,NR_syscalls
@@ -734,7 +734,7 @@ _GLOBAL(ret_from_except_lite)
mtmsrd  r10,1 /* Update machine state */
 #endif /* CONFIG_PPC_BOOK3E */
 
-   CURRENT_THREAD_INFO(r9, r1)
+   ld  r9, PACACURRENT(r13)
ld  r3,_MSR(r1)
 #ifdef CONFIG_PPC_BOOK3E
ld  r10,PACACURRENT(r13)
@@ -848,7 +848,7 @@ resume_kernel:
 1: bl  preempt_schedule_irq
 
/* Re-test flags and eventually loop */
-   CURRENT_THREAD_INFO(r9, r1)
+   ld  r9, PACACURRENT(r13)
ld  r4,TI_FLAGS(r9)
andi.   r0,r4,_TIF_NEED_RESCHED
bne 1b
diff --git a/arch/powerpc/kernel/exceptions-64e.S 
b/arch/powerpc/kernel/exceptions-64e.S
index 231d066b4a3d..dfafcd0af009 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -469,7 +469,7 @@ exc_##n##_bad_stack:
\
  * interrupts happen before the wait instruction.
  */
 #define CHECK_NAPPING()
\
-   CURRENT_THREAD_INFO(r11, r1);   \
+   ld  r11, PACACURRENT(r13);  \
ld  r10,TI_LOCAL_FLAGS(r11);\
andi.   r9,r10,_TLF_NAPPING;\
beq+1f; \
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 89d32bb79d5e..1cbe1a78df57 100644

[PATCH v9 7/9] powerpc/32: Remove CURRENT_THREAD_INFO and rename TI_CPU

2018-11-08 Thread Christophe Leroy
Now that thread_info is similar to task_struct, it's address is in r2
so CURRENT_THREAD_INFO() macro is useless. This patch removes it.

At the same time, as the 'cpu' field is not anymore in thread_info,
this patch renames it to TASK_CPU.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Makefile  |  2 +-
 arch/powerpc/include/asm/thread_info.h |  2 --
 arch/powerpc/kernel/asm-offsets.c  |  2 +-
 arch/powerpc/kernel/entry_32.S | 43 --
 arch/powerpc/kernel/epapr_hcalls.S |  5 ++--
 arch/powerpc/kernel/head_fsl_booke.S   |  5 ++--
 arch/powerpc/kernel/idle_6xx.S |  8 +++
 arch/powerpc/kernel/idle_e500.S|  8 +++
 arch/powerpc/kernel/misc_32.S  |  3 +--
 arch/powerpc/mm/hash_low_32.S  | 14 ---
 arch/powerpc/sysdev/6xx-suspend.S  |  5 ++--
 11 files changed, 35 insertions(+), 62 deletions(-)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 0bb705f36a08..f7ee70885a55 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -417,7 +417,7 @@ ifdef CONFIG_SMP
 prepare: task_cpu_prepare
 
 task_cpu_prepare: prepare0
-   $(eval KBUILD_CFLAGS += -D_TASK_CPU=$(shell awk '{if ($$2 == "TI_CPU") 
print $$3;}' include/generated/asm-offsets.h))
+   $(eval KBUILD_CFLAGS += -D_TASK_CPU=$(shell awk '{if ($$2 == 
"TASK_CPU") print $$3;}' include/generated/asm-offsets.h))
 endif
 
 # Check toolchain versions:
diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index d91523c2c7d8..c959b8d66cac 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -19,8 +19,6 @@
 
 #ifdef CONFIG_PPC64
 #define CURRENT_THREAD_INFO(dest, sp)  stringify_in_c(ld dest, 
PACACURRENT(r13))
-#else
-#define CURRENT_THREAD_INFO(dest, sp)  stringify_in_c(mr dest, r2)
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 94ac190a0b16..03439785c2ea 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -96,7 +96,7 @@ int main(void)
 #endif /* CONFIG_PPC64 */
OFFSET(TASK_STACK, task_struct, stack);
 #ifdef CONFIG_SMP
-   OFFSET(TI_CPU, task_struct, cpu);
+   OFFSET(TASK_CPU, task_struct, cpu);
 #endif
 
 #ifdef CONFIG_LIVEPATCH
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index bd3b146e18a3..d0c546ce387e 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -168,8 +168,7 @@ transfer_to_handler:
tophys(r11,r11)
addir11,r11,global_dbcr0@l
 #ifdef CONFIG_SMP
-   CURRENT_THREAD_INFO(r9, r1)
-   lwz r9,TI_CPU(r9)
+   lwz r9,TASK_CPU(r2)
slwir9,r9,3
add r11,r11,r9
 #endif
@@ -180,8 +179,7 @@ transfer_to_handler:
stw r12,4(r11)
 #endif
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
-   CURRENT_THREAD_INFO(r9, r1)
-   tophys(r9, r9)
+   tophys(r9, r2)
ACCOUNT_CPU_USER_ENTRY(r9, r11, r12)
 #endif
 
@@ -195,8 +193,7 @@ transfer_to_handler:
ble-stack_ovf   /* then the kernel stack overflowed */
 5:
 #if defined(CONFIG_6xx) || defined(CONFIG_E500)
-   CURRENT_THREAD_INFO(r9, r1)
-   tophys(r9,r9)   /* check local flags */
+   tophys(r9,r2)   /* check local flags */
lwz r12,TI_LOCAL_FLAGS(r9)
mtcrf   0x01,r12
bt- 31-TLF_NAPPING,4f
@@ -345,8 +342,7 @@ _GLOBAL(DoSyscall)
mtmsr   r11
 1:
 #endif /* CONFIG_TRACE_IRQFLAGS */
-   CURRENT_THREAD_INFO(r10, r1)
-   lwz r11,TI_FLAGS(r10)
+   lwz r11,TI_FLAGS(r2)
andi.   r11,r11,_TIF_SYSCALL_DOTRACE
bne-syscall_dotrace
 syscall_dotrace_cont:
@@ -379,13 +375,12 @@ ret_from_syscall:
lwz r3,GPR3(r1)
 #endif
mr  r6,r3
-   CURRENT_THREAD_INFO(r12, r1)
/* disable interrupts so current_thread_info()->flags can't change */
LOAD_MSR_KERNEL(r10,MSR_KERNEL) /* doesn't include MSR_EE */
/* Note: We don't bother telling lockdep about it */
SYNC
MTMSRD(r10)
-   lwz r9,TI_FLAGS(r12)
+   lwz r9,TI_FLAGS(r2)
li  r8,-MAX_ERRNO
andi.   
r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
bne-syscall_exit_work
@@ -432,8 +427,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_NEED_PAIRED_STWCX)
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
andi.   r4,r8,MSR_PR
beq 3f
-   CURRENT_THREAD_INFO(r4, r1)
-   ACCOUNT_CPU_USER_EXIT(r4, r5, r7)
+   ACCOUNT_CPU_USER_EXIT(r2, r5, r7)
 3:
 #endif
lwz r4,_LINK(r1)
@@ -526,7 +520,7 @@ syscall_exit_work:
/* Clear per-syscall TIF flags if any are set.  */
 
li  r11,_TIF_PERSYSCALL_MASK
-   addir12,r12,TI_FLAGS
+   addir12,r2,TI_FLAGS
 3: lwarx   r8,0,r12

[PATCH v9 6/9] powerpc: 'current_set' is now a table of task_struct pointers

2018-11-08 Thread Christophe Leroy
The table of pointers 'current_set' has been used for retrieving
the stack and current. They used to be thread_info pointers as
they were pointing to the stack and current was taken from the
'task' field of the thread_info.

Now, the pointers of 'current_set' table are now both pointers
to task_struct and pointers to thread_info.

As they are used to get current, and the stack pointer is
retrieved from current's stack field, this patch changes
their type to task_struct, and renames secondary_ti to
secondary_current.

Reviewed-by: Nicholas Piggin 
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/asm-prototypes.h |  4 ++--
 arch/powerpc/kernel/head_32.S |  6 +++---
 arch/powerpc/kernel/head_44x.S|  4 ++--
 arch/powerpc/kernel/head_fsl_booke.S  |  4 ++--
 arch/powerpc/kernel/smp.c | 10 --
 5 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index ec691d489656..b1b999b22a12 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -23,8 +23,8 @@
 #include 
 
 /* SMP */
-extern struct thread_info *current_set[NR_CPUS];
-extern struct thread_info *secondary_ti;
+extern struct task_struct *current_set[NR_CPUS];
+extern struct task_struct *secondary_current;
 void start_secondary(void *unused);
 
 /* kexec */
diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S
index 44dfd73b2a62..ba0341bd5a00 100644
--- a/arch/powerpc/kernel/head_32.S
+++ b/arch/powerpc/kernel/head_32.S
@@ -842,9 +842,9 @@ __secondary_start:
 #endif /* CONFIG_6xx */
 
/* get current's stack and current */
-   lis r1,secondary_ti@ha
-   tophys(r1,r1)
-   lwz r2,secondary_ti@l(r1)
+   lis r2,secondary_current@ha
+   tophys(r2,r2)
+   lwz r2,secondary_current@l(r2)
tophys(r1,r2)
lwz r1,TASK_STACK(r1)
 
diff --git a/arch/powerpc/kernel/head_44x.S b/arch/powerpc/kernel/head_44x.S
index 2c7e90f36358..48e4de4dfd0c 100644
--- a/arch/powerpc/kernel/head_44x.S
+++ b/arch/powerpc/kernel/head_44x.S
@@ -1021,8 +1021,8 @@ _GLOBAL(start_secondary_47x)
/* Now we can get our task struct and real stack pointer */
 
/* Get current's stack and current */
-   lis r1,secondary_ti@ha
-   lwz r2,secondary_ti@l(r1)
+   lis r2,secondary_current@ha
+   lwz r2,secondary_current@l(r2)
lwz r1,TASK_STACK(r2)
 
/* Current stack pointer */
diff --git a/arch/powerpc/kernel/head_fsl_booke.S 
b/arch/powerpc/kernel/head_fsl_booke.S
index b8a2b789677e..0d27bfff52dd 100644
--- a/arch/powerpc/kernel/head_fsl_booke.S
+++ b/arch/powerpc/kernel/head_fsl_booke.S
@@ -1076,8 +1076,8 @@ __secondary_start:
bl  call_setup_cpu
 
/* get current's stack and current */
-   lis r1,secondary_ti@ha
-   lwz r2,secondary_ti@l(r1)
+   lis r2,secondary_current@ha
+   lwz r2,secondary_current@l(r2)
lwz r1,TASK_STACK(r2)
 
/* stack */
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index aa4517686f90..a41fa8924004 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -76,7 +76,7 @@
 static DEFINE_PER_CPU(int, cpu_state) = { 0 };
 #endif
 
-struct thread_info *secondary_ti;
+struct task_struct *secondary_current;
 bool has_big_cores;
 
 DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
@@ -664,7 +664,7 @@ void smp_send_stop(void)
 }
 #endif /* CONFIG_NMI_IPI */
 
-struct thread_info *current_set[NR_CPUS];
+struct task_struct *current_set[NR_CPUS];
 
 static void smp_store_cpu_info(int id)
 {
@@ -929,7 +929,7 @@ void smp_prepare_boot_cpu(void)
paca_ptrs[boot_cpuid]->__current = current;
 #endif
set_numa_node(numa_cpu_lookup_table[boot_cpuid]);
-   current_set[boot_cpuid] = task_thread_info(current);
+   current_set[boot_cpuid] = current;
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -1014,15 +1014,13 @@ static bool secondaries_inhibited(void)
 
 static void cpu_idle_thread_init(unsigned int cpu, struct task_struct *idle)
 {
-   struct thread_info *ti = task_thread_info(idle);
-
 #ifdef CONFIG_PPC64
paca_ptrs[cpu]->__current = idle;
paca_ptrs[cpu]->kstack = (unsigned long)task_stack_page(idle) +
 THREAD_SIZE - STACK_FRAME_OVERHEAD;
 #endif
idle->cpu = cpu;
-   secondary_ti = current_set[cpu] = ti;
+   secondary_current = current_set[cpu] = idle;
 }
 
 int __cpu_up(unsigned int cpu, struct task_struct *tidle)
-- 
2.13.3



[PATCH v9 5/9] powerpc: regain entire stack space

2018-11-08 Thread Christophe Leroy
thread_info is not anymore in the stack, so the entire stack
can now be used.

There is also no risk anymore of corrupting task_cpu(p) with a
stack overflow so the patch removes the test.

When doing this, an explicit test for NULL stack pointer is
needed in validate_sp() as it is not anymore implicitely covered
by the sizeof(thread_info) gap.

In the meantime, with the previous patch all pointers to the stacks
are not anymore pointers to thread_info so this patch changes them
to void*

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/irq.h   | 10 +-
 arch/powerpc/include/asm/processor.h |  3 +--
 arch/powerpc/kernel/asm-offsets.c|  1 -
 arch/powerpc/kernel/entry_32.S   | 14 --
 arch/powerpc/kernel/irq.c| 19 +--
 arch/powerpc/kernel/misc_32.S|  6 ++
 arch/powerpc/kernel/process.c| 32 +---
 arch/powerpc/kernel/setup_64.c   |  8 
 8 files changed, 38 insertions(+), 55 deletions(-)

diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h
index 2efbae8d93be..966ddd4d2414 100644
--- a/arch/powerpc/include/asm/irq.h
+++ b/arch/powerpc/include/asm/irq.h
@@ -48,9 +48,9 @@ struct pt_regs;
  * Per-cpu stacks for handling critical, debug and machine check
  * level interrupts.
  */
-extern struct thread_info *critirq_ctx[NR_CPUS];
-extern struct thread_info *dbgirq_ctx[NR_CPUS];
-extern struct thread_info *mcheckirq_ctx[NR_CPUS];
+extern void *critirq_ctx[NR_CPUS];
+extern void *dbgirq_ctx[NR_CPUS];
+extern void *mcheckirq_ctx[NR_CPUS];
 extern void exc_lvl_ctx_init(void);
 #else
 #define exc_lvl_ctx_init()
@@ -59,8 +59,8 @@ extern void exc_lvl_ctx_init(void);
 /*
  * Per-cpu stacks for handling hard and soft interrupts.
  */
-extern struct thread_info *hardirq_ctx[NR_CPUS];
-extern struct thread_info *softirq_ctx[NR_CPUS];
+extern void *hardirq_ctx[NR_CPUS];
+extern void *softirq_ctx[NR_CPUS];
 
 extern void irq_ctx_init(void);
 void call_do_softirq(void *sp);
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index 15acb282a876..8179b64871ed 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -325,8 +325,7 @@ struct thread_struct {
 #define ARCH_MIN_TASKALIGN 16
 
 #define INIT_SP(sizeof(init_stack) + (unsigned long) 
_stack)
-#define INIT_SP_LIMIT \
-   (_ALIGN_UP(sizeof(struct thread_info), 16) + (unsigned long)_stack)
+#define INIT_SP_LIMIT  ((unsigned long)_stack)
 
 #ifdef CONFIG_SPE
 #define SPEFSCR_INIT \
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 1fb52206c106..94ac190a0b16 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -92,7 +92,6 @@ int main(void)
DEFINE(SIGSEGV, SIGSEGV);
DEFINE(NMI_MASK, NMI_MASK);
 #else
-   DEFINE(THREAD_INFO_GAP, _ALIGN_UP(sizeof(struct thread_info), 16));
OFFSET(KSP_LIMIT, thread_struct, ksp_limit);
 #endif /* CONFIG_PPC64 */
OFFSET(TASK_STACK, task_struct, stack);
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index fa7a69ffb37a..bd3b146e18a3 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -97,14 +97,11 @@ crit_transfer_to_handler:
mfspr   r0,SPRN_SRR1
stw r0,_SRR1(r11)
 
-   /* set the stack limit to the current stack
-* and set the limit to protect the thread_info
-* struct
-*/
+   /* set the stack limit to the current stack */
mfspr   r8,SPRN_SPRG_THREAD
lwz r0,KSP_LIMIT(r8)
stw r0,SAVED_KSP_LIMIT(r11)
-   rlwimi  r0,r1,0,0,(31-THREAD_SHIFT)
+   rlwinm  r0,r1,0,0,(31 - THREAD_SHIFT)
stw r0,KSP_LIMIT(r8)
/* fall through */
 #endif
@@ -121,14 +118,11 @@ crit_transfer_to_handler:
mfspr   r0,SPRN_SRR1
stw r0,crit_srr1@l(0)
 
-   /* set the stack limit to the current stack
-* and set the limit to protect the thread_info
-* struct
-*/
+   /* set the stack limit to the current stack */
mfspr   r8,SPRN_SPRG_THREAD
lwz r0,KSP_LIMIT(r8)
stw r0,saved_ksp_limit@l(0)
-   rlwimi  r0,r1,0,0,(31-THREAD_SHIFT)
+   rlwinm  r0,r1,0,0,(31 - THREAD_SHIFT)
stw r0,KSP_LIMIT(r8)
/* fall through */
 #endif
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 3fdb6b6973cf..62cfccf4af89 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -618,9 +618,8 @@ static inline void check_stack_overflow(void)
sp = current_stack_pointer() & (THREAD_SIZE-1);
 
/* check for stack overflow: is there less than 2KB free? */
-   if (unlikely(sp < (sizeof(struct thread_info) + 2048))) {
-   pr_err("do_IRQ: stack overflow: %ld\n",
-   sp - sizeof(struct thread_info));
+  

[PATCH v9 4/9] powerpc: Activate CONFIG_THREAD_INFO_IN_TASK

2018-11-08 Thread Christophe Leroy
This patch activates CONFIG_THREAD_INFO_IN_TASK which
moves the thread_info into task_struct.

Moving thread_info into task_struct has the following advantages:
- It protects thread_info from corruption in the case of stack
overflows.
- Its address is harder to determine if stack addresses are
leaked, making a number of attacks more difficult.

This has the following consequences:
- thread_info is now located at the beginning of task_struct.
- The 'cpu' field is now in task_struct, and only exists when
CONFIG_SMP is active.
- thread_info doesn't have anymore the 'task' field.

This patch:
- Removes all recopy of thread_info struct when the stack changes.
- Changes the CURRENT_THREAD_INFO() macro to point to current.
- Selects CONFIG_THREAD_INFO_IN_TASK.
- Modifies raw_smp_processor_id() to get ->cpu from current without
including linux/sched.h to avoid circular inclusion and without
including asm/asm-offsets.h to avoid symbol names duplication
between ASM constants and C constants.

Signed-off-by: Christophe Leroy 
Reviewed-by: Nicholas Piggin 
---
 arch/powerpc/Kconfig   |  1 +
 arch/powerpc/Makefile  |  7 +
 arch/powerpc/include/asm/ptrace.h  |  2 +-
 arch/powerpc/include/asm/smp.h | 17 +++-
 arch/powerpc/include/asm/thread_info.h | 17 ++--
 arch/powerpc/kernel/asm-offsets.c  |  7 +++--
 arch/powerpc/kernel/entry_32.S |  9 +++
 arch/powerpc/kernel/exceptions-64e.S   | 11 
 arch/powerpc/kernel/head_32.S  |  6 ++---
 arch/powerpc/kernel/head_44x.S |  4 +--
 arch/powerpc/kernel/head_64.S  |  1 +
 arch/powerpc/kernel/head_booke.h   |  8 +-
 arch/powerpc/kernel/head_fsl_booke.S   |  7 +++--
 arch/powerpc/kernel/irq.c  | 47 +-
 arch/powerpc/kernel/kgdb.c | 28 
 arch/powerpc/kernel/machine_kexec_64.c |  6 ++---
 arch/powerpc/kernel/setup_64.c | 21 ---
 arch/powerpc/kernel/smp.c  |  2 +-
 arch/powerpc/net/bpf_jit32.h   |  5 ++--
 19 files changed, 52 insertions(+), 154 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 8be31261aec8..fd634a293d7f 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -238,6 +238,7 @@ config PPC
select RTC_LIB
select SPARSE_IRQ
select SYSCTL_EXCEPTION_TRACE
+   select THREAD_INFO_IN_TASK
select VIRT_TO_BUS  if !PPC64
#
# Please keep this list sorted alphabetically.
diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 8a2ce14d68d0..0bb705f36a08 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -413,6 +413,13 @@ else
 endif
 endif
 
+ifdef CONFIG_SMP
+prepare: task_cpu_prepare
+
+task_cpu_prepare: prepare0
+   $(eval KBUILD_CFLAGS += -D_TASK_CPU=$(shell awk '{if ($$2 == "TI_CPU") 
print $$3;}' include/generated/asm-offsets.h))
+endif
+
 # Check toolchain versions:
 # - gcc-4.6 is the minimum kernel-wide version so nothing required.
 checkbin:
diff --git a/arch/powerpc/include/asm/ptrace.h 
b/arch/powerpc/include/asm/ptrace.h
index f73886a1a7f5..0ab96da68b51 100644
--- a/arch/powerpc/include/asm/ptrace.h
+++ b/arch/powerpc/include/asm/ptrace.h
@@ -156,7 +156,7 @@ extern int ptrace_put_reg(struct task_struct *task, int 
regno,
  unsigned long data);
 
 #define current_pt_regs() \
-   ((struct pt_regs *)((unsigned long)current_thread_info() + THREAD_SIZE) 
- 1)
+   ((struct pt_regs *)((unsigned long)task_stack_page(current) + 
THREAD_SIZE) - 1)
 /*
  * We use the least-significant bit of the trap field to indicate
  * whether we have saved the full set of registers, or only a
diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 41695745032c..0de717e16dd6 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -83,7 +83,22 @@ int is_cpu_dead(unsigned int cpu);
 /* 32-bit */
 extern int smp_hw_index[];
 
-#define raw_smp_processor_id() (current_thread_info()->cpu)
+/*
+ * This is particularly ugly: it appears we can't actually get the definition
+ * of task_struct here, but we need access to the CPU this task is running on.
+ * Instead of using task_struct we're using _TASK_CPU which is extracted from
+ * asm-offsets.h by kbuild to get the current processor ID.
+ *
+ * This also needs to be safeguarded when building asm-offsets.s because at
+ * that time _TASK_CPU is not defined yet. It could have been guarded by
+ * _TASK_CPU itself, but we want the build to fail if _TASK_CPU is missing
+ * when building something else than asm-offsets.s
+ */
+#ifdef GENERATING_ASM_OFFSETS
+#define raw_smp_processor_id() (0)
+#else
+#define raw_smp_processor_id() (*(unsigned int *)((void *)current + 
_TASK_CPU))
+#endif
 #define hard_smp_processor_id()(smp_hw_index[smp_processor_id()])
 
 static inline int 

[PATCH v9 3/9] powerpc: Prepare for moving thread_info into task_struct

2018-11-08 Thread Christophe Leroy
This patch cleans the powerpc kernel before activating
CONFIG_THREAD_INFO_IN_TASK:
- The purpose of the pointer given to call_do_softirq() and
call_do_irq() is to point the new stack ==> change it to void* and
rename it 'sp'
- Don't use CURRENT_THREAD_INFO() to locate the stack.
- Fix a few comments.
- Replace current_thread_info()->task by current
- Remove unnecessary casts to thread_info, as they'll become invalid
once thread_info is not in stack anymore.
- Rename THREAD_INFO to TASK_STASK: as it is in fact the offset of the
pointer to the stack in task_struct, this pointer will not be impacted
by the move of THREAD_INFO.
- Makes TASK_STACK available to PPC64. PPC64 will need it to get the
stack pointer from current once the thread_info have been moved.
- Modifies klp_init_thread_info() to take task_struct pointer argument.

Signed-off-by: Christophe Leroy 
Reviewed-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/irq.h   |  4 ++--
 arch/powerpc/include/asm/livepatch.h |  7 ---
 arch/powerpc/include/asm/processor.h |  4 ++--
 arch/powerpc/include/asm/reg.h   |  2 +-
 arch/powerpc/kernel/asm-offsets.c|  2 +-
 arch/powerpc/kernel/entry_32.S   |  2 +-
 arch/powerpc/kernel/entry_64.S   |  2 +-
 arch/powerpc/kernel/head_32.S|  4 ++--
 arch/powerpc/kernel/head_40x.S   |  4 ++--
 arch/powerpc/kernel/head_44x.S   |  2 +-
 arch/powerpc/kernel/head_8xx.S   |  2 +-
 arch/powerpc/kernel/head_booke.h |  4 ++--
 arch/powerpc/kernel/head_fsl_booke.S |  4 ++--
 arch/powerpc/kernel/irq.c|  2 +-
 arch/powerpc/kernel/misc_32.S|  4 ++--
 arch/powerpc/kernel/process.c|  8 
 arch/powerpc/kernel/setup-common.c   |  2 +-
 arch/powerpc/kernel/setup_32.c   | 15 +--
 arch/powerpc/kernel/smp.c|  4 +++-
 19 files changed, 38 insertions(+), 40 deletions(-)

diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h
index ee39ce56b2a2..2efbae8d93be 100644
--- a/arch/powerpc/include/asm/irq.h
+++ b/arch/powerpc/include/asm/irq.h
@@ -63,8 +63,8 @@ extern struct thread_info *hardirq_ctx[NR_CPUS];
 extern struct thread_info *softirq_ctx[NR_CPUS];
 
 extern void irq_ctx_init(void);
-extern void call_do_softirq(struct thread_info *tp);
-extern void call_do_irq(struct pt_regs *regs, struct thread_info *tp);
+void call_do_softirq(void *sp);
+void call_do_irq(struct pt_regs *regs, void *sp);
 extern void do_IRQ(struct pt_regs *regs);
 extern void __init init_IRQ(void);
 extern void __do_irq(struct pt_regs *regs);
diff --git a/arch/powerpc/include/asm/livepatch.h 
b/arch/powerpc/include/asm/livepatch.h
index 47a03b9b528b..8a81d10ccc82 100644
--- a/arch/powerpc/include/asm/livepatch.h
+++ b/arch/powerpc/include/asm/livepatch.h
@@ -43,13 +43,14 @@ static inline unsigned long 
klp_get_ftrace_location(unsigned long faddr)
return ftrace_location_range(faddr, faddr + 16);
 }
 
-static inline void klp_init_thread_info(struct thread_info *ti)
+static inline void klp_init_thread_info(struct task_struct *p)
 {
+   struct thread_info *ti = task_thread_info(p);
/* + 1 to account for STACK_END_MAGIC */
-   ti->livepatch_sp = (unsigned long *)(ti + 1) + 1;
+   ti->livepatch_sp = end_of_stack(p) + 1;
 }
 #else
-static void klp_init_thread_info(struct thread_info *ti) { }
+static inline void klp_init_thread_info(struct task_struct *p) { }
 #endif /* CONFIG_LIVEPATCH */
 
 #endif /* _ASM_POWERPC_LIVEPATCH_H */
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index 692f7383d461..15acb282a876 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -40,7 +40,7 @@
 
 #ifndef __ASSEMBLY__
 #include 
-#include 
+#include 
 #include 
 #include 
 
@@ -326,7 +326,7 @@ struct thread_struct {
 
 #define INIT_SP(sizeof(init_stack) + (unsigned long) 
_stack)
 #define INIT_SP_LIMIT \
-   (_ALIGN_UP(sizeof(init_thread_info), 16) + (unsigned long) _stack)
+   (_ALIGN_UP(sizeof(struct thread_info), 16) + (unsigned long)_stack)
 
 #ifdef CONFIG_SPE
 #define SPEFSCR_INIT \
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index de52c3166ba4..95b68bdf34df 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -1060,7 +1060,7 @@
  * - SPRG9 debug exception scratch
  *
  * All 32-bit:
- * - SPRG3 current thread_info pointer
+ * - SPRG3 current thread_struct physical addr pointer
  *(virtual on BookE, physical on others)
  *
  * 32-bit classic:
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 9ffc72ded73a..b2b52e002a76 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -90,10 +90,10 @@ int main(void)
DEFINE(SIGSEGV, SIGSEGV);
DEFINE(NMI_MASK, NMI_MASK);
 #else
-   OFFSET(THREAD_INFO, task_struct, stack);
DEFINE(THREAD_INFO_GAP, 

[PATCH v9 2/9] powerpc: Only use task_struct 'cpu' field on SMP

2018-11-08 Thread Christophe Leroy
When moving to CONFIG_THREAD_INFO_IN_TASK, the thread_info 'cpu' field
gets moved into task_struct and only defined when CONFIG_SMP is set.

This patch ensures that TI_CPU is only used when CONFIG_SMP is set and
that task_struct 'cpu' field is not used directly out of SMP code.

Signed-off-by: Christophe Leroy 
Reviewed-by: Nicholas Piggin 
---
 arch/powerpc/kernel/head_fsl_booke.S | 2 ++
 arch/powerpc/kernel/misc_32.S| 4 
 arch/powerpc/xmon/xmon.c | 2 +-
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/head_fsl_booke.S 
b/arch/powerpc/kernel/head_fsl_booke.S
index e2750b856c8f..05b574f416b3 100644
--- a/arch/powerpc/kernel/head_fsl_booke.S
+++ b/arch/powerpc/kernel/head_fsl_booke.S
@@ -243,8 +243,10 @@ set_ivor:
li  r0,0
stwur0,THREAD_SIZE-STACK_FRAME_OVERHEAD(r1)
 
+#ifdef CONFIG_SMP
CURRENT_THREAD_INFO(r22, r1)
stw r24, TI_CPU(r22)
+#endif
 
bl  early_init
 
diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S
index 695b24a2d954..2f0fe8bfc078 100644
--- a/arch/powerpc/kernel/misc_32.S
+++ b/arch/powerpc/kernel/misc_32.S
@@ -183,10 +183,14 @@ _GLOBAL(low_choose_750fx_pll)
or  r4,r4,r5
mtspr   SPRN_HID1,r4
 
+#ifdef CONFIG_SMP
/* Store new HID1 image */
CURRENT_THREAD_INFO(r6, r1)
lwz r6,TI_CPU(r6)
slwir6,r6,2
+#else
+   li  r6, 0
+#endif
addis   r6,r6,nap_save_hid1@ha
stw r4,nap_save_hid1@l(r6)
 
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 36b8dc47a3c3..4414d7b3dc85 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -2995,7 +2995,7 @@ static void show_task(struct task_struct *tsk)
printf("%px %016lx %6d %6d %c %2d %s\n", tsk,
tsk->thread.ksp,
tsk->pid, tsk->parent->pid,
-   state, task_thread_info(tsk)->cpu,
+   state, task_cpu(tsk),
tsk->comm);
 }
 
-- 
2.13.3



[PATCH v9 0/9] powerpc: Switch to CONFIG_THREAD_INFO_IN_TASK

2018-11-08 Thread Christophe Leroy
The purpose of this serie is to activate CONFIG_THREAD_INFO_IN_TASK which
moves the thread_info into task_struct.

Moving thread_info into task_struct has the following advantages:
- It protects thread_info from corruption in the case of stack
overflows.
- Its address is harder to determine if stack addresses are
leaked, making a number of attacks more difficult.

Changes since v8:
 - Rebased on e589b79e40d9 ("Automatic merge of branches 'master', 'next' and 
'fixes' into merge")
  ==> Main impact was conflicts due to commit 9a8dd708d547 ("memblock: rename 
memblock_alloc{_nid,_try_nid} to memblock_phys_alloc*")

Changes since v7:
 - Rebased on fb6c6ce7907d ("Automatic merge of branches 'master', 'next' and 
'fixes' into merge")

Changes since v6:
 - Fixed validate_sp() to exclude NULL sp in 'regain entire stack space' patch 
(early crash with CONFIG_KMEMLEAK)

Changes since v5:
 - Fixed livepatch_sp setup by using end_of_stack() instead of hardcoding
 - Fixed PPC_BPF_LOAD_CPU() macro

Changes since v4:
 - Fixed a build failure on 32bits SMP when include/generated/asm-offsets.h is 
not
 already existing, was due to spaces instead of a tab in the Makefile

Changes since RFC v3: (based on Nick's review)
 - Renamed task_size.h to task_size_user64.h to better relate to what it 
contains.
 - Handling of the isolation of thread_info cpu field inside CONFIG_SMP #ifdefs 
moved to a separate patch.
 - Removed CURRENT_THREAD_INFO macro completely.
 - Added a guard in asm/smp.h to avoid build failure before _TASK_CPU is 
defined.
 - Added a patch at the end to rename 'tp' pointers to 'sp' pointers
 - Renamed 'tp' into 'sp' pointers in preparation patch when relevant
 - Fixed a few commit logs
 - Fixed checkpatch report.

Changes since RFC v2:
 - Removed the modification of names in asm-offsets
 - Created a rule in arch/powerpc/Makefile to append the offset of current->cpu 
in CFLAGS
 - Modified asm/smp.h to use the offset set in CFLAGS
 - Squashed the renaming of THREAD_INFO to TASK_STACK in the preparation patch
 - Moved the modification of current_pt_regs in the patch activating 
CONFIG_THREAD_INFO_IN_TASK

Changes since RFC v1:
 - Removed the first patch which was modifying header inclusion order in timer
 - Modified some names in asm-offsets to avoid conflicts when including 
asm-offsets in C files
 - Modified asm/smp.h to avoid having to include linux/sched.h (using 
asm-offsets instead)
 - Moved some changes from the activation patch to the preparation patch.

Christophe Leroy (9):
  book3s/64: avoid circular header inclusion in mmu-hash.h
  powerpc: Only use task_struct 'cpu' field on SMP
  powerpc: Prepare for moving thread_info into task_struct
  powerpc: Activate CONFIG_THREAD_INFO_IN_TASK
  powerpc: regain entire stack space
  powerpc: 'current_set' is now a table of task_struct pointers
  powerpc/32: Remove CURRENT_THREAD_INFO and rename TI_CPU
  powerpc/64: Remove CURRENT_THREAD_INFO
  powerpc: clean stack pointers naming

 arch/powerpc/Kconfig   |  1 +
 arch/powerpc/Makefile  |  7 +++
 arch/powerpc/include/asm/asm-prototypes.h  |  4 +-
 arch/powerpc/include/asm/book3s/64/mmu-hash.h  |  2 +-
 arch/powerpc/include/asm/exception-64s.h   |  4 +-
 arch/powerpc/include/asm/irq.h | 14 ++---
 arch/powerpc/include/asm/livepatch.h   |  7 ++-
 arch/powerpc/include/asm/processor.h   | 39 +
 arch/powerpc/include/asm/ptrace.h  |  2 +-
 arch/powerpc/include/asm/reg.h |  2 +-
 arch/powerpc/include/asm/smp.h | 17 +-
 arch/powerpc/include/asm/task_size_user64.h| 42 ++
 arch/powerpc/include/asm/thread_info.h | 19 ---
 arch/powerpc/kernel/asm-offsets.c  | 10 ++--
 arch/powerpc/kernel/entry_32.S | 66 --
 arch/powerpc/kernel/entry_64.S | 12 ++--
 arch/powerpc/kernel/epapr_hcalls.S |  5 +-
 arch/powerpc/kernel/exceptions-64e.S   | 13 +
 arch/powerpc/kernel/exceptions-64s.S   |  2 +-
 arch/powerpc/kernel/head_32.S  | 14 ++---
 arch/powerpc/kernel/head_40x.S |  4 +-
 arch/powerpc/kernel/head_44x.S |  8 +--
 arch/powerpc/kernel/head_64.S  |  1 +
 arch/powerpc/kernel/head_8xx.S |  2 +-
 arch/powerpc/kernel/head_booke.h   | 12 +---
 arch/powerpc/kernel/head_fsl_booke.S   | 16 +++---
 arch/powerpc/kernel/idle_6xx.S |  8 +--
 arch/powerpc/kernel/idle_book3e.S  |  2 +-
 arch/powerpc/kernel/idle_e500.S|  8 +--
 arch/powerpc/kernel/idle_power4.S  |  2 +-
 arch/powerpc/kernel/irq.c  | 77 +-
 arch/powerpc/kernel/kgdb.c | 28 --
 arch/powerpc/kernel/machine_kexec_64.c |  6 +-
 arch/powerpc/kernel/misc_32.S  | 17 +++---
 

[PATCH v9 1/9] book3s/64: avoid circular header inclusion in mmu-hash.h

2018-11-08 Thread Christophe Leroy
When activating CONFIG_THREAD_INFO_IN_TASK, linux/sched.h
includes asm/current.h. This generates a circular dependency.
To avoid that, asm/processor.h shall not be included in mmu-hash.h

In order to do that, this patch moves into a new header called
asm/task_size_user64.h the information from asm/processor.h required
by mmu-hash.h

Signed-off-by: Christophe Leroy 
Reviewed-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  2 +-
 arch/powerpc/include/asm/processor.h  | 34 +-
 arch/powerpc/include/asm/task_size_user64.h   | 42 +++
 arch/powerpc/kvm/book3s_hv_hmi.c  |  1 +
 4 files changed, 45 insertions(+), 34 deletions(-)
 create mode 100644 arch/powerpc/include/asm/task_size_user64.h

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 12e522807f9f..b2aba048301e 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -23,7 +23,7 @@
  */
 #include 
 #include 
-#include 
+#include 
 #include 
 
 /*
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index ee58526cb6c2..692f7383d461 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -95,40 +95,8 @@ void release_thread(struct task_struct *);
 #endif
 
 #ifdef CONFIG_PPC64
-/*
- * 64-bit user address space can have multiple limits
- * For now supported values are:
- */
-#define TASK_SIZE_64TB  (0x4000UL)
-#define TASK_SIZE_128TB (0x8000UL)
-#define TASK_SIZE_512TB (0x0002UL)
-#define TASK_SIZE_1PB   (0x0004UL)
-#define TASK_SIZE_2PB   (0x0008UL)
-/*
- * With 52 bits in the address we can support
- * upto 4PB of range.
- */
-#define TASK_SIZE_4PB   (0x0010UL)
 
-/*
- * For now 512TB is only supported with book3s and 64K linux page size.
- */
-#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_64K_PAGES)
-/*
- * Max value currently used:
- */
-#define TASK_SIZE_USER64   TASK_SIZE_4PB
-#define DEFAULT_MAP_WINDOW_USER64  TASK_SIZE_128TB
-#define TASK_CONTEXT_SIZE  TASK_SIZE_512TB
-#else
-#define TASK_SIZE_USER64   TASK_SIZE_64TB
-#define DEFAULT_MAP_WINDOW_USER64  TASK_SIZE_64TB
-/*
- * We don't need to allocate extended context ids for 4K page size, because
- * we limit the max effective address on this config to 64TB.
- */
-#define TASK_CONTEXT_SIZE  TASK_SIZE_64TB
-#endif
+#include 
 
 /*
  * 32-bit user address space is 4GB - 1 page
diff --git a/arch/powerpc/include/asm/task_size_user64.h 
b/arch/powerpc/include/asm/task_size_user64.h
new file mode 100644
index ..a4043075864b
--- /dev/null
+++ b/arch/powerpc/include/asm/task_size_user64.h
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_TASK_SIZE_USER64_H
+#define _ASM_POWERPC_TASK_SIZE_USER64_H
+
+#ifdef CONFIG_PPC64
+/*
+ * 64-bit user address space can have multiple limits
+ * For now supported values are:
+ */
+#define TASK_SIZE_64TB  (0x4000UL)
+#define TASK_SIZE_128TB (0x8000UL)
+#define TASK_SIZE_512TB (0x0002UL)
+#define TASK_SIZE_1PB   (0x0004UL)
+#define TASK_SIZE_2PB   (0x0008UL)
+/*
+ * With 52 bits in the address we can support
+ * upto 4PB of range.
+ */
+#define TASK_SIZE_4PB   (0x0010UL)
+
+/*
+ * For now 512TB is only supported with book3s and 64K linux page size.
+ */
+#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_64K_PAGES)
+/*
+ * Max value currently used:
+ */
+#define TASK_SIZE_USER64   TASK_SIZE_4PB
+#define DEFAULT_MAP_WINDOW_USER64  TASK_SIZE_128TB
+#define TASK_CONTEXT_SIZE  TASK_SIZE_512TB
+#else
+#define TASK_SIZE_USER64   TASK_SIZE_64TB
+#define DEFAULT_MAP_WINDOW_USER64  TASK_SIZE_64TB
+/*
+ * We don't need to allocate extended context ids for 4K page size, because
+ * we limit the max effective address on this config to 64TB.
+ */
+#define TASK_CONTEXT_SIZE  TASK_SIZE_64TB
+#endif
+
+#endif /* CONFIG_PPC64 */
+#endif /* _ASM_POWERPC_TASK_SIZE_USER64_H */
diff --git a/arch/powerpc/kvm/book3s_hv_hmi.c b/arch/powerpc/kvm/book3s_hv_hmi.c
index e3f738eb1cac..64b5011475c7 100644
--- a/arch/powerpc/kvm/book3s_hv_hmi.c
+++ b/arch/powerpc/kvm/book3s_hv_hmi.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 void wait_for_subcore_guest_exit(void)
 {
-- 
2.13.3



[PATCH] powerpc: Add CONFIG_NR_CPUS to ppc64_defconfig

2018-11-08 Thread sathnaga
From: Satheesh Rajendran 

CONFIG_NR_CPUS is not set in ppc64_defconfig, So
it gets default to 32 which is not likely suitable for
powerpc systems configuration, hence defaulting it to 2048
like other powerpc defconfigs.

Signed-off-by: Satheesh Rajendran 
---
 arch/powerpc/configs/ppc64_defconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/configs/ppc64_defconfig 
b/arch/powerpc/configs/ppc64_defconfig
index f2515674a1e2..8f89ea12dc8d 100644
--- a/arch/powerpc/configs/ppc64_defconfig
+++ b/arch/powerpc/configs/ppc64_defconfig
@@ -1,4 +1,5 @@
 CONFIG_PPC64=y
+CONFIG_NR_CPUS=2048
 CONFIG_SYSVIPC=y
 CONFIG_POSIX_MQUEUE=y
 CONFIG_NO_HZ=y
-- 
2.17.2



Re: [PATCH 12/36] dt-bindings: arm: Convert cpu binding to json-schema

2018-11-08 Thread Michal Simek
Hi Rob,

On 05. 10. 18 18:58, Rob Herring wrote:
> Convert ARM CPU binding to DT schema format using json-schema.
> 
> Cc: Mark Rutland 
> Cc: Matthias Brugger 
> Cc: devicet...@vger.kernel.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-media...@lists.infradead.org
> Signed-off-by: Rob Herring 
> ---
>  .../devicetree/bindings/arm/cpus.txt  | 490 -
>  .../devicetree/bindings/arm/cpus.yaml | 503 ++
>  2 files changed, 503 insertions(+), 490 deletions(-)
>  delete mode 100644 Documentation/devicetree/bindings/arm/cpus.txt
>  create mode 100644 Documentation/devicetree/bindings/arm/cpus.yaml
> 
> diff --git a/Documentation/devicetree/bindings/arm/cpus.txt 
> b/Documentation/devicetree/bindings/arm/cpus.txt
> deleted file mode 100644
> index b0198a1cf403..
> --- a/Documentation/devicetree/bindings/arm/cpus.txt
> +++ /dev/null
> @@ -1,490 +0,0 @@
> -=
> -ARM CPUs bindings
> -=
> -
> -The device tree allows to describe the layout of CPUs in a system through
> -the "cpus" node, which in turn contains a number of subnodes (ie "cpu")
> -defining properties for every cpu.
> -
> -Bindings for CPU nodes follow the Devicetree Specification, available from:
> -
> -https://www.devicetree.org/specifications/
> -
> -with updates for 32-bit and 64-bit ARM systems provided in this document.
> -
> -
> -Convention used in this document
> -
> -
> -This document follows the conventions described in the Devicetree
> -Specification, with the addition:
> -
> -- square brackets define bitfields, eg reg[7:0] value of the bitfield in
> -  the reg property contained in bits 7 down to 0
> -
> -=
> -cpus and cpu node bindings definition
> -=
> -
> -The ARM architecture, in accordance with the Devicetree Specification,
> -requires the cpus and cpu nodes to be present and contain the properties
> -described below.
> -
> -- cpus node
> -
> - Description: Container of cpu nodes
> -
> - The node name must be "cpus".
> -
> - A cpus node must define the following properties:
> -
> - - #address-cells
> - Usage: required
> - Value type: 
> -
> - Definition depends on ARM architecture version and
> - configuration:
> -
> - # On uniprocessor ARM architectures previous to v7
> -   value must be 1, to enable a simple enumeration
> -   scheme for processors that do not have a HW CPU
> -   identification register.
> - # On 32-bit ARM 11 MPcore, ARM v7 or later systems
> -   value must be 1, that corresponds to CPUID/MPIDR
> -   registers sizes.
> - # On ARM v8 64-bit systems value should be set to 2,
> -   that corresponds to the MPIDR_EL1 register size.
> -   If MPIDR_EL1[63:32] value is equal to 0 on all CPUs
> -   in the system, #address-cells can be set to 1, since
> -   MPIDR_EL1[63:32] bits are not used for CPUs
> -   identification.
> - - #size-cells
> - Usage: required
> - Value type: 
> - Definition: must be set to 0
> -
> -- cpu node
> -
> - Description: Describes a CPU in an ARM based system
> -
> - PROPERTIES
> -
> - - device_type
> - Usage: required
> - Value type: 
> - Definition: must be "cpu"
> - - reg
> - Usage and definition depend on ARM architecture version and
> - configuration:
> -
> - # On uniprocessor ARM architectures previous to v7
> -   this property is required and must be set to 0.
> -
> - # On ARM 11 MPcore based systems this property is
> -   required and matches the CPUID[11:0] register bits.
> -
> -   Bits [11:0] in the reg cell must be set to
> -   bits [11:0] in CPU ID register.
> -
> -   All other bits in the reg cell must be set to 0.
> -
> - # On 32-bit ARM v7 or later systems this property is
> -   required and matches the CPU MPIDR[23:0] register
> -   bits.
> -
> -   Bits [23:0] in the reg cell must be set to
> -   bits [23:0] in MPIDR.
> -
> -   All other bits in the reg cell must be set to 0.
> -
> - # On ARM v8 64-bit systems this property is required
> -   and matches the MPIDR_EL1 register affinity bits.
> -
> -   * If cpus node's #address-cells property is set to 2
> -
> - The first reg cell bits