[PATCH] lkdtm: do not depend on CONFIG_BLOCK
Most parts of lkdtm don't require CONFIG_BLOCK. This patch limits dependency to CONFIG_BLOCK in order to give embedded platforms which don't select CONFIG_BLOCK the opportunity to use LKDTM. Fixes: fddd9cf82c9f ("make LKDTM depend on BLOCK") Signed-off-by: Christophe Leroy --- drivers/misc/lkdtm/core.c | 7 ++- lib/Kconfig.debug | 1 - 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c index 2837dc77478e..bc76756b7eda 100644 --- a/drivers/misc/lkdtm/core.c +++ b/drivers/misc/lkdtm/core.c @@ -40,9 +40,12 @@ #include #include #include -#include #include +#ifdef CONFIG_BLOCK +#include +#endif + #ifdef CONFIG_IDE #include #endif @@ -101,7 +104,9 @@ static struct crashpoint crashpoints[] = { CRASHPOINT("FS_DEVRW", "ll_rw_block"), CRASHPOINT("MEM_SWAPOUT","shrink_inactive_list"), CRASHPOINT("TIMERADD", "hrtimer_start"), +# ifdef CONFIG_BLOCK CRASHPOINT("SCSI_DISPATCH_CMD", "scsi_dispatch_cmd"), +# endif # ifdef CONFIG_IDE CRASHPOINT("IDE_CORE_CP","generic_ide_ioctl"), # endif diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 1af29b8224fd..0dd65b4b2ad2 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1685,7 +1685,6 @@ if RUNTIME_TESTING_MENU config LKDTM tristate "Linux Kernel Dump Test Tool Module" depends on DEBUG_FS - depends on BLOCK help This module enables testing of the different dumping mechanisms by inducing system failures at predefined crash points. -- 2.13.3
Re: [GIT PULL] of: overlay: validation checks, subsequent fixes for v20 -- correction: v4.20
On 11/8/18 10:56 PM, Frank Rowand wrote: > Hi Rob, > > Please pull the changes to add the overlay validation checks. > > This is the v7 version of the patch series. > > -Frank > > > The following changes since commit 651022382c7f8da46cb4872a545ee1da6d097d2a: > > Linux 4.20-rc1 (2018-11-04 15:37:52 -0800) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/frowand/linux.git > tags/kfree_validate_v7-for-4.20 > > for you to fetch changes up to eeb07c573ec307c53fe2f6ac6d8d11c261f64006: > > of: unittest: initialize args before calling of_*parse_*() (2018-11-08 > 22:12:37 -0800) > > > Add checks to (1) overlay apply process and (2) memory freeing > triggered by overlay release. The checks are intended to detect > possible memory leaks and invalid overlays. > > The checks revealed bugs in existing code. Fixed the bugs. > > While fixing bugs, noted other issues, which are fixed in > separate patches. > > > Frank Rowand (17): > of: overlay: add tests to validate kfrees from overlay removal > of: overlay: add missing of_node_put() after add new node to changeset > of: overlay: add missing of_node_get() in __of_attach_node_sysfs > powerpc/pseries: add of_node_put() in dlpar_detach_node() > of: overlay: use prop add changeset entry for property in new nodes > of: overlay: do not duplicate properties from overlay for new nodes > of: overlay: reorder fields in struct fragment > of: overlay: validate overlay properties #address-cells and #size-cells > of: overlay: make all pr_debug() and pr_err() messages unique > of: overlay: test case of two fragments adding same node > of: overlay: check prevents multiple fragments add or delete same node > of: overlay: check prevents multiple fragments touching same property > of: unittest: remove unused of_unittest_apply_overlay() argument > of: overlay: set node fields from properties when add new overlay node > of: unittest: allow base devicetree to have symbol metadata > of: unittest: find overlays[] entry by name instead of index > of: unittest: initialize args before calling of_*parse_*() > > arch/powerpc/platforms/pseries/dlpar.c | 2 + > drivers/of/dynamic.c | 59 - > drivers/of/kobj.c | 4 +- > drivers/of/overlay.c | 292 > - > drivers/of/unittest-data/Makefile | 2 + > .../of/unittest-data/overlay_bad_add_dup_node.dts | 28 ++ > .../of/unittest-data/overlay_bad_add_dup_prop.dts | 24 ++ > drivers/of/unittest-data/overlay_base.dts | 1 + > drivers/of/unittest.c | 96 +-- > include/linux/of.h | 21 +- > 10 files changed, 432 insertions(+), 97 deletions(-) > create mode 100644 drivers/of/unittest-data/overlay_bad_add_dup_node.dts > create mode 100644 drivers/of/unittest-data/overlay_bad_add_dup_prop.dts >
[GIT PULL] of: overlay: validation checks, subsequent fixes for v20
Hi Rob, Please pull the changes to add the overlay validation checks. This is the v7 version of the patch series. -Frank The following changes since commit 651022382c7f8da46cb4872a545ee1da6d097d2a: Linux 4.20-rc1 (2018-11-04 15:37:52 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/frowand/linux.git tags/kfree_validate_v7-for-4.20 for you to fetch changes up to eeb07c573ec307c53fe2f6ac6d8d11c261f64006: of: unittest: initialize args before calling of_*parse_*() (2018-11-08 22:12:37 -0800) Add checks to (1) overlay apply process and (2) memory freeing triggered by overlay release. The checks are intended to detect possible memory leaks and invalid overlays. The checks revealed bugs in existing code. Fixed the bugs. While fixing bugs, noted other issues, which are fixed in separate patches. Frank Rowand (17): of: overlay: add tests to validate kfrees from overlay removal of: overlay: add missing of_node_put() after add new node to changeset of: overlay: add missing of_node_get() in __of_attach_node_sysfs powerpc/pseries: add of_node_put() in dlpar_detach_node() of: overlay: use prop add changeset entry for property in new nodes of: overlay: do not duplicate properties from overlay for new nodes of: overlay: reorder fields in struct fragment of: overlay: validate overlay properties #address-cells and #size-cells of: overlay: make all pr_debug() and pr_err() messages unique of: overlay: test case of two fragments adding same node of: overlay: check prevents multiple fragments add or delete same node of: overlay: check prevents multiple fragments touching same property of: unittest: remove unused of_unittest_apply_overlay() argument of: overlay: set node fields from properties when add new overlay node of: unittest: allow base devicetree to have symbol metadata of: unittest: find overlays[] entry by name instead of index of: unittest: initialize args before calling of_*parse_*() arch/powerpc/platforms/pseries/dlpar.c | 2 + drivers/of/dynamic.c | 59 - drivers/of/kobj.c | 4 +- drivers/of/overlay.c | 292 - drivers/of/unittest-data/Makefile | 2 + .../of/unittest-data/overlay_bad_add_dup_node.dts | 28 ++ .../of/unittest-data/overlay_bad_add_dup_prop.dts | 24 ++ drivers/of/unittest-data/overlay_base.dts | 1 + drivers/of/unittest.c | 96 +-- include/linux/of.h | 21 +- 10 files changed, 432 insertions(+), 97 deletions(-) create mode 100644 drivers/of/unittest-data/overlay_bad_add_dup_node.dts create mode 100644 drivers/of/unittest-data/overlay_bad_add_dup_prop.dts
[PATCH v7 17/17] of: unittest: initialize args before calling of_*parse_*()
From: Frank Rowand Callers of of_irq_parse_one() blindly use the pointer args.np without checking whether of_irq_parse_one() had an error and thus did not set the value of args.np. Initialize args to zero so that using the format "%pOF" to show the value of args.np will show "(null)" when of_irq_parse_one() has an error. This prevents the dereference of a random value. Make the same fix for callers of of_parse_phandle_with_args() and of_parse_phandle_with_args_map(). Reported-by: Guenter Roeck Tested-by: Alan Tull Signed-off-by: Frank Rowand --- drivers/of/unittest.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c index fe01c5869b0f..9a10a48eb6a1 100644 --- a/drivers/of/unittest.c +++ b/drivers/of/unittest.c @@ -379,6 +379,7 @@ static void __init of_unittest_parse_phandle_with_args(void) for (i = 0; i < 8; i++) { bool passed = true; + memset(, 0, sizeof(args)); rc = of_parse_phandle_with_args(np, "phandle-list", "#phandle-cells", i, ); @@ -432,6 +433,7 @@ static void __init of_unittest_parse_phandle_with_args(void) } /* Check for missing list property */ + memset(, 0, sizeof(args)); rc = of_parse_phandle_with_args(np, "phandle-list-missing", "#phandle-cells", 0, ); unittest(rc == -ENOENT, "expected:%i got:%i\n", -ENOENT, rc); @@ -440,6 +442,7 @@ static void __init of_unittest_parse_phandle_with_args(void) unittest(rc == -ENOENT, "expected:%i got:%i\n", -ENOENT, rc); /* Check for missing cells property */ + memset(, 0, sizeof(args)); rc = of_parse_phandle_with_args(np, "phandle-list", "#phandle-cells-missing", 0, ); unittest(rc == -EINVAL, "expected:%i got:%i\n", -EINVAL, rc); @@ -448,6 +451,7 @@ static void __init of_unittest_parse_phandle_with_args(void) unittest(rc == -EINVAL, "expected:%i got:%i\n", -EINVAL, rc); /* Check for bad phandle in list */ + memset(, 0, sizeof(args)); rc = of_parse_phandle_with_args(np, "phandle-list-bad-phandle", "#phandle-cells", 0, ); unittest(rc == -EINVAL, "expected:%i got:%i\n", -EINVAL, rc); @@ -456,6 +460,7 @@ static void __init of_unittest_parse_phandle_with_args(void) unittest(rc == -EINVAL, "expected:%i got:%i\n", -EINVAL, rc); /* Check for incorrectly formed argument list */ + memset(, 0, sizeof(args)); rc = of_parse_phandle_with_args(np, "phandle-list-bad-args", "#phandle-cells", 1, ); unittest(rc == -EINVAL, "expected:%i got:%i\n", -EINVAL, rc); @@ -506,6 +511,7 @@ static void __init of_unittest_parse_phandle_with_args_map(void) for (i = 0; i < 8; i++) { bool passed = true; + memset(, 0, sizeof(args)); rc = of_parse_phandle_with_args_map(np, "phandle-list", "phandle", i, ); @@ -563,21 +569,25 @@ static void __init of_unittest_parse_phandle_with_args_map(void) } /* Check for missing list property */ + memset(, 0, sizeof(args)); rc = of_parse_phandle_with_args_map(np, "phandle-list-missing", "phandle", 0, ); unittest(rc == -ENOENT, "expected:%i got:%i\n", -ENOENT, rc); /* Check for missing cells,map,mask property */ + memset(, 0, sizeof(args)); rc = of_parse_phandle_with_args_map(np, "phandle-list", "phandle-missing", 0, ); unittest(rc == -EINVAL, "expected:%i got:%i\n", -EINVAL, rc); /* Check for bad phandle in list */ + memset(, 0, sizeof(args)); rc = of_parse_phandle_with_args_map(np, "phandle-list-bad-phandle", "phandle", 0, ); unittest(rc == -EINVAL, "expected:%i got:%i\n", -EINVAL, rc); /* Check for incorrectly formed argument list */ + memset(, 0, sizeof(args)); rc = of_parse_phandle_with_args_map(np, "phandle-list-bad-args", "phandle", 1, ); unittest(rc == -EINVAL, "expected:%i got:%i\n", -EINVAL, rc); @@ -787,7 +797,7 @@ static void __init of_unittest_parse_interrupts(void) for (i = 0; i < 4; i++) { bool passed = true; - args.args_count = 0; + memset(, 0, sizeof(args)); rc = of_irq_parse_one(np, i, ); passed &= !rc; @@ -808,7 +818,7 @@ static void __init of_unittest_parse_interrupts(void) for (i = 0; i < 4; i++) { bool passed = true; - args.args_count = 0; + memset(, 0, sizeof(args));
[PATCH v7 16/17] of: unittest: find overlays[] entry by name instead of index
From: Frank Rowand One accessor of overlays[] was using a hard coded index value to find the correct array entry instead of searching for the entry containing the correct name. Tested-by: Alan Tull Signed-off-by: Frank Rowand --- drivers/of/unittest.c | 21 + 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c index d625a91a7f60..fe01c5869b0f 100644 --- a/drivers/of/unittest.c +++ b/drivers/of/unittest.c @@ -2192,7 +2192,7 @@ struct overlay_info { OVERLAY_INFO_EXTERN(overlay_bad_phandle); OVERLAY_INFO_EXTERN(overlay_bad_symbol); -/* order of entries is hard-coded into users of overlays[] */ +/* entries found by name */ static struct overlay_info overlays[] = { OVERLAY_INFO(overlay_base, -), OVERLAY_INFO(overlay, 0), @@ -2215,7 +2215,8 @@ struct overlay_info { OVERLAY_INFO(overlay_bad_add_dup_prop, -EINVAL), OVERLAY_INFO(overlay_bad_phandle, -EINVAL), OVERLAY_INFO(overlay_bad_symbol, -EINVAL), - {} + /* end marker */ + {.dtb_begin = NULL, .dtb_end = NULL, .expected_result = 0, .name = NULL} }; static struct device_node *overlay_base_root; @@ -2245,6 +2246,19 @@ void __init unittest_unflatten_overlay_base(void) u32 data_size; void *new_fdt; u32 size; + int found = 0; + const char *overlay_name = "overlay_base"; + + for (info = overlays; info && info->name; info++) { + if (!strcmp(overlay_name, info->name)) { + found = 1; + break; + } + } + if (!found) { + pr_err("no overlay data for %s\n", overlay_name); + return; + } info = [0]; @@ -2292,11 +2306,10 @@ static int __init overlay_data_apply(const char *overlay_name, int *overlay_id) { struct overlay_info *info; int found = 0; - int k; int ret; u32 size; - for (k = 0, info = overlays; info && info->name; info++, k++) { + for (info = overlays; info && info->name; info++) { if (!strcmp(overlay_name, info->name)) { found = 1; break; -- Frank Rowand
[PATCH v7 15/17] of: unittest: allow base devicetree to have symbol metadata
From: Frank Rowand The overlay metadata nodes in the FDT created from testcases.dts are not handled properly. The __fixups__ and __local_fixups__ node were added to the live devicetree, but should not be. Only the first property in the /__symbols__ node was added to the live devicetree if the live devicetree already contained a /__symbols node. All of the node's properties must be added. Tested-by: Alan Tull Signed-off-by: Frank Rowand --- drivers/of/unittest.c | 43 +++ 1 file changed, 35 insertions(+), 8 deletions(-) diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c index 14838b21ec6a..d625a91a7f60 100644 --- a/drivers/of/unittest.c +++ b/drivers/of/unittest.c @@ -1071,20 +1071,44 @@ static void __init of_unittest_platform_populate(void) * of np into dup node (present in live tree) and * updates parent of children of np to dup. * - * @np:node already present in live tree + * @np:node whose properties are being added to the live tree * @dup: node present in live tree to be updated */ static void update_node_properties(struct device_node *np, struct device_node *dup) { struct property *prop; + struct property *save_next; struct device_node *child; - - for_each_property_of_node(np, prop) - of_add_property(dup, prop); + int ret; for_each_child_of_node(np, child) child->parent = dup; + + /* +* "unittest internal error: unable to add testdata property" +* +*If this message reports a property in node '/__symbols__' then +*the respective unittest overlay contains a label that has the +*same name as a label in the live devicetree. The label will +*be in the live devicetree only if the devicetree source was +*compiled with the '-@' option. If you encounter this error, +*please consider renaming __all__ of the labels in the unittest +*overlay dts files with an odd prefix that is unlikely to be +*used in a real devicetree. +*/ + + /* +* open code for_each_property_of_node() because of_add_property() +* sets prop->next to NULL +*/ + for (prop = np->properties; prop != NULL; prop = save_next) { + save_next = prop->next; + ret = of_add_property(dup, prop); + if (ret) + pr_err("unittest internal error: unable to add testdata property %pOF/%s", + np, prop->name); + } } /** @@ -1093,18 +1117,23 @@ static void update_node_properties(struct device_node *np, * * @np:Node to attach to live tree */ -static int attach_node_and_children(struct device_node *np) +static void attach_node_and_children(struct device_node *np) { struct device_node *next, *dup, *child; unsigned long flags; const char *full_name; full_name = kasprintf(GFP_KERNEL, "%pOF", np); + + if (!strcmp(full_name, "/__local_fixups__") || + !strcmp(full_name, "/__fixups__")) + return; + dup = of_find_node_by_path(full_name); kfree(full_name); if (dup) { update_node_properties(np, dup); - return 0; + return; } child = np->child; @@ -1125,8 +1154,6 @@ static int attach_node_and_children(struct device_node *np) attach_node_and_children(child); child = next; } - - return 0; } /** -- Frank Rowand
[PATCH v7 14/17] of: overlay: set node fields from properties when add new overlay node
From: Frank Rowand Overlay nodes added by add_changeset_node() do not have the node fields name, phandle, and type set. The node passed to __of_attach_node() when the add node changeset entry is processed does not contain any properties. The node's properties are located in add property changeset entries that will be processed after the add node changeset is applied. Set the node's fields in the node contained in the add node changeset entry and do not set them to incorrect values in add_changeset_node(). A visible symptom that is fixed by this patch is the names of nodes added by overlays that have an entry in /sys/bus/platform/drivers/*/ will contain the unit-address but the node-name will be , for example, "fc4ab000.". After applying the patch the name, in this example, for node restart@fc4ab000 is "fc4ab000.restart". Tested-by: Alan Tull Signed-off-by: Frank Rowand --- drivers/of/dynamic.c | 27 ++- drivers/of/overlay.c | 29 - 2 files changed, 42 insertions(+), 14 deletions(-) diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c index 146681540487..b4e5b90cb314 100644 --- a/drivers/of/dynamic.c +++ b/drivers/of/dynamic.c @@ -205,15 +205,24 @@ static void __of_attach_node(struct device_node *np) const __be32 *phandle; int sz; - np->name = __of_get_property(np, "name", NULL) ? : ""; - np->type = __of_get_property(np, "device_type", NULL) ? : ""; - - phandle = __of_get_property(np, "phandle", ); - if (!phandle) - phandle = __of_get_property(np, "linux,phandle", ); - if (IS_ENABLED(CONFIG_PPC_PSERIES) && !phandle) - phandle = __of_get_property(np, "ibm,phandle", ); - np->phandle = (phandle && (sz >= 4)) ? be32_to_cpup(phandle) : 0; + if (!of_node_check_flag(np, OF_OVERLAY)) { + np->name = __of_get_property(np, "name", NULL); + np->type = __of_get_property(np, "device_type", NULL); + if (!np->name) + np->name = ""; + if (!np->type) + np->type = ""; + + phandle = __of_get_property(np, "phandle", ); + if (!phandle) + phandle = __of_get_property(np, "linux,phandle", ); + if (IS_ENABLED(CONFIG_PPC_PSERIES) && !phandle) + phandle = __of_get_property(np, "ibm,phandle", ); + if (phandle && (sz >= 4)) + np->phandle = be32_to_cpup(phandle); + else + np->phandle = 0; + } np->child = NULL; np->sibling = np->parent->child; diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c index 184cc2c4a931..2b5ac43a5690 100644 --- a/drivers/of/overlay.c +++ b/drivers/of/overlay.c @@ -307,10 +307,11 @@ static int add_changeset_property(struct overlay_changeset *ovcs, int ret = 0; bool check_for_non_overlay_node = false; - if (!of_prop_cmp(overlay_prop->name, "name") || - !of_prop_cmp(overlay_prop->name, "phandle") || - !of_prop_cmp(overlay_prop->name, "linux,phandle")) - return 0; + if (target->in_livetree) + if (!of_prop_cmp(overlay_prop->name, "name") || + !of_prop_cmp(overlay_prop->name, "phandle") || + !of_prop_cmp(overlay_prop->name, "linux,phandle")) + return 0; if (target->in_livetree) prop = of_find_property(target->np, overlay_prop->name, NULL); @@ -330,6 +331,10 @@ static int add_changeset_property(struct overlay_changeset *ovcs, if (!prop) { check_for_non_overlay_node = true; + if (!target->in_livetree) { + new_prop->next = target->np->deadprops; + target->np->deadprops = new_prop; + } ret = of_changeset_add_property(>cset, target->np, new_prop); } else if (!of_prop_cmp(prop->name, "#address-cells")) { @@ -400,9 +405,10 @@ static int add_changeset_node(struct overlay_changeset *ovcs, struct target *target, struct device_node *node) { const char *node_kbasename; + const __be32 *phandle; struct device_node *tchild; struct target target_child; - int ret = 0; + int ret = 0, size; node_kbasename = kbasename(node->full_name); @@ -416,6 +422,19 @@ static int add_changeset_node(struct overlay_changeset *ovcs, return -ENOMEM; tchild->parent = target->np; + tchild->name = __of_get_property(node, "name", NULL); + tchild->type = __of_get_property(node, "device_type", NULL); + + if (!tchild->name) + tchild->name = ""; + if (!tchild->type) + tchild->type = "";
[PATCH v7 13/17] of: unittest: remove unused of_unittest_apply_overlay() argument
From: Frank Rowand Argument unittest_nr is not used in of_unittest_apply_overlay(), remove it. Tested-by: Alan Tull Signed-off-by: Frank Rowand --- drivers/of/unittest.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c index f0139d1e8b63..14838b21ec6a 100644 --- a/drivers/of/unittest.c +++ b/drivers/of/unittest.c @@ -1433,8 +1433,7 @@ static void of_unittest_destroy_tracked_overlays(void) } while (defers > 0); } -static int __init of_unittest_apply_overlay(int overlay_nr, int unittest_nr, - int *overlay_id) +static int __init of_unittest_apply_overlay(int overlay_nr, int *overlay_id) { const char *overlay_name; @@ -1467,7 +1466,7 @@ static int __init of_unittest_apply_overlay_check(int overlay_nr, } ovcs_id = 0; - ret = of_unittest_apply_overlay(overlay_nr, unittest_nr, _id); + ret = of_unittest_apply_overlay(overlay_nr, _id); if (ret != 0) { /* of_unittest_apply_overlay already called unittest() */ return ret; @@ -1503,7 +1502,7 @@ static int __init of_unittest_apply_revert_overlay_check(int overlay_nr, /* apply the overlay */ ovcs_id = 0; - ret = of_unittest_apply_overlay(overlay_nr, unittest_nr, _id); + ret = of_unittest_apply_overlay(overlay_nr, _id); if (ret != 0) { /* of_unittest_apply_overlay already called unittest() */ return ret; -- Frank Rowand
[PATCH v7 12/17] of: overlay: check prevents multiple fragments touching same property
From: Frank Rowand Add test case of two fragments updating the same property. After adding the test case, the system hangs at end of boot, after after slub stack dumps from kfree() in crypto modprobe code. Multiple overlay fragments adding, modifying, or deleting the same property is not supported. Add check to detect the attempt and fail the overlay apply. Before this patch, the first fragment error would terminate processing. Allow fragment checking to proceed and report all of the fragment errors before terminating the overlay apply. This is not a hot path, thus not a performance issue (the error is not transient and requires fixing the overlay before attempting to apply it again). After applying this patch, the devicetree unittest messages will include: OF: overlay: ERROR: multiple fragments add, update, and/or delete property /testcase-data-2/substation@100/motor-1/rpm_avail ... ### dt-test ### end of unittest - 212 passed, 0 failed The check to detect two fragments updating the same property is folded into the patch that created the test case to maintain bisectability. Tested-by: Alan Tull Signed-off-by: Frank Rowand --- drivers/of/overlay.c | 118 ++--- drivers/of/unittest-data/Makefile | 1 + .../of/unittest-data/overlay_bad_add_dup_prop.dts | 24 + drivers/of/unittest-data/overlay_base.dts | 1 + drivers/of/unittest.c | 5 + 5 files changed, 112 insertions(+), 37 deletions(-) create mode 100644 drivers/of/unittest-data/overlay_bad_add_dup_prop.dts diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c index 8af8115bd36e..184cc2c4a931 100644 --- a/drivers/of/overlay.c +++ b/drivers/of/overlay.c @@ -508,52 +508,96 @@ static int build_changeset_symbols_node(struct overlay_changeset *ovcs, return 0; } +static int find_dup_cset_node_entry(struct overlay_changeset *ovcs, + struct of_changeset_entry *ce_1) +{ + struct of_changeset_entry *ce_2; + char *fn_1, *fn_2; + int node_path_match; + + if (ce_1->action != OF_RECONFIG_ATTACH_NODE && + ce_1->action != OF_RECONFIG_DETACH_NODE) + return 0; + + ce_2 = ce_1; + list_for_each_entry_continue(ce_2, >cset.entries, node) { + if ((ce_2->action != OF_RECONFIG_ATTACH_NODE && +ce_2->action != OF_RECONFIG_DETACH_NODE) || + of_node_cmp(ce_1->np->full_name, ce_2->np->full_name)) + continue; + + fn_1 = kasprintf(GFP_KERNEL, "%pOF", ce_1->np); + fn_2 = kasprintf(GFP_KERNEL, "%pOF", ce_2->np); + node_path_match = !strcmp(fn_1, fn_2); + kfree(fn_1); + kfree(fn_2); + if (node_path_match) { + pr_err("ERROR: multiple fragments add and/or delete node %pOF\n", + ce_1->np); + return -EINVAL; + } + } + + return 0; +} + +static int find_dup_cset_prop(struct overlay_changeset *ovcs, + struct of_changeset_entry *ce_1) +{ + struct of_changeset_entry *ce_2; + char *fn_1, *fn_2; + int node_path_match; + + if (ce_1->action != OF_RECONFIG_ADD_PROPERTY && + ce_1->action != OF_RECONFIG_REMOVE_PROPERTY && + ce_1->action != OF_RECONFIG_UPDATE_PROPERTY) + return 0; + + ce_2 = ce_1; + list_for_each_entry_continue(ce_2, >cset.entries, node) { + if ((ce_2->action != OF_RECONFIG_ADD_PROPERTY && +ce_2->action != OF_RECONFIG_REMOVE_PROPERTY && +ce_2->action != OF_RECONFIG_UPDATE_PROPERTY) || + of_node_cmp(ce_1->np->full_name, ce_2->np->full_name)) + continue; + + fn_1 = kasprintf(GFP_KERNEL, "%pOF", ce_1->np); + fn_2 = kasprintf(GFP_KERNEL, "%pOF", ce_2->np); + node_path_match = !strcmp(fn_1, fn_2); + kfree(fn_1); + kfree(fn_2); + if (node_path_match && + !of_prop_cmp(ce_1->prop->name, ce_2->prop->name)) { + pr_err("ERROR: multiple fragments add, update, and/or delete property %pOF/%s\n", + ce_1->np, ce_1->prop->name); + return -EINVAL; + } + } + + return 0; +} + /** - * check_changeset_dup_add_node() - changeset validation: duplicate add node + * changeset_dup_entry_check() - check for duplicate entries * @ovcs: Overlay changeset * - * Check changeset @ovcs->cset for multiple add node entries for the same - * node. + * Check changeset @ovcs->cset for multiple {add or delete} node entries for + * the same node or duplicate {add, delete, or update} properties entries + * for the same property. * - * Returns 0 on success, -ENOMEM if
[PATCH v7 11/17] of: overlay: check prevents multiple fragments add or delete same node
From: Frank Rowand Multiple overlay fragments adding or deleting the same node is not supported. Replace code comment of such, with check to detect the attempt and fail the overlay apply. Devicetree unittest where multiple fragments added the same node was added in the previous patch in the series. After applying this patch the unittest messages will no longer include: Duplicate name in motor-1, renamed to "controller#1" OF: overlay: of_overlay_apply() err=0 ### dt-test ### of_overlay_fdt_apply() expected -22, ret=0, overlay_bad_add_dup_node ### dt-test ### FAIL of_unittest_overlay_high_level():2419 Adding overlay 'overlay_bad_add_dup_node' failed ... ### dt-test ### end of unittest - 210 passed, 1 failed but will instead include: OF: overlay: ERROR: multiple overlay fragments add and/or delete node /testcase-data-2/substation@100/motor-1/controller ... ### dt-test ### end of unittest - 211 passed, 0 failed Tested-by: Alan Tull Signed-off-by: Frank Rowand --- checkpatch errors "line over 80 characters" and "Too many leading tabs" are ok, they will be fixed later in this series drivers/of/overlay.c | 58 1 file changed, 49 insertions(+), 9 deletions(-) diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c index c8f88b0836a3..8af8115bd36e 100644 --- a/drivers/of/overlay.c +++ b/drivers/of/overlay.c @@ -392,14 +392,6 @@ static int add_changeset_property(struct overlay_changeset *ovcs, * a live devicetree created from Open Firmware. * * NOTE_2: Multiple mods of created nodes not supported. - * If more than one fragment contains a node that does not already exist - * in the live tree, then for each fragment of_changeset_attach_node() - * will add a changeset entry to add the node. When the changeset is - * applied, __of_attach_node() will attach the node twice (once for - * each fragment). At this point the device tree will be corrupted. - * - * TODO: add integrity check to ensure that multiple fragments do not - * create the same node. * * Returns 0 on success, -ENOMEM if memory allocation failure, or -EINVAL if * invalid @overlay. @@ -517,6 +509,54 @@ static int build_changeset_symbols_node(struct overlay_changeset *ovcs, } /** + * check_changeset_dup_add_node() - changeset validation: duplicate add node + * @ovcs: Overlay changeset + * + * Check changeset @ovcs->cset for multiple add node entries for the same + * node. + * + * Returns 0 on success, -ENOMEM if memory allocation failure, or -EINVAL if + * invalid overlay in @ovcs->fragments[]. + */ +static int check_changeset_dup_add_node(struct overlay_changeset *ovcs) +{ + struct of_changeset_entry *ce_1, *ce_2; + char *fn_1, *fn_2; + int name_match; + + list_for_each_entry(ce_1, >cset.entries, node) { + + if (ce_1->action == OF_RECONFIG_ATTACH_NODE || + ce_1->action == OF_RECONFIG_DETACH_NODE) { + + ce_2 = ce_1; + list_for_each_entry_continue(ce_2, >cset.entries, node) { + if (ce_2->action == OF_RECONFIG_ATTACH_NODE || + ce_2->action == OF_RECONFIG_DETACH_NODE) { + /* inexpensive name compare */ + if (!of_node_cmp(ce_1->np->full_name, + ce_2->np->full_name)) { + /* expensive full path name compare */ + fn_1 = kasprintf(GFP_KERNEL, "%pOF", ce_1->np); + fn_2 = kasprintf(GFP_KERNEL, "%pOF", ce_2->np); + name_match = !strcmp(fn_1, fn_2); + kfree(fn_1); + kfree(fn_2); + if (name_match) { + pr_err("ERROR: multiple overlay fragments add and/or delete node %pOF\n", + ce_1->np); + return -EINVAL; + } + } + } + } + } + } + + return 0; +} + +/** * build_changeset() - populate overlay changeset in @ovcs from @ovcs->fragments * @ovcs: Overlay changeset * @@ -571,7 +611,7 @@ static int build_changeset(struct overlay_changeset *ovcs) } } - return 0; + return check_changeset_dup_add_node(ovcs); } /* -- Frank Rowand
[PATCH v7 10/17] of: overlay: test case of two fragments adding same node
From: Frank Rowand Multiple overlay fragments adding or deleting the same node is not supported. An attempt to do so results in an incorrect devicetree. The node name will be munged for the second add. After adding this patch, the unittest messages will show: Duplicate name in motor-1, renamed to "controller#1" OF: overlay: of_overlay_apply() err=0 ### dt-test ### of_overlay_fdt_apply() expected -22, ret=0, overlay_bad_add_dup_node ### dt-test ### FAIL of_unittest_overlay_high_level():2419 Adding overlay 'overlay_bad_add_dup_node' failed ... ### dt-test ### end of unittest - 210 passed, 1 failed The incorrect (munged) node name "controller#1" can be seen in the /proc filesystem: $ pwd /proc/device-tree/testcase-data-2/substation@100/motor-1 $ ls compatiblecontrollercontroller#1 name phandle spin $ ls controller power_bus $ ls controller#1 power_bus_emergency Tested-by: Alan Tull Signed-off-by: Frank Rowand --- drivers/of/unittest-data/Makefile | 1 + .../of/unittest-data/overlay_bad_add_dup_node.dts | 28 ++ drivers/of/unittest.c | 5 3 files changed, 34 insertions(+) create mode 100644 drivers/of/unittest-data/overlay_bad_add_dup_node.dts diff --git a/drivers/of/unittest-data/Makefile b/drivers/of/unittest-data/Makefile index 013d85e694c6..166dbdbfd1c5 100644 --- a/drivers/of/unittest-data/Makefile +++ b/drivers/of/unittest-data/Makefile @@ -17,6 +17,7 @@ obj-$(CONFIG_OF_OVERLAY) += overlay.dtb.o \ overlay_12.dtb.o \ overlay_13.dtb.o \ overlay_15.dtb.o \ + overlay_bad_add_dup_node.dtb.o \ overlay_bad_phandle.dtb.o \ overlay_bad_symbol.dtb.o \ overlay_base.dtb.o diff --git a/drivers/of/unittest-data/overlay_bad_add_dup_node.dts b/drivers/of/unittest-data/overlay_bad_add_dup_node.dts new file mode 100644 index ..145dfc3b1024 --- /dev/null +++ b/drivers/of/unittest-data/overlay_bad_add_dup_node.dts @@ -0,0 +1,28 @@ +// SPDX-License-Identifier: GPL-2.0 +/dts-v1/; +/plugin/; + +/* + * _1/motor-1 and _ctrl_1 are the same node: + * /testcase-data-2/substation@100/motor-1 + * + * Thus the new node "controller" in each fragment will + * result in an attempt to add the same node twice. + * This will result in an error and the overlay apply + * will fail. + */ + +_1 { + + motor-1 { + controller { + power_bus = < 0x1 0x2 >; + }; + }; +}; + +_ctrl_1 { + controller { + power_bus_emergency = < 0x101 0x102 >; + }; +}; diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c index 49ae2aa744d6..f82edf829f43 100644 --- a/drivers/of/unittest.c +++ b/drivers/of/unittest.c @@ -2161,6 +2161,7 @@ struct overlay_info { OVERLAY_INFO_EXTERN(overlay_12); OVERLAY_INFO_EXTERN(overlay_13); OVERLAY_INFO_EXTERN(overlay_15); +OVERLAY_INFO_EXTERN(overlay_bad_add_dup_node); OVERLAY_INFO_EXTERN(overlay_bad_phandle); OVERLAY_INFO_EXTERN(overlay_bad_symbol); @@ -2183,6 +2184,7 @@ struct overlay_info { OVERLAY_INFO(overlay_12, 0), OVERLAY_INFO(overlay_13, 0), OVERLAY_INFO(overlay_15, 0), + OVERLAY_INFO(overlay_bad_add_dup_node, -EINVAL), OVERLAY_INFO(overlay_bad_phandle, -EINVAL), OVERLAY_INFO(overlay_bad_symbol, -EINVAL), {} @@ -2430,6 +2432,9 @@ static __init void of_unittest_overlay_high_level(void) unittest(overlay_data_apply("overlay", NULL), "Adding overlay 'overlay' failed\n"); + unittest(overlay_data_apply("overlay_bad_add_dup_node", NULL), +"Adding overlay 'overlay_bad_add_dup_node' failed\n"); + unittest(overlay_data_apply("overlay_bad_phandle", NULL), "Adding overlay 'overlay_bad_phandle' failed\n"); -- Frank Rowand
[PATCH v7 09/17] of: overlay: make all pr_debug() and pr_err() messages unique
From: Frank Rowand Make overlay.c debug and error messages unique so that they can be unambiguously found by grep. Tested-by: Alan Tull Signed-off-by: Frank Rowand --- drivers/of/overlay.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c index 72bf00adb9c8..c8f88b0836a3 100644 --- a/drivers/of/overlay.c +++ b/drivers/of/overlay.c @@ -507,7 +507,7 @@ static int build_changeset_symbols_node(struct overlay_changeset *ovcs, for_each_property_of_node(overlay_symbols_node, prop) { ret = add_changeset_property(ovcs, target, prop, 1); if (ret) { - pr_debug("Failed to apply prop @%pOF/%s, err=%d\n", + pr_debug("Failed to apply symbols prop @%pOF/%s, err=%d\n", target->np, prop->name, ret); return ret; } @@ -551,7 +551,8 @@ static int build_changeset(struct overlay_changeset *ovcs) ret = build_changeset_next_level(ovcs, , fragment->overlay); if (ret) { - pr_debug("apply failed '%pOF'\n", fragment->target); + pr_debug("fragment apply failed '%pOF'\n", +fragment->target); return ret; } } @@ -564,7 +565,8 @@ static int build_changeset(struct overlay_changeset *ovcs) ret = build_changeset_symbols_node(ovcs, , fragment->overlay); if (ret) { - pr_debug("apply failed '%pOF'\n", fragment->target); + pr_debug("symbols fragment apply failed '%pOF'\n", +fragment->target); return ret; } } @@ -873,7 +875,7 @@ static int of_overlay_apply(const void *fdt, struct device_node *tree, ret = __of_changeset_apply_notify(>cset); if (ret) - pr_err("overlay changeset entry notify error %d\n", ret); + pr_err("overlay apply changeset entry notify error %d\n", ret); /* notify failure is not fatal, continue */ list_add_tail(>ovcs_list, _list); @@ -1132,7 +1134,7 @@ int of_overlay_remove(int *ovcs_id) ret = __of_changeset_revert_notify(>cset); if (ret) - pr_err("overlay changeset entry notify error %d\n", ret); + pr_err("overlay remove changeset entry notify error %d\n", ret); /* notify failure is not fatal, continue */ *ovcs_id = 0; -- Frank Rowand
[PATCH v7 08/17] of: overlay: validate overlay properties #address-cells and #size-cells
From: Frank Rowand If overlay properties #address-cells or #size-cells are already in the live devicetree for any given node, then the values in the overlay must match the values in the live tree. If the properties are already in the live tree then there is no need to create a changeset entry to add them since they must have the same value. This reduces the memory used by the changeset and eliminates a possible memory leak. Tested-by: Alan Tull Signed-off-by: Frank Rowand --- drivers/of/overlay.c | 32 +--- include/linux/of.h | 6 ++ 2 files changed, 35 insertions(+), 3 deletions(-) diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c index 15be3da34fef..72bf00adb9c8 100644 --- a/drivers/of/overlay.c +++ b/drivers/of/overlay.c @@ -287,7 +287,12 @@ static struct property *dup_and_fixup_symbol_prop( * @target may be either in the live devicetree or in a new subtree that * is contained in the changeset. * - * Some special properties are not updated (no error returned). + * Some special properties are not added or updated (no error returned): + * "name", "phandle", "linux,phandle". + * + * Properties "#address-cells" and "#size-cells" are not updated if they + * are already in the live tree, but if present in the live tree, the values + * in the overlay must match the values in the live tree. * * Update of property in symbols node is not allowed. * @@ -300,6 +305,7 @@ static int add_changeset_property(struct overlay_changeset *ovcs, { struct property *new_prop = NULL, *prop; int ret = 0; + bool check_for_non_overlay_node = false; if (!of_prop_cmp(overlay_prop->name, "name") || !of_prop_cmp(overlay_prop->name, "phandle") || @@ -322,12 +328,32 @@ static int add_changeset_property(struct overlay_changeset *ovcs, if (!new_prop) return -ENOMEM; - if (!prop) + if (!prop) { + check_for_non_overlay_node = true; ret = of_changeset_add_property(>cset, target->np, new_prop); - else + } else if (!of_prop_cmp(prop->name, "#address-cells")) { + if (!of_prop_val_eq(prop, new_prop)) { + pr_err("ERROR: changing value of #address-cells is not allowed in %pOF\n", + target->np); + ret = -EINVAL; + } + } else if (!of_prop_cmp(prop->name, "#size-cells")) { + if (!of_prop_val_eq(prop, new_prop)) { + pr_err("ERROR: changing value of #size-cells is not allowed in %pOF\n", + target->np); + ret = -EINVAL; + } + } else { + check_for_non_overlay_node = true; ret = of_changeset_update_property(>cset, target->np, new_prop); + } + + if (check_for_non_overlay_node && + !of_node_check_flag(target->np, OF_OVERLAY)) + pr_err("WARNING: memory leak will occur if overlay removed, property: %pOF/%s\n", + target->np, new_prop->name); if (ret) { kfree(new_prop->name); diff --git a/include/linux/of.h b/include/linux/of.h index 664cd5573ae2..18ac8921e90c 100644 --- a/include/linux/of.h +++ b/include/linux/of.h @@ -990,6 +990,12 @@ static inline int of_map_rid(struct device_node *np, u32 rid, #define of_node_cmp(s1, s2)strcasecmp((s1), (s2)) #endif +static inline int of_prop_val_eq(struct property *p1, struct property *p2) +{ + return p1->length == p2->length && + !memcmp(p1->value, p2->value, (size_t)p1->length); +} + #if defined(CONFIG_OF) && defined(CONFIG_NUMA) extern int of_node_to_nid(struct device_node *np); #else -- Frank Rowand
[PATCH v7 07/17] of: overlay: reorder fields in struct fragment
From: Frank Rowand Order the fields of struct fragment in the same order as struct of_overlay_notify_data. The order in struct fragment is not significant. If both structs are ordered the same then when examining the data in a debugger or dump the human involved does not have to remember which context they are examining. Tested-by: Alan Tull Signed-off-by: Frank Rowand --- drivers/of/overlay.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c index 9808aae4621a..15be3da34fef 100644 --- a/drivers/of/overlay.c +++ b/drivers/of/overlay.c @@ -49,8 +49,8 @@ struct target { * @overlay: pointer to the __overlay__ node */ struct fragment { - struct device_node *target; struct device_node *overlay; + struct device_node *target; }; /** -- Frank Rowand
[PATCH v7 06/17] of: overlay: do not duplicate properties from overlay for new nodes
From: Frank Rowand When allocating a new node, add_changeset_node() was duplicating the properties from the respective node in the overlay instead of allocating a node with no properties. When this patch is applied the errors reported by the devictree unittest from patch "of: overlay: add tests to validate kfrees from overlay removal" will no longer occur. These error messages are of the form: "OF: ERROR: ..." and the unittest results will change from: ### dt-test ### end of unittest - 203 passed, 7 failed to ### dt-test ### end of unittest - 210 passed, 0 failed Tested-by: Alan Tull Signed-off-by: Frank Rowand --- drivers/of/overlay.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c index 6fd8e6145e10..9808aae4621a 100644 --- a/drivers/of/overlay.c +++ b/drivers/of/overlay.c @@ -393,7 +393,7 @@ static int add_changeset_node(struct overlay_changeset *ovcs, break; if (!tchild) { - tchild = __of_node_dup(node, node_kbasename); + tchild = __of_node_dup(NULL, node_kbasename); if (!tchild) return -ENOMEM; -- Frank Rowand
[PATCH v7 05/17] of: overlay: use prop add changeset entry for property in new nodes
From: Frank Rowand The changeset entry 'update property' was used for new properties in an overlay instead of 'add property'. The decision of whether to use 'update property' was based on whether the property already exists in the subtree where the node is being spliced into. At the top level of creating a changeset describing the overlay, the target node is in the live devicetree, so checking whether the property exists in the target node returns the correct result. As soon as the changeset creation algorithm recurses into a new node, the target is no longer in the live devicetree, but is instead in the detached overlay tree, thus all properties are incorrectly found to already exist in the target. This fix will expose another devicetree bug that will be fixed in the following patch in the series. When this patch is applied the errors reported by the devictree unittest will change, and the unittest results will change from: ### dt-test ### end of unittest - 210 passed, 0 failed to ### dt-test ### end of unittest - 203 passed, 7 failed Tested-by: Alan Tull Signed-off-by: Frank Rowand --- drivers/of/overlay.c | 112 ++- 1 file changed, 74 insertions(+), 38 deletions(-) diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c index 7613f7d680c7..6fd8e6145e10 100644 --- a/drivers/of/overlay.c +++ b/drivers/of/overlay.c @@ -24,6 +24,26 @@ #include "of_private.h" /** + * struct target - info about current target node as recursing through overlay + * @np:node where current level of overlay will be applied + * @in_livetree: @np is a node in the live devicetree + * + * Used in the algorithm to create the portion of a changeset that describes + * an overlay fragment, which is a devicetree subtree. Initially @np is a node + * in the live devicetree where the overlay subtree is targeted to be grafted + * into. When recursing to the next level of the overlay subtree, the target + * also recurses to the next level of the live devicetree, as long as overlay + * subtree node also exists in the live devicetree. When a node in the overlay + * subtree does not exist at the same level in the live devicetree, target->np + * points to a newly allocated node, and all subsequent targets in the subtree + * will be newly allocated nodes. + */ +struct target { + struct device_node *np; + bool in_livetree; +}; + +/** * struct fragment - info about fragment nodes in overlay expanded device tree * @target:target of the overlay operation * @overlay: pointer to the __overlay__ node @@ -72,8 +92,7 @@ static int devicetree_corrupt(void) } static int build_changeset_next_level(struct overlay_changeset *ovcs, - struct device_node *target_node, - const struct device_node *overlay_node); + struct target *target, const struct device_node *overlay_node); /* * of_resolve_phandles() finds the largest phandle in the live tree. @@ -257,14 +276,17 @@ static struct property *dup_and_fixup_symbol_prop( /** * add_changeset_property() - add @overlay_prop to overlay changeset * @ovcs: overlay changeset - * @target_node: where to place @overlay_prop in live tree + * @target:where @overlay_prop will be placed * @overlay_prop: property to add or update, from overlay tree * @is_symbols_prop: 1 if @overlay_prop is from node "/__symbols__" * - * If @overlay_prop does not already exist in @target_node, add changeset entry - * to add @overlay_prop in @target_node, else add changeset entry to update + * If @overlay_prop does not already exist in live devicetree, add changeset + * entry to add @overlay_prop in @target, else add changeset entry to update * value of @overlay_prop. * + * @target may be either in the live devicetree or in a new subtree that + * is contained in the changeset. + * * Some special properties are not updated (no error returned). * * Update of property in symbols node is not allowed. @@ -273,20 +295,22 @@ static struct property *dup_and_fixup_symbol_prop( * invalid @overlay. */ static int add_changeset_property(struct overlay_changeset *ovcs, - struct device_node *target_node, - struct property *overlay_prop, + struct target *target, struct property *overlay_prop, bool is_symbols_prop) { struct property *new_prop = NULL, *prop; int ret = 0; - prop = of_find_property(target_node, overlay_prop->name, NULL); - if (!of_prop_cmp(overlay_prop->name, "name") || !of_prop_cmp(overlay_prop->name, "phandle") || !of_prop_cmp(overlay_prop->name, "linux,phandle")) return 0; + if (target->in_livetree) + prop = of_find_property(target->np, overlay_prop->name, NULL); + else + prop = NULL; + if (is_symbols_prop) { if (prop)
[PATCH v7 04/17] powerpc/pseries: add of_node_put() in dlpar_detach_node()
From: Frank Rowand The previous commit, "of: overlay: add missing of_node_get() in __of_attach_node_sysfs" added a missing of_node_get() to __of_attach_node_sysfs(). This results in a refcount imbalance for nodes attached with dlpar_attach_node(). The calling sequence from dlpar_attach_node() to __of_attach_node_sysfs() is: dlpar_attach_node() of_attach_node() __of_attach_node_sysfs() For more detailed description of the node refcount, see commit 68baf692c435 ("powerpc/pseries: Fix of_node_put() underflow during DLPAR remove"). Tested-by: Alan Tull Acked-by: Michael Ellerman Signed-off-by: Frank Rowand --- arch/powerpc/platforms/pseries/dlpar.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c index 7625546caefd..17958043e7f7 100644 --- a/arch/powerpc/platforms/pseries/dlpar.c +++ b/arch/powerpc/platforms/pseries/dlpar.c @@ -270,6 +270,8 @@ int dlpar_detach_node(struct device_node *dn) if (rc) return rc; + of_node_put(dn); + return 0; } -- Frank Rowand
[PATCH v7 03/17] of: overlay: add missing of_node_get() in __of_attach_node_sysfs
From: Frank Rowand There is a matching of_node_put() in __of_detach_node_sysfs() Remove misleading comment from function header comment for of_detach_node(). This patch may result in memory leaks from code that directly calls the dynamic node add and delete functions directly instead of using changesets. This commit should result in powerpc systems that dynamically allocate a node, then later deallocate the node to have a memory leak when the node is deallocated. The next commit will fix the leak. Tested-by: Alan Tull Acked-by: Michael Ellerman (powerpc) Signed-off-by: Frank Rowand --- drivers/of/dynamic.c | 3 --- drivers/of/kobj.c| 4 +++- 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c index 12c3f9a15e94..146681540487 100644 --- a/drivers/of/dynamic.c +++ b/drivers/of/dynamic.c @@ -272,9 +272,6 @@ void __of_detach_node(struct device_node *np) /** * of_detach_node() - "Unplug" a node from the device tree. - * - * The caller must hold a reference to the node. The memory associated with - * the node is not freed until its refcount goes to zero. */ int of_detach_node(struct device_node *np) { diff --git a/drivers/of/kobj.c b/drivers/of/kobj.c index 7a0a18980b98..c72eef988041 100644 --- a/drivers/of/kobj.c +++ b/drivers/of/kobj.c @@ -133,6 +133,9 @@ int __of_attach_node_sysfs(struct device_node *np) } if (!name) return -ENOMEM; + + of_node_get(np); + rc = kobject_add(>kobj, parent, "%s", name); kfree(name); if (rc) @@ -159,6 +162,5 @@ void __of_detach_node_sysfs(struct device_node *np) kobject_del(>kobj); } - /* finally remove the kobj_init ref */ of_node_put(np); } -- Frank Rowand
[PATCH v7 02/17] of: overlay: add missing of_node_put() after add new node to changeset
From: Frank Rowand The refcount of a newly added overlay node decrements to one (instead of zero) when the overlay changeset is destroyed. This change will cause the final decrement be to zero. After applying this patch, new validation warnings will be reported from the devicetree unittest during boot due to a pre-existing devicetree bug. The warnings will be similar to: OF: ERROR: memory leak before free overlay changeset, /testcase-data/overlay-node/test-bus/test-unittest4 This pre-existing devicetree bug will also trigger a WARN_ONCE() from refcount_sub_and_test_checked() when an overlay changeset is destroyed without having first been applied. This scenario occurs when an error in the overlay is detected during the overlay changeset creation: WARNING: CPU: 0 PID: 1 at lib/refcount.c:187 refcount_sub_and_test_checked+0xa8/0xbc refcount_t: underflow; use-after-free. (unwind_backtrace) from (show_stack+0x10/0x14) (show_stack) from (dump_stack+0x6c/0x8c) (dump_stack) from (__warn+0xdc/0x104) (__warn) from (warn_slowpath_fmt+0x44/0x6c) (warn_slowpath_fmt) from (refcount_sub_and_test_checked+0xa8/0xbc) (refcount_sub_and_test_checked) from (kobject_put+0x24/0x208) (kobject_put) from (of_changeset_destroy+0x2c/0xb4) (of_changeset_destroy) from (free_overlay_changeset+0x1c/0x9c) (free_overlay_changeset) from (of_overlay_remove+0x284/0x2cc) (of_overlay_remove) from (of_unittest_apply_revert_overlay_check.constprop.4+0xf8/0x1e8) (of_unittest_apply_revert_overlay_check.constprop.4) from (of_unittest_overlay+0x960/0xed8) (of_unittest_overlay) from (of_unittest+0x1cc4/0x2138) (of_unittest) from (do_one_initcall+0x4c/0x28c) (do_one_initcall) from (kernel_init_freeable+0x29c/0x378) (kernel_init_freeable) from (kernel_init+0x8/0x110) (kernel_init) from (ret_from_fork+0x14/0x2c) Tested-by: Alan Tull Signed-off-by: Frank Rowand --- drivers/of/overlay.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c index f5fc8859a7ee..7613f7d680c7 100644 --- a/drivers/of/overlay.c +++ b/drivers/of/overlay.c @@ -379,7 +379,9 @@ static int add_changeset_node(struct overlay_changeset *ovcs, if (ret) return ret; - return build_changeset_next_level(ovcs, tchild, node); + ret = build_changeset_next_level(ovcs, tchild, node); + of_node_put(tchild); + return ret; } if (node->phandle && tchild->phandle) -- Frank Rowand
[PATCH v7 01/17] of: overlay: add tests to validate kfrees from overlay removal
From: Frank Rowand Add checks: - attempted kfree due to refcount reaching zero before overlay is removed - properties linked to an overlay node when the node is removed - node refcount > one during node removal in a changeset destroy, if the node was created by the changeset After applying this patch, several validation warnings will be reported from the devicetree unittest during boot due to pre-existing devicetree bugs. The warnings will be similar to: OF: ERROR: of_node_release(), unexpected properties in /testcase-data/overlay-node/test-bus/test-unittest11 OF: ERROR: memory leak, expected refcount 1 instead of 2, of_node_get()/of_node_put() unbalanced - destroy cset entry: attach overlay node /testcase-data-2/substation@100/ hvac-medium-2 Tested-by: Alan Tull Signed-off-by: Frank Rowand --- drivers/of/dynamic.c | 29 + drivers/of/overlay.c | 1 + include/linux/of.h | 15 ++- 3 files changed, 40 insertions(+), 5 deletions(-) diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c index f4f8ed9b5454..12c3f9a15e94 100644 --- a/drivers/of/dynamic.c +++ b/drivers/of/dynamic.c @@ -330,6 +330,25 @@ void of_node_release(struct kobject *kobj) if (!of_node_check_flag(node, OF_DYNAMIC)) return; + if (of_node_check_flag(node, OF_OVERLAY)) { + + if (!of_node_check_flag(node, OF_OVERLAY_FREE_CSET)) { + /* premature refcount of zero, do not free memory */ + pr_err("ERROR: memory leak before free overlay changeset, %pOF\n", + node); + return; + } + + /* +* If node->properties non-empty then properties were added +* to this node either by different overlay that has not +* yet been removed, or by a non-overlay mechanism. +*/ + if (node->properties) + pr_err("ERROR: %s(), unexpected properties in %pOF\n", + __func__, node); + } + property_list_free(node->properties); property_list_free(node->deadprops); @@ -434,6 +453,16 @@ struct device_node *__of_node_dup(const struct device_node *np, static void __of_changeset_entry_destroy(struct of_changeset_entry *ce) { + if (ce->action == OF_RECONFIG_ATTACH_NODE && + of_node_check_flag(ce->np, OF_OVERLAY)) { + if (kref_read(>np->kobj.kref) > 1) { + pr_err("ERROR: memory leak, expected refcount 1 instead of %d, of_node_get()/of_node_put() unbalanced - destroy cset entry: attach overlay node %pOF\n", + kref_read(>np->kobj.kref), ce->np); + } else { + of_node_set_flag(ce->np, OF_OVERLAY_FREE_CSET); + } + } + of_node_put(ce->np); list_del(>node); kfree(ce); diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c index 42b1f73ac5f6..f5fc8859a7ee 100644 --- a/drivers/of/overlay.c +++ b/drivers/of/overlay.c @@ -373,6 +373,7 @@ static int add_changeset_node(struct overlay_changeset *ovcs, return -ENOMEM; tchild->parent = target_node; + of_node_set_flag(tchild, OF_OVERLAY); ret = of_changeset_attach_node(>cset, tchild); if (ret) diff --git a/include/linux/of.h b/include/linux/of.h index a5aee3c438ad..664cd5573ae2 100644 --- a/include/linux/of.h +++ b/include/linux/of.h @@ -138,11 +138,16 @@ static inline void of_node_put(struct device_node *node) { } extern struct device_node *of_stdout; extern raw_spinlock_t devtree_lock; -/* flag descriptions (need to be visible even when !CONFIG_OF) */ -#define OF_DYNAMIC 1 /* node and properties were allocated via kmalloc */ -#define OF_DETACHED2 /* node has been detached from the device tree */ -#define OF_POPULATED 3 /* device already created for the node */ -#define OF_POPULATED_BUS 4 /* of_platform_populate recursed to children of this node */ +/* + * struct device_node flag descriptions + * (need to be visible even when !CONFIG_OF) + */ +#define OF_DYNAMIC 1 /* (and properties) allocated via kmalloc */ +#define OF_DETACHED2 /* detached from the device tree */ +#define OF_POPULATED 3 /* device already created */ +#define OF_POPULATED_BUS 4 /* platform bus created for children */ +#define OF_OVERLAY 5 /* allocated for an overlay */ +#define OF_OVERLAY_FREE_CSET 6 /* in overlay cset being freed */ #define OF_BAD_ADDR((u64)-1) -- Frank Rowand
[PATCH v7 00/17] of: overlay: validation checks, subsequent fixes
From: Frank Rowand Add checks to (1) overlay apply process and (2) memory freeing triggered by overlay release. The checks are intended to detect possible memory leaks and invalid overlays. The checks revealed bugs in existing code. Fixed the bugs. While fixing bugs, noted other issues, which are fixed in separate patches. FPGA folks: I made the validation checks that should result in an invalid live devicetree report "ERROR" and cause the overlay apply to fail. I made the memory leak validation tests report "WARNING" and allow the overlay apply to complete successfully. Please let me know if you encounter the warnings. There are at least two paths forward to deal with the cases that trigger the warning: (1) change the warning to an error and fail the overlay apply, or (2) find a way to detect the potential memory leaks and free the memory appropriately. ALL people: The validations do _not_ address another major concern I have with releasing overlays, which is use after free errors. Changes since v6: - 03/18 Add acked-by from Michael Ellerman - 03/18 Move info from post "---" into header comment - 04/18 Add acked-by from Michael Ellerman - 03/18 Move info from post "---" into header comment, add more info - 07/18 Drop. Changes since v5: - move from 4.19-rc1 to 4.20-rc1 - all patches: add tested-by Alan Tull - 05/18: update for context change from commit a613b26a50136 ("of: Convert to using %pOFn instead of device_node.name") Changes since v4: - 01/18: make error message format consistent, error first, path last - 09/18: create of_prop_val_eq() and change open code to use it - 09/18: remove extra blank lines Changes since v3: - 01/18: Add expected value of refcount for destroy cset entry error. Also explain the cause of the error. - 09/18: for errors of an overlay changing the value of #size-cells or #address-cells, return -EINVAL so that overlay apply will fail - 09/18: for errors of an overlay changing the value of #size-cells or #address-cells, make the message more direct. Old message: OF: overlay: ERROR: overlay and/or live tree #size-cells invalid in node /soc/base_fpga_region New message: OF: overlay: ERROR: changing value of /soc/base_fpga_region/#size-cells not allowed - 13/18: Update patch comment header to state that this patch modifies the previous patch to not return immediately on fragment error and explain this is not a performance issue. - 13/18: remove redundant "overlay" from two error messages. "OF: overlay:" is already present in pr_fmt() Changes since v2: - 13/18: Use continue to reduce indentation in find_dup_cset_node_entry() and find_dup_cset_prop() Changes since v1: - move patch 16/16 to 17/18 - move patch 15/16 to 18/18 - new patch 15/18 - new patch 16/18 - 05/18: add_changeset_node() header comment: incorrect comment for @target - 18/18: add same fix for of_parse_phandle_with_args() - 18/18: add same fix for of_parse_phandle_with_args_map() Frank Rowand (17): of: overlay: add tests to validate kfrees from overlay removal of: overlay: add missing of_node_put() after add new node to changeset of: overlay: add missing of_node_get() in __of_attach_node_sysfs powerpc/pseries: add of_node_put() in dlpar_detach_node() of: overlay: use prop add changeset entry for property in new nodes of: overlay: do not duplicate properties from overlay for new nodes of: overlay: reorder fields in struct fragment of: overlay: validate overlay properties #address-cells and #size-cells of: overlay: make all pr_debug() and pr_err() messages unique of: overlay: test case of two fragments adding same node of: overlay: check prevents multiple fragments add or delete same node of: overlay: check prevents multiple fragments touching same property of: unittest: remove unused of_unittest_apply_overlay() argument of: overlay: set node fields from properties when add new overlay node of: unittest: allow base devicetree to have symbol metadata of: unittest: find overlays[] entry by name instead of index of: unittest: initialize args before calling of_*parse_*() arch/powerpc/platforms/pseries/dlpar.c | 2 + drivers/of/dynamic.c | 59 - drivers/of/kobj.c | 4 +- drivers/of/overlay.c | 292 - drivers/of/unittest-data/Makefile | 2 + .../of/unittest-data/overlay_bad_add_dup_node.dts | 28 ++ .../of/unittest-data/overlay_bad_add_dup_prop.dts | 24 ++ drivers/of/unittest-data/overlay_base.dts | 1 + drivers/of/unittest.c | 96 +-- include/linux/of.h | 21 +- 10 files changed, 432 insertions(+), 97 deletions(-) create mode 100644 drivers/of/unittest-data/overlay_bad_add_dup_node.dts create mode
Re: [Skiboot] Important details about race condition in EEH/NVMe-issue on ppc64le.
On Fri, Nov 9, 2018 at 2:30 AM Koltsov Dmitriy wrote: > > Hi, Oliver. > > Your version of EEH/NVMe-issue looks close to be true. > > I've applied simple patch to ./drivers/nvme/host/pci.c file: > > @@ -29,6 +29,7 @@ > #include > #include > #include > +#include > > #include "nvme.h" > > @@ -2073,11 +2074,14 @@ > > static int nvme_pci_enable(struct nvme_dev *dev) > { > +static int num=0; > int result = -ENOMEM; > struct pci_dev *pdev = to_pci_dev(dev->dev); > +num=num+1; > > if (pci_enable_device_mem(pdev)) > return result; > +msleep(num); > > pci_set_master(pdev); > > > and find out approximate minimal estimation of delay which removes the > EEH/NVMe-issue. > This delay is about 1 millisecond (see in the code above). So, it's a stable > phenomena > that when all nvme disks have minimum mutual interval of 1(one) millisecond > (each) to > enable upstream bridge then EEH/NVMe-issue is not reproducible. Hence, I have > the > hypothesis that due to bar alignment patch (see prev message) the bridge > enabling procedure > lasts significantly more time than in case of alignment=0. > So, in correspondence with your > hypothesis about several threads "pinging" PHB, it turns out that one of that > threads > begins to write to PHB, while necessary time is not passed yet (~1 ms). No, only the first thread ever touches the hardware (PCI config space, really). The first thread will set an in-memory flag which is seen by the other threads and taken to mean that the bridge is already enabled. > HW error follows and readl() of NVME_REG_CSTS returns 0x in > nvme_pci_enable(). > (And when align=0 the PHB enable procedure duration is significantly less > than 1ms so > other threads which don't enable PHB themselves - has no conflict, because by > the time > they tries to write to PHB - PHB is already really enabled). > I don't see here contradictions in such version of events sequence leading > to EEH/NVMe-issue. > > So, there are two questions: > > Q.1. Does my hypothesis about delay=1ms and about difference of PHB enable > procedure > duration in two cases (align=0 and align=64K) looks close to be true ? *shrug* Probably? Even if there is a difference in timing I don't see why that is a problem. You might be taking (or not taking) a page fault due to the alignment patch, or maybe there's some extra state inside of linux that needed to be updated for some reason. It could be anything. Fundamentally, the problem here is that Linux has a race condition. Debugging the exact set of circumstances required to trigger the race is sometimes worth it, but I don't think this is one of those cases. If you honestly believe that there is a problem here then can you provide us with some actual data rather than speculating about HW bugs? The Linux ftrace (function_graph is useful) infrastructure should allow you to do accurate measurements of the whole process and get a breakdown of what the slow parts are. Once you have a good overview of what code paths are being exercised in each case then you can do more detailed measurements of specific functions. > Q.2. If answer is 'yes' to Q.1 - are there some procedures in FW or HW > implementation in > POWER8 PHB that may "extend" phb enable procedure time execution because of > alignment change > to 64K ? May be in OPAL or on-chip ? Or the only "suspected area" for > EEH/NVMe-issue - > is in linux kernel procedures ? I don't see any reason to suspect OPAL or the chip here. Oliver > > > > Regards, > Dmitriy. >
[PATCH] KVM: PPC: Book3S HV: fix handling for interrupted H_ENTER_NESTED
While running a nested guest VCPU on L0 via H_ENTER_NESTED hcall, a pending signal in the L0 QEMU process can generate the following sequence: ret0 = kvmppc_pseries_do_hcall() ret1 = kvmhv_enter_nested_guest() ret2 = kvmhv_run_single_vcpu() if (ret2 == -EINTR) return H_INTERRUPT if (ret1 == H_INTERRUPT) kvmppc_set_gpr(vcpu, 3, 0) return -EINTR /* skipped: */ kvmppc_set_gpr(vcpu, 3, ret) vcpu->arch.hcall_needed = 0 return RESUME_GUEST which causes an exit to L0 userspace with ret0 == -EINTR. The intention seems to be to set the hcall return value to 0 (via VCPU r3) so that L1 will see a successful return from H_ENTER_NESTED once we resume executing the VCPU. However, because we don't set vcpu->arch.hcall_needed = 0, we do the following once userspace resumes execution via kvm_arch_vcpu_ioctl_run(): ... } else if (vcpu->arch.hcall_needed) { int i kvmppc_set_gpr(vcpu, 3, run->papr_hcall.ret); for (i = 0; i < 9; ++i) kvmppc_set_gpr(vcpu, 4 + i, run->papr_hcall.args[i]); vcpu->arch.hcall_needed = 0; since vcpu->arch.hcall_needed == 1 indicates that userspace should have handled the hcall and stored the return value in run->papr_hcall.ret. Since that's not the case here, we can get an unexpected value in VCPU r3, which can result in kvmhv_p9_guest_entry() reporting an unexpected trap value when it returns from H_ENTER_NESTED, causing the following register dump to console via subsequent call to kvmppc_handle_exit_hv() in L1: [ 350.612854] vcpu f9564cf8 (0): [ 350.612915] pc = c013eb98 msr = 80009033 trap = 1 [ 350.613020] r 0 = c04b9044 r16 = [ 350.613075] r 1 = c0007cffba30 r17 = [ 350.613120] r 2 = c178c100 r18 = 7fffc24f3b50 [ 350.613166] r 3 = c0007ef52480 r19 = 7fffc24fff58 [ 350.613212] r 4 = r20 = 0a1e96ece9d0 [ 350.613253] r 5 = 70616d00746f6f72 r21 = 0a1ea117c9b0 [ 350.613295] r 6 = 0020 r22 = 0a1ea1184360 [ 350.613338] r 7 = c000783be440 r23 = 0003 [ 350.613380] r 8 = fffc r24 = 0a1e96e9e124 [ 350.613423] r 9 = c0007ef52490 r25 = 07ff [ 350.613469] r10 = 0004 r26 = c0007eb2f7a0 [ 350.613513] r11 = b0616d0009eccdb2 r27 = c0007cffbb10 [ 350.613556] r12 = c04b9000 r28 = c0007d83a2c0 [ 350.613597] r13 = c1b0 r29 = c000783cdf68 [ 350.613639] r14 = r30 = [ 350.613681] r15 = r31 = c0007cffbbf0 [ 350.613723] ctr = c04b9000 lr = c04b9044 [ 350.613765] srr0 = 772f954dd48c srr1 = 8280f033 [ 350.613808] sprg0 = sprg1 = c1b0 [ 350.613859] sprg2 = 772f9565a280 sprg3 = [ 350.613911] cr = 88002848 xer = 2004 dsisr = 4200 [ 350.613962] dar = 772f9539 [ 350.614031] fault dar = c00244b278c0 dsisr = [ 350.614073] SLB (0 entries): [ 350.614157] lpcr = 004003d40413 sdr1 = last_inst = [ 350.614252] trap=0x1 | pc=0xc013eb98 | msr=0x80009033 followed by L1's QEMU reporting the following before stopping execution of the nested guest: KVM: unknown exit, hardware reason 1 NIP c013eb98 LR c04b9044 CTR c04b9000 XER 2004 CPU#0 MSR 80009033 HID0 HF 8000 iidx 3 didx 3 TB DECR GPR00 c04b9044 c0007cffba30 c178c100 c0007ef52480 GPR04 70616d00746f6f72 0020 c000783be440 GPR08 fffc c0007ef52490 0004 b0616d0009eccdb2 GPR12 c04b9000 c1b0 GPR16 7fffc24f3b50 7fffc24fff58 GPR20 0a1e96ece9d0 0a1ea117c9b0 0a1ea1184360 0003 GPR24 0a1e96e9e124 07ff c0007eb2f7a0 c0007cffbb10 GPR28 c0007d83a2c0 c000783cdf68 c0007cffbbf0 CR 88002848 [ L L - - E L G L ] RES SRR0 772f954dd48c SRR1 8280f033PVR 004e1202 VRSAVE SPRG0 SPRG1 c1b0 SPRG2 772f9565a280 SPRG3 SPRG4 SPRG5 SPRG6 SPRG7 HSRR0 HSRR1 CFAR LPCR 03d40413 PTCR DAR 772f9539 DSISR 4200 Fix this by setting vcpu->arch.hcall_needed = 0 to indicate completion of H_ENTER_NESTED before we exit to L0 userspace. Cc: linuxppc-...@ozlabs.org Cc: David Gibson Cc: Paul Mackerras Cc:
RE: [PATCH 5/6] pci: layerscape: Add the EP mode support.
-Original Message- From: Xiaowei Bao Sent: 2018年11月6日 14:48 To: 'Kishon Vijay Abraham I' ; bhelg...@google.com; robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; Leo Li ; lorenzo.pieral...@arm.com; a...@arndb.de; gre...@linuxfoundation.org; M.h. Lian ; Mingkai Hu ; Roy Zang ; kstew...@linuxfoundation.org; cyrille.pitc...@free-electrons.com; pombreda...@nexb.com; shawn@rock-chips.com; linux-...@vger.kernel.org; devicet...@vger.kernel.org; linux-ker...@vger.kernel.org; linux-arm-ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org Cc: Jiafei Pan Subject: RE: [PATCH 5/6] pci: layerscape: Add the EP mode support. Hi Kishon, -Original Message- From: Kishon Vijay Abraham I Sent: 2018年11月6日 14:07 To: Xiaowei Bao ; bhelg...@google.com; robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; Leo Li ; lorenzo.pieral...@arm.com; a...@arndb.de; gre...@linuxfoundation.org; M.h. Lian ; Mingkai Hu ; Roy Zang ; kstew...@linuxfoundation.org; cyrille.pitc...@free-electrons.com; pombreda...@nexb.com; shawn@rock-chips.com; linux-...@vger.kernel.org; devicet...@vger.kernel.org; linux-ker...@vger.kernel.org; linux-arm-ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org Cc: Jiafei Pan Subject: Re: [PATCH 5/6] pci: layerscape: Add the EP mode support. (Removed Niklas as mails to him is bouncing) Hi, Please fix your email client. Refer Documentation/process/email-clients.rst On 05/11/18 2:45 PM, Xiaowei Bao wrote: > > > -Original Message- > From: Kishon Vijay Abraham I > Sent: 2018年11月5日 16:57 > To: Xiaowei Bao ; bhelg...@google.com; > robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; Leo Li > ; lorenzo.pieral...@arm.com; a...@arndb.de; > gre...@linuxfoundation.org; M.h. Lian ; Mingkai > Hu ; Roy Zang ; > kstew...@linuxfoundation.org; cyrille.pitc...@free-electrons.com; > pombreda...@nexb.com; shawn@rock-chips.com; > niklas.cas...@axis.com; linux-...@vger.kernel.org; > devicet...@vger.kernel.org; linux-ker...@vger.kernel.org; > linux-arm-ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org > Cc: Jiafei Pan > Subject: Re: [PATCH 5/6] pci: layerscape: Add the EP mode support. > > Hi, > > On 31/10/18 4:08 PM, Xiaowei Bao wrote: >> >> >> -Original Message- >> From: Kishon Vijay Abraham I >> Sent: 2018年10月31日 12:15 >> To: Xiaowei Bao ; bhelg...@google.com; >> robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; Leo Li >> ; lorenzo.pieral...@arm.com; a...@arndb.de; >> gre...@linuxfoundation.org; M.h. Lian ; >> Mingkai Hu ; Roy Zang ; >> kstew...@linuxfoundation.org; cyrille.pitc...@free-electrons.com; >> pombreda...@nexb.com; shawn@rock-chips.com; >> niklas.cas...@axis.com; linux-...@vger.kernel.org; >> devicet...@vger.kernel.org; linux-ker...@vger.kernel.org; >> linux-arm-ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org >> Cc: Jiafei Pan >> Subject: Re: [PATCH 5/6] pci: layerscape: Add the EP mode support. >> >> Hi, >> >> On 31/10/18 8:03 AM, Xiaowei Bao wrote: >>> >>> >>> -Original Message- >>> From: Xiaowei Bao >>> Sent: 2018年10月26日 17:19 >>> To: 'Kishon Vijay Abraham I' ; bhelg...@google.com; >>> robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; Leo >>> robh+Li >>> ; lorenzo.pieral...@arm.com; a...@arndb.de; >>> gre...@linuxfoundation.org; M.h. Lian ; >>> Mingkai Hu ; Roy Zang ; >>> kstew...@linuxfoundation.org; cyrille.pitc...@free-electrons.com; >>> pombreda...@nexb.com; shawn@rock-chips.com; >>> niklas.cas...@axis.com; linux-...@vger.kernel.org; >>> devicet...@vger.kernel.org; linux-ker...@vger.kernel.org; >>> linux-arm-ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org >>> Cc: Jiafei Pan >>> Subject: RE: [PATCH 5/6] pci: layerscape: Add the EP mode support. >>> >>> >>> >>> -Original Message- >>> From: Kishon Vijay Abraham I >>> Sent: 2018年10月26日 13:29 >>> To: Xiaowei Bao ; bhelg...@google.com; >>> robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; Leo >>> robh+Li >>> ; lorenzo.pieral...@arm.com; a...@arndb.de; >>> gre...@linuxfoundation.org; M.h. Lian ; >>> Mingkai Hu ; Roy Zang ; >>> kstew...@linuxfoundation.org; cyrille.pitc...@free-electrons.com; >>> pombreda...@nexb.com; shawn@rock-chips.com; >>> niklas.cas...@axis.com; linux-...@vger.kernel.org; >>> devicet...@vger.kernel.org; linux-ker...@vger.kernel.org; >>> linux-arm-ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org >>> Subject: Re: [PATCH 5/6] pci: layerscape: Add the EP mode support. >>> >>> Hi, >>> >>> On Thursday 25 October 2018 04:39 PM, Xiaowei Bao wrote: Add the PCIe EP mode support for layerscape platform. Signed-off-by: Xiaowei Bao --- drivers/pci/controller/dwc/Makefile|2 +- drivers/pci/controller/dwc/pci-layerscape-ep.c | 161 2 files changed, 162 insertions(+), 1 deletions(-) create mode 100644
Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected
On 11/08/2018 04:51 PM, Greg KH wrote: > On Thu, Nov 08, 2018 at 10:49:08PM +, alex_gagn...@dellteam.com wrote: >> In the case that we're trying to fix, this code executing is a result of >> the device being gone, so we can guarantee race-free operation. I agree >> that there is a race, in the general case. As far as checking the result >> for all F's, that's not an option when firmware crashes the system as a >> result of the mmio read/write. It's never pretty when firmware gets >> involved. > > If you have firmware that crashes the system when you try to read from a > PCI device that was hot-removed, that is broken firmware and needs to be > fixed. The kernel can not work around that as again, you will never win > that race. But it's not the firmware that crashes. It's linux as a result of a fatal error message from the firmware. And we can't fix that because FFS handling requires that the system reboots [1]. If we're going to say that we don't want to support FFS because it's a separate code path, and different flow, that's fine. I am myself, not a fan of FFS. But if we're going to continue supporting it, I think we'll continue to have to resolve these sort of unintended consequences. Alex [1] ACPI 6.2, 18.1 - Hardware Errors and Error Sources
Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected
On 11/08/2018 04:43 PM, Greg Kroah-Hartman wrote: > > [EXTERNAL EMAIL] > Please report any suspicious attachments, links, or requests for sensitive > information. > > > On Thu, Nov 08, 2018 at 03:32:58PM -0700, Keith Busch wrote: >> On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote: >>> On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote: I'm having second thoughts about this. One thing I'm uncomfortable with is that sprinkling pci_dev_is_disconnected() around feels ad hoc instead of systematic, in the sense that I don't know how we convince ourselves that this (and only this) is the correct place to put it. >>> >>> I think my stance always has been that this call is not good at all >>> because once you call it you never really know if it is still true as >>> the device could have been removed right afterward. >>> >>> So almost any code that relies on it is broken, there is no locking and >>> it can and will race and you will loose. >> >> AIUI, we're not trying to create code to rely on this. This more about >> reducing reliance on hardware. If the software misses the race once and >> accesses disconnected device memory, that's usually not a big deal to >> let hardware sort it out, but the point is not to push our luck. > > Then why even care about this call at all? If you need to really know > if the read worked, you have to check the value. If the value is FF > then you have a huge hint that the hardware is now gone. And you can > rely on it being gone, you can never rely on making the call to the > function to check if the hardware is there to be still valid any point > in time after the call returns. In the case that we're trying to fix, this code executing is a result of the device being gone, so we can guarantee race-free operation. I agree that there is a race, in the general case. As far as checking the result for all F's, that's not an option when firmware crashes the system as a result of the mmio read/write. It's never pretty when firmware gets involved. Alex
Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected
On 11/08/2018 02:09 PM, Bjorn Helgaas wrote: > > [EXTERNAL EMAIL] > Please report any suspicious attachments, links, or requests for sensitive > information. > > > [+cc Jonathan, Greg, Lukas, Russell, Sam, Oliver for discussion about > PCI error recovery in general] Has anyone seen seen the ECRs in the PCIe base spec and ACPI that have been floating around the past few months? -- HPX, SFI, CER. Without divulging too much before publication, I'm curious on opinions on how well (or not well) those flows would work in general, and in linux. > On Wed, Nov 07, 2018 at 05:42:57PM -0600, Bjorn Helgaas wrote: > I'm having second thoughts about this. One thing I'm uncomfortable > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc > instead of systematic, in the sense that I don't know how we convince > ourselves that this (and only this) is the correct place to put it. > > Another is that the only place we call pci_dev_set_disconnected() is > in pciehp and acpiphp, so the only "disconnected" case we catch is if > hotplug happens to be involved. Every MMIO read from the device is an > opportunity to learn whether it is reachable (a read from an > unreachable device typically returns ~0 data), but we don't do > anything at all with those. > > The config accessors already check pci_dev_is_disconnected(), so this > patch is really aimed at MMIO accesses. I think it would be more > robust if we added wrappers for readl() and writel() so we could > notice read errors and avoid future reads and writes. I wouldn't expect anything less than complete scrutiny and quality control of unquestionable moral integrity :). In theory ~0 can be a great indicator that something may be wrong. Though I think it's about as ad-hoc as pci_dev_is_disconnected(). I slightly like the idea of wrapping the MMIO accessors. There's still memcpy and DMA that cause the same MemRead/Wr PCIe transactions, and the same sort of errors in PCIe land, and it would be good to have more testing on this. Since this patch is tested and confirmed to fix a known failure case, I would keep it, and the look at fixing the problem in a more generic way. BTW, a lot of the problems we're fixing here come courtesy of firmware-first error handling. Do we reach a point where we draw a line in handling new problems introduced by FFS? So, if something is a problem with FFS, but not native handling, do we commit to supporting it? Alex
Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected
On Thu, Nov 08, 2018 at 02:42:55PM -0800, Greg Kroah-Hartman wrote: > On Thu, Nov 08, 2018 at 03:32:58PM -0700, Keith Busch wrote: > > On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote: > > > On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote: > > > > I'm having second thoughts about this. One thing I'm uncomfortable > > > > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc > > > > instead of systematic, in the sense that I don't know how we convince > > > > ourselves that this (and only this) is the correct place to put it. > > > > > > I think my stance always has been that this call is not good at all > > > because once you call it you never really know if it is still true as > > > the device could have been removed right afterward. > > > > > > So almost any code that relies on it is broken, there is no locking and > > > it can and will race and you will loose. > > > > AIUI, we're not trying to create code to rely on this. This more about > > reducing reliance on hardware. If the software misses the race once and > > accesses disconnected device memory, that's usually not a big deal to > > let hardware sort it out, but the point is not to push our luck. > > Then why even care about this call at all? If you need to really know > if the read worked, you have to check the value. If the value is FF > then you have a huge hint that the hardware is now gone. And you can > rely on it being gone, you can never rely on making the call to the > function to check if the hardware is there to be still valid any point > in time after the call returns. > > > Surprise hot remove is empirically more reliable the less we interact > > with hardware and firmware. That shouldn't be necessary, but is just an > > unfortunate reality. > > You are not "interacting", you are reading/writing to the hardware, as > you have to do so. So I really don't understand what you are talking > about here, sorry. We're reading hardware memory, yes, but the hardware isn't there. Something obviously needs to return FF, so we are indirectly interacting with whatever mechanism handles that. Sometimes that mechanism doesn't handle it gracefully and instead of having FF to consider, you have a machine check rebooting your system.
Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected
On Thu, Nov 08, 2018 at 10:49:08PM +, alex_gagn...@dellteam.com wrote: > On 11/08/2018 04:43 PM, Greg Kroah-Hartman wrote: > > > > [EXTERNAL EMAIL] > > Please report any suspicious attachments, links, or requests for sensitive > > information. > > > > > > On Thu, Nov 08, 2018 at 03:32:58PM -0700, Keith Busch wrote: > >> On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote: > >>> On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote: > I'm having second thoughts about this. One thing I'm uncomfortable > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc > instead of systematic, in the sense that I don't know how we convince > ourselves that this (and only this) is the correct place to put it. > >>> > >>> I think my stance always has been that this call is not good at all > >>> because once you call it you never really know if it is still true as > >>> the device could have been removed right afterward. > >>> > >>> So almost any code that relies on it is broken, there is no locking and > >>> it can and will race and you will loose. > >> > >> AIUI, we're not trying to create code to rely on this. This more about > >> reducing reliance on hardware. If the software misses the race once and > >> accesses disconnected device memory, that's usually not a big deal to > >> let hardware sort it out, but the point is not to push our luck. > > > > Then why even care about this call at all? If you need to really know > > if the read worked, you have to check the value. If the value is FF > > then you have a huge hint that the hardware is now gone. And you can > > rely on it being gone, you can never rely on making the call to the > > function to check if the hardware is there to be still valid any point > > in time after the call returns. > > In the case that we're trying to fix, this code executing is a result of > the device being gone, so we can guarantee race-free operation. I agree > that there is a race, in the general case. As far as checking the result > for all F's, that's not an option when firmware crashes the system as a > result of the mmio read/write. It's never pretty when firmware gets > involved. If you have firmware that crashes the system when you try to read from a PCI device that was hot-removed, that is broken firmware and needs to be fixed. The kernel can not work around that as again, you will never win that race. thanks, greg k-h
Re: [PATCH 6/9] PCI: consolidate PCI config entry in drivers/pci
On Fri, Oct 19, 2018 at 02:09:49PM +0200, Christoph Hellwig wrote: > There is no good reason to duplicate the PCI menu in every architecture. > Instead provide a selectable HAVE_PCI symbol that indicates availability > of PCI support and the handle the rest in drivers/pci. > > Note that for powerpc we now select HAVE_PCI globally instead of the > convoluted mess of conditional or or non-conditional support per board, > similar to what we do e.g. on x86. For alpha PCI is selected for the > non-jensen configs as it was the default before, and a lot of code does > not compile without PCI enabled. On other architectures with limited > PCI support that wasn't as complicated I've left the selection as-is. > > Signed-off-by: Christoph Hellwig > Acked-by: Max Filippov > Acked-by: Thomas Gleixner > Acked-by: Bjorn Helgaas Sounds like Masahiro plans to take this series and I'm fine with that. Minor typo below, since it sounds like there's another revision coming. > --- a/drivers/pci/Kconfig > +++ b/drivers/pci/Kconfig > @@ -3,6 +3,18 @@ > # PCI configuration > # > > +config HAVE_PCI > + bool > + > +menuconfig PCI > + bool "PCI support" > + depends on HAVE_PCI > + > + help > + This option enables support for the PCI local bus, including > + support for PCI-X and the fundations for PCI Express support. s/fundations/foundations/ > + Say 'Y' here unless you know what you are doing. > + > source "drivers/pci/pcie/Kconfig"
Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected
On Thu, Nov 08, 2018 at 03:32:58PM -0700, Keith Busch wrote: > On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote: > > On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote: > > > I'm having second thoughts about this. One thing I'm uncomfortable > > > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc > > > instead of systematic, in the sense that I don't know how we convince > > > ourselves that this (and only this) is the correct place to put it. > > > > I think my stance always has been that this call is not good at all > > because once you call it you never really know if it is still true as > > the device could have been removed right afterward. > > > > So almost any code that relies on it is broken, there is no locking and > > it can and will race and you will loose. > > AIUI, we're not trying to create code to rely on this. This more about > reducing reliance on hardware. If the software misses the race once and > accesses disconnected device memory, that's usually not a big deal to > let hardware sort it out, but the point is not to push our luck. Then why even care about this call at all? If you need to really know if the read worked, you have to check the value. If the value is FF then you have a huge hint that the hardware is now gone. And you can rely on it being gone, you can never rely on making the call to the function to check if the hardware is there to be still valid any point in time after the call returns. > Surprise hot remove is empirically more reliable the less we interact > with hardware and firmware. That shouldn't be necessary, but is just an > unfortunate reality. You are not "interacting", you are reading/writing to the hardware, as you have to do so. So I really don't understand what you are talking about here, sorry. greg k-h
Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected
On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote: > On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote: > > I'm having second thoughts about this. One thing I'm uncomfortable > > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc > > instead of systematic, in the sense that I don't know how we convince > > ourselves that this (and only this) is the correct place to put it. > > I think my stance always has been that this call is not good at all > because once you call it you never really know if it is still true as > the device could have been removed right afterward. > > So almost any code that relies on it is broken, there is no locking and > it can and will race and you will loose. AIUI, we're not trying to create code to rely on this. This more about reducing reliance on hardware. If the software misses the race once and accesses disconnected device memory, that's usually not a big deal to let hardware sort it out, but the point is not to push our luck. Surprise hot remove is empirically more reliable the less we interact with hardware and firmware. That shouldn't be necessary, but is just an unfortunate reality.
Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected
On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote: > [+cc Jonathan, Greg, Lukas, Russell, Sam, Oliver for discussion about > PCI error recovery in general] > > On Wed, Nov 07, 2018 at 05:42:57PM -0600, Bjorn Helgaas wrote: > > On Tue, Sep 18, 2018 at 05:15:00PM -0500, Alexandru Gagniuc wrote: > > > When a PCI device is gone, we don't want to send IO to it if we can > > > avoid it. We expose functionality via the irq_chip structure. As > > > users of that structure may not know about the underlying PCI device, > > > it's our responsibility to guard against removed devices. > > > > > > .irq_write_msi_msg() is already guarded inside __pci_write_msi_msg(). > > > .irq_mask/unmask() are not. Guard them for completeness. > > > > > > For example, surprise removal of a PCIe device triggers teardown. This > > > touches the irq_chips ops some point to disable the interrupts. I/O > > > generated here can crash the system on firmware-first machines. > > > Not triggering the IO in the first place greatly reduces the > > > possibility of the problem occurring. > > > > > > Signed-off-by: Alexandru Gagniuc > > > > Applied to pci/misc for v4.21, thanks! > > I'm having second thoughts about this. One thing I'm uncomfortable > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc > instead of systematic, in the sense that I don't know how we convince > ourselves that this (and only this) is the correct place to put it. I think my stance always has been that this call is not good at all because once you call it you never really know if it is still true as the device could have been removed right afterward. So almost any code that relies on it is broken, there is no locking and it can and will race and you will loose. I think your patch suffers from this race: > +static u32 mmio_readl(struct pci_dev *dev, const volatile void __iomem *addr) > +{ > + u32 val, id; > + > + if (pci_dev_is_disconnected(dev)) > + return ~0; Great, but what happens if I yank the device out right here? > + val = readl(addr); This value could now be all FF, if the device is gone, so what did the check above help with? > + /* > + * If an MMIO read from the device returns ~0 data, that data may > + * be valid, or it may indicate a bus error. If config space is > + * readable, assume it's valid data; otherwise, assume a bus error. > + */ > + if (val == ~0) { > + pci_read_config_dword(dev, PCI_VENDOR_ID, ); > + if (id == ~0) > + pci_dev_set_disconnected(dev, NULL); So why do the check above for "is disconnected"? What does this buy us here, just short-circuiting the readl()? > + } > + > + return val; > +} > + > +static void mmio_writel(struct pci_dev *dev, u32 val, > + volatile void __iomem *addr) > +{ > + if (pci_dev_is_disconnected(dev)) > + return; > + > + writel(val, addr); Why even check, what's wrong with always doing the write? I understand the wish to make this easier, but I think the only way is that the driver themselves should be checking on their reads. And they have to check on all reads, or at least on some subset of reads and be able to handle 0xff for the other ones without going crazy. I _think_ the xhci driver does this given that it is hot added/removed all the time dynamically due to the way that modern laptops are made where the bios adds/removed the xhci controller when a USB device is added/removed. thanks, greg k-h
Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected
On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote: > > I'm having second thoughts about this. One thing I'm uncomfortable > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc > instead of systematic, in the sense that I don't know how we convince > ourselves that this (and only this) is the correct place to put it. You know how the kernel provides ZERO_PAGE, wouldn't it be cool if we also had a ONES_PAGE and could remap all virtual addresses from a memory mapped device to that page on an ungraceful disconnect? I do not know how to accomplish that, so might just be crazy talk... But if it is possible, that would be a pretty nifty way to solve this problem.
Re: [PATCH 00/36] Devicetree schema
On Thu, Nov 8, 2018 at 1:54 PM Marta Rybczynska wrote: > > Rob, > The patch set does convert the documentation files. Could you explain > the workflow of verifying a DTS? From what I can understand we can > validate YAML devicetrees, and the schema files against the > meta-schemas, but I see no tool for DTS to YAML conversion in your > tools. Do you use https://github.com/pantoniou/yamldt ? No, I'm not using that. dtc supports yaml output now. The kernel copy of dtc has this support in 4.20. You need libyaml headers installed to enable the support. The output format is intended only for validation and could possibly change. The 'dtbs_check' target will do a dts->dt.yaml pass of all targets in $(dtb-y) with dtc and then run the .dt.yaml files through the schema validation. There's also a presentation here: https://connect.linaro.org/resources/yvr18/yvr18-404/ Rob
Re: [PATCH 00/36] Devicetree schema
Rob, The patch set does convert the documentation files. Could you explain the workflow of verifying a DTS? From what I can understand we can validate YAML devicetrees, and the schema files against the meta-schemas, but I see no tool for DTS to YAML conversion in your tools. Do you use https://github.com/pantoniou/yamldt ? Thanks, Marta
[PATCH -next-akpm 2/3] mm: speed up mremap by 20x on large regions (v5)
From: "Joel Fernandes (Google)" Android needs to mremap large regions of memory during memory management related operations. The mremap system call can be really slow if THP is not enabled. The bottleneck is move_page_tables, which is copying each pte at a time, and can be really slow across a large map. Turning on THP may not be a viable option, and is not for us. This patch speeds up the performance for non-THP system by copying at the PMD level when possible. The speed up is an order of magnitude on x86 (~20x). On a 1GB mremap, the mremap completion times drops from 3.4-3.6 milliseconds to 144-160 microseconds. Before: Total mremap time for 1GB data: 3521942 nanoseconds. Total mremap time for 1GB data: 3449229 nanoseconds. Total mremap time for 1GB data: 3488230 nanoseconds. After: Total mremap time for 1GB data: 150279 nanoseconds. Total mremap time for 1GB data: 144665 nanoseconds. Total mremap time for 1GB data: 158708 nanoseconds. Incase THP is enabled, the optimization is mostly skipped except in certain situations. Acked-by: Kirill A. Shutemov Reviewed-by: William Kucharski Signed-off-by: Joel Fernandes (Google) --- Note that since the bug fix in [1], we now have to flush the TLB every PMD move. The above numbers were obtained on x86 with a flush done every move. For arm64, I previously encountered performance issues doing a flush everytime we move, however Will Deacon says [2] the performance should be better now with recent release. Until we can evaluate arm64, I am dropping the HAVE_MOVE_PMD config enable patch for ARM64 for now. It can be added back once we finish the performance evaluation. Also of note is that the speed up on arm64 with this patch but without the TLB flush every PMD move is around 500x. [1] https://bugs.chromium.org/p/project-zero/issues/detail?id=1695 [2] https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg140837.html arch/Kconfig | 5 + mm/mremap.c | 62 2 files changed, 67 insertions(+) diff --git a/arch/Kconfig b/arch/Kconfig index e1e540ffa979..b70c952ac838 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -535,6 +535,11 @@ config HAVE_IRQ_TIME_ACCOUNTING Archs need to ensure they use a high enough resolution clock to support irq time accounting and then call enable_sched_clock_irqtime(). +config HAVE_MOVE_PMD + bool + help + Archs that select this are able to move page tables at the PMD level. + config HAVE_ARCH_TRANSPARENT_HUGEPAGE bool diff --git a/mm/mremap.c b/mm/mremap.c index 7c9ab747f19d..2591e512373a 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -191,6 +191,50 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, drop_rmap_locks(vma); } +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, + unsigned long new_addr, unsigned long old_end, + pmd_t *old_pmd, pmd_t *new_pmd) +{ + spinlock_t *old_ptl, *new_ptl; + struct mm_struct *mm = vma->vm_mm; + pmd_t pmd; + + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK) + || old_end - old_addr < PMD_SIZE) + return false; + + /* +* The destination pmd shouldn't be established, free_pgtables() +* should have release it. +*/ + if (WARN_ON(!pmd_none(*new_pmd))) + return false; + + /* +* We don't have to worry about the ordering of src and dst +* ptlocks because exclusive mmap_sem prevents deadlock. +*/ + old_ptl = pmd_lock(vma->vm_mm, old_pmd); + new_ptl = pmd_lockptr(mm, new_pmd); + if (new_ptl != old_ptl) + spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); + + /* Clear the pmd */ + pmd = *old_pmd; + pmd_clear(old_pmd); + + VM_BUG_ON(!pmd_none(*new_pmd)); + + /* Set the new pmd */ + set_pmd_at(mm, new_addr, new_pmd, pmd); + flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE); + if (new_ptl != old_ptl) + spin_unlock(new_ptl); + spin_unlock(old_ptl); + + return true; +} + unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long old_addr, struct vm_area_struct *new_vma, unsigned long new_addr, unsigned long len, @@ -237,7 +281,25 @@ unsigned long move_page_tables(struct vm_area_struct *vma, split_huge_pmd(vma, old_pmd, old_addr); if (pmd_trans_unstable(old_pmd)) continue; + } else if (extent == PMD_SIZE) { +#ifdef CONFIG_HAVE_MOVE_PMD + /* +* If the extent is PMD-sized, try to speed the move by +* moving at the PMD level if possible. +*/ + bool moved; + + if (need_rmap_locks) +
Re: [PATCH 13/36] dt-bindings: arm: Convert PMU binding to json-schema
Hello, I'm jumping into the discussion, but I clearly don't have all the context of the discussion. On Thu, 8 Nov 2018 15:54:31 +, Robin Murphy wrote: > >> This seems like a semantic different between the two representations, or am > >> I missing something here? Specifically, both the introduction of > >> interrupts-extended and also dropping any mention of using a single per-cpu > >> interrupt (the single combined case is no longer support by Linux; not sure > >> if you want to keep it in the binding). > > > > In regards to no support for the single combined interrupt, it looks > > like Marvell Armada SoCs at least (armada-375 is what I'm looking at) > > have only a single interrupt. Though the interrupt gets routed to MPIC > > which then has a GIC PPI. So it isn't supported or happens to work > > still since it is a PPI? > > Well, the description of the MPIC in the Armada XP functional spec says: > > "Interrupt sources ID0–ID28 are private events per CPU. Thus, each > processor has a different set of events map interrupts ID0–ID28." > > Odd grammar aside, that would seem to imply that < 3> is a per-cpu > interrupt itself, thus AFAICS so long as it's cascaded to a GIC PPI and > not an SPI then there's no issue there. The Armada XP does not have a GIC at all, but only a MPIC as the primary interrupt controller. However the Armada 38x has both a GIC and a MPIC, and indeed the parent interrupts of the MPIC towards the GIC is: interrupts = ; Best regards, Thomas -- Thomas Petazzoni, CTO, Bootlin Embedded Linux and Kernel engineering https://bootlin.com
[PATCH V5 4/10] KVM/VMX: Add hv tlb range flush support
From: Lan Tianyu This patch is to register tlb_remote_flush_with_range callback with hv tlb range flush interface. Signed-off-by: Lan Tianyu --- Change since v4: - Use new function kvm_fill_hv_flush_list_func() to fill flush request. Change since v3: - Merge Vitaly's don't pass EPT configuration info to vmx_hv_remote_flush_tlb() fix. Change since v1: - Pass flush range with new hyper-v tlb flush struct rather than KVM tlb flush struct. --- arch/x86/kvm/vmx.c | 69 ++ 1 file changed, 49 insertions(+), 20 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index edbc96cb990a..405dfbde70b2 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1567,7 +1567,38 @@ static void check_ept_pointer_match(struct kvm *kvm) to_kvm_vmx(kvm)->ept_pointers_match = EPT_POINTERS_MATCH; } -static int vmx_hv_remote_flush_tlb(struct kvm *kvm) +int kvm_fill_hv_flush_list_func(struct hv_guest_mapping_flush_list *flush, + void *data) +{ + struct kvm_tlb_range *range = data; + + return hyperv_fill_flush_guest_mapping_list(flush, range->start_gfn, + range->pages); +} + +static inline int __hv_remote_flush_tlb_with_range(struct kvm *kvm, + struct kvm_vcpu *vcpu, struct kvm_tlb_range *range) +{ + u64 ept_pointer = to_vmx(vcpu)->ept_pointer; + + /* If ept_pointer is invalid pointer, bypass flush request. */ + if (!VALID_PAGE(ept_pointer)) + return 0; + + /* +* FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE hypercall needs address +* of the base of EPT PML4 table, strip off EPT configuration +* information. +*/ + if (range) + return hyperv_flush_guest_mapping_range(ept_pointer & PAGE_MASK, + kvm_fill_hv_flush_list_func, (void *)range); + else + return hyperv_flush_guest_mapping(ept_pointer & PAGE_MASK); +} + +static int hv_remote_flush_tlb_with_range(struct kvm *kvm, + struct kvm_tlb_range *range) { struct kvm_vcpu *vcpu; int ret = -ENOTSUPP, i; @@ -1577,30 +1608,23 @@ static int vmx_hv_remote_flush_tlb(struct kvm *kvm) if (to_kvm_vmx(kvm)->ept_pointers_match == EPT_POINTERS_CHECK) check_ept_pointer_match(kvm); - /* -* FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE hypercall needs the address of the -* base of EPT PML4 table, strip off EPT configuration information. -* If ept_pointer is invalid pointer, bypass the flush request. -*/ if (to_kvm_vmx(kvm)->ept_pointers_match != EPT_POINTERS_MATCH) { - kvm_for_each_vcpu(i, vcpu, kvm) { - if (!VALID_PAGE(to_vmx(vcpu)->ept_pointer)) - return 0; - - ret |= hyperv_flush_guest_mapping( - to_vmx(vcpu)->ept_pointer & PAGE_MASK); - } + kvm_for_each_vcpu(i, vcpu, kvm) + ret |= __hv_remote_flush_tlb_with_range( + kvm, vcpu, range); } else { - if (!VALID_PAGE(to_vmx(kvm_get_vcpu(kvm, 0))->ept_pointer)) - return 0; - - ret = hyperv_flush_guest_mapping( - to_vmx(kvm_get_vcpu(kvm, 0))->ept_pointer & PAGE_MASK); + ret = __hv_remote_flush_tlb_with_range(kvm, + kvm_get_vcpu(kvm, 0), range); } spin_unlock(_kvm_vmx(kvm)->ept_pointer_lock); return ret; } + +static int hv_remote_flush_tlb(struct kvm *kvm) +{ + return hv_remote_flush_tlb_with_range(kvm, NULL); +} #else /* !IS_ENABLED(CONFIG_HYPERV) */ static inline void evmcs_write64(unsigned long field, u64 value) {} static inline void evmcs_write32(unsigned long field, u32 value) {} @@ -7957,8 +7981,11 @@ static __init int hardware_setup(void) #if IS_ENABLED(CONFIG_HYPERV) if (ms_hyperv.nested_features & HV_X64_NESTED_GUEST_MAPPING_FLUSH - && enable_ept) - kvm_x86_ops->tlb_remote_flush = vmx_hv_remote_flush_tlb; + && enable_ept) { + kvm_x86_ops->tlb_remote_flush = hv_remote_flush_tlb; + kvm_x86_ops->tlb_remote_flush_with_range = + hv_remote_flush_tlb_with_range; + } #endif if (!cpu_has_vmx_ple()) { @@ -11567,6 +11594,8 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id) vmx->nested.posted_intr_nv = -1; vmx->nested.current_vmptr = -1ull; + vmx->ept_pointer = INVALID_PAGE; + vmx->msr_ia32_feature_control_valid_bits = FEATURE_CONTROL_LOCKED; /* -- 2.14.4
Re: pkeys: Reserve PKEY_DISABLE_READ
* Ram Pai: > Florian, > > I can. But I am struggling to understand the requirement. Why is > this needed? Are we proposing a enhancement to the sys_pkey_alloc(), > to be able to allocate keys that are initialied to disable-read > only? Yes, I think that would be a natural consequence. However, my immediate need comes from the fact that the AMR register can contain a flag combination that is not possible to represent with the existing PKEY_DISABLE_WRITE and PKEY_DISABLE_ACCESS flags. User code could write to AMR directly, so I cannot rule out that certain flag combinations exist there. So I came up with this: int pkey_get (int key) { if (key < 0 || key > PKEY_MAX) { __set_errno (EINVAL); return -1; } unsigned int index = pkey_index (key); unsigned long int amr = pkey_read (); unsigned int bits = (amr >> index) & 3; /* Translate from AMR values. PKEY_AMR_READ standing alone is not currently representable. */ if (bits & PKEY_AMR_READ) return PKEY_DISABLE_ACCESS; else if (bits == PKEY_AMR_WRITE) return PKEY_DISABLE_WRITE; return 0; } And this is not ideal. I would prefer something like this instead: switch (bits) { case PKEY_AMR_READ | PKEY_AMR_WRITE: return PKEY_DISABLE_ACCESS; case PKEY_AMR_READ: return PKEY_DISABLE_READ; case PKEY_AMR_WRITE: return PKEY_DISABLE_WRITE; case 0: return 0; } By the way, is the AMR register 64-bit or 32-bit on 32-bit POWER? Thanks, Florian
Re: pkeys: Reserve PKEY_DISABLE_READ
On Thu, Nov 08, 2018 at 06:37:41PM +0100, Florian Weimer wrote: > * Dave Hansen: > > > On 11/8/18 7:01 AM, Florian Weimer wrote: > >> Ideally, PKEY_DISABLE_READ | PKEY_DISABLE_WRITE and PKEY_DISABLE_READ | > >> PKEY_DISABLE_ACCESS would be treated as PKEY_DISABLE_ACCESS both, and a > >> line PKEY_DISABLE_READ would result in an EINVAL failure. > > > > Sounds reasonable to me. > > > > I don't see any urgency to do this right now. It could easily go in > > alongside the ppc patches when those get merged. > > POWER support has already been merged, so we need to do something here > now, before I can complete the userspace side. > > > The only thing I'd suggest is that we make it something slightly > > higher than 0x4. It'll make the code easier to deal with in the > > kernel if we have the ABI and the hardware mirror each other, and if > > we pick 0x4 in the ABI for PKEY_DISABLE_READ, it might get messy if > > the harware choose 0x4 for PKEY_DISABLE_EXECUTE or something. > > > > So, let's make it 0x80 or something on x86 at least. > > I don't have a problem with that if that's what it takes. > > > Also, I'll be happy to review and ack the patch to do this, but I'd > > expect the ppc guys (hi Ram!) to actually put it together. > > Ram, do you want to write a patch? Florian, I can. But I am struggling to understand the requirement. Why is this needed? Are we proposing a enhancement to the sys_pkey_alloc(), to be able to allocate keys that are initialied to disable-read only? RP > > I'll promise I finish the glibc support for this. 8-) > > Thanks, > Florian -- Ram Pai
Re: [PATCH 0/5] Guarded Userspace Access Prevention on Radix
On Thu, 2018-11-08 at 18:52 +0100, Christophe LEROY wrote: > > In signal_32.c and signal_64.c, save_user_regs() calls __put_user() to > modify code, then calls flush_icache_range() on user addresses. > > Shouldn't flush_icache_range() be performed with userspace access > protection unlocked ? Thankfully this code is pretty much never used these days... Russell: To trigger that, you need to disable the VDSO. This brings back the idea however of having a way to "bulk" open the gate during the whole signal sequence... Cheers, Ben.
Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected
[+cc Jonathan, Greg, Lukas, Russell, Sam, Oliver for discussion about PCI error recovery in general] On Wed, Nov 07, 2018 at 05:42:57PM -0600, Bjorn Helgaas wrote: > On Tue, Sep 18, 2018 at 05:15:00PM -0500, Alexandru Gagniuc wrote: > > When a PCI device is gone, we don't want to send IO to it if we can > > avoid it. We expose functionality via the irq_chip structure. As > > users of that structure may not know about the underlying PCI device, > > it's our responsibility to guard against removed devices. > > > > .irq_write_msi_msg() is already guarded inside __pci_write_msi_msg(). > > .irq_mask/unmask() are not. Guard them for completeness. > > > > For example, surprise removal of a PCIe device triggers teardown. This > > touches the irq_chips ops some point to disable the interrupts. I/O > > generated here can crash the system on firmware-first machines. > > Not triggering the IO in the first place greatly reduces the > > possibility of the problem occurring. > > > > Signed-off-by: Alexandru Gagniuc > > Applied to pci/misc for v4.21, thanks! I'm having second thoughts about this. One thing I'm uncomfortable with is that sprinkling pci_dev_is_disconnected() around feels ad hoc instead of systematic, in the sense that I don't know how we convince ourselves that this (and only this) is the correct place to put it. Another is that the only place we call pci_dev_set_disconnected() is in pciehp and acpiphp, so the only "disconnected" case we catch is if hotplug happens to be involved. Every MMIO read from the device is an opportunity to learn whether it is reachable (a read from an unreachable device typically returns ~0 data), but we don't do anything at all with those. The config accessors already check pci_dev_is_disconnected(), so this patch is really aimed at MMIO accesses. I think it would be more robust if we added wrappers for readl() and writel() so we could notice read errors and avoid future reads and writes. Two compiled but untested patches below for your comments. You can mostly ignore the first; it's just boring interface changes. The important part is the second, which adds the wrappers. These would be an alternative to the (admittedly much shorter) patch here. The wrappers are independent of MSI and could potentially be exported from the PCI core for use by drivers. I think it would be better for drivers to use something like this instead of calling pci_device_is_present() or pci_dev_is_disconnected() directly. > > --- > > drivers/pci/msi.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > > index f2ef896464b3..f31058fd2260 100644 > > --- a/drivers/pci/msi.c > > +++ b/drivers/pci/msi.c > > @@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 > > flag) > > { > > struct msi_desc *desc = irq_data_get_msi_desc(data); > > > > + if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc))) > > + return; > > + > > if (desc->msi_attrib.is_msix) { > > msix_mask_irq(desc, flag); > > readl(desc->mask_base); /* Flush write to device */ commit 150346e09edbcaedc520a6d7dec2b16f3a53afa1 Author: Bjorn Helgaas Date: Thu Nov 8 09:17:51 2018 -0600 PCI/MSI: Pass pci_dev into IRQ mask interfaces Add the struct pci_dev pointer to these interfaces: __pci_msix_desc_mask_irq() __pci_msi_desc_mask_irq() msix_mask_irq msi_mask_irq() The pci_dev pointer is currently unused, so there's no functional change intended with this patch. A subsequent patch will use the pointer to improve error detection. diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c index 9f6f392a4461..56bbee2cf761 100644 --- a/arch/s390/pci/pci.c +++ b/arch/s390/pci/pci.c @@ -462,9 +462,9 @@ void arch_teardown_msi_irqs(struct pci_dev *pdev) if (!msi->irq) continue; if (msi->msi_attrib.is_msix) - __pci_msix_desc_mask_irq(msi, 1); + __pci_msix_desc_mask_irq(pdev, msi, 1); else - __pci_msi_desc_mask_irq(msi, 1, 1); + __pci_msi_desc_mask_irq(pdev, msi, 1, 1); irq_set_msi_desc(msi->irq, NULL); irq_free_desc(msi->irq); msi->msg.address_lo = 0; diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index af24ed50a245..d46ae506e296 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -170,7 +170,8 @@ static inline __attribute_const__ u32 msi_mask(unsigned x) * reliably as devices without an INTx disable bit will then generate a * level IRQ which will never be cleared. */ -u32 __pci_msi_desc_mask_irq(struct msi_desc *desc, u32 mask, u32 flag) +u32 __pci_msi_desc_mask_irq(struct pci_dev *dev, struct msi_desc *desc, + u32 mask, u32 flag) { u32 mask_bits = desc->masked; @@ -179,15 +180,15 @@ u32
Re: pkeys: Reserve PKEY_DISABLE_READ
On Thu, Nov 08, 2018 at 01:05:09PM +0100, Florian Weimer wrote: > Would it be possible to reserve a bit for PKEY_DISABLE_READ? > > I think the POWER implementation can disable read access at the hardware > level, but not write access, and that cannot be expressed with the > current PKEY_DISABLE_ACCESS and PKEY_DISABLE_WRITE bits. POWER hardware can disable-read and can **also disable-write** at the hardware level. It can disable-execute aswell at the hardware level. For example if the key bits for a given key in the AMR register is 0b01 it is read-disable 0b10 it is write-disable To support access-disable, we make the key value 0b11. So in case if you want to know if the key is read-disable 'bitwise-and' it against 0x1. i.e (x & 0x1) RP
Re: [PATCH] powerpc: Add KVM guest defconfig
hi Satheesh, On 11/08/2018 03:08 AM, sathn...@linux.vnet.ibm.com wrote: > --- /dev/null > +++ b/arch/powerpc/configs/guest.config > @@ -0,0 +1,14 @@ > +CONFIG_VIRTIO_BLK=y > +CONFIG_VIRTIO_BLK_SCSI=y > +CONFIG_SCSI_VIRTIO=y > +CONFIG_VIRTIO_NET=y > +CONFIG_NET_FAILOVER=y > +CONFIG_VIRTIO_CONSOLE=y > +CONFIG_VIRTIO=y > +CONFIG_VIRTIO_PCI=y > +CONFIG_KVM_GUEST=y > +CONFIG_EPAPR_PARAVIRT=y > +CONFIG_XFS_FS=y Why a guest kernel needs to have XFS integrated in the core image? I am wondering if it is a requirement from another CONFIG_ option. If it is not a strict requirement from another config, I think we can keep it as defined at ppc64_defconfig, which defines it as module (CONFIG_XFS_FS=m). Thanks for this patch, very useful. Breno
[PATCH -next-akpm 3/3] mm: select HAVE_MOVE_PMD in x86 for faster mremap
From: "Joel Fernandes (Google)" Moving page-tables at the PMD-level on x86 is known to be safe. Enable this option so that we can do fast mremap when possible. Suggested-by: Kirill A. Shutemov Acked-by: Kirill A. Shutemov Signed-off-by: Joel Fernandes (Google) --- arch/x86/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 605bec0c228f..05f3667de0d2 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -173,6 +173,7 @@ config X86 select HAVE_MEMBLOCK_NODE_MAP select HAVE_MIXED_BREAKPOINTS_REGS select HAVE_MOD_ARCH_SPECIFIC + select HAVE_MOVE_PMD select HAVE_NMI select HAVE_OPROFILE select HAVE_OPTPROBES -- 2.19.1.930.g4563a0d9d0-goog
[PATCH -next-akpm 1/3] mm: treewide: remove unused address argument from pte_alloc functions (v2)
From: "Joel Fernandes (Google)" This series speeds up mremap(2) syscall by copying page tables at the PMD level even for non-THP systems. There is concern that the extra 'address' argument that mremap passes to pte_alloc may do something subtle architecture related in the future that may make the scheme not work. Also we find that there is no point in passing the 'address' to pte_alloc since its unused. So this patch therefore removes this argument tree-wide resulting in a nice negative diff as well. Also ensuring along the way that the enabled architectures do not do anything funky with 'address' argument that goes unnoticed by the optimization. Build and boot tested on x86-64. Build tested on arm64. The changes were obtained by applying the following Coccinelle script. (thanks Julia for answering all Coccinelle questions!). Following fix ups were done manually: * Removal of address argument from pte_fragment_alloc * Removal of pte_alloc_one_fast definitions from m68k and microblaze. // Options: --include-headers --no-includes // Note: I split the 'identifier fn' line, so if you are manually // running it, please unsplit it so it runs for you. virtual patch @pte_alloc_func_def depends on patch exists@ identifier E2; identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$"; type T2; @@ fn(... - , T2 E2 ) { ... } @pte_alloc_func_proto_noarg depends on patch exists@ type T1, T2, T3, T4; identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$"; @@ ( - T3 fn(T1, T2); + T3 fn(T1); | - T3 fn(T1, T2, T4); + T3 fn(T1, T2); ) @pte_alloc_func_proto depends on patch exists@ identifier E1, E2, E4; type T1, T2, T3, T4; identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$"; @@ ( - T3 fn(T1 E1, T2 E2); + T3 fn(T1 E1); | - T3 fn(T1 E1, T2 E2, T4 E4); + T3 fn(T1 E1, T2 E2); ) @pte_alloc_func_call depends on patch exists@ expression E2; identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$"; @@ fn(... -, E2 ) @pte_alloc_macro depends on patch exists@ identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$"; identifier a, b, c; expression e; position p; @@ ( - #define fn(a, b, c) e + #define fn(a, b) e | - #define fn(a, b) e + #define fn(a) e ) Suggested-by: Kirill A. Shutemov Acked-by: Kirill A. Shutemov Cc: Michal Hocko Cc: Julia Lawall Cc: Kirill A. Shutemov Signed-off-by: Joel Fernandes (Google) --- arch/alpha/include/asm/pgalloc.h | 6 +++--- arch/arc/include/asm/pgalloc.h | 5 ++--- arch/arm/include/asm/pgalloc.h | 4 ++-- arch/arm64/include/asm/pgalloc.h | 4 ++-- arch/hexagon/include/asm/pgalloc.h | 6 ++ arch/ia64/include/asm/pgalloc.h | 5 ++--- arch/m68k/include/asm/mcf_pgalloc.h | 8 ++-- arch/m68k/include/asm/motorola_pgalloc.h | 4 ++-- arch/m68k/include/asm/sun3_pgalloc.h | 6 ++ arch/microblaze/include/asm/pgalloc.h| 19 ++- arch/microblaze/mm/pgtable.c | 3 +-- arch/mips/include/asm/pgalloc.h | 6 ++ arch/nds32/include/asm/pgalloc.h | 5 ++--- arch/nios2/include/asm/pgalloc.h | 6 ++ arch/openrisc/include/asm/pgalloc.h | 5 ++--- arch/openrisc/mm/ioremap.c | 3 +-- arch/parisc/include/asm/pgalloc.h| 4 ++-- arch/powerpc/include/asm/book3s/32/pgalloc.h | 4 ++-- arch/powerpc/include/asm/book3s/64/pgalloc.h | 12 +--- arch/powerpc/include/asm/nohash/32/pgalloc.h | 4 ++-- arch/powerpc/include/asm/nohash/64/pgalloc.h | 6 ++ arch/powerpc/mm/pgtable-book3s64.c | 2 +- arch/powerpc/mm/pgtable_32.c | 4 ++-- arch/riscv/include/asm/pgalloc.h | 6 ++ arch/s390/include/asm/pgalloc.h | 4 ++-- arch/sh/include/asm/pgalloc.h| 6 ++ arch/sparc/include/asm/pgalloc_32.h | 5 ++--- arch/sparc/include/asm/pgalloc_64.h | 6 ++ arch/sparc/mm/init_64.c | 6 ++ arch/sparc/mm/srmmu.c| 4 ++-- arch/um/include/asm/pgalloc.h| 4 ++-- arch/um/kernel/mem.c | 4 ++-- arch/unicore32/include/asm/pgalloc.h | 4 ++-- arch/x86/include/asm/pgalloc.h | 4 ++-- arch/x86/mm/pgtable.c| 4 ++-- arch/xtensa/include/asm/pgalloc.h| 8 +++- include/linux/mm.h | 13 ++--- mm/huge_memory.c | 8 mm/kasan/kasan_init.c| 2 +- mm/memory.c | 17 - mm/migrate.c | 2 +- mm/mremap.c | 2 +-
[PATCH -next-akpm 0/3] Add support for fast mremap
Hi, Here is the "fast mremap" series. This just a repost with Kirill's Acked-bys added and William's Reviewed-by added. Also fixed a UML build error reported last week. I would like this to be considered for linux -next. The performance numbers in the series are for testing on x86. The config enablement patch for arm64 will be posted in the future after testing (see notes below). List of patches in series: (1) mm: select HAVE_MOVE_PMD in x86 for faster mremap (2) mm: speed up mremap by 20x on large regions (v5) v1->v2: Added support for per-arch enablement (Kirill Shutemov) v2->v3: Updated commit message to state the optimization may also run for non-thp type of systems (Daniel Col). v3->v4: Remove useless pmd_lock check (Kirill Shutemov) Rebased ontop of Linus's master, updated perf results based on x86 testing. Added Kirill's Acks. v4->v5: Added William's Reviewed-by. Fixed arch/um build error due to set_pmd_at not defined. Rebased on linux-next/akpm. (3) mm: treewide: remove unused address argument from pte_alloc functions (v2) v1->v2: fix arch/um/ prototype which was missed in v1 (Anton Ivanov) update changelog with manual fixups for m68k and microblaze. not included - (4) mm: select HAVE_MOVE_PMD in arm64 for faster mremap This patch is dropped since last posting pending further performance testing on arm64 with new TLB gather updates. See notes in patch titled "mm: speed up mremap by 500x on large regions" for more details. Joel Fernandes (Google) (3): mm: treewide: remove unused address argument from pte_alloc functions (v2) mm: speed up mremap by 20x on large regions (v5) mm: select HAVE_MOVE_PMD in x86 for faster mremap arch/Kconfig | 5 ++ arch/alpha/include/asm/pgalloc.h | 6 +- arch/arc/include/asm/pgalloc.h | 5 +- arch/arm/include/asm/pgalloc.h | 4 +- arch/arm64/include/asm/pgalloc.h | 4 +- arch/hexagon/include/asm/pgalloc.h | 6 +- arch/ia64/include/asm/pgalloc.h | 5 +- arch/m68k/include/asm/mcf_pgalloc.h | 8 +-- arch/m68k/include/asm/motorola_pgalloc.h | 4 +- arch/m68k/include/asm/sun3_pgalloc.h | 6 +- arch/microblaze/include/asm/pgalloc.h| 19 +- arch/microblaze/mm/pgtable.c | 3 +- arch/mips/include/asm/pgalloc.h | 6 +- arch/nds32/include/asm/pgalloc.h | 5 +- arch/nios2/include/asm/pgalloc.h | 6 +- arch/openrisc/include/asm/pgalloc.h | 5 +- arch/openrisc/mm/ioremap.c | 3 +- arch/parisc/include/asm/pgalloc.h| 4 +- arch/powerpc/include/asm/book3s/32/pgalloc.h | 4 +- arch/powerpc/include/asm/book3s/64/pgalloc.h | 12 ++-- arch/powerpc/include/asm/nohash/32/pgalloc.h | 4 +- arch/powerpc/include/asm/nohash/64/pgalloc.h | 6 +- arch/powerpc/mm/pgtable-book3s64.c | 2 +- arch/powerpc/mm/pgtable_32.c | 4 +- arch/riscv/include/asm/pgalloc.h | 6 +- arch/s390/include/asm/pgalloc.h | 4 +- arch/sh/include/asm/pgalloc.h| 6 +- arch/sparc/include/asm/pgalloc_32.h | 5 +- arch/sparc/include/asm/pgalloc_64.h | 6 +- arch/sparc/mm/init_64.c | 6 +- arch/sparc/mm/srmmu.c| 4 +- arch/um/include/asm/pgalloc.h| 4 +- arch/um/kernel/mem.c | 4 +- arch/unicore32/include/asm/pgalloc.h | 4 +- arch/x86/Kconfig | 1 + arch/x86/include/asm/pgalloc.h | 4 +- arch/x86/mm/pgtable.c| 4 +- arch/xtensa/include/asm/pgalloc.h| 8 +-- include/linux/mm.h | 13 ++-- mm/huge_memory.c | 8 +-- mm/kasan/kasan_init.c| 2 +- mm/memory.c | 17 +++--- mm/migrate.c | 2 +- mm/mremap.c | 64 +++- mm/userfaultfd.c | 2 +- virt/kvm/arm/mmu.c | 2 +- 46 files changed, 165 insertions(+), 147 deletions(-) -- 2.19.1.930.g4563a0d9d0-goog
Re: [PATCH 0/5] Guarded Userspace Access Prevention on Radix
Le 01/11/2018 à 04:54, Russell Currey a écrit : On Wed, 2018-10-31 at 17:58 +0100, LEROY Christophe wrote: Russell Currey a écrit : On Fri, 2018-10-26 at 18:29 +0200, LEROY Christophe wrote: Russell Currey a écrit : Guarded Userspace Access Prevention is a security mechanism that prevents the kernel from being able to read and write userspace addresses outside of the allowed paths, most commonly copy_{to/from}_user(). At present, the only CPU that supports this is POWER9, and only while using the Radix MMU. Privileged reads and writes cannot access user data when key 0 of the AMR is set. This is described in the "Radix Tree Translation Storage Protection" section of the POWER ISA as of version 3.0. It is not right that only power9 can support that. It's true that not only P9 can support it, but there are more considerations under hash than radix, implementing this for radix is a first step. I don't know much about hash, but I was talking about the 8xx which is a nohash ppc32. I'll see next week if I can do something with it on top of your serie. My small brain saw the number 8 and assumed you were talking about POWER8, I didn't know what 8xx was until now. Working on a refactor to make things a bit more generic, and removing the radix name and dependency from the config option. In signal_32.c and signal_64.c, save_user_regs() calls __put_user() to modify code, then calls flush_icache_range() on user addresses. Shouldn't flush_icache_range() be performed with userspace access protection unlocked ? Christophe
[PATCH] powerpc/xmon: Fix invocation inside lock region
Currently xmon needs to get devtree_lock (through rtas_token()) during its invocation (at crash time). If there is a crash while devtree_lock is being held, then xmon tries to get the lock but spins forever and never get into the interactive debugger, as in the following case: int *ptr = NULL; raw_spin_lock_irqsave(_lock, flags); *ptr = 0xdeadbeef; This patch avoids calling rtas_token(), thus trying to get the same lock, at crash time. This new mechanism proposes getting the token at initialization time (xmon_init()) and just consuming it at crash time. This would allow xmon to be possible invoked independent of devtree_lock being held or not. Signed-off-by: Breno Leitao --- arch/powerpc/xmon/xmon.c | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index 36b8dc47a3c3..b566203d09c5 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -75,6 +75,9 @@ static int xmon_gate; #define xmon_owner 0 #endif /* CONFIG_SMP */ +#ifdef CONFIG_PPC_PSERIES +static int set_indicator_token = RTAS_UNKNOWN_SERVICE; +#endif static unsigned long in_xmon __read_mostly = 0; static int xmon_on = IS_ENABLED(CONFIG_XMON_DEFAULT); @@ -358,7 +361,6 @@ static inline void disable_surveillance(void) #ifdef CONFIG_PPC_PSERIES /* Since this can't be a module, args should end up below 4GB. */ static struct rtas_args args; - int token; /* * At this point we have got all the cpus we can into @@ -367,11 +369,11 @@ static inline void disable_surveillance(void) * If we did try to take rtas.lock there would be a * real possibility of deadlock. */ - token = rtas_token("set-indicator"); - if (token == RTAS_UNKNOWN_SERVICE) + if (set_indicator_token == RTAS_UNKNOWN_SERVICE) return; - rtas_call_unlocked(, token, 3, 1, NULL, SURVEILLANCE_TOKEN, 0, 0); + rtas_call_unlocked(, set_indicator_token, 3, 1, NULL, + SURVEILLANCE_TOKEN, 0, 0); #endif /* CONFIG_PPC_PSERIES */ } @@ -3688,6 +3690,14 @@ static void xmon_init(int enable) __debugger_iabr_match = xmon_iabr_match; __debugger_break_match = xmon_break_match; __debugger_fault_handler = xmon_fault_handler; + +#ifdef CONFIG_PPC_PSERIES + /* +* Get the token here to avoid trying to get a lock +* during the crash, causing a deadlock. +*/ + set_indicator_token = rtas_token("set-indicator"); +#endif } else { __debugger = NULL; __debugger_ipi = NULL; -- 2.19.0
Re: [PATCH 13/36] dt-bindings: arm: Convert PMU binding to json-schema
On 08/11/2018 15:59, Thomas Petazzoni wrote: Hello, I'm jumping into the discussion, but I clearly don't have all the context of the discussion. On Thu, 8 Nov 2018 15:54:31 +, Robin Murphy wrote: This seems like a semantic different between the two representations, or am I missing something here? Specifically, both the introduction of interrupts-extended and also dropping any mention of using a single per-cpu interrupt (the single combined case is no longer support by Linux; not sure if you want to keep it in the binding). In regards to no support for the single combined interrupt, it looks like Marvell Armada SoCs at least (armada-375 is what I'm looking at) have only a single interrupt. Though the interrupt gets routed to MPIC which then has a GIC PPI. So it isn't supported or happens to work still since it is a PPI? Well, the description of the MPIC in the Armada XP functional spec says: "Interrupt sources ID0–ID28 are private events per CPU. Thus, each processor has a different set of events map interrupts ID0–ID28." Odd grammar aside, that would seem to imply that < 3> is a per-cpu interrupt itself, thus AFAICS so long as it's cascaded to a GIC PPI and not an SPI then there's no issue there. The Armada XP does not have a GIC at all, but only a MPIC as the primary interrupt controller. However the Armada 38x has both a GIC and a MPIC, and indeed the parent interrupts of the MPIC towards the GIC is: interrupts = ; Yeah, perhaps I should have clarified that the XP manual was the only publicly-available one I could find, but I'm inferring from the binding and driver that the MPIC in 375/38x still behaves the same. Robin.
Re: [PATCH 13/36] dt-bindings: arm: Convert PMU binding to json-schema
On 01/11/2018 19:32, Rob Herring wrote: On Tue, Oct 9, 2018 at 6:57 AM Will Deacon wrote: Hi Rob, On Fri, Oct 05, 2018 at 11:58:25AM -0500, Rob Herring wrote: Convert ARM PMU binding to DT schema format using json-schema. Cc: Will Deacon Cc: Mark Rutland Cc: linux-arm-ker...@lists.infradead.org Cc: devicet...@vger.kernel.org Signed-off-by: Rob Herring --- Documentation/devicetree/bindings/arm/pmu.txt | 70 -- .../devicetree/bindings/arm/pmu.yaml | 96 +++ 2 files changed, 96 insertions(+), 70 deletions(-) delete mode 100644 Documentation/devicetree/bindings/arm/pmu.txt create mode 100644 Documentation/devicetree/bindings/arm/pmu.yaml [...] -- interrupts : 1 combined interrupt or 1 per core. If the interrupt is a per-cpu - interrupt (PPI) then 1 interrupt should be specified. [...] + interrupts: +oneOf: + - maxItems: 1 + - minItems: 2 +maxItems: 8 +description: 1 interrupt per core. + + interrupts-extended: +$ref: '#/properties/interrupts' This seems like a semantic different between the two representations, or am I missing something here? Specifically, both the introduction of interrupts-extended and also dropping any mention of using a single per-cpu interrupt (the single combined case is no longer support by Linux; not sure if you want to keep it in the binding). In regards to no support for the single combined interrupt, it looks like Marvell Armada SoCs at least (armada-375 is what I'm looking at) have only a single interrupt. Though the interrupt gets routed to MPIC which then has a GIC PPI. So it isn't supported or happens to work still since it is a PPI? Well, the description of the MPIC in the Armada XP functional spec says: "Interrupt sources ID0–ID28 are private events per CPU. Thus, each processor has a different set of events map interrupts ID0–ID28." Odd grammar aside, that would seem to imply that < 3> is a per-cpu interrupt itself, thus AFAICS so long as it's cascaded to a GIC PPI and not an SPI then there's no issue there. Robin.
[Resend PATCH V5 7/10] KVM: Make kvm_set_spte_hva() return int
From: Lan Tianyu The patch is to make kvm_set_spte_hva() return int and caller can check return value to determine flush tlb or not. Signed-off-by: Lan Tianyu --- arch/arm/include/asm/kvm_host.h | 2 +- arch/arm64/include/asm/kvm_host.h | 2 +- arch/mips/include/asm/kvm_host.h| 2 +- arch/mips/kvm/mmu.c | 3 ++- arch/powerpc/include/asm/kvm_host.h | 2 +- arch/powerpc/kvm/book3s.c | 3 ++- arch/powerpc/kvm/e500_mmu_host.c| 3 ++- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/mmu.c | 3 ++- virt/kvm/arm/mmu.c | 6 -- 10 files changed, 17 insertions(+), 11 deletions(-) diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index 5ca5d9af0c26..fb62b500e590 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -225,7 +225,7 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu, #define KVM_ARCH_WANT_MMU_NOTIFIER int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end); -void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); +int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu); int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices); diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 52fbc823ff8c..b3df7b38ba7d 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -360,7 +360,7 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu, #define KVM_ARCH_WANT_MMU_NOTIFIER int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end); -void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); +int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end); int kvm_test_age_hva(struct kvm *kvm, unsigned long hva); diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h index 2c1c53d12179..71c3f21d80d5 100644 --- a/arch/mips/include/asm/kvm_host.h +++ b/arch/mips/include/asm/kvm_host.h @@ -933,7 +933,7 @@ enum kvm_mips_fault_result kvm_trap_emul_gva_fault(struct kvm_vcpu *vcpu, #define KVM_ARCH_WANT_MMU_NOTIFIER int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end); -void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); +int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end); int kvm_test_age_hva(struct kvm *kvm, unsigned long hva); diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c index d8dcdb350405..97e538a8c1be 100644 --- a/arch/mips/kvm/mmu.c +++ b/arch/mips/kvm/mmu.c @@ -551,7 +551,7 @@ static int kvm_set_spte_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end, (pte_dirty(old_pte) && !pte_dirty(hva_pte)); } -void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte) +int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte) { unsigned long end = hva + PAGE_SIZE; int ret; @@ -559,6 +559,7 @@ void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte) ret = handle_hva_to_gpa(kvm, hva, end, _set_spte_handler, ); if (ret) kvm_mips_callbacks->flush_shadow_all(kvm); + return 0; } static int kvm_age_hva_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end, diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index fac6f631ed29..ab23379c53a9 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -72,7 +72,7 @@ extern int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end); extern int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end); extern int kvm_test_age_hva(struct kvm *kvm, unsigned long hva); -extern void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); +extern int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); #define HPTEG_CACHE_NUM(1 << 15) #define HPTEG_HASH_BITS_PTE13 diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index fd9893bc7aa1..437613bb609a 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -850,9 +850,10 @@ int kvm_test_age_hva(struct kvm *kvm, unsigned long hva) return kvm->arch.kvm_ops->test_age_hva(kvm, hva); } -void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte) +int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte) { kvm->arch.kvm_ops->set_spte_hva(kvm, hva, pte); + return 0; } void kvmppc_mmu_destroy(struct kvm_vcpu *vcpu) diff --git a/arch/powerpc/kvm/e500_mmu_host.c
[Resend PATCH V5 6/10] KVM: Replace old tlb flush function with new one to flush a specified range.
From: Lan Tianyu This patch is to replace kvm_flush_remote_tlbs() with kvm_flush_ remote_tlbs_with_address() in some functions without logic change. Signed-off-by: Lan Tianyu --- arch/x86/kvm/mmu.c | 31 +-- arch/x86/kvm/paging_tmpl.h | 3 ++- 2 files changed, 23 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index c1d383532066..af3e9e466014 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1485,8 +1485,12 @@ static bool __drop_large_spte(struct kvm *kvm, u64 *sptep) static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep) { - if (__drop_large_spte(vcpu->kvm, sptep)) - kvm_flush_remote_tlbs(vcpu->kvm); + if (__drop_large_spte(vcpu->kvm, sptep)) { + struct kvm_mmu_page *sp = page_header(__pa(sptep)); + + kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn, + KVM_PAGES_PER_HPAGE(sp->role.level)); + } } /* @@ -1954,7 +1958,8 @@ static void rmap_recycle(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn) rmap_head = gfn_to_rmap(vcpu->kvm, gfn, sp); kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, sp->role.level, 0); - kvm_flush_remote_tlbs(vcpu->kvm); + kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn, + KVM_PAGES_PER_HPAGE(sp->role.level)); } int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end) @@ -2470,7 +2475,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, account_shadowed(vcpu->kvm, sp); if (level == PT_PAGE_TABLE_LEVEL && rmap_write_protect(vcpu, gfn)) - kvm_flush_remote_tlbs(vcpu->kvm); + kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1); if (level > PT_PAGE_TABLE_LEVEL && need_sync) flush |= kvm_sync_pages(vcpu, gfn, _list); @@ -2590,7 +2595,7 @@ static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, return; drop_parent_pte(child, sptep); - kvm_flush_remote_tlbs(vcpu->kvm); + kvm_flush_remote_tlbs_with_address(vcpu->kvm, child->gfn, 1); } } @@ -3014,8 +3019,10 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned pte_access, ret = RET_PF_EMULATE; kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); } + if (set_spte_ret & SET_SPTE_NEED_REMOTE_TLB_FLUSH || flush) - kvm_flush_remote_tlbs(vcpu->kvm); + kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, + KVM_PAGES_PER_HPAGE(level)); if (unlikely(is_mmio_spte(*sptep))) ret = RET_PF_EMULATE; @@ -5681,7 +5688,8 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, * on PT_WRITABLE_MASK anymore. */ if (flush) - kvm_flush_remote_tlbs(kvm); + kvm_flush_remote_tlbs_with_address(kvm, memslot->base_gfn, + memslot->npages); } static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, @@ -5751,7 +5759,8 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm, * dirty_bitmap. */ if (flush) - kvm_flush_remote_tlbs(kvm); + kvm_flush_remote_tlbs_with_address(kvm, memslot->base_gfn, + memslot->npages); } EXPORT_SYMBOL_GPL(kvm_mmu_slot_leaf_clear_dirty); @@ -5769,7 +5778,8 @@ void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm, lockdep_assert_held(>slots_lock); if (flush) - kvm_flush_remote_tlbs(kvm); + kvm_flush_remote_tlbs_with_address(kvm, memslot->base_gfn, + memslot->npages); } EXPORT_SYMBOL_GPL(kvm_mmu_slot_largepage_remove_write_access); @@ -5786,7 +5796,8 @@ void kvm_mmu_slot_set_dirty(struct kvm *kvm, /* see kvm_mmu_slot_leaf_clear_dirty */ if (flush) - kvm_flush_remote_tlbs(kvm); + kvm_flush_remote_tlbs_with_address(kvm, memslot->base_gfn, + memslot->npages); } EXPORT_SYMBOL_GPL(kvm_mmu_slot_set_dirty); diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 7cf2185b7eb5..6bdca39829bc 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -894,7 +894,8 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa) pte_gpa += (sptep - sp->spt) * sizeof(pt_element_t); if (mmu_page_zap_pte(vcpu->kvm, sp, sptep)) - kvm_flush_remote_tlbs(vcpu->kvm); + kvm_flush_remote_tlbs_with_address(vcpu->kvm, + sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
[Resend PATCH V5 5/10] KVM/MMU: Add tlb flush with range helper function
From: Lan Tianyu This patch is to add wrapper functions for tlb_remote_flush_with_range callback and flush tlb directly in kvm_mmu_zap_collapsible_spte(). kvm_mmu_zap_collapsible_spte() returns flush request to the slot_handle_leaf() and the latter does flush on demand. When range flush is available, make kvm_mmu_zap_collapsible_spte() to flush tlb with range directly to avoid returning range back to slot_handle_leaf(). Signed-off-by: Lan Tianyu --- Change since V4: Remove flush list interface and flush tlb directly in kvm_mmu_zap_collapsible_spte(). Change since V3: Remove code of updating "tlbs_dirty" Change since V2: Fix comment in the kvm_flush_remote_tlbs_with_range() --- arch/x86/kvm/mmu.c | 37 - 1 file changed, 36 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index cf5f572f2305..c1d383532066 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -264,6 +264,35 @@ static void mmu_spte_set(u64 *sptep, u64 spte); static union kvm_mmu_page_role kvm_mmu_calc_root_page_role(struct kvm_vcpu *vcpu); + +static inline bool kvm_available_flush_tlb_with_range(void) +{ + return kvm_x86_ops->tlb_remote_flush_with_range; +} + +static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm, + struct kvm_tlb_range *range) +{ + int ret = -ENOTSUPP; + + if (range && kvm_x86_ops->tlb_remote_flush_with_range) + ret = kvm_x86_ops->tlb_remote_flush_with_range(kvm, range); + + if (ret) + kvm_flush_remote_tlbs(kvm); +} + +static void kvm_flush_remote_tlbs_with_address(struct kvm *kvm, + u64 start_gfn, u64 pages) +{ + struct kvm_tlb_range range; + + range.start_gfn = start_gfn; + range.pages = pages; + + kvm_flush_remote_tlbs_with_range(kvm, ); +} + void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value) { BUG_ON((mmio_mask & mmio_value) != mmio_value); @@ -5680,7 +5709,13 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, !kvm_is_reserved_pfn(pfn) && PageTransCompoundMap(pfn_to_page(pfn))) { pte_list_remove(rmap_head, sptep); - need_tlb_flush = 1; + + if (kvm_available_flush_tlb_with_range()) + kvm_flush_remote_tlbs_with_address(kvm, sp->gfn, + KVM_PAGES_PER_HPAGE(sp->role.level)); + else + need_tlb_flush = 1; + goto restart; } } -- 2.14.4
[Resend PATCH V5 4/10] KVM/VMX: Add hv tlb range flush support
From: Lan Tianyu This patch is to register tlb_remote_flush_with_range callback with hv tlb range flush interface. Signed-off-by: Lan Tianyu --- Change since v4: - Use new function kvm_fill_hv_flush_list_func() to fill flush request. Change since v3: - Merge Vitaly's don't pass EPT configuration info to vmx_hv_remote_flush_tlb() fix. Change since v1: - Pass flush range with new hyper-v tlb flush struct rather than KVM tlb flush struct. --- arch/x86/kvm/vmx.c | 69 ++ 1 file changed, 49 insertions(+), 20 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index edbc96cb990a..405dfbde70b2 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1567,7 +1567,38 @@ static void check_ept_pointer_match(struct kvm *kvm) to_kvm_vmx(kvm)->ept_pointers_match = EPT_POINTERS_MATCH; } -static int vmx_hv_remote_flush_tlb(struct kvm *kvm) +int kvm_fill_hv_flush_list_func(struct hv_guest_mapping_flush_list *flush, + void *data) +{ + struct kvm_tlb_range *range = data; + + return hyperv_fill_flush_guest_mapping_list(flush, range->start_gfn, + range->pages); +} + +static inline int __hv_remote_flush_tlb_with_range(struct kvm *kvm, + struct kvm_vcpu *vcpu, struct kvm_tlb_range *range) +{ + u64 ept_pointer = to_vmx(vcpu)->ept_pointer; + + /* If ept_pointer is invalid pointer, bypass flush request. */ + if (!VALID_PAGE(ept_pointer)) + return 0; + + /* +* FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE hypercall needs address +* of the base of EPT PML4 table, strip off EPT configuration +* information. +*/ + if (range) + return hyperv_flush_guest_mapping_range(ept_pointer & PAGE_MASK, + kvm_fill_hv_flush_list_func, (void *)range); + else + return hyperv_flush_guest_mapping(ept_pointer & PAGE_MASK); +} + +static int hv_remote_flush_tlb_with_range(struct kvm *kvm, + struct kvm_tlb_range *range) { struct kvm_vcpu *vcpu; int ret = -ENOTSUPP, i; @@ -1577,30 +1608,23 @@ static int vmx_hv_remote_flush_tlb(struct kvm *kvm) if (to_kvm_vmx(kvm)->ept_pointers_match == EPT_POINTERS_CHECK) check_ept_pointer_match(kvm); - /* -* FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE hypercall needs the address of the -* base of EPT PML4 table, strip off EPT configuration information. -* If ept_pointer is invalid pointer, bypass the flush request. -*/ if (to_kvm_vmx(kvm)->ept_pointers_match != EPT_POINTERS_MATCH) { - kvm_for_each_vcpu(i, vcpu, kvm) { - if (!VALID_PAGE(to_vmx(vcpu)->ept_pointer)) - return 0; - - ret |= hyperv_flush_guest_mapping( - to_vmx(vcpu)->ept_pointer & PAGE_MASK); - } + kvm_for_each_vcpu(i, vcpu, kvm) + ret |= __hv_remote_flush_tlb_with_range( + kvm, vcpu, range); } else { - if (!VALID_PAGE(to_vmx(kvm_get_vcpu(kvm, 0))->ept_pointer)) - return 0; - - ret = hyperv_flush_guest_mapping( - to_vmx(kvm_get_vcpu(kvm, 0))->ept_pointer & PAGE_MASK); + ret = __hv_remote_flush_tlb_with_range(kvm, + kvm_get_vcpu(kvm, 0), range); } spin_unlock(_kvm_vmx(kvm)->ept_pointer_lock); return ret; } + +static int hv_remote_flush_tlb(struct kvm *kvm) +{ + return hv_remote_flush_tlb_with_range(kvm, NULL); +} #else /* !IS_ENABLED(CONFIG_HYPERV) */ static inline void evmcs_write64(unsigned long field, u64 value) {} static inline void evmcs_write32(unsigned long field, u32 value) {} @@ -7957,8 +7981,11 @@ static __init int hardware_setup(void) #if IS_ENABLED(CONFIG_HYPERV) if (ms_hyperv.nested_features & HV_X64_NESTED_GUEST_MAPPING_FLUSH - && enable_ept) - kvm_x86_ops->tlb_remote_flush = vmx_hv_remote_flush_tlb; + && enable_ept) { + kvm_x86_ops->tlb_remote_flush = hv_remote_flush_tlb; + kvm_x86_ops->tlb_remote_flush_with_range = + hv_remote_flush_tlb_with_range; + } #endif if (!cpu_has_vmx_ple()) { @@ -11567,6 +11594,8 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id) vmx->nested.posted_intr_nv = -1; vmx->nested.current_vmptr = -1ull; + vmx->ept_pointer = INVALID_PAGE; + vmx->msr_ia32_feature_control_valid_bits = FEATURE_CONTROL_LOCKED; /* -- 2.14.4
[Resend PATCH V5 3/10] x86/Hyper-v: Add trace in the hyperv_nested_flush_guest_mapping_range()
From: Lan Tianyu This patch is to trace log in the hyperv_nested_flush_ guest_mapping_range(). Signed-off-by: Lan Tianyu --- arch/x86/hyperv/nested.c| 1 + arch/x86/include/asm/trace/hyperv.h | 14 ++ 2 files changed, 15 insertions(+) diff --git a/arch/x86/hyperv/nested.c b/arch/x86/hyperv/nested.c index 3d0f31e46954..dd0a843f766d 100644 --- a/arch/x86/hyperv/nested.c +++ b/arch/x86/hyperv/nested.c @@ -130,6 +130,7 @@ int hyperv_flush_guest_mapping_range(u64 as, else ret = status; fault: + trace_hyperv_nested_flush_guest_mapping_range(as, ret); return ret; } EXPORT_SYMBOL_GPL(hyperv_flush_guest_mapping_range); diff --git a/arch/x86/include/asm/trace/hyperv.h b/arch/x86/include/asm/trace/hyperv.h index 2e6245a023ef..ace464f09681 100644 --- a/arch/x86/include/asm/trace/hyperv.h +++ b/arch/x86/include/asm/trace/hyperv.h @@ -42,6 +42,20 @@ TRACE_EVENT(hyperv_nested_flush_guest_mapping, TP_printk("address space %llx ret %d", __entry->as, __entry->ret) ); +TRACE_EVENT(hyperv_nested_flush_guest_mapping_range, + TP_PROTO(u64 as, int ret), + TP_ARGS(as, ret), + + TP_STRUCT__entry( + __field(u64, as) + __field(int, ret) + ), + TP_fast_assign(__entry->as = as; + __entry->ret = ret; + ), + TP_printk("address space %llx ret %d", __entry->as, __entry->ret) + ); + TRACE_EVENT(hyperv_send_ipi_mask, TP_PROTO(const struct cpumask *cpus, int vector), -- 2.14.4
[Resend PATCH V5 2/10] x86/hyper-v: Add HvFlushGuestAddressList hypercall support
From: Lan Tianyu Hyper-V provides HvFlushGuestAddressList() hypercall to flush EPT tlb with specified ranges. This patch is to add the hypercall support. Reviewed-by: Michael Kelley Signed-off-by: Lan Tianyu --- Change sincd v4: - Expose function hyperv_fill_flush_guest_mapping_list() out of hyperv file - Adjust parameter of hyperv_flush_guest_mapping_range() Change since v2: Fix some coding style issues - Move HV_MAX_FLUSH_PAGES and HV_MAX_FLUSH_REP_COUNT to hyperv-tlfs.h. - Calculate HV_MAX_FLUSH_REP_COUNT in the macro definition - Use HV_MAX_FLUSH_REP_COUNT to define length of gpa_list in struct hv_guest_mapping_flush_list. Change since v1: Add hyperv tlb flush struct to avoid use kvm tlb flush struct in the hyperv file. --- arch/x86/hyperv/nested.c | 79 ++ arch/x86/include/asm/hyperv-tlfs.h | 32 +++ arch/x86/include/asm/mshyperv.h| 15 3 files changed, 126 insertions(+) diff --git a/arch/x86/hyperv/nested.c b/arch/x86/hyperv/nested.c index b8e60cc50461..3d0f31e46954 100644 --- a/arch/x86/hyperv/nested.c +++ b/arch/x86/hyperv/nested.c @@ -7,6 +7,7 @@ * * Author : Lan Tianyu */ +#define pr_fmt(fmt) "Hyper-V: " fmt #include @@ -54,3 +55,81 @@ int hyperv_flush_guest_mapping(u64 as) return ret; } EXPORT_SYMBOL_GPL(hyperv_flush_guest_mapping); + +int hyperv_fill_flush_guest_mapping_list( + struct hv_guest_mapping_flush_list *flush, + u64 start_gfn, u64 pages) +{ + u64 cur = start_gfn; + u64 additional_pages; + int gpa_n = 0; + + do { + /* +* If flush requests exceed max flush count, go back to +* flush tlbs without range. +*/ + if (gpa_n >= HV_MAX_FLUSH_REP_COUNT) + return -ENOSPC; + + additional_pages = min_t(u64, pages, HV_MAX_FLUSH_PAGES) - 1; + + flush->gpa_list[gpa_n].page.additional_pages = additional_pages; + flush->gpa_list[gpa_n].page.largepage = false; + flush->gpa_list[gpa_n].page.basepfn = cur; + + pages -= additional_pages + 1; + cur += additional_pages + 1; + gpa_n++; + } while (pages > 0); + + return gpa_n; +} +EXPORT_SYMBOL_GPL(hyperv_fill_flush_guest_mapping_list); + +int hyperv_flush_guest_mapping_range(u64 as, + hyperv_fill_flush_list_func fill_flush_list_func, void *data) +{ + struct hv_guest_mapping_flush_list **flush_pcpu; + struct hv_guest_mapping_flush_list *flush; + u64 status = 0; + unsigned long flags; + int ret = -ENOTSUPP; + int gpa_n = 0; + + if (!hv_hypercall_pg || !fill_flush_list_func) + goto fault; + + local_irq_save(flags); + + flush_pcpu = (struct hv_guest_mapping_flush_list **) + this_cpu_ptr(hyperv_pcpu_input_arg); + + flush = *flush_pcpu; + if (unlikely(!flush)) { + local_irq_restore(flags); + goto fault; + } + + flush->address_space = as; + flush->flags = 0; + + gpa_n = fill_flush_list_func(flush, data); + if (gpa_n < 0) { + local_irq_restore(flags); + goto fault; + } + + status = hv_do_rep_hypercall(HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST, +gpa_n, 0, flush, NULL); + + local_irq_restore(flags); + + if (!(status & HV_HYPERCALL_RESULT_MASK)) + ret = 0; + else + ret = status; +fault: + return ret; +} +EXPORT_SYMBOL_GPL(hyperv_flush_guest_mapping_range); diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h index 4139f7650fe5..405a378e1c62 100644 --- a/arch/x86/include/asm/hyperv-tlfs.h +++ b/arch/x86/include/asm/hyperv-tlfs.h @@ -10,6 +10,7 @@ #define _ASM_X86_HYPERV_TLFS_H #include +#include /* * The below CPUID leaves are present if VersionAndFeatures.HypervisorPresent @@ -358,6 +359,7 @@ struct hv_tsc_emulation_status { #define HVCALL_POST_MESSAGE0x005c #define HVCALL_SIGNAL_EVENT0x005d #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE 0x00af +#define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST 0x00b0 #define HV_X64_MSR_VP_ASSIST_PAGE_ENABLE 0x0001 #define HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT12 @@ -757,6 +759,36 @@ struct hv_guest_mapping_flush { u64 flags; }; +/* + * HV_MAX_FLUSH_PAGES = "additional_pages" + 1. It's limited + * by the bitwidth of "additional_pages" in union hv_gpa_page_range. + */ +#define HV_MAX_FLUSH_PAGES (2048) + +/* HvFlushGuestPhysicalAddressList hypercall */ +union hv_gpa_page_range { + u64 address_space; + struct { + u64 additional_pages:11; + u64 largepage:1; +
[Resend PATCH V5 1/10] KVM: Add tlb_remote_flush_with_range callback in kvm_x86_ops
From: Lan Tianyu Add flush range call back in the kvm_x86_ops and platform can use it to register its associated function. The parameter "kvm_tlb_range" accepts a single range and flush list which contains a list of ranges. Signed-off-by: Lan Tianyu --- Change since v1: Change "end_gfn" to "pages" to aviod confusion as to whether "end_gfn" is inclusive or exlusive. --- arch/x86/include/asm/kvm_host.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 55e51ff7e421..c8a65f0a7107 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -439,6 +439,11 @@ struct kvm_mmu { u64 pdptrs[4]; /* pae */ }; +struct kvm_tlb_range { + u64 start_gfn; + u64 pages; +}; + enum pmc_type { KVM_PMC_GP = 0, KVM_PMC_FIXED, @@ -1042,6 +1047,8 @@ struct kvm_x86_ops { void (*tlb_flush)(struct kvm_vcpu *vcpu, bool invalidate_gpa); int (*tlb_remote_flush)(struct kvm *kvm); + int (*tlb_remote_flush_with_range)(struct kvm *kvm, + struct kvm_tlb_range *range); /* * Flush any TLB entries associated with the given GVA. -- 2.14.4
[PATCH V5 00/10] x86/KVM/Hyper-v: Add HV ept tlb range flush hypercall support in KVM
From: Lan Tianyu Sorry. Some patches was blocked and I try to resend via another account. For nested memory virtualization, Hyper-v doesn't set write-protect L1 hypervisor EPT page directory and page table node to track changes while it relies on guest to tell it changes via HvFlushGuestAddressLlist hypercall. HvFlushGuestAddressLlist hypercall provides a way to flush EPT page table with ranges which are specified by L1 hypervisor. If L1 hypervisor uses INVEPT or HvFlushGuestAddressSpace hypercall to flush EPT tlb, Hyper-V will invalidate associated EPT shadow page table and sync L1's EPT table when next EPT page fault is triggered. HvFlushGuestAddressLlist hypercall helps to avoid such redundant EPT page fault and synchronization of shadow page table. This patchset is rebased on the Linux 4.20-rc1 and Patch "KVM/VMX: Check ept_pointer before flushing ept tlb".(https://www.mail-archive.com/linux -ker...@vger.kernel.org/msg1798827.html). Change since v4: 1) Split flush address and flush list patches. This patchset only contains flush address patches. Will post flush list patches later. 2) Expose function hyperv_fill_flush_guest_mapping_list() out of hyperv file 3) Adjust parameter of hyperv_flush_guest_mapping_range() 4) Reorder patchset and move Hyper-V and VMX changes ahead. Change since v3: 1) Remove code of updating "tlbs_dirty" in kvm_flush_remote_tlbs_with_range() 2) Remove directly tlb flush in the kvm_handle_hva_range() 3) Move tlb flush in kvm_set_pte_rmapp() to kvm_mmu_notifier_change_pte() 4) Combine Vitaly's "don't pass EPT configuration info to vmx_hv_remote_flush_tlb()" fix Change since v2: 1) Fix comment in the kvm_flush_remote_tlbs_with_range() 2) Move HV_MAX_FLUSH_PAGES and HV_MAX_FLUSH_REP_COUNT to hyperv-tlfs.h. 3) Calculate HV_MAX_FLUSH_REP_COUNT in the macro definition 4) Use HV_MAX_FLUSH_REP_COUNT to define length of gpa_list in struct hv_guest_mapping_flush_list. Change since v1: 1) Convert "end_gfn" of struct kvm_tlb_range to "pages" in order to avoid confusion as to whether "end_gfn" is inclusive or exlusive. 2) Add hyperv tlb range struct and replace kvm tlb range struct with new struct in order to avoid using kvm struct in the hyperv code directly. Lan Tianyu (10): KVM: Add tlb_remote_flush_with_range callback in kvm_x86_ops x86/hyper-v: Add HvFlushGuestAddressList hypercall support x86/Hyper-v: Add trace in the hyperv_nested_flush_guest_mapping_range() KVM/VMX: Add hv tlb range flush support KVM/MMU: Add tlb flush with range helper function KVM: Replace old tlb flush function with new one to flush a specified range. KVM: Make kvm_set_spte_hva() return int KVM/MMU: Move tlb flush in kvm_set_pte_rmapp() to kvm_mmu_notifier_change_pte() KVM/MMU: Flush tlb directly in the kvm_set_pte_rmapp() KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range() arch/arm/include/asm/kvm_host.h | 2 +- arch/arm64/include/asm/kvm_host.h | 2 +- arch/mips/include/asm/kvm_host.h| 2 +- arch/mips/kvm/mmu.c | 3 +- arch/powerpc/include/asm/kvm_host.h | 2 +- arch/powerpc/kvm/book3s.c | 3 +- arch/powerpc/kvm/e500_mmu_host.c| 3 +- arch/x86/hyperv/nested.c| 80 +++ arch/x86/include/asm/hyperv-tlfs.h | 32 + arch/x86/include/asm/kvm_host.h | 9 +++- arch/x86/include/asm/mshyperv.h | 15 ++ arch/x86/include/asm/trace/hyperv.h | 14 ++ arch/x86/kvm/mmu.c | 96 + arch/x86/kvm/paging_tmpl.h | 3 +- arch/x86/kvm/vmx.c | 69 ++ virt/kvm/arm/mmu.c | 6 ++- virt/kvm/kvm_main.c | 5 +- 17 files changed, 295 insertions(+), 51 deletions(-) -- 2.14.4
Re: [PATCH] powerpc: Mark variable `cpumsr` as unused
Hi Mathieu, Christophe Thanks for spotting and fixing this bug. On 11/08/2018 05:25 AM, Mathieu Malaterre wrote: > On Thu, Nov 8, 2018 at 7:09 AM Christophe Leroy > wrote: >> >> >> >> On 11/07/2018 08:26 PM, Mathieu Malaterre wrote: >>> Add gcc attribute unused for `cpumsr` variable. >>> >>> Fix warnings treated as errors with W=1: >>> >>>arch/powerpc/kernel/process.c:231:16: error: variable ‘cpumsr’ set but >>> not used [-Werror=unused-but-set-variable] >>>arch/powerpc/kernel/process.c:296:16: error: variable ‘cpumsr’ set but >>> not used [-Werror=unused-but-set-variable] >>> >>> Signed-off-by: Mathieu Malaterre >> >> I don't think this is the good way to fix that. This problem was >> introduced by commit 5c784c8414fb ("powerpc/tm: Remove >> msr_tm_active()"). That commit should be reverted and fixed. > > I see, it makes sense. > >> That commit should have removed the macro and kept the inline function. > > Breno, what do you think ? Turning this macro into a function might cause the code to be more confused, since all the other TM states bits are checked using a macro, for example: MSR_TM_SUSPENDED Checks if the MSR has Suspended bits set MSR_TM_TRANSACTIONAL Checks if the MSR has the transactional bits set MSR_TM_RESV Checks if the MSR has the TM reserved bits set That said, I understand that it makes sense to have an uniform way to check for TM bits in MSR, thus having a MSR_TM_ACTIVE macro to check for the active bits. Using a non-uniform function just to fix this warning seems to be an overkill. Reverting the patch seems to bring back the old style, which is having a macro and a function with the same name, where the function just calls the macro. Anyway, I think it might have other ways to fix warning, as I can think now: 1) Avoid setting cpumsr if CONFIG_PPC_TRANSACTIONAL_MEM is not enabled 2) If !CONFIG_PPC_TRANSACTIONAL_MEM, redefine MSR_TM_ACTIVE(x) to something as (x & 0) instead of 0. 3) Avoid double definition of MSR_TM_ACTIVE, i.e, have the same definition independent of PPC_TRANSACTIONAL_MEM being set or not. Anyway, I would like to try option 3), which is the hardest one to implement and validate, but it seems to be the most correct option, once it checks for a MSR bit configuration, and the caller should have the logic.
Re: [PATCH 35/36] dt-bindings: arm: Convert Xilinx board/soc bindings to json-schema
On 05. 10. 18 18:58, Rob Herring wrote: > Convert Xilinx SoC bindings to DT schema format using json-schema. > > Cc: Mark Rutland > Cc: Michal Simek > Cc: devicet...@vger.kernel.org > Cc: linux-arm-ker...@lists.infradead.org > Signed-off-by: Rob Herring > --- > .../devicetree/bindings/arm/xilinx.txt| 83 --- > .../devicetree/bindings/arm/xilinx.yaml | 81 ++ > 2 files changed, 81 insertions(+), 83 deletions(-) > delete mode 100644 Documentation/devicetree/bindings/arm/xilinx.txt > create mode 100644 Documentation/devicetree/bindings/arm/xilinx.yaml > > diff --git a/Documentation/devicetree/bindings/arm/xilinx.txt > b/Documentation/devicetree/bindings/arm/xilinx.txt > deleted file mode 100644 > index 26fe5ecc4332.. > --- a/Documentation/devicetree/bindings/arm/xilinx.txt > +++ /dev/null > @@ -1,83 +0,0 @@ > -Xilinx Zynq Platforms Device Tree Bindings > - > -Boards with Zynq-7000 SOC based on an ARM Cortex A9 processor > -shall have the following properties. > - > -Required root node properties: > -- compatible = "xlnx,zynq-7000"; > - > -Additional compatible strings: > - > -- Adapteva Parallella board > - "adapteva,parallella" > - > -- Avnet MicroZed board > - "avnet,zynq-microzed" > - "xlnx,zynq-microzed" > - > -- Avnet ZedBoard board > - "avnet,zynq-zed" > - "xlnx,zynq-zed" > - > -- Digilent Zybo board > - "digilent,zynq-zybo" > - > -- Digilent Zybo Z7 board > - "digilent,zynq-zybo-z7" > - > -- Xilinx CC108 internal board > - "xlnx,zynq-cc108" > - > -- Xilinx ZC702 internal board > - "xlnx,zynq-zc702" > - > -- Xilinx ZC706 internal board > - "xlnx,zynq-zc706" > - > -- Xilinx ZC770 internal board, with different FMC cards > - "xlnx,zynq-zc770-xm010" > - "xlnx,zynq-zc770-xm011" > - "xlnx,zynq-zc770-xm012" > - "xlnx,zynq-zc770-xm013" > - > > - > -Xilinx Zynq UltraScale+ MPSoC Platforms Device Tree Bindings > - > -Boards with ZynqMP SOC based on an ARM Cortex A53 processor > -shall have the following properties. > - > -Required root node properties: > -- compatible = "xlnx,zynqmp"; > - > - > -Additional compatible strings: > - > -- Xilinx internal board zc1232 > - "xlnx,zynqmp-zc1232-revA", "xlnx,zynqmp-zc1232" > - > -- Xilinx internal board zc1254 > - "xlnx,zynqmp-zc1254-revA", "xlnx,zynqmp-zc1254" > - > -- Xilinx internal board zc1275 > - "xlnx,zynqmp-zc1275-revA", "xlnx,zynqmp-zc1275" > - > -- Xilinx internal board zc1751 > - "xlnx,zynqmp-zc1751" > - > -- Xilinx 96boards compatible board zcu100 > - "xlnx,zynqmp-zcu100-revC", "xlnx,zynqmp-zcu100" > - > -- Xilinx evaluation board zcu102 > - "xlnx,zynqmp-zcu102-revA", "xlnx,zynqmp-zcu102" > - "xlnx,zynqmp-zcu102-revB", "xlnx,zynqmp-zcu102" > - "xlnx,zynqmp-zcu102-rev1.0", "xlnx,zynqmp-zcu102" > - > -- Xilinx evaluation board zcu104 > - "xlnx,zynqmp-zcu104-revA", "xlnx,zynqmp-zcu104" > - > -- Xilinx evaluation board zcu106 > - "xlnx,zynqmp-zcu106-revA", "xlnx,zynqmp-zcu106" > - > -- Xilinx evaluation board zcu111 > - "xlnx,zynqmp-zcu111-revA", "xlnx,zynqmp-zcu111" > diff --git a/Documentation/devicetree/bindings/arm/xilinx.yaml > b/Documentation/devicetree/bindings/arm/xilinx.yaml > new file mode 100644 > index ..dd227bccf015 > --- /dev/null > +++ b/Documentation/devicetree/bindings/arm/xilinx.yaml > @@ -0,0 +1,81 @@ > +%YAML 1.2 > +--- > +$id: http://devicetree.org/schemas/soc/arm/xilinx.yaml# > +$schema: http://devicetree.org/meta-schemas/core.yaml# > + > +title: Xilinx Zynq Platforms Device Tree Bindings > + > +maintainers: > + - Michal Simek > + > +description: | > + Xilinx boards with Zynq-7000 SOC or Zynq UltraScale+ MPSoC > + > +properties: > + $nodename: > +const: '/' > + compatible: > +oneOf: > + - items: > + - enum: > + - adapteva,parallella > + - digilent,zynq-zybo > + - digilent,zynq-zybo-z7 > + - xlnx,zynq-cc108 > + - xlnx,zynq-zc702 > + - xlnx,zynq-zc706 > + - xlnx,zynq-zc770-xm010 > + - xlnx,zynq-zc770-xm011 > + - xlnx,zynq-zc770-xm012 > + - xlnx,zynq-zc770-xm013 > + - const: xlnx,zynq-7000 > + > + - items: > + - const: avnet,zynq-microzed > + - const: xlnx,zynq-microzed > + - const: xlnx,zynq-7000 > + > + - items: > + - const: avnet,zynq-zed > + - const: xlnx,zynq-zed > + - const: xlnx,zynq-7000 > + > + - items: > + - enum: > + - xlnx,zynqmp-zc1751 > + - const: xlnx,zynqmp > + > + - description: Xilinx internal board zc1232 > +items: > + - const: xlnx,zynqmp-zc1232-revA > + - const: xlnx,zynqmp-zc1232 > + - const: xlnx,zynqmp > + > + - description: Xilinx internal board zc1254 > +items: > + - const: xlnx,zynqmp-zc1254-revA > + - const:
Re: [PATCH kernel 2/3] vfio_pci: Allow regions to add own capabilities
On Thu, Nov 08, 2018 at 05:48:58PM +1100, Alexey Kardashevskiy wrote: > > > On 08/11/2018 17:21, David Gibson wrote: > > On Mon, Oct 15, 2018 at 08:42:32PM +1100, Alexey Kardashevskiy wrote: > >> VFIO regions already support region capabilities with a limited set of > >> fields. However the subdriver might have to report to the userspace > >> additional bits. > >> > >> This adds an add_capability() hook to vfio_pci_regops. > >> > >> This is aiming Witherspoon POWER9 machines which have multiple > >> interconnected NVIDIA V100 GPUs with coherent RAM; each GPU's RAM > >> is mapped to a system bus and to each of GPU internal system bus and > >> the GPUs use this for DMA routing as DMA trafic can go via any > >> of many NVLink2 (GPU-GPU or GPU-CPU) or even stay local within a > >> GPU. > > > > This description doesn't really make clear how per-region capabilities > > are relevant to these devices. > > > I am confused. This patch just adds a hook, and the device specifics are > explained in the next patch where they are used... Well, my point is the last paragraph of the commit message appears to be a rationale for this change in terms of what's needed for the GPU devices. But how those described properties of the GPU mean that region capabilites are useful / necessary isn't made clear. If it's not meant to be a rationale, I'm not sure what it's doing there at all. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
[PATCH v9 9/9] powerpc: clean stack pointers naming
Some stack pointers used to also be thread_info pointers and were called tp. Now that they are only stack pointers, rename them sp. Signed-off-by: Christophe Leroy --- arch/powerpc/kernel/irq.c | 17 +++-- arch/powerpc/kernel/setup_64.c | 20 ++-- 2 files changed, 17 insertions(+), 20 deletions(-) diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c index 62cfccf4af89..754f0efc507b 100644 --- a/arch/powerpc/kernel/irq.c +++ b/arch/powerpc/kernel/irq.c @@ -659,21 +659,21 @@ void __do_irq(struct pt_regs *regs) void do_IRQ(struct pt_regs *regs) { struct pt_regs *old_regs = set_irq_regs(regs); - void *curtp, *irqtp, *sirqtp; + void *cursp, *irqsp, *sirqsp; /* Switch to the irq stack to handle this */ - curtp = (void *)(current_stack_pointer() & ~(THREAD_SIZE - 1)); - irqtp = hardirq_ctx[raw_smp_processor_id()]; - sirqtp = softirq_ctx[raw_smp_processor_id()]; + cursp = (void *)(current_stack_pointer() & ~(THREAD_SIZE - 1)); + irqsp = hardirq_ctx[raw_smp_processor_id()]; + sirqsp = softirq_ctx[raw_smp_processor_id()]; /* Already there ? */ - if (unlikely(curtp == irqtp || curtp == sirqtp)) { + if (unlikely(cursp == irqsp || cursp == sirqsp)) { __do_irq(regs); set_irq_regs(old_regs); return; } /* Switch stack and call */ - call_do_irq(regs, irqtp); + call_do_irq(regs, irqsp); set_irq_regs(old_regs); } @@ -732,10 +732,7 @@ void irq_ctx_init(void) void do_softirq_own_stack(void) { - void *irqtp; - - irqtp = softirq_ctx[smp_processor_id()]; - call_do_softirq(irqtp); + call_do_softirq(softirq_ctx[smp_processor_id()]); } irq_hw_number_t virq_to_hw(unsigned int virq) diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index be6a2ea250a0..61d159d845aa 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -716,22 +716,22 @@ void __init emergency_stack_init(void) limit = min(ppc64_bolted_size(), ppc64_rma_size); for_each_possible_cpu(i) { - void *ti; + void *sp; - ti = alloc_stack(limit, i); - memset(ti, 0, THREAD_SIZE); - paca_ptrs[i]->emergency_sp = ti + THREAD_SIZE; + sp = alloc_stack(limit, i); + memset(sp, 0, THREAD_SIZE); + paca_ptrs[i]->emergency_sp = sp + THREAD_SIZE; #ifdef CONFIG_PPC_BOOK3S_64 /* emergency stack for NMI exception handling. */ - ti = alloc_stack(limit, i); - memset(ti, 0, THREAD_SIZE); - paca_ptrs[i]->nmi_emergency_sp = ti + THREAD_SIZE; + sp = alloc_stack(limit, i); + memset(sp, 0, THREAD_SIZE); + paca_ptrs[i]->nmi_emergency_sp = sp + THREAD_SIZE; /* emergency stack for machine check exception handling. */ - ti = alloc_stack(limit, i); - memset(ti, 0, THREAD_SIZE); - paca_ptrs[i]->mc_emergency_sp = ti + THREAD_SIZE; + sp = alloc_stack(limit, i); + memset(sp, 0, THREAD_SIZE); + paca_ptrs[i]->mc_emergency_sp = sp + THREAD_SIZE; #endif } } -- 2.13.3
[PATCH v9 8/9] powerpc/64: Remove CURRENT_THREAD_INFO
Now that current_thread_info is located at the beginning of 'current' task struct, CURRENT_THREAD_INFO macro is not really needed any more. This patch replaces it by loads of the value at PACACURRENT(r13). Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/exception-64s.h | 4 ++-- arch/powerpc/include/asm/thread_info.h | 4 arch/powerpc/kernel/entry_64.S | 10 +- arch/powerpc/kernel/exceptions-64e.S | 2 +- arch/powerpc/kernel/exceptions-64s.S | 2 +- arch/powerpc/kernel/idle_book3e.S | 2 +- arch/powerpc/kernel/idle_power4.S | 2 +- arch/powerpc/kernel/trace/ftrace_64_mprofile.S | 6 +++--- 8 files changed, 14 insertions(+), 18 deletions(-) diff --git a/arch/powerpc/include/asm/exception-64s.h b/arch/powerpc/include/asm/exception-64s.h index 3b4767ed3ec5..dd6a5ae7a769 100644 --- a/arch/powerpc/include/asm/exception-64s.h +++ b/arch/powerpc/include/asm/exception-64s.h @@ -671,7 +671,7 @@ END_FTR_SECTION_NESTED(ftr,ftr,943) #define RUNLATCH_ON\ BEGIN_FTR_SECTION \ - CURRENT_THREAD_INFO(r3, r1);\ + ld r3, PACACURRENT(r13); \ ld r4,TI_LOCAL_FLAGS(r3); \ andi. r0,r4,_TLF_RUNLATCH;\ beqlppc64_runlatch_on_trampoline; \ @@ -721,7 +721,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_CTRL) #ifdef CONFIG_PPC_970_NAP #define FINISH_NAP \ BEGIN_FTR_SECTION \ - CURRENT_THREAD_INFO(r11, r1); \ + ld r11, PACACURRENT(r13); \ ld r9,TI_LOCAL_FLAGS(r11); \ andi. r10,r9,_TLF_NAPPING;\ bnelpower4_fixup_nap; \ diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h index c959b8d66cac..8e1d0195ac36 100644 --- a/arch/powerpc/include/asm/thread_info.h +++ b/arch/powerpc/include/asm/thread_info.h @@ -17,10 +17,6 @@ #define THREAD_SIZE(1 << THREAD_SHIFT) -#ifdef CONFIG_PPC64 -#define CURRENT_THREAD_INFO(dest, sp) stringify_in_c(ld dest, PACACURRENT(r13)) -#endif - #ifndef __ASSEMBLY__ #include #include diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index 03cbf409c3f8..b017bd3da1ed 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -158,7 +158,7 @@ system_call:/* label this so stack traces look sane */ li r10,IRQS_ENABLED std r10,SOFTE(r1) - CURRENT_THREAD_INFO(r11, r1) + ld r11, PACACURRENT(r13) ld r10,TI_FLAGS(r11) andi. r11,r10,_TIF_SYSCALL_DOTRACE bne .Lsyscall_dotrace /* does not return */ @@ -205,7 +205,7 @@ system_call:/* label this so stack traces look sane */ ld r3,RESULT(r1) #endif - CURRENT_THREAD_INFO(r12, r1) + ld r12, PACACURRENT(r13) ld r8,_MSR(r1) #ifdef CONFIG_PPC_BOOK3S @@ -336,7 +336,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR) /* Repopulate r9 and r10 for the syscall path */ addir9,r1,STACK_FRAME_OVERHEAD - CURRENT_THREAD_INFO(r10, r1) + ld r10, PACACURRENT(r13) ld r10,TI_FLAGS(r10) cmpldi r0,NR_syscalls @@ -734,7 +734,7 @@ _GLOBAL(ret_from_except_lite) mtmsrd r10,1 /* Update machine state */ #endif /* CONFIG_PPC_BOOK3E */ - CURRENT_THREAD_INFO(r9, r1) + ld r9, PACACURRENT(r13) ld r3,_MSR(r1) #ifdef CONFIG_PPC_BOOK3E ld r10,PACACURRENT(r13) @@ -848,7 +848,7 @@ resume_kernel: 1: bl preempt_schedule_irq /* Re-test flags and eventually loop */ - CURRENT_THREAD_INFO(r9, r1) + ld r9, PACACURRENT(r13) ld r4,TI_FLAGS(r9) andi. r0,r4,_TIF_NEED_RESCHED bne 1b diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S index 231d066b4a3d..dfafcd0af009 100644 --- a/arch/powerpc/kernel/exceptions-64e.S +++ b/arch/powerpc/kernel/exceptions-64e.S @@ -469,7 +469,7 @@ exc_##n##_bad_stack: \ * interrupts happen before the wait instruction. */ #define CHECK_NAPPING() \ - CURRENT_THREAD_INFO(r11, r1); \ + ld r11, PACACURRENT(r13); \ ld r10,TI_LOCAL_FLAGS(r11);\ andi. r9,r10,_TLF_NAPPING;\ beq+1f; \ diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index 89d32bb79d5e..1cbe1a78df57 100644
[PATCH v9 7/9] powerpc/32: Remove CURRENT_THREAD_INFO and rename TI_CPU
Now that thread_info is similar to task_struct, it's address is in r2 so CURRENT_THREAD_INFO() macro is useless. This patch removes it. At the same time, as the 'cpu' field is not anymore in thread_info, this patch renames it to TASK_CPU. Signed-off-by: Christophe Leroy --- arch/powerpc/Makefile | 2 +- arch/powerpc/include/asm/thread_info.h | 2 -- arch/powerpc/kernel/asm-offsets.c | 2 +- arch/powerpc/kernel/entry_32.S | 43 -- arch/powerpc/kernel/epapr_hcalls.S | 5 ++-- arch/powerpc/kernel/head_fsl_booke.S | 5 ++-- arch/powerpc/kernel/idle_6xx.S | 8 +++ arch/powerpc/kernel/idle_e500.S| 8 +++ arch/powerpc/kernel/misc_32.S | 3 +-- arch/powerpc/mm/hash_low_32.S | 14 --- arch/powerpc/sysdev/6xx-suspend.S | 5 ++-- 11 files changed, 35 insertions(+), 62 deletions(-) diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile index 0bb705f36a08..f7ee70885a55 100644 --- a/arch/powerpc/Makefile +++ b/arch/powerpc/Makefile @@ -417,7 +417,7 @@ ifdef CONFIG_SMP prepare: task_cpu_prepare task_cpu_prepare: prepare0 - $(eval KBUILD_CFLAGS += -D_TASK_CPU=$(shell awk '{if ($$2 == "TI_CPU") print $$3;}' include/generated/asm-offsets.h)) + $(eval KBUILD_CFLAGS += -D_TASK_CPU=$(shell awk '{if ($$2 == "TASK_CPU") print $$3;}' include/generated/asm-offsets.h)) endif # Check toolchain versions: diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h index d91523c2c7d8..c959b8d66cac 100644 --- a/arch/powerpc/include/asm/thread_info.h +++ b/arch/powerpc/include/asm/thread_info.h @@ -19,8 +19,6 @@ #ifdef CONFIG_PPC64 #define CURRENT_THREAD_INFO(dest, sp) stringify_in_c(ld dest, PACACURRENT(r13)) -#else -#define CURRENT_THREAD_INFO(dest, sp) stringify_in_c(mr dest, r2) #endif #ifndef __ASSEMBLY__ diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 94ac190a0b16..03439785c2ea 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -96,7 +96,7 @@ int main(void) #endif /* CONFIG_PPC64 */ OFFSET(TASK_STACK, task_struct, stack); #ifdef CONFIG_SMP - OFFSET(TI_CPU, task_struct, cpu); + OFFSET(TASK_CPU, task_struct, cpu); #endif #ifdef CONFIG_LIVEPATCH diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S index bd3b146e18a3..d0c546ce387e 100644 --- a/arch/powerpc/kernel/entry_32.S +++ b/arch/powerpc/kernel/entry_32.S @@ -168,8 +168,7 @@ transfer_to_handler: tophys(r11,r11) addir11,r11,global_dbcr0@l #ifdef CONFIG_SMP - CURRENT_THREAD_INFO(r9, r1) - lwz r9,TI_CPU(r9) + lwz r9,TASK_CPU(r2) slwir9,r9,3 add r11,r11,r9 #endif @@ -180,8 +179,7 @@ transfer_to_handler: stw r12,4(r11) #endif #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE - CURRENT_THREAD_INFO(r9, r1) - tophys(r9, r9) + tophys(r9, r2) ACCOUNT_CPU_USER_ENTRY(r9, r11, r12) #endif @@ -195,8 +193,7 @@ transfer_to_handler: ble-stack_ovf /* then the kernel stack overflowed */ 5: #if defined(CONFIG_6xx) || defined(CONFIG_E500) - CURRENT_THREAD_INFO(r9, r1) - tophys(r9,r9) /* check local flags */ + tophys(r9,r2) /* check local flags */ lwz r12,TI_LOCAL_FLAGS(r9) mtcrf 0x01,r12 bt- 31-TLF_NAPPING,4f @@ -345,8 +342,7 @@ _GLOBAL(DoSyscall) mtmsr r11 1: #endif /* CONFIG_TRACE_IRQFLAGS */ - CURRENT_THREAD_INFO(r10, r1) - lwz r11,TI_FLAGS(r10) + lwz r11,TI_FLAGS(r2) andi. r11,r11,_TIF_SYSCALL_DOTRACE bne-syscall_dotrace syscall_dotrace_cont: @@ -379,13 +375,12 @@ ret_from_syscall: lwz r3,GPR3(r1) #endif mr r6,r3 - CURRENT_THREAD_INFO(r12, r1) /* disable interrupts so current_thread_info()->flags can't change */ LOAD_MSR_KERNEL(r10,MSR_KERNEL) /* doesn't include MSR_EE */ /* Note: We don't bother telling lockdep about it */ SYNC MTMSRD(r10) - lwz r9,TI_FLAGS(r12) + lwz r9,TI_FLAGS(r2) li r8,-MAX_ERRNO andi. r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK) bne-syscall_exit_work @@ -432,8 +427,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_NEED_PAIRED_STWCX) #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE andi. r4,r8,MSR_PR beq 3f - CURRENT_THREAD_INFO(r4, r1) - ACCOUNT_CPU_USER_EXIT(r4, r5, r7) + ACCOUNT_CPU_USER_EXIT(r2, r5, r7) 3: #endif lwz r4,_LINK(r1) @@ -526,7 +520,7 @@ syscall_exit_work: /* Clear per-syscall TIF flags if any are set. */ li r11,_TIF_PERSYSCALL_MASK - addir12,r12,TI_FLAGS + addir12,r2,TI_FLAGS 3: lwarx r8,0,r12
[PATCH v9 6/9] powerpc: 'current_set' is now a table of task_struct pointers
The table of pointers 'current_set' has been used for retrieving the stack and current. They used to be thread_info pointers as they were pointing to the stack and current was taken from the 'task' field of the thread_info. Now, the pointers of 'current_set' table are now both pointers to task_struct and pointers to thread_info. As they are used to get current, and the stack pointer is retrieved from current's stack field, this patch changes their type to task_struct, and renames secondary_ti to secondary_current. Reviewed-by: Nicholas Piggin Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/asm-prototypes.h | 4 ++-- arch/powerpc/kernel/head_32.S | 6 +++--- arch/powerpc/kernel/head_44x.S| 4 ++-- arch/powerpc/kernel/head_fsl_booke.S | 4 ++-- arch/powerpc/kernel/smp.c | 10 -- 5 files changed, 13 insertions(+), 15 deletions(-) diff --git a/arch/powerpc/include/asm/asm-prototypes.h b/arch/powerpc/include/asm/asm-prototypes.h index ec691d489656..b1b999b22a12 100644 --- a/arch/powerpc/include/asm/asm-prototypes.h +++ b/arch/powerpc/include/asm/asm-prototypes.h @@ -23,8 +23,8 @@ #include /* SMP */ -extern struct thread_info *current_set[NR_CPUS]; -extern struct thread_info *secondary_ti; +extern struct task_struct *current_set[NR_CPUS]; +extern struct task_struct *secondary_current; void start_secondary(void *unused); /* kexec */ diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S index 44dfd73b2a62..ba0341bd5a00 100644 --- a/arch/powerpc/kernel/head_32.S +++ b/arch/powerpc/kernel/head_32.S @@ -842,9 +842,9 @@ __secondary_start: #endif /* CONFIG_6xx */ /* get current's stack and current */ - lis r1,secondary_ti@ha - tophys(r1,r1) - lwz r2,secondary_ti@l(r1) + lis r2,secondary_current@ha + tophys(r2,r2) + lwz r2,secondary_current@l(r2) tophys(r1,r2) lwz r1,TASK_STACK(r1) diff --git a/arch/powerpc/kernel/head_44x.S b/arch/powerpc/kernel/head_44x.S index 2c7e90f36358..48e4de4dfd0c 100644 --- a/arch/powerpc/kernel/head_44x.S +++ b/arch/powerpc/kernel/head_44x.S @@ -1021,8 +1021,8 @@ _GLOBAL(start_secondary_47x) /* Now we can get our task struct and real stack pointer */ /* Get current's stack and current */ - lis r1,secondary_ti@ha - lwz r2,secondary_ti@l(r1) + lis r2,secondary_current@ha + lwz r2,secondary_current@l(r2) lwz r1,TASK_STACK(r2) /* Current stack pointer */ diff --git a/arch/powerpc/kernel/head_fsl_booke.S b/arch/powerpc/kernel/head_fsl_booke.S index b8a2b789677e..0d27bfff52dd 100644 --- a/arch/powerpc/kernel/head_fsl_booke.S +++ b/arch/powerpc/kernel/head_fsl_booke.S @@ -1076,8 +1076,8 @@ __secondary_start: bl call_setup_cpu /* get current's stack and current */ - lis r1,secondary_ti@ha - lwz r2,secondary_ti@l(r1) + lis r2,secondary_current@ha + lwz r2,secondary_current@l(r2) lwz r1,TASK_STACK(r2) /* stack */ diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index aa4517686f90..a41fa8924004 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -76,7 +76,7 @@ static DEFINE_PER_CPU(int, cpu_state) = { 0 }; #endif -struct thread_info *secondary_ti; +struct task_struct *secondary_current; bool has_big_cores; DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map); @@ -664,7 +664,7 @@ void smp_send_stop(void) } #endif /* CONFIG_NMI_IPI */ -struct thread_info *current_set[NR_CPUS]; +struct task_struct *current_set[NR_CPUS]; static void smp_store_cpu_info(int id) { @@ -929,7 +929,7 @@ void smp_prepare_boot_cpu(void) paca_ptrs[boot_cpuid]->__current = current; #endif set_numa_node(numa_cpu_lookup_table[boot_cpuid]); - current_set[boot_cpuid] = task_thread_info(current); + current_set[boot_cpuid] = current; } #ifdef CONFIG_HOTPLUG_CPU @@ -1014,15 +1014,13 @@ static bool secondaries_inhibited(void) static void cpu_idle_thread_init(unsigned int cpu, struct task_struct *idle) { - struct thread_info *ti = task_thread_info(idle); - #ifdef CONFIG_PPC64 paca_ptrs[cpu]->__current = idle; paca_ptrs[cpu]->kstack = (unsigned long)task_stack_page(idle) + THREAD_SIZE - STACK_FRAME_OVERHEAD; #endif idle->cpu = cpu; - secondary_ti = current_set[cpu] = ti; + secondary_current = current_set[cpu] = idle; } int __cpu_up(unsigned int cpu, struct task_struct *tidle) -- 2.13.3
[PATCH v9 5/9] powerpc: regain entire stack space
thread_info is not anymore in the stack, so the entire stack can now be used. There is also no risk anymore of corrupting task_cpu(p) with a stack overflow so the patch removes the test. When doing this, an explicit test for NULL stack pointer is needed in validate_sp() as it is not anymore implicitely covered by the sizeof(thread_info) gap. In the meantime, with the previous patch all pointers to the stacks are not anymore pointers to thread_info so this patch changes them to void* Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/irq.h | 10 +- arch/powerpc/include/asm/processor.h | 3 +-- arch/powerpc/kernel/asm-offsets.c| 1 - arch/powerpc/kernel/entry_32.S | 14 -- arch/powerpc/kernel/irq.c| 19 +-- arch/powerpc/kernel/misc_32.S| 6 ++ arch/powerpc/kernel/process.c| 32 +--- arch/powerpc/kernel/setup_64.c | 8 8 files changed, 38 insertions(+), 55 deletions(-) diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h index 2efbae8d93be..966ddd4d2414 100644 --- a/arch/powerpc/include/asm/irq.h +++ b/arch/powerpc/include/asm/irq.h @@ -48,9 +48,9 @@ struct pt_regs; * Per-cpu stacks for handling critical, debug and machine check * level interrupts. */ -extern struct thread_info *critirq_ctx[NR_CPUS]; -extern struct thread_info *dbgirq_ctx[NR_CPUS]; -extern struct thread_info *mcheckirq_ctx[NR_CPUS]; +extern void *critirq_ctx[NR_CPUS]; +extern void *dbgirq_ctx[NR_CPUS]; +extern void *mcheckirq_ctx[NR_CPUS]; extern void exc_lvl_ctx_init(void); #else #define exc_lvl_ctx_init() @@ -59,8 +59,8 @@ extern void exc_lvl_ctx_init(void); /* * Per-cpu stacks for handling hard and soft interrupts. */ -extern struct thread_info *hardirq_ctx[NR_CPUS]; -extern struct thread_info *softirq_ctx[NR_CPUS]; +extern void *hardirq_ctx[NR_CPUS]; +extern void *softirq_ctx[NR_CPUS]; extern void irq_ctx_init(void); void call_do_softirq(void *sp); diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h index 15acb282a876..8179b64871ed 100644 --- a/arch/powerpc/include/asm/processor.h +++ b/arch/powerpc/include/asm/processor.h @@ -325,8 +325,7 @@ struct thread_struct { #define ARCH_MIN_TASKALIGN 16 #define INIT_SP(sizeof(init_stack) + (unsigned long) _stack) -#define INIT_SP_LIMIT \ - (_ALIGN_UP(sizeof(struct thread_info), 16) + (unsigned long)_stack) +#define INIT_SP_LIMIT ((unsigned long)_stack) #ifdef CONFIG_SPE #define SPEFSCR_INIT \ diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 1fb52206c106..94ac190a0b16 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -92,7 +92,6 @@ int main(void) DEFINE(SIGSEGV, SIGSEGV); DEFINE(NMI_MASK, NMI_MASK); #else - DEFINE(THREAD_INFO_GAP, _ALIGN_UP(sizeof(struct thread_info), 16)); OFFSET(KSP_LIMIT, thread_struct, ksp_limit); #endif /* CONFIG_PPC64 */ OFFSET(TASK_STACK, task_struct, stack); diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S index fa7a69ffb37a..bd3b146e18a3 100644 --- a/arch/powerpc/kernel/entry_32.S +++ b/arch/powerpc/kernel/entry_32.S @@ -97,14 +97,11 @@ crit_transfer_to_handler: mfspr r0,SPRN_SRR1 stw r0,_SRR1(r11) - /* set the stack limit to the current stack -* and set the limit to protect the thread_info -* struct -*/ + /* set the stack limit to the current stack */ mfspr r8,SPRN_SPRG_THREAD lwz r0,KSP_LIMIT(r8) stw r0,SAVED_KSP_LIMIT(r11) - rlwimi r0,r1,0,0,(31-THREAD_SHIFT) + rlwinm r0,r1,0,0,(31 - THREAD_SHIFT) stw r0,KSP_LIMIT(r8) /* fall through */ #endif @@ -121,14 +118,11 @@ crit_transfer_to_handler: mfspr r0,SPRN_SRR1 stw r0,crit_srr1@l(0) - /* set the stack limit to the current stack -* and set the limit to protect the thread_info -* struct -*/ + /* set the stack limit to the current stack */ mfspr r8,SPRN_SPRG_THREAD lwz r0,KSP_LIMIT(r8) stw r0,saved_ksp_limit@l(0) - rlwimi r0,r1,0,0,(31-THREAD_SHIFT) + rlwinm r0,r1,0,0,(31 - THREAD_SHIFT) stw r0,KSP_LIMIT(r8) /* fall through */ #endif diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c index 3fdb6b6973cf..62cfccf4af89 100644 --- a/arch/powerpc/kernel/irq.c +++ b/arch/powerpc/kernel/irq.c @@ -618,9 +618,8 @@ static inline void check_stack_overflow(void) sp = current_stack_pointer() & (THREAD_SIZE-1); /* check for stack overflow: is there less than 2KB free? */ - if (unlikely(sp < (sizeof(struct thread_info) + 2048))) { - pr_err("do_IRQ: stack overflow: %ld\n", - sp - sizeof(struct thread_info)); +
[PATCH v9 4/9] powerpc: Activate CONFIG_THREAD_INFO_IN_TASK
This patch activates CONFIG_THREAD_INFO_IN_TASK which moves the thread_info into task_struct. Moving thread_info into task_struct has the following advantages: - It protects thread_info from corruption in the case of stack overflows. - Its address is harder to determine if stack addresses are leaked, making a number of attacks more difficult. This has the following consequences: - thread_info is now located at the beginning of task_struct. - The 'cpu' field is now in task_struct, and only exists when CONFIG_SMP is active. - thread_info doesn't have anymore the 'task' field. This patch: - Removes all recopy of thread_info struct when the stack changes. - Changes the CURRENT_THREAD_INFO() macro to point to current. - Selects CONFIG_THREAD_INFO_IN_TASK. - Modifies raw_smp_processor_id() to get ->cpu from current without including linux/sched.h to avoid circular inclusion and without including asm/asm-offsets.h to avoid symbol names duplication between ASM constants and C constants. Signed-off-by: Christophe Leroy Reviewed-by: Nicholas Piggin --- arch/powerpc/Kconfig | 1 + arch/powerpc/Makefile | 7 + arch/powerpc/include/asm/ptrace.h | 2 +- arch/powerpc/include/asm/smp.h | 17 +++- arch/powerpc/include/asm/thread_info.h | 17 ++-- arch/powerpc/kernel/asm-offsets.c | 7 +++-- arch/powerpc/kernel/entry_32.S | 9 +++ arch/powerpc/kernel/exceptions-64e.S | 11 arch/powerpc/kernel/head_32.S | 6 ++--- arch/powerpc/kernel/head_44x.S | 4 +-- arch/powerpc/kernel/head_64.S | 1 + arch/powerpc/kernel/head_booke.h | 8 +- arch/powerpc/kernel/head_fsl_booke.S | 7 +++-- arch/powerpc/kernel/irq.c | 47 +- arch/powerpc/kernel/kgdb.c | 28 arch/powerpc/kernel/machine_kexec_64.c | 6 ++--- arch/powerpc/kernel/setup_64.c | 21 --- arch/powerpc/kernel/smp.c | 2 +- arch/powerpc/net/bpf_jit32.h | 5 ++-- 19 files changed, 52 insertions(+), 154 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 8be31261aec8..fd634a293d7f 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -238,6 +238,7 @@ config PPC select RTC_LIB select SPARSE_IRQ select SYSCTL_EXCEPTION_TRACE + select THREAD_INFO_IN_TASK select VIRT_TO_BUS if !PPC64 # # Please keep this list sorted alphabetically. diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile index 8a2ce14d68d0..0bb705f36a08 100644 --- a/arch/powerpc/Makefile +++ b/arch/powerpc/Makefile @@ -413,6 +413,13 @@ else endif endif +ifdef CONFIG_SMP +prepare: task_cpu_prepare + +task_cpu_prepare: prepare0 + $(eval KBUILD_CFLAGS += -D_TASK_CPU=$(shell awk '{if ($$2 == "TI_CPU") print $$3;}' include/generated/asm-offsets.h)) +endif + # Check toolchain versions: # - gcc-4.6 is the minimum kernel-wide version so nothing required. checkbin: diff --git a/arch/powerpc/include/asm/ptrace.h b/arch/powerpc/include/asm/ptrace.h index f73886a1a7f5..0ab96da68b51 100644 --- a/arch/powerpc/include/asm/ptrace.h +++ b/arch/powerpc/include/asm/ptrace.h @@ -156,7 +156,7 @@ extern int ptrace_put_reg(struct task_struct *task, int regno, unsigned long data); #define current_pt_regs() \ - ((struct pt_regs *)((unsigned long)current_thread_info() + THREAD_SIZE) - 1) + ((struct pt_regs *)((unsigned long)task_stack_page(current) + THREAD_SIZE) - 1) /* * We use the least-significant bit of the trap field to indicate * whether we have saved the full set of registers, or only a diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 41695745032c..0de717e16dd6 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -83,7 +83,22 @@ int is_cpu_dead(unsigned int cpu); /* 32-bit */ extern int smp_hw_index[]; -#define raw_smp_processor_id() (current_thread_info()->cpu) +/* + * This is particularly ugly: it appears we can't actually get the definition + * of task_struct here, but we need access to the CPU this task is running on. + * Instead of using task_struct we're using _TASK_CPU which is extracted from + * asm-offsets.h by kbuild to get the current processor ID. + * + * This also needs to be safeguarded when building asm-offsets.s because at + * that time _TASK_CPU is not defined yet. It could have been guarded by + * _TASK_CPU itself, but we want the build to fail if _TASK_CPU is missing + * when building something else than asm-offsets.s + */ +#ifdef GENERATING_ASM_OFFSETS +#define raw_smp_processor_id() (0) +#else +#define raw_smp_processor_id() (*(unsigned int *)((void *)current + _TASK_CPU)) +#endif #define hard_smp_processor_id()(smp_hw_index[smp_processor_id()]) static inline int
[PATCH v9 3/9] powerpc: Prepare for moving thread_info into task_struct
This patch cleans the powerpc kernel before activating CONFIG_THREAD_INFO_IN_TASK: - The purpose of the pointer given to call_do_softirq() and call_do_irq() is to point the new stack ==> change it to void* and rename it 'sp' - Don't use CURRENT_THREAD_INFO() to locate the stack. - Fix a few comments. - Replace current_thread_info()->task by current - Remove unnecessary casts to thread_info, as they'll become invalid once thread_info is not in stack anymore. - Rename THREAD_INFO to TASK_STASK: as it is in fact the offset of the pointer to the stack in task_struct, this pointer will not be impacted by the move of THREAD_INFO. - Makes TASK_STACK available to PPC64. PPC64 will need it to get the stack pointer from current once the thread_info have been moved. - Modifies klp_init_thread_info() to take task_struct pointer argument. Signed-off-by: Christophe Leroy Reviewed-by: Nicholas Piggin --- arch/powerpc/include/asm/irq.h | 4 ++-- arch/powerpc/include/asm/livepatch.h | 7 --- arch/powerpc/include/asm/processor.h | 4 ++-- arch/powerpc/include/asm/reg.h | 2 +- arch/powerpc/kernel/asm-offsets.c| 2 +- arch/powerpc/kernel/entry_32.S | 2 +- arch/powerpc/kernel/entry_64.S | 2 +- arch/powerpc/kernel/head_32.S| 4 ++-- arch/powerpc/kernel/head_40x.S | 4 ++-- arch/powerpc/kernel/head_44x.S | 2 +- arch/powerpc/kernel/head_8xx.S | 2 +- arch/powerpc/kernel/head_booke.h | 4 ++-- arch/powerpc/kernel/head_fsl_booke.S | 4 ++-- arch/powerpc/kernel/irq.c| 2 +- arch/powerpc/kernel/misc_32.S| 4 ++-- arch/powerpc/kernel/process.c| 8 arch/powerpc/kernel/setup-common.c | 2 +- arch/powerpc/kernel/setup_32.c | 15 +-- arch/powerpc/kernel/smp.c| 4 +++- 19 files changed, 38 insertions(+), 40 deletions(-) diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h index ee39ce56b2a2..2efbae8d93be 100644 --- a/arch/powerpc/include/asm/irq.h +++ b/arch/powerpc/include/asm/irq.h @@ -63,8 +63,8 @@ extern struct thread_info *hardirq_ctx[NR_CPUS]; extern struct thread_info *softirq_ctx[NR_CPUS]; extern void irq_ctx_init(void); -extern void call_do_softirq(struct thread_info *tp); -extern void call_do_irq(struct pt_regs *regs, struct thread_info *tp); +void call_do_softirq(void *sp); +void call_do_irq(struct pt_regs *regs, void *sp); extern void do_IRQ(struct pt_regs *regs); extern void __init init_IRQ(void); extern void __do_irq(struct pt_regs *regs); diff --git a/arch/powerpc/include/asm/livepatch.h b/arch/powerpc/include/asm/livepatch.h index 47a03b9b528b..8a81d10ccc82 100644 --- a/arch/powerpc/include/asm/livepatch.h +++ b/arch/powerpc/include/asm/livepatch.h @@ -43,13 +43,14 @@ static inline unsigned long klp_get_ftrace_location(unsigned long faddr) return ftrace_location_range(faddr, faddr + 16); } -static inline void klp_init_thread_info(struct thread_info *ti) +static inline void klp_init_thread_info(struct task_struct *p) { + struct thread_info *ti = task_thread_info(p); /* + 1 to account for STACK_END_MAGIC */ - ti->livepatch_sp = (unsigned long *)(ti + 1) + 1; + ti->livepatch_sp = end_of_stack(p) + 1; } #else -static void klp_init_thread_info(struct thread_info *ti) { } +static inline void klp_init_thread_info(struct task_struct *p) { } #endif /* CONFIG_LIVEPATCH */ #endif /* _ASM_POWERPC_LIVEPATCH_H */ diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h index 692f7383d461..15acb282a876 100644 --- a/arch/powerpc/include/asm/processor.h +++ b/arch/powerpc/include/asm/processor.h @@ -40,7 +40,7 @@ #ifndef __ASSEMBLY__ #include -#include +#include #include #include @@ -326,7 +326,7 @@ struct thread_struct { #define INIT_SP(sizeof(init_stack) + (unsigned long) _stack) #define INIT_SP_LIMIT \ - (_ALIGN_UP(sizeof(init_thread_info), 16) + (unsigned long) _stack) + (_ALIGN_UP(sizeof(struct thread_info), 16) + (unsigned long)_stack) #ifdef CONFIG_SPE #define SPEFSCR_INIT \ diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index de52c3166ba4..95b68bdf34df 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -1060,7 +1060,7 @@ * - SPRG9 debug exception scratch * * All 32-bit: - * - SPRG3 current thread_info pointer + * - SPRG3 current thread_struct physical addr pointer *(virtual on BookE, physical on others) * * 32-bit classic: diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 9ffc72ded73a..b2b52e002a76 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -90,10 +90,10 @@ int main(void) DEFINE(SIGSEGV, SIGSEGV); DEFINE(NMI_MASK, NMI_MASK); #else - OFFSET(THREAD_INFO, task_struct, stack); DEFINE(THREAD_INFO_GAP,
[PATCH v9 2/9] powerpc: Only use task_struct 'cpu' field on SMP
When moving to CONFIG_THREAD_INFO_IN_TASK, the thread_info 'cpu' field gets moved into task_struct and only defined when CONFIG_SMP is set. This patch ensures that TI_CPU is only used when CONFIG_SMP is set and that task_struct 'cpu' field is not used directly out of SMP code. Signed-off-by: Christophe Leroy Reviewed-by: Nicholas Piggin --- arch/powerpc/kernel/head_fsl_booke.S | 2 ++ arch/powerpc/kernel/misc_32.S| 4 arch/powerpc/xmon/xmon.c | 2 +- 3 files changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/head_fsl_booke.S b/arch/powerpc/kernel/head_fsl_booke.S index e2750b856c8f..05b574f416b3 100644 --- a/arch/powerpc/kernel/head_fsl_booke.S +++ b/arch/powerpc/kernel/head_fsl_booke.S @@ -243,8 +243,10 @@ set_ivor: li r0,0 stwur0,THREAD_SIZE-STACK_FRAME_OVERHEAD(r1) +#ifdef CONFIG_SMP CURRENT_THREAD_INFO(r22, r1) stw r24, TI_CPU(r22) +#endif bl early_init diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S index 695b24a2d954..2f0fe8bfc078 100644 --- a/arch/powerpc/kernel/misc_32.S +++ b/arch/powerpc/kernel/misc_32.S @@ -183,10 +183,14 @@ _GLOBAL(low_choose_750fx_pll) or r4,r4,r5 mtspr SPRN_HID1,r4 +#ifdef CONFIG_SMP /* Store new HID1 image */ CURRENT_THREAD_INFO(r6, r1) lwz r6,TI_CPU(r6) slwir6,r6,2 +#else + li r6, 0 +#endif addis r6,r6,nap_save_hid1@ha stw r4,nap_save_hid1@l(r6) diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index 36b8dc47a3c3..4414d7b3dc85 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -2995,7 +2995,7 @@ static void show_task(struct task_struct *tsk) printf("%px %016lx %6d %6d %c %2d %s\n", tsk, tsk->thread.ksp, tsk->pid, tsk->parent->pid, - state, task_thread_info(tsk)->cpu, + state, task_cpu(tsk), tsk->comm); } -- 2.13.3
[PATCH v9 0/9] powerpc: Switch to CONFIG_THREAD_INFO_IN_TASK
The purpose of this serie is to activate CONFIG_THREAD_INFO_IN_TASK which moves the thread_info into task_struct. Moving thread_info into task_struct has the following advantages: - It protects thread_info from corruption in the case of stack overflows. - Its address is harder to determine if stack addresses are leaked, making a number of attacks more difficult. Changes since v8: - Rebased on e589b79e40d9 ("Automatic merge of branches 'master', 'next' and 'fixes' into merge") ==> Main impact was conflicts due to commit 9a8dd708d547 ("memblock: rename memblock_alloc{_nid,_try_nid} to memblock_phys_alloc*") Changes since v7: - Rebased on fb6c6ce7907d ("Automatic merge of branches 'master', 'next' and 'fixes' into merge") Changes since v6: - Fixed validate_sp() to exclude NULL sp in 'regain entire stack space' patch (early crash with CONFIG_KMEMLEAK) Changes since v5: - Fixed livepatch_sp setup by using end_of_stack() instead of hardcoding - Fixed PPC_BPF_LOAD_CPU() macro Changes since v4: - Fixed a build failure on 32bits SMP when include/generated/asm-offsets.h is not already existing, was due to spaces instead of a tab in the Makefile Changes since RFC v3: (based on Nick's review) - Renamed task_size.h to task_size_user64.h to better relate to what it contains. - Handling of the isolation of thread_info cpu field inside CONFIG_SMP #ifdefs moved to a separate patch. - Removed CURRENT_THREAD_INFO macro completely. - Added a guard in asm/smp.h to avoid build failure before _TASK_CPU is defined. - Added a patch at the end to rename 'tp' pointers to 'sp' pointers - Renamed 'tp' into 'sp' pointers in preparation patch when relevant - Fixed a few commit logs - Fixed checkpatch report. Changes since RFC v2: - Removed the modification of names in asm-offsets - Created a rule in arch/powerpc/Makefile to append the offset of current->cpu in CFLAGS - Modified asm/smp.h to use the offset set in CFLAGS - Squashed the renaming of THREAD_INFO to TASK_STACK in the preparation patch - Moved the modification of current_pt_regs in the patch activating CONFIG_THREAD_INFO_IN_TASK Changes since RFC v1: - Removed the first patch which was modifying header inclusion order in timer - Modified some names in asm-offsets to avoid conflicts when including asm-offsets in C files - Modified asm/smp.h to avoid having to include linux/sched.h (using asm-offsets instead) - Moved some changes from the activation patch to the preparation patch. Christophe Leroy (9): book3s/64: avoid circular header inclusion in mmu-hash.h powerpc: Only use task_struct 'cpu' field on SMP powerpc: Prepare for moving thread_info into task_struct powerpc: Activate CONFIG_THREAD_INFO_IN_TASK powerpc: regain entire stack space powerpc: 'current_set' is now a table of task_struct pointers powerpc/32: Remove CURRENT_THREAD_INFO and rename TI_CPU powerpc/64: Remove CURRENT_THREAD_INFO powerpc: clean stack pointers naming arch/powerpc/Kconfig | 1 + arch/powerpc/Makefile | 7 +++ arch/powerpc/include/asm/asm-prototypes.h | 4 +- arch/powerpc/include/asm/book3s/64/mmu-hash.h | 2 +- arch/powerpc/include/asm/exception-64s.h | 4 +- arch/powerpc/include/asm/irq.h | 14 ++--- arch/powerpc/include/asm/livepatch.h | 7 ++- arch/powerpc/include/asm/processor.h | 39 + arch/powerpc/include/asm/ptrace.h | 2 +- arch/powerpc/include/asm/reg.h | 2 +- arch/powerpc/include/asm/smp.h | 17 +- arch/powerpc/include/asm/task_size_user64.h| 42 ++ arch/powerpc/include/asm/thread_info.h | 19 --- arch/powerpc/kernel/asm-offsets.c | 10 ++-- arch/powerpc/kernel/entry_32.S | 66 -- arch/powerpc/kernel/entry_64.S | 12 ++-- arch/powerpc/kernel/epapr_hcalls.S | 5 +- arch/powerpc/kernel/exceptions-64e.S | 13 + arch/powerpc/kernel/exceptions-64s.S | 2 +- arch/powerpc/kernel/head_32.S | 14 ++--- arch/powerpc/kernel/head_40x.S | 4 +- arch/powerpc/kernel/head_44x.S | 8 +-- arch/powerpc/kernel/head_64.S | 1 + arch/powerpc/kernel/head_8xx.S | 2 +- arch/powerpc/kernel/head_booke.h | 12 +--- arch/powerpc/kernel/head_fsl_booke.S | 16 +++--- arch/powerpc/kernel/idle_6xx.S | 8 +-- arch/powerpc/kernel/idle_book3e.S | 2 +- arch/powerpc/kernel/idle_e500.S| 8 +-- arch/powerpc/kernel/idle_power4.S | 2 +- arch/powerpc/kernel/irq.c | 77 +- arch/powerpc/kernel/kgdb.c | 28 -- arch/powerpc/kernel/machine_kexec_64.c | 6 +- arch/powerpc/kernel/misc_32.S | 17 +++---
[PATCH v9 1/9] book3s/64: avoid circular header inclusion in mmu-hash.h
When activating CONFIG_THREAD_INFO_IN_TASK, linux/sched.h includes asm/current.h. This generates a circular dependency. To avoid that, asm/processor.h shall not be included in mmu-hash.h In order to do that, this patch moves into a new header called asm/task_size_user64.h the information from asm/processor.h required by mmu-hash.h Signed-off-by: Christophe Leroy Reviewed-by: Nicholas Piggin --- arch/powerpc/include/asm/book3s/64/mmu-hash.h | 2 +- arch/powerpc/include/asm/processor.h | 34 +- arch/powerpc/include/asm/task_size_user64.h | 42 +++ arch/powerpc/kvm/book3s_hv_hmi.c | 1 + 4 files changed, 45 insertions(+), 34 deletions(-) create mode 100644 arch/powerpc/include/asm/task_size_user64.h diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h index 12e522807f9f..b2aba048301e 100644 --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h @@ -23,7 +23,7 @@ */ #include #include -#include +#include #include /* diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h index ee58526cb6c2..692f7383d461 100644 --- a/arch/powerpc/include/asm/processor.h +++ b/arch/powerpc/include/asm/processor.h @@ -95,40 +95,8 @@ void release_thread(struct task_struct *); #endif #ifdef CONFIG_PPC64 -/* - * 64-bit user address space can have multiple limits - * For now supported values are: - */ -#define TASK_SIZE_64TB (0x4000UL) -#define TASK_SIZE_128TB (0x8000UL) -#define TASK_SIZE_512TB (0x0002UL) -#define TASK_SIZE_1PB (0x0004UL) -#define TASK_SIZE_2PB (0x0008UL) -/* - * With 52 bits in the address we can support - * upto 4PB of range. - */ -#define TASK_SIZE_4PB (0x0010UL) -/* - * For now 512TB is only supported with book3s and 64K linux page size. - */ -#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_64K_PAGES) -/* - * Max value currently used: - */ -#define TASK_SIZE_USER64 TASK_SIZE_4PB -#define DEFAULT_MAP_WINDOW_USER64 TASK_SIZE_128TB -#define TASK_CONTEXT_SIZE TASK_SIZE_512TB -#else -#define TASK_SIZE_USER64 TASK_SIZE_64TB -#define DEFAULT_MAP_WINDOW_USER64 TASK_SIZE_64TB -/* - * We don't need to allocate extended context ids for 4K page size, because - * we limit the max effective address on this config to 64TB. - */ -#define TASK_CONTEXT_SIZE TASK_SIZE_64TB -#endif +#include /* * 32-bit user address space is 4GB - 1 page diff --git a/arch/powerpc/include/asm/task_size_user64.h b/arch/powerpc/include/asm/task_size_user64.h new file mode 100644 index ..a4043075864b --- /dev/null +++ b/arch/powerpc/include/asm/task_size_user64.h @@ -0,0 +1,42 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_POWERPC_TASK_SIZE_USER64_H +#define _ASM_POWERPC_TASK_SIZE_USER64_H + +#ifdef CONFIG_PPC64 +/* + * 64-bit user address space can have multiple limits + * For now supported values are: + */ +#define TASK_SIZE_64TB (0x4000UL) +#define TASK_SIZE_128TB (0x8000UL) +#define TASK_SIZE_512TB (0x0002UL) +#define TASK_SIZE_1PB (0x0004UL) +#define TASK_SIZE_2PB (0x0008UL) +/* + * With 52 bits in the address we can support + * upto 4PB of range. + */ +#define TASK_SIZE_4PB (0x0010UL) + +/* + * For now 512TB is only supported with book3s and 64K linux page size. + */ +#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_64K_PAGES) +/* + * Max value currently used: + */ +#define TASK_SIZE_USER64 TASK_SIZE_4PB +#define DEFAULT_MAP_WINDOW_USER64 TASK_SIZE_128TB +#define TASK_CONTEXT_SIZE TASK_SIZE_512TB +#else +#define TASK_SIZE_USER64 TASK_SIZE_64TB +#define DEFAULT_MAP_WINDOW_USER64 TASK_SIZE_64TB +/* + * We don't need to allocate extended context ids for 4K page size, because + * we limit the max effective address on this config to 64TB. + */ +#define TASK_CONTEXT_SIZE TASK_SIZE_64TB +#endif + +#endif /* CONFIG_PPC64 */ +#endif /* _ASM_POWERPC_TASK_SIZE_USER64_H */ diff --git a/arch/powerpc/kvm/book3s_hv_hmi.c b/arch/powerpc/kvm/book3s_hv_hmi.c index e3f738eb1cac..64b5011475c7 100644 --- a/arch/powerpc/kvm/book3s_hv_hmi.c +++ b/arch/powerpc/kvm/book3s_hv_hmi.c @@ -24,6 +24,7 @@ #include #include #include +#include void wait_for_subcore_guest_exit(void) { -- 2.13.3
[PATCH] powerpc: Add CONFIG_NR_CPUS to ppc64_defconfig
From: Satheesh Rajendran CONFIG_NR_CPUS is not set in ppc64_defconfig, So it gets default to 32 which is not likely suitable for powerpc systems configuration, hence defaulting it to 2048 like other powerpc defconfigs. Signed-off-by: Satheesh Rajendran --- arch/powerpc/configs/ppc64_defconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/configs/ppc64_defconfig b/arch/powerpc/configs/ppc64_defconfig index f2515674a1e2..8f89ea12dc8d 100644 --- a/arch/powerpc/configs/ppc64_defconfig +++ b/arch/powerpc/configs/ppc64_defconfig @@ -1,4 +1,5 @@ CONFIG_PPC64=y +CONFIG_NR_CPUS=2048 CONFIG_SYSVIPC=y CONFIG_POSIX_MQUEUE=y CONFIG_NO_HZ=y -- 2.17.2
Re: [PATCH 12/36] dt-bindings: arm: Convert cpu binding to json-schema
Hi Rob, On 05. 10. 18 18:58, Rob Herring wrote: > Convert ARM CPU binding to DT schema format using json-schema. > > Cc: Mark Rutland > Cc: Matthias Brugger > Cc: devicet...@vger.kernel.org > Cc: linux-arm-ker...@lists.infradead.org > Cc: linux-media...@lists.infradead.org > Signed-off-by: Rob Herring > --- > .../devicetree/bindings/arm/cpus.txt | 490 - > .../devicetree/bindings/arm/cpus.yaml | 503 ++ > 2 files changed, 503 insertions(+), 490 deletions(-) > delete mode 100644 Documentation/devicetree/bindings/arm/cpus.txt > create mode 100644 Documentation/devicetree/bindings/arm/cpus.yaml > > diff --git a/Documentation/devicetree/bindings/arm/cpus.txt > b/Documentation/devicetree/bindings/arm/cpus.txt > deleted file mode 100644 > index b0198a1cf403.. > --- a/Documentation/devicetree/bindings/arm/cpus.txt > +++ /dev/null > @@ -1,490 +0,0 @@ > -= > -ARM CPUs bindings > -= > - > -The device tree allows to describe the layout of CPUs in a system through > -the "cpus" node, which in turn contains a number of subnodes (ie "cpu") > -defining properties for every cpu. > - > -Bindings for CPU nodes follow the Devicetree Specification, available from: > - > -https://www.devicetree.org/specifications/ > - > -with updates for 32-bit and 64-bit ARM systems provided in this document. > - > - > -Convention used in this document > - > - > -This document follows the conventions described in the Devicetree > -Specification, with the addition: > - > -- square brackets define bitfields, eg reg[7:0] value of the bitfield in > - the reg property contained in bits 7 down to 0 > - > -= > -cpus and cpu node bindings definition > -= > - > -The ARM architecture, in accordance with the Devicetree Specification, > -requires the cpus and cpu nodes to be present and contain the properties > -described below. > - > -- cpus node > - > - Description: Container of cpu nodes > - > - The node name must be "cpus". > - > - A cpus node must define the following properties: > - > - - #address-cells > - Usage: required > - Value type: > - > - Definition depends on ARM architecture version and > - configuration: > - > - # On uniprocessor ARM architectures previous to v7 > - value must be 1, to enable a simple enumeration > - scheme for processors that do not have a HW CPU > - identification register. > - # On 32-bit ARM 11 MPcore, ARM v7 or later systems > - value must be 1, that corresponds to CPUID/MPIDR > - registers sizes. > - # On ARM v8 64-bit systems value should be set to 2, > - that corresponds to the MPIDR_EL1 register size. > - If MPIDR_EL1[63:32] value is equal to 0 on all CPUs > - in the system, #address-cells can be set to 1, since > - MPIDR_EL1[63:32] bits are not used for CPUs > - identification. > - - #size-cells > - Usage: required > - Value type: > - Definition: must be set to 0 > - > -- cpu node > - > - Description: Describes a CPU in an ARM based system > - > - PROPERTIES > - > - - device_type > - Usage: required > - Value type: > - Definition: must be "cpu" > - - reg > - Usage and definition depend on ARM architecture version and > - configuration: > - > - # On uniprocessor ARM architectures previous to v7 > - this property is required and must be set to 0. > - > - # On ARM 11 MPcore based systems this property is > - required and matches the CPUID[11:0] register bits. > - > - Bits [11:0] in the reg cell must be set to > - bits [11:0] in CPU ID register. > - > - All other bits in the reg cell must be set to 0. > - > - # On 32-bit ARM v7 or later systems this property is > - required and matches the CPU MPIDR[23:0] register > - bits. > - > - Bits [23:0] in the reg cell must be set to > - bits [23:0] in MPIDR. > - > - All other bits in the reg cell must be set to 0. > - > - # On ARM v8 64-bit systems this property is required > - and matches the MPIDR_EL1 register affinity bits. > - > - * If cpus node's #address-cells property is set to 2 > - > - The first reg cell bits