Re: [PATCH] powerpc/pseries: Fix build error when NUMA=n

2021-08-27 Thread Laurent Dufour

Le 27/08/2021 à 15:15, Michael Ellerman a écrit :

On Mon, 16 Aug 2021 14:10:32 +1000, Michael Ellerman wrote:

As reported by lkp, if NUMA=n we see a build error:

arch/powerpc/platforms/pseries/hotplug-cpu.c: In function 
'pseries_cpu_hotplug_init':
arch/powerpc/platforms/pseries/hotplug-cpu.c:1022:8: error: 
'node_to_cpumask_map' undeclared
 1022 |node_to_cpumask_map[node]);

Use cpumask_of_node() which has an empty stub for NUMA=n, and when
NUMA=y does a lookup from node_to_cpumask_map[].

[...]


Applied to powerpc/next.

[1/1] powerpc/pseries: Fix build error when NUMA=n
   https://git.kernel.org/powerpc/c/8b893ef190b0c440877de04f767efca4bf4d6af8

cheers



Thanks, Michael, for fixing my bugs !


Re: [PATCH v2 1/3] powerpc/numa: Print debug statements only when required

2021-08-23 Thread Laurent Dufour

Le 21/08/2021 à 12:25, Srikar Dronamraju a écrit :

Currently, a debug message gets printed every time an attempt to
add(remove) a CPU. However this is redundant if the CPU is already added
(removed) from the node.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
Signed-off-by: Srikar Dronamraju 
---
  arch/powerpc/mm/numa.c | 11 +--
  1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..fbe03f6840e0 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -141,10 +141,11 @@ static void map_cpu_to_node(int cpu, int node)
  {
update_numa_cpu_lookup_table(cpu, node);
  
-	dbg("adding cpu %d to node %d\n", cpu, node);
  
-	if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node])))

+   if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node]))) {
+   dbg("adding cpu %d to node %d\n", cpu, node);
cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
+   }
  }
  
  #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)

@@ -152,13 +153,11 @@ static void unmap_cpu_from_node(unsigned long cpu)
  {
int node = numa_cpu_lookup_table[cpu];
  
-	dbg("removing cpu %lu from node %d\n", cpu, node);

-
if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
cpumask_clear_cpu(cpu, node_to_cpumask_map[node]);
+   dbg("removing cpu %lu from node %d\n", cpu, node);
} else {
-   printk(KERN_ERR "WARNING: cpu %lu not found in node %d\n",
-  cpu, node);
+   pr_err("WARNING: cpu %lu not found in node %d\n", cpu, node);


Would pr_warn() be more appropriate here (or removing the "WARNING" statement)?


}
  }
  #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */





[PATCH] powerpc/pseries: Fix update of LPAR security flavor after LPM

2021-08-05 Thread Laurent Dufour
After LPM, when migrating from a system with security mitigation enabled to
a system with mitigation disabled, the security flavor exposed in /proc is
not correctly set back to 0.

Do not assume the value of the security flavor is set to 0 when entering
init_cpu_char_feature_flags(), so when called after a LPM, the value is set
correctly even if the mitigation are not turned off.

Fixes: 6ce56e1ac380 ("powerpc/pseries: export LPAR security flavor in
lparcfg")

Cc: sta...@vger.kernel.org # 5.13.x
Signed-off-by: Laurent Dufour 
---
 arch/powerpc/platforms/pseries/setup.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 6b0886668465..0dfaa6ab44cc 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -539,9 +539,10 @@ static void init_cpu_char_feature_flags(struct 
h_cpu_char_result *result)
 * H_CPU_BEHAV_FAVOUR_SECURITY_H could be set only if
 * H_CPU_BEHAV_FAVOUR_SECURITY is.
 */
-   if (!(result->behaviour & H_CPU_BEHAV_FAVOUR_SECURITY))
+   if (!(result->behaviour & H_CPU_BEHAV_FAVOUR_SECURITY)) {
security_ftr_clear(SEC_FTR_FAVOUR_SECURITY);
-   else if (result->behaviour & H_CPU_BEHAV_FAVOUR_SECURITY_H)
+   pseries_security_flavor = 0;
+   } else if (result->behaviour & H_CPU_BEHAV_FAVOUR_SECURITY_H)
pseries_security_flavor = 1;
else
pseries_security_flavor = 2;
-- 
2.32.0



Re: [PATCH v5] pseries: prevent free CPU ids to be reused on another node

2021-08-03 Thread Laurent Dufour

Le 03/08/2021 à 18:54, Nathan Lynch a écrit :

Laurent Dufour  writes:

V5:
  - Rework code structure
  - Reintroduce the capability to reuse other node's ids.


OK. While I preferred v4, where we would fail an add rather than allow
CPU IDs to appear to "travel" between nodes, this change is a net
improvement.

Reviewed-by: Nathan Lynch 



Thanks Nathan,

Regarding the reuse of other nodes free CPU ids, with this patch the kernel does 
it best to prevent that. Instead of failing adding new CPUs, I think it's better 
to reuse free CPU ids of other nodes, otherwise, only a reboot would allow the 
CPU adding operation to succeed.


Laurent.


Re: [PATCH v5] pseries/drmem: update LMBs after LPM

2021-08-03 Thread Laurent Dufour

Le 03/08/2021 à 19:32, Nathan Lynch a écrit :

Laurent Dufour  writes:

V5:
  - Reword the commit's description to address Nathan's comments.


Thanks. Still don't like the global variable usage but:

Reviewed-by: Nathan Lynch 



Thanks Nathan,

I don't like either the global variable usage but I can't see any smarter way to 
achieve that.


Laurent.


Re: [PATCH v5] pseries: prevent free CPU ids to be reused on another node

2021-07-19 Thread Laurent Dufour

Hi Michael,

Is there a way to get that patch in 5.14?

Thanks,
Laurent.

Le 29/04/2021 à 19:49, Laurent Dufour a écrit :

When a CPU is hot added, the CPU ids are taken from the available mask from
the lower possible set. If that set of values was previously used for CPU
attached to a different node, this seems to application like if these CPUs
have migrated from a node to another one which is not expected in real
life.

To prevent this, it is needed to record the CPU ids used for each node and
to not reuse them on another node. However, to prevent CPU hot plug to
fail, in the case the CPU ids is starved on a node, the capability to reuse
other nodes’ free CPU ids is kept. A warning is displayed in such a case
to warn the user.

A new CPU bit mask (node_recorded_ids_map) is introduced for each possible
node. It is populated with the CPU onlined at boot time, and then when a
CPU is hot plug to a node. The bits in that mask remain when the CPU is hot
unplugged, to remind this CPU ids have been used for this node.

If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.

The effect of this patch can be seen by removing and adding CPUs using the
Qemu monitor. In the following case, the first CPU from the node 2 is
removed, then the first one from the node 1 is removed too. Later, the
first CPU of the node 2 is added back. Without that patch, the kernel will
numbered these CPUs using the first CPU ids available which are the ones
freed when removing the second CPU of the node 0. This leads to the CPU ids
16-23 to move from the node 1 to the node 2. With the patch applied, the
CPU ids 32-39 are used since they are the lowest free ones which have not
been used on another node.

At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47

Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Signed-off-by: Laurent Dufour 
---
V5:
  - Rework code structure
  - Reintroduce the capability to reuse other node's ids.
V4: addressing Nathan's comment
  - Rename the local variable named 'nid' into 'assigned_node'
V3: addressing Nathan's comments
  - Remove the retry feature
  - Reduce the number of local variables (removing 'i')
  - Add comment about the cpu_add_remove_lock protecting the added CPU mask.
  V2: (no functional changes)
  - update the test's output in the commit's description
  - node_recorded_ids_map should be static
---
  arch/powerpc/platforms/pseries/hotplug-cpu.c | 171 ++-
  1 file changed, 132 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 7e970f81d8ff..e1f224320102 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -39,6 +39,12 @@
  /* This version can't take the spinlock, because it never returns */
  static int rtas_stop_self_token = RTAS_UNKNOWN_SERVICE;
  
+/*

+ * Record the CPU ids used on each nodes.
+ * Protected by cpu_add_remove_lock.
+ */
+static cpumask_var_t node_recorded_ids_map[MAX_NUMNODES];
+
  static void rtas_stop_self(void)
  {
static struct rtas_args args;
@@ -139,72 +145,148 @@ static void pseries_cpu_die(unsigned int cpu)
paca_ptrs[cpu]->cpu_start = 0;
  }
  
+/**

+ * find_cpu_id_range - found a linear ranger of @nthreads free CPU ids.
+ * @nthreads : the number of threads (cpu ids)
+ * @assigned_node : the node it belongs to or NUMA_NO_NODE if free ids from any
+ *  node can be peek.
+ * @cpu_mask: the returned CPU mask.
+ *
+ * Returns 0 on success.
+ */
+static int find_cpu_id_range(unsigned int nthreads, int assigned_node,
+cpumask_var_t *cpu_mask)
+{
+   cpumask_var_t candidate_mask;
+   unsigned int cpu, node;
+   int rc = -ENOSPC;
+
+   if (!zalloc_cpumask_var(_mask, GFP_KERNEL))
+   return -ENOMEM;
+
+   cpumask_clear(*cpu_mask);
+   for (cpu = 0; cpu < nthreads; cpu++)
+   cpumask_set_cpu(cpu, *cpu_mask);
+
+   BUG_ON(!cpumask_subset(cpu_present_mask, cpu_possible_mask));
+
+   /* Get a bitmap of unoccupied slots. */
+   cpumask_xor(candidate_mask, cpu_possible_mask, cpu_present_mask);
+
+   if (assigned_node != NUMA_N

Re: [PATCH v2] ppc64/numa: consider the max numa node for migratable LPAR

2021-07-19 Thread Laurent Dufour

Hi Michael,

Is there a way to get that patch in 5.14?

Thanks,
Laurent.

Le 11/05/2021 à 09:31, Laurent Dufour a écrit :

When a LPAR is migratable, we should consider the maximum possible NUMA
node instead the number of NUMA node from the actual system.

The DT property 'ibm,current-associativity-domains' is defining the maximum
number of nodes the LPAR can see when running on that box. But if the LPAR
is being migrated on another box, it may seen up to the nodes defined by
'ibm,max-associativity-domains'. So if a LPAR is migratable, that value
should be used.

Unfortunately, there is no easy way to know if a LPAR is migratable or
not. The hypervisor is exporting the property 'ibm,migratable-partition' in
the case it set to migrate partition, but that would not mean that the
current partition is migratable.

Without this patch, when a LPAR is started on a 2 nodes box and then
migrated to a 3 nodes box, the hypervisor may spread the LPAR's CPUs on the
3rd node. In that case if a CPU from that 3rd node is added to the LPAR, it
will be wrongly assigned to the node because the kernel has been set to use
up to 2 nodes (the configuration of the departure node). With this patch
applies, the CPU is correctly added to the 3rd node.

Fixes: f9f130ff2ec9 ("powerpc/numa: Detect support for coregroup")
Reviewed-by: Srikar Dronamraju 
Signed-off-by: Laurent Dufour 
---
V2: Address Srikar's comments
  - Fix the commit message
  - Use pr_info instead printk(KERN_INFO..)
---
  arch/powerpc/mm/numa.c | 13 ++---
  1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..094a1076fd1f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -893,7 +893,7 @@ static void __init setup_node_data(int nid, u64 start_pfn, 
u64 end_pfn)
  static void __init find_possible_nodes(void)
  {
struct device_node *rtas;
-   const __be32 *domains;
+   const __be32 *domains = NULL;
int prop_length, max_nodes;
u32 i;
  
@@ -909,9 +909,14 @@ static void __init find_possible_nodes(void)

 * it doesn't exist, then fallback on ibm,max-associativity-domains.
 * Current denotes what the platform can support compared to max
 * which denotes what the Hypervisor can support.
+*
+* If the LPAR is migratable, new nodes might be activated after a LPM,
+* so we should consider the max number in that case.
 */
-   domains = of_get_property(rtas, "ibm,current-associativity-domains",
-   _length);
+   if (!of_get_property(of_root, "ibm,migratable-partition", NULL))
+   domains = of_get_property(rtas,
+ "ibm,current-associativity-domains",
+ _length);
if (!domains) {
domains = of_get_property(rtas, "ibm,max-associativity-domains",
_length);
@@ -920,6 +925,8 @@ static void __init find_possible_nodes(void)
}
  
  	max_nodes = of_read_number([min_common_depth], 1);

+   pr_info("Partition configured for %d NUMA nodes.\n", max_nodes);
+
for (i = 0; i < max_nodes; i++) {
if (!node_possible(i))
node_set(i, node_possible_map);





Re: [PATCH v5] pseries/drmem: update LMBs after LPM

2021-07-19 Thread Laurent Dufour

Hi Michael,

Is there a way to get that patch in 5.14?

Thanks,
Laurent.

Le 17/05/2021 à 11:06, Laurent Dufour a écrit :

After a LPM, the device tree node ibm,dynamic-reconfiguration-memory may be
updated by the hypervisor in the case the NUMA topology of the LPAR's
memory is updated.

This is handled by the kernel, but the memory's node is not updated because
there is no way to move a memory block between nodes from the Linux kernel
point of view.

If later a memory block is added or removed, drmem_update_dt() is called
and it is overwriting the DT node ibm,dynamic-reconfiguration-memory to
match the added or removed LMB. But the LMB's associativity node has not
been updated after the DT node update and thus the node is overwritten by
the Linux's topology instead of the hypervisor one.

Introduce a hook called when the ibm,dynamic-reconfiguration-memory node is
updated to force an update of the LMB's associativity. However, ignore the
call to that hook when the update has been triggered by drmem_update_dt().
Because, in that case, the LMB tree has been used to set the DT property
and thus it doesn't need to be updated back. Since drmem_update_dt() is
called under the protection of the device_hotplug_lock and the hook is
called in the same context, use a simple boolean variable to detect that
call.

Cc: Nathan Lynch 
Cc: Aneesh Kumar K.V 
Cc: Tyrel Datwyler 
Signed-off-by: Laurent Dufour 
---

V5:
  - Reword the commit's description to address Nathan's comments.
V4:
  - Prevent the LMB to be updated back in the case the request came from the
  LMB tree's update.
V3:
  - Check rd->dn->name instead of rd->dn->full_name
V2:
  - Take Tyrel's idea to rely on OF_RECONFIG_UPDATE_PROPERTY instead of
  introducing a new hook mechanism.
---
  arch/powerpc/include/asm/drmem.h  |  1 +
  arch/powerpc/mm/drmem.c   | 46 +++
  .../platforms/pseries/hotplug-memory.c|  4 ++
  3 files changed, 51 insertions(+)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index bf2402fed3e0..4265d5e95c2c 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -111,6 +111,7 @@ int drmem_update_dt(void);
  int __init
  walk_drmem_lmbs_early(unsigned long node, void *data,
  int (*func)(struct drmem_lmb *, const __be32 **, void *));
+void drmem_update_lmbs(struct property *prop);
  #endif
  
  static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb)

diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 9af3832c9d8d..22197b18d85e 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -18,6 +18,7 @@ static int n_root_addr_cells, n_root_size_cells;
  
  static struct drmem_lmb_info __drmem_info;

  struct drmem_lmb_info *drmem_info = &__drmem_info;
+static bool in_drmem_update;
  
  u64 drmem_lmb_memory_max(void)

  {
@@ -178,6 +179,11 @@ int drmem_update_dt(void)
if (!memory)
return -1;
  
+	/*

+* Set in_drmem_update to prevent the notifier callback to process the
+* DT property back since the change is coming from the LMB tree.
+*/
+   in_drmem_update = true;
prop = of_find_property(memory, "ibm,dynamic-memory", NULL);
if (prop) {
rc = drmem_update_dt_v1(memory, prop);
@@ -186,6 +192,7 @@ int drmem_update_dt(void)
if (prop)
rc = drmem_update_dt_v2(memory, prop);
}
+   in_drmem_update = false;
  
  	of_node_put(memory);

return rc;
@@ -307,6 +314,45 @@ int __init walk_drmem_lmbs_early(unsigned long node, void 
*data,
return ret;
  }
  
+/*

+ * Update the LMB associativity index.
+ */
+static int update_lmb(struct drmem_lmb *updated_lmb,
+ __maybe_unused const __be32 **usm,
+ __maybe_unused void *data)
+{
+   struct drmem_lmb *lmb;
+
+   for_each_drmem_lmb(lmb) {
+   if (lmb->drc_index != updated_lmb->drc_index)
+   continue;
+
+   lmb->aa_index = updated_lmb->aa_index;
+   break;
+   }
+   return 0;
+}
+
+/*
+ * Update the LMB associativity index.
+ *
+ * This needs to be called when the hypervisor is updating the
+ * dynamic-reconfiguration-memory node property.
+ */
+void drmem_update_lmbs(struct property *prop)
+{
+   /*
+* Don't update the LMBs if triggered by the update done in
+* drmem_update_dt(), the LMB values have been used to the update the DT
+* property in that case.
+*/
+   if (in_drmem_update)
+   return;
+   if (!strcmp(prop->name, "ibm,dynamic-memory"))
+   __walk_drmem_v1_lmbs(prop->value, NULL, NULL, update_lmb);
+   else if (!strcmp(prop->name, "ibm,dynamic-memory-v2"))
+   __walk_drmem_v2_lmbs(prop->value, NULL, NUL

Re: [PATCH v5] pseries/drmem: update LMBs after LPM

2021-07-01 Thread Laurent Dufour

Hi Michael,

Do you mind taking this patch of 5.14?

Thanks,
Laurent.

Le 17/05/2021 à 11:06, Laurent Dufour a écrit :

After a LPM, the device tree node ibm,dynamic-reconfiguration-memory may be
updated by the hypervisor in the case the NUMA topology of the LPAR's
memory is updated.

This is handled by the kernel, but the memory's node is not updated because
there is no way to move a memory block between nodes from the Linux kernel
point of view.

If later a memory block is added or removed, drmem_update_dt() is called
and it is overwriting the DT node ibm,dynamic-reconfiguration-memory to
match the added or removed LMB. But the LMB's associativity node has not
been updated after the DT node update and thus the node is overwritten by
the Linux's topology instead of the hypervisor one.

Introduce a hook called when the ibm,dynamic-reconfiguration-memory node is
updated to force an update of the LMB's associativity. However, ignore the
call to that hook when the update has been triggered by drmem_update_dt().
Because, in that case, the LMB tree has been used to set the DT property
and thus it doesn't need to be updated back. Since drmem_update_dt() is
called under the protection of the device_hotplug_lock and the hook is
called in the same context, use a simple boolean variable to detect that
call.

Cc: Nathan Lynch 
Cc: Aneesh Kumar K.V 
Cc: Tyrel Datwyler 
Signed-off-by: Laurent Dufour 
---

V5:
  - Reword the commit's description to address Nathan's comments.
V4:
  - Prevent the LMB to be updated back in the case the request came from the
  LMB tree's update.
V3:
  - Check rd->dn->name instead of rd->dn->full_name
V2:
  - Take Tyrel's idea to rely on OF_RECONFIG_UPDATE_PROPERTY instead of
  introducing a new hook mechanism.
---
  arch/powerpc/include/asm/drmem.h  |  1 +
  arch/powerpc/mm/drmem.c   | 46 +++
  .../platforms/pseries/hotplug-memory.c|  4 ++
  3 files changed, 51 insertions(+)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index bf2402fed3e0..4265d5e95c2c 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -111,6 +111,7 @@ int drmem_update_dt(void);
  int __init
  walk_drmem_lmbs_early(unsigned long node, void *data,
  int (*func)(struct drmem_lmb *, const __be32 **, void *));
+void drmem_update_lmbs(struct property *prop);
  #endif
  
  static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb)

diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 9af3832c9d8d..22197b18d85e 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -18,6 +18,7 @@ static int n_root_addr_cells, n_root_size_cells;
  
  static struct drmem_lmb_info __drmem_info;

  struct drmem_lmb_info *drmem_info = &__drmem_info;
+static bool in_drmem_update;
  
  u64 drmem_lmb_memory_max(void)

  {
@@ -178,6 +179,11 @@ int drmem_update_dt(void)
if (!memory)
return -1;
  
+	/*

+* Set in_drmem_update to prevent the notifier callback to process the
+* DT property back since the change is coming from the LMB tree.
+*/
+   in_drmem_update = true;
prop = of_find_property(memory, "ibm,dynamic-memory", NULL);
if (prop) {
rc = drmem_update_dt_v1(memory, prop);
@@ -186,6 +192,7 @@ int drmem_update_dt(void)
if (prop)
rc = drmem_update_dt_v2(memory, prop);
}
+   in_drmem_update = false;
  
  	of_node_put(memory);

return rc;
@@ -307,6 +314,45 @@ int __init walk_drmem_lmbs_early(unsigned long node, void 
*data,
return ret;
  }
  
+/*

+ * Update the LMB associativity index.
+ */
+static int update_lmb(struct drmem_lmb *updated_lmb,
+ __maybe_unused const __be32 **usm,
+ __maybe_unused void *data)
+{
+   struct drmem_lmb *lmb;
+
+   for_each_drmem_lmb(lmb) {
+   if (lmb->drc_index != updated_lmb->drc_index)
+   continue;
+
+   lmb->aa_index = updated_lmb->aa_index;
+   break;
+   }
+   return 0;
+}
+
+/*
+ * Update the LMB associativity index.
+ *
+ * This needs to be called when the hypervisor is updating the
+ * dynamic-reconfiguration-memory node property.
+ */
+void drmem_update_lmbs(struct property *prop)
+{
+   /*
+* Don't update the LMBs if triggered by the update done in
+* drmem_update_dt(), the LMB values have been used to the update the DT
+* property in that case.
+*/
+   if (in_drmem_update)
+   return;
+   if (!strcmp(prop->name, "ibm,dynamic-memory"))
+   __walk_drmem_v1_lmbs(prop->value, NULL, NULL, update_lmb);
+   else if (!strcmp(prop->name, "ibm,dynamic-memory-v2"))
+   __walk_drmem_v2_lmbs(prop->value, NULL, NUL

Re: [PATCH v2] ppc64/numa: consider the max numa node for migratable LPAR

2021-07-01 Thread Laurent Dufour

Hi Michael,

Do you mind taking this patch of 5.14?

Thanks,
Laurent.

Le 11/05/2021 à 09:31, Laurent Dufour a écrit :

When a LPAR is migratable, we should consider the maximum possible NUMA
node instead the number of NUMA node from the actual system.

The DT property 'ibm,current-associativity-domains' is defining the maximum
number of nodes the LPAR can see when running on that box. But if the LPAR
is being migrated on another box, it may seen up to the nodes defined by
'ibm,max-associativity-domains'. So if a LPAR is migratable, that value
should be used.

Unfortunately, there is no easy way to know if a LPAR is migratable or
not. The hypervisor is exporting the property 'ibm,migratable-partition' in
the case it set to migrate partition, but that would not mean that the
current partition is migratable.

Without this patch, when a LPAR is started on a 2 nodes box and then
migrated to a 3 nodes box, the hypervisor may spread the LPAR's CPUs on the
3rd node. In that case if a CPU from that 3rd node is added to the LPAR, it
will be wrongly assigned to the node because the kernel has been set to use
up to 2 nodes (the configuration of the departure node). With this patch
applies, the CPU is correctly added to the 3rd node.

Fixes: f9f130ff2ec9 ("powerpc/numa: Detect support for coregroup")
Reviewed-by: Srikar Dronamraju 
Signed-off-by: Laurent Dufour 
---
V2: Address Srikar's comments
  - Fix the commit message
  - Use pr_info instead printk(KERN_INFO..)
---
  arch/powerpc/mm/numa.c | 13 ++---
  1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..094a1076fd1f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -893,7 +893,7 @@ static void __init setup_node_data(int nid, u64 start_pfn, 
u64 end_pfn)
  static void __init find_possible_nodes(void)
  {
struct device_node *rtas;
-   const __be32 *domains;
+   const __be32 *domains = NULL;
int prop_length, max_nodes;
u32 i;
  
@@ -909,9 +909,14 @@ static void __init find_possible_nodes(void)

 * it doesn't exist, then fallback on ibm,max-associativity-domains.
 * Current denotes what the platform can support compared to max
 * which denotes what the Hypervisor can support.
+*
+* If the LPAR is migratable, new nodes might be activated after a LPM,
+* so we should consider the max number in that case.
 */
-   domains = of_get_property(rtas, "ibm,current-associativity-domains",
-   _length);
+   if (!of_get_property(of_root, "ibm,migratable-partition", NULL))
+   domains = of_get_property(rtas,
+ "ibm,current-associativity-domains",
+ _length);
if (!domains) {
domains = of_get_property(rtas, "ibm,max-associativity-domains",
_length);
@@ -920,6 +925,8 @@ static void __init find_possible_nodes(void)
}
  
  	max_nodes = of_read_number([min_common_depth], 1);

+   pr_info("Partition configured for %d NUMA nodes.\n", max_nodes);
+
for (i = 0; i < max_nodes; i++) {
if (!node_possible(i))
node_set(i, node_possible_map);





Re: [PATCH v5] pseries: prevent free CPU ids to be reused on another node

2021-07-01 Thread Laurent Dufour

Hi Michael,

Do you mind taking this patch of 5.14?

Thanks,
Laurent.

Le 29/04/2021 à 19:49, Laurent Dufour a écrit :

When a CPU is hot added, the CPU ids are taken from the available mask from
the lower possible set. If that set of values was previously used for CPU
attached to a different node, this seems to application like if these CPUs
have migrated from a node to another one which is not expected in real
life.

To prevent this, it is needed to record the CPU ids used for each node and
to not reuse them on another node. However, to prevent CPU hot plug to
fail, in the case the CPU ids is starved on a node, the capability to reuse
other nodes’ free CPU ids is kept. A warning is displayed in such a case
to warn the user.

A new CPU bit mask (node_recorded_ids_map) is introduced for each possible
node. It is populated with the CPU onlined at boot time, and then when a
CPU is hot plug to a node. The bits in that mask remain when the CPU is hot
unplugged, to remind this CPU ids have been used for this node.

If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.

The effect of this patch can be seen by removing and adding CPUs using the
Qemu monitor. In the following case, the first CPU from the node 2 is
removed, then the first one from the node 1 is removed too. Later, the
first CPU of the node 2 is added back. Without that patch, the kernel will
numbered these CPUs using the first CPU ids available which are the ones
freed when removing the second CPU of the node 0. This leads to the CPU ids
16-23 to move from the node 1 to the node 2. With the patch applied, the
CPU ids 32-39 are used since they are the lowest free ones which have not
been used on another node.

At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47

Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Signed-off-by: Laurent Dufour 
---
V5:
  - Rework code structure
  - Reintroduce the capability to reuse other node's ids.
V4: addressing Nathan's comment
  - Rename the local variable named 'nid' into 'assigned_node'
V3: addressing Nathan's comments
  - Remove the retry feature
  - Reduce the number of local variables (removing 'i')
  - Add comment about the cpu_add_remove_lock protecting the added CPU mask.
  V2: (no functional changes)
  - update the test's output in the commit's description
  - node_recorded_ids_map should be static
---
  arch/powerpc/platforms/pseries/hotplug-cpu.c | 171 ++-
  1 file changed, 132 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 7e970f81d8ff..e1f224320102 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -39,6 +39,12 @@
  /* This version can't take the spinlock, because it never returns */
  static int rtas_stop_self_token = RTAS_UNKNOWN_SERVICE;
  
+/*

+ * Record the CPU ids used on each nodes.
+ * Protected by cpu_add_remove_lock.
+ */
+static cpumask_var_t node_recorded_ids_map[MAX_NUMNODES];
+
  static void rtas_stop_self(void)
  {
static struct rtas_args args;
@@ -139,72 +145,148 @@ static void pseries_cpu_die(unsigned int cpu)
paca_ptrs[cpu]->cpu_start = 0;
  }
  
+/**

+ * find_cpu_id_range - found a linear ranger of @nthreads free CPU ids.
+ * @nthreads : the number of threads (cpu ids)
+ * @assigned_node : the node it belongs to or NUMA_NO_NODE if free ids from any
+ *  node can be peek.
+ * @cpu_mask: the returned CPU mask.
+ *
+ * Returns 0 on success.
+ */
+static int find_cpu_id_range(unsigned int nthreads, int assigned_node,
+cpumask_var_t *cpu_mask)
+{
+   cpumask_var_t candidate_mask;
+   unsigned int cpu, node;
+   int rc = -ENOSPC;
+
+   if (!zalloc_cpumask_var(_mask, GFP_KERNEL))
+   return -ENOMEM;
+
+   cpumask_clear(*cpu_mask);
+   for (cpu = 0; cpu < nthreads; cpu++)
+   cpumask_set_cpu(cpu, *cpu_mask);
+
+   BUG_ON(!cpumask_subset(cpu_present_mask, cpu_possible_mask));
+
+   /* Get a bitmap of unoccupied slots. */
+   cpumask_xor(candidate_mask, cpu_possible_mask, cpu_present_mask);
+
+   if (assigned_node != NUMA_NO_NODE) {
+   /*
+* 

Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity

2021-06-24 Thread Laurent Dufour

Hi Aneesh,

A little bit of wordsmithing below...

Le 17/06/2021 à 18:51, Aneesh Kumar K.V a écrit :

PAPR interface currently supports two different ways of communicating resource
grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.

Signed-off-by: Daniel Henrique Barboza 
Signed-off-by: Aneesh Kumar K.V 
---
  Documentation/powerpc/associativity.rst   | 135 
  arch/powerpc/include/asm/firmware.h   |   3 +-
  arch/powerpc/include/asm/prom.h   |   1 +
  arch/powerpc/kernel/prom_init.c   |   3 +-
  arch/powerpc/mm/numa.c| 149 +-
  arch/powerpc/platforms/pseries/firmware.c |   1 +
  6 files changed, 286 insertions(+), 6 deletions(-)
  create mode 100644 Documentation/powerpc/associativity.rst

diff --git a/Documentation/powerpc/associativity.rst 
b/Documentation/powerpc/associativity.rst
new file mode 100644
index ..93be604ac54d
--- /dev/null
+++ b/Documentation/powerpc/associativity.rst
@@ -0,0 +1,135 @@
+
+NUMA resource associativity
+=
+
+Associativity represents the groupings of the various platform resources into
+domains of substantially similar mean performance relative to resources outside
+of that domain. Resources subsets of a given domain that exhibit better
+performance relative to each other than relative to other resources subsets
+are represented as being members of a sub-grouping domain. This performance
+characteristic is presented in terms of NUMA node distance within the Linux 
kernel.
+From the platform view, these groups are also referred to as domains.
+
+PAPR interface currently supports different ways of communicating these 
resource
+grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
+associativity grouping. Form 0 is the older format and is now considered 
deprecated.
+
+Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 
property".

   architecture ^


+Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of 
Form 0 or Form 1.
+A value of 1 indicates the usage of Form 1 associativity. For Form 2 
associativity
+bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
+
+Form 0
+-
+Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
+
+Form 1
+-
+With Form 1 a combination of ibm,associativity-reference-points and 
ibm,associativity
+device tree properties are used to determine the NUMA distance between 
resource groups/domains.
+
+The “ibm,associativity” property contains one or more lists of numbers 
(domainID)
+representing the resource’s platform grouping domains.
+
+The “ibm,associativity-reference-points” property contains one or more list of 
numbers
+(domainID index) that represents the 1 based ordinal in the associativity 
lists.
+The list of domainID index represnets increasing hierachy of resource grouping.

represents ^


+
+ex:
+{ primary domainID index, secondary domainID index, tertiary domainID index.. }
+
+Linux kernel uses the domainID at the primary domainID index as the NUMA node 
id.
+Linux kernel computes NUMA distance between two domains by recursively 
comparing
+if they belong to the same higher-level domains. For mismatch at every higher
+level of the resource group, the kernel doubles the NUMA distance between the
+comparing domains.
+
+Form 2
+---
+Form 2 associativity format adds separate device tree properties representing 
NUMA node distance
+thereby making the node distance computation flexible. Form 2 also allows 
flexible primary
+domain numbering. With numa distance computation now detached from the index 
value of
+"ibm,associativity" property, Form 2 allows a large number of primary domain 
ids at the
+same domainID index representing resource groups of different 
performance/latency characteristics.
+
+Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in 
the
+"ibm,architecture-vec-5" property.
+
+"ibm,numa-lookup-index-table" property contains one or more list numbers 
representing
+the domainIDs present in the system. The offset of the domainID in this 
property is considered
+the domainID index.
+
+prop-encoded-array: The number N of the domainIDs encoded as with encode-int, 
followed by
+N domainID encoded as with encode-int
+
+For ex:
+ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index for 
domainID 8 is 1.
+
+"ibm,numa-distance-table" property contains one or more list of numbers 
representing the NUMA
+distance between resource groups/domains present in the system.
+
+prop-encoded-array: The number N of the distance values encoded as with 
encode-int, followed by
+N distance values encoded as with 

Re: [PATCH 3/3] powerpc/pseries: fail quicker in dlpar_memory_add_by_ic()

2021-06-24 Thread Laurent Dufour

Le 22/06/2021 à 15:39, Daniel Henrique Barboza a écrit :

The validation done at the start of dlpar_memory_add_by_ic() is an all
of nothing scenario - if any LMBs in the range is marked as RESERVED we
can fail right away.

We then can remove the 'lmbs_available' var and its check with
'lmbs_to_add' since the whole LMB range was already validated in the
previous step.


Reviewed-by: Laurent Dufour 


Signed-off-by: Daniel Henrique Barboza 
---
  arch/powerpc/platforms/pseries/hotplug-memory.c | 14 ++
  1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index c0a03e1537cb..377d852f5a9a 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -796,7 +796,6 @@ static int dlpar_memory_add_by_index(u32 drc_index)
  static int dlpar_memory_add_by_ic(u32 lmbs_to_add, u32 drc_index)
  {
struct drmem_lmb *lmb, *start_lmb, *end_lmb;
-   int lmbs_available = 0;
int rc;
  
  	pr_info("Attempting to hot-add %u LMB(s) at index %x\n",

@@ -811,15 +810,14 @@ static int dlpar_memory_add_by_ic(u32 lmbs_to_add, u32 
drc_index)
  
  	/* Validate that the LMBs in this range are not reserved */

for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
-   if (lmb->flags & DRCONF_MEM_RESERVED)
-   break;
-
-   lmbs_available++;
+   /* Fail immediately if the whole range can't be hot-added */
+   if (lmb->flags & DRCONF_MEM_RESERVED) {
+   pr_err("Memory at %llx (drc index %x) is reserved\n",
+   lmb->base_addr, lmb->drc_index);
+   return -EINVAL;
+   }
}
  
-	if (lmbs_available < lmbs_to_add)

-   return -EINVAL;
-
for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
if (lmb->flags & DRCONF_MEM_ASSIGNED)
continue;





Re: [PATCH 2/3] powerpc/pseries: break early in dlpar_memory_add_by_count() loops

2021-06-24 Thread Laurent Dufour

Le 22/06/2021 à 15:39, Daniel Henrique Barboza a écrit :

After a successful dlpar_add_lmb() call the LMB is marked as reserved.
Later on, depending whether we added enough LMBs or not, we rely on
the marked LMBs to see which ones might need to be removed, and we
remove the reservation of all of them.

These are done in for_each_drmem_lmb() loops without any break
condition. This means that we're going to check all LMBs of the partition
even after going through all the reserved ones.

This patch adds break conditions in both loops to avoid this. The
'lmbs_added' variable was renamed to 'lmbs_reserved', and it's now
being decremented each time a lmb reservation is removed, indicating
if there are still marked LMBs to be processed.


Reviewed-by: Laurent Dufour 


Signed-off-by: Daniel Henrique Barboza 
---
  arch/powerpc/platforms/pseries/hotplug-memory.c | 17 -
  1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 28a7fd90232f..c0a03e1537cb 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -673,7 +673,7 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
  {
struct drmem_lmb *lmb;
int lmbs_available = 0;
-   int lmbs_added = 0;
+   int lmbs_reserved = 0;
int rc;
  
  	pr_info("Attempting to hot-add %d LMB(s)\n", lmbs_to_add);

@@ -714,13 +714,12 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
 * requested LMBs cannot be added.
 */
drmem_mark_lmb_reserved(lmb);
-
-   lmbs_added++;
-   if (lmbs_added == lmbs_to_add)
+   lmbs_reserved++;
+   if (lmbs_reserved == lmbs_to_add)
break;
}
  
-	if (lmbs_added != lmbs_to_add) {

+   if (lmbs_reserved != lmbs_to_add) {
pr_err("Memory hot-add failed, removing any added LMBs\n");
  
  		for_each_drmem_lmb(lmb) {

@@ -735,6 +734,10 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
dlpar_release_drc(lmb->drc_index);
  
  			drmem_remove_lmb_reservation(lmb);

+   lmbs_reserved--;
+
+   if (lmbs_reserved == 0)
+   break;
}
rc = -EINVAL;
} else {
@@ -745,6 +748,10 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
pr_debug("Memory at %llx (drc index %x) was 
hot-added\n",
 lmb->base_addr, lmb->drc_index);
drmem_remove_lmb_reservation(lmb);
+   lmbs_reserved--;
+
+   if (lmbs_reserved == 0)
+   break;
}
rc = 0;
}





Re: [PATCH 1/3] powerpc/pseries: skip reserved LMBs in dlpar_memory_add_by_count()

2021-06-24 Thread Laurent Dufour

Le 22/06/2021 à 15:39, Daniel Henrique Barboza a écrit :

The function is counting reserved LMBs as available to be added, but
they aren't. This will cause the function to miscalculate the available
LMBs and can trigger errors later on when executing dlpar_add_lmb().


Indeed I'm wondering if dlpar_add_lmb() would fail in that case, so that's even 
better to check for that flag earlier.


Reviewed-by: Laurent Dufour 


Signed-off-by: Daniel Henrique Barboza 
---
  arch/powerpc/platforms/pseries/hotplug-memory.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 36f66556a7c6..28a7fd90232f 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -683,6 +683,9 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
  
  	/* Validate that there are enough LMBs to satisfy the request */

for_each_drmem_lmb(lmb) {
+   if (lmb->flags & DRCONF_MEM_RESERVED)
+   continue;
+
if (!(lmb->flags & DRCONF_MEM_ASSIGNED))
lmbs_available++;
  





[PATCH v5] pseries/drmem: update LMBs after LPM

2021-05-17 Thread Laurent Dufour
After a LPM, the device tree node ibm,dynamic-reconfiguration-memory may be
updated by the hypervisor in the case the NUMA topology of the LPAR's
memory is updated.

This is handled by the kernel, but the memory's node is not updated because
there is no way to move a memory block between nodes from the Linux kernel
point of view.

If later a memory block is added or removed, drmem_update_dt() is called
and it is overwriting the DT node ibm,dynamic-reconfiguration-memory to
match the added or removed LMB. But the LMB's associativity node has not
been updated after the DT node update and thus the node is overwritten by
the Linux's topology instead of the hypervisor one.

Introduce a hook called when the ibm,dynamic-reconfiguration-memory node is
updated to force an update of the LMB's associativity. However, ignore the
call to that hook when the update has been triggered by drmem_update_dt().
Because, in that case, the LMB tree has been used to set the DT property
and thus it doesn't need to be updated back. Since drmem_update_dt() is
called under the protection of the device_hotplug_lock and the hook is
called in the same context, use a simple boolean variable to detect that
call.

Cc: Nathan Lynch 
Cc: Aneesh Kumar K.V 
Cc: Tyrel Datwyler 
Signed-off-by: Laurent Dufour 
---

V5:
 - Reword the commit's description to address Nathan's comments.
V4:
 - Prevent the LMB to be updated back in the case the request came from the
 LMB tree's update.
V3:
 - Check rd->dn->name instead of rd->dn->full_name
V2:
 - Take Tyrel's idea to rely on OF_RECONFIG_UPDATE_PROPERTY instead of
 introducing a new hook mechanism.
---
 arch/powerpc/include/asm/drmem.h  |  1 +
 arch/powerpc/mm/drmem.c   | 46 +++
 .../platforms/pseries/hotplug-memory.c|  4 ++
 3 files changed, 51 insertions(+)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index bf2402fed3e0..4265d5e95c2c 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -111,6 +111,7 @@ int drmem_update_dt(void);
 int __init
 walk_drmem_lmbs_early(unsigned long node, void *data,
  int (*func)(struct drmem_lmb *, const __be32 **, void *));
+void drmem_update_lmbs(struct property *prop);
 #endif
 
 static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb)
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 9af3832c9d8d..22197b18d85e 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -18,6 +18,7 @@ static int n_root_addr_cells, n_root_size_cells;
 
 static struct drmem_lmb_info __drmem_info;
 struct drmem_lmb_info *drmem_info = &__drmem_info;
+static bool in_drmem_update;
 
 u64 drmem_lmb_memory_max(void)
 {
@@ -178,6 +179,11 @@ int drmem_update_dt(void)
if (!memory)
return -1;
 
+   /*
+* Set in_drmem_update to prevent the notifier callback to process the
+* DT property back since the change is coming from the LMB tree.
+*/
+   in_drmem_update = true;
prop = of_find_property(memory, "ibm,dynamic-memory", NULL);
if (prop) {
rc = drmem_update_dt_v1(memory, prop);
@@ -186,6 +192,7 @@ int drmem_update_dt(void)
if (prop)
rc = drmem_update_dt_v2(memory, prop);
}
+   in_drmem_update = false;
 
of_node_put(memory);
return rc;
@@ -307,6 +314,45 @@ int __init walk_drmem_lmbs_early(unsigned long node, void 
*data,
return ret;
 }
 
+/*
+ * Update the LMB associativity index.
+ */
+static int update_lmb(struct drmem_lmb *updated_lmb,
+ __maybe_unused const __be32 **usm,
+ __maybe_unused void *data)
+{
+   struct drmem_lmb *lmb;
+
+   for_each_drmem_lmb(lmb) {
+   if (lmb->drc_index != updated_lmb->drc_index)
+   continue;
+
+   lmb->aa_index = updated_lmb->aa_index;
+   break;
+   }
+   return 0;
+}
+
+/*
+ * Update the LMB associativity index.
+ *
+ * This needs to be called when the hypervisor is updating the
+ * dynamic-reconfiguration-memory node property.
+ */
+void drmem_update_lmbs(struct property *prop)
+{
+   /*
+* Don't update the LMBs if triggered by the update done in
+* drmem_update_dt(), the LMB values have been used to the update the DT
+* property in that case.
+*/
+   if (in_drmem_update)
+   return;
+   if (!strcmp(prop->name, "ibm,dynamic-memory"))
+   __walk_drmem_v1_lmbs(prop->value, NULL, NULL, update_lmb);
+   else if (!strcmp(prop->name, "ibm,dynamic-memory-v2"))
+   __walk_drmem_v2_lmbs(prop->value, NULL, NULL, update_lmb);
+}
 #endif
 
 static int init_drmem_lmb_size(struct device_node *dn)
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c

[PATCH v2] ppc64/numa: consider the max numa node for migratable LPAR

2021-05-11 Thread Laurent Dufour
When a LPAR is migratable, we should consider the maximum possible NUMA
node instead the number of NUMA node from the actual system.

The DT property 'ibm,current-associativity-domains' is defining the maximum
number of nodes the LPAR can see when running on that box. But if the LPAR
is being migrated on another box, it may seen up to the nodes defined by
'ibm,max-associativity-domains'. So if a LPAR is migratable, that value
should be used.

Unfortunately, there is no easy way to know if a LPAR is migratable or
not. The hypervisor is exporting the property 'ibm,migratable-partition' in
the case it set to migrate partition, but that would not mean that the
current partition is migratable.

Without this patch, when a LPAR is started on a 2 nodes box and then
migrated to a 3 nodes box, the hypervisor may spread the LPAR's CPUs on the
3rd node. In that case if a CPU from that 3rd node is added to the LPAR, it
will be wrongly assigned to the node because the kernel has been set to use
up to 2 nodes (the configuration of the departure node). With this patch
applies, the CPU is correctly added to the 3rd node.

Fixes: f9f130ff2ec9 ("powerpc/numa: Detect support for coregroup")
Reviewed-by: Srikar Dronamraju 
Signed-off-by: Laurent Dufour 
---
V2: Address Srikar's comments
 - Fix the commit message
 - Use pr_info instead printk(KERN_INFO..)
---
 arch/powerpc/mm/numa.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..094a1076fd1f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -893,7 +893,7 @@ static void __init setup_node_data(int nid, u64 start_pfn, 
u64 end_pfn)
 static void __init find_possible_nodes(void)
 {
struct device_node *rtas;
-   const __be32 *domains;
+   const __be32 *domains = NULL;
int prop_length, max_nodes;
u32 i;
 
@@ -909,9 +909,14 @@ static void __init find_possible_nodes(void)
 * it doesn't exist, then fallback on ibm,max-associativity-domains.
 * Current denotes what the platform can support compared to max
 * which denotes what the Hypervisor can support.
+*
+* If the LPAR is migratable, new nodes might be activated after a LPM,
+* so we should consider the max number in that case.
 */
-   domains = of_get_property(rtas, "ibm,current-associativity-domains",
-   _length);
+   if (!of_get_property(of_root, "ibm,migratable-partition", NULL))
+   domains = of_get_property(rtas,
+ "ibm,current-associativity-domains",
+ _length);
if (!domains) {
domains = of_get_property(rtas, "ibm,max-associativity-domains",
_length);
@@ -920,6 +925,8 @@ static void __init find_possible_nodes(void)
}
 
max_nodes = of_read_number([min_common_depth], 1);
+   pr_info("Partition configured for %d NUMA nodes.\n", max_nodes);
+
for (i = 0; i < max_nodes; i++) {
if (!node_possible(i))
node_set(i, node_possible_map);
-- 
2.31.1



Re: [PATCH] ppc64/numa: consider the max numa node for migratable LPAR

2021-05-10 Thread Laurent Dufour

Le 10/05/2021 à 12:21, Srikar Dronamraju a écrit :

* Laurent Dufour  [2021-04-29 20:19:01]:


When a LPAR is migratable, we should consider the maximum possible NUMA
node instead the number of NUMA node from the actual system.

The DT property 'ibm,current-associativity-domains' is defining the maximum
number of nodes the LPAR can see when running on that box. But if the LPAR
is being migrated on another box, it may seen up to the nodes defined by
'ibm,max-associativity-domains'. So if a LPAR is migratable, that value
should be used.

Unfortunately, there is no easy way to know if a LPAR is migratable or
not. The hypervisor is exporting the property 'ibm,migratable-partition' in
the case it set to migrate partition, but that would not mean that the
current partition is migratable.

Without that patch, when a LPAR is started on a 2 nodes box and then
migrated to a 3 nodes box, the hypervisor may spread the LPAR's CPUs on the
3rd node. In that case if a CPU from that 3rd node is added to the LPAR, it
will be wrongly assigned to the node because the kernel has been set to use




up to 2 nodes (the configuration of the departure node). With that patch
applies, the CPU is correctly added to the 3rd node.


You probably meant, "With this patch applied"

Also you may want to add a fixes tag:


I'll fix "that" and add the fixes tag.


Cc: Srikar Dronamraju 
Signed-off-by: Laurent Dufour 
---
  arch/powerpc/mm/numa.c | 14 +++---
  1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..673fa6e47850 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -893,7 +893,7 @@ static void __init setup_node_data(int nid, u64 start_pfn, 
u64 end_pfn)
  static void __init find_possible_nodes(void)
  {
struct device_node *rtas;
-   const __be32 *domains;
+   const __be32 *domains = NULL;
int prop_length, max_nodes;
u32 i;

@@ -909,9 +909,14 @@ static void __init find_possible_nodes(void)
 * it doesn't exist, then fallback on ibm,max-associativity-domains.
 * Current denotes what the platform can support compared to max
 * which denotes what the Hypervisor can support.
+*
+* If the LPAR is migratable, new nodes might be activated after a LPM,
+* so we should consider the max number in that case.
 */
-   domains = of_get_property(rtas, "ibm,current-associativity-domains",
-   _length);
+   if (!of_get_property(of_root, "ibm,migratable-partition", NULL))
+   domains = of_get_property(rtas,
+ "ibm,current-associativity-domains",
+ _length);
if (!domains) {
domains = of_get_property(rtas, "ibm,max-associativity-domains",
_length);
@@ -920,6 +925,9 @@ static void __init find_possible_nodes(void)
}

max_nodes = of_read_number([min_common_depth], 1);
+   printk(KERN_INFO "Partition configured for %d NUMA nodes.\n",
+  max_nodes);
+


Another nit:
you may want to make this pr_info instead of printk


Sure !


for (i = 0; i < max_nodes; i++) {
if (!node_possible(i))
node_set(i, node_possible_map);
--
2.31.1



Otherwise looks good to me.

Reviewed-by: Srikar Dronamraju 


Thanks Srikar, I'll add you review tag in the v2.




Re: [PATCH v4] pseries/drmem: update LMBs after LPM

2021-05-05 Thread Laurent Dufour

Le 05/05/2021 à 00:30, Nathan Lynch a écrit :

Hi Laurent,


Hi Nathan,

Thanks for your review.


Bear with me while I work through the commit message:

Laurent Dufour  writes:

After a LPM, the device tree node ibm,dynamic-reconfiguration-memory may be
updated by the hypervisor in the case the NUMA topology of the LPAR's
memory is updated.


Yes, the RTAS functions ibm,update-nodes and ibm,update-properties,
which the OS invokes after resuming, may bring in updated properties
under the ibm,dynamic-reconfiguration-memory node, including the
ibm,associativity-lookup-arrays property.


This is caught by the kernel,


"Caught" makes me think this is an error condition, as in catching an
exception. I guess "handled" better conveys your meaning?


ok




but the memory's node is updated because
there is no way to move a memory block between nodes.


"The memory's node" refers the ibm,dynamic-reconfiguration-memory DT
node, yes? Or is it referring to Linux's NUMA nodes? ("move a memory
block between nodes" in your statement here refers to Linux's NUMA
nodes, that much is clear to me.)

I am failing to follow the cause->effect relationship stated. True,
changing a block's node assignment while it's in use isn't safe. I don't
see why that implies that "the memory's node is updated"? In fact this
seems contradictory.

This statement makes more sense to me if I change it to "the memory's
node is _not_ updated" -- is this what you intended?


Correct, I dropped the 'not' word here ;)




If later a memory block is added or removed, drmem_update_dt() is called
and it is overwriting the DT node to match the added or removed LMB.


I understand this, but I will expand on it.

dlpar_memory()
   -> dlpar_memory_add_by_count()
 -> dlpar_add_lmb()
   -> update_lmb_associativity_index()
 ... lmb->aa_index = 
   -> drmem_update_dt()

update_lmb_associativity_index() retrieves the firmware description of
the new block, and sets the aa_index of the matching entry in the
drmem_info array to the value matching the firmware description.

Then, drmem_update_dt() walks the drmem_info array and synthesizes a new
/ibm,dynamic-reconfiguration-memory/ibm,dynamic-memory-v2 property based
on the recently updated information in that array.


Yes




But the LMB's associativity node has not been updated after the DT
node update and thus the node is overwritten by the Linux's topology
instead of the hypervisor one.


So, an example of the problem is:

1. VM migrates. On resume, ibm,associativity-lookup-arrays is changed
via ibm,update-properties. Entries in the drmem_info array remain
unchanged, with aa_index values that correspond to the source
system's ibm,associativity-lookup-arrays property, now inaccessible.

2. A memory block is added. We look up the new block's entry in the
drmem_info array, and set the aa_index to the value matching the
current ibm,associativity-lookup-arrays.

3. Then, the ibm,associativity-lookup-arrays property is completely
regenerated from the drmem_info array, which reflects a mixture of
information from the source and destination systems.

Do I understand correctly?


Yes





Introduce a hook called when the ibm,dynamic-reconfiguration-memory node is
updated to force an update of the LMB's associativity. However, ignore the
call to that hook when the update has been triggered by drmem_update_dt().
Because, in that case, the LMB tree has been used to set the DT property
and thus it doesn't need to be updated back. Since drmem_update_dt() is
called under the protection of the device_hotplug_lock and the hook is
called in the same context, use a simple boolean variable to detect that
call.


This strikes me as almost a revert of e978a3ccaa71 ("powerpc/pseries:
remove obsolete memory hotplug DT notifier code").


Not really identical to reverting e978a3ccaa71, here only the aa_index of the 
LMB is updated, everything else is kept in place. I don't try to apply the 
memory layout's changes, just updating the in use LMB's aa_index field.


The only matching point with the code reverted by the commit you mentioned would 
be the use of a global variable in_drmem_update instead of the previous 
rtas_hp_event to prevent the LMB tree to be updated again during memory hot plug 
event.



I'd rather avoid smuggling through global state information that ought
to be passed in function parameters, if it should be passed around at
all. Despite having (IMO) relatively simple responsibilities, this code
is difficult to change and review; adding this property makes it
worse. If the structure of the code is pushing us toward this kind of
compromise, then the code probably needs more fundamental changes.

I'm probably forgetting something -- can anyone remind me why we need an
array of these:

struct drmem_lmb {
u64 base_addr;
u32 drc_index;
u32 aa_index;
  

Re: [PATCH] powerpc/pseries/dlpar: use rtas_get_sensor()

2021-05-04 Thread Laurent Dufour

Le 04/05/2021 à 04:53, Nathan Lynch a écrit :

Instead of making bare calls to get-sensor-state, use
rtas_get_sensor(), which correctly handles busy and extended delay
statuses.


Reviewed-by: Laurent Dufour 


Fixes: ab519a011caa ("powerpc/pseries: Kernel DLPAR Infrastructure")
Signed-off-by: Nathan Lynch 
---
  arch/powerpc/platforms/pseries/dlpar.c | 9 +++--
  1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
b/arch/powerpc/platforms/pseries/dlpar.c
index 3ac70790ec7a..b1f01ac0c29e 100644
--- a/arch/powerpc/platforms/pseries/dlpar.c
+++ b/arch/powerpc/platforms/pseries/dlpar.c
@@ -289,8 +289,7 @@ int dlpar_acquire_drc(u32 drc_index)
  {
int dr_status, rc;
  
-	rc = rtas_call(rtas_token("get-sensor-state"), 2, 2, _status,

-  DR_ENTITY_SENSE, drc_index);
+   rc = rtas_get_sensor(DR_ENTITY_SENSE, drc_index, _status);
if (rc || dr_status != DR_ENTITY_UNUSABLE)
return -1;
  
@@ -311,8 +310,7 @@ int dlpar_release_drc(u32 drc_index)

  {
int dr_status, rc;
  
-	rc = rtas_call(rtas_token("get-sensor-state"), 2, 2, _status,

-  DR_ENTITY_SENSE, drc_index);
+   rc = rtas_get_sensor(DR_ENTITY_SENSE, drc_index, _status);
if (rc || dr_status != DR_ENTITY_PRESENT)
return -1;
  
@@ -333,8 +331,7 @@ int dlpar_unisolate_drc(u32 drc_index)

  {
int dr_status, rc;
  
-	rc = rtas_call(rtas_token("get-sensor-state"), 2, 2, _status,

-   DR_ENTITY_SENSE, drc_index);
+   rc = rtas_get_sensor(DR_ENTITY_SENSE, drc_index, _status);
if (rc || dr_status != DR_ENTITY_PRESENT)
return -1;
  





[PATCH v4] pseries/drmem: update LMBs after LPM

2021-05-04 Thread Laurent Dufour
After a LPM, the device tree node ibm,dynamic-reconfiguration-memory may be
updated by the hypervisor in the case the NUMA topology of the LPAR's
memory is updated.

This is caught by the kernel, but the memory's node is updated because
there is no way to move a memory block between nodes.

If later a memory block is added or removed, drmem_update_dt() is called
and it is overwriting the DT node to match the added or removed LMB. But
the LMB's associativity node has not been updated after the DT node update
and thus the node is overwritten by the Linux's topology instead of the
hypervisor one.

Introduce a hook called when the ibm,dynamic-reconfiguration-memory node is
updated to force an update of the LMB's associativity. However, ignore the
call to that hook when the update has been triggered by drmem_update_dt().
Because, in that case, the LMB tree has been used to set the DT property
and thus it doesn't need to be updated back. Since drmem_update_dt() is
called under the protection of the device_hotplug_lock and the hook is
called in the same context, use a simple boolean variable to detect that
call.

Cc: Aneesh Kumar K.V 
Cc: Tyrel Datwyler 
Signed-off-by: Laurent Dufour 
---

V4:
 - Prevent the LMB to be updated back in the case the request came from the
 LMB tree's update.
V3:
 - Check rd->dn->name instead of rd->dn->full_name
V2:
 - Take Tyrel's idea to rely on OF_RECONFIG_UPDATE_PROPERTY instead of
 introducing a new hook mechanism.
---
 arch/powerpc/include/asm/drmem.h  |  1 +
 arch/powerpc/mm/drmem.c   | 46 +++
 .../platforms/pseries/hotplug-memory.c|  4 ++
 3 files changed, 51 insertions(+)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index bf2402fed3e0..4265d5e95c2c 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -111,6 +111,7 @@ int drmem_update_dt(void);
 int __init
 walk_drmem_lmbs_early(unsigned long node, void *data,
  int (*func)(struct drmem_lmb *, const __be32 **, void *));
+void drmem_update_lmbs(struct property *prop);
 #endif
 
 static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb)
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 9af3832c9d8d..22197b18d85e 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -18,6 +18,7 @@ static int n_root_addr_cells, n_root_size_cells;
 
 static struct drmem_lmb_info __drmem_info;
 struct drmem_lmb_info *drmem_info = &__drmem_info;
+static bool in_drmem_update;
 
 u64 drmem_lmb_memory_max(void)
 {
@@ -178,6 +179,11 @@ int drmem_update_dt(void)
if (!memory)
return -1;
 
+   /*
+* Set in_drmem_update to prevent the notifier callback to process the
+* DT property back since the change is coming from the LMB tree.
+*/
+   in_drmem_update = true;
prop = of_find_property(memory, "ibm,dynamic-memory", NULL);
if (prop) {
rc = drmem_update_dt_v1(memory, prop);
@@ -186,6 +192,7 @@ int drmem_update_dt(void)
if (prop)
rc = drmem_update_dt_v2(memory, prop);
}
+   in_drmem_update = false;
 
of_node_put(memory);
return rc;
@@ -307,6 +314,45 @@ int __init walk_drmem_lmbs_early(unsigned long node, void 
*data,
return ret;
 }
 
+/*
+ * Update the LMB associativity index.
+ */
+static int update_lmb(struct drmem_lmb *updated_lmb,
+ __maybe_unused const __be32 **usm,
+ __maybe_unused void *data)
+{
+   struct drmem_lmb *lmb;
+
+   for_each_drmem_lmb(lmb) {
+   if (lmb->drc_index != updated_lmb->drc_index)
+   continue;
+
+   lmb->aa_index = updated_lmb->aa_index;
+   break;
+   }
+   return 0;
+}
+
+/*
+ * Update the LMB associativity index.
+ *
+ * This needs to be called when the hypervisor is updating the
+ * dynamic-reconfiguration-memory node property.
+ */
+void drmem_update_lmbs(struct property *prop)
+{
+   /*
+* Don't update the LMBs if triggered by the update done in
+* drmem_update_dt(), the LMB values have been used to the update the DT
+* property in that case.
+*/
+   if (in_drmem_update)
+   return;
+   if (!strcmp(prop->name, "ibm,dynamic-memory"))
+   __walk_drmem_v1_lmbs(prop->value, NULL, NULL, update_lmb);
+   else if (!strcmp(prop->name, "ibm,dynamic-memory-v2"))
+   __walk_drmem_v2_lmbs(prop->value, NULL, NULL, update_lmb);
+}
 #endif
 
 static int init_drmem_lmb_size(struct device_node *dn)
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 8377f1f7c78e..672ffbee2e78 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powe

Re: [PATCH v3] pseries/drmem: update LMBs after LPM

2021-05-04 Thread Laurent Dufour

Le 03/05/2021 à 22:44, Tyrel Datwyler a écrit :

On 5/3/21 10:28 AM, Laurent Dufour wrote:

Le 01/05/2021 à 01:58, Tyrel Datwyler a écrit :

On 4/30/21 9:13 AM, Laurent Dufour wrote:

Le 29/04/2021 à 21:12, Tyrel Datwyler a écrit :

On 4/29/21 3:27 AM, Aneesh Kumar K.V wrote:

Laurent Dufour  writes:



Snip



As of today I don't have a problem with your patch. This was more of me pointing
out things that I think are currently wrong with our memory hotplug
implementation, and that we need to take a long hard look at it down the road.


I do agree, there is a lot of odd things there to address in this area.
If you're ok with that patch, do you mind to add a reviewed-by?



Can you send a v4 with the fix for the duplicate update included?


Of course !
I wrote it last week, but let in the to-be-sent list, my mistake.


Re: [PATCH v3] pseries/drmem: update LMBs after LPM

2021-05-03 Thread Laurent Dufour

Le 01/05/2021 à 01:58, Tyrel Datwyler a écrit :

On 4/30/21 9:13 AM, Laurent Dufour wrote:

Le 29/04/2021 à 21:12, Tyrel Datwyler a écrit :

On 4/29/21 3:27 AM, Aneesh Kumar K.V wrote:

Laurent Dufour  writes:


After a LPM, the device tree node ibm,dynamic-reconfiguration-memory may be
updated by the hypervisor in the case the NUMA topology of the LPAR's
memory is updated.

This is caught by the kernel, but the memory's node is updated because
there is no way to move a memory block between nodes.

If later a memory block is added or removed, drmem_update_dt() is called
and it is overwriting the DT node to match the added or removed LMB. But
the LMB's associativity node has not been updated after the DT node update
and thus the node is overwritten by the Linux's topology instead of the
hypervisor one.

Introduce a hook called when the ibm,dynamic-reconfiguration-memory node is
updated to force an update of the LMB's associativity.

Cc: Tyrel Datwyler 
Signed-off-by: Laurent Dufour 
---

V3:
   - Check rd->dn->name instead of rd->dn->full_name
V2:
   - Take Tyrel's idea to rely on OF_RECONFIG_UPDATE_PROPERTY instead of
   introducing a new hook mechanism.
---
   arch/powerpc/include/asm/drmem.h  |  1 +
   arch/powerpc/mm/drmem.c   | 35 +++
   .../platforms/pseries/hotplug-memory.c    |  4 +++
   3 files changed, 40 insertions(+)

diff --git a/arch/powerpc/include/asm/drmem.h
b/arch/powerpc/include/asm/drmem.h
index bf2402fed3e0..4265d5e95c2c 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -111,6 +111,7 @@ int drmem_update_dt(void);
   int __init
   walk_drmem_lmbs_early(unsigned long node, void *data,
     int (*func)(struct drmem_lmb *, const __be32 **, void *));
+void drmem_update_lmbs(struct property *prop);
   #endif
     static inline void invalidate_lmb_associativity_index(struct drmem_lmb
*lmb)
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 9af3832c9d8d..f0a6633132af 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -307,6 +307,41 @@ int __init walk_drmem_lmbs_early(unsigned long node,
void *data,
   return ret;
   }
   +/*
+ * Update the LMB associativity index.
+ */
+static int update_lmb(struct drmem_lmb *updated_lmb,
+  __maybe_unused const __be32 **usm,
+  __maybe_unused void *data)
+{
+    struct drmem_lmb *lmb;
+
+    /*
+ * Brut force there may be better way to fetch the LMB
+ */
+    for_each_drmem_lmb(lmb) {
+    if (lmb->drc_index != updated_lmb->drc_index)
+    continue;
+
+    lmb->aa_index = updated_lmb->aa_index;
+    break;
+    }
+    return 0;
+}
+
+/*
+ * Update the LMB associativity index.
+ *
+ * This needs to be called when the hypervisor is updating the
+ * dynamic-reconfiguration-memory node property.
+ */
+void drmem_update_lmbs(struct property *prop)
+{
+    if (!strcmp(prop->name, "ibm,dynamic-memory"))
+    __walk_drmem_v1_lmbs(prop->value, NULL, NULL, update_lmb);
+    else if (!strcmp(prop->name, "ibm,dynamic-memory-v2"))
+    __walk_drmem_v2_lmbs(prop->value, NULL, NULL, update_lmb);
+}
   #endif
     static int init_drmem_lmb_size(struct device_node *dn)
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 8377f1f7c78e..672ffbee2e78 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -949,6 +949,10 @@ static int pseries_memory_notifier(struct
notifier_block *nb,
   case OF_RECONFIG_DETACH_NODE:
   err = pseries_remove_mem_node(rd->dn);
   break;
+    case OF_RECONFIG_UPDATE_PROPERTY:
+    if (!strcmp(rd->dn->name,
+    "ibm,dynamic-reconfiguration-memory"))
+    drmem_update_lmbs(rd->prop);
   }
   return notifier_from_errno(err);


How will this interact with DLPAR memory? When we dlpar memory,
ibm,configure-connector is used to fetch the new associativity details
and set drmem_lmb->aa_index correctly there. Once that is done kernel
then call drmem_update_dt() which will result in the above notifier
callback?

IIUC, the call back then will update drmem_lmb->aa_index again?


After digging through some of this code I'm a bit concerned about all the kernel
device tree manipulation around memory DLPAR both with the assoc-lookup-array
prop update and post dynamic-memory prop updating. We build a drmem_info array
of the LMBs from the device-tree at boot. I don't really understand why we are
manipulating the device tree property every time we add/remove an LMB. Not sure
the reasoning was to write back in particular the aa_index and flags for each
LMB into the device tree when we already have them in the drmem_info array. On
the other hand the assoc-lookup-array I suppose would need to have an in ke

Re: [PATCH v3] pseries/drmem: update LMBs after LPM

2021-04-30 Thread Laurent Dufour

Le 29/04/2021 à 21:12, Tyrel Datwyler a écrit :

On 4/29/21 3:27 AM, Aneesh Kumar K.V wrote:

Laurent Dufour  writes:


After a LPM, the device tree node ibm,dynamic-reconfiguration-memory may be
updated by the hypervisor in the case the NUMA topology of the LPAR's
memory is updated.

This is caught by the kernel, but the memory's node is updated because
there is no way to move a memory block between nodes.

If later a memory block is added or removed, drmem_update_dt() is called
and it is overwriting the DT node to match the added or removed LMB. But
the LMB's associativity node has not been updated after the DT node update
and thus the node is overwritten by the Linux's topology instead of the
hypervisor one.

Introduce a hook called when the ibm,dynamic-reconfiguration-memory node is
updated to force an update of the LMB's associativity.

Cc: Tyrel Datwyler 
Signed-off-by: Laurent Dufour 
---

V3:
  - Check rd->dn->name instead of rd->dn->full_name
V2:
  - Take Tyrel's idea to rely on OF_RECONFIG_UPDATE_PROPERTY instead of
  introducing a new hook mechanism.
---
  arch/powerpc/include/asm/drmem.h  |  1 +
  arch/powerpc/mm/drmem.c   | 35 +++
  .../platforms/pseries/hotplug-memory.c|  4 +++
  3 files changed, 40 insertions(+)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index bf2402fed3e0..4265d5e95c2c 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -111,6 +111,7 @@ int drmem_update_dt(void);
  int __init
  walk_drmem_lmbs_early(unsigned long node, void *data,
  int (*func)(struct drmem_lmb *, const __be32 **, void *));
+void drmem_update_lmbs(struct property *prop);
  #endif
  
  static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb)

diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 9af3832c9d8d..f0a6633132af 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -307,6 +307,41 @@ int __init walk_drmem_lmbs_early(unsigned long node, void 
*data,
return ret;
  }
  
+/*

+ * Update the LMB associativity index.
+ */
+static int update_lmb(struct drmem_lmb *updated_lmb,
+ __maybe_unused const __be32 **usm,
+ __maybe_unused void *data)
+{
+   struct drmem_lmb *lmb;
+
+   /*
+* Brut force there may be better way to fetch the LMB
+*/
+   for_each_drmem_lmb(lmb) {
+   if (lmb->drc_index != updated_lmb->drc_index)
+   continue;
+
+   lmb->aa_index = updated_lmb->aa_index;
+   break;
+   }
+   return 0;
+}
+
+/*
+ * Update the LMB associativity index.
+ *
+ * This needs to be called when the hypervisor is updating the
+ * dynamic-reconfiguration-memory node property.
+ */
+void drmem_update_lmbs(struct property *prop)
+{
+   if (!strcmp(prop->name, "ibm,dynamic-memory"))
+   __walk_drmem_v1_lmbs(prop->value, NULL, NULL, update_lmb);
+   else if (!strcmp(prop->name, "ibm,dynamic-memory-v2"))
+   __walk_drmem_v2_lmbs(prop->value, NULL, NULL, update_lmb);
+}
  #endif
  
  static int init_drmem_lmb_size(struct device_node *dn)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 8377f1f7c78e..672ffbee2e78 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -949,6 +949,10 @@ static int pseries_memory_notifier(struct notifier_block 
*nb,
case OF_RECONFIG_DETACH_NODE:
err = pseries_remove_mem_node(rd->dn);
break;
+   case OF_RECONFIG_UPDATE_PROPERTY:
+   if (!strcmp(rd->dn->name,
+   "ibm,dynamic-reconfiguration-memory"))
+   drmem_update_lmbs(rd->prop);
}
return notifier_from_errno(err);


How will this interact with DLPAR memory? When we dlpar memory,
ibm,configure-connector is used to fetch the new associativity details
and set drmem_lmb->aa_index correctly there. Once that is done kernel
then call drmem_update_dt() which will result in the above notifier
callback?

IIUC, the call back then will update drmem_lmb->aa_index again?


After digging through some of this code I'm a bit concerned about all the kernel
device tree manipulation around memory DLPAR both with the assoc-lookup-array
prop update and post dynamic-memory prop updating. We build a drmem_info array
of the LMBs from the device-tree at boot. I don't really understand why we are
manipulating the device tree property every time we add/remove an LMB. Not sure
the reasoning was to write back in particular the aa_index and flags for each
LMB into the device tree when we already have them in the drmem_info array. On
the other hand the assoc-lookup-array I

Re: [PATCH] ppc64/numa: consider the max numa node for migratable LPAR

2021-04-30 Thread Laurent Dufour

Le 29/04/2021 à 21:29, Tyrel Datwyler a écrit :

On 4/29/21 11:19 AM, Laurent Dufour wrote:

When a LPAR is migratable, we should consider the maximum possible NUMA
node instead the number of NUMA node from the actual system.

The DT property 'ibm,current-associativity-domains' is defining the maximum
number of nodes the LPAR can see when running on that box. But if the LPAR
is being migrated on another box, it may seen up to the nodes defined by
'ibm,max-associativity-domains'. So if a LPAR is migratable, that value
should be used.

Unfortunately, there is no easy way to know if a LPAR is migratable or
not. The hypervisor is exporting the property 'ibm,migratable-partition' in
the case it set to migrate partition, but that would not mean that the
current partition is migratable.


Wording is a little hard to follow for me here. From PAPR the
'ibm,migratable-partition' property presence indicates that the platform
supports the potential migration of the partition. I guess maybe the point is
that all migratable partitions define 'ibm,migratable-partition', but all
partitions that define 'ibm,migratable-partition' are not necessarily 
migratable.


That's what I meant.

Laurent


-Tyrel



Without that patch, when a LPAR is started on a 2 nodes box and then
migrated to a 3 nodes box, the hypervisor may spread the LPAR's CPUs on the
3rd node. In that case if a CPU from that 3rd node is added to the LPAR, it
will be wrongly assigned to the node because the kernel has been set to use
up to 2 nodes (the configuration of the departure node). With that patch
applies, the CPU is correctly added to the 3rd node.

Cc: Srikar Dronamraju 
Signed-off-by: Laurent Dufour 
---
  arch/powerpc/mm/numa.c | 14 +++---
  1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..673fa6e47850 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -893,7 +893,7 @@ static void __init setup_node_data(int nid, u64 start_pfn, 
u64 end_pfn)
  static void __init find_possible_nodes(void)
  {
struct device_node *rtas;
-   const __be32 *domains;
+   const __be32 *domains = NULL;
int prop_length, max_nodes;
u32 i;

@@ -909,9 +909,14 @@ static void __init find_possible_nodes(void)
 * it doesn't exist, then fallback on ibm,max-associativity-domains.
 * Current denotes what the platform can support compared to max
 * which denotes what the Hypervisor can support.
+*
+* If the LPAR is migratable, new nodes might be activated after a LPM,
+* so we should consider the max number in that case.
 */
-   domains = of_get_property(rtas, "ibm,current-associativity-domains",
-   _length);
+   if (!of_get_property(of_root, "ibm,migratable-partition", NULL))
+   domains = of_get_property(rtas,
+ "ibm,current-associativity-domains",
+ _length);
if (!domains) {
domains = of_get_property(rtas, "ibm,max-associativity-domains",
_length);
@@ -920,6 +925,9 @@ static void __init find_possible_nodes(void)
}

max_nodes = of_read_number([min_common_depth], 1);
+   printk(KERN_INFO "Partition configured for %d NUMA nodes.\n",
+  max_nodes);
+
for (i = 0; i < max_nodes; i++) {
if (!node_possible(i))
node_set(i, node_possible_map);







[PATCH] ppc64/numa: consider the max numa node for migratable LPAR

2021-04-29 Thread Laurent Dufour
When a LPAR is migratable, we should consider the maximum possible NUMA
node instead the number of NUMA node from the actual system.

The DT property 'ibm,current-associativity-domains' is defining the maximum
number of nodes the LPAR can see when running on that box. But if the LPAR
is being migrated on another box, it may seen up to the nodes defined by
'ibm,max-associativity-domains'. So if a LPAR is migratable, that value
should be used.

Unfortunately, there is no easy way to know if a LPAR is migratable or
not. The hypervisor is exporting the property 'ibm,migratable-partition' in
the case it set to migrate partition, but that would not mean that the
current partition is migratable.

Without that patch, when a LPAR is started on a 2 nodes box and then
migrated to a 3 nodes box, the hypervisor may spread the LPAR's CPUs on the
3rd node. In that case if a CPU from that 3rd node is added to the LPAR, it
will be wrongly assigned to the node because the kernel has been set to use
up to 2 nodes (the configuration of the departure node). With that patch
applies, the CPU is correctly added to the 3rd node.

Cc: Srikar Dronamraju 
Signed-off-by: Laurent Dufour 
---
 arch/powerpc/mm/numa.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..673fa6e47850 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -893,7 +893,7 @@ static void __init setup_node_data(int nid, u64 start_pfn, 
u64 end_pfn)
 static void __init find_possible_nodes(void)
 {
struct device_node *rtas;
-   const __be32 *domains;
+   const __be32 *domains = NULL;
int prop_length, max_nodes;
u32 i;
 
@@ -909,9 +909,14 @@ static void __init find_possible_nodes(void)
 * it doesn't exist, then fallback on ibm,max-associativity-domains.
 * Current denotes what the platform can support compared to max
 * which denotes what the Hypervisor can support.
+*
+* If the LPAR is migratable, new nodes might be activated after a LPM,
+* so we should consider the max number in that case.
 */
-   domains = of_get_property(rtas, "ibm,current-associativity-domains",
-   _length);
+   if (!of_get_property(of_root, "ibm,migratable-partition", NULL))
+   domains = of_get_property(rtas,
+ "ibm,current-associativity-domains",
+ _length);
if (!domains) {
domains = of_get_property(rtas, "ibm,max-associativity-domains",
_length);
@@ -920,6 +925,9 @@ static void __init find_possible_nodes(void)
}
 
max_nodes = of_read_number([min_common_depth], 1);
+   printk(KERN_INFO "Partition configured for %d NUMA nodes.\n",
+  max_nodes);
+
for (i = 0; i < max_nodes; i++) {
if (!node_possible(i))
node_set(i, node_possible_map);
-- 
2.31.1



[PATCH v5] pseries: prevent free CPU ids to be reused on another node

2021-04-29 Thread Laurent Dufour
When a CPU is hot added, the CPU ids are taken from the available mask from
the lower possible set. If that set of values was previously used for CPU
attached to a different node, this seems to application like if these CPUs
have migrated from a node to another one which is not expected in real
life.

To prevent this, it is needed to record the CPU ids used for each node and
to not reuse them on another node. However, to prevent CPU hot plug to
fail, in the case the CPU ids is starved on a node, the capability to reuse
other nodes’ free CPU ids is kept. A warning is displayed in such a case
to warn the user.

A new CPU bit mask (node_recorded_ids_map) is introduced for each possible
node. It is populated with the CPU onlined at boot time, and then when a
CPU is hot plug to a node. The bits in that mask remain when the CPU is hot
unplugged, to remind this CPU ids have been used for this node.

If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.

The effect of this patch can be seen by removing and adding CPUs using the
Qemu monitor. In the following case, the first CPU from the node 2 is
removed, then the first one from the node 1 is removed too. Later, the
first CPU of the node 2 is added back. Without that patch, the kernel will
numbered these CPUs using the first CPU ids available which are the ones
freed when removing the second CPU of the node 0. This leads to the CPU ids
16-23 to move from the node 1 to the node 2. With the patch applied, the
CPU ids 32-39 are used since they are the lowest free ones which have not
been used on another node.

At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47

Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Signed-off-by: Laurent Dufour 
---
V5:
 - Rework code structure
 - Reintroduce the capability to reuse other node's ids.
V4: addressing Nathan's comment
 - Rename the local variable named 'nid' into 'assigned_node'
V3: addressing Nathan's comments
 - Remove the retry feature
 - Reduce the number of local variables (removing 'i')
 - Add comment about the cpu_add_remove_lock protecting the added CPU mask.
 V2: (no functional changes)
 - update the test's output in the commit's description
 - node_recorded_ids_map should be static
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 171 ++-
 1 file changed, 132 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 7e970f81d8ff..e1f224320102 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -39,6 +39,12 @@
 /* This version can't take the spinlock, because it never returns */
 static int rtas_stop_self_token = RTAS_UNKNOWN_SERVICE;
 
+/*
+ * Record the CPU ids used on each nodes.
+ * Protected by cpu_add_remove_lock.
+ */
+static cpumask_var_t node_recorded_ids_map[MAX_NUMNODES];
+
 static void rtas_stop_self(void)
 {
static struct rtas_args args;
@@ -139,72 +145,148 @@ static void pseries_cpu_die(unsigned int cpu)
paca_ptrs[cpu]->cpu_start = 0;
 }
 
+/**
+ * find_cpu_id_range - found a linear ranger of @nthreads free CPU ids.
+ * @nthreads : the number of threads (cpu ids)
+ * @assigned_node : the node it belongs to or NUMA_NO_NODE if free ids from any
+ *  node can be peek.
+ * @cpu_mask: the returned CPU mask.
+ *
+ * Returns 0 on success.
+ */
+static int find_cpu_id_range(unsigned int nthreads, int assigned_node,
+cpumask_var_t *cpu_mask)
+{
+   cpumask_var_t candidate_mask;
+   unsigned int cpu, node;
+   int rc = -ENOSPC;
+
+   if (!zalloc_cpumask_var(_mask, GFP_KERNEL))
+   return -ENOMEM;
+
+   cpumask_clear(*cpu_mask);
+   for (cpu = 0; cpu < nthreads; cpu++)
+   cpumask_set_cpu(cpu, *cpu_mask);
+
+   BUG_ON(!cpumask_subset(cpu_present_mask, cpu_possible_mask));
+
+   /* Get a bitmap of unoccupied slots. */
+   cpumask_xor(candidate_mask, cpu_possible_mask, cpu_present_mask);
+
+   if (assigned_node != NUMA_NO_NODE) {
+   /*
+* Remove free ids previously assigned on the other nodes. We
+* can walk only online nodes because once a node became 

Re: [PATCH v3] pseries/drmem: update LMBs after LPM

2021-04-29 Thread Laurent Dufour

Le 29/04/2021 à 12:27, Aneesh Kumar K.V a écrit :

Laurent Dufour  writes:


After a LPM, the device tree node ibm,dynamic-reconfiguration-memory may be
updated by the hypervisor in the case the NUMA topology of the LPAR's
memory is updated.

This is caught by the kernel, but the memory's node is updated because
there is no way to move a memory block between nodes.

If later a memory block is added or removed, drmem_update_dt() is called
and it is overwriting the DT node to match the added or removed LMB. But
the LMB's associativity node has not been updated after the DT node update
and thus the node is overwritten by the Linux's topology instead of the
hypervisor one.

Introduce a hook called when the ibm,dynamic-reconfiguration-memory node is
updated to force an update of the LMB's associativity.

Cc: Tyrel Datwyler 
Signed-off-by: Laurent Dufour 
---

V3:
  - Check rd->dn->name instead of rd->dn->full_name
V2:
  - Take Tyrel's idea to rely on OF_RECONFIG_UPDATE_PROPERTY instead of
  introducing a new hook mechanism.
---
  arch/powerpc/include/asm/drmem.h  |  1 +
  arch/powerpc/mm/drmem.c   | 35 +++
  .../platforms/pseries/hotplug-memory.c|  4 +++
  3 files changed, 40 insertions(+)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index bf2402fed3e0..4265d5e95c2c 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -111,6 +111,7 @@ int drmem_update_dt(void);
  int __init
  walk_drmem_lmbs_early(unsigned long node, void *data,
  int (*func)(struct drmem_lmb *, const __be32 **, void *));
+void drmem_update_lmbs(struct property *prop);
  #endif
  
  static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb)

diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 9af3832c9d8d..f0a6633132af 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -307,6 +307,41 @@ int __init walk_drmem_lmbs_early(unsigned long node, void 
*data,
return ret;
  }
  
+/*

+ * Update the LMB associativity index.
+ */
+static int update_lmb(struct drmem_lmb *updated_lmb,
+ __maybe_unused const __be32 **usm,
+ __maybe_unused void *data)
+{
+   struct drmem_lmb *lmb;
+
+   /*
+* Brut force there may be better way to fetch the LMB
+*/
+   for_each_drmem_lmb(lmb) {
+   if (lmb->drc_index != updated_lmb->drc_index)
+   continue;
+
+   lmb->aa_index = updated_lmb->aa_index;
+   break;
+   }
+   return 0;
+}
+
+/*
+ * Update the LMB associativity index.
+ *
+ * This needs to be called when the hypervisor is updating the
+ * dynamic-reconfiguration-memory node property.
+ */
+void drmem_update_lmbs(struct property *prop)
+{
+   if (!strcmp(prop->name, "ibm,dynamic-memory"))
+   __walk_drmem_v1_lmbs(prop->value, NULL, NULL, update_lmb);
+   else if (!strcmp(prop->name, "ibm,dynamic-memory-v2"))
+   __walk_drmem_v2_lmbs(prop->value, NULL, NULL, update_lmb);
+}
  #endif
  
  static int init_drmem_lmb_size(struct device_node *dn)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 8377f1f7c78e..672ffbee2e78 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -949,6 +949,10 @@ static int pseries_memory_notifier(struct notifier_block 
*nb,
case OF_RECONFIG_DETACH_NODE:
err = pseries_remove_mem_node(rd->dn);
break;
+   case OF_RECONFIG_UPDATE_PROPERTY:
+   if (!strcmp(rd->dn->name,
+   "ibm,dynamic-reconfiguration-memory"))
+   drmem_update_lmbs(rd->prop);
}
return notifier_from_errno(err);


How will this interact with DLPAR memory? When we dlpar memory,
ibm,configure-connector is used to fetch the new associativity details
and set drmem_lmb->aa_index correctly there. Once that is done kernel
then call drmem_update_dt() which will result in the above notifier
callback?

IIUC, the call back then will update drmem_lmb->aa_index again?


Thanks for pointing this Aneesh,

You're right I missed that callback and it was quite invisible during my test 
because the value set back in the aa_index was the same.


When dmrem_update_dt() is called, there is no need to update the LMB back and 
the DT modify notifier should be ignored.


As DLPAR operations are serialized (by lock_device_hotplug()), I'm proposing to 
 rely on a boolean static variable to do skip this notification, like this:


diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index f0a6633132af..3c0130720086 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -18,6 +18,

Re: [PATCH v3] pseries/drmem: update LMBs after LPM

2021-04-29 Thread Laurent Dufour

Le 29/04/2021 à 13:31, Laurent Dufour a écrit :

Le 29/04/2021 à 12:27, Aneesh Kumar K.V a écrit :

Laurent Dufour  writes:


After a LPM, the device tree node ibm,dynamic-reconfiguration-memory may be
updated by the hypervisor in the case the NUMA topology of the LPAR's
memory is updated.

This is caught by the kernel, but the memory's node is updated because
there is no way to move a memory block between nodes.

If later a memory block is added or removed, drmem_update_dt() is called
and it is overwriting the DT node to match the added or removed LMB. But
the LMB's associativity node has not been updated after the DT node update
and thus the node is overwritten by the Linux's topology instead of the
hypervisor one.

Introduce a hook called when the ibm,dynamic-reconfiguration-memory node is
updated to force an update of the LMB's associativity.

Cc: Tyrel Datwyler 
Signed-off-by: Laurent Dufour 
---

V3:
  - Check rd->dn->name instead of rd->dn->full_name
V2:
  - Take Tyrel's idea to rely on OF_RECONFIG_UPDATE_PROPERTY instead of
  introducing a new hook mechanism.
---
  arch/powerpc/include/asm/drmem.h  |  1 +
  arch/powerpc/mm/drmem.c   | 35 +++
  .../platforms/pseries/hotplug-memory.c    |  4 +++
  3 files changed, 40 insertions(+)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index bf2402fed3e0..4265d5e95c2c 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -111,6 +111,7 @@ int drmem_update_dt(void);
  int __init
  walk_drmem_lmbs_early(unsigned long node, void *data,
    int (*func)(struct drmem_lmb *, const __be32 **, void *));
+void drmem_update_lmbs(struct property *prop);
  #endif
  static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb)
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 9af3832c9d8d..f0a6633132af 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -307,6 +307,41 @@ int __init walk_drmem_lmbs_early(unsigned long node, 
void *data,

  return ret;
  }
+/*
+ * Update the LMB associativity index.
+ */
+static int update_lmb(struct drmem_lmb *updated_lmb,
+  __maybe_unused const __be32 **usm,
+  __maybe_unused void *data)
+{
+    struct drmem_lmb *lmb;
+
+    /*
+ * Brut force there may be better way to fetch the LMB
+ */
+    for_each_drmem_lmb(lmb) {
+    if (lmb->drc_index != updated_lmb->drc_index)
+    continue;
+
+    lmb->aa_index = updated_lmb->aa_index;
+    break;
+    }
+    return 0;
+}
+
+/*
+ * Update the LMB associativity index.
+ *
+ * This needs to be called when the hypervisor is updating the
+ * dynamic-reconfiguration-memory node property.
+ */
+void drmem_update_lmbs(struct property *prop)
+{
+    if (!strcmp(prop->name, "ibm,dynamic-memory"))
+    __walk_drmem_v1_lmbs(prop->value, NULL, NULL, update_lmb);
+    else if (!strcmp(prop->name, "ibm,dynamic-memory-v2"))
+    __walk_drmem_v2_lmbs(prop->value, NULL, NULL, update_lmb);
+}
  #endif
  static int init_drmem_lmb_size(struct device_node *dn)
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c

index 8377f1f7c78e..672ffbee2e78 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -949,6 +949,10 @@ static int pseries_memory_notifier(struct notifier_block 
*nb,

  case OF_RECONFIG_DETACH_NODE:
  err = pseries_remove_mem_node(rd->dn);
  break;
+    case OF_RECONFIG_UPDATE_PROPERTY:
+    if (!strcmp(rd->dn->name,
+    "ibm,dynamic-reconfiguration-memory"))
+    drmem_update_lmbs(rd->prop);
  }
  return notifier_from_errno(err);


How will this interact with DLPAR memory? When we dlpar memory,
ibm,configure-connector is used to fetch the new associativity details
and set drmem_lmb->aa_index correctly there. Once that is done kernel
then call drmem_update_dt() which will result in the above notifier
callback?


When a memory DLPAR operation is done, the in memory DT property 
"ibm,dynamic-memory" or "ibm,dynamic-memory-v2" (if existing) have to be updated 
to reflect the added/removed memory part. This is done by calling 
drmem_update_dt().


This patch is addressing the case where the hypervisor has updated the DT 
property mentioned above. In that case, the LMB tree should be updated so the 
aa_index fields are matching the DT one. This way the next time a memory DLPAR 
operation is done the DT properties "ibm,dynamic-memory" or 
"ibm,dynamic-memory-v2" will be rebuilt correctly.



IIUC, the call back then will update drmem_lmb->aa_index again?


Oh I missed what you pointed out.
Please ignore my previous answer, I need to double check code.

drmem_update_dt(

Re: [PATCH v3] pseries/drmem: update LMBs after LPM

2021-04-29 Thread Laurent Dufour

Le 29/04/2021 à 12:27, Aneesh Kumar K.V a écrit :

Laurent Dufour  writes:


After a LPM, the device tree node ibm,dynamic-reconfiguration-memory may be
updated by the hypervisor in the case the NUMA topology of the LPAR's
memory is updated.

This is caught by the kernel, but the memory's node is updated because
there is no way to move a memory block between nodes.

If later a memory block is added or removed, drmem_update_dt() is called
and it is overwriting the DT node to match the added or removed LMB. But
the LMB's associativity node has not been updated after the DT node update
and thus the node is overwritten by the Linux's topology instead of the
hypervisor one.

Introduce a hook called when the ibm,dynamic-reconfiguration-memory node is
updated to force an update of the LMB's associativity.

Cc: Tyrel Datwyler 
Signed-off-by: Laurent Dufour 
---

V3:
  - Check rd->dn->name instead of rd->dn->full_name
V2:
  - Take Tyrel's idea to rely on OF_RECONFIG_UPDATE_PROPERTY instead of
  introducing a new hook mechanism.
---
  arch/powerpc/include/asm/drmem.h  |  1 +
  arch/powerpc/mm/drmem.c   | 35 +++
  .../platforms/pseries/hotplug-memory.c|  4 +++
  3 files changed, 40 insertions(+)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index bf2402fed3e0..4265d5e95c2c 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -111,6 +111,7 @@ int drmem_update_dt(void);
  int __init
  walk_drmem_lmbs_early(unsigned long node, void *data,
  int (*func)(struct drmem_lmb *, const __be32 **, void *));
+void drmem_update_lmbs(struct property *prop);
  #endif
  
  static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb)

diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 9af3832c9d8d..f0a6633132af 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -307,6 +307,41 @@ int __init walk_drmem_lmbs_early(unsigned long node, void 
*data,
return ret;
  }
  
+/*

+ * Update the LMB associativity index.
+ */
+static int update_lmb(struct drmem_lmb *updated_lmb,
+ __maybe_unused const __be32 **usm,
+ __maybe_unused void *data)
+{
+   struct drmem_lmb *lmb;
+
+   /*
+* Brut force there may be better way to fetch the LMB
+*/
+   for_each_drmem_lmb(lmb) {
+   if (lmb->drc_index != updated_lmb->drc_index)
+   continue;
+
+   lmb->aa_index = updated_lmb->aa_index;
+   break;
+   }
+   return 0;
+}
+
+/*
+ * Update the LMB associativity index.
+ *
+ * This needs to be called when the hypervisor is updating the
+ * dynamic-reconfiguration-memory node property.
+ */
+void drmem_update_lmbs(struct property *prop)
+{
+   if (!strcmp(prop->name, "ibm,dynamic-memory"))
+   __walk_drmem_v1_lmbs(prop->value, NULL, NULL, update_lmb);
+   else if (!strcmp(prop->name, "ibm,dynamic-memory-v2"))
+   __walk_drmem_v2_lmbs(prop->value, NULL, NULL, update_lmb);
+}
  #endif
  
  static int init_drmem_lmb_size(struct device_node *dn)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 8377f1f7c78e..672ffbee2e78 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -949,6 +949,10 @@ static int pseries_memory_notifier(struct notifier_block 
*nb,
case OF_RECONFIG_DETACH_NODE:
err = pseries_remove_mem_node(rd->dn);
break;
+   case OF_RECONFIG_UPDATE_PROPERTY:
+   if (!strcmp(rd->dn->name,
+   "ibm,dynamic-reconfiguration-memory"))
+   drmem_update_lmbs(rd->prop);
}
return notifier_from_errno(err);


How will this interact with DLPAR memory? When we dlpar memory,
ibm,configure-connector is used to fetch the new associativity details
and set drmem_lmb->aa_index correctly there. Once that is done kernel
then call drmem_update_dt() which will result in the above notifier
callback?


When a memory DLPAR operation is done, the in memory DT property 
"ibm,dynamic-memory" or "ibm,dynamic-memory-v2" (if existing) have to be updated 
to reflect the added/removed memory part. This is done by calling drmem_update_dt().


This patch is addressing the case where the hypervisor has updated the DT 
property mentioned above. In that case, the LMB tree should be updated so the 
aa_index fields are matching the DT one. This way the next time a memory DLPAR 
operation is done the DT properties "ibm,dynamic-memory" or 
"ibm,dynamic-memory-v2" will be rebuilt correctly.



IIUC, the call back then will update drmem_lmb->aa_index again?


drmem_upd

[PATCH v3] pseries/drmem: update LMBs after LPM

2021-04-28 Thread Laurent Dufour
After a LPM, the device tree node ibm,dynamic-reconfiguration-memory may be
updated by the hypervisor in the case the NUMA topology of the LPAR's
memory is updated.

This is caught by the kernel, but the memory's node is updated because
there is no way to move a memory block between nodes.

If later a memory block is added or removed, drmem_update_dt() is called
and it is overwriting the DT node to match the added or removed LMB. But
the LMB's associativity node has not been updated after the DT node update
and thus the node is overwritten by the Linux's topology instead of the
hypervisor one.

Introduce a hook called when the ibm,dynamic-reconfiguration-memory node is
updated to force an update of the LMB's associativity.

Cc: Tyrel Datwyler 
Signed-off-by: Laurent Dufour 
---

V3:
 - Check rd->dn->name instead of rd->dn->full_name
V2:
 - Take Tyrel's idea to rely on OF_RECONFIG_UPDATE_PROPERTY instead of
 introducing a new hook mechanism.
---
 arch/powerpc/include/asm/drmem.h  |  1 +
 arch/powerpc/mm/drmem.c   | 35 +++
 .../platforms/pseries/hotplug-memory.c|  4 +++
 3 files changed, 40 insertions(+)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index bf2402fed3e0..4265d5e95c2c 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -111,6 +111,7 @@ int drmem_update_dt(void);
 int __init
 walk_drmem_lmbs_early(unsigned long node, void *data,
  int (*func)(struct drmem_lmb *, const __be32 **, void *));
+void drmem_update_lmbs(struct property *prop);
 #endif
 
 static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb)
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 9af3832c9d8d..f0a6633132af 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -307,6 +307,41 @@ int __init walk_drmem_lmbs_early(unsigned long node, void 
*data,
return ret;
 }
 
+/*
+ * Update the LMB associativity index.
+ */
+static int update_lmb(struct drmem_lmb *updated_lmb,
+ __maybe_unused const __be32 **usm,
+ __maybe_unused void *data)
+{
+   struct drmem_lmb *lmb;
+
+   /*
+* Brut force there may be better way to fetch the LMB
+*/
+   for_each_drmem_lmb(lmb) {
+   if (lmb->drc_index != updated_lmb->drc_index)
+   continue;
+
+   lmb->aa_index = updated_lmb->aa_index;
+   break;
+   }
+   return 0;
+}
+
+/*
+ * Update the LMB associativity index.
+ *
+ * This needs to be called when the hypervisor is updating the
+ * dynamic-reconfiguration-memory node property.
+ */
+void drmem_update_lmbs(struct property *prop)
+{
+   if (!strcmp(prop->name, "ibm,dynamic-memory"))
+   __walk_drmem_v1_lmbs(prop->value, NULL, NULL, update_lmb);
+   else if (!strcmp(prop->name, "ibm,dynamic-memory-v2"))
+   __walk_drmem_v2_lmbs(prop->value, NULL, NULL, update_lmb);
+}
 #endif
 
 static int init_drmem_lmb_size(struct device_node *dn)
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 8377f1f7c78e..672ffbee2e78 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -949,6 +949,10 @@ static int pseries_memory_notifier(struct notifier_block 
*nb,
case OF_RECONFIG_DETACH_NODE:
err = pseries_remove_mem_node(rd->dn);
break;
+   case OF_RECONFIG_UPDATE_PROPERTY:
+   if (!strcmp(rd->dn->name,
+   "ibm,dynamic-reconfiguration-memory"))
+   drmem_update_lmbs(rd->prop);
}
return notifier_from_errno(err);
 }
-- 
2.31.1



Re: [PATCH] pseries/drmem: update LMBs after LPM

2021-04-27 Thread Laurent Dufour

Michael, this is a v2 despite the mail's subject, sorry for the mess.


[PATCH] pseries/drmem: update LMBs after LPM

2021-04-27 Thread Laurent Dufour
After a LPM, the device tree node ibm,dynamic-reconfiguration-memory may be
updated by the hypervisor in the case the NUMA topology of the LPAR's
memory is updated.

This is caught by the kernel, but the memory's node is updated because
there is no way to move a memory block between nodes.

If later a memory block is added or removed, drmem_update_dt() is called
and it is overwriting the DT node to match the added or removed LMB. But
the LMB's associativity node has not been updated after the DT node update
and thus the node is overwritten by the Linux's topology instead of the
hypervisor one.

Introduce a hook called when the ibm,dynamic-reconfiguration-memory node is
updated to force an update of the LMB's associativity.

Cc: Tyrel Datwyler 
Signed-off-by: Laurent Dufour 

Change since V1:
 - Take Tyrel's idea to rely on OF_RECONFIG_UPDATE_PROPERTY instead of
 introducing a new hook mechanism.
---
 arch/powerpc/include/asm/drmem.h  |  1 +
 arch/powerpc/mm/drmem.c   | 35 +++
 .../platforms/pseries/hotplug-memory.c|  4 +++
 3 files changed, 40 insertions(+)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index bf2402fed3e0..4265d5e95c2c 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -111,6 +111,7 @@ int drmem_update_dt(void);
 int __init
 walk_drmem_lmbs_early(unsigned long node, void *data,
  int (*func)(struct drmem_lmb *, const __be32 **, void *));
+void drmem_update_lmbs(struct property *prop);
 #endif
 
 static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb)
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 9af3832c9d8d..f0a6633132af 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -307,6 +307,41 @@ int __init walk_drmem_lmbs_early(unsigned long node, void 
*data,
return ret;
 }
 
+/*
+ * Update the LMB associativity index.
+ */
+static int update_lmb(struct drmem_lmb *updated_lmb,
+ __maybe_unused const __be32 **usm,
+ __maybe_unused void *data)
+{
+   struct drmem_lmb *lmb;
+
+   /*
+* Brut force there may be better way to fetch the LMB
+*/
+   for_each_drmem_lmb(lmb) {
+   if (lmb->drc_index != updated_lmb->drc_index)
+   continue;
+
+   lmb->aa_index = updated_lmb->aa_index;
+   break;
+   }
+   return 0;
+}
+
+/*
+ * Update the LMB associativity index.
+ *
+ * This needs to be called when the hypervisor is updating the
+ * dynamic-reconfiguration-memory node property.
+ */
+void drmem_update_lmbs(struct property *prop)
+{
+   if (!strcmp(prop->name, "ibm,dynamic-memory"))
+   __walk_drmem_v1_lmbs(prop->value, NULL, NULL, update_lmb);
+   else if (!strcmp(prop->name, "ibm,dynamic-memory-v2"))
+   __walk_drmem_v2_lmbs(prop->value, NULL, NULL, update_lmb);
+}
 #endif
 
 static int init_drmem_lmb_size(struct device_node *dn)
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 8377f1f7c78e..8aabaafc484b 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -949,6 +949,10 @@ static int pseries_memory_notifier(struct notifier_block 
*nb,
case OF_RECONFIG_DETACH_NODE:
err = pseries_remove_mem_node(rd->dn);
break;
+   case OF_RECONFIG_UPDATE_PROPERTY:
+   if (!strcmp(rd->dn->full_name,
+   "ibm,dynamic-reconfiguration-memory"))
+   drmem_update_lmbs(rd->prop);
}
return notifier_from_errno(err);
 }
-- 
2.31.1



Re: [PATCH] pseries/drmem: update LMBs after LPM

2021-04-27 Thread Laurent Dufour

Le 27/04/2021 à 19:01, Tyrel Datwyler a écrit :

On 4/27/21 8:01 AM, Laurent Dufour wrote:

After a LPM, the device tree node ibm,dynamic-reconfiguration-memory may be
updated by the hypervisor in the case the NUMA topology of the LPAR's
memory is updated.

This is caught by the kernel, but the memory's node is updated because
there is no way to move a memory block between nodes.

If later a memory block is added or removed, drmem_update_dt() is called
and it is overwriting the DT node to match the added or removed LMB. But
the LMB's associativity node has not been updated after the DT node update
and thus the node is overwritten by the Linux's topology instead of the
hypervisor one.

Introduce a hook called when the ibm,dynamic-reconfiguration-memory node is
updated to force an update of the LMB's associativity.

Signed-off-by: Laurent Dufour 
---
  arch/powerpc/include/asm/drmem.h  |  1 +
  arch/powerpc/mm/drmem.c   | 48 +++
  arch/powerpc/platforms/pseries/mobility.c |  9 +
  3 files changed, 58 insertions(+)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index bf2402fed3e0..55c2c25085b0 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -111,6 +111,7 @@ int drmem_update_dt(void);
  int __init
  walk_drmem_lmbs_early(unsigned long node, void *data,
  int (*func)(struct drmem_lmb *, const __be32 **, void *));
+void drmem_update_lmbs(void);
  #endif

  static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb)
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 9af3832c9d8d..46074bdfdb3c 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -307,6 +307,54 @@ int __init walk_drmem_lmbs_early(unsigned long node, void 
*data,
return ret;
  }

+/*
+ * Update the LMB associativity index.
+ */
+static int update_lmb(struct drmem_lmb *updated_lmb,
+ __maybe_unused const __be32 **usm,
+ __maybe_unused void *data)
+{
+   struct drmem_lmb *lmb;
+
+   /*
+* Brut force there may be better way to fetch the LMB
+*/
+   for_each_drmem_lmb(lmb) {
+   if (lmb->drc_index != updated_lmb->drc_index)
+   continue;
+
+   lmb->aa_index = updated_lmb->aa_index;
+   break;
+   }
+   return 0;
+}
+
+/*
+ * Update the LMB associativity index.
+ *
+ * This needs to be called when the hypervisor is updating the
+ * dynamic-reconfiguration-memory node property.
+ */
+void drmem_update_lmbs(void)
+{
+   struct device_node *node;
+   const __be32 *prop;
+
+   node = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
+   if (!node)
+   return;
+
+   prop = of_get_property(node, "ibm,dynamic-memory", NULL);
+   if (prop) {
+   __walk_drmem_v1_lmbs(prop, NULL, NULL, update_lmb);
+   } else {
+   prop = of_get_property(node, "ibm,dynamic-memory-v2", NULL);
+   if (prop)
+   __walk_drmem_v2_lmbs(prop, NULL, NULL, update_lmb);
+   }
+
+   of_node_put(node);
+}
  #endif

  static int init_drmem_lmb_size(struct device_node *dn)
diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index ea4d6a660e0d..c68eccc6e8df 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -25,6 +25,7 @@

  #include 
  #include 
+#include 
  #include "pseries.h"
  #include "../../kernel/cacheinfo.h"

@@ -237,6 +238,7 @@ int pseries_devicetree_update(s32 scope)
__be32 *data;
int update_nodes_token;
int rc;
+   bool drmem_updated = false;

update_nodes_token = rtas_token("ibm,update-nodes");
if (update_nodes_token == RTAS_UNKNOWN_SERVICE)
@@ -271,6 +273,10 @@ int pseries_devicetree_update(s32 scope)
continue;
}

+   if (!strcmp(np->full_name,
+   
"ibm,dynamic-reconfiguration-memory"))
+   drmem_updated = true;


Is there a reason that we can't use the existing pseries_memory_notifier()
callback in pseries/hotplug-memory.c to trigger the drmem_update_lmbs() when
either the ibm,dynamic-memory or ibm,dynamic-memory-v2 properties are updated?


Thanks a lot Tyrel!

That's far more elegant, I'll send a v2 soon.

Laurent.


Something like:

static int pseries_memory_notifier(struct notifier_block *nb,
unsigned long action, void *data)
{
 struct of_reconfig_data *rd = data;
 int err = 0;

 switch (action) {
 case OF_RECONFIG_ATTACH_NODE:
 err = pserie

[PATCH] pseries/drmem: update LMBs after LPM

2021-04-27 Thread Laurent Dufour
After a LPM, the device tree node ibm,dynamic-reconfiguration-memory may be
updated by the hypervisor in the case the NUMA topology of the LPAR's
memory is updated.

This is caught by the kernel, but the memory's node is updated because
there is no way to move a memory block between nodes.

If later a memory block is added or removed, drmem_update_dt() is called
and it is overwriting the DT node to match the added or removed LMB. But
the LMB's associativity node has not been updated after the DT node update
and thus the node is overwritten by the Linux's topology instead of the
hypervisor one.

Introduce a hook called when the ibm,dynamic-reconfiguration-memory node is
updated to force an update of the LMB's associativity.

Signed-off-by: Laurent Dufour 
---
 arch/powerpc/include/asm/drmem.h  |  1 +
 arch/powerpc/mm/drmem.c   | 48 +++
 arch/powerpc/platforms/pseries/mobility.c |  9 +
 3 files changed, 58 insertions(+)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index bf2402fed3e0..55c2c25085b0 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -111,6 +111,7 @@ int drmem_update_dt(void);
 int __init
 walk_drmem_lmbs_early(unsigned long node, void *data,
  int (*func)(struct drmem_lmb *, const __be32 **, void *));
+void drmem_update_lmbs(void);
 #endif
 
 static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb)
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 9af3832c9d8d..46074bdfdb3c 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -307,6 +307,54 @@ int __init walk_drmem_lmbs_early(unsigned long node, void 
*data,
return ret;
 }
 
+/*
+ * Update the LMB associativity index.
+ */
+static int update_lmb(struct drmem_lmb *updated_lmb,
+ __maybe_unused const __be32 **usm,
+ __maybe_unused void *data)
+{
+   struct drmem_lmb *lmb;
+
+   /*
+* Brut force there may be better way to fetch the LMB
+*/
+   for_each_drmem_lmb(lmb) {
+   if (lmb->drc_index != updated_lmb->drc_index)
+   continue;
+
+   lmb->aa_index = updated_lmb->aa_index;
+   break;
+   }
+   return 0;
+}
+
+/*
+ * Update the LMB associativity index.
+ *
+ * This needs to be called when the hypervisor is updating the
+ * dynamic-reconfiguration-memory node property.
+ */
+void drmem_update_lmbs(void)
+{
+   struct device_node *node;
+   const __be32 *prop;
+
+   node = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
+   if (!node)
+   return;
+
+   prop = of_get_property(node, "ibm,dynamic-memory", NULL);
+   if (prop) {
+   __walk_drmem_v1_lmbs(prop, NULL, NULL, update_lmb);
+   } else {
+   prop = of_get_property(node, "ibm,dynamic-memory-v2", NULL);
+   if (prop)
+   __walk_drmem_v2_lmbs(prop, NULL, NULL, update_lmb);
+   }
+
+   of_node_put(node);
+}
 #endif
 
 static int init_drmem_lmb_size(struct device_node *dn)
diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index ea4d6a660e0d..c68eccc6e8df 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -25,6 +25,7 @@
 
 #include 
 #include 
+#include 
 #include "pseries.h"
 #include "../../kernel/cacheinfo.h"
 
@@ -237,6 +238,7 @@ int pseries_devicetree_update(s32 scope)
__be32 *data;
int update_nodes_token;
int rc;
+   bool drmem_updated = false;
 
update_nodes_token = rtas_token("ibm,update-nodes");
if (update_nodes_token == RTAS_UNKNOWN_SERVICE)
@@ -271,6 +273,10 @@ int pseries_devicetree_update(s32 scope)
continue;
}
 
+   if (!strcmp(np->full_name,
+   
"ibm,dynamic-reconfiguration-memory"))
+   drmem_updated = true;
+
switch (action) {
case DELETE_DT_NODE:
delete_dt_node(np);
@@ -293,6 +299,9 @@ int pseries_devicetree_update(s32 scope)
} while (rc == 1);
 
kfree(rtas_buf);
+
+   if (drmem_updated)
+   drmem_update_lmbs();
return rc;
 }
 
-- 
2.31.1



Re: [PATCH v4] pseries: prevent free CPU ids to be reused on another node

2021-04-20 Thread Laurent Dufour

Le 07/04/2021 à 17:38, Laurent Dufour a écrit :

When a CPU is hot added, the CPU ids are taken from the available mask from
the lower possible set. If that set of values was previously used for CPU
attached to a different node, this seems to application like if these CPUs
have migrated from a node to another one which is not expected in real
life.

To prevent this, it is needed to record the CPU ids used for each node and
to not reuse them on another node. However, to prevent CPU hot plug to
fail, in the case the CPU ids is starved on a node, the capability to reuse
other nodes’ free CPU ids is kept. A warning is displayed in such a case
to warn the user.

A new CPU bit mask (node_recorded_ids_map) is introduced for each possible
node. It is populated with the CPU onlined at boot time, and then when a
CPU is hot plug to a node. The bits in that mask remain when the CPU is hot
unplugged, to remind this CPU ids have been used for this node.

The effect of this patch can be seen by removing and adding CPUs using the
Qemu monitor. In the following case, the first CPU from the node 2 is
removed, then the first one from the node 1 is removed too. Later, the
first CPU of the node 2 is added back. Without that patch, the kernel will
numbered these CPUs using the first CPU ids available which are the ones
freed when removing the second CPU of the node 0. This leads to the CPU ids
16-23 to move from the node 1 to the node 2. With the patch applied, the
CPU ids 32-39 are used since they are the lowest free ones which have not
been used on another node.

At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47

Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Changes since V3, addressing Nathan's comment:
  - Rename the local variable named 'nid' into 'assigned_node'
Changes since V2, addressing Nathan's comments:
  - Remove the retry feature
  - Reduce the number of local variables (removing 'i')
  - Add comment about the cpu_add_remove_lock protecting the added CPU mask.
Changes since V1 (no functional changes):
  - update the test's output in the commit's description
  - node_recorded_ids_map should be static

Signed-off-by: Laurent Dufour 


I did further LPM tests with this patch applied and not allowing fall back 
reusing free ids of another node is too strong.


This is easy to hit that limitation when a LPAR is running at the maximum number 
of CPU it is configured for and when a LPAR migration leads to new node activation.


For instance, consider a dedicated LPAR configured with a max of 32 CPUs (4 
cores SMT 8). At boot time, cpu_possible_mask is filled with CPU ids 0-31 in 
smp_setup_cpu_maps() by reading the DT property "/rtas/ibm,lrdr-capacity", so 
the higher CPU id for this LPAR is 31.


Departure box:
node 0 : CPU 0-31
Arrival box:
node 0 : CPU 0-15
node 1 : CPU 16-31 << need to reuse ids from node 0

Visualizing the CPU ids would have a big impact as it is used in several places 
in the kernel as to index linear table.


But in the case the LPAR is migratable (DT property "ibm,migratable-partition" 
is present), we may set the higher CPU ids to NR_CPUS (usually 2048), to limit 
the case where CPU id has to be reused on a different node. Doing this will have 
impact on some data allocation done in the kernel when the size is based on 
num_possible_cpus.


Any better idea?

Thanks,
Laurent.



[PATCH v4] pseries: prevent free CPU ids to be reused on another node

2021-04-07 Thread Laurent Dufour
When a CPU is hot added, the CPU ids are taken from the available mask from
the lower possible set. If that set of values was previously used for CPU
attached to a different node, this seems to application like if these CPUs
have migrated from a node to another one which is not expected in real
life.

To prevent this, it is needed to record the CPU ids used for each node and
to not reuse them on another node. However, to prevent CPU hot plug to
fail, in the case the CPU ids is starved on a node, the capability to reuse
other nodes’ free CPU ids is kept. A warning is displayed in such a case
to warn the user.

A new CPU bit mask (node_recorded_ids_map) is introduced for each possible
node. It is populated with the CPU onlined at boot time, and then when a
CPU is hot plug to a node. The bits in that mask remain when the CPU is hot
unplugged, to remind this CPU ids have been used for this node.

The effect of this patch can be seen by removing and adding CPUs using the
Qemu monitor. In the following case, the first CPU from the node 2 is
removed, then the first one from the node 1 is removed too. Later, the
first CPU of the node 2 is added back. Without that patch, the kernel will
numbered these CPUs using the first CPU ids available which are the ones
freed when removing the second CPU of the node 0. This leads to the CPU ids
16-23 to move from the node 1 to the node 2. With the patch applied, the
CPU ids 32-39 are used since they are the lowest free ones which have not
been used on another node.

At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47

Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Changes since V3, addressing Nathan's comment:
 - Rename the local variable named 'nid' into 'assigned_node'
Changes since V2, addressing Nathan's comments:
 - Remove the retry feature
 - Reduce the number of local variables (removing 'i')
 - Add comment about the cpu_add_remove_lock protecting the added CPU mask.
Changes since V1 (no functional changes):
 - update the test's output in the commit's description
 - node_recorded_ids_map should be static

Signed-off-by: Laurent Dufour 
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 52 ++--
 1 file changed, 47 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index ec478f8a98ff..cddf6d2db786 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -39,6 +39,12 @@
 /* This version can't take the spinlock, because it never returns */
 static int rtas_stop_self_token = RTAS_UNKNOWN_SERVICE;
 
+/*
+ * Record the CPU ids used on each nodes.
+ * Protected by cpu_add_remove_lock.
+ */
+static cpumask_var_t node_recorded_ids_map[MAX_NUMNODES];
+
 static void rtas_stop_self(void)
 {
static struct rtas_args args;
@@ -151,9 +157,9 @@ static void pseries_cpu_die(unsigned int cpu)
  */
 static int pseries_add_processor(struct device_node *np)
 {
-   unsigned int cpu;
+   unsigned int cpu, node;
cpumask_var_t candidate_mask, tmp;
-   int err = -ENOSPC, len, nthreads, i;
+   int err = -ENOSPC, len, nthreads, assigned_node;
const __be32 *intserv;
 
intserv = of_get_property(np, "ibm,ppc-interrupt-server#s", );
@@ -163,9 +169,17 @@ static int pseries_add_processor(struct device_node *np)
zalloc_cpumask_var(_mask, GFP_KERNEL);
zalloc_cpumask_var(, GFP_KERNEL);
 
+   /*
+* Fetch from the DT nodes read by dlpar_configure_connector() the NUMA
+* node id the added CPU belongs to.
+*/
+   assigned_node = of_node_to_nid(np);
+   if (assigned_node < 0 || !node_possible(assigned_node))
+   assigned_node = first_online_node;
+
nthreads = len / sizeof(u32);
-   for (i = 0; i < nthreads; i++)
-   cpumask_set_cpu(i, tmp);
+   for (cpu = 0; cpu < nthreads; cpu++)
+   cpumask_set_cpu(cpu, tmp);
 
cpu_maps_update_begin();
 
@@ -173,6 +187,19 @@ static int pseries_add_processor(struct device_node *np)
 
/* Get a bitmap of unoccupied slots. */
cpumask_xor(candidate_mask, cpu_possible_mask, cpu_present_mask);
+
+   /*
+* Remove free ids previously assigned on the other nodes. We can walk
+* only online nodes because once a

Re: [PATCH v3] pseries: prevent free CPU ids to be reused on another node

2021-04-07 Thread Laurent Dufour

Le 07/04/2021 à 16:55, Nathan Lynch a écrit :

Laurent Dufour  writes:

Changes since V2, addressing Nathan's comments:
  - Remove the retry feature
  - Reduce the number of local variables (removing 'i')


I was more interested in not having two variables for NUMA nodes in the
function named 'node' and 'nid', hoping at least one of them could have
a more descriptive name. See below.


  static int pseries_add_processor(struct device_node *np)
  {
-   unsigned int cpu;
+   unsigned int cpu, node;
cpumask_var_t candidate_mask, tmp;
-   int err = -ENOSPC, len, nthreads, i;
+   int err = -ENOSPC, len, nthreads, nid;
const __be32 *intserv;
  
  	intserv = of_get_property(np, "ibm,ppc-interrupt-server#s", );

@@ -163,9 +169,17 @@ static int pseries_add_processor(struct device_node *np)
zalloc_cpumask_var(_mask, GFP_KERNEL);
zalloc_cpumask_var(, GFP_KERNEL);
  
+	/*

+* Fetch from the DT nodes read by dlpar_configure_connector() the NUMA
+* node id the added CPU belongs to.
+*/
+   nid = of_node_to_nid(np);
+   if (nid < 0 || !node_possible(nid))
+   nid = first_online_node;
+
nthreads = len / sizeof(u32);
-   for (i = 0; i < nthreads; i++)
-   cpumask_set_cpu(i, tmp);
+   for (cpu = 0; cpu < nthreads; cpu++)
+   cpumask_set_cpu(cpu, tmp);
  
  	cpu_maps_update_begin();
  
@@ -173,6 +187,19 @@ static int pseries_add_processor(struct device_node *np)
  
  	/* Get a bitmap of unoccupied slots. */

cpumask_xor(candidate_mask, cpu_possible_mask, cpu_present_mask);
+
+   /*
+* Remove free ids previously assigned on the other nodes. We can walk
+* only online nodes because once a node became online it is not turned
+* offlined back.
+*/
+   for_each_online_node(node) {
+   if (node == nid) /* Keep our node's recorded ids */
+   continue;
+   cpumask_andnot(candidate_mask, candidate_mask,
+  node_recorded_ids_map[node]);
+   }
+


e.g. change 'nid' to 'assigned_node' or similar, and I think this
becomes easier to follow.


Fair enough, will send a v4


Otherwise the patch looks fine to me now.





[PATCH v3] pseries: prevent free CPU ids to be reused on another node

2021-04-06 Thread Laurent Dufour
When a CPU is hot added, the CPU ids are taken from the available mask from
the lower possible set. If that set of values was previously used for CPU
attached to a different node, this seems to application like if these CPUs
have migrated from a node to another one which is not expected in real
life.

To prevent this, it is needed to record the CPU ids used for each node and
to not reuse them on another node. However, to prevent CPU hot plug to
fail, in the case the CPU ids is starved on a node, the capability to reuse
other nodes’ free CPU ids is kept. A warning is displayed in such a case
to warn the user.

A new CPU bit mask (node_recorded_ids_map) is introduced for each possible
node. It is populated with the CPU onlined at boot time, and then when a
CPU is hot plug to a node. The bits in that mask remain when the CPU is hot
unplugged, to remind this CPU ids have been used for this node.

The effect of this patch can be seen by removing and adding CPUs using the
Qemu monitor. In the following case, the first CPU from the node 2 is
removed, then the first one from the node 1 is removed too. Later, the
first CPU of the node 2 is added back. Without that patch, the kernel will
numbered these CPUs using the first CPU ids available which are the ones
freed when removing the second CPU of the node 0. This leads to the CPU ids
16-23 to move from the node 1 to the node 2. With the patch applied, the
CPU ids 32-39 are used since they are the lowest free ones which have not
been used on another node.

At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47

Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Changes since V2, addressing Nathan's comments:
 - Remove the retry feature
 - Reduce the number of local variables (removing 'i')
 - Add comment about the cpu_add_remove_lock protecting the added CPU mask.
Changes since V1 (no functional changes):
 - update the test's output in the commit's description
 - node_recorded_ids_map should be static

Signed-off-by: Laurent Dufour 
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 51 ++--
 1 file changed, 46 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index ec478f8a98ff..f3fd4807dc3e 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -39,6 +39,12 @@
 /* This version can't take the spinlock, because it never returns */
 static int rtas_stop_self_token = RTAS_UNKNOWN_SERVICE;
 
+/*
+ * Record the CPU ids used on each nodes.
+ * Protected by cpu_add_remove_lock.
+ */
+static cpumask_var_t node_recorded_ids_map[MAX_NUMNODES];
+
 static void rtas_stop_self(void)
 {
static struct rtas_args args;
@@ -151,9 +157,9 @@ static void pseries_cpu_die(unsigned int cpu)
  */
 static int pseries_add_processor(struct device_node *np)
 {
-   unsigned int cpu;
+   unsigned int cpu, node;
cpumask_var_t candidate_mask, tmp;
-   int err = -ENOSPC, len, nthreads, i;
+   int err = -ENOSPC, len, nthreads, nid;
const __be32 *intserv;
 
intserv = of_get_property(np, "ibm,ppc-interrupt-server#s", );
@@ -163,9 +169,17 @@ static int pseries_add_processor(struct device_node *np)
zalloc_cpumask_var(_mask, GFP_KERNEL);
zalloc_cpumask_var(, GFP_KERNEL);
 
+   /*
+* Fetch from the DT nodes read by dlpar_configure_connector() the NUMA
+* node id the added CPU belongs to.
+*/
+   nid = of_node_to_nid(np);
+   if (nid < 0 || !node_possible(nid))
+   nid = first_online_node;
+
nthreads = len / sizeof(u32);
-   for (i = 0; i < nthreads; i++)
-   cpumask_set_cpu(i, tmp);
+   for (cpu = 0; cpu < nthreads; cpu++)
+   cpumask_set_cpu(cpu, tmp);
 
cpu_maps_update_begin();
 
@@ -173,6 +187,19 @@ static int pseries_add_processor(struct device_node *np)
 
/* Get a bitmap of unoccupied slots. */
cpumask_xor(candidate_mask, cpu_possible_mask, cpu_present_mask);
+
+   /*
+* Remove free ids previously assigned on the other nodes. We can walk
+* only online nodes because once a node became online it is not turned
+* offlined back.
+*/
+   for_each_online_node(node) {
+   if (node == nid) /* Keep our 

Re: [PATCH v2] pseries: prevent free CPU ids to be reused on another node

2021-04-02 Thread Laurent Dufour

Thanks Nathan for reviewing this.

Le 02/04/2021 à 15:34, Nathan Lynch a écrit :

Hi Laurent,

Laurent Dufour  writes:

When a CPU is hot added, the CPU ids are taken from the available mask from
the lower possible set. If that set of values was previously used for CPU
attached to a different node, this seems to application like if these CPUs
have migrated from a node to another one which is not expected in real
life.


This seems like a problem that could affect other architectures or
platforms? I guess as long as arch code is responsible for placing new
CPUs in cpu_present_mask, that code will have the responsibility of
ensuring CPU IDs' NUMA assignments remain stable.


Actually, x86 is already handling this issue in the arch code specific code, see
8f54969dc8d6 ("x86/acpi: Introduce persistent storage for cpuid <-> apicid 
mapping"). I didn't check for other architectures but as CPU id allocation is in 
the arch part, I believe this is up to each arch to deal with this issue.


Making the CPU id allocation common to all arch is outside the scope of this 
patch.


[...]


The effect of this patch can be seen by removing and adding CPUs using the
Qemu monitor. In the following case, the first CPU from the node 2 is
removed, then the first one from the node 1 is removed too. Later, the
first CPU of the node 2 is added back. Without that patch, the kernel will
numbered these CPUs using the first CPU ids available which are the ones
freed when removing the second CPU of the node 0. This leads to the CPU ids
16-23 to move from the node 1 to the node 2. With the patch applied, the
CPU ids 32-39 are used since they are the lowest free ones which have not
been used on another node.

At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47

Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47


Good demonstration of the problem. CPUs 16-23 "move" from node 1 to
node 2.


Thanks





diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 12cbffd3c2e3..48c7943b25b0 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -39,6 +39,8 @@
  /* This version can't take the spinlock, because it never returns */
  static int rtas_stop_self_token = RTAS_UNKNOWN_SERVICE;
  
+static cpumask_var_t node_recorded_ids_map[MAX_NUMNODES];


I guess this should have documentation that it must be
accessed/manipulated with cpu_add_remove_lock held?


I'll add a comment before the declaration to make this clear.




+
  static void rtas_stop_self(void)
  {
static struct rtas_args args;
@@ -151,29 +153,61 @@ static void pseries_cpu_die(unsigned int cpu)
   */
  static int pseries_add_processor(struct device_node *np)
  {
-   unsigned int cpu;
+   unsigned int cpu, node;
cpumask_var_t candidate_mask, tmp;
-   int err = -ENOSPC, len, nthreads, i;
+   int err = -ENOSPC, len, nthreads, i, nid;


 From eight local vars to ten, and the two new variables' names are
"node" and "nid". More distinctive names would help readers.


I agree that's confusing, I'll do some cleanup.





const __be32 *intserv;
+   bool force_reusing = false;
  
  	intserv = of_get_property(np, "ibm,ppc-interrupt-server#s", );

if (!intserv)
return 0;
  
-	zalloc_cpumask_var(_mask, GFP_KERNEL);

-   zalloc_cpumask_var(, GFP_KERNEL);
+   alloc_cpumask_var(_mask, GFP_KERNEL);
+   alloc_cpumask_var(, GFP_KERNEL);
+
+   /*
+* Fetch from the DT nodes read by dlpar_configure_connector() the NUMA
+* node id the added CPU belongs to.
+*/
+   nid = of_node_to_nid(np);
+   if (nid < 0 || !node_possible(nid))
+   nid = first_online_node;
  
  	nthreads = len / sizeof(u32);

-   for (i = 0; i < nthreads; i++)
-   cpumask_set_cpu(i, tmp);
  
  	cpu_maps_update_begin();
  
  	BUG_ON(!cpumask_subset(cpu_present_mask, cpu_possible_mask));
  
+again:

+   cpumask_clear(candidate_mask);
+   cpumask_clear(tmp);
+   for (i = 0; i < nthreads; i++)
+   cpumask_set_cpu(i, tmp);
+
/* Get a bitmap of unoccupied slots. */
cpumask_xor(candidate_mask, cpu_possible_mask, cpu_present_mask);
+
+   /*
+* Remove free ids previously assi

Re: [PATCH] powerpc/vdso: Separate vvar vma from vdso

2021-03-29 Thread Laurent Dufour

Le 26/03/2021 à 20:17, Dmitry Safonov a écrit :

Since commit 511157ab641e ("powerpc/vdso: Move vdso datapage up front")
VVAR page is in front of the VDSO area. In result it breaks CRIU
(Checkpoint Restore In Userspace) [1], where CRIU expects that "[vdso]"
from /proc/../maps points at ELF/vdso image, rather than at VVAR data page.
Laurent made a patch to keep CRIU working (by reading aux vector).
But I think it still makes sence to separate two mappings into different
VMAs. It will also make ppc64 less "special" for userspace and as
a side-bonus will make VVAR page un-writable by debugger (which previously
would COW page and can be unexpected).

I opportunistically Cc stable on it: I understand that usually such
stuff isn't a stable material, but that will allow us in CRIU have
one workaround less that is needed just for one release (v5.11) on
one platform (ppc64), which we otherwise have to maintain.
I wouldn't go as far as to say that the commit 511157ab641e is ABI
regression as no other userspace got broken, but I'd really appreciate
if it gets backported to v5.11 after v5.12 is released, so as not
to complicate already non-simple CRIU-vdso code. Thanks!

Cc: Andrei Vagin 
Cc: Andy Lutomirski 
Cc: Benjamin Herrenschmidt 
Cc: Christophe Leroy 
Cc: Laurent Dufour 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sta...@vger.kernel.org # v5.11
[1]: https://github.com/checkpoint-restore/criu/issues/1417
Signed-off-by: Dmitry Safonov 
Tested-by: Christophe Leroy 


I run the CRIU's test suite and except the usual suspects, all the tests passed.

Tested-by: Laurent Dufour 


---
  arch/powerpc/include/asm/mmu_context.h |  2 +-
  arch/powerpc/kernel/vdso.c | 54 +++---
  2 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 652ce85f9410..4bc45d3ed8b0 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -263,7 +263,7 @@ extern void arch_exit_mmap(struct mm_struct *mm);
  static inline void arch_unmap(struct mm_struct *mm,
  unsigned long start, unsigned long end)
  {
-   unsigned long vdso_base = (unsigned long)mm->context.vdso - PAGE_SIZE;
+   unsigned long vdso_base = (unsigned long)mm->context.vdso;
  
  	if (start <= vdso_base && vdso_base < end)

mm->context.vdso = NULL;
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index e839a906fdf2..b14907209822 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -55,10 +55,10 @@ static int vdso_mremap(const struct vm_special_mapping *sm, 
struct vm_area_struc
  {
unsigned long new_size = new_vma->vm_end - new_vma->vm_start;
  
-	if (new_size != text_size + PAGE_SIZE)

+   if (new_size != text_size)
return -EINVAL;
  
-	current->mm->context.vdso = (void __user *)new_vma->vm_start + PAGE_SIZE;

+   current->mm->context.vdso = (void __user *)new_vma->vm_start;
  
  	return 0;

  }
@@ -73,6 +73,10 @@ static int vdso64_mremap(const struct vm_special_mapping 
*sm, struct vm_area_str
return vdso_mremap(sm, new_vma, _end - _start);
  }
  
+static struct vm_special_mapping vvar_spec __ro_after_init = {

+   .name = "[vvar]",
+};
+
  static struct vm_special_mapping vdso32_spec __ro_after_init = {
.name = "[vdso]",
.mremap = vdso32_mremap,
@@ -89,11 +93,11 @@ static struct vm_special_mapping vdso64_spec 
__ro_after_init = {
   */
  static int __arch_setup_additional_pages(struct linux_binprm *bprm, int 
uses_interp)
  {
-   struct mm_struct *mm = current->mm;
+   unsigned long vdso_size, vdso_base, mappings_size;
struct vm_special_mapping *vdso_spec;
+   unsigned long vvar_size = PAGE_SIZE;
+   struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
-   unsigned long vdso_size;
-   unsigned long vdso_base;
  
  	if (is_32bit_task()) {

vdso_spec = _spec;
@@ -110,8 +114,8 @@ static int __arch_setup_additional_pages(struct 
linux_binprm *bprm, int uses_int
vdso_base = 0;
}
  
-	/* Add a page to the vdso size for the data page */

-   vdso_size += PAGE_SIZE;
+   mappings_size = vdso_size + vvar_size;
+   mappings_size += (VDSO_ALIGNMENT - 1) & PAGE_MASK;
  
  	/*

 * pick a base address for the vDSO in process space. We try to put it
@@ -119,9 +123,7 @@ static int __arch_setup_additional_pages(struct 
linux_binprm *bprm, int uses_int
 * and end up putting it elsewhere.
 * Add enough to the size so that the result can be aligned.
 */
-   vdso_base = get_unmapped_area(NULL, vdso_base,
- vdso_size + ((VDSO_ALIGNMENT - 1) & 
PAGE_MASK),
-   

Re: [PATCH] powerpc/vdso: Separate vvar vma from vdso

2021-03-29 Thread Laurent Dufour

Hi Christophe and Dimitry,

Le 27/03/2021 à 18:43, Dmitry Safonov a écrit :

Hi Christophe,

On 3/27/21 5:19 PM, Christophe Leroy wrote:
[..]

I opportunistically Cc stable on it: I understand that usually such
stuff isn't a stable material, but that will allow us in CRIU have
one workaround less that is needed just for one release (v5.11) on
one platform (ppc64), which we otherwise have to maintain.


Why is that a workaround, and why for one release only ? I think the
solution proposed by Laurentto use the aux vector AT_SYSINFO_EHDR should
work with any past and future release.


Yeah, I guess.
Previously, (before v5.11/power) all kernels had ELF start at "[vdso]"
VMA start, now we'll have to carry the offset in the VMA. Probably, not
the worst thing, but as it will be only for v5.11 release it can break,
so needs separate testing.
Kinda life was a bit easier without this additional code.
The assumption that ELF header is at the start of "[vdso]" is perhaps not a good 
one, but using a "[vvar]" section looks more conventional and allows to clearly 
identify the data part. I'd argue for this option.





I wouldn't go as far as to say that the commit 511157ab641e is ABI
regression as no other userspace got broken, but I'd really appreciate
if it gets backported to v5.11 after v5.12 is released, so as not
to complicate already non-simple CRIU-vdso code. Thanks!

Cc: Andrei Vagin 
Cc: Andy Lutomirski 
Cc: Benjamin Herrenschmidt 
Cc: Christophe Leroy 
Cc: Laurent Dufour 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sta...@vger.kernel.org # v5.11
[1]: https://github.com/checkpoint-restore/criu/issues/1417
Signed-off-by: Dmitry Safonov 
Tested-by: Christophe Leroy 


I tested it with sifreturn_vdso selftest and it worked, because that
selftest doesn't involve VDSO data.


Thanks again on helping with testing it, I appreciate it!


But if I do a mremap() on the VDSO text vma without remapping VVAR to
keep the same distance between the two vmas, gettimeofday() crashes. The
reason is that the code obtains the address of the data by calculating a
fix difference from its own address with the below macro, the delta
being resolved at link time:

.macro get_datapage ptr
 bcl    20, 31, .+4
999:
 mflr    \ptr
#if CONFIG_PPC_PAGE_SHIFT > 14
 addis    \ptr, \ptr, (_vdso_datapage - 999b)@ha
#endif
 addi    \ptr, \ptr, (_vdso_datapage - 999b)@l
.endm

So the datapage needs to remain at the same distance from the code at
all time.

Wondering how the other architectures do to have two independent VMAs
and be able to move one independently of the other.


It's alright as far as I know. If userspace remaps vdso/vvar it should
be aware of this (CRIU keeps this in mind, also old vdso image is dumped
to compare on restore with the one that the host has).


I do agree, playing with the VDSO mapping needs the application to be aware of 
the mapping details, and prior to 83d3f0e90c6c "powerpc/mm: tracking vDSO 
remap", remapping the VDSO was not working on PowerPC and nobody complained...


Laurent.



Re: VDSO ELF header

2021-03-25 Thread Laurent Dufour

Le 25/03/2021 à 17:56, Laurent Dufour a écrit :

Le 25/03/2021 à 17:46, Christophe Leroy a écrit :

Hi Laurent

Le 25/03/2021 à 17:11, Laurent Dufour a écrit :

Hi Christophe,

Since v5.11 and the changes you made to the VDSO code, it no more exposing 
the ELF header at the beginning of the VDSO mapping in user space.


This is confusing CRIU which is checking for this ELF header cookie 
(https://github.com/checkpoint-restore/criu/issues/1417).


How does it do on other architectures ?


Good question, I'll double check the CRIU code.


On x86, there are 2 VDSO entries:
77fcb000-77fce000 r--p  00:00 0  [vvar]
77fce000-77fcf000 r-xp  00:00 0  [vdso]

And the VDSO is starting with the ELF header.







I'm not an expert in loading and ELF part and reading the change you made, I 
can't identify how this could work now as I'm expecting the loader to need 
that ELF header to do the relocation.


I think the loader is able to find it at the expected place.


Actually, it seems the loader relies on the AUX vector AT_SYSINFO_EHDR. I guess 
CRIU should do the same.




 From my investigation it seems that the first bytes of the VDSO area are now 
the vdso_arch_data.


Is the ELF header put somewhere else?
How could the loader process the VDSO without that ELF header?



Like most other architectures, we now have the data section as first page and 
the text section follows. So you will likely find the elf header on the second 
page.


I'm wondering if the data section you're refering to is the vvar section I can 
see on x86.





Done in this commit: 
https://github.com/linuxppc/linux/commit/511157ab641eb6bedd00d62673388e78a4f871cf


I'll double check on x86, but anyway, I think CRIU should rely on 
AT_SYSINFO_EHDR and not assume that the ELF header is at the beginning of VDSO 
mapping.


Thanks for your help.
Laurent.





Re: VDSO ELF header

2021-03-25 Thread Laurent Dufour

Le 25/03/2021 à 17:46, Christophe Leroy a écrit :

Hi Laurent

Le 25/03/2021 à 17:11, Laurent Dufour a écrit :

Hi Christophe,

Since v5.11 and the changes you made to the VDSO code, it no more exposing the 
ELF header at the beginning of the VDSO mapping in user space.


This is confusing CRIU which is checking for this ELF header cookie 
(https://github.com/checkpoint-restore/criu/issues/1417).


How does it do on other architectures ?


Good question, I'll double check the CRIU code.





I'm not an expert in loading and ELF part and reading the change you made, I 
can't identify how this could work now as I'm expecting the loader to need 
that ELF header to do the relocation.


I think the loader is able to find it at the expected place.


Actually, it seems the loader relies on the AUX vector AT_SYSINFO_EHDR. I guess 
CRIU should do the same.




 From my investigation it seems that the first bytes of the VDSO area are now 
the vdso_arch_data.


Is the ELF header put somewhere else?
How could the loader process the VDSO without that ELF header?



Like most other architectures, we now have the data section as first page and 
the text section follows. So you will likely find the elf header on the second 
page.


Done in this commit: 
https://github.com/linuxppc/linux/commit/511157ab641eb6bedd00d62673388e78a4f871cf


I'll double check on x86, but anyway, I think CRIU should rely on 
AT_SYSINFO_EHDR and not assume that the ELF header is at the beginning of VDSO 
mapping.


Thanks for your help.
Laurent.



[PATCH v2] pseries: prevent free CPU ids to be reused on another node

2021-03-25 Thread Laurent Dufour
When a CPU is hot added, the CPU ids are taken from the available mask from
the lower possible set. If that set of values was previously used for CPU
attached to a different node, this seems to application like if these CPUs
have migrated from a node to another one which is not expected in real
life.

To prevent this, it is needed to record the CPU ids used for each node and
to not reuse them on another node. However, to prevent CPU hot plug to
fail, in the case the CPU ids is starved on a node, the capability to reuse
other nodes’ free CPU ids is kept. A warning is displayed in such a case
to warn the user.

A new CPU bit mask (node_recorded_ids_map) is introduced for each possible
node. It is populated with the CPU onlined at boot time, and then when a
CPU is hot plug to a node. The bits in that mask remain when the CPU is hot
unplugged, to remind this CPU ids have been used for this node.

If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.

The effect of this patch can be seen by removing and adding CPUs using the
Qemu monitor. In the following case, the first CPU from the node 2 is
removed, then the first one from the node 1 is removed too. Later, the
first CPU of the node 2 is added back. Without that patch, the kernel will
numbered these CPUs using the first CPU ids available which are the ones
freed when removing the second CPU of the node 0. This leads to the CPU ids
16-23 to move from the node 1 to the node 2. With the patch applied, the
CPU ids 32-39 are used since they are the lowest free ones which have not
been used on another node.

At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47

Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Changes since V1 (no functional changes):
 - update the test's output in the commit's description
 - node_recorded_ids_map should be static

Signed-off-by: Laurent Dufour 
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 83 ++--
 1 file changed, 76 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 12cbffd3c2e3..48c7943b25b0 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -39,6 +39,8 @@
 /* This version can't take the spinlock, because it never returns */
 static int rtas_stop_self_token = RTAS_UNKNOWN_SERVICE;
 
+static cpumask_var_t node_recorded_ids_map[MAX_NUMNODES];
+
 static void rtas_stop_self(void)
 {
static struct rtas_args args;
@@ -151,29 +153,61 @@ static void pseries_cpu_die(unsigned int cpu)
  */
 static int pseries_add_processor(struct device_node *np)
 {
-   unsigned int cpu;
+   unsigned int cpu, node;
cpumask_var_t candidate_mask, tmp;
-   int err = -ENOSPC, len, nthreads, i;
+   int err = -ENOSPC, len, nthreads, i, nid;
const __be32 *intserv;
+   bool force_reusing = false;
 
intserv = of_get_property(np, "ibm,ppc-interrupt-server#s", );
if (!intserv)
return 0;
 
-   zalloc_cpumask_var(_mask, GFP_KERNEL);
-   zalloc_cpumask_var(, GFP_KERNEL);
+   alloc_cpumask_var(_mask, GFP_KERNEL);
+   alloc_cpumask_var(, GFP_KERNEL);
+
+   /*
+* Fetch from the DT nodes read by dlpar_configure_connector() the NUMA
+* node id the added CPU belongs to.
+*/
+   nid = of_node_to_nid(np);
+   if (nid < 0 || !node_possible(nid))
+   nid = first_online_node;
 
nthreads = len / sizeof(u32);
-   for (i = 0; i < nthreads; i++)
-   cpumask_set_cpu(i, tmp);
 
cpu_maps_update_begin();
 
BUG_ON(!cpumask_subset(cpu_present_mask, cpu_possible_mask));
 
+again:
+   cpumask_clear(candidate_mask);
+   cpumask_clear(tmp);
+   for (i = 0; i < nthreads; i++)
+   cpumask_set_cpu(i, tmp);
+
/* Get a bitmap of unoccupied slots. */
cpumask_xor(candidate_mask, cpu_possible_mask, cpu_present_mask);
+
+   /*
+* Remove free ids previously assigned on the other nodes. We can walk
+* only online nodes because once a node became online it is not turned
+* offlined back.
+*/
+   if (!force_reusing)
+   for_ea

[PATCH] pseries: prevent free CPU ids to be reused on another node

2021-03-18 Thread Laurent Dufour
When a CPU is hot added, the CPU ids are taken from the available mask from
the lower possible set. If that set of values was previously used for CPU
attached to a different node, this seems to application like if these CPUs
have migrated from a node to another one which is not expected in real
life.

To prevent this, it is needed to record the CPU ids used for each node and
to not reuse them on another node. However, to prevent CPU hot plug to
fail, in the case the CPU ids is starved on a node, the capability to reuse
other nodes’ free CPU ids is kept. A warning is displayed in such a case
to warn the user.

A new CPU bit mask (node_recorded_ids_map) is introduced for each possible
node. It is populated with the CPU onlined at boot time, and then when a
CPU is hot plug to a node. The bits in that mask remain when the CPU is hot
unplugged, to remind this CPU ids have been used for this node.

If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.

The effect of this patch can be seen by removing and adding CPUs using the
Qemu monitor. In the following case, the first CPU from the node 2 is
removed, then the first one from the node 1 is removed too. Later, the
first CPU of the node 2 is added back. Without that patch, the kernel will
numbered these CPUs using the first CPU ids available which are the ones
freed when removing the second CPU of the node 0. This leads to the CPU ids
16-23 to move from the node 1 to the node 2. With the patch applied, the
CPU ids 32-39 are used since they are the lowest free ones which have not
been used on another node.

At boot time:
[root@vm40 ~]# numactl -H | grep cpus
available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47
node 2 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

Unpatched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47

Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Signed-off-by: Laurent Dufour 
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 83 ++--
 1 file changed, 76 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 12cbffd3c2e3..dc5797110d6e 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -39,6 +39,8 @@
 /* This version can't take the spinlock, because it never returns */
 static int rtas_stop_self_token = RTAS_UNKNOWN_SERVICE;
 
+cpumask_var_t node_recorded_ids_map[MAX_NUMNODES];
+
 static void rtas_stop_self(void)
 {
static struct rtas_args args;
@@ -151,29 +153,61 @@ static void pseries_cpu_die(unsigned int cpu)
  */
 static int pseries_add_processor(struct device_node *np)
 {
-   unsigned int cpu;
+   unsigned int cpu, node;
cpumask_var_t candidate_mask, tmp;
-   int err = -ENOSPC, len, nthreads, i;
+   int err = -ENOSPC, len, nthreads, i, nid;
const __be32 *intserv;
+   bool force_reusing = false;
 
intserv = of_get_property(np, "ibm,ppc-interrupt-server#s", );
if (!intserv)
return 0;
 
-   zalloc_cpumask_var(_mask, GFP_KERNEL);
-   zalloc_cpumask_var(, GFP_KERNEL);
+   alloc_cpumask_var(_mask, GFP_KERNEL);
+   alloc_cpumask_var(, GFP_KERNEL);
+
+   /*
+* Fetch from the DT nodes read by dlpar_configure_connector() the NUMA
+* node id the added CPU belongs to.
+*/
+   nid = of_node_to_nid(np);
+   if (nid < 0 || !node_possible(nid))
+   nid = first_online_node;
 
nthreads = len / sizeof(u32);
-   for (i = 0; i < nthreads; i++)
-   cpumask_set_cpu(i, tmp);
 
cpu_maps_update_begin();
 
BUG_ON(!cpumask_subset(cpu_present_mask, cpu_possible_mask));
 
+again:
+   cpumask_clear(candidate_mask);
+   cpumask_clear(tmp);
+   for (i = 0; i < nthreads; i++)
+   cpumask_set_cpu(i, tmp);
+
/* Get a bitmap of unoccupied slots. */
cpumask_xor(candidate_mask, cpu_possible_mask, cpu_present_mask);
+
+   /*
+* Remove free ids previously assigned on the other nodes. We can walk
+* only online nodes because once a node became online it is not turned
+* offlined back.
+*/
+   if (!force_reusing)
+   for_each_online_node(node) {
+   if (node == nid) /* Keep our 

[PATCH] cxl: don't manipulate the mm.mm_users field directly

2021-03-10 Thread Laurent Dufour
It is better to rely on the API provided by the MM layer instead of
directly manipulating the mm_users field.

Signed-off-by: Laurent Dufour 
---
 drivers/misc/cxl/fault.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/misc/cxl/fault.c b/drivers/misc/cxl/fault.c
index 01153b74334a..60c829113299 100644
--- a/drivers/misc/cxl/fault.c
+++ b/drivers/misc/cxl/fault.c
@@ -200,7 +200,7 @@ static struct mm_struct *get_mem_context(struct cxl_context 
*ctx)
if (ctx->mm == NULL)
return NULL;
 
-   if (!atomic_inc_not_zero(>mm->mm_users))
+   if (!mmget_not_zero(ctx->mm))
return NULL;
 
return ctx->mm;
-- 
2.30.1



Re: [PATCH] powerpc/pseries: export LPAR security flavor in lparcfg

2021-03-05 Thread Laurent Dufour

Le 05/03/2021 à 12:43, Michael Ellerman a écrit :

Laurent Dufour  writes:

Le 05/03/2021 à 07:23, Michael Ellerman a écrit :

Laurent Dufour  writes:

This is helpful to read the security flavor from inside the LPAR.


We already have /sys/kernel/debug/powerpc/security_features.

Is that not sufficient?


Not really, it only reports that security mitigation are on or off but not the
level set through the ASMI menu. Furthermore, reporting it through
/proc/powerpc/lparcfg allows an easy processing by the lparstat command (see 
below).




Export it like this in /proc/powerpc/lparcfg:

$ grep security_flavor /proc/powerpc/lparcfg
security_flavor=1

Value means:
0 Speculative execution fully enabled
1 Speculative execution controls to mitigate user-to-kernel attacks
2 Speculative execution controls to mitigate user-to-kernel and
user-to-user side-channel attacks


Those strings come from the FSP help, but we have no guarantee it won't
mean something different in future.


I think this is nailed down, those strings came from:
https://www.ibm.com/support/pages/node/715841

Where it is written (regarding AIX):

On an LPAR, one can use lparstat -x to display the current mitigation mode:
0 = Speculative execution fully enabled
1 = Speculative execution controls to mitigate user-to-kernel side-channel 
attacks
2 = Speculative execution controls to mitigate user-to-kernel and user-to-user
side-channel attacks

We have been requested to provide almost the same, which I proposed in
powerpc-utils:
https://groups.google.com/g/powerpc-utils-devel/c/NaKXvdyl_UI/m/wa2stpIDAQAJ


OK. Do you mind sending a v2 with all those details incorporated into
the change log?


Ok will do so.

Thanks


Re: [PATCH] powerpc/pseries: export LPAR security flavor in lparcfg

2021-03-05 Thread Laurent Dufour

Le 05/03/2021 à 07:23, Michael Ellerman a écrit :

Laurent Dufour  writes:

This is helpful to read the security flavor from inside the LPAR.


We already have /sys/kernel/debug/powerpc/security_features.

Is that not sufficient?


Not really, it only reports that security mitigation are on or off but not the 
level set through the ASMI menu. Furthermore, reporting it through

/proc/powerpc/lparcfg allows an easy processing by the lparstat command (see 
below).




Export it like this in /proc/powerpc/lparcfg:

$ grep security_flavor /proc/powerpc/lparcfg
security_flavor=1

Value means:
0 Speculative execution fully enabled
1 Speculative execution controls to mitigate user-to-kernel attacks
2 Speculative execution controls to mitigate user-to-kernel and
   user-to-user side-channel attacks


Those strings come from the FSP help, but we have no guarantee it won't
mean something different in future.


I think this is nailed down, those strings came from:
https://www.ibm.com/support/pages/node/715841

Where it is written (regarding AIX):

On an LPAR, one can use lparstat -x to display the current mitigation mode:
0 = Speculative execution fully enabled
1 = Speculative execution controls to mitigate user-to-kernel side-channel 
attacks
2 = Speculative execution controls to mitigate user-to-kernel and user-to-user 
side-channel attacks


We have been requested to provide almost the same, which I proposed in 
powerpc-utils:

https://groups.google.com/g/powerpc-utils-devel/c/NaKXvdyl_UI/m/wa2stpIDAQAJ

Thanks,
Laurent.


[PATCH] powerpc/pseries: export LPAR security flavor in lparcfg

2021-03-04 Thread Laurent Dufour
This is helpful to read the security flavor from inside the LPAR.

Export it like this in /proc/powerpc/lparcfg:

$ grep security_flavor /proc/powerpc/lparcfg
security_flavor=1

Value means:
0 Speculative execution fully enabled
1 Speculative execution controls to mitigate user-to-kernel attacks
2 Speculative execution controls to mitigate user-to-kernel and
  user-to-user side-channel attacks

Signed-off-by: Laurent Dufour 
---
 arch/powerpc/include/asm/hvcall.h| 1 +
 arch/powerpc/platforms/pseries/lparcfg.c | 2 ++
 arch/powerpc/platforms/pseries/pseries.h | 1 +
 arch/powerpc/platforms/pseries/setup.c   | 8 
 4 files changed, 12 insertions(+)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index ed6086d57b22..455e188da26d 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -389,6 +389,7 @@
 #define H_CPU_BEHAV_FAVOUR_SECURITY(1ull << 63) // IBM bit 0
 #define H_CPU_BEHAV_L1D_FLUSH_PR   (1ull << 62) // IBM bit 1
 #define H_CPU_BEHAV_BNDS_CHK_SPEC_BAR  (1ull << 61) // IBM bit 2
+#define H_CPU_BEHAV_FAVOUR_SECURITY_H  (1ull << 60) // IBM bit 3
 #define H_CPU_BEHAV_FLUSH_COUNT_CACHE  (1ull << 58) // IBM bit 5
 #define H_CPU_BEHAV_FLUSH_LINK_STACK   (1ull << 57) // IBM bit 6
 
diff --git a/arch/powerpc/platforms/pseries/lparcfg.c 
b/arch/powerpc/platforms/pseries/lparcfg.c
index e278390ab28d..35f6c4929fbd 100644
--- a/arch/powerpc/platforms/pseries/lparcfg.c
+++ b/arch/powerpc/platforms/pseries/lparcfg.c
@@ -169,6 +169,7 @@ static void show_gpci_data(struct seq_file *m)
kfree(buf);
 }
 
+
 static unsigned h_pic(unsigned long *pool_idle_time,
  unsigned long *num_procs)
 {
@@ -537,6 +538,7 @@ static int pseries_lparcfg_data(struct seq_file *m, void *v)
parse_em_data(m);
maxmem_data(m);
 
+   seq_printf(m, "security_flavor=%u\n", pseries_security_flavor);
return 0;
 }
 
diff --git a/arch/powerpc/platforms/pseries/pseries.h 
b/arch/powerpc/platforms/pseries/pseries.h
index 4fe48c04c6c2..a25517dc2515 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -111,6 +111,7 @@ static inline unsigned long cmo_get_page_size(void)
 
 int dlpar_workqueue_init(void);
 
+extern u32 pseries_security_flavor;
 void pseries_setup_security_mitigations(void);
 void pseries_lpar_read_hblkrm_characteristics(void);
 
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 46e1540abc22..59080413a269 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -85,6 +85,7 @@ EXPORT_SYMBOL(CMO_PageSize);
 
 int fwnmi_active;  /* TRUE if an FWNMI handler is present */
 int ibm_nmi_interlock_token;
+u32 pseries_security_flavor;
 
 static void pSeries_show_cpuinfo(struct seq_file *m)
 {
@@ -534,9 +535,16 @@ static void init_cpu_char_feature_flags(struct 
h_cpu_char_result *result)
/*
 * The features below are enabled by default, so we instead look to see
 * if firmware has *disabled* them, and clear them if so.
+* H_CPU_BEHAV_FAVOUR_SECURITY_H could be set only if
+* H_CPU_BEHAV_FAVOUR_SECURITY is.
 */
if (!(result->behaviour & H_CPU_BEHAV_FAVOUR_SECURITY))
security_ftr_clear(SEC_FTR_FAVOUR_SECURITY);
+   else if (result->behaviour & H_CPU_BEHAV_FAVOUR_SECURITY_H)
+   pseries_security_flavor = 1;
+   else
+   pseries_security_flavor = 2;
+
 
if (!(result->behaviour & H_CPU_BEHAV_L1D_FLUSH_PR))
security_ftr_clear(SEC_FTR_L1D_FLUSH_PR);
-- 
2.30.1



Re: [PATCH v12 00/31] Speculative page faults

2020-12-14 Thread Laurent Dufour

Le 14/12/2020 à 03:03, Joel Fernandes a écrit :

On Tue, Jul 07, 2020 at 01:31:37PM +0800, Chinwen Chang wrote:
[..]

Hi Laurent,

We merged SPF v11 and some patches from v12 into our platforms. After
several experiments, we observed SPF has obvious improvements on the
launch time of applications, especially for those high-TLP ones,

# launch time of applications(s):

package   version  w/ SPF  w/o SPF  improve(%)
--
Baidu maps10.13.3  0.887   0.98 9.49
Taobao8.4.0.35 1.227   1.2935.10
Meituan   9.12.401 1.107   1.54328.26
WeChat7.0.32.353   2.68 12.20
Honor of Kings1.43.1.6 6.636.7131.24


That's great news, thanks for reporting this!



By the way, we have verified our platforms with those patches and
achieved the goal of mass production.


Another good news!
For my information, what is your targeted hardware?

Cheers,
Laurent.


Hi Laurent,

Our targeted hardware belongs to ARM64 multi-core series.


Hello!

I was trying to develop an intuition about why does SPF give improvement for
you on small CPU systems. This is just a high-level theory but:

1. Assume the improvement is because of elimination of "blocking" on
mmap_sem.
Could it be that the mmap_sem is acquired in write-mode unnecessarily in some
places, thus causing blocking on mmap_sem in other paths? If so, is it
feasible to convert such usages to acquiring them in read-mode?


That's correct, and the goal of this series is to try not holding the mmap_sem 
in read mode during page fault processing.


Converting mmap_sem holder from write to read mode is not so easy and that work 
as already been done in some places. If you think there are areas where this 
could be done, you're welcome to send patches fixing that.



2. Assume the improvement is because of lesser read-side contention on
mmap_sem.
On small CPU systems, I would not expect reducing cache-line bouncing to give
such a dramatic improvement in performance as you are seeing.


I don't think cache line bouncing reduction is the main sourcec of performance 
improvement, I would rather think this is the lower part here.
I guess this is mainly because during loading time a lot of page fault is 
occuring and thus SPF is reducing the contention on the mmap_sem.



Thanks for any insight on this!

- Joel





[PATCH] powerpc/memhotplug: quieting some DLPAR operations

2020-12-11 Thread Laurent Dufour
When attempting to remove by index a set of LMB a lot of messages are
displayed on the console, even when everything goes fine:

 pseries-hotplug-mem: Attempting to hot-remove LMB, drc index 802d
 Offlined Pages 4096
 pseries-hotplug-mem: Memory at 2d000 was hot-removed

The 2 messages prefixed by "pseries-hotplug-mem" are not really helpful for
the end user, they should be debug outputs.

In case of error, because some of the LMB's pages couldn't be offlined, the
following is displayed on the console:

 pseries-hotplug-mem: Attempting to hot-remove LMB, drc index 803e
 pseries-hotplug-mem: Failed to hot-remove memory at 3e000
 dlpar: Could not handle DLPAR request "memory remove index 0x803e"

Again, the 2 messages prefixed by "pseries-hotplug-mem" are useless, and the
generic DLPAR prefixed message should be enough. Turning the 2 firts at
DEBUG level.

These 2 first changes are mainly triggered by the changes introduced in
drmgr:
https://groups.google.com/g/powerpc-utils-devel/c/Y6ef4NB3EzM/m/9cu5JHRxAQAJ

Also, when adding a bunch of LMBs, a message is displayed in the console per LMB
like these ones:
 pseries-hotplug-mem: Memory at 7e000 (drc index 807e) was hot-added
 pseries-hotplug-mem: Memory at 7f000 (drc index 807f) was hot-added
 pseries-hotplug-mem: Memory at 8 (drc index 8080) was hot-added
 pseries-hotplug-mem: Memory at 81000 (drc index 8081) was hot-added

When adding 1TB of memory and LMB size is 256MB, this leads to 4096
messages to be displayed on the console. These messages are not really
helpful for the end user, so moving them to the DEBUG level.

Signed-off-by: Laurent Dufour 
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 7efe6ec5d14a..8377f1f7c78e 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -479,7 +479,7 @@ static int dlpar_memory_remove_by_index(u32 drc_index)
int lmb_found;
int rc;
 
-   pr_info("Attempting to hot-remove LMB, drc index %x\n", drc_index);
+   pr_debug("Attempting to hot-remove LMB, drc index %x\n", drc_index);
 
lmb_found = 0;
for_each_drmem_lmb(lmb) {
@@ -497,10 +497,10 @@ static int dlpar_memory_remove_by_index(u32 drc_index)
rc = -EINVAL;
 
if (rc)
-   pr_info("Failed to hot-remove memory at %llx\n",
-   lmb->base_addr);
+   pr_debug("Failed to hot-remove memory at %llx\n",
+lmb->base_addr);
else
-   pr_info("Memory at %llx was hot-removed\n", lmb->base_addr);
+   pr_debug("Memory at %llx was hot-removed\n", lmb->base_addr);
 
return rc;
 }
@@ -717,8 +717,8 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
if (!drmem_lmb_reserved(lmb))
continue;
 
-   pr_info("Memory at %llx (drc index %x) was hot-added\n",
-   lmb->base_addr, lmb->drc_index);
+   pr_debug("Memory at %llx (drc index %x) was 
hot-added\n",
+lmb->base_addr, lmb->drc_index);
drmem_remove_lmb_reservation(lmb);
}
rc = 0;
-- 
2.29.2



Re: [PATCH] powerpc/hotplug: assign hot added LMB to the right node

2020-12-04 Thread Laurent Dufour

Le 03/12/2020 à 19:25, Greg KH a écrit :

On Thu, Dec 03, 2020 at 11:15:14AM +0100, Laurent Dufour wrote:

This patch applies to 5.9 and earlier kernels only.

Since 5.10, this has been fortunately fixed by the commit
e5e179aa3a39 ("pseries/drmem: don't cache node id in drmem_lmb struct").


Why can't we just backport that patch instead?  It's almost always
better to do that than to have a one-off patch, as almost always those
have bugs in them.


That's a good option too.
I was thinking that this 5.10 patch was not matching the stable release's 
guidelines since it was targeting performance issue, but since it is also fixing 
this issue, I'm certainly wrong.


So, forget that patch.

Thanks,
Laurent.


[PATCH] powerpc/hotplug: assign hot added LMB to the right node

2020-12-03 Thread Laurent Dufour
This patch applies to 5.9 and earlier kernels only.

Since 5.10, this has been fortunately fixed by the commit
e5e179aa3a39 ("pseries/drmem: don't cache node id in drmem_lmb struct").

When LMBs are added to a running system, the node id assigned to the LMB is
fetched from the temporary DT node provided by the hypervisor.

However, LMBs added are always assigned to the first online node. This is a
mistake and this is because hot_add_drconf_scn_to_nid() called by
lmb_set_nid() is checking for the LMB flags DRCONF_MEM_ASSIGNED which is
set later in dlpar_add_lmb().

To fix this issue, simply set that flag earlier in dlpar_add_lmb().

Note, this code has been rewrote in 5.10 and thus this fix has no meaning
since this version.

Signed-off-by: Laurent Dufour 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nathan Lynch 
Cc: Scott Cheloha 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sta...@vger.kernel.org
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index e54dcbd04b2f..92d83915c629 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -663,12 +663,14 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
return rc;
}
 
+   lmb->flags |= DRCONF_MEM_ASSIGNED;
lmb_set_nid(lmb);
block_sz = memory_block_size_bytes();
 
/* Add the memory */
rc = __add_memory(lmb->nid, lmb->base_addr, block_sz);
if (rc) {
+   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
invalidate_lmb_associativity_index(lmb);
return rc;
}
@@ -676,10 +678,9 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
rc = dlpar_online_lmb(lmb);
if (rc) {
__remove_memory(lmb->nid, lmb->base_addr, block_sz);
+   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
-   } else {
-   lmb->flags |= DRCONF_MEM_ASSIGNED;
}
 
return rc;
-- 
2.29.2



Re: [PATCH] x86/mpx: fix recursive munmap() corruption

2020-11-04 Thread Laurent Dufour

Le 03/11/2020 à 22:08, Dmitry Safonov a écrit :

Hi Laurent, Christophe, Michael, all,

On 11/3/20 5:11 PM, Laurent Dufour wrote:

Le 23/10/2020 à 14:28, Christophe Leroy a écrit :

[..]

That seems like it would work for CRIU and make sense in general?


Sorry for the late answer, yes this would make more sense.

Here is a patch doing that.



In your patch, the test seems overkill:

+    if ((start <= vdso_base && vdso_end <= end) ||  /* 1   */
+    (vdso_base <= start && start < vdso_end) || /* 3,4 */
+    (vdso_base < end && end <= vdso_end))   /* 2,3 */
+    mm->context.vdso_base = mm->context.vdso_end = 0;

What about

  if (start < vdso_end && vdso_start < end)
  mm->context.vdso_base = mm->context.vdso_end = 0;

This should cover all cases, or am I missing something ?


And do we really need to store vdso_end in the context ?
I think it should be possible to re-calculate it: the size of the VDSO
should be (_end - _start) + PAGE_SIZE for 32 bits VDSO,
and (_end - _start) + PAGE_SIZE for the 64 bits VDSO.


Thanks Christophe for the advise.

That is covering all the cases, and indeed is similar to the Michael's
proposal I missed last year.

I'll send a patch fixing this issue following your proposal.


It's probably not necessary anymore. I've sent patches [1], currently in
akpm, the last one forbids splitting of vm_special_mapping.
So, a user is able munmap() or mremap() vdso as a whole, but not partly.


Hi Dmitry,

That's a good thing too, but I think my patch is still valid in the PowerPC 
code, fixing a bad check, even if some corner cases are handled earlier in the code.



[1]:
https://lore.kernel.org/linux-mm/20201013013416.390574-1-d...@arista.com/

Thanks,
   Dmitry





[PATCH] powerpc/vdso: Fix VDSO unmap check

2020-11-03 Thread Laurent Dufour
The check introduced by the commit 83d3f0e90c6c ("powerpc/mm: tracking vDSO
remap") is wrong and is missing some partial unmaps of the VDSO.

To be complete the check needs the base and end address of the
VDSO. Currently only the base is available in the mm_context of a task, but
the end address can easily be computed because the size of VDSO is
constant. However, there are 2 sizes for 32 or 64 bits task and they are
stored in static variables in arch/powerpc/kernel/vdso.c.

Exporting a new function called vdso_pages() to get the number of pages of
the VDSO based on the static variables from arch/powerpc/kernel/vdso.c.

Fixes: 83d3f0e90c6c ("powerpc/mm: tracking vDSO remap")

Signed-off-by: Laurent Dufour 
Reported-by: Thomas Gleixner 
Suggested-by: Christophe Leroy 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
---
 arch/powerpc/include/asm/mmu_context.h | 18 --
 arch/powerpc/kernel/vdso.c | 14 ++
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index e02aa793420b..ced80897b7a1 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -259,11 +259,25 @@ static inline void enter_lazy_tlb(struct mm_struct *mm,
 
 extern void arch_exit_mmap(struct mm_struct *mm);
 
+extern int vdso_pages(void);
 static inline void arch_unmap(struct mm_struct *mm,
  unsigned long start, unsigned long end)
 {
-   if (start <= mm->context.vdso_base && mm->context.vdso_base < end)
-   mm->context.vdso_base = 0;
+   unsigned long vdso_end;
+
+   if (mm->context.vdso_base) {
+   /*
+* case 1   >  | VDSO|  <
+* case 2   >  |   < |
+* case 3  |  >< |
+* case 4  |  >  |  <
+*/
+   vdso_end = mm->context.vdso_base;
+   vdso_end += vdso_pages() << PAGE_SHIFT;
+
+   if (start < vdso_end && mm->context.vdso_base < end)
+   mm->context.vdso_base = 0;
+   }
 }
 
 #ifdef CONFIG_PPC_MEM_KEYS
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index 8dad44262e75..9defa35a1eba 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -117,6 +117,20 @@ struct lib64_elfinfo
unsigned long   text;
 };
 
+/*
+ * Return the number of pages of the VDSO for the current task.
+ */
+int vdso_pages(void)
+{
+   int vdso_pages = vdso32_pages;
+
+#ifdef CONFIG_PPC64
+   if (!is_32bit_task())
+   vdso_pages = vdso64_pages;
+#endif
+
+   return vdso_pages + 1; /* Add the data page */
+}
 
 /*
  * This is called from binfmt_elf, we create the special vma for the
-- 
2.29.2



Re: [PATCH] x86/mpx: fix recursive munmap() corruption

2020-11-03 Thread Laurent Dufour

Le 23/10/2020 à 14:28, Christophe Leroy a écrit :

Hi Laurent

Le 07/05/2019 à 18:35, Laurent Dufour a écrit :

Le 01/05/2019 à 12:32, Michael Ellerman a écrit :

Laurent Dufour  writes:

Le 23/04/2019 à 18:04, Dave Hansen a écrit :

On 4/23/19 4:16 AM, Laurent Dufour wrote:

...

There are 2 assumptions here:
   1. 'start' and 'end' are page aligned (this is guaranteed by 
__do_munmap().
   2. the VDSO is 1 page (this is guaranteed by the union vdso_data_store 
on powerpc)


Are you sure about #2?  The 'vdso64_pages' variable seems rather
unnecessary if the VDSO is only 1 page. ;)


Hum, not so sure now ;)
I got confused, only the header is one page.
The test is working as a best effort, and don't cover the case where
only few pages inside the VDSO are unmmapped (start >
mm->context.vdso_base). This is not what CRIU is doing and so this was
enough for CRIU support.

Michael, do you think there is a need to manage all the possibility
here, since the only user is CRIU and unmapping the VDSO is not a so
good idea for other processes ?


Couldn't we implement the semantic that if any part of the VDSO is
unmapped then vdso_base is set to zero? That should be fairly easy, eg:

if (start < vdso_end && end >= mm->context.vdso_base)
    mm->context.vdso_base = 0;


We might need to add vdso_end to the mm->context, but that should be OK.

That seems like it would work for CRIU and make sense in general?


Sorry for the late answer, yes this would make more sense.

Here is a patch doing that.



In your patch, the test seems overkill:

+    if ((start <= vdso_base && vdso_end <= end) ||  /* 1   */
+    (vdso_base <= start && start < vdso_end) || /* 3,4 */
+    (vdso_base < end && end <= vdso_end))   /* 2,3 */
+    mm->context.vdso_base = mm->context.vdso_end = 0;

What about

 if (start < vdso_end && vdso_start < end)
     mm->context.vdso_base = mm->context.vdso_end = 0;

This should cover all cases, or am I missing something ?


And do we really need to store vdso_end in the context ?
I think it should be possible to re-calculate it: the size of the VDSO should be 
(_end - _start) + PAGE_SIZE for 32 bits VDSO, and (_end - 
_start) + PAGE_SIZE for the 64 bits VDSO.


Thanks Christophe for the advise.

That is covering all the cases, and indeed is similar to the Michael's proposal 
I missed last year.


I'll send a patch fixing this issue following your proposal.

Cheers,
Laurent.


Re: [PATCH v4] pseries/hotplug-memory: hot-add: skip redundant LMB lookup

2020-09-17 Thread Laurent Dufour
 of the drconf range, hence the
smaller speedup.

Signed-off-by: Scott Cheloha 


Reviewed-by: Laurent Dufour 


---
Changelog:

v1: 
https://lore.kernel.org/linuxppc-dev/20200910175637.2865160-1-chel...@linux.ibm.com/

v2:
- Move prototype for of_drconf_to_nid_single() to topology.h.
   Requested by Michael Ellerman.

v3:
- Send the right patch.  v2 is from the wrong branch, my mistake.

v4:
- Fix checkpatch.pl warnings.  Reported by Laurent Dufour.

  arch/powerpc/include/asm/topology.h | 3 +++
  arch/powerpc/mm/numa.c  | 2 +-
  arch/powerpc/platforms/pseries/hotplug-memory.c | 6 --
  3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index f0b6300e7dd3..ae19b19f9d44 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -86,6 +86,9 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 
*cpu2_assoc)

  #endif /* CONFIG_NUMA */

+struct drmem_lmb;
+int of_drconf_to_nid_single(struct drmem_lmb *lmb);
+
  #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
  extern int find_and_online_cpu_nid(int cpu);
  #else
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 1f61fa2148b5..63507b47164d 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -430,7 +430,7 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
   * This is like of_node_to_nid_single() for memory represented in the
   * ibm,dynamic-reconfiguration-memory node.
   */
-static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
+int of_drconf_to_nid_single(struct drmem_lmb *lmb)
  {
struct assoc_arrays aa = { .arrays = NULL };
int default_nid = NUMA_NO_NODE;
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 0ea976d1cac4..9a533acf8ad0 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -611,8 +611,10 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)

block_sz = memory_block_size_bytes();

-   /* Find the node id for this address. */
-   nid = memory_add_physaddr_to_nid(lmb->base_addr);
+   /* Find the node id for this LMB.  Fake one if necessary. */
+   nid = of_drconf_to_nid_single(lmb);
+   if (nid < 0 || !node_possible(nid))
+   nid = first_online_node;

/* Add the memory */
rc = __add_memory(nid, lmb->base_addr, block_sz);





Re: [PATCH v3] pseries/hotplug-memory: hot-add: skip redundant LMB lookup

2020-09-16 Thread Laurent Dufour

Le 15/09/2020 à 21:46, Scott Cheloha a écrit :

During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
to determine which node id (nid) to use when later calling __add_memory().

This is wasteful.  On pseries, memory_add_physaddr_to_nid() finds an
appropriate nid for a given address by looking up the LMB containing the
address and then passing that LMB to of_drconf_to_nid_single() to get the
nid.  In dlpar_add_lmb() we get this address from the LMB itself.

In short, we have a pointer to an LMB and then we are searching for
that LMB *again* in order to find its nid.

If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we
can skip the redundant lookup.  The only error handling we need to
duplicate from memory_add_physaddr_to_nid() is the fallback to the
default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or
an invalid nid.

Skipping the extra lookup makes hot-add operations faster, especially
on machines with many LMBs.

Consider an LPAR with 126976 LMBs.  In one test, hot-adding 126000
LMBs on an upatched kernel took ~3.5 hours while a patched kernel
completed the same operation in ~2 hours:

Unpatched (12450 seconds):
Sep  9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000
Sep  9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 
126000 LMB(s)
[...]
Sep  9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 2000 
(drc index 8002) was hot-added

Patched (7065 seconds):
Sep  8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000
Sep  8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 
126000 LMB(s)
[...]
Sep  8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 2000 
(drc index 8002) was hot-added

It should be noted that the speedup grows more substantial when
hot-adding LMBs at the end of the drconf range.  This is because we
are skipping a linear LMB search.

To see the distinction, consider smaller hot-add test on the same
LPAR.  A perf-stat run with 10 iterations showed that hot-adding 4096
LMBs completed less than 1 second faster on a patched kernel:

Unpatched:
  Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):

 104,753.42 msec task-clock#0.992 CPUs utilized 
   ( +-  0.55% )
  4,708  context-switches  #0.045 K/sec 
   ( +-  0.69% )
  2,444  cpu-migrations#0.023 K/sec 
   ( +-  1.25% )
394  page-faults   #0.004 K/sec 
   ( +-  0.22% )
445,902,503,057  cycles#4.257 GHz   
   ( +-  0.55% )  (66.67%)
  8,558,376,740  stalled-cycles-frontend   #1.92% frontend cycles 
idle ( +-  0.88% )  (49.99%)
300,346,181,651  stalled-cycles-backend#   67.36% backend cycles 
idle  ( +-  0.76% )  (50.01%)
258,091,488,691  instructions  #0.58  insn per cycle
   #1.16  stalled cycles 
per insn  ( +-  0.22% )  (66.67%)
 70,568,169,256  branches  #  673.660 M/sec 
   ( +-  0.17% )  (50.01%)
  3,100,725,426  branch-misses #4.39% of all branches   
   ( +-  0.20% )  (49.99%)

105.583 +- 0.589 seconds time elapsed  ( +-  0.56% )

Patched:
  Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):

 104,055.69 msec task-clock#0.993 CPUs utilized 
   ( +-  0.32% )
  4,606  context-switches  #0.044 K/sec 
   ( +-  0.20% )
  2,463  cpu-migrations#0.024 K/sec 
   ( +-  0.93% )
394  page-faults   #0.004 K/sec 
   ( +-  0.25% )
442,951,129,921  cycles#4.257 GHz   
   ( +-  0.32% )  (66.66%)
  8,710,413,329  stalled-cycles-frontend   #1.97% frontend cycles 
idle ( +-  0.47% )  (50.06%)
299,656,905,836  stalled-cycles-backend#   67.65% backend cycles 
idle  ( +-  0.39% )  (50.02%)
252,731,168,193  instructions  #0.57  insn per cycle
   #1.19  stalled cycles 
per insn  ( +-  0.20% )  (66.66%)
 68,902,851,121  branches  #  662.173 M/sec 
   ( +-  0.13% )  (49.94%)
  3,100,242,882  branch-misses #4.50% of all branches   
   ( +-  0.15% )  (49.98%)

104.829 +- 0.325 seconds time elapsed  ( +-  0.31% )

This is consistent.  An add-by-count hot-add operation adds LMBs
greedily, so LMBs near the start of the drconf range are considered
first.  On an otherwise idle LPAR with so many LMBs we would expect to
find the LMBs we need near the start of the 

Re: [PATCH] mm: check for memory's node later during boot

2020-09-08 Thread Laurent Dufour

Le 03/09/2020 à 23:35, Andrew Morton a écrit :

On Wed,  2 Sep 2020 11:09:11 +0200 Laurent Dufour  wrote:


register_mem_sect_under_nodem() is checking the memory block's node id only
if the system state is "SYSTEM_BOOTING". On PowerPC, the memory blocks are
registered while the system state is "SYSTEM_SCHEDULING", the one before
SYSTEM_RUNNING.

The consequence on PowerPC guest with interleaved memory node's ranges is
that some memory block could be assigned to multiple nodes on sysfs. This
lately prevents some memory hot-plug and hot-unplug to succeed because
links are remaining. Such a panic is then displayed:

[ cut here ]
kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul 
binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
NIP:  c0403f34 LR: c0403f2c CTR: 
REGS: c004876e3660 TRAP: 0700   Not tainted  (5.9.0-rc1+)
MSR:  8282b033   CR: 24000448  XER: 
2004
CFAR: c0846d20 IRQMASK: 0
GPR00: c0403f2c c004876e38f0 c12f6f00 ffef
GPR04: 0227 c004805ae680  0004886f
GPR08: 0226 0003 0002 fffd
GPR12: 88000484 c0001ec96280  
GPR16:   0004 0003
GPR20: c0047814ffe0 c0077c08 0010 c13332c8
GPR24:  c11f6cc0  
GPR28: ffef 0001 00015000 1000
NIP [c0403f34] add_memory_resource+0x244/0x340
LR [c0403f2c] add_memory_resource+0x23c/0x340
Call Trace:
[c004876e38f0] [c0403f2c] add_memory_resource+0x23c/0x340 
(unreliable)
[c004876e39c0] [c040408c] __add_memory+0x5c/0xf0
[c004876e39f0] [c00e2b94] dlpar_add_lmb+0x1b4/0x500
[c004876e3ad0] [c00e3888] dlpar_memory+0x1f8/0xb80
[c004876e3b60] [c00dc0d0] handle_dlpar_errorlog+0xc0/0x190
[c004876e3bd0] [c00dc398] dlpar_store+0x198/0x4a0
[c004876e3c90] [c072e630] kobj_attr_store+0x30/0x50
[c004876e3cb0] [c051f954] sysfs_kf_write+0x64/0x90
[c004876e3cd0] [c051ee40] kernfs_fop_write+0x1b0/0x290
[c004876e3d20] [c0438dd8] vfs_write+0xe8/0x290
[c004876e3d70] [c04391ac] ksys_write+0xdc/0x130
[c004876e3dc0] [c0034e40] system_call_exception+0x160/0x270
[c004876e3e20] [c000d740] system_call_common+0xf0/0x27c
Instruction dump:
48442e35 6000 0b03 3cbe0001 7fa3eb78 7bc48402 38a5fffe 7ca5fa14
78a58402 48442db1 6000 7c7c1b78 <0b03> 7f23cb78 4bda371d 6000
---[ end trace 562fd6c109cd0fb2 ]---

To prevent this multiple links, make the node checking done for states
prior to SYSTEM_RUNNING.


Did you consider adding a cc:stable to this fix?


I should have, but now I've to review the fix based on David's comment.



[PATCH] mm: check for memory's node later during boot

2020-09-02 Thread Laurent Dufour
register_mem_sect_under_nodem() is checking the memory block's node id only
if the system state is "SYSTEM_BOOTING". On PowerPC, the memory blocks are
registered while the system state is "SYSTEM_SCHEDULING", the one before
SYSTEM_RUNNING.

The consequence on PowerPC guest with interleaved memory node's ranges is
that some memory block could be assigned to multiple nodes on sysfs. This
lately prevents some memory hot-plug and hot-unplug to succeed because
links are remaining. Such a panic is then displayed:

[ cut here ]
kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul 
binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
NIP:  c0403f34 LR: c0403f2c CTR: 
REGS: c004876e3660 TRAP: 0700   Not tainted  (5.9.0-rc1+)
MSR:  8282b033   CR: 24000448  XER: 
2004
CFAR: c0846d20 IRQMASK: 0
GPR00: c0403f2c c004876e38f0 c12f6f00 ffef
GPR04: 0227 c004805ae680  0004886f
GPR08: 0226 0003 0002 fffd
GPR12: 88000484 c0001ec96280  
GPR16:   0004 0003
GPR20: c0047814ffe0 c0077c08 0010 c13332c8
GPR24:  c11f6cc0  
GPR28: ffef 0001 00015000 1000
NIP [c0403f34] add_memory_resource+0x244/0x340
LR [c0403f2c] add_memory_resource+0x23c/0x340
Call Trace:
[c004876e38f0] [c0403f2c] add_memory_resource+0x23c/0x340 
(unreliable)
[c004876e39c0] [c040408c] __add_memory+0x5c/0xf0
[c004876e39f0] [c00e2b94] dlpar_add_lmb+0x1b4/0x500
[c004876e3ad0] [c00e3888] dlpar_memory+0x1f8/0xb80
[c004876e3b60] [c00dc0d0] handle_dlpar_errorlog+0xc0/0x190
[c004876e3bd0] [c00dc398] dlpar_store+0x198/0x4a0
[c004876e3c90] [c072e630] kobj_attr_store+0x30/0x50
[c004876e3cb0] [c051f954] sysfs_kf_write+0x64/0x90
[c004876e3cd0] [c051ee40] kernfs_fop_write+0x1b0/0x290
[c004876e3d20] [c0438dd8] vfs_write+0xe8/0x290
[c004876e3d70] [c04391ac] ksys_write+0xdc/0x130
[c004876e3dc0] [c0034e40] system_call_exception+0x160/0x270
[c004876e3e20] [c000d740] system_call_common+0xf0/0x27c
Instruction dump:
48442e35 6000 0b03 3cbe0001 7fa3eb78 7bc48402 38a5fffe 7ca5fa14
78a58402 48442db1 6000 7c7c1b78 <0b03> 7f23cb78 4bda371d 6000
---[ end trace 562fd6c109cd0fb2 ]---

To prevent this multiple links, make the node checking done for states
prior to SYSTEM_RUNNING.

Signed-off-by: Laurent Dufour 
Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Andrew Morton 
Fixes: 4fbce633910e ("mm/memory_hotplug.c: make register_mem_sect_under_node() 
a callback of walk_memory_range()")
---
 drivers/base/node.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 508b80f6329b..8e9f39b562ef 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -789,7 +789,7 @@ static int register_mem_sect_under_node(struct memory_block 
*mem_blk,
 * case, during hotplug we know that all pages in the memory
 * block belong to the same node.
 */
-   if (system_state == SYSTEM_BOOTING) {
+   if (system_state < SYSTEM_RUNNING) {
page_nid = get_nid_for_pfn(pfn);
if (page_nid < 0)
continue;
-- 
2.28.0



Re: [PATCHv5 1/2] powerpc/pseries: group lmb operation and memblock's

2020-08-27 Thread Laurent Dufour

Le 10/08/2020 à 10:52, Pingfan Liu a écrit :

This patch prepares for the incoming patch which swaps the order of
KOBJ_ADD/REMOVE uevent and dt's updating.

The dt updating should come after lmb operations, and before
__remove_memory()/__add_memory().  Accordingly, grouping all lmb operations
before the memblock's.


I can't find the link between this commit description and the code's changes 
below.



Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Hari Bathini 
Cc: Nathan Lynch 
Cc: Nathan Fontenot 
Cc: Laurent Dufour 
To: linuxppc-dev@lists.ozlabs.org
Cc: ke...@lists.infradead.org
---
v4 -> v5: fix the miss of clearing DRCONF_MEM_ASSIGNED in a failure path
  arch/powerpc/platforms/pseries/hotplug-memory.c | 28 +
  1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 5d545b7..46cbcd1 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -355,7 +355,8 @@ static int dlpar_add_lmb(struct drmem_lmb *);
  static int dlpar_remove_lmb(struct drmem_lmb *lmb)
  {
unsigned long block_sz;
-   int rc;
+   phys_addr_t base_addr;
+   int rc, nid;
  
  	if (!lmb_is_removable(lmb))

return -EINVAL;
@@ -364,17 +365,19 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
if (rc)
return rc;
  
+	base_addr = lmb->base_addr;

+   nid = lmb->nid;
block_sz = pseries_memory_block_size();
  
-	__remove_memory(lmb->nid, lmb->base_addr, block_sz);

-
-   /* Update memory regions for memory remove */
-   memblock_remove(lmb->base_addr, block_sz);
-
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
lmb->flags &= ~DRCONF_MEM_ASSIGNED;
  
+	__remove_memory(nid, base_addr, block_sz);

+
+   /* Update memory regions for memory remove */
+   memblock_remove(base_addr, block_sz);
+
return 0;
  }
  
@@ -603,22 +606,29 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)

}
  
  	lmb_set_nid(lmb);

+   lmb->flags |= DRCONF_MEM_ASSIGNED;
+
block_sz = memory_block_size_bytes();
  
  	/* Add the memory */

rc = __add_memory(lmb->nid, lmb->base_addr, block_sz);
if (rc) {
invalidate_lmb_associativity_index(lmb);
+   lmb_clear_nid(lmb);
+   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
return rc;
}
  
  	rc = dlpar_online_lmb(lmb);

if (rc) {
-   __remove_memory(lmb->nid, lmb->base_addr, block_sz);
+   int nid = lmb->nid;
+   phys_addr_t base_addr = lmb->base_addr;
+
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
-   } else {
-   lmb->flags |= DRCONF_MEM_ASSIGNED;
+   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+
+   __remove_memory(nid, base_addr, block_sz);
}
  
  	return rc;






Re: [PATCHv5 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents

2020-08-27 Thread Laurent Dufour

Le 10/08/2020 à 10:52, Pingfan Liu a écrit :

A bug is observed on pseries by taking the following steps on rhel:
-1. drmgr -c mem -r -q 5
-2. echo c > /proc/sysrq-trigger

And then, the failure looks like:
kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/
kdump: saving vmcore-dmesg.txt
kdump: saving vmcore-dmesg.txt complete
kdump: saving vmcore
  Checking for memory holes : [  0.0 %] /   
Checking for memory holes : [100.0 %] | 
  Excluding unnecessary pages   : [100.0 %] \   
Copying data  : [  0.3 %] - 
 eta: 38s[   44.337636] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
access=0x8004 current=makedumpfile
[   44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
psize 2 pte=0xc0005504
[   44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
access=0x8004 current=makedumpfile
[   44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
psize 2 pte=0xc0005504
[   44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 nip 
7fffbbc4d7fc lr 00011356ca3c code 2
[   44.338548] Core dump to |/bin/false pipe failed
/lib/kdump-lib-initramfs.sh: line 98:   469 Bus error   
$CORE_COLLECTOR /proc/vmcore 
$_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete
kdump: saving vmcore failed

* Root cause *
   After analyzing, it turns out that in the current implementation,
when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating as
the code __remove_memory() comes before drmem_update_dt().
So in kdump kernel, when read_from_oldmem() resorts to
pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to
non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it
can be observed "Bus error"

 From a viewpoint of listener and publisher, the publisher notifies the
listener before data is ready.  This introduces a problem where udev
launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before
updating. And in capture kernel, makedumpfile will access the memory based
on the stale dt info, and hit a SIGBUS error due to an un-existed lmb.

* Fix *
This bug is introduced by commit 063b8b1251fd
("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR
request"), which tried to combine all the dt updating into one.

To fix this issue, meanwhile not to introduce a quadratic runtime
complexity by the model:
   dlpar_memory_add_by_count
 for_each_drmem_lmb <--
   dlpar_add_lmb
 drmem_update_dt(_v1|_v2)
   for_each_drmem_lmb   <--
The dt should still be only updated once, and just before the last memory
online/offline event is ejected to user space. Achieve this by tracing the
num of lmb added or removed.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Hari Bathini 
Cc: Nathan Lynch 
Cc: Nathan Fontenot 
Cc: Laurent Dufour 
To: linuxppc-dev@lists.ozlabs.org
Cc: ke...@lists.infradead.org
---
v4 -> v5: change dlpar_add_lmb()/dlpar_remove_lmb() prototype to report
   whether dt is updated successfully.
   Fix a condition boundary check bug
v3 -> v4: resolve a quadratic runtime complexity issue.
   This series is applied on next-test branch
  arch/powerpc/platforms/pseries/hotplug-memory.c | 102 +++-
  1 file changed, 80 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 46cbcd1..1567d9f 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -350,13 +350,22 @@ static bool lmb_is_removable(struct drmem_lmb *lmb)
return true;
  }
  
-static int dlpar_add_lmb(struct drmem_lmb *);

+enum dt_update_status {
+   DT_NOUPDATE,
+   DT_TOUPDATE,
+   DT_UPDATED,
+};
+
+/* "*dt_update" returns DT_UPDATED if updated */
+static int dlpar_add_lmb(struct drmem_lmb *lmb,
+   enum dt_update_status *dt_update);
  
-static int dlpar_remove_lmb(struct drmem_lmb *lmb)

+static int dlpar_remove_lmb(struct drmem_lmb *lmb,
+   enum dt_update_status *dt_update)
  {
unsigned long block_sz;
phys_addr_t base_addr;
-   int rc, nid;
+   int rc, ret, nid;
  
  	if (!lmb_is_removable(lmb))

return -EINVAL;
@@ -372,6 +381,13 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+   if (*dt_update) {


That test is wrong, you should do:
if (*dt_update && *dt_update == DT_TOUPDATE) {

With the current code, the device tree is updated all the time.

Another option would be to pass a valid poin

Re: [PATCH v3] pseries/drmem: don't cache node id in drmem_lmb struct

2020-08-21 Thread Laurent Dufour

Le 11/08/2020 à 03:51, Scott Cheloha a écrit :

At memory hot-remove time we can retrieve an LMB's nid from its
corresponding memory_block.  There is no need to store the nid
in multiple locations.

Note that lmb_to_memblock() uses find_memory_block() to get the
corresponding memory_block.  As find_memory_block() runs in sub-linear
time this approach is negligibly slower than what we do at present.

In exchange for this lookup at hot-remove time we no longer need to
call memory_add_physaddr_to_nid() during drmem_init() for each LMB.
On powerpc, memory_add_physaddr_to_nid() is a linear search, so this
spares us an O(n^2) initialization during boot.

On systems with many LMBs that initialization overhead is palpable and
disruptive.  For example, on a box with 249854 LMBs we're seeing
drmem_init() take upwards of 30 seconds to complete:

[   53.721639] drmem: initializing drmem v2
[   80.604346] watchdog: BUG: soft lockup - CPU#65 stuck for 23s! [swapper/0:1]
[   80.604377] Modules linked in:
[   80.604389] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2+ #4
[   80.604397] NIP:  c00a4980 LR: c00a4940 CTR: 
[   80.604407] REGS: c0002dbff8493830 TRAP: 0901   Not tainted  (5.6.0-rc2+)
[   80.604412] MSR:  82009033   CR: 44000248  
XER: 000d
[   80.604431] CFAR: c00a4a38 IRQMASK: 0
[   80.604431] GPR00: c00a4940 c0002dbff8493ac0 c1904400 
c0003cfede30
[   80.604431] GPR04:  c0f4095a 002f 
1000
[   80.604431] GPR08: cbf7ecdb7fb8 cbf7ecc2d3c8 0008 
c00c0002fdfb2001
[   80.604431] GPR12:  c0001e8ec200
[   80.604477] NIP [c00a4980] hot_add_scn_to_nid+0xa0/0x3e0
[   80.604486] LR [c00a4940] hot_add_scn_to_nid+0x60/0x3e0
[   80.604492] Call Trace:
[   80.604498] [c0002dbff8493ac0] [c00a4940] 
hot_add_scn_to_nid+0x60/0x3e0 (unreliable)
[   80.604509] [c0002dbff8493b20] [c0087c10] 
memory_add_physaddr_to_nid+0x20/0x60
[   80.604521] [c0002dbff8493b40] [c10d4880] drmem_init+0x25c/0x2f0
[   80.604530] [c0002dbff8493c10] [c0010154] do_one_initcall+0x64/0x2c0
[   80.604540] [c0002dbff8493ce0] [c10c4aa0] 
kernel_init_freeable+0x2d8/0x3a0
[   80.604550] [c0002dbff8493db0] [c0010824] kernel_init+0x2c/0x148
[   80.604560] [c0002dbff8493e20] [c000b648] 
ret_from_kernel_thread+0x5c/0x74
[   80.604567] Instruction dump:
[   80.604574] 392918e8 e949 e90a000a e92a 80ea000c 1d080018 3908ffe8 
7d094214
[   80.604586] 7fa94040 419d00dc e9490010 714a0088 <2faa0008> 409e00ac e949 
7fbe5040
[   89.047390] drmem: 249854 LMB(s)

With a patched kernel on the same machine we're no longer seeing the
soft lockup.  drmem_init() now completes in negligible time, even when
the LMB count is large.

Signed-off-by: Scott Cheloha 
---
v1:
  - RFC

v2:
  - Adjusted commit message.
  - Miscellaneous cleanup.

v3:
  - Correct issue found by Laurent Dufour :
- Add missing put_device() call in dlpar_remove_lmb() for the
  lmb's associated mem_block.

  arch/powerpc/include/asm/drmem.h  | 21 
  arch/powerpc/mm/drmem.c   |  6 +
  .../platforms/pseries/hotplug-memory.c| 24 ---
  3 files changed, 17 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 414d209f45bb..34e4e9b257f5 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -13,9 +13,6 @@ struct drmem_lmb {
u32 drc_index;
u32 aa_index;
u32 flags;
-#ifdef CONFIG_MEMORY_HOTPLUG
-   int nid;
-#endif
  };

  struct drmem_lmb_info {
@@ -104,22 +101,4 @@ static inline void 
invalidate_lmb_associativity_index(struct drmem_lmb *lmb)
lmb->aa_index = 0x;
  }

-#ifdef CONFIG_MEMORY_HOTPLUG
-static inline void lmb_set_nid(struct drmem_lmb *lmb)
-{
-   lmb->nid = memory_add_physaddr_to_nid(lmb->base_addr);
-}
-static inline void lmb_clear_nid(struct drmem_lmb *lmb)
-{
-   lmb->nid = -1;
-}
-#else
-static inline void lmb_set_nid(struct drmem_lmb *lmb)
-{
-}
-static inline void lmb_clear_nid(struct drmem_lmb *lmb)
-{
-}
-#endif
-
  #endif /* _ASM_POWERPC_LMB_H */
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 59327cefbc6a..873fcfc7b875 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -362,10 +362,8 @@ static void __init init_drmem_v1_lmbs(const __be32 *prop)
if (!drmem_info->lmbs)
return;

-   for_each_drmem_lmb(lmb) {
+   for_each_drmem_lmb(lmb)
read_drconf_v1_cell(lmb, );
-   lmb_set_nid(lmb);
-   }
  }

  static void __init init_drmem_v2_lmbs(const __be32 *prop)
@@ -410,8 +408,6 @@ static void __init init_drmem_v2_lmbs(const __be32 *prop)

lmb->aa_

[PATCH v2] powerpc/drmem: Don't compute the NUMA node for each LMB

2020-08-05 Thread Laurent Dufour
All the LMB from the same set of ibm,dynamic-memory-v2 property are
sharing the same NUMA node. Don't compute that node for each one.

Tested on a system with 1022 LMBs spread on 4 NUMA nodes, only 4 calls to
lmb_set_nid() have been made instead of 1022.

This should prevent some soft lockups when starting large guests

Code has meaning only if CONFIG_MEMORY_HOTPLUG is set, otherwise the nid
field is not present in the drmem_lmb structure.

Signed-off-by: Laurent Dufour 
---
 arch/powerpc/mm/drmem.c | 25 -
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index b2eeea39684c..c11b6ec99ea3 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -402,6 +402,9 @@ static void __init init_drmem_v2_lmbs(const __be32 *prop)
const __be32 *p;
u32 i, j, lmb_sets;
int lmb_index;
+#ifdef CONFIG_MEMORY_HOTPLUG
+   struct drmem_lmb *first = NULL;
+#endif
 
lmb_sets = of_read_number(prop++, 1);
if (lmb_sets == 0)
@@ -426,6 +429,15 @@ static void __init init_drmem_v2_lmbs(const __be32 *prop)
for (i = 0; i < lmb_sets; i++) {
read_drconf_v2_cell(_cell, );
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+   /*
+* Fetch the NUMA node id for the fist set or if the
+* associativity index is different from the previous set.
+*/
+   if (first && dr_cell.aa_index != first->aa_index)
+   first = NULL;
+#endif
+
for (j = 0; j < dr_cell.seq_lmbs; j++) {
lmb = _info->lmbs[lmb_index++];
 
@@ -438,7 +450,18 @@ static void __init init_drmem_v2_lmbs(const __be32 *prop)
lmb->aa_index = dr_cell.aa_index;
lmb->flags = dr_cell.flags;
 
-   lmb_set_nid(lmb);
+#ifdef CONFIG_MEMORY_HOTPLUG
+   /*
+* All the LMB in the set share the same NUMA
+* associativity property. So read that node only once.
+*/
+   if (!first) {
+   lmb_set_nid(lmb);
+   first = lmb;
+   } else {
+   lmb->nid = first->nid;
+   }
+#endif
}
}
 }
-- 
2.28.0



Re: [PATCH] powerpc/drmem: Don't compute the NUMA node for each LMB

2020-08-05 Thread Laurent Dufour

Le 05/08/2020 à 12:43, kernel test robot a écrit :

Hi Laurent,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on linux/master linus/master v5.8 next-20200804]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Laurent-Dufour/powerpc-drmem-Don-t-compute-the-NUMA-node-for-each-LMB/20200805-173213
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-mpc885_ads_defconfig (attached as .config)
compiler: powerpc-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
 wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
 chmod +x ~/bin/make.cross
 # save the attached .config to linux build tree
 COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross 
ARCH=powerpc

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

arch/powerpc/mm/drmem.c: In function 'init_drmem_v2_lmbs':

arch/powerpc/mm/drmem.c:457:8: error: 'struct drmem_lmb' has no member named 
'nid'

  457 | lmb->nid = first->nid;
  |^~
arch/powerpc/mm/drmem.c:457:21: error: 'struct drmem_lmb' has no member 
named 'nid'
  457 | lmb->nid = first->nid;
  | ^~


My mistake, the nid member is only present when CONFIG_MEMORY_HOTPLUG is set.

I'll send a new version fixing this.



vim +457 arch/powerpc/mm/drmem.c

397 
398 static void __init init_drmem_v2_lmbs(const __be32 *prop)
399 {
400 struct drmem_lmb *lmb, *first;
401 struct of_drconf_cell_v2 dr_cell;
402 const __be32 *p;
403 u32 i, j, lmb_sets;
404 int lmb_index;
405 
406 lmb_sets = of_read_number(prop++, 1);
407 if (lmb_sets == 0)
408 return;
409 
410 /* first pass, calculate the number of LMBs */
411 p = prop;
412 for (i = 0; i < lmb_sets; i++) {
413 read_drconf_v2_cell(_cell, );
414 drmem_info->n_lmbs += dr_cell.seq_lmbs;
415 }
416 
417 drmem_info->lmbs = kcalloc(drmem_info->n_lmbs, sizeof(*lmb),
418GFP_KERNEL);
419 if (!drmem_info->lmbs)
420 return;
421 
422 /* second pass, read in the LMB information */
423 lmb_index = 0;
424 p = prop;
425 first = NULL;
426 
427 for (i = 0; i < lmb_sets; i++) {
428 read_drconf_v2_cell(_cell, );
429 
430 /*
431  * Fetch the NUMA node id for the fist set or if the
432  * associativity index is different from the previous 
set.
433  */
434 if (first && dr_cell.aa_index != first->aa_index)
435 first = NULL;
436 
437 for (j = 0; j < dr_cell.seq_lmbs; j++) {
438 lmb = _info->lmbs[lmb_index++];
439 
440 lmb->base_addr = dr_cell.base_addr;
441 dr_cell.base_addr += drmem_info->lmb_size;
442 
443 lmb->drc_index = dr_cell.drc_index;
444 dr_cell.drc_index++;
445 
446 lmb->aa_index = dr_cell.aa_index;
447 lmb->flags = dr_cell.flags;
448 
449 /*
450  * All the LMB in the set share the same NUMA
451  * associativity property. So read that node 
only once.
452  */
453 if (!first) {
454 lmb_set_nid(lmb);
455 first = lmb;
456 } else {
  > 457  lmb->nid = first->nid;
458 }
459 }
460 }
461 }
462 

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org





[PATCH] powerpc/drmem: Don't compute the NUMA node for each LMB

2020-08-05 Thread Laurent Dufour
All the LMB from the same set of ibm,dynamic-memory-v2 property are
sharing the same NUMA node. Don't compute that node for each one.

Tested on a system with 1022 LMBs spread on 4 NUMA nodes, only 4 calls to
lmb_set_nid() have been made instead of 1022.

This should prevent some soft lockups when starting large guests

Signed-off-by: Laurent Dufour 
---
 arch/powerpc/mm/drmem.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index b2eeea39684c..3819c523c65b 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -397,7 +397,7 @@ static void __init init_drmem_v1_lmbs(const __be32 *prop)
 
 static void __init init_drmem_v2_lmbs(const __be32 *prop)
 {
-   struct drmem_lmb *lmb;
+   struct drmem_lmb *lmb, *first;
struct of_drconf_cell_v2 dr_cell;
const __be32 *p;
u32 i, j, lmb_sets;
@@ -422,10 +422,18 @@ static void __init init_drmem_v2_lmbs(const __be32 *prop)
/* second pass, read in the LMB information */
lmb_index = 0;
p = prop;
+   first = NULL;
 
for (i = 0; i < lmb_sets; i++) {
read_drconf_v2_cell(_cell, );
 
+   /*
+* Fetch the NUMA node id for the fist set or if the
+* associativity index is different from the previous set.
+*/
+   if (first && dr_cell.aa_index != first->aa_index)
+   first = NULL;
+
for (j = 0; j < dr_cell.seq_lmbs; j++) {
lmb = _info->lmbs[lmb_index++];
 
@@ -438,7 +446,16 @@ static void __init init_drmem_v2_lmbs(const __be32 *prop)
lmb->aa_index = dr_cell.aa_index;
lmb->flags = dr_cell.flags;
 
-   lmb_set_nid(lmb);
+   /*
+* All the LMB in the set share the same NUMA
+* associativity property. So read that node only once.
+*/
+   if (!first) {
+   lmb_set_nid(lmb);
+   first = lmb;
+   } else {
+   lmb->nid = first->nid;
+   }
}
}
 }
-- 
2.28.0



Re: [PATCHv4 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents

2020-08-03 Thread Laurent Dufour

Le 30/07/2020 à 15:33, Pingfan Liu a écrit :

A bug is observed on pseries by taking the following steps on rhel:
-1. drmgr -c mem -r -q 5
-2. echo c > /proc/sysrq-trigger

And then, the failure looks like:
kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/
kdump: saving vmcore-dmesg.txt
kdump: saving vmcore-dmesg.txt complete
kdump: saving vmcore
  Checking for memory holes : [  0.0 %] /   
Checking for memory holes : [100.0 %] | 
  Excluding unnecessary pages   : [100.0 %] \   
Copying data  : [  0.3 %] - 
 eta: 38s[   44.337636] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
access=0x8004 current=makedumpfile
[   44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
psize 2 pte=0xc0005504
[   44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
access=0x8004 current=makedumpfile
[   44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
psize 2 pte=0xc0005504
[   44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 nip 
7fffbbc4d7fc lr 00011356ca3c code 2
[   44.338548] Core dump to |/bin/false pipe failed
/lib/kdump-lib-initramfs.sh: line 98:   469 Bus error   
$CORE_COLLECTOR /proc/vmcore 
$_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete
kdump: saving vmcore failed

* Root cause *
   After analyzing, it turns out that in the current implementation,
when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating as
the code __remove_memory() comes before drmem_update_dt().
So in kdump kernel, when read_from_oldmem() resorts to
pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to
non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it
can be observed "Bus error"

 From a viewpoint of listener and publisher, the publisher notifies the
listener before data is ready.  This introduces a problem where udev
launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before
updating. And in capture kernel, makedumpfile will access the memory based
on the stale dt info, and hit a SIGBUS error due to an un-existed lmb.

* Fix *
This bug is introduced by commit 063b8b1251fd
("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR
request"), which tried to combine all the dt updating into one.

To fix this issue, meanwhile not to introduce a quadratic runtime
complexity by the model:
   dlpar_memory_add_by_count
 for_each_drmem_lmb <--
   dlpar_add_lmb
 drmem_update_dt(_v1|_v2)
   for_each_drmem_lmb   <--
The dt should still be only updated once, and just before the last memory
online/offline event is ejected to user space. Achieve this by tracing the
num of lmb added or removed.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Hari Bathini 
Cc: Nathan Lynch 
Cc: Nathan Fontenot 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
v3 -> v4: resolve a quadratic runtime complexity issue.
   This series is applied on next-test branch
  arch/powerpc/platforms/pseries/hotplug-memory.c | 88 ++---
  1 file changed, 66 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 1a3ac3b..e07d5b1 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -350,13 +350,13 @@ static bool lmb_is_removable(struct drmem_lmb *lmb)
return true;
  }
  
-static int dlpar_add_lmb(struct drmem_lmb *);

+static int dlpar_add_lmb(struct drmem_lmb *lmb, bool dt_update);
  
-static int dlpar_remove_lmb(struct drmem_lmb *lmb)

+static int dlpar_remove_lmb(struct drmem_lmb *lmb, bool dt_update)
  {
unsigned long block_sz;
phys_addr_t base_addr;
-   int rc, nid;
+   int rc, ret, nid;
  
  	if (!lmb_is_removable(lmb))

return -EINVAL;
@@ -372,6 +372,11 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+   if (dt_update) {
+   ret = drmem_update_dt();
+   if (ret)
+   pr_warn("%s fail to update dt, but continue\n", 
__func__);
+   }
  
  	__remove_memory(nid, base_addr, block_sz);
  
@@ -387,6 +392,7 @@ static int dlpar_memory_remove_by_count(u32 lmbs_to_remove)

int lmbs_removed = 0;
int lmbs_available = 0;
int rc;
+   bool dt_update = false;
  
  	pr_info("Attempting to hot-remove %d LMB(s)\n", lmbs_to_remove);
  
@@ -409,7 +415,7 @@ static int dlpar_memory_remove_by_count(u32 lmbs_to_remove)

}
  
  	for_each_drmem_lmb(lmb) {

-   rc = dlpar_remove_lmb(lmb);
+   

Re: [PATCHv4 1/2] powerpc/pseries: group lmb operation and memblock's

2020-08-03 Thread Laurent Dufour

@@ -603,6 +606,8 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
}

lmb_set_nid(lmb);
+   lmb->flags |= DRCONF_MEM_ASSIGNED;
+
block_sz = memory_block_size_bytes();

/* Add the memory */


Since the lmb->flags is now set earlier, you should unset it in the case the 
call to __add_memory() fails, something like:


@@ -614,6 +614,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
rc = __add_memory(lmb->nid, lmb->base_addr, block_sz);
if (rc) {
invalidate_lmb_associativity_index(lmb);
+   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
return rc;
}


@@ -614,11 +619,14 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)

rc = dlpar_online_lmb(lmb);
if (rc) {
-   __remove_memory(lmb->nid, lmb->base_addr, block_sz);
+   int nid = lmb->nid;
+   phys_addr_t base_addr = lmb->base_addr;
+
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
-   } else {
-   lmb->flags |= DRCONF_MEM_ASSIGNED;
+   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+
+   __remove_memory(nid, base_addr, block_sz);
}

return rc;


Re: [PATCH v2] pseries/drmem: don't cache node id in drmem_lmb struct

2020-07-29 Thread Laurent Dufour

Hi Scott,

Le 28/07/2020 à 18:53, Scott Cheloha a écrit :

At memory hot-remove time we can retrieve an LMB's nid from its
corresponding memory_block.  There is no need to store the nid
in multiple locations.

Note that lmb_to_memblock() uses find_memory_block() to get the
corresponding memory_block.  As find_memory_block() runs in sub-linear
time this approach is negligibly slower than what we do at present.

In exchange for this lookup at hot-remove time we no longer need to
call memory_add_physaddr_to_nid() during drmem_init() for each LMB.
On powerpc, memory_add_physaddr_to_nid() is a linear search, so this
spares us an O(n^2) initialization during boot.

On systems with many LMBs that initialization overhead is palpable and
disruptive.  For example, on a box with 249854 LMBs we're seeing
drmem_init() take upwards of 30 seconds to complete:

[   53.721639] drmem: initializing drmem v2
[   80.604346] watchdog: BUG: soft lockup - CPU#65 stuck for 23s! [swapper/0:1]
[   80.604377] Modules linked in:
[   80.604389] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2+ #4
[   80.604397] NIP:  c00a4980 LR: c00a4940 CTR: 
[   80.604407] REGS: c0002dbff8493830 TRAP: 0901   Not tainted  (5.6.0-rc2+)
[   80.604412] MSR:  82009033   CR: 44000248  
XER: 000d
[   80.604431] CFAR: c00a4a38 IRQMASK: 0
[   80.604431] GPR00: c00a4940 c0002dbff8493ac0 c1904400 
c0003cfede30
[   80.604431] GPR04:  c0f4095a 002f 
1000
[   80.604431] GPR08: cbf7ecdb7fb8 cbf7ecc2d3c8 0008 
c00c0002fdfb2001
[   80.604431] GPR12:  c0001e8ec200
[   80.604477] NIP [c00a4980] hot_add_scn_to_nid+0xa0/0x3e0
[   80.604486] LR [c00a4940] hot_add_scn_to_nid+0x60/0x3e0
[   80.604492] Call Trace:
[   80.604498] [c0002dbff8493ac0] [c00a4940] 
hot_add_scn_to_nid+0x60/0x3e0 (unreliable)
[   80.604509] [c0002dbff8493b20] [c0087c10] 
memory_add_physaddr_to_nid+0x20/0x60
[   80.604521] [c0002dbff8493b40] [c10d4880] drmem_init+0x25c/0x2f0
[   80.604530] [c0002dbff8493c10] [c0010154] do_one_initcall+0x64/0x2c0
[   80.604540] [c0002dbff8493ce0] [c10c4aa0] 
kernel_init_freeable+0x2d8/0x3a0
[   80.604550] [c0002dbff8493db0] [c0010824] kernel_init+0x2c/0x148
[   80.604560] [c0002dbff8493e20] [c000b648] 
ret_from_kernel_thread+0x5c/0x74
[   80.604567] Instruction dump:
[   80.604574] 392918e8 e949 e90a000a e92a 80ea000c 1d080018 3908ffe8 
7d094214
[   80.604586] 7fa94040 419d00dc e9490010 714a0088 <2faa0008> 409e00ac e949 
7fbe5040
[   89.047390] drmem: 249854 LMB(s)

With a patched kernel on the same machine we're no longer seeing the
soft lockup.  drmem_init() now completes in negligible time, even when
the LMB count is large.

Signed-off-by: Scott Cheloha 
---
  arch/powerpc/include/asm/drmem.h  | 21 ---
  arch/powerpc/mm/drmem.c   |  6 +-
  .../platforms/pseries/hotplug-memory.c| 19 ++---
  3 files changed, 13 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 414d209f45bb..34e4e9b257f5 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -13,9 +13,6 @@ struct drmem_lmb {
u32 drc_index;
u32 aa_index;
u32 flags;
-#ifdef CONFIG_MEMORY_HOTPLUG
-   int nid;
-#endif
  };

  struct drmem_lmb_info {
@@ -104,22 +101,4 @@ static inline void 
invalidate_lmb_associativity_index(struct drmem_lmb *lmb)
lmb->aa_index = 0x;
  }

-#ifdef CONFIG_MEMORY_HOTPLUG
-static inline void lmb_set_nid(struct drmem_lmb *lmb)
-{
-   lmb->nid = memory_add_physaddr_to_nid(lmb->base_addr);
-}
-static inline void lmb_clear_nid(struct drmem_lmb *lmb)
-{
-   lmb->nid = -1;
-}
-#else
-static inline void lmb_set_nid(struct drmem_lmb *lmb)
-{
-}
-static inline void lmb_clear_nid(struct drmem_lmb *lmb)
-{
-}
-#endif
-
  #endif /* _ASM_POWERPC_LMB_H */
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 59327cefbc6a..873fcfc7b875 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -362,10 +362,8 @@ static void __init init_drmem_v1_lmbs(const __be32 *prop)
if (!drmem_info->lmbs)
return;

-   for_each_drmem_lmb(lmb) {
+   for_each_drmem_lmb(lmb)
read_drconf_v1_cell(lmb, );
-   lmb_set_nid(lmb);
-   }
  }

  static void __init init_drmem_v2_lmbs(const __be32 *prop)
@@ -410,8 +408,6 @@ static void __init init_drmem_v2_lmbs(const __be32 *prop)

lmb->aa_index = dr_cell.aa_index;
lmb->flags = dr_cell.flags;
-
-   lmb_set_nid(lmb);
}
}
  }
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 

Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal

2020-07-28 Thread Laurent Dufour

Le 28/07/2020 à 19:37, Nathan Lynch a écrit :

The drmem lmb list can have hundreds of thousands of entries, and
unfortunately lookups take the form of linear searches. As long as
this is the case, traversals have the potential to monopolize the CPU
and provoke lockup reports, workqueue stalls, and the like unless
they explicitly yield.

Rather than placing cond_resched() calls within various
for_each_drmem_lmb() loop blocks in the code, put it in the iteration
expression of the loop macro itself so users can't omit it.


Hi Nathan,

Is that not too much to call cond_resched() on every LMB?

Could that be less frequent, every 10, or 100, I don't really know ?

Cheers,
Laurent.


Fixes: 6c6ea53725b3 ("powerpc/mm: Separate ibm, dynamic-memory data from DT 
format")
Signed-off-by: Nathan Lynch 
---
  arch/powerpc/include/asm/drmem.h | 10 +-
  1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 414d209f45bb..36d0ed04bda8 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -8,6 +8,8 @@
  #ifndef _ASM_POWERPC_LMB_H
  #define _ASM_POWERPC_LMB_H
  
+#include 

+
  struct drmem_lmb {
u64 base_addr;
u32 drc_index;
@@ -26,8 +28,14 @@ struct drmem_lmb_info {
  
  extern struct drmem_lmb_info *drmem_info;
  
+static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb)

+{
+   cond_resched();
+   return ++lmb;
+}
+
  #define for_each_drmem_lmb_in_range(lmb, start, end)  \
-   for ((lmb) = (start); (lmb) < (end); (lmb)++)
+   for ((lmb) = (start); (lmb) < (end); lmb = drmem_lmb_next(lmb))
  
  #define for_each_drmem_lmb(lmb)	\

for_each_drmem_lmb_in_range((lmb),  \





[PATCH] KVM: PPC: Book3S HV: rework secure mem slot dropping

2020-07-24 Thread Laurent Dufour
When a secure memslot is dropped, all the pages backed in the secure
device (aka really backed by secure memory by the Ultravisor)
should be paged out to a normal page. Previously, this was
achieved by triggering the page fault mechanism which is calling
kvmppc_svm_page_out() on each pages.

This can't work when hot unplugging a memory slot because the memory
slot is flagged as invalid and gfn_to_pfn() is then not trying to access
the page, so the page fault mechanism is not triggered.

Since the final goal is to make a call to kvmppc_svm_page_out() it seems
simpler to call directly instead of triggering such a mechanism. This
way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
memslot.

Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
the call to __kvmppc_svm_page_out() is made.  As
__kvmppc_svm_page_out needs the vma pointer to migrate the pages,
the VMA is fetched in a lazy way, to not trigger find_vma() all
the time. In addition, the mmap_sem is held in read mode during
that time, not in write mode since the virual memory layout is not
impacted, and kvm->arch.uvmem_lock prevents concurrent operation
on the secure device.

Cc: Ram Pai 
Cc: Bharata B Rao 
Cc: Paul Mackerras 
Signed-off-by: Ram Pai 
[modified the changelog description]
Signed-off-by: Laurent Dufour 
[modified check on the VMA in kvmppc_uvmem_drop_pages]
---
 arch/powerpc/kvm/book3s_hv_uvmem.c | 53 --
 1 file changed, 36 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index c772e921f769..5dd3e9acdcab 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -632,35 +632,54 @@ static inline int kvmppc_svm_page_out(struct 
vm_area_struct *vma,
  * fault on them, do fault time migration to replace the device PTEs in
  * QEMU page table with normal PTEs from newly allocated pages.
  */
-void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
+void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,
 struct kvm *kvm, bool skip_page_out)
 {
int i;
struct kvmppc_uvmem_page_pvt *pvt;
-   unsigned long pfn, uvmem_pfn;
-   unsigned long gfn = free->base_gfn;
+   struct page *uvmem_page;
+   struct vm_area_struct *vma = NULL;
+   unsigned long uvmem_pfn, gfn;
+   unsigned long addr, end;
+
+   mmap_read_lock(kvm->mm);
+
+   addr = slot->userspace_addr;
+   end = addr + (slot->npages * PAGE_SIZE);
 
-   for (i = free->npages; i; --i, ++gfn) {
-   struct page *uvmem_page;
+   gfn = slot->base_gfn;
+   for (i = slot->npages; i; --i, ++gfn, addr += PAGE_SIZE) {
+
+   /* Fetch the VMA if addr is not in the latest fetched one */
+   if (!vma || addr >= vma->vm_end) {
+   vma = find_vma_intersection(kvm->mm, addr, addr+1);
+   if (!vma) {
+   pr_err("Can't find VMA for gfn:0x%lx\n", gfn);
+   break;
+   }
+   }
 
mutex_lock(>arch.uvmem_lock);
-   if (!kvmppc_gfn_is_uvmem_pfn(gfn, kvm, _pfn)) {
+
+   if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, _pfn)) {
+   uvmem_page = pfn_to_page(uvmem_pfn);
+   pvt = uvmem_page->zone_device_data;
+   pvt->skip_page_out = skip_page_out;
+   pvt->remove_gfn = true;
+
+   if (__kvmppc_svm_page_out(vma, addr, addr + PAGE_SIZE,
+ PAGE_SHIFT, kvm, pvt->gpa))
+   pr_err("Can't page out gpa:0x%lx addr:0x%lx\n",
+  pvt->gpa, addr);
+   } else {
+   /* Remove the shared flag if any */
kvmppc_gfn_remove(gfn, kvm);
-   mutex_unlock(>arch.uvmem_lock);
-   continue;
}
 
-   uvmem_page = pfn_to_page(uvmem_pfn);
-   pvt = uvmem_page->zone_device_data;
-   pvt->skip_page_out = skip_page_out;
-   pvt->remove_gfn = true;
mutex_unlock(>arch.uvmem_lock);
-
-   pfn = gfn_to_pfn(kvm, gfn);
-   if (is_error_noslot_pfn(pfn))
-   continue;
-   kvm_release_pfn_clean(pfn);
}
+
+   mmap_read_unlock(kvm->mm);
 }
 
 unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm)
-- 
2.27.0



Re: [PATCH v5 7/7] KVM: PPC: Book3S HV: rework secure mem slot dropping

2020-07-24 Thread Laurent Dufour

Le 24/07/2020 à 05:03, Bharata B Rao a écrit :

On Thu, Jul 23, 2020 at 01:07:24PM -0700, Ram Pai wrote:

From: Laurent Dufour 

When a secure memslot is dropped, all the pages backed in the secure
device (aka really backed by secure memory by the Ultravisor)
should be paged out to a normal page. Previously, this was
achieved by triggering the page fault mechanism which is calling
kvmppc_svm_page_out() on each pages.

This can't work when hot unplugging a memory slot because the memory
slot is flagged as invalid and gfn_to_pfn() is then not trying to access
the page, so the page fault mechanism is not triggered.

Since the final goal is to make a call to kvmppc_svm_page_out() it seems
simpler to call directly instead of triggering such a mechanism. This
way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
memslot.

Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
the call to __kvmppc_svm_page_out() is made.  As
__kvmppc_svm_page_out needs the vma pointer to migrate the pages,
the VMA is fetched in a lazy way, to not trigger find_vma() all
the time. In addition, the mmap_sem is held in read mode during
that time, not in write mode since the virual memory layout is not
impacted, and kvm->arch.uvmem_lock prevents concurrent operation
on the secure device.

Cc: Ram Pai 
Cc: Bharata B Rao 
Cc: Paul Mackerras 
Signed-off-by: Ram Pai 
[modified the changelog description]
Signed-off-by: Laurent Dufour 
---
  arch/powerpc/kvm/book3s_hv_uvmem.c | 54 ++
  1 file changed, 37 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index c772e92..daffa6e 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -632,35 +632,55 @@ static inline int kvmppc_svm_page_out(struct 
vm_area_struct *vma,
   * fault on them, do fault time migration to replace the device PTEs in
   * QEMU page table with normal PTEs from newly allocated pages.
   */
-void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
+void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,
 struct kvm *kvm, bool skip_page_out)
  {
int i;
struct kvmppc_uvmem_page_pvt *pvt;
-   unsigned long pfn, uvmem_pfn;
-   unsigned long gfn = free->base_gfn;
+   struct page *uvmem_page;
+   struct vm_area_struct *vma = NULL;
+   unsigned long uvmem_pfn, gfn;
+   unsigned long addr, end;
+
+   mmap_read_lock(kvm->mm);
+
+   addr = slot->userspace_addr;
+   end = addr + (slot->npages * PAGE_SIZE);
  
-	for (i = free->npages; i; --i, ++gfn) {

-   struct page *uvmem_page;
+   gfn = slot->base_gfn;
+   for (i = slot->npages; i; --i, ++gfn, addr += PAGE_SIZE) {
+
+   /* Fetch the VMA if addr is not in the latest fetched one */
+   if (!vma || (addr < vma->vm_start || addr >= vma->vm_end)) {
+   vma = find_vma_intersection(kvm->mm, addr, end);
+   if (!vma ||
+   vma->vm_start > addr || vma->vm_end < end) {
+   pr_err("Can't find VMA for gfn:0x%lx\n", gfn);
+   break;
+   }


There is a potential issue with the boundary condition check here
which I discussed with Laurent yesterday. Guess he hasn't gotten around
to look at it yet.


Right, I'm working on that..




Re: [PATCH v2 2/2] KVM: PPC: Book3S HV: rework secure mem slot dropping

2020-07-23 Thread Laurent Dufour

Le 23/07/2020 à 14:32, Laurent Dufour a écrit :

Le 23/07/2020 à 05:36, Bharata B Rao a écrit :

On Tue, Jul 21, 2020 at 12:42:02PM +0200, Laurent Dufour wrote:

When a secure memslot is dropped, all the pages backed in the secure device
(aka really backed by secure memory by the Ultravisor) should be paged out
to a normal page. Previously, this was achieved by triggering the page
fault mechanism which is calling kvmppc_svm_page_out() on each pages.

This can't work when hot unplugging a memory slot because the memory slot
is flagged as invalid and gfn_to_pfn() is then not trying to access the
page, so the page fault mechanism is not triggered.

Since the final goal is to make a call to kvmppc_svm_page_out() it seems
simpler to directly calling it instead of triggering such a mechanism. This
way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
memslot.

Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
the call to __kvmppc_svm_page_out() is made.
As __kvmppc_svm_page_out needs the vma pointer to migrate the pages, the
VMA is fetched in a lazy way, to not trigger find_vma() all the time. In
addition, the mmap_sem is help in read mode during that time, not in write
mode since the virual memory layout is not impacted, and
kvm->arch.uvmem_lock prevents concurrent operation on the secure device.

Cc: Ram Pai 
Cc: Bharata B Rao 
Cc: Paul Mackerras 
Signed-off-by: Laurent Dufour 
---
  arch/powerpc/kvm/book3s_hv_uvmem.c | 54 --
  1 file changed, 37 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c

index 5a4b02d3f651..ba5c7c77cc3a 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -624,35 +624,55 @@ static inline int kvmppc_svm_page_out(struct 
vm_area_struct *vma,

   * fault on them, do fault time migration to replace the device PTEs in
   * QEMU page table with normal PTEs from newly allocated pages.
   */
-void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
+void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,
   struct kvm *kvm, bool skip_page_out)
  {
  int i;
  struct kvmppc_uvmem_page_pvt *pvt;
-    unsigned long pfn, uvmem_pfn;
-    unsigned long gfn = free->base_gfn;
+    struct page *uvmem_page;
+    struct vm_area_struct *vma = NULL;
+    unsigned long uvmem_pfn, gfn;
+    unsigned long addr, end;
+
+    mmap_read_lock(kvm->mm);
+
+    addr = slot->userspace_addr;


We typically use gfn_to_hva() for that, but that won't work for a
memslot that is already marked INVALID which is the case here.
I think it is ok to access slot->userspace_addr here of an INVALID
memslot, but just thought of explictly bringing this up.


Which explicitly mentioned above in the patch's description:

This can't work when hot unplugging a memory slot because the memory slot
is flagged as invalid and gfn_to_pfn() is then not trying to access the
page, so the page fault mechanism is not triggered.




+    end = addr + (slot->npages * PAGE_SIZE);
-    for (i = free->npages; i; --i, ++gfn) {
-    struct page *uvmem_page;
+    gfn = slot->base_gfn;
+    for (i = slot->npages; i; --i, ++gfn, addr += PAGE_SIZE) {
+
+    /* Fetch the VMA if addr is not in the latest fetched one */
+    if (!vma || (addr < vma->vm_start || addr >= vma->vm_end)) {
+    vma = find_vma_intersection(kvm->mm, addr, end);
+    if (!vma ||
+    vma->vm_start > addr || vma->vm_end < end) {
+    pr_err("Can't find VMA for gfn:0x%lx\n", gfn);
+    break;
+    }
+    }


In Ram's series, kvmppc_memslot_page_merge() also walks the VMAs spanning
the memslot, but it uses a different logic for the same. Why can't these
two cases use the same method to walk the VMAs? Is there anything subtly
different between the two cases?


This is probably doable. At the time I wrote that patch, the 
kvmppc_memslot_page_merge() was not yet introduced AFAIR.


This being said, I'd help a lot to factorize that code... I let Ram dealing with 
that ;)


Indeed I don't think this is relevant, the loop in kvmppc_memslot_page_merge() 
deals with one call (to ksm_advise) per VMA, while this code is dealing with one 
call per page of the VMA, which completely different.


I don't think merging the both will be a good idea.

Cheers,
Laurent.


Re: [PATCH v2 2/2] KVM: PPC: Book3S HV: rework secure mem slot dropping

2020-07-23 Thread Laurent Dufour

Le 23/07/2020 à 05:36, Bharata B Rao a écrit :

On Tue, Jul 21, 2020 at 12:42:02PM +0200, Laurent Dufour wrote:

When a secure memslot is dropped, all the pages backed in the secure device
(aka really backed by secure memory by the Ultravisor) should be paged out
to a normal page. Previously, this was achieved by triggering the page
fault mechanism which is calling kvmppc_svm_page_out() on each pages.

This can't work when hot unplugging a memory slot because the memory slot
is flagged as invalid and gfn_to_pfn() is then not trying to access the
page, so the page fault mechanism is not triggered.

Since the final goal is to make a call to kvmppc_svm_page_out() it seems
simpler to directly calling it instead of triggering such a mechanism. This
way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
memslot.

Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
the call to __kvmppc_svm_page_out() is made.
As __kvmppc_svm_page_out needs the vma pointer to migrate the pages, the
VMA is fetched in a lazy way, to not trigger find_vma() all the time. In
addition, the mmap_sem is help in read mode during that time, not in write
mode since the virual memory layout is not impacted, and
kvm->arch.uvmem_lock prevents concurrent operation on the secure device.

Cc: Ram Pai 
Cc: Bharata B Rao 
Cc: Paul Mackerras 
Signed-off-by: Laurent Dufour 
---
  arch/powerpc/kvm/book3s_hv_uvmem.c | 54 --
  1 file changed, 37 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 5a4b02d3f651..ba5c7c77cc3a 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -624,35 +624,55 @@ static inline int kvmppc_svm_page_out(struct 
vm_area_struct *vma,
   * fault on them, do fault time migration to replace the device PTEs in
   * QEMU page table with normal PTEs from newly allocated pages.
   */
-void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
+void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,
 struct kvm *kvm, bool skip_page_out)
  {
int i;
struct kvmppc_uvmem_page_pvt *pvt;
-   unsigned long pfn, uvmem_pfn;
-   unsigned long gfn = free->base_gfn;
+   struct page *uvmem_page;
+   struct vm_area_struct *vma = NULL;
+   unsigned long uvmem_pfn, gfn;
+   unsigned long addr, end;
+
+   mmap_read_lock(kvm->mm);
+
+   addr = slot->userspace_addr;


We typically use gfn_to_hva() for that, but that won't work for a
memslot that is already marked INVALID which is the case here.
I think it is ok to access slot->userspace_addr here of an INVALID
memslot, but just thought of explictly bringing this up.


Which explicitly mentioned above in the patch's description:

This can't work when hot unplugging a memory slot because the memory slot
is flagged as invalid and gfn_to_pfn() is then not trying to access the
page, so the page fault mechanism is not triggered.




+   end = addr + (slot->npages * PAGE_SIZE);
  
-	for (i = free->npages; i; --i, ++gfn) {

-   struct page *uvmem_page;
+   gfn = slot->base_gfn;
+   for (i = slot->npages; i; --i, ++gfn, addr += PAGE_SIZE) {
+
+   /* Fetch the VMA if addr is not in the latest fetched one */
+   if (!vma || (addr < vma->vm_start || addr >= vma->vm_end)) {
+   vma = find_vma_intersection(kvm->mm, addr, end);
+   if (!vma ||
+   vma->vm_start > addr || vma->vm_end < end) {
+   pr_err("Can't find VMA for gfn:0x%lx\n", gfn);
+   break;
+   }
+   }


In Ram's series, kvmppc_memslot_page_merge() also walks the VMAs spanning
the memslot, but it uses a different logic for the same. Why can't these
two cases use the same method to walk the VMAs? Is there anything subtly
different between the two cases?


This is probably doable. At the time I wrote that patch, the 
kvmppc_memslot_page_merge() was not yet introduced AFAIR.


This being said, I'd help a lot to factorize that code... I let Ram dealing with 
that ;)


Cheers,
Laurent.




Re: [PATCH v2 2/2] KVM: PPC: Book3S HV: rework secure mem slot dropping

2020-07-22 Thread Laurent Dufour

Le 21/07/2020 à 23:37, Ram Pai a écrit :

On Tue, Jul 21, 2020 at 12:42:02PM +0200, Laurent Dufour wrote:

When a secure memslot is dropped, all the pages backed in the secure device
(aka really backed by secure memory by the Ultravisor) should be paged out
to a normal page. Previously, this was achieved by triggering the page
fault mechanism which is calling kvmppc_svm_page_out() on each pages.

This can't work when hot unplugging a memory slot because the memory slot
is flagged as invalid and gfn_to_pfn() is then not trying to access the
page, so the page fault mechanism is not triggered.

Since the final goal is to make a call to kvmppc_svm_page_out() it seems
simpler to directly calling it instead of triggering such a mechanism. This

 ^^ call directly instead of triggering..


way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
memslot.

Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
the call to __kvmppc_svm_page_out() is made.
As __kvmppc_svm_page_out needs the vma pointer to migrate the pages, the
VMA is fetched in a lazy way, to not trigger find_vma() all the time. In
addition, the mmap_sem is help in read mode during that time, not in write

  ^^ held


mode since the virual memory layout is not impacted, and
kvm->arch.uvmem_lock prevents concurrent operation on the secure device.

Cc: Ram Pai 


Reviewed-by: Ram Pai 


Thanks for reviewing this series.

Regarding the wordsmithing, Paul, could you manage that when pulling the series?

Thanks,
Laurent.


[PATCH v2 2/2] KVM: PPC: Book3S HV: rework secure mem slot dropping

2020-07-21 Thread Laurent Dufour
When a secure memslot is dropped, all the pages backed in the secure device
(aka really backed by secure memory by the Ultravisor) should be paged out
to a normal page. Previously, this was achieved by triggering the page
fault mechanism which is calling kvmppc_svm_page_out() on each pages.

This can't work when hot unplugging a memory slot because the memory slot
is flagged as invalid and gfn_to_pfn() is then not trying to access the
page, so the page fault mechanism is not triggered.

Since the final goal is to make a call to kvmppc_svm_page_out() it seems
simpler to directly calling it instead of triggering such a mechanism. This
way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
memslot.

Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
the call to __kvmppc_svm_page_out() is made.
As __kvmppc_svm_page_out needs the vma pointer to migrate the pages, the
VMA is fetched in a lazy way, to not trigger find_vma() all the time. In
addition, the mmap_sem is help in read mode during that time, not in write
mode since the virual memory layout is not impacted, and
kvm->arch.uvmem_lock prevents concurrent operation on the secure device.

Cc: Ram Pai 
Cc: Bharata B Rao 
Cc: Paul Mackerras 
Signed-off-by: Laurent Dufour 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c | 54 --
 1 file changed, 37 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 5a4b02d3f651..ba5c7c77cc3a 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -624,35 +624,55 @@ static inline int kvmppc_svm_page_out(struct 
vm_area_struct *vma,
  * fault on them, do fault time migration to replace the device PTEs in
  * QEMU page table with normal PTEs from newly allocated pages.
  */
-void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
+void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,
 struct kvm *kvm, bool skip_page_out)
 {
int i;
struct kvmppc_uvmem_page_pvt *pvt;
-   unsigned long pfn, uvmem_pfn;
-   unsigned long gfn = free->base_gfn;
+   struct page *uvmem_page;
+   struct vm_area_struct *vma = NULL;
+   unsigned long uvmem_pfn, gfn;
+   unsigned long addr, end;
+
+   mmap_read_lock(kvm->mm);
+
+   addr = slot->userspace_addr;
+   end = addr + (slot->npages * PAGE_SIZE);
 
-   for (i = free->npages; i; --i, ++gfn) {
-   struct page *uvmem_page;
+   gfn = slot->base_gfn;
+   for (i = slot->npages; i; --i, ++gfn, addr += PAGE_SIZE) {
+
+   /* Fetch the VMA if addr is not in the latest fetched one */
+   if (!vma || (addr < vma->vm_start || addr >= vma->vm_end)) {
+   vma = find_vma_intersection(kvm->mm, addr, end);
+   if (!vma ||
+   vma->vm_start > addr || vma->vm_end < end) {
+   pr_err("Can't find VMA for gfn:0x%lx\n", gfn);
+   break;
+   }
+   }
 
mutex_lock(>arch.uvmem_lock);
-   if (!kvmppc_gfn_is_uvmem_pfn(gfn, kvm, _pfn)) {
+
+   if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, _pfn)) {
+   uvmem_page = pfn_to_page(uvmem_pfn);
+   pvt = uvmem_page->zone_device_data;
+   pvt->skip_page_out = skip_page_out;
+   pvt->remove_gfn = true;
+
+   if (__kvmppc_svm_page_out(vma, addr, addr + PAGE_SIZE,
+ PAGE_SHIFT, kvm, pvt->gpa))
+   pr_err("Can't page out gpa:0x%lx addr:0x%lx\n",
+  pvt->gpa, addr);
+   } else {
+   /* Remove the shared flag if any */
kvmppc_gfn_remove(gfn, kvm);
-   mutex_unlock(>arch.uvmem_lock);
-   continue;
}
 
-   uvmem_page = pfn_to_page(uvmem_pfn);
-   pvt = uvmem_page->zone_device_data;
-   pvt->skip_page_out = skip_page_out;
-   pvt->remove_gfn = true;
mutex_unlock(>arch.uvmem_lock);
-
-   pfn = gfn_to_pfn(kvm, gfn);
-   if (is_error_noslot_pfn(pfn))
-   continue;
-   kvm_release_pfn_clean(pfn);
}
+
+   mmap_read_unlock(kvm->mm);
 }
 
 unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm)
-- 
2.27.0



[PATCH v2 1/2] KVM: PPC: Book3S HV: move kvmppc_svm_page_out up

2020-07-21 Thread Laurent Dufour
kvmppc_svm_page_out() will need to be called by kvmppc_uvmem_drop_pages()
so move it upper in this file.

Furthermore it will be interesting to call this function when already
holding the kvm->arch.uvmem_lock, so prefix the original function with __
and remove the locking in it, and introduce a wrapper which call that
function with the lock held.

There is no functional change.

Cc: Ram Pai 
Cc: Bharata B Rao 
Cc: Paul Mackerras 
Signed-off-by: Laurent Dufour 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c | 166 -
 1 file changed, 90 insertions(+), 76 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index a2b4d259f8b0..5a4b02d3f651 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -526,6 +526,96 @@ unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)
return ret;
 }
 
+/*
+ * Provision a new page on HV side and copy over the contents
+ * from secure memory using UV_PAGE_OUT uvcall.
+ * Caller must held kvm->arch.uvmem_lock.
+ */
+static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
+   unsigned long start,
+   unsigned long end, unsigned long page_shift,
+   struct kvm *kvm, unsigned long gpa)
+{
+   unsigned long src_pfn, dst_pfn = 0;
+   struct migrate_vma mig;
+   struct page *dpage, *spage;
+   struct kvmppc_uvmem_page_pvt *pvt;
+   unsigned long pfn;
+   int ret = U_SUCCESS;
+
+   memset(, 0, sizeof(mig));
+   mig.vma = vma;
+   mig.start = start;
+   mig.end = end;
+   mig.src = _pfn;
+   mig.dst = _pfn;
+   mig.src_owner = _uvmem_pgmap;
+
+   /* The requested page is already paged-out, nothing to do */
+   if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL))
+   return ret;
+
+   ret = migrate_vma_setup();
+   if (ret)
+   return -1;
+
+   spage = migrate_pfn_to_page(*mig.src);
+   if (!spage || !(*mig.src & MIGRATE_PFN_MIGRATE))
+   goto out_finalize;
+
+   if (!is_zone_device_page(spage))
+   goto out_finalize;
+
+   dpage = alloc_page_vma(GFP_HIGHUSER, vma, start);
+   if (!dpage) {
+   ret = -1;
+   goto out_finalize;
+   }
+
+   lock_page(dpage);
+   pvt = spage->zone_device_data;
+   pfn = page_to_pfn(dpage);
+
+   /*
+* This function is used in two cases:
+* - When HV touches a secure page, for which we do UV_PAGE_OUT
+* - When a secure page is converted to shared page, we *get*
+*   the page to essentially unmap the device page. In this
+*   case we skip page-out.
+*/
+   if (!pvt->skip_page_out)
+   ret = uv_page_out(kvm->arch.lpid, pfn << page_shift,
+ gpa, 0, page_shift);
+
+   if (ret == U_SUCCESS)
+   *mig.dst = migrate_pfn(pfn) | MIGRATE_PFN_LOCKED;
+   else {
+   unlock_page(dpage);
+   __free_page(dpage);
+   goto out_finalize;
+   }
+
+   migrate_vma_pages();
+
+out_finalize:
+   migrate_vma_finalize();
+   return ret;
+}
+
+static inline int kvmppc_svm_page_out(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ unsigned long page_shift,
+ struct kvm *kvm, unsigned long gpa)
+{
+   int ret;
+
+   mutex_lock(>arch.uvmem_lock);
+   ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa);
+   mutex_unlock(>arch.uvmem_lock);
+
+   return ret;
+}
+
 /*
  * Drop device pages that we maintain for the secure guest
  *
@@ -898,82 +988,6 @@ unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, 
unsigned long gpa,
return ret;
 }
 
-/*
- * Provision a new page on HV side and copy over the contents
- * from secure memory using UV_PAGE_OUT uvcall.
- */
-static int kvmppc_svm_page_out(struct vm_area_struct *vma,
-   unsigned long start,
-   unsigned long end, unsigned long page_shift,
-   struct kvm *kvm, unsigned long gpa)
-{
-   unsigned long src_pfn, dst_pfn = 0;
-   struct migrate_vma mig;
-   struct page *dpage, *spage;
-   struct kvmppc_uvmem_page_pvt *pvt;
-   unsigned long pfn;
-   int ret = U_SUCCESS;
-
-   memset(, 0, sizeof(mig));
-   mig.vma = vma;
-   mig.start = start;
-   mig.end = end;
-   mig.src = _pfn;
-   mig.dst = _pfn;
-   mig.src_owner = _uvmem_pgmap;
-
-   mutex_lock(>arch.uvmem_lock);
-   /* The requested page is already paged-out, nothing to do */
-   if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL))
-   goto out;
-
-   ret = migrate_vma_setup();
-   if (ret)
-   goto out;
-
-   spage = migrate_pfn_to_pa

[PATCH v2 0/2] Rework secure memslot dropping

2020-07-21 Thread Laurent Dufour
When doing memory hotplug on a secure VM, the secure pages are not well
cleaned from the secure device when dropping the memslot.  This silent
error, is then preventing the SVM to reboot properly after the following
sequence of commands are run in the Qemu monitor:

device_add pc-dimm,id=dimm1,memdev=mem1
device_del dimm1
device_add pc-dimm,id=dimm1,memdev=mem1

At reboot time, when the kernel is booting again and switching to the
secure mode, the page_in is failing for the pages in the memslot because
the cleanup was not done properly, because the memslot is flagged as
invalid during the hot unplug and thus the page fault mechanism is not
triggered.

To prevent that during the memslot dropping, instead of belonging on the
page fault mechanism to trigger the page out of the secured pages, it seems
simpler to directly call the function doing the page out. This way the
state of the memslot is not interfering on the page out process.

This series applies on top of the Ram's one titled:
"[v4 0/5] Migrate non-migrated pages of a SVM."
https://lore.kernel.org/linuxppc-dev/1594972827-13928-1-git-send-email-linux...@us.ibm.com/

Changes since V1:
 - Rebase on top of Ram's V4 series
 - Address Bharata's comment to use mmap_read_*lock().

Laurent Dufour (2):
  KVM: PPC: Book3S HV: move kvmppc_svm_page_out up
  KVM: PPC: Book3S HV: rework secure mem slot dropping

 arch/powerpc/kvm/book3s_hv_uvmem.c | 220 +
 1 file changed, 127 insertions(+), 93 deletions(-)

-- 
2.27.0



Re: [RFC PATCH] powerpc/pseries/svm: capture instruction faulting on MMIO access, in sprg0 register

2020-07-21 Thread Laurent Dufour

Le 20/07/2020 à 22:24, Segher Boessenkool a écrit :

On Mon, Jul 20, 2020 at 03:10:41PM -0500, Segher Boessenkool wrote:

On Mon, Jul 20, 2020 at 11:39:56AM +0200, Laurent Dufour wrote:

Le 16/07/2020 à 10:32, Ram Pai a écrit :

+   if (is_secure_guest()) {\
+   __asm__ __volatile__("mfsprg0 %3;"\
+   "lnia %2;"\
+   "ld %2,12(%2);"   \
+   "mtsprg0 %2;" \
+   "sync;"   \
+   #insn" %0,%y1;"   \
+   "twi 0,%0,0;" \
+   "isync;"  \
+   "mtsprg0 %3"  \
+   : "=r" (ret)  \
+   : "Z" (*addr), "r" (0), "r" (0)   \


I'm wondering if SPRG0 is restored to its original value.
You're using the same register (r0) for parameters 2 and 3, so when doing
lnia %2, you're overwriting the SPRG0 value you saved in r0 just earlier.


It is putting the value 0 in the registers the compiler chooses for
operands 2 and 3.  But operand 3 is written, while the asm says it is an
input.  It needs an earlyclobber as well.


Oh nice, I was not aware that compiler may choose registers this way.
Good to know, thanks for the explanation.


It may be clearer to use explicit registers for %2 and %3 and to mark them
as modified for the compiler.


That is not a good idea, imnsho.


(The explicit register number part, I mean; operand 2 should be an
output as well, yes.)


Sure if the compiler can choose the registers that's far better.

Cheers,
Laurent.


Re: [RFC PATCH] powerpc/pseries/svm: capture instruction faulting on MMIO access, in sprg0 register

2020-07-20 Thread Laurent Dufour

Le 16/07/2020 à 10:32, Ram Pai a écrit :

An instruction accessing a mmio address, generates a HDSI fault.  This fault is
appropriately handled by the Hypervisor.  However in the case of secureVMs, the
fault is delivered to the ultravisor.

Unfortunately the Ultravisor has no correct-way to fetch the faulting
instruction. The PEF architecture does not allow Ultravisor to enable MMU
translation. Walking the two level page table to read the instruction can race
with other vcpus modifying the SVM's process scoped page table.

This problem can be correctly solved with some help from the kernel.

Capture the faulting instruction in SPRG0 register, before executing the
faulting instruction. This enables the ultravisor to easily procure the
faulting instruction and emulate it.

Signed-off-by: Ram Pai 
---
  arch/powerpc/include/asm/io.h | 85 ++-
  1 file changed, 75 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
index 635969b..7ef663d 100644
--- a/arch/powerpc/include/asm/io.h
+++ b/arch/powerpc/include/asm/io.h
@@ -35,6 +35,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #define SIO_CONFIG_RA	0x398

  #define SIO_CONFIG_RD 0x399
@@ -105,34 +106,98 @@
  static inline u##size name(const volatile u##size __iomem *addr)  \
  { \
u##size ret;\
-   __asm__ __volatile__("sync;"#insn" %0,%y1;twi 0,%0,0;isync" \
-   : "=r" (ret) : "Z" (*addr) : "memory");   \
+   if (is_secure_guest()) {\
+   __asm__ __volatile__("mfsprg0 %3;"\
+   "lnia %2;"\
+   "ld %2,12(%2);"   \
+   "mtsprg0 %2;" \
+   "sync;"   \
+   #insn" %0,%y1;"   \
+   "twi 0,%0,0;" \
+   "isync;"  \
+   "mtsprg0 %3"  \
+   : "=r" (ret)  \
+   : "Z" (*addr), "r" (0), "r" (0)   \


I'm wondering if SPRG0 is restored to its original value.
You're using the same register (r0) for parameters 2 and 3, so when doing lnia 
%2, you're overwriting the SPRG0 value you saved in r0 just earlier.


It may be clearer to use explicit registers for %2 and %3 and to mark them as 
modified for the compiler.


This applies to the other macros.

Cheers,
Laurent.


+   : "memory");  \
+   } else {\
+   __asm__ __volatile__("sync;"  \
+   #insn" %0,%y1;"   \
+   "twi 0,%0,0;" \
+   "isync"   \
+   : "=r" (ret) : "Z" (*addr) : "memory");   \
+   }   \
return ret; \
  }
  
  #define DEF_MMIO_OUT_X(name, size, insn)\

  static inline void name(volatile u##size __iomem *addr, u##size val)  \
  { \
-   __asm__ __volatile__("sync;"#insn" %1,%y0"  \
-   : "=Z" (*addr) : "r" (val) : "memory");   \
-   mmiowb_set_pending();   \
+   if (is_secure_guest()) {\
+   __asm__ __volatile__("mfsprg0 %3;"\
+   "lnia %2;"\
+   "ld %2,12(%2);"   \
+   "mtsprg0 %2;" \
+   "sync;"   \
+   #insn" %1,%y0;"   \
+   "mtsprg0 %3"  \
+   : "=Z" (*addr)\
+   : "r" (val), "r" (0), "r" (0) \
+   : "memory");  \
+   } else {\
+   __asm__ __volatile__("sync;"  \
+

Re: [PATCH v2 02/12] powerpc/kexec_file: mark PPC64 specific code

2020-07-10 Thread Laurent Dufour

Le 02/07/2020 à 21:54, Hari Bathini a écrit :

Some of the kexec_file_load code isn't PPC64 specific. Move PPC64
specific code from kexec/file_load.c to kexec/file_load_64.c. Also,
rename purgatory/trampoline.S to purgatory/trampoline_64.S in the
same spirit.


FWIW and despite a minor comment below,

Reviewed-by: Laurent Dufour 


Signed-off-by: Hari Bathini 
---

Changes in v2:
* No changes.


  arch/powerpc/include/asm/kexec.h   |   11 +++
  arch/powerpc/kexec/Makefile|2 -
  arch/powerpc/kexec/elf_64.c|7 +-
  arch/powerpc/kexec/file_load.c |   37 ++
  arch/powerpc/kexec/file_load_64.c  |  108 ++
  arch/powerpc/purgatory/Makefile|4 +
  arch/powerpc/purgatory/trampoline.S|  117 
  arch/powerpc/purgatory/trampoline_64.S |  117 
  8 files changed, 248 insertions(+), 155 deletions(-)
  create mode 100644 arch/powerpc/kexec/file_load_64.c
  delete mode 100644 arch/powerpc/purgatory/trampoline.S
  create mode 100644 arch/powerpc/purgatory/trampoline_64.S

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index c684768..7008ea1 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -114,8 +114,17 @@ int setup_purgatory(struct kimage *image, const void 
*slave_code,
unsigned long fdt_load_addr);
  int setup_new_fdt(const struct kimage *image, void *fdt,
  unsigned long initrd_load_addr, unsigned long initrd_len,
- const char *cmdline);
+ const char *cmdline, int *node);
  int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size);
+
+#ifdef CONFIG_PPC64
+int setup_purgatory_ppc64(struct kimage *image, const void *slave_code,
+ const void *fdt, unsigned long kernel_load_addr,
+ unsigned long fdt_load_addr);
+int setup_new_fdt_ppc64(const struct kimage *image, void *fdt,
+   unsigned long initrd_load_addr,
+   unsigned long initrd_len, const char *cmdline);
+#endif /* CONFIG_PPC64 */
  #endif /* CONFIG_KEXEC_FILE */
  
  #else /* !CONFIG_KEXEC_CORE */

diff --git a/arch/powerpc/kexec/Makefile b/arch/powerpc/kexec/Makefile
index 86380c6..67c3553 100644
--- a/arch/powerpc/kexec/Makefile
+++ b/arch/powerpc/kexec/Makefile
@@ -7,7 +7,7 @@ obj-y   += core.o crash.o core_$(BITS).o
  
  obj-$(CONFIG_PPC32)		+= relocate_32.o
  
-obj-$(CONFIG_KEXEC_FILE)	+= file_load.o elf_$(BITS).o

+obj-$(CONFIG_KEXEC_FILE)   += file_load.o file_load_$(BITS).o elf_$(BITS).o
  
  ifdef CONFIG_HAVE_IMA_KEXEC

  ifdef CONFIG_IMA
diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
index 3072fd6..23ad04c 100644
--- a/arch/powerpc/kexec/elf_64.c
+++ b/arch/powerpc/kexec/elf_64.c
@@ -88,7 +88,8 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
goto out;
}
  
-	ret = setup_new_fdt(image, fdt, initrd_load_addr, initrd_len, cmdline);

+   ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr,
+ initrd_len, cmdline);
if (ret)
goto out;
  
@@ -107,8 +108,8 @@ static void *elf64_load(struct kimage *image, char *kernel_buf,

pr_debug("Loaded device tree at 0x%lx\n", fdt_load_addr);
  
  	slave_code = elf_info.buffer + elf_info.proghdrs[0].p_offset;

-   ret = setup_purgatory(image, slave_code, fdt, kernel_load_addr,
- fdt_load_addr);
+   ret = setup_purgatory_ppc64(image, slave_code, fdt, kernel_load_addr,
+   fdt_load_addr);
if (ret)
pr_err("Error setting up the purgatory.\n");
  
diff --git a/arch/powerpc/kexec/file_load.c b/arch/powerpc/kexec/file_load.c

index 143c917..99a2c4d 100644
--- a/arch/powerpc/kexec/file_load.c
+++ b/arch/powerpc/kexec/file_load.c
@@ -1,6 +1,6 @@
  // SPDX-License-Identifier: GPL-2.0-only
  /*
- * ppc64 code to implement the kexec_file_load syscall
+ * powerpc code to implement the kexec_file_load syscall
   *
   * Copyright (C) 2004  Adam Litke (a...@us.ibm.com)
   * Copyright (C) 2004  IBM Corp.
@@ -16,26 +16,10 @@
  
  #include 

  #include 
-#include 
  #include 
  #include 
  
-#define SLAVE_CODE_SIZE		256

-
-const struct kexec_file_ops * const kexec_file_loaders[] = {
-   _elf64_ops,
-   NULL
-};
-
-int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
- unsigned long buf_len)
-{
-   /* We don't support crash kernels yet. */
-   if (image->type == KEXEC_TYPE_CRASH)
-   return -EOPNOTSUPP;
-
-   return kexec_image_probe_default(image, buf, buf_len);
-}
+#define SLAVE_CODE_SIZE256 /* First 0x100 bytes */
  
  /**

   * setup_purgatory - initialize the purga

Re: [PATCH 2/2] KVM: PPC: Book3S HV: rework secure mem slot dropping

2020-07-08 Thread Laurent Dufour

Le 08/07/2020 à 13:25, Bharata B Rao a écrit :

On Fri, Jul 03, 2020 at 05:59:14PM +0200, Laurent Dufour wrote:

When a secure memslot is dropped, all the pages backed in the secure device
(aka really backed by secure memory by the Ultravisor) should be paged out
to a normal page. Previously, this was achieved by triggering the page
fault mechanism which is calling kvmppc_svm_page_out() on each pages.

This can't work when hot unplugging a memory slot because the memory slot
is flagged as invalid and gfn_to_pfn() is then not trying to access the
page, so the page fault mechanism is not triggered.

Since the final goal is to make a call to kvmppc_svm_page_out() it seems
simpler to directly calling it instead of triggering such a mechanism. This
way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
memslot.


Yes, this appears much simpler.


Thanks Bharata for reviewing this.





Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
the call to __kvmppc_svm_page_out() is made.
As __kvmppc_svm_page_out needs the vma pointer to migrate the pages, the
VMA is fetched in a lazy way, to not trigger find_vma() all the time. In
addition, the mmap_sem is help in read mode during that time, not in write
mode since the virual memory layout is not impacted, and
kvm->arch.uvmem_lock prevents concurrent operation on the secure device.

Cc: Ram Pai 
Cc: Bharata B Rao 
Cc: Paul Mackerras 
Signed-off-by: Laurent Dufour 
---
  arch/powerpc/kvm/book3s_hv_uvmem.c | 54 --
  1 file changed, 37 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 852cc9ae6a0b..479ddf16d18c 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -533,35 +533,55 @@ static inline int kvmppc_svm_page_out(struct 
vm_area_struct *vma,
   * fault on them, do fault time migration to replace the device PTEs in
   * QEMU page table with normal PTEs from newly allocated pages.
   */
-void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
+void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,
 struct kvm *kvm, bool skip_page_out)
  {
int i;
struct kvmppc_uvmem_page_pvt *pvt;
-   unsigned long pfn, uvmem_pfn;
-   unsigned long gfn = free->base_gfn;
+   struct page *uvmem_page;
+   struct vm_area_struct *vma = NULL;
+   unsigned long uvmem_pfn, gfn;
+   unsigned long addr, end;
+
+   down_read(>mm->mmap_sem);


You should be using mmap_read_lock(kvm->mm) with recent kernels.


Absolutely, shame on me, I reviewed Michel's series about that!

Paul, Michael, could you fix that when pulling this patch or should I sent a 
whole new series?





+
+   addr = slot->userspace_addr;
+   end = addr + (slot->npages * PAGE_SIZE);
  
-	for (i = free->npages; i; --i, ++gfn) {

-   struct page *uvmem_page;
+   gfn = slot->base_gfn;
+   for (i = slot->npages; i; --i, ++gfn, addr += PAGE_SIZE) {
+
+   /* Fetch the VMA if addr is not in the latest fetched one */
+   if (!vma || (addr < vma->vm_start || addr >= vma->vm_end)) {
+   vma = find_vma_intersection(kvm->mm, addr, end);
+   if (!vma ||
+   vma->vm_start > addr || vma->vm_end < end) {
+   pr_err("Can't find VMA for gfn:0x%lx\n", gfn);
+   break;
+   }
+   }


The first find_vma_intersection() was called for the range spanning the
entire memslot, but you have code to check if vma remains valid for the
new addr in each iteration. Guess you wanted to get vma for one page at
a time and use it for subsequent pages until it covers the range?


That's the goal, fetch the VMA once and no more until we reach its end boundary.


Re: [PATCH v12 00/31] Speculative page faults

2020-07-06 Thread Laurent Dufour

Le 06/07/2020 à 11:25, Chinwen Chang a écrit :

On Thu, 2019-06-20 at 16:19 +0800, Haiyan Song wrote:

Hi Laurent,

I downloaded your script and run it on Intel 2s skylake platform with spf-v12 
patch
serials.

Here attached the output results of this script.

The following comparison result is statistics from the script outputs.

a). Enable THP
 SPF_0  change   SPF_1
will-it-scale.page_fault2.per_thread_ops2664190.8  -11.7%   
2353637.6
will-it-scale.page_fault3.per_thread_ops4480027.2  -14.7%   
3819331.9


b). Disable THP
 SPF_0   change  SPF_1
will-it-scale.page_fault2.per_thread_ops2653260.7   -10%
2385165.8
will-it-scale.page_fault3.per_thread_ops4436330.1   -12.4%  
3886734.2


Thanks,
Haiyan Song


On Fri, Jun 14, 2019 at 10:44:47AM +0200, Laurent Dufour wrote:

Le 14/06/2019 à 10:37, Laurent Dufour a écrit :

Please find attached the script I run to get these numbers.
This would be nice if you could give it a try on your victim node and share the 
result.


Sounds that the Intel mail fitering system doesn't like the attached shell 
script.
Please find it there: 
https://gist.github.com/ldu4/a5cc1a93f293108ea387d43d5d5e7f44

Thanks,
Laurent.



Hi Laurent,

We merged SPF v11 and some patches from v12 into our platforms. After
several experiments, we observed SPF has obvious improvements on the
launch time of applications, especially for those high-TLP ones,

# launch time of applications(s):

package   version  w/ SPF  w/o SPF  improve(%)
--
Baidu maps10.13.3  0.887   0.98 9.49
Taobao8.4.0.35 1.227   1.2935.10
Meituan   9.12.401 1.107   1.54328.26
WeChat7.0.32.353   2.68 12.20
Honor of Kings1.43.1.6 6.636.7131.24


That's great news, thanks for reporting this!



By the way, we have verified our platforms with those patches and
achieved the goal of mass production.


Another good news!
For my information, what is your targeted hardware?

Cheers,
Laurent.



[PATCH 0/2] Rework secure memslot dropping

2020-07-03 Thread Laurent Dufour
When doing memory hotplug on a secure VM, the secure pages are not well
cleaned from the secure device when dropping the memslot.  This silent
error, is then preventing the SVM to reboot properly after the following
sequence of commands are run in the Qemu monitor:

device_add pc-dimm,id=dimm1,memdev=mem1
device_del dimm1
device_add pc-dimm,id=dimm1,memdev=mem1

At reboot time, when the kernel is booting again and switching to the
secure mode, the page_in is failing for the pages in the memslot because
the cleanup was not done properly, because the memslot is flagged as
invalid during the hot unplug and thus the page fault mechanism is not
triggered.

To prevent that during the memslot dropping, instead of belonging on the
page fault mechanism to trigger the page out of the secured pages, it seems
simpler to directly call the function doing the page out. This way the
state of the memslot is not interfering on the page out process.

This series applies on top of the Ram's one titled:
"PATCH v3 0/4] Migrate non-migrated pages of a SVM."
https://lore.kernel.org/linuxppc-dev/1592606622-29884-1-git-send-email-linux...@us.ibm.com/#r

Laurent Dufour (2):
  KVM: PPC: Book3S HV: move kvmppc_svm_page_out up
  KVM: PPC: Book3S HV: rework secure mem slot dropping

 arch/powerpc/kvm/book3s_hv_uvmem.c | 220 +
 1 file changed, 127 insertions(+), 93 deletions(-)

-- 
2.27.0



[PATCH 1/2] KVM: PPC: Book3S HV: move kvmppc_svm_page_out up

2020-07-03 Thread Laurent Dufour
kvmppc_svm_page_out() will need to be called by kvmppc_uvmem_drop_pages()
so move it upper in this file.

Furthermore it will be interesting to call this function when already
holding the kvm->arch.uvmem_lock, so prefix the original function with __
and remove the locking in it, and introduce a wrapper which call that
function with the lock held.

There is no functional change.

Cc: Ram Pai 
Cc: Bharata B Rao 
Cc: Paul Mackerras 
Signed-off-by: Laurent Dufour 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c | 166 -
 1 file changed, 90 insertions(+), 76 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 778a6ea86991..852cc9ae6a0b 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -435,6 +435,96 @@ unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)
return ret;
 }
 
+/*
+ * Provision a new page on HV side and copy over the contents
+ * from secure memory using UV_PAGE_OUT uvcall.
+ * Caller must held kvm->arch.uvmem_lock.
+ */
+static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
+   unsigned long start,
+   unsigned long end, unsigned long page_shift,
+   struct kvm *kvm, unsigned long gpa)
+{
+   unsigned long src_pfn, dst_pfn = 0;
+   struct migrate_vma mig;
+   struct page *dpage, *spage;
+   struct kvmppc_uvmem_page_pvt *pvt;
+   unsigned long pfn;
+   int ret = U_SUCCESS;
+
+   memset(, 0, sizeof(mig));
+   mig.vma = vma;
+   mig.start = start;
+   mig.end = end;
+   mig.src = _pfn;
+   mig.dst = _pfn;
+   mig.src_owner = _uvmem_pgmap;
+
+   /* The requested page is already paged-out, nothing to do */
+   if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL))
+   return ret;
+
+   ret = migrate_vma_setup();
+   if (ret)
+   return -1;
+
+   spage = migrate_pfn_to_page(*mig.src);
+   if (!spage || !(*mig.src & MIGRATE_PFN_MIGRATE))
+   goto out_finalize;
+
+   if (!is_zone_device_page(spage))
+   goto out_finalize;
+
+   dpage = alloc_page_vma(GFP_HIGHUSER, vma, start);
+   if (!dpage) {
+   ret = -1;
+   goto out_finalize;
+   }
+
+   lock_page(dpage);
+   pvt = spage->zone_device_data;
+   pfn = page_to_pfn(dpage);
+
+   /*
+* This function is used in two cases:
+* - When HV touches a secure page, for which we do UV_PAGE_OUT
+* - When a secure page is converted to shared page, we *get*
+*   the page to essentially unmap the device page. In this
+*   case we skip page-out.
+*/
+   if (!pvt->skip_page_out)
+   ret = uv_page_out(kvm->arch.lpid, pfn << page_shift,
+ gpa, 0, page_shift);
+
+   if (ret == U_SUCCESS)
+   *mig.dst = migrate_pfn(pfn) | MIGRATE_PFN_LOCKED;
+   else {
+   unlock_page(dpage);
+   __free_page(dpage);
+   goto out_finalize;
+   }
+
+   migrate_vma_pages();
+
+out_finalize:
+   migrate_vma_finalize();
+   return ret;
+}
+
+static inline int kvmppc_svm_page_out(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ unsigned long page_shift,
+ struct kvm *kvm, unsigned long gpa)
+{
+   int ret;
+
+   mutex_lock(>arch.uvmem_lock);
+   ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa);
+   mutex_unlock(>arch.uvmem_lock);
+
+   return ret;
+}
+
 /*
  * Drop device pages that we maintain for the secure guest
  *
@@ -801,82 +891,6 @@ unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, 
unsigned long gpa,
return ret;
 }
 
-/*
- * Provision a new page on HV side and copy over the contents
- * from secure memory using UV_PAGE_OUT uvcall.
- */
-static int kvmppc_svm_page_out(struct vm_area_struct *vma,
-   unsigned long start,
-   unsigned long end, unsigned long page_shift,
-   struct kvm *kvm, unsigned long gpa)
-{
-   unsigned long src_pfn, dst_pfn = 0;
-   struct migrate_vma mig;
-   struct page *dpage, *spage;
-   struct kvmppc_uvmem_page_pvt *pvt;
-   unsigned long pfn;
-   int ret = U_SUCCESS;
-
-   memset(, 0, sizeof(mig));
-   mig.vma = vma;
-   mig.start = start;
-   mig.end = end;
-   mig.src = _pfn;
-   mig.dst = _pfn;
-   mig.src_owner = _uvmem_pgmap;
-
-   mutex_lock(>arch.uvmem_lock);
-   /* The requested page is already paged-out, nothing to do */
-   if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL))
-   goto out;
-
-   ret = migrate_vma_setup();
-   if (ret)
-   goto out;
-
-   spage = migrate_pfn_to_pa

[PATCH 2/2] KVM: PPC: Book3S HV: rework secure mem slot dropping

2020-07-03 Thread Laurent Dufour
When a secure memslot is dropped, all the pages backed in the secure device
(aka really backed by secure memory by the Ultravisor) should be paged out
to a normal page. Previously, this was achieved by triggering the page
fault mechanism which is calling kvmppc_svm_page_out() on each pages.

This can't work when hot unplugging a memory slot because the memory slot
is flagged as invalid and gfn_to_pfn() is then not trying to access the
page, so the page fault mechanism is not triggered.

Since the final goal is to make a call to kvmppc_svm_page_out() it seems
simpler to directly calling it instead of triggering such a mechanism. This
way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
memslot.

Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
the call to __kvmppc_svm_page_out() is made.
As __kvmppc_svm_page_out needs the vma pointer to migrate the pages, the
VMA is fetched in a lazy way, to not trigger find_vma() all the time. In
addition, the mmap_sem is help in read mode during that time, not in write
mode since the virual memory layout is not impacted, and
kvm->arch.uvmem_lock prevents concurrent operation on the secure device.

Cc: Ram Pai 
Cc: Bharata B Rao 
Cc: Paul Mackerras 
Signed-off-by: Laurent Dufour 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c | 54 --
 1 file changed, 37 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 852cc9ae6a0b..479ddf16d18c 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -533,35 +533,55 @@ static inline int kvmppc_svm_page_out(struct 
vm_area_struct *vma,
  * fault on them, do fault time migration to replace the device PTEs in
  * QEMU page table with normal PTEs from newly allocated pages.
  */
-void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
+void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,
 struct kvm *kvm, bool skip_page_out)
 {
int i;
struct kvmppc_uvmem_page_pvt *pvt;
-   unsigned long pfn, uvmem_pfn;
-   unsigned long gfn = free->base_gfn;
+   struct page *uvmem_page;
+   struct vm_area_struct *vma = NULL;
+   unsigned long uvmem_pfn, gfn;
+   unsigned long addr, end;
+
+   down_read(>mm->mmap_sem);
+
+   addr = slot->userspace_addr;
+   end = addr + (slot->npages * PAGE_SIZE);
 
-   for (i = free->npages; i; --i, ++gfn) {
-   struct page *uvmem_page;
+   gfn = slot->base_gfn;
+   for (i = slot->npages; i; --i, ++gfn, addr += PAGE_SIZE) {
+
+   /* Fetch the VMA if addr is not in the latest fetched one */
+   if (!vma || (addr < vma->vm_start || addr >= vma->vm_end)) {
+   vma = find_vma_intersection(kvm->mm, addr, end);
+   if (!vma ||
+   vma->vm_start > addr || vma->vm_end < end) {
+   pr_err("Can't find VMA for gfn:0x%lx\n", gfn);
+   break;
+   }
+   }
 
mutex_lock(>arch.uvmem_lock);
-   if (!kvmppc_gfn_is_uvmem_pfn(gfn, kvm, _pfn)) {
+
+   if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, _pfn)) {
+   uvmem_page = pfn_to_page(uvmem_pfn);
+   pvt = uvmem_page->zone_device_data;
+   pvt->skip_page_out = skip_page_out;
+   pvt->remove_gfn = true;
+
+   if (__kvmppc_svm_page_out(vma, addr, addr + PAGE_SIZE,
+ PAGE_SHIFT, kvm, pvt->gpa))
+   pr_err("Can't page out gpa:0x%lx addr:0x%lx\n",
+  pvt->gpa, addr);
+   } else {
+   /* Remove the shared flag if any */
kvmppc_gfn_remove(gfn, kvm);
-   mutex_unlock(>arch.uvmem_lock);
-   continue;
}
 
-   uvmem_page = pfn_to_page(uvmem_pfn);
-   pvt = uvmem_page->zone_device_data;
-   pvt->skip_page_out = skip_page_out;
-   pvt->remove_gfn = true;
mutex_unlock(>arch.uvmem_lock);
-
-   pfn = gfn_to_pfn(kvm, gfn);
-   if (is_error_noslot_pfn(pfn))
-   continue;
-   kvm_release_pfn_clean(pfn);
}
+
+   up_read(>mm->mmap_sem);
 }
 
 unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm)
-- 
2.27.0



Re: [PATCH v3 3/4] KVM: PPC: Book3S HV: migrate remaining normal-GFNs to secure-GFNs in H_SVM_INIT_DONE

2020-06-29 Thread Laurent Dufour

Le 28/06/2020 à 18:20, Bharata B Rao a écrit :

On Fri, Jun 19, 2020 at 03:43:41PM -0700, Ram Pai wrote:

H_SVM_INIT_DONE incorrectly assumes that the Ultravisor has explicitly


As noted in the last iteration, can you reword the above please?
I don't see it as an incorrect assumption, but see it as extension of
scope now :-)


called H_SVM_PAGE_IN for all secure pages. These GFNs continue to be
normal GFNs associated with normal PFNs; when infact, these GFNs should
have been secure GFNs, associated with device PFNs.

Move all the PFNs associated with the SVM's GFNs, to secure-PFNs, in
H_SVM_INIT_DONE. Skip the GFNs that are already Paged-in or Shared
through H_SVM_PAGE_IN, or Paged-in followed by a Paged-out through
UV_PAGE_OUT.

Cc: Paul Mackerras 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Cc: Bharata B Rao 
Cc: Aneesh Kumar K.V 
Cc: Sukadev Bhattiprolu 
Cc: Laurent Dufour 
Cc: Thiago Jung Bauermann 
Cc: David Gibson 
Cc: Claudio Carvalho 
Cc: kvm-...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ram Pai 
---
  Documentation/powerpc/ultravisor.rst|   2 +
  arch/powerpc/include/asm/kvm_book3s_uvmem.h |   2 +
  arch/powerpc/kvm/book3s_hv_uvmem.c  | 154 +++-
  3 files changed, 132 insertions(+), 26 deletions(-)

diff --git a/Documentation/powerpc/ultravisor.rst 
b/Documentation/powerpc/ultravisor.rst
index 363736d..3bc8957 100644
--- a/Documentation/powerpc/ultravisor.rst
+++ b/Documentation/powerpc/ultravisor.rst
@@ -933,6 +933,8 @@ Return values
* H_UNSUPPORTED if called from the wrong context (e.g.
from an SVM or before an H_SVM_INIT_START
hypercall).
+   * H_STATE   if the hypervisor could not successfully
+transition the VM to Secure VM.
  
  Description

  ~~~
diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
index 5a9834e..b9cd7eb 100644
--- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -22,6 +22,8 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
  unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm);
  void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
 struct kvm *kvm, bool skip_page_out);
+int kvmppc_uv_migrate_mem_slot(struct kvm *kvm,
+   const struct kvm_memory_slot *memslot);
  #else
  static inline int kvmppc_uvmem_init(void)
  {
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index c8c0290..449e8a7 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -93,6 +93,7 @@
  #include 
  #include 
  #include 
+#include 
  
  static struct dev_pagemap kvmppc_uvmem_pgmap;

  static unsigned long *kvmppc_uvmem_bitmap;
@@ -339,6 +340,21 @@ static bool kvmppc_gfn_is_uvmem_pfn(unsigned long gfn, 
struct kvm *kvm,
return false;
  }
  
+/* return true, if the GFN is a shared-GFN, or a secure-GFN */

+bool kvmppc_gfn_has_transitioned(unsigned long gfn, struct kvm *kvm)
+{
+   struct kvmppc_uvmem_slot *p;
+
+   list_for_each_entry(p, >arch.uvmem_pfns, list) {
+   if (gfn >= p->base_pfn && gfn < p->base_pfn + p->nr_pfns) {
+   unsigned long index = gfn - p->base_pfn;
+
+   return (p->pfns[index] & KVMPPC_GFN_FLAG_MASK);
+   }
+   }
+   return false;
+}
+
  unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
  {
struct kvm_memslots *slots;
@@ -379,12 +395,31 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
  
  unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)

  {
+   struct kvm_memslots *slots;
+   struct kvm_memory_slot *memslot;
+   int srcu_idx;
+   long ret = H_SUCCESS;
+
if (!(kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START))
return H_UNSUPPORTED;
  
+	/* migrate any unmoved normal pfn to device pfns*/

+   srcu_idx = srcu_read_lock(>srcu);
+   slots = kvm_memslots(kvm);
+   kvm_for_each_memslot(memslot, slots) {
+   ret = kvmppc_uv_migrate_mem_slot(kvm, memslot);
+   if (ret) {
+   ret = H_STATE;
+   goto out;
+   }
+   }
+
kvm->arch.secure_guest |= KVMPPC_SECURE_INIT_DONE;
pr_info("LPID %d went secure\n", kvm->arch.lpid);
-   return H_SUCCESS;
+
+out:
+   srcu_read_unlock(>srcu, srcu_idx);
+   return ret;
  }
  
  /*

@@ -505,12 +540,14 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
  }
  
  /*

- * Alloc a PFN from private device memory pool and copy page from normal
- * memory to secure memory using UV_PAGE_IN uvcall.
+ * Alloc a PFN from private device memory pool. If @pagein

Re: [PATCH v2 2/4] KVM: PPC: Book3S HV: track the state GFNs associated with secure VMs

2020-06-18 Thread Laurent Dufour
  |
  | |   |  | |   |   |
  - --
  | |   |  | |   |   |
  | Normal  | Normal| Transient|Error|Error  |Normal |
  | |   |  | |   |   |
  | Secure  |   Error   | Error|Error|Error  |Normal |
  | |   |  | |   |   |
  |Transient|   N/A | Error|Secure   |Normal |Normal |
  



Cc: Paul Mackerras 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Cc: Bharata B Rao 
Cc: Aneesh Kumar K.V 
Cc: Sukadev Bhattiprolu 
Cc: Laurent Dufour 
Cc: Thiago Jung Bauermann 
Cc: David Gibson 
Cc: Claudio Carvalho 
Cc: kvm-...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Thiago Jung Bauermann 
Signed-off-by: Ram Pai 
---
  arch/powerpc/include/asm/kvm_book3s_uvmem.h |   6 +-
  arch/powerpc/kvm/book3s_64_mmu_radix.c  |   2 +-
  arch/powerpc/kvm/book3s_hv.c|   2 +-
  arch/powerpc/kvm/book3s_hv_uvmem.c  | 195 +---
  4 files changed, 180 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
index 5a9834e..f0c5708 100644
--- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -21,7 +21,8 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
  int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gfn);
  unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm);
  void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
-struct kvm *kvm, bool skip_page_out);
+struct kvm *kvm, bool skip_page_out,
+bool purge_gfn);
  #else
  static inline int kvmppc_uvmem_init(void)
  {
@@ -75,6 +76,7 @@ static inline int kvmppc_send_page_to_uv(struct kvm *kvm, 
unsigned long gfn)
  
  static inline void

  kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
-   struct kvm *kvm, bool skip_page_out) { }
+   struct kvm *kvm, bool skip_page_out,
+   bool purge_gfn) { }
  #endif /* CONFIG_PPC_UV */
  #endif /* __ASM_KVM_BOOK3S_UVMEM_H__ */
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 803940d..3448459 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -1100,7 +1100,7 @@ void kvmppc_radix_flush_memslot(struct kvm *kvm,
unsigned int shift;
  
  	if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START)

-   kvmppc_uvmem_drop_pages(memslot, kvm, true);
+   kvmppc_uvmem_drop_pages(memslot, kvm, true, false);


When reviewing the v1 of this series, I asked you the question about the fact 
that the call here is made with purge_gfn = false. Your answer was:



This function does not know, under what context it is called. Since
its job is to just flush the memslot, it cannot assume anything
about purging the pages in the memslot.


Indeed in the case of the memory hotplug operation, this function is called to 
wipe the page from the secure device in the case the pages are secured. In that 
case the purge is required. Indeed, I checked the other call to 
kvmppc_radix_flush_memslot() in kvmppc_core_flush_memslot_hv() and I cannot see 
why in that case too purge_gfn should be false, especially when the memslot is 
reused as detailed in __kvm_set_memory_region() around the call to 
kvm_arch_flush_shadow_memslot().


I'm sorry to not have ask this earlier, but could you please elaborate on this?

  
  	if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)

return;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 6717d24..6cf80e5 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5482,7 +5482,7 @@ static int kvmhv_svm_off(struct kvm *kvm)
continue;
  
  		kvm_for_each_memslot(memslot, slots) {

-   kvmppc_uvmem_drop_pages(memslot, kvm, true);
+   kvmppc_uvmem_drop_pages(memslot, kvm, true, true);
uv_unregister_mem_slot(kvm->arch.lpid, memslot->id);
}
}
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 3599aaa..666d1bb 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -98,7 +98,127 @@
  static unsigned long *kvmppc_uvmem_bitmap;
  static DEFINE_SPINLOCK(kvmppc_uvmem_bitmap_lock);
  
-#define KVMPPC_UVMEM_PFN	(1UL << 63)

+/*
+ * States of a GFN
+ * ---

Re: [PATCH V2] powerpc/pseries/svm: Drop unused align argument in alloc_shared_lppaca() function

2020-06-16 Thread Laurent Dufour

Le 12/06/2020 à 16:29, Satheesh Rajendran a écrit :

Argument "align" in alloc_shared_lppaca() was unused inside the
function. Let's drop it and update code comment for page alignment.

Cc: linux-ker...@vger.kernel.org
Cc: Thiago Jung Bauermann 
Cc: Ram Pai 
Cc: Sukadev Bhattiprolu 
Cc: Laurent Dufour 
Cc: Michael Ellerman 
Reviewed-by: Thiago Jung Bauermann 


Reviewed-by: Laurent Dufour 


Signed-off-by: Satheesh Rajendran 
---

V2:
Added reviewed by Thiago.
Dropped align argument as per Michael suggest.
Modified commit msg.

V1: 
http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200609113909.17236-1-sathn...@linux.vnet.ibm.com/
---
  arch/powerpc/kernel/paca.c | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 8d96169c597e..a174d64d9b4d 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -57,8 +57,8 @@ static void *__init alloc_paca_data(unsigned long size, 
unsigned long align,
  
  #define LPPACA_SIZE 0x400
  
-static void *__init alloc_shared_lppaca(unsigned long size, unsigned long align,

-   unsigned long limit, int cpu)
+static void *__init alloc_shared_lppaca(unsigned long size, unsigned long 
limit,
+   int cpu)
  {
size_t shared_lppaca_total_size = PAGE_ALIGN(nr_cpu_ids * LPPACA_SIZE);
static unsigned long shared_lppaca_size;
@@ -68,6 +68,12 @@ static void *__init alloc_shared_lppaca(unsigned long size, 
unsigned long align,
if (!shared_lppaca) {
memblock_set_bottom_up(true);
  
+		/* See Documentation/powerpc/ultravisor.rst for mode details

+*
+* UV/HV data share is in PAGE granularity, In order to
+* minimize the number of pages shared and maximize the
+* use of a page, let's use page align.
+*/
shared_lppaca =
memblock_alloc_try_nid(shared_lppaca_total_size,
   PAGE_SIZE, MEMBLOCK_LOW_LIMIT,
@@ -122,7 +128,7 @@ static struct lppaca * __init new_lppaca(int cpu, unsigned 
long limit)
return NULL;
  
  	if (is_secure_guest())

-   lp = alloc_shared_lppaca(LPPACA_SIZE, 0x400, limit, cpu);
+   lp = alloc_shared_lppaca(LPPACA_SIZE, limit, cpu);
else
lp = alloc_paca_data(LPPACA_SIZE, 0x400, limit, cpu);
  





Re: [PATCH v1 4/4] KVM: PPC: Book3S HV: migrate hot plugged memory

2020-06-15 Thread Laurent Dufour

Le 31/05/2020 à 04:27, Ram Pai a écrit :

From: Laurent Dufour 

When a memory slot is hot plugged to a SVM, GFNs associated with that
memory slot automatically default to secure GFN. Hence migrate the
PFNs associated with these GFNs to device-PFNs.

uv_migrate_mem_slot() is called to achieve that. It will not call
UV_PAGE_IN since this request is ignored by the Ultravisor.
NOTE: Ultravisor does not trust any page content provided by
the Hypervisor, ones the VM turns secure.

Cc: Paul Mackerras 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Cc: Bharata B Rao 
Cc: Aneesh Kumar K.V 
Cc: Sukadev Bhattiprolu 
Cc: Laurent Dufour 
Cc: Thiago Jung Bauermann 
Cc: David Gibson 
Cc: Claudio Carvalho 
Cc: kvm-...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ram Pai 
(fixed merge conflicts. Modified the commit message)
Signed-off-by: Laurent Dufour 
---
  arch/powerpc/include/asm/kvm_book3s_uvmem.h |  4 
  arch/powerpc/kvm/book3s_hv.c| 11 +++
  arch/powerpc/kvm/book3s_hv_uvmem.c  |  3 +--
  3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
index f0c5708..2ec2e5afb 100644
--- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -23,6 +23,7 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
  void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
 struct kvm *kvm, bool skip_page_out,
 bool purge_gfn);
+int uv_migrate_mem_slot(struct kvm *kvm, const struct kvm_memory_slot 
*memslot);
  #else
  static inline int kvmppc_uvmem_init(void)
  {
@@ -78,5 +79,8 @@ static inline int kvmppc_send_page_to_uv(struct kvm *kvm, 
unsigned long gfn)
  kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
struct kvm *kvm, bool skip_page_out,
bool purge_gfn) { }
+
+static int uv_migrate_mem_slot(struct kvm *kvm,
+   const struct kvm_memory_slot *memslot);
  #endif /* CONFIG_PPC_UV */
  #endif /* __ASM_KVM_BOOK3S_UVMEM_H__ */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 4c62bfe..604d062 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4516,13 +4516,16 @@ static void kvmppc_core_commit_memory_region_hv(struct 
kvm *kvm,
case KVM_MR_CREATE:
if (kvmppc_uvmem_slot_init(kvm, new))
return;
-   uv_register_mem_slot(kvm->arch.lpid,
-new->base_gfn << PAGE_SHIFT,
-new->npages * PAGE_SIZE,
-0, new->id);
+   if (uv_register_mem_slot(kvm->arch.lpid,
+new->base_gfn << PAGE_SHIFT,
+new->npages * PAGE_SIZE,
+0, new->id))
+   return;
+   uv_migrate_mem_slot(kvm, new);
break;
case KVM_MR_DELETE:
uv_unregister_mem_slot(kvm->arch.lpid, old->id);
+   kvmppc_uvmem_drop_pages(old, kvm, true, true);


My mistake, kvmppc_radix_flush_memslot() called just before is already 
triggering the call to kvmppc_uvmem_drop_pages(), so that call is useless.


You should remove it in your v2.


kvmppc_uvmem_slot_free(kvm, old);
break;
default:
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 36dda1d..1fa5f2a 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -377,8 +377,7 @@ static int kvmppc_svm_migrate_page(struct vm_area_struct 
*vma,
return ret;
  }
  
-static int uv_migrate_mem_slot(struct kvm *kvm,

-   const struct kvm_memory_slot *memslot)
+int uv_migrate_mem_slot(struct kvm *kvm, const struct kvm_memory_slot *memslot)
  {
unsigned long gfn = memslot->base_gfn;
unsigned long end;





Re: [PATCH] powerpc/pseries/svm: Remove unwanted check for shared_lppaca_size

2020-06-09 Thread Laurent Dufour

Le 09/06/2020 à 12:57, Satheesh Rajendran a écrit :

Early secure guest boot hits the below crash while booting with
vcpus numbers aligned with page boundary for PAGE size of 64k
and LPPACA size of 1k i.e 64, 128 etc, due to the BUG_ON assert
for shared_lppaca_total_size equal to shared_lppaca_size,

  [0.00] Partition configured for 64 cpus.
  [0.00] CPU maps initialized for 1 thread per core
  [0.00] [ cut here ]
  [0.00] kernel BUG at arch/powerpc/kernel/paca.c:89!
  [0.00] Oops: Exception in kernel mode, sig: 5 [#1]
  [0.00] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries

which is not necessary, let's remove it.



Reviewed-by: Laurent Dufour 


Cc: linux-ker...@vger.kernel.org
Cc: Thiago Jung Bauermann 
Cc: Ram Pai 
Cc: Sukadev Bhattiprolu 
Cc: Laurent Dufour 
Signed-off-by: Satheesh Rajendran 
---
  arch/powerpc/kernel/paca.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 949eceb25..10b7c54a7 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -86,7 +86,7 @@ static void *__init alloc_shared_lppaca(unsigned long size, 
unsigned long align,
 * This is very early in boot, so no harm done if the kernel crashes at
 * this point.
 */
-   BUG_ON(shared_lppaca_size >= shared_lppaca_total_size);
+   BUG_ON(shared_lppaca_size > shared_lppaca_total_size);
  
  	return ptr;

  }





Re: [PATCH] powerpc/pseries/svm: Remove unwanted check for shared_lppaca_size

2020-06-09 Thread Laurent Dufour

Le 09/06/2020 à 07:38, sathn...@linux.vent.ibm.com a écrit :

From: Satheesh Rajendran 

Early secure guest boot hits the below crash while booting with
vcpus numbers aligned with page boundary for PAGE size of 64k
and LPPACA size of 1k i.e 64, 128 etc, due to the BUG_ON assert
for shared_lppaca_total_size equal to shared_lppaca_size,

  [0.00] Partition configured for 64 cpus.
  [0.00] CPU maps initialized for 1 thread per core
  [0.00] [ cut here ]
  [0.00] kernel BUG at arch/powerpc/kernel/paca.c:89!
  [0.00] Oops: Exception in kernel mode, sig: 5 [#1]
  [0.00] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries

which is not necessary, let's remove it.


Reviewed-by: Laurent Dufour 


Cc: linuxppc-dev@lists.ozlabs.org
Cc: Thiago Jung Bauermann 
Cc: Ram Pai 
Cc: Sukadev Bhattiprolu 
Cc: Laurent Dufour 
Signed-off-by: Satheesh Rajendran 
---
  arch/powerpc/kernel/paca.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 949eceb25..10b7c54a7 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -86,7 +86,7 @@ static void *__init alloc_shared_lppaca(unsigned long size, 
unsigned long align,
 * This is very early in boot, so no harm done if the kernel crashes at
 * this point.
 */
-   BUG_ON(shared_lppaca_size >= shared_lppaca_total_size);
+   BUG_ON(shared_lppaca_size > shared_lppaca_total_size);
  
  	return ptr;

  }





Re: [PATCH v1 2/4] KVM: PPC: Book3S HV: track shared GFNs of secure VMs

2020-06-05 Thread Laurent Dufour

Le 31/05/2020 à 04:27, Ram Pai a écrit :

During the life of SVM, its GFNs can transition from secure to shared
state and vice-versa. Since the kernel does not track GFNs that are
shared, it is not possible to disambiguate a shared GFN from a GFN whose
PFN has not yet been migrated to a device-PFN.

The ability to identify a shared GFN is needed to skip migrating its PFN
to device PFN. This functionality is leveraged in a subsequent patch.

Add the ability to identify the state of a GFN.

Cc: Paul Mackerras 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Cc: Bharata B Rao 
Cc: Aneesh Kumar K.V 
Cc: Sukadev Bhattiprolu 
Cc: Laurent Dufour 
Cc: Thiago Jung Bauermann 
Cc: David Gibson 
Cc: Claudio Carvalho 
Cc: kvm-...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Thiago Jung Bauermann 
Signed-off-by: Ram Pai 
---
  arch/powerpc/include/asm/kvm_book3s_uvmem.h |   6 +-
  arch/powerpc/kvm/book3s_64_mmu_radix.c  |   2 +-
  arch/powerpc/kvm/book3s_hv.c|   2 +-
  arch/powerpc/kvm/book3s_hv_uvmem.c  | 115 ++--
  4 files changed, 113 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
index 5a9834e..f0c5708 100644
--- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -21,7 +21,8 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
  int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gfn);
  unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm);
  void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
-struct kvm *kvm, bool skip_page_out);
+struct kvm *kvm, bool skip_page_out,
+bool purge_gfn);
  #else
  static inline int kvmppc_uvmem_init(void)
  {
@@ -75,6 +76,7 @@ static inline int kvmppc_send_page_to_uv(struct kvm *kvm, 
unsigned long gfn)
  
  static inline void

  kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
-   struct kvm *kvm, bool skip_page_out) { }
+   struct kvm *kvm, bool skip_page_out,
+   bool purge_gfn) { }
  #endif /* CONFIG_PPC_UV */
  #endif /* __ASM_KVM_BOOK3S_UVMEM_H__ */
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 803940d..3448459 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -1100,7 +1100,7 @@ void kvmppc_radix_flush_memslot(struct kvm *kvm,
unsigned int shift;
  
  	if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START)

-   kvmppc_uvmem_drop_pages(memslot, kvm, true);
+   kvmppc_uvmem_drop_pages(memslot, kvm, true, false);


Why purge_gfn is false here?
That call function is called when dropping an hot plugged memslot.
That's being said, when called by kvmppc_core_commit_memory_region_hv(), the mem 
slot is then free by kvmppc_uvmem_slot_free() so that shared state will not 
remain long but there is a window...


  
  	if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)

return;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 103d13e..4c62bfe 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5467,7 +5467,7 @@ static int kvmhv_svm_off(struct kvm *kvm)
continue;
  
  		kvm_for_each_memslot(memslot, slots) {

-   kvmppc_uvmem_drop_pages(memslot, kvm, true);
+   kvmppc_uvmem_drop_pages(memslot, kvm, true, true);
uv_unregister_mem_slot(kvm->arch.lpid, memslot->id);
}
}
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index ea4a1f1..2ef1e03 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -99,14 +99,56 @@
  static DEFINE_SPINLOCK(kvmppc_uvmem_bitmap_lock);
  
  #define KVMPPC_UVMEM_PFN	(1UL << 63)

+#define KVMPPC_UVMEM_SHARED(1UL << 62)
+#define KVMPPC_UVMEM_FLAG_MASK (KVMPPC_UVMEM_PFN | KVMPPC_UVMEM_SHARED)
+#define KVMPPC_UVMEM_PFN_MASK  (~KVMPPC_UVMEM_FLAG_MASK)
  
  struct kvmppc_uvmem_slot {

struct list_head list;
unsigned long nr_pfns;
unsigned long base_pfn;
+   /*
+* pfns array has an entry for each GFN of the memory slot.
+*
+* The GFN can be in one of the following states.
+*
+* (a) Secure - The GFN is secure. Only Ultravisor can access it.
+* (b) Shared - The GFN is shared. Both Hypervisor and Ultravisor
+*  can access it.
+* (c) Normal - The GFN is a normal.  Only Hypervisor can access it.
+*
+* Secure GFN is associated with a devicePFN. Its pfn[] has
+* KVMPPC_UVMEM_PFN flag set, and has the value of the device PFN
+* KVMPPC_

Re: [PATCH v1 4/4] KVM: PPC: Book3S HV: migrate hot plugged memory

2020-06-02 Thread Laurent Dufour

Le 31/05/2020 à 04:27, Ram Pai a écrit :

From: Laurent Dufour 

When a memory slot is hot plugged to a SVM, GFNs associated with that
memory slot automatically default to secure GFN. Hence migrate the
PFNs associated with these GFNs to device-PFNs.

uv_migrate_mem_slot() is called to achieve that. It will not call
UV_PAGE_IN since this request is ignored by the Ultravisor.
NOTE: Ultravisor does not trust any page content provided by
the Hypervisor, ones the VM turns secure.

Cc: Paul Mackerras 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Cc: Bharata B Rao 
Cc: Aneesh Kumar K.V 
Cc: Sukadev Bhattiprolu 
Cc: Laurent Dufour 
Cc: Thiago Jung Bauermann 
Cc: David Gibson 
Cc: Claudio Carvalho 
Cc: kvm-...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ram Pai 
(fixed merge conflicts. Modified the commit message)
Signed-off-by: Laurent Dufour 
---
  arch/powerpc/include/asm/kvm_book3s_uvmem.h |  4 
  arch/powerpc/kvm/book3s_hv.c| 11 +++
  arch/powerpc/kvm/book3s_hv_uvmem.c  |  3 +--
  3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
index f0c5708..2ec2e5afb 100644
--- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -23,6 +23,7 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
  void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
 struct kvm *kvm, bool skip_page_out,
 bool purge_gfn);
+int uv_migrate_mem_slot(struct kvm *kvm, const struct kvm_memory_slot 
*memslot);
  #else
  static inline int kvmppc_uvmem_init(void)
  {
@@ -78,5 +79,8 @@ static inline int kvmppc_send_page_to_uv(struct kvm *kvm, 
unsigned long gfn)
  kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
struct kvm *kvm, bool skip_page_out,
bool purge_gfn) { }
+
+static int uv_migrate_mem_slot(struct kvm *kvm,
+   const struct kvm_memory_slot *memslot);


That line was not part of the patch I sent to you!



  #endif /* CONFIG_PPC_UV */
  #endif /* __ASM_KVM_BOOK3S_UVMEM_H__ */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 4c62bfe..604d062 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4516,13 +4516,16 @@ static void kvmppc_core_commit_memory_region_hv(struct 
kvm *kvm,
case KVM_MR_CREATE:
if (kvmppc_uvmem_slot_init(kvm, new))
return;
-   uv_register_mem_slot(kvm->arch.lpid,
-new->base_gfn << PAGE_SHIFT,
-new->npages * PAGE_SIZE,
-0, new->id);
+   if (uv_register_mem_slot(kvm->arch.lpid,
+new->base_gfn << PAGE_SHIFT,
+new->npages * PAGE_SIZE,
+0, new->id))
+   return;
+   uv_migrate_mem_slot(kvm, new);
break;
case KVM_MR_DELETE:
uv_unregister_mem_slot(kvm->arch.lpid, old->id);
+   kvmppc_uvmem_drop_pages(old, kvm, true, true);


Again that line has been changed from the patch I sent to you. The last 'true' 
argument has nothing to do here.


Is that series really building?


kvmppc_uvmem_slot_free(kvm, old);
break;
default:
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 36dda1d..1fa5f2a 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -377,8 +377,7 @@ static int kvmppc_svm_migrate_page(struct vm_area_struct 
*vma,
return ret;
  }
  
-static int uv_migrate_mem_slot(struct kvm *kvm,

-   const struct kvm_memory_slot *memslot)
+int uv_migrate_mem_slot(struct kvm *kvm, const struct kvm_memory_slot *memslot)
  {
unsigned long gfn = memslot->base_gfn;
unsigned long end;





Re: [PATCH v2] KVM: PPC: Book3S HV: relax check on H_SVM_INIT_ABORT

2020-05-27 Thread Laurent Dufour

Le 27/05/2020 à 06:16, Paul Mackerras a écrit :

On Wed, May 20, 2020 at 07:43:08PM +0200, Laurent Dufour wrote:

The commit 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_*
Hcalls") added checks of secure bit of SRR1 to filter out the Hcall
reserved to the Ultravisor.

However, the Hcall H_SVM_INIT_ABORT is made by the Ultravisor passing the
context of the VM calling UV_ESM. This allows the Hypervisor to return to
the guest without going through the Ultravisor. Thus the Secure bit of SRR1
is not set in that particular case.

In the case a regular VM is calling H_SVM_INIT_ABORT, this hcall will be
filtered out in kvmppc_h_svm_init_abort() because kvm->arch.secure_guest is
not set in that case.

Fixes: 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_* Hcalls")
Signed-off-by: Laurent Dufour 


Thanks, applied to my kvm-ppc-next branch.  I expanded the comment in
the code a little.


Thanks, the comment is more explicit now.

Laurent.


Re: [PATCH] KVM: PPC: Book3S HV: read ibm,secure-memory nodes

2020-05-26 Thread Laurent Dufour

Paul, could you please take that patch?

Le 16/04/2020 à 18:27, Laurent Dufour a écrit :

The newly introduced ibm,secure-memory nodes supersede the
ibm,uv-firmware's property secure-memory-ranges.

Firmware will no more expose the secure-memory-ranges property so first
read the new one and if not found rollback to the older one.

Signed-off-by: Laurent Dufour 
---
  arch/powerpc/kvm/book3s_hv_uvmem.c | 14 ++
  1 file changed, 14 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 53b88cae3e73..ad950f8996e0 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -735,6 +735,20 @@ static u64 kvmppc_get_secmem_size(void)
const __be32 *prop;
u64 size = 0;

+   /*
+* First try the new ibm,secure-memory nodes which supersede the
+* secure-memory-ranges property.
+* If we found somes, no need to read the deprecated one.
+*/
+   for_each_compatible_node(np, NULL, "ibm,secure-memory") {
+   prop = of_get_property(np, "reg", );
+   if (!prop)
+   continue;
+   size += of_read_number(prop + 2, 2);
+   }
+   if (size)
+   return size;
+
np = of_find_compatible_node(NULL, NULL, "ibm,uv-firmware");
if (!np)
goto out;





[PATCH v2] KVM: PPC: Book3S HV: relax check on H_SVM_INIT_ABORT

2020-05-20 Thread Laurent Dufour
The commit 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_*
Hcalls") added checks of secure bit of SRR1 to filter out the Hcall
reserved to the Ultravisor.

However, the Hcall H_SVM_INIT_ABORT is made by the Ultravisor passing the
context of the VM calling UV_ESM. This allows the Hypervisor to return to
the guest without going through the Ultravisor. Thus the Secure bit of SRR1
is not set in that particular case.

In the case a regular VM is calling H_SVM_INIT_ABORT, this hcall will be
filtered out in kvmppc_h_svm_init_abort() because kvm->arch.secure_guest is
not set in that case.

Fixes: 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_* Hcalls")
Signed-off-by: Laurent Dufour 
---
 arch/powerpc/kvm/book3s_hv.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 93493f0cbfe8..6ad1a3b14300 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1099,9 +1099,12 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
ret = kvmppc_h_svm_init_done(vcpu->kvm);
break;
case H_SVM_INIT_ABORT:
-   ret = H_UNSUPPORTED;
-   if (kvmppc_get_srr1(vcpu) & MSR_S)
-   ret = kvmppc_h_svm_init_abort(vcpu->kvm);
+   /*
+* Even if that call is made by the Ultravisor, the SSR1 value
+* is the guest context one, with the secure bit clear as it has
+* not yet been secured. So we can't check it here.
+*/
+   ret = kvmppc_h_svm_init_abort(vcpu->kvm);
break;
 
default:
-- 
2.26.2



Re: [PATCH] KVM: PPC: Book3S HV: relax check on H_SVM_INIT_ABORT

2020-05-20 Thread Laurent Dufour

Le 20/05/2020 à 19:32, Greg Kurz a écrit :

On Wed, 20 May 2020 18:51:10 +0200
Laurent Dufour  wrote:


The commit 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_*
Hcalls") added checks of secure bit of SRR1 to filter out the Hcall
reserved to the Ultravisor.

However, the Hcall H_SVM_INIT_ABORT is made by the Ultravisor passing the
context of the VM calling UV_ESM. This allows the Hypervisor to return to
the guest without going through the Ultravisor. Thus the Secure bit of SRR1
is not set in that particular case.

In the case a regular VM is calling H_SVM_INIT_ABORT, this hcall will be
filtered out in kvmppc_h_svm_init_abort() because kvm->arch.secure_guest is
not set in that case.



Why not checking vcpu->kvm->arch.secure_guest then ?


I don't think that's the right place.



Fixes: 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_* Hcalls")
Signed-off-by: Laurent Dufour 
---
  arch/powerpc/kvm/book3s_hv.c | 4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 93493f0cbfe8..eb1f96cb7b72 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1099,9 +1099,7 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
ret = kvmppc_h_svm_init_done(vcpu->kvm);
break;
case H_SVM_INIT_ABORT:
-   ret = H_UNSUPPORTED;
-   if (kvmppc_get_srr1(vcpu) & MSR_S)
-   ret = kvmppc_h_svm_init_abort(vcpu->kvm);


or at least put a comment to explain why H_SVM_INIT_ABORT
doesn't have the same sanity check as the other SVM hcalls.


I agree that might help. I'll send a v2 with a comment there.




+   ret = kvmppc_h_svm_init_abort(vcpu->kvm);
break;
  
  	default:






[PATCH] KVM: PPC: Book3S HV: relax check on H_SVM_INIT_ABORT

2020-05-20 Thread Laurent Dufour
The commit 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_*
Hcalls") added checks of secure bit of SRR1 to filter out the Hcall
reserved to the Ultravisor.

However, the Hcall H_SVM_INIT_ABORT is made by the Ultravisor passing the
context of the VM calling UV_ESM. This allows the Hypervisor to return to
the guest without going through the Ultravisor. Thus the Secure bit of SRR1
is not set in that particular case.

In the case a regular VM is calling H_SVM_INIT_ABORT, this hcall will be
filtered out in kvmppc_h_svm_init_abort() because kvm->arch.secure_guest is
not set in that case.

Fixes: 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_* Hcalls")
Signed-off-by: Laurent Dufour 
---
 arch/powerpc/kvm/book3s_hv.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 93493f0cbfe8..eb1f96cb7b72 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1099,9 +1099,7 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
ret = kvmppc_h_svm_init_done(vcpu->kvm);
break;
case H_SVM_INIT_ABORT:
-   ret = H_UNSUPPORTED;
-   if (kvmppc_get_srr1(vcpu) & MSR_S)
-   ret = kvmppc_h_svm_init_abort(vcpu->kvm);
+   ret = kvmppc_h_svm_init_abort(vcpu->kvm);
break;
 
default:
-- 
2.26.2



  1   2   3   4   5   6   7   >