Re: [PATCH v4 02/13] mm/mempolicy: convert single preferred_node to full nodemask

2021-04-14 Thread Michal Hocko
On Wed 17-03-21 11:39:59, Feng Tang wrote:
> From: Dave Hansen 
> 
> The NUMA APIs currently allow passing in a "preferred node" as a
> single bit set in a nodemask.  If more than one bit it set, bits
> after the first are ignored.  Internally, this is implemented as
> a single integer: mempolicy->preferred_node.
> 
> This single node is generally OK for location-based NUMA where
> memory being allocated will eventually be operated on by a single
> CPU.  However, in systems with multiple memory types, folks want
> to target a *type* of memory instead of a location.  For instance,
> someone might want some high-bandwidth memory but do not care about
> the CPU next to which it is allocated.  Or, they want a cheap,
> high capacity allocation and want to target all NUMA nodes which
> have persistent memory in volatile mode.  In both of these cases,
> the application wants to target a *set* of nodes, but does not
> want strict MPOL_BIND behavior as that could lead to OOM killer or
> SIGSEGV.
> 
> To get that behavior, a MPOL_PREFERRED mode is desirable, but one
> that honors multiple nodes to be set in the nodemask.
> 
> The first step in that direction is to be able to internally store
> multiple preferred nodes, which is implemented in this patch.
> 
> This should not have any function changes and just switches the
> internal representation of mempolicy->preferred_node from an
> integer to a nodemask called 'mempolicy->preferred_nodes'.
> 
> This is not a pie-in-the-sky dream for an API.  This was a response to a
> specific ask of more than one group at Intel.  Specifically:
> 
> 1. There are existing libraries that target memory types such as
>https://github.com/memkind/memkind.  These are known to suffer
>from SIGSEGV's when memory is low on targeted memory "kinds" that
>span more than one node.  The MCDRAM on a Xeon Phi in "Cluster on
>Die" mode is an example of this.
> 2. Volatile-use persistent memory users want to have a memory policy
>which is targeted at either "cheap and slow" (PMEM) or "expensive and
>fast" (DRAM).  However, they do not want to experience allocation
>failures when the targeted type is unavailable.
> 3. Allocate-then-run.  Generally, we let the process scheduler decide
>on which physical CPU to run a task.  That location provides a
>default allocation policy, and memory availability is not generally
>considered when placing tasks.  For situations where memory is
>valuable and constrained, some users want to allocate memory first,
>*then* allocate close compute resources to the allocation.  This is
>the reverse of the normal (CPU) model.  Accelerators such as GPUs
>that operate on core-mm-managed memory are interested in this model.

This is a very useful background for the feature. The changelog for the
specific patch is rather modest and it would help to add more details
about the change. The mempolicy code is a maze and it is quite easy to
get lost there. I hope we are not going to miss something just by hunting
preferred_node usage...
 
[...]
> @@ -345,22 +345,26 @@ static void mpol_rebind_preferred(struct mempolicy *pol,
>   const nodemask_t *nodes)
>  {
>   nodemask_t tmp;
> + nodemask_t preferred_node;

This is rather harsh. Some distribution kernels use high NODES_SHIFT
(SLES has 10 for x86) so this will consume additional 1K on the stack.
Unless I am missing something this shouldn't be called in deep call
chains but still.

> +
> + /* MPOL_PREFERRED uses only the first node in the mask */
> + preferred_node = nodemask_of_node(first_node(*nodes));
>  
>   if (pol->flags & MPOL_F_STATIC_NODES) {
>   int node = first_node(pol->w.user_nodemask);
>  
>   if (node_isset(node, *nodes)) {
> - pol->v.preferred_node = node;
> + pol->v.preferred_nodes = nodemask_of_node(node);
>   pol->flags &= ~MPOL_F_LOCAL;
>   } else
>   pol->flags |= MPOL_F_LOCAL;
>   } else if (pol->flags & MPOL_F_RELATIVE_NODES) {
>   mpol_relative_nodemask(, >w.user_nodemask, nodes);
> - pol->v.preferred_node = first_node(tmp);
> + pol->v.preferred_nodes = tmp;
>   } else if (!(pol->flags & MPOL_F_LOCAL)) {
> - pol->v.preferred_node = node_remap(pol->v.preferred_node,
> -pol->w.cpuset_mems_allowed,
> -*nodes);
> + nodes_remap(tmp, pol->v.preferred_nodes,
> + pol->w.cpuset_mems_allowed, preferred_node);
> + pol->v.preferred_nodes = tmp;
>   pol->w.cpuset_mems_allowed = *nodes;
>   }

I have to say that I really disliked the original code (becasuse it
fiddles with user provided input behind the back) I got lost here
completely. What the heck is going on?
a) why do we 

[PATCH v4 02/13] mm/mempolicy: convert single preferred_node to full nodemask

2021-03-16 Thread Feng Tang
From: Dave Hansen 

The NUMA APIs currently allow passing in a "preferred node" as a
single bit set in a nodemask.  If more than one bit it set, bits
after the first are ignored.  Internally, this is implemented as
a single integer: mempolicy->preferred_node.

This single node is generally OK for location-based NUMA where
memory being allocated will eventually be operated on by a single
CPU.  However, in systems with multiple memory types, folks want
to target a *type* of memory instead of a location.  For instance,
someone might want some high-bandwidth memory but do not care about
the CPU next to which it is allocated.  Or, they want a cheap,
high capacity allocation and want to target all NUMA nodes which
have persistent memory in volatile mode.  In both of these cases,
the application wants to target a *set* of nodes, but does not
want strict MPOL_BIND behavior as that could lead to OOM killer or
SIGSEGV.

To get that behavior, a MPOL_PREFERRED mode is desirable, but one
that honors multiple nodes to be set in the nodemask.

The first step in that direction is to be able to internally store
multiple preferred nodes, which is implemented in this patch.

This should not have any function changes and just switches the
internal representation of mempolicy->preferred_node from an
integer to a nodemask called 'mempolicy->preferred_nodes'.

This is not a pie-in-the-sky dream for an API.  This was a response to a
specific ask of more than one group at Intel.  Specifically:

1. There are existing libraries that target memory types such as
   https://github.com/memkind/memkind.  These are known to suffer
   from SIGSEGV's when memory is low on targeted memory "kinds" that
   span more than one node.  The MCDRAM on a Xeon Phi in "Cluster on
   Die" mode is an example of this.
2. Volatile-use persistent memory users want to have a memory policy
   which is targeted at either "cheap and slow" (PMEM) or "expensive and
   fast" (DRAM).  However, they do not want to experience allocation
   failures when the targeted type is unavailable.
3. Allocate-then-run.  Generally, we let the process scheduler decide
   on which physical CPU to run a task.  That location provides a
   default allocation policy, and memory availability is not generally
   considered when placing tasks.  For situations where memory is
   valuable and constrained, some users want to allocate memory first,
   *then* allocate close compute resources to the allocation.  This is
   the reverse of the normal (CPU) model.  Accelerators such as GPUs
   that operate on core-mm-managed memory are interested in this model.

v2:
Fix spelling errors in commit message. (Ben)
clang-format. (Ben)
Integrated bit from another patch. (Ben)
Update the docs to reflect the internal data structure change (Ben)
Don't advertise MPOL_PREFERRED_MANY in UAPI until we can handle it (Ben)
Added more to the commit message (Dave)

Link: https://lore.kernel.org/r/20200630212517.308045-3-ben.widaw...@intel.com
Co-developed-by: Ben Widawsky 
Signed-off-by: Dave Hansen 
Signed-off-by: Ben Widawsky 
Signed-off-by: Feng Tang 
---
 .../admin-guide/mm/numa_memory_policy.rst  |  6 ++--
 include/linux/mempolicy.h  |  4 +--
 mm/mempolicy.c | 40 --
 3 files changed, 27 insertions(+), 23 deletions(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst 
b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 067a90a..1ad020c 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -205,9 +205,9 @@ MPOL_PREFERRED
of increasing distance from the preferred node based on
information provided by the platform firmware.
 
-   Internally, the Preferred policy uses a single node--the
-   preferred_node member of struct mempolicy.  When the internal
-   mode flag MPOL_F_LOCAL is set, the preferred_node is ignored
+   Internally, the Preferred policy uses a nodemask--the
+   preferred_nodes member of struct mempolicy.  When the internal
+   mode flag MPOL_F_LOCAL is set, the preferred_nodes are ignored
and the policy is interpreted as local allocation.  "Local"
allocation policy can be viewed as a Preferred policy that
starts at the node containing the cpu where the allocation
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5f1c74d..23ee105 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -47,8 +47,8 @@ struct mempolicy {
unsigned short mode;/* See MPOL_* above */
unsigned short flags;   /* See set_mempolicy() MPOL_F_* above */
union {
-   shortpreferred_node; /* preferred */
-   nodemask_t   nodes; /* interleave/bind */
+   nodemask_t preferred_nodes; /* preferred */
+   nodemask_t nodes; /* interleave/bind */