Re: [PATCH v7 13/13] arm64: topology: divorce MC scheduling domain from core_siblings

2018-03-14 Thread Morten Rasmussen
On Wed, Mar 07, 2018 at 10:19:50AM -0600, Jeremy Linton wrote:
> Hi,
> 
> On 03/07/2018 07:06 AM, Morten Rasmussen wrote:
> >On Tue, Mar 06, 2018 at 04:22:18PM -0600, Jeremy Linton wrote:
> >>To do this correctly, we should really base that on the cache
> >>topology immediately below the NUMA node (for NUMA in socket) >> or 
> >>below the physical package for normal NUMA configurations.
> >
> >That means we wouldn't support multi-die NUMA nodes?
> 
> You mean a bottom level NUMA domain that crosses multiple sockets/dies? 
> That
> should work. This patch is picking the widest cache layer below the 
> smallest
> of the package or numa grouping. What actually happens depends on the
> topology. Given a case where there are multiple dies in a socket, and the
> numa domain is at the socket level the MC is going to reflect the caching
> topology immediately below the socket. In the case of multiple dies, with 
> a
> cache that crosses them in socket, then the MC is basically going to be 
> the
> socket, otherwise if the widest cache is per die, or some narrower 
> grouping
> (cluster?) then that is what ends up in the MC. (this is easier with some
> pictures)
> >>>
> >>>That is more or less what I meant. I think I got confused with the role
> >>>of "DIE" level, i.e. that top non-NUMA level, in this. The DIE level
> >>>cpumask spans exactly the NUMA node, so IIUC we have three scenarios:
> >>>
> >>>1. Multi-die/socket/physical package NUMA node
> >>>Top non-NUMA level (DIE) spans multiple packages. Bottom NUMA level
> >>>spans multiple multi-package nodes. The MC mask reflects the last-level
> >>>cache within the NUMA node which is most likely per-die or per-cluster
> >>>(inside each die).
> >>>
> >>>2. physical package == NUMA node
> >>>The top non-NUMA (DIE) mask is the same as the core sibling mask.
> >>>If there is cache spanning the entire node, the scheduler topology
> >>>will eliminate a layer (DIE?), so bottom NUMA level would be right on
> >>>top of MC spanning multiple physical packages. If there is no
> >>>node-wide last level cache, DIE is preserved and MC matches the span
> >>>of the last level cache.
> >>>
> >>>3. numa-in-package
> >>>Top non-NUMA (DIE) mask is not reflecting the actual die, but is
> >>>reflecting the NUMA node. MC has a span equal to the largest share
> >>>cache span smaller than or equal to the the NUMA node. If it is
> >>>equal, DIE level is eliminated, otherwise DIE is preserved, but
> >>>doesn't really represent die. Bottom non-NUMA level spans multiple
> >>>in-package NUMA nodes.
> >>>
> >>>As you said, multi-die nodes should work. However, I'm not sure if
> >>>shrinking MC to match a cache could cause us trouble, or if it should
> >>>just be shrunk to be the smaller of the node mask and core siblings.
> >>
> >>Shrinking to the smaller of the numa or package is fairly trivial change,
> >>I'm good with that change too.. I discounted it because there might be an
> >>advantage in case 2 if the internal hardware is actually a case 3 (or just
> >>multiple rings/whatever each with a L3). In those cases the firmware vendor
> >>could play around with whatever representation serves them the best.
> >
> >Agreed. Distributed last level caches and interconnect speeds makes it
> >virtually impossible to define MC in a way that works well for everyone
> >based on the topology information we have at hand.
> >
> >>
> >>>Unless you have a node-wide last level cache DIE level won't be
> >>>eliminated in scenario 2 and 3, and could cause trouble. For
> >>>numa-in-package, you can end up with a DIE level inside the node where
> >>>the default flags don't favour aggressive spreading of tasks. The same
> >>>could be the case for per-package nodes (scenario 2).
> >>>
> >>>Don't we end up redefining physical package to be last level cache
> >>>instead of using the PPTT flag for scenario 2 and 3?
> >>
> >>I'm not sure I understand, core_siblings isn't changing (its still per
> >>package). Only the MC mapping which normally is just core_siblings. For all
> >>intents right now this patch is the same as v6, except for the
> >>numa-in-package where the MC domain is being shrunk to the node siblings.
> >>I'm just trying to setup the code for potential future cases where the LLC
> >>isn't equal to the node or package.
> >
> >Right, core_siblings remains the same. The scheduler topology just looks
> >a bit odd as we can have core_siblings spanning the full true physical
> >package and have DIE level as a subset of that with an MC level where
> >the MC siblings is a much smaller subset of cpus than core_siblings.
> >
> >IOW, it would lead to having one topology used by the scheduler and
> >another used by the users of topology_core_cpumask() (which is not
> >many I think).
> >
> >Is there a good reason for diverging instead of adjusting the
> >core_sibling mask? O

Re: [PATCH v7 13/13] arm64: topology: divorce MC scheduling domain from core_siblings

2018-03-14 Thread Morten Rasmussen
On Thu, Mar 08, 2018 at 09:41:17PM +0100, Brice Goglin wrote:
> 
> > Is there a good reason for diverging instead of adjusting the
> > core_sibling mask? On x86 the core_siblings mask is defined by the last
> > level cache span so they don't have this issue. 
> 
> No. core_siblings is defined as the list of cores that have the same
> physical_package_id (see the doc of sysfs topology files), and LLC can
> be smaller than that.
> Example with E5v3 with cluster-on-die (two L3 per package, core_siblings
> is twice larger than L3 cpumap):
> https://www.open-mpi.org/projects/hwloc/lstopo/images/2XeonE5v3.v1.11.png
> On AMD EPYC, you even have up to 8 LLC per package.

Right, I missed the fact that x86 reports a different cpumask for
topology_core_cpumask() which defines the core_siblings exported through
sysfs than the mask used to define MC level in the scheduler topology.
The sysfs core_siblings is defined by the package_id, while the MC level
is defined by the LLC.

Thanks for pointing this out.

On arm64 MC level and sysfs core_siblings are currently defined using
the same mask, but we can't break sysfs, so using different masks is the
only option.

Morten 


Re: [PATCH v7 13/13] arm64: topology: divorce MC scheduling domain from core_siblings

2018-03-08 Thread Brice Goglin

> Is there a good reason for diverging instead of adjusting the
> core_sibling mask? On x86 the core_siblings mask is defined by the last
> level cache span so they don't have this issue. 

No. core_siblings is defined as the list of cores that have the same
physical_package_id (see the doc of sysfs topology files), and LLC can
be smaller than that.
Example with E5v3 with cluster-on-die (two L3 per package, core_siblings
is twice larger than L3 cpumap):
https://www.open-mpi.org/projects/hwloc/lstopo/images/2XeonE5v3.v1.11.png
On AMD EPYC, you even have up to 8 LLC per package.

Brice



Re: [PATCH v7 13/13] arm64: topology: divorce MC scheduling domain from core_siblings

2018-03-07 Thread Jeremy Linton

Hi,

On 03/07/2018 07:06 AM, Morten Rasmussen wrote:

On Tue, Mar 06, 2018 at 04:22:18PM -0600, Jeremy Linton wrote:

To do this correctly, we should really base that on the cache
topology immediately below the NUMA node (for NUMA in socket) >> or below the 
physical package for normal NUMA configurations.


That means we wouldn't support multi-die NUMA nodes?


You mean a bottom level NUMA domain that crosses multiple sockets/dies? That
should work. This patch is picking the widest cache layer below the smallest
of the package or numa grouping. What actually happens depends on the
topology. Given a case where there are multiple dies in a socket, and the
numa domain is at the socket level the MC is going to reflect the caching
topology immediately below the socket. In the case of multiple dies, with a
cache that crosses them in socket, then the MC is basically going to be the
socket, otherwise if the widest cache is per die, or some narrower grouping
(cluster?) then that is what ends up in the MC. (this is easier with some
pictures)


That is more or less what I meant. I think I got confused with the role
of "DIE" level, i.e. that top non-NUMA level, in this. The DIE level
cpumask spans exactly the NUMA node, so IIUC we have three scenarios:

1. Multi-die/socket/physical package NUMA node
Top non-NUMA level (DIE) spans multiple packages. Bottom NUMA level
spans multiple multi-package nodes. The MC mask reflects the last-level
cache within the NUMA node which is most likely per-die or per-cluster
(inside each die).

2. physical package == NUMA node
The top non-NUMA (DIE) mask is the same as the core sibling mask.
If there is cache spanning the entire node, the scheduler topology
will eliminate a layer (DIE?), so bottom NUMA level would be right on
top of MC spanning multiple physical packages. If there is no
node-wide last level cache, DIE is preserved and MC matches the span
of the last level cache.

3. numa-in-package
Top non-NUMA (DIE) mask is not reflecting the actual die, but is
reflecting the NUMA node. MC has a span equal to the largest share
cache span smaller than or equal to the the NUMA node. If it is
equal, DIE level is eliminated, otherwise DIE is preserved, but
doesn't really represent die. Bottom non-NUMA level spans multiple
in-package NUMA nodes.

As you said, multi-die nodes should work. However, I'm not sure if
shrinking MC to match a cache could cause us trouble, or if it should
just be shrunk to be the smaller of the node mask and core siblings.


Shrinking to the smaller of the numa or package is fairly trivial change,
I'm good with that change too.. I discounted it because there might be an
advantage in case 2 if the internal hardware is actually a case 3 (or just
multiple rings/whatever each with a L3). In those cases the firmware vendor
could play around with whatever representation serves them the best.


Agreed. Distributed last level caches and interconnect speeds makes it
virtually impossible to define MC in a way that works well for everyone
based on the topology information we have at hand.




Unless you have a node-wide last level cache DIE level won't be
eliminated in scenario 2 and 3, and could cause trouble. For
numa-in-package, you can end up with a DIE level inside the node where
the default flags don't favour aggressive spreading of tasks. The same
could be the case for per-package nodes (scenario 2).

Don't we end up redefining physical package to be last level cache
instead of using the PPTT flag for scenario 2 and 3?


I'm not sure I understand, core_siblings isn't changing (its still per
package). Only the MC mapping which normally is just core_siblings. For all
intents right now this patch is the same as v6, except for the
numa-in-package where the MC domain is being shrunk to the node siblings.
I'm just trying to setup the code for potential future cases where the LLC
isn't equal to the node or package.


Right, core_siblings remains the same. The scheduler topology just looks
a bit odd as we can have core_siblings spanning the full true physical
package and have DIE level as a subset of that with an MC level where
the MC siblings is a much smaller subset of cpus than core_siblings.

IOW, it would lead to having one topology used by the scheduler and
another used by the users of topology_core_cpumask() (which is not
many I think).

Is there a good reason for diverging instead of adjusting the
core_sibling mask? On x86 the core_siblings mask is defined by the last
level cache span so they don't have this issue.


I'm overwhelmingly convinced we are doing the right thing WRT the core 
siblings at the moment. Its exported to user space, and the general 
understanding is that its a socket. Even with numa in package/on die if 
you run lscpu, lstopo, etc... They all understand the system topology 
correctly doing it this way (AFAIK).


Yes. IIUC, MC is always equal to package or node on x86. They don't ha

Re: [PATCH v7 13/13] arm64: topology: divorce MC scheduling domain from core_siblings

2018-03-07 Thread Morten Rasmussen
On Tue, Mar 06, 2018 at 04:22:18PM -0600, Jeremy Linton wrote:
> To do this correctly, we should really base that on the cache
> topology immediately below the NUMA node (for NUMA in socket) >> or below 
> the physical package for normal NUMA configurations.
> >>>
> >>>That means we wouldn't support multi-die NUMA nodes?
> >>
> >>You mean a bottom level NUMA domain that crosses multiple sockets/dies? That
> >>should work. This patch is picking the widest cache layer below the smallest
> >>of the package or numa grouping. What actually happens depends on the
> >>topology. Given a case where there are multiple dies in a socket, and the
> >>numa domain is at the socket level the MC is going to reflect the caching
> >>topology immediately below the socket. In the case of multiple dies, with a
> >>cache that crosses them in socket, then the MC is basically going to be the
> >>socket, otherwise if the widest cache is per die, or some narrower grouping
> >>(cluster?) then that is what ends up in the MC. (this is easier with some
> >>pictures)
> >
> >That is more or less what I meant. I think I got confused with the role
> >of "DIE" level, i.e. that top non-NUMA level, in this. The DIE level
> >cpumask spans exactly the NUMA node, so IIUC we have three scenarios:
> >
> >1. Multi-die/socket/physical package NUMA node
> >Top non-NUMA level (DIE) spans multiple packages. Bottom NUMA level
> >spans multiple multi-package nodes. The MC mask reflects the last-level
> >cache within the NUMA node which is most likely per-die or per-cluster
> >(inside each die).
> >
> >2. physical package == NUMA node
> >The top non-NUMA (DIE) mask is the same as the core sibling mask.
> >If there is cache spanning the entire node, the scheduler topology
> >will eliminate a layer (DIE?), so bottom NUMA level would be right on
> >top of MC spanning multiple physical packages. If there is no
> >node-wide last level cache, DIE is preserved and MC matches the span
> >of the last level cache.
> >
> >3. numa-in-package
> >Top non-NUMA (DIE) mask is not reflecting the actual die, but is
> >reflecting the NUMA node. MC has a span equal to the largest share
> >cache span smaller than or equal to the the NUMA node. If it is
> >equal, DIE level is eliminated, otherwise DIE is preserved, but
> >doesn't really represent die. Bottom non-NUMA level spans multiple
> >in-package NUMA nodes.
> >
> >As you said, multi-die nodes should work. However, I'm not sure if
> >shrinking MC to match a cache could cause us trouble, or if it should
> >just be shrunk to be the smaller of the node mask and core siblings.
> 
> Shrinking to the smaller of the numa or package is fairly trivial change,
> I'm good with that change too.. I discounted it because there might be an
> advantage in case 2 if the internal hardware is actually a case 3 (or just
> multiple rings/whatever each with a L3). In those cases the firmware vendor
> could play around with whatever representation serves them the best.

Agreed. Distributed last level caches and interconnect speeds makes it
virtually impossible to define MC in a way that works well for everyone
based on the topology information we have at hand.

> 
> >Unless you have a node-wide last level cache DIE level won't be
> >eliminated in scenario 2 and 3, and could cause trouble. For
> >numa-in-package, you can end up with a DIE level inside the node where
> >the default flags don't favour aggressive spreading of tasks. The same
> >could be the case for per-package nodes (scenario 2).
> >
> >Don't we end up redefining physical package to be last level cache
> >instead of using the PPTT flag for scenario 2 and 3?
> 
> I'm not sure I understand, core_siblings isn't changing (its still per
> package). Only the MC mapping which normally is just core_siblings. For all
> intents right now this patch is the same as v6, except for the
> numa-in-package where the MC domain is being shrunk to the node siblings.
> I'm just trying to setup the code for potential future cases where the LLC
> isn't equal to the node or package.

Right, core_siblings remains the same. The scheduler topology just looks
a bit odd as we can have core_siblings spanning the full true physical
package and have DIE level as a subset of that with an MC level where
the MC siblings is a much smaller subset of cpus than core_siblings.

IOW, it would lead to having one topology used by the scheduler and
another used by the users of topology_core_cpumask() (which is not
many I think).

Is there a good reason for diverging instead of adjusting the
core_sibling mask? On x86 the core_siblings mask is defined by the last
level cache span so they don't have this issue. 

> >I think DIE level should be eliminated for scenario 2 and 3 like it is
> >for x86.
> 
> Ok, that is based on the assumption that MC will always be equal to either
> the package or node? If that assumption isn't true, then would you keep it,
>

Re: [PATCH v7 13/13] arm64: topology: divorce MC scheduling domain from core_siblings

2018-03-06 Thread Jeremy Linton

Hi,

On 03/06/2018 10:07 AM, Morten Rasmussen wrote:

On Tue, Feb 27, 2018 at 02:18:47PM -0600, Jeremy Linton wrote:

Hi,


First, thanks for taking a look at this.

On 03/01/2018 09:52 AM, Morten Rasmussen wrote:

Hi Jeremy,

On Wed, Feb 28, 2018 at 04:06:19PM -0600, Jeremy Linton wrote:

Now that we have an accurate view of the physical topology
we need to represent it correctly to the scheduler. In the
case of NUMA in socket, we need to assure that the sched domain
we build for the MC layer isn't larger than the DIE above it.


MC shouldn't be larger than any of the NUMA domains either.


Right, that is one of the things this patch is assuring..




To do this correctly, we should really base that on the cache
topology immediately below the NUMA node (for NUMA in socket) >> or below the 
physical package for normal NUMA configurations.


That means we wouldn't support multi-die NUMA nodes?


You mean a bottom level NUMA domain that crosses multiple sockets/dies? That
should work. This patch is picking the widest cache layer below the smallest
of the package or numa grouping. What actually happens depends on the
topology. Given a case where there are multiple dies in a socket, and the
numa domain is at the socket level the MC is going to reflect the caching
topology immediately below the socket. In the case of multiple dies, with a
cache that crosses them in socket, then the MC is basically going to be the
socket, otherwise if the widest cache is per die, or some narrower grouping
(cluster?) then that is what ends up in the MC. (this is easier with some
pictures)


That is more or less what I meant. I think I got confused with the role
of "DIE" level, i.e. that top non-NUMA level, in this. The DIE level
cpumask spans exactly the NUMA node, so IIUC we have three scenarios:

1. Multi-die/socket/physical package NUMA node
Top non-NUMA level (DIE) spans multiple packages. Bottom NUMA level
spans multiple multi-package nodes. The MC mask reflects the last-level
cache within the NUMA node which is most likely per-die or per-cluster
(inside each die).

2. physical package == NUMA node
The top non-NUMA (DIE) mask is the same as the core sibling mask.
If there is cache spanning the entire node, the scheduler topology
will eliminate a layer (DIE?), so bottom NUMA level would be right on
top of MC spanning multiple physical packages. If there is no
node-wide last level cache, DIE is preserved and MC matches the span
of the last level cache.

3. numa-in-package
Top non-NUMA (DIE) mask is not reflecting the actual die, but is
reflecting the NUMA node. MC has a span equal to the largest share
cache span smaller than or equal to the the NUMA node. If it is
equal, DIE level is eliminated, otherwise DIE is preserved, but
doesn't really represent die. Bottom non-NUMA level spans multiple
in-package NUMA nodes.

As you said, multi-die nodes should work. However, I'm not sure if
shrinking MC to match a cache could cause us trouble, or if it should
just be shrunk to be the smaller of the node mask and core siblings.


Shrinking to the smaller of the numa or package is fairly trivial 
change, I'm good with that change too.. I discounted it because there 
might be an advantage in case 2 if the internal hardware is actually a 
case 3 (or just multiple rings/whatever each with a L3). In those cases 
the firmware vendor could play around with whatever representation 
serves them the best.



Unless you have a node-wide last level cache DIE level won't be
eliminated in scenario 2 and 3, and could cause trouble. For
numa-in-package, you can end up with a DIE level inside the node where
the default flags don't favour aggressive spreading of tasks. The same
could be the case for per-package nodes (scenario 2).

Don't we end up redefining physical package to be last level cache
instead of using the PPTT flag for scenario 2 and 3?


I'm not sure I understand, core_siblings isn't changing (its still per 
package). Only the MC mapping which normally is just core_siblings. For 
all intents right now this patch is the same as v6, except for the 
numa-in-package where the MC domain is being shrunk to the node 
siblings. I'm just trying to setup the code for potential future cases 
where the LLC isn't equal to the node or package.




I think DIE level should be eliminated for scenario 2 and 3 like it is
for x86.


Ok, that is based on the assumption that MC will always be equal to 
either the package or node? If that assumption isn't true, then would 
you keep it, or maybe it doesn't matter?




[...]


This patch creates a set of early cache_siblings masks, then
when the scheduler requests the coregroup mask we pick the
smaller of the physical package siblings, or the numa siblings
and locate the largest cache which is an entire subset of
those siblings. If we are unable to find a proper subset of
cores then we retain the original behavior and return the
core_sibling list.



Re: [PATCH v7 13/13] arm64: topology: divorce MC scheduling domain from core_siblings

2018-03-06 Thread Morten Rasmussen
On Tue, Feb 27, 2018 at 02:18:47PM -0600, Jeremy Linton wrote:
> Hi,
> 
> 
> First, thanks for taking a look at this.
> 
> On 03/01/2018 09:52 AM, Morten Rasmussen wrote:
> >Hi Jeremy,
> >
> >On Wed, Feb 28, 2018 at 04:06:19PM -0600, Jeremy Linton wrote:
> >>Now that we have an accurate view of the physical topology
> >>we need to represent it correctly to the scheduler. In the
> >>case of NUMA in socket, we need to assure that the sched domain
> >>we build for the MC layer isn't larger than the DIE above it.
> >
> >MC shouldn't be larger than any of the NUMA domains either.
> 
> Right, that is one of the things this patch is assuring..
> 
> >
> >>To do this correctly, we should really base that on the cache
> >>topology immediately below the NUMA node (for NUMA in socket) >> or below 
> >>the physical package for normal NUMA configurations.
> >
> >That means we wouldn't support multi-die NUMA nodes?
> 
> You mean a bottom level NUMA domain that crosses multiple sockets/dies? That
> should work. This patch is picking the widest cache layer below the smallest
> of the package or numa grouping. What actually happens depends on the
> topology. Given a case where there are multiple dies in a socket, and the
> numa domain is at the socket level the MC is going to reflect the caching
> topology immediately below the socket. In the case of multiple dies, with a
> cache that crosses them in socket, then the MC is basically going to be the
> socket, otherwise if the widest cache is per die, or some narrower grouping
> (cluster?) then that is what ends up in the MC. (this is easier with some
> pictures)

That is more or less what I meant. I think I got confused with the role
of "DIE" level, i.e. that top non-NUMA level, in this. The DIE level
cpumask spans exactly the NUMA node, so IIUC we have three scenarios:

1. Multi-die/socket/physical package NUMA node
   Top non-NUMA level (DIE) spans multiple packages. Bottom NUMA level
   spans multiple multi-package nodes. The MC mask reflects the last-level
   cache within the NUMA node which is most likely per-die or per-cluster
   (inside each die).

2. physical package == NUMA node
   The top non-NUMA (DIE) mask is the same as the core sibling mask.
   If there is cache spanning the entire node, the scheduler topology
   will eliminate a layer (DIE?), so bottom NUMA level would be right on
   top of MC spanning multiple physical packages. If there is no
   node-wide last level cache, DIE is preserved and MC matches the span
   of the last level cache.

3. numa-in-package
   Top non-NUMA (DIE) mask is not reflecting the actual die, but is
   reflecting the NUMA node. MC has a span equal to the largest share
   cache span smaller than or equal to the the NUMA node. If it is
   equal, DIE level is eliminated, otherwise DIE is preserved, but
   doesn't really represent die. Bottom non-NUMA level spans multiple
   in-package NUMA nodes.

As you said, multi-die nodes should work. However, I'm not sure if
shrinking MC to match a cache could cause us trouble, or if it should
just be shrunk to be the smaller of the node mask and core siblings.
Unless you have a node-wide last level cache DIE level won't be
eliminated in scenario 2 and 3, and could cause trouble. For
numa-in-package, you can end up with a DIE level inside the node where
the default flags don't favour aggressive spreading of tasks. The same
could be the case for per-package nodes (scenario 2).

Don't we end up redefining physical package to be last level cache
instead of using the PPTT flag for scenario 2 and 3?

I think DIE level should be eliminated for scenario 2 and 3 like it is
for x86.

[...]

> >>This patch creates a set of early cache_siblings masks, then
> >>when the scheduler requests the coregroup mask we pick the
> >>smaller of the physical package siblings, or the numa siblings
> >>and locate the largest cache which is an entire subset of
> >>those siblings. If we are unable to find a proper subset of
> >>cores then we retain the original behavior and return the
> >>core_sibling list.
> >
> >IIUC, for numa-in-package it is a strict requirement that there is a
> >cache that span the entire NUMA node? For example, having a NUMA node
> >consisting of two clusters with per-cluster caches only wouldn't be
> >supported?
> 
> Everything is supported, the MC is reflecting the cache topology. We just
> use the physical/numa topology to help us pick which layer of cache topology
> lands in the MC. (unless of course we fail to find a PPTT/cache topology, in
> which case we fallback to the old behavior of the core_siblings which can
> reflect the MPIDR/etc).

I see. For this example we would end up with a "DIE" level and two MC
domains inside each node whether we have the PPTT table and cache
topology or not. I'm just wondering if everyone would be happy with
basing MC on last level cache instead of the smaller of physical package
and NUMA node.

> >>+{
> >>+   /* first determine if we are a NUMA in package */
>

Re: [PATCH v7 13/13] arm64: topology: divorce MC scheduling domain from core_siblings

2018-03-01 Thread Jeremy Linton

Hi,


First, thanks for taking a look at this.

On 03/01/2018 09:52 AM, Morten Rasmussen wrote:

Hi Jeremy,

On Wed, Feb 28, 2018 at 04:06:19PM -0600, Jeremy Linton wrote:

Now that we have an accurate view of the physical topology
we need to represent it correctly to the scheduler. In the
case of NUMA in socket, we need to assure that the sched domain
we build for the MC layer isn't larger than the DIE above it.


MC shouldn't be larger than any of the NUMA domains either.


Right, that is one of the things this patch is assuring..




To do this correctly, we should really base that on the cache
topology immediately below the NUMA node (for NUMA in socket) >> or below the 
physical package for normal NUMA configurations.


That means we wouldn't support multi-die NUMA nodes?


You mean a bottom level NUMA domain that crosses multiple sockets/dies? 
That should work. This patch is picking the widest cache layer below the 
smallest of the package or numa grouping. What actually happens depends 
on the topology. Given a case where there are multiple dies in a socket, 
and the numa domain is at the socket level the MC is going to reflect 
the caching topology immediately below the socket. In the case of 
multiple dies, with a cache that crosses them in socket, then the MC is 
basically going to be the socket, otherwise if the widest cache is per 
die, or some narrower grouping (cluster?) then that is what ends up in 
the MC. (this is easier with some pictures)





This patch creates a set of early cache_siblings masks, then
when the scheduler requests the coregroup mask we pick the
smaller of the physical package siblings, or the numa siblings
and locate the largest cache which is an entire subset of
those siblings. If we are unable to find a proper subset of
cores then we retain the original behavior and return the
core_sibling list.


IIUC, for numa-in-package it is a strict requirement that there is a
cache that span the entire NUMA node? For example, having a NUMA node
consisting of two clusters with per-cluster caches only wouldn't be
supported?


Everything is supported, the MC is reflecting the cache topology. We 
just use the physical/numa topology to help us pick which layer of cache 
topology lands in the MC. (unless of course we fail to find a PPTT/cache 
topology, in which case we fallback to the old behavior of the 
core_siblings which can reflect the MPIDR/etc).






Signed-off-by: Jeremy Linton 
---
  arch/arm64/include/asm/topology.h |  5 +++
  arch/arm64/kernel/topology.c  | 64 +++
  2 files changed, 69 insertions(+)

diff --git a/arch/arm64/include/asm/topology.h 
b/arch/arm64/include/asm/topology.h
index 6b10459e6905..08db3e4e44e1 100644
--- a/arch/arm64/include/asm/topology.h
+++ b/arch/arm64/include/asm/topology.h
@@ -4,12 +4,17 @@
  
  #include 
  
+#define MAX_CACHE_CHECKS 4

+
  struct cpu_topology {
int thread_id;
int core_id;
int package_id;
+   int cache_id[MAX_CACHE_CHECKS];
cpumask_t thread_sibling;
cpumask_t core_sibling;
+   cpumask_t cache_siblings[MAX_CACHE_CHECKS];
+   int cache_level;
  };
  
  extern struct cpu_topology cpu_topology[NR_CPUS];

diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index bd1aae438a31..1809dc9d347c 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -212,8 +212,42 @@ static int __init parse_dt_topology(void)
  struct cpu_topology cpu_topology[NR_CPUS];
  EXPORT_SYMBOL_GPL(cpu_topology);
  
+static void find_llc_topology_for_cpu(int cpu)


Isn't it more find core/node siblings? Or is it a requirement that the
last level cache spans exactly one NUMA node? For example, a package
level cache isn't allowed for numa-in-package?


Yes, its a core siblings group, but more like a 
widest_core_siblings_sharing_a_cache_equalorsmaller_than_the_smallest_of_numa_or_package()


LLC is a bit of a misnomer because its the 'LLC' within the package/px 
domain. Is possible there is a LLC grouping larger than whatever we pick 
but we don't care.






+{
+   /* first determine if we are a NUMA in package */
+   const cpumask_t *node_mask = cpumask_of_node(cpu_to_node(cpu));
+   int indx;
+
+   if (!cpumask_subset(node_mask, &cpu_topology[cpu].core_sibling)) {
+   /* not numa in package, lets use the package siblings */
+   node_mask = &cpu_topology[cpu].core_sibling;
+   }
+
+   /*
+* node_mask should represent the smallest package/numa grouping
+* lets search for the largest cache smaller than the node_mask.
+*/
+   for (indx = 0; indx < MAX_CACHE_CHECKS; indx++) {
+   cpumask_t *cache_sibs = &cpu_topology[cpu].cache_siblings[indx];
+
+   if (cpu_topology[cpu].cache_id[indx] < 0)
+   continue;
+
+   if (cpumask_subset(cache_sibs, node_mask))
+   cpu_topology[cpu].cache_leve

Re: [PATCH v7 13/13] arm64: topology: divorce MC scheduling domain from core_siblings

2018-03-01 Thread Morten Rasmussen
Hi Jeremy,

On Wed, Feb 28, 2018 at 04:06:19PM -0600, Jeremy Linton wrote:
> Now that we have an accurate view of the physical topology
> we need to represent it correctly to the scheduler. In the
> case of NUMA in socket, we need to assure that the sched domain
> we build for the MC layer isn't larger than the DIE above it.

MC shouldn't be larger than any of the NUMA domains either.

> To do this correctly, we should really base that on the cache
> topology immediately below the NUMA node (for NUMA in socket)
> or below the physical package for normal NUMA configurations.

That means we wouldn't support multi-die NUMA nodes?

> This patch creates a set of early cache_siblings masks, then
> when the scheduler requests the coregroup mask we pick the
> smaller of the physical package siblings, or the numa siblings
> and locate the largest cache which is an entire subset of
> those siblings. If we are unable to find a proper subset of
> cores then we retain the original behavior and return the
> core_sibling list.

IIUC, for numa-in-package it is a strict requirement that there is a
cache that span the entire NUMA node? For example, having a NUMA node
consisting of two clusters with per-cluster caches only wouldn't be
supported?

> 
> Signed-off-by: Jeremy Linton 
> ---
>  arch/arm64/include/asm/topology.h |  5 +++
>  arch/arm64/kernel/topology.c  | 64 
> +++
>  2 files changed, 69 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/topology.h 
> b/arch/arm64/include/asm/topology.h
> index 6b10459e6905..08db3e4e44e1 100644
> --- a/arch/arm64/include/asm/topology.h
> +++ b/arch/arm64/include/asm/topology.h
> @@ -4,12 +4,17 @@
>  
>  #include 
>  
> +#define MAX_CACHE_CHECKS 4
> +
>  struct cpu_topology {
>   int thread_id;
>   int core_id;
>   int package_id;
> + int cache_id[MAX_CACHE_CHECKS];
>   cpumask_t thread_sibling;
>   cpumask_t core_sibling;
> + cpumask_t cache_siblings[MAX_CACHE_CHECKS];
> + int cache_level;
>  };
>  
>  extern struct cpu_topology cpu_topology[NR_CPUS];
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index bd1aae438a31..1809dc9d347c 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
> @@ -212,8 +212,42 @@ static int __init parse_dt_topology(void)
>  struct cpu_topology cpu_topology[NR_CPUS];
>  EXPORT_SYMBOL_GPL(cpu_topology);
>  
> +static void find_llc_topology_for_cpu(int cpu)

Isn't it more find core/node siblings? Or is it a requirement that the
last level cache spans exactly one NUMA node? For example, a package
level cache isn't allowed for numa-in-package?

> +{
> + /* first determine if we are a NUMA in package */
> + const cpumask_t *node_mask = cpumask_of_node(cpu_to_node(cpu));
> + int indx;
> +
> + if (!cpumask_subset(node_mask, &cpu_topology[cpu].core_sibling)) {
> + /* not numa in package, lets use the package siblings */
> + node_mask = &cpu_topology[cpu].core_sibling;
> + }
> +
> + /*
> +  * node_mask should represent the smallest package/numa grouping
> +  * lets search for the largest cache smaller than the node_mask.
> +  */
> + for (indx = 0; indx < MAX_CACHE_CHECKS; indx++) {
> + cpumask_t *cache_sibs = &cpu_topology[cpu].cache_siblings[indx];
> +
> + if (cpu_topology[cpu].cache_id[indx] < 0)
> + continue;
> +
> + if (cpumask_subset(cache_sibs, node_mask))
> + cpu_topology[cpu].cache_level = indx;

I don't this guarantees that the cache level we found matches exactly
the NUMA node. Taking the two cluster NUMA node example from above, we
would set cache_level to point at the per-cluster cache as it is a
subset of the NUMA node but it would only span half of the node. Or am I
missing something?

> + }
> +}
> +
>  const struct cpumask *cpu_coregroup_mask(int cpu)
>  {
> + int *llc = &cpu_topology[cpu].cache_level;
> +
> + if (*llc == -1)
> + find_llc_topology_for_cpu(cpu);
> +
> + if (*llc != -1)
> + return &cpu_topology[cpu].cache_siblings[*llc];
> +
>   return &cpu_topology[cpu].core_sibling;
>  }
>  
> @@ -221,6 +255,7 @@ static void update_siblings_masks(unsigned int cpuid)
>  {
>   struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
>   int cpu;
> + int idx;
>  
>   /* update core and thread sibling masks */
>   for_each_possible_cpu(cpu) {
> @@ -229,6 +264,16 @@ static void update_siblings_masks(unsigned int cpuid)
>   if (cpuid_topo->package_id != cpu_topo->package_id)
>   continue;
>  
> + for (idx = 0; idx < MAX_CACHE_CHECKS; idx++) {
> + cpumask_t *lsib;
> + int cput_id = cpuid_topo->cache_id[idx];
> +
> + if (cput_id == cpu_topo->cache_id[idx]) {
> + lsib = &cpuid_topo->cache_