Re: [hwloc-users] AMD EPYC topology

2017-12-29 Thread Matthew Scutter
Following up on this,
Indeed with a recent kernel the error message goes away.
The poor performance stays though (a few percent difference between 4.13
and 4.15rc5), and I'm at a loss as to whether it's related to MPI or not.
I see oddities such as locking the job to the first 12 cores yield 100%
greater performance than locking to the last 12 cores which I can't explain
but I can only suspect are related to some kind of MPI cache partitioning
issue.


On Sat, Dec 30, 2017 at 8:59 AM, Brice Goglin  wrote:

>
>
> Le 29/12/2017 à 23:15, Bill Broadley a écrit :
> >
> >
> > Very interesting, I was running parallel finite element code and was
> seeing
> > great performance compared to Intel in most cases, but on larger runs it
> was 20x
> > slower.  This would explain it.
> >
> > Do you know which commit, or anything else that might help find any
> related
> > discussion?  I tried a few google searches without luck.
> >
> > Is it specific to the 24-core?  The slowdown I described happened on a
> 32 core
> > Epyc single socket as well as a dual socket 24 core AMD Epyc system.
>
> Hello
>
> Yes it's 24-core specific (that's the only core-count that doesn't have
> 8-core per zeppelin module).
>
> The commit in Linux git master is 2b83809a5e6d619a780876fcaf68cdc42b50d28c
>
> Brice
>
>
> commit 2b83809a5e6d619a780876fcaf68cdc42b50d28c
> Author: Suravee Suthikulpanit 
> Date:   Mon Jul 31 10:51:59 2017 +0200
>
> x86/cpu/amd: Derive L3 shared_cpu_map from cpu_llc_shared_mask
>
> For systems with X86_FEATURE_TOPOEXT, current logic uses the APIC ID
> to calculate shared_cpu_map. However, APIC IDs are not guaranteed to
> be contiguous for cores across different L3s (e.g. family17h system
> w/ downcore configuration). This breaks the logic, and results in an
> incorrect L3 shared_cpu_map.
>
> Instead, always use the previously calculated cpu_llc_shared_mask of
> each CPU to derive the L3 shared_cpu_map.
>
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
>
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] AMD EPYC topology

2017-12-29 Thread Brice Goglin


Le 29/12/2017 à 23:15, Bill Broadley a écrit :
>
>
> Very interesting, I was running parallel finite element code and was seeing
> great performance compared to Intel in most cases, but on larger runs it was 
> 20x
> slower.  This would explain it.
>
> Do you know which commit, or anything else that might help find any related
> discussion?  I tried a few google searches without luck.
>
> Is it specific to the 24-core?  The slowdown I described happened on a 32 core
> Epyc single socket as well as a dual socket 24 core AMD Epyc system.

Hello

Yes it's 24-core specific (that's the only core-count that doesn't have
8-core per zeppelin module).

The commit in Linux git master is 2b83809a5e6d619a780876fcaf68cdc42b50d28c

Brice


commit 2b83809a5e6d619a780876fcaf68cdc42b50d28c
Author: Suravee Suthikulpanit 
Date:   Mon Jul 31 10:51:59 2017 +0200

x86/cpu/amd: Derive L3 shared_cpu_map from cpu_llc_shared_mask

For systems with X86_FEATURE_TOPOEXT, current logic uses the APIC ID
to calculate shared_cpu_map. However, APIC IDs are not guaranteed to
be contiguous for cores across different L3s (e.g. family17h system
w/ downcore configuration). This breaks the logic, and results in an
incorrect L3 shared_cpu_map.

Instead, always use the previously calculated cpu_llc_shared_mask of
each CPU to derive the L3 shared_cpu_map.

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] AMD EPYC topology

2017-12-29 Thread Bill Broadley



Very interesting, I was running parallel finite element code and was seeing
great performance compared to Intel in most cases, but on larger runs it was 20x
slower.  This would explain it.

Do you know which commit, or anything else that might help find any related
discussion?  I tried a few google searches without luck.

Is it specific to the 24-core?  The slowdown I described happened on a 32 core
Epyc single socket as well as a dual socket 24 core AMD Epyc system.
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users