Following up on this,
Indeed with a recent kernel the error message goes away.
The poor performance stays though (a few percent difference between 4.13
and 4.15rc5), and I'm at a loss as to whether it's related to MPI or not.
I see oddities such as locking the job to the first 12 cores yield 100%
greater performance than locking to the last 12 cores which I can't explain
but I can only suspect are related to some kind of MPI cache partitioning
On Sat, Dec 30, 2017 at 8:59 AM, Brice Goglin wrote:
> Le 29/12/2017 à 23:15, Bill Broadley a écrit :
> > Very interesting, I was running parallel finite element code and was
> > great performance compared to Intel in most cases, but on larger runs it
> was 20x
> > slower. This would explain it.
> > Do you know which commit, or anything else that might help find any
> > discussion? I tried a few google searches without luck.
> > Is it specific to the 24-core? The slowdown I described happened on a
> 32 core
> > Epyc single socket as well as a dual socket 24 core AMD Epyc system.
> Yes it's 24-core specific (that's the only core-count that doesn't have
> 8-core per zeppelin module).
> The commit in Linux git master is 2b83809a5e6d619a780876fcaf68cdc42b50d28c
> commit 2b83809a5e6d619a780876fcaf68cdc42b50d28c
> Author: Suravee Suthikulpanit
> Date: Mon Jul 31 10:51:59 2017 +0200
> x86/cpu/amd: Derive L3 shared_cpu_map from cpu_llc_shared_mask
> For systems with X86_FEATURE_TOPOEXT, current logic uses the APIC ID
> to calculate shared_cpu_map. However, APIC IDs are not guaranteed to
> be contiguous for cores across different L3s (e.g. family17h system
> w/ downcore configuration). This breaks the logic, and results in an
> incorrect L3 shared_cpu_map.
> Instead, always use the previously calculated cpu_llc_shared_mask of
> each CPU to derive the L3 shared_cpu_map.
> hwloc-users mailing list
hwloc-users mailing list