Hello all,

TLDR: Try the combination intel/19.1.1.217 and openmpi/3.1.6 on expanse.

Long version:

I had a similar issue on expanse during the EUP with great performance on 1 node and worse on multiple nodes.  I tried many combinations of the available (at the time) mpi implementations and compilers.

Here is what I found:

* intel19 + intel mpi : great 1 node speed, poor scaling to >1 node, poor weak 
scaling
* intel19 + openmpi 4 : same as above
* gcc 10 + openmpi 4 : 40% slower 1 node speed, poor scaling to >1 node, poor 
weak scaling
* gcc 9 + openmpi 3.1.6 : 40% slower 1 node speed, good scaling to >1 node, 
acceptable weak scaling
* using intel19 + openmpi 4 executable but with gcc9/openmpi 3.1.6 modules loaded 
at runtime : great 1 node speed, good scaling to >1 node, acceptable weak 
scaling

The MPI implementation that worked was using openmpi 3.1.6.  At the time this was only available if you used the gcc 9 compilers.  However, I contacted support and they installed a version for the intel compilers.  Here is what support said:

"I think I can install openmpi/3.1.6 with intel compilers. I have to go back and 
check but I think the main difference is we are using ibverbs on the openmpi/3.1.6 build 
and ucx on the openmpi/4.0.4. For most codes ucx has been the faster option but in your 
case it seems different. I will let you know once the compilers are in place."

After he installed it, the combination of intel19 compilers with openmpi 3.1.6 gives acceptable scaling. I am not familiar enough with ucx vs ibverbs to comment on if that is the issue with the AMD clusters.  I also have this same issue on bridges2 which uses the same AMD nodes as expanse, and have not been able to get the code to perform well on >1 node.

At least for expanse, I'd suggest trying to load intel 19 and openmpi and see if you get better scaling.  Then if you do, we could inquire with support on the differences in configurations for openmpi 3.1.6 and openmpi 4.0.4 to see if there is more beyond ucx vs ibverbs.  Then, the next step would be seeing if we can replicate this elsewhere (like bridges2 or anvil).

module load intel/19.1.1.217
module load openmpi/3.1.6

Best,
Jim Healy
CCRG Research Associate

On 8/27/21 12:44 PM, Gabriele Bozzola wrote:
Hello,

Last week I opened a PR to add the configuration files
for Expanse to simfactory. Expanse is an example of
the new generation of AMD supercomputers. Others are
Anvil, one of the other new XSEDE machines, or Puma,
the newest cluster at The University of Arizona.

I have some experience with Puma and Expanse and
I would like to share some thoughts, some of which come
from interacting with the admins of Expanse. The problem
is that I am finding terrible multi-node performance on both
these machines, and I don't know if this will be a common
thread among new AMD clusters.

These supercomputers have similar characteristics.

First, they have very high cores/node count (typically
128/node) but low memory per core (typically 2 GB / core).
In these conditions, it is very easy to have a job killed by
the OOM daemon. My suspicion is that it is rank 0 that
goes out of memory, and the entire run is aborted.

Second, depending on the MPI implementation, MPI collective
operations can be extremely expensive. I was told that
the best implementation is mvapich 2.3.6 (at the moment).
This seems to be due to the high core count.

I found that the code does not scale well. This is possibly
related to the previous point. If your job can fit on a single node,
it will run wonderfully. However, when you perform the same
simulation on two nodes, the code will actually be slower.
This indicates that there's no strong scaling at all from
1 node to 2 (128 to 256 cores, or 32 to 64 MPI ranks).
Using mvapich 2.3.6 improves the situation, but it is still
faster to use fewer nodes.

(My benchmark is a par file I've tested extensively on Frontera)

I am working with Expanse's support staff to see what we can
do, but I wonder if anyone has had a positive experience with
this architecture and has some tips to share.

Gabriele


_______________________________________________
Users mailing list
[email protected]
http://lists.einsteintoolkit.org/mailman/listinfo/users


_______________________________________________
Users mailing list
[email protected]
http://lists.einsteintoolkit.org/mailman/listinfo/users

Reply via email to