[OMPI users] OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-27 Thread Collin Strassburger via users
Hello, I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5. Both of these versions cause the same error (error code 63) when utilizing more than 100 cores on a single node. The processors I am utilizing are AMD Epyc "Rome" 7742s. The OS is CentOS 8.1. I have tried compiling with bo

Re: [OMPI users] OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-27 Thread Collin Strassburger via users
:38 Uhr schrieb Collin Strassburger via users mailto:users@lists.open-mpi.org>>: Hello, I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5. Both of these versions cause the same error (error code 63) when utilizing more than 100 cores on a single node. The processors I am util

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-27 Thread Collin Strassburger via users
/27/2020 11:29 AM, Collin Strassburger via users wrote: This message was sent from a non-IU address. Please exercise caution when clicking links or opening attachments from external sources. Hello Howard, To remove potential interactions, I have found that the issue persists without ucx and hcoll

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
here: https://www.open-mpi.org/community/help/ Thanks! On Jan 27, 2020, at 12:00 PM, Collin Strassburger via users mailto:users@lists.open-mpi.org>> wrote: Hello, I had initially thought the same thing about the streams, but I have 2 sockets with 64 cores each. Additionally, I h

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
returns error 63 on AMD 7742 when utilizing 100+ processors per node Does it work with pbs but not Mellanox? Just trying to isolate the problem. On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users mailto:users@lists.open-mpi.org>> wrote: Hello, I have done some additional testi

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
ct: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Does it work with pbs but not Mellanox? Just trying to isolate the problem. On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users mailto:users@lists.open-mpi.org>> wrote:

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
sts.open-mpi.org>> Cc: Ralph Castain mailto:r...@open-mpi.org>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Does it work with pbs but not Mellanox? Just trying to isolate the problem. On Jan 28, 2020, at 6:39 AM,

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
t;> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Does it work with pbs but not Mellanox? Just trying to isolate the problem. On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users mailto:users@lists.open-mpi.org>&

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
On Behalf Of Ralph Castain via users Sent: Tuesday, January 28, 2020 11:02 AM To: Open MPI Users mailto:users@lists.open-mpi.org>> Cc: Ralph Castain mailto:r...@open-mpi.org>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per nod

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Does it work with pbs but not Mellanox? Just trying to isolate the problem. On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users mailto:users@lists.open-mpi.org>> wrote: Hello, I hav

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
I agree that it is odd that the issue does not appear until after the Mellanox drivers have been installed (and the configure flags set to use them). As requested, here are the results Input: mpirun -np 128 --mca odls_base_verbose 10 --mca state_base_verbose 10 hostname Output: [Gen2Node3:54

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
Wonderful! I am happy to confirm that this resolves the issue! Many thanks to everyone for their assistance, Collin

Re: [OMPI users] Slow collective MPI File IO

2020-04-06 Thread Collin Strassburger via users
Hello, Just a quick comment on this; is your code written in C/C++ or Fortran? Fortran has issues with writing at a decent speed regardless of MPI setup and as such should be avoided for file IO (yet I still occasionally see it implemented). Collin From: users On Behalf Of Dong-In Kang via

Re: [OMPI users] Slow collective MPI File IO

2020-04-06 Thread Collin Strassburger via users
. Cheers, Gilles On April 6, 2020, at 23:22, Collin Strassburger via users mailto:users@lists.open-mpi.org>> wrote: Hello, Just a quick comment on this; is your code written in C/C++ or Fortran? Fortran has issues with writing at a decent speed regardless of MPI setup and as such sho

Re: [OMPI users] slot number calculation when no config files?

2020-06-08 Thread Collin Strassburger via users
Hello David, The slot calculation is based on physical cores rather than logical cores. The 4 CPUs you are seeing there are logical CPUs. And since your processor has 2 threads per core, you have two physical cores; yielding a total of 4 logical cores (which is reported to lscpu). On machine

Re: [OMPI users] slot number calculation when no config files?

2020-06-08 Thread Collin Strassburger via users
it instructs mpirun to treat the HWTs as independent cpus so you would > have 4 slots in this case. > > > > On Jun 8, 2020, at 11:28 AM, Collin Strassburger via users > > wrote: > > > > Hello David, > > > > The slot calculation is based on physical

Re: [OMPI users] HPL: Error occurred in MPI_Recv

2022-06-09 Thread Collin Strassburger via users
Since it is happening on this cluster and not on others, have you checked the InfiniBand counters to ensure it’s not a bad cable or something along those lines? I believe the command is ibdiag (or something similar). Collin From: users On Behalf Of Bart Willems via users Sent: Thursday, June