Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
Wonderful!  I am happy to confirm that this resolves the issue!

Many thanks to everyone for their assistance,

Collin





Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
Okay, that nailed it down - the problem is the number of open file descriptors 
is exceeding your system limit. I suspect the connection to the Mellanox 
drivers is solely due to it also having some descriptors open, and you are just 
close enough to the boundary that it causes you to hit it.

See what you get with "ulimit -a" - you are looking for a line that indicates 
"open files", meaning the max number of open file descriptors you are allowed 
to have. You can also check the system imits with "cat /proc/sys/fs/file-max" 
(might differ with flavor of Linux you are using).

There are a number of solutions - here is an article that explains them: 
https://www.linuxtechi.com/set-ulimit-file-descriptors-limit-linux-servers/

Ralph





Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
I agree that it is odd that the issue does not appear until after the Mellanox 
drivers have been installed (and the configure flags set to use them).  As 
requested, here are the results

Input:  mpirun -np 128 --mca odls_base_verbose 10 --mca state_base_verbose 10 
hostname

Output:
[Gen2Node3:54366] mca: base: components_register: registering framework state 
components
[Gen2Node3:54366] mca: base: components_register: found loaded component orted
[Gen2Node3:54366] mca: base: components_register: component orted has no 
register or open function
[Gen2Node3:54366] mca: base: components_register: found loaded component hnp
[Gen2Node3:54366] mca: base: components_register: component hnp has no register 
or open function
[Gen2Node3:54366] mca: base: components_register: found loaded component tool
[Gen2Node3:54366] mca: base: components_register: component tool has no 
register or open function
[Gen2Node3:54366] mca: base: components_register: found loaded component app
[Gen2Node3:54366] mca: base: components_register: component app has no register 
or open function
[Gen2Node3:54366] mca: base: components_register: found loaded component novm
[Gen2Node3:54366] mca: base: components_register: component novm has no 
register or open function
[Gen2Node3:54366] mca: base: components_open: opening state components
[Gen2Node3:54366] mca: base: components_open: found loaded component orted
[Gen2Node3:54366] mca: base: components_open: component orted open function 
successful
[Gen2Node3:54366] mca: base: components_open: found loaded component hnp
[Gen2Node3:54366] mca: base: components_open: component hnp open function 
successful
[Gen2Node3:54366] mca: base: components_open: found loaded component tool
[Gen2Node3:54366] mca: base: components_open: component tool open function 
successful
[Gen2Node3:54366] mca: base: components_open: found loaded component app
[Gen2Node3:54366] mca: base: components_open: component app open function 
successful
[Gen2Node3:54366] mca: base: components_open: found loaded component novm
[Gen2Node3:54366] mca: base: components_open: component novm open function 
successful
[Gen2Node3:54366] mca:base:select: Auto-selecting state components
[Gen2Node3:54366] mca:base:select:(state) Querying component [orted]
[Gen2Node3:54366] mca:base:select:(state) Querying component [hnp]
[Gen2Node3:54366] mca:base:select:(state) Query of component [hnp] set priority 
to 60
[Gen2Node3:54366] mca:base:select:(state) Querying component [tool]
[Gen2Node3:54366] mca:base:select:(state) Querying component [app]
[Gen2Node3:54366] mca:base:select:(state) Querying component [novm]
[Gen2Node3:54366] mca:base:select:(state) Selected component [hnp]
[Gen2Node3:54366] mca: base: close: component orted closed
[Gen2Node3:54366] mca: base: close: unloading component orted
[Gen2Node3:54366] mca: base: close: component tool closed
[Gen2Node3:54366] mca: base: close: unloading component tool
[Gen2Node3:54366] mca: base: close: component app closed
[Gen2Node3:54366] mca: base: close: unloading component app
[Gen2Node3:54366] mca: base: close: component novm closed
[Gen2Node3:54366] mca: base: close: unloading component novm
[Gen2Node3:54366] ORTE_JOB_STATE_MACHINE:
[Gen2Node3:54366]   State: PENDING INIT cbfunc: DEFINED
[Gen2Node3:54366]   State: INIT_COMPLETE cbfunc: DEFINED
[Gen2Node3:54366]   State: PENDING ALLOCATION cbfunc: DEFINED
[Gen2Node3:54366]   State: ALLOCATION COMPLETE cbfunc: DEFINED
[Gen2Node3:54366]   State: DAEMONS LAUNCHED cbfunc: DEFINED
[Gen2Node3:54366]   State: ALL DAEMONS REPORTED cbfunc: DEFINED
[Gen2Node3:54366]   State: VM READY cbfunc: DEFINED
[Gen2Node3:54366]   State: PENDING MAPPING cbfunc: DEFINED
[Gen2Node3:54366]   State: MAP COMPLETE cbfunc: DEFINED
[Gen2Node3:54366]   State: PENDING FINAL SYSTEM PREP cbfunc: DEFINED
[Gen2Node3:54366]   State: PENDING APP LAUNCH cbfunc: DEFINED
[Gen2Node3:54366]   State: SENDING LAUNCH MSG cbfunc: DEFINED
[Gen2Node3:54366]   State: LOCAL LAUNCH COMPLETE cbfunc: DEFINED
[Gen2Node3:54366]   State: RUNNING cbfunc: DEFINED
[Gen2Node3:54366]   State: SYNC REGISTERED cbfunc: DEFINED
[Gen2Node3:54366]   State: NORMALLY TERMINATED cbfunc: DEFINED
[Gen2Node3:54366]   State: NOTIFY COMPLETED cbfunc: DEFINED
[Gen2Node3:54366]   State: NOTIFIED cbfunc: DEFINED
[Gen2Node3:54366]   State: ALL JOBS COMPLETE cbfunc: DEFINED
[Gen2Node3:54366]   State: DAEMONS TERMINATED cbfunc: DEFINED
[Gen2Node3:54366]   State: FORCED EXIT cbfunc: DEFINED
[Gen2Node3:54366]   State: REPORT PROGRESS cbfunc: DEFINED
[Gen2Node3:54366] ORTE_PROC_STATE_MACHINE:
[Gen2Node3:54366]   State: RUNNING cbfunc: DEFINED
[Gen2Node3:54366]   State: SYNC REGISTERED cbfunc: DEFINED
[Gen2Node3:54366]   State: IOF COMPLETE cbfunc: DEFINED
[Gen2Node3:54366]   State: WAITPID FIRED cbfunc: DEFINED
[Gen2Node3:54366]   State: NORMALLY TERMINATED cbfunc: DEFINED
[Gen2Node3:54366] mca: base: 

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
Okay, debug-daemons isn't going to help as we aren't launching any daemons. 
This is all one node. So try adding "--mca odls_base_verbose 10 --mca 
state_base_verbose 10" to the cmd line and let's see what is going on.

I agree with Josh - neither mpirun nor hostname are invoking the Mellanox 
drivers, so it is hard to see why removing those drivers is allowing this to 
run.

On Jan 28, 2020, at 11:35 AM, Ralph H Castain mailto:r...@open-mpi.org> > wrote:

kay, debug-daemons isn't going to help as we aren't launching any daemons. This 
is all one node. So try adding "--mca odls_base_verbose 10 --mca 
state_base_verbose 10" to the cmd line and let's see what is going on.

I agree with Josh - neither mpirun nor hostname are invoking the Mellanox 
drivers, so it is hard to see why removing those drivers is allowing this to 
run.


On Jan 28, 2020, at 11:28 AM, Collin Strassburger mailto:cstrassbur...@bihrle.com> > wrote:

Same result.  (It works though 102 but not greater than that)
 Input: mpirun -np 128 --debug-daemons  --map-by ppr:64:socket  hostname
 Output:
[Gen2Node3:54348] [[18008,0],0] orted_cmd: received add_local_procs
[Gen2Node3:54348] [[18008,0],0] orted_cmd: received exit cmd
[Gen2Node3:54348] [[18008,0],0] orted_cmd: all routes and children gone - 
exiting
--
mpirun was unable to start the specified application as it encountered an
error:
 Error code: 63
Error name: (null)
Node: Gen2Node3
 when attempting to start process rank 0.
--
128 total processes failed to start
 Collin



Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
Same result.  (It works though 102 but not greater than that)

Input: mpirun -np 128 --debug-daemons  --map-by ppr:64:socket  hostname

Output:
[Gen2Node3:54348] [[18008,0],0] orted_cmd: received add_local_procs
[Gen2Node3:54348] [[18008,0],0] orted_cmd: received exit cmd
[Gen2Node3:54348] [[18008,0],0] orted_cmd: all routes and children gone - 
exiting
--
mpirun was unable to start the specified application as it encountered an
error:

Error code: 63
Error name: (null)
Node: Gen2Node3

when attempting to start process rank 0.
--
128 total processes failed to start

Collin

From: Joshua Ladd 
Sent: Tuesday, January 28, 2020 2:24 PM
To: Collin Strassburger 
Cc: Open MPI Users ; Ralph Castain 
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

OK. Please try:

mpirun -np 128 --debug-daemons  --map-by ppr:64:socket  hostname

Josh

On Tue, Jan 28, 2020 at 12:49 PM Collin Strassburger 
mailto:cstrassbur...@bihrle.com>> wrote:
Input: mpirun -np 128 --debug-daemons -mca plm rsh hostname

Output:
[Gen2Node3:54039] [[16643,0],0] orted_cmd: received add_local_procs
[Gen2Node3:54039] [[16643,0],0] orted_cmd: received exit cmd
[Gen2Node3:54039] [[16643,0],0] orted_cmd: all routes and children gone - 
exiting
--
mpirun was unable to start the specified application as it encountered an
error:

Error code: 63
Error name: (null)
Node: Gen2Node3

when attempting to start process rank 0.
--
128 total processes failed to start

Collin

From: Joshua Ladd mailto:jladd.m...@gmail.com>>
Sent: Tuesday, January 28, 2020 12:48 PM
To: Collin Strassburger 
mailto:cstrassbur...@bihrle.com>>
Cc: Open MPI Users mailto:users@lists.open-mpi.org>>; 
Ralph Castain mailto:r...@open-mpi.org>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Sorry, typo, try:

mpirun -np 128 --debug-daemons -mca plm rsh hostname

Josh

On Tue, Jan 28, 2020 at 12:45 PM Joshua Ladd 
mailto:jladd.m...@gmail.com>> wrote:
And if you try:
mpirun -np 128 --debug-daemons -plm rsh hostname

Josh

On Tue, Jan 28, 2020 at 12:34 PM Collin Strassburger 
mailto:cstrassbur...@bihrle.com>> wrote:
Input:   mpirun -np 128 --debug-daemons hostname

Output:
[Gen2Node3:54023] [[16659,0],0] orted_cmd: received add_local_procs
[Gen2Node3:54023] [[16659,0],0] orted_cmd: received exit cmd
[Gen2Node3:54023] [[16659,0],0] orted_cmd: all routes and children gone - 
exiting
--
mpirun was unable to start the specified application as it encountered an
error:

Error code: 63
Error name: (null)
Node: Gen2Node3

when attempting to start process rank 0.
--

Collin

From: Joshua Ladd mailto:jladd.m...@gmail.com>>
Sent: Tuesday, January 28, 2020 12:31 PM
To: Collin Strassburger 
mailto:cstrassbur...@bihrle.com>>
Cc: Open MPI Users mailto:users@lists.open-mpi.org>>; 
Ralph Castain mailto:r...@open-mpi.org>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Interesting. Can you try:

mpirun -np 128 --debug-daemons hostname

Josh

On Tue, Jan 28, 2020 at 12:14 PM Collin Strassburger 
mailto:cstrassbur...@bihrle.com>> wrote:
In relation to the multi-node attempt, I haven’t yet set that up yet as the 
per-node configuration doesn’t pass its tests (full node utilization, etc).

Here are the results for the hostname test:
Input: mpirun -np 128 hostname

Output:
--
mpirun was unable to start the specified application as it encountered an
error:

Error code: 63
Error name: (null)
Node: Gen2Node3

when attempting to start process rank 0.
--
128 total processes failed to start


Collin


From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Ralph Castain via users
Sent: Tuesday, January 28, 2020 12:06 PM
To: Joshua Ladd mailto:jladd.m...@gmail.com>>
Cc: Ralph Castain mailto:r...@open-mpi.org>>; Open MPI Users 
mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Josh - if you read thru the thread, you will see that disabling Mellanox/IB 
drivers allows the program to run. It only fails when they are enabled.


On Jan 28, 2020, at 8:49 AM, Joshua Ladd 
mailto:jladd.m...@gmail.com>> wrote:

I don't see how this can be diagnosed as a "problem with the Mellanox 
Sof

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
OK. Please try:

mpirun -np 128 --debug-daemons  --map-by ppr:64:socket  hostname

Josh

On Tue, Jan 28, 2020 at 12:49 PM Collin Strassburger <
cstrassbur...@bihrle.com> wrote:

> Input: mpirun -np 128 --debug-daemons -mca plm rsh hostname
>
>
>
> Output:
>
> [Gen2Node3:54039] [[16643,0],0] orted_cmd: received add_local_procs
>
> [Gen2Node3:54039] [[16643,0],0] orted_cmd: received exit cmd
>
> [Gen2Node3:54039] [[16643,0],0] orted_cmd: all routes and children gone -
> exiting
>
> --
>
> mpirun was unable to start the specified application as it encountered an
>
> error:
>
>
>
> Error code: 63
>
> Error name: (null)
>
> Node: Gen2Node3
>
>
>
> when attempting to start process rank 0.
>
> --
>
> 128 total processes failed to start
>
>
>
> Collin
>
>
>
> *From:* Joshua Ladd 
> *Sent:* Tuesday, January 28, 2020 12:48 PM
> *To:* Collin Strassburger 
> *Cc:* Open MPI Users ; Ralph Castain <
> r...@open-mpi.org>
> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
> 7742 when utilizing 100+ processors per node
>
>
>
> Sorry, typo, try:
>
>
>
> mpirun -np 128 --debug-daemons -mca plm rsh hostname
>
>
>
> Josh
>
>
>
> On Tue, Jan 28, 2020 at 12:45 PM Joshua Ladd  wrote:
>
> And if you try:
>
> mpirun -np 128 --debug-daemons -plm rsh hostname
>
>
>
> Josh
>
>
>
> On Tue, Jan 28, 2020 at 12:34 PM Collin Strassburger <
> cstrassbur...@bihrle.com> wrote:
>
> Input:   mpirun -np 128 --debug-daemons hostname
>
>
>
> Output:
>
> [Gen2Node3:54023] [[16659,0],0] orted_cmd: received add_local_procs
>
> [Gen2Node3:54023] [[16659,0],0] orted_cmd: received exit cmd
>
> [Gen2Node3:54023] [[16659,0],0] orted_cmd: all routes and children gone -
> exiting
>
> --
>
> mpirun was unable to start the specified application as it encountered an
>
> error:
>
>
>
> Error code: 63
>
> Error name: (null)
>
> Node: Gen2Node3
>
>
>
> when attempting to start process rank 0.
>
> --
>
>
>
> Collin
>
>
>
> *From:* Joshua Ladd 
> *Sent:* Tuesday, January 28, 2020 12:31 PM
> *To:* Collin Strassburger 
> *Cc:* Open MPI Users ; Ralph Castain <
> r...@open-mpi.org>
> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
> 7742 when utilizing 100+ processors per node
>
>
>
> Interesting. Can you try:
>
>
>
> mpirun -np 128 --debug-daemons hostname
>
>
>
> Josh
>
>
>
> On Tue, Jan 28, 2020 at 12:14 PM Collin Strassburger <
> cstrassbur...@bihrle.com> wrote:
>
> In relation to the multi-node attempt, I haven’t yet set that up yet as
> the per-node configuration doesn’t pass its tests (full node utilization,
> etc).
>
>
>
> Here are the results for the hostname test:
>
> Input: mpirun -np 128 hostname
>
>
>
> Output:
>
> --
>
> mpirun was unable to start the specified application as it encountered an
>
> error:
>
>
>
> Error code: 63
>
> Error name: (null)
>
> Node: Gen2Node3
>
>
>
> when attempting to start process rank 0.
>
> --
>
> 128 total processes failed to start
>
>
>
>
>
> Collin
>
>
>
>
>
> *From:* users  *On Behalf Of *Ralph
> Castain via users
> *Sent:* Tuesday, January 28, 2020 12:06 PM
> *To:* Joshua Ladd 
> *Cc:* Ralph Castain ; Open MPI Users <
> users@lists.open-mpi.org>
> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
> 7742 when utilizing 100+ processors per node
>
>
>
> Josh - if you read thru the thread, you will see that disabling
> Mellanox/IB drivers allows the program to run. It only fails when they are
> enabled.
>
>
>
>
>
> On Jan 28, 2020, at 8:49 AM, Joshua Ladd  wrote:
>
>
>
> I don't see how this can be diagnosed as a "problem with the Mellanox
> Software". This is on a single node. What happens when you try to launch on
> more than one node?
>
>
>
> Josh
>
>
>
> On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger <
> cstrassbur...@bihrle.com> wrote:
>
> Here’s the I/O for these high local core count runs. (“xhpcg” is the
> standard hpcg benchmark)
>

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
Input: mpirun -np 128 --debug-daemons -mca plm rsh hostname

Output:
[Gen2Node3:54039] [[16643,0],0] orted_cmd: received add_local_procs
[Gen2Node3:54039] [[16643,0],0] orted_cmd: received exit cmd
[Gen2Node3:54039] [[16643,0],0] orted_cmd: all routes and children gone - 
exiting
--
mpirun was unable to start the specified application as it encountered an
error:

Error code: 63
Error name: (null)
Node: Gen2Node3

when attempting to start process rank 0.
--
128 total processes failed to start

Collin

From: Joshua Ladd 
Sent: Tuesday, January 28, 2020 12:48 PM
To: Collin Strassburger 
Cc: Open MPI Users ; Ralph Castain 
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Sorry, typo, try:

mpirun -np 128 --debug-daemons -mca plm rsh hostname

Josh

On Tue, Jan 28, 2020 at 12:45 PM Joshua Ladd 
mailto:jladd.m...@gmail.com>> wrote:
And if you try:
mpirun -np 128 --debug-daemons -plm rsh hostname

Josh

On Tue, Jan 28, 2020 at 12:34 PM Collin Strassburger 
mailto:cstrassbur...@bihrle.com>> wrote:
Input:   mpirun -np 128 --debug-daemons hostname

Output:
[Gen2Node3:54023] [[16659,0],0] orted_cmd: received add_local_procs
[Gen2Node3:54023] [[16659,0],0] orted_cmd: received exit cmd
[Gen2Node3:54023] [[16659,0],0] orted_cmd: all routes and children gone - 
exiting
--
mpirun was unable to start the specified application as it encountered an
error:

Error code: 63
Error name: (null)
Node: Gen2Node3

when attempting to start process rank 0.
--

Collin

From: Joshua Ladd mailto:jladd.m...@gmail.com>>
Sent: Tuesday, January 28, 2020 12:31 PM
To: Collin Strassburger 
mailto:cstrassbur...@bihrle.com>>
Cc: Open MPI Users mailto:users@lists.open-mpi.org>>; 
Ralph Castain mailto:r...@open-mpi.org>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Interesting. Can you try:

mpirun -np 128 --debug-daemons hostname

Josh

On Tue, Jan 28, 2020 at 12:14 PM Collin Strassburger 
mailto:cstrassbur...@bihrle.com>> wrote:
In relation to the multi-node attempt, I haven’t yet set that up yet as the 
per-node configuration doesn’t pass its tests (full node utilization, etc).

Here are the results for the hostname test:
Input: mpirun -np 128 hostname

Output:
--
mpirun was unable to start the specified application as it encountered an
error:

Error code: 63
Error name: (null)
Node: Gen2Node3

when attempting to start process rank 0.
--
128 total processes failed to start


Collin


From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Ralph Castain via users
Sent: Tuesday, January 28, 2020 12:06 PM
To: Joshua Ladd mailto:jladd.m...@gmail.com>>
Cc: Ralph Castain mailto:r...@open-mpi.org>>; Open MPI Users 
mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Josh - if you read thru the thread, you will see that disabling Mellanox/IB 
drivers allows the program to run. It only fails when they are enabled.


On Jan 28, 2020, at 8:49 AM, Joshua Ladd 
mailto:jladd.m...@gmail.com>> wrote:

I don't see how this can be diagnosed as a "problem with the Mellanox 
Software". This is on a single node. What happens when you try to launch on 
more than one node?

Josh

On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger 
mailto:cstrassbur...@bihrle.com>> wrote:
Here’s the I/O for these high local core count runs. (“xhpcg” is the standard 
hpcg benchmark)

Run command: mpirun -np 128 bin/xhpcg
Output:
--
mpirun was unable to start the specified application as it encountered an
error:

Error code: 63
Error name: (null)
Node: Gen2Node4

when attempting to start process rank 0.
--
128 total processes failed to start


Collin

From: Joshua Ladd mailto:jladd.m...@gmail.com>>
Sent: Tuesday, January 28, 2020 11:39 AM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Collin Strassburger 
mailto:cstrassbur...@bihrle.com>>; Ralph Castain 
mailto:r...@open-mpi.org>>; Artem Polyakov 
mailto:art...@mellanox.com>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Can you send the output of a failed run including your command line.

Josh

On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users 
mailto:use

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
Input:   mpirun -np 128 --debug-daemons hostname

Output:
[Gen2Node3:54023] [[16659,0],0] orted_cmd: received add_local_procs
[Gen2Node3:54023] [[16659,0],0] orted_cmd: received exit cmd
[Gen2Node3:54023] [[16659,0],0] orted_cmd: all routes and children gone - 
exiting
--
mpirun was unable to start the specified application as it encountered an
error:

Error code: 63
Error name: (null)
Node: Gen2Node3

when attempting to start process rank 0.
--

Collin

From: Joshua Ladd 
Sent: Tuesday, January 28, 2020 12:31 PM
To: Collin Strassburger 
Cc: Open MPI Users ; Ralph Castain 
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Interesting. Can you try:

mpirun -np 128 --debug-daemons hostname

Josh

On Tue, Jan 28, 2020 at 12:14 PM Collin Strassburger 
mailto:cstrassbur...@bihrle.com>> wrote:
In relation to the multi-node attempt, I haven’t yet set that up yet as the 
per-node configuration doesn’t pass its tests (full node utilization, etc).

Here are the results for the hostname test:
Input: mpirun -np 128 hostname

Output:
--
mpirun was unable to start the specified application as it encountered an
error:

Error code: 63
Error name: (null)
Node: Gen2Node3

when attempting to start process rank 0.
--
128 total processes failed to start


Collin


From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Ralph Castain via users
Sent: Tuesday, January 28, 2020 12:06 PM
To: Joshua Ladd mailto:jladd.m...@gmail.com>>
Cc: Ralph Castain mailto:r...@open-mpi.org>>; Open MPI Users 
mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Josh - if you read thru the thread, you will see that disabling Mellanox/IB 
drivers allows the program to run. It only fails when they are enabled.


On Jan 28, 2020, at 8:49 AM, Joshua Ladd 
mailto:jladd.m...@gmail.com>> wrote:

I don't see how this can be diagnosed as a "problem with the Mellanox 
Software". This is on a single node. What happens when you try to launch on 
more than one node?

Josh

On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger 
mailto:cstrassbur...@bihrle.com>> wrote:
Here’s the I/O for these high local core count runs. (“xhpcg” is the standard 
hpcg benchmark)

Run command: mpirun -np 128 bin/xhpcg
Output:
--
mpirun was unable to start the specified application as it encountered an
error:

Error code: 63
Error name: (null)
Node: Gen2Node4

when attempting to start process rank 0.
--
128 total processes failed to start


Collin

From: Joshua Ladd mailto:jladd.m...@gmail.com>>
Sent: Tuesday, January 28, 2020 11:39 AM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Collin Strassburger 
mailto:cstrassbur...@bihrle.com>>; Ralph Castain 
mailto:r...@open-mpi.org>>; Artem Polyakov 
mailto:art...@mellanox.com>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Can you send the output of a failed run including your command line.

Josh

On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users 
mailto:users@lists.open-mpi.org>> wrote:
Okay, so this is a problem with the Mellanox software - copying Artem.

On Jan 28, 2020, at 8:15 AM, Collin Strassburger 
mailto:cstrassbur...@bihrle.com>> wrote:

I just tried that and it does indeed work with pbs and without Mellanox (until 
a reboot makes it complain about Mellanox/IB related defaults as no drivers 
were installed, etc).

After installing the Mellanox drivers, I used
./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx 
--with-platform=contrib/platform/mellanox/optimized

With the new compile it fails on the higher core counts.


Collin

From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Ralph Castain via users
Sent: Tuesday, January 28, 2020 11:02 AM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Ralph Castain mailto:r...@open-mpi.org>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Does it work with pbs but not Mellanox? Just trying to isolate the problem.


On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users 
mailto:users@lists.open-mpi.org>> wrote:

Hello,

I have done some additional testing and I can say that it works correctly with 
gcc8 and no mellanox or pbs installed.

I am have done two runs with Mellanox and pbs installed.  O

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
Interesting. Can you try:

mpirun -np 128 --debug-daemons hostname

Josh

On Tue, Jan 28, 2020 at 12:14 PM Collin Strassburger <
cstrassbur...@bihrle.com> wrote:

> In relation to the multi-node attempt, I haven’t yet set that up yet as
> the per-node configuration doesn’t pass its tests (full node utilization,
> etc).
>
>
>
> Here are the results for the hostname test:
>
> Input: mpirun -np 128 hostname
>
>
>
> Output:
>
> --
>
> mpirun was unable to start the specified application as it encountered an
>
> error:
>
>
>
> Error code: 63
>
> Error name: (null)
>
> Node: Gen2Node3
>
>
>
> when attempting to start process rank 0.
>
> --
>
> 128 total processes failed to start
>
>
>
>
>
> Collin
>
>
>
>
>
> *From:* users  *On Behalf Of *Ralph
> Castain via users
> *Sent:* Tuesday, January 28, 2020 12:06 PM
> *To:* Joshua Ladd 
> *Cc:* Ralph Castain ; Open MPI Users <
> users@lists.open-mpi.org>
> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
> 7742 when utilizing 100+ processors per node
>
>
>
> Josh - if you read thru the thread, you will see that disabling
> Mellanox/IB drivers allows the program to run. It only fails when they are
> enabled.
>
>
>
>
>
> On Jan 28, 2020, at 8:49 AM, Joshua Ladd  wrote:
>
>
>
> I don't see how this can be diagnosed as a "problem with the Mellanox
> Software". This is on a single node. What happens when you try to launch on
> more than one node?
>
>
>
> Josh
>
>
>
> On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger <
> cstrassbur...@bihrle.com> wrote:
>
> Here’s the I/O for these high local core count runs. (“xhpcg” is the
> standard hpcg benchmark)
>
>
>
> Run command: mpirun -np 128 bin/xhpcg
>
> Output:
>
> --
>
> mpirun was unable to start the specified application as it encountered an
>
> error:
>
>
>
> Error code: 63
>
> Error name: (null)
>
> Node: Gen2Node4
>
>
>
> when attempting to start process rank 0.
>
> ----------------------------------
>
> 128 total processes failed to start
>
>
>
>
>
> Collin
>
>
>
> *From:* Joshua Ladd 
> *Sent:* Tuesday, January 28, 2020 11:39 AM
> *To:* Open MPI Users 
> *Cc:* Collin Strassburger ; Ralph Castain <
> r...@open-mpi.org>; Artem Polyakov 
> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
> 7742 when utilizing 100+ processors per node
>
>
>
> Can you send the output of a failed run including your command line.
>
>
>
> Josh
>
>
>
> On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users <
> users@lists.open-mpi.org> wrote:
>
> Okay, so this is a problem with the Mellanox software - copying Artem.
>
>
>
> On Jan 28, 2020, at 8:15 AM, Collin Strassburger 
> wrote:
>
>
>
> I just tried that and it does indeed work with pbs and without Mellanox
> (until a reboot makes it complain about Mellanox/IB related defaults as no
> drivers were installed, etc).
>
>
>
> After installing the Mellanox drivers, I used
>
> ./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx
> --with-platform=contrib/platform/mellanox/optimized
>
>
>
> With the new compile it fails on the higher core counts.
>
>
>
>
>
> Collin
>
>
>
> *From:* users  *On Behalf Of *Ralph
> Castain via users
> *Sent:* Tuesday, January 28, 2020 11:02 AM
> *To:* Open MPI Users 
> *Cc:* Ralph Castain 
> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
> 7742 when utilizing 100+ processors per node
>
>
>
> Does it work with pbs but not Mellanox? Just trying to isolate the problem.
>
>
>
>
>
> On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users <
> users@lists.open-mpi.org> wrote:
>
>
>
> Hello,
>
>
>
> I have done some additional testing and I can say that it works correctly
> with gcc8 and no mellanox or pbs installed.
>
>
>
> I am have done two runs with Mellanox and pbs installed.  One run includes
> the actual run options I will be using while the other includes a truncated
> set which still compiles but fails to execute correctly.  As the option
> with the actual run options results in a smaller config log, I am including
> it here.
>
>
>
> Ve

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
In relation to the multi-node attempt, I haven’t yet set that up yet as the 
per-node configuration doesn’t pass its tests (full node utilization, etc).

Here are the results for the hostname test:
Input: mpirun -np 128 hostname

Output:
--
mpirun was unable to start the specified application as it encountered an
error:

Error code: 63
Error name: (null)
Node: Gen2Node3

when attempting to start process rank 0.
--
128 total processes failed to start


Collin


From: users  On Behalf Of Ralph Castain via 
users
Sent: Tuesday, January 28, 2020 12:06 PM
To: Joshua Ladd 
Cc: Ralph Castain ; Open MPI Users 
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Josh - if you read thru the thread, you will see that disabling Mellanox/IB 
drivers allows the program to run. It only fails when they are enabled.



On Jan 28, 2020, at 8:49 AM, Joshua Ladd 
mailto:jladd.m...@gmail.com>> wrote:

I don't see how this can be diagnosed as a "problem with the Mellanox 
Software". This is on a single node. What happens when you try to launch on 
more than one node?

Josh

On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger 
mailto:cstrassbur...@bihrle.com>> wrote:
Here’s the I/O for these high local core count runs. (“xhpcg” is the standard 
hpcg benchmark)

Run command: mpirun -np 128 bin/xhpcg
Output:
--
mpirun was unable to start the specified application as it encountered an
error:

Error code: 63
Error name: (null)
Node: Gen2Node4

when attempting to start process rank 0.
--
128 total processes failed to start


Collin

From: Joshua Ladd mailto:jladd.m...@gmail.com>>
Sent: Tuesday, January 28, 2020 11:39 AM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Collin Strassburger 
mailto:cstrassbur...@bihrle.com>>; Ralph Castain 
mailto:r...@open-mpi.org>>; Artem Polyakov 
mailto:art...@mellanox.com>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Can you send the output of a failed run including your command line.

Josh

On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users 
mailto:users@lists.open-mpi.org>> wrote:
Okay, so this is a problem with the Mellanox software - copying Artem.

On Jan 28, 2020, at 8:15 AM, Collin Strassburger 
mailto:cstrassbur...@bihrle.com>> wrote:

I just tried that and it does indeed work with pbs and without Mellanox (until 
a reboot makes it complain about Mellanox/IB related defaults as no drivers 
were installed, etc).

After installing the Mellanox drivers, I used
./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx 
--with-platform=contrib/platform/mellanox/optimized

With the new compile it fails on the higher core counts.


Collin

From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Ralph Castain via users
Sent: Tuesday, January 28, 2020 11:02 AM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Ralph Castain mailto:r...@open-mpi.org>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Does it work with pbs but not Mellanox? Just trying to isolate the problem.


On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users 
mailto:users@lists.open-mpi.org>> wrote:

Hello,

I have done some additional testing and I can say that it works correctly with 
gcc8 and no mellanox or pbs installed.

I am have done two runs with Mellanox and pbs installed.  One run includes the 
actual run options I will be using while the other includes a truncated set 
which still compiles but fails to execute correctly.  As the option with the 
actual run options results in a smaller config log, I am including it here.

Version: 4.0.2
The config log is available at 
https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and the 
ompi dump is available athttps://pastebin.com/md3HwTUR.

The IB network information (which is not being explicitly operated across):
Packages: MLNX_OFED and Mellanox HPC-X, both are current versions 
(MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and 
hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64)
Ulimit -l = unlimited
Ibv_devinfo:
hca_id: mlx4_0
transport:  InfiniBand (0)
fw_ver: 2.42.5000
…
vendor_id:  0x02c9
vendor_part_id: 4099
hw_ver: 0x1
board_id:   MT_1100120019
phys_port_cnt:  1
Device ports:
port:   1
   

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
Josh - if you read thru the thread, you will see that disabling Mellanox/IB 
drivers allows the program to run. It only fails when they are enabled.


On Jan 28, 2020, at 8:49 AM, Joshua Ladd mailto:jladd.m...@gmail.com> > wrote:

I don't see how this can be diagnosed as a "problem with the Mellanox 
Software". This is on a single node. What happens when you try to launch on 
more than one node? 

Josh

On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger mailto:cstrassbur...@bihrle.com> > wrote:
Here’s the I/O for these high local core count runs. (“xhpcg” is the standard 
hpcg benchmark)

 
Run command: mpirun -np 128 bin/xhpcg

Output:

--

mpirun was unable to start the specified application as it encountered an

error:

 
Error code: 63

Error name: (null)

Node: Gen2Node4

 
when attempting to start process rank 0.

--

128 total processes failed to start

 
 
Collin

 
From: Joshua Ladd mailto:jladd.m...@gmail.com> >
Sent: Tuesday, January 28, 2020 11:39 AM
To: Open MPI Users mailto:users@lists.open-mpi.org> >
Cc: Collin Strassburger mailto:cstrassbur...@bihrle.com> >; Ralph Castain mailto:r...@open-mpi.org> >; Artem Polyakov mailto:art...@mellanox.com> >
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

 
Can you send the output of a failed run including your command line. 

 
Josh

 
On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users 
mailto:users@lists.open-mpi.org> > wrote:

Okay, so this is a problem with the Mellanox software - copying Artem.




On Jan 28, 2020, at 8:15 AM, Collin Strassburger mailto:cstrassbur...@bihrle.com> > wrote:

 
I just tried that and it does indeed work with pbs and without Mellanox (until 
a reboot makes it complain about Mellanox/IB related defaults as no drivers 
were installed, etc).

 
After installing the Mellanox drivers, I used

./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx 
--with-platform=contrib/platform/mellanox/optimized

 
With the new compile it fails on the higher core counts.

 
 
Collin

 
From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users
Sent: Tuesday, January 28, 2020 11:02 AM
To: Open MPI Users mailto:users@lists.open-mpi.org> >
Cc: Ralph Castain mailto:r...@open-mpi.org> >
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

 
Does it work with pbs but not Mellanox? Just trying to isolate the problem.

 
 
On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users 
mailto:users@lists.open-mpi.org> > wrote:

 
Hello,

 
I have done some additional testing and I can say that it works correctly with 
gcc8 and no mellanox or pbs installed.

 
I am have done two runs with Mellanox and pbs installed.  One run includes the 
actual run options I will be using while the other includes a truncated set 
which still compiles but fails to execute correctly.  As the option with the 
actual run options results in a smaller config log, I am including it here.

 
Version: 4.0.2

The config log is available at 
https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and the 
ompi dump is available athttps://pastebin.com/md3HwTUR.

 
The IB network information (which is not being explicitly operated across):

Packages: MLNX_OFED and Mellanox HPC-X, both are current versions 
(MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and 
hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64)

Ulimit -l = unlimited

Ibv_devinfo:

hca_id: mlx4_0

    transport:  InfiniBand (0)

    fw_ver: 2.42.5000

…

    vendor_id:  0x02c9

    vendor_part_id: 4099

    hw_ver: 0x1

    board_id:   MT_1100120019

    phys_port_cnt:  1

    Device ports:

    port:   1

    state:  PORT_ACTIVE (4)

    max_mtu:    4096 (5)

    active_mtu: 4096 (5)

    sm_lid: 1

    port_lid:   12

    port_lmc:   0x00

    link_layer:     InfiniBand

It looks like the rest of the IB information is in the config file.

 
I hope this helps,

Collin

 
 
 
From: Jeff Squyres (jsquyres) mailto:jsquy...@cisco.com> > 
Sent: Monday, January 27, 2020 3:40 PM
To: Open MPI User's List mailto:users@lists.open-mpi.org> >
Cc: Collin Strassburger mailto:cstrassbur...@bihrle.com> >
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 o

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
Also, can you try running:

mpirun -np 128 hostname

Josh

On Tue, Jan 28, 2020 at 11:49 AM Joshua Ladd  wrote:

> I don't see how this can be diagnosed as a "problem with the Mellanox
> Software". This is on a single node. What happens when you try to launch on
> more than one node?
>
> Josh
>
> On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger <
> cstrassbur...@bihrle.com> wrote:
>
>> Here’s the I/O for these high local core count runs. (“xhpcg” is the
>> standard hpcg benchmark)
>>
>>
>>
>> Run command: mpirun -np 128 bin/xhpcg
>>
>> Output:
>>
>> --
>>
>> mpirun was unable to start the specified application as it encountered an
>>
>> error:
>>
>>
>>
>> Error code: 63
>>
>> Error name: (null)
>>
>> Node: Gen2Node4
>>
>>
>>
>> when attempting to start process rank 0.
>>
>> --
>>
>> 128 total processes failed to start
>>
>>
>>
>>
>>
>> Collin
>>
>>
>>
>> *From:* Joshua Ladd 
>> *Sent:* Tuesday, January 28, 2020 11:39 AM
>> *To:* Open MPI Users 
>> *Cc:* Collin Strassburger ; Ralph Castain <
>> r...@open-mpi.org>; Artem Polyakov 
>> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
>> 7742 when utilizing 100+ processors per node
>>
>>
>>
>> Can you send the output of a failed run including your command line.
>>
>>
>>
>> Josh
>>
>>
>>
>> On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users <
>> users@lists.open-mpi.org> wrote:
>>
>> Okay, so this is a problem with the Mellanox software - copying Artem.
>>
>>
>>
>> On Jan 28, 2020, at 8:15 AM, Collin Strassburger <
>> cstrassbur...@bihrle.com> wrote:
>>
>>
>>
>> I just tried that and it does indeed work with pbs and without Mellanox
>> (until a reboot makes it complain about Mellanox/IB related defaults as no
>> drivers were installed, etc).
>>
>>
>>
>> After installing the Mellanox drivers, I used
>>
>> ./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx
>> --with-platform=contrib/platform/mellanox/optimized
>>
>>
>>
>> With the new compile it fails on the higher core counts.
>>
>>
>>
>>
>>
>> Collin
>>
>>
>>
>> *From:* users  *On Behalf Of *Ralph
>> Castain via users
>> *Sent:* Tuesday, January 28, 2020 11:02 AM
>> *To:* Open MPI Users 
>> *Cc:* Ralph Castain 
>> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
>> 7742 when utilizing 100+ processors per node
>>
>>
>>
>> Does it work with pbs but not Mellanox? Just trying to isolate the
>> problem.
>>
>>
>>
>>
>>
>> On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users <
>> users@lists.open-mpi.org> wrote:
>>
>>
>>
>> Hello,
>>
>>
>>
>> I have done some additional testing and I can say that it works correctly
>> with gcc8 and no mellanox or pbs installed.
>>
>>
>>
>> I am have done two runs with Mellanox and pbs installed.  One run
>> includes the actual run options I will be using while the other includes a
>> truncated set which still compiles but fails to execute correctly.  As the
>> option with the actual run options results in a smaller config log, I am
>> including it here.
>>
>>
>>
>> Version: 4.0.2
>>
>> The config log is available at
>> https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and
>> the ompi dump is available athttps://pastebin.com/md3HwTUR.
>>
>>
>>
>> The IB network information (which is not being explicitly operated
>> across):
>>
>> Packages: MLNX_OFED and Mellanox HPC-X, both are current versions
>> (MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and
>> hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64)
>>
>> Ulimit -l = unlimited
>>
>> Ibv_devinfo:
>>
>> hca_id: mlx4_0
>>
>> transport:  InfiniBand (0)
>>
>> fw_ver: 2.42.5000
>>
>> …
>>
>> vendor_id:          0x02c9
>>
>>     vendor_part_id:     4099
>>
>>   

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
I don't see how this can be diagnosed as a "problem with the Mellanox
Software". This is on a single node. What happens when you try to launch on
more than one node?

Josh

On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger <
cstrassbur...@bihrle.com> wrote:

> Here’s the I/O for these high local core count runs. (“xhpcg” is the
> standard hpcg benchmark)
>
>
>
> Run command: mpirun -np 128 bin/xhpcg
>
> Output:
>
> --
>
> mpirun was unable to start the specified application as it encountered an
>
> error:
>
>
>
> Error code: 63
>
> Error name: (null)
>
> Node: Gen2Node4
>
>
>
> when attempting to start process rank 0.
>
> --
>
> 128 total processes failed to start
>
>
>
>
>
> Collin
>
>
>
> *From:* Joshua Ladd 
> *Sent:* Tuesday, January 28, 2020 11:39 AM
> *To:* Open MPI Users 
> *Cc:* Collin Strassburger ; Ralph Castain <
> r...@open-mpi.org>; Artem Polyakov 
> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
> 7742 when utilizing 100+ processors per node
>
>
>
> Can you send the output of a failed run including your command line.
>
>
>
> Josh
>
>
>
> On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users <
> users@lists.open-mpi.org> wrote:
>
> Okay, so this is a problem with the Mellanox software - copying Artem.
>
>
>
> On Jan 28, 2020, at 8:15 AM, Collin Strassburger 
> wrote:
>
>
>
> I just tried that and it does indeed work with pbs and without Mellanox
> (until a reboot makes it complain about Mellanox/IB related defaults as no
> drivers were installed, etc).
>
>
>
> After installing the Mellanox drivers, I used
>
> ./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx
> --with-platform=contrib/platform/mellanox/optimized
>
>
>
> With the new compile it fails on the higher core counts.
>
>
>
>
>
> Collin
>
>
>
> *From:* users  *On Behalf Of *Ralph
> Castain via users
> *Sent:* Tuesday, January 28, 2020 11:02 AM
> *To:* Open MPI Users 
> *Cc:* Ralph Castain 
> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
> 7742 when utilizing 100+ processors per node
>
>
>
> Does it work with pbs but not Mellanox? Just trying to isolate the problem.
>
>
>
>
>
> On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users <
> users@lists.open-mpi.org> wrote:
>
>
>
> Hello,
>
>
>
> I have done some additional testing and I can say that it works correctly
> with gcc8 and no mellanox or pbs installed.
>
>
>
> I am have done two runs with Mellanox and pbs installed.  One run includes
> the actual run options I will be using while the other includes a truncated
> set which still compiles but fails to execute correctly.  As the option
> with the actual run options results in a smaller config log, I am including
> it here.
>
>
>
> Version: 4.0.2
>
> The config log is available at
> https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and
> the ompi dump is available athttps://pastebin.com/md3HwTUR.
>
>
>
> The IB network information (which is not being explicitly operated across):
>
> Packages: MLNX_OFED and Mellanox HPC-X, both are current versions
> (MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and
> hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64)
>
> Ulimit -l = unlimited
>
> Ibv_devinfo:
>
> hca_id: mlx4_0
>
> transport:  InfiniBand (0)
>
> fw_ver: 2.42.5000
>
> …
>
> vendor_id:  0x02c9
>
> vendor_part_id: 4099
>
> hw_ver: 0x1
>
> board_id:   MT_1100120019
>
> phys_port_cnt:  1
>
> Device ports:
>
> port:   1
>
> state:  PORT_ACTIVE (4)
>
> max_mtu:4096 (5)
>
>             active_mtu:         4096 (5)
>
>             sm_lid:         1
>
> port_lid:   12
>
> port_lmc:   0x00
>
> link_layer: InfiniBand
>
> It looks like the rest of the IB information is in the config file.
>
>
>
> I hope this helps,
>
> Collin
>
>
>
>
>
&

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
Here’s the I/O for these high local core count runs. (“xhpcg” is the standard 
hpcg benchmark)

Run command: mpirun -np 128 bin/xhpcg
Output:
--
mpirun was unable to start the specified application as it encountered an
error:

Error code: 63
Error name: (null)
Node: Gen2Node4

when attempting to start process rank 0.
--
128 total processes failed to start


Collin

From: Joshua Ladd 
Sent: Tuesday, January 28, 2020 11:39 AM
To: Open MPI Users 
Cc: Collin Strassburger ; Ralph Castain 
; Artem Polyakov 
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Can you send the output of a failed run including your command line.

Josh

On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users 
mailto:users@lists.open-mpi.org>> wrote:
Okay, so this is a problem with the Mellanox software - copying Artem.


On Jan 28, 2020, at 8:15 AM, Collin Strassburger 
mailto:cstrassbur...@bihrle.com>> wrote:

I just tried that and it does indeed work with pbs and without Mellanox (until 
a reboot makes it complain about Mellanox/IB related defaults as no drivers 
were installed, etc).

After installing the Mellanox drivers, I used
./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx 
--with-platform=contrib/platform/mellanox/optimized

With the new compile it fails on the higher core counts.


Collin

From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Ralph Castain via users
Sent: Tuesday, January 28, 2020 11:02 AM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Ralph Castain mailto:r...@open-mpi.org>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Does it work with pbs but not Mellanox? Just trying to isolate the problem.


On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users 
mailto:users@lists.open-mpi.org>> wrote:

Hello,

I have done some additional testing and I can say that it works correctly with 
gcc8 and no mellanox or pbs installed.

I am have done two runs with Mellanox and pbs installed.  One run includes the 
actual run options I will be using while the other includes a truncated set 
which still compiles but fails to execute correctly.  As the option with the 
actual run options results in a smaller config log, I am including it here.

Version: 4.0.2
The config log is available at 
https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and the 
ompi dump is available athttps://pastebin.com/md3HwTUR.

The IB network information (which is not being explicitly operated across):
Packages: MLNX_OFED and Mellanox HPC-X, both are current versions 
(MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and 
hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64)
Ulimit -l = unlimited
Ibv_devinfo:
hca_id: mlx4_0
transport:  InfiniBand (0)
fw_ver: 2.42.5000
…
vendor_id:  0x02c9
vendor_part_id: 4099
hw_ver: 0x1
board_id:   MT_1100120019
phys_port_cnt:  1
Device ports:
port:   1
state:  PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid:   12
port_lmc:   0x00
link_layer: InfiniBand
It looks like the rest of the IB information is in the config file.

I hope this helps,
Collin



From: Jeff Squyres (jsquyres) mailto:jsquy...@cisco.com>>
Sent: Monday, January 27, 2020 3:40 PM
To: Open MPI User's List 
mailto:users@lists.open-mpi.org>>
Cc: Collin Strassburger 
mailto:cstrassbur...@bihrle.com>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Can you please send all the information listed here:

https://www.open-mpi.org/community/help/

Thanks!



On Jan 27, 2020, at 12:00 PM, Collin Strassburger via users 
mailto:users@lists.open-mpi.org>> wrote:

Hello,

I had initially thought the same thing about the streams, but I have 2 sockets 
with 64 cores each.  Additionally, I have not yet turned multithreading off, so 
lscpu reports a total of 256 logical cores and 128 physical cores.  As such, I 
don’t see how it could be running out of streams unless something is being 
passed incorrectly.

Collin

From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Ray Sheppard via users
Sent: Monday, January 27, 2020 11:53 AM
To: users@lists.open-mpi.org<mai

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
Can you send the output of a failed run including your command line.

Josh

On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users <
users@lists.open-mpi.org> wrote:

> Okay, so this is a problem with the Mellanox software - copying Artem.
>
> On Jan 28, 2020, at 8:15 AM, Collin Strassburger 
> wrote:
>
> I just tried that and it does indeed work with pbs and without Mellanox
> (until a reboot makes it complain about Mellanox/IB related defaults as no
> drivers were installed, etc).
>
> After installing the Mellanox drivers, I used
> ./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx
> --with-platform=contrib/platform/mellanox/optimized
>
> With the new compile it fails on the higher core counts.
>
>
> Collin
>
> *From:* users  *On Behalf Of *Ralph
> Castain via users
> *Sent:* Tuesday, January 28, 2020 11:02 AM
> *To:* Open MPI Users 
> *Cc:* Ralph Castain 
> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
> 7742 when utilizing 100+ processors per node
>
> Does it work with pbs but not Mellanox? Just trying to isolate the problem.
>
>
>
> On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users <
> users@lists.open-mpi.org> wrote:
>
> Hello,
>
> I have done some additional testing and I can say that it works correctly
> with gcc8 and no mellanox or pbs installed.
>
> I am have done two runs with Mellanox and pbs installed.  One run includes
> the actual run options I will be using while the other includes a truncated
> set which still compiles but fails to execute correctly.  As the option
> with the actual run options results in a smaller config log, I am including
> it here.
>
> Version: 4.0.2
> The config log is available at
> https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and
> the ompi dump is available athttps://pastebin.com/md3HwTUR.
>
> The IB network information (which is not being explicitly operated across):
> Packages: MLNX_OFED and Mellanox HPC-X, both are current versions
> (MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and
> hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64)
> Ulimit -l = unlimited
> Ibv_devinfo:
> hca_id: mlx4_0
> transport:  InfiniBand (0)
> fw_ver: 2.42.5000
> …
> vendor_id:  0x02c9
> vendor_part_id: 4099
> hw_ver: 0x1
> board_id:   MT_1100120019
> phys_port_cnt:  1
> Device ports:
> port:   1
> state:  PORT_ACTIVE (4)
> max_mtu:4096 (5)
> active_mtu: 4096 (5)
> sm_lid: 1
> port_lid:   12
> port_lmc:   0x00
> link_layer: InfiniBand
> It looks like the rest of the IB information is in the config file.
>
> I hope this helps,
> Collin
>
>
>
> *From:* Jeff Squyres (jsquyres) 
> *Sent:* Monday, January 27, 2020 3:40 PM
> *To:* Open MPI User's List 
> *Cc:* Collin Strassburger 
> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
> 7742 when utilizing 100+ processors per node
>
> Can you please send all the information listed here:
>
> https://www.open-mpi.org/community/help/
>
> Thanks!
>
>
>
>
> On Jan 27, 2020, at 12:00 PM, Collin Strassburger via users <
> users@lists.open-mpi.org> wrote:
>
> Hello,
>
> I had initially thought the same thing about the streams, but I have 2
> sockets with 64 cores each.  Additionally, I have not yet turned
> multithreading off, so lscpu reports a total of 256 logical cores and 128
> physical cores.  As such, I don’t see how it could be running out of
> streams unless something is being passed incorrectly.
>
> Collin
>
> *From:* users  *On Behalf Of *Ray
> Sheppard via users
> *Sent:* Monday, January 27, 2020 11:53 AM
> *To:* users@lists.open-mpi.org
> *Cc:* Ray Sheppard 
> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
> 7742 when utilizing 100+ processors per node
>
>
> Hi All,
>   Just my two cents, I think error code 63 is saying it is running out of
> streams to use.  I think you have only 64 cores, so at 100, you are
> overloading most of them.  It feels like you are running out of resources
> trying to swap in and out ranks on physical cores.
>Ray
> On 1/27/2020 11:29 AM, Collin Strassburger via users wrote:
>
> This message was sent

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
Okay, so this is a problem with the Mellanox software - copying Artem.

On Jan 28, 2020, at 8:15 AM, Collin Strassburger mailto:cstrassbur...@bihrle.com> > wrote:

I just tried that and it does indeed work with pbs and without Mellanox (until 
a reboot makes it complain about Mellanox/IB related defaults as no drivers 
were installed, etc).
 After installing the Mellanox drivers, I used
./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx 
--with-platform=contrib/platform/mellanox/optimized
 With the new compile it fails on the higher core counts.
  Collin
 From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users
Sent: Tuesday, January 28, 2020 11:02 AM
To: Open MPI Users mailto:users@lists.open-mpi.org> >
Cc: Ralph Castain mailto:r...@open-mpi.org> >
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node
 Does it work with pbs but not Mellanox? Just trying to isolate the problem.
 

On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users 
mailto:users@lists.open-mpi.org> > wrote:
 Hello,
 I have done some additional testing and I can say that it works correctly with 
gcc8 and no mellanox or pbs installed.
 I am have done two runs with Mellanox and pbs installed.  One run includes the 
actual run options I will be using while the other includes a truncated set 
which still compiles but fails to execute correctly.  As the option with the 
actual run options results in a smaller config log, I am including it here.
 Version: 4.0.2
The config log is available at 
https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and the 
ompi dump is available athttps://pastebin.com/md3HwTUR.
 The IB network information (which is not being explicitly operated across):
Packages: MLNX_OFED and Mellanox HPC-X, both are current versions 
(MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and 
hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64)
Ulimit -l = unlimited
Ibv_devinfo:
hca_id: mlx4_0
    transport:  InfiniBand (0)
    fw_ver: 2.42.5000
…
    vendor_id:  0x02c9
    vendor_part_id: 4099
    hw_ver: 0x1
    board_id:   MT_1100120019
    phys_port_cnt:  1
    Device ports:
    port:   1
    state:  PORT_ACTIVE (4)
    max_mtu:    4096 (5)
    active_mtu: 4096 (5)
    sm_lid: 1
    port_lid:   12
    port_lmc:   0x00
    link_layer:     InfiniBand
It looks like the rest of the IB information is in the config file.
 I hope this helps,
Collin
   From: Jeff Squyres (jsquyres) mailto:jsquy...@cisco.com> > 
Sent: Monday, January 27, 2020 3:40 PM
To: Open MPI User's List mailto:users@lists.open-mpi.org> >
Cc: Collin Strassburger mailto:cstrassbur...@bihrle.com> >
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node
 Can you please send all the information listed here:
     https://www.open-mpi.org/community/help/
 Thanks!
 


On Jan 27, 2020, at 12:00 PM, Collin Strassburger via users 
mailto:users@lists.open-mpi.org> > wrote:
 Hello,
 I had initially thought the same thing about the streams, but I have 2 sockets 
with 64 cores each.  Additionally, I have not yet turned multithreading off, so 
lscpu reports a total of 256 logical cores and 128 physical cores.  As such, I 
don’t see how it could be running out of streams unless something is being 
passed incorrectly.
 Collin
 From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ray Sheppard via users
Sent: Monday, January 27, 2020 11:53 AM
To: users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> 
Cc: Ray Sheppard mailto:rshep...@iu.edu> >
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node
 Hi All,
  Just my two cents, I think error code 63 is saying it is running out of 
streams to use.  I think you have only 64 cores, so at 100, you are overloading 
most of them.  It feels like you are running out of resources trying to swap in 
and out ranks on physical cores.  
   Ray

On 1/27/2020 11:29 AM, Collin Strassburger via users wrote:
This message was sent from a non-IU address. Please exercise caution when 
clicking links or opening attachments from external sources.
 Hello Howard,
 To remove potential interactions, I have found that the issue persists without 
ucx and hcoll support.
 Run command: mpirun -np 128 bin/xhpcg
Output:
--
mpirun was unable to

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
I just tried that and it does indeed work with pbs and without Mellanox (until 
a reboot makes it complain about Mellanox/IB related defaults as no drivers 
were installed, etc).

After installing the Mellanox drivers, I used
./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx 
--with-platform=contrib/platform/mellanox/optimized

With the new compile it fails on the higher core counts.


Collin

From: users  On Behalf Of Ralph Castain via 
users
Sent: Tuesday, January 28, 2020 11:02 AM
To: Open MPI Users 
Cc: Ralph Castain 
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Does it work with pbs but not Mellanox? Just trying to isolate the problem.



On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users 
mailto:users@lists.open-mpi.org>> wrote:

Hello,

I have done some additional testing and I can say that it works correctly with 
gcc8 and no mellanox or pbs installed.

I am have done two runs with Mellanox and pbs installed.  One run includes the 
actual run options I will be using while the other includes a truncated set 
which still compiles but fails to execute correctly.  As the option with the 
actual run options results in a smaller config log, I am including it here.

Version: 4.0.2
The config log is available at 
https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and the 
ompi dump is available athttps://pastebin.com/md3HwTUR.

The IB network information (which is not being explicitly operated across):
Packages: MLNX_OFED and Mellanox HPC-X, both are current versions 
(MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and 
hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64)
Ulimit -l = unlimited
Ibv_devinfo:
hca_id: mlx4_0
transport:  InfiniBand (0)
fw_ver: 2.42.5000
…
vendor_id:  0x02c9
vendor_part_id: 4099
hw_ver: 0x1
board_id:   MT_1100120019
phys_port_cnt:  1
Device ports:
port:   1
state:  PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid:   12
port_lmc:   0x00
link_layer: InfiniBand
It looks like the rest of the IB information is in the config file.

I hope this helps,
Collin



From: Jeff Squyres (jsquyres) mailto:jsquy...@cisco.com>>
Sent: Monday, January 27, 2020 3:40 PM
To: Open MPI User's List 
mailto:users@lists.open-mpi.org>>
Cc: Collin Strassburger 
mailto:cstrassbur...@bihrle.com>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Can you please send all the information listed here:

https://www.open-mpi.org/community/help/

Thanks!




On Jan 27, 2020, at 12:00 PM, Collin Strassburger via users 
mailto:users@lists.open-mpi.org>> wrote:

Hello,

I had initially thought the same thing about the streams, but I have 2 sockets 
with 64 cores each.  Additionally, I have not yet turned multithreading off, so 
lscpu reports a total of 256 logical cores and 128 physical cores.  As such, I 
don’t see how it could be running out of streams unless something is being 
passed incorrectly.

Collin

From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Ray Sheppard via users
Sent: Monday, January 27, 2020 11:53 AM
To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
Cc: Ray Sheppard mailto:rshep...@iu.edu>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Hi All,
  Just my two cents, I think error code 63 is saying it is running out of 
streams to use.  I think you have only 64 cores, so at 100, you are overloading 
most of them.  It feels like you are running out of resources trying to swap in 
and out ranks on physical cores.
   Ray
On 1/27/2020 11:29 AM, Collin Strassburger via users wrote:
This message was sent from a non-IU address. Please exercise caution when 
clicking links or opening attachments from external sources.

Hello Howard,

To remove potential interactions, I have found that the issue persists without 
ucx and hcoll support.

Run command: mpirun -np 128 bin/xhpcg
Output:
--
mpirun was unable to start the specified application as it encountered an
error:

Error code: 63
Error name: (null)
Node: Gen2Node4

when attempting to start process rank 0.
--
128 total processes failed to start

It returns this error for any pr

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
Does it work with pbs but not Mellanox? Just trying to isolate the problem.


On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users 
mailto:users@lists.open-mpi.org> > wrote:

Hello,
 I have done some additional testing and I can say that it works correctly with 
gcc8 and no mellanox or pbs installed.
 I am have done two runs with Mellanox and pbs installed.  One run includes the 
actual run options I will be using while the other includes a truncated set 
which still compiles but fails to execute correctly.  As the option with the 
actual run options results in a smaller config log, I am including it here.
 Version: 4.0.2
The config log is available at 
https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and the 
ompi dump is available athttps://pastebin.com/md3HwTUR.
 The IB network information (which is not being explicitly operated across):
Packages: MLNX_OFED and Mellanox HPC-X, both are current versions 
(MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and 
hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64)
Ulimit -l = unlimited
Ibv_devinfo:
hca_id: mlx4_0
    transport:  InfiniBand (0)
    fw_ver: 2.42.5000
…
    vendor_id:  0x02c9
    vendor_part_id: 4099
    hw_ver: 0x1
    board_id:   MT_1100120019
    phys_port_cnt:  1
    Device ports:
    port:   1
    state:  PORT_ACTIVE (4)
    max_mtu:    4096 (5)
    active_mtu: 4096 (5)
    sm_lid: 1
    port_lid:   12
    port_lmc:   0x00
    link_layer:     InfiniBand
It looks like the rest of the IB information is in the config file.
 I hope this helps,
Collin
   From: Jeff Squyres (jsquyres) mailto:jsquy...@cisco.com> > 
Sent: Monday, January 27, 2020 3:40 PM
To: Open MPI User's List mailto:users@lists.open-mpi.org> >
Cc: Collin Strassburger mailto:cstrassbur...@bihrle.com> >
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node
 Can you please send all the information listed here:
     https://www.open-mpi.org/community/help/
 Thanks!
 

On Jan 27, 2020, at 12:00 PM, Collin Strassburger via users 
mailto:users@lists.open-mpi.org> > wrote:
 Hello,
 I had initially thought the same thing about the streams, but I have 2 sockets 
with 64 cores each.  Additionally, I have not yet turned multithreading off, so 
lscpu reports a total of 256 logical cores and 128 physical cores.  As such, I 
don’t see how it could be running out of streams unless something is being 
passed incorrectly.
 Collin
 From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ray Sheppard via users
Sent: Monday, January 27, 2020 11:53 AM
To: users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> 
Cc: Ray Sheppard mailto:rshep...@iu.edu> >
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node
 Hi All,
  Just my two cents, I think error code 63 is saying it is running out of 
streams to use.  I think you have only 64 cores, so at 100, you are overloading 
most of them.  It feels like you are running out of resources trying to swap in 
and out ranks on physical cores.  
   Ray

On 1/27/2020 11:29 AM, Collin Strassburger via users wrote:
This message was sent from a non-IU address. Please exercise caution when 
clicking links or opening attachments from external sources.
 Hello Howard,
 To remove potential interactions, I have found that the issue persists without 
ucx and hcoll support.
 Run command: mpirun -np 128 bin/xhpcg
Output:
--
mpirun was unable to start the specified application as it encountered an
error:
 Error code: 63
Error name: (null)
Node: Gen2Node4
 when attempting to start process rank 0.
--
128 total processes failed to start
 It returns this error for any process I initialize with >100 processes per 
node.  I get the same error message for multiple different codes, so the error 
code is mpi related rather than being program specific.
 Collin
 From: Howard Pritchard  <mailto:hpprit...@gmail.com>  
Sent: Monday, January 27, 2020 11:20 AM
To: Open MPI Users  <mailto:users@lists.open-mpi.org> 
Cc: Collin Strassburger  
<mailto:cstrassbur...@bihrle.com> 
Subject: Re: [OMPI users] OMPI returns error 63 on AMD 7742 when utilizing 100+ 
processors per node
 Hello Collen,
 Could you provide more information about the error.  Is there any output from 
either Open MPI or, maybe, UCX, that could prov

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-27 Thread Jeff Squyres (jsquyres) via users
Can you please send all the information listed here:

https://www.open-mpi.org/community/help/

Thanks!


On Jan 27, 2020, at 12:00 PM, Collin Strassburger via users 
mailto:users@lists.open-mpi.org>> wrote:

Hello,

I had initially thought the same thing about the streams, but I have 2 sockets 
with 64 cores each.  Additionally, I have not yet turned multithreading off, so 
lscpu reports a total of 256 logical cores and 128 physical cores.  As such, I 
don’t see how it could be running out of streams unless something is being 
passed incorrectly.

Collin

From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Ray Sheppard via users
Sent: Monday, January 27, 2020 11:53 AM
To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
Cc: Ray Sheppard mailto:rshep...@iu.edu>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Hi All,
  Just my two cents, I think error code 63 is saying it is running out of 
streams to use.  I think you have only 64 cores, so at 100, you are overloading 
most of them.  It feels like you are running out of resources trying to swap in 
and out ranks on physical cores.
   Ray
On 1/27/2020 11:29 AM, Collin Strassburger via users wrote:
This message was sent from a non-IU address. Please exercise caution when 
clicking links or opening attachments from external sources.

Hello Howard,

To remove potential interactions, I have found that the issue persists without 
ucx and hcoll support.

Run command: mpirun -np 128 bin/xhpcg
Output:
--
mpirun was unable to start the specified application as it encountered an
error:

Error code: 63
Error name: (null)
Node: Gen2Node4

when attempting to start process rank 0.
--
128 total processes failed to start

It returns this error for any process I initialize with >100 processes per 
node.  I get the same error message for multiple different codes, so the error 
code is mpi related rather than being program specific.

Collin

From: Howard Pritchard <mailto:hpprit...@gmail.com>
Sent: Monday, January 27, 2020 11:20 AM
To: Open MPI Users <mailto:users@lists.open-mpi.org>
Cc: Collin Strassburger 
<mailto:cstrassbur...@bihrle.com>
Subject: Re: [OMPI users] OMPI returns error 63 on AMD 7742 when utilizing 100+ 
processors per node

Hello Collen,

Could you provide more information about the error.  Is there any output from 
either Open MPI or, maybe, UCX, that could provide more information about the 
problem you are hitting?

Howard


Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via users 
mailto:users@lists.open-mpi.org>>:
Hello,

I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5.  Both of these 
versions cause the same error (error code 63) when utilizing more than 100 
cores on a single node.  The processors I am utilizing are AMD Epyc “Rome” 
7742s.  The OS is CentOS 8.1.  I have tried compiling with both the default gcc 
8 and locally compiled gcc 9.  I have already tried modifying the maximum name 
field values with no success.

My compile options are:
./configure
 --prefix=${HPCX_HOME}/ompi
 --with-platform=contrib/platform/mellanox/optimized

Any assistance would be appreciated,
Collin

Collin Strassburger
Bihrle Applied Research Inc.




--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>



Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-27 Thread Collin Strassburger via users
Hello,

I had initially thought the same thing about the streams, but I have 2 sockets 
with 64 cores each.  Additionally, I have not yet turned multithreading off, so 
lscpu reports a total of 256 logical cores and 128 physical cores.  As such, I 
don’t see how it could be running out of streams unless something is being 
passed incorrectly.

Collin

From: users  On Behalf Of Ray Sheppard via 
users
Sent: Monday, January 27, 2020 11:53 AM
To: users@lists.open-mpi.org
Cc: Ray Sheppard 
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Hi All,
  Just my two cents, I think error code 63 is saying it is running out of 
streams to use.  I think you have only 64 cores, so at 100, you are overloading 
most of them.  It feels like you are running out of resources trying to swap in 
and out ranks on physical cores.
   Ray
On 1/27/2020 11:29 AM, Collin Strassburger via users wrote:
This message was sent from a non-IU address. Please exercise caution when 
clicking links or opening attachments from external sources.

Hello Howard,

To remove potential interactions, I have found that the issue persists without 
ucx and hcoll support.

Run command: mpirun -np 128 bin/xhpcg
Output:
--
mpirun was unable to start the specified application as it encountered an
error:

Error code: 63
Error name: (null)
Node: Gen2Node4

when attempting to start process rank 0.
--
128 total processes failed to start

It returns this error for any process I initialize with >100 processes per 
node.  I get the same error message for multiple different codes, so the error 
code is mpi related rather than being program specific.

Collin

From: Howard Pritchard <mailto:hpprit...@gmail.com>
Sent: Monday, January 27, 2020 11:20 AM
To: Open MPI Users <mailto:users@lists.open-mpi.org>
Cc: Collin Strassburger 
<mailto:cstrassbur...@bihrle.com>
Subject: Re: [OMPI users] OMPI returns error 63 on AMD 7742 when utilizing 100+ 
processors per node

Hello Collen,

Could you provide more information about the error.  Is there any output from 
either Open MPI or, maybe, UCX, that could provide more information about the 
problem you are hitting?

Howard


Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via users 
mailto:users@lists.open-mpi.org>>:
Hello,

I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5.  Both of these 
versions cause the same error (error code 63) when utilizing more than 100 
cores on a single node.  The processors I am utilizing are AMD Epyc “Rome” 
7742s.  The OS is CentOS 8.1.  I have tried compiling with both the default gcc 
8 and locally compiled gcc 9.  I have already tried modifying the maximum name 
field values with no success.

My compile options are:
./configure
 --prefix=${HPCX_HOME}/ompi
 --with-platform=contrib/platform/mellanox/optimized

Any assistance would be appreciated,
Collin

Collin Strassburger
Bihrle Applied Research Inc.




Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-27 Thread Ray Sheppard via users

Hi All,
  Just my two cents, I think error code 63 is saying it is running out 
of streams to use.  I think you have only 64 cores, so at 100, you are 
overloading most of them.  It feels like you are running out of 
resources trying to swap in and out ranks on physical cores.

   Ray

On 1/27/2020 11:29 AM, Collin Strassburger via users wrote:
This message was sent from a non-IU address. Please exercise caution 
when clicking links or opening attachments from external sources.


Hello Howard,

To remove potential interactions, I have found that the issue persists 
without ucx and hcoll support.


Run command: mpirun -np 128 bin/xhpcg

Output:

--

mpirun was unable to start the specified application as it encountered an

error:

Error code: 63

Error name: (null)

Node: Gen2Node4

when attempting to start process rank 0.

--

128 total processes failed to start

It returns this error for any process I initialize with >100 processes 
per node.  I get the same error message for multiple different codes, 
so the error code is mpi related rather than being program specific.


Collin

*From:* Howard Pritchard 
*Sent:* Monday, January 27, 2020 11:20 AM
*To:* Open MPI Users 
*Cc:* Collin Strassburger 
*Subject:* Re: [OMPI users] OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node


Hello Collen,

Could you provide more information about the error.  Is there any 
output from either Open MPI or, maybe, UCX, that could provide more 
information about the problem you are hitting?


Howard

Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via 
users mailto:users@lists.open-mpi.org>>:


Hello,

I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5. 
Both of these versions cause the same error (error code 63) when
utilizing more than 100 cores on a single node.  The processors I
am utilizing are AMD Epyc “Rome” 7742s.  The OS is CentOS 8.1.  I
have tried compiling with both the default gcc 8 and locally
compiled gcc 9.  I have already tried modifying the maximum name
field values with no success.

My compile options are:

./configure

--prefix=${HPCX_HOME}/ompi

--with-platform=contrib/platform/mellanox/optimized

Any assistance would be appreciated,

Collin

Collin Strassburger

Bihrle Applied Research Inc.