Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
Wonderful! I am happy to confirm that this resolves the issue! Many thanks to everyone for their assistance, Collin
Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
Okay, that nailed it down - the problem is the number of open file descriptors is exceeding your system limit. I suspect the connection to the Mellanox drivers is solely due to it also having some descriptors open, and you are just close enough to the boundary that it causes you to hit it. See what you get with "ulimit -a" - you are looking for a line that indicates "open files", meaning the max number of open file descriptors you are allowed to have. You can also check the system imits with "cat /proc/sys/fs/file-max" (might differ with flavor of Linux you are using). There are a number of solutions - here is an article that explains them: https://www.linuxtechi.com/set-ulimit-file-descriptors-limit-linux-servers/ Ralph
Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
I agree that it is odd that the issue does not appear until after the Mellanox drivers have been installed (and the configure flags set to use them). As requested, here are the results Input: mpirun -np 128 --mca odls_base_verbose 10 --mca state_base_verbose 10 hostname Output: [Gen2Node3:54366] mca: base: components_register: registering framework state components [Gen2Node3:54366] mca: base: components_register: found loaded component orted [Gen2Node3:54366] mca: base: components_register: component orted has no register or open function [Gen2Node3:54366] mca: base: components_register: found loaded component hnp [Gen2Node3:54366] mca: base: components_register: component hnp has no register or open function [Gen2Node3:54366] mca: base: components_register: found loaded component tool [Gen2Node3:54366] mca: base: components_register: component tool has no register or open function [Gen2Node3:54366] mca: base: components_register: found loaded component app [Gen2Node3:54366] mca: base: components_register: component app has no register or open function [Gen2Node3:54366] mca: base: components_register: found loaded component novm [Gen2Node3:54366] mca: base: components_register: component novm has no register or open function [Gen2Node3:54366] mca: base: components_open: opening state components [Gen2Node3:54366] mca: base: components_open: found loaded component orted [Gen2Node3:54366] mca: base: components_open: component orted open function successful [Gen2Node3:54366] mca: base: components_open: found loaded component hnp [Gen2Node3:54366] mca: base: components_open: component hnp open function successful [Gen2Node3:54366] mca: base: components_open: found loaded component tool [Gen2Node3:54366] mca: base: components_open: component tool open function successful [Gen2Node3:54366] mca: base: components_open: found loaded component app [Gen2Node3:54366] mca: base: components_open: component app open function successful [Gen2Node3:54366] mca: base: components_open: found loaded component novm [Gen2Node3:54366] mca: base: components_open: component novm open function successful [Gen2Node3:54366] mca:base:select: Auto-selecting state components [Gen2Node3:54366] mca:base:select:(state) Querying component [orted] [Gen2Node3:54366] mca:base:select:(state) Querying component [hnp] [Gen2Node3:54366] mca:base:select:(state) Query of component [hnp] set priority to 60 [Gen2Node3:54366] mca:base:select:(state) Querying component [tool] [Gen2Node3:54366] mca:base:select:(state) Querying component [app] [Gen2Node3:54366] mca:base:select:(state) Querying component [novm] [Gen2Node3:54366] mca:base:select:(state) Selected component [hnp] [Gen2Node3:54366] mca: base: close: component orted closed [Gen2Node3:54366] mca: base: close: unloading component orted [Gen2Node3:54366] mca: base: close: component tool closed [Gen2Node3:54366] mca: base: close: unloading component tool [Gen2Node3:54366] mca: base: close: component app closed [Gen2Node3:54366] mca: base: close: unloading component app [Gen2Node3:54366] mca: base: close: component novm closed [Gen2Node3:54366] mca: base: close: unloading component novm [Gen2Node3:54366] ORTE_JOB_STATE_MACHINE: [Gen2Node3:54366] State: PENDING INIT cbfunc: DEFINED [Gen2Node3:54366] State: INIT_COMPLETE cbfunc: DEFINED [Gen2Node3:54366] State: PENDING ALLOCATION cbfunc: DEFINED [Gen2Node3:54366] State: ALLOCATION COMPLETE cbfunc: DEFINED [Gen2Node3:54366] State: DAEMONS LAUNCHED cbfunc: DEFINED [Gen2Node3:54366] State: ALL DAEMONS REPORTED cbfunc: DEFINED [Gen2Node3:54366] State: VM READY cbfunc: DEFINED [Gen2Node3:54366] State: PENDING MAPPING cbfunc: DEFINED [Gen2Node3:54366] State: MAP COMPLETE cbfunc: DEFINED [Gen2Node3:54366] State: PENDING FINAL SYSTEM PREP cbfunc: DEFINED [Gen2Node3:54366] State: PENDING APP LAUNCH cbfunc: DEFINED [Gen2Node3:54366] State: SENDING LAUNCH MSG cbfunc: DEFINED [Gen2Node3:54366] State: LOCAL LAUNCH COMPLETE cbfunc: DEFINED [Gen2Node3:54366] State: RUNNING cbfunc: DEFINED [Gen2Node3:54366] State: SYNC REGISTERED cbfunc: DEFINED [Gen2Node3:54366] State: NORMALLY TERMINATED cbfunc: DEFINED [Gen2Node3:54366] State: NOTIFY COMPLETED cbfunc: DEFINED [Gen2Node3:54366] State: NOTIFIED cbfunc: DEFINED [Gen2Node3:54366] State: ALL JOBS COMPLETE cbfunc: DEFINED [Gen2Node3:54366] State: DAEMONS TERMINATED cbfunc: DEFINED [Gen2Node3:54366] State: FORCED EXIT cbfunc: DEFINED [Gen2Node3:54366] State: REPORT PROGRESS cbfunc: DEFINED [Gen2Node3:54366] ORTE_PROC_STATE_MACHINE: [Gen2Node3:54366] State: RUNNING cbfunc: DEFINED [Gen2Node3:54366] State: SYNC REGISTERED cbfunc: DEFINED [Gen2Node3:54366] State: IOF COMPLETE cbfunc: DEFINED [Gen2Node3:54366] State: WAITPID FIRED cbfunc: DEFINED [Gen2Node3:54366] State: NORMALLY TERMINATED cbfunc: DEFINED [Gen2Node3:54366] mca: base:
Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
Okay, debug-daemons isn't going to help as we aren't launching any daemons. This is all one node. So try adding "--mca odls_base_verbose 10 --mca state_base_verbose 10" to the cmd line and let's see what is going on. I agree with Josh - neither mpirun nor hostname are invoking the Mellanox drivers, so it is hard to see why removing those drivers is allowing this to run. On Jan 28, 2020, at 11:35 AM, Ralph H Castain mailto:r...@open-mpi.org> > wrote: kay, debug-daemons isn't going to help as we aren't launching any daemons. This is all one node. So try adding "--mca odls_base_verbose 10 --mca state_base_verbose 10" to the cmd line and let's see what is going on. I agree with Josh - neither mpirun nor hostname are invoking the Mellanox drivers, so it is hard to see why removing those drivers is allowing this to run. On Jan 28, 2020, at 11:28 AM, Collin Strassburger mailto:cstrassbur...@bihrle.com> > wrote: Same result. (It works though 102 but not greater than that) Input: mpirun -np 128 --debug-daemons --map-by ppr:64:socket hostname Output: [Gen2Node3:54348] [[18008,0],0] orted_cmd: received add_local_procs [Gen2Node3:54348] [[18008,0],0] orted_cmd: received exit cmd [Gen2Node3:54348] [[18008,0],0] orted_cmd: all routes and children gone - exiting -- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node3 when attempting to start process rank 0. -- 128 total processes failed to start Collin
Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
Same result. (It works though 102 but not greater than that) Input: mpirun -np 128 --debug-daemons --map-by ppr:64:socket hostname Output: [Gen2Node3:54348] [[18008,0],0] orted_cmd: received add_local_procs [Gen2Node3:54348] [[18008,0],0] orted_cmd: received exit cmd [Gen2Node3:54348] [[18008,0],0] orted_cmd: all routes and children gone - exiting -- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node3 when attempting to start process rank 0. -- 128 total processes failed to start Collin From: Joshua Ladd Sent: Tuesday, January 28, 2020 2:24 PM To: Collin Strassburger Cc: Open MPI Users ; Ralph Castain Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node OK. Please try: mpirun -np 128 --debug-daemons --map-by ppr:64:socket hostname Josh On Tue, Jan 28, 2020 at 12:49 PM Collin Strassburger mailto:cstrassbur...@bihrle.com>> wrote: Input: mpirun -np 128 --debug-daemons -mca plm rsh hostname Output: [Gen2Node3:54039] [[16643,0],0] orted_cmd: received add_local_procs [Gen2Node3:54039] [[16643,0],0] orted_cmd: received exit cmd [Gen2Node3:54039] [[16643,0],0] orted_cmd: all routes and children gone - exiting -- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node3 when attempting to start process rank 0. -- 128 total processes failed to start Collin From: Joshua Ladd mailto:jladd.m...@gmail.com>> Sent: Tuesday, January 28, 2020 12:48 PM To: Collin Strassburger mailto:cstrassbur...@bihrle.com>> Cc: Open MPI Users mailto:users@lists.open-mpi.org>>; Ralph Castain mailto:r...@open-mpi.org>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Sorry, typo, try: mpirun -np 128 --debug-daemons -mca plm rsh hostname Josh On Tue, Jan 28, 2020 at 12:45 PM Joshua Ladd mailto:jladd.m...@gmail.com>> wrote: And if you try: mpirun -np 128 --debug-daemons -plm rsh hostname Josh On Tue, Jan 28, 2020 at 12:34 PM Collin Strassburger mailto:cstrassbur...@bihrle.com>> wrote: Input: mpirun -np 128 --debug-daemons hostname Output: [Gen2Node3:54023] [[16659,0],0] orted_cmd: received add_local_procs [Gen2Node3:54023] [[16659,0],0] orted_cmd: received exit cmd [Gen2Node3:54023] [[16659,0],0] orted_cmd: all routes and children gone - exiting -- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node3 when attempting to start process rank 0. -- Collin From: Joshua Ladd mailto:jladd.m...@gmail.com>> Sent: Tuesday, January 28, 2020 12:31 PM To: Collin Strassburger mailto:cstrassbur...@bihrle.com>> Cc: Open MPI Users mailto:users@lists.open-mpi.org>>; Ralph Castain mailto:r...@open-mpi.org>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Interesting. Can you try: mpirun -np 128 --debug-daemons hostname Josh On Tue, Jan 28, 2020 at 12:14 PM Collin Strassburger mailto:cstrassbur...@bihrle.com>> wrote: In relation to the multi-node attempt, I haven’t yet set that up yet as the per-node configuration doesn’t pass its tests (full node utilization, etc). Here are the results for the hostname test: Input: mpirun -np 128 hostname Output: -- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node3 when attempting to start process rank 0. -- 128 total processes failed to start Collin From: users mailto:users-boun...@lists.open-mpi.org>> On Behalf Of Ralph Castain via users Sent: Tuesday, January 28, 2020 12:06 PM To: Joshua Ladd mailto:jladd.m...@gmail.com>> Cc: Ralph Castain mailto:r...@open-mpi.org>>; Open MPI Users mailto:users@lists.open-mpi.org>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Josh - if you read thru the thread, you will see that disabling Mellanox/IB drivers allows the program to run. It only fails when they are enabled. On Jan 28, 2020, at 8:49 AM, Joshua Ladd mailto:jladd.m...@gmail.com>> wrote: I don't see how this can be diagnosed as a "problem with the Mellanox Sof
Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
OK. Please try: mpirun -np 128 --debug-daemons --map-by ppr:64:socket hostname Josh On Tue, Jan 28, 2020 at 12:49 PM Collin Strassburger < cstrassbur...@bihrle.com> wrote: > Input: mpirun -np 128 --debug-daemons -mca plm rsh hostname > > > > Output: > > [Gen2Node3:54039] [[16643,0],0] orted_cmd: received add_local_procs > > [Gen2Node3:54039] [[16643,0],0] orted_cmd: received exit cmd > > [Gen2Node3:54039] [[16643,0],0] orted_cmd: all routes and children gone - > exiting > > -- > > mpirun was unable to start the specified application as it encountered an > > error: > > > > Error code: 63 > > Error name: (null) > > Node: Gen2Node3 > > > > when attempting to start process rank 0. > > -- > > 128 total processes failed to start > > > > Collin > > > > *From:* Joshua Ladd > *Sent:* Tuesday, January 28, 2020 12:48 PM > *To:* Collin Strassburger > *Cc:* Open MPI Users ; Ralph Castain < > r...@open-mpi.org> > *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD > 7742 when utilizing 100+ processors per node > > > > Sorry, typo, try: > > > > mpirun -np 128 --debug-daemons -mca plm rsh hostname > > > > Josh > > > > On Tue, Jan 28, 2020 at 12:45 PM Joshua Ladd wrote: > > And if you try: > > mpirun -np 128 --debug-daemons -plm rsh hostname > > > > Josh > > > > On Tue, Jan 28, 2020 at 12:34 PM Collin Strassburger < > cstrassbur...@bihrle.com> wrote: > > Input: mpirun -np 128 --debug-daemons hostname > > > > Output: > > [Gen2Node3:54023] [[16659,0],0] orted_cmd: received add_local_procs > > [Gen2Node3:54023] [[16659,0],0] orted_cmd: received exit cmd > > [Gen2Node3:54023] [[16659,0],0] orted_cmd: all routes and children gone - > exiting > > -- > > mpirun was unable to start the specified application as it encountered an > > error: > > > > Error code: 63 > > Error name: (null) > > Node: Gen2Node3 > > > > when attempting to start process rank 0. > > -- > > > > Collin > > > > *From:* Joshua Ladd > *Sent:* Tuesday, January 28, 2020 12:31 PM > *To:* Collin Strassburger > *Cc:* Open MPI Users ; Ralph Castain < > r...@open-mpi.org> > *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD > 7742 when utilizing 100+ processors per node > > > > Interesting. Can you try: > > > > mpirun -np 128 --debug-daemons hostname > > > > Josh > > > > On Tue, Jan 28, 2020 at 12:14 PM Collin Strassburger < > cstrassbur...@bihrle.com> wrote: > > In relation to the multi-node attempt, I haven’t yet set that up yet as > the per-node configuration doesn’t pass its tests (full node utilization, > etc). > > > > Here are the results for the hostname test: > > Input: mpirun -np 128 hostname > > > > Output: > > -- > > mpirun was unable to start the specified application as it encountered an > > error: > > > > Error code: 63 > > Error name: (null) > > Node: Gen2Node3 > > > > when attempting to start process rank 0. > > -- > > 128 total processes failed to start > > > > > > Collin > > > > > > *From:* users *On Behalf Of *Ralph > Castain via users > *Sent:* Tuesday, January 28, 2020 12:06 PM > *To:* Joshua Ladd > *Cc:* Ralph Castain ; Open MPI Users < > users@lists.open-mpi.org> > *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD > 7742 when utilizing 100+ processors per node > > > > Josh - if you read thru the thread, you will see that disabling > Mellanox/IB drivers allows the program to run. It only fails when they are > enabled. > > > > > > On Jan 28, 2020, at 8:49 AM, Joshua Ladd wrote: > > > > I don't see how this can be diagnosed as a "problem with the Mellanox > Software". This is on a single node. What happens when you try to launch on > more than one node? > > > > Josh > > > > On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger < > cstrassbur...@bihrle.com> wrote: > > Here’s the I/O for these high local core count runs. (“xhpcg” is the > standard hpcg benchmark) >
Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
Input: mpirun -np 128 --debug-daemons -mca plm rsh hostname Output: [Gen2Node3:54039] [[16643,0],0] orted_cmd: received add_local_procs [Gen2Node3:54039] [[16643,0],0] orted_cmd: received exit cmd [Gen2Node3:54039] [[16643,0],0] orted_cmd: all routes and children gone - exiting -- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node3 when attempting to start process rank 0. -- 128 total processes failed to start Collin From: Joshua Ladd Sent: Tuesday, January 28, 2020 12:48 PM To: Collin Strassburger Cc: Open MPI Users ; Ralph Castain Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Sorry, typo, try: mpirun -np 128 --debug-daemons -mca plm rsh hostname Josh On Tue, Jan 28, 2020 at 12:45 PM Joshua Ladd mailto:jladd.m...@gmail.com>> wrote: And if you try: mpirun -np 128 --debug-daemons -plm rsh hostname Josh On Tue, Jan 28, 2020 at 12:34 PM Collin Strassburger mailto:cstrassbur...@bihrle.com>> wrote: Input: mpirun -np 128 --debug-daemons hostname Output: [Gen2Node3:54023] [[16659,0],0] orted_cmd: received add_local_procs [Gen2Node3:54023] [[16659,0],0] orted_cmd: received exit cmd [Gen2Node3:54023] [[16659,0],0] orted_cmd: all routes and children gone - exiting -- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node3 when attempting to start process rank 0. -- Collin From: Joshua Ladd mailto:jladd.m...@gmail.com>> Sent: Tuesday, January 28, 2020 12:31 PM To: Collin Strassburger mailto:cstrassbur...@bihrle.com>> Cc: Open MPI Users mailto:users@lists.open-mpi.org>>; Ralph Castain mailto:r...@open-mpi.org>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Interesting. Can you try: mpirun -np 128 --debug-daemons hostname Josh On Tue, Jan 28, 2020 at 12:14 PM Collin Strassburger mailto:cstrassbur...@bihrle.com>> wrote: In relation to the multi-node attempt, I haven’t yet set that up yet as the per-node configuration doesn’t pass its tests (full node utilization, etc). Here are the results for the hostname test: Input: mpirun -np 128 hostname Output: -- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node3 when attempting to start process rank 0. -- 128 total processes failed to start Collin From: users mailto:users-boun...@lists.open-mpi.org>> On Behalf Of Ralph Castain via users Sent: Tuesday, January 28, 2020 12:06 PM To: Joshua Ladd mailto:jladd.m...@gmail.com>> Cc: Ralph Castain mailto:r...@open-mpi.org>>; Open MPI Users mailto:users@lists.open-mpi.org>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Josh - if you read thru the thread, you will see that disabling Mellanox/IB drivers allows the program to run. It only fails when they are enabled. On Jan 28, 2020, at 8:49 AM, Joshua Ladd mailto:jladd.m...@gmail.com>> wrote: I don't see how this can be diagnosed as a "problem with the Mellanox Software". This is on a single node. What happens when you try to launch on more than one node? Josh On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger mailto:cstrassbur...@bihrle.com>> wrote: Here’s the I/O for these high local core count runs. (“xhpcg” is the standard hpcg benchmark) Run command: mpirun -np 128 bin/xhpcg Output: -- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node4 when attempting to start process rank 0. -- 128 total processes failed to start Collin From: Joshua Ladd mailto:jladd.m...@gmail.com>> Sent: Tuesday, January 28, 2020 11:39 AM To: Open MPI Users mailto:users@lists.open-mpi.org>> Cc: Collin Strassburger mailto:cstrassbur...@bihrle.com>>; Ralph Castain mailto:r...@open-mpi.org>>; Artem Polyakov mailto:art...@mellanox.com>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Can you send the output of a failed run including your command line. Josh On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users mailto:use
Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
Input: mpirun -np 128 --debug-daemons hostname Output: [Gen2Node3:54023] [[16659,0],0] orted_cmd: received add_local_procs [Gen2Node3:54023] [[16659,0],0] orted_cmd: received exit cmd [Gen2Node3:54023] [[16659,0],0] orted_cmd: all routes and children gone - exiting -- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node3 when attempting to start process rank 0. -- Collin From: Joshua Ladd Sent: Tuesday, January 28, 2020 12:31 PM To: Collin Strassburger Cc: Open MPI Users ; Ralph Castain Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Interesting. Can you try: mpirun -np 128 --debug-daemons hostname Josh On Tue, Jan 28, 2020 at 12:14 PM Collin Strassburger mailto:cstrassbur...@bihrle.com>> wrote: In relation to the multi-node attempt, I haven’t yet set that up yet as the per-node configuration doesn’t pass its tests (full node utilization, etc). Here are the results for the hostname test: Input: mpirun -np 128 hostname Output: -- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node3 when attempting to start process rank 0. -- 128 total processes failed to start Collin From: users mailto:users-boun...@lists.open-mpi.org>> On Behalf Of Ralph Castain via users Sent: Tuesday, January 28, 2020 12:06 PM To: Joshua Ladd mailto:jladd.m...@gmail.com>> Cc: Ralph Castain mailto:r...@open-mpi.org>>; Open MPI Users mailto:users@lists.open-mpi.org>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Josh - if you read thru the thread, you will see that disabling Mellanox/IB drivers allows the program to run. It only fails when they are enabled. On Jan 28, 2020, at 8:49 AM, Joshua Ladd mailto:jladd.m...@gmail.com>> wrote: I don't see how this can be diagnosed as a "problem with the Mellanox Software". This is on a single node. What happens when you try to launch on more than one node? Josh On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger mailto:cstrassbur...@bihrle.com>> wrote: Here’s the I/O for these high local core count runs. (“xhpcg” is the standard hpcg benchmark) Run command: mpirun -np 128 bin/xhpcg Output: -- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node4 when attempting to start process rank 0. -- 128 total processes failed to start Collin From: Joshua Ladd mailto:jladd.m...@gmail.com>> Sent: Tuesday, January 28, 2020 11:39 AM To: Open MPI Users mailto:users@lists.open-mpi.org>> Cc: Collin Strassburger mailto:cstrassbur...@bihrle.com>>; Ralph Castain mailto:r...@open-mpi.org>>; Artem Polyakov mailto:art...@mellanox.com>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Can you send the output of a failed run including your command line. Josh On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users mailto:users@lists.open-mpi.org>> wrote: Okay, so this is a problem with the Mellanox software - copying Artem. On Jan 28, 2020, at 8:15 AM, Collin Strassburger mailto:cstrassbur...@bihrle.com>> wrote: I just tried that and it does indeed work with pbs and without Mellanox (until a reboot makes it complain about Mellanox/IB related defaults as no drivers were installed, etc). After installing the Mellanox drivers, I used ./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx --with-platform=contrib/platform/mellanox/optimized With the new compile it fails on the higher core counts. Collin From: users mailto:users-boun...@lists.open-mpi.org>> On Behalf Of Ralph Castain via users Sent: Tuesday, January 28, 2020 11:02 AM To: Open MPI Users mailto:users@lists.open-mpi.org>> Cc: Ralph Castain mailto:r...@open-mpi.org>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Does it work with pbs but not Mellanox? Just trying to isolate the problem. On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users mailto:users@lists.open-mpi.org>> wrote: Hello, I have done some additional testing and I can say that it works correctly with gcc8 and no mellanox or pbs installed. I am have done two runs with Mellanox and pbs installed. O
Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
Interesting. Can you try: mpirun -np 128 --debug-daemons hostname Josh On Tue, Jan 28, 2020 at 12:14 PM Collin Strassburger < cstrassbur...@bihrle.com> wrote: > In relation to the multi-node attempt, I haven’t yet set that up yet as > the per-node configuration doesn’t pass its tests (full node utilization, > etc). > > > > Here are the results for the hostname test: > > Input: mpirun -np 128 hostname > > > > Output: > > -- > > mpirun was unable to start the specified application as it encountered an > > error: > > > > Error code: 63 > > Error name: (null) > > Node: Gen2Node3 > > > > when attempting to start process rank 0. > > -- > > 128 total processes failed to start > > > > > > Collin > > > > > > *From:* users *On Behalf Of *Ralph > Castain via users > *Sent:* Tuesday, January 28, 2020 12:06 PM > *To:* Joshua Ladd > *Cc:* Ralph Castain ; Open MPI Users < > users@lists.open-mpi.org> > *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD > 7742 when utilizing 100+ processors per node > > > > Josh - if you read thru the thread, you will see that disabling > Mellanox/IB drivers allows the program to run. It only fails when they are > enabled. > > > > > > On Jan 28, 2020, at 8:49 AM, Joshua Ladd wrote: > > > > I don't see how this can be diagnosed as a "problem with the Mellanox > Software". This is on a single node. What happens when you try to launch on > more than one node? > > > > Josh > > > > On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger < > cstrassbur...@bihrle.com> wrote: > > Here’s the I/O for these high local core count runs. (“xhpcg” is the > standard hpcg benchmark) > > > > Run command: mpirun -np 128 bin/xhpcg > > Output: > > -- > > mpirun was unable to start the specified application as it encountered an > > error: > > > > Error code: 63 > > Error name: (null) > > Node: Gen2Node4 > > > > when attempting to start process rank 0. > > ---------------------------------- > > 128 total processes failed to start > > > > > > Collin > > > > *From:* Joshua Ladd > *Sent:* Tuesday, January 28, 2020 11:39 AM > *To:* Open MPI Users > *Cc:* Collin Strassburger ; Ralph Castain < > r...@open-mpi.org>; Artem Polyakov > *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD > 7742 when utilizing 100+ processors per node > > > > Can you send the output of a failed run including your command line. > > > > Josh > > > > On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users < > users@lists.open-mpi.org> wrote: > > Okay, so this is a problem with the Mellanox software - copying Artem. > > > > On Jan 28, 2020, at 8:15 AM, Collin Strassburger > wrote: > > > > I just tried that and it does indeed work with pbs and without Mellanox > (until a reboot makes it complain about Mellanox/IB related defaults as no > drivers were installed, etc). > > > > After installing the Mellanox drivers, I used > > ./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx > --with-platform=contrib/platform/mellanox/optimized > > > > With the new compile it fails on the higher core counts. > > > > > > Collin > > > > *From:* users *On Behalf Of *Ralph > Castain via users > *Sent:* Tuesday, January 28, 2020 11:02 AM > *To:* Open MPI Users > *Cc:* Ralph Castain > *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD > 7742 when utilizing 100+ processors per node > > > > Does it work with pbs but not Mellanox? Just trying to isolate the problem. > > > > > > On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users < > users@lists.open-mpi.org> wrote: > > > > Hello, > > > > I have done some additional testing and I can say that it works correctly > with gcc8 and no mellanox or pbs installed. > > > > I am have done two runs with Mellanox and pbs installed. One run includes > the actual run options I will be using while the other includes a truncated > set which still compiles but fails to execute correctly. As the option > with the actual run options results in a smaller config log, I am including > it here. > > > > Ve
Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
In relation to the multi-node attempt, I haven’t yet set that up yet as the per-node configuration doesn’t pass its tests (full node utilization, etc). Here are the results for the hostname test: Input: mpirun -np 128 hostname Output: -- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node3 when attempting to start process rank 0. -- 128 total processes failed to start Collin From: users On Behalf Of Ralph Castain via users Sent: Tuesday, January 28, 2020 12:06 PM To: Joshua Ladd Cc: Ralph Castain ; Open MPI Users Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Josh - if you read thru the thread, you will see that disabling Mellanox/IB drivers allows the program to run. It only fails when they are enabled. On Jan 28, 2020, at 8:49 AM, Joshua Ladd mailto:jladd.m...@gmail.com>> wrote: I don't see how this can be diagnosed as a "problem with the Mellanox Software". This is on a single node. What happens when you try to launch on more than one node? Josh On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger mailto:cstrassbur...@bihrle.com>> wrote: Here’s the I/O for these high local core count runs. (“xhpcg” is the standard hpcg benchmark) Run command: mpirun -np 128 bin/xhpcg Output: -- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node4 when attempting to start process rank 0. -- 128 total processes failed to start Collin From: Joshua Ladd mailto:jladd.m...@gmail.com>> Sent: Tuesday, January 28, 2020 11:39 AM To: Open MPI Users mailto:users@lists.open-mpi.org>> Cc: Collin Strassburger mailto:cstrassbur...@bihrle.com>>; Ralph Castain mailto:r...@open-mpi.org>>; Artem Polyakov mailto:art...@mellanox.com>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Can you send the output of a failed run including your command line. Josh On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users mailto:users@lists.open-mpi.org>> wrote: Okay, so this is a problem with the Mellanox software - copying Artem. On Jan 28, 2020, at 8:15 AM, Collin Strassburger mailto:cstrassbur...@bihrle.com>> wrote: I just tried that and it does indeed work with pbs and without Mellanox (until a reboot makes it complain about Mellanox/IB related defaults as no drivers were installed, etc). After installing the Mellanox drivers, I used ./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx --with-platform=contrib/platform/mellanox/optimized With the new compile it fails on the higher core counts. Collin From: users mailto:users-boun...@lists.open-mpi.org>> On Behalf Of Ralph Castain via users Sent: Tuesday, January 28, 2020 11:02 AM To: Open MPI Users mailto:users@lists.open-mpi.org>> Cc: Ralph Castain mailto:r...@open-mpi.org>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Does it work with pbs but not Mellanox? Just trying to isolate the problem. On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users mailto:users@lists.open-mpi.org>> wrote: Hello, I have done some additional testing and I can say that it works correctly with gcc8 and no mellanox or pbs installed. I am have done two runs with Mellanox and pbs installed. One run includes the actual run options I will be using while the other includes a truncated set which still compiles but fails to execute correctly. As the option with the actual run options results in a smaller config log, I am including it here. Version: 4.0.2 The config log is available at https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and the ompi dump is available athttps://pastebin.com/md3HwTUR. The IB network information (which is not being explicitly operated across): Packages: MLNX_OFED and Mellanox HPC-X, both are current versions (MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64) Ulimit -l = unlimited Ibv_devinfo: hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.42.5000 … vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x1 board_id: MT_1100120019 phys_port_cnt: 1 Device ports: port: 1
Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
Josh - if you read thru the thread, you will see that disabling Mellanox/IB drivers allows the program to run. It only fails when they are enabled. On Jan 28, 2020, at 8:49 AM, Joshua Ladd mailto:jladd.m...@gmail.com> > wrote: I don't see how this can be diagnosed as a "problem with the Mellanox Software". This is on a single node. What happens when you try to launch on more than one node? Josh On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger mailto:cstrassbur...@bihrle.com> > wrote: Here’s the I/O for these high local core count runs. (“xhpcg” is the standard hpcg benchmark) Run command: mpirun -np 128 bin/xhpcg Output: -- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node4 when attempting to start process rank 0. -- 128 total processes failed to start Collin From: Joshua Ladd mailto:jladd.m...@gmail.com> > Sent: Tuesday, January 28, 2020 11:39 AM To: Open MPI Users mailto:users@lists.open-mpi.org> > Cc: Collin Strassburger mailto:cstrassbur...@bihrle.com> >; Ralph Castain mailto:r...@open-mpi.org> >; Artem Polyakov mailto:art...@mellanox.com> > Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Can you send the output of a failed run including your command line. Josh On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users mailto:users@lists.open-mpi.org> > wrote: Okay, so this is a problem with the Mellanox software - copying Artem. On Jan 28, 2020, at 8:15 AM, Collin Strassburger mailto:cstrassbur...@bihrle.com> > wrote: I just tried that and it does indeed work with pbs and without Mellanox (until a reboot makes it complain about Mellanox/IB related defaults as no drivers were installed, etc). After installing the Mellanox drivers, I used ./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx --with-platform=contrib/platform/mellanox/optimized With the new compile it fails on the higher core counts. Collin From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users Sent: Tuesday, January 28, 2020 11:02 AM To: Open MPI Users mailto:users@lists.open-mpi.org> > Cc: Ralph Castain mailto:r...@open-mpi.org> > Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Does it work with pbs but not Mellanox? Just trying to isolate the problem. On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users mailto:users@lists.open-mpi.org> > wrote: Hello, I have done some additional testing and I can say that it works correctly with gcc8 and no mellanox or pbs installed. I am have done two runs with Mellanox and pbs installed. One run includes the actual run options I will be using while the other includes a truncated set which still compiles but fails to execute correctly. As the option with the actual run options results in a smaller config log, I am including it here. Version: 4.0.2 The config log is available at https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and the ompi dump is available athttps://pastebin.com/md3HwTUR. The IB network information (which is not being explicitly operated across): Packages: MLNX_OFED and Mellanox HPC-X, both are current versions (MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64) Ulimit -l = unlimited Ibv_devinfo: hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.42.5000 … vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x1 board_id: MT_1100120019 phys_port_cnt: 1 Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 12 port_lmc: 0x00 link_layer: InfiniBand It looks like the rest of the IB information is in the config file. I hope this helps, Collin From: Jeff Squyres (jsquyres) mailto:jsquy...@cisco.com> > Sent: Monday, January 27, 2020 3:40 PM To: Open MPI User's List mailto:users@lists.open-mpi.org> > Cc: Collin Strassburger mailto:cstrassbur...@bihrle.com> > Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 o
Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
Also, can you try running: mpirun -np 128 hostname Josh On Tue, Jan 28, 2020 at 11:49 AM Joshua Ladd wrote: > I don't see how this can be diagnosed as a "problem with the Mellanox > Software". This is on a single node. What happens when you try to launch on > more than one node? > > Josh > > On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger < > cstrassbur...@bihrle.com> wrote: > >> Here’s the I/O for these high local core count runs. (“xhpcg” is the >> standard hpcg benchmark) >> >> >> >> Run command: mpirun -np 128 bin/xhpcg >> >> Output: >> >> -- >> >> mpirun was unable to start the specified application as it encountered an >> >> error: >> >> >> >> Error code: 63 >> >> Error name: (null) >> >> Node: Gen2Node4 >> >> >> >> when attempting to start process rank 0. >> >> -- >> >> 128 total processes failed to start >> >> >> >> >> >> Collin >> >> >> >> *From:* Joshua Ladd >> *Sent:* Tuesday, January 28, 2020 11:39 AM >> *To:* Open MPI Users >> *Cc:* Collin Strassburger ; Ralph Castain < >> r...@open-mpi.org>; Artem Polyakov >> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD >> 7742 when utilizing 100+ processors per node >> >> >> >> Can you send the output of a failed run including your command line. >> >> >> >> Josh >> >> >> >> On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users < >> users@lists.open-mpi.org> wrote: >> >> Okay, so this is a problem with the Mellanox software - copying Artem. >> >> >> >> On Jan 28, 2020, at 8:15 AM, Collin Strassburger < >> cstrassbur...@bihrle.com> wrote: >> >> >> >> I just tried that and it does indeed work with pbs and without Mellanox >> (until a reboot makes it complain about Mellanox/IB related defaults as no >> drivers were installed, etc). >> >> >> >> After installing the Mellanox drivers, I used >> >> ./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx >> --with-platform=contrib/platform/mellanox/optimized >> >> >> >> With the new compile it fails on the higher core counts. >> >> >> >> >> >> Collin >> >> >> >> *From:* users *On Behalf Of *Ralph >> Castain via users >> *Sent:* Tuesday, January 28, 2020 11:02 AM >> *To:* Open MPI Users >> *Cc:* Ralph Castain >> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD >> 7742 when utilizing 100+ processors per node >> >> >> >> Does it work with pbs but not Mellanox? Just trying to isolate the >> problem. >> >> >> >> >> >> On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users < >> users@lists.open-mpi.org> wrote: >> >> >> >> Hello, >> >> >> >> I have done some additional testing and I can say that it works correctly >> with gcc8 and no mellanox or pbs installed. >> >> >> >> I am have done two runs with Mellanox and pbs installed. One run >> includes the actual run options I will be using while the other includes a >> truncated set which still compiles but fails to execute correctly. As the >> option with the actual run options results in a smaller config log, I am >> including it here. >> >> >> >> Version: 4.0.2 >> >> The config log is available at >> https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and >> the ompi dump is available athttps://pastebin.com/md3HwTUR. >> >> >> >> The IB network information (which is not being explicitly operated >> across): >> >> Packages: MLNX_OFED and Mellanox HPC-X, both are current versions >> (MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and >> hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64) >> >> Ulimit -l = unlimited >> >> Ibv_devinfo: >> >> hca_id: mlx4_0 >> >> transport: InfiniBand (0) >> >> fw_ver: 2.42.5000 >> >> … >> >> vendor_id: 0x02c9 >> >> vendor_part_id: 4099 >> >>
Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
I don't see how this can be diagnosed as a "problem with the Mellanox Software". This is on a single node. What happens when you try to launch on more than one node? Josh On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger < cstrassbur...@bihrle.com> wrote: > Here’s the I/O for these high local core count runs. (“xhpcg” is the > standard hpcg benchmark) > > > > Run command: mpirun -np 128 bin/xhpcg > > Output: > > -- > > mpirun was unable to start the specified application as it encountered an > > error: > > > > Error code: 63 > > Error name: (null) > > Node: Gen2Node4 > > > > when attempting to start process rank 0. > > -- > > 128 total processes failed to start > > > > > > Collin > > > > *From:* Joshua Ladd > *Sent:* Tuesday, January 28, 2020 11:39 AM > *To:* Open MPI Users > *Cc:* Collin Strassburger ; Ralph Castain < > r...@open-mpi.org>; Artem Polyakov > *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD > 7742 when utilizing 100+ processors per node > > > > Can you send the output of a failed run including your command line. > > > > Josh > > > > On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users < > users@lists.open-mpi.org> wrote: > > Okay, so this is a problem with the Mellanox software - copying Artem. > > > > On Jan 28, 2020, at 8:15 AM, Collin Strassburger > wrote: > > > > I just tried that and it does indeed work with pbs and without Mellanox > (until a reboot makes it complain about Mellanox/IB related defaults as no > drivers were installed, etc). > > > > After installing the Mellanox drivers, I used > > ./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx > --with-platform=contrib/platform/mellanox/optimized > > > > With the new compile it fails on the higher core counts. > > > > > > Collin > > > > *From:* users *On Behalf Of *Ralph > Castain via users > *Sent:* Tuesday, January 28, 2020 11:02 AM > *To:* Open MPI Users > *Cc:* Ralph Castain > *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD > 7742 when utilizing 100+ processors per node > > > > Does it work with pbs but not Mellanox? Just trying to isolate the problem. > > > > > > On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users < > users@lists.open-mpi.org> wrote: > > > > Hello, > > > > I have done some additional testing and I can say that it works correctly > with gcc8 and no mellanox or pbs installed. > > > > I am have done two runs with Mellanox and pbs installed. One run includes > the actual run options I will be using while the other includes a truncated > set which still compiles but fails to execute correctly. As the option > with the actual run options results in a smaller config log, I am including > it here. > > > > Version: 4.0.2 > > The config log is available at > https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and > the ompi dump is available athttps://pastebin.com/md3HwTUR. > > > > The IB network information (which is not being explicitly operated across): > > Packages: MLNX_OFED and Mellanox HPC-X, both are current versions > (MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and > hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64) > > Ulimit -l = unlimited > > Ibv_devinfo: > > hca_id: mlx4_0 > > transport: InfiniBand (0) > > fw_ver: 2.42.5000 > > … > > vendor_id: 0x02c9 > > vendor_part_id: 4099 > > hw_ver: 0x1 > > board_id: MT_1100120019 > > phys_port_cnt: 1 > > Device ports: > > port: 1 > > state: PORT_ACTIVE (4) > > max_mtu:4096 (5) > > active_mtu: 4096 (5) > > sm_lid: 1 > > port_lid: 12 > > port_lmc: 0x00 > > link_layer: InfiniBand > > It looks like the rest of the IB information is in the config file. > > > > I hope this helps, > > Collin > > > > > &
Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
Here’s the I/O for these high local core count runs. (“xhpcg” is the standard hpcg benchmark) Run command: mpirun -np 128 bin/xhpcg Output: -- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node4 when attempting to start process rank 0. -- 128 total processes failed to start Collin From: Joshua Ladd Sent: Tuesday, January 28, 2020 11:39 AM To: Open MPI Users Cc: Collin Strassburger ; Ralph Castain ; Artem Polyakov Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Can you send the output of a failed run including your command line. Josh On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users mailto:users@lists.open-mpi.org>> wrote: Okay, so this is a problem with the Mellanox software - copying Artem. On Jan 28, 2020, at 8:15 AM, Collin Strassburger mailto:cstrassbur...@bihrle.com>> wrote: I just tried that and it does indeed work with pbs and without Mellanox (until a reboot makes it complain about Mellanox/IB related defaults as no drivers were installed, etc). After installing the Mellanox drivers, I used ./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx --with-platform=contrib/platform/mellanox/optimized With the new compile it fails on the higher core counts. Collin From: users mailto:users-boun...@lists.open-mpi.org>> On Behalf Of Ralph Castain via users Sent: Tuesday, January 28, 2020 11:02 AM To: Open MPI Users mailto:users@lists.open-mpi.org>> Cc: Ralph Castain mailto:r...@open-mpi.org>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Does it work with pbs but not Mellanox? Just trying to isolate the problem. On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users mailto:users@lists.open-mpi.org>> wrote: Hello, I have done some additional testing and I can say that it works correctly with gcc8 and no mellanox or pbs installed. I am have done two runs with Mellanox and pbs installed. One run includes the actual run options I will be using while the other includes a truncated set which still compiles but fails to execute correctly. As the option with the actual run options results in a smaller config log, I am including it here. Version: 4.0.2 The config log is available at https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and the ompi dump is available athttps://pastebin.com/md3HwTUR. The IB network information (which is not being explicitly operated across): Packages: MLNX_OFED and Mellanox HPC-X, both are current versions (MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64) Ulimit -l = unlimited Ibv_devinfo: hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.42.5000 … vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x1 board_id: MT_1100120019 phys_port_cnt: 1 Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu:4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 12 port_lmc: 0x00 link_layer: InfiniBand It looks like the rest of the IB information is in the config file. I hope this helps, Collin From: Jeff Squyres (jsquyres) mailto:jsquy...@cisco.com>> Sent: Monday, January 27, 2020 3:40 PM To: Open MPI User's List mailto:users@lists.open-mpi.org>> Cc: Collin Strassburger mailto:cstrassbur...@bihrle.com>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Can you please send all the information listed here: https://www.open-mpi.org/community/help/ Thanks! On Jan 27, 2020, at 12:00 PM, Collin Strassburger via users mailto:users@lists.open-mpi.org>> wrote: Hello, I had initially thought the same thing about the streams, but I have 2 sockets with 64 cores each. Additionally, I have not yet turned multithreading off, so lscpu reports a total of 256 logical cores and 128 physical cores. As such, I don’t see how it could be running out of streams unless something is being passed incorrectly. Collin From: users mailto:users-boun...@lists.open-mpi.org>> On Behalf Of Ray Sheppard via users Sent: Monday, January 27, 2020 11:53 AM To: users@lists.open-mpi.org<mai
Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
Can you send the output of a failed run including your command line. Josh On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users < users@lists.open-mpi.org> wrote: > Okay, so this is a problem with the Mellanox software - copying Artem. > > On Jan 28, 2020, at 8:15 AM, Collin Strassburger > wrote: > > I just tried that and it does indeed work with pbs and without Mellanox > (until a reboot makes it complain about Mellanox/IB related defaults as no > drivers were installed, etc). > > After installing the Mellanox drivers, I used > ./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx > --with-platform=contrib/platform/mellanox/optimized > > With the new compile it fails on the higher core counts. > > > Collin > > *From:* users *On Behalf Of *Ralph > Castain via users > *Sent:* Tuesday, January 28, 2020 11:02 AM > *To:* Open MPI Users > *Cc:* Ralph Castain > *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD > 7742 when utilizing 100+ processors per node > > Does it work with pbs but not Mellanox? Just trying to isolate the problem. > > > > On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users < > users@lists.open-mpi.org> wrote: > > Hello, > > I have done some additional testing and I can say that it works correctly > with gcc8 and no mellanox or pbs installed. > > I am have done two runs with Mellanox and pbs installed. One run includes > the actual run options I will be using while the other includes a truncated > set which still compiles but fails to execute correctly. As the option > with the actual run options results in a smaller config log, I am including > it here. > > Version: 4.0.2 > The config log is available at > https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and > the ompi dump is available athttps://pastebin.com/md3HwTUR. > > The IB network information (which is not being explicitly operated across): > Packages: MLNX_OFED and Mellanox HPC-X, both are current versions > (MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and > hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64) > Ulimit -l = unlimited > Ibv_devinfo: > hca_id: mlx4_0 > transport: InfiniBand (0) > fw_ver: 2.42.5000 > … > vendor_id: 0x02c9 > vendor_part_id: 4099 > hw_ver: 0x1 > board_id: MT_1100120019 > phys_port_cnt: 1 > Device ports: > port: 1 > state: PORT_ACTIVE (4) > max_mtu:4096 (5) > active_mtu: 4096 (5) > sm_lid: 1 > port_lid: 12 > port_lmc: 0x00 > link_layer: InfiniBand > It looks like the rest of the IB information is in the config file. > > I hope this helps, > Collin > > > > *From:* Jeff Squyres (jsquyres) > *Sent:* Monday, January 27, 2020 3:40 PM > *To:* Open MPI User's List > *Cc:* Collin Strassburger > *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD > 7742 when utilizing 100+ processors per node > > Can you please send all the information listed here: > > https://www.open-mpi.org/community/help/ > > Thanks! > > > > > On Jan 27, 2020, at 12:00 PM, Collin Strassburger via users < > users@lists.open-mpi.org> wrote: > > Hello, > > I had initially thought the same thing about the streams, but I have 2 > sockets with 64 cores each. Additionally, I have not yet turned > multithreading off, so lscpu reports a total of 256 logical cores and 128 > physical cores. As such, I don’t see how it could be running out of > streams unless something is being passed incorrectly. > > Collin > > *From:* users *On Behalf Of *Ray > Sheppard via users > *Sent:* Monday, January 27, 2020 11:53 AM > *To:* users@lists.open-mpi.org > *Cc:* Ray Sheppard > *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD > 7742 when utilizing 100+ processors per node > > > Hi All, > Just my two cents, I think error code 63 is saying it is running out of > streams to use. I think you have only 64 cores, so at 100, you are > overloading most of them. It feels like you are running out of resources > trying to swap in and out ranks on physical cores. >Ray > On 1/27/2020 11:29 AM, Collin Strassburger via users wrote: > > This message was sent
Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
Okay, so this is a problem with the Mellanox software - copying Artem. On Jan 28, 2020, at 8:15 AM, Collin Strassburger mailto:cstrassbur...@bihrle.com> > wrote: I just tried that and it does indeed work with pbs and without Mellanox (until a reboot makes it complain about Mellanox/IB related defaults as no drivers were installed, etc). After installing the Mellanox drivers, I used ./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx --with-platform=contrib/platform/mellanox/optimized With the new compile it fails on the higher core counts. Collin From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users Sent: Tuesday, January 28, 2020 11:02 AM To: Open MPI Users mailto:users@lists.open-mpi.org> > Cc: Ralph Castain mailto:r...@open-mpi.org> > Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Does it work with pbs but not Mellanox? Just trying to isolate the problem. On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users mailto:users@lists.open-mpi.org> > wrote: Hello, I have done some additional testing and I can say that it works correctly with gcc8 and no mellanox or pbs installed. I am have done two runs with Mellanox and pbs installed. One run includes the actual run options I will be using while the other includes a truncated set which still compiles but fails to execute correctly. As the option with the actual run options results in a smaller config log, I am including it here. Version: 4.0.2 The config log is available at https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and the ompi dump is available athttps://pastebin.com/md3HwTUR. The IB network information (which is not being explicitly operated across): Packages: MLNX_OFED and Mellanox HPC-X, both are current versions (MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64) Ulimit -l = unlimited Ibv_devinfo: hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.42.5000 … vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x1 board_id: MT_1100120019 phys_port_cnt: 1 Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 12 port_lmc: 0x00 link_layer: InfiniBand It looks like the rest of the IB information is in the config file. I hope this helps, Collin From: Jeff Squyres (jsquyres) mailto:jsquy...@cisco.com> > Sent: Monday, January 27, 2020 3:40 PM To: Open MPI User's List mailto:users@lists.open-mpi.org> > Cc: Collin Strassburger mailto:cstrassbur...@bihrle.com> > Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Can you please send all the information listed here: https://www.open-mpi.org/community/help/ Thanks! On Jan 27, 2020, at 12:00 PM, Collin Strassburger via users mailto:users@lists.open-mpi.org> > wrote: Hello, I had initially thought the same thing about the streams, but I have 2 sockets with 64 cores each. Additionally, I have not yet turned multithreading off, so lscpu reports a total of 256 logical cores and 128 physical cores. As such, I don’t see how it could be running out of streams unless something is being passed incorrectly. Collin From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ray Sheppard via users Sent: Monday, January 27, 2020 11:53 AM To: users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> Cc: Ray Sheppard mailto:rshep...@iu.edu> > Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Hi All, Just my two cents, I think error code 63 is saying it is running out of streams to use. I think you have only 64 cores, so at 100, you are overloading most of them. It feels like you are running out of resources trying to swap in and out ranks on physical cores. Ray On 1/27/2020 11:29 AM, Collin Strassburger via users wrote: This message was sent from a non-IU address. Please exercise caution when clicking links or opening attachments from external sources. Hello Howard, To remove potential interactions, I have found that the issue persists without ucx and hcoll support. Run command: mpirun -np 128 bin/xhpcg Output: -- mpirun was unable to
Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
I just tried that and it does indeed work with pbs and without Mellanox (until a reboot makes it complain about Mellanox/IB related defaults as no drivers were installed, etc). After installing the Mellanox drivers, I used ./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx --with-platform=contrib/platform/mellanox/optimized With the new compile it fails on the higher core counts. Collin From: users On Behalf Of Ralph Castain via users Sent: Tuesday, January 28, 2020 11:02 AM To: Open MPI Users Cc: Ralph Castain Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Does it work with pbs but not Mellanox? Just trying to isolate the problem. On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users mailto:users@lists.open-mpi.org>> wrote: Hello, I have done some additional testing and I can say that it works correctly with gcc8 and no mellanox or pbs installed. I am have done two runs with Mellanox and pbs installed. One run includes the actual run options I will be using while the other includes a truncated set which still compiles but fails to execute correctly. As the option with the actual run options results in a smaller config log, I am including it here. Version: 4.0.2 The config log is available at https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and the ompi dump is available athttps://pastebin.com/md3HwTUR. The IB network information (which is not being explicitly operated across): Packages: MLNX_OFED and Mellanox HPC-X, both are current versions (MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64) Ulimit -l = unlimited Ibv_devinfo: hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.42.5000 … vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x1 board_id: MT_1100120019 phys_port_cnt: 1 Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu:4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 12 port_lmc: 0x00 link_layer: InfiniBand It looks like the rest of the IB information is in the config file. I hope this helps, Collin From: Jeff Squyres (jsquyres) mailto:jsquy...@cisco.com>> Sent: Monday, January 27, 2020 3:40 PM To: Open MPI User's List mailto:users@lists.open-mpi.org>> Cc: Collin Strassburger mailto:cstrassbur...@bihrle.com>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Can you please send all the information listed here: https://www.open-mpi.org/community/help/ Thanks! On Jan 27, 2020, at 12:00 PM, Collin Strassburger via users mailto:users@lists.open-mpi.org>> wrote: Hello, I had initially thought the same thing about the streams, but I have 2 sockets with 64 cores each. Additionally, I have not yet turned multithreading off, so lscpu reports a total of 256 logical cores and 128 physical cores. As such, I don’t see how it could be running out of streams unless something is being passed incorrectly. Collin From: users mailto:users-boun...@lists.open-mpi.org>> On Behalf Of Ray Sheppard via users Sent: Monday, January 27, 2020 11:53 AM To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> Cc: Ray Sheppard mailto:rshep...@iu.edu>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Hi All, Just my two cents, I think error code 63 is saying it is running out of streams to use. I think you have only 64 cores, so at 100, you are overloading most of them. It feels like you are running out of resources trying to swap in and out ranks on physical cores. Ray On 1/27/2020 11:29 AM, Collin Strassburger via users wrote: This message was sent from a non-IU address. Please exercise caution when clicking links or opening attachments from external sources. Hello Howard, To remove potential interactions, I have found that the issue persists without ucx and hcoll support. Run command: mpirun -np 128 bin/xhpcg Output: -- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node4 when attempting to start process rank 0. -- 128 total processes failed to start It returns this error for any pr
Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
Does it work with pbs but not Mellanox? Just trying to isolate the problem. On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users mailto:users@lists.open-mpi.org> > wrote: Hello, I have done some additional testing and I can say that it works correctly with gcc8 and no mellanox or pbs installed. I am have done two runs with Mellanox and pbs installed. One run includes the actual run options I will be using while the other includes a truncated set which still compiles but fails to execute correctly. As the option with the actual run options results in a smaller config log, I am including it here. Version: 4.0.2 The config log is available at https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and the ompi dump is available athttps://pastebin.com/md3HwTUR. The IB network information (which is not being explicitly operated across): Packages: MLNX_OFED and Mellanox HPC-X, both are current versions (MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64) Ulimit -l = unlimited Ibv_devinfo: hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.42.5000 … vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x1 board_id: MT_1100120019 phys_port_cnt: 1 Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 12 port_lmc: 0x00 link_layer: InfiniBand It looks like the rest of the IB information is in the config file. I hope this helps, Collin From: Jeff Squyres (jsquyres) mailto:jsquy...@cisco.com> > Sent: Monday, January 27, 2020 3:40 PM To: Open MPI User's List mailto:users@lists.open-mpi.org> > Cc: Collin Strassburger mailto:cstrassbur...@bihrle.com> > Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Can you please send all the information listed here: https://www.open-mpi.org/community/help/ Thanks! On Jan 27, 2020, at 12:00 PM, Collin Strassburger via users mailto:users@lists.open-mpi.org> > wrote: Hello, I had initially thought the same thing about the streams, but I have 2 sockets with 64 cores each. Additionally, I have not yet turned multithreading off, so lscpu reports a total of 256 logical cores and 128 physical cores. As such, I don’t see how it could be running out of streams unless something is being passed incorrectly. Collin From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ray Sheppard via users Sent: Monday, January 27, 2020 11:53 AM To: users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> Cc: Ray Sheppard mailto:rshep...@iu.edu> > Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Hi All, Just my two cents, I think error code 63 is saying it is running out of streams to use. I think you have only 64 cores, so at 100, you are overloading most of them. It feels like you are running out of resources trying to swap in and out ranks on physical cores. Ray On 1/27/2020 11:29 AM, Collin Strassburger via users wrote: This message was sent from a non-IU address. Please exercise caution when clicking links or opening attachments from external sources. Hello Howard, To remove potential interactions, I have found that the issue persists without ucx and hcoll support. Run command: mpirun -np 128 bin/xhpcg Output: -- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node4 when attempting to start process rank 0. -- 128 total processes failed to start It returns this error for any process I initialize with >100 processes per node. I get the same error message for multiple different codes, so the error code is mpi related rather than being program specific. Collin From: Howard Pritchard <mailto:hpprit...@gmail.com> Sent: Monday, January 27, 2020 11:20 AM To: Open MPI Users <mailto:users@lists.open-mpi.org> Cc: Collin Strassburger <mailto:cstrassbur...@bihrle.com> Subject: Re: [OMPI users] OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Hello Collen, Could you provide more information about the error. Is there any output from either Open MPI or, maybe, UCX, that could prov
Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
Can you please send all the information listed here: https://www.open-mpi.org/community/help/ Thanks! On Jan 27, 2020, at 12:00 PM, Collin Strassburger via users mailto:users@lists.open-mpi.org>> wrote: Hello, I had initially thought the same thing about the streams, but I have 2 sockets with 64 cores each. Additionally, I have not yet turned multithreading off, so lscpu reports a total of 256 logical cores and 128 physical cores. As such, I don’t see how it could be running out of streams unless something is being passed incorrectly. Collin From: users mailto:users-boun...@lists.open-mpi.org>> On Behalf Of Ray Sheppard via users Sent: Monday, January 27, 2020 11:53 AM To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> Cc: Ray Sheppard mailto:rshep...@iu.edu>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Hi All, Just my two cents, I think error code 63 is saying it is running out of streams to use. I think you have only 64 cores, so at 100, you are overloading most of them. It feels like you are running out of resources trying to swap in and out ranks on physical cores. Ray On 1/27/2020 11:29 AM, Collin Strassburger via users wrote: This message was sent from a non-IU address. Please exercise caution when clicking links or opening attachments from external sources. Hello Howard, To remove potential interactions, I have found that the issue persists without ucx and hcoll support. Run command: mpirun -np 128 bin/xhpcg Output: -- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node4 when attempting to start process rank 0. -- 128 total processes failed to start It returns this error for any process I initialize with >100 processes per node. I get the same error message for multiple different codes, so the error code is mpi related rather than being program specific. Collin From: Howard Pritchard <mailto:hpprit...@gmail.com> Sent: Monday, January 27, 2020 11:20 AM To: Open MPI Users <mailto:users@lists.open-mpi.org> Cc: Collin Strassburger <mailto:cstrassbur...@bihrle.com> Subject: Re: [OMPI users] OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Hello Collen, Could you provide more information about the error. Is there any output from either Open MPI or, maybe, UCX, that could provide more information about the problem you are hitting? Howard Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via users mailto:users@lists.open-mpi.org>>: Hello, I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5. Both of these versions cause the same error (error code 63) when utilizing more than 100 cores on a single node. The processors I am utilizing are AMD Epyc “Rome” 7742s. The OS is CentOS 8.1. I have tried compiling with both the default gcc 8 and locally compiled gcc 9. I have already tried modifying the maximum name field values with no success. My compile options are: ./configure --prefix=${HPCX_HOME}/ompi --with-platform=contrib/platform/mellanox/optimized Any assistance would be appreciated, Collin Collin Strassburger Bihrle Applied Research Inc. -- Jeff Squyres jsquy...@cisco.com<mailto:jsquy...@cisco.com>
Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
Hello, I had initially thought the same thing about the streams, but I have 2 sockets with 64 cores each. Additionally, I have not yet turned multithreading off, so lscpu reports a total of 256 logical cores and 128 physical cores. As such, I don’t see how it could be running out of streams unless something is being passed incorrectly. Collin From: users On Behalf Of Ray Sheppard via users Sent: Monday, January 27, 2020 11:53 AM To: users@lists.open-mpi.org Cc: Ray Sheppard Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Hi All, Just my two cents, I think error code 63 is saying it is running out of streams to use. I think you have only 64 cores, so at 100, you are overloading most of them. It feels like you are running out of resources trying to swap in and out ranks on physical cores. Ray On 1/27/2020 11:29 AM, Collin Strassburger via users wrote: This message was sent from a non-IU address. Please exercise caution when clicking links or opening attachments from external sources. Hello Howard, To remove potential interactions, I have found that the issue persists without ucx and hcoll support. Run command: mpirun -np 128 bin/xhpcg Output: -- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node4 when attempting to start process rank 0. -- 128 total processes failed to start It returns this error for any process I initialize with >100 processes per node. I get the same error message for multiple different codes, so the error code is mpi related rather than being program specific. Collin From: Howard Pritchard <mailto:hpprit...@gmail.com> Sent: Monday, January 27, 2020 11:20 AM To: Open MPI Users <mailto:users@lists.open-mpi.org> Cc: Collin Strassburger <mailto:cstrassbur...@bihrle.com> Subject: Re: [OMPI users] OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Hello Collen, Could you provide more information about the error. Is there any output from either Open MPI or, maybe, UCX, that could provide more information about the problem you are hitting? Howard Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via users mailto:users@lists.open-mpi.org>>: Hello, I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5. Both of these versions cause the same error (error code 63) when utilizing more than 100 cores on a single node. The processors I am utilizing are AMD Epyc “Rome” 7742s. The OS is CentOS 8.1. I have tried compiling with both the default gcc 8 and locally compiled gcc 9. I have already tried modifying the maximum name field values with no success. My compile options are: ./configure --prefix=${HPCX_HOME}/ompi --with-platform=contrib/platform/mellanox/optimized Any assistance would be appreciated, Collin Collin Strassburger Bihrle Applied Research Inc.
Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
Hi All, Just my two cents, I think error code 63 is saying it is running out of streams to use. I think you have only 64 cores, so at 100, you are overloading most of them. It feels like you are running out of resources trying to swap in and out ranks on physical cores. Ray On 1/27/2020 11:29 AM, Collin Strassburger via users wrote: This message was sent from a non-IU address. Please exercise caution when clicking links or opening attachments from external sources. Hello Howard, To remove potential interactions, I have found that the issue persists without ucx and hcoll support. Run command: mpirun -np 128 bin/xhpcg Output: -- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node4 when attempting to start process rank 0. -- 128 total processes failed to start It returns this error for any process I initialize with >100 processes per node. I get the same error message for multiple different codes, so the error code is mpi related rather than being program specific. Collin *From:* Howard Pritchard *Sent:* Monday, January 27, 2020 11:20 AM *To:* Open MPI Users *Cc:* Collin Strassburger *Subject:* Re: [OMPI users] OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Hello Collen, Could you provide more information about the error. Is there any output from either Open MPI or, maybe, UCX, that could provide more information about the problem you are hitting? Howard Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via users mailto:users@lists.open-mpi.org>>: Hello, I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5. Both of these versions cause the same error (error code 63) when utilizing more than 100 cores on a single node. The processors I am utilizing are AMD Epyc “Rome” 7742s. The OS is CentOS 8.1. I have tried compiling with both the default gcc 8 and locally compiled gcc 9. I have already tried modifying the maximum name field values with no success. My compile options are: ./configure --prefix=${HPCX_HOME}/ompi --with-platform=contrib/platform/mellanox/optimized Any assistance would be appreciated, Collin Collin Strassburger Bihrle Applied Research Inc.