Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
Wonderful! I am happy to confirm that this resolves the issue! Many thanks to everyone for their assistance, Collin

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
Okay, that nailed it down - the problem is the number of open file descriptors is exceeding your system limit. I suspect the connection to the Mellanox drivers is solely due to it also having some descriptors open, and you are just close enough to the boundary that it causes you to hit it. See

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
I agree that it is odd that the issue does not appear until after the Mellanox drivers have been installed (and the configure flags set to use them). As requested, here are the results Input: mpirun -np 128 --mca odls_base_verbose 10 --mca state_base_verbose 10 hostname Output:

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
Okay, debug-daemons isn't going to help as we aren't launching any daemons. This is all one node. So try adding "--mca odls_base_verbose 10 --mca state_base_verbose 10" to the cmd line and let's see what is going on. I agree with Josh - neither mpirun nor hostname are invoking the Mellanox

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
pen-mpi.org>>; Ralph Castain mailto:r...@open-mpi.org>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Interesting. Can you try: mpirun -np 128 --debug-daemons hostname Josh On Tue, Jan 28, 2020 at 12:14 PM Collin

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
--- > > 128 total processes failed to start > > > > Collin > > > > *From:* Joshua Ladd > *Sent:* Tuesday, January 28, 2020 12:48 PM > *To:* Collin Strassburger > *Cc:* Open MPI Users ; Ralph Castain < > r...@o

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
. -- 128 total processes failed to start Collin From: Joshua Ladd Sent: Tuesday, January 28, 2020 12:48 PM To: Collin Strassburger Cc: Open MPI Users ; Ralph Castain Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Sorry

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
: Tuesday, January 28, 2020 12:31 PM To: Collin Strassburger Cc: Open MPI Users ; Ralph Castain Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Interesting. Can you try: mpirun -np 128 --debug-daemons hostname Josh On Tue, Jan 28

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
gt; > 128 total processes failed to start > > > > > > Collin > > > > > > *From:* users *On Behalf Of *Ralph > Castain via users > *Sent:* Tuesday, January 28, 2020 12:06 PM > *To:* Joshua Ladd > *Cc:* Ralph Castain ; Open MPI Users < > users@li

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
Collin From: users On Behalf Of Ralph Castain via users Sent: Tuesday, January 28, 2020 12:06 PM To: Joshua Ladd Cc: Ralph Castain ; Open MPI Users Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Josh - if you read thru

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
s.open-mpi.org> > Cc: Collin Strassburger mailto:cstrassbur...@bihrle.com> >; Ralph Castain mailto:r...@open-mpi.org> >; Artem Polyakov mailto:art...@mellanox.com> > Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node   Ca

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
to start >> >> >> >> >> >> Collin >> >> >> >> *From:* Joshua Ladd >> *Sent:* Tuesday, January 28, 2020 11:39 AM >> *To:* Open MPI Users >> *Cc:* Collin Strassburger ; Ralph Castain < >> r...@open-mpi.org>; Art

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
al processes failed to start > > > > > > Collin > > > > *From:* Joshua Ladd > *Sent:* Tuesday, January 28, 2020 11:39 AM > *To:* Open MPI Users > *Cc:* Collin Strassburger ; Ralph Castain < > r...@open-mpi.org>; Artem Polyakov > *Subject:

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
: Collin Strassburger ; Ralph Castain ; Artem Polyakov Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Can you send the output of a failed run including your command line. Josh On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
nuary 28, 2020 11:02 AM > *To:* Open MPI Users > *Cc:* Ralph Castain > *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD > 7742 when utilizing 100+ processors per node > > Does it work with pbs but not Mellanox? Just trying to isolate the problem. > >

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
...@cisco.com> >  Sent: Monday, January 27, 2020 3:40 PM To: Open MPI User's List mailto:users@lists.open-mpi.org> > Cc: Collin Strassburger mailto:cstrassbur...@bihrle.com> > Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors p

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
:jsquy...@cisco.com>> Sent: Monday, January 27, 2020 3:40 PM To: Open MPI User's List mailto:users@lists.open-mpi.org>> Cc: Collin Strassburger mailto:cstrassbur...@bihrle.com>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ proc

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
To: Open MPI User's List mailto:users@lists.open-mpi.org> > Cc: Collin Strassburger mailto:cstrassbur...@bihrle.com> > Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node  Can you please send all the information listed her

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-27 Thread Jeff Squyres (jsquyres) via users
s-boun...@lists.open-mpi.org>> On Behalf Of Ray Sheppard via users Sent: Monday, January 27, 2020 11:53 AM To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> Cc: Ray Sheppard mailto:rshep...@iu.edu>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-27 Thread Collin Strassburger via users
unless something is being passed incorrectly. Collin From: users On Behalf Of Ray Sheppard via users Sent: Monday, January 27, 2020 11:53 AM To: users@lists.open-mpi.org Cc: Ray Sheppard Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-27 Thread Ray Sheppard via users
Hi All,   Just my two cents, I think error code 63 is saying it is running out of streams to use.  I think you have only 64 cores, so at 100, you are overloading most of them.  It feels like you are running out of resources trying to swap in and out ranks on physical cores.    Ray On