Wonderful! I am happy to confirm that this resolves the issue!
Many thanks to everyone for their assistance,
Collin
Okay, that nailed it down - the problem is the number of open file descriptors
is exceeding your system limit. I suspect the connection to the Mellanox
drivers is solely due to it also having some descriptors open, and you are just
close enough to the boundary that it causes you to hit it.
See
I agree that it is odd that the issue does not appear until after the Mellanox
drivers have been installed (and the configure flags set to use them). As
requested, here are the results
Input: mpirun -np 128 --mca odls_base_verbose 10 --mca state_base_verbose 10
hostname
Output:
Okay, debug-daemons isn't going to help as we aren't launching any daemons.
This is all one node. So try adding "--mca odls_base_verbose 10 --mca
state_base_verbose 10" to the cmd line and let's see what is going on.
I agree with Josh - neither mpirun nor hostname are invoking the Mellanox
pen-mpi.org>>;
Ralph Castain mailto:r...@open-mpi.org>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when
utilizing 100+ processors per node
Interesting. Can you try:
mpirun -np 128 --debug-daemons hostname
Josh
On Tue, Jan 28, 2020 at 12:14 PM Collin
---
>
> 128 total processes failed to start
>
>
>
> Collin
>
>
>
> *From:* Joshua Ladd
> *Sent:* Tuesday, January 28, 2020 12:48 PM
> *To:* Collin Strassburger
> *Cc:* Open MPI Users ; Ralph Castain <
> r...@o
.
--
128 total processes failed to start
Collin
From: Joshua Ladd
Sent: Tuesday, January 28, 2020 12:48 PM
To: Collin Strassburger
Cc: Open MPI Users ; Ralph Castain
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when
utilizing 100+ processors per node
Sorry
: Tuesday, January 28, 2020 12:31 PM
To: Collin Strassburger
Cc: Open MPI Users ; Ralph Castain
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when
utilizing 100+ processors per node
Interesting. Can you try:
mpirun -np 128 --debug-daemons hostname
Josh
On Tue, Jan 28
gt;
> 128 total processes failed to start
>
>
>
>
>
> Collin
>
>
>
>
>
> *From:* users *On Behalf Of *Ralph
> Castain via users
> *Sent:* Tuesday, January 28, 2020 12:06 PM
> *To:* Joshua Ladd
> *Cc:* Ralph Castain ; Open MPI Users <
> users@li
Collin
From: users On Behalf Of Ralph Castain via
users
Sent: Tuesday, January 28, 2020 12:06 PM
To: Joshua Ladd
Cc: Ralph Castain ; Open MPI Users
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when
utilizing 100+ processors per node
Josh - if you read thru
s.open-mpi.org> >
Cc: Collin Strassburger mailto:cstrassbur...@bihrle.com> >; Ralph Castain mailto:r...@open-mpi.org> >; Artem Polyakov mailto:art...@mellanox.com> >
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when
utilizing 100+ processors per node
Ca
to start
>>
>>
>>
>>
>>
>> Collin
>>
>>
>>
>> *From:* Joshua Ladd
>> *Sent:* Tuesday, January 28, 2020 11:39 AM
>> *To:* Open MPI Users
>> *Cc:* Collin Strassburger ; Ralph Castain <
>> r...@open-mpi.org>; Art
al processes failed to start
>
>
>
>
>
> Collin
>
>
>
> *From:* Joshua Ladd
> *Sent:* Tuesday, January 28, 2020 11:39 AM
> *To:* Open MPI Users
> *Cc:* Collin Strassburger ; Ralph Castain <
> r...@open-mpi.org>; Artem Polyakov
> *Subject:
: Collin Strassburger ; Ralph Castain
; Artem Polyakov
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when
utilizing 100+ processors per node
Can you send the output of a failed run including your command line.
Josh
On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via
nuary 28, 2020 11:02 AM
> *To:* Open MPI Users
> *Cc:* Ralph Castain
> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
> 7742 when utilizing 100+ processors per node
>
> Does it work with pbs but not Mellanox? Just trying to isolate the problem.
>
>
...@cisco.com> >
Sent: Monday, January 27, 2020 3:40 PM
To: Open MPI User's List mailto:users@lists.open-mpi.org> >
Cc: Collin Strassburger mailto:cstrassbur...@bihrle.com> >
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when
utilizing 100+ processors p
:jsquy...@cisco.com>>
Sent: Monday, January 27, 2020 3:40 PM
To: Open MPI User's List
mailto:users@lists.open-mpi.org>>
Cc: Collin Strassburger
mailto:cstrassbur...@bihrle.com>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when
utilizing 100+ proc
To: Open MPI User's List mailto:users@lists.open-mpi.org> >
Cc: Collin Strassburger mailto:cstrassbur...@bihrle.com> >
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when
utilizing 100+ processors per node
Can you please send all the information listed her
s-boun...@lists.open-mpi.org>> On
Behalf Of Ray Sheppard via users
Sent: Monday, January 27, 2020 11:53 AM
To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
Cc: Ray Sheppard mailto:rshep...@iu.edu>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
unless something is being
passed incorrectly.
Collin
From: users On Behalf Of Ray Sheppard via
users
Sent: Monday, January 27, 2020 11:53 AM
To: users@lists.open-mpi.org
Cc: Ray Sheppard
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when
utilizing 100+ processors
Hi All,
Just my two cents, I think error code 63 is saying it is running out
of streams to use. I think you have only 64 cores, so at 100, you are
overloading most of them. It feels like you are running out of
resources trying to swap in and out ranks on physical cores.
Ray
On
21 matches
Mail list logo