On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote:

> On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote:
>> 
>>> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> 
>>>> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
>>>> 
>>>>> Was this ever committed to the OMPI src as something not having to be
>>>>> run outside of OpenMPI, but as part of the PSM setup that OpenMPI
>>>>> does?
>>>> 
>>>> Not that I know of - I don't think the PSM developers ever looked at it.

Thought about this some more and I believe I have a soln to the problem. Will 
try to commit something to the devel trunk by the end of the week.

Ralph


>>>> 
>>>>> 
>>>>> I'm having some trouble getting Slurm/OpenMPI to play nice with the
>>>>> setup of this key.  Namely, with slurm you cannot export variables
>>>>> from the --prolog of an srun, only from an --task-prolog,
>>>>> unfortunately, if you use a task-prolog each rank gets a different
>>>>> key, which doesn't work.
>>>>> 
>>>>> I'm also guessing that each unique mpirun needs it's own psm key, not
>>>>> one for the whole system, so i can't just make it a permanent
>>>>> parameter somewhere else.
>>>>> 
>>>>> Also, i recall reading somewhere that the --resv-ports parameter that
>>>>> OMPI uses from slurm to choose a list of ports to use for TCP comm's,
>>>>> tries to lock a port from the pool three times before giving up.
>>>> 
>>>> Had to look back at the code - I think you misread this. I can find no 
>>>> evidence in the code that we try to bind that port more than once.
>>> 
>>> Perhaps i misstated, i don't believe you're trying to bind to the same
>>> port twice during the same session.  i believe the code re-uses
>>> similar ports from session to session.  what i believe happens (but
>>> could be totally wrong) the previous session releases the port, but
>>> linux isn't quite done with it when the new session tries to bind to
>>> the port, in which case it tries three times and then fails the job
>> 
>> Actually, I understood you correctly. I'm just saying that I find no 
>> evidence in the code that we try three times before giving up. What I see is 
>> a single attempt to bind the port - if it fails, then we abort. There is no 
>> parameter to control that behavior.
>> 
>> So if the OS hasn't released the port by the time a new job starts on that 
>> node, then it will indeed abort if the job was unfortunately given the same 
>> port reservation.
> 
> Oh, okay, sorry...
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to