On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote: > On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >> On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote: >> >>> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>> >>>> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote: >>>> >>>>> Was this ever committed to the OMPI src as something not having to be >>>>> run outside of OpenMPI, but as part of the PSM setup that OpenMPI >>>>> does? >>>> >>>> Not that I know of - I don't think the PSM developers ever looked at it.
Thought about this some more and I believe I have a soln to the problem. Will try to commit something to the devel trunk by the end of the week. Ralph >>>> >>>>> >>>>> I'm having some trouble getting Slurm/OpenMPI to play nice with the >>>>> setup of this key. Namely, with slurm you cannot export variables >>>>> from the --prolog of an srun, only from an --task-prolog, >>>>> unfortunately, if you use a task-prolog each rank gets a different >>>>> key, which doesn't work. >>>>> >>>>> I'm also guessing that each unique mpirun needs it's own psm key, not >>>>> one for the whole system, so i can't just make it a permanent >>>>> parameter somewhere else. >>>>> >>>>> Also, i recall reading somewhere that the --resv-ports parameter that >>>>> OMPI uses from slurm to choose a list of ports to use for TCP comm's, >>>>> tries to lock a port from the pool three times before giving up. >>>> >>>> Had to look back at the code - I think you misread this. I can find no >>>> evidence in the code that we try to bind that port more than once. >>> >>> Perhaps i misstated, i don't believe you're trying to bind to the same >>> port twice during the same session. i believe the code re-uses >>> similar ports from session to session. what i believe happens (but >>> could be totally wrong) the previous session releases the port, but >>> linux isn't quite done with it when the new session tries to bind to >>> the port, in which case it tries three times and then fails the job >> >> Actually, I understood you correctly. I'm just saying that I find no >> evidence in the code that we try three times before giving up. What I see is >> a single attempt to bind the port - if it fails, then we abort. There is no >> parameter to control that behavior. >> >> So if the OS hasn't released the port by the time a new job starts on that >> node, then it will indeed abort if the job was unfortunately given the same >> port reservation. > > Oh, okay, sorry... > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users