On Apr 29, 2011, at 8:05 AM, Michael Di Domenico wrote:

> On Fri, Apr 29, 2011 at 10:01 AM, Michael Di Domenico
> <mdidomeni...@gmail.com> wrote:
>> On Fri, Apr 29, 2011 at 4:52 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>> Hi Michael
>>> 
>>> Please see the attached updated patch to try for 1.5.3. I mistakenly free'd 
>>> the envar after adding it to the environ :-/
>> 
>> The patch works great, i can now see the precondition environment
>> variable if i do
>> 
>> mpirun -n 2 -host node1 <prog>
>> 
>> and my <prog> runs just fine, However if i do
>> 
>> srun --resv-ports -n 2 -w node1 <prog>
>> 
>> I get
>> 
>> [node1:16780] PSM EP connect error (unknown connect error):
>> [node1:16780]  node1
>> [node1:16780] PSM EP connect error (Endpoint could not be reached):
>> [node1:16780]  node1
>> 
>> PML add procs failed
>> --> Returned "Error" (-1) instead of "Success" (0)
>> 
>> I did notice a difference in the precondition env variable between the two 
>> runs
>> 
>> mpirun -n 2 -host node1 <prog>
>> 
>> sets precondition_transports=fbc383997ee1b668-00d40f1401d2e827 (which
>> changes with each run (aka random))

I didn't change anything about the way mpirun works, so this is expected.

> 
>> 
>> srun --resv-ports -n 2 -w node1 <prog>
> 
> this should have been "srun --resv-ports -n 1 -w node1 <prog>", i
> can't run a 2 rank job, i get the PML error above
> 
>> 
>> sets precondition_transports=0000184500000000-0000000100000000 (which
>> doesn't seem to change run to run)

The value would indeed look quite different. Since I can't use a random value 
(so each proc can compute the same result), I simply used the SLURM_JOBID and 
SLURM_STEPID. I would therefore have expected that the first field (based on 
the jobid) would remain the same, and the second would change each time you did 
an "srun" within the same job.

I'm afraid I don't know the significance of the fields, so I can't say why psm 
can't make the connection. I'll have to ping someone more knowledgable to see 
why those values aren't acceptable.


>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to