Rémi,

This got me a bit farther, thanks.

> The stack trace stuck in BTL openib makes me think it's more related to Open 
> MPI <-> IB integration than to Slurm <-> Open MPI.

I agree that it seems like an MPI/IB thing; however, I can run using 
Torque/Moab via SSH so there's some kind of difference here that I'm not 
understanding, I think.


> Did you check the permissions of your IB devices in /dev?

Good point.  I believe these are correct.  We're not having a problem with any 
other IB-based applications, including other MPI/IB models.  But I checked, and 
they look right to me.


> It could work w/o problem using `mpirun -host` because of MCA related 
> environment variables may be set in your module and not propagated by mpirun 
> through SSH where Slurm basically propagate everything.

I did the following command, both in my normal shell as well as after getting a 
shell from salloc, then diffed the results (no difference).  What else can I 
check?

ompi-info --all | grep -i btl   



> You can also check it is related to IB by disabling it explicitely in Open 
> MPI BTL framework in parameters of mpirun.

This was a good idea.  I ran correctly with the following in an salloc shell, 
which confirms that it's happening at the IB integration level:

mpirun --mca btl ^openib ./simple



So the question is:  Why aren't the MCA parameters propagating?  Or:  What did 
I misconfigure so they would not.  Torque uses ssh when it deploys, and we've 
no problems with any of our MPI setups via Torque.  Is there some Slurm-ishness 
I my Torquey assumptions are getting in the way of me understanding?

Thanks,
Paul.


P.S.,  To Andy Reibs:  Thanks for your suggestion.  My current build does use 
PMI and explicitly paths to the Slurm PMI.  I tried your /etc/sysconfig/slurm 
suggestion, but no dice.


Reply via email to