Hi,
just a dumb question but did you actually built Slurm's PMI plugin? As it is
considered additional you have to manually compile
and install it…
Regards,
Uwe
Am 17.06.2015 um 18:52 schrieb Wiegand, Paul:
> Rémi,
>
> This got me a bit farther, thanks.
>
>> The stack trace stuck in BTL openib makes me think it's more related to Open
>> MPI <-> IB integration than to Slurm <-> Open MPI.
>
> I agree that it seems like an MPI/IB thing; however, I can run using
> Torque/Moab via SSH so there's some kind of difference here that I'm not
> understanding, I think.
>
>
>> Did you check the permissions of your IB devices in /dev?
>
> Good point. I believe these are correct. We're not having a problem with
> any other IB-based applications, including other MPI/IB models. But I
> checked, and they look right to me.
>
>
>> It could work w/o problem using `mpirun -host` because of MCA related
>> environment variables may be set in your module and not propagated by mpirun
>> through SSH where Slurm basically propagate everything.
>
> I did the following command, both in my normal shell as well as after getting
> a shell from salloc, then diffed the results (no difference). What else can
> I check?
>
> ompi-info --all | grep -i btl
>
>
>
>> You can also check it is related to IB by disabling it explicitely in Open
>> MPI BTL framework in parameters of mpirun.
>
> This was a good idea. I ran correctly with the following in an salloc shell,
> which confirms that it's happening at the IB integration level:
>
> mpirun --mca btl ^openib ./simple
>
>
>
> So the question is: Why aren't the MCA parameters propagating? Or: What
> did I misconfigure so they would not. Torque uses ssh when it deploys, and
> we've no problems with any of our MPI setups via Torque. Is there some
> Slurm-ishness I my Torquey assumptions are getting in the way of me
> understanding?
>
> Thanks,
> Paul.
>
>
> P.S., To Andy Reibs: Thanks for your suggestion. My current build does use
> PMI and explicitly paths to the Slurm PMI. I tried your /etc/sysconfig/slurm
> suggestion, but no dice.
>
>