Hi Paul,

Le 17/06/2015 16:38, Wiegand, Paul a écrit :
[...]
$ salloc -n 1
salloc: Granted job allocation 192
$ ulimit -l
unlimited
$ mpirun ./simple
[evc5:19184] *** Process received signal ***
[evc5:19184] Signal: Segmentation fault (11)
[evc5:19184] Signal code: Address not mapped (1)
[evc5:19184] Failing at address: 0x30
[evc5:19184] [ 0] /lib64/libpthread.so.0(+0xf130)[0x2b401b32c130]
[evc5:19184] [ 1] 
/apps/openmpi/openmpi-1.8.3-ic-2015-slurm-14.11/lib/openmpi/mca_btl_openib.so(+0x1fdd8)[0x2b4020735dd8]

The stack trace stuck in BTL openib makes me think it's more related to Open MPI <-> IB integration than to Slurm <-> Open MPI.

Did you check the permissions of your IB devices in /dev?

It could work w/o problem using `mpirun -host` because of MCA related environment variables may be set in your module and not propagated by mpirun through SSH where Slurm basically propagate everything.

You can also check it is related to IB by disabling it explicitely in Open MPI BTL framework in parameters of mpirun.

rémi

Reply via email to