I am testing Intel MPI under Slurm and have the recommended method
working, i.e.,

I_MPI_PMI_LIBRARY=<slurmdir>/lib64/libpmi.so  srun myimpirog

However  Intel MPI recommends  another startup method

salloc -N 1

mpiexec.hydra -bootstrap jmi -n 2 ./myimpiprog.


Now I'm not sure what are the pros and cons of JMI but launching fails
as it seems to invoke srun with the short name (as in my slurm.conf) but
changes to the FQDN, this causes srun to fail as the "requested node
configuration is not available".

I'd like to check if the short name --> FQDN is  Slurm or JMI/PMI
weirdness. I am testing with Slurm 14.03.3-2 and Intel MPI
4.1.3.0249/5.0.0.016.

Here's the debug from Intel MPI with JMI, note the switch from short names
to FQDNs. hostname on the nodes returns the short name, and
SLURM_NODELIST also shows the shortnames.


[mpiexec@builder] Launch arguments: /opt/intel/impi/4.1.3/bin64/pmi_proxy
--control-port builder:38756 --debug --pmi-connect lazy-cache
--pmi-aggregate -s 0 --rmk slurm --launcher jmi --demux poll --pgid 0
--enable-stdin 1 --retries 10 --control-code 313688655 --proxy-id -1
[jmi-slurm@builder] Launch arguments: srun --nodelist builder,ruchba -N 2
-n 2 /opt/intel/impi/4.1.3/bin64/pmi_proxy --control-port builder:38756
--debug --pmi-connect lazy-cache --pmi-aggregate -s 0 --rmk slurm
--launcher jmi --demux poll --pgid 0 --enable-stdin 1 --retries 10
--control-code 313688655 --proxy-id -1
[mpiexec@builder] STDIN will be redirected to 1 fd(s): 8
[proxy:0:0@builder] Start PMI_proxy 0
[jmi-slurm@builder] Launch arguments: srun --nodelist
builder.hpc8888.com-N 1 -n 1 ./hello
[jmi-slurm@builder] Launch arguments: srun --nodelist
builder.hpc8888.com-N 1 -n 1 ./hello
[proxy:0:0@builder] STDIN will be redirected to 1 fd(s): 8
[proxy:0:1@ruchba] Start PMI_proxy 1
[jmi-slurm@ruchba] Launch arguments: srun --nodelist ruchba.hpc8888.com -N
1 -n 1 ./hello
[jmi-slurm@ruchba] Launch arguments: srun --nodelist ruchba.hpc8888.com -N
1 -n 1 ./hello

- Anthony

Reply via email to