Dear QE users,
I have srun problem on ubuntu 16.04 cluster with intel mpi. Could you
please me to check what is going on? Thank you!
I am trying to install slurm in a cluster running ubuntu 16.04.
I am using intel mpi and the installation directory is located at the
head node /opt/intel/impi_5.01.
According to the slurm instruction, it needs to export the libpmi.so
variable.https://slurm.schedmd.com/mpi_guide.html#intel_mpi
But, I installed slurm-llnl via ubuntu
|sudo apt-get slurm-llnl |
and I am not sure where the libpmi.so is located? So, I did a search and
found a file here, is this the file I'm looking for?
|/usr/lib/x86_64-linux-gnu/libpmi.so |
Anyway, I export the variable and I tried
|srun -p old -N3 -n24 hostname |
It returns,
|rolly@head:~$ srun -p old -N3 -n24 hostname node02 node02 node02 node02
node02 node02 node02 node02 node01 node01 head head node01 head head
head node01 node01 head node01 head head node01 node01 |
It appears working.
But as I run my task,
|srun -p old -N3 -n24 ~/QE530-CPU/espresso-5.3.0/bin/pw.x |
It produced errors,
|mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly);
possible causes: 1. no mpd is running on this host 2. an mpd is running
but was started without a "console" (-n option) mpiexec_node02: cannot
connect to local mpd (/tmp/mpd2.console_rolly); possible causes: 1. no
mpd is running on this host 2. an mpd is running but was started without
a "console" (-n option) mpiexec_node01: cannot connect to local mpd
(/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this
host 2. an mpd is running but was started without a "console" (-n
option) mpiexec_node01: cannot connect to local mpd
(/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this
host 2. an mpd is running but was started without a "console" (-n
option) mpiexec_node02: cannot connect to local mpd
(/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this
host 2. an mpd is running but was started without a "console" (-n
option) mpiexec_node02: cannot connect to local mpd
(/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this
host 2. an mpd is running but was started without a "console" (-n
option) mpiexec_node02: cannot connect to local mpd
(/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this
host 2. an mpd is running but was started without a "console" (-n
option) mpiexec_node02: cannot connect to local mpd
(/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this
host 2. an mpd is running but was started without a "console" (-n
option) mpiexec_node02: cannot connect to local mpd
(/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this
host 2. an mpd is running but was started without a "console" (-n
option) mpiexec_node01: cannot connect to local mpd
(/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this
host 2. an mpd is running but was started without a "console" (-n
option) mpiexec_node01: cannot connect to local mpd
(/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this
host 2. an mpd is running but was started without a "console" (-n
option) mpiexec_node02: cannot connect to local mpd
(/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this
host 2. an mpd is running but was started without a "console" (-n
option) mpiexec_node01: cannot connect to local mpd
(/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this
host 2. an mpd is running but was started without a "console" (-n
option) mpiexec_node01: cannot connect to local mpd
(/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this
host 2. an mpd is running but was started without a "console" (-n
option) mpiexec_node01: cannot connect to local mpd
(/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this
host 2. an mpd is running but was started without a "console" (-n
option) mpiexec_node01: cannot connect to local mpd
(/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this
host 2. an mpd is running but was started without a "console" (-n option) |
I believe the error prompts are due to running mpiexec with intel-mpi,
it should be using mpirun instead.
I can confirm that by exporting the environmental variable, export
I_MPI_PMI_LIBRARY=/usr/lib/x86_64-linux-gnu/libpmi.so, kills the mpirun.
if this is set, mpirun -n 24 -ppn 8 -f ~/machines.LINUX
~/QE530-CPU/espresso-5.3.0/bin/pw.x fails. If it is removed, mpirun
works again.
How can I correct the problem?
--
PhD. Research Fellow,
Dept. of Physics & Materials Science,
City University of Hong Kong
Tel: +852 3442 4000
Fax: +852 3442 0538
_______________________________________________
Pw_forum mailing list
[email protected]
http://pwscf.org/mailman/listinfo/pw_forum