Dear QE users,

I have srun problem on ubuntu 16.04 cluster with intel mpi. Could you please me to check what is going on? Thank you!

I am trying to install slurm in a cluster running ubuntu 16.04.

I am using intel mpi and the installation directory is located at the head node /opt/intel/impi_5.01.

According to the slurm instruction, it needs to export the libpmi.so variable.https://slurm.schedmd.com/mpi_guide.html#intel_mpi

But, I installed slurm-llnl via ubuntu

|sudo apt-get slurm-llnl |

and I am not sure where the libpmi.so is located? So, I did a search and found a file here, is this the file I'm looking for?

|/usr/lib/x86_64-linux-gnu/libpmi.so |

Anyway, I export the variable and I tried

|srun -p old -N3 -n24 hostname |

It returns,

|rolly@head:~$ srun -p old -N3 -n24 hostname node02 node02 node02 node02 node02 node02 node02 node02 node01 node01 head head node01 head head head node01 node01 head node01 head head node01 node01 |

It appears working.

But as I run my task,

|srun -p old -N3 -n24 ~/QE530-CPU/espresso-5.3.0/bin/pw.x |

It produced errors,

|mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option) mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option) mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option) mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option) mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option) mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option) mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option) mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option) mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option) mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option) mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option) mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option) mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option) mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option) mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option) mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option) |

I believe the error prompts are due to running mpiexec with intel-mpi, it should be using mpirun instead.

I can confirm that by exporting the environmental variable, export I_MPI_PMI_LIBRARY=/usr/lib/x86_64-linux-gnu/libpmi.so, kills the mpirun. if this is set, mpirun -n 24 -ppn 8 -f ~/machines.LINUX ~/QE530-CPU/espresso-5.3.0/bin/pw.x fails. If it is removed, mpirun works again.

How can I correct the problem?

--
PhD. Research Fellow,
Dept. of Physics & Materials Science,
City University of Hong Kong
Tel: +852 3442 4000
Fax: +852 3442 0538

_______________________________________________
Pw_forum mailing list
[email protected]
http://pwscf.org/mailman/listinfo/pw_forum

Reply via email to