Hi, I am unable to run batch jobs with my installation of OpenMPI and SLURM. Indeed I am not sure if this is an OpenMPI issue or a SLURM issue, but here is what is happening on my little cluster (3 nodes, one login node and 2 backend nodes with 2 dual core CPUs each). If I run
salloc -n 8 mpirun -np 8 myprog I get both backend nodes allocated (with their total of 8 cores) and myprog runs if I run sbatch -n 8 zrun.sh where zrun.sh contains #!/bin/bash mpirun -np 8 myprog again both backend nodes get allocated, but the job does not run. In top I see one mpirun and two srun processes on the first backend node, but they just seem to be sitting there. On the other backend node I see no mpirun, srun or anything else which might have been started as a result of the batch job. Is this the correct way to initiate SLURM batch jobs with OpenMPI? I also see the following error in the SLURM log of the second backnode May 26 16:15:21 localhost slurmd[2665]: launch task 82.0 request from 1001.1001@127.0.0.1 (port 21721) May 26 16:15:21 localhost slurmstepd[2747]: jobacct NONE plugin loaded May 26 16:15:21 localhost slurmstepd[2747]: error: connect io: Connection refused May 26 16:15:21 localhost slurmd[node21][2747]: error: IO setup failed: Connection refused May 26 16:15:21 localhost slurmd[node21][2747]: error: job_manager exiting abnormally, rc = 4020 May 26 16:15:21 localhost slurmd[node21][2747]: done with job The job number assigned by SLURM at the submissin was 82. What am I doing incorrectly? Is it possible that something in my environment is not set up correctly? Thanks, Nayden Kambouchev