[OMPI users] Cannot run batch jobs with SLURM

Nayden D Kambouchev Sat, 26 May 2007 16:56:40 -0400

Hi,

I am unable to run batch jobs with my installation of OpenMPI and SLURM.  Indeed
I am not sure if this is an OpenMPI issue or a SLURM issue, but here is what is
happening on my little cluster (3 nodes, one login node and 2 backend nodes
with 2 dual core CPUs each).  If I run


salloc -n 8 mpirun -np 8 myprog

I get both backend nodes allocated (with their total of 8 cores) and myprog runs

if I run

sbatch -n 8 zrun.sh

where zrun.sh contains

#!/bin/bash
mpirun -np 8 myprog

again both backend nodes get allocated, but the job does not run.  In top I see
one mpirun and two srun processes on the first backend node, but they just seem
to be sitting there.  On the other backend node I see no mpirun, srun or
anything else which might have been started as a result of the batch job.

Is this the correct way to initiate SLURM batch jobs with OpenMPI?

I also see the following error in the SLURM log of the second backnode

May 26 16:15:21 localhost slurmd[2665]: launch task 82.0 request from
1001.1001@127.0.0.1 (port 21721)
May 26 16:15:21 localhost slurmstepd[2747]: jobacct NONE plugin loaded
May 26 16:15:21 localhost slurmstepd[2747]: error: connect io: Connection
refused
May 26 16:15:21 localhost slurmd[node21][2747]: error: IO setup failed:
Connection refused
May 26 16:15:21 localhost slurmd[node21][2747]: error: job_manager exiting
abnormally, rc = 4020
May 26 16:15:21 localhost slurmd[node21][2747]: done with job

The job number assigned by SLURM at the submissin was 82.

What am I doing incorrectly?  Is it possible that something in my environment is
not set up correctly?

Thanks,
Nayden Kambouchev

[OMPI users] Cannot run batch jobs with SLURM

Reply via email to