Your confusion seems to be stemming from misunderstanding job arrays, not the
--node and --ntasks-per-node options,
Job arrays are basically a short cut for submitting large numbers of similar
jobs
at once. The "sbatch array.sub" command basically submits 4 jobs, each job
being
allocated 1 CPU-core on each of 4 nodes. Since the jobs are all single
threaded,
only 1 core on the first node ever gets used. Presumably your global
configuration
defaults to sharing of nodes, so all four jobs (one for each array member) got
assigned the same nodes, and when run they all reported the same node name.
For this simple case, I would not bother with job arrays and just run a loop
in array.sub. Not sure if you need to background the srun or not.
Alternatively,
if you insist on job arrays, reduce to a single task on a single node (--ntasks
1
should suffice). You might need to add --exclusive to force all tfour jobs get
sent to
different nodes.
On Wed, 14 Jun 2017, Ariel Balter wrote:
NOTE: similar but not same
(https://stackoverflow.com/questions/39187072/what-does-the-option-nodes-in-slurm-do-with-sbatch).
I'm trying to understand what the nodes options `--nodes` and
`--ntasks-per-node` do in SLURM. I would have thought that they allow me to
specify that if I run 4 tasks and specify `-N4` and `--ntasks-per-node=1` then
I would have
each task run on a different node. That is not what is happening.
I'm starting with a simple script that takes two arguments:
*hello_to.sh*
#!/bin/bash
firstname=$1
lastname=$2
echo "Hello to $firstname $lastname from $(hostname)"
echo "It is currently $(date)"
echo "SLURM_JOB_NAME: $SLURM_JOB_NAME"
echo "SLURM_JOBID: $SLURM_JOBID"
echo "SLURM_ARRAY_TASK_ID: $SLURM_ARRAY_TASK_ID"
echo "SLURM_ARRAY_JOB_ID: $SLURM_ARRAY_JOB_ID"
echo "All Done!"
echo ""
And I call it in the following batch script:
*array.sub*
#!/bin/bash
#SBATCH --job-name=hello_to
#SBATCH --array=0-3
##SBATCH -N4
##SBATCH --ntasks-per-node=1
#SBATCH --output="hello_%A_%a_%j.out"
#SBATCH --error="hello_%A_%a_%j.err"
names=(
"paul mccartney"
"john lennon"
"george harrison"
"ringo starr"
)
srun hello_to.sh ${names[$SLURM_ARRAY_TASK_ID]}
When I run it, I get four output files that look like:
$ for i in *.out; do echo "*** $i ***"; cat $i; done
*** hello_277_0_278.out ***
Hello to paul mccartney from exanode-2-8
It is currently Wed Jun 14 09:33:44 PDT 2017
SLURM_JOB_NAME: hello_to
SLURM_JOBID: 278
SLURM_ARRAY_TASK_ID: 0
SLURM_ARRAY_JOB_ID: 277
All Done!
*** hello_277_1_279.out ***
Hello to john lennon from exanode-2-8
It is currently Wed Jun 14 09:33:44 PDT 2017
SLURM_JOB_NAME: hello_to
SLURM_JOBID: 279
SLURM_ARRAY_TASK_ID: 1
SLURM_ARRAY_JOB_ID: 277
All Done!
*** hello_277_2_280.out ***
Hello to george harrison from exanode-2-8
It is currently Wed Jun 14 09:33:44 PDT 2017
SLURM_JOB_NAME: hello_to
SLURM_JOBID: 280
SLURM_ARRAY_TASK_ID: 2
SLURM_ARRAY_JOB_ID: 277
All Done!
*** hello_277_3_277.out ***
Hello to ringo starr from exanode-2-8
It is currently Wed Jun 14 09:33:44 PDT 2017
SLURM_JOB_NAME: hello_to
SLURM_JOBID: 277
SLURM_ARRAY_TASK_ID: 3
SLURM_ARRAY_JOB_ID: 277
All Done!
I wanted to see how I could make sure they all run on separate nodes. If I
uncomment the line `#SBATCH -N4` then I get the following:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
289_0 exacloud hello_to balter R 0:05 4
exanode-4-44,exanode-6-[0-2]
289_1 exacloud hello_to balter R 0:05 4
exanode-4-44,exanode-6-[0-2]
289_2 exacloud hello_to balter R 0:05 4
exanode-4-44,exanode-6-[0-2]
289_3 exacloud hello_to balter R 0:05 4
exanode-4-44,exanode-6-[0-2]
$ for i in *.out; do echo "*** $i ***"; cat $i; done
*** hello_289_0_290.out ***
Hello to paul mccartney from exanode-4-44
It is currently Wed Jun 14 09:40:43 PDT 2017
SLURM_JOB_NAME: hello_to
SLURM_JOBID: 290
SLURM_ARRAY_TASK_ID: 0
SLURM_ARRAY_JOB_ID: 289
All Done!
*** hello_289_1_291.out ***
Hello to john lennon from exanode-4-44
It is currently Wed Jun 14 09:40:43 PDT 2017
SLURM_JOB_NAME: hello_to
SLURM_JOBID: 291
SLURM_ARRAY_TASK_ID: 1
SLURM_ARRAY_JOB_ID: 289
All Done!
*** hello_289_2_292.out ***
Hello to george harrison from exanode-4-44
It is currently Wed Jun 14 09:40:43 PDT 2017
SLURM_JOB_NAME: hello_to
SLURM_JOBID: 292
SLURM_ARRAY_TASK_ID: 2
SLURM_ARRAY_JOB_ID: 289
All Done!
*** hello_289_3_289.out ***
Hello to ringo starr from exanode-4-44
It is currently Wed Jun 14 09:40:43 PDT 2017
SLURM_JOB_NAME: hello_to
SLURM_JOBID: 289
SLURM_ARRAY_TASK_ID: 3
SLURM_ARRAY_JOB_ID: 289
All Done!
$for i in *.err; do echo "*** $i ***"; cat $i; done
*** hello_289_0_290.err ***
srun: error: fwd_tree_thread: can't find address for host exanode-6-0,
check slurm.conf
srun: error: fwd_tree_thread: can't find address for host exanode-6-2,
check slurm.conf
srun: error: fwd_tree_thread: can't find address for host exanode-6-1,
check slurm.conf
srun: error: Task launch for 290.0 failed on node exanode-6-1: Can't find
an address, check slurm.conf
srun: error: Task launch for 290.0 failed on node exanode-6-2: Can't find
an address, check slurm.conf
srun: error: Task launch for 290.0 failed on node exanode-6-0: Can't find
an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check
slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 290.0 ON exanode-4-44 CANCELLED AT
2017-06-14T09:40:43 ***
srun: error: Timed out waiting for job step to complete
*** hello_289_1_291.err ***
srun: error: fwd_tree_thread: can't find address for host exanode-6-0,
check slurm.conf
srun: error: fwd_tree_thread: can't find address for host exanode-6-1,
check slurm.conf
srun: error: fwd_tree_thread: can't find address for host exanode-6-2,
check slurm.conf
srun: error: Task launch for 291.0 failed on node exanode-6-2: Can't find
an address, check slurm.conf
srun: error: Task launch for 291.0 failed on node exanode-6-1: Can't find
an address, check slurm.conf
srun: error: Task launch for 291.0 failed on node exanode-6-0: Can't find
an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check
slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 291.0 ON exanode-4-44 CANCELLED AT
2017-06-14T09:40:43 ***
srun: error: Timed out waiting for job step to complete
*** hello_289_2_292.err ***
srun: error: fwd_tree_thread: can't find address for host exanode-6-2,
check slurm.conf
srun: error: fwd_tree_thread: can't find address for host exanode-6-1,
check slurm.conf
srun: error: fwd_tree_thread: can't find address for host exanode-6-0,
check slurm.conf
srun: error: Task launch for 292.0 failed on node exanode-6-0: Can't find
an address, check slurm.conf
srun: error: Task launch for 292.0 failed on node exanode-6-1: Can't find
an address, check slurm.conf
srun: error: Task launch for 292.0 failed on node exanode-6-2: Can't find
an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check
slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 292.0 ON exanode-4-44 CANCELLED AT
2017-06-14T09:40:43 ***
srun: error: Timed out waiting for job step to complete
*** hello_289_3_289.err ***
srun: error: fwd_tree_thread: can't find address for host exanode-6-0,
check slurm.conf
srun: error: fwd_tree_thread: can't find address for host exanode-6-2,
check slurm.conf
srun: error: fwd_tree_thread: can't find address for host exanode-6-1,
check slurm.conf
srun: error: Task launch for 289.0 failed on node exanode-6-1: Can't find
an address, check slurm.conf
srun: error: Task launch for 289.0 failed on node exanode-6-2: Can't find
an address, check slurm.conf
srun: error: Task launch for 289.0 failed on node exanode-6-0: Can't find
an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check
slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 289.0 ON exanode-4-44 CANCELLED AT
2017-06-14T09:40:43 ***
srun: error: Timed out waiting for job step to complete
If I add in `#SBATCH -ntasks-per-node` I get:
$squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
294_0 exacloud hello_to balter R 0:03 4
exanode-4-44,exanode-6-[0-2]
294_1 exacloud hello_to balter R 0:03 4
exanode-4-44,exanode-6-[0-2]
294_2 exacloud hello_to balter R 0:03 4
exanode-4-44,exanode-6-[0-2]
294_3 exacloud hello_to balter R 0:03 4
exanode-4-44,exanode-6-[0-2]
$for i in *.out; do echo "*** $i ***"; cat $i; done
*** hello_294_0_295.out ***
*** hello_294_1_296.out ***
*** hello_294_2_297.out ***
*** hello_294_3_294.out ***
$for i in *.err; do echo "*** $i ***"; cat $i; done
*** hello_294_0_295.err ***
srun: error: fwd_tree_thread: can't find address for host exanode-6-0,
check slurm.conf
srun: error: fwd_tree_thread: can't find address for host exanode-6-2,
check slurm.conf
srun: error: fwd_tree_thread: can't find address for host exanode-6-1,
check slurm.conf
srun: error: Task launch for 295.0 failed on node exanode-6-1: Can't find
an address, check slurm.conf
srun: error: Task launch for 295.0 failed on node exanode-6-2: Can't find
an address, check slurm.conf
srun: error: Task launch for 295.0 failed on node exanode-6-0: Can't find
an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check
slurm.conf
*** hello_294_1_296.err ***
srun: error: fwd_tree_thread: can't find address for host exanode-6-0,
check slurm.conf
srun: error: fwd_tree_thread: can't find address for host exanode-6-1,
check slurm.conf
srun: error: fwd_tree_thread: can't find address for host exanode-6-2,
check slurm.conf
srun: error: Task launch for 296.0 failed on node exanode-6-2: Can't find
an address, check slurm.conf
srun: error: Task launch for 296.0 failed on node exanode-6-1: Can't find
an address, check slurm.conf
srun: error: Task launch for 296.0 failed on node exanode-6-0: Can't find
an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check
slurm.conf
*** hello_294_2_297.err ***
srun: error: fwd_tree_thread: can't find address for host exanode-6-0,
check slurm.conf
srun: error: fwd_tree_thread: can't find address for host exanode-6-1,
check slurm.conf
srun: error: fwd_tree_thread: can't find address for host exanode-6-2,
check slurm.conf
srun: error: Task launch for 297.0 failed on node exanode-6-2: Can't find
an address, check slurm.conf
srun: error: Task launch for 297.0 failed on node exanode-6-1: Can't find
an address, check slurm.conf
srun: error: Task launch for 297.0 failed on node exanode-6-0: Can't find
an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check
slurm.conf
*** hello_294_3_294.err ***
srun: error: fwd_tree_thread: can't find address for host exanode-6-0,
check slurm.conf
srun: error: fwd_tree_thread: can't find address for host exanode-6-1,
check slurm.conf
srun: error: fwd_tree_thread: can't find address for host exanode-6-2,
check slurm.conf
srun: error: Task launch for 294.0 failed on node exanode-6-2: Can't find
an address, check slurm.conf
srun: error: Task launch for 294.0 failed on node exanode-6-1: Can't find
an address, check slurm.conf
srun: error: Task launch for 294.0 failed on node exanode-6-0: Can't find
an a
Tom Payerle
DIT-ATI-Research Computing [email protected]
4254 Stadium Dr (301) 405-6135
University of Maryland
College Park, MD 20742-4111