NOTE: similar but not same
(https://stackoverflow.com/questions/39187072/what-does-the-option-nodes-in-slurm-do-with-sbatch).
I'm trying to understand what the nodes options `--nodes` and
`--ntasks-per-node` do in SLURM. I would have thought that they allow me
to specify that if I run 4 tasks and specify `-N4` and
`--ntasks-per-node=1` then I would have each task run on a different
node. That is not what is happening.
I'm starting with a simple script that takes two arguments:
*hello_to.sh*
#!/bin/bash
firstname=$1
lastname=$2
echo "Hello to $firstname $lastname from $(hostname)"
echo "It is currently $(date)"
echo "SLURM_JOB_NAME: $SLURM_JOB_NAME"
echo "SLURM_JOBID: $SLURM_JOBID"
echo "SLURM_ARRAY_TASK_ID: $SLURM_ARRAY_TASK_ID"
echo "SLURM_ARRAY_JOB_ID: $SLURM_ARRAY_JOB_ID"
echo "All Done!"
echo ""
And I call it in the following batch script:
*array.sub*
#!/bin/bash
#SBATCH --job-name=hello_to
#SBATCH --array=0-3
##SBATCH -N4
##SBATCH --ntasks-per-node=1
#SBATCH --output="hello_%A_%a_%j.out"
#SBATCH --error="hello_%A_%a_%j.err"
names=(
"paul mccartney"
"john lennon"
"george harrison"
"ringo starr"
)
srun hello_to.sh ${names[$SLURM_ARRAY_TASK_ID]}
When I run it, I get four output files that look like:
$ for i in *.out; do echo "*** $i ***"; cat $i; done
*** hello_277_0_278.out ***
Hello to paul mccartney from exanode-2-8
It is currently Wed Jun 14 09:33:44 PDT 2017
SLURM_JOB_NAME: hello_to
SLURM_JOBID: 278
SLURM_ARRAY_TASK_ID: 0
SLURM_ARRAY_JOB_ID: 277
All Done!
*** hello_277_1_279.out ***
Hello to john lennon from exanode-2-8
It is currently Wed Jun 14 09:33:44 PDT 2017
SLURM_JOB_NAME: hello_to
SLURM_JOBID: 279
SLURM_ARRAY_TASK_ID: 1
SLURM_ARRAY_JOB_ID: 277
All Done!
*** hello_277_2_280.out ***
Hello to george harrison from exanode-2-8
It is currently Wed Jun 14 09:33:44 PDT 2017
SLURM_JOB_NAME: hello_to
SLURM_JOBID: 280
SLURM_ARRAY_TASK_ID: 2
SLURM_ARRAY_JOB_ID: 277
All Done!
*** hello_277_3_277.out ***
Hello to ringo starr from exanode-2-8
It is currently Wed Jun 14 09:33:44 PDT 2017
SLURM_JOB_NAME: hello_to
SLURM_JOBID: 277
SLURM_ARRAY_TASK_ID: 3
SLURM_ARRAY_JOB_ID: 277
All Done!
I wanted to see how I could make sure they all run on separate nodes. If
I uncomment the line `#SBATCH -N4` then I get the following:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
289_0 exacloud hello_to balter R 0:05 4
exanode-4-44,exanode-6-[0-2]
289_1 exacloud hello_to balter R 0:05 4
exanode-4-44,exanode-6-[0-2]
289_2 exacloud hello_to balter R 0:05 4
exanode-4-44,exanode-6-[0-2]
289_3 exacloud hello_to balter R 0:05 4
exanode-4-44,exanode-6-[0-2]
$ for i in *.out; do echo "*** $i ***"; cat $i; done
*** hello_289_0_290.out ***
Hello to paul mccartney from exanode-4-44
It is currently Wed Jun 14 09:40:43 PDT 2017
SLURM_JOB_NAME: hello_to
SLURM_JOBID: 290
SLURM_ARRAY_TASK_ID: 0
SLURM_ARRAY_JOB_ID: 289
All Done!
*** hello_289_1_291.out ***
Hello to john lennon from exanode-4-44
It is currently Wed Jun 14 09:40:43 PDT 2017
SLURM_JOB_NAME: hello_to
SLURM_JOBID: 291
SLURM_ARRAY_TASK_ID: 1
SLURM_ARRAY_JOB_ID: 289
All Done!
*** hello_289_2_292.out ***
Hello to george harrison from exanode-4-44
It is currently Wed Jun 14 09:40:43 PDT 2017
SLURM_JOB_NAME: hello_to
SLURM_JOBID: 292
SLURM_ARRAY_TASK_ID: 2
SLURM_ARRAY_JOB_ID: 289
All Done!
*** hello_289_3_289.out ***
Hello to ringo starr from exanode-4-44
It is currently Wed Jun 14 09:40:43 PDT 2017
SLURM_JOB_NAME: hello_to
SLURM_JOBID: 289
SLURM_ARRAY_TASK_ID: 3
SLURM_ARRAY_JOB_ID: 289
All Done!
$for i in *.err; do echo "*** $i ***"; cat $i; done
*** hello_289_0_290.err ***
srun: error: fwd_tree_thread: can't find address for host
exanode-6-0, check slurm.conf
srun: error: fwd_tree_thread: can't find address for host
exanode-6-2, check slurm.conf
srun: error: fwd_tree_thread: can't find address for host
exanode-6-1, check slurm.conf
srun: error: Task launch for 290.0 failed on node exanode-6-1:
Can't find an address, check slurm.conf
srun: error: Task launch for 290.0 failed on node exanode-6-2:
Can't find an address, check slurm.conf
srun: error: Task launch for 290.0 failed on node exanode-6-0:
Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address,
check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to
finish.
slurmstepd: error: *** STEP 290.0 ON exanode-4-44 CANCELLED AT
2017-06-14T09:40:43 ***
srun: error: Timed out waiting for job step to complete
*** hello_289_1_291.err ***
srun: error: fwd_tree_thread: can't find address for host
exanode-6-0, check slurm.conf
srun: error: fwd_tree_thread: can't find address for host
exanode-6-1, check slurm.conf
srun: error: fwd_tree_thread: can't find address for host
exanode-6-2, check slurm.conf
srun: error: Task launch for 291.0 failed on node exanode-6-2:
Can't find an address, check slurm.conf
srun: error: Task launch for 291.0 failed on node exanode-6-1:
Can't find an address, check slurm.conf
srun: error: Task launch for 291.0 failed on node exanode-6-0:
Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address,
check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to
finish.
slurmstepd: error: *** STEP 291.0 ON exanode-4-44 CANCELLED AT
2017-06-14T09:40:43 ***
srun: error: Timed out waiting for job step to complete
*** hello_289_2_292.err ***
srun: error: fwd_tree_thread: can't find address for host
exanode-6-2, check slurm.conf
srun: error: fwd_tree_thread: can't find address for host
exanode-6-1, check slurm.conf
srun: error: fwd_tree_thread: can't find address for host
exanode-6-0, check slurm.conf
srun: error: Task launch for 292.0 failed on node exanode-6-0:
Can't find an address, check slurm.conf
srun: error: Task launch for 292.0 failed on node exanode-6-1:
Can't find an address, check slurm.conf
srun: error: Task launch for 292.0 failed on node exanode-6-2:
Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address,
check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to
finish.
slurmstepd: error: *** STEP 292.0 ON exanode-4-44 CANCELLED AT
2017-06-14T09:40:43 ***
srun: error: Timed out waiting for job step to complete
*** hello_289_3_289.err ***
srun: error: fwd_tree_thread: can't find address for host
exanode-6-0, check slurm.conf
srun: error: fwd_tree_thread: can't find address for host
exanode-6-2, check slurm.conf
srun: error: fwd_tree_thread: can't find address for host
exanode-6-1, check slurm.conf
srun: error: Task launch for 289.0 failed on node exanode-6-1:
Can't find an address, check slurm.conf
srun: error: Task launch for 289.0 failed on node exanode-6-2:
Can't find an address, check slurm.conf
srun: error: Task launch for 289.0 failed on node exanode-6-0:
Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address,
check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to
finish.
slurmstepd: error: *** STEP 289.0 ON exanode-4-44 CANCELLED AT
2017-06-14T09:40:43 ***
srun: error: Timed out waiting for job step to complete
If I add in `#SBATCH -ntasks-per-node` I get:
$squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
294_0 exacloud hello_to balter R 0:03 4
exanode-4-44,exanode-6-[0-2]
294_1 exacloud hello_to balter R 0:03 4
exanode-4-44,exanode-6-[0-2]
294_2 exacloud hello_to balter R 0:03 4
exanode-4-44,exanode-6-[0-2]
294_3 exacloud hello_to balter R 0:03 4
exanode-4-44,exanode-6-[0-2]
$for i in *.out; do echo "*** $i ***"; cat $i; done
*** hello_294_0_295.out ***
*** hello_294_1_296.out ***
*** hello_294_2_297.out ***
*** hello_294_3_294.out ***
$for i in *.err; do echo "*** $i ***"; cat $i; done
*** hello_294_0_295.err ***
srun: error: fwd_tree_thread: can't find address for host
exanode-6-0, check slurm.conf
srun: error: fwd_tree_thread: can't find address for host
exanode-6-2, check slurm.conf
srun: error: fwd_tree_thread: can't find address for host
exanode-6-1, check slurm.conf
srun: error: Task launch for 295.0 failed on node exanode-6-1:
Can't find an address, check slurm.conf
srun: error: Task launch for 295.0 failed on node exanode-6-2:
Can't find an address, check slurm.conf
srun: error: Task launch for 295.0 failed on node exanode-6-0:
Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address,
check slurm.conf
*** hello_294_1_296.err ***
srun: error: fwd_tree_thread: can't find address for host
exanode-6-0, check slurm.conf
srun: error: fwd_tree_thread: can't find address for host
exanode-6-1, check slurm.conf
srun: error: fwd_tree_thread: can't find address for host
exanode-6-2, check slurm.conf
srun: error: Task launch for 296.0 failed on node exanode-6-2:
Can't find an address, check slurm.conf
srun: error: Task launch for 296.0 failed on node exanode-6-1:
Can't find an address, check slurm.conf
srun: error: Task launch for 296.0 failed on node exanode-6-0:
Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address,
check slurm.conf
*** hello_294_2_297.err ***
srun: error: fwd_tree_thread: can't find address for host
exanode-6-0, check slurm.conf
srun: error: fwd_tree_thread: can't find address for host
exanode-6-1, check slurm.conf
srun: error: fwd_tree_thread: can't find address for host
exanode-6-2, check slurm.conf
srun: error: Task launch for 297.0 failed on node exanode-6-2:
Can't find an address, check slurm.conf
srun: error: Task launch for 297.0 failed on node exanode-6-1:
Can't find an address, check slurm.conf
srun: error: Task launch for 297.0 failed on node exanode-6-0:
Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address,
check slurm.conf
*** hello_294_3_294.err ***
srun: error: fwd_tree_thread: can't find address for host
exanode-6-0, check slurm.conf
srun: error: fwd_tree_thread: can't find address for host
exanode-6-1, check slurm.conf
srun: error: fwd_tree_thread: can't find address for host
exanode-6-2, check slurm.conf
srun: error: Task launch for 294.0 failed on node exanode-6-2:
Can't find an address, check slurm.conf
srun: error: Task launch for 294.0 failed on node exanode-6-1:
Can't find an address, check slurm.conf
srun: error: Task launch for 294.0 failed on node exanode-6-0:
Can't find an a