NOTE: similar but not same (https://stackoverflow.com/questions/39187072/what-does-the-option-nodes-in-slurm-do-with-sbatch).

I'm trying to understand what the nodes options `--nodes` and `--ntasks-per-node` do in SLURM. I would have thought that they allow me to specify that if I run 4 tasks and specify `-N4` and `--ntasks-per-node=1` then I would have each task run on a different node. That is not what is happening.

I'm starting with a simple script that takes two arguments:

*hello_to.sh*

    #!/bin/bash

    firstname=$1
    lastname=$2

    echo "Hello to $firstname $lastname from $(hostname)"
    echo "It is currently $(date)"
    echo "SLURM_JOB_NAME: $SLURM_JOB_NAME"
    echo "SLURM_JOBID: $SLURM_JOBID"
    echo "SLURM_ARRAY_TASK_ID: $SLURM_ARRAY_TASK_ID"
    echo "SLURM_ARRAY_JOB_ID: $SLURM_ARRAY_JOB_ID"
    echo "All Done!"
    echo ""

And I call it in the following batch script:

*array.sub*

    #!/bin/bash

    #SBATCH --job-name=hello_to
    #SBATCH --array=0-3
    ##SBATCH -N4
    ##SBATCH --ntasks-per-node=1
    #SBATCH --output="hello_%A_%a_%j.out"
    #SBATCH --error="hello_%A_%a_%j.err"

    names=(
    "paul mccartney"
    "john lennon"
    "george harrison"
    "ringo starr"
    )

    srun hello_to.sh ${names[$SLURM_ARRAY_TASK_ID]}


When I run it, I get four output files that look like:

    $ for i in *.out; do echo "*** $i ***"; cat $i; done
    *** hello_277_0_278.out ***
    Hello to paul mccartney from exanode-2-8
    It is currently Wed Jun 14 09:33:44 PDT 2017
    SLURM_JOB_NAME: hello_to
    SLURM_JOBID: 278
    SLURM_ARRAY_TASK_ID: 0
    SLURM_ARRAY_JOB_ID: 277
    All Done!

    *** hello_277_1_279.out ***
    Hello to john lennon from exanode-2-8
    It is currently Wed Jun 14 09:33:44 PDT 2017
    SLURM_JOB_NAME: hello_to
    SLURM_JOBID: 279
    SLURM_ARRAY_TASK_ID: 1
    SLURM_ARRAY_JOB_ID: 277
    All Done!

    *** hello_277_2_280.out ***
    Hello to george harrison from exanode-2-8
    It is currently Wed Jun 14 09:33:44 PDT 2017
    SLURM_JOB_NAME: hello_to
    SLURM_JOBID: 280
    SLURM_ARRAY_TASK_ID: 2
    SLURM_ARRAY_JOB_ID: 277
    All Done!

    *** hello_277_3_277.out ***
    Hello to ringo starr from exanode-2-8
    It is currently Wed Jun 14 09:33:44 PDT 2017
    SLURM_JOB_NAME: hello_to
    SLURM_JOBID: 277
    SLURM_ARRAY_TASK_ID: 3
    SLURM_ARRAY_JOB_ID: 277
    All Done!

I wanted to see how I could make sure they all run on separate nodes. If I uncomment the line `#SBATCH -N4` then I get the following:

    $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 289_0 exacloud hello_to balter R 0:05 4 exanode-4-44,exanode-6-[0-2] 289_1 exacloud hello_to balter R 0:05 4 exanode-4-44,exanode-6-[0-2] 289_2 exacloud hello_to balter R 0:05 4 exanode-4-44,exanode-6-[0-2] 289_3 exacloud hello_to balter R 0:05 4 exanode-4-44,exanode-6-[0-2]

    $ for i in *.out; do echo "*** $i ***"; cat $i; done
    *** hello_289_0_290.out ***
    Hello to paul mccartney from exanode-4-44
    It is currently Wed Jun 14 09:40:43 PDT 2017
    SLURM_JOB_NAME: hello_to
    SLURM_JOBID: 290
    SLURM_ARRAY_TASK_ID: 0
    SLURM_ARRAY_JOB_ID: 289
    All Done!

    *** hello_289_1_291.out ***
    Hello to john lennon from exanode-4-44
    It is currently Wed Jun 14 09:40:43 PDT 2017
    SLURM_JOB_NAME: hello_to
    SLURM_JOBID: 291
    SLURM_ARRAY_TASK_ID: 1
    SLURM_ARRAY_JOB_ID: 289
    All Done!

    *** hello_289_2_292.out ***
    Hello to george harrison from exanode-4-44
    It is currently Wed Jun 14 09:40:43 PDT 2017
    SLURM_JOB_NAME: hello_to
    SLURM_JOBID: 292
    SLURM_ARRAY_TASK_ID: 2
    SLURM_ARRAY_JOB_ID: 289
    All Done!

    *** hello_289_3_289.out ***
    Hello to ringo starr from exanode-4-44
    It is currently Wed Jun 14 09:40:43 PDT 2017
    SLURM_JOB_NAME: hello_to
    SLURM_JOBID: 289
    SLURM_ARRAY_TASK_ID: 3
    SLURM_ARRAY_JOB_ID: 289
    All Done!

    $for i in *.err; do echo "*** $i ***"; cat $i; done
    *** hello_289_0_290.err ***
srun: error: fwd_tree_thread: can't find address for host exanode-6-0, check slurm.conf srun: error: fwd_tree_thread: can't find address for host exanode-6-2, check slurm.conf srun: error: fwd_tree_thread: can't find address for host exanode-6-1, check slurm.conf srun: error: Task launch for 290.0 failed on node exanode-6-1: Can't find an address, check slurm.conf srun: error: Task launch for 290.0 failed on node exanode-6-2: Can't find an address, check slurm.conf srun: error: Task launch for 290.0 failed on node exanode-6-0: Can't find an address, check slurm.conf srun: error: Application launch failed: Can't find an address, check slurm.conf srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 290.0 ON exanode-4-44 CANCELLED AT 2017-06-14T09:40:43 ***
    srun: error: Timed out waiting for job step to complete
    *** hello_289_1_291.err ***
srun: error: fwd_tree_thread: can't find address for host exanode-6-0, check slurm.conf srun: error: fwd_tree_thread: can't find address for host exanode-6-1, check slurm.conf srun: error: fwd_tree_thread: can't find address for host exanode-6-2, check slurm.conf srun: error: Task launch for 291.0 failed on node exanode-6-2: Can't find an address, check slurm.conf srun: error: Task launch for 291.0 failed on node exanode-6-1: Can't find an address, check slurm.conf srun: error: Task launch for 291.0 failed on node exanode-6-0: Can't find an address, check slurm.conf srun: error: Application launch failed: Can't find an address, check slurm.conf srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 291.0 ON exanode-4-44 CANCELLED AT 2017-06-14T09:40:43 ***
    srun: error: Timed out waiting for job step to complete
    *** hello_289_2_292.err ***
srun: error: fwd_tree_thread: can't find address for host exanode-6-2, check slurm.conf srun: error: fwd_tree_thread: can't find address for host exanode-6-1, check slurm.conf srun: error: fwd_tree_thread: can't find address for host exanode-6-0, check slurm.conf srun: error: Task launch for 292.0 failed on node exanode-6-0: Can't find an address, check slurm.conf srun: error: Task launch for 292.0 failed on node exanode-6-1: Can't find an address, check slurm.conf srun: error: Task launch for 292.0 failed on node exanode-6-2: Can't find an address, check slurm.conf srun: error: Application launch failed: Can't find an address, check slurm.conf srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 292.0 ON exanode-4-44 CANCELLED AT 2017-06-14T09:40:43 ***
    srun: error: Timed out waiting for job step to complete
    *** hello_289_3_289.err ***
srun: error: fwd_tree_thread: can't find address for host exanode-6-0, check slurm.conf srun: error: fwd_tree_thread: can't find address for host exanode-6-2, check slurm.conf srun: error: fwd_tree_thread: can't find address for host exanode-6-1, check slurm.conf srun: error: Task launch for 289.0 failed on node exanode-6-1: Can't find an address, check slurm.conf srun: error: Task launch for 289.0 failed on node exanode-6-2: Can't find an address, check slurm.conf srun: error: Task launch for 289.0 failed on node exanode-6-0: Can't find an address, check slurm.conf srun: error: Application launch failed: Can't find an address, check slurm.conf srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 289.0 ON exanode-4-44 CANCELLED AT 2017-06-14T09:40:43 ***
    srun: error: Timed out waiting for job step to complete


If I add in `#SBATCH -ntasks-per-node` I get:

    $squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 294_0 exacloud hello_to balter R 0:03 4 exanode-4-44,exanode-6-[0-2] 294_1 exacloud hello_to balter R 0:03 4 exanode-4-44,exanode-6-[0-2] 294_2 exacloud hello_to balter R 0:03 4 exanode-4-44,exanode-6-[0-2] 294_3 exacloud hello_to balter R 0:03 4 exanode-4-44,exanode-6-[0-2]
    $for i in *.out; do echo "*** $i ***"; cat $i; done
    *** hello_294_0_295.out ***
    *** hello_294_1_296.out ***
    *** hello_294_2_297.out ***
    *** hello_294_3_294.out ***
    $for i in *.err; do echo "*** $i ***"; cat $i; done
    *** hello_294_0_295.err ***
srun: error: fwd_tree_thread: can't find address for host exanode-6-0, check slurm.conf srun: error: fwd_tree_thread: can't find address for host exanode-6-2, check slurm.conf srun: error: fwd_tree_thread: can't find address for host exanode-6-1, check slurm.conf srun: error: Task launch for 295.0 failed on node exanode-6-1: Can't find an address, check slurm.conf srun: error: Task launch for 295.0 failed on node exanode-6-2: Can't find an address, check slurm.conf srun: error: Task launch for 295.0 failed on node exanode-6-0: Can't find an address, check slurm.conf srun: error: Application launch failed: Can't find an address, check slurm.conf
    *** hello_294_1_296.err ***
srun: error: fwd_tree_thread: can't find address for host exanode-6-0, check slurm.conf srun: error: fwd_tree_thread: can't find address for host exanode-6-1, check slurm.conf srun: error: fwd_tree_thread: can't find address for host exanode-6-2, check slurm.conf srun: error: Task launch for 296.0 failed on node exanode-6-2: Can't find an address, check slurm.conf srun: error: Task launch for 296.0 failed on node exanode-6-1: Can't find an address, check slurm.conf srun: error: Task launch for 296.0 failed on node exanode-6-0: Can't find an address, check slurm.conf srun: error: Application launch failed: Can't find an address, check slurm.conf
    *** hello_294_2_297.err ***
srun: error: fwd_tree_thread: can't find address for host exanode-6-0, check slurm.conf srun: error: fwd_tree_thread: can't find address for host exanode-6-1, check slurm.conf srun: error: fwd_tree_thread: can't find address for host exanode-6-2, check slurm.conf srun: error: Task launch for 297.0 failed on node exanode-6-2: Can't find an address, check slurm.conf srun: error: Task launch for 297.0 failed on node exanode-6-1: Can't find an address, check slurm.conf srun: error: Task launch for 297.0 failed on node exanode-6-0: Can't find an address, check slurm.conf srun: error: Application launch failed: Can't find an address, check slurm.conf
    *** hello_294_3_294.err ***
srun: error: fwd_tree_thread: can't find address for host exanode-6-0, check slurm.conf srun: error: fwd_tree_thread: can't find address for host exanode-6-1, check slurm.conf srun: error: fwd_tree_thread: can't find address for host exanode-6-2, check slurm.conf srun: error: Task launch for 294.0 failed on node exanode-6-2: Can't find an address, check slurm.conf srun: error: Task launch for 294.0 failed on node exanode-6-1: Can't find an address, check slurm.conf srun: error: Task launch for 294.0 failed on node exanode-6-0: Can't find an a

Reply via email to