Your confusion seems to be stemming from misunderstanding job arrays, not the
--node and --ntasks-per-node options,

Job arrays are basically a short cut for submitting large numbers of similar 
jobs
at once.  The "sbatch array.sub" command basically submits 4 jobs, each job 
being
allocated 1 CPU-core on each of 4 nodes.  Since the jobs are all single 
threaded,
only 1 core on the first node ever gets used.  Presumably your global 
configuration
defaults to sharing of nodes, so all four jobs (one for each array member) got assigned the same nodes, and when run they all reported the same node name.

For this simple case, I would not bother with job arrays and just run a loop
in array.sub.  Not sure if you need to background the srun or not.  
Alternatively,
if you insist on job arrays, reduce to a single task on a single node (--ntasks 
1
should suffice).  You might need to add --exclusive to force all tfour jobs get 
sent to
different nodes.

On Wed, 14 Jun 2017, Ariel Balter wrote:

NOTE: similar but not same 
(https://stackoverflow.com/questions/39187072/what-does-the-option-nodes-in-slurm-do-with-sbatch).

I'm trying to understand what the nodes options `--nodes` and 
`--ntasks-per-node` do in SLURM. I would have thought that they allow me to 
specify that if I run 4 tasks and specify `-N4` and `--ntasks-per-node=1` then 
I would have
each task run on a different node. That is not what is happening.

I'm starting with a simple script that takes two arguments:

*hello_to.sh*

    #!/bin/bash

    firstname=$1
    lastname=$2

    echo "Hello to $firstname $lastname from $(hostname)"
    echo "It is currently $(date)"
    echo "SLURM_JOB_NAME: $SLURM_JOB_NAME"
    echo "SLURM_JOBID: $SLURM_JOBID"
    echo "SLURM_ARRAY_TASK_ID: $SLURM_ARRAY_TASK_ID"
    echo "SLURM_ARRAY_JOB_ID: $SLURM_ARRAY_JOB_ID"
    echo "All Done!"
    echo ""

And I call it in the following batch script:

*array.sub*

    #!/bin/bash

    #SBATCH --job-name=hello_to
    #SBATCH --array=0-3
    ##SBATCH -N4
    ##SBATCH --ntasks-per-node=1
    #SBATCH --output="hello_%A_%a_%j.out"
    #SBATCH --error="hello_%A_%a_%j.err"

    names=(
    "paul mccartney"
    "john lennon"
    "george harrison"
    "ringo starr"
    )

    srun hello_to.sh ${names[$SLURM_ARRAY_TASK_ID]}


When I run it, I get four output files that look like:

    $ for i in *.out; do echo "*** $i ***"; cat $i; done
    *** hello_277_0_278.out ***
    Hello to paul mccartney from exanode-2-8
    It is currently Wed Jun 14 09:33:44 PDT 2017
    SLURM_JOB_NAME: hello_to
    SLURM_JOBID: 278
    SLURM_ARRAY_TASK_ID: 0
    SLURM_ARRAY_JOB_ID: 277
    All Done!

    *** hello_277_1_279.out ***
    Hello to john lennon from exanode-2-8
    It is currently Wed Jun 14 09:33:44 PDT 2017
    SLURM_JOB_NAME: hello_to
    SLURM_JOBID: 279
    SLURM_ARRAY_TASK_ID: 1
    SLURM_ARRAY_JOB_ID: 277
    All Done!

    *** hello_277_2_280.out ***
    Hello to george harrison from exanode-2-8
    It is currently Wed Jun 14 09:33:44 PDT 2017
    SLURM_JOB_NAME: hello_to
    SLURM_JOBID: 280
    SLURM_ARRAY_TASK_ID: 2
    SLURM_ARRAY_JOB_ID: 277
    All Done!

    *** hello_277_3_277.out ***
    Hello to ringo starr from exanode-2-8
    It is currently Wed Jun 14 09:33:44 PDT 2017
    SLURM_JOB_NAME: hello_to
    SLURM_JOBID: 277
    SLURM_ARRAY_TASK_ID: 3
    SLURM_ARRAY_JOB_ID: 277
    All Done!

I wanted to see how I could make sure they all run on separate nodes. If I 
uncomment the line `#SBATCH -N4` then I get the following:

    $ squeue
                 JOBID PARTITION     NAME     USER ST       TIME NODES 
NODELIST(REASON)
                 289_0  exacloud hello_to   balter  R 0:05      4 
exanode-4-44,exanode-6-[0-2]
                 289_1  exacloud hello_to   balter  R 0:05      4 
exanode-4-44,exanode-6-[0-2]
                 289_2  exacloud hello_to   balter  R 0:05      4 
exanode-4-44,exanode-6-[0-2]
                 289_3  exacloud hello_to   balter  R 0:05      4 
exanode-4-44,exanode-6-[0-2]

    $ for i in *.out; do echo "*** $i ***"; cat $i; done
    *** hello_289_0_290.out ***
    Hello to paul mccartney from exanode-4-44
    It is currently Wed Jun 14 09:40:43 PDT 2017
    SLURM_JOB_NAME: hello_to
    SLURM_JOBID: 290
    SLURM_ARRAY_TASK_ID: 0
    SLURM_ARRAY_JOB_ID: 289
    All Done!

    *** hello_289_1_291.out ***
    Hello to john lennon from exanode-4-44
    It is currently Wed Jun 14 09:40:43 PDT 2017
    SLURM_JOB_NAME: hello_to
    SLURM_JOBID: 291
    SLURM_ARRAY_TASK_ID: 1
    SLURM_ARRAY_JOB_ID: 289
    All Done!

    *** hello_289_2_292.out ***
    Hello to george harrison from exanode-4-44
    It is currently Wed Jun 14 09:40:43 PDT 2017
    SLURM_JOB_NAME: hello_to
    SLURM_JOBID: 292
    SLURM_ARRAY_TASK_ID: 2
    SLURM_ARRAY_JOB_ID: 289
    All Done!

    *** hello_289_3_289.out ***
    Hello to ringo starr from exanode-4-44
    It is currently Wed Jun 14 09:40:43 PDT 2017
    SLURM_JOB_NAME: hello_to
    SLURM_JOBID: 289
    SLURM_ARRAY_TASK_ID: 3
    SLURM_ARRAY_JOB_ID: 289
    All Done!

    $for i in *.err; do echo "*** $i ***"; cat $i; done
    *** hello_289_0_290.err ***
    srun: error: fwd_tree_thread: can't find address for host exanode-6-0, 
check slurm.conf
    srun: error: fwd_tree_thread: can't find address for host exanode-6-2, 
check slurm.conf
    srun: error: fwd_tree_thread: can't find address for host exanode-6-1, 
check slurm.conf
    srun: error: Task launch for 290.0 failed on node exanode-6-1: Can't find 
an address, check slurm.conf
    srun: error: Task launch for 290.0 failed on node exanode-6-2: Can't find 
an address, check slurm.conf
    srun: error: Task launch for 290.0 failed on node exanode-6-0: Can't find 
an address, check slurm.conf
    srun: error: Application launch failed: Can't find an address, check 
slurm.conf
    srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
    slurmstepd: error: *** STEP 290.0 ON exanode-4-44 CANCELLED AT 
2017-06-14T09:40:43 ***
    srun: error: Timed out waiting for job step to complete
    *** hello_289_1_291.err ***
    srun: error: fwd_tree_thread: can't find address for host exanode-6-0, 
check slurm.conf
    srun: error: fwd_tree_thread: can't find address for host exanode-6-1, 
check slurm.conf
    srun: error: fwd_tree_thread: can't find address for host exanode-6-2, 
check slurm.conf
    srun: error: Task launch for 291.0 failed on node exanode-6-2: Can't find 
an address, check slurm.conf
    srun: error: Task launch for 291.0 failed on node exanode-6-1: Can't find 
an address, check slurm.conf
    srun: error: Task launch for 291.0 failed on node exanode-6-0: Can't find 
an address, check slurm.conf
    srun: error: Application launch failed: Can't find an address, check 
slurm.conf
    srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
    slurmstepd: error: *** STEP 291.0 ON exanode-4-44 CANCELLED AT 
2017-06-14T09:40:43 ***
    srun: error: Timed out waiting for job step to complete
    *** hello_289_2_292.err ***
    srun: error: fwd_tree_thread: can't find address for host exanode-6-2, 
check slurm.conf
    srun: error: fwd_tree_thread: can't find address for host exanode-6-1, 
check slurm.conf
    srun: error: fwd_tree_thread: can't find address for host exanode-6-0, 
check slurm.conf
    srun: error: Task launch for 292.0 failed on node exanode-6-0: Can't find 
an address, check slurm.conf
    srun: error: Task launch for 292.0 failed on node exanode-6-1: Can't find 
an address, check slurm.conf
    srun: error: Task launch for 292.0 failed on node exanode-6-2: Can't find 
an address, check slurm.conf
    srun: error: Application launch failed: Can't find an address, check 
slurm.conf
    srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
    slurmstepd: error: *** STEP 292.0 ON exanode-4-44 CANCELLED AT 
2017-06-14T09:40:43 ***
    srun: error: Timed out waiting for job step to complete
    *** hello_289_3_289.err ***
    srun: error: fwd_tree_thread: can't find address for host exanode-6-0, 
check slurm.conf
    srun: error: fwd_tree_thread: can't find address for host exanode-6-2, 
check slurm.conf
    srun: error: fwd_tree_thread: can't find address for host exanode-6-1, 
check slurm.conf
    srun: error: Task launch for 289.0 failed on node exanode-6-1: Can't find 
an address, check slurm.conf
    srun: error: Task launch for 289.0 failed on node exanode-6-2: Can't find 
an address, check slurm.conf
    srun: error: Task launch for 289.0 failed on node exanode-6-0: Can't find 
an address, check slurm.conf
    srun: error: Application launch failed: Can't find an address, check 
slurm.conf
    srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
    slurmstepd: error: *** STEP 289.0 ON exanode-4-44 CANCELLED AT 
2017-06-14T09:40:43 ***
    srun: error: Timed out waiting for job step to complete


If I add in `#SBATCH -ntasks-per-node` I get:

    $squeue
                 JOBID PARTITION     NAME     USER ST       TIME NODES 
NODELIST(REASON)
                 294_0  exacloud hello_to   balter  R 0:03      4 
exanode-4-44,exanode-6-[0-2]
                 294_1  exacloud hello_to   balter  R 0:03      4 
exanode-4-44,exanode-6-[0-2]
                 294_2  exacloud hello_to   balter  R 0:03      4 
exanode-4-44,exanode-6-[0-2]
                 294_3  exacloud hello_to   balter  R 0:03      4 
exanode-4-44,exanode-6-[0-2]
    $for i in *.out; do echo "*** $i ***"; cat $i; done
    *** hello_294_0_295.out ***
    *** hello_294_1_296.out ***
    *** hello_294_2_297.out ***
    *** hello_294_3_294.out ***
    $for i in *.err; do echo "*** $i ***"; cat $i; done
    *** hello_294_0_295.err ***
    srun: error: fwd_tree_thread: can't find address for host exanode-6-0, 
check slurm.conf
    srun: error: fwd_tree_thread: can't find address for host exanode-6-2, 
check slurm.conf
    srun: error: fwd_tree_thread: can't find address for host exanode-6-1, 
check slurm.conf
    srun: error: Task launch for 295.0 failed on node exanode-6-1: Can't find 
an address, check slurm.conf
    srun: error: Task launch for 295.0 failed on node exanode-6-2: Can't find 
an address, check slurm.conf
    srun: error: Task launch for 295.0 failed on node exanode-6-0: Can't find 
an address, check slurm.conf
    srun: error: Application launch failed: Can't find an address, check 
slurm.conf
    *** hello_294_1_296.err ***
    srun: error: fwd_tree_thread: can't find address for host exanode-6-0, 
check slurm.conf
    srun: error: fwd_tree_thread: can't find address for host exanode-6-1, 
check slurm.conf
    srun: error: fwd_tree_thread: can't find address for host exanode-6-2, 
check slurm.conf
    srun: error: Task launch for 296.0 failed on node exanode-6-2: Can't find 
an address, check slurm.conf
    srun: error: Task launch for 296.0 failed on node exanode-6-1: Can't find 
an address, check slurm.conf
    srun: error: Task launch for 296.0 failed on node exanode-6-0: Can't find 
an address, check slurm.conf
    srun: error: Application launch failed: Can't find an address, check 
slurm.conf
    *** hello_294_2_297.err ***
    srun: error: fwd_tree_thread: can't find address for host exanode-6-0, 
check slurm.conf
    srun: error: fwd_tree_thread: can't find address for host exanode-6-1, 
check slurm.conf
    srun: error: fwd_tree_thread: can't find address for host exanode-6-2, 
check slurm.conf
    srun: error: Task launch for 297.0 failed on node exanode-6-2: Can't find 
an address, check slurm.conf
    srun: error: Task launch for 297.0 failed on node exanode-6-1: Can't find 
an address, check slurm.conf
    srun: error: Task launch for 297.0 failed on node exanode-6-0: Can't find 
an address, check slurm.conf
    srun: error: Application launch failed: Can't find an address, check 
slurm.conf
    *** hello_294_3_294.err ***
    srun: error: fwd_tree_thread: can't find address for host exanode-6-0, 
check slurm.conf
    srun: error: fwd_tree_thread: can't find address for host exanode-6-1, 
check slurm.conf
    srun: error: fwd_tree_thread: can't find address for host exanode-6-2, 
check slurm.conf
    srun: error: Task launch for 294.0 failed on node exanode-6-2: Can't find 
an address, check slurm.conf
    srun: error: Task launch for 294.0 failed on node exanode-6-1: Can't find 
an address, check slurm.conf
    srun: error: Task launch for 294.0 failed on node exanode-6-0: Can't find 
an a




Tom Payerle
DIT-ATI-Research Computing              [email protected]
4254 Stadium Dr                         (301) 405-6135
University of Maryland
College Park, MD 20742-4111

Reply via email to