Greetings,

I would like to write a script that will launch several different tasks on 
multiple nodes that will run simultaneously. Additionally, I would like to 
restrict the number of tasks that will run on each node. The following is 
pseudo-code for what I have currently (assuming W different tasks running on X 
nodes (Y cores per node), with a maximum of Z tasks running on each node, where 
Z < Y):


    #SBATCH -t <walltime>

    #SBATCH -N X


    [BEGIN FOR LOOP]


    srun -N 1 -n 1 --ntasks-per-node Z --exclusive <ith command> &


    [END FOR LOOP]


    wait


I set X to account for the --ntasks-per-node limit I set on the srun command. 
My intention in writing this was for srun to populate each node with tasks up 
to the --ntasks-per-node maximum Z before starting to launch tasks on the next 
node. However, when I submitted this job with sbatch and examined the allocated 
nodes while the job was running, I found that srun launched a task on every 
core of a node before launching tasks on the next node. In other words, the 
--ntasks-per-node option on the srun command does not work as I expected it to, 
and Y tasks were launched on each node until the desired number of tasks was 
launched. As a consequence, multiple nodes at the end of the nodelist were 
allocated for the job and left empty, as I selected the number of nodes X 
assuming a smaller number of tasks would run on each node.


I figure the job turned out this way because the "--ntasks-per-node Z" option 
applies to each invocation of srun individually, so as long as no single 
invocation of srun launches more than Z tasks, they will be launched on the 
same node until it is full. Is this correct?


Would the following modified pseudo-code accomplish my goal by setting a 
tasks-per-node limit for the entire allocation?


    #SBATCH -t <walltime>

    #SBATCH -N X

    #SBATCH --ntasks-per-node Z


    [BEGIN FOR LOOP]


    srun -N 1 -n 1 --exclusive <ith command> &


    [END FOR LOOP]


    wait


My concern with this option is that because I am not also using the "#SBATCH 
-n", this code will launch the entire script Z times on each node of the 
allocation. Is this true?


Is there a better way to go about this?


Thanks very much,


Tyler Jordan

Reply via email to