Dear Vendor,
What I want to do is run a large number of single-CPU tasks, and have them
distributed evenly over all allocated nodes, and to oversubscribe CPUs to tasks
(each task is very light on CPU resources).
Here is a small test script that allocates 2 nodes (16 CPUs per Node on our
machines) and tries to distribute 32 tasks over these 32 CPUs:
#SBATCH -n 32 -p short
set Vec = ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
26 27 28 29 30 31 32 )
foreach frame ($Vec)
cd $frame
srun -n 1 -m cyclic a.out > output.txt &
cd ..
end
wait
The hope was that all 16 tasks would run on Node 1, and 16 tasks would run on
Node 2. Unfortunately what happens is that all 32 jobs get assigned to Node 1.
I thought -m cyclic was supposed to avoid this.
A note from the vendor suggested using the -exclusive flag. In that case I
modified my srun command to
srun -exclusive -N 1 -n 1 a.out > output.txt &
The problem with this is that it still assigns the tasks to Node 1, but waits
until there is an available CPU before assigning the last 16. It still doesn't
accomplish the task of distributing all 32 jobs to the 32 CPUs across 2 nodes.
And, in the next step I want to overscubscribe tasks to nodes, and -exclusive
specicifally waits until open CPUs before submitting all the jobs. This sinks a
whole lot of time.
I have also played around with the -overcommit option, however that has not
produced any difference. Note that MAX_TASKS_PER_NODE set in slurm.h is
adequate.
The -m cyclic option only applies to multiple tasks launched within a single
step. Is there a mechanism for submiting 32 tasks using 1 srun command, at
which point -m cyclic should hopefully fix everything.
Thank you for your time and any help or suggestions.
Best regards,
Lucas Koziol
Corporate Strategic Research
ExxonMobil Research and Engineering Co.
1545 US Route 22 East
Annandale, NJ, 08801
Tel: (908) 335-3411