Dear Vendor,

What I want to do is run a  large number of single-CPU tasks, and have them 
distributed evenly over all allocated nodes, and to oversubscribe CPUs to tasks 
(each task is very light on CPU resources).

Here is a small test script that allocates 2 nodes (16 CPUs per Node on our 
machines) and tries to distribute 32 tasks over these 32 CPUs:

#SBATCH -n 32 -p short

set Vec = ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
26 27 28 29 30 31 32 )

foreach frame ($Vec)

                cd $frame
                                srun -n 1 -m cyclic a.out > output.txt &
                cd ..
end
wait

The hope was that all 16 tasks would run on Node 1, and 16 tasks would run on 
Node 2. Unfortunately what happens is that all 32 jobs get assigned to Node 1. 
I thought -m cyclic was supposed to avoid this.

A note from the vendor suggested using the -exclusive flag. In that case I 
modified my srun command to

srun -exclusive -N 1 -n 1 a.out > output.txt &


The problem with this is that it still assigns the tasks to Node 1, but waits 
until there is an available CPU before assigning the last 16. It still doesn't 
accomplish the task of distributing all 32 jobs to the 32 CPUs across 2 nodes. 
And, in the next step I want to overscubscribe tasks to nodes, and -exclusive 
specicifally waits until open CPUs before submitting all the jobs. This sinks a 
whole lot of time.

I have also played around with the -overcommit option, however that has not 
produced any difference. Note that MAX_TASKS_PER_NODE set in slurm.h is 
adequate.

The -m cyclic option only applies to multiple tasks launched within a single 
step. Is there a mechanism for submiting 32 tasks using 1 srun command, at 
which point -m cyclic should hopefully fix everything.

Thank you for your time and any help or suggestions.

Best regards,
Lucas Koziol


Corporate Strategic Research
ExxonMobil Research and Engineering Co.
1545 US Route 22 East
Annandale, NJ, 08801
Tel: (908) 335-3411

Reply via email to