In my current cluster now I have 41 CPU resources.  I can execute srun -n41
hostname no problem, but if I try srun -n42 hostname, it will not launch
the job b/c it does not have enough resources.

If I --overcommit, it launches the job all on the same node, e.g.

srun -n42 -O hostname
0: host1
1: host1
...
41: host1

I tried --share, but that does not seem to make a difference:

srun -n42 -O -s -l hostname
0: host1
1: host1
...
41: host1

Is there a way to overcommit on all machines?

My relative config:

SchedulerType           = sched/backfill
SelectType              = select/linear

Reply via email to