In my current cluster now I have 41 CPU resources. I can execute srun -n41 hostname no problem, but if I try srun -n42 hostname, it will not launch the job b/c it does not have enough resources.
If I --overcommit, it launches the job all on the same node, e.g. srun -n42 -O hostname 0: host1 1: host1 ... 41: host1 I tried --share, but that does not seem to make a difference: srun -n42 -O -s -l hostname 0: host1 1: host1 ... 41: host1 Is there a way to overcommit on all machines? My relative config: SchedulerType = sched/backfill SelectType = select/linear