Hi

Is this behaviour expected?

I have a partition with 14 nodes, each with 16 cpus

PartitionName=d3
   AllocNodes=ALL AllowGroups=ALL Default=NO
   DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 MaxCPUsPerNode=UNLIMITED
   Nodes=delta[43-56]
   Priority=100 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=224 TotalNodes=14 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

$ sinfo |grep d3
d3            up   infinite      1  down* delta44
d3            up   infinite     13  alloc delta[43,45-56]


delta44 died but we can tolerate a dead node so we input jobs with -N
13-14

$ sbatch -N 13-14 -p d3 -c 4 sleep.com
sbatch: error: Batch job submission failed: Requested node configuration
is not available

$ sbatch -N 13-14 -p d3 sleep.com
Submitted batch job 12287

$ sbatch -N 13-14 -p d3 -c 2 sleep.com
sbatch: error: Batch job submission failed: Requested node configuration
is not available

$ sbatch -N 13-14 -p d3 -c 16 sleep.com
Submitted batch job 12290

$ sbatch -N 14 -p d3 -c 16 sleep.com
Submitted batch job 12291

$ sbatch -N 14 -p d3 -c 4 sleep.com
Submitted batch job 12292

Plus, it would be nice to be able to specify that you want to use all
the nodes in a partition but will tolerate a few dead nodes without
having to know how many nodes there actually are, perhaps by specifying
-N -2   ie, min nodes is max-2.

Cheers,

Reply via email to