Hi,

after an all_nodes reservation for maintenance, a couple of jobs didn't start.
Instead, they complain about BadConstraints.
Since they are clones of jobs that ran perfectly before this is puzzling.

Looking deeper into the job details, I believe I have found what causes this
- but the deeper reason is still unclear:

# scontrol show job 12345
   JobState=PENDING Reason=BadConstraints Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   NumNodes=2 NumCPUs=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=32,node=1
   MinCPUsNode=32 MinMemoryNode=0 MinTmpDiskNode=0
 
The user had requested 2 (NumNodes) nodes, with a total of 32 (NumCPUs) cores.
This is OK, since the nodes have 16 cores each.
The TRES prat looks strange though as the total cpu count is still correct, but
the number of nodes has been set to 1 only.
As a consequence, a matching node must have 32 (MinCPUsNode) cores, which is
impossible to fulfill.

Attempts to change the values failed as 
# scontrol update job=12345 MinCPUsNode=16
returns without having changed anything, and TRES cannot be modified.

Is there a way to adjust the values to make the jobs runnable again?

What may have caused Slurm (which had not been stopped during the reservation)
to mangle these values?

Thanks
 S

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am M�hlenberg 1
D-14476 Potsdam-Golm
Germany
~~~
Fon: +49-331-567 7274
Fax: +49-331-567 7298
Mail: steffen.grunewald(at)aei.mpg.de
~~~

Reply via email to