It seems that for whatever reason SLURM isn’t tracking memory
properly. Certain nodes keep going into “drain” state after any job is
submitted but no memory is actually being used. Example:
# scontrol show node f1
NodeName=cv-hpcf1 Arch=x86_64 CoresPerSocket=8
CPUAlloc=0 CPUErr=0 CPUTot=16 CPULoad=0.01 Features=(null)
Gres=(null)
NodeAddr=f1 NodeHostName=f1 Version=14.03
OS=Linux RealMemory=129023 AllocMem=0 Sockets=2 Boards=1
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1
BootTime=2014-09-26T08:56:17 SlurmdStartTime=2014-09-26T09:08:51
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory
It seems to think the memory is low even though none is allocated. Not sure how
to proceed here…
Thanks,
~Mike C.