It seems that for whatever reason SLURM isn’t tracking memory 
properly. Certain nodes keep going into “drain” state after any job is 
submitted but no memory is actually being used. Example:

 

# scontrol show node f1

NodeName=cv-hpcf1 Arch=x86_64 CoresPerSocket=8

   CPUAlloc=0 CPUErr=0 CPUTot=16 CPULoad=0.01 Features=(null)

   Gres=(null)

   NodeAddr=f1 NodeHostName=f1 Version=14.03

   OS=Linux RealMemory=129023 AllocMem=0 Sockets=2 Boards=1

   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1

   BootTime=2014-09-26T08:56:17 SlurmdStartTime=2014-09-26T09:08:51

   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0

   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

   Reason=Low RealMemory 

            

It seems to think the memory is low even though none is allocated. Not sure how 
to proceed here… 

 

            Thanks,

            ~Mike C. 

 

Reply via email to