EV, > Did you check the slurmd.log on the node's and make sure the > RealMemory for them on start up is less then what's defined in > slurmd.conf?
I didn't do this unfortunately! Feel free to jeer! What I had done is configure the nodes in question by looking at what was reported via 'free -m' and then subtracting a GB and configuring that as the 'RealMemory' value in slurm.conf. Thank you for pointing this out, and my apologies if this was a basic question. I've updated the configuration and all is well after changing the nodes' state to "IDLE". I'll make sure to review the slurmd.log first before posting any more questions, should they arise! John DeSantis 2014-07-02 14:09 GMT-04:00 E V <[email protected]>: > > Did you check the slurmd.log on the node's and make sure the > RealMemory for them on start up is less then what's defined in > slurmd.conf? > > On Wed, Jul 2, 2014 at 12:45 PM, John Desantis <[email protected]> wrote: >> >> Hello list! >> >> I asked this question in #slurm yesterday but didn't receive a >> response, and I also wasn't able to find any insight via Google or the >> Slurm site. >> >> Anyways, to the point! >> >> How does Slurm (14.03) determine when a node should be placed in a >> "drain" state with the reason "Low RealMemory"? I'm asking this >> question because I have three nodes each having between 12-14 GB RAM >> total, with "free" reporting between 7-10 GB as free. >> >> I'll paste some scontrol output below and corresponding entries from >> slurm.conf. >> >> NodeName=sanitized_hostname[1] Arch=x86_64 CoresPerSocket=4 >> CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.53 >> Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon >> Gres=(null) >> NodeAddr=sanitized_hostname[1] NodeHostName=sanitized_hostname[1] >> Version=(null) >> OS=Linux RealMemory=12929 AllocMem=0 Sockets=2 Boards=1 >> State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 >> BootTime=2014-03-08T20:15:30 SlurmdStartTime=2014-07-02T12:29:17 >> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 >> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s >> Reason=Low RealMemory [root@2014-07-01T14:48:44] >> >> NodeName=sanitized_hostname[2] Arch=x86_64 CoresPerSocket=4 >> CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.54 >> Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon >> Gres=(null) >> NodeAddr=sanitized_hostname[2] NodeHostName=sanitized_hostname[2] >> Version=(null) >> OS=Linux RealMemory=10909 AllocMem=0 Sockets=2 Boards=1 >> State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 >> BootTime=2014-03-08T20:15:02 SlurmdStartTime=2014-07-02T12:29:17 >> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 >> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s >> Reason=Low RealMemory [root@2014-07-01T14:48:44] >> >> NodeName=sanitized_hostname[3] Arch=x86_64 CoresPerSocket=4 >> CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.71 >> Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon >> Gres=(null) >> NodeAddr=sanitized_hostname[3] NodeHostName=sanitized_hostname[3] >> Version=(null) >> OS=Linux RealMemory=10909 AllocMem=0 Sockets=2 Boards=1 >> State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 >> BootTime=2014-03-08T20:14:55 SlurmdStartTime=2014-07-02T12:29:17 >> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 >> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s >> Reason=Low RealMemory [root@2014-07-01T14:48:44] >> >> NodeName=sanitized_hostname[1] CPUs=8 CoresPerSocket=4 Sockets=2 >> RealMemory=12929 Feature=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon >> NodeName=sanitized_hostname[2-3] CPUs=8 CoresPerSocket=4 Sockets=2 >> RealMemory=10909 Feature=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon >> >> Thanks for any help and/or insight! >> >> John DeSantis
