I think the difference in memory may be your issue. I vaguely recall something similar when I set up our cluster. If the slurm.conf has a Memory number higher then what's the node sees you get this problem.
On Tue, Dec 23, 2014 at 2:44 PM, SLIM H.A. <h.a.s...@durham.ac.uk> wrote: > > This is the output from slurmd on the node > > ClusterName=(null) NodeName=smp3 CPUs=96 Boards=1 SocketsPerBoard=8 > CoresPerSocket=12 ThreadsPerCore=1 RealMemory=3069693 TmpDisk=65767 > UpTime=35-20:56:06 > > This is different from what I sent previously because I refined the entry for > the node in the conf file > > NodeName=smp3 Sockets=8 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=3143366 > > PartitionName=seq6.q nodes=smp3 State=UP MaxTime=INFINITE > > The amount of memory is different though, MiB versus MB? > > However, the problem is still there: > > # sinfo -o %10R%C > PARTITION CPUS(A/I/O/T) > par6.q 0/1920/0/1920 > seq6.q 95/0/1/96 > > # sinfo -R > REASON USER TIMESTAMP NODELIST > Low RealMemory slurm 2014-12-23T12:35:33 smp3 > > One task has finished but no new one is started. > > Many thanks > ________________________________________ > From: je...@schedmd.com [je...@schedmd.com] > Sent: 23 December 2014 16:17 > To: slurm-dev > Subject: [slurm-dev] Re: node returns to "Low RealMemory" state after some > jobs finish > > Run "slurmd -C" on the node to see what slurm sees for resources on the node: > $ /usr/local/sbin/slurmd -C > ClusterName=(null) NodeName=tux123 CPUs=4 Boards=1 SocketsPerBoard=1 > CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7872 TmpDisk=112361 > UpTime=10-23:52:08 > > Quoting "SLIM H.A." <h.a.s...@durham.ac.uk>: > >> Hello, >> >> One of the nodes (smp3, 96 cores) of our cluster is used for an >> array job with 400 serial tasks. The slurm.conf setting is >> >> SelectType = select/cons_res >> SelectTypeParameters = CR_CORE_MEMORY >> >> When a task is completed a new task is not started but the node is >> put in a drng state and eventually the node is empty although there >> are still tasks queued. The reason for the draining state appears to >> be " Low RealMemory". Sample details here: >> >> # squeue >> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) >> 235_[255-400] seq6.q lpj_arra dcl0has PD 0:00 1 >> (Resources) >> # sinfo -o %10R%C >> PARTITION CPUS(A/I/O/T) >> par6.q 0/1920/0/1920 >> seq6.q 0/0/96/96 >> # sinfo -R >> REASON USER TIMESTAMP NODELIST >> Low RealMemory slurm 2014-12-22T22:11:44 smp3 >> >> However there should be by far enough free memory available (~3 TB) >> on the node >> >> # free >> total used free shared buffers cached >> Mem: 3143366112 24353500 3119012612 0 345764 3249576 >> -/+ buffers/cache: 20758160 3122607952 >> Swap: 33554424 0 33554424 >> >> Every time the node is in this state the command >> >> # scontrol update NodeName=smp3 State=Resume >> # sinfo -R >> REASON USER TIMESTAMP NODELIST >> >> will make it accepting the next tasks. >> >> Is there any explanation for this behaviour? >> >> This is a line of output from the show node command >> >> OS=Linux RealMemory=3143366 AllocMem=0 Sockets=96 Boards=1 >> >> Many thanks >> >> Henk > > > -- > Morris "Moe" Jette > CTO, SchedMD LLC > Commercial Slurm Development and Support