I think the difference in memory may be your issue. I vaguely recall
something similar when I set up our cluster. If the slurm.conf has a
Memory number higher then what's the node sees you get this problem.

On Tue, Dec 23, 2014 at 2:44 PM, SLIM H.A. <h.a.s...@durham.ac.uk> wrote:
>
> This is the output from slurmd on the node
>
> ClusterName=(null) NodeName=smp3 CPUs=96 Boards=1 SocketsPerBoard=8 
> CoresPerSocket=12 ThreadsPerCore=1 RealMemory=3069693 TmpDisk=65767
> UpTime=35-20:56:06
>
> This is different from what I sent previously because I refined the entry for 
> the node in the conf file
>
> NodeName=smp3 Sockets=8 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=3143366
>
> PartitionName=seq6.q nodes=smp3    State=UP        MaxTime=INFINITE
>
> The amount of memory is different though, MiB versus MB?
>
> However, the problem is still there:
>
> # sinfo -o %10R%C
> PARTITION CPUS(A/I/O/T)
> par6.q    0/1920/0/1920
> seq6.q    95/0/1/96
>
> # sinfo -R
> REASON               USER      TIMESTAMP           NODELIST
> Low RealMemory       slurm     2014-12-23T12:35:33 smp3
>
> One task has finished but no new one is started.
>
> Many thanks
> ________________________________________
> From: je...@schedmd.com [je...@schedmd.com]
> Sent: 23 December 2014 16:17
> To: slurm-dev
> Subject: [slurm-dev] Re: node returns to "Low RealMemory" state after some 
> jobs finish
>
> Run "slurmd -C" on the node to see what slurm sees for resources on the node:
> $ /usr/local/sbin/slurmd -C
> ClusterName=(null) NodeName=tux123 CPUs=4 Boards=1 SocketsPerBoard=1
> CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7872 TmpDisk=112361
> UpTime=10-23:52:08
>
> Quoting "SLIM H.A." <h.a.s...@durham.ac.uk>:
>
>> Hello,
>>
>> One of the nodes (smp3, 96 cores) of our cluster is used for an
>> array job with 400 serial tasks. The slurm.conf setting is
>>
>> SelectType              = select/cons_res
>> SelectTypeParameters    = CR_CORE_MEMORY
>>
>> When a task is completed a new task is not started but the node is
>> put in a drng state and eventually the node is empty although there
>> are still tasks queued. The reason for the draining state appears to
>> be " Low RealMemory". Sample details here:
>>
>> # squeue
>> JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
>>      235_[255-400]    seq6.q lpj_arra  dcl0has PD       0:00      1
>> (Resources)
>> # sinfo -o %10R%C
>> PARTITION CPUS(A/I/O/T)
>> par6.q    0/1920/0/1920
>> seq6.q    0/0/96/96
>> # sinfo -R
>> REASON               USER      TIMESTAMP           NODELIST
>> Low RealMemory       slurm     2014-12-22T22:11:44 smp3
>>
>> However there should be by far enough free memory available (~3 TB)
>> on the node
>>
>> # free
>>              total       used       free     shared    buffers     cached
>> Mem:    3143366112   24353500 3119012612          0     345764    3249576
>> -/+ buffers/cache:   20758160 3122607952
>> Swap:     33554424          0   33554424
>>
>> Every time the node is in this state the command
>>
>> # scontrol update NodeName=smp3 State=Resume
>> # sinfo -R
>> REASON               USER      TIMESTAMP           NODELIST
>>
>> will make it accepting the next tasks.
>>
>> Is there any explanation for this behaviour?
>>
>> This is a line of output from the show node command
>>
>>  OS=Linux RealMemory=3143366 AllocMem=0 Sockets=96 Boards=1
>>
>> Many thanks
>>
>> Henk
>
>
> --
> Morris "Moe" Jette
> CTO, SchedMD LLC
> Commercial Slurm Development and Support

Reply via email to