[slurm-dev] slurmctld high memory utilization

Phil Sharfstein Wed, 16 Mar 2011 13:26:52 -0700

The slurmctld process on my primary control machine is using over 90% ofthe available memory (16GB). After restarting slurmctld, its memoryutilization is only a few percent. However, within 24 hours, it isconsuming over 90% of the memory.

Our slurm version is 2.2.0 running on RHEL 5.6. We are using backfillscheduling and cons_res select. Our jobs are all submitted withunlimited time limits and primarily use generic resources and licensesfor resource allocation. We have one long-running process using themaster resource on each of the nodes that launches a number of parallelslave processes that are scheduled one on each node.

We will generally have 40 running master processes 50-100 pending masterprocesses, 40 running slave processes and 500+ pending slave processes.Slave processes are prioritized (nice value) to ensure that thosescheduled by the first launched master processes jump to the front ofthe queue (master jobs finish in the order they were launched in theshortest amount of time). A master process runs for 1+ hours (somefinish 24+ hours after launch waiting for resources to complete theirslave jobs), while a single slave processes generally completes in 5-20minutes.

I'm pretty sure that we are doing something wrong with our configurationor conops that is causing the excess memory consumption. However, Ihave not been able to track it down.


Thanks,
-Phil

Our slurm.conf (excuse any typos- this was transcribed from a printout):

ControlMachine=blade0204
ControlAddr=10.1.53.49
BackupController=blade0201
BackupAddr=10.1.53.146
AuthType=auth/munge
CacheGroups=1
CryptoType=crypto/munge
GresTypes=master,slave
Licenses=fcx*3,obc*6
MaxJobCount=3000
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/tmp/slurmd
SlurmUser=bin
StateSaveLocation=/gpfs/fs0/slurm
SwitchType=switch/none
TaskPlugin=task/none
HealthCheckInterval=60
HealthCheckProgram=/etc/slurm/healthcheck.sh
InactiveLimit=0
KillWait=30
MessageTimeout=90
MinJobAge=10
SlurmctldTimeout=90
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SchedulerParameters=max_job_bf=1000
SchedulerPort=7321
SelectType=select/cons_res
AccountingStorageType=accounting_storage/none
ClusterName=cluster
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmdDebug=3

NodeName=blade02[01-16] NodeAddr=10.1.153.[146-161] Procs=8RealMemory=1600 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1State=UNKNOWN Gres=master:1,slave:1NodeName=blade03[01-16] NodeAddr=10.1.153.[162-177] Procs=8RealMemory=1600 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1State=UNKNOWN Gres=master:1,slave:1NodeName=blade04[01-16] NodeAddr=10.1.153.[178-193] Procs=8RealMemory=1600 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1State=UNKNOWN Gres=master:1,slave:1

PartitionName=clust Nodes=blade02[09-16],blade03[01-16],blade04[01-16]Default=YES MaxTime=INFINITE State=UP

PartitionName=clusttest Nodes=blade02[01-09] Default=NO MaxTime=INFINITEState=UP

[slurm-dev] slurmctld high memory utilization

Reply via email to