[slurm-dev] Re: Memory Issues

S. Aravindan Thu, 23 May 2013 10:58:15 -0700

I guess so. The slurm version I use is 2.5.4. I have attached my
slurm.conf with this mail.


--Semparithi

+++ On 10:34 23 May Paul Edmon wrote:
> 
> Hmm, maybe its the ThreadsPerCore?  Perhaps its thinks there are half as 
> many core as there really are due to the ThreadsPerCore. Thus if you do 
> the --mem-per-cpu it will only give you half, as it only counts cores 
> not threads*cores?
> 
> -Paul Edmon-
> 
> On 05/23/2013 01:31 PM, S. Aravindan wrote:
> > I was about to post a similar query. Gaussian 09 job is killed when the
> > memory consumption exceeds half the amount of memory available on a node
> > when --mem-per-cpu is used but the job runs when --mem is used.  The
> > relevant lines from slurm.conf is below.
> >
> > NodeName=node[01-15] RealMemory=48228 Sockets=2 CoresPerSocket=6 
> > ThreadsPerCore=2 CPUs=24 State=UNKNOWN TmpDisk=1850000
> > NodeName=node[16-30] RealMemory=96705 Sockets=2 CoresPerSocket=6 
> > ThreadsPerCore=2 CPUs=24 State=UNKNOWN TmpDisk=1850000 Feature=96g
> >
> > Any suggestion is welcome.
> >
> > --Semparithi
> >
> >
> > +++ On 09:41 23 May Paul Edmon wrote:
> >> I have a user that is running a problem which uses 512 GB of memory. She
> >> request this from SLURM on a node which has this much.  However her code
> >> dies:
> >>
> >> slurmd[holy2b09101]: error: Job 6497 exceeded 268435456 KB memory limit, 
> >> being killed
> >> slurmd[holy2b09101]: error: Exceeded job memory limit
> >> slurmd[holy2b09101]: error: *** JOB 6497 CANCELLED AT 2013-05-23T00:53:31 
> >> ***
> >>
> >> This is half of the 512 GB which was requested.  Is there something I am 
> >> missing?  The nodes in question have:
> >>
> >> NodeName=DEFAULT CPUs=64 RealMemory=529247 Sockets=4 CoresPerSocket=8 
> >> ThreadsPerCore=2 State=UNKNOWN
> >>
> >> These are AMD Abu Dhabi processors with 8 GB per core, so 512 GB total.  
> >> She is requesting 8 GB per cpu and is asking for 64 cores.  Thoughts?
> >>
> >> -Paul Edmon-
> > -- Semparithi Aravindan

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=abacus
ControlAddr=172.31.1.100
AuthType=auth/munge
CacheGroups=0
CryptoType=crypto/munge
Epilog=/etc/slurm/slurm.epilog.clean
JobCheckpointDir=/home/slurm/
MpiDefault=none
MpiParams=ports=12000-12999
PluginDir=/usr/lib64/slurm
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
StateSaveLocation=/home/slurm/
SlurmdSpoolDir=/tmp/slurm
SlurmUser=slurm
SwitchType=switch/none
TaskPlugin=task/affinity
TaskPluginParam=Sched
TmpFS=/scratch
InactiveLimit=0
KillWait=30
MinJobAge=300
OverTimeLimit=10
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
PriorityType=priority/multifactor
PriorityDecayHalfLife=0
PriorityUsageResetPeriod=NONE
AccountingStorageEnforce=limits
AccountingStorageHost=abacus
AccountingStorageLoc=slurm_acct_db
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
AccountingStoreJobComment=YES
ClusterName=abacus
JobCompHost=abacus
JobCompLoc=slurm_acct_db
JobCompPass=****
JobCompPort=3306
JobCompType=jobcomp/mysql
JobCompUser=slurm
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log
NodeName=node[01-15] RealMemory=48228 Sockets=2 CoresPerSocket=6 
ThreadsPerCore=2 CPUs=24 State=UNKNOWN TmpDisk=1850000 Weight=50
NodeName=node[16-30] RealMemory=96705 Sockets=2 CoresPerSocket=6 
ThreadsPerCore=2 CPUs=24 State=UNKNOWN TmpDisk=1850000 Weight=100 Feature=96g
PartitionName=short Nodes=node[01-04] Default=YES DefaultTime=60 MaxTime=360 
State=UP Shared=NO DefMemPerCPU=1024  MaxNodes=2 
PartitionName=long Nodes=node[05-30] Default=NO DefaultTime=60 MaxTime=4320 
State=UP Shared=NO  DefMemPerCPU=1024  MaxNodes=2

[slurm-dev] Re: Memory Issues

Reply via email to