I guess so. The slurm version I use is 2.5.4. I have attached my slurm.conf with this mail.
--Semparithi +++ On 10:34 23 May Paul Edmon wrote: > > Hmm, maybe its the ThreadsPerCore? Perhaps its thinks there are half as > many core as there really are due to the ThreadsPerCore. Thus if you do > the --mem-per-cpu it will only give you half, as it only counts cores > not threads*cores? > > -Paul Edmon- > > On 05/23/2013 01:31 PM, S. Aravindan wrote: > > I was about to post a similar query. Gaussian 09 job is killed when the > > memory consumption exceeds half the amount of memory available on a node > > when --mem-per-cpu is used but the job runs when --mem is used. The > > relevant lines from slurm.conf is below. > > > > NodeName=node[01-15] RealMemory=48228 Sockets=2 CoresPerSocket=6 > > ThreadsPerCore=2 CPUs=24 State=UNKNOWN TmpDisk=1850000 > > NodeName=node[16-30] RealMemory=96705 Sockets=2 CoresPerSocket=6 > > ThreadsPerCore=2 CPUs=24 State=UNKNOWN TmpDisk=1850000 Feature=96g > > > > Any suggestion is welcome. > > > > --Semparithi > > > > > > +++ On 09:41 23 May Paul Edmon wrote: > >> I have a user that is running a problem which uses 512 GB of memory. She > >> request this from SLURM on a node which has this much. However her code > >> dies: > >> > >> slurmd[holy2b09101]: error: Job 6497 exceeded 268435456 KB memory limit, > >> being killed > >> slurmd[holy2b09101]: error: Exceeded job memory limit > >> slurmd[holy2b09101]: error: *** JOB 6497 CANCELLED AT 2013-05-23T00:53:31 > >> *** > >> > >> This is half of the 512 GB which was requested. Is there something I am > >> missing? The nodes in question have: > >> > >> NodeName=DEFAULT CPUs=64 RealMemory=529247 Sockets=4 CoresPerSocket=8 > >> ThreadsPerCore=2 State=UNKNOWN > >> > >> These are AMD Abu Dhabi processors with 8 GB per core, so 512 GB total. > >> She is requesting 8 GB per cpu and is asking for 64 cores. Thoughts? > >> > >> -Paul Edmon- > > -- Semparithi Aravindan
# slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=abacus ControlAddr=172.31.1.100 AuthType=auth/munge CacheGroups=0 CryptoType=crypto/munge Epilog=/etc/slurm/slurm.epilog.clean JobCheckpointDir=/home/slurm/ MpiDefault=none MpiParams=ports=12000-12999 PluginDir=/usr/lib64/slurm ProctrackType=proctrack/linuxproc ReturnToService=1 SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm/slurmd.pid SlurmdPort=6818 StateSaveLocation=/home/slurm/ SlurmdSpoolDir=/tmp/slurm SlurmUser=slurm SwitchType=switch/none TaskPlugin=task/affinity TaskPluginParam=Sched TmpFS=/scratch InactiveLimit=0 KillWait=30 MinJobAge=300 OverTimeLimit=10 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 FastSchedule=1 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory PriorityType=priority/multifactor PriorityDecayHalfLife=0 PriorityUsageResetPeriod=NONE AccountingStorageEnforce=limits AccountingStorageHost=abacus AccountingStorageLoc=slurm_acct_db AccountingStorageType=accounting_storage/slurmdbd AccountingStorageUser=slurm AccountingStoreJobComment=YES ClusterName=abacus JobCompHost=abacus JobCompLoc=slurm_acct_db JobCompPass=**** JobCompPort=3306 JobCompType=jobcomp/mysql JobCompUser=slurm JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm/slurmd.log NodeName=node[01-15] RealMemory=48228 Sockets=2 CoresPerSocket=6 ThreadsPerCore=2 CPUs=24 State=UNKNOWN TmpDisk=1850000 Weight=50 NodeName=node[16-30] RealMemory=96705 Sockets=2 CoresPerSocket=6 ThreadsPerCore=2 CPUs=24 State=UNKNOWN TmpDisk=1850000 Weight=100 Feature=96g PartitionName=short Nodes=node[01-04] Default=YES DefaultTime=60 MaxTime=360 State=UP Shared=NO DefMemPerCPU=1024 MaxNodes=2 PartitionName=long Nodes=node[05-30] Default=NO DefaultTime=60 MaxTime=4320 State=UP Shared=NO DefMemPerCPU=1024 MaxNodes=2
