Not so fast: It is overcommiting cores. n007 has 2x6=12 cores, and the slurm.conf file has NodeName=n0[01-07] CoresPerSocket=6 Sockets=2 ThreadsPerCore=2
node8 has 2x10=20 cores, and the slurm.conf has NodeName=n0[08-10] CoresPerSocket=10 Sockets=2 ThreadsPerCore=2 Trying to submit a job on the 2 nodes without overcommitting by requesting 12+20=32 tasks #SBATCH --ntasks=32 --ntasks-per-core=1 --nodelist=n00[7-8] gives me 16 on n007, not 12. I noticed that what works is #SBATCH -n 32 --cpus-per-task=2 --nodelist=n00[7-8] Why doesn't the former approach work? With the default "SelectType=select/linear", it worked fine. The slurm.conf file is below: # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=fission #ControlAddr= # #MailProg=/bin/mail MpiDefault=none #MpiParams=ports=#-# ProctrackType=proctrack/pgid ReturnToService=2 SlurmctldPidFile=/var/run/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid #SlurmdPort=6818 #ap this works: SlurmdSpoolDir=/cm/local/apps/slurm/var/spool #SlurmdSpoolDir=/var/spool/slurmd # fails, no permission SlurmUser=slurm #SlurmdUser=root #ap this works: StateSaveLocation=/cm/shared/apps/slurm/var/cm/statesave #StateSaveLocation=/var/spool # fails, no permission SwitchType=switch/none #TaskPlugin=task/none TaskPlugin=task/affinity # enable task affinity # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # # # SCHEDULING FastSchedule=0 SchedulerType=sched/backfill #SchedulerPort=7321 #ap original, node consumable resource, use SelectType=select/linear only #SelectType=select/linear SelectType=select/cons_res SelectTypeParameters=CR_Core # # # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/none #AccountingStorageType=accounting_storage/filetxt ClusterName=SLURM_CLUSTER #JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none #JobAcctGatherType=jobacct_gather/linux #ap inserted below #JobCompType=jobcomp/filetxt #JobCompLoc=/var/log/slurm/job_completions #AccountingStorageLoc=/var/log/slurm/accounting #ap #SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurmctld #SlurmdDebug=3 SlurmdLogFile=/var/log/slurmd # # # COMPUTE NODES #FastSchedule=1 NodeName=n0[01-07] CoresPerSocket=6 Sockets=2 ThreadsPerCore=2 NodeName=n0[08-10] CoresPerSocket=10 Sockets=2 ThreadsPerCore=2 PartitionName=debug Nodes=n[001-010] Default=YES MaxTime=00:01:00 State=UP Shared=YES AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=YES Priority=1000 PartitionName=GPU Nodes=n0[01-05,08,10] Default=NO MaxTime=INFINITE State=UP Shared=YES:4 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Priority=50 PartitionName=DAY Nodes=n0[01-10] MaxNodes=1 Default=NO MaxTime=24:00:00 State=UP Shared=YES:4 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Priority=100 PartitionName=WEEK Nodes=n0[01-10] Default=NO MaxTime=5-00:00 State=UP Shared=YES:2 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Priority=10 PartitionName=UNLIM Nodes=n0[01-10] Default=NO MaxTime=INFINITE State=UP Shared=YES:2 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Priority=1 #FastSchedule=1 On Wed, Sep 23, 2015 at 10:38 AM, Andrew Petersen <[email protected]> wrote: > Work perfectly, thanks! > > On Tue, Sep 22, 2015 at 12:27 AM, Christopher Samuel < > [email protected]> wrote: > >> >> On 22/09/15 14:12, Andrew Petersen wrote: >> >> > Below is the whole thing: >> >> You don't define your nodes, other than: >> >> NodeName=n0[01-10] >> >> so my guess is it's defaulting to considering each node to have 1 core. >> >> In the slurm.conf for our new Haswell system we're bringing up with >> 32 core nodes, most with 128GB RAM and 2 with 512GB RAM we say: >> >> NodeName=DEFAULT CoresPerSocket=16 Sockets=2 RealMemory=125000 Weight=2 >> NodeName=snowy[001-008,010,012-031] NodeAddr=snowy[001-008,010,012-031] >> NodeName=snowy[009,011] NodeAddr=snowy[009,011] RealMemory=500000 >> Weight=1000 >> >> So the DEFAULT line says everything that is general and then for the >> two larger memory nodes we override that. >> >> Hope this helps! >> >> Chris >> -- >> Christopher Samuel Senior Systems Administrator >> VLSCI - Victorian Life Sciences Computation Initiative >> Email: [email protected] Phone: +61 (0)3 903 55545 >> http://www.vlsci.org.au/ http://twitter.com/vlsci >> > >
