Luis, The SLURM CPU Management Guide has several examples of cpu binding. One of these may match what you're trying to do. See http://www.schedmd.com/slurmdocs/cpu_management.html Regards, Martin
Re: [slurm-dev] Limit by number of CPUs by user Danny Auble to: slurm-dev 01/18/2012 12:29 PM Sent by: [email protected] Cc: Carles Fenoy, administrador.ccc From: Danny Auble <[email protected]> To: [email protected] Cc: Carles Fenoy <[email protected]>, [email protected] Sent by: [email protected] Please respond to [email protected] Luis, Charles is right about the job not being bound. As you can see from the log output and from the insert into the database, SLURM is only allocating 1 cpu to your job. The problem is it isn't bound to any cpu and is using all it can find. If you use the task affinity to bind the job things should work just like you would expect. Danny On 01/18/12 11:16, Carles Fenoy wrote: > Hola Felipe, > > can you send us a squeue after submitting the job? > It seems that you don't specify 1 cpu for the job and the job is not > binded to 1 cpu. This way it can use all the resources in the compute > node. > Try to enable task affinity to bind the job to only 1 cpu. This way if > the job opens threads it will be limited to just 1 cpu. > > Saludos, > > Carles Fenoy > > > On Wed, Jan 18, 2012 at 12:55 PM, luis<[email protected]> wrote: >> Dear Danny: >> >> >> >> By tha way, I'm new using slurm. >> >> >> >> I have performed various tests and I have not managed slurm to limit the >> maxcpus. >> >> >> >> The cluster is made up of 40 nodes of two cpu's each. >> >> >> >> I want to limit the jobs to one processor to perform sequential jobs >> >> >> >> The slurm configuration file is: >> >> >> >> -------------------------------------------------------------------------- >> >> # slurm.conf file generated by configurator.html. >> >> # Put this file on all nodes of your cluster. >> >> # See the slurm.conf man page for more information. >> >> # >> >> # >> >> # Definimos la maquina que va a ser el master de slurm >> >> # y si se tiene cual es la que serĂ¡ de backup >> >> # >> >> ControlMachine=alpha >> >> ControlAddr=192.168.123.5 >> >> # >> >> # Se define cual es el metodo de autentificacion >> >> # >> >> AuthType=auth/munge >> >> CacheGroups=1 >> >> CryptoType=crypto/munge >> >> EnforcePartLimits=yes >> >> JobCredentialPrivateKey=/etc/slurm/private.key >> >> JobCredentialPublicCertificate=/etc/slurm/public.key >> >> MpiDefault=none >> >> ProctrackType=proctrack/linuxproc >> >> ReturnToService=1SlurmctldPidFile=/var/run/slurmctld.pid >> >> SlurmctldPort=6817 >> >> SlurmdPidFile=/var/run/slurmd.pid >> >> SlurmdPort=6818 >> >> SlurmdSpoolDir=/var/log/slurm/slurmd.spool >> >> SlurmUser=slurm >> >> StateSaveLocation=/var/log/slurm/log_slurmctld >> >> SwitchType=switch/none >> >> TaskPlugin=task/none >> >> TopologyPlugin=topology/none >> >> # >> >> # TIMERS >> >> #InactiveLimit=0 >> >> KillWait=30 >> >> MinJobAge=300 >> >> SlurmctldTimeout=120 >> >> SlurmdTimeout=36000 >> >> Waittime=0 >> >> # >> >> # SCHEDULING >> >> # >> >> FastSchedule=1 >> >> SchedulerType=sched/backfill >> >> SchedulerPort=7321 >> >> SelectType=select/cons_res >> >> SelectTypeParameters=CR_CPU_Memory >> >> PreemptMode=SUSPEND,GANG >> >> PreemptType=preempt/partition_prio >> >> # >> >> # JOB PRIORITY >> >> #PriorityType=priority/multifactor >> >> PriorityWeightAge=10000 >> >> PriorityWeightJobSize=1000 >> >> PriorityWeightQOS=10000 >> >> # >> >> # LOGGING AND ACCOUNTING >> >> # >> >> AccountingStorageEnforce=limits,qos >> >> AccountingStorageHost=alpha >> >> AccountingStorageLoc=/var/log/slurm/accounting/tmp >> >> AccountingStorageType=accounting_storage/slurmdbd >> >> ClusterName=cccuam >> >> JobCompLoc=/var/log/slurm/job_completions >> >> JobCompType=jobcomp/slurmdbd >> >> JobAcctGatherFrequency=30 >> >> JobAcctGatherType=jobacct_gather/linux >> >> # >> >> # Logging >> >> # >> >> SlurmctldDebug=6 >> >> SlurmctldLogFile=/var/log/slurm/slurmctld.log >> >> SlurmdDebug=5 >> >> SlurmdLogFile=/var/log/slurm/slurmd.log >> >> SlurmSchedLogFile=/var/log/slurm/sched.log >> >> SlurmSchedLogLevel=1 >> >> # >> >> # COMPUTE NODES >> >> # >> >> NodeName=calc[1-40] NodeAddr=192.168.123.[65-66] RealMemory=7932 Procs=2 >> State=UNKNOWN >> >> PartitionName=sec4000 Nodes=calc[1-40] MaxNodes=1 Priority=5 MaxTime=7200 >> MaxMemPerCPU=3966 Shared=No State=UP PreemptMode=requeue >> >> ---------------------------------------------------------------- >> >> >> >> >> >> The slurmdbd configuration file is: >> >> >> >> ------------------------------------------------------------------- >> >> # Authentication info >> >> AuthType=auth/munge >> >> ## >> >> # slurmDBD info >> >> DbdAddr=localhost >> >> DbdHost=localhost >> >> SlurmUser=slurm >> >> DebugLevel=7 >> >> LogFile=/var/log/slurm/slurmdbd.log >> >> PidFile=/var/run/slurmdbd.pid >> >> # >> >> # Database info >> >> # >> >> StorageType=accounting_storage/mysql >> >> StorageHost=localhost >> >> StorageUser=slurm >> >> StorageLoc=slurm_acct_db >> >> TrackWCKey=yes >> >> ArchiveDir="/tmp" >> >> ArchiveEvents=yes >> >> ArchiveJobs=yes >> >> ArchiveSteps=yes >> >> ArchiveSuspend=yes >> >> PurgeEventAfter=2mont >> >> PurgeJobAfter=12 >> >> PurgeStepAfter=2days >> >> PurgeSuspendAfter=2hours >> >> ------------------------------------------------------------------- >> >> >> >> The output of the command "show qos sacctmgr" is: >> >> >> >> >> slurm-2.3.2]# sacctmgr show qos >> >> Name Priority GraceTime Preempt PreemptMode Flags UsageThres GrpCPUs >> GrpCPUMins GrpJobs GrpNodes GrpSubmit GrpWall MaxCPUs MaxCPUMins MaxJobs >> MaxNodes MaxSubmit MaxWall >> >> ---------- ---------- ---------- ---------- ----------- >> ---------------------------------------- ---------- -------- ----------- >> ------- -------- --------- ----------- -------- ----------- ------- -------- >> --------- ----------- >> >> normal 0 cluster >> >> sec4000 5 cluster 1 7200 4 >> >> >> >> >> >> The slurm's logs are: >> >> >> >> ------------------------------------------------------------------- >> >> sched.log >> >> . >> >> . >> >> . >> >> . >> >> [2012-01-18T10:12:41] sched: Running job scheduler >> >> [2012-01-18T10:13:41] sched: Running job scheduler >> >> [2012-01-18T10:14:41] sched: Running job scheduler >> >> [2012-01-18T10:15:41] sched: Running job scheduler >> >> [2012-01-18T10:16:41] sched: Running job scheduler >> >> [2012-01-18T10:17:41] sched: Running job scheduler >> >> [2012-01-18T10:19:49] sched: JobId=156 allocated resources: NodeList=(null) >> >> [2012-01-18T10:19:49] sched: Running job scheduler >> >> [2012-01-18T10:19:49] sched: JobId=156 initiated >> >> [2012-01-18T10:19:49] sched: Allocate JobId=156 NodeList=calc2 #CPUs=1 >> >> >> >> slurmctld.log >> >> . >> >> . >> >> . >> >> . >> >> [2012-01-18T10:19:41] debug: sched: Running job scheduler >> >> [2012-01-18T10:19:49] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from >> uid=720 >> >> [2012-01-18T10:19:49] debug2: initial priority for job 156 is 10500 >> >> [2012-01-18T10:19:49] debug2: found 2 usable nodes from config containing >> calc[1-40] >> >> [2012-01-18T10:19:49] debug2: sched: JobId=156 allocated resources: >> NodeList=(null) >> >> [2012-01-18T10:19:49] _slurm_rpc_submit_batch_job JobId=156 usec=1085 >> >> [2012-01-18T10:19:49] debug: sched: Running job scheduler >> >> [2012-01-18T10:19:49] debug2: found 2 usable nodes from config containing >> calc[1-40] >> >> [2012-01-18T10:19:49] sched: Allocate JobId=156 NodeList=calc2 #CPUs=1 >> >> [2012-01-18T10:19:49] debug2: Spawning RPC agent for msg_type 4005 >> >> [2012-01-18T10:19:49] debug2: got 1 threads to send out >> >> [2012-01-18T10:19:49] debug2: Tree head got back 0 looking for 1 >> >> [2012-01-18T10:19:49] debug2: Tree head got back 1 >> >> [2012-01-18T10:19:49] debug2: Tree head got them all >> >> [2012-01-18T10:19:49] debug2: node_did_resp calc2 >> >> [2012-01-18T10:20:02] debug2: Testing job time limits and checkpoints >> >> [2012-01-18T10:20:15] debug: backfill: no jobs to backfill >> >> [2012-01-18T10:20:32] debug2: Testing job time limits and checkpoints >> >> >> >> >> >> >> slurmdbd.log >> >> . >> >> . >> >> . >> >> . >> >> [2012-01-18T10:15:54] debug2: DBD_FINI: CLOSE:1 COMMIT:0 >> >> [2012-01-18T10:15:54] debug3: Write connection 10 closed >> >> [2012-01-18T10:15:54] debug2: Closed connection 10 uid(0) >> >> [2012-01-18T10:17:21] debug2: DBD_CLUSTER_CPUS: called for cccuam(2) >> >> [2012-01-18T10:17:21] debug3: we have the same cpu count as before for >> cccuam, no need to update the database. >> >> [2012-01-18T10:17:21] debug3: we have the same nodes in the cluster as >> before no need to update the database. >> >> [2012-01-18T10:19:54] debug2: DBD_JOB_START: START CALL ID:156 >> NAME:lanza09-1-b INX:0 >> >> [2012-01-18T10:19:54] debug2: as_mysql_slurmdb_job_start() called >> >> [2012-01-18T10:19:54] debug3: found correct user >> >> [2012-01-18T10:19:54] debug3: found correct wckey 3 >> >> [2012-01-18T10:19:54] debug3: 7(as_mysql_job.c:481) query >> >> insert into "cccuam_job_table" (id_job, id_assoc, id_qos, id_wckey, id_user, >> id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, >> time_start, job_name, track_steps, state, priority, cpus_req, cpus_alloc, >> nodes_alloc, account, partition, wckey, node_inx) values (156, 10, 2, 3, >> 720, 407, 'calc2', 0, 7200, 1326878389, 1326878389, 1326878389, >> 'lanza09-1-b', 0, 1, 10500, 1, 1, 1, 'cccuam', 'sec4000', '**', '0') on >> duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_wckey=3, >> id_user=720, id_group=407, nodelist='calc2', id_resv=0, timelimit=7200, >> time_submit=1326878389, time_start=1326878389, job_name='lanza09-1-b', >> track_steps=0, id_qos=2, state=greatest(state, 1), priority=10500, >> cpus_req=1, cpus_alloc=1, nodes_alloc=1, account='cccuam', >> partition='sec4000', wckey='**', node_inx='0' >> >> ------------------------------------------------------------------- >> >> >> >> >> >> The steps I have followed to use maxcpus are: >> >> >> >> >> sacctmgr add cluster CCCUAM >> >> sacctmgr add account CCCUAM Cluster=CCCUAM Description="Usuarios CCC" >> Organization="UAM" >> >> sacctmgr add qos name=sec4000 priority=5 PreemptMode=suspend,gang MaxJobs=4 >> MaxCPUs=1 >> >> sacctmgr add user lfelipe DefaultAccount=CCCUAM qos=sec4000 >> DefaultQOS=sec4000 >> >> >> >> >> I lounch job: >> >> >> gaussian> sbatch -p sec4000 --qos=sec4000 lanza09-1-b tres_forma1c_2-bis >> >> Submitted batch job 154 >> >> >> >> >> >> As I said I want tolimit one CPU per job. >> >> >> >> >> >> To see if it works, I launch a job that requests two cpu's. >> >> >> >> Executing a top, we can see that the job takes 2 cpus, although I have >> limited it to only one. >> >> >> >> >> top - 11:02:54 up 82 days, 1:16, 2 users, load average: 2.01, 2.04, 1.91 >> >> Tasks: 124 total, 2 running, 122 sleeping, 0 stopped, 0 zombie >> >> Cpu0 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st >> >> Cpu1 : 99.0%us, 1.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st >> >> Mem: 8123228k total, 5720980k used, 2402248k free, 388108k buffers >> >> Swap: 16787872k total, 184308k used, 16603564k free, 681296k cached >> >> >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> >> 4646 luisf 25 0 9350m 4.3g 4896 R 198.8 55.0 84:25.36 l502.exe >> >> 4614 luisf 15 0 72832 1764 1124 S 0.0 0.0 0:00.06 slurm_script >> >> 4644 luisf 16 0 90084 876 688 S 0.0 0.0 0:00.00 g09 >> >> 4645 luisf 16 0 61216 720 608 S 0.0 0.0 0:00.09 tee >> >> >> >> >> Am I doing wrong configurations to slurm? Why is not limited by maxcpus? >> >> >> Sincerely, >> >> >> >> Luis Felipe Ruiz Nieto >> >> >> >> >> >> >> >> >> Luis Felipe Ruiz Nieto >> >> >> >> >> >> >> >> >> >> >> > >
