Hola Felipe, can you send us a squeue after submitting the job? It seems that you don't specify 1 cpu for the job and the job is not binded to 1 cpu. This way it can use all the resources in the compute node. Try to enable task affinity to bind the job to only 1 cpu. This way if the job opens threads it will be limited to just 1 cpu.
Saludos, Carles Fenoy On Wed, Jan 18, 2012 at 12:55 PM, luis <luis.r...@uam.es> wrote: > Dear Danny: > > > > By tha way, I'm new using slurm. > > > > I have performed various tests and I have not managed slurm to limit the > maxcpus. > > > > The cluster is made up of 40 nodes of two cpu's each. > > > > I want to limit the jobs to one processor to perform sequential jobs > > > > The slurm configuration file is: > > > > -------------------------------------------------------------------------- > > # slurm.conf file generated by configurator.html. > > # Put this file on all nodes of your cluster. > > # See the slurm.conf man page for more information. > > # > > # > > # Definimos la maquina que va a ser el master de slurm > > # y si se tiene cual es la que serĂ¡ de backup > > # > > ControlMachine=alpha > > ControlAddr=192.168.123.5 > > # > > # Se define cual es el metodo de autentificacion > > # > > AuthType=auth/munge > > CacheGroups=1 > > CryptoType=crypto/munge > > EnforcePartLimits=yes > > JobCredentialPrivateKey=/etc/slurm/private.key > > JobCredentialPublicCertificate=/etc/slurm/public.key > > MpiDefault=none > > ProctrackType=proctrack/linuxproc > > ReturnToService=1SlurmctldPidFile=/var/run/slurmctld.pid > > SlurmctldPort=6817 > > SlurmdPidFile=/var/run/slurmd.pid > > SlurmdPort=6818 > > SlurmdSpoolDir=/var/log/slurm/slurmd.spool > > SlurmUser=slurm > > StateSaveLocation=/var/log/slurm/log_slurmctld > > SwitchType=switch/none > > TaskPlugin=task/none > > TopologyPlugin=topology/none > > # > > # TIMERS > > #InactiveLimit=0 > > KillWait=30 > > MinJobAge=300 > > SlurmctldTimeout=120 > > SlurmdTimeout=36000 > > Waittime=0 > > # > > # SCHEDULING > > # > > FastSchedule=1 > > SchedulerType=sched/backfill > > SchedulerPort=7321 > > SelectType=select/cons_res > > SelectTypeParameters=CR_CPU_Memory > > PreemptMode=SUSPEND,GANG > > PreemptType=preempt/partition_prio > > # > > # JOB PRIORITY > > #PriorityType=priority/multifactor > > PriorityWeightAge=10000 > > PriorityWeightJobSize=1000 > > PriorityWeightQOS=10000 > > # > > # LOGGING AND ACCOUNTING > > # > > AccountingStorageEnforce=limits,qos > > AccountingStorageHost=alpha > > AccountingStorageLoc=/var/log/slurm/accounting/tmp > > AccountingStorageType=accounting_storage/slurmdbd > > ClusterName=cccuam > > JobCompLoc=/var/log/slurm/job_completions > > JobCompType=jobcomp/slurmdbd > > JobAcctGatherFrequency=30 > > JobAcctGatherType=jobacct_gather/linux > > # > > # Logging > > # > > SlurmctldDebug=6 > > SlurmctldLogFile=/var/log/slurm/slurmctld.log > > SlurmdDebug=5 > > SlurmdLogFile=/var/log/slurm/slurmd.log > > SlurmSchedLogFile=/var/log/slurm/sched.log > > SlurmSchedLogLevel=1 > > # > > # COMPUTE NODES > > # > > NodeName=calc[1-40] NodeAddr=192.168.123.[65-66] RealMemory=7932 Procs=2 > State=UNKNOWN > > PartitionName=sec4000 Nodes=calc[1-40] MaxNodes=1 Priority=5 MaxTime=7200 > MaxMemPerCPU=3966 Shared=No State=UP PreemptMode=requeue > > ---------------------------------------------------------------- > > > > > > The slurmdbd configuration file is: > > > > ------------------------------------------------------------------- > > # Authentication info > > AuthType=auth/munge > > ## > > # slurmDBD info > > DbdAddr=localhost > > DbdHost=localhost > > SlurmUser=slurm > > DebugLevel=7 > > LogFile=/var/log/slurm/slurmdbd.log > > PidFile=/var/run/slurmdbd.pid > > # > > # Database info > > # > > StorageType=accounting_storage/mysql > > StorageHost=localhost > > StorageUser=slurm > > StorageLoc=slurm_acct_db > > TrackWCKey=yes > > ArchiveDir="/tmp" > > ArchiveEvents=yes > > ArchiveJobs=yes > > ArchiveSteps=yes > > ArchiveSuspend=yes > > PurgeEventAfter=2mont > > PurgeJobAfter=12 > > PurgeStepAfter=2days > > PurgeSuspendAfter=2hours > > ------------------------------------------------------------------- > > > > The output of the command "show qos sacctmgr" is: > > > > > slurm-2.3.2]# sacctmgr show qos > > Name Priority GraceTime Preempt PreemptMode Flags UsageThres GrpCPUs > GrpCPUMins GrpJobs GrpNodes GrpSubmit GrpWall MaxCPUs MaxCPUMins MaxJobs > MaxNodes MaxSubmit MaxWall > > ---------- ---------- ---------- ---------- ----------- > ---------------------------------------- ---------- -------- ----------- > ------- -------- --------- ----------- -------- ----------- ------- -------- > --------- ----------- > > normal 0 cluster > > sec4000 5 cluster 1 7200 4 > > > > > > The slurm's logs are: > > > > ------------------------------------------------------------------- > > sched.log > > . > > . > > . > > . > > [2012-01-18T10:12:41] sched: Running job scheduler > > [2012-01-18T10:13:41] sched: Running job scheduler > > [2012-01-18T10:14:41] sched: Running job scheduler > > [2012-01-18T10:15:41] sched: Running job scheduler > > [2012-01-18T10:16:41] sched: Running job scheduler > > [2012-01-18T10:17:41] sched: Running job scheduler > > [2012-01-18T10:19:49] sched: JobId=156 allocated resources: NodeList=(null) > > [2012-01-18T10:19:49] sched: Running job scheduler > > [2012-01-18T10:19:49] sched: JobId=156 initiated > > [2012-01-18T10:19:49] sched: Allocate JobId=156 NodeList=calc2 #CPUs=1 > > > > slurmctld.log > > . > > . > > . > > . > > [2012-01-18T10:19:41] debug: sched: Running job scheduler > > [2012-01-18T10:19:49] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from > uid=720 > > [2012-01-18T10:19:49] debug2: initial priority for job 156 is 10500 > > [2012-01-18T10:19:49] debug2: found 2 usable nodes from config containing > calc[1-40] > > [2012-01-18T10:19:49] debug2: sched: JobId=156 allocated resources: > NodeList=(null) > > [2012-01-18T10:19:49] _slurm_rpc_submit_batch_job JobId=156 usec=1085 > > [2012-01-18T10:19:49] debug: sched: Running job scheduler > > [2012-01-18T10:19:49] debug2: found 2 usable nodes from config containing > calc[1-40] > > [2012-01-18T10:19:49] sched: Allocate JobId=156 NodeList=calc2 #CPUs=1 > > [2012-01-18T10:19:49] debug2: Spawning RPC agent for msg_type 4005 > > [2012-01-18T10:19:49] debug2: got 1 threads to send out > > [2012-01-18T10:19:49] debug2: Tree head got back 0 looking for 1 > > [2012-01-18T10:19:49] debug2: Tree head got back 1 > > [2012-01-18T10:19:49] debug2: Tree head got them all > > [2012-01-18T10:19:49] debug2: node_did_resp calc2 > > [2012-01-18T10:20:02] debug2: Testing job time limits and checkpoints > > [2012-01-18T10:20:15] debug: backfill: no jobs to backfill > > [2012-01-18T10:20:32] debug2: Testing job time limits and checkpoints > > > > > > > slurmdbd.log > > . > > . > > . > > . > > [2012-01-18T10:15:54] debug2: DBD_FINI: CLOSE:1 COMMIT:0 > > [2012-01-18T10:15:54] debug3: Write connection 10 closed > > [2012-01-18T10:15:54] debug2: Closed connection 10 uid(0) > > [2012-01-18T10:17:21] debug2: DBD_CLUSTER_CPUS: called for cccuam(2) > > [2012-01-18T10:17:21] debug3: we have the same cpu count as before for > cccuam, no need to update the database. > > [2012-01-18T10:17:21] debug3: we have the same nodes in the cluster as > before no need to update the database. > > [2012-01-18T10:19:54] debug2: DBD_JOB_START: START CALL ID:156 > NAME:lanza09-1-b INX:0 > > [2012-01-18T10:19:54] debug2: as_mysql_slurmdb_job_start() called > > [2012-01-18T10:19:54] debug3: found correct user > > [2012-01-18T10:19:54] debug3: found correct wckey 3 > > [2012-01-18T10:19:54] debug3: 7(as_mysql_job.c:481) query > > insert into "cccuam_job_table" (id_job, id_assoc, id_qos, id_wckey, id_user, > id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, > time_start, job_name, track_steps, state, priority, cpus_req, cpus_alloc, > nodes_alloc, account, partition, wckey, node_inx) values (156, 10, 2, 3, > 720, 407, 'calc2', 0, 7200, 1326878389, 1326878389, 1326878389, > 'lanza09-1-b', 0, 1, 10500, 1, 1, 1, 'cccuam', 'sec4000', '**', '0') on > duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_wckey=3, > id_user=720, id_group=407, nodelist='calc2', id_resv=0, timelimit=7200, > time_submit=1326878389, time_start=1326878389, job_name='lanza09-1-b', > track_steps=0, id_qos=2, state=greatest(state, 1), priority=10500, > cpus_req=1, cpus_alloc=1, nodes_alloc=1, account='cccuam', > partition='sec4000', wckey='**', node_inx='0' > > ------------------------------------------------------------------- > > > > > > The steps I have followed to use maxcpus are: > > > > > sacctmgr add cluster CCCUAM > > sacctmgr add account CCCUAM Cluster=CCCUAM Description="Usuarios CCC" > Organization="UAM" > > sacctmgr add qos name=sec4000 priority=5 PreemptMode=suspend,gang MaxJobs=4 > MaxCPUs=1 > > sacctmgr add user lfelipe DefaultAccount=CCCUAM qos=sec4000 > DefaultQOS=sec4000 > > > > > I lounch job: > > > gaussian> sbatch -p sec4000 --qos=sec4000 lanza09-1-b tres_forma1c_2-bis > > Submitted batch job 154 > > > > > > As I said I want tolimit one CPU per job. > > > > > > To see if it works, I launch a job that requests two cpu's. > > > > Executing a top, we can see that the job takes 2 cpus, although I have > limited it to only one. > > > > > top - 11:02:54 up 82 days, 1:16, 2 users, load average: 2.01, 2.04, 1.91 > > Tasks: 124 total, 2 running, 122 sleeping, 0 stopped, 0 zombie > > Cpu0 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > > Cpu1 : 99.0%us, 1.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > > Mem: 8123228k total, 5720980k used, 2402248k free, 388108k buffers > > Swap: 16787872k total, 184308k used, 16603564k free, 681296k cached > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 4646 luisf 25 0 9350m 4.3g 4896 R 198.8 55.0 84:25.36 l502.exe > > 4614 luisf 15 0 72832 1764 1124 S 0.0 0.0 0:00.06 slurm_script > > 4644 luisf 16 0 90084 876 688 S 0.0 0.0 0:00.00 g09 > > 4645 luisf 16 0 61216 720 608 S 0.0 0.0 0:00.09 tee > > > > > Am I doing wrong configurations to slurm? Why is not limited by maxcpus? > > > Sincerely, > > > > Luis Felipe Ruiz Nieto > > > > > > > > > Luis Felipe Ruiz Nieto > > > > > > > > > > > -- -- Carles Fenoy