Re: [slurm-dev] Limit by number of CPUs by user

Carles Fenoy Wed, 18 Jan 2012 11:17:34 -0800

Hola Felipe,

can you send us a squeue after submitting the job?
It seems that you don't specify 1 cpu for the job and the job is not
binded to 1 cpu. This way it can use all the resources in the compute
node.
Try to enable task affinity to bind the job to only 1 cpu. This way if
the job opens threads it will be limited to just 1 cpu.


Saludos,

Carles Fenoy


On Wed, Jan 18, 2012 at 12:55 PM, luis <luis.r...@uam.es> wrote:
> Dear Danny:
>
>
>
> By tha way, I'm new using slurm.
>
>
>
> I have performed various tests and I have not managed slurm to limit the
> maxcpus.
>
>
>
> The cluster is made up of 40 nodes of two cpu's each.
>
>
>
> I want to limit the jobs to one processor to perform sequential jobs
>
>
>
> The slurm configuration file is:
>
>
>
> --------------------------------------------------------------------------
>
> # slurm.conf file generated by configurator.html.
>
> # Put this file on all nodes of your cluster.
>
> # See the slurm.conf man page for more information.
>
> #
>
> #
>
> # Definimos la maquina que va a ser el master de slurm
>
> # y si se tiene cual es la que será de backup
>
> #
>
> ControlMachine=alpha
>
> ControlAddr=192.168.123.5
>
> #
>
> # Se define cual es el metodo de autentificacion
>
> #
>
> AuthType=auth/munge
>
> CacheGroups=1
>
> CryptoType=crypto/munge
>
> EnforcePartLimits=yes
>
> JobCredentialPrivateKey=/etc/slurm/private.key
>
> JobCredentialPublicCertificate=/etc/slurm/public.key
>
> MpiDefault=none
>
> ProctrackType=proctrack/linuxproc
>
> ReturnToService=1SlurmctldPidFile=/var/run/slurmctld.pid
>
> SlurmctldPort=6817
>
> SlurmdPidFile=/var/run/slurmd.pid
>
> SlurmdPort=6818
>
> SlurmdSpoolDir=/var/log/slurm/slurmd.spool
>
> SlurmUser=slurm
>
> StateSaveLocation=/var/log/slurm/log_slurmctld
>
> SwitchType=switch/none
>
> TaskPlugin=task/none
>
> TopologyPlugin=topology/none
>
> #
>
> # TIMERS
>
> #InactiveLimit=0
>
> KillWait=30
>
> MinJobAge=300
>
> SlurmctldTimeout=120
>
> SlurmdTimeout=36000
>
> Waittime=0
>
> #
>
> # SCHEDULING
>
> #
>
> FastSchedule=1
>
> SchedulerType=sched/backfill
>
> SchedulerPort=7321
>
> SelectType=select/cons_res
>
> SelectTypeParameters=CR_CPU_Memory
>
> PreemptMode=SUSPEND,GANG
>
> PreemptType=preempt/partition_prio
>
> #
>
> # JOB PRIORITY
>
> #PriorityType=priority/multifactor
>
> PriorityWeightAge=10000
>
> PriorityWeightJobSize=1000
>
> PriorityWeightQOS=10000
>
> #
>
> # LOGGING AND ACCOUNTING
>
> #
>
> AccountingStorageEnforce=limits,qos
>
> AccountingStorageHost=alpha
>
> AccountingStorageLoc=/var/log/slurm/accounting/tmp
>
> AccountingStorageType=accounting_storage/slurmdbd
>
> ClusterName=cccuam
>
> JobCompLoc=/var/log/slurm/job_completions
>
> JobCompType=jobcomp/slurmdbd
>
> JobAcctGatherFrequency=30
>
> JobAcctGatherType=jobacct_gather/linux
>
> #
>
> # Logging
>
> #
>
> SlurmctldDebug=6
>
> SlurmctldLogFile=/var/log/slurm/slurmctld.log
>
> SlurmdDebug=5
>
> SlurmdLogFile=/var/log/slurm/slurmd.log
>
> SlurmSchedLogFile=/var/log/slurm/sched.log
>
> SlurmSchedLogLevel=1
>
> #
>
> # COMPUTE NODES
>
> #
>
> NodeName=calc[1-40] NodeAddr=192.168.123.[65-66] RealMemory=7932 Procs=2
> State=UNKNOWN
>
> PartitionName=sec4000 Nodes=calc[1-40] MaxNodes=1 Priority=5 MaxTime=7200
> MaxMemPerCPU=3966 Shared=No State=UP PreemptMode=requeue
>
> ----------------------------------------------------------------
>
>
>
>
>
> The slurmdbd configuration file is:
>
>
>
> -------------------------------------------------------------------
>
> # Authentication info
>
> AuthType=auth/munge
>
> ##
>
> # slurmDBD info
>
> DbdAddr=localhost
>
> DbdHost=localhost
>
> SlurmUser=slurm
>
> DebugLevel=7
>
> LogFile=/var/log/slurm/slurmdbd.log
>
> PidFile=/var/run/slurmdbd.pid
>
> #
>
> # Database info
>
> #
>
> StorageType=accounting_storage/mysql
>
> StorageHost=localhost
>
> StorageUser=slurm
>
> StorageLoc=slurm_acct_db
>
> TrackWCKey=yes
>
> ArchiveDir="/tmp"
>
> ArchiveEvents=yes
>
> ArchiveJobs=yes
>
> ArchiveSteps=yes
>
> ArchiveSuspend=yes
>
> PurgeEventAfter=2mont
>
> PurgeJobAfter=12
>
> PurgeStepAfter=2days
>
> PurgeSuspendAfter=2hours
>
> -------------------------------------------------------------------
>
>
>
> The output of the command "show qos sacctmgr" is:
>
>
>
>
> slurm-2.3.2]# sacctmgr show qos
>
> Name Priority GraceTime Preempt PreemptMode Flags UsageThres GrpCPUs
> GrpCPUMins GrpJobs GrpNodes GrpSubmit GrpWall MaxCPUs MaxCPUMins MaxJobs
> MaxNodes MaxSubmit MaxWall
>
> ---------- ---------- ---------- ---------- -----------
> ---------------------------------------- ---------- -------- -----------
> ------- -------- --------- ----------- -------- ----------- ------- --------
> --------- -----------
>
> normal 0 cluster
>
> sec4000 5 cluster 1 7200 4
>
>
>
>
>
> The slurm's logs are:
>
>
>
> -------------------------------------------------------------------
>
> sched.log
>
> .
>
> .
>
> .
>
> .
>
> [2012-01-18T10:12:41] sched: Running job scheduler
>
> [2012-01-18T10:13:41] sched: Running job scheduler
>
> [2012-01-18T10:14:41] sched: Running job scheduler
>
> [2012-01-18T10:15:41] sched: Running job scheduler
>
> [2012-01-18T10:16:41] sched: Running job scheduler
>
> [2012-01-18T10:17:41] sched: Running job scheduler
>
> [2012-01-18T10:19:49] sched: JobId=156 allocated resources: NodeList=(null)
>
> [2012-01-18T10:19:49] sched: Running job scheduler
>
> [2012-01-18T10:19:49] sched: JobId=156 initiated
>
> [2012-01-18T10:19:49] sched: Allocate JobId=156 NodeList=calc2 #CPUs=1
>
>
>
> slurmctld.log
>
> .
>
> .
>
> .
>
> .
>
> [2012-01-18T10:19:41] debug: sched: Running job scheduler
>
> [2012-01-18T10:19:49] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from
> uid=720
>
> [2012-01-18T10:19:49] debug2: initial priority for job 156 is 10500
>
> [2012-01-18T10:19:49] debug2: found 2 usable nodes from config containing
> calc[1-40]
>
> [2012-01-18T10:19:49] debug2: sched: JobId=156 allocated resources:
> NodeList=(null)
>
> [2012-01-18T10:19:49] _slurm_rpc_submit_batch_job JobId=156 usec=1085
>
> [2012-01-18T10:19:49] debug: sched: Running job scheduler
>
> [2012-01-18T10:19:49] debug2: found 2 usable nodes from config containing
> calc[1-40]
>
> [2012-01-18T10:19:49] sched: Allocate JobId=156 NodeList=calc2 #CPUs=1
>
> [2012-01-18T10:19:49] debug2: Spawning RPC agent for msg_type 4005
>
> [2012-01-18T10:19:49] debug2: got 1 threads to send out
>
> [2012-01-18T10:19:49] debug2: Tree head got back 0 looking for 1
>
> [2012-01-18T10:19:49] debug2: Tree head got back 1
>
> [2012-01-18T10:19:49] debug2: Tree head got them all
>
> [2012-01-18T10:19:49] debug2: node_did_resp calc2
>
> [2012-01-18T10:20:02] debug2: Testing job time limits and checkpoints
>
> [2012-01-18T10:20:15] debug: backfill: no jobs to backfill
>
> [2012-01-18T10:20:32] debug2: Testing job time limits and checkpoints
>
>
>
>
>
>
> slurmdbd.log
>
> .
>
> .
>
> .
>
> .
>
> [2012-01-18T10:15:54] debug2: DBD_FINI: CLOSE:1 COMMIT:0
>
> [2012-01-18T10:15:54] debug3: Write connection 10 closed
>
> [2012-01-18T10:15:54] debug2: Closed connection 10 uid(0)
>
> [2012-01-18T10:17:21] debug2: DBD_CLUSTER_CPUS: called for cccuam(2)
>
> [2012-01-18T10:17:21] debug3: we have the same cpu count as before for
> cccuam, no need to update the database.
>
> [2012-01-18T10:17:21] debug3: we have the same nodes in the cluster as
> before no need to update the database.
>
> [2012-01-18T10:19:54] debug2: DBD_JOB_START: START CALL ID:156
> NAME:lanza09-1-b INX:0
>
> [2012-01-18T10:19:54] debug2: as_mysql_slurmdb_job_start() called
>
> [2012-01-18T10:19:54] debug3: found correct user
>
> [2012-01-18T10:19:54] debug3: found correct wckey 3
>
> [2012-01-18T10:19:54] debug3: 7(as_mysql_job.c:481) query
>
> insert into "cccuam_job_table" (id_job, id_assoc, id_qos, id_wckey, id_user,
> id_group, nodelist, id_resv, timelimit, time_eligible, time_submit,
> time_start, job_name, track_steps, state, priority, cpus_req, cpus_alloc,
> nodes_alloc, account, partition, wckey, node_inx) values (156, 10, 2, 3,
> 720, 407, 'calc2', 0, 7200, 1326878389, 1326878389, 1326878389,
> 'lanza09-1-b', 0, 1, 10500, 1, 1, 1, 'cccuam', 'sec4000', '**', '0') on
> duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_wckey=3,
> id_user=720, id_group=407, nodelist='calc2', id_resv=0, timelimit=7200,
> time_submit=1326878389, time_start=1326878389, job_name='lanza09-1-b',
> track_steps=0, id_qos=2, state=greatest(state, 1), priority=10500,
> cpus_req=1, cpus_alloc=1, nodes_alloc=1, account='cccuam',
> partition='sec4000', wckey='**', node_inx='0'
>
> -------------------------------------------------------------------
>
>
>
>
>
> The steps I have followed to use maxcpus are:
>
>
>
>
> sacctmgr add cluster CCCUAM
>
> sacctmgr add account CCCUAM Cluster=CCCUAM Description="Usuarios CCC"
> Organization="UAM"
>
> sacctmgr add qos name=sec4000 priority=5 PreemptMode=suspend,gang MaxJobs=4
> MaxCPUs=1
>
> sacctmgr add user lfelipe DefaultAccount=CCCUAM qos=sec4000
> DefaultQOS=sec4000
>
>
>
>
> I lounch job:
>
>
> gaussian> sbatch -p sec4000 --qos=sec4000 lanza09-1-b tres_forma1c_2-bis
>
> Submitted batch job 154
>
>
>
>
>
> As I said I want tolimit one CPU per job.
>
>
>
>
>
> To see if it works, I launch a job that requests two cpu's.
>
>
>
> Executing a top, we can see that the job takes 2 cpus, although I have
> limited it to only one.
>
>
>
>
> top - 11:02:54 up 82 days, 1:16, 2 users, load average: 2.01, 2.04, 1.91
>
> Tasks: 124 total, 2 running, 122 sleeping, 0 stopped, 0 zombie
>
> Cpu0 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
>
> Cpu1 : 99.0%us, 1.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
>
> Mem: 8123228k total, 5720980k used, 2402248k free, 388108k buffers
>
> Swap: 16787872k total, 184308k used, 16603564k free, 681296k cached
>
>
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>
> 4646 luisf 25 0 9350m 4.3g 4896 R 198.8 55.0 84:25.36 l502.exe
>
> 4614 luisf 15 0 72832 1764 1124 S 0.0 0.0 0:00.06 slurm_script
>
> 4644 luisf 16 0 90084 876 688 S 0.0 0.0 0:00.00 g09
>
> 4645 luisf 16 0 61216 720 608 S 0.0 0.0 0:00.09 tee
>
>
>
>
> Am I doing wrong configurations to slurm? Why is not limited by maxcpus?
>
>
> Sincerely,
>
>
>
> Luis Felipe Ruiz Nieto
>
>
>
>
>
>
>
>
> Luis Felipe Ruiz Nieto
>
>
>
>
>
>
>
>
>
>
>



-- 
--
Carles Fenoy

Re: [slurm-dev] Limit by number of CPUs by user

Reply via email to