Hi all, A user reported a problem when submitting GPU jobs.
The particular machine has 3 nvidia GPUs: 0. GeForce GTX 580 1. GeForce 210 2. GeForce GTX 580 The two 580 are for GPGPU, while the 210 is only to drive a display. I don't want jobs running on the 210, so I /etc/slurm/gres.conf contains this: Name=gpu File=/dev/nvidia0 Name=gpu File=/dev/nvidia2 The first GPU job to be submitted runs on the first 580 (id 0) but the second submitted job will not run on the second 580 (assuming the first one is still busy/allocated). The code will run on the GT 210 which 1) have low perfomance and 2) pisses off the user of the display. What I think happens is that the first job gets submitted correctly and runs on the first GTX. But the second job does not get scheduled a GPU by slurm. Since the GT 210 is not controlled (or "hidden") by slurm, the code can see it and thus runs there... What could be wrong?? I'm attaching the config file I'm using and the logs. Thanks a lot. Regards Nicolas
# slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # DebugFlags=NO_CONF_HASH ControlMachine=NODENAME #ControlAddr= #BackupController= #BackupAddr= # AuthType=auth/munge CacheGroups=0 #CheckpointType=checkpoint/none CryptoType=crypto/munge #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #PrologSlurmctld= #FirstJobId=1 #MaxJobId=999999 GresTypes=gpu #GroupUpdateForce=0 #GroupUpdateTime=600 #JobCheckpointDir=/var/slurm/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=1 #KillOnBadExit=0 #Licenses=foo*4,bar #MailProg=/bin/mail #MaxJobCount=5000 #MaxStepCount=40000 #MaxTasksPerNode=128 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/pgid #Prolog= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/tmp/slurm/slurmd SlurmUser=slurm #SrunEpilog= #SrunProlog= StateSaveLocation=/var/tmp/slurm SwitchType=switch/none #TaskEpilog= TaskPlugin=task/none #TaskPluginParam= #TaskProlog= #TopologyPlugin=topology/tree #TmpFs=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING #DefMemPerCPU=0 FastSchedule=1 #MaxMemPerCPU=0 #SchedulerRootFilter=1 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/cons_res #SelectTypeParameters= # # # JOB PRIORITY #PriorityType=priority/basic #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING #AccountingStorageEnforce=0 #AccountingStorageHost= #AccountingStorageLoc= #AccountingStoragePass= #AccountingStoragePort= AccountingStorageType=accounting_storage/none #AccountingStorageUser= AccountingStoreJobComment=YES ClusterName=NODENAME #DebugFlags= #JobCompHost= #JobCompLoc= #JobCompPass= #JobCompPort= JobCompType=jobcomp/none #JobCompUser= JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=7 SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm/slurmd.log SlurmSchedLogFile=/var/log/slurm/slurmsched.log SlurmSchedLogLevel=7 # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES NodeName=NODENAME RealMemory=16082 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:2 PartitionName=gpu Nodes=NODENAME Default=YES MaxTime=INFINITE State=UP
[2012-01-13T11:48:05] Launching batch job 1763 for UID 1000 [2012-01-13T11:48:05] [1763] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 [2012-01-13T11:48:05] [1763] done with job
[2012-01-13T11:48:05] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from uid=1000 [2012-01-13T11:48:05] debug3: JobDesc: user_id=1000 job_id=-1 partition=gpu name=SimName [2012-01-13T11:48:05] debug3: cpus=1-4294967294 pn_min_cpus=-1 [2012-01-13T11:48:05] debug3: -N min-[max]: 1-[4294967294]:65534:65534:65534 [2012-01-13T11:48:05] debug3: pn_min_memory_job=-1 pn_min_tmp_disk=-1 [2012-01-13T11:48:05] debug3: immediate=0 features=(null) reservation=(null) [2012-01-13T11:48:05] debug3: req_nodes=(null) exc_nodes=(null) gres=gpu:1 [2012-01-13T11:48:05] debug3: time_limit=-1--1 priority=-1 contiguous=0 shared=-1 [2012-01-13T11:48:05] debug3: kill_on_node_fail=-1 script=#!/bin/bash #SBATCH --job-name=SimName ... [2012-01-13T11:48:05] debug3: argv="/home/me/test/output/slurm_20120113_11h48.sh" [2012-01-13T11:48:05] debug3: environment=MANPATH=/usr/lib64/mpi/mpi-openmpi-gcc/usr/share/man:/home/me/.gentoo/java-config-2/current-user-vm/man:/usr/local/share/man:/usr/share/man:/usr/share/binutils-data/x86_64-pc-linux-gnu/2.21.1/man:/usr/share/gcc-data/x86_64-pc-linux-gnu/4.5.3/man:/etc/java-config/system-vm/man/:/opt/intel/composerxe-2011.4.191/man/en_US:/opt/cuda/man,VTK_DIR=/usr/lib64/vtk-5.8,KDE_MULTIHEAD=false,... [2012-01-13T11:48:05] debug3: stdin=/dev/null stdout=/home/me/test/output/out_%j.log stderr=/home/me/test/output/err_%j.log [2012-01-13T11:48:05] debug3: work_dir=/home/me/test alloc_node:sid=NODENAME:26341 [2012-01-13T11:48:05] debug3: resp_host=(null) alloc_resp_port=0 other_port=0 [2012-01-13T11:48:05] debug3: dependency=(null) account=(null) qos=(null) comment=(null) [2012-01-13T11:48:05] debug3: mail_type=0 mail_user=(null) nice=55534 num_tasks=4294967294 open_mode=0 overcommit=-1 acctg_freq=-1 [2012-01-13T11:48:05] debug3: network=(null) begin=Unknown cpus_per_task=-1 requeue=-1 licenses=(null) [2012-01-13T11:48:05] debug3: end_time=Unknown signal=0@0 wait_all_nodes=-1 [2012-01-13T11:48:05] debug3: ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1 [2012-01-13T11:48:05] debug3: cpus_bind=65534:(null) mem_bind=65534:(null) plane_size:65534 [2012-01-13T11:48:05] debug2: found 1 usable nodes from config containing NODENAME [2012-01-13T11:48:05] debug3: _pick_best_nodes: job 1763 idle_nodes 0 share_nodes 1 [2012-01-13T11:48:05] debug2: sched: JobId=1763 allocated resources: NodeList=(null) [2012-01-13T11:48:05] _slurm_rpc_submit_batch_job JobId=1763 usec=536 [2012-01-13T11:48:05] debug: sched: Running job scheduler [2012-01-13T11:48:05] debug2: found 1 usable nodes from config containing NODENAME [2012-01-13T11:48:05] debug3: _pick_best_nodes: job 1763 idle_nodes 0 share_nodes 1 [2012-01-13T11:48:05] debug3: dist_task: best_fit : using node[0]:socket[1] : 3 cores available [2012-01-13T11:48:05] debug3: cons_res: _add_job_to_res: job 1763 act 0 [2012-01-13T11:48:05] debug3: cons_res: adding job 1763 to part gpu row 0 [2012-01-13T11:48:05] debug3: sched: JobId=1763 initiated [2012-01-13T11:48:05] sched: Allocate JobId=1763 NodeList=NODENAME #CPUs=1 [2012-01-13T11:48:05] debug2: Spawning RPC agent for msg_type 4005 [2012-01-13T11:48:05] debug2: got 1 threads to send out [2012-01-13T11:48:05] debug2: Tree head got back 0 looking for 1 [2012-01-13T11:48:05] debug3: Tree sending to NODENAME [2012-01-13T11:48:05] debug3: Writing job id 1763 to header record of job_state file [2012-01-13T11:48:05] debug2: Tree head got back 1 [2012-01-13T11:48:05] debug2: Tree head got them all [2012-01-13T11:48:05] debug2: node_did_resp NODENAME [2012-01-13T11:48:05] debug2: Processing RPC: REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 JobId=1763 [2012-01-13T11:48:05] completing job 1763 [2012-01-13T11:48:05] debug3: cons_res: _rm_job_from_res: job 1763 action 0 [2012-01-13T11:48:05] debug3: cons_res: removed job 1763 from part gpu row 0 [2012-01-13T11:48:05] debug2: Spawning RPC agent for msg_type 6011 [2012-01-13T11:48:05] sched: job_complete for JobId=1763 successful [2012-01-13T11:48:05] debug2: _slurm_rpc_complete_batch_script JobId=1763 usec=213 [2012-01-13T11:48:05] debug2: got 1 threads to send out [2012-01-13T11:48:05] debug2: Tree head got back 0 looking for 1 [2012-01-13T11:48:05] debug3: Tree sending to NODENAME [2012-01-13T11:48:05] debug2: Tree head got back 1 [2012-01-13T11:48:05] debug2: Tree head got them all [2012-01-13T11:48:05] debug2: node_did_resp NODENAME [2012-01-13T11:48:05] debug: sched: Running job scheduler [2012-01-13T11:48:07] debug3: Writing job id 1763 to header record of job_state file [2012-01-13T11:48:07] debug3: Processing RPC: REQUEST_NODE_INFO from uid=1005 [2012-01-13T11:48:07] debug3: _slurm_rpc_dump_nodes, size=153 usec=112 [2012-01-13T11:48:07] debug3: Processing RPC: REQUEST_JOB_INFO from uid=1005 [2012-01-13T11:48:08] debug3: Processing RPC: REQUEST_NODE_INFO from uid=1000 [2012-01-13T11:48:08] debug3: _slurm_rpc_dump_nodes, size=153 usec=100 [2012-01-13T11:48:08] debug3: Processing RPC: REQUEST_JOB_INFO from uid=1000
[2012-01-13T11:48:05] sched: JobId=1763 allocated resources: NodeList=(null) [2012-01-13T11:48:05] sched: Running job scheduler [2012-01-13T11:48:05] sched: JobId=1763 initiated [2012-01-13T11:48:05] sched: Allocate JobId=1763 NodeList=NODENAME #CPUs=1 [2012-01-13T11:48:05] sched: job_complete for JobId=1763 successful [2012-01-13T11:48:05] sched: Running job scheduler [2012-01-13T11:48:52] sched: Running job scheduler