Hi Moe, Thanks for your quick reply. I've modified the configurations parameters and it still behaves the same way. I send the output of squeue, sinfo, scontrol show nodes and scontrol show jobs
sinfo: PARTITION AVAIL TIMELIMIT NODES STATE NODELIST projects* up infinite 1 alloc bscop134 squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 20214 projects sbatch cfenoy PD 0:00 1 (Resources) 20210 projects sbatch cfenoy R 4:22 1 bscop134 20211 projects sbatch cfenoy R 4:21 1 bscop134 20212 projects sbatch cfenoy R 4:21 1 bscop134 20213 projects sbatch cfenoy R 4:20 1 bscop134 scontrol show nodes: NodeName=bscop134 Arch=x86_64 CoresPerSocket=1 CPUAlloc=4 CPUErr=0 CPUTot=8 Features=(null) Gres=gpu:2 NodeAddr=bscop134 NodeHostName=bscop134 OS=Linux RealMemory=12036 Sockets=8 State=MIXED ThreadsPerCore=1 TmpDisk=20157 Weight=1 BootTime=2011-06-17T11:15:47 SlurmdStartTime=2011-07-01T08:37:16 Reason=(null) scontrol show jobs(only 3 jobs): JobId=20212 Name=sbatch UserId=cfenoy(1001) GroupId=users(100) Priority=4294901757 Account=(null) QOS=(null) WCKey=* JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:04:40 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2011-07-01T08:38:32 EligibleTime=2011-07-01T08:38:32 StartTime=2011-07-01T08:38:32 EndTime=Unknown PreemptTime=NO_VAL SuspendTime=None SecsPreSuspend=0 Partition=projects AllocNode:Sid=bscop134:13583 ReqNodeList=(null) ExcNodeList=(null) NodeList=bscop134 BatchHost=bscop134 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=gpu:2 Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/cfenoy JobId=20213 Name=sbatch UserId=cfenoy(1001) GroupId=users(100) Priority=4294901756 Account=(null) QOS=(null) WCKey=* JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:04:39 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2011-07-01T08:38:33 EligibleTime=2011-07-01T08:38:33 StartTime=2011-07-01T08:38:33 EndTime=Unknown PreemptTime=NO_VAL SuspendTime=None SecsPreSuspend=0 Partition=projects AllocNode:Sid=bscop134:13583 ReqNodeList=(null) ExcNodeList=(null) NodeList=bscop134 BatchHost=bscop134 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=gpu:2 Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/cfenoy JobId=20214 Name=sbatch UserId=cfenoy(1001) GroupId=users(100) Priority=4294901755 Account=(null) QOS=(null) WCKey=* JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2011-07-01T08:38:33 EligibleTime=2011-07-01T08:38:33 StartTime=Unknown EndTime=Unknown PreemptTime=NO_VAL SuspendTime=None SecsPreSuspend=0 Partition=projects AllocNode:Sid=bscop134:13583 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=gpu:2 Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/cfenoy On Thu, Jun 30, 2011 at 7:08 PM, <[email protected]> wrote: > It looks like there is a configuration problem. You have a gres defined in > some places as "gpu" and in other places as "gpus", which will result in two > separate sets of data structures. In slurm v2.3 I see that configuration log > a bunch of errors. > > > Quoting Carles Fenoy <[email protected]>: > > Hi all, >> >> I've been testing last few days slurm with gres for our future nvidia >> machine, and I'm facing some problems with gres overallocating resources. >> I've seen the following error every time the controller starts a job. >> >> [2011-06-28T14:50:55] error: gres/gpu: job 20206 node bscop134 >> overallocated >> resources by 2 >> >> The configuration consists of 1 node with 2 gpus. At the end of the email >> you can find the relevant configurations parameters. >> >> Is this the expected behavior of the scheduling with gres? Is this a bug, >> or >> there is no way to no over-allocate resources? >> >> Best regards, >> Carles Fenoy >> >> slurm.conf: >> >> SelectType=select/cons_res >> >> SelectTypeParameters=CR_CPU >> >> SchedulerType=sched/backfill >> >> GresTypes=gpu >> >> NodeName=DEFAULT RealMemory=12000 Procs=8 TmpDisk=20000 Gres=gpus:2 >> >> NodeName=bscop134 NodeAddr=bscop134 Gres=gpus:2 >> >> PartitionName=projects AllowGroups=ALL Hidden=NO RootOnly=NO >> MaxNodes=UNLIMITED MinNodes=1 MaxTime=UNLIMITED Shared=NO State=UP >> Default=YES Nodes=bscop134 >> >> >> gres.conf: >> >> Name=gpu File=/dev/nvidia0 CPUs=0-3 >> Name=gpu File=/dev/nvidia1 CPUs=4-7 >> >> >> -- >> -- >> Carles Fenoy >> >> > > > Moe Jette > SchedMD LLC > > -- -- Carles Fenoy
