Carles,
The logic to support managing generic resource topology (associating
specific generic resources with specific CPUs on a node) was
incomplete. The attached patch should fix the problem you have
reported and will be included in SLURM version 2.3.0-pre7.
Moe Jette
SchedMD LLC
On 07/01/2011 06:08 PM, Carles Fenoy wrote:
---------- Mensaje reenviado ----------
De: <[email protected] <mailto:[email protected]>>
Fecha: 01/07/2011 17:24
Asunto: Fwd: Re: [slurm-dev] GRES Overallocating resources
Para: "Carles Fenoy" <[email protected] <mailto:[email protected]>>
Hi Carles,
I have been able to reproduce this problems. If I include the "CPUs"
field in the gres.conf file this problem occurs. It does not occur
without the "CPUs" field. What are the chances of getting a SLURM
support contract to fix this for you?
Moe Jette
SchedMD LLC
----- Forwarded message from [email protected]
<mailto:[email protected]> -----
Date: Fri, 1 Jul 2011 08:37:21 +0200
From: Carles Fenoy <[email protected] <mailto:[email protected]>>
Reply-To: Carles Fenoy <[email protected] <mailto:[email protected]>>
Subject: Re: [slurm-dev] GRES Overallocating resources
To: [email protected] <mailto:[email protected]>
Cc: [email protected] <mailto:[email protected]>
Hi Moe,
Thanks for your quick reply.
I've modified the configurations parameters and it still behaves the same
way. I send the output of squeue, sinfo, scontrol show nodes and scontrol
show jobs
sinfo:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
projects* up infinite 1 alloc bscop134
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
20214 projects sbatch cfenoy PD 0:00 1 (Resources)
20210 projects sbatch cfenoy R 4:22 1 bscop134
20211 projects sbatch cfenoy R 4:21 1 bscop134
20212 projects sbatch cfenoy R 4:21 1 bscop134
20213 projects sbatch cfenoy R 4:20 1 bscop134
scontrol show nodes:
NodeName=bscop134 Arch=x86_64 CoresPerSocket=1
CPUAlloc=4 CPUErr=0 CPUTot=8 Features=(null)
Gres=gpu:2
NodeAddr=bscop134 NodeHostName=bscop134
OS=Linux RealMemory=12036 Sockets=8
State=MIXED ThreadsPerCore=1 TmpDisk=20157 Weight=1
BootTime=2011-06-17T11:15:47 SlurmdStartTime=2011-07-01T08:37:16
Reason=(null)
scontrol show jobs(only 3 jobs):
JobId=20212 Name=sbatch
UserId=cfenoy(1001) GroupId=users(100)
Priority=4294901757 Account=(null) QOS=(null) WCKey=*
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:04:40 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2011-07-01T08:38:32 EligibleTime=2011-07-01T08:38:32
StartTime=2011-07-01T08:38:32 EndTime=Unknown
PreemptTime=NO_VAL SuspendTime=None SecsPreSuspend=0
Partition=projects AllocNode:Sid=bscop134:13583
ReqNodeList=(null) ExcNodeList=(null)
NodeList=bscop134
BatchHost=bscop134
NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=gpu:2 Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/home/cfenoy
JobId=20213 Name=sbatch
UserId=cfenoy(1001) GroupId=users(100)
Priority=4294901756 Account=(null) QOS=(null) WCKey=*
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:04:39 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2011-07-01T08:38:33 EligibleTime=2011-07-01T08:38:33
StartTime=2011-07-01T08:38:33 EndTime=Unknown
PreemptTime=NO_VAL SuspendTime=None SecsPreSuspend=0
Partition=projects AllocNode:Sid=bscop134:13583
ReqNodeList=(null) ExcNodeList=(null)
NodeList=bscop134
BatchHost=bscop134
NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=gpu:2 Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/home/cfenoy
JobId=20214 Name=sbatch
UserId=cfenoy(1001) GroupId=users(100)
Priority=4294901755 Account=(null) QOS=(null) WCKey=*
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2011-07-01T08:38:33 EligibleTime=2011-07-01T08:38:33
StartTime=Unknown EndTime=Unknown
PreemptTime=NO_VAL SuspendTime=None SecsPreSuspend=0
Partition=projects AllocNode:Sid=bscop134:13583
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=gpu:2 Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/home/cfenoy
On Thu, Jun 30, 2011 at 7:08 PM, <[email protected]
<mailto:[email protected]>> wrote:
It looks like there is a configuration problem. You have a gres
defined in
some places as "gpu" and in other places as "gpus", which will
result in two
separate sets of data structures. In slurm v2.3 I see that
configuration log
a bunch of errors.
Quoting Carles Fenoy <[email protected] <mailto:[email protected]>>:
Hi all,
I've been testing last few days slurm with gres for our future
nvidia
machine, and I'm facing some problems with gres overallocating
resources.
I've seen the following error every time the controller starts
a job.
[2011-06-28T14:50:55] error: gres/gpu: job 20206 node bscop134
overallocated
resources by 2
The configuration consists of 1 node with 2 gpus. At the end
of the email
you can find the relevant configurations parameters.
Is this the expected behavior of the scheduling with gres? Is
this a bug,
or
there is no way to no over-allocate resources?
Best regards,
Carles Fenoy
slurm.conf:
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
SchedulerType=sched/backfill
GresTypes=gpu
NodeName=DEFAULT RealMemory=12000 Procs=8 TmpDisk=20000
Gres=gpus:2
NodeName=bscop134 NodeAddr=bscop134 Gres=gpus:2
PartitionName=projects AllowGroups=ALL Hidden=NO RootOnly=NO
MaxNodes=UNLIMITED MinNodes=1 MaxTime=UNLIMITED Shared=NO State=UP
Default=YES Nodes=bscop134
gres.conf:
Name=gpu File=/dev/nvidia0 CPUs=0-3
Name=gpu File=/dev/nvidia1 CPUs=4-7
--
--
Carles Fenoy
Moe Jette
SchedMD LLC
--
--
Carles Fenoy
----- End forwarded message -----
Moe Jette
SchedMD LLC
Hi Moe,
Thanks for your quick reply.
I've modified the configurations parameters and it still behaves the
same way. I send the output of squeue, sinfo, scontrol show nodes and
scontrol show jobs
sinfo:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
projects* up infinite 1 alloc bscop134
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
20214 projects sbatch cfenoy PD 0:00 1 (Resources)
20210 projects sbatch cfenoy R 4:22 1 bscop134
20211 projects sbatch cfenoy R 4:21 1 bscop134
20212 projects sbatch cfenoy R 4:21 1 bscop134
20213 projects sbatch cfenoy R 4:20 1 bscop134
scontrol show nodes:
NodeName=bscop134 Arch=x86_64 CoresPerSocket=1
CPUAlloc=4 CPUErr=0 CPUTot=8 Features=(null)
Gres=gpu:2
NodeAddr=bscop134 NodeHostName=bscop134
OS=Linux RealMemory=12036 Sockets=8
State=MIXED ThreadsPerCore=1 TmpDisk=20157 Weight=1
BootTime=2011-06-17T11:15:47 SlurmdStartTime=2011-07-01T08:37:16
Reason=(null)
scontrol show jobs(only 3 jobs):
JobId=20212 Name=sbatch
UserId=cfenoy(1001) GroupId=users(100)
Priority=4294901757 Account=(null) QOS=(null) WCKey=*
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:04:40 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2011-07-01T08:38:32 EligibleTime=2011-07-01T08:38:32
StartTime=2011-07-01T08:38:32 EndTime=Unknown
PreemptTime=NO_VAL SuspendTime=None SecsPreSuspend=0
Partition=projects AllocNode:Sid=bscop134:13583
ReqNodeList=(null) ExcNodeList=(null)
NodeList=bscop134
BatchHost=bscop134
NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=gpu:2 Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/home/cfenoy
JobId=20213 Name=sbatch
UserId=cfenoy(1001) GroupId=users(100)
Priority=4294901756 Account=(null) QOS=(null) WCKey=*
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:04:39 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2011-07-01T08:38:33 EligibleTime=2011-07-01T08:38:33
StartTime=2011-07-01T08:38:33 EndTime=Unknown
PreemptTime=NO_VAL SuspendTime=None SecsPreSuspend=0
Partition=projects AllocNode:Sid=bscop134:13583
ReqNodeList=(null) ExcNodeList=(null)
NodeList=bscop134
BatchHost=bscop134
NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=gpu:2 Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/home/cfenoy
JobId=20214 Name=sbatch
UserId=cfenoy(1001) GroupId=users(100)
Priority=4294901755 Account=(null) QOS=(null) WCKey=*
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2011-07-01T08:38:33 EligibleTime=2011-07-01T08:38:33
StartTime=Unknown EndTime=Unknown
PreemptTime=NO_VAL SuspendTime=None SecsPreSuspend=0
Partition=projects AllocNode:Sid=bscop134:13583
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=gpu:2 Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/home/cfenoy
On Thu, Jun 30, 2011 at 7:08 PM, <[email protected]
<mailto:[email protected]>> wrote:
It looks like there is a configuration problem. You have a gres
defined in some places as "gpu" and in other places as "gpus",
which will result in two separate sets of data structures. In
slurm v2.3 I see that configuration log a bunch of errors.
Quoting Carles Fenoy <[email protected] <mailto:[email protected]>>:
Hi all,
I've been testing last few days slurm with gres for our future
nvidia
machine, and I'm facing some problems with gres overallocating
resources.
I've seen the following error every time the controller starts
a job.
[2011-06-28T14:50:55] error: gres/gpu: job 20206 node bscop134
overallocated
resources by 2
The configuration consists of 1 node with 2 gpus. At the end
of the email
you can find the relevant configurations parameters.
Is this the expected behavior of the scheduling with gres? Is
this a bug, or
there is no way to no over-allocate resources?
Best regards,
Carles Fenoy
slurm.conf:
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
SchedulerType=sched/backfill
GresTypes=gpu
NodeName=DEFAULT RealMemory=12000 Procs=8 TmpDisk=20000
Gres=gpus:2
NodeName=bscop134 NodeAddr=bscop134 Gres=gpus:2
PartitionName=projects AllowGroups=ALL Hidden=NO RootOnly=NO
MaxNodes=UNLIMITED MinNodes=1 MaxTime=UNLIMITED Shared=NO State=UP
Default=YES Nodes=bscop134
gres.conf:
Name=gpu File=/dev/nvidia0 CPUs=0-3
Name=gpu File=/dev/nvidia1 CPUs=4-7
--
--
Carles Fenoy
diff --git a/src/common/gres.c b/src/common/gres.c
index 8c4a712..956472c 100644
--- a/src/common/gres.c
+++ b/src/common/gres.c
@@ -2528,9 +2528,10 @@ extern uint32_t _job_test(void *job_gres_data, void *node_gres_data,
break;
}
cpus_avail[top_inx] = 0;
- i = node_gres_ptr->topo_gres_cnt_avail[top_inx] -
- node_gres_ptr->topo_gres_cnt_alloc[top_inx];
- if (i <= 0) {
+ i = node_gres_ptr->topo_gres_cnt_avail[top_inx];
+ if (!use_total_gres)
+ i -= node_gres_ptr->topo_gres_cnt_alloc[top_inx];
+ if (i < 0) {
error("gres/%s: topology allocation error on "
"node %s", gres_name, node_name);
continue;
@@ -2687,7 +2688,6 @@ extern int _job_alloc(void *job_gres_data, void *node_gres_data,
/*
* Select the specific resources to use for this job.
- * We'll need to add topology information in the future
*/
if (job_gres_ptr->gres_bit_alloc[node_offset]) {
/* Resuming a suspended job, resources already allocated */
@@ -2728,6 +2728,18 @@ extern int _job_alloc(void *job_gres_data, void *node_gres_data,
} else {
node_gres_ptr->gres_cnt_alloc += job_gres_ptr->gres_cnt_alloc;
}
+ if (job_gres_ptr->gres_bit_alloc &&
+ job_gres_ptr->gres_bit_alloc[node_offset] &&
+ node_gres_ptr->topo_gres_bitmap &&
+ node_gres_ptr->topo_gres_cnt_alloc) {
+ for (i=0; i<node_gres_ptr->topo_cnt; i++) {
+ gres_cnt = bit_overlap(job_gres_ptr->
+ gres_bit_alloc[node_offset],
+ node_gres_ptr->
+ topo_gres_bitmap[i]);
+ node_gres_ptr->topo_gres_cnt_alloc[i] += gres_cnt;
+ }
+ }
return SLURM_SUCCESS;
}
@@ -2810,7 +2822,7 @@ static int _job_dealloc(void *job_gres_data, void *node_gres_data,
int node_offset, char *gres_name, uint32_t job_id,
char *node_name)
{
- int i, len;
+ int i, len, gres_cnt;
gres_job_state_t *job_gres_ptr = (gres_job_state_t *) job_gres_data;
gres_node_state_t *node_gres_ptr = (gres_node_state_t *) node_gres_data;
@@ -2865,6 +2877,19 @@ static int _job_dealloc(void *job_gres_data, void *node_gres_data,
gres_name, job_id, node_name);
}
+ if (job_gres_ptr->gres_bit_alloc &&
+ job_gres_ptr->gres_bit_alloc[node_offset] &&
+ node_gres_ptr->topo_gres_bitmap &&
+ node_gres_ptr->topo_gres_cnt_alloc) {
+ for (i=0; i<node_gres_ptr->topo_cnt; i++) {
+ gres_cnt = bit_overlap(job_gres_ptr->
+ gres_bit_alloc[node_offset],
+ node_gres_ptr->
+ topo_gres_bitmap[i]);
+ node_gres_ptr->topo_gres_cnt_alloc[i] -= gres_cnt;
+ }
+ }
+
return SLURM_SUCCESS;
}