Re: [slurm-dev] GRES Overallocating resources

jette Wed, 06 Jul 2011 14:21:53 -0700

Carles,

The logic to support managing generic resource topology (associatingspecific generic resources with specific CPUs on a node) wasincomplete. The attached patch should fix the problem you havereported and will be included in SLURM version 2.3.0-pre7.


Moe Jette
SchedMD LLC

On 07/01/2011 06:08 PM, Carles Fenoy wrote:

---------- Mensaje reenviado ----------
De: <[email protected] <mailto:[email protected]>>
Fecha: 01/07/2011 17:24
Asunto: Fwd: Re: [slurm-dev] GRES Overallocating resources
Para: "Carles Fenoy" <[email protected] <mailto:[email protected]>>

Hi Carles,

I have been able to reproduce this problems. If I include the "CPUs"
field in the gres.conf file this problem occurs. It does not occur
without the "CPUs" field. What are the chances of getting a SLURM
support contract to fix this for you?

Moe Jette
SchedMD LLC


----- Forwarded message from [email protected]
<mailto:[email protected]> -----
   Date: Fri, 1 Jul 2011 08:37:21 +0200
   From: Carles Fenoy <[email protected] <mailto:[email protected]>>
Reply-To: Carles Fenoy <[email protected] <mailto:[email protected]>>
 Subject: Re: [slurm-dev] GRES Overallocating resources
     To: [email protected] <mailto:[email protected]>
     Cc: [email protected] <mailto:[email protected]>


Hi Moe,

Thanks for your quick reply.
I've modified the configurations parameters and it still behaves the same
way. I send the output of squeue, sinfo, scontrol show nodes and scontrol
show jobs

sinfo:
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
projects*    up   infinite      1  alloc bscop134

squeue
 JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
 20214  projects   sbatch   cfenoy  PD       0:00      1 (Resources)
 20210  projects   sbatch   cfenoy   R       4:22      1 bscop134
 20211  projects   sbatch   cfenoy   R       4:21      1 bscop134
 20212  projects   sbatch   cfenoy   R       4:21      1 bscop134
 20213  projects   sbatch   cfenoy   R       4:20      1 bscop134

scontrol show nodes:
NodeName=bscop134 Arch=x86_64 CoresPerSocket=1
  CPUAlloc=4 CPUErr=0 CPUTot=8 Features=(null)
  Gres=gpu:2
  NodeAddr=bscop134 NodeHostName=bscop134
  OS=Linux RealMemory=12036 Sockets=8
  State=MIXED ThreadsPerCore=1 TmpDisk=20157 Weight=1
  BootTime=2011-06-17T11:15:47 SlurmdStartTime=2011-07-01T08:37:16
  Reason=(null)

scontrol show jobs(only 3 jobs):
JobId=20212 Name=sbatch
  UserId=cfenoy(1001) GroupId=users(100)
  Priority=4294901757 Account=(null) QOS=(null) WCKey=*
  JobState=RUNNING Reason=None Dependency=(null)
  Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
  RunTime=00:04:40 TimeLimit=UNLIMITED TimeMin=N/A
  SubmitTime=2011-07-01T08:38:32 EligibleTime=2011-07-01T08:38:32
  StartTime=2011-07-01T08:38:32 EndTime=Unknown
  PreemptTime=NO_VAL SuspendTime=None SecsPreSuspend=0
  Partition=projects AllocNode:Sid=bscop134:13583
  ReqNodeList=(null) ExcNodeList=(null)
  NodeList=bscop134
  BatchHost=bscop134
  NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
  MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
  Features=(null) Gres=gpu:2 Reservation=(null)
  Shared=OK Contiguous=0 Licenses=(null) Network=(null)
  Command=(null)
  WorkDir=/home/cfenoy

JobId=20213 Name=sbatch
  UserId=cfenoy(1001) GroupId=users(100)
  Priority=4294901756 Account=(null) QOS=(null) WCKey=*
  JobState=RUNNING Reason=None Dependency=(null)
  Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
  RunTime=00:04:39 TimeLimit=UNLIMITED TimeMin=N/A
  SubmitTime=2011-07-01T08:38:33 EligibleTime=2011-07-01T08:38:33
  StartTime=2011-07-01T08:38:33 EndTime=Unknown
  PreemptTime=NO_VAL SuspendTime=None SecsPreSuspend=0
  Partition=projects AllocNode:Sid=bscop134:13583
  ReqNodeList=(null) ExcNodeList=(null)
  NodeList=bscop134
  BatchHost=bscop134
  NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
  MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
  Features=(null) Gres=gpu:2 Reservation=(null)
  Shared=OK Contiguous=0 Licenses=(null) Network=(null)
  Command=(null)
  WorkDir=/home/cfenoy

JobId=20214 Name=sbatch
  UserId=cfenoy(1001) GroupId=users(100)
  Priority=4294901755 Account=(null) QOS=(null) WCKey=*
  JobState=PENDING Reason=Resources Dependency=(null)
  Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
  RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
  SubmitTime=2011-07-01T08:38:33 EligibleTime=2011-07-01T08:38:33
  StartTime=Unknown EndTime=Unknown
  PreemptTime=NO_VAL SuspendTime=None SecsPreSuspend=0
  Partition=projects AllocNode:Sid=bscop134:13583
  ReqNodeList=(null) ExcNodeList=(null)
  NodeList=(null)
  NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
  MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
  Features=(null) Gres=gpu:2 Reservation=(null)
  Shared=OK Contiguous=0 Licenses=(null) Network=(null)
  Command=(null)
  WorkDir=/home/cfenoy


On Thu, Jun 30, 2011 at 7:08 PM, <[email protected]
<mailto:[email protected]>> wrote:

    It looks like there is a configuration problem. You have a gres
    defined in
    some places as "gpu" and in other places as "gpus", which will
    result in two
    separate sets of data structures. In slurm v2.3 I see that
    configuration log
    a bunch of errors.


    Quoting Carles Fenoy <[email protected] <mailto:[email protected]>>:

     Hi all,


        I've been testing last few days slurm with gres for our future
        nvidia
        machine, and I'm facing some problems with gres overallocating
        resources.
        I've seen the following error every time the controller starts
        a job.

        [2011-06-28T14:50:55] error: gres/gpu: job 20206 node bscop134
        overallocated
        resources by 2

        The configuration consists of 1 node with 2 gpus. At the end
        of the email
        you can find the relevant configurations parameters.

        Is this the expected behavior of the scheduling with gres? Is
        this a bug,
        or
        there is no way to no over-allocate resources?

        Best regards,
        Carles Fenoy

        slurm.conf:

        SelectType=select/cons_res

        SelectTypeParameters=CR_CPU

        SchedulerType=sched/backfill

        GresTypes=gpu

        NodeName=DEFAULT RealMemory=12000 Procs=8 TmpDisk=20000
        Gres=gpus:2

        NodeName=bscop134 NodeAddr=bscop134 Gres=gpus:2

        PartitionName=projects AllowGroups=ALL Hidden=NO RootOnly=NO
        MaxNodes=UNLIMITED MinNodes=1 MaxTime=UNLIMITED Shared=NO State=UP
        Default=YES Nodes=bscop134


        gres.conf:

        Name=gpu File=/dev/nvidia0 CPUs=0-3
        Name=gpu File=/dev/nvidia1 CPUs=4-7


        --
        --
        Carles Fenoy




    Moe Jette
    SchedMD LLC




--
--
Carles Fenoy


----- End forwarded message -----


Moe Jette
SchedMD LLC

Hi Moe,

Thanks for your quick reply.
I've modified the configurations parameters and it still behaves the
same way. I send the output of squeue, sinfo, scontrol show nodes and
scontrol show jobs

sinfo:
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
projects*    up   infinite      1  alloc bscop134

squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
  20214  projects   sbatch   cfenoy  PD       0:00      1 (Resources)
  20210  projects   sbatch   cfenoy   R       4:22      1 bscop134
  20211  projects   sbatch   cfenoy   R       4:21      1 bscop134
  20212  projects   sbatch   cfenoy   R       4:21      1 bscop134
  20213  projects   sbatch   cfenoy   R       4:20      1 bscop134

scontrol show nodes:
NodeName=bscop134 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=4 CPUErr=0 CPUTot=8 Features=(null)
   Gres=gpu:2
   NodeAddr=bscop134 NodeHostName=bscop134
   OS=Linux RealMemory=12036 Sockets=8
   State=MIXED ThreadsPerCore=1 TmpDisk=20157 Weight=1
   BootTime=2011-06-17T11:15:47 SlurmdStartTime=2011-07-01T08:37:16
   Reason=(null)

scontrol show jobs(only 3 jobs):
JobId=20212 Name=sbatch
   UserId=cfenoy(1001) GroupId=users(100)
   Priority=4294901757 Account=(null) QOS=(null) WCKey=*
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:04:40 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2011-07-01T08:38:32 EligibleTime=2011-07-01T08:38:32
   StartTime=2011-07-01T08:38:32 EndTime=Unknown
   PreemptTime=NO_VAL SuspendTime=None SecsPreSuspend=0
   Partition=projects AllocNode:Sid=bscop134:13583
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=bscop134
   BatchHost=bscop134
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=gpu:2 Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/cfenoy

JobId=20213 Name=sbatch
   UserId=cfenoy(1001) GroupId=users(100)
   Priority=4294901756 Account=(null) QOS=(null) WCKey=*
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:04:39 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2011-07-01T08:38:33 EligibleTime=2011-07-01T08:38:33
   StartTime=2011-07-01T08:38:33 EndTime=Unknown
   PreemptTime=NO_VAL SuspendTime=None SecsPreSuspend=0
   Partition=projects AllocNode:Sid=bscop134:13583
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=bscop134
   BatchHost=bscop134
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=gpu:2 Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/cfenoy

JobId=20214 Name=sbatch
   UserId=cfenoy(1001) GroupId=users(100)
   Priority=4294901755 Account=(null) QOS=(null) WCKey=*
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2011-07-01T08:38:33 EligibleTime=2011-07-01T08:38:33
   StartTime=Unknown EndTime=Unknown
   PreemptTime=NO_VAL SuspendTime=None SecsPreSuspend=0
   Partition=projects AllocNode:Sid=bscop134:13583
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=gpu:2 Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/cfenoy


On Thu, Jun 30, 2011 at 7:08 PM, <[email protected]
<mailto:[email protected]>> wrote:

    It looks like there is a configuration problem. You have a gres
    defined in some places as "gpu" and in other places as "gpus",
    which will result in two separate sets of data structures. In
    slurm v2.3 I see that configuration log a bunch of errors.


    Quoting Carles Fenoy <[email protected] <mailto:[email protected]>>:

        Hi all,

        I've been testing last few days slurm with gres for our future
        nvidia
        machine, and I'm facing some problems with gres overallocating
        resources.
        I've seen the following error every time the controller starts
        a job.

        [2011-06-28T14:50:55] error: gres/gpu: job 20206 node bscop134
        overallocated
        resources by 2

        The configuration consists of 1 node with 2 gpus. At the end
        of the email
        you can find the relevant configurations parameters.

        Is this the expected behavior of the scheduling with gres? Is
        this a bug, or
        there is no way to no over-allocate resources?

        Best regards,
        Carles Fenoy

        slurm.conf:

        SelectType=select/cons_res

        SelectTypeParameters=CR_CPU

        SchedulerType=sched/backfill

        GresTypes=gpu

        NodeName=DEFAULT RealMemory=12000 Procs=8 TmpDisk=20000
        Gres=gpus:2

        NodeName=bscop134 NodeAddr=bscop134 Gres=gpus:2

        PartitionName=projects AllowGroups=ALL Hidden=NO RootOnly=NO
        MaxNodes=UNLIMITED MinNodes=1 MaxTime=UNLIMITED Shared=NO State=UP
        Default=YES Nodes=bscop134


        gres.conf:

        Name=gpu File=/dev/nvidia0 CPUs=0-3
        Name=gpu File=/dev/nvidia1 CPUs=4-7


        --
        --
        Carles Fenoy

diff --git a/src/common/gres.c b/src/common/gres.c
index 8c4a712..956472c 100644
--- a/src/common/gres.c
+++ b/src/common/gres.c
@@ -2528,9 +2528,10 @@ extern uint32_t _job_test(void *job_gres_data, void *node_gres_data,
 				break;
 			}
 			cpus_avail[top_inx] = 0;
-			i = node_gres_ptr->topo_gres_cnt_avail[top_inx] -
-			    node_gres_ptr->topo_gres_cnt_alloc[top_inx];
-			if (i <= 0) {
+			i = node_gres_ptr->topo_gres_cnt_avail[top_inx];
+			if (!use_total_gres)
+			    i -= node_gres_ptr->topo_gres_cnt_alloc[top_inx];
+			if (i < 0) {
 				error("gres/%s: topology allocation error on "
 				      "node %s", gres_name, node_name);
 				continue;
@@ -2687,7 +2688,6 @@ extern int _job_alloc(void *job_gres_data, void *node_gres_data,
 
 	/*
 	 * Select the specific resources to use for this job.
-	 * We'll need to add topology information in the future
 	 */
 	if (job_gres_ptr->gres_bit_alloc[node_offset]) {
 		/* Resuming a suspended job, resources already allocated */
@@ -2728,6 +2728,18 @@ extern int _job_alloc(void *job_gres_data, void *node_gres_data,
 	} else {
 		node_gres_ptr->gres_cnt_alloc += job_gres_ptr->gres_cnt_alloc;
 	}
+	if (job_gres_ptr->gres_bit_alloc &&
+	    job_gres_ptr->gres_bit_alloc[node_offset] &&
+	    node_gres_ptr->topo_gres_bitmap &&
+	    node_gres_ptr->topo_gres_cnt_alloc) {
+		for (i=0; i<node_gres_ptr->topo_cnt; i++) {
+			gres_cnt = bit_overlap(job_gres_ptr->
+					       gres_bit_alloc[node_offset],
+					       node_gres_ptr->
+					       topo_gres_bitmap[i]);
+			node_gres_ptr->topo_gres_cnt_alloc[i] += gres_cnt;
+		}
+	}
 
 	return SLURM_SUCCESS;
 }
@@ -2810,7 +2822,7 @@ static int _job_dealloc(void *job_gres_data, void *node_gres_data,
 			int node_offset, char *gres_name, uint32_t job_id,
 			char *node_name)
 {
-	int i, len;
+	int i, len, gres_cnt;
 	gres_job_state_t  *job_gres_ptr  = (gres_job_state_t *)  job_gres_data;
 	gres_node_state_t *node_gres_ptr = (gres_node_state_t *) node_gres_data;
 
@@ -2865,6 +2877,19 @@ static int _job_dealloc(void *job_gres_data, void *node_gres_data,
 		      gres_name, job_id, node_name);
 	}
 
+	if (job_gres_ptr->gres_bit_alloc &&
+	    job_gres_ptr->gres_bit_alloc[node_offset] &&
+	    node_gres_ptr->topo_gres_bitmap &&
+	    node_gres_ptr->topo_gres_cnt_alloc) {
+		for (i=0; i<node_gres_ptr->topo_cnt; i++) {
+			gres_cnt = bit_overlap(job_gres_ptr->
+					       gres_bit_alloc[node_offset],
+					       node_gres_ptr->
+					       topo_gres_bitmap[i]);
+			node_gres_ptr->topo_gres_cnt_alloc[i] -= gres_cnt;
+		}
+	}
+
 	return SLURM_SUCCESS;
 }

Re: [slurm-dev] GRES Overallocating resources

Reply via email to