Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs
Hi Team, I have differentiated the CPU node and GPU nodes into two different queues. Now I have 20 Nodes having CPUS (20 cores)only but no GPU. Another set of nodes having GPU+CPU.some nodes are with 2 GPU and 20 CPU and some are with 8GPU and 48 CPU assigned to GPU queue user facing issues when in GPU queue. the scenario is as below: user submitting jobs with 4CPU+1GPU and also submitting jobs with 4CPU only. So the situation arises when all the GPU is full and the job submitted with GPU resources is waiting in queue but there is a large amount of CPU available but the job which is only required CPU jobs are not going through because the 4CPU+1GPU job has higher priority over CPU. is there any mechanism that once all GPU is full in use it will allow the CPU based job. Regards Navin. On Mon, Jun 22, 2020 at 6:09 PM Diego Zuccato wrote: > Il 16/06/20 16:23, Loris Bennett ha scritto: > > > Thanks for pointing this out - I hadn't been aware of this. Is there > > anywhere in the documentation where this is explicitly stated? > I don't remember. Seems Michael's experience is different. Possibly some > other setting influences that behaviour. Maybe different partition > priorities? > But on the small cluster I'm managing it's this way. I'm not an expert > and I'd like to understand. > > -- > Diego Zuccato > DIFA - Dip. di Fisica e Astronomia > Servizi Informatici > Alma Mater Studiorum - Università di Bologna > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > tel.: +39 051 20 95786 > >
Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs
Il 16/06/20 16:23, Loris Bennett ha scritto: > Thanks for pointing this out - I hadn't been aware of this. Is there > anywhere in the documentation where this is explicitly stated? I don't remember. Seems Michael's experience is different. Possibly some other setting influences that behaviour. Maybe different partition priorities? But on the small cluster I'm managing it's this way. I'm not an expert and I'd like to understand. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs
Diego Zuccato writes: > Il 16/06/20 09:39, Loris Bennett ha scritto: > >>> Maybe it's already known and obvious, but... Remember that a node can be >>> allocated to only one partition. >> Maybe I am misunderstanding you, but I think that this is not the case. >> A node can be in multiple partitions. > > *Assigned* to multiple partitions: OK. > But once slurm schedules jon in "partGPU" on that node, the whole node > is unavailable for jobs in "partCPU", even if the GPU job is using only > 1% of the resources. Thanks for pointing this out - I hadn't been aware of this. Is there anywhere in the documentation where this is explicitly stated? >> We have nodes belonging to >> individual research groups which are in both a separate partition just >> for the group and in a 'scavenger' partition for everyone (but with >> lower priority add maximum run-time). > > More or less our current config. Quite inefficient, at least for us: too > many unuseable resources due to small jobs. Our scavenger partition tends to be used mostly by a small number of users each with a huge number of small, short jobs. Thus, they tend to fill nodes and not block resources for that long, but I probably need to look at this a bit more carefully. >>> So, if you have the mixed nodes in bot >>> partitions and there's a GPU job running, a non-gpu job will find that >>> node marked as busy because it's allocated to another partition. >>> That's why we're drastically reducing the number of partitions we have >>> and will avoid shared nodes. >> Again I don't this is explanation. If a job is running on a GPU node, >> but not using all the CPUs, then a CPU-only job should be able to start >> on that node, unless some form of exclusivity has been set up, such as >> ExclusiveUser=YES for the partition. > Nope. The whole node gets allocated to one partition at a time. So if > the GPU job and the CPU one are in different partitions, it's expected > that only one starts. The behaviour you're looking for is the one of > QoS: define a single partition w/ multiple QoS and both jobs will run > concurrently. > > If you think about it, that's the meaning of "partition" :) Like I said, this is new to me, but personally I don't think that linguistically speaking it is obvious. If the actual membership of a node to a partition changes over time and just depends on which jobs happen to be running on it at a given moment, to my mind, that's not much like the physical concept of partitioning a room or a city. Cheers, Loris -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs
Not trying to argue unnecessarily, but what you describe is not a universal rule, regardless of QOS. Our GPU nodes are members of 3 GPU-related partitions, 2 more resource-limited non-GPU partitions, and one of two larger-memory partitions. It’s set up this way to minimize idle resources (due to us not buying enough GPUs in those nodes to keep all the CPUs busy, plus our other nodes having limited numbers of DIMM slots for larger-memory jobs). First terminal, results in a job running in the ‘any-interactive’ partition on gpunode002. We have a job submit plugin that automatically routes jobs to ‘interactive’, ‘gpu-interactive’, or ‘any-interactive’ depending on the resources requested: = [renfro@login rosetta-job]$ type hpcshell hpcshell is a function hpcshell () { srun --partition=interactive $@ --pty bash -i } [renfro@login rosetta-job]$ hpcshell [renfro@gpunode002(job 751070) rosetta-job]$ = Second terminal, simultaneous to first terminal, results in a job running in the ‘gpu-interactive’ partition on gpunode002: = [renfro@login ~]$ hpcshell --gres=gpu [renfro@gpunode002(job 751071) ~]$ squeue -t R -u $USER JOBID PARTI NAME USER ST TIME S:C: NODES MIN_MEMORY NODELIST(REASON) SUBMIT_TIME START_TIMEEND_TIME TRES_PER_NODE 751071 gpu-i bash renfro R 0:08 *:*: 1 2000M gpunode002 2020-06-16T08:27:50 2020-06-16T08:27:50 2020-06-16T10:27:50 gpu 751070 any-i bash renfro R 0:18 *:*: 1 2000M gpunode002 2020-06-16T08:27:40 2020-06-16T08:27:40 2020-06-16T10:27:41 N/A [renfro@gpunode002(job 751071) ~]$ = Selected configuration details (excluding things like resource ranges and defaults): NodeName=gpunode[001-003] CoresPerSocket=14 RealMemory=382000 Sockets=2 ThreadsPerCore=1 Weight=10011 Gres=gpu:2 NodeName=gpunode004 CoresPerSocket=14 RealMemory=894000 Sockets=2 ThreadsPerCore=1 Weight=10021 Gres=gpu:2 PartitionName=gpu Default=NO MaxCPUsPerNode=16 ExclusiveUser=NO State=UP Nodes=gpunode[001-004] PartitionName=gpu-debug Default=NO MaxCPUsPerNode=16 ExclusiveUser=NO State=UP Nodes=gpunode[001-004] PartitionName=gpu-interactive Default=NO MaxCPUsPerNode=16 ExclusiveUser=NO State=UP Nodes=gpunode[001-004] PartitionName=any-interactive Default=NO MaxCPUsPerNode=12 ExclusiveUser=NO State=UP Nodes=node[001-040],gpunode[001-004] PartitionName=any-debug Default=NO MaxCPUsPerNode=12 ExclusiveUser=NO State=UP Nodes=node[001-040],gpunode[001-004] PartitionName=bigmem Default=NO MaxCPUsPerNode=12 ExclusiveUser=NO State=UP Nodes=gpunode[001-003] PartitionName=hugemem Default=NO MaxCPUsPerNode=12 ExclusiveUser=NO State=UP Nodes=gpunode004 > On Jun 16, 2020, at 8:14 AM, Diego Zuccato wrote: > > Il 16/06/20 09:39, Loris Bennett ha scritto: > >>> Maybe it's already known and obvious, but... Remember that a node can be >>> allocated to only one partition. >> Maybe I am misunderstanding you, but I think that this is not the case. >> A node can be in multiple partitions. > *Assigned* to multiple partitions: OK. > But once slurm schedules jon in "partGPU" on that node, the whole node > is unavailable for jobs in "partCPU", even if the GPU job is using only > 1% of the resources. > >> We have nodes belonging to >> individual research groups which are in both a separate partition just >> for the group and in a 'scavenger' partition for everyone (but with >> lower priority add maximum run-time). > More or less our current config. Quite inefficient, at least for us: too > many unuseable resources due to small jobs. > >>> So, if you have the mixed nodes in bot >>> partitions and there's a GPU job running, a non-gpu job will find that >>> node marked as busy because it's allocated to another partition. >>> That's why we're drastically reducing the number of partitions we have >>> and will avoid shared nodes. >> Again I don't this is explanation. If a job is running on a GPU node, >> but not using all the CPUs, then a CPU-only job should be able to start >> on that node, unless some form of exclusivity has been set up, such as >> ExclusiveUser=YES for the partition. > Nope. The whole node gets allocated to one partition at a time. So if > the GPU job and the CPU one are in different partitions, it's expected > that only one starts. The behaviour you're looking for is the one of > QoS: define a single partition w/ multiple QoS and both jobs will run > concurrently. > > If you think about it, that's the meaning of "partition" :) > > -- > Diego Zuccato > DIFA - Dip. di Fisica e Astronomia > Servizi Informatici > Alma Mater Studiorum - Università di Bologna > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > tel.: +39 051 20 95786 >
Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs
Il 16/06/20 09:39, Loris Bennett ha scritto: >> Maybe it's already known and obvious, but... Remember that a node can be >> allocated to only one partition. > Maybe I am misunderstanding you, but I think that this is not the case. > A node can be in multiple partitions. *Assigned* to multiple partitions: OK. But once slurm schedules jon in "partGPU" on that node, the whole node is unavailable for jobs in "partCPU", even if the GPU job is using only 1% of the resources. > We have nodes belonging to > individual research groups which are in both a separate partition just > for the group and in a 'scavenger' partition for everyone (but with > lower priority add maximum run-time). More or less our current config. Quite inefficient, at least for us: too many unuseable resources due to small jobs. >> So, if you have the mixed nodes in bot >> partitions and there's a GPU job running, a non-gpu job will find that >> node marked as busy because it's allocated to another partition. >> That's why we're drastically reducing the number of partitions we have >> and will avoid shared nodes. > Again I don't this is explanation. If a job is running on a GPU node, > but not using all the CPUs, then a CPU-only job should be able to start > on that node, unless some form of exclusivity has been set up, such as > ExclusiveUser=YES for the partition. Nope. The whole node gets allocated to one partition at a time. So if the GPU job and the CPU one are in different partitions, it's expected that only one starts. The behaviour you're looking for is the one of QoS: define a single partition w/ multiple QoS and both jobs will run concurrently. If you think about it, that's the meaning of "partition" :) -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs
Diego Zuccato writes: > Il 13/06/20 17:47, navin srivastava ha scritto: > >> Yes we have separate partitions. Some are specific to gpu having 2 nodes >> with 8 gpu and another partitions are mix of both,nodes with 2 gpu and >> very few nodes are without any gpu. > Maybe it's already known and obvious, but... Remember that a node can be > allocated to only one partition. Maybe I am misunderstanding you, but I think that this is not the case. A node can be in multiple partitions. We have nodes belonging to individual research groups which are in both a separate partition just for the group and in a 'scavenger' partition for everyone (but with lower priority add maximum run-time). > So, if you have the mixed nodes in bot > partitions and there's a GPU job running, a non-gpu job will find that > node marked as busy because it's allocated to another partition. > That's why we're drastically reducing the number of partitions we have > and will avoid shared nodes. Again I don't this is explanation. If a job is running on a GPU node, but not using all the CPUs, then a CPU-only job should be able to start on that node, unless some form of exclusivity has been set up, such as ExclusiveUser=YES for the partition. Without seeing the full slurm.conf, it is difficult to guess what the problem might be. Cheers, Loris -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs
Il 13/06/20 17:47, navin srivastava ha scritto: > Yes we have separate partitions. Some are specific to gpu having 2 nodes > with 8 gpu and another partitions are mix of both,nodes with 2 gpu and > very few nodes are without any gpu. Maybe it's already known and obvious, but... Remember that a node can be allocated to only one partition. So, if you have the mixed nodes in bot partitions and there's a GPU job running, a non-gpu job will find that node marked as busy because it's allocated to another partition. That's why we're drastically reducing the number of partitions we have and will avoid shared nodes. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs
Thanks Renfro. I will perform similar setting and let us see how it goes. Regards On Mon, Jun 15, 2020, 23:02 Renfro, Michael wrote: > So if a GPU job is submitted to a partition containing only GPU nodes, and > a non-GPU job is submitted to a partition containing at least some nodes > without GPUs, both jobs should be able to run. Priorities should be > evaluated on a per-partition basis. I can 100% guarantee that in our HPC, > pending GPU jobs don't block non-GPU jobs, and vice versa. > > I could see a problem if the GPU job was submitted to a partition > containing both types of nodes: if that job was assigned the highest > priority for whatever reason (fair share, age, etc.), other jobs in the > same partition would have to wait until that job started. > > A simple solution would be to make a GPU partition containing only GPU > nodes, and a non-GPU partition containing only non-GPU nodes. Submit GPU > jobs to the GPU partition, and non-GPU jobs to the non-GPU partition. > > Once that works, you could make a partition that includes both types of > nodes to reduce idle resources, but jobs submitted to that partition would > have to (a) not require a GPU, (b) require a limited number of CPUs per > node, so that you'd have some CPUs available for GPU jobs on the nodes > containing GPUs. > > -- > *From:* slurm-users on behalf of > navin srivastava > *Sent:* Saturday, June 13, 2020 10:47 AM > *To:* Slurm User Community List > *Subject:* Re: [slurm-users] ignore gpu resources to scheduled the cpu > based jobs > > > Yes we have separate partitions. Some are specific to gpu having 2 nodes > with 8 gpu and another partitions are mix of both,nodes with 2 gpu and very > few nodes are without any gpu. > > Regards > Navin > > > On Sat, Jun 13, 2020, 21:11 navin srivastava > wrote: > > Thanks Renfro. > > Yes we have both types of nodes with gpu and nongpu. > Also some users job require gpu and some applications use only CPU. > > So the issue happens when user priority is high and waiting for gpu > resources which is not available and the job with lower priority is waiting > even though enough CPU is available which need only CPU resources. > > When I hold gpu jobs the cpu jobs will go through. > > Regards > Navin > > On Sat, Jun 13, 2020, 20:37 Renfro, Michael wrote: > > Will probably need more information to find a solution. > > To start, do you have separate partitions for GPU and non-GPU jobs? Do you > have nodes without GPUs? > > On Jun 13, 2020, at 12:28 AM, navin srivastava > wrote: > > Hi All, > > In our environment we have GPU. so what i found is if the user having high > priority and his job is in queue and waiting for the GPU resources which > are almost full and not available. so the other user submitted the job > which does not require the GPU resources are in queue even though lots of > cpu resources are available. > > our scheduling mechanism is FIFO and Fair tree enabled. Is there any way > we can make some changes so that the cpu based job should go through and > GPU based job can wait till the GPU resources are free. > > Regards > Navin. > > > > >
Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs
So if a GPU job is submitted to a partition containing only GPU nodes, and a non-GPU job is submitted to a partition containing at least some nodes without GPUs, both jobs should be able to run. Priorities should be evaluated on a per-partition basis. I can 100% guarantee that in our HPC, pending GPU jobs don't block non-GPU jobs, and vice versa. I could see a problem if the GPU job was submitted to a partition containing both types of nodes: if that job was assigned the highest priority for whatever reason (fair share, age, etc.), other jobs in the same partition would have to wait until that job started. A simple solution would be to make a GPU partition containing only GPU nodes, and a non-GPU partition containing only non-GPU nodes. Submit GPU jobs to the GPU partition, and non-GPU jobs to the non-GPU partition. Once that works, you could make a partition that includes both types of nodes to reduce idle resources, but jobs submitted to that partition would have to (a) not require a GPU, (b) require a limited number of CPUs per node, so that you'd have some CPUs available for GPU jobs on the nodes containing GPUs. From: slurm-users on behalf of navin srivastava Sent: Saturday, June 13, 2020 10:47 AM To: Slurm User Community List Subject: Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs Yes we have separate partitions. Some are specific to gpu having 2 nodes with 8 gpu and another partitions are mix of both,nodes with 2 gpu and very few nodes are without any gpu. Regards Navin On Sat, Jun 13, 2020, 21:11 navin srivastava mailto:navin.alt...@gmail.com>> wrote: Thanks Renfro. Yes we have both types of nodes with gpu and nongpu. Also some users job require gpu and some applications use only CPU. So the issue happens when user priority is high and waiting for gpu resources which is not available and the job with lower priority is waiting even though enough CPU is available which need only CPU resources. When I hold gpu jobs the cpu jobs will go through. Regards Navin On Sat, Jun 13, 2020, 20:37 Renfro, Michael mailto:ren...@tntech.edu>> wrote: Will probably need more information to find a solution. To start, do you have separate partitions for GPU and non-GPU jobs? Do you have nodes without GPUs? On Jun 13, 2020, at 12:28 AM, navin srivastava mailto:navin.alt...@gmail.com>> wrote: Hi All, In our environment we have GPU. so what i found is if the user having high priority and his job is in queue and waiting for the GPU resources which are almost full and not available. so the other user submitted the job which does not require the GPU resources are in queue even though lots of cpu resources are available. our scheduling mechanism is FIFO and Fair tree enabled. Is there any way we can make some changes so that the cpu based job should go through and GPU based job can wait till the GPU resources are free. Regards Navin.
Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs
Yes we have separate partitions. Some are specific to gpu having 2 nodes with 8 gpu and another partitions are mix of both,nodes with 2 gpu and very few nodes are without any gpu. Regards Navin On Sat, Jun 13, 2020, 21:11 navin srivastava wrote: > Thanks Renfro. > > Yes we have both types of nodes with gpu and nongpu. > Also some users job require gpu and some applications use only CPU. > > So the issue happens when user priority is high and waiting for gpu > resources which is not available and the job with lower priority is waiting > even though enough CPU is available which need only CPU resources. > > When I hold gpu jobs the cpu jobs will go through. > > Regards > Navin > > On Sat, Jun 13, 2020, 20:37 Renfro, Michael wrote: > >> Will probably need more information to find a solution. >> >> To start, do you have separate partitions for GPU and non-GPU jobs? Do >> you have nodes without GPUs? >> >> On Jun 13, 2020, at 12:28 AM, navin srivastava >> wrote: >> >> Hi All, >> >> In our environment we have GPU. so what i found is if the user having >> high priority and his job is in queue and waiting for the GPU resources >> which are almost full and not available. so the other user submitted the >> job which does not require the GPU resources are in queue even though lots >> of cpu resources are available. >> >> our scheduling mechanism is FIFO and Fair tree enabled. Is there any way >> we can make some changes so that the cpu based job should go through and >> GPU based job can wait till the GPU resources are free. >> >> Regards >> Navin. >> >> >> >> >>
Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs
Thanks Renfro. Yes we have both types of nodes with gpu and nongpu. Also some users job require gpu and some applications use only CPU. So the issue happens when user priority is high and waiting for gpu resources which is not available and the job with lower priority is waiting even though enough CPU is available which need only CPU resources. When I hold gpu jobs the cpu jobs will go through. Regards Navin On Sat, Jun 13, 2020, 20:37 Renfro, Michael wrote: > Will probably need more information to find a solution. > > To start, do you have separate partitions for GPU and non-GPU jobs? Do you > have nodes without GPUs? > > On Jun 13, 2020, at 12:28 AM, navin srivastava > wrote: > > Hi All, > > In our environment we have GPU. so what i found is if the user having high > priority and his job is in queue and waiting for the GPU resources which > are almost full and not available. so the other user submitted the job > which does not require the GPU resources are in queue even though lots of > cpu resources are available. > > our scheduling mechanism is FIFO and Fair tree enabled. Is there any way > we can make some changes so that the cpu based job should go through and > GPU based job can wait till the GPU resources are free. > > Regards > Navin. > > > > >
Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs
Will probably need more information to find a solution. To start, do you have separate partitions for GPU and non-GPU jobs? Do you have nodes without GPUs? On Jun 13, 2020, at 12:28 AM, navin srivastava wrote: Hi All, In our environment we have GPU. so what i found is if the user having high priority and his job is in queue and waiting for the GPU resources which are almost full and not available. so the other user submitted the job which does not require the GPU resources are in queue even though lots of cpu resources are available. our scheduling mechanism is FIFO and Fair tree enabled. Is there any way we can make some changes so that the cpu based job should go through and GPU based job can wait till the GPU resources are free. Regards Navin.