[slurm-dev] Re: QOS GrpNodes problem...?

Evren Yurtesen IB Wed, 12 Sep 2012 17:06:07 -0700


Hello,


Thanks for the information and th detailed explanation.

Perhaps there could be a partial solution where the counter is notincremented if the assigned node (after selecting a node as usual) isalready running a task from same group and at exit counter is notdecremented if the node is still running a task from same group? Thiscould at least stop unnecessary increments of the counter.

If I understand correctly, for a complete solution, in addition to theprevious partial solution. The node selection would have to first checkall nodes where group processes currently are running and still have enoughresources for the new task. Then possible number of found nodes would besubstracted from the requested number to find if it is within the limit.Which would have high overhead...

So, is it possible to implement the partial solution at least? If not, Ifeel this should be documented. Perhaps a notice that if the tasks areassigned to same node, the node is counted twice.


Thanks,
Evren

On Wed, 12 Sep 2012, Danny Auble wrote:


This would actually be a bit more involved.  When this check is done the
nodes haven't been assigned to the job yet.  So we would have to pull
this logic into the select plugin as well to pick the correct nodes a
job could use without going over the limit.  IMHO this is way more
complexity and overhead than a return on investment would present.

Danny

On 09/12/12 09:53, Moe Jette wrote:

The code current increments and decrements a counter when jobs start
and end. It would be possible to track the specific nodes allocated to
each group and avoid counting nodes twice, but that would require code
changes and higher overhead than simply incrementing and decrementing
a counter.

Quoting Evren Yurtesen IB <[email protected]>:



On Tue, 11 Sep 2012, Danny Auble wrote:

On 09/11/12 05:59, Evren Yurtesen IB wrote:

Hello,

We have a cluster with 12 cores on each node. I made a QOS entry
with GrpNodes 4 (I dont want this group of users to be able to use
more than 4 nodes)

If somebody queues tasks running (2 jobs on the same node):

job 1 - node 1
job 2 - node 1
job 3 - node 2
job 4 - node 2

It looks like slurm is thinking 4 nodes are used? Because I see
the next task queued in the system shows Nodes 1 and pending due
to QOSResourceLimit. In my opinion, 2 nodes are used? :)

Could it be that it counts same nodes again because they are in
different jobs? (v2.4.2 is used on this system)

Yes

Well, is there a way to make it count each node once? :)

Thanks,
Evren

[slurm-dev] Re: QOS GrpNodes problem...?

Reply via email to