Hello,

Thanks for the information and th detailed explanation.

Perhaps there could be a partial solution where the counter is not incremented if the assigned node (after selecting a node as usual) is already running a task from same group and at exit counter is not decremented if the node is still running a task from same group? This could at least stop unnecessary increments of the counter.

If I understand correctly, for a complete solution, in addition to the previous partial solution. The node selection would have to first check all nodes where group processes currently are running and still have enough resources for the new task. Then possible number of found nodes would be substracted from the requested number to find if it is within the limit. Which would have high overhead...

So, is it possible to implement the partial solution at least? If not, I feel this should be documented. Perhaps a notice that if the tasks are assigned to same node, the node is counted twice.

Thanks,
Evren

On Wed, 12 Sep 2012, Danny Auble wrote:


This would actually be a bit more involved.  When this check is done the
nodes haven't been assigned to the job yet.  So we would have to pull
this logic into the select plugin as well to pick the correct nodes a
job could use without going over the limit.  IMHO this is way more
complexity and overhead than a return on investment would present.

Danny

On 09/12/12 09:53, Moe Jette wrote:
The code current increments and decrements a counter when jobs start
and end. It would be possible to track the specific nodes allocated to
each group and avoid counting nodes twice, but that would require code
changes and higher overhead than simply incrementing and decrementing
a counter.

Quoting Evren Yurtesen IB <[email protected]>:



On Tue, 11 Sep 2012, Danny Auble wrote:

On 09/11/12 05:59, Evren Yurtesen IB wrote:
Hello,

We have a cluster with 12 cores on each node. I made a QOS entry
with GrpNodes 4 (I dont want this group of users to be able to use
more than 4 nodes)

If somebody queues tasks running (2 jobs on the same node):

job 1 - node 1
job 2 - node 1
job 3 - node 2
job 4 - node 2

It looks like slurm is thinking 4 nodes are used? Because I see
the next task queued in the system shows Nodes 1 and pending due
to QOSResourceLimit. In my opinion, 2 nodes are used? :)

Could it be that it counts same nodes again because they are in
different jobs? (v2.4.2 is used on this system)

Yes
Well, is there a way to make it count each node once? :)

Thanks,
Evren

Reply via email to