[gridengine users] Core Binding and Node Load

Joseph Farran Tue, 29 Apr 2014 14:33:07 -0700

Howdy.

We are running Son of GE 8.1.6 on CentOS 6.5 with core binding turned onfor our 64-core nodes.


$ qconf -sconf | grep BINDING
                             ENABLE_BINDING=TRUE


When I submit an OpenMP job with:

#!/bin/bash
#$ -N TESTING
#$ -q q64
#$ -pe openmp 16
#$ -binding linear:

The job stays locked to 16 cores out of 64-cores which is great and whatis expected.

Many of our jobs, like MATLAB tries to use as many cores as areavailable on a node and we cannot control MATLAB core usage. Sobinding is great when we need to only allow say 16-cores per job.

The issue is that MATLAB has 64 threads locked to 16-cores and thus whenyou have 4 of these MATLAB jobs running on a 64-core node, the load onthe node is through the roof because there are more workers than cores.


We have Threshold setup on all of our queues to 110%:

$ qconf -sq q64 | grep np
suspend_thresholds    np_load_avg=1.1

So jobs begin to suspend because the load is over 70 on a node as expected.

My question is, does it make sense to turn OFF "np_load_avg"cluster-wide and turn ON core-binding cluster wide?

What we want to achieve is that jobs only use as many cores as arerequested on a node. With the above scenario we will see nodes with aHUGE load ( past 64 ) but each job will only be using said cores.


Thank you,
Joseph

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] Core Binding and Node Load

Reply via email to