Howdy.

We are running Son of GE 8.1.6 on CentOS 6.5 with core binding turned on for our 64-core nodes.

$ qconf -sconf | grep BINDING
                             ENABLE_BINDING=TRUE


When I submit an OpenMP job with:

#!/bin/bash
#$ -N TESTING
#$ -q q64
#$ -pe openmp 16
#$ -binding linear:

The job stays locked to 16 cores out of 64-cores which is great and what is expected.

Many of our jobs, like MATLAB tries to use as many cores as are available on a node and we cannot control MATLAB core usage. So binding is great when we need to only allow say 16-cores per job.

The issue is that MATLAB has 64 threads locked to 16-cores and thus when you have 4 of these MATLAB jobs running on a 64-core node, the load on the node is through the roof because there are more workers than cores.

We have Threshold setup on all of our queues to 110%:

$ qconf -sq q64 | grep np
suspend_thresholds    np_load_avg=1.1

So jobs begin to suspend because the load is over 70 on a node as expected.

My question is, does it make sense to turn OFF "np_load_avg" cluster-wide and turn ON core-binding cluster wide?

What we want to achieve is that jobs only use as many cores as are requested on a node. With the above scenario we will see nodes with a HUGE load ( past 64 ) but each job will only be using said cores.

Thank you,
Joseph

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to