Howdy.
We are running Son of GE 8.1.6 on CentOS 6.5 with core binding turned on
for our 64-core nodes.
$ qconf -sconf | grep BINDING
ENABLE_BINDING=TRUE
When I submit an OpenMP job with:
#!/bin/bash
#$ -N TESTING
#$ -q q64
#$ -pe openmp 16
#$ -binding linear:
The job stays locked to 16 cores out of 64-cores which is great and what
is expected.
Many of our jobs, like MATLAB tries to use as many cores as are
available on a node and we cannot control MATLAB core usage. So
binding is great when we need to only allow say 16-cores per job.
The issue is that MATLAB has 64 threads locked to 16-cores and thus when
you have 4 of these MATLAB jobs running on a 64-core node, the load on
the node is through the roof because there are more workers than cores.
We have Threshold setup on all of our queues to 110%:
$ qconf -sq q64 | grep np
suspend_thresholds np_load_avg=1.1
So jobs begin to suspend because the load is over 70 on a node as expected.
My question is, does it make sense to turn OFF "np_load_avg"
cluster-wide and turn ON core-binding cluster wide?
What we want to achieve is that jobs only use as many cores as are
requested on a node. With the above scenario we will see nodes with a
HUGE load ( past 64 ) but each job will only be using said cores.
Thank you,
Joseph
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users