This is a hassle for us too.

In general, what we do is:

1. Set binding by default in launcher scripts to '-binding linear:1', to force 
users to use single threads
2. allow them to override by unaliasing qsub, qrsh, and setting manually to use 
openmp pe
3. for MATLAB this doesn't work because it doesn't honor any env vars or 
whatever, it just greedily looks at the number of threads available and 
launches that many processes. HOWEVER, you can force it to only use one thread 
(even though it launches many!) with '-SingleCompThread' in 
$MATLABROOT/bin/worker and $MATLABROOT/bin/matlab:

# diff worker.dist worker
20c20
< exec "${bindir}/matlab" -dmlworker -nodisplay -r distcomp_evaluate_filetask $*
---
> exec "${bindir}/matlab" -dmlworker -logfile /dev/null -singleCompThread 
> -nodisplay -r distcomp_evaluate_filetask $*

# diff matlab.dist matlab
164c164
<         arglist=
---
>         arglist=-singleCompThread
490c490
<     arglist=""
---
>     arglist="-singleCompThread"

Then users who want more than one thread in MATLAB MUST use a parallel MPI 
environment with matlabpool, which requires further OGS/SGE/SOG integration and 
licensing, which is described in toolbox/distcomp/examples/integration/sge, but 
I can get you our setup if you're interested and have the Dist_Comp_Engine 
toolbox available (don't need to install the engine, just have the license).

Make sense? Yukk!

For other software, you need to try to find equivalent ways to set them to use 
only single threads, and then parallelize with MPI, OR respect an environment 
variable and use the openmp way with the '-binding XXX:X' set correctly.

For CPLEX, set single thread like so:
Envar across cluster: 
ILOG_CPLEX_PARAMETER_FILE=/usr/local/cplex/CPLEX_Studio/cplex.prm
And in that file:
CPX_PARAM_THREADS                1

Bleh! And that's not (or wasn't six months ago) honored by Rcplex, but Hector 
was working on it I think.

I hope some of that is useful. It's been the way that works with the least 
number of questions from users. It only works for us because we have a site 
license for Dist Comp Engine, so can have a license server on each host to 
serve out the threads needed there. Bleh.

If others have novel ways to approach this problem, PLEASE let us all know. 
It's certainly one of the more difficult aspects of user education and cluster 
use for us.

Cheers,
-Hugh
________________________________________
From: [email protected] [[email protected]] on behalf of 
Joseph Farran [[email protected]]
Sent: Tuesday, April 29, 2014 5:31 PM
To: [email protected]
Subject: [gridengine users] Core Binding and Node Load

Howdy.

We are running Son of GE 8.1.6 on CentOS 6.5 with core binding turned on
for our 64-core nodes.

$ qconf -sconf | grep BINDING
                              ENABLE_BINDING=TRUE


When I submit an OpenMP job with:

#!/bin/bash
#$ -N TESTING
#$ -q q64
#$ -pe openmp 16
#$ -binding linear:

The job stays locked to 16 cores out of 64-cores which is great and what
is expected.

Many of our jobs, like MATLAB tries to use as many cores as are
available on a node and we cannot control MATLAB core usage.   So
binding is great when we need to only allow say 16-cores per job.

The issue is that MATLAB has 64 threads locked to 16-cores and thus when
you have 4 of these MATLAB jobs running on a 64-core node, the load on
the node is through the roof because there are more workers than cores.

We have Threshold setup on all of our queues to 110%:

$ qconf -sq q64 | grep np
suspend_thresholds    np_load_avg=1.1

So jobs begin to suspend because the load is over 70 on a node as expected.

My question is, does it make sense to turn OFF "np_load_avg"
cluster-wide and turn ON core-binding cluster wide?

What we want to achieve is that jobs only use as many cores as are
requested on a node.    With the above scenario we will see nodes with a
HUGE load ( past 64 ) but each job will only be using said cores.

Thank you,
Joseph

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to