Re: [gridengine users] Core Binding and Node Load
In the message dated: Tue, 29 Apr 2014 22:31:42 -, The pithy ruminations from MacMullan, Hugh on Re: [gridengine users] Core Binding and Node Load were: = This is a hassle for us too. = = In general, what we do is: = = 1. Set binding by default in launcher scripts to '-binding linear:1', to force users to use single threads Same here. = 2. allow them to override by unaliasing qsub, qrsh, and setting manually to use openmp pe Same here. = 3. for MATLAB this doesn't work because it doesn't honor any env vars or whatever, it just greedily looks at the number of threads available and launches that many processes. HOWEVER, you can force it to only use one thread (even though it launches many!) with '-SingleCompThread' in $MATLABROOT/bin/worker and $MATLABROOT/bin/matlab: = However, that has no affect on multithreaded MEX functions. We use 'mcc' to produce compiled binaries of Matlab executables, and the option -SingleCompThread is also not passed on when the executable is created. = # diff worker.dist worker = 20c20 = exec ${bindir}/matlab -dmlworker -nodisplay -r distcomp_evaluate_filetask $* = --- = exec ${bindir}/matlab -dmlworker -logfile /dev/null -singleCompThread -nodisplay -r distcomp_evaluate_filetask $* = = # diff matlab.dist matlab = 164c164 = arglist= = --- = arglist=-singleCompThread = 490c490 = arglist= = --- = arglist=-singleCompThread = = Then users who want more than one thread in MATLAB MUST use a = parallel MPI environment with matlabpool, which requires further = OGS/SGE/SOG integration and licensing, which is described in = toolbox/distcomp/examples/integration/sge, but I can get you our setup = if you're interested and have the Dist_Comp_Engine toolbox available = (don't need to install the engine, just have the license). HmmmI wonder if there's any way to have a JSV deal with this for environments where Dist_Comp_Engine is not available. Here's a snippet from our current (working) JSV that sets binding based on the user-requested number of threads (defaulting to linear:1), with some pseudo-code to abuse the PATH variable so that a matlab wrapper that includes -singleCompThread will be called for jobs w/o binding. === if (! exists $params{binding_strategy}) { # No binding strategy requested; select one depending on # whether the job is # MPI, single/multi-threaded # # No PE: if (!(exists $params{pe_name})) { # --- # in case no parallel environment was chosen # add a default request of one processor core # --- # set the binding strategy to linear (without given start point: linear) jsv_sub_add_param('binding_type','set'); jsv_sub_add_param('binding_amount','1'); jsv_sub_add_param('binding_strategy','linear_automatic'); jsv_add_env(OMP_NUM_THREADS,1); jsv_add_env(ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS,1); jsv_add_env(MKL_NUM_THREADS,1); jsv_add_env(MKL_DYNAMIC,FALSE); # Pseudo-code, untested: $oldPATH=$ENV{'PATH'}; jsv_add_env(PATH,/path/to/matlab/wrapper/with/singleCompThread/arg:$oldPATH); ## jsv_log_info ('Single-threaded job, core binding strategy linear_automatic added'); $do_correct++; } else { # A parallel environment (threaded or openmpi) was requested # jsv_log_info (Parallel environment $params{pe_name} requested); if ($params{pe_name} eq 'threaded') { # Pseudo code jsv_add_env(PATH,/path/to/real/matlab/without_singleCompThread:$oldPATH); === = = = Make sense? Yukk! = = For other software, you need to try to find equivalent ways to set These might help: OMP_NUM_THREADS environment variable, used by OpenMP ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS environment variable, used by ITK MKL_NUM_THREADS environment variable, used by the many python modules that are linked against the Intel Math Kernel Library, particularly NumPy and SciPy. = them to use only single threads, and then parallelize with MPI, OR respect = an environment variable and use the openmp way with the '-binding XXX:X' = set correctly. = = = For CPLEX, set single thread like so: = Envar
[gridengine users] Core Binding and Node Load
Howdy. We are running Son of GE 8.1.6 on CentOS 6.5 with core binding turned on for our 64-core nodes. $ qconf -sconf | grep BINDING ENABLE_BINDING=TRUE When I submit an OpenMP job with: #!/bin/bash #$ -N TESTING #$ -q q64 #$ -pe openmp 16 #$ -binding linear: The job stays locked to 16 cores out of 64-cores which is great and what is expected. Many of our jobs, like MATLAB tries to use as many cores as are available on a node and we cannot control MATLAB core usage. So binding is great when we need to only allow say 16-cores per job. The issue is that MATLAB has 64 threads locked to 16-cores and thus when you have 4 of these MATLAB jobs running on a 64-core node, the load on the node is through the roof because there are more workers than cores. We have Threshold setup on all of our queues to 110%: $ qconf -sq q64 | grep np suspend_thresholdsnp_load_avg=1.1 So jobs begin to suspend because the load is over 70 on a node as expected. My question is, does it make sense to turn OFF np_load_avg cluster-wide and turn ON core-binding cluster wide? What we want to achieve is that jobs only use as many cores as are requested on a node.With the above scenario we will see nodes with a HUGE load ( past 64 ) but each job will only be using said cores. Thank you, Joseph ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] Core Binding and Node Load
This is a hassle for us too. In general, what we do is: 1. Set binding by default in launcher scripts to '-binding linear:1', to force users to use single threads 2. allow them to override by unaliasing qsub, qrsh, and setting manually to use openmp pe 3. for MATLAB this doesn't work because it doesn't honor any env vars or whatever, it just greedily looks at the number of threads available and launches that many processes. HOWEVER, you can force it to only use one thread (even though it launches many!) with '-SingleCompThread' in $MATLABROOT/bin/worker and $MATLABROOT/bin/matlab: # diff worker.dist worker 20c20 exec ${bindir}/matlab -dmlworker -nodisplay -r distcomp_evaluate_filetask $* --- exec ${bindir}/matlab -dmlworker -logfile /dev/null -singleCompThread -nodisplay -r distcomp_evaluate_filetask $* # diff matlab.dist matlab 164c164 arglist= --- arglist=-singleCompThread 490c490 arglist= --- arglist=-singleCompThread Then users who want more than one thread in MATLAB MUST use a parallel MPI environment with matlabpool, which requires further OGS/SGE/SOG integration and licensing, which is described in toolbox/distcomp/examples/integration/sge, but I can get you our setup if you're interested and have the Dist_Comp_Engine toolbox available (don't need to install the engine, just have the license). Make sense? Yukk! For other software, you need to try to find equivalent ways to set them to use only single threads, and then parallelize with MPI, OR respect an environment variable and use the openmp way with the '-binding XXX:X' set correctly. For CPLEX, set single thread like so: Envar across cluster: ILOG_CPLEX_PARAMETER_FILE=/usr/local/cplex/CPLEX_Studio/cplex.prm And in that file: CPX_PARAM_THREADS1 Bleh! And that's not (or wasn't six months ago) honored by Rcplex, but Hector was working on it I think. I hope some of that is useful. It's been the way that works with the least number of questions from users. It only works for us because we have a site license for Dist Comp Engine, so can have a license server on each host to serve out the threads needed there. Bleh. If others have novel ways to approach this problem, PLEASE let us all know. It's certainly one of the more difficult aspects of user education and cluster use for us. Cheers, -Hugh From: users-boun...@gridengine.org [users-boun...@gridengine.org] on behalf of Joseph Farran [jfar...@uci.edu] Sent: Tuesday, April 29, 2014 5:31 PM To: users@gridengine.org Subject: [gridengine users] Core Binding and Node Load Howdy. We are running Son of GE 8.1.6 on CentOS 6.5 with core binding turned on for our 64-core nodes. $ qconf -sconf | grep BINDING ENABLE_BINDING=TRUE When I submit an OpenMP job with: #!/bin/bash #$ -N TESTING #$ -q q64 #$ -pe openmp 16 #$ -binding linear: The job stays locked to 16 cores out of 64-cores which is great and what is expected. Many of our jobs, like MATLAB tries to use as many cores as are available on a node and we cannot control MATLAB core usage. So binding is great when we need to only allow say 16-cores per job. The issue is that MATLAB has 64 threads locked to 16-cores and thus when you have 4 of these MATLAB jobs running on a 64-core node, the load on the node is through the roof because there are more workers than cores. We have Threshold setup on all of our queues to 110%: $ qconf -sq q64 | grep np suspend_thresholdsnp_load_avg=1.1 So jobs begin to suspend because the load is over 70 on a node as expected. My question is, does it make sense to turn OFF np_load_avg cluster-wide and turn ON core-binding cluster wide? What we want to achieve is that jobs only use as many cores as are requested on a node.With the above scenario we will see nodes with a HUGE load ( past 64 ) but each job will only be using said cores. Thank you, Joseph ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users