On Tue, May 24, 2016 at 07:20:48PM -0400, berg...@merctech.com wrote: > We're running SoGE 8.1.6 under CentOS6 and had successfully been using > qstat -j 2005747 > ============================================================== > job_number: 2005747 > exec_file: job_scripts/2005747 > submission_time: Tue May 24 12:19:22 2016 > sge_o_log_name: foobarmultimodal > account: sge > hard resource_list: h_stack=256m,centos6=TRUE,h_vmem=10G > notify: FALSE > job_name: run_func.sh > priority: -100 > jobshare: 0 > shell_list: NONE:/bin/bash > env_list: > TERM=NONE,SGE_CELL=default,SGE_ARCH=lx-amd64,SGE_EXECD_PORT=16445,SGE_QMASTER_PORT=16444,SGE_ROOT=/cbica/home/sge/centos6/8.1.6,SGE_VER=8.1.6,OMP_NUM_THREADS=1,ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=1,numMaxCompThreads=1,MKL_NUM_THREADS=1,MKL_DYNAMIC=FALSE > job_args: - > script_file: STDIN > binding: set linear:1 > --------------- > > The job is aggressively multi-threaded (it is based on Matlab). In the > past, this kind of job would be bound to the requested number of CPUs > (defaulting to 1). If there were too few CPUs requested, the job would > run very slowly as threads waited for each other, but other processes > on the same node would be fine. > > Now the job is using more than 1 CPU (I've seen it spike up to 9 cores) > and overloading the compute node. > > [root@c1-17 log]# ps -fp 4588 > UID PID PPID C STIME TTY TIME CMD > 32226 4588 4586 0 12:25 ? 00:00:00 /bin/bash > /var/tmp/gridengine/8.1.6/default/spool/c1-17/job_scripts/2005747 - > [root@c1-17 log]# pstree -p 4588 > 2005747(4588)?????????run_runGdCMFreg(4851)?????????runGdCMFreg_fun(4853)?????????{runGdCMFreg_fu}(4859) > > ??????{runGdCMFreg_fu}(4860) > : > : > > ??????{runGdCMFreg_fu}(5028) > [root@c1-17 log]# pstree -p 4588 | wc -l > 77 > [root@c1-17 log]# taskset -c -p 4588 > pid 4588's current affinity list: 0-39 > [root@c1-17 log]# taskset -c -p 4851 > pid 4851's current affinity list: 0-39 > [root@c1-17 log]# taskset -c -p 4853 > pid 4853's current affinity list: 0-39 > -------------------------------- > > Any suggestions about troubleshooting this in order to re-enable the core > binding? >
Check that ENABLE_BINDING is set in the execd_params. The other possibility is that you've been bitten by this bug: https://arc.liv.ac.uk/trac/SGE/ticket/1479 which can cause an mpi style job to over allocate cores. If this results in there being no cores available on the node then any other job that ends up there won't be bound. I'm working on a fix but first I have to shave some Yaks. William
signature.asc
Description: Digital signature
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users