We're running SoGE 8.1.6 under CentOS6 and had successfully been using a JSV to set core binding. This has been extremely helpful in controlling some very 'greedy' multi-threaded processes.
Recently, I've noticed that the core binding is no longer working. There have been no related changes to the JSV or SGE configuration. I suspect that a recent kernel or OS-distribution update has changed the core binding, but I'm not certain. Here's an example of a currently-running job. Note that the binding and environment variables related to threading were set by the JSV. The submit script (run_func.sh) doesn't do anything with CPU affinity or threading, it simply sets some environment variables, accepts a few arguments, and calls a compiled-Matlab executable. qstat -j 2005747 ============================================================== job_number: 2005747 exec_file: job_scripts/2005747 submission_time: Tue May 24 12:19:22 2016 sge_o_log_name: foobarmultimodal account: sge hard resource_list: h_stack=256m,centos6=TRUE,h_vmem=10G notify: FALSE job_name: run_func.sh priority: -100 jobshare: 0 shell_list: NONE:/bin/bash env_list: TERM=NONE,SGE_CELL=default,SGE_ARCH=lx-amd64,SGE_EXECD_PORT=16445,SGE_QMASTER_PORT=16444,SGE_ROOT=/cbica/home/sge/centos6/8.1.6,SGE_VER=8.1.6,OMP_NUM_THREADS=1,ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=1,numMaxCompThreads=1,MKL_NUM_THREADS=1,MKL_DYNAMIC=FALSE job_args: - script_file: STDIN binding: set linear:1 --------------- The job is aggressively multi-threaded (it is based on Matlab). In the past, this kind of job would be bound to the requested number of CPUs (defaulting to 1). If there were too few CPUs requested, the job would run very slowly as threads waited for each other, but other processes on the same node would be fine. Now the job is using more than 1 CPU (I've seen it spike up to 9 cores) and overloading the compute node. [root@c1-17 log]# ps -fp 4588 UID PID PPID C STIME TTY TIME CMD 32226 4588 4586 0 12:25 ? 00:00:00 /bin/bash /var/tmp/gridengine/8.1.6/default/spool/c1-17/job_scripts/2005747 - [root@c1-17 log]# pstree -p 4588 2005747(4588)───run_runGdCMFreg(4851)───runGdCMFreg_fun(4853)─┬─{runGdCMFreg_fu}(4859) ├─{runGdCMFreg_fu}(4860) : : └─{runGdCMFreg_fu}(5028) [root@c1-17 log]# pstree -p 4588 | wc -l 77 [root@c1-17 log]# taskset -c -p 4588 pid 4588's current affinity list: 0-39 [root@c1-17 log]# taskset -c -p 4851 pid 4851's current affinity list: 0-39 [root@c1-17 log]# taskset -c -p 4853 pid 4853's current affinity list: 0-39 -------------------------------- Any suggestions about troubleshooting this in order to re-enable the core binding? Thanks, Mark _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users