We're running SoGE 8.1.6 under CentOS6 and had successfully been using
a JSV to set core binding. This has been extremely helpful in
controlling some very 'greedy' multi-threaded processes.

Recently, I've noticed that the core binding is no longer working.

There have been no related changes to the JSV or SGE configuration. I
suspect that a recent kernel or OS-distribution update has changed the
core binding, but I'm not certain.

Here's an example of a currently-running job. Note that the binding and
environment variables related to threading were set by the JSV. The
submit script (run_func.sh) doesn't do anything with CPU affinity or
threading, it simply sets some environment variables, accepts a few
arguments, and calls a compiled-Matlab executable.


qstat -j 2005747
==============================================================
job_number:                 2005747
exec_file:                  job_scripts/2005747
submission_time:            Tue May 24 12:19:22 2016
sge_o_log_name:             foobarmultimodal
account:                    sge
hard resource_list:         h_stack=256m,centos6=TRUE,h_vmem=10G
notify:                     FALSE
job_name:                   run_func.sh
priority:                   -100
jobshare:                   0
shell_list:                 NONE:/bin/bash
env_list:                   
TERM=NONE,SGE_CELL=default,SGE_ARCH=lx-amd64,SGE_EXECD_PORT=16445,SGE_QMASTER_PORT=16444,SGE_ROOT=/cbica/home/sge/centos6/8.1.6,SGE_VER=8.1.6,OMP_NUM_THREADS=1,ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=1,numMaxCompThreads=1,MKL_NUM_THREADS=1,MKL_DYNAMIC=FALSE
job_args:                   -
script_file:                STDIN
binding:                    set linear:1
---------------

The job is aggressively multi-threaded (it is based on Matlab). In the
past, this kind of job would be bound to the requested number of CPUs
(defaulting to 1). If there were too few CPUs requested, the job would
run very slowly as threads waited for each other, but other processes
on the same node would be fine.

Now the job is using more than 1 CPU (I've seen it spike up to 9 cores)
and overloading the compute node.
 
[root@c1-17 log]# ps -fp 4588
UID        PID  PPID  C STIME TTY          TIME CMD
32226     4588  4586  0 12:25 ?        00:00:00 /bin/bash 
/var/tmp/gridengine/8.1.6/default/spool/c1-17/job_scripts/2005747 -
[root@c1-17 log]# pstree -p 4588
2005747(4588)───run_runGdCMFreg(4851)───runGdCMFreg_fun(4853)─┬─{runGdCMFreg_fu}(4859)
                                                              
├─{runGdCMFreg_fu}(4860)
                                                                :
                                                                :
                                                              
└─{runGdCMFreg_fu}(5028)
[root@c1-17 log]# pstree -p 4588 | wc -l
77
[root@c1-17 log]# taskset -c -p 4588
pid 4588's current affinity list: 0-39
[root@c1-17 log]# taskset -c -p 4851
pid 4851's current affinity list: 0-39
[root@c1-17 log]# taskset -c -p 4853
pid 4853's current affinity list: 0-39
--------------------------------

Any suggestions about troubleshooting this in order to re-enable the core 
binding?

Thanks,

Mark

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to