On Tue, May 24, 2016 at 07:20:48PM -0400, berg...@merctech.com wrote:
> We're running SoGE 8.1.6 under CentOS6 and had successfully been using
> qstat -j 2005747
> ==============================================================
> job_number:                 2005747
> exec_file:                  job_scripts/2005747
> submission_time:            Tue May 24 12:19:22 2016
> sge_o_log_name:             foobarmultimodal
> account:                    sge
> hard resource_list:         h_stack=256m,centos6=TRUE,h_vmem=10G
> notify:                     FALSE
> job_name:                   run_func.sh
> priority:                   -100
> jobshare:                   0
> shell_list:                 NONE:/bin/bash
> env_list:                   
> TERM=NONE,SGE_CELL=default,SGE_ARCH=lx-amd64,SGE_EXECD_PORT=16445,SGE_QMASTER_PORT=16444,SGE_ROOT=/cbica/home/sge/centos6/8.1.6,SGE_VER=8.1.6,OMP_NUM_THREADS=1,ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=1,numMaxCompThreads=1,MKL_NUM_THREADS=1,MKL_DYNAMIC=FALSE
> job_args:                   -
> script_file:                STDIN
> binding:                    set linear:1
> ---------------
> 
> The job is aggressively multi-threaded (it is based on Matlab). In the
> past, this kind of job would be bound to the requested number of CPUs
> (defaulting to 1). If there were too few CPUs requested, the job would
> run very slowly as threads waited for each other, but other processes
> on the same node would be fine.
> 
> Now the job is using more than 1 CPU (I've seen it spike up to 9 cores)
> and overloading the compute node.
>  
> [root@c1-17 log]# ps -fp 4588
> UID        PID  PPID  C STIME TTY          TIME CMD
> 32226     4588  4586  0 12:25 ?        00:00:00 /bin/bash 
> /var/tmp/gridengine/8.1.6/default/spool/c1-17/job_scripts/2005747 -
> [root@c1-17 log]# pstree -p 4588
> 2005747(4588)?????????run_runGdCMFreg(4851)?????????runGdCMFreg_fun(4853)?????????{runGdCMFreg_fu}(4859)
>                                                               
> ??????{runGdCMFreg_fu}(4860)
>                                                               :
>                                                               :
>                                                               
> ??????{runGdCMFreg_fu}(5028)
> [root@c1-17 log]# pstree -p 4588 | wc -l
> 77
> [root@c1-17 log]# taskset -c -p 4588
> pid 4588's current affinity list: 0-39
> [root@c1-17 log]# taskset -c -p 4851
> pid 4851's current affinity list: 0-39
> [root@c1-17 log]# taskset -c -p 4853
> pid 4853's current affinity list: 0-39
> --------------------------------
> 
> Any suggestions about troubleshooting this in order to re-enable the core 
> binding?
> 

Check that ENABLE_BINDING is set in the execd_params.

The other possibility is that you've been bitten by this bug:
https://arc.liv.ac.uk/trac/SGE/ticket/1479 which can cause an mpi style 
job to over allocate cores.  If this results in there being no cores
available on the node then any other job that ends up there won't be 
bound.

I'm working on a fix but first I have to shave some Yaks.

William

Attachment: signature.asc
Description: Digital signature

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to