Re: [gridengine users] debugging queued jobs

Reuti Fri, 01 Feb 2013 10:12:17 -0800

Hi,

Are you requesting 8 times 16 GB? Is this available in the nodes?


-- Reuti

Am 01.02.2013 um 18:04 schrieb Arnau:

> Hi all,
>  
> Our cluster has been running jobs fine, and I've been able to debug pending 
> reason on queued jobs. But now, I'm facing some jobs that, from my 
> undesratnding, should be running but they're queued
>  
> Our cluster have some dedicated queues (which use a certain hostgroup) and a 
> defualt queueu that uses some other hosts. In this default queue, the 
> hostgroup contain nodes that are quite similiar, but some have 94GB of memory 
> and the others 48. My first thought was that the user was requesting a high 
> amount of memory, and for that reason, low memory nodes where descarted, but 
> this is not true.
>  
> The conf:
>  
> default queue:
> # qconf -sq default
> qname                 default
> hostlist              @ibm
> seq_no                1
> load_thresholds       np_load_avg=1.3
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 BATCH INTERACTIVE
> ckpt_list             NONE
> pe_list               smp ompi
> rerun                 FALSE
> slots                 9999
> [....]
>  
> # qconf -shgrp @ibm
> group_name @ibm
> hostlist node-ib0301bi.linux.crg.es node-ib0302bi.linux.crg.es \
>          node-ib0303bi.linux.crg.es node-ib0304bi.linux.crg.es \
>          node-ib0305bi.linux.crg.es node-ib0306bi.linux.crg.es \
>          node-ib0307bi.linux.crg.es node-ib0308bi.linux.crg.es \
>          node-ib0309bi.linux.crg.es node-ib0310bi.linux.crg.es \
>          node-ib0311bi.linux.crg.es node-ib0312bi.linux.crg.es \
>          node-ib0313bi.linux.crg.es node-ib0314bi.linux.crg.es
> # qhost : (output truncated, just a couple of nodes from each type):
> [...]
> node-ib0305bi           linux-x64       8  1.00   94.5G    7.9G   32.0G    
> 7.7M
> node-ib0306bi           linux-x64       8  2.62   94.5G    8.1G   32.0G   
> 14.3M
> node-ib0307bi           linux-x64       8  0.00   47.1G    2.2G   32.0G    
> 9.8M
> node-ib0308bi           linux-x64       8  0.00   47.1G    2.3G   32.0G    
> 9.2M
> [....]
>  
> detailed host definition (high and low memory):
> # qconf -se node-ib0305bi
> hostname              node-ib0305bi.linux.crg.es
> load_scaling          NONE
> complex_values        slots=8,virtual_free=102G
> load_values           arch=linux-x64,num_proc=8,mem_total=96733.835938M, \
>                       swap_total=32767.992188M,virtual_total=129501.828125M, \
>                       load_avg=1.000000,load_short=1.000000, \
>                       load_medium=1.000000,load_long=1.320000, \
>                       mem_free=88596.046875M,swap_free=32760.312500M, \
>                       virtual_free=121356.359375M,mem_used=8137.789062M, \
>                       swap_used=7.679688M,virtual_used=8145.468750M, \
>                       cpu=12.500000,m_topology=SCCCCSCCCC, \
>                       m_topology_inuse=SCCCCSCCCC,m_socket=2,m_core=8, \
>                       np_load_avg=0.125000,np_load_short=0.125000, \
>                       np_load_medium=0.125000,np_load_long=0.165000, \
>                       free_scratch_space=453
> processors            8
> user_lists            NONE
> xuser_lists           NONE
> projects              NONE
> xprojects             NONE
> usage_scaling         NONE
> report_variables      NONE
> # qconf -se node-ib0307bi
> hostname              node-ib0307bi.linux.crg.es
> load_scaling          NONE
> complex_values        slots=8,virtual_free=60G
> load_values           load_avg=0.000000,load_short=0.000000, \
>                       load_medium=0.000000,load_long=0.000000,arch=linux-x64, 
> \
>                       num_proc=8,mem_free=46013.773438M, \
>                       swap_free=32758.222656M,virtual_free=78771.996094M, \
>                       mem_total=48257.062500M,swap_total=32767.992188M, \
>                       virtual_total=81025.054688M,mem_used=2243.289062M, \
>                       swap_used=9.769531M,virtual_used=2253.058594M, \
>                       cpu=0.000000,m_topology=SCCCCSCCCC, \
>                       m_topology_inuse=SCCCCSCCCC,m_socket=2,m_core=8, \
>                       np_load_avg=0.000000,np_load_short=0.000000, \
>                       np_load_medium=0.000000,np_load_long=0.000000, \
>                       free_scratch_space=453
> processors            8
> user_lists            NONE
> xuser_lists           NONE
> projects              NONE
> xprojects             NONE
> usage_scaling         NONE
> report_variables      NONE
>  
> # qconf -sp smp
> pe_name            smp
> slots              1024
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    NONE
> stop_proc_args     NONE
> allocation_rule    $pe_slots
> control_slaves     FALSE
> job_is_first_task  TRUE
> urgency_slots      min
> accounting_summary FALSE
>  
> and the queued job:
> qstat -explain c -j 17779
> ==============================================================
> job_number:                 17779
> exec_file:                  job_scripts/17779
> submission_time:            Thu Jan 31 14:56:07 2013
> owner:                      XX
> uid:                        XX
> group:                      XXEstivill
> gid:                        XX
> sge_o_home:                 XX
> sge_o_log_name:             XX
> sge_o_path:                 
> /usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/lib64/openmpi/bin/:/usr/lib64/compat-openmpi/bin/:/users/xe/dtrujillano/bin
> sge_o_shell:                /bin/bash
> sge_o_workdir:              XX
> sge_o_host:                 ant-login4
> account:                    sge
> stderr_path_list:           NONE:NONE:/usersXX
> hard resource_list:         virtual_free=12G,h_vmem=5G
> mail_list:                  [email protected]
> notify:                     FALSE
> job_name:                   r21.sh
> stdout_path_list:           NONE:NONE:/usersXX
> jobshare:                   0
> hard_queue_list:            default
> env_list:
> script_file:                /users/XX
> parallel environment:  smp range: 8
> verify_suitable_queues:     2
> scheduling info:            cannot run in queue "cn-el6" because it is not 
> contained in its hard queue list (-q)
>                             cannot run in queue "rg-el6" because it is not 
> contained in its hard queue list (-q)
>                             cannot run in queue "xe-el6" because it is not 
> contained in its hard queue list (-q)
>                             (-l slots=1) cannot run in queue 
> "node-ib0301bi.linux.crg.es" because it offers only hc:slots=0.000000
>                             (-l slots=1) cannot run in queue 
> "node-ib0304bi.linux.crg.es" because it offers only hc:slots=0.000000
>                             (-l slots=1) cannot run in queue 
> "node-ib0302bi.linux.crg.es" because it offers only hc:slots=0.000000
>                             (-l slots=1) cannot run in queue 
> "node-ib0305bi.linux.crg.es" because it offers only hc:slots=0.000000
>                             (-l slots=1) cannot run in queue 
> "node-ib0306bi.linux.crg.es" because it offers only hc:slots=0.000000
>                             (-l slots=1) cannot run in queue 
> "node-ib0303bi.linux.crg.es" because it offers only hc:slots=0.000000
>                             cannot run in PE "smp" because it only offers 0 
> slots
>  
> As you can see, the scheduller says that "dedicated queues" are not in queue 
> list, and it says that the job cannot run in node with high memory (from 1 to 
> 6), but  it says nothign about nodes with less memory (from 7 to 14), and 
> also complains about available slots in smp PE (strange, cause its limit 
> should be 1024). The memory limit is less than the memory available in the 
> node... so, from my understanding the should could be run in node 7 to 14)....
>  
> 1.-) why is the scheduller message saying nothing about those nodes with 
> availbale resources (7 to 14, low memory nodes)? Is it taking them into 
> account?
> 2.-) what does that PE message menas? In the cluster there are 10 jobs all 
> running 8 slots, so only 80 slots from smp PE should be couned).
>  
>  
> Anyone with more experience could help me to debug the pending status of this 
> job?
>  
> TIA,
> Arnau
>  
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] debugging queued jobs

Reply via email to