Hi all,

Our cluster has been running jobs fine, and I've been able to debug pending
reason on queued jobs. But now, I'm facing some jobs that, from my
undesratnding, should be running but they're queued

Our cluster have some dedicated queues (which use a certain hostgroup) and
a defualt queueu that uses some other hosts. In this default queue, the
hostgroup contain nodes that are quite similiar, but some have 94GB of
memory and the others 48. My first thought was that the user was requesting
a high amount of memory, and for that reason, low memory nodes where
descarted, but this is not true.

The conf:

default queue:
# qconf -sq default
qname                 default
hostlist              @ibm
seq_no                1
load_thresholds       np_load_avg=1.3
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               smp ompi
rerun                 FALSE
slots                 9999
[....]

# qconf -shgrp @ibm
group_name @ibm
hostlist node-ib0301bi.linux.crg.es node-ib0302bi.linux.crg.es \
         node-ib0303bi.linux.crg.es node-ib0304bi.linux.crg.es \
         node-ib0305bi.linux.crg.es node-ib0306bi.linux.crg.es \
         node-ib0307bi.linux.crg.es node-ib0308bi.linux.crg.es \
         node-ib0309bi.linux.crg.es node-ib0310bi.linux.crg.es \
         node-ib0311bi.linux.crg.es node-ib0312bi.linux.crg.es \
         node-ib0313bi.linux.crg.es node-ib0314bi.linux.crg.es
# qhost : (output truncated, just a couple of nodes from each type):
[...]
node-ib0305bi           linux-x64       8  1.00   94.5G    7.9G   32.0G
7.7M
node-ib0306bi           linux-x64       8  2.62   94.5G    8.1G   32.0G
14.3M
node-ib0307bi           linux-x64       8  0.00   47.1G    2.2G   32.0G
9.8M
node-ib0308bi           linux-x64       8  0.00   47.1G    2.3G   32.0G
9.2M
[....]

detailed host definition (high and low memory):
# qconf -se node-ib0305bi
hostname              node-ib0305bi.linux.crg.es
load_scaling          NONE
complex_values        slots=8,virtual_free=102G
load_values           arch=linux-x64,num_proc=8,mem_total=96733.835938M, \

swap_total=32767.992188M,virtual_total=129501.828125M, \
                      load_avg=1.000000,load_short=1.000000, \
                      load_medium=1.000000,load_long=1.320000, \
                      mem_free=88596.046875M,swap_free=32760.312500M, \
                      virtual_free=121356.359375M,mem_used=8137.789062M, \
                      swap_used=7.679688M,virtual_used=8145.468750M, \
                      cpu=12.500000,m_topology=SCCCCSCCCC, \
                      m_topology_inuse=SCCCCSCCCC,m_socket=2,m_core=8, \
                      np_load_avg=0.125000,np_load_short=0.125000, \
                      np_load_medium=0.125000,np_load_long=0.165000, \
                      free_scratch_space=453
processors            8
user_lists            NONE
xuser_lists           NONE
projects              NONE
xprojects             NONE
usage_scaling         NONE
report_variables      NONE
# qconf -se node-ib0307bi
hostname              node-ib0307bi.linux.crg.es
load_scaling          NONE
complex_values        slots=8,virtual_free=60G
load_values           load_avg=0.000000,load_short=0.000000, \

load_medium=0.000000,load_long=0.000000,arch=linux-x64, \
                      num_proc=8,mem_free=46013.773438M, \
                      swap_free=32758.222656M,virtual_free=78771.996094M, \
                      mem_total=48257.062500M,swap_total=32767.992188M, \
                      virtual_total=81025.054688M,mem_used=2243.289062M, \
                      swap_used=9.769531M,virtual_used=2253.058594M, \
                      cpu=0.000000,m_topology=SCCCCSCCCC, \
                      m_topology_inuse=SCCCCSCCCC,m_socket=2,m_core=8, \
                      np_load_avg=0.000000,np_load_short=0.000000, \
                      np_load_medium=0.000000,np_load_long=0.000000, \
                      free_scratch_space=453
processors            8
user_lists            NONE
xuser_lists           NONE
projects              NONE
xprojects             NONE
usage_scaling         NONE
report_variables      NONE

# qconf -sp smp
pe_name            smp
slots              1024
user_lists         NONE
xuser_lists        NONE
start_proc_args    NONE
stop_proc_args     NONE
allocation_rule    $pe_slots
control_slaves     FALSE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE

and the queued job:
qstat -explain c -j 17779
==============================================================
job_number:                 17779
exec_file:                  job_scripts/17779
submission_time:            Thu Jan 31 14:56:07 2013
owner:                      XX
uid:                        XX
group:                      XXEstivill
gid:                        XX
sge_o_home:                 XX
sge_o_log_name:             XX
sge_o_path:
/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/lib64/openmpi/bin/:/usr/lib64/compat-openmpi/bin/:/users/xe/dtrujillano/bin
sge_o_shell:                /bin/bash
sge_o_workdir:              XX
sge_o_host:                 ant-login4
account:                    sge
stderr_path_list:           NONE:NONE:/usersXX
hard resource_list:         virtual_free=12G,h_vmem=5G
mail_list:                  [email protected]
notify:                     FALSE
job_name:                   r21.sh
stdout_path_list:           NONE:NONE:/usersXX
jobshare:                   0
hard_queue_list:            default
env_list:
script_file:                /users/XX
parallel environment:  smp range: 8
verify_suitable_queues:     2
scheduling info:            cannot run in queue "cn-el6" because it is not
contained in its hard queue list (-q)
                            cannot run in queue "rg-el6" because it is not
contained in its hard queue list (-q)
                            cannot run in queue "xe-el6" because it is not
contained in its hard queue list (-q)
                            (-l slots=1) cannot run in queue "
node-ib0301bi.linux.crg.es" because it offers only hc:slots=0.000000
                            (-l slots=1) cannot run in queue "
node-ib0304bi.linux.crg.es" because it offers only hc:slots=0.000000
                            (-l slots=1) cannot run in queue "
node-ib0302bi.linux.crg.es" because it offers only hc:slots=0.000000
                            (-l slots=1) cannot run in queue "
node-ib0305bi.linux.crg.es" because it offers only hc:slots=0.000000
                            (-l slots=1) cannot run in queue "
node-ib0306bi.linux.crg.es" because it offers only hc:slots=0.000000
                            (-l slots=1) cannot run in queue "
node-ib0303bi.linux.crg.es" because it offers only hc:slots=0.000000
                            cannot run in PE "smp" because it only offers 0
slots

As you can see, the scheduller says that "dedicated queues" are not in
queue list, and it says that the job cannot run in node with high memory
(from 1 to 6), but  it says nothign about nodes with less memory (from 7 to
14), and also complains about available slots in smp PE (strange, cause its
limit should be 1024). The memory limit is less than the memory available
in the node... so, from my understanding the should could be run in node 7
to 14)....

1.-) why is the scheduller message saying nothing about those nodes with
availbale resources (7 to 14, low memory nodes)? Is it taking them into
account?
2.-) what does that PE message menas? In the cluster there are 10 jobs all
running 8 slots, so only 80 slots from smp PE should be couned).


Anyone with more experience could help me to debug the pending status of
this job?

TIA,
Arnau
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to