Hi all,
Our cluster has been running jobs fine, and I've been able to debug pending
reason on queued jobs. But now, I'm facing some jobs that, from my
undesratnding, should be running but they're queued
Our cluster have some dedicated queues (which use a certain hostgroup) and
a defualt queueu that uses some other hosts. In this default queue, the
hostgroup contain nodes that are quite similiar, but some have 94GB of
memory and the others 48. My first thought was that the user was requesting
a high amount of memory, and for that reason, low memory nodes where
descarted, but this is not true.
The conf:
default queue:
# qconf -sq default
qname default
hostlist @ibm
seq_no 1
load_thresholds np_load_avg=1.3
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list smp ompi
rerun FALSE
slots 9999
[....]
# qconf -shgrp @ibm
group_name @ibm
hostlist node-ib0301bi.linux.crg.es node-ib0302bi.linux.crg.es \
node-ib0303bi.linux.crg.es node-ib0304bi.linux.crg.es \
node-ib0305bi.linux.crg.es node-ib0306bi.linux.crg.es \
node-ib0307bi.linux.crg.es node-ib0308bi.linux.crg.es \
node-ib0309bi.linux.crg.es node-ib0310bi.linux.crg.es \
node-ib0311bi.linux.crg.es node-ib0312bi.linux.crg.es \
node-ib0313bi.linux.crg.es node-ib0314bi.linux.crg.es
# qhost : (output truncated, just a couple of nodes from each type):
[...]
node-ib0305bi linux-x64 8 1.00 94.5G 7.9G 32.0G
7.7M
node-ib0306bi linux-x64 8 2.62 94.5G 8.1G 32.0G
14.3M
node-ib0307bi linux-x64 8 0.00 47.1G 2.2G 32.0G
9.8M
node-ib0308bi linux-x64 8 0.00 47.1G 2.3G 32.0G
9.2M
[....]
detailed host definition (high and low memory):
# qconf -se node-ib0305bi
hostname node-ib0305bi.linux.crg.es
load_scaling NONE
complex_values slots=8,virtual_free=102G
load_values arch=linux-x64,num_proc=8,mem_total=96733.835938M, \
swap_total=32767.992188M,virtual_total=129501.828125M, \
load_avg=1.000000,load_short=1.000000, \
load_medium=1.000000,load_long=1.320000, \
mem_free=88596.046875M,swap_free=32760.312500M, \
virtual_free=121356.359375M,mem_used=8137.789062M, \
swap_used=7.679688M,virtual_used=8145.468750M, \
cpu=12.500000,m_topology=SCCCCSCCCC, \
m_topology_inuse=SCCCCSCCCC,m_socket=2,m_core=8, \
np_load_avg=0.125000,np_load_short=0.125000, \
np_load_medium=0.125000,np_load_long=0.165000, \
free_scratch_space=453
processors 8
user_lists NONE
xuser_lists NONE
projects NONE
xprojects NONE
usage_scaling NONE
report_variables NONE
# qconf -se node-ib0307bi
hostname node-ib0307bi.linux.crg.es
load_scaling NONE
complex_values slots=8,virtual_free=60G
load_values load_avg=0.000000,load_short=0.000000, \
load_medium=0.000000,load_long=0.000000,arch=linux-x64, \
num_proc=8,mem_free=46013.773438M, \
swap_free=32758.222656M,virtual_free=78771.996094M, \
mem_total=48257.062500M,swap_total=32767.992188M, \
virtual_total=81025.054688M,mem_used=2243.289062M, \
swap_used=9.769531M,virtual_used=2253.058594M, \
cpu=0.000000,m_topology=SCCCCSCCCC, \
m_topology_inuse=SCCCCSCCCC,m_socket=2,m_core=8, \
np_load_avg=0.000000,np_load_short=0.000000, \
np_load_medium=0.000000,np_load_long=0.000000, \
free_scratch_space=453
processors 8
user_lists NONE
xuser_lists NONE
projects NONE
xprojects NONE
usage_scaling NONE
report_variables NONE
# qconf -sp smp
pe_name smp
slots 1024
user_lists NONE
xuser_lists NONE
start_proc_args NONE
stop_proc_args NONE
allocation_rule $pe_slots
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE
and the queued job:
qstat -explain c -j 17779
==============================================================
job_number: 17779
exec_file: job_scripts/17779
submission_time: Thu Jan 31 14:56:07 2013
owner: XX
uid: XX
group: XXEstivill
gid: XX
sge_o_home: XX
sge_o_log_name: XX
sge_o_path:
/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/lib64/openmpi/bin/:/usr/lib64/compat-openmpi/bin/:/users/xe/dtrujillano/bin
sge_o_shell: /bin/bash
sge_o_workdir: XX
sge_o_host: ant-login4
account: sge
stderr_path_list: NONE:NONE:/usersXX
hard resource_list: virtual_free=12G,h_vmem=5G
mail_list: [email protected]
notify: FALSE
job_name: r21.sh
stdout_path_list: NONE:NONE:/usersXX
jobshare: 0
hard_queue_list: default
env_list:
script_file: /users/XX
parallel environment: smp range: 8
verify_suitable_queues: 2
scheduling info: cannot run in queue "cn-el6" because it is not
contained in its hard queue list (-q)
cannot run in queue "rg-el6" because it is not
contained in its hard queue list (-q)
cannot run in queue "xe-el6" because it is not
contained in its hard queue list (-q)
(-l slots=1) cannot run in queue "
node-ib0301bi.linux.crg.es" because it offers only hc:slots=0.000000
(-l slots=1) cannot run in queue "
node-ib0304bi.linux.crg.es" because it offers only hc:slots=0.000000
(-l slots=1) cannot run in queue "
node-ib0302bi.linux.crg.es" because it offers only hc:slots=0.000000
(-l slots=1) cannot run in queue "
node-ib0305bi.linux.crg.es" because it offers only hc:slots=0.000000
(-l slots=1) cannot run in queue "
node-ib0306bi.linux.crg.es" because it offers only hc:slots=0.000000
(-l slots=1) cannot run in queue "
node-ib0303bi.linux.crg.es" because it offers only hc:slots=0.000000
cannot run in PE "smp" because it only offers 0
slots
As you can see, the scheduller says that "dedicated queues" are not in
queue list, and it says that the job cannot run in node with high memory
(from 1 to 6), but it says nothign about nodes with less memory (from 7 to
14), and also complains about available slots in smp PE (strange, cause its
limit should be 1024). The memory limit is less than the memory available
in the node... so, from my understanding the should could be run in node 7
to 14)....
1.-) why is the scheduller message saying nothing about those nodes with
availbale resources (7 to 14, low memory nodes)? Is it taking them into
account?
2.-) what does that PE message menas? In the cluster there are 10 jobs all
running 8 slots, so only 80 slots from smp PE should be couned).
Anyone with more experience could help me to debug the pending status of
this job?
TIA,
Arnau
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users