Hi, Are you requesting 8 times 16 GB? Is this available in the nodes?
-- Reuti Am 01.02.2013 um 18:04 schrieb Arnau: > Hi all, > > Our cluster has been running jobs fine, and I've been able to debug pending > reason on queued jobs. But now, I'm facing some jobs that, from my > undesratnding, should be running but they're queued > > Our cluster have some dedicated queues (which use a certain hostgroup) and a > defualt queueu that uses some other hosts. In this default queue, the > hostgroup contain nodes that are quite similiar, but some have 94GB of memory > and the others 48. My first thought was that the user was requesting a high > amount of memory, and for that reason, low memory nodes where descarted, but > this is not true. > > The conf: > > default queue: > # qconf -sq default > qname default > hostlist @ibm > seq_no 1 > load_thresholds np_load_avg=1.3 > suspend_thresholds NONE > nsuspend 1 > suspend_interval 00:05:00 > priority 0 > min_cpu_interval 00:05:00 > processors UNDEFINED > qtype BATCH INTERACTIVE > ckpt_list NONE > pe_list smp ompi > rerun FALSE > slots 9999 > [....] > > # qconf -shgrp @ibm > group_name @ibm > hostlist node-ib0301bi.linux.crg.es node-ib0302bi.linux.crg.es \ > node-ib0303bi.linux.crg.es node-ib0304bi.linux.crg.es \ > node-ib0305bi.linux.crg.es node-ib0306bi.linux.crg.es \ > node-ib0307bi.linux.crg.es node-ib0308bi.linux.crg.es \ > node-ib0309bi.linux.crg.es node-ib0310bi.linux.crg.es \ > node-ib0311bi.linux.crg.es node-ib0312bi.linux.crg.es \ > node-ib0313bi.linux.crg.es node-ib0314bi.linux.crg.es > # qhost : (output truncated, just a couple of nodes from each type): > [...] > node-ib0305bi linux-x64 8 1.00 94.5G 7.9G 32.0G > 7.7M > node-ib0306bi linux-x64 8 2.62 94.5G 8.1G 32.0G > 14.3M > node-ib0307bi linux-x64 8 0.00 47.1G 2.2G 32.0G > 9.8M > node-ib0308bi linux-x64 8 0.00 47.1G 2.3G 32.0G > 9.2M > [....] > > detailed host definition (high and low memory): > # qconf -se node-ib0305bi > hostname node-ib0305bi.linux.crg.es > load_scaling NONE > complex_values slots=8,virtual_free=102G > load_values arch=linux-x64,num_proc=8,mem_total=96733.835938M, \ > swap_total=32767.992188M,virtual_total=129501.828125M, \ > load_avg=1.000000,load_short=1.000000, \ > load_medium=1.000000,load_long=1.320000, \ > mem_free=88596.046875M,swap_free=32760.312500M, \ > virtual_free=121356.359375M,mem_used=8137.789062M, \ > swap_used=7.679688M,virtual_used=8145.468750M, \ > cpu=12.500000,m_topology=SCCCCSCCCC, \ > m_topology_inuse=SCCCCSCCCC,m_socket=2,m_core=8, \ > np_load_avg=0.125000,np_load_short=0.125000, \ > np_load_medium=0.125000,np_load_long=0.165000, \ > free_scratch_space=453 > processors 8 > user_lists NONE > xuser_lists NONE > projects NONE > xprojects NONE > usage_scaling NONE > report_variables NONE > # qconf -se node-ib0307bi > hostname node-ib0307bi.linux.crg.es > load_scaling NONE > complex_values slots=8,virtual_free=60G > load_values load_avg=0.000000,load_short=0.000000, \ > load_medium=0.000000,load_long=0.000000,arch=linux-x64, > \ > num_proc=8,mem_free=46013.773438M, \ > swap_free=32758.222656M,virtual_free=78771.996094M, \ > mem_total=48257.062500M,swap_total=32767.992188M, \ > virtual_total=81025.054688M,mem_used=2243.289062M, \ > swap_used=9.769531M,virtual_used=2253.058594M, \ > cpu=0.000000,m_topology=SCCCCSCCCC, \ > m_topology_inuse=SCCCCSCCCC,m_socket=2,m_core=8, \ > np_load_avg=0.000000,np_load_short=0.000000, \ > np_load_medium=0.000000,np_load_long=0.000000, \ > free_scratch_space=453 > processors 8 > user_lists NONE > xuser_lists NONE > projects NONE > xprojects NONE > usage_scaling NONE > report_variables NONE > > # qconf -sp smp > pe_name smp > slots 1024 > user_lists NONE > xuser_lists NONE > start_proc_args NONE > stop_proc_args NONE > allocation_rule $pe_slots > control_slaves FALSE > job_is_first_task TRUE > urgency_slots min > accounting_summary FALSE > > and the queued job: > qstat -explain c -j 17779 > ============================================================== > job_number: 17779 > exec_file: job_scripts/17779 > submission_time: Thu Jan 31 14:56:07 2013 > owner: XX > uid: XX > group: XXEstivill > gid: XX > sge_o_home: XX > sge_o_log_name: XX > sge_o_path: > /usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/lib64/openmpi/bin/:/usr/lib64/compat-openmpi/bin/:/users/xe/dtrujillano/bin > sge_o_shell: /bin/bash > sge_o_workdir: XX > sge_o_host: ant-login4 > account: sge > stderr_path_list: NONE:NONE:/usersXX > hard resource_list: virtual_free=12G,h_vmem=5G > mail_list: [email protected] > notify: FALSE > job_name: r21.sh > stdout_path_list: NONE:NONE:/usersXX > jobshare: 0 > hard_queue_list: default > env_list: > script_file: /users/XX > parallel environment: smp range: 8 > verify_suitable_queues: 2 > scheduling info: cannot run in queue "cn-el6" because it is not > contained in its hard queue list (-q) > cannot run in queue "rg-el6" because it is not > contained in its hard queue list (-q) > cannot run in queue "xe-el6" because it is not > contained in its hard queue list (-q) > (-l slots=1) cannot run in queue > "node-ib0301bi.linux.crg.es" because it offers only hc:slots=0.000000 > (-l slots=1) cannot run in queue > "node-ib0304bi.linux.crg.es" because it offers only hc:slots=0.000000 > (-l slots=1) cannot run in queue > "node-ib0302bi.linux.crg.es" because it offers only hc:slots=0.000000 > (-l slots=1) cannot run in queue > "node-ib0305bi.linux.crg.es" because it offers only hc:slots=0.000000 > (-l slots=1) cannot run in queue > "node-ib0306bi.linux.crg.es" because it offers only hc:slots=0.000000 > (-l slots=1) cannot run in queue > "node-ib0303bi.linux.crg.es" because it offers only hc:slots=0.000000 > cannot run in PE "smp" because it only offers 0 > slots > > As you can see, the scheduller says that "dedicated queues" are not in queue > list, and it says that the job cannot run in node with high memory (from 1 to > 6), but it says nothign about nodes with less memory (from 7 to 14), and > also complains about available slots in smp PE (strange, cause its limit > should be 1024). The memory limit is less than the memory available in the > node... so, from my understanding the should could be run in node 7 to 14).... > > 1.-) why is the scheduller message saying nothing about those nodes with > availbale resources (7 to 14, low memory nodes)? Is it taking them into > account? > 2.-) what does that PE message menas? In the cluster there are 10 jobs all > running 8 slots, so only 80 slots from smp PE should be couned). > > > Anyone with more experience could help me to debug the pending status of this > job? > > TIA, > Arnau > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
