Hi everyone,
seems that the nodes are "temporarily not available" because they are in
an error state (the big 'E' of the status column of the output of qstat
-f).
One way to understand why this happen could be execute qstat -explain E
or read the log file of the sge_qmaster on the master and the sge_execd
on a node in that state. To recovery all nodes from that state you can
execute qmod -cq all.q.
If you execute env|grep SGE on a node what is the output?
Let's see if William has some other advice.
Greetings,
Paolo
Il 15/06/2015 19:27, Daniele Gregori ha scritto:
> Hi William,
> the problem is more generic, also serial job stay pending, for first I
> tried your test commands:
>
>
> [root@hactar ~]# qalter -w v 962
>
> Job 962 queue instance "all.q@compute-1-5" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-9" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-8" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-6" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-1" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-2" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-11" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-14" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-3" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-4" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-7" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-10" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-12" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-13" dropped because it is temporarily
> not available
>
> Job 962 cannot run in PE "mpi" because it only offers 0 slots
>
> verification: no suitable queues
>
> [root@hactar ~]# qalter -w p 962
>
> Job 962 queue instance "all.q@compute-1-5" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-9" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-8" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-6" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-1" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-2" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-11" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-14" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-3" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-4" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-7" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-10" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-12" dropped because it is temporarily
> not available
>
> Job 962 queue instance "all.q@compute-1-13" dropped because it is temporarily
> not available
>
> Job 962 cannot run in PE "mpi" because it only offers 0 slots
>
> verification: no suitable queues
>
> [root@hactar ~]#
>
>
> I changed the max_reservation parameter from 0 to 336 (all available slots)
> without benefit, I tried to submit also serial jobs and also serial jobs are
> pending.
>
> I restarted qmaster end execd daemons on each clients without changes.
> Now the status is:
>
>
> [root@hactar ~]# qstat -f
>
> queuename qtype resv/used/tot. load_avg arch
> states
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-1 BIP 0/0/24 0.18 linux-x64 E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-10 BIP 0/0/24 0.13 linux-x64 E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-11 BIP 0/0/24 0.03 linux-x64 E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-12 BIP 0/0/24 0.12 linux-x64 E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-13 BIP 0/0/24 0.03 linux-x64 E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-14 BIP 0/0/24 0.10 linux-x64 E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-2 BIP 0/0/24 0.12 linux-x64 E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-3 BIP 0/0/24 0.10 linux-x64 E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-4 BIP 0/0/24 0.16 linux-x64 E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-5 BIP 0/0/24 0.12 linux-x64 E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-6 BIP 0/0/24 0.07 linux-x64 E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-7 BIP 0/0/24 0.05 linux-x64 E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-8 BIP 0/0/24 0.04 linux-x64 E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-9 BIP 0/0/24 0.09 linux-x64 E
>
> [root@hactar ~]# qhost
>
> HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO
> SWAPUS
>
> -------------------------------------------------------------------------------
>
> global - - - - - -
> -
>
> compute-1-1 linux-x64 24 0.15 125.9G 1.6G 15.6G
> 0.0
>
> compute-1-10 linux-x64 24 0.13 125.9G 1.6G 15.6G
> 0.0
>
> compute-1-11 linux-x64 24 0.03 125.9G 1.6G 15.6G
> 0.0
>
> compute-1-12 linux-x64 24 0.12 125.9G 1.5G 15.6G
> 0.0
>
> compute-1-13 linux-x64 24 0.03 125.9G 1.6G 15.6G
> 0.0
>
> compute-1-14 linux-x64 24 0.10 125.9G 1.6G 15.6G
> 0.0
>
> compute-1-2 linux-x64 24 0.12 125.9G 1.6G 15.6G
> 0.0
>
> compute-1-3 linux-x64 24 0.09 125.9G 1.5G 15.6G
> 0.0
>
> compute-1-4 linux-x64 24 0.16 125.9G 1.6G 15.6G
> 0.0
>
> compute-1-5 linux-x64 24 0.12 125.9G 1.6G 15.6G
> 0.0
>
> compute-1-6 linux-x64 24 0.07 125.9G 9.5G 15.6G
> 0.0
>
> compute-1-7 linux-x64 24 0.05 125.9G 1.6G 15.6G
> 0.0
>
> compute-1-8 linux-x64 24 0.04 125.9G 1.2G 15.6G
> 0.0
>
> compute-1-9 linux-x64 24 0.09 125.9G 1.6G 15.6G
> 0.0
>
>
> The ports are:
>
>
> [root@hactar ~]# netstat -nltp |grep 644
>
> tcp 0 0 0.0.0.0:6444 0.0.0.0:*
> LISTEN 3316/sge_qmaster
>
> [root@hactar ~]# ssh compute-1-1
>
> Warning: Permanently added 'compute-1-1,192.168.0.1' (RSA) to the list of
> known hosts.
>
> Last login: Mon Jun 15 10:30:48 2015 from hactar
>
> [root@compute-1-1 ~]# netstat -nltp |grep 644
>
> tcp 0 0 0.0.0.0:6445 0.0.0.0:*
> LISTEN 60124/sge_execd
>
> [root@compute-1-1 ~]# ps -ef|grep sge
>
> root 60124 1 0 10:33 ? 00:00:00
> /opt/shared/ge2011.11/bin/linux-x64/sge_execd
>
> root 62004 61373 0 10:38 pts/0 00:00:00 grep sge
>
> [root@compute-1-1 ~]#
>
> Where hactar is the master and compute-1-1 one client.
> The scheduler configuration is:
>
>
> [root@hactar ~]# qconf -sss
>
> hactar
>
> [root@hactar ~]# qconf -ssconf
>
> algorithm default
>
> schedule_interval 0:0:15
>
> maxujobs 0
>
> queue_sort_method load
>
> job_load_adjustments np_load_avg=0.50
>
> load_adjustment_decay_time 0:7:30
>
> load_formula np_load_avg
>
> schedd_job_info false
>
> flush_submit_sec 0
>
> flush_finish_sec 0
>
> params none
>
> reprioritize_interval 0:0:0
>
> halftime 168
>
> usage_weight_list cpu=1.000000,mem=0.000000,io=0.000000
>
> compensation_factor 5.000000
>
> weight_user 0.250000
>
> weight_project 0.250000
>
> weight_department 0.250000
>
> weight_job 0.250000
>
> weight_tickets_functional 0
>
> weight_tickets_share 0
>
> share_override_tickets TRUE
>
> share_functional_shares TRUE
>
> max_functional_jobs_to_schedule 200
>
> report_pjob_tickets TRUE
>
> max_pending_tasks_per_job 50
>
> halflife_decay_list none
>
> policy_hierarchy OFS
>
> weight_ticket 0.010000
>
> weight_waiting_time 0.000000
>
> weight_deadline 3600000.000000
>
> weight_urgency 0.100000
>
> weight_priority 1.000000
>
> max_reservation 336
>
> default_duration INFINITY
>
> [root@hactar ~]#
>
> At this point I submitted serial and mpi jobs and they are all pending.
>
> How can I discover the cause of this behaviour?
>
> Thanks
>
> D.
>
>
> Il giorno 15/giu/2015, alle ore 09:27, William Hay
> <[email protected]<mailto:[email protected]>> ha scritto:
>
> On Sat, 13 Jun 2015 16:41:18 +0000
> Daniele Gregori
> <[email protected]<mailto:[email protected]>> wrote:
>
>
> [e4user@hactar greg]$ qsub -pe mpi 2 sge.sh
>
> The problem is that the job doesn’t start, it is alwais in qw state:
>
> [e4user@hactar greg]$ qstat
> job-ID prior name user state submit/start at queue
> slots ja-task-ID
> -----------------------------------------------------------------------------------------------------------------
> 961 0.60500 mpi_date.s e4user qw 06/12/2015 17:45:09
> 8
> 962 0.50500 sge.sh e4user qw 06/13/2015 18:24:30
> 2
> [e4user@hactar greg]$
>
> mpi_date is a similar test job submitted before.
> Any hint to start the job?
>
> First thing to do is to check why grid engine thinks it can't run (qalter -w
> v 962 and qalter -w p 962).
>
> One possibility is if there are lots of serial jobs in the queue it can't get
> started because there are never 2 slots free simultaneously. To prevent this
> you need to configure reservations (max_reservation in the scheduler
> configuration) and request one(qsub -R y) when submitting the job.
>
> William
>
>
>
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users