Hi William, the problem is more generic, also serial job stay pending, for first I tried your test commands:
[root@hactar ~]# qalter -w v 962 Job 962 queue instance "all.q@compute-1-5" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-9" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-8" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-6" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-1" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-2" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-11" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-14" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-3" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-4" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-7" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-10" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-12" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-13" dropped because it is temporarily not available Job 962 cannot run in PE "mpi" because it only offers 0 slots verification: no suitable queues [root@hactar ~]# qalter -w p 962 Job 962 queue instance "all.q@compute-1-5" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-9" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-8" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-6" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-1" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-2" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-11" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-14" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-3" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-4" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-7" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-10" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-12" dropped because it is temporarily not available Job 962 queue instance "all.q@compute-1-13" dropped because it is temporarily not available Job 962 cannot run in PE "mpi" because it only offers 0 slots verification: no suitable queues [root@hactar ~]# I changed the max_reservation parameter from 0 to 336 (all available slots) without benefit, I tried to submit also serial jobs and also serial jobs are pending. I restarted qmaster end execd daemons on each clients without changes. Now the status is: [root@hactar ~]# qstat -f queuename qtype resv/used/tot. load_avg arch states --------------------------------------------------------------------------------- all.q@compute-1-1 BIP 0/0/24 0.18 linux-x64 E --------------------------------------------------------------------------------- all.q@compute-1-10 BIP 0/0/24 0.13 linux-x64 E --------------------------------------------------------------------------------- all.q@compute-1-11 BIP 0/0/24 0.03 linux-x64 E --------------------------------------------------------------------------------- all.q@compute-1-12 BIP 0/0/24 0.12 linux-x64 E --------------------------------------------------------------------------------- all.q@compute-1-13 BIP 0/0/24 0.03 linux-x64 E --------------------------------------------------------------------------------- all.q@compute-1-14 BIP 0/0/24 0.10 linux-x64 E --------------------------------------------------------------------------------- all.q@compute-1-2 BIP 0/0/24 0.12 linux-x64 E --------------------------------------------------------------------------------- all.q@compute-1-3 BIP 0/0/24 0.10 linux-x64 E --------------------------------------------------------------------------------- all.q@compute-1-4 BIP 0/0/24 0.16 linux-x64 E --------------------------------------------------------------------------------- all.q@compute-1-5 BIP 0/0/24 0.12 linux-x64 E --------------------------------------------------------------------------------- all.q@compute-1-6 BIP 0/0/24 0.07 linux-x64 E --------------------------------------------------------------------------------- all.q@compute-1-7 BIP 0/0/24 0.05 linux-x64 E --------------------------------------------------------------------------------- all.q@compute-1-8 BIP 0/0/24 0.04 linux-x64 E --------------------------------------------------------------------------------- all.q@compute-1-9 BIP 0/0/24 0.09 linux-x64 E [root@hactar ~]# qhost HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------- global - - - - - - - compute-1-1 linux-x64 24 0.15 125.9G 1.6G 15.6G 0.0 compute-1-10 linux-x64 24 0.13 125.9G 1.6G 15.6G 0.0 compute-1-11 linux-x64 24 0.03 125.9G 1.6G 15.6G 0.0 compute-1-12 linux-x64 24 0.12 125.9G 1.5G 15.6G 0.0 compute-1-13 linux-x64 24 0.03 125.9G 1.6G 15.6G 0.0 compute-1-14 linux-x64 24 0.10 125.9G 1.6G 15.6G 0.0 compute-1-2 linux-x64 24 0.12 125.9G 1.6G 15.6G 0.0 compute-1-3 linux-x64 24 0.09 125.9G 1.5G 15.6G 0.0 compute-1-4 linux-x64 24 0.16 125.9G 1.6G 15.6G 0.0 compute-1-5 linux-x64 24 0.12 125.9G 1.6G 15.6G 0.0 compute-1-6 linux-x64 24 0.07 125.9G 9.5G 15.6G 0.0 compute-1-7 linux-x64 24 0.05 125.9G 1.6G 15.6G 0.0 compute-1-8 linux-x64 24 0.04 125.9G 1.2G 15.6G 0.0 compute-1-9 linux-x64 24 0.09 125.9G 1.6G 15.6G 0.0 The ports are: [root@hactar ~]# netstat -nltp |grep 644 tcp 0 0 0.0.0.0:6444 0.0.0.0:* LISTEN 3316/sge_qmaster [root@hactar ~]# ssh compute-1-1 Warning: Permanently added 'compute-1-1,192.168.0.1' (RSA) to the list of known hosts. Last login: Mon Jun 15 10:30:48 2015 from hactar [root@compute-1-1 ~]# netstat -nltp |grep 644 tcp 0 0 0.0.0.0:6445 0.0.0.0:* LISTEN 60124/sge_execd [root@compute-1-1 ~]# ps -ef|grep sge root 60124 1 0 10:33 ? 00:00:00 /opt/shared/ge2011.11/bin/linux-x64/sge_execd root 62004 61373 0 10:38 pts/0 00:00:00 grep sge [root@compute-1-1 ~]# Where hactar is the master and compute-1-1 one client. The scheduler configuration is: [root@hactar ~]# qconf -sss hactar [root@hactar ~]# qconf -ssconf algorithm default schedule_interval 0:0:15 maxujobs 0 queue_sort_method load job_load_adjustments np_load_avg=0.50 load_adjustment_decay_time 0:7:30 load_formula np_load_avg schedd_job_info false flush_submit_sec 0 flush_finish_sec 0 params none reprioritize_interval 0:0:0 halftime 168 usage_weight_list cpu=1.000000,mem=0.000000,io=0.000000 compensation_factor 5.000000 weight_user 0.250000 weight_project 0.250000 weight_department 0.250000 weight_job 0.250000 weight_tickets_functional 0 weight_tickets_share 0 share_override_tickets TRUE share_functional_shares TRUE max_functional_jobs_to_schedule 200 report_pjob_tickets TRUE max_pending_tasks_per_job 50 halflife_decay_list none policy_hierarchy OFS weight_ticket 0.010000 weight_waiting_time 0.000000 weight_deadline 3600000.000000 weight_urgency 0.100000 weight_priority 1.000000 max_reservation 336 default_duration INFINITY [root@hactar ~]# At this point I submitted serial and mpi jobs and they are all pending. How can I discover the cause of this behaviour? Thanks D. Il giorno 15/giu/2015, alle ore 09:27, William Hay <w....@ucl.ac.uk<mailto:w....@ucl.ac.uk>> ha scritto: On Sat, 13 Jun 2015 16:41:18 +0000 Daniele Gregori <daniele.greg...@e4company.com<mailto:daniele.greg...@e4company.com>> wrote: [e4user@hactar greg]$ qsub -pe mpi 2 sge.sh The problem is that the job doesn’t start, it is alwais in qw state: [e4user@hactar greg]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 961 0.60500 mpi_date.s e4user qw 06/12/2015 17:45:09 8 962 0.50500 sge.sh e4user qw 06/13/2015 18:24:30 2 [e4user@hactar greg]$ mpi_date is a similar test job submitted before. Any hint to start the job? First thing to do is to check why grid engine thinks it can't run (qalter -w v 962 and qalter -w p 962). One possibility is if there are lots of serial jobs in the queue it can't get started because there are never 2 slots free simultaneously. To prevent this you need to configure reservations (max_reservation in the scheduler configuration) and request one(qsub -R y) when submitting the job. William _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users