Hi William,
   the problem is more generic, also serial job stay pending, for first I tried 
your test commands:


[root@hactar ~]# qalter -w v 962

Job 962 queue instance "all.q@compute-1-5" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-9" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-8" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-6" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-1" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-2" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-11" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-14" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-3" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-4" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-7" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-10" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-12" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-13" dropped because it is temporarily 
not available

Job 962 cannot run in PE "mpi" because it only offers 0 slots

verification: no suitable queues

[root@hactar ~]# qalter -w p 962

Job 962 queue instance "all.q@compute-1-5" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-9" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-8" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-6" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-1" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-2" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-11" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-14" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-3" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-4" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-7" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-10" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-12" dropped because it is temporarily 
not available

Job 962 queue instance "all.q@compute-1-13" dropped because it is temporarily 
not available

Job 962 cannot run in PE "mpi" because it only offers 0 slots

verification: no suitable queues

[root@hactar ~]#


I changed the max_reservation parameter from 0 to 336 (all available slots) 
without benefit, I tried to submit also serial jobs and also serial jobs are 
pending.

I restarted qmaster end execd daemons on each clients without changes.
Now the status is:


[root@hactar ~]# qstat -f

queuename                      qtype resv/used/tot. load_avg arch          
states

---------------------------------------------------------------------------------

all.q@compute-1-1              BIP   0/0/24         0.18     linux-x64     E

---------------------------------------------------------------------------------

all.q@compute-1-10             BIP   0/0/24         0.13     linux-x64     E

---------------------------------------------------------------------------------

all.q@compute-1-11             BIP   0/0/24         0.03     linux-x64     E

---------------------------------------------------------------------------------

all.q@compute-1-12             BIP   0/0/24         0.12     linux-x64     E

---------------------------------------------------------------------------------

all.q@compute-1-13             BIP   0/0/24         0.03     linux-x64     E

---------------------------------------------------------------------------------

all.q@compute-1-14             BIP   0/0/24         0.10     linux-x64     E

---------------------------------------------------------------------------------

all.q@compute-1-2              BIP   0/0/24         0.12     linux-x64     E

---------------------------------------------------------------------------------

all.q@compute-1-3              BIP   0/0/24         0.10     linux-x64     E

---------------------------------------------------------------------------------

all.q@compute-1-4              BIP   0/0/24         0.16     linux-x64     E

---------------------------------------------------------------------------------

all.q@compute-1-5              BIP   0/0/24         0.12     linux-x64     E

---------------------------------------------------------------------------------

all.q@compute-1-6              BIP   0/0/24         0.07     linux-x64     E

---------------------------------------------------------------------------------

all.q@compute-1-7              BIP   0/0/24         0.05     linux-x64     E

---------------------------------------------------------------------------------

all.q@compute-1-8              BIP   0/0/24         0.04     linux-x64     E

---------------------------------------------------------------------------------

all.q@compute-1-9              BIP   0/0/24         0.09     linux-x64     E

[root@hactar ~]# qhost

HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS

-------------------------------------------------------------------------------

global                  -               -     -       -       -       -       -

compute-1-1             linux-x64      24  0.15  125.9G    1.6G   15.6G     0.0

compute-1-10            linux-x64      24  0.13  125.9G    1.6G   15.6G     0.0

compute-1-11            linux-x64      24  0.03  125.9G    1.6G   15.6G     0.0

compute-1-12            linux-x64      24  0.12  125.9G    1.5G   15.6G     0.0

compute-1-13            linux-x64      24  0.03  125.9G    1.6G   15.6G     0.0

compute-1-14            linux-x64      24  0.10  125.9G    1.6G   15.6G     0.0

compute-1-2             linux-x64      24  0.12  125.9G    1.6G   15.6G     0.0

compute-1-3             linux-x64      24  0.09  125.9G    1.5G   15.6G     0.0

compute-1-4             linux-x64      24  0.16  125.9G    1.6G   15.6G     0.0

compute-1-5             linux-x64      24  0.12  125.9G    1.6G   15.6G     0.0

compute-1-6             linux-x64      24  0.07  125.9G    9.5G   15.6G     0.0

compute-1-7             linux-x64      24  0.05  125.9G    1.6G   15.6G     0.0

compute-1-8             linux-x64      24  0.04  125.9G    1.2G   15.6G     0.0

compute-1-9             linux-x64      24  0.09  125.9G    1.6G   15.6G     0.0


The ports are:


[root@hactar ~]# netstat -nltp |grep 644

tcp        0      0 0.0.0.0:6444                0.0.0.0:*                   
LISTEN      3316/sge_qmaster

[root@hactar ~]# ssh compute-1-1

Warning: Permanently added 'compute-1-1,192.168.0.1' (RSA) to the list of known 
hosts.

Last login: Mon Jun 15 10:30:48 2015 from hactar

[root@compute-1-1 ~]# netstat -nltp |grep 644

tcp        0      0 0.0.0.0:6445                0.0.0.0:*                   
LISTEN      60124/sge_execd

[root@compute-1-1 ~]# ps -ef|grep sge

root      60124      1  0 10:33 ?        00:00:00 
/opt/shared/ge2011.11/bin/linux-x64/sge_execd

root      62004  61373  0 10:38 pts/0    00:00:00 grep sge

[root@compute-1-1 ~]#

Where hactar is the master and compute-1-1 one client.
The scheduler configuration is:


[root@hactar ~]# qconf -sss

hactar

[root@hactar ~]# qconf -ssconf

algorithm                         default

schedule_interval                 0:0:15

maxujobs                          0

queue_sort_method                 load

job_load_adjustments              np_load_avg=0.50

load_adjustment_decay_time        0:7:30

load_formula                      np_load_avg

schedd_job_info                   false

flush_submit_sec                  0

flush_finish_sec                  0

params                            none

reprioritize_interval             0:0:0

halftime                          168

usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000

compensation_factor               5.000000

weight_user                       0.250000

weight_project                    0.250000

weight_department                 0.250000

weight_job                        0.250000

weight_tickets_functional         0

weight_tickets_share              0

share_override_tickets            TRUE

share_functional_shares           TRUE

max_functional_jobs_to_schedule   200

report_pjob_tickets               TRUE

max_pending_tasks_per_job         50

halflife_decay_list               none

policy_hierarchy                  OFS

weight_ticket                     0.010000

weight_waiting_time               0.000000

weight_deadline                   3600000.000000

weight_urgency                    0.100000

weight_priority                   1.000000

max_reservation                   336

default_duration                  INFINITY

[root@hactar ~]#

At this point I submitted serial and mpi jobs and they are all pending.

How can I discover the cause of this behaviour?

Thanks

D.


Il giorno 15/giu/2015, alle ore 09:27, William Hay 
<w....@ucl.ac.uk<mailto:w....@ucl.ac.uk>> ha scritto:

On Sat, 13 Jun 2015 16:41:18 +0000
Daniele Gregori 
<daniele.greg...@e4company.com<mailto:daniele.greg...@e4company.com>> wrote:


[e4user@hactar greg]$ qsub -pe mpi 2 sge.sh

The problem is that the job doesn’t start, it is alwais in qw state:

[e4user@hactar greg]$ qstat
job-ID  prior   name       user         state submit/start at     queue         
                 slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
   961 0.60500 mpi_date.s e4user       qw    06/12/2015 17:45:09                
                    8
   962 0.50500 sge.sh     e4user       qw    06/13/2015 18:24:30                
                    2
[e4user@hactar greg]$

mpi_date is a similar test job submitted before.
Any hint to start the job?

First thing to do is to check why grid engine thinks it can't run (qalter -w v 
962 and qalter -w p 962).

One possibility is if there are lots of serial jobs in the queue it can't get 
started because there are never 2 slots free simultaneously.  To prevent this 
you need to configure reservations (max_reservation in the scheduler 
configuration) and request one(qsub -R y) when submitting the job.

William




_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to