Re: [gridengine users] mpi job doesn't start

Paolo Margara Tue, 16 Jun 2015 00:17:11 -0700

Hi everyone,
seems that the nodes are "temporarily not available" because they are in
an error state (the big 'E' of the status column of the output of qstat
-f).


One way to understand why this happen could be execute qstat -explain E
or read the log file of the sge_qmaster on the master and the sge_execd
on a node in that state. To recovery all nodes from that state you can
execute qmod -cq all.q.

If you execute env|grep SGE on a node what is the output?

Let's see if William has some other advice.

Greetings,
     Paolo

Il 15/06/2015 19:27, Daniele Gregori ha scritto:
> Hi William,
>    the problem is more generic, also serial job stay pending, for first I 
> tried your test commands:
>
>
> [root@hactar ~]# qalter -w v 962
>
> Job 962 queue instance "all.q@compute-1-5" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-9" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-8" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-6" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-1" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-2" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-11" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-14" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-3" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-4" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-7" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-10" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-12" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-13" dropped because it is temporarily 
> not available
>
> Job 962 cannot run in PE "mpi" because it only offers 0 slots
>
> verification: no suitable queues
>
> [root@hactar ~]# qalter -w p 962
>
> Job 962 queue instance "all.q@compute-1-5" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-9" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-8" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-6" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-1" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-2" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-11" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-14" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-3" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-4" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-7" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-10" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-12" dropped because it is temporarily 
> not available
>
> Job 962 queue instance "all.q@compute-1-13" dropped because it is temporarily 
> not available
>
> Job 962 cannot run in PE "mpi" because it only offers 0 slots
>
> verification: no suitable queues
>
> [root@hactar ~]#
>
>
> I changed the max_reservation parameter from 0 to 336 (all available slots) 
> without benefit, I tried to submit also serial jobs and also serial jobs are 
> pending.
>
> I restarted qmaster end execd daemons on each clients without changes.
> Now the status is:
>
>
> [root@hactar ~]# qstat -f
>
> queuename                      qtype resv/used/tot. load_avg arch          
> states
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-1              BIP   0/0/24         0.18     linux-x64     E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-10             BIP   0/0/24         0.13     linux-x64     E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-11             BIP   0/0/24         0.03     linux-x64     E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-12             BIP   0/0/24         0.12     linux-x64     E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-13             BIP   0/0/24         0.03     linux-x64     E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-14             BIP   0/0/24         0.10     linux-x64     E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-2              BIP   0/0/24         0.12     linux-x64     E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-3              BIP   0/0/24         0.10     linux-x64     E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-4              BIP   0/0/24         0.16     linux-x64     E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-5              BIP   0/0/24         0.12     linux-x64     E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-6              BIP   0/0/24         0.07     linux-x64     E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-7              BIP   0/0/24         0.05     linux-x64     E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-8              BIP   0/0/24         0.04     linux-x64     E
>
> ---------------------------------------------------------------------------------
>
> all.q@compute-1-9              BIP   0/0/24         0.09     linux-x64     E
>
> [root@hactar ~]# qhost
>
> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  
> SWAPUS
>
> -------------------------------------------------------------------------------
>
> global                  -               -     -       -       -       -       
> -
>
> compute-1-1             linux-x64      24  0.15  125.9G    1.6G   15.6G     
> 0.0
>
> compute-1-10            linux-x64      24  0.13  125.9G    1.6G   15.6G     
> 0.0
>
> compute-1-11            linux-x64      24  0.03  125.9G    1.6G   15.6G     
> 0.0
>
> compute-1-12            linux-x64      24  0.12  125.9G    1.5G   15.6G     
> 0.0
>
> compute-1-13            linux-x64      24  0.03  125.9G    1.6G   15.6G     
> 0.0
>
> compute-1-14            linux-x64      24  0.10  125.9G    1.6G   15.6G     
> 0.0
>
> compute-1-2             linux-x64      24  0.12  125.9G    1.6G   15.6G     
> 0.0
>
> compute-1-3             linux-x64      24  0.09  125.9G    1.5G   15.6G     
> 0.0
>
> compute-1-4             linux-x64      24  0.16  125.9G    1.6G   15.6G     
> 0.0
>
> compute-1-5             linux-x64      24  0.12  125.9G    1.6G   15.6G     
> 0.0
>
> compute-1-6             linux-x64      24  0.07  125.9G    9.5G   15.6G     
> 0.0
>
> compute-1-7             linux-x64      24  0.05  125.9G    1.6G   15.6G     
> 0.0
>
> compute-1-8             linux-x64      24  0.04  125.9G    1.2G   15.6G     
> 0.0
>
> compute-1-9             linux-x64      24  0.09  125.9G    1.6G   15.6G     
> 0.0
>
>
> The ports are:
>
>
> [root@hactar ~]# netstat -nltp |grep 644
>
> tcp        0      0 0.0.0.0:6444                0.0.0.0:*                   
> LISTEN      3316/sge_qmaster
>
> [root@hactar ~]# ssh compute-1-1
>
> Warning: Permanently added 'compute-1-1,192.168.0.1' (RSA) to the list of 
> known hosts.
>
> Last login: Mon Jun 15 10:30:48 2015 from hactar
>
> [root@compute-1-1 ~]# netstat -nltp |grep 644
>
> tcp        0      0 0.0.0.0:6445                0.0.0.0:*                   
> LISTEN      60124/sge_execd
>
> [root@compute-1-1 ~]# ps -ef|grep sge
>
> root      60124      1  0 10:33 ?        00:00:00 
> /opt/shared/ge2011.11/bin/linux-x64/sge_execd
>
> root      62004  61373  0 10:38 pts/0    00:00:00 grep sge
>
> [root@compute-1-1 ~]#
>
> Where hactar is the master and compute-1-1 one client.
> The scheduler configuration is:
>
>
> [root@hactar ~]# qconf -sss
>
> hactar
>
> [root@hactar ~]# qconf -ssconf
>
> algorithm                         default
>
> schedule_interval                 0:0:15
>
> maxujobs                          0
>
> queue_sort_method                 load
>
> job_load_adjustments              np_load_avg=0.50
>
> load_adjustment_decay_time        0:7:30
>
> load_formula                      np_load_avg
>
> schedd_job_info                   false
>
> flush_submit_sec                  0
>
> flush_finish_sec                  0
>
> params                            none
>
> reprioritize_interval             0:0:0
>
> halftime                          168
>
> usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
>
> compensation_factor               5.000000
>
> weight_user                       0.250000
>
> weight_project                    0.250000
>
> weight_department                 0.250000
>
> weight_job                        0.250000
>
> weight_tickets_functional         0
>
> weight_tickets_share              0
>
> share_override_tickets            TRUE
>
> share_functional_shares           TRUE
>
> max_functional_jobs_to_schedule   200
>
> report_pjob_tickets               TRUE
>
> max_pending_tasks_per_job         50
>
> halflife_decay_list               none
>
> policy_hierarchy                  OFS
>
> weight_ticket                     0.010000
>
> weight_waiting_time               0.000000
>
> weight_deadline                   3600000.000000
>
> weight_urgency                    0.100000
>
> weight_priority                   1.000000
>
> max_reservation                   336
>
> default_duration                  INFINITY
>
> [root@hactar ~]#
>
> At this point I submitted serial and mpi jobs and they are all pending.
>
> How can I discover the cause of this behaviour?
>
> Thanks
>
> D.
>
>
> Il giorno 15/giu/2015, alle ore 09:27, William Hay 
> <[email protected]<mailto:[email protected]>> ha scritto:
>
> On Sat, 13 Jun 2015 16:41:18 +0000
> Daniele Gregori 
> <[email protected]<mailto:[email protected]>> wrote:
>
>
> [e4user@hactar greg]$ qsub -pe mpi 2 sge.sh
>
> The problem is that the job doesn’t start, it is alwais in qw state:
>
> [e4user@hactar greg]$ qstat
> job-ID  prior   name       user         state submit/start at     queue       
>                    slots ja-task-ID
> -----------------------------------------------------------------------------------------------------------------
>    961 0.60500 mpi_date.s e4user       qw    06/12/2015 17:45:09              
>                       8
>    962 0.50500 sge.sh     e4user       qw    06/13/2015 18:24:30              
>                       2
> [e4user@hactar greg]$
>
> mpi_date is a similar test job submitted before.
> Any hint to start the job?
>
> First thing to do is to check why grid engine thinks it can't run (qalter -w 
> v 962 and qalter -w p 962).
>
> One possibility is if there are lots of serial jobs in the queue it can't get 
> started because there are never 2 slots free simultaneously.  To prevent this 
> you need to configure reservations (max_reservation in the scheduler 
> configuration) and request one(qsub -R y) when submitting the job.
>
> William
>
>
>
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] mpi job doesn't start

Reply via email to