We're having a problem similar to that described in this thread:
http://www.mentby.com/Group/grid-engine/62u4-resource-reservation-not-working-for-some-jobs.html

We're running Grid Engine 6.2u5 for a cluster of 4 Linux nodes (32 cores
each) running Ubuntu 12.04 (Precise).

We're seeing that jobs that request a reservation and are at the top of the
queue are not starting, with lower-priority jobs that are requesting fewer
cores slipping ahead of the higher priority job. An example of this is at
the bottom of this posting.

Here's the results of "qconf -ssconf":
algorithm                         default
schedule_interval                 0:0:15
maxujobs                          0
queue_sort_method                 load
job_load_adjustments              np_load_avg=0.50
load_adjustment_decay_time        0:7:30
load_formula                      np_load_avg
schedd_job_info                   true
flush_submit_sec                  0
flush_finish_sec                  0
params                            MONITOR=1
reprioritize_interval             0:0:0
halftime                          720
usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
compensation_factor               5.000000
weight_user                       0.250000
weight_project                    0.250000
weight_department                 0.250000
weight_job                        0.250000
weight_tickets_functional         0
weight_tickets_share              100000
share_override_tickets            TRUE
share_functional_shares           TRUE
max_functional_jobs_to_schedule   200
report_pjob_tickets               TRUE
max_pending_tasks_per_job         50
halflife_decay_list               none
policy_hierarchy                  SOF
weight_ticket                     1.000000
weight_waiting_time               0.278000
weight_deadline                   3600000.000000
weight_urgency                    0.000000
weight_priority                   0.000000
max_reservation                   10
default_duration                  7200:00:00

Here's the example:

Job #34378 was submitted as:
qsub -pe smp 16 -R y -b y "R CMD BATCH --no-save tmp.R tmp.out"


Soon after submitting #34378, we see that the job #34378 is next in line:
job-ID  prior   name       user         state submit/start at
queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
  33004 0.11762 tophat.sh  seqc         r     04/24/2013 07:14:20
[email protected]       32
  33718 0.12405 fooSU_long lwtai        r     05/06/2013 17:01:58
[email protected]       1
  33719 0.12405 fooSV_long lwtai        r     05/06/2013 17:01:58
[email protected]       1
  33720 0.12405 fooWV_long lwtai        r     05/06/2013 17:01:58
[email protected]       1
  33721 0.12405 fooWU_long lwtai        r     05/06/2013 17:01:58
[email protected]       1
  33745 0.06583 toy.sh     yjhuoh       r     05/07/2013 22:29:28
[email protected]        1
  33758 0.06583 toy.sh     yjhuoh       r     05/07/2013 22:30:28
[email protected]        1
  33763 0.06583 toy.sh     yjhuoh       r     05/07/2013 22:33:58
[email protected]        1
  33787 0.06583 toy.sh     yjhuoh       r     05/08/2013 00:15:58
[email protected]        1
  33794 0.06583 toy.sh     yjhuoh       r     05/08/2013 01:45:58
[email protected]        1
  34183 0.00570 SubSampleF isoform      r     05/09/2013 03:29:32
[email protected]        8
  34185 0.00570 SubSampleF isoform      r     05/09/2013 04:27:47
[email protected]        8
  34186 0.00570 SubSampleF isoform      r     05/09/2013 04:36:47
[email protected]        8
  34187 0.00570 SubSampleF isoform      r     05/09/2013 05:05:02
[email protected]        8
  34188 0.00570 SubSampleF isoform      r     05/09/2013 05:42:17
[email protected]        8
  34189 0.00570 SubSampleF isoform      r     05/09/2013 06:12:47
[email protected]        8
  34190 0.00570 SubSampleF isoform      r     05/09/2013 06:14:17
[email protected]        8
  34191 0.00570 SubSampleF isoform      r     05/09/2013 07:07:32
[email protected]        8
  34192 0.00570 SubSampleF isoform      r     05/09/2013 07:24:02
[email protected]        8
  34194 0.00570 SubSampleF isoform      r     05/09/2013 07:37:17
[email protected]        8
  34378 1.00000 R CMD BATC paciorek     qw    05/09/2013
08:14:31                                   16
  34195 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34196 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34197 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34198 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34199 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34200 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34201 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34202 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34203 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34204 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34205 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34206 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34207 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34208 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34209 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34210 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8

A little while later, we see that jobs 34195-34198 have slipped ahead of
34378:

job-ID  prior   name       user         state submit/start at
queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
  33004 0.11790 tophat.sh  seqc         r     04/24/2013 07:14:20
[email protected]       32
  33718 0.12398 fooSU_long lwtai        r     05/06/2013 17:01:58
[email protected]       1
  33719 0.12398 fooSV_long lwtai        r     05/06/2013 17:01:58
[email protected]       1
  33720 0.12398 fooWV_long lwtai        r     05/06/2013 17:01:58
[email protected]       1
  33721 0.12398 fooWU_long lwtai        r     05/06/2013 17:01:58
[email protected]       1
  33745 0.08234 toy.sh     yjhuoh       r     05/07/2013 22:29:28
[email protected]        1
  33758 0.08234 toy.sh     yjhuoh       r     05/07/2013 22:30:28
[email protected]        1
  33763 0.08234 toy.sh     yjhuoh       r     05/07/2013 22:33:58
[email protected]        1
  33787 0.08234 toy.sh     yjhuoh       r     05/08/2013 00:15:58
[email protected]        1
  34188 0.00568 SubSampleF isoform      r     05/09/2013 05:42:17
[email protected]        8
  34189 0.00568 SubSampleF isoform      r     05/09/2013 06:12:47
[email protected]        8
  34190 0.00568 SubSampleF isoform      r     05/09/2013 06:14:17
[email protected]        8
  34191 0.00568 SubSampleF isoform      r     05/09/2013 07:07:32
[email protected]        8
  34192 0.00568 SubSampleF isoform      r     05/09/2013 07:24:02
[email protected]        8
  34194 0.00568 SubSampleF isoform      r     05/09/2013 07:37:17
[email protected]        8
  34195 0.00568 SubSampleF isoform      r     05/09/2013 08:16:47
[email protected]        8
  34196 0.00568 SubSampleF isoform      r     05/09/2013 08:47:32
[email protected]        8
  34197 0.00568 SubSampleF isoform      r     05/09/2013 09:11:02
[email protected]        8
  34198 0.00568 SubSampleF isoform      r     05/09/2013 09:16:32
[email protected]        8
  34378 1.00000 R CMD BATC paciorek     qw    05/09/2013
08:14:31                                   16
  34199 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34200 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34201 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34202 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34203 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34204 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34205 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34206 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34207 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34208 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34209 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34210 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:51                                    8
  34211 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:52                                    8
  34212 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:52                                    8
  34213 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:52                                    8
  34214 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:52                                    8
  34215 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:52                                    8
  34216 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:52                                    8
  34217 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:52                                    8
  34218 0.00000 SubSampleF isoform      qw    05/08/2013
19:30:52                                    8

The schedule file shows that there are RESERVING statements for #34378:
34378:1:RESERVING:1369228520:25920060:P:smp:slots:16.000000
34378:1:RESERVING:1369228520:25920060:Q:[email protected]:
slots:16.000000

Perhaps the issue is that the reservation seems specific to the cluster
node "scf-sm02.Berkeley.EDU", and that specific node is occupied by a
long-running job (#33004). If so, is there any way to have the reservation
not tied to a node?

-Chris

----------------------------------------------------------------------------------------------
Chris Paciorek

Statistical Computing Consultant, Associate Research Statistician, Lecturer

Office: 495 Evans Hall                      Email:
[email protected]
Mailing Address:                            Voice: 510-842-6670
Department of Statistics                    Fax:   510-642-7892
367 Evans Hall                              Skype: cjpaciorek
University of California, Berkeley          WWW:
www.stat.berkeley.edu/~paciorek
Berkeley, CA 94720 USA                      Permanent forward:
[email protected]
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to