Am 09.05.2013 um 18:51 schrieb Chris Paciorek: > We're having a problem similar to that described in this thread: > http://www.mentby.com/Group/grid-engine/62u4-resource-reservation-not-working-for-some-jobs.html > > We're running Grid Engine 6.2u5 for a cluster of 4 Linux nodes (32 cores > each) running Ubuntu 12.04 (Precise). > > We're seeing that jobs that request a reservation and are at the top of the > queue are not starting, with lower-priority jobs that are requesting fewer > cores slipping ahead of the higher priority job. An example of this is at the > bottom of this posting.
Besides the defined "default_duration 7200:00:00": what h_rt/s_rt request was supplied to the short jobs? -- Reuti > Here's the results of "qconf -ssconf": > algorithm default > schedule_interval 0:0:15 > maxujobs 0 > queue_sort_method load > job_load_adjustments np_load_avg=0.50 > load_adjustment_decay_time 0:7:30 > load_formula np_load_avg > schedd_job_info true > flush_submit_sec 0 > flush_finish_sec 0 > params MONITOR=1 > reprioritize_interval 0:0:0 > halftime 720 > usage_weight_list cpu=1.000000,mem=0.000000,io=0.000000 > compensation_factor 5.000000 > weight_user 0.250000 > weight_project 0.250000 > weight_department 0.250000 > weight_job 0.250000 > weight_tickets_functional 0 > weight_tickets_share 100000 > share_override_tickets TRUE > share_functional_shares TRUE > max_functional_jobs_to_schedule 200 > report_pjob_tickets TRUE > max_pending_tasks_per_job 50 > halflife_decay_list none > policy_hierarchy SOF > weight_ticket 1.000000 > weight_waiting_time 0.278000 > weight_deadline 3600000.000000 > weight_urgency 0.000000 > weight_priority 0.000000 > max_reservation 10 > default_duration 7200:00:00 > > Here's the example: > > Job #34378 was submitted as: > qsub -pe smp 16 -R y -b y "R CMD BATCH --no-save tmp.R tmp.out" > > > Soon after submitting #34378, we see that the job #34378 is next in line: > job-ID prior name user state submit/start at queue > slots ja-task-ID > ----------------------------------------------------------------------------------------------------------------- > 33004 0.11762 tophat.sh seqc r 04/24/2013 07:14:20 > [email protected] 32 > 33718 0.12405 fooSU_long lwtai r 05/06/2013 17:01:58 > [email protected] 1 > 33719 0.12405 fooSV_long lwtai r 05/06/2013 17:01:58 > [email protected] 1 > 33720 0.12405 fooWV_long lwtai r 05/06/2013 17:01:58 > [email protected] 1 > 33721 0.12405 fooWU_long lwtai r 05/06/2013 17:01:58 > [email protected] 1 > 33745 0.06583 toy.sh yjhuoh r 05/07/2013 22:29:28 > [email protected] 1 > 33758 0.06583 toy.sh yjhuoh r 05/07/2013 22:30:28 > [email protected] 1 > 33763 0.06583 toy.sh yjhuoh r 05/07/2013 22:33:58 > [email protected] 1 > 33787 0.06583 toy.sh yjhuoh r 05/08/2013 00:15:58 > [email protected] 1 > 33794 0.06583 toy.sh yjhuoh r 05/08/2013 01:45:58 > [email protected] 1 > 34183 0.00570 SubSampleF isoform r 05/09/2013 03:29:32 > [email protected] 8 > 34185 0.00570 SubSampleF isoform r 05/09/2013 04:27:47 > [email protected] 8 > 34186 0.00570 SubSampleF isoform r 05/09/2013 04:36:47 > [email protected] 8 > 34187 0.00570 SubSampleF isoform r 05/09/2013 05:05:02 > [email protected] 8 > 34188 0.00570 SubSampleF isoform r 05/09/2013 05:42:17 > [email protected] 8 > 34189 0.00570 SubSampleF isoform r 05/09/2013 06:12:47 > [email protected] 8 > 34190 0.00570 SubSampleF isoform r 05/09/2013 06:14:17 > [email protected] 8 > 34191 0.00570 SubSampleF isoform r 05/09/2013 07:07:32 > [email protected] 8 > 34192 0.00570 SubSampleF isoform r 05/09/2013 07:24:02 > [email protected] 8 > 34194 0.00570 SubSampleF isoform r 05/09/2013 07:37:17 > [email protected] 8 > 34378 1.00000 R CMD BATC paciorek qw 05/09/2013 08:14:31 > 16 > 34195 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34196 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34197 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34198 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34199 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34200 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34201 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34202 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34203 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34204 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34205 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34206 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34207 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34208 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34209 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34210 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > > A little while later, we see that jobs 34195-34198 have slipped ahead of > 34378: > > job-ID prior name user state submit/start at queue > slots ja-task-ID > ----------------------------------------------------------------------------------------------------------------- > 33004 0.11790 tophat.sh seqc r 04/24/2013 07:14:20 > [email protected] 32 > 33718 0.12398 fooSU_long lwtai r 05/06/2013 17:01:58 > [email protected] 1 > 33719 0.12398 fooSV_long lwtai r 05/06/2013 17:01:58 > [email protected] 1 > 33720 0.12398 fooWV_long lwtai r 05/06/2013 17:01:58 > [email protected] 1 > 33721 0.12398 fooWU_long lwtai r 05/06/2013 17:01:58 > [email protected] 1 > 33745 0.08234 toy.sh yjhuoh r 05/07/2013 22:29:28 > [email protected] 1 > 33758 0.08234 toy.sh yjhuoh r 05/07/2013 22:30:28 > [email protected] 1 > 33763 0.08234 toy.sh yjhuoh r 05/07/2013 22:33:58 > [email protected] 1 > 33787 0.08234 toy.sh yjhuoh r 05/08/2013 00:15:58 > [email protected] 1 > 34188 0.00568 SubSampleF isoform r 05/09/2013 05:42:17 > [email protected] 8 > 34189 0.00568 SubSampleF isoform r 05/09/2013 06:12:47 > [email protected] 8 > 34190 0.00568 SubSampleF isoform r 05/09/2013 06:14:17 > [email protected] 8 > 34191 0.00568 SubSampleF isoform r 05/09/2013 07:07:32 > [email protected] 8 > 34192 0.00568 SubSampleF isoform r 05/09/2013 07:24:02 > [email protected] 8 > 34194 0.00568 SubSampleF isoform r 05/09/2013 07:37:17 > [email protected] 8 > 34195 0.00568 SubSampleF isoform r 05/09/2013 08:16:47 > [email protected] 8 > 34196 0.00568 SubSampleF isoform r 05/09/2013 08:47:32 > [email protected] 8 > 34197 0.00568 SubSampleF isoform r 05/09/2013 09:11:02 > [email protected] 8 > 34198 0.00568 SubSampleF isoform r 05/09/2013 09:16:32 > [email protected] 8 > 34378 1.00000 R CMD BATC paciorek qw 05/09/2013 08:14:31 > 16 > 34199 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34200 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34201 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34202 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34203 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34204 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34205 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34206 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34207 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34208 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34209 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34210 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > 8 > 34211 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 > 8 > 34212 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 > 8 > 34213 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 > 8 > 34214 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 > 8 > 34215 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 > 8 > 34216 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 > 8 > 34217 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 > 8 > 34218 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 > 8 > > The schedule file shows that there are RESERVING statements for #34378: > 34378:1:RESERVING:1369228520:25920060:P:smp:slots:16.000000 > 34378:1:RESERVING:1369228520:25920060:Q:[email protected]:slots:16.000000 > > Perhaps the issue is that the reservation seems specific to the cluster node > "scf-sm02.Berkeley.EDU", and that specific node is occupied by a long-running > job (#33004). If so, is there any way to have the reservation not tied to a > node? > > -Chris > > ---------------------------------------------------------------------------------------------- > Chris Paciorek > > Statistical Computing Consultant, Associate Research Statistician, Lecturer > > Office: 495 Evans Hall Email: [email protected] > Mailing Address: Voice: 510-842-6670 > Department of Statistics Fax: 510-642-7892 > 367 Evans Hall Skype: cjpaciorek > University of California, Berkeley WWW: > www.stat.berkeley.edu/~paciorek > Berkeley, CA 94720 USA Permanent forward: > [email protected] > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
