Hi, Joseph, That seems like a really long runtime (over a year)... well, it would be in our environment anyway :) I can attest from my recent experiences that limiting max possible runtime and forcing users into a checkpoint-restart configuration has improved our scheduling immensely. It stinks to be confronted by a user who's 6 hour job wont start because there are 1 month long jobs dominating the resources and there's nothing you can do about it. We have a special ACL-restricted queue called xlong for super long jobs (with a modest slot limit). To use this queue, a user must make a formal request and his job must not support any form of checkpointing. Needless to say, we get very few of these requests (that don't get handled by checkpoint) and our max runtime limit for the default queue is 1 week.
-Brian On Mon, Aug 13, 2012 at 4:28 PM, Joseph Farran <[email protected]> wrote: > I checked the PE-job and the job-arrays and they had none. > > I originally setup all my queues with a large h_rt value thinking that > jobs would inherit that: > > # qconf -sq bio | fgrep h_rt > h_rt 9999:00:00 > > But I think I read wrong how that worked. > > > What is the proper / recommend way to setup a default h_rt value so that > jobs will inherit if none is specified and so that job_arrays will not keep > PE jobs from running? > > Joseph > > > > On 08/13/2012 12:33 PM, Reuti wrote: > >> What walltime (i.e. h_rt) for the jobs did you request? >> >> -- Reuti >> >> Am 13.08.2012 um 21:17 schrieb Joseph Farran<[email protected]>: >> >> Hi. >>> >>> We are having the classical job starvation with PE jobs. >>> >>> I followed the instructions listed at http://www.gridengine.info/** >>> 2006/05/31/resource-**reservation-prevents-parallel-**job-starvation<http://www.gridengine.info/2006/05/31/resource-reservation-prevents-parallel-job-starvation> >>> >>> # qconf -ssconf | egrep reservation >>> max_reservation 64 >>> >>> # qconf -sconf | grep reservations >>> max_advance_reservations 64 >>> >>> Here is a sample Queue listing: >>> >>> job-ID prior name user state submit/start at ots >>> ja-task-ID >>> ------------------------------**------------------------------** >>> -------------------- >>> 2427 0.56811 blat_redi userb qw 08/13/2012 11:24:54 10 >>> 2415 0.50500 test_run6c usera qw 08/13/2012 10:26:16 1 >>> 138-250:1 >>> 2416 0.50500 test_run6e usera qw 08/13/2012 10:26:16 1 >>> 1-250:1 >>> 2417 0.50500 test_run6f usera qw 08/13/2012 10:26:17 1 >>> 1-250:1 >>> 2418 0.50500 test_run6g usera qw 08/13/2012 10:26:17 1 >>> 1-250:1 >>> 2419 0.50500 test_run6h usera qw 08/13/2012 10:26:17 1 >>> 1-250:1 >>> 2420 0.50500 test_run6i usera qw 08/13/2012 10:26:17 1 >>> 1-250:1 >>> 2421 0.50500 test_run6j usera qw 08/13/2012 10:26:17 1 >>> 1-250:1 >>> 2428 0.50500 test_run6. usera qw 08/13/2012 11:38:43 1 >>> 1-250:1 >>> >>> >>> Job #2427 has the Reservation flag turned on: >>> >>> # qstat -F -j 2427 | grep reserve >>> reserve: y >>> >>> However, the job arrays under job #2427 keep PE job #2427 from running. >>> >>> What other settings do I need to check for to fix this? >>> >>> I am suing binary GE2011.11 >>> >>> Thanks, >>> Joseph >>> >>> >>> >>> >>> ______________________________**_________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/**mailman/listinfo/users<https://gridengine.org/mailman/listinfo/users> >>> >> > ______________________________**_________________ > users mailing list > [email protected] > https://gridengine.org/**mailman/listinfo/users<https://gridengine.org/mailman/listinfo/users> > -- Brian Smith Sr. System Administrator Research Computing, University of South Florida 4202 E. Fowler Ave. SVC4010 Office Phone: +1 813 974-1467 Organization URL: http://rc.usf.edu
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
