Hi, Joseph,

That seems like a really long runtime (over a year)... well, it would be in
our environment anyway :)  I can attest from my recent experiences that
limiting max possible runtime and forcing users into a checkpoint-restart
configuration has improved our scheduling immensely.  It stinks to be
confronted by a user who's 6 hour job wont start because there are 1 month
long jobs dominating the resources and there's nothing you can do about it.
 We have a special ACL-restricted queue called xlong for super long jobs
(with a modest slot limit).  To use this queue, a user must make a formal
request and his job must not support any form of checkpointing.  Needless
to say, we get very few of these requests (that don't get handled by
checkpoint) and our max runtime limit for the default queue is 1 week.

-Brian

On Mon, Aug 13, 2012 at 4:28 PM, Joseph Farran <[email protected]> wrote:

> I checked the PE-job and the job-arrays and they had none.
>
> I originally setup all my queues with a large h_rt value thinking that
> jobs would inherit that:
>
> # qconf -sq bio | fgrep h_rt
> h_rt                  9999:00:00
>
> But I think I read wrong how that worked.
>
>
> What is the proper / recommend way to setup a default h_rt value so that
> jobs will inherit if none is specified and so that job_arrays will not keep
> PE jobs from running?
>
> Joseph
>
>
>
> On 08/13/2012 12:33 PM, Reuti wrote:
>
>> What walltime (i.e. h_rt) for the jobs did you request?
>>
>> -- Reuti
>>
>> Am 13.08.2012 um 21:17 schrieb Joseph Farran<[email protected]>:
>>
>>  Hi.
>>>
>>> We are having the classical job starvation with PE jobs.
>>>
>>> I followed the instructions listed at http://www.gridengine.info/**
>>> 2006/05/31/resource-**reservation-prevents-parallel-**job-starvation<http://www.gridengine.info/2006/05/31/resource-reservation-prevents-parallel-job-starvation>
>>>
>>> # qconf -ssconf | egrep reservation
>>> max_reservation                   64
>>>
>>> # qconf -sconf | grep reservations
>>> max_advance_reservations     64
>>>
>>> Here is a sample Queue listing:
>>>
>>> job-ID  prior   name       user      state submit/start at     ots
>>> ja-task-ID
>>> ------------------------------**------------------------------**
>>> --------------------
>>>    2427 0.56811 blat_redi  userb     qw    08/13/2012 11:24:54  10
>>>    2415 0.50500 test_run6c usera     qw    08/13/2012 10:26:16   1
>>> 138-250:1
>>>    2416 0.50500 test_run6e usera     qw    08/13/2012 10:26:16   1
>>> 1-250:1
>>>    2417 0.50500 test_run6f usera     qw    08/13/2012 10:26:17   1
>>> 1-250:1
>>>    2418 0.50500 test_run6g usera     qw    08/13/2012 10:26:17   1
>>> 1-250:1
>>>    2419 0.50500 test_run6h usera     qw    08/13/2012 10:26:17   1
>>> 1-250:1
>>>    2420 0.50500 test_run6i usera     qw    08/13/2012 10:26:17   1
>>> 1-250:1
>>>    2421 0.50500 test_run6j usera     qw    08/13/2012 10:26:17   1
>>> 1-250:1
>>>    2428 0.50500 test_run6. usera     qw    08/13/2012 11:38:43   1
>>> 1-250:1
>>>
>>>
>>> Job #2427 has the Reservation flag turned on:
>>>
>>> # qstat -F -j 2427 | grep reserve
>>> reserve:                    y
>>>
>>> However, the job arrays under job #2427 keep PE job #2427 from running.
>>>
>>> What other settings do I need to check for to fix this?
>>>
>>> I am suing binary GE2011.11
>>>
>>> Thanks,
>>> Joseph
>>>
>>>
>>>
>>>
>>> ______________________________**_________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/**mailman/listinfo/users<https://gridengine.org/mailman/listinfo/users>
>>>
>>
> ______________________________**_________________
> users mailing list
> [email protected]
> https://gridengine.org/**mailman/listinfo/users<https://gridengine.org/mailman/listinfo/users>
>



-- 
Brian Smith
Sr. System Administrator
Research Computing, University of South Florida
4202 E. Fowler Ave. SVC4010
Office Phone: +1 813 974-1467
Organization URL: http://rc.usf.edu
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to