On 5 April 2011 11:33, Reuti <[email protected]> wrote:
> Hi,
>
> please don't crossposted. I think we have the gridengine.org as a place to
> discuss common setups.
>
> Am 05.04.2011 um 11:56 schrieb William Hay:
>
>> We're planning an outage on our cluster for the 12th of this month.
>> I've added reservations for each of the subclusters to ensure that
>> nothing is running at that time. The command I use is something like
>> qrsub -l mem=4G,job=true -a 04120800 -d 24:0:0 -pe '*-j' 256 where mem
>> is a consumable resource used to control memory usage and job is an
>> exclusive resource associated with each host and the pe varies
>> depending on which subcluster I'm reserving.
>
> Can't the mem/job be disregarded here? I mean: just request a reservation for
> all slots and you are done.
More of a habit from reserving nodes for actual users where I want to
reserve the resources they will use.
>
>
>> The reservations appear to be fine themselves but checking the
>> schedule file it appears that queued jobs now make reservations after
>> the outage even though they have plenty of time to run before it (I'm
>> making the reservations this early because we have a few people
>> submitting 7 day jobs).
>
> They are requesting also 7 days, or is this the estimated default duration
> setting in the scheduler configuration?
The users request the job length with a JSV enforcing different
lengths depending on the user and number of nodes
used. The vast majority of jobs (including the ones at the top of
the queue) are under 2 days. Three of our
ten subclusters have jobs longer than 3 days running on them which
should leave 7 where reservations can be made before the outage. In
the schedule file 1 job is making a reservation on subcluster with a 7
day job while 9 are running on a subcluster with only shorter jobs.
>
>> If I restart the scheduler then the jobs start reserving slots prior
>> to the outage but the queues acquire a qtype of N according to qstat
>> -f and jobs don't actually start in them. I can change the qtype in
>> qstat -f to B by using qconf to change the qtype attribute of each
>> queue to batch (which it already is according to qconf -sq).
>
> Can you tell us more about your setup? You have different queues, i.e. some
> only being batch and some only for parallel jobs?
>
> --Reuti
>
>
>> I can change the qtype to BP in qstat -f by modifying pe_list on each
>> queue but it won't let me do this with a reservation in place (even
>> though I'm just repeating what is already there). If I delete the
>> reservation,modify the pe_list and recreate the reservation then I'm
>> back to my original problem
>>
>> The upshot of this is that the cluster is now dominated by low
>> priority small jobs while the high priority parallel jobs are making
>> reservations after the outage.
>>
>> Also after a scheduler restart it takes a while for existing jobs to
>> start making reservations. For a few hours thereafter only jobs
>> submitted after the restart make reservations.
>>
>> Running SGE 6.2u3 at the moment. Is an upgrade likely to fix this?
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users