Hi,

Am 07.03.2011 um 15:15 schrieb Bharanidharan Narayanaswamy:

> There is a single queue available to the users. Now a user has submitted a 
> job which is going to take a long time to compute. Another users who has a 
> job in queue is much simpler and will complete in few minutes.
> 
> what would be the best / effective method to send the second job in place of 
> the first job.
> 
> The trouble here is that there is no application level checkpointing.
> 
> I'm using drmaa to submit batch jobs.

there are different approaches possible. All have in common, that for SGE a 
started job will use the requested resources up to its end - it won't release 
them in any case unless it gets rescheduled or deleted.

- The long job could be started in a queue with a nice value of 19 (setting 
"priority" in the queue definition). The shorter job will then get for a short 
time more CPU resources in a different queue with nice 0. As nice values are 
relative, multiple jobs with nice 19 in the long queue behave the same way as 
multiple jobs with nice 0.

- The long running jobs could be suspended by setting "subordinate_list" in the 
short queues definition. This way the long running job will be stopped during 
the execution of the short job and continue afterwards. This can be extended to 
have a slotwise subordination to stop only one of the long running jobs on a 
node and not all in that queue, but it won't restart the suspended jobs under 
certain conditions in 6.2u5 though in this case.

-- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to