Hi, Am 07.03.2011 um 15:15 schrieb Bharanidharan Narayanaswamy:
> There is a single queue available to the users. Now a user has submitted a > job which is going to take a long time to compute. Another users who has a > job in queue is much simpler and will complete in few minutes. > > what would be the best / effective method to send the second job in place of > the first job. > > The trouble here is that there is no application level checkpointing. > > I'm using drmaa to submit batch jobs. there are different approaches possible. All have in common, that for SGE a started job will use the requested resources up to its end - it won't release them in any case unless it gets rescheduled or deleted. - The long job could be started in a queue with a nice value of 19 (setting "priority" in the queue definition). The shorter job will then get for a short time more CPU resources in a different queue with nice 0. As nice values are relative, multiple jobs with nice 19 in the long queue behave the same way as multiple jobs with nice 0. - The long running jobs could be suspended by setting "subordinate_list" in the short queues definition. This way the long running job will be stopped during the execution of the short job and continue afterwards. This can be extended to have a slotwise subordination to stop only one of the long running jobs on a node and not all in that queue, but it won't restart the suspended jobs under certain conditions in 6.2u5 though in this case. -- Reuti _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
