Hi, Am 17.01.2012 um 20:41 schrieb Jeff Dusenberry:
> I have a subordinate queue set up with notification time of 5 minutes, > and preempted jobs are terminated (using SIGTERM) after that period. > For jobs running in that queue, I've been able to confirm that there > is a 5 minute delay between when the notification is sent and when the > job is terminated. The idea is to give the job a chance to save state > and shut itself down cleanly before being terminated. > > The issue that I've been running into is that the job that triggers > the preemption begins running when the notification signal is sent. > We then end up with both jobs running simultaneously during the > notification period. Is there any way to delay that second job so it > will not start until the preempted job has either exited on its own or > been killed? Any suggestions for how I might configure this > differently would be appreciated. Well, SGE can't look ahead. So you allow already an oversubscription in memory and/or slots I assume. And you defined a suspend_method to checkpoint and kill the suspended job? It depends on your setup, but when you have an oversubscription in slots for a short time, you could define a "starter_method" which will check `qcconf -F slots -h foobar` twice a minute or so and wait if it's still above the defined cores on this machine. BTW: you could also submit the to be preempted jobs with a checkpointing interface "application_level" and define the checkpointing and killing the processgroup in the "migr_command" defined script. Then the preempted job is still on top of the waiting again list instead removing it completely and submitting it again. -- Reuti _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
