Not sure if there is a better way, but the following seems to be working.
In the checkpoint scripts, the submit script, I am catching SIGUSR1 signal
and then issuing a qmod suspend the job with:
function SIGUSR1_HANDLER()
{
qmod -sj $JOB_ID
}
trap SIGUSR1_HANDLER SIGUSR1
So when "s_rt" is reached and the job receives SIGUSR1 signal, it suspends
the job via qmod.
Joseph
On 10/31/2013 11:48 AM, Joseph Farran wrote:
Greetings.
We have a queue defined with a soft & hard wall-clock limit of:
qconf -sq free64 | egrep "_rt|notify"
notify 00:05:00
s_rt 48:00:00
h_rt 48:05:00
And jobs get killed correctly after 2 days of wall-clock run time. We now have
Grid
Engine checkpoint setup and would like to make it so that jobs do not get
killed,
but rather be sent the suspend signal so that checkpoint takes over instead of
being killed.
After reading and doing some tests with the queue "suspend_method", I am not
sure I am on the right track.
So what is the proper / correct way to do this? To *not* have jobs killed but
to have the checkpoint process take over when s_rt is reached?
Joseph
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users