Not sure if there is a better way, but the following seems to be working.

In the checkpoint scripts, the submit script, I am catching SIGUSR1 signal
and then issuing a qmod suspend the job with:

function SIGUSR1_HANDLER()
{
    qmod -sj $JOB_ID
}
trap SIGUSR1_HANDLER  SIGUSR1

So when "s_rt" is reached and the job receives SIGUSR1 signal, it suspends
the job via qmod.

Joseph


On 10/31/2013 11:48 AM, Joseph Farran wrote:
Greetings.

We have a queue defined with a soft & hard wall-clock limit of:

qconf -sq free64 | egrep "_rt|notify"
notify                00:05:00
s_rt                  48:00:00
h_rt                  48:05:00

And jobs get killed correctly after 2 days of wall-clock run time. We now have 
Grid
Engine checkpoint setup and would like to make it so that jobs do not get 
killed,
but rather be sent the suspend signal so that checkpoint takes over instead of
being killed.

After reading and doing some tests with the queue "suspend_method", I am not
sure I am on the right track.

So what is the proper / correct way to do this?    To *not* have jobs killed but
to have the checkpoint process take over when s_rt is reached?

Joseph

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to