Re: [gridengine users] Caught SIGTERM on termination, but handling fails after some time

Reuti Mon, 02 Dec 2013 07:12:23 -0800

Am 02.12.2013 um 00:03 schrieb David Dotson:

>>> <sbip>
>>> # END SCRIPT
>>> 
>>> What is happening here? Is a second SIGTERM sent by gridengine after some 
>>> time? If so, what is the best way to ensure this copy-back completes on 
>>> qdel?
>> Yes, this might happen - how long does the copy process take to complete. It 
>> should be recorded in the message for the node though (do you see a 90 sec 
>> interval?).
> The time it takes for the job to die during cleanup appears to vary. I have 
> had instances where it takes minutes, and others in which it takes seconds. 
> This has made it very frustrating to figure out, and it's why I'm reaching 
> out for some help.


Is there any additional signal mentioned in the messages file of the node's 
spool directory?

There are two cases:

a) the main application completed and the copy back procedure is done during 
the normal job execution - no signal involved here

b) a `qdel` is issued (or "h_rt/h_cpu" used up), which should kill the main 
application and copy the files back during some extra time granted to perform 
this step

Does a) run always successful?

In case of b), the main application also gets the signal (it's send to the 
complete process group of the job). Maybe it's necessary to reenable the 
default behavior to terminate the main application by running it in the 
sub-shell (the side effect is, that the main script stays in the cwd):

***
trap cleanup_exit TERM
setup_workdir

(cd $WORK; exec /foo/bar/baz)

cleanup_exit
***

There is also the special word "EXIT" - the warning should kill the 
application, but will be ignored in the shell. So the subshell should return 
and the copy back is processed by the exit of the main script independent from 
the cause:

***
trap "" TERM
trap cleanup_exit EXIT
setup_workdir

(cd $WORK; exec /foo/bar/baz)
***

Maybe any of these gives you better results.

-- Reuti



>>> trap
>>>  cleanup_exit TERM
>>>  setup_workdir
>>> 
>>> cd $WORK || { echo "EE ERROR: failed to cd $WORK"; exit 2; }
>>>    # MAIN COMPUTATION RUNS HERE
>>>  cleanup_exit




>>> As a note, I have tried sending SIGTERM as a notification instead, and 
>>> setting the `notify` queue configuration key to 24:00:00
>> And changed the signal in SGE's configuration ("NOTIFY_KILL=sigterm") and 
>> submitted with "-notify"? This would be better than changing the 
>> "terminate_method" to something which must be handled inside the script to 
>> kill itself.
> Correct. I added this key to "execd_params" in the SGE configuration, and 
> submitted the job with the "-notify" flag. As I said, it seems to be working 
> in some quick tests I did today, but I do recall this failing in the past 
> when actually copying back large files (meaning the copy was killed mid-copy).
>> 
>> 
>>> (basically, REALLY LONG). This seems to work in some of my tests, but it 
>>> has failed in actual use when copying back large data files.
>> What do you mean by failed - is was killed anyway?
>> 
>> -- Reuti
>> 
>> 
>>> David
>>> -- 
>>> David L. Dotson
>>> Center for Biological Physics
>>> Arizona State University
>>> 
>>> Email:
>>> [email protected]
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
> 
> -- 
> David L. Dotson
> Center for Biological Physics
> Arizona State University
> 
> Email: [email protected]
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Caught SIGTERM on termination, but handling fails after some time

Reply via email to