Re: [gridengine users] Caught SIGTERM on termination, but handling fails after some time

Reuti Mon, 02 Dec 2013 07:30:59 -0800

Am 02.12.2013 um 01:45 schrieb David Dotson:

> 
> On 12/01/2013 04:03 PM, David Dotson wrote:
>> 
>> On 12/01/2013 03:13 PM, Reuti wrote:
>>> Hi,
>>> 
>>> Am 01.12.2013 um 22:20 schrieb David Dotson:
>>> 
>>>> Greetings,
>>>> 
>>>> We have the terminate_method for our queue set to SIGTERM, so that when 
>>>> the following submission script runs, it should copy back all the files 
>>>> generated to the original directory. The signal is indeed caught, and the 
>>>> copy-back takes place, but it often dies without completing after a short 
>>>> amount of time.
>>>> 
>>>> # BEGIN SCRIPT
>>>> #============
>>>> 
>>>>  # standard gridengine script with automatic copying back of data
>>>> #$ -S /bin/bash
>>>> #$ -N grid_job
>>>> #$ -pe singlenode 16
>>>> #$ -cwd
>>>> #$ -j y
>>>> #$ -R y -r n
>>>> 
>>>> 
>>>> # set up scratch directory
>>>> WORK=/scratch/${USER}/WORK/${JOB_ID}
>>>> ORIG=$PWD
>>> The scratch directory supplied by SGE, i.e. $TMPDIR is not sufficient? It's 
>>> the one set up in the queue definition "tmpdir".
>> I was not aware of this option. It shouldn't make a difference, but could 
>> it? We may very well change our standard submission scripts to reference 
>> $TMPDIR instead if that's the case.
> I realized why we chose not to use $TMPDIR: this directory is automatically 
> deleted on job exit. We prefer being able to salvage data in the case of a 
> copy failure, power loss, etc.


Yep. This is correct. BTW: is all data therein valuable, or does it include 
unnecessary scratch files too? These could then be separated into the $TMPDIR 
and your persistent one for the important ones.

-- Reuti


>>> 
>>> 
>>>> function setup_workdir () {
>>>> 
>>>>     echo "-- [$(date)] setting up $WORK"
>>>> 
>>>>     mkdir -p $WORK
>>>>       test -d $WORK || { echo "EE ERROR: Failed to make tmpdir"; exit 1; }
>>>> 
>>>>     cp $TPR $DEFFNM.cpt $DEFFNM.xtc $DEFFNM.trr $DEFFNM.edr $DEFFNM.log 
>>>> $WORK
>>>> copy_success="True"
>>>> }
>>>> 
>>>> 
>>>> function cleanup_exit () {
>>>> 
>>>>     # ensure that we don't overwrite complete files with partial ones if 
>>>> job killed mid-copy
>>>> 
>>>>     echo "-- [$(date)] cleaning up: $WORK --> $ORIG"
>>>> 
>>>>     cp $WORK/* $ORIG || { echo "EE ERROR: Did not copy $WORK --- check 
>>>> manually!"; exit 1; }
>>>> 
>>>>           cd $ORIG
>>>> 
>>>>     rm -r $WORK
>>>> 
>>>>     exit 0
>>>> }
>>>> 
>>>> 
>>>> # make sure that killing the job copies back everything; won't copy back 
>>>> if job
>>>> # killed while copying to workstation (a good thing!)
>>>> # (GE must be configured to use SIGTERM for killing jobs!)
>>>> trap
>>>>  cleanup_exit TERM
>>>>  setup_workdir
>>>> 
>>>> cd $WORK || { echo "EE ERROR: failed to cd $WORK"; exit 2; }
>>>>    # MAIN COMPUTATION RUNS HERE
>>>>  cleanup_exit
>>>> 
>>>> 
>>>> #============
>>>> # END SCRIPT
>>>> 
>>>> What is happening here? Is a second SIGTERM sent by gridengine after some 
>>>> time? If so, what is the best way to ensure this copy-back completes on 
>>>> qdel?
>>> Yes, this might happen - how long does the copy process take to complete. 
>>> It should be recorded in the message for the node though (do you see a 90 
>>> sec interval?).
>> The time it takes for the job to die during cleanup appears to vary. I have 
>> had instances where it takes minutes, and others in which it takes seconds. 
>> This has made it very frustrating to figure out, and it's why I'm reaching 
>> out for some help.
>>> 
>>> 
>>>> As a note, I have tried sending SIGTERM as a notification instead, and 
>>>> setting the `notify` queue configuration key to 24:00:00
>>> And changed the signal in SGE's configuration ("NOTIFY_KILL=sigterm") and 
>>> submitted with "-notify"? This would be better than changing the 
>>> "terminate_method" to something which must be handled inside the script to 
>>> kill itself.
>> Correct. I added this key to "execd_params" in the SGE configuration, and 
>> submitted the job with the "-notify" flag. As I said, it seems to be working 
>> in some quick tests I did today, but I do recall this failing in the past 
>> when actually copying back large files (meaning the copy was killed 
>> mid-copy).
>>> 
>>> 
>>>> (basically, REALLY LONG). This seems to work in some of my tests, but it 
>>>> has failed in actual use when copying back large data files.
>>> What do you mean by failed - is was killed anyway?
>>> 
>>> -- Reuti
>>> 
>>> 
>>>> David
>>>> -- 
>>>> David L. Dotson
>>>> Center for Biological Physics
>>>> Arizona State University
>>>> 
>>>> Email:
>>>> [email protected]
>>>> _______________________________________________
>>>> users mailing list
>>>> [email protected]
>>>> https://gridengine.org/mailman/listinfo/users
>> 
> 
> -- 
> David L. Dotson
> Center for Biological Physics
> Arizona State University
> 
> Email: [email protected]
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Caught SIGTERM on termination, but handling fails after some time

Reply via email to