Re: [gridengine users] Caught SIGTERM on termination, but handling fails after some time

Reuti Sun, 01 Dec 2013 14:16:11 -0800

Hi,

Am 01.12.2013 um 22:20 schrieb David Dotson:


> Greetings,
> 
> We have the terminate_method for our queue set to SIGTERM, so that when the 
> following submission script runs, it should copy back all the files generated 
> to the original directory. The signal is indeed caught, and the copy-back 
> takes place, but it often dies without completing after a short amount of 
> time.
> 
> # BEGIN SCRIPT
> #============
> 
>  # standard gridengine script with automatic copying back of data
> #$ -S /bin/bash
> #$ -N grid_job
> #$ -pe singlenode 16
> #$ -cwd
> #$ -j y
> #$ -R y -r n
> 
>  
>  
> 
> # set up scratch directory
> WORK=/scratch/${USER}/WORK/${JOB_ID}
> ORIG=$PWD

The scratch directory supplied by SGE, i.e. $TMPDIR is not sufficient? It's the 
one set up in the queue definition "tmpdir".


> function setup_workdir () {
> 
>     
> echo "-- [$(date)] setting up $WORK"
> 
>     
> mkdir -p $WORK
>  
>     
> test -d $WORK || { echo "EE ERROR: Failed to make tmpdir"; exit 1; }
> 
>     
> cp $TPR $DEFFNM.cpt $DEFFNM.xtc $DEFFNM.trr $DEFFNM.edr $DEFFNM.log $WORK
> copy_success="True"
> }
> 
>  
> 
> function cleanup_exit () {
> 
>     
> # ensure that we don't overwrite complete files with partial ones if job 
> killed mid-copy
> 
>     
> echo "-- [$(date)] cleaning up: $WORK --> $ORIG"
> 
>     
> cp $WORK/* $ORIG || { echo "EE ERROR: Did not copy $WORK --- check 
> manually!"; exit 1; }
> 
>      
>     
> cd $ORIG
> 
>     
> rm -r $WORK
> 
>     
> exit 0
> }
> 
>  
> 
> # make sure that killing the job copies back everything; won't copy back if 
> job
> # killed while copying to workstation (a good thing!)
> # (GE must be configured to use SIGTERM for killing jobs!)
> trap
>  cleanup_exit TERM
>  
> setup_workdir
> 
> cd $WORK || { echo "EE ERROR: failed to cd $WORK"; exit 2; }
>  
>  
> # MAIN COMPUTATION RUNS HERE
>  
> cleanup_exit
> 
> 
> #============
> # END SCRIPT
> 
> What is happening here? Is a second SIGTERM sent by gridengine after some 
> time? If so, what is the best way to ensure this copy-back completes on qdel?

Yes, this might happen - how long does the copy process take to complete. It 
should be recorded in the message for the node though (do you see a 90 sec 
interval?).


> As a note, I have tried sending SIGTERM as a notification instead, and 
> setting the `notify` queue configuration key to 24:00:00 

And changed the signal in SGE's configuration ("NOTIFY_KILL=sigterm") and 
submitted with "-notify"? This would be better than changing the 
"terminate_method" to something which must be handled inside the script to kill 
itself.


> (basically, REALLY LONG). This seems to work in some of my tests, but it has 
> failed in actual use when copying back large data files.

What do you mean by failed - is was killed anyway?

-- Reuti


> David
> -- 
> David L. Dotson
> Center for Biological Physics
> Arizona State University
> 
> Email: 
> [email protected]
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Caught SIGTERM on termination, but handling fails after some time

Reply via email to