Re: [gridengine users] Caught SIGTERM on termination, but handling fails after some time

Adam Brenner Sun, 01 Dec 2013 13:46:11 -0800

David,

As far as I am aware from my experience on our HPC cluster, you do not
have this fine level of control via a qsub script. When a kill signal
is issued from GE, you can not capture the signal and do some other
task from within qsub. It will just kill the process. This is to
prevent people from writing qsub scripts that will capture any kill
signal, and rather than stop running, continue running.


You are better off (others can correct me on this) by putting this
code in GE's epilog script. You could then have your users decide if
they want this feature (on/off) by setting an environmental variable.
Your epilog script could check for this environmental variable and
perform the necessary work to copy data from local node to homedir,
etc.

--
Adam Brenner
Computer Science, Undergraduate Student
Donald Bren School of Information and Computer Sciences

Research Computing Support
Office of Information Technology
http://www.oit.uci.edu/rcs/

University of California, Irvine
www.ics.uci.edu/~aebrenne/
[email protected]



On Sun, Dec 1, 2013 at 1:20 PM, David Dotson <[email protected]> wrote:
> Greetings,
>
> We have the terminate_method for our queue set to SIGTERM, so that when the
> following submission script runs, it should copy back all the files
> generated to the original directory. The signal is indeed caught, and the
> copy-back takes place, but it often dies without completing after a short
> amount of time.
>
> # BEGIN SCRIPT
> #============
>
> # standard gridengine script with automatic copying back of data
> #$ -S /bin/bash
> #$ -N grid_job
> #$ -pe singlenode 16
> #$ -cwd
> #$ -j y
> #$ -R y -r n
>
>
> # set up scratch directory
> WORK=/scratch/${USER}/WORK/${JOB_ID}
> ORIG=$PWD
>
>
> function setup_workdir () {
>     echo "-- [$(date)] setting up $WORK"
>     mkdir -p $WORK
>     test -d $WORK || { echo "EE ERROR: Failed to make tmpdir"; exit 1; }
>     cp $TPR $DEFFNM.cpt $DEFFNM.xtc $DEFFNM.trr $DEFFNM.edr $DEFFNM.log
> $WORK
>
>     copy_success="True"
> }
>
> function cleanup_exit () {
>     # ensure that we don't overwrite complete files with partial ones if job
> killed mid-copy
>     echo "-- [$(date)] cleaning up: $WORK --> $ORIG"
>     cp $WORK/* $ORIG || { echo "EE ERROR: Did not copy $WORK --- check
> manually!"; exit 1; }
>
>     cd $ORIG
>     rm -r $WORK
>     exit 0
> }
>
> # make sure that killing the job copies back everything; won't copy back if
> job
> # killed while copying to workstation (a good thing!)
> # (GE must be configured to use SIGTERM for killing jobs!)
> trap cleanup_exit TERM
>
> setup_workdir
> cd $WORK || { echo "EE ERROR: failed to cd $WORK"; exit 2; }
>
> # MAIN COMPUTATION RUNS HERE
>
> cleanup_exit
>
>
> #============
> # END SCRIPT
>
> What is happening here? Is a second SIGTERM sent by gridengine after some
> time? If so, what is the best way to ensure this copy-back completes on
> qdel?
>
> As a note, I have tried sending SIGTERM as a notification instead, and
> setting the `notify` queue configuration key to 24:00:00 (basically, REALLY
> LONG). This seems to work in some of my tests, but it has failed in actual
> use when copying back large data files.
>
> David
>
> --
> David L. Dotson
> Center for Biological Physics
> Arizona State University
>
> Email: [email protected]
>
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Caught SIGTERM on termination, but handling fails after some time

Reply via email to