Re: [gridengine users] Caught SIGTERM on termination, but handling fails after some time

David Dotson Sun, 01 Dec 2013 15:06:10 -0800

I was not aware of the epilog feature. This might be a route worthexploring to address this issue. Thanks!


David


On 12/01/2013 02:44 PM, Adam Brenner wrote:

David,

As far as I am aware from my experience on our HPC cluster, you do not
have this fine level of control via a qsub script. When a kill signal
is issued from GE, you can not capture the signal and do some other
task from within qsub. It will just kill the process. This is to
prevent people from writing qsub scripts that will capture any kill
signal, and rather than stop running, continue running.

You are better off (others can correct me on this) by putting this
code in GE's epilog script. You could then have your users decide if
they want this feature (on/off) by setting an environmental variable.
Your epilog script could check for this environmental variable and
perform the necessary work to copy data from local node to homedir,
etc.

--
Adam Brenner
Computer Science, Undergraduate Student
Donald Bren School of Information and Computer Sciences

Research Computing Support
Office of Information Technology
http://www.oit.uci.edu/rcs/

University of California, Irvine
www.ics.uci.edu/~aebrenne/
[email protected]



On Sun, Dec 1, 2013 at 1:20 PM, David Dotson <[email protected]> wrote:

Greetings,

We have the terminate_method for our queue set to SIGTERM, so that when the
following submission script runs, it should copy back all the files
generated to the original directory. The signal is indeed caught, and the
copy-back takes place, but it often dies without completing after a short
amount of time.

# BEGIN SCRIPT
#============

# standard gridengine script with automatic copying back of data
#$ -S /bin/bash
#$ -N grid_job
#$ -pe singlenode 16
#$ -cwd
#$ -j y
#$ -R y -r n


# set up scratch directory
WORK=/scratch/${USER}/WORK/${JOB_ID}
ORIG=$PWD


function setup_workdir () {
     echo "-- [$(date)] setting up $WORK"
     mkdir -p $WORK
     test -d $WORK || { echo "EE ERROR: Failed to make tmpdir"; exit 1; }
     cp $TPR $DEFFNM.cpt $DEFFNM.xtc $DEFFNM.trr $DEFFNM.edr $DEFFNM.log
$WORK

     copy_success="True"
}

function cleanup_exit () {
     # ensure that we don't overwrite complete files with partial ones if job
killed mid-copy
     echo "-- [$(date)] cleaning up: $WORK --> $ORIG"
     cp $WORK/* $ORIG || { echo "EE ERROR: Did not copy $WORK --- check
manually!"; exit 1; }

     cd $ORIG
     rm -r $WORK
     exit 0
}

# make sure that killing the job copies back everything; won't copy back if
job
# killed while copying to workstation (a good thing!)
# (GE must be configured to use SIGTERM for killing jobs!)
trap cleanup_exit TERM

setup_workdir
cd $WORK || { echo "EE ERROR: failed to cd $WORK"; exit 2; }

# MAIN COMPUTATION RUNS HERE

cleanup_exit


#============
# END SCRIPT

What is happening here? Is a second SIGTERM sent by gridengine after some
time? If so, what is the best way to ensure this copy-back completes on
qdel?

As a note, I have tried sending SIGTERM as a notification instead, and
setting the `notify` queue configuration key to 24:00:00 (basically, REALLY
LONG). This seems to work in some of my tests, but it has failed in actual
use when copying back large data files.

David

--
David L. Dotson
Center for Biological Physics
Arizona State University

Email: [email protected]


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users


--
David L. Dotson
Center for Biological Physics
Arizona State University

Email: [email protected]

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Caught SIGTERM on termination, but handling fails after some time

Reply via email to