David, As far as I am aware from my experience on our HPC cluster, you do not have this fine level of control via a qsub script. When a kill signal is issued from GE, you can not capture the signal and do some other task from within qsub. It will just kill the process. This is to prevent people from writing qsub scripts that will capture any kill signal, and rather than stop running, continue running.
You are better off (others can correct me on this) by putting this code in GE's epilog script. You could then have your users decide if they want this feature (on/off) by setting an environmental variable. Your epilog script could check for this environmental variable and perform the necessary work to copy data from local node to homedir, etc. -- Adam Brenner Computer Science, Undergraduate Student Donald Bren School of Information and Computer Sciences Research Computing Support Office of Information Technology http://www.oit.uci.edu/rcs/ University of California, Irvine www.ics.uci.edu/~aebrenne/ [email protected] On Sun, Dec 1, 2013 at 1:20 PM, David Dotson <[email protected]> wrote: > Greetings, > > We have the terminate_method for our queue set to SIGTERM, so that when the > following submission script runs, it should copy back all the files > generated to the original directory. The signal is indeed caught, and the > copy-back takes place, but it often dies without completing after a short > amount of time. > > # BEGIN SCRIPT > #============ > > # standard gridengine script with automatic copying back of data > #$ -S /bin/bash > #$ -N grid_job > #$ -pe singlenode 16 > #$ -cwd > #$ -j y > #$ -R y -r n > > > # set up scratch directory > WORK=/scratch/${USER}/WORK/${JOB_ID} > ORIG=$PWD > > > function setup_workdir () { > echo "-- [$(date)] setting up $WORK" > mkdir -p $WORK > test -d $WORK || { echo "EE ERROR: Failed to make tmpdir"; exit 1; } > cp $TPR $DEFFNM.cpt $DEFFNM.xtc $DEFFNM.trr $DEFFNM.edr $DEFFNM.log > $WORK > > copy_success="True" > } > > function cleanup_exit () { > # ensure that we don't overwrite complete files with partial ones if job > killed mid-copy > echo "-- [$(date)] cleaning up: $WORK --> $ORIG" > cp $WORK/* $ORIG || { echo "EE ERROR: Did not copy $WORK --- check > manually!"; exit 1; } > > cd $ORIG > rm -r $WORK > exit 0 > } > > # make sure that killing the job copies back everything; won't copy back if > job > # killed while copying to workstation (a good thing!) > # (GE must be configured to use SIGTERM for killing jobs!) > trap cleanup_exit TERM > > setup_workdir > cd $WORK || { echo "EE ERROR: failed to cd $WORK"; exit 2; } > > # MAIN COMPUTATION RUNS HERE > > cleanup_exit > > > #============ > # END SCRIPT > > What is happening here? Is a second SIGTERM sent by gridengine after some > time? If so, what is the best way to ensure this copy-back completes on > qdel? > > As a note, I have tried sending SIGTERM as a notification instead, and > setting the `notify` queue configuration key to 24:00:00 (basically, REALLY > LONG). This seems to work in some of my tests, but it has failed in actual > use when copying back large data files. > > David > > -- > David L. Dotson > Center for Biological Physics > Arizona State University > > Email: [email protected] > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
