Hi,
Am 01.12.2013 um 22:20 schrieb David Dotson:
> Greetings,
>
> We have the terminate_method for our queue set to SIGTERM, so that when the
> following submission script runs, it should copy back all the files generated
> to the original directory. The signal is indeed caught, and the copy-back
> takes place, but it often dies without completing after a short amount of
> time.
>
> # BEGIN SCRIPT
> #============
>
> # standard gridengine script with automatic copying back of data
> #$ -S /bin/bash
> #$ -N grid_job
> #$ -pe singlenode 16
> #$ -cwd
> #$ -j y
> #$ -R y -r n
>
>
>
>
> # set up scratch directory
> WORK=/scratch/${USER}/WORK/${JOB_ID}
> ORIG=$PWD
The scratch directory supplied by SGE, i.e. $TMPDIR is not sufficient? It's the
one set up in the queue definition "tmpdir".
> function setup_workdir () {
>
>
> echo "-- [$(date)] setting up $WORK"
>
>
> mkdir -p $WORK
>
>
> test -d $WORK || { echo "EE ERROR: Failed to make tmpdir"; exit 1; }
>
>
> cp $TPR $DEFFNM.cpt $DEFFNM.xtc $DEFFNM.trr $DEFFNM.edr $DEFFNM.log $WORK
> copy_success="True"
> }
>
>
>
> function cleanup_exit () {
>
>
> # ensure that we don't overwrite complete files with partial ones if job
> killed mid-copy
>
>
> echo "-- [$(date)] cleaning up: $WORK --> $ORIG"
>
>
> cp $WORK/* $ORIG || { echo "EE ERROR: Did not copy $WORK --- check
> manually!"; exit 1; }
>
>
>
> cd $ORIG
>
>
> rm -r $WORK
>
>
> exit 0
> }
>
>
>
> # make sure that killing the job copies back everything; won't copy back if
> job
> # killed while copying to workstation (a good thing!)
> # (GE must be configured to use SIGTERM for killing jobs!)
> trap
> cleanup_exit TERM
>
> setup_workdir
>
> cd $WORK || { echo "EE ERROR: failed to cd $WORK"; exit 2; }
>
>
> # MAIN COMPUTATION RUNS HERE
>
> cleanup_exit
>
>
> #============
> # END SCRIPT
>
> What is happening here? Is a second SIGTERM sent by gridengine after some
> time? If so, what is the best way to ensure this copy-back completes on qdel?
Yes, this might happen - how long does the copy process take to complete. It
should be recorded in the message for the node though (do you see a 90 sec
interval?).
> As a note, I have tried sending SIGTERM as a notification instead, and
> setting the `notify` queue configuration key to 24:00:00
And changed the signal in SGE's configuration ("NOTIFY_KILL=sigterm") and
submitted with "-notify"? This would be better than changing the
"terminate_method" to something which must be handled inside the script to kill
itself.
> (basically, REALLY LONG). This seems to work in some of my tests, but it has
> failed in actual use when copying back large data files.
What do you mean by failed - is was killed anyway?
-- Reuti
> David
> --
> David L. Dotson
> Center for Biological Physics
> Arizona State University
>
> Email:
> [email protected]
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users