Re: [gridengine users] Caught SIGTERM on termination, but handling fails after some time

David Dotson Sun, 01 Dec 2013 16:48:10 -0800


On 12/01/2013 04:03 PM, David Dotson wrote:

On 12/01/2013 03:13 PM, Reuti wrote:
Hi,

Am 01.12.2013 um 22:20 schrieb David Dotson:
Greetings,
We have the terminate_method for our queue set to SIGTERM, so thatwhen the following submission script runs, it should copy back allthe files generated to the original directory. The signal is indeedcaught, and the copy-back takes place, but it often dies withoutcompleting after a short amount of time.
# BEGIN SCRIPT
#============

  # standard gridengine script with automatic copying back of data
#$ -S /bin/bash
#$ -N grid_job
#$ -pe singlenode 16
#$ -cwd
#$ -j y
#$ -R y -r n


# set up scratch directory
WORK=/scratch/${USER}/WORK/${JOB_ID}
ORIG=$PWD
The scratch directory supplied by SGE, i.e. $TMPDIR is notsufficient? It's the one set up in the queue definition "tmpdir".
I was not aware of this option. It shouldn't make a difference, butcould it? We may very well change our standard submission scripts toreference $TMPDIR instead if that's the case.

I realized why we chose not to use $TMPDIR: this directory isautomatically deleted on job exit. We prefer being able to salvage datain the case of a copy failure, power loss, etc.

function setup_workdir () {

     echo "-- [$(date)] setting up $WORK"

     mkdir -p $WORK
test -d $WORK || { echo "EE ERROR: Failed to make tmpdir";exit 1; }
cp $TPR $DEFFNM.cpt $DEFFNM.xtc $DEFFNM.trr $DEFFNM.edr$DEFFNM.log $WORK
copy_success="True"
}


function cleanup_exit () {
# ensure that we don't overwrite complete files with partialones if job killed mid-copy
     echo "-- [$(date)] cleaning up: $WORK --> $ORIG"
cp $WORK/* $ORIG || { echo "EE ERROR: Did not copy $WORK ---check manually!"; exit 1; }
           cd $ORIG

     rm -r $WORK

     exit 0
}
# make sure that killing the job copies back everything; won't copyback if job
# killed while copying to workstation (a good thing!)
# (GE must be configured to use SIGTERM for killing jobs!)
trap
  cleanup_exit TERM
  setup_workdir

cd $WORK || { echo "EE ERROR: failed to cd $WORK"; exit 2; }
    # MAIN COMPUTATION RUNS HERE
  cleanup_exit


#============
# END SCRIPT
What is happening here? Is a second SIGTERM sent by gridengine aftersome time? If so, what is the best way to ensure this copy-backcompletes on qdel?
Yes, this might happen - how long does the copy process take tocomplete. It should be recorded in the message for the node though(do you see a 90 sec interval?).
The time it takes for the job to die during cleanup appears to vary. Ihave had instances where it takes minutes, and others in which ittakes seconds. This has made it very frustrating to figure out, andit's why I'm reaching out for some help.
As a note, I have tried sending SIGTERM as a notification instead,and setting the `notify` queue configuration key to 24:00:00
And changed the signal in SGE's configuration ("NOTIFY_KILL=sigterm")and submitted with "-notify"? This would be better than changing the"terminate_method" to something which must be handled inside thescript to kill itself.
Correct. I added this key to "execd_params" in the SGE configuration,and submitted the job with the "-notify" flag. As I said, it seems tobe working in some quick tests I did today, but I do recall thisfailing in the past when actually copying back large files (meaningthe copy was killed mid-copy).
(basically, REALLY LONG). This seems to work in some of my tests,but it has failed in actual use when copying back large data files.
What do you mean by failed - is was killed anyway?

-- Reuti
David
--
David L. Dotson
Center for Biological Physics
Arizona State University

Email:
[email protected]
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users


--
David L. Dotson
Center for Biological Physics
Arizona State University

Email: [email protected]

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Caught SIGTERM on termination, but handling fails after some time

Reply via email to