Am 02.12.2013 um 01:45 schrieb David Dotson:
>
> On 12/01/2013 04:03 PM, David Dotson wrote:
>>
>> On 12/01/2013 03:13 PM, Reuti wrote:
>>> Hi,
>>>
>>> Am 01.12.2013 um 22:20 schrieb David Dotson:
>>>
>>>> Greetings,
>>>>
>>>> We have the terminate_method for our queue set to SIGTERM, so that when
>>>> the following submission script runs, it should copy back all the files
>>>> generated to the original directory. The signal is indeed caught, and the
>>>> copy-back takes place, but it often dies without completing after a short
>>>> amount of time.
>>>>
>>>> # BEGIN SCRIPT
>>>> #============
>>>>
>>>> # standard gridengine script with automatic copying back of data
>>>> #$ -S /bin/bash
>>>> #$ -N grid_job
>>>> #$ -pe singlenode 16
>>>> #$ -cwd
>>>> #$ -j y
>>>> #$ -R y -r n
>>>>
>>>>
>>>> # set up scratch directory
>>>> WORK=/scratch/${USER}/WORK/${JOB_ID}
>>>> ORIG=$PWD
>>> The scratch directory supplied by SGE, i.e. $TMPDIR is not sufficient? It's
>>> the one set up in the queue definition "tmpdir".
>> I was not aware of this option. It shouldn't make a difference, but could
>> it? We may very well change our standard submission scripts to reference
>> $TMPDIR instead if that's the case.
> I realized why we chose not to use $TMPDIR: this directory is automatically
> deleted on job exit. We prefer being able to salvage data in the case of a
> copy failure, power loss, etc.
Yep. This is correct. BTW: is all data therein valuable, or does it include
unnecessary scratch files too? These could then be separated into the $TMPDIR
and your persistent one for the important ones.
-- Reuti
>>>
>>>
>>>> function setup_workdir () {
>>>>
>>>> echo "-- [$(date)] setting up $WORK"
>>>>
>>>> mkdir -p $WORK
>>>> test -d $WORK || { echo "EE ERROR: Failed to make tmpdir"; exit 1; }
>>>>
>>>> cp $TPR $DEFFNM.cpt $DEFFNM.xtc $DEFFNM.trr $DEFFNM.edr $DEFFNM.log
>>>> $WORK
>>>> copy_success="True"
>>>> }
>>>>
>>>>
>>>> function cleanup_exit () {
>>>>
>>>> # ensure that we don't overwrite complete files with partial ones if
>>>> job killed mid-copy
>>>>
>>>> echo "-- [$(date)] cleaning up: $WORK --> $ORIG"
>>>>
>>>> cp $WORK/* $ORIG || { echo "EE ERROR: Did not copy $WORK --- check
>>>> manually!"; exit 1; }
>>>>
>>>> cd $ORIG
>>>>
>>>> rm -r $WORK
>>>>
>>>> exit 0
>>>> }
>>>>
>>>>
>>>> # make sure that killing the job copies back everything; won't copy back
>>>> if job
>>>> # killed while copying to workstation (a good thing!)
>>>> # (GE must be configured to use SIGTERM for killing jobs!)
>>>> trap
>>>> cleanup_exit TERM
>>>> setup_workdir
>>>>
>>>> cd $WORK || { echo "EE ERROR: failed to cd $WORK"; exit 2; }
>>>> # MAIN COMPUTATION RUNS HERE
>>>> cleanup_exit
>>>>
>>>>
>>>> #============
>>>> # END SCRIPT
>>>>
>>>> What is happening here? Is a second SIGTERM sent by gridengine after some
>>>> time? If so, what is the best way to ensure this copy-back completes on
>>>> qdel?
>>> Yes, this might happen - how long does the copy process take to complete.
>>> It should be recorded in the message for the node though (do you see a 90
>>> sec interval?).
>> The time it takes for the job to die during cleanup appears to vary. I have
>> had instances where it takes minutes, and others in which it takes seconds.
>> This has made it very frustrating to figure out, and it's why I'm reaching
>> out for some help.
>>>
>>>
>>>> As a note, I have tried sending SIGTERM as a notification instead, and
>>>> setting the `notify` queue configuration key to 24:00:00
>>> And changed the signal in SGE's configuration ("NOTIFY_KILL=sigterm") and
>>> submitted with "-notify"? This would be better than changing the
>>> "terminate_method" to something which must be handled inside the script to
>>> kill itself.
>> Correct. I added this key to "execd_params" in the SGE configuration, and
>> submitted the job with the "-notify" flag. As I said, it seems to be working
>> in some quick tests I did today, but I do recall this failing in the past
>> when actually copying back large files (meaning the copy was killed
>> mid-copy).
>>>
>>>
>>>> (basically, REALLY LONG). This seems to work in some of my tests, but it
>>>> has failed in actual use when copying back large data files.
>>> What do you mean by failed - is was killed anyway?
>>>
>>> -- Reuti
>>>
>>>
>>>> David
>>>> --
>>>> David L. Dotson
>>>> Center for Biological Physics
>>>> Arizona State University
>>>>
>>>> Email:
>>>> [email protected]
>>>> _______________________________________________
>>>> users mailing list
>>>> [email protected]
>>>> https://gridengine.org/mailman/listinfo/users
>>
>
> --
> David L. Dotson
> Center for Biological Physics
> Arizona State University
>
> Email: [email protected]
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users