Hi, > Am 08.06.2018 um 23:46 schrieb Ilya M <4ilya.m+g...@gmail.com>: > > Hello, > > I found an unexpected behavior when setting a hard and soft time limits and > doing automatic rescheduling by SIGUSR1.
I wouldn't be surprised if the execd remembers that the job was already warned, hence it must be the hard limit now. Would your workflow allow: . /usr/sge/default/common/settings.sh trap "qresub $JOB_ID; exit 4;" SIGUSR1 Well, you get several job numbers this way. For the accounting with `qacct` you could use the job name instead of the job number to get all the runs listed though. -- Reuti > This is my test script: > > #!/bin/bash > > #$ -S /bin/bash > #$ -l s_rt=0:0:5,h_rt=0:0:10 > #$ -j y > > set -x > set -e > set -o pipefail > set -u > > trap "exit 99" SIGUSR1 > > trap "exit 2" SIGTERM > > echo "hello world" > > sleep 15 > > It should reschedule itself indefinitely when s_rt lapses. Yet, what is > happening is that rescheduling happens only once. On the second run the job > receives only SIGTERM and exits. Here is the script's output: > > node140 > + set -e > + set -o pipefail > + set -u > + trap 'exit 99' SIGUSR1 > + trap 'exit 2' SIGTERM > + echo 'hello world' > hello world > + sleep 15 > User defined signal 1 > ++ exit 99 > node069 > + set -e > + set -o pipefail > + set -u > + trap 'exit 99' SIGUSR1 > + trap 'exit 2' SIGTERM > + echo 'hello world' > hello world > + sleep 15 > Terminated > ++ exit 2 > > Execd logs confirms that for the second time the jobs was killed for > exceeding h_rt: > > 06/08/2018 21:20:15| main|node140|W|job 8030395.1 exceeded soft wallclock > time - initiate soft notify method > 06/08/2018 21:20:59| main|node140|E|shepherd of job 8030395.1 exited with > exit status = 25 > > 06/08/2018 21:21:45| main|node069|W|job 8030395.1 exceeded hard wallclock > time - initiate terminate method > > And here is the accounting information: > > ============================================================== > qname short.q > hostname node140 > group everyone > owner ilya > project project.p > department defaultdepartment > jobname reshed_test.sh > jobnumber 8030395 > taskid undefined > account sge > priority 0 > qsub_time Fri Jun 8 21:19:40 2018 > start_time Fri Jun 8 21:20:09 2018 > end_time Fri Jun 8 21:20:15 2018 > granted_pe NONE > slots 1 > failed 25 : rescheduling > exit_status 99 > ru_wallclock 6 > ... > ============================================================== > qname short.q > hostname node069 > group everyone > owner ilya > project project.p > department defaultdepartment > jobname reshed_test.sh > jobnumber 8030395 > taskid undefined > account sge > priority 0 > qsub_time Fri Jun 8 21:19:40 2018 > start_time Fri Jun 8 21:21:39 2018 > end_time Fri Jun 8 21:21:50 2018 > granted_pe NONE > slots 1 > failed 0 > exit_status 2 > ru_wallclock 11 > ... > > > Is there anything in the configuration I could be missing. Running 6.2u5. > > Thank you, > Ilya. > > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users