Re-reading the man page yet another time made me think that this is the desired and logical behavior: if the job id remains the same, then h_rt and s_rt counters cannot be reset: job starts only once, execution *continues* after re-scheduing:
"RESOURCE LIMITS The first two resource limit parameters, s_rt and h_rt, are implemented by Grid Engine. They define the "real time" or also called "elapsed" or "wall clock" time *having passed since the start of the job*...' Ilya. On Mon, Jun 11, 2018 at 9:57 AM, Reuti <re...@staff.uni-marburg.de> wrote: > > > Am 11.06.2018 um 18:43 schrieb Ilya M <4ilya.m+g...@gmail.com>: > > > > Hello, > > > > Thank you for the suggestion, Reuti. Not sure if my users' pipelines can > deal with multiple job ids, perhaps they will be willing to modify their > code. > > Also other commands in SGE like `qdel` allow to use the job name to deal > with such a configuration. > > > > On Mon, Jun 11, 2018 at 9:23 AM, Reuti <re...@staff.uni-marburg.de> > wrote: > > Hi, > > > > > > I wouldn't be surprised if the execd remembers that the job was already > warned, hence it must be the hard limit now. Would your workflow allow: > > > > This is happening on different nodes, so each execd cannot know any > history by itself, the master must be providing this information. > > Aha, you correct. > > -- Reuti > > > > Can't help wondering if this is a configurable option. > > > > Ilya. > > > > > > > > . /usr/sge/default/common/settings.sh > > trap "qresub $JOB_ID; exit 4;" SIGUSR1 > > > > Well, you get several job numbers this way. For the accounting with > `qacct` you could use the job name instead of the job number to get all the > runs listed though. > > > > -- Reuti > > > > > > > This is my test script: > > > > > > #!/bin/bash > > > > > > #$ -S /bin/bash > > > #$ -l s_rt=0:0:5,h_rt=0:0:10 > > > #$ -j y > > > > > > set -x > > > set -e > > > set -o pipefail > > > set -u > > > > > > trap "exit 99" SIGUSR1 > > > > > > trap "exit 2" SIGTERM > > > > > > echo "hello world" > > > > > > sleep 15 > > > > > > It should reschedule itself indefinitely when s_rt lapses. Yet, what > is happening is that rescheduling happens only once. On the second run the > job receives only SIGTERM and exits. Here is the script's output: > > > > > > node140 > > > + set -e > > > + set -o pipefail > > > + set -u > > > + trap 'exit 99' SIGUSR1 > > > + trap 'exit 2' SIGTERM > > > + echo 'hello world' > > > hello world > > > + sleep 15 > > > User defined signal 1 > > > ++ exit 99 > > > node069 > > > + set -e > > > + set -o pipefail > > > + set -u > > > + trap 'exit 99' SIGUSR1 > > > + trap 'exit 2' SIGTERM > > > + echo 'hello world' > > > hello world > > > + sleep 15 > > > Terminated > > > ++ exit 2 > > > > > > Execd logs confirms that for the second time the jobs was killed for > exceeding h_rt: > > > > > > 06/08/2018 21:20:15| main|node140|W|job 8030395.1 exceeded soft > wallclock time - initiate soft notify method > > > 06/08/2018 21:20:59| main|node140|E|shepherd of job 8030395.1 exited > with exit status = 25 > > > > > > 06/08/2018 21:21:45| main|node069|W|job 8030395.1 exceeded hard > wallclock time - initiate terminate method > > > > > > And here is the accounting information: > > > > > > ============================================================== > > > qname short.q > > > hostname node140 > > > group everyone > > > owner ilya > > > project project.p > > > department defaultdepartment > > > jobname reshed_test.sh > > > jobnumber 8030395 > > > taskid undefined > > > account sge > > > priority 0 > > > qsub_time Fri Jun 8 21:19:40 2018 > > > start_time Fri Jun 8 21:20:09 2018 > > > end_time Fri Jun 8 21:20:15 2018 > > > granted_pe NONE > > > slots 1 > > > failed 25 : rescheduling > > > exit_status 99 > > > ru_wallclock 6 > > > ... > > > ============================================================== > > > qname short.q > > > hostname node069 > > > group everyone > > > owner ilya > > > project project.p > > > department defaultdepartment > > > jobname reshed_test.sh > > > jobnumber 8030395 > > > taskid undefined > > > account sge > > > priority 0 > > > qsub_time Fri Jun 8 21:19:40 2018 > > > start_time Fri Jun 8 21:21:39 2018 > > > end_time Fri Jun 8 21:21:50 2018 > > > granted_pe NONE > > > slots 1 > > > failed 0 > > > exit_status 2 > > > ru_wallclock 11 > > > ... > > > > > > > > > Is there anything in the configuration I could be missing. Running > 6.2u5. > > > > > > Thank you, > > > Ilya. > > > > > > _______________________________________________ > > > users mailing list > > > users@gridengine.org > > > https://gridengine.org/mailman/listinfo/users > > > > > > _______________________________________________ > > users mailing list > > users@gridengine.org > > https://gridengine.org/mailman/listinfo/users > >
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users