> Am 11.06.2018 um 18:43 schrieb Ilya M <4ilya.m+g...@gmail.com>: > > Hello, > > Thank you for the suggestion, Reuti. Not sure if my users' pipelines can deal > with multiple job ids, perhaps they will be willing to modify their code.
Also other commands in SGE like `qdel` allow to use the job name to deal with such a configuration. > On Mon, Jun 11, 2018 at 9:23 AM, Reuti <re...@staff.uni-marburg.de> wrote: > Hi, > > > I wouldn't be surprised if the execd remembers that the job was already > warned, hence it must be the hard limit now. Would your workflow allow: > > This is happening on different nodes, so each execd cannot know any history > by itself, the master must be providing this information. Aha, you correct. -- Reuti > Can't help wondering if this is a configurable option. > > Ilya. > > > > . /usr/sge/default/common/settings.sh > trap "qresub $JOB_ID; exit 4;" SIGUSR1 > > Well, you get several job numbers this way. For the accounting with `qacct` > you could use the job name instead of the job number to get all the runs > listed though. > > -- Reuti > > > > This is my test script: > > > > #!/bin/bash > > > > #$ -S /bin/bash > > #$ -l s_rt=0:0:5,h_rt=0:0:10 > > #$ -j y > > > > set -x > > set -e > > set -o pipefail > > set -u > > > > trap "exit 99" SIGUSR1 > > > > trap "exit 2" SIGTERM > > > > echo "hello world" > > > > sleep 15 > > > > It should reschedule itself indefinitely when s_rt lapses. Yet, what is > > happening is that rescheduling happens only once. On the second run the job > > receives only SIGTERM and exits. Here is the script's output: > > > > node140 > > + set -e > > + set -o pipefail > > + set -u > > + trap 'exit 99' SIGUSR1 > > + trap 'exit 2' SIGTERM > > + echo 'hello world' > > hello world > > + sleep 15 > > User defined signal 1 > > ++ exit 99 > > node069 > > + set -e > > + set -o pipefail > > + set -u > > + trap 'exit 99' SIGUSR1 > > + trap 'exit 2' SIGTERM > > + echo 'hello world' > > hello world > > + sleep 15 > > Terminated > > ++ exit 2 > > > > Execd logs confirms that for the second time the jobs was killed for > > exceeding h_rt: > > > > 06/08/2018 21:20:15| main|node140|W|job 8030395.1 exceeded soft wallclock > > time - initiate soft notify method > > 06/08/2018 21:20:59| main|node140|E|shepherd of job 8030395.1 exited with > > exit status = 25 > > > > 06/08/2018 21:21:45| main|node069|W|job 8030395.1 exceeded hard wallclock > > time - initiate terminate method > > > > And here is the accounting information: > > > > ============================================================== > > qname short.q > > hostname node140 > > group everyone > > owner ilya > > project project.p > > department defaultdepartment > > jobname reshed_test.sh > > jobnumber 8030395 > > taskid undefined > > account sge > > priority 0 > > qsub_time Fri Jun 8 21:19:40 2018 > > start_time Fri Jun 8 21:20:09 2018 > > end_time Fri Jun 8 21:20:15 2018 > > granted_pe NONE > > slots 1 > > failed 25 : rescheduling > > exit_status 99 > > ru_wallclock 6 > > ... > > ============================================================== > > qname short.q > > hostname node069 > > group everyone > > owner ilya > > project project.p > > department defaultdepartment > > jobname reshed_test.sh > > jobnumber 8030395 > > taskid undefined > > account sge > > priority 0 > > qsub_time Fri Jun 8 21:19:40 2018 > > start_time Fri Jun 8 21:21:39 2018 > > end_time Fri Jun 8 21:21:50 2018 > > granted_pe NONE > > slots 1 > > failed 0 > > exit_status 2 > > ru_wallclock 11 > > ... > > > > > > Is there anything in the configuration I could be missing. Running 6.2u5. > > > > Thank you, > > Ilya. > > > > _______________________________________________ > > users mailing list > > users@gridengine.org > > https://gridengine.org/mailman/listinfo/users > > > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users