Hi,

> Am 08.06.2018 um 23:46 schrieb Ilya M <4ilya.m+g...@gmail.com>:
> 
> Hello,
> 
> I found an unexpected behavior when setting a hard and soft time limits and 
> doing automatic rescheduling by SIGUSR1.

I wouldn't be surprised if the execd remembers that the job was already warned, 
hence it must be the hard limit now. Would your workflow allow:

. /usr/sge/default/common/settings.sh
trap "qresub $JOB_ID; exit 4;" SIGUSR1

Well, you get several job numbers this way. For the accounting with `qacct` you 
could use the job name instead of the job number to get all the runs listed 
though.

-- Reuti


> This is my test script:
> 
> #!/bin/bash
> 
> #$ -S /bin/bash
> #$ -l s_rt=0:0:5,h_rt=0:0:10
> #$ -j y
> 
> set -x
> set -e
> set -o pipefail
> set -u
> 
> trap "exit 99" SIGUSR1
> 
> trap "exit 2" SIGTERM
> 
> echo "hello world"
> 
> sleep 15
> 
> It should reschedule itself indefinitely when s_rt lapses. Yet, what is 
> happening is that rescheduling happens only once. On the second run the job 
> receives only SIGTERM and exits. Here is the script's output:
> 
> node140
> + set -e
> + set -o pipefail
> + set -u
> + trap 'exit 99' SIGUSR1
> + trap 'exit 2' SIGTERM
> + echo 'hello world'
> hello world
> + sleep 15
> User defined signal 1
> ++ exit 99
> node069
> + set -e
> + set -o pipefail
> + set -u
> + trap 'exit 99' SIGUSR1
> + trap 'exit 2' SIGTERM
> + echo 'hello world'
> hello world
> + sleep 15
> Terminated
> ++ exit 2
> 
> Execd logs confirms that for the second time the jobs was killed for 
> exceeding h_rt:
> 
> 06/08/2018 21:20:15|  main|node140|W|job 8030395.1 exceeded soft wallclock 
> time - initiate soft notify method
> 06/08/2018 21:20:59|  main|node140|E|shepherd of job 8030395.1 exited with 
> exit status = 25
> 
> 06/08/2018 21:21:45|  main|node069|W|job 8030395.1 exceeded hard wallclock 
> time - initiate terminate method
> 
> And here is the accounting information:
> 
> ==============================================================
> qname        short.q             
> hostname     node140
> group        everyone            
> owner        ilya            
> project      project.p              
> department   defaultdepartment   
> jobname      reshed_test.sh      
> jobnumber    8030395             
> taskid       undefined
> account      sge                 
> priority     0                   
> qsub_time    Fri Jun  8 21:19:40 2018
> start_time   Fri Jun  8 21:20:09 2018
> end_time     Fri Jun  8 21:20:15 2018
> granted_pe   NONE                
> slots        1                   
> failed       25  : rescheduling
> exit_status  99                  
> ru_wallclock 6            
> ...                
> ==============================================================
> qname        short.q             
> hostname     node069
> group        everyone            
> owner        ilya            
> project      project.p              
> department   defaultdepartment   
> jobname      reshed_test.sh      
> jobnumber    8030395             
> taskid       undefined
> account      sge                 
> priority     0                   
> qsub_time    Fri Jun  8 21:19:40 2018
> start_time   Fri Jun  8 21:21:39 2018
> end_time     Fri Jun  8 21:21:50 2018
> granted_pe   NONE                
> slots        1                   
> failed       0    
> exit_status  2                   
> ru_wallclock 11           
> ...
> 
> 
> Is there anything in the configuration I could be missing. Running 6.2u5.
> 
> Thank you,
> Ilya.
> 
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to