Re: [gridengine users] Automatic job rescheduling. Only one rescheduling is happening

Reuti Mon, 11 Jun 2018 09:59:16 -0700


> Am 11.06.2018 um 18:43 schrieb Ilya M <4ilya.m+g...@gmail.com>:
> 
> Hello,
> 
> Thank you for the suggestion, Reuti. Not sure if my users' pipelines can deal 
> with multiple job ids, perhaps they will be willing to modify their code.


Also other commands in SGE like `qdel` allow to use the job name to deal with 
such a configuration.


> On Mon, Jun 11, 2018 at 9:23 AM, Reuti <re...@staff.uni-marburg.de> wrote:
> Hi,
> 
> 
> I wouldn't be surprised if the execd remembers that the job was already 
> warned, hence it must be the hard limit now. Would your workflow allow:
> 
> This is happening on different nodes, so each execd cannot know any history 
> by itself, the master must be providing this information.

Aha, you correct.

-- Reuti


> Can't help wondering if this is a configurable option.
> 
> Ilya.
> 
> 
>  
> . /usr/sge/default/common/settings.sh
> trap "qresub $JOB_ID; exit 4;" SIGUSR1
> 
> Well, you get several job numbers this way. For the accounting with `qacct` 
> you could use the job name instead of the job number to get all the runs 
> listed though.
> 
> -- Reuti
> 
> 
> > This is my test script:
> > 
> > #!/bin/bash
> > 
> > #$ -S /bin/bash
> > #$ -l s_rt=0:0:5,h_rt=0:0:10
> > #$ -j y
> > 
> > set -x
> > set -e
> > set -o pipefail
> > set -u
> > 
> > trap "exit 99" SIGUSR1
> > 
> > trap "exit 2" SIGTERM
> > 
> > echo "hello world"
> > 
> > sleep 15
> > 
> > It should reschedule itself indefinitely when s_rt lapses. Yet, what is 
> > happening is that rescheduling happens only once. On the second run the job 
> > receives only SIGTERM and exits. Here is the script's output:
> > 
> > node140
> > + set -e
> > + set -o pipefail
> > + set -u
> > + trap 'exit 99' SIGUSR1
> > + trap 'exit 2' SIGTERM
> > + echo 'hello world'
> > hello world
> > + sleep 15
> > User defined signal 1
> > ++ exit 99
> > node069
> > + set -e
> > + set -o pipefail
> > + set -u
> > + trap 'exit 99' SIGUSR1
> > + trap 'exit 2' SIGTERM
> > + echo 'hello world'
> > hello world
> > + sleep 15
> > Terminated
> > ++ exit 2
> > 
> > Execd logs confirms that for the second time the jobs was killed for 
> > exceeding h_rt:
> > 
> > 06/08/2018 21:20:15|  main|node140|W|job 8030395.1 exceeded soft wallclock 
> > time - initiate soft notify method
> > 06/08/2018 21:20:59|  main|node140|E|shepherd of job 8030395.1 exited with 
> > exit status = 25
> > 
> > 06/08/2018 21:21:45|  main|node069|W|job 8030395.1 exceeded hard wallclock 
> > time - initiate terminate method
> > 
> > And here is the accounting information:
> > 
> > ==============================================================
> > qname        short.q             
> > hostname     node140
> > group        everyone            
> > owner        ilya            
> > project      project.p              
> > department   defaultdepartment   
> > jobname      reshed_test.sh      
> > jobnumber    8030395             
> > taskid       undefined
> > account      sge                 
> > priority     0                   
> > qsub_time    Fri Jun  8 21:19:40 2018
> > start_time   Fri Jun  8 21:20:09 2018
> > end_time     Fri Jun  8 21:20:15 2018
> > granted_pe   NONE                
> > slots        1                   
> > failed       25  : rescheduling
> > exit_status  99                  
> > ru_wallclock 6            
> > ...                
> > ==============================================================
> > qname        short.q             
> > hostname     node069
> > group        everyone            
> > owner        ilya            
> > project      project.p              
> > department   defaultdepartment   
> > jobname      reshed_test.sh      
> > jobnumber    8030395             
> > taskid       undefined
> > account      sge                 
> > priority     0                   
> > qsub_time    Fri Jun  8 21:19:40 2018
> > start_time   Fri Jun  8 21:21:39 2018
> > end_time     Fri Jun  8 21:21:50 2018
> > granted_pe   NONE                
> > slots        1                   
> > failed       0    
> > exit_status  2                   
> > ru_wallclock 11           
> > ...
> > 
> > 
> > Is there anything in the configuration I could be missing. Running 6.2u5.
> > 
> > Thank you,
> > Ilya.
> > 
> > _______________________________________________
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> 
> 
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Automatic job rescheduling. Only one rescheduling is happening

Reply via email to