Re: [gridengine users] Forcing Grid Engine jobs to error state with exit status other than 0, 99 or 100.
On Wed, Sep 14, 2016 at 08:52:12PM +, Lee, Wayne wrote: > HI William, > > I've performed some tests by submitting a basic shell script which dumps the > environment (i.e. env) and performs either an "exit 0", "exit 99", "exit > 100", "exit 137" other exit status codes.If I set my script to "exit 0", > the job exits normally. If I set my script to "exit 99", then the job gets > requeued for execution and if I set my script to "exit 100", the job goes > into error state. All of these scenarios are what I expect based on the man > pages for "queue_conf". However, I am unable to use any other "exit ##", > trap it and force the job to error state by the method I describe. > What I was after was what happens when you try. You've described your setup in detail but your results are missing. When the job exits for example with 107 and the epilog exits 100 then what happens? Does the queue go into an error state? > I'm not sure if what I'm trying to do makes sense or should I consider a > different way to do what I am attempting. I can look at the > "starter_method" to see if this is a viable way. As per my prior message I think using the starter_method as a wrapper will work more reliably than tweaking things in the epilog. William signature.asc Description: Digital signature ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] Forcing Grid Engine jobs to error state with exit status other than 0, 99 or 100.
Am 14.09.2016 um 22:52 schrieb Lee, Wayne: > HI William, > > Thanks for the prompt reply. Apologies for not including more detail with > regards to my query concerning getting Grid Engine to force all jobs with an > exit status other than 0, 99 or 100 to error state (i.e. exit code of 100). > > As I stated in my earlier post our jobs execute an epilog script which is > named "gp_epilog" at the conclusion of the job running on a given execution > host.The "gp_epilog" essentially does the following: > > 1. Obtains the "exit_status" value from the execution host's job spool > directory from a file named "usage". As an example, take a look at the > directory listing below from a test job on an execution host with name > "g00801" where the execution host's spool directory is /tmp/ge/. You then > will see the "usage" file.The contents of the "usage" file is shown below > the directory contents.The "exit_status" in the example below is 137. > > Directory listing of /tmp/ge/g00801/active_jobs/1012.1 > > /tmp/ugev841/g00801/active_jobs/1012.1: > total 48 > drwxr-xr-x 2 sgeadmin adm4096 Sep 13 13:12 . > drwxr-xr-x 3 sgeadmin adm4096 Sep 13 13:12 .. > -rw-r--r-- 1 sgeadmin adm 6 Sep 13 13:12 addgrpid > -rw-r--r-- 1 sgeadmin adm2236 Sep 13 13:12 config > -rw-r--r-- 1 sgeadmin adm1546 Sep 13 13:12 environment > -rw-r--r-- 1 tdhf781 hougeo0 Sep 13 13:12 error > -rw-r--r-- 1 tdhf781 hougeo0 Sep 13 13:12 exit_status > prw-r--r-- 1 sgeadmin adm 0 Sep 13 13:12 fifo_execd_to_shepherd > -rw-r--r-- 1 sgeadmin adm 6 Sep 13 13:12 job_pid > -rw-r--r-- 1 sgeadmin adm 54 Sep 13 13:12 pe_hostfile > -rw-r--r-- 1 sgeadmin adm 6 Sep 13 13:12 pid > -rw-r--r-- 1 tdhf781 hougeo 9095 Sep 13 13:12 trace > -rw-r--r-- 1 sgeadmin adm 324 Sep 13 13:12 usage > > > Contents of Usage file output !!! > = > wait_status=2193 > exit_status=137 > signal=0 > start_time=1473790362804 > end_time=1473790367828 > ru_wallclock=5.024000 > ru_utime=0.004999 > ru_stime=0.001999 > ru_maxrss=1828 > ru_ixrss=0 > ru_idrss=0 > ru_isrss=0 > ru_minflt=3460 > ru_majflt=0 > ru_nswap=0 > ru_inblock=8 > ru_oublock=96 > ru_msgsnd=0 > ru_msgrcv=0 > ru_nsignals=0 > ru_nvcsw=73 > ru_nivcsw=11 > > > 2. Once the value of the "exit_status" is parsed from the "usage" file, the > "gp_epilog" script just does a check to see if the value of "exit_status" > doesn't equal 0, 99 or 100.If it doesn't equal 0, 99 or 100, then the > "gp_epilog" script executes an "exit 100".I'm assuming the "exit_status" > value from the "usage" file is from the application that is from the job/job > tasks that executed on the execution host g00801 from the example I've listed > above.I was thinking that if I issue an "exit 100" from within the > "gp_epilog" script I've got, the job/job task would show up in "error state". > I would see this show up in a "qstat" output with the job/job task showing > a state of "Eqw" or something similar. > > I've performed some tests by submitting a basic shell script which dumps the > environment (i.e. env) and performs either an "exit 0", "exit 99", "exit > 100", "exit 137" other exit status codes.If I set my script to "exit 0", > the job exits normally. If I set my script to "exit 99", then the job gets > requeued for execution and if I set my script to "exit 100", the job goes > into error state. All of these scenarios are what I expect based on the man > pages for "queue_conf". However, I am unable to use any other "exit ##", > trap it and force the job to error state by the method I describe. This should work. In the `qacct` output you can even see a mixture of the real exit code and the job being rescheduled: $ qacct -j 1083478 == qnameparallel hostname node17 ... slots4 failed 30 : rescheduling on application error exit_status 56 ... While the job exiting with 100 would show of course: failed 30 : rescheduling on application error exit_status 100 -- Reuti > I'm not sure if what I'm trying to do makes sense or should I consider a > different way to do what I am attempting. I can look at the >
Re: [gridengine users] Forcing Grid Engine jobs to error state with exit status other than 0, 99 or 100.
HI William, Thanks for the prompt reply. Apologies for not including more detail with regards to my query concerning getting Grid Engine to force all jobs with an exit status other than 0, 99 or 100 to error state (i.e. exit code of 100). As I stated in my earlier post our jobs execute an epilog script which is named "gp_epilog" at the conclusion of the job running on a given execution host. The "gp_epilog" essentially does the following: 1. Obtains the "exit_status" value from the execution host's job spool directory from a file named "usage". As an example, take a look at the directory listing below from a test job on an execution host with name "g00801" where the execution host's spool directory is /tmp/ge/. You then will see the "usage" file.The contents of the "usage" file is shown below the directory contents.The "exit_status" in the example below is 137. Directory listing of /tmp/ge/g00801/active_jobs/1012.1 /tmp/ugev841/g00801/active_jobs/1012.1: total 48 drwxr-xr-x 2 sgeadmin adm4096 Sep 13 13:12 . drwxr-xr-x 3 sgeadmin adm4096 Sep 13 13:12 .. -rw-r--r-- 1 sgeadmin adm 6 Sep 13 13:12 addgrpid -rw-r--r-- 1 sgeadmin adm2236 Sep 13 13:12 config -rw-r--r-- 1 sgeadmin adm1546 Sep 13 13:12 environment -rw-r--r-- 1 tdhf781 hougeo0 Sep 13 13:12 error -rw-r--r-- 1 tdhf781 hougeo0 Sep 13 13:12 exit_status prw-r--r-- 1 sgeadmin adm 0 Sep 13 13:12 fifo_execd_to_shepherd -rw-r--r-- 1 sgeadmin adm 6 Sep 13 13:12 job_pid -rw-r--r-- 1 sgeadmin adm 54 Sep 13 13:12 pe_hostfile -rw-r--r-- 1 sgeadmin adm 6 Sep 13 13:12 pid -rw-r--r-- 1 tdhf781 hougeo 9095 Sep 13 13:12 trace -rw-r--r-- 1 sgeadmin adm 324 Sep 13 13:12 usage Contents of Usage file output !!! = wait_status=2193 exit_status=137 signal=0 start_time=1473790362804 end_time=1473790367828 ru_wallclock=5.024000 ru_utime=0.004999 ru_stime=0.001999 ru_maxrss=1828 ru_ixrss=0 ru_idrss=0 ru_isrss=0 ru_minflt=3460 ru_majflt=0 ru_nswap=0 ru_inblock=8 ru_oublock=96 ru_msgsnd=0 ru_msgrcv=0 ru_nsignals=0 ru_nvcsw=73 ru_nivcsw=11 2. Once the value of the "exit_status" is parsed from the "usage" file, the "gp_epilog" script just does a check to see if the value of "exit_status" doesn't equal 0, 99 or 100.If it doesn't equal 0, 99 or 100, then the "gp_epilog" script executes an "exit 100".I'm assuming the "exit_status" value from the "usage" file is from the application that is from the job/job tasks that executed on the execution host g00801 from the example I've listed above.I was thinking that if I issue an "exit 100" from within the "gp_epilog" script I've got, the job/job task would show up in "error state". I would see this show up in a "qstat" output with the job/job task showing a state of "Eqw" or something similar. I've performed some tests by submitting a basic shell script which dumps the environment (i.e. env) and performs either an "exit 0", "exit 99", "exit 100", "exit 137" other exit status codes.If I set my script to "exit 0", the job exits normally. If I set my script to "exit 99", then the job gets requeued for execution and if I set my script to "exit 100", the job goes into error state. All of these scenarios are what I expect based on the man pages for "queue_conf". However, I am unable to use any other "exit ##", trap it and force the job to error state by the method I describe. I'm not sure if what I'm trying to do makes sense or should I consider a different way to do what I am attempting. I can look at the "starter_method" to see if this is a viable way. Thanks in advance. - Wayne Lee -Original Message- From: William Hay [mailto:w@ucl.ac.uk] Sent: Wednesday, September 14, 2016 2:38 AM To: Lee, Wayne Cc: users@gridengine.org Group Subject: Re: [gridengine users] Forcing Grid Engine jobs to error state with exit status other than 0, 99 or 100. On Tue, Sep 13, 2016 at 06:52:53PM +, Lee, Wayne wrote: >In the epilog script that I've setup for our jobs, I've attempted to >capture the value of the "exit_status" of a job or job task and if it >isn't 0, 99 or 100, exit the epilog script with an "exit 100". However >this doesn't appear to work. In general when describing an issue or problem it is more helpful to describe what does happen than what doesn't. The number of things that didn't happen when you made the epilog script exit 100 is almost infinite. &
Re: [gridengine users] Forcing Grid Engine jobs to error state with exit status other than 0, 99 or 100.
On Tue, Sep 13, 2016 at 06:52:53PM +, Lee, Wayne wrote: >In the epilog script that I've setup for our jobs, I've attempted to >capture the value of the "exit_status" of a job or job task and if it >isn't 0, 99 or 100, exit the epilog script with an "exit 100". However >this doesn't appear to work. In general when describing an issue or problem it is more helpful to describe what does happen than what doesn't. The number of things that didn't happen when you made the epilog script exit 100 is almost infinite. > > > >Anyway way of stating what I'm trying to convey is if the exit status a >job or job task is anything other than 0, 99 or 100 put the job in error >state. If this can be done, then we would know that a job didn't >complete correctly and if it is in Eqw state we have the option of >clearing error state (i.e. qmod -cj) and re-executing the job again. One possibility would be to write a starter_method that wraps the real job and does an exit 100 when the job terminates with an exit status other than 0 or 99. William signature.asc Description: Digital signature ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users