Re: [gridengine users] Forcing Grid Engine jobs to error state with exit status other than 0, 99 or 100.

2016-09-15 Thread William Hay
On Wed, Sep 14, 2016 at 08:52:12PM +, Lee, Wayne wrote:
> HI William,
> 
> I've performed some tests by submitting a basic shell script which dumps the 
> environment (i.e. env) and performs either an "exit 0", "exit 99", "exit 
> 100", "exit 137" other exit status codes.If I set my script to "exit 0", 
> the job exits normally.   If I set my script to "exit 99", then the job gets 
> requeued for execution and if I set my script to "exit 100", the job goes 
> into error state.   All of these scenarios are what I expect based on the man 
> pages for "queue_conf".   However, I am unable to use any other "exit ##", 
> trap it and force the job to error state by the method I describe.  
> 
What I was after was what happens when you try.  You've described your setup in 
detail but your results are missing.  When the job exits for example with 107 
and the epilog exits 100 then what happens?  Does the queue go into an error 
state?

> I'm not sure if what I'm trying to do makes sense or should I consider a 
> different way to do what I am attempting.   I can look at the 
> "starter_method" to see if this is a viable way.

As per my prior message I think using the starter_method as a wrapper will work 
more reliably than tweaking things in the epilog.

William



signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Forcing Grid Engine jobs to error state with exit status other than 0, 99 or 100.

2016-09-14 Thread Reuti

Am 14.09.2016 um 22:52 schrieb Lee, Wayne:

> HI William,
> 
> Thanks for the prompt reply.   Apologies for not including more detail with 
> regards to my query concerning getting Grid Engine to force all jobs with an 
> exit status other than 0, 99 or 100 to error state (i.e. exit code of 100).
> 
> As I stated in my earlier post our jobs execute an epilog script which is 
> named "gp_epilog" at the conclusion of the job running on a given execution 
> host.The "gp_epilog" essentially does the following:
> 
> 1.  Obtains the "exit_status" value from the execution host's job spool 
> directory from a file named "usage".   As an example, take a look at the 
> directory listing below from a test job on an execution host with name 
> "g00801" where the execution host's spool directory is /tmp/ge/.  You then 
> will see the "usage" file.The contents of the "usage" file is shown below 
> the directory contents.The "exit_status" in the example below is 137.
> 
> Directory listing of /tmp/ge/g00801/active_jobs/1012.1
> 
> /tmp/ugev841/g00801/active_jobs/1012.1:
> total 48
> drwxr-xr-x 2 sgeadmin adm4096 Sep 13 13:12 .
> drwxr-xr-x 3 sgeadmin adm4096 Sep 13 13:12 ..
> -rw-r--r-- 1 sgeadmin adm   6 Sep 13 13:12 addgrpid
> -rw-r--r-- 1 sgeadmin adm2236 Sep 13 13:12 config
> -rw-r--r-- 1 sgeadmin adm1546 Sep 13 13:12 environment
> -rw-r--r-- 1 tdhf781  hougeo0 Sep 13 13:12 error
> -rw-r--r-- 1 tdhf781  hougeo0 Sep 13 13:12 exit_status
> prw-r--r-- 1 sgeadmin adm   0 Sep 13 13:12 fifo_execd_to_shepherd
> -rw-r--r-- 1 sgeadmin adm   6 Sep 13 13:12 job_pid
> -rw-r--r-- 1 sgeadmin adm  54 Sep 13 13:12 pe_hostfile
> -rw-r--r-- 1 sgeadmin adm   6 Sep 13 13:12 pid
> -rw-r--r-- 1 tdhf781  hougeo 9095 Sep 13 13:12 trace
> -rw-r--r-- 1 sgeadmin adm 324 Sep 13 13:12 usage
> 
> 
> Contents of Usage file output !!!
> =
> wait_status=2193
> exit_status=137
> signal=0
> start_time=1473790362804
> end_time=1473790367828
> ru_wallclock=5.024000
> ru_utime=0.004999
> ru_stime=0.001999
> ru_maxrss=1828
> ru_ixrss=0
> ru_idrss=0
> ru_isrss=0
> ru_minflt=3460
> ru_majflt=0
> ru_nswap=0
> ru_inblock=8
> ru_oublock=96
> ru_msgsnd=0
> ru_msgrcv=0
> ru_nsignals=0
> ru_nvcsw=73
> ru_nivcsw=11
> 
> 
> 2. Once the value of the "exit_status" is parsed from the "usage" file, the 
> "gp_epilog" script just does a check to see if the value of "exit_status" 
> doesn't equal 0, 99 or 100.If it doesn't equal 0, 99 or 100, then the 
> "gp_epilog" script executes an "exit 100".I'm assuming the "exit_status" 
> value from the "usage" file is from the application that is from the job/job 
> tasks that executed on the execution host g00801 from the example I've listed 
> above.I was thinking that if I issue an "exit 100" from within the 
> "gp_epilog" script I've got, the job/job task would show up in "error state". 
>   I would see this show up in a "qstat" output with the job/job task showing 
> a state of "Eqw" or something similar.   
> 
> I've performed some tests by submitting a basic shell script which dumps the 
> environment (i.e. env) and performs either an "exit 0", "exit 99", "exit 
> 100", "exit 137" other exit status codes.If I set my script to "exit 0", 
> the job exits normally.   If I set my script to "exit 99", then the job gets 
> requeued for execution and if I set my script to "exit 100", the job goes 
> into error state.   All of these scenarios are what I expect based on the man 
> pages for "queue_conf".   However, I am unable to use any other "exit ##", 
> trap it and force the job to error state by the method I describe.

This should work. In the `qacct` output you can even see a mixture of the real 
exit code and the job being rescheduled:

$ qacct -j 1083478
==
qnameparallel
hostname node17  
...
slots4   
failed   30  : rescheduling on application error
exit_status  56
...

While the job exiting with 100 would show of course:

failed   30  : rescheduling on application error
exit_status  100


-- Reuti


> I'm not sure if what I'm trying to do makes sense or should I consider a 
> different way to do what I am attempting.   I can look at the 
>

Re: [gridengine users] Forcing Grid Engine jobs to error state with exit status other than 0, 99 or 100.

2016-09-14 Thread Lee, Wayne
HI William,

Thanks for the prompt reply.   Apologies for not including more detail with 
regards to my query concerning getting Grid Engine to force all jobs with an 
exit status other than 0, 99 or 100 to error state (i.e. exit code of 100).

As I stated in my earlier post our jobs execute an epilog script which is named 
"gp_epilog" at the conclusion of the job running on a given execution host.
The "gp_epilog" essentially does the following:

1.  Obtains the "exit_status" value from the execution host's job spool 
directory from a file named "usage".   As an example, take a look at the 
directory listing below from a test job on an execution host with name "g00801" 
where the execution host's spool directory is /tmp/ge/.  You then will see the 
"usage" file.The contents of the "usage" file is shown below the directory 
contents.The "exit_status" in the example below is 137.

Directory listing of /tmp/ge/g00801/active_jobs/1012.1

/tmp/ugev841/g00801/active_jobs/1012.1:
total 48
drwxr-xr-x 2 sgeadmin adm4096 Sep 13 13:12 .
drwxr-xr-x 3 sgeadmin adm4096 Sep 13 13:12 ..
-rw-r--r-- 1 sgeadmin adm   6 Sep 13 13:12 addgrpid
-rw-r--r-- 1 sgeadmin adm2236 Sep 13 13:12 config
-rw-r--r-- 1 sgeadmin adm1546 Sep 13 13:12 environment
-rw-r--r-- 1 tdhf781  hougeo0 Sep 13 13:12 error
-rw-r--r-- 1 tdhf781  hougeo0 Sep 13 13:12 exit_status
prw-r--r-- 1 sgeadmin adm   0 Sep 13 13:12 fifo_execd_to_shepherd
-rw-r--r-- 1 sgeadmin adm   6 Sep 13 13:12 job_pid
-rw-r--r-- 1 sgeadmin adm  54 Sep 13 13:12 pe_hostfile
-rw-r--r-- 1 sgeadmin adm   6 Sep 13 13:12 pid
-rw-r--r-- 1 tdhf781  hougeo 9095 Sep 13 13:12 trace
-rw-r--r-- 1 sgeadmin adm 324 Sep 13 13:12 usage


Contents of Usage file output !!!
=
wait_status=2193
exit_status=137
signal=0
start_time=1473790362804
end_time=1473790367828
ru_wallclock=5.024000
ru_utime=0.004999
ru_stime=0.001999
ru_maxrss=1828
ru_ixrss=0
ru_idrss=0
ru_isrss=0
ru_minflt=3460
ru_majflt=0
ru_nswap=0
ru_inblock=8
ru_oublock=96
ru_msgsnd=0
ru_msgrcv=0
ru_nsignals=0
ru_nvcsw=73
ru_nivcsw=11


2. Once the value of the "exit_status" is parsed from the "usage" file, the 
"gp_epilog" script just does a check to see if the value of "exit_status" 
doesn't equal 0, 99 or 100.If it doesn't equal 0, 99 or 100, then the 
"gp_epilog" script executes an "exit 100".I'm assuming the "exit_status" 
value from the "usage" file is from the application that is from the job/job 
tasks that executed on the execution host g00801 from the example I've listed 
above.I was thinking that if I issue an "exit 100" from within the 
"gp_epilog" script I've got, the job/job task would show up in "error state".   
I would see this show up in a "qstat" output with the job/job task showing a 
state of "Eqw" or something similar.   

I've performed some tests by submitting a basic shell script which dumps the 
environment (i.e. env) and performs either an "exit 0", "exit 99", "exit 100", 
"exit 137" other exit status codes.If I set my script to "exit 0", the job 
exits normally.   If I set my script to "exit 99", then the job gets requeued 
for execution and if I set my script to "exit 100", the job goes into error 
state.   All of these scenarios are what I expect based on the man pages for 
"queue_conf".   However, I am unable to use any other "exit ##", trap it and 
force the job to error state by the method I describe.  

I'm not sure if what I'm trying to do makes sense or should I consider a 
different way to do what I am attempting.   I can look at the "starter_method" 
to see if this is a viable way.

Thanks in advance.

-
Wayne Lee


-Original Message-
From: William Hay [mailto:w@ucl.ac.uk] 
Sent: Wednesday, September 14, 2016 2:38 AM
To: Lee, Wayne 
Cc: users@gridengine.org Group 
Subject: Re: [gridengine users] Forcing Grid Engine jobs to error state with 
exit status other than 0, 99 or 100.

On Tue, Sep 13, 2016 at 06:52:53PM +, Lee, Wayne wrote:
>In the epilog script that I've setup for our jobs, I've attempted to
>capture the value of the "exit_status" of a job or job task and if it
>isn't 0, 99 or 100, exit the epilog script with an "exit 100".   However
>this doesn't appear to work.  

In general when describing an issue or problem it is more helpful to describe 
what does happen than what doesn't.  The number of things that didn't happen 
when you made the epilog script exit 100 is almost infinite.

&

Re: [gridengine users] Forcing Grid Engine jobs to error state with exit status other than 0, 99 or 100.

2016-09-14 Thread William Hay
On Tue, Sep 13, 2016 at 06:52:53PM +, Lee, Wayne wrote:
>In the epilog script that I've setup for our jobs, I've attempted to
>capture the value of the "exit_status" of a job or job task and if it
>isn't 0, 99 or 100, exit the epilog script with an "exit 100".   However
>this doesn't appear to work.  

In general when describing an issue or problem it is more helpful to describe 
what
does happen than what doesn't.  The number of things that didn't happen when you
made the epilog script exit 100 is almost infinite.

> 
> 
> 
>Anyway way of stating what I'm trying to convey is if the exit status a
>job or job task is anything other than 0, 99 or 100 put the job in error
>state.  If this can be done, then we would know that a job didn't
>complete correctly and if it is in Eqw state we have the option of
>clearing error state (i.e. qmod -cj) and re-executing the job again.

One possibility would be to write a starter_method that wraps the real job and
does an exit 100 when the job terminates with an exit status other than 0 or 
99. 

William 


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users