Re: [gridengine users] 152 kills tasks.

Reuti Thu, 14 Feb 2013 03:51:35 -0800

Am 13.02.2013 um 16:05 schrieb Lars van der bijl:

> On 13 February 2013 15:35, Reuti <[email protected]> wrote:
> Am 13.02.2013 um 15:16 schrieb Lars van der bijl:
> 
> > hey everyone,
> >
> > we always set a v_smem values and catch this so that task don't use to much 
> > memory. but we want to make sure they fall into a error state because of 
> > dependencies.
> >
> > with SGE 8.1.2 we are seeing a lot of our machine not doing this properly.
> >
> > $ qacct -j 10970
> > ...
> > failed       100 : assumedly after job
> > exit_status  152
> 
> 152 = 128 + 24 = 24) SIGXCPU
> 
> So this works.
> 
> > so we catch the 152 and raise a 100 our self's but still they get removed 
> > from the grid and there dependencies start. anyone have any ideas what 
> > could cause this?
> 
> How do you catch the signal and raise the error? Were the jobs submitted with 
> DRMAA? A simple job like:
> 
> we are not using DRMAA. just qsub
> we have a prolog script that checks the exit status of the task and raises it 
> own.


You mean epilog - right?


> exit_status=`grep "exit_status" $SGE_JOB_SPOOL_DIR/usage | cut -d'=' -f 2`

It looks like you can't put a job into error state once it exited by a signal 
(an `exit 152` doesn't block putting it into error state though).

Can you add a line:

trap 'exit 152' xcpu

to your scripts?

-- Reuti


> we then have a python script that checks the number of re-tries and exit with 
> 99 or 100 based on that. 
> 
> 
> #!/bin/sh
> trap 'echo got it; exit 100' xcpu
> kill -xcpu $$ 
> 
> is working as expected?
> 
> this worked as expected.
> 
>  
> 
> -- Reuti
> 
> 
> > Lars
> > _______________________________________________
> > users mailing list
> > [email protected]
> > https://gridengine.org/mailman/listinfo/users
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] 152 kills tasks.

Reply via email to