Am 14.02.2013 um 14:30 schrieb Lars van der bijl:

> 
> On 14 February 2013 12:50, Reuti <[email protected]> wrote:
> Am 13.02.2013 um 16:05 schrieb Lars van der bijl:
> 
> > On 13 February 2013 15:35, Reuti <[email protected]> wrote:
> > Am 13.02.2013 um 15:16 schrieb Lars van der bijl:
> >
> > > hey everyone,
> > >
> > > we always set a v_smem values and catch this so that task don't use to 
> > > much memory. but we want to make sure they fall into a error state 
> > > because of dependencies.
> > >
> > > with SGE 8.1.2 we are seeing a lot of our machine not doing this properly.
> > >
> > > $ qacct -j 10970
> > > ...
> > > failed       100 : assumedly after job
> > > exit_status  152
> >
> > 152 = 128 + 24 = 24) SIGXCPU
> >
> > So this works.
> >
> > > so we catch the 152 and raise a 100 our self's but still they get removed 
> > > from the grid and there dependencies start. anyone have any ideas what 
> > > could cause this?
> >
> > How do you catch the signal and raise the error? Were the jobs submitted 
> > with DRMAA? A simple job like:
> >
> > we are not using DRMAA. just qsub
> > we have a prolog script that checks the exit status of the task and raises 
> > it own.
> 
> You mean epilog - right?
> 
> your right. epilog.
>  
> 
> 
> > exit_status=`grep "exit_status" $SGE_JOB_SPOOL_DIR/usage | cut -d'=' -f 2`
> 
> It looks like you can't put a job into error state once it exited by a signal 
> (an `exit 152` doesn't block putting it into error state though).
> 
> Can you add a line:
> 
> trap 'exit 152' xcpu
> 
> to your scripts?
> 
> I could but would that make the epilog run on the task correctly? isn't that 
> what happening now because the qacct shows a exit of 152 and my epilog 
> raising a 100.
> I could understand adding 

AFAICS it's a difference whether you exit on your own by 152 or you get a 
signal and getting 152 by adding 128 and 24. Something else seems to be checked.

-- Reuti


> trap 'exit 100' xcpu 
> 
> working because it's run in the main thread.
> 
>  
> 
> -- Reuti
> 
> 
> > we then have a python script that checks the number of re-tries and exit 
> > with 99 or 100 based on that.
> >
> >
> > #!/bin/sh
> > trap 'echo got it; exit 100' xcpu
> > kill -xcpu $$
> >
> > is working as expected?
> >
> > this worked as expected.
> >
> >
> >
> > -- Reuti
> >
> >
> > > Lars
> > > _______________________________________________
> > > users mailing list
> > > [email protected]
> > > https://gridengine.org/mailman/listinfo/users
> >
> >
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to