I added this yesterday and it seems to be working perfectly. thanks Reuti!

On 14 February 2013 16:29, Reuti <[email protected]> wrote:

> Am 14.02.2013 um 14:30 schrieb Lars van der bijl:
>
> >
> > On 14 February 2013 12:50, Reuti <[email protected]> wrote:
> > Am 13.02.2013 um 16:05 schrieb Lars van der bijl:
> >
> > > On 13 February 2013 15:35, Reuti <[email protected]> wrote:
> > > Am 13.02.2013 um 15:16 schrieb Lars van der bijl:
> > >
> > > > hey everyone,
> > > >
> > > > we always set a v_smem values and catch this so that task don't use
> to much memory. but we want to make sure they fall into a error state
> because of dependencies.
> > > >
> > > > with SGE 8.1.2 we are seeing a lot of our machine not doing this
> properly.
> > > >
> > > > $ qacct -j 10970
> > > > ...
> > > > failed       100 : assumedly after job
> > > > exit_status  152
> > >
> > > 152 = 128 + 24 = 24) SIGXCPU
> > >
> > > So this works.
> > >
> > > > so we catch the 152 and raise a 100 our self's but still they get
> removed from the grid and there dependencies start. anyone have any ideas
> what could cause this?
> > >
> > > How do you catch the signal and raise the error? Were the jobs
> submitted with DRMAA? A simple job like:
> > >
> > > we are not using DRMAA. just qsub
> > > we have a prolog script that checks the exit status of the task and
> raises it own.
> >
> > You mean epilog - right?
> >
> > your right. epilog.
> >
> >
> >
> > > exit_status=`grep "exit_status" $SGE_JOB_SPOOL_DIR/usage | cut -d'='
> -f 2`
> >
> > It looks like you can't put a job into error state once it exited by a
> signal (an `exit 152` doesn't block putting it into error state though).
> >
> > Can you add a line:
> >
> > trap 'exit 152' xcpu
> >
> > to your scripts?
> >
> > I could but would that make the epilog run on the task correctly? isn't
> that what happening now because the qacct shows a exit of 152 and my epilog
> raising a 100.
> > I could understand adding
>
> AFAICS it's a difference whether you exit on your own by 152 or you get a
> signal and getting 152 by adding 128 and 24. Something else seems to be
> checked.
>
> -- Reuti
>
>
> > trap 'exit 100' xcpu
> >
> > working because it's run in the main thread.
> >
> >
> >
> > -- Reuti
> >
> >
> > > we then have a python script that checks the number of re-tries and
> exit with 99 or 100 based on that.
> > >
> > >
> > > #!/bin/sh
> > > trap 'echo got it; exit 100' xcpu
> > > kill -xcpu $$
> > >
> > > is working as expected?
> > >
> > > this worked as expected.
> > >
> > >
> > >
> > > -- Reuti
> > >
> > >
> > > > Lars
> > > > _______________________________________________
> > > > users mailing list
> > > > [email protected]
> > > > https://gridengine.org/mailman/listinfo/users
> > >
> > >
> >
> >
>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to