Am 13.02.2013 um 16:05 schrieb Lars van der bijl: > On 13 February 2013 15:35, Reuti <[email protected]> wrote: > Am 13.02.2013 um 15:16 schrieb Lars van der bijl: > > > hey everyone, > > > > we always set a v_smem values and catch this so that task don't use to much > > memory. but we want to make sure they fall into a error state because of > > dependencies. > > > > with SGE 8.1.2 we are seeing a lot of our machine not doing this properly. > > > > $ qacct -j 10970 > > ... > > failed 100 : assumedly after job > > exit_status 152 > > 152 = 128 + 24 = 24) SIGXCPU > > So this works. > > > so we catch the 152 and raise a 100 our self's but still they get removed > > from the grid and there dependencies start. anyone have any ideas what > > could cause this? > > How do you catch the signal and raise the error? Were the jobs submitted with > DRMAA? A simple job like: > > we are not using DRMAA. just qsub > we have a prolog script that checks the exit status of the task and raises it > own.
You mean epilog - right? > exit_status=`grep "exit_status" $SGE_JOB_SPOOL_DIR/usage | cut -d'=' -f 2` It looks like you can't put a job into error state once it exited by a signal (an `exit 152` doesn't block putting it into error state though). Can you add a line: trap 'exit 152' xcpu to your scripts? -- Reuti > we then have a python script that checks the number of re-tries and exit with > 99 or 100 based on that. > > > #!/bin/sh > trap 'echo got it; exit 100' xcpu > kill -xcpu $$ > > is working as expected? > > this worked as expected. > > > > -- Reuti > > > > Lars > > _______________________________________________ > > users mailing list > > [email protected] > > https://gridengine.org/mailman/listinfo/users > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
