I tried to implement the -notify + trap USR2 solution, but could not get it
to work. I can trap the USR2 signal in the qmaster PE script, but as soon
as it is sent, the slave tasks get killed, leaving my application no time
to cleanly shut them down. The qmaster log displays:

*tightly integrated parallel task 61969.1 task 1.computeXX failed - killing
job*


The queue is configured with "notify 00:00:60", so that should leave at
least one minute. I also tried to trap USR2 in the PE script and not
forward it all to child processes, but slave tasks still get killed. Is
there something else specific to do to avoid this?

2012/9/19 Julien Nicoulaud <[email protected]>

> Yes, that's what I meant. For me, if control_slaves is FALSE, qsub returns
> with a non-zero exit code after h_rt is elapsed.
>
>
> 2012/9/19 Reuti <[email protected]>
>
>> Hi,
>>
>> Am 19.09.2012 um 14:36 schrieb Julien Nicoulaud:
>>
>> > On SGE 6.2u5, I submit jobs with -sync y and h_rt. When the jobs gets
>> killed after the time is elapsed, qsub prints a "Unable to run job" message
>> but exists with code 0.  I tried to trap KILL signal
>> > inside the job script, but it does not seem to affect qsub return code.
>> Is it possible to make it return 1 ?
>> >
>> > Note: it only behaves this way for jobs running in a tightly integrated
>> parallel environment. In a loosely integrated PE, qsub returns 1 in this
>> case...
>>
>> You mean the setting of "control_slaves"? For me it's always 0 if I
>> request a PE.
>>
>> -- Reuti
>
>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to