On Wed, 19 Oct 2011 at 11:36 -0000, Reuti wrote:
> Date: Wed, 19 Oct 2011 11:36:28
> From: Reuti <[email protected]>
> To: "Peskin, Eric" <[email protected]>
> Cc: "[email protected]" <[email protected]>
> Subject: Re: [gridengine users] jobs getting killed (failed assumedly after
> job because: job 311263.1 died through signal KILL (9))
>
> Am 17.10.2011 um 19:26 schrieb Peskin, Eric:
>
> > At that time on compute-2-13 itself, the file
> > $SGE_ROOT/$SGE_CELL/spool/compute-2-13/messages has the following:
> >
> > 10/13/2011 10:09:37| main|compute-2-13|W|reaping job "332836" ptf
> > complains: Job does not exist
> > 10/13/2011 10:09:37| main|compute-2-13|E|can't open file
> > active_jobs/332836.1/error: No such file or directory
> > 10/13/2011 10:09:37| main|compute-2-13|W|reaping job "332842" ptf
> > complains: Job does not exist
> > 10/13/2011 10:09:37| main|compute-2-13|E|can't open file
> > active_jobs/332842.1/error: No such file or directory
>
> Somehow I remember this issue on the list. But IIRC we never found a
> solution but the problem vanished at one point again.
>
> They were killed randomly without any reason. I can't find the
> thread right now though.
It is possible you are thinking of a problem I've seen, but the
message was different. It doesn't sound like Eric's(?) problem.
In order to kill jobs on dead nodes we had tried:
% qconf -sconf
reschedule_unknown 00:15:00
qmaster_params ENABLE_RESCHEDULE_KILL=true \
ENABLE_RESCHEDULE_SLAVE=true
This was incorrectly killing other jobs on other nodes with messages
like:
07/09/2011 02:14:07|worker|betsy-qmaster|E|[email protected] reports
running job (16648.32/master) in queue "[email protected]" that was not
supposed to be there - killing
I've removed these from our configuration and deal with the occasional
dead node manually.
Stuart
--
I've never been lost; I was once bewildered for three days, but never lost!
-- Daniel Boone
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users