On Wed, 19 Oct 2011 at 11:36 -0000, Reuti wrote:

> Date: Wed, 19 Oct 2011 11:36:28
> From: Reuti <[email protected]>
> To: "Peskin, Eric" <[email protected]>
> Cc: "[email protected]" <[email protected]>
> Subject: Re: [gridengine users] jobs getting killed (failed assumedly after
>     job because: job 311263.1 died through signal KILL (9))
>
> Am 17.10.2011 um 19:26 schrieb Peskin, Eric:
>
> > At that time on compute-2-13 itself, the file 
> > $SGE_ROOT/$SGE_CELL/spool/compute-2-13/messages has the following:
> >
> > 10/13/2011 10:09:37|  main|compute-2-13|W|reaping job "332836" ptf 
> > complains: Job does not exist
> > 10/13/2011 10:09:37|  main|compute-2-13|E|can't open file 
> > active_jobs/332836.1/error: No such file or directory
> > 10/13/2011 10:09:37|  main|compute-2-13|W|reaping job "332842" ptf 
> > complains: Job does not exist
> > 10/13/2011 10:09:37|  main|compute-2-13|E|can't open file 
> > active_jobs/332842.1/error: No such file or directory
>
> Somehow I remember this issue on the list. But IIRC we never found a
> solution but the problem vanished at one point again.
>
> They were killed randomly without any reason. I can't find the
> thread right now though.

It is possible you are thinking of a problem I've seen, but the
message was different.  It doesn't sound like Eric's(?) problem.

In order to kill jobs on dead nodes we had tried:

    % qconf -sconf
    reschedule_unknown  00:15:00
    qmaster_params      ENABLE_RESCHEDULE_KILL=true \
                        ENABLE_RESCHEDULE_SLAVE=true

This was incorrectly killing other jobs on other nodes with messages
like:

  07/09/2011 02:14:07|worker|betsy-qmaster|E|[email protected] reports 
running job (16648.32/master) in queue "[email protected]" that was not 
supposed to be there - killing

I've removed these from our configuration and deal with the occasional
dead node manually.

Stuart
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to