On Fri, Jun 15, 2012 at 12:31 PM, Rayson Ho <[email protected]> wrote:

> On Fri, Jun 15, 2012 at 1:46 PM, Michael Coffman
> <[email protected]> wrote:
> > Also might be of interest:
>
> Thanks... Also, any messages in the execd "messages" file??
>
>
06/14/2012 08:56:49|  main|cs431|E|shepherd of job 9990340.1 exited with
exit status = 11


> Rayson
>
>
>
>
> >
> > ==============================================================
> > qname        all.q
> > hostname     cs431.ftc.avagotech.net
> > group        fidlib
> > owner        bgp
> > project      NONE
> > department   priority
> > jobname      qsubcmd.21231
> > jobnumber    17593
> > taskid       undefined
> > account      sge
> > priority     0
> > qsub_time    Wed Dec 31 17:00:00 1969
> > start_time   -/-
> > end_time     -/-
> > granted_pe   NONE
> > slots        0
> > failed       11  : before job
> > exit_status  0
> > ru_wallclock 0
> > ru_utime     0.000
> > ru_stime     0.000
> > ru_maxrss    0
> > ru_ixrss     0
> > ru_ismrss    0
> > ru_idrss     0
> > ru_isrss     0
> > ru_minflt    0
> > ru_majflt    0
> > ru_nswap     0
> > ru_inblock   0
> > ru_oublock   0
> > ru_msgsnd    0
> > ru_msgrcv    0
> > ru_nsignals  0
> > ru_nvcsw     0
> > ru_nivcsw    0
> > cpu          0.000
> > mem          0.000
> > io           0.000
> > iow          0.000
> > maxvmem      0.000
> > arid         undefined
> >
> >
> >
> > On Fri, Jun 15, 2012 at 11:27 AM, Michael Coffman
> > <[email protected]> wrote:
> >>
> >> On Fri, Jun 15, 2012 at 11:11 AM, Rayson Ho <[email protected]>
> wrote:
> >>>
> >>> Can you set "execd_params" to KEEP_ACTIVE for this host?? (See the
> >>> manpage at this URL:
> >>> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_conf.html )
> >>>
> >>> Request the job to run in this queue/host again, and see why the
> >>> shepherd can't open the job_pid.
> >>>
> >>> (And remember to unset the execd_params or else you will fill up your
> >>> local spool dir eventually with job information.)
> >>>
> >>
> >> I can't do this on my production grid.   And I don't know how to
> replicate
> >> the problem currently.   I will set things up on a test setup and try
> and
> >> reproduce the issue with KEEP_ACTIVE turned on.
> >>
> >> Is it possible to set the KEEP_ACTIVE per host?   I only see this in the
> >> qconf -sconf
> >>
> >>>
> >>> Rayson
> >>>
> >>>
> >>>
> >>> On Fri, Jun 15, 2012 at 12:58 PM, Michael Coffman
> >>> <[email protected]> wrote:
> >>> > On Fri, Jun 15, 2012 at 10:11 AM, Rayson Ho <[email protected]>
> >>> > wrote:
> >>> >>
> >>> >> On Fri, Jun 15, 2012 at 12:01 PM, Michael Coffman
> >>> >> <[email protected]> wrote:
> >>> >> > From the qmaster messages file:
> >>> >> > 06/14/2012 21:29:39|worker|gemaster|W|job 3885.1 failed on host
> >>> >> > cs428.ftc.avagotech.net general before job because: 06/14/2012
> >>> >> > 21:29:37
> >>> >> > [20339:8436]: can't open file job_pid: Permission denied
> >>> >> >
> >>> >> > I checked a job_pid file on a currently running job on the system
> >>> >> > that
> >>> >> > had
> >>> >> > the above errors, permission down the entire tree seems fine and
> >>> >> > here is
> >>> >> > the
> >>> >> > job_id file:
> >>> >> >
> >>> >> > -rw-r--r-- 1 grid  grid       6 Jun 14 17:40 job_pid
> >>> >>
> >>> >> Is your execd spool dir on NFS or local??
> >>> >>
> >>> > Local.
> >>> >
> >>> >>
> >>> >> Also, does it happen to all nodes or just a node or queue?
> >>> >>
> >>> >
> >>> > Happened on 2 different nodes.   Not all jobs caused this.
> >>> >
> >>> >>
> >>> >> Rayson
> >>> >>
> >>> >>
> >>> >>
> >>> >> >
> >>> >> > Any clues?    Is the path perhaps hard coded into sge_shepherd for
> >>> >> > this
> >>> >> > file?
> >>> >> >
> >>> >> > Thanks.
> >>> >> > --
> >>> >> > -MichaelC
> >>> >> >
> >>> >> > _______________________________________________
> >>> >> > users mailing list
> >>> >> > [email protected]
> >>> >> > https://gridengine.org/mailman/listinfo/users
> >>> >> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > -MichaelC
> >>
> >>
> >>
> >>
> >> --
> >> -MichaelC
> >
> >
> >
> >
> > --
> > -MichaelC
>



-- 
-MichaelC
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to