Also might be of interest: ============================================================== qname all.q hostname cs431.ftc.avagotech.net group fidlib owner bgp project NONE department priority jobname qsubcmd.21231 jobnumber 17593 taskid undefined account sge priority 0 qsub_time Wed Dec 31 17:00:00 1969 start_time -/- end_time -/- granted_pe NONE slots 0 failed 11 : before job exit_status 0 ru_wallclock 0 ru_utime 0.000 ru_stime 0.000 ru_maxrss 0 ru_ixrss 0 ru_ismrss 0 ru_idrss 0 ru_isrss 0 ru_minflt 0 ru_majflt 0 ru_nswap 0 ru_inblock 0 ru_oublock 0 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 0 ru_nivcsw 0 cpu 0.000 mem 0.000 io 0.000 iow 0.000 maxvmem 0.000 arid undefined
On Fri, Jun 15, 2012 at 11:27 AM, Michael Coffman < [email protected]> wrote: > On Fri, Jun 15, 2012 at 11:11 AM, Rayson Ho <[email protected]> wrote: > >> Can you set "execd_params" to KEEP_ACTIVE for this host?? (See the >> manpage at this URL: >> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_conf.html ) >> >> Request the job to run in this queue/host again, and see why the >> shepherd can't open the job_pid. >> >> (And remember to unset the execd_params or else you will fill up your >> local spool dir eventually with job information.) >> >> > I can't do this on my production grid. And I don't know how to replicate > the problem currently. I will set things up on a test setup and try and > reproduce the issue with KEEP_ACTIVE turned on. > > Is it possible to set the KEEP_ACTIVE per host? I only see this in the > qconf -sconf > > >> Rayson >> >> >> >> On Fri, Jun 15, 2012 at 12:58 PM, Michael Coffman >> <[email protected]> wrote: >> > On Fri, Jun 15, 2012 at 10:11 AM, Rayson Ho <[email protected]> >> wrote: >> >> >> >> On Fri, Jun 15, 2012 at 12:01 PM, Michael Coffman >> >> <[email protected]> wrote: >> >> > From the qmaster messages file: >> >> > 06/14/2012 21:29:39|worker|gemaster|W|job 3885.1 failed on host >> >> > cs428.ftc.avagotech.net general before job because: 06/14/2012 >> 21:29:37 >> >> > [20339:8436]: can't open file job_pid: Permission denied >> >> > >> >> > I checked a job_pid file on a currently running job on the system >> that >> >> > had >> >> > the above errors, permission down the entire tree seems fine and >> here is >> >> > the >> >> > job_id file: >> >> > >> >> > -rw-r--r-- 1 grid grid 6 Jun 14 17:40 job_pid >> >> >> >> Is your execd spool dir on NFS or local?? >> >> >> > Local. >> > >> >> >> >> Also, does it happen to all nodes or just a node or queue? >> >> >> > >> > Happened on 2 different nodes. Not all jobs caused this. >> > >> >> >> >> Rayson >> >> >> >> >> >> >> >> > >> >> > Any clues? Is the path perhaps hard coded into sge_shepherd for >> this >> >> > file? >> >> > >> >> > Thanks. >> >> > -- >> >> > -MichaelC >> >> > >> >> > _______________________________________________ >> >> > users mailing list >> >> > [email protected] >> >> > https://gridengine.org/mailman/listinfo/users >> >> > >> > >> > >> > >> > >> > -- >> > -MichaelC >> > > > > -- > -MichaelC > -- -MichaelC
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
