Can you set "execd_params" to KEEP_ACTIVE for this host?? (See the manpage at this URL: http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_conf.html )
Request the job to run in this queue/host again, and see why the shepherd can't open the job_pid. (And remember to unset the execd_params or else you will fill up your local spool dir eventually with job information.) Rayson On Fri, Jun 15, 2012 at 12:58 PM, Michael Coffman <[email protected]> wrote: > On Fri, Jun 15, 2012 at 10:11 AM, Rayson Ho <[email protected]> wrote: >> >> On Fri, Jun 15, 2012 at 12:01 PM, Michael Coffman >> <[email protected]> wrote: >> > From the qmaster messages file: >> > 06/14/2012 21:29:39|worker|gemaster|W|job 3885.1 failed on host >> > cs428.ftc.avagotech.net general before job because: 06/14/2012 21:29:37 >> > [20339:8436]: can't open file job_pid: Permission denied >> > >> > I checked a job_pid file on a currently running job on the system that >> > had >> > the above errors, permission down the entire tree seems fine and here is >> > the >> > job_id file: >> > >> > -rw-r--r-- 1 grid grid 6 Jun 14 17:40 job_pid >> >> Is your execd spool dir on NFS or local?? >> > Local. > >> >> Also, does it happen to all nodes or just a node or queue? >> > > Happened on 2 different nodes. Not all jobs caused this. > >> >> Rayson >> >> >> >> > >> > Any clues? Is the path perhaps hard coded into sge_shepherd for this >> > file? >> > >> > Thanks. >> > -- >> > -MichaelC >> > >> > _______________________________________________ >> > users mailing list >> > [email protected] >> > https://gridengine.org/mailman/listinfo/users >> > > > > > > -- > -MichaelC _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
