I am trying to update my sge_execd and sge_shepherd binaries. Based on recent emails, I figured I could drop the GE2011.11 bits into place and they would work fine. I am however having issues:
My grid environment is: Current Version - SGE - 6.2u5 SGE_CELL=ftcrnd SGE_ROOT=/opt/grid-6.2u5 SGE_CLUSTER_NAME=ftcrnd Binary path is /opt/grid/bin/lx24-amd64. I had to make l link in /opt/grid/bin for linux-x64 to get things to work. I used the following commands and it did indeed update live and the running processes seemed happy and all seemed to be working fine: gbits=/opt/sa/tmp/gbits service sgeexecd softstop cd /opt/grid/bin ln -s lx24-amd64 linux-x64 cd lx24-amd64 mv sge_shepherd sge_shepherd.old mv sge_execd sge_execd.old cp $gbits/sge_shepherd . cp $gbits/sge_execd . service sgeexecd start Since yesterday though I have had a couple of jobs fail and put their queue into an error state. Mail from the failing job: Shepherd error: 06/14/2012 21:29:37 [20339:8436]: can't open file job_pid: Permission denied >From the qmaster messages file: 06/14/2012 21:29:39|worker|gemaster|W|job 3885.1 failed on host cs428.ftc.avagotech.net general before job because: 06/14/2012 21:29:37 [20339:8436]: can't open file job_pid: Permission denied I checked a job_pid file on a currently running job on the system that had the above errors, permission down the entire tree seems fine and here is the job_id file: -rw-r--r-- 1 grid grid 6 Jun 14 17:40 job_pid Any clues? Is the path perhaps hard coded into sge_shepherd for this file? Thanks. -- -MichaelC
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
