Hi, Am 13.12.2014 um 22:16 schrieb Sean Smith:
> our old qmaster died and in after replacing it we are now faced with these > issues. by "replacing" you mean the hardware? You installed the same version of SGE which was in use before? > These don't happen 100% of the time but most of the time. > > from the user perspective job ran all the way to completion but SGE is > unhappy and starts marking the Q in the E State... > > spool/host/messages contains: > > 12/13/2014 13:03:29| main|dvgrid14|E|shepherd of job 267412.1 exited with > exit status = 7 > 12/13/2014 13:03:29| main|dvgrid14|E|abnormal termination of shepherd for > job 267412.1: no "exit_status" file > 12/13/2014 13:03:29| main|dvgrid14|E|can't open file > active_jobs/267412.1/error: No such file or directory > 12/13/2014 13:03:29| main|dvgrid14|E|can't open pid file > "active_jobs/267412.1/pid" for job 267412.1 > > I then did a qconf -mconf and added: > > execd_params KEEP_ACTIVE=true > > per several google searches: > > when the job fails I just see these three files: It seems like something is > missing?? This sounds like an NFS issue. Do you have a shared spool directory for the exechosts or is it local on each of them? -- Reuti > ✔ /sge/ge-2011.11p1/colo/spool/dvgrid14/active_jobs/267412.1 > 13:09 $ ll > total 28 > -rw-r--r-- 1 sgeadmin it-group 2204 Dec 13 13:03 config > -rw-r--r-- 1 sgeadmin it-group 16806 Dec 13 13:03 environment > -rw-r--r-- 1 sgeadmin it-group 59 Dec 13 13:03 pe_hostfile > > Why is there no error or PID File? > > Any suggestions I'm stuck.... > > Sean > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
