Hi, Am 15.12.2014 um 21:04 schrieb Sean Smith:
> From: Reuti [[email protected]] > Sent: Sunday, December 14, 2014 4:36 AM > To: Sean Smith > Cc: [email protected] > Subject: Re: [gridengine users] shepherd of job 267412.1 exited with exit > status = 7 > > Hi, > > Am 13.12.2014 um 22:16 schrieb Sean Smith: > > > our old qmaster died and in after replacing it we are now faced with these > > issues. > > by "replacing" you mean the hardware? You installed the same version of SGE > which was in use before? > > Yes, The HW died so we rehosted it into another host. Same version of SW and > OS. > > > > These don't happen 100% of the time but most of the time. > > > > from the user perspective job ran all the way to completion but SGE is > > unhappy and starts marking the Q in the E State... > > > > spool/host/messages contains: > > > > 12/13/2014 13:03:29| main|dvgrid14|E|shepherd of job 267412.1 exited with > > exit status = 7 > > 12/13/2014 13:03:29| main|dvgrid14|E|abnormal termination of shepherd for > > job 267412.1: no "exit_status" file > > 12/13/2014 13:03:29| main|dvgrid14|E|can't open file > > active_jobs/267412.1/error: No such file or directory > > 12/13/2014 13:03:29| main|dvgrid14|E|can't open pid file > > "active_jobs/267412.1/pid" for job 267412.1 > > > > I then did a qconf -mconf and added: > > > > execd_params KEEP_ACTIVE=true > > > > per several google searches: > > > > when the job fails I just see these three files: It seems like something is > > missing?? > > This sounds like an NFS issue. Do you have a shared spool directory for the > exechosts or is it local on each of them? > > -- Reuti > > > I have shared spool directory that is exported from the qmaster. I set the > permssions to 777 and sgeadmin and root can both create files in this > hierachy it appears. It's not about permissions (in fact, allowing sgeadmin to write and anyone else to read would be sufficient), but about performance. Changing to a local spool directory is quite easy. It's just necessary to create something like /var/spool/sge on the exechosts and the node specific subdirectory will be created by the starting sgeexecd automatically after adjusting the setting in `qconf -mconf` of "execd_spool_dir". https://arc.liv.ac.uk/SGE/howto/nfsreduce.html -- Reuti PS: As the NFS share is on your qmaster, the job information will first be transmitted to the exechost by SGE's protocol, and then transferred back by NFS to the qmaster machine in your current setup. > Any suggestions? > > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
