Not yet, no. My plan is to work with Bright to get *at least* the executables off of NFS.
To address something you mentioned earlier on the thread: > Looking at the error message, I suspect the problem is that SGE_ROOT is > installed on the head node instead of on the Isilon filesystem. If that's the > case, when the head node nodes down, all the SGE files go with it. If that's > the case, moving SGE_ROOT to the Isilon filesystem should fix the problem. No, the $SGE_ROOT is on the Isilon. That includes the global database. As for when we lost all the jobs, the theory from Bright is that the SGE database had become corrupt, and that one master wasn't able to properly update it. When the second master took over, it didn't have the right information and killed all the jobs it considered invalid. Interestingly, this one really old job that had been running for almost a year (!) survived. But if new stuff couldn't get written to the database, I don't see how SGE was functioning at all. But the more recent probably we have is the intermittent job failure when sge_execd disappears. Eric On Nov 18, 2014, at 8:43 AM, Prentice Bisbal wrote: > Eric, > > Did you ever get to the root of this problem? > > Prentice > > On 11/12/2014 10:26 AM, Peskin, Eric wrote: >> All, >> >> Does SGE have to use NFS or can it work locally on each node? >> If parts of it have to be on NFS, what is the minimal subset? >> How much of this changes if you want redundant masters? >> >> We have a cluster running CentOS 6.3, Bright Cluster Manager 6.0, and SGE >> 2011.11. Specifically, SGE is provided by a Bright package: >> sge-2011.11-360_cm6.0.x86_64 >> >> Twice, we have lost all the running SGE jobs when the cluster failed over >> from one head node to the other. =( Not supposed to happen. >> Since then, we have also had many individual jobs get lost. The later >> situation correlates with messages in the system logs saying >> >>> abrt[9007]: File '/cm/shared/apps/sge/2011.11/bin/linux-x64/sge_execd' >>> seems to be deleted >> That file lives on an NFS mount on our Isilon storage. >> Surely, the executables don't have to be on NFS? >> Interesting, we are using local spooling, the spool directory on each node >> is /cm/local/apps/sge/var/spool , which is, indeed local. >> But the $SGE_ROOT , /cm/shared/apps/sge/2011.11 lives on NFS. >> Does any of it need to? >> Maybe just the var part would need to: /cm/shared/apps/sge/var ? >> >> Thanks, >> Eric >> >> >> >> _______________________________________________ >> users mailing list >> users@gridengine.org >> https://gridengine.org/mailman/listinfo/users > > > -- > Prentice > > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users