On 11/12/2014 11:26 AM, Peskin, Eric wrote:
All,

Does SGE have to use NFS or can it work locally on each node?
If parts of it have to be on NFS, what is the minimal subset?
How much of this changes if you want redundant masters?

We have a cluster running CentOS 6.3, Bright Cluster Manager 6.0, and SGE 
2011.11.  Specifically, SGE is provided by a Bright package: 
sge-2011.11-360_cm6.0.x86_64

Twice, we have lost all the running SGE jobs when the cluster failed over from 
one head node to the other.  =( Not supposed to happen.
Since then, we have also had many individual jobs get lost.  The later 
situation correlates with messages in the system logs saying

abrt[9007]: File '/cm/shared/apps/sge/2011.11/bin/linux-x64/sge_execd' seems to 
be deleted
That file lives on an NFS mount on our Isilon storage.
Surely, the executables don't have to be on NFS?
Interesting, we are using local spooling, the spool directory on each node is  
/cm/local/apps/sge/var/spool , which is, indeed local.
But the $SGE_ROOT ,  /cm/shared/apps/sge/2011.11 lives on NFS.
Does any of it need to?
Maybe just the var part would need to:  /cm/shared/apps/sge/var ?

Thanks,
Eric



I don't have any experience using SGE with redundant masters. Don't you need a shared filesystem to share state between the active master and shadow master?

The executables don't have to live on NFS, but in a typical install with everything shared, or shared SGE_ROOT with local spooling, the executables will be installed in NFS, but the can be installed local to each node. See http://gridscheduler.sourceforge.net/howto/nfsreduce.html

Looking at the error message, I suspect the problem is that SGE_ROOT is installed on the head node instead of on the Isilon filesystem. If that's the case, when the head node nodes down, all the SGE files go with it. If that's the case, moving SGE_ROOT to the Isilon filesystem should fix the problem.

In the past, I've always had SGE_ROOT on shared filesystem with local spooling, and never had any issues, even when shutting down or rebooting the head node. Right now, one of my clusters is using OGS 6.2u5 installed from RPM, so there's no filesystem sharing going on at all. It works, but SGE was designed around SGE_ROOT being shared, so if it's not shared installation and configuration becomes more complicated

For example, when you configure your sge_master, it will write settings a handful of files in SGE_ROOT: settings.(c)sh, act_qmaster, bootstrap, cluster_name. When you run inst_sge to configure your execution hosts, you will need to access these files in SGE_ROOT/default/common. I thought defining these in inst_template.conf would be sufficient, but it wasn't.

I use puppet, so I added all of these configuration files to puppet and now I can run puppet on a new execution host, and then run inst_sge to configure the host. I also use a mail wrapper script and sge_aliases, so puppet installs them, too. If you don't use a configuration management system, you can create a tarball containing these files and then untar it on each execution host before running inst_sge, but you'll still have to manually propagate certain config changes to every host manually every time you make a change.

Given the choice, I prefer shared SGE_ROOT with local spooling.

--
Prentice

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to