Re: [gridengine users] SGE and NFS

Prentice Bisbal Thu, 13 Nov 2014 13:02:02 -0800

On 11/12/2014 11:26 AM, Peskin, Eric wrote:

All,


Does SGE have to use NFS or can it work locally on each node?
If parts of it have to be on NFS, what is the minimal subset?
How much of this changes if you want redundant masters?

We have a cluster running CentOS 6.3, Bright Cluster Manager 6.0, and SGE 
2011.11.  Specifically, SGE is provided by a Bright package: 
sge-2011.11-360_cm6.0.x86_64

Twice, we have lost all the running SGE jobs when the cluster failed over from 
one head node to the other.  =( Not supposed to happen.
Since then, we have also had many individual jobs get lost.  The later 
situation correlates with messages in the system logs saying

abrt[9007]: File '/cm/shared/apps/sge/2011.11/bin/linux-x64/sge_execd' seems to 
be deleted

That file lives on an NFS mount on our Isilon storage.
Surely, the executables don't have to be on NFS?
Interesting, we are using local spooling, the spool directory on each node is  
/cm/local/apps/sge/var/spool , which is, indeed local.
But the $SGE_ROOT ,  /cm/shared/apps/sge/2011.11 lives on NFS.
Does any of it need to?
Maybe just the var part would need to:  /cm/shared/apps/sge/var ?

Thanks,
Eric

I don't have any experience using SGE with redundant masters. Don't youneed a shared filesystem to share state between the active master andshadow master?

The executables don't have to live on NFS, but in a typical install witheverything shared, or shared SGE_ROOT with local spooling, theexecutables will be installed in NFS, but the can be installed local toeach node. See http://gridscheduler.sourceforge.net/howto/nfsreduce.html

Looking at the error message, I suspect the problem is that SGE_ROOT isinstalled on the head node instead of on the Isilon filesystem. Ifthat's the case, when the head node nodes down, all the SGE files gowith it. If that's the case, moving SGE_ROOT to the Isilon filesystemshould fix the problem.

In the past, I've always had SGE_ROOT on shared filesystem with localspooling, and never had any issues, even when shutting down or rebootingthe head node. Right now, one of my clusters is using OGS 6.2u5installed from RPM, so there's no filesystem sharing going on at all. Itworks, but SGE was designed around SGE_ROOT being shared, so if it's notshared installation and configuration becomes more complicated

For example, when you configure your sge_master, it will write settingsa handful of files in SGE_ROOT: settings.(c)sh, act_qmaster, bootstrap,cluster_name. When you run inst_sge to configure your execution hosts,you will need to access these files in SGE_ROOT/default/common. Ithought defining these in inst_template.conf would be sufficient, but itwasn't.

I use puppet, so I added all of these configuration files to puppet andnow I can run puppet on a new execution host, and then run inst_sge toconfigure the host. I also use a mail wrapper script and sge_aliases, sopuppet installs them, too. If you don't use a configuration managementsystem, you can create a tarball containing these files and then untarit on each execution host before running inst_sge, but you'll still haveto manually propagate certain config changes to every host manuallyevery time you make a change.


Given the choice, I prefer shared SGE_ROOT with local spooling.

--
Prentice

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] SGE and NFS

Reply via email to