Thanks for your full report! By the way $SGE_ROOT/$SGE_CELL needs to be
owned by sge user (which is different in SoGE). Maybe the origin of your
permission issue while trying to restore with "inst_sge -rst" ?

Best,

Remy
Le 11 août 2015 21:28, "Michael Stauffer" <mgsta...@gmail.com> a écrit :

> Hi,
>
> I posted recently looking for help with migrating from OGS 2011.11p1 to Son
> of Gridengine 8.1.8 (SoGE). This was part of a Rocks cluster upgrade from
> 6.1 to 6.2 (CentOS 6.3 to 6.6). Thanks to all those who helped me out! I
> got it all worked out finally, and below are my notes in the hope that the
> next person who has to do this will have a much easier time.
>
> Because SoGE doesn't exist as a Rocks roll, you have to do the install
> manually.
>
> ==== Preparation ====
>
> ** Backup your OGS/SGE config
>
> Run $SGE_ROOT/util/upgrade_modules/save_sge_config.sh
>
> Make a manual copy of $SGE_ROOT for sanity's sake and reference to things
> that don't get saved in the above dump.
>
> (Another option is to use '$SGE_ROOT/inst_sge -bup'. This appears to use a
> different backup mechanism. I ran this one as well in case it restored
> better than from the above option. However I had weird permissions trouble
> with the restore using '$SGE_ROOT/inst_sge -rst', so I didn't end up using
> it.)
>
> ==== Uninstall ====
>
> If you're not doing a fresh OS install, you probably want to uninstall OGS,
> or at least stop sgemaster and sge_execd and move $SGE_ROOT to a backup
> location, on the FE and nodes. There might be other uninstall steps, I
> don't know since I did a fresh OS install.
>
> ==== Install SoGE ====
>
> Get SoGE RPMS from here: https://arc.liv.ac.uk/trac/SGE
>
> ** Do a full install and get things running before restoring your previous
> config.
> Below are notes on issues that I encountered.
> Detailed instructions are here, in multiple pages:
> http://www.softpanorama.org/HPC/Grid_engine/
>
>    - Master host:
>       - Installation of Son of Grid Engine 8.1.8 RPMs for Master Host
>       <
> http://www.softpanorama.org/HPC/Grid_engine/Implementations/Son_of_grid_engine/installation_of_soge818_rpms_for_master_host.shtml
> >
>       - Installation of Grid Engine Master Host
>       <
> http://www.softpanorama.org/HPC/Grid_engine/Installation/installation_of_master_host.shtml
> >
>    - Execution host
>       - Installation of the Son of Grid Engine 8.1.8 RPMs for Execution
> Host
>       <
> http://www.softpanorama.org/HPC/Grid_engine/Implementations/Son_of_grid_engine/installation_of_soge818_rpms_for_execution_host.shtml
> >
>       - Installation of the Grid Engine Execution Host
>       <
> http://www.softpanorama.org/HPC/Grid_engine/Installation/installation_of_execution_host.shtml
> >
>       - Using the command line installer
>       <
> http://www.softpanorama.org/HPC/Grid_engine/Installation/using_the_command_line_installer.shtml
> >
>
> ** Share $SGE_ROOT with compute nodes (execution hosts)?
>
> You'll see discussion about this in the above docs. I decided to share the
> complete $SGE_ROOT from the front end to the nodes via NFS. This is simpler
> and shouldn't cause a problem on my smallish cluster (21 nodes, 332 cores)
> which has 10Gb local switching.
>
> ** install_initd error in ./install_qmaster
>
> Running $SGE_ROOT/install_qmaster gave me an error at the step where the
> sgemaster.<cluster name> init script is installed.
> While debugging not that install_initd doesn't output an error, it just
> returns 1 instead of 0.
> I traced this to the dependency lines
>
>
>  # Required-Start: $network $remote_fs
>  # Required-Stop: $network $remote_fs
>
> in /etc/init.d/sgemaster.<cluster-name>. For whatever reason it doesn't
> like $remote_fs dependency.
>
> I changed $SGE_ROOT/util/rctemplates/sgemaster_template to this instead:
>
>  # Required-Start: $network $local_fs
>  # Required-Stop: $network $local_fs
>
> And then reran install_qmaster. I believe this should be fine since SoGE
> doesn't rely on remote filesystems on my system (just the compute nodes do,
> to mount /opt/sge, but in their init config it doesn't complain about the
> $remote_fs dependency. Go figure.) So far things are working fine.
>
> ==== Restore SGE configurations ====
>
> ** run $SGE_ROOT/util/upgrade_modules/load_sge_config.sh
>
> I had an issue with the hostname on my front end. This script was picking
> up the FQDN and it was conflicting with the local hostname that the script
> wanted to see for qmaster host. I temporarily set the hostname to the local
> one and the script was happy.
>
> ==== Add nodes / execution hosts ====
>
> Here are the steps to add an exec host after it's booted up (I have the
> rpm's added to the rocks distro per the above install instructions, but I'm
> not sure if it's needed with the fully-shared /opt/sge dir). I still need
> to add this to an init script of some sort so it can run automatically as
> part of the rocks distro for the exec hosts.
>
> usermod -u 399 sgeadmin
> groupmod -g 399 sgeadmin
> echo "#manually added" >> /etc/fstab
> echo "<front-end>:/opt/sge                    /opt/sge    nfs
> defaults,noatime      0 0" >> /etc/fstab
> mount /opt/sge
>  . /etc/profile.d/cfn-sge-env.sh #or whatever you call this file for sge
> env setup
> cp $SGE_ROOT/default/common/sgeexecd /etc/init.d/sgeexecd.<cluster-name>
> /usr/lib/lsb/install_initd /etc/init.d/sgeexecd.<cluster-name>
> service sgeexecd.<cluster-name> start
> ps -ef | grep sge
>
>
> HTH!
>
> -M
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20150811/7517d8cf/attachment.html
>
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to