Thanks for your full report! By the way $SGE_ROOT/$SGE_CELL needs to be owned by sge user (which is different in SoGE). Maybe the origin of your permission issue while trying to restore with "inst_sge -rst" ?
Best, Remy Le 11 août 2015 21:28, "Michael Stauffer" <mgsta...@gmail.com> a écrit : > Hi, > > I posted recently looking for help with migrating from OGS 2011.11p1 to Son > of Gridengine 8.1.8 (SoGE). This was part of a Rocks cluster upgrade from > 6.1 to 6.2 (CentOS 6.3 to 6.6). Thanks to all those who helped me out! I > got it all worked out finally, and below are my notes in the hope that the > next person who has to do this will have a much easier time. > > Because SoGE doesn't exist as a Rocks roll, you have to do the install > manually. > > ==== Preparation ==== > > ** Backup your OGS/SGE config > > Run $SGE_ROOT/util/upgrade_modules/save_sge_config.sh > > Make a manual copy of $SGE_ROOT for sanity's sake and reference to things > that don't get saved in the above dump. > > (Another option is to use '$SGE_ROOT/inst_sge -bup'. This appears to use a > different backup mechanism. I ran this one as well in case it restored > better than from the above option. However I had weird permissions trouble > with the restore using '$SGE_ROOT/inst_sge -rst', so I didn't end up using > it.) > > ==== Uninstall ==== > > If you're not doing a fresh OS install, you probably want to uninstall OGS, > or at least stop sgemaster and sge_execd and move $SGE_ROOT to a backup > location, on the FE and nodes. There might be other uninstall steps, I > don't know since I did a fresh OS install. > > ==== Install SoGE ==== > > Get SoGE RPMS from here: https://arc.liv.ac.uk/trac/SGE > > ** Do a full install and get things running before restoring your previous > config. > Below are notes on issues that I encountered. > Detailed instructions are here, in multiple pages: > http://www.softpanorama.org/HPC/Grid_engine/ > > - Master host: > - Installation of Son of Grid Engine 8.1.8 RPMs for Master Host > < > http://www.softpanorama.org/HPC/Grid_engine/Implementations/Son_of_grid_engine/installation_of_soge818_rpms_for_master_host.shtml > > > - Installation of Grid Engine Master Host > < > http://www.softpanorama.org/HPC/Grid_engine/Installation/installation_of_master_host.shtml > > > - Execution host > - Installation of the Son of Grid Engine 8.1.8 RPMs for Execution > Host > < > http://www.softpanorama.org/HPC/Grid_engine/Implementations/Son_of_grid_engine/installation_of_soge818_rpms_for_execution_host.shtml > > > - Installation of the Grid Engine Execution Host > < > http://www.softpanorama.org/HPC/Grid_engine/Installation/installation_of_execution_host.shtml > > > - Using the command line installer > < > http://www.softpanorama.org/HPC/Grid_engine/Installation/using_the_command_line_installer.shtml > > > > ** Share $SGE_ROOT with compute nodes (execution hosts)? > > You'll see discussion about this in the above docs. I decided to share the > complete $SGE_ROOT from the front end to the nodes via NFS. This is simpler > and shouldn't cause a problem on my smallish cluster (21 nodes, 332 cores) > which has 10Gb local switching. > > ** install_initd error in ./install_qmaster > > Running $SGE_ROOT/install_qmaster gave me an error at the step where the > sgemaster.<cluster name> init script is installed. > While debugging not that install_initd doesn't output an error, it just > returns 1 instead of 0. > I traced this to the dependency lines > > > # Required-Start: $network $remote_fs > # Required-Stop: $network $remote_fs > > in /etc/init.d/sgemaster.<cluster-name>. For whatever reason it doesn't > like $remote_fs dependency. > > I changed $SGE_ROOT/util/rctemplates/sgemaster_template to this instead: > > # Required-Start: $network $local_fs > # Required-Stop: $network $local_fs > > And then reran install_qmaster. I believe this should be fine since SoGE > doesn't rely on remote filesystems on my system (just the compute nodes do, > to mount /opt/sge, but in their init config it doesn't complain about the > $remote_fs dependency. Go figure.) So far things are working fine. > > ==== Restore SGE configurations ==== > > ** run $SGE_ROOT/util/upgrade_modules/load_sge_config.sh > > I had an issue with the hostname on my front end. This script was picking > up the FQDN and it was conflicting with the local hostname that the script > wanted to see for qmaster host. I temporarily set the hostname to the local > one and the script was happy. > > ==== Add nodes / execution hosts ==== > > Here are the steps to add an exec host after it's booted up (I have the > rpm's added to the rocks distro per the above install instructions, but I'm > not sure if it's needed with the fully-shared /opt/sge dir). I still need > to add this to an init script of some sort so it can run automatically as > part of the rocks distro for the exec hosts. > > usermod -u 399 sgeadmin > groupmod -g 399 sgeadmin > echo "#manually added" >> /etc/fstab > echo "<front-end>:/opt/sge /opt/sge nfs > defaults,noatime 0 0" >> /etc/fstab > mount /opt/sge > . /etc/profile.d/cfn-sge-env.sh #or whatever you call this file for sge > env setup > cp $SGE_ROOT/default/common/sgeexecd /etc/init.d/sgeexecd.<cluster-name> > /usr/lib/lsb/install_initd /etc/init.d/sgeexecd.<cluster-name> > service sgeexecd.<cluster-name> start > ps -ef | grep sge > > > HTH! > > -M > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > http://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20150811/7517d8cf/attachment.html >
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users