On 10/02/2013 07:00 PM, Dave Love wrote:
Lionel SPINELLI <[email protected]> writes:
Hello all,
I have a question that is not directly linked to SGE but relates to
the same. Which tool administrators that have to install, manage,
configure and ensure coherence between lot of grid nodes use? I mean,
if I have 10 nodes in my grid and need to be sure that all of them
have the right software/configuration, I don't want to manually
configure each machine.
It seems to be a religious topic... My requirements for managing node
images are:
1. free software
2. stateless image (NFS root + local /tmp; modify the shared root
more-or-less directly)
3. support for heterogeneous systems with different images for multiple
OSes and customizing for different node groups with a single image
4. decoupled from the OS (not living somewhat in its own world, like
Rocks) so you do normal package management
When I had to pick one swiftly, the only one it was clear would do
3. properly was oneSIS <http://www.onesis.org>, though probably others
can. I've run a 250-node horrible mess of hardware as a shared
everything cluster with oneSIS off a single NFS server. I recently
replaced a vendor's useless imaging scheme with it for the second time.
Do you know a simple tool that could do the job? My researches lead me
to "Puppet Master" but I would like to get advises from experts...
I'm not convinced that's appropriate for an HPC cluster, but people with
more HPC experience disagree.
You need tools apart from image management, of course.
Dave's right, this is a religious topic. I see Dave's point of using
something not coupled to the OS. I, however, have always used RHEL
derivatives, so I've just used the combination of Kickstart with DHCP
and PXE booting, and it has served me well. With kickstart, it's not to
hard with some basic pre- and post-scripting to come up with some
different configuration options if you need different images installed
on different machines.
For configuration management of cluster, something like puppet can be
overkill. In the past, I kept all my config files on a webserver only
accessible to the cluster nodes, and then used a post-install script to
wget all the config files needed. This was only about 10 - 20 files, so
a simple for-loop kept it manageable. If I ever needed to update config
files, I used a parallel-front end to ssh (there are several good ones
out there) to execute wget across all nodes and restart any services as
necessary. I've seen others accomplish the same thing with rdist.
Some of you might be rolling your eyes thinking this a lot of work, bu
it really isn't, and my clusters have been pretty static once they're up
and running, so it's not often I need to make any configuration changes.
I've used puppet in many other situations, and I'm going to start using
it on my clusters, too. These makes the kickstart post-install script
one line - a single call to puppet.
If you decide to use puppet this way on your cluster, you don't want to
have the daemon running all the time. If you do, the daemon will check
in every 30 minutes, and slow down your jobs. It's best to run puppet as
a cron job with the --one-time flag (do not daemonize) only once a day
or so, or use a parallel front-end to ssh to run puppet with --one-time
only as needed.
That's how I do it.
Prentice
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users