First of all, thank you for answers.
I have a bit more questions, added below.

What is the behavior in case a node dies or becomes unreachable?
Your run will be aborted. However there is checkpoint/restart support for Linux 
http://www.open-mpi.org/faq/?category=ft

As this is a Win32 program, I'll have to take into account that there is only 
the < abort > behavior.

What makes any given machine become a node available for tasks?
You define it in a host file or a batch system tells it OpenMPI.

So there is no dynamic discovery of nodes available on the network. Unless, of 
course, if I was to write a tool that would do it before the actual run is 
started.


Is there a monitoring tool that would give me indications of the status and 
health of the nodes?
This has nothing to do with MPI. Nagios or Ganglia can do that.

I was more thinking of a tool that would tell me a node is already performing a 
task, so that I can avoid having it oversubscribed.


I'm quite sure all these are trivial questions for those with more experience, 
but I'm having a hard time finding resources that would answer those.
Read an introduction on programming with MPI and another one on Beowulf 
clusters (batch systems, monitoring, shared file systems). This should give you 
enough information on the topic. If you don't mind spending more money on 
software you can also take a look at Microsofts HPC Server.
I've started looking at beowulf clusters, and that lead me to PBS. Am I right 
in assuming that PBS (PBSPro or TORQUE) could be used to do the monitoring and 
the load balancing I thought of?

Thanks
Olivier

Reply via email to