-----Message d'origine----- De : users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] De la part de Nico Mittenzwey Envoyé : jeudi 20 janvier 2011 18:58 À : Open MPI Users Objet : Re: [OMPI users] Help with some fundamentals
On 01/20/2011 05:50 PM, Olivier SANNIER wrote: > What is the behavior in case a node dies or becomes unreachable? > Your run will be aborted. However there is checkpoint/restart support > for Linux http://www.open-mpi.org/faq/?category=ft > > As this is a Win32 program, I'll have to take into account that there is only > the< abort> behavior. AFAIK yes > So there is no dynamic discovery of nodes available on the network. Unless, > of course, if I was to write a tool that would do it before the actual run is > started. This is done by a batch system like PBS (torque) or SGE > Is there a monitoring tool that would give me indications of the status and > health of the nodes? > This has nothing to do with MPI. Nagios or Ganglia can do that. > > I was more thinking of a tool that would tell me a node is already performing > a task, so that I can avoid having it oversubscribed. This is also done by a batch system > I've started looking at beowulf clusters, and that lead me to PBS. Am I right > in assuming that PBS (PBSPro or TORQUE) could be used to do the monitoring > and the load balancing I thought of? Yes, however the terms "monitoring" and "load balancing" are usually used in other contexts. Thank you for your help, I now have a better understanding of the technical details involved with all this