I am pleased to announce that Open MPI now supports checkpoint/ restart process fault tolerance. This new feature is supported on the current development trunk as of r14519. This new feature is currently scheduled for release in the version 1.3 series of Open MPI.

The current implementation includes support for fully coordinated checkpoint/restart operation (somewhat similar to the LAM/MPI implementation). We support checkpoint/restart with the Berkeley Lab Checkpoint/Restart (BLCR) system, and a specialized SELF component used support application level checkpoint/restart operations.

By default checkpoint/restart process fault tolerance is compiled out and disabled at runtime. For information on how to enable and properly use this new feature please refer to the Checkpoint/Restart Users Guide draft attached to the Wiki page:
  https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR

In addition to the checkpoint/restart users guide, the Wiki entry also describes the current status of and updates regarding the development of this new feature.

If you have any questions or problems using checkpoint/restart process fault tolerance in Open MPI please send them to the users and developers lists.

Cheers,
Josh

----
Josh Hursey
jjhur...@open-mpi.org
http://www.open-mpi.org/

Reply via email to