Actually, I honestly don't remember even having that discussion. In looking at 
it, this would be relatively easy to implement if someone really wanted it.

Only issue: user would bear full responsibility for OMPI not cleaning up failed 
jobs since we wouldn't terminate upon seeing a proc fail. Definitely not 
something you'd want to do in production!


On Sep 16, 2011, at 6:55 AM, Josh Hursey wrote:

> Though I do not share George's pessimism about acceptance to the Open
> MPI community, it has been slightly difficult to add such a
> non-standard feature to the code base for various reasons.
> 
> At ORNL, I have been developing a prototype for the MPI Forum Fault
> Tolerance Working Group [1] of the Run-Through Stabilization proposal
> [2,3]. This would allow the application to continue running and using
> MPI functions even though processes fail during execution. We have
> been doing some limited alpha releases for some friendly application
> developers desiring to play with the prototype for a while now. We are
> hoping to do a more public beta release in the coming months. I'll
> likely post a message to the ompi-devel list once it is ready.
> 
> -- Josh
> 
> [1] http://svn.mpi-forum.org/trac/mpi-forum-web/wiki/FaultToleranceWikiPage
> [2] See PDF on 
> https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization
> [3] See PDF on 
> https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization_2
> 
> On Thu, Sep 15, 2011 at 4:14 PM, George Bosilca <bosi...@eecs.utk.edu> wrote:
>> Rob,
>> 
>> The Open MPI community did consider such as option, but it deemed it as 
>> uninteresting. However, we (UTK team) have a patched version supporting 
>> several fault tolerant modes, including the one you described in your email. 
>> If you are interested please contact me directly.
>> 
>>  Thanks,
>>    george.
>> 
>> 
>> On Sep 12, 2011, at 20:43 , Ralph Castain wrote:
>> 
>>> We don't have anything similar in OMPI. There are fault tolerance modes, 
>>> but not like the one you describe.
>>> 
>>> On Sep 12, 2011, at 5:52 PM, Rob Stewart wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I have implemented a simple fault tolerant ping pong C program with MPI, 
>>>> here: http://pastebin.com/7mtmQH2q
>>>> 
>>>> MPICH2 offers a parameter with mpiexec:
>>>> $ mpiexec -disable-auto-cleanup
>>>> 
>>>> .. as described here: http://trac.mcs.anl.gov/projects/mpich2/ticket/1421
>>>> 
>>>> It is fault tolerant in the respect that, when I ssh to one of the nodes 
>>>> in the hosts file, and kill the relevant process, the MPI job is not 
>>>> terminated. Simply, the ping will not prompt a pong from the dead node, 
>>>> but the ping-pong runs forever on the remaining live nodes.
>>>> 
>>>> Is such an feature available for openMPI, either via mpiexec or some other 
>>>> means?
>>>> 
>>>> 
>>>> --
>>>> Rob Stewart
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 
> 
> 
> -- 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to