Re: [OMPI users] fault tolerance in open mpi

2009-12-24 Thread vipin kumar
Dear all, May I help in this context ? I can't promise to do big things or high availability in this regard, because I may get more busy in my work. And also I am not sure that my company will allow this or not. Any way I may do this in my spare time. Thanks & Regards, On 12/23/09, Ralph

Re: [OMPI users] fault tolerance in open mpi

2009-12-23 Thread Ralph Castain
That's just OMPI's default behavior - as Josh said, we are working towards allowing other behaviors, but for now, this is what we have. On Dec 23, 2009, at 5:40 AM, vipin kumar wrote: > Thank you Ralph, > > I did as you said. Programs are running fine, But still killing one process > leads

Re: [OMPI users] fault tolerance in open mpi

2009-12-23 Thread vipin kumar
Thank you Ralph, I did as you said. Programs are running fine, But still killing one process leads to terminate all processes. Am I missing something? Any thing else to be called with MPI::Comm::Disconnect()? Thanks & Regards, On Mon, Dec 21, 2009 at 8:00 PM, Ralph Castain

Re: [OMPI users] fault tolerance in open mpi

2009-12-21 Thread Ralph Castain
Disconnect is a -collective- operation. Both parent and child have to call it. Your child process is "hanging" while it waits for the parent. On Dec 21, 2009, at 1:37 AM, vipin kumar wrote: > Hello folks, > > As I explained my problem earlier, I am looking for Fault Tolerance in MPI >

Re: [OMPI users] fault tolerance in open mpi

2009-12-21 Thread vipin kumar
Hello folks, As I explained my problem earlier, I am looking for Fault Tolerance in MPI Programs. I read in Open MPI 2.1 standard document that two DISCONNECTED processes does not affect each other, i.e. they can die or can be killed without whithout affecting other processes. So, I was trying

Re: [OMPI users] fault tolerance in open mpi

2009-09-23 Thread Josh Hursey
Unfortunately I cannot provide a precise time frame for availability at this point, but we are targeting the v1.5 release series. There is a handful of core developers working on this issue at the moment. Pieces of this work have already made it into the Open MPI development trunk. If you

Re: [OMPI users] fault tolerance in open mpi

2009-08-03 Thread Josh Hursey
Task-farm or manager/worker recovery models typically depend on intercommunicators (i.e., from MPI_Comm_spawn) and a resilient MPI implementation. William Gropp and Ewing Lusk have a paper entitled "Fault Tolerance in MPI Programs" that outlines how an application might take advantage of

Re: [OMPI users] fault tolerance in open mpi

2009-08-03 Thread Durga Choudhury
Is that kind of approach possible within an MPI framework? Perhaps a grid approach would be better. More experienced people, speak up, please? (The reason I say that is that I too am interested in the solution of that kind of problem, where an individual blade of a blade server fails and

Re: [OMPI users] fault tolerance in open mpi

2009-08-03 Thread jody
Hi I guess "task-farming" could give you a certain amount of the kind of fault-tolerance you want. (i.e. a master process distributes tasks to idle slave processors - however, this will only work if the slave processes don't need to communicate with each other) Jody On Mon, Aug 3, 2009 at 1:24

Re: [OMPI users] fault tolerance in open mpi

2009-08-03 Thread vipin kumar
Hi all, Thanks Durga for your reply. Jeff, once you wrote code for Mandelbrot set to demonstrate fault tolerance in LAM-MPI. i. e. killing any slave process doesn't affect others. Exact behaviour I am looking for in Open MPI. I attempted, but no luck. Can you please tell how to write such

Re: [OMPI users] fault tolerance in open mpi

2009-07-09 Thread Durga Choudhury
Although I have perhaps the least experience on the topic in this list, I will take a shot; more experienced people, please correct me: MPI standards specify communication mechanism, not fault tolerance at any level. You may achieve network tolerance at the IP level by implementing 'equal cost