Dear all, May I help in this context ? I can't promise to do big things or high availability in this regard, because I may get more busy in my work. And also I am not sure that my company will allow this or not. Any way I may do this in my spare time.
Thanks & Regards, On 12/23/09, Ralph Castain <r...@open-mpi.org> wrote: > That's just OMPI's default behavior - as Josh said, we are working towards > allowing other behaviors, but for now, this is what we have. > > > On Dec 23, 2009, at 5:40 AM, vipin kumar wrote: > >> Thank you Ralph, >> >> I did as you said. Programs are running fine, But still killing one >> process leads to terminate all processes. Am I missing something? Any >> thing else to be called with MPI::Comm::Disconnect()? >> >> Thanks & Regards, >> >> On Mon, Dec 21, 2009 at 8:00 PM, Ralph Castain <r...@open-mpi.org> wrote: >> Disconnect is a -collective- operation. Both parent and child have to call >> it. Your child process is "hanging" while it waits for the parent. >> >> On Dec 21, 2009, at 1:37 AM, vipin kumar wrote: >> >>> Hello folks, >>> >>> As I explained my problem earlier, I am looking for Fault Tolerance in >>> MPI Programs. I read in Open MPI 2.1 standard document that two >>> DISCONNECTED processes does not affect each other, i.e. they can die or >>> can be killed without whithout affecting other processes. >>> >>> So, I was trying this to achieve fault tolerance using >>> MPI::Comm::Disconnect() to disconnect the CHILD process with PARENT >>> process, which was spawned by calling MPI::Comm::spawn(). I am calling >>> MPI::Comm::Disconnect() from CHILD process immediatly after calling >>> MPI::Init(). It seems that CHILD process is not returning from this call. >>> >>> >>> I tried MPI::Comm::Free() too, but this is also not working. Process is >>> not progressing from this point of call. If I comment these statements, >>> everything works fine. Note that I have tried this in Solaris as well as >>> in Linux (fedora core). >>> >>> My question is, whether Open-mpi suports to disconnect two processes( >>> like child from parent). And if it is, then how? >>> >>> >>> Thanks & Regards, >>> >>> On Wed, Sep 23, 2009 at 6:41 PM, Josh Hursey <jjhur...@open-mpi.org> >>> wrote: >>> Unfortunately I cannot provide a precise time frame for availability at >>> this point, but we are targeting the v1.5 release series. There is a >>> handful of core developers working on this issue at the moment. Pieces of >>> this work have already made it into the Open MPI development trunk. If >>> you want to play around with what is available try turning on the >>> resilient mapper: >>> -mca rmaps resilient >>> >>> We will be sure to email the list once this work becomes more stable and >>> available. >>> >>> -- Josh >>> >>> >>> On Sep 18, 2009, at 2:56 AM, vipin kumar wrote: >>> >>> Hi Josh, >>> >>> It is good to hear from you that work is in progress towards resiliency >>> of Open-MPI. I was and I am waiting for this capability in Open-MPI. I >>> have almost finished my development work and waiting for this to happen >>> so that I can test my programs. It will be good if you can tell how long >>> it will take to make Open-MPI a resilient impementation. Here by >>> resiliency I mean abnormal termination or intentionally killing a process >>> should not cause any(parent or sibling) process to be terminated, given >>> that processes are connected. >>> >>> thanks. >>> >>> Regards, >>> >>> On Mon, Aug 3, 2009 at 8:37 PM, Josh Hursey <jjhur...@open-mpi.org> >>> wrote: >>> Task-farm or manager/worker recovery models typically depend on >>> intercommunicators (i.e., from MPI_Comm_spawn) and a resilient MPI >>> implementation. William Gropp and Ewing Lusk have a paper entitled "Fault >>> Tolerance in MPI Programs" that outlines how an application might take >>> advantage of these features in order to recover from process failure. >>> >>> However, these techniques strongly depend upon resilient MPI >>> implementations, and behaviors that, some may argue, are non-standard. >>> Unfortunately there are not many MPI implementations that are >>> sufficiently resilient in the face of process failure to support failure >>> in task-farm scenarios. Though Open MPI supports the current MPI 2.1 >>> standard, it is not as resilient to process failure as it could be. >>> >>> There are a number of people working on improving the resiliency of Open >>> MPI in the face of network and process failure (including myself). We >>> have started to move some of the resiliency work into the Open MPI trunk. >>> Resiliency in Open MPI has been improving over the past few months, but I >>> would not assess it as ready quite yet. Most of the work has focused on >>> the runtime level (ORTE), and there are still some MPI level (OMPI) >>> issues that need to be worked out. >>> >>> With all of that being said, I would try some of the techniques presented >>> in the Gropp/Lusk paper in your application. Then test it with Open MPI >>> and let us know how it goes. >>> >>> Best, >>> Josh >>> >>> >>> On Aug 3, 2009, at 10:30 AM, Durga Choudhury wrote: >>> >>> Is that kind of approach possible within an MPI framework? Perhaps a >>> grid approach would be better. More experienced people, speak up, >>> please? >>> (The reason I say that is that I too am interested in the solution of >>> that kind of problem, where an individual blade of a blade server >>> fails and correcting for that failure on the fly is better than taking >>> checkpoints and restarting the whole process excluding the failed >>> blade. >>> >>> Durga >>> >>> On Mon, Aug 3, 2009 at 9:21 AM, jody<jody....@gmail.com> wrote: >>> Hi >>> >>> I guess "task-farming" could give you a certain amount of the kind of >>> fault-tolerance you want. >>> (i.e. a master process distributes tasks to idle slave processors - >>> however, this will only work >>> if the slave processes don't need to communicate with each other) >>> >>> Jody >>> >>> >>> On Mon, Aug 3, 2009 at 1:24 PM, vipin kumar<vipinkuma...@gmail.com> >>> wrote: >>> Hi all, >>> >>> Thanks Durga for your reply. >>> >>> Jeff, once you wrote code for Mandelbrot set to demonstrate fault >>> tolerance >>> in LAM-MPI. i. e. killing any slave process doesn't >>> affect others. Exact behaviour I am looking for in Open MPI. I attempted, >>> but no luck. Can you please tell how to write such programs in Open MPI. >>> >>> Thanks in advance. >>> >>> Regards, >>> On Thu, Jul 9, 2009 at 8:30 PM, Durga Choudhury <dpcho...@gmail.com> >>> wrote: >>> >>> Although I have perhaps the least experience on the topic in this >>> list, I will take a shot; more experienced people, please correct me: >>> >>> MPI standards specify communication mechanism, not fault tolerance at >>> any level. You may achieve network tolerance at the IP level by >>> implementing 'equal cost multipath' routes (which means two equally >>> capable NIC cards connecting to the same destination and modifying the >>> kernel routing table to use both cards; the kernel will dynamically >>> load balance.). At the MAC level, you can achieve the same effect by >>> trunking multiple network cards. >>> >>> You can achieve process level fault tolerance by a checkpointing >>> scheme such as BLCR, which has been tested to work with OpenMPI (and >>> other processes as well) >>> >>> Durga >>> >>> On Thu, Jul 9, 2009 at 4:57 AM, vipin kumar<vipinkuma...@gmail.com> >>> wrote: >>> >>> Hi all, >>> >>> I want to know whether open mpi supports Network and process fault >>> tolerance >>> or not? If there is any example demonstrating these features that will >>> be >>> best. >>> >>> Regards, >>> -- >>> Vipin K. >>> Research Engineer, >>> C-DOTB, India >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> >>> -- >>> Vipin K. >>> Research Engineer, >>> C-DOTB, India >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> >>> -- >>> Vipin K. >>> Research Engineer, >>> C-DOTB, India >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> >>> -- >>> Vipin K. >>> Research Engineer, >>> C-DOTB, India >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> -- >> Vipin K. >> Research Engineer, >> C-DOTB, India >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- Vipin K. Research Engineer, C-DOTB, India