Mainly responding to Ralph's comments. In HLA a federate (MPI process) can join and leave a federation (MPI collective) independently from other federates. And rejoin later.
---John On Mon, Apr 22, 2013 at 11:20 AM, George Bosilca <bosi...@icl.utk.edu>wrote: > On Apr 19, 2013, at 17:00 , John Chludzinski <john.chludzin...@gmail.com> > wrote: > > So the apparent conclusion to this thread is that an (Open)MPI based RTI > is very doable - if we allow for the future development of dynamic joining > and leaving of the MPI collective? > > > John, > > What do you mean by dynamically joining and leaving of the MPI collective? > > There are quite a few functions in MPI to dynamically join and disconnect > processes (MPI_Spawn, MPI_Connect, MPI_Comm_join). So if your processes > __always__ leave cleanly (using the defined MPI pattern of comm_disconnect > + comm_free), you might be lucky enough to have this working today. If you > want to support processes leaving for reasons outside of your control (such > as crash) you do not have an option today in MPI, you need to use some > extension (such as ULFM). > > George. > > > > > ---John > > > On Wed, Apr 17, 2013 at 12:45 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> Thanks for the clarification - very interesting indeed! I'll look at it >> more closely. >> >> >> On Apr 17, 2013, at 9:20 AM, George Bosilca <bosi...@icl.utk.edu> wrote: >> >> On Apr 16, 2013, at 15:51 , Ralph Castain <r...@open-mpi.org> wrote: >> >> Just curious: I thought ULFM dealt with recovering an MPI job where one >> or more processes fail. Is this correct? >> >> >> It depends what is the definition of "recovering" you take. ULFM is about >> leaving the processes that remains (after a fault or a disconnect) in a >> state that allow them to continue to make progress. It is not about >> recovering processes, or user data, but it does provide the minimalistic >> set of functionalities to allow application to do this, if needed (revoke, >> agreement and shrink). >> >> HLA/RTI consists of processes that start at random times, run to >> completion, and then exit normally. While a failure could occur, most >> process terminations are normal and there is no need/intent to revive them. >> >> >> As I said above, there is no revival of processes in ULFM, and it was >> never our intent to have such feature. The dynamic world is to be >> constructed using MPI-2 constructs (MPI_Spawn or MPI_Connect/Accept or even >> MPI_Join). >> >> So it's mostly a case of massively exercising MPI's dynamic >> connect/accept/disconnect functions. >> >> Do ULFM's structures have some utility for that purpose? >> >> >> Absolutely. If the process that leaves instead of calling MPI_Finalize >> calls exit() this will be interpreted by the version of the runtime in ULFM >> as an event triggering a report. All the ensuing mechanisms are then >> activated and the application can react to this event with the most >> meaningful approach it can envision. >> >> George. >> >> >> >> On Apr 16, 2013, at 3:20 AM, George Bosilca <bosi...@icl.utk.edu> wrote: >> >> There is an ongoing effort to address the potential volatility of >> processes in MPI called ULFM. There is a working version available at >> http://fault-tolerance.org. It supports TCP, sm and IB (mostly). You >> will find some examples, and the document explaining the additional >> constructs needed in MPI to achieve this. >> >> George. >> >> On Apr 15, 2013, at 17:29 , John Chludzinski <john.chludzin...@gmail.com> >> wrote: >> >> That would seem to preclude its use for an RTI. Unless you have a card >> up your sleeve? >> >> ---John >> >> >> On Mon, Apr 15, 2013 at 11:23 AM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> It isn't the fact that there are multiple programs being used - we >>> support that just fine. The problem with HLA/RTI is that it allows programs >>> to come/go at will - i.e., not every program has to start at the same time, >>> nor complete at the same time. MPI requires that all programs be executing >>> at the beginning, and that all call finalize prior to anyone exiting. >>> >>> >>> On Apr 15, 2013, at 8:14 AM, John Chludzinski < >>> john.chludzin...@gmail.com> wrote: >>> >>> I just received an e-mail notifying me that MPI-2 supports MPMD. This >>> would seen to be just what the doctor ordered? >>> >>> ---John >>> >>> >>> On Mon, Apr 15, 2013 at 11:10 AM, Ralph Castain <r...@open-mpi.org>wrote: >>> >>>> FWIW: some of us are working on a variant of MPI that would indeed >>>> support what you describe - it would support send/recv (i.e., MPI-1), but >>>> not collectives, and so would allow communication between arbitrary >>>> programs. >>>> >>>> Not specifically targeting HLA/RTI, though I suppose a wrapper that >>>> conformed to that standard could be created. >>>> >>>> On Apr 15, 2013, at 7:50 AM, John Chludzinski < >>>> john.chludzin...@gmail.com> wrote: >>>> >>>> > This would be a departure from the SPMD paradigm that seems central to >>>> > MPI's design. Each process would be a completely different program >>>> > (piece of code) and I'm not sure how well that would working using >>>> > MPI? >>>> > >>>> > BTW, MPI is commonly used in the parallel discrete even world for >>>> > communication between LPs (federates in HLA). But these LPs are >>>> > usually the same program. >>>> > >>>> > ---John >>>> > >>>> > On Mon, Apr 15, 2013 at 10:22 AM, John Chludzinski >>>> > <john.chludzin...@gmail.com> wrote: >>>> >> Is anyone aware of an MPI based HLA/RTI (DoD High Level Architecture >>>> >> (HLA) / Runtime Infrastructure)? >>>> >> >>>> >> ---John >>>> > _______________________________________________ >>>> > users mailing list >>>> > us...@open-mpi.org >>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >