Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Ralph Castain
If Josh is going to be at the forum, perhaps you folks could chat there? Might as well take advantage of being colocated, if possible. Otherwise, I'm available pretty much any time. I can't contribute much about the MPI recovery issues, but can contribute to the RTE issues if that helps. On

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread George Bosilca
Josh, Next week is a little bit too early as will need some time to figure out how to integrate with this new framework, and at what extent our code and requirements fit into. Then the week after is the MPI Forum. How about on Thursday 11 March? Thanks, george. On Feb 25, 2010, at 12:46

Re: [OMPI devel] question about pids

2010-02-25 Thread Ralph Castain
Easy to do. I'll dump all the pids at the same time when the launch completes - effectively, it will be at the same point used by other debuggers to attach. Have it for you in the trunk this weekend. Can you suggest an xml format you would like? Otherwise, I'll just use the current proc output

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Ralph Castain
Just to add to Josh's comment: I am working now on recovering from HNP failure as well. Should have that in a month or so. On Thu, Feb 25, 2010 at 10:46 AM, Josh Hursey wrote: > > On Feb 25, 2010, at 8:32 AM, George Bosilca wrote: > > > > > On Feb 25, 2010, at 11:16 ,

[hwloc-devel] Create success (hwloc r1.0a1r1753)

2010-02-25 Thread MPI Team
Creating nightly hwloc snapshot SVN tarball was a success. Snapshot: hwloc 1.0a1r1753 Start time: Thu Feb 25 21:01:04 EST 2010 End time: Thu Feb 25 21:03:13 EST 2010 Your friendly daemon, Cyrador

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Ralph Castain
I believe you are thinking parallel to what Josh and I have been doing, and slightly different to the UTK approach. The "orcm" method follows what you describe: we maintain operation on the current remaining nodes, see if we can use another new node to replace the failed one, and redistribute the

Re: [OMPI devel] RFC: increase default AC/AM/LT requirements

2010-02-25 Thread Barrett, Brian W
I think our last set of minimums was based on being able to use RHEL4 out of the box. Updating to whatever ships with RHEL5 probably makes sense, but I think that still leaves you at a LT 1.5.x release. Being higher than that requires new Autotools, which seems like asking for trouble. Brian

[OMPI devel] RFC: increase default AC/AM/LT requirements

2010-02-25 Thread Jeff Squyres
WHAT: Bump minimum required versions of GNU autotools up to modern versions. I suggest the following, but could be talked down a version or two: Autoconf: 2.65 Automake: 1.11.1 Libtool: 2.2.6b WHY: Stop carrying patches and workarounds for old versions. WHERE: autogen.sh,

Re: [OMPI devel] question about pids

2010-02-25 Thread Ashley Pittman
Have you looked at orte-ps? It contains all the information you'll need to attach a debugger to a already running application. Ashley, On 25 Feb 2010, at 17:43, Greg Watson wrote: > Ralph, > > We'd like this to be able to support attaching a debugger to the application. > Would it be

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Josh Hursey
On Feb 25, 2010, at 8:32 AM, George Bosilca wrote: > > On Feb 25, 2010, at 11:16 , Leonardo Fialho wrote: > >> Hum... I'm really afraid about this. I understand your choice since it is >> really a good solution for fail/stop/restart behaviour, but looking from the >> fail/recovery side, can

Re: [OMPI devel] question about pids

2010-02-25 Thread Greg Watson
Ralph, We'd like this to be able to support attaching a debugger to the application. Would it be difficult to provide? We don't need the information all at once, each PID could be sent as the process launches (as long as the XML is correctly formatted) if that makes it any easier. Greg On

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Leonardo Fialho
Hi George, >> Hum... I'm really afraid about this. I understand your choice since it is >> really a good solution for fail/stop/restart behaviour, but looking from the >> fail/recovery side, can you envision some alternative for the orted's >> reconfiguration on the fly? > > I don't see why

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread George Bosilca
On Feb 25, 2010, at 11:16 , Leonardo Fialho wrote: > Hum... I'm really afraid about this. I understand your choice since it is > really a good solution for fail/stop/restart behaviour, but looking from the > fail/recovery side, can you envision some alternative for the orted's >

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Leonardo Fialho
Hi Ralph and Josh, >>> Regarding to the schema represented by the picture, I didn't understand the >>> RecoS' behaviour in a node failure situation. >>> >>> In this case, will mpirun consider the daemon failure as a normal proc >>> failure? If it is correct, should mpirun update the global

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Josh Hursey
On Feb 25, 2010, at 4:38 AM, Ralph Castain wrote: > > On Feb 25, 2010, at 1:41 AM, Leonardo Fialho wrote: > >> Hi Ralph, >> >> Very interesting the "composite framework" idea. > > Josh is the force behind that idea :-) It solves a pretty interesting little problem. Its utility will really

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Josh Hursey
On Feb 23, 2010, at 3:00 PM, Ralph Castain wrote: > > On Feb 23, 2010, at 3:32 PM, George Bosilca wrote: > >> Ralph, Josh, >> >> We have some comments about the API of the new framework, mostly >> clarifications needed to better understand how this new framework is >> supposed to be used.

Re: [OMPI devel] what's the relationship between proc, endpoint and btl?

2010-02-25 Thread Jeff Squyres
On Feb 25, 2010, at 7:14 AM, hu yaohui wrote: > Thanks a lot! i got it.Could you introduce some more materials for me to get > better understood of the following functions: > (1):/ompi/mca/pml/ob1/pml_ob1.c/mca_pml_ob1_add_procs This is just the OB1 function to add new peer processes. It's

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Ralph Castain
On Feb 25, 2010, at 1:41 AM, Leonardo Fialho wrote: > Hi Ralph, > > Very interesting the "composite framework" idea. Josh is the force behind that idea :-) > Regarding to the schema represented by the picture, I didn't understand the > RecoS' behaviour in a node failure situation. > > In

Re: [OMPI devel] what's the relationship between proc, endpoint and btl?

2010-02-25 Thread hu yaohui
Thanks a lot! i got it.Could you introduce some more materials for me to get better understood of the following functions: (1):/ompi/mca/pml/ob1/pml_ob1.c/mca_pml_ob1_add_procs (2):/ompi/mca/bml/r2/bml_r2.c/mca_bml_r2_add_procs (3):/ompi/mca/btl/tcp/btl_tcp.c/mca_btl_tcp_add_procs especially the

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Leonardo Fialho
Hi Ralph, Very interesting the "composite framework" idea. Regarding to the schema represented by the picture, I didn't understand the RecoS' behaviour in a node failure situation. In this case, will mpirun consider the daemon failure as a normal proc failure? If it is correct, should mpirun

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Leonardo Fialho
Hi Ralph, Very interesting the "composite framework" idea. Regarding to the schema represented by the picture, I didn't understand the RecoS' behaviour in a node failure situation. In this case, will mpirun consider the daemon failure as a normal proc failure? If it is correct, should mpirun

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Ralph Castain
Hi George et al I have begun documenting the RecoS operation on the OMPI wiki: https://svn.open-mpi.org/trac/ompi/wiki/RecoS I'll continue to work on this over the next few days by adding a section explaining what was changed outside of the new framework to make it all work. In addition, I am