If Josh is going to be at the forum, perhaps you folks could chat there?
Might as well take advantage of being colocated, if possible.
Otherwise, I'm available pretty much any time. I can't contribute much about
the MPI recovery issues, but can contribute to the RTE issues if that helps.
On
Josh,
Next week is a little bit too early as will need some time to figure out how to
integrate with this new framework, and at what extent our code and requirements
fit into. Then the week after is the MPI Forum. How about on Thursday 11 March?
Thanks,
george.
On Feb 25, 2010, at 12:46
Easy to do. I'll dump all the pids at the same time when the launch
completes - effectively, it will be at the same point used by other
debuggers to attach.
Have it for you in the trunk this weekend. Can you suggest an xml format you
would like? Otherwise, I'll just use the current proc output
Just to add to Josh's comment: I am working now on recovering from HNP
failure as well. Should have that in a month or so.
On Thu, Feb 25, 2010 at 10:46 AM, Josh Hursey wrote:
>
> On Feb 25, 2010, at 8:32 AM, George Bosilca wrote:
>
> >
> > On Feb 25, 2010, at 11:16 ,
Creating nightly hwloc snapshot SVN tarball was a success.
Snapshot: hwloc 1.0a1r1753
Start time: Thu Feb 25 21:01:04 EST 2010
End time: Thu Feb 25 21:03:13 EST 2010
Your friendly daemon,
Cyrador
I believe you are thinking parallel to what Josh and I have been doing, and
slightly different to the UTK approach. The "orcm" method follows what you
describe: we maintain operation on the current remaining nodes, see if we
can use another new node to replace the failed one, and redistribute the
I think our last set of minimums was based on being able to use RHEL4 out of
the box. Updating to whatever ships with RHEL5 probably makes sense, but I
think that still leaves you at a LT 1.5.x release. Being higher than that
requires new Autotools, which seems like asking for trouble.
Brian
WHAT: Bump minimum required versions of GNU autotools up to modern versions. I
suggest the following, but could be talked down a version or two:
Autoconf: 2.65
Automake: 1.11.1
Libtool: 2.2.6b
WHY: Stop carrying patches and workarounds for old versions.
WHERE: autogen.sh,
Have you looked at orte-ps? It contains all the information you'll need to
attach a debugger to a already running application.
Ashley,
On 25 Feb 2010, at 17:43, Greg Watson wrote:
> Ralph,
>
> We'd like this to be able to support attaching a debugger to the application.
> Would it be
On Feb 25, 2010, at 8:32 AM, George Bosilca wrote:
>
> On Feb 25, 2010, at 11:16 , Leonardo Fialho wrote:
>
>> Hum... I'm really afraid about this. I understand your choice since it is
>> really a good solution for fail/stop/restart behaviour, but looking from the
>> fail/recovery side, can
Ralph,
We'd like this to be able to support attaching a debugger to the application.
Would it be difficult to provide? We don't need the information all at once,
each PID could be sent as the process launches (as long as the XML is correctly
formatted) if that makes it any easier.
Greg
On
Hi George,
>> Hum... I'm really afraid about this. I understand your choice since it is
>> really a good solution for fail/stop/restart behaviour, but looking from the
>> fail/recovery side, can you envision some alternative for the orted's
>> reconfiguration on the fly?
>
> I don't see why
On Feb 25, 2010, at 11:16 , Leonardo Fialho wrote:
> Hum... I'm really afraid about this. I understand your choice since it is
> really a good solution for fail/stop/restart behaviour, but looking from the
> fail/recovery side, can you envision some alternative for the orted's
>
Hi Ralph and Josh,
>>> Regarding to the schema represented by the picture, I didn't understand the
>>> RecoS' behaviour in a node failure situation.
>>>
>>> In this case, will mpirun consider the daemon failure as a normal proc
>>> failure? If it is correct, should mpirun update the global
On Feb 25, 2010, at 4:38 AM, Ralph Castain wrote:
>
> On Feb 25, 2010, at 1:41 AM, Leonardo Fialho wrote:
>
>> Hi Ralph,
>>
>> Very interesting the "composite framework" idea.
>
> Josh is the force behind that idea :-)
It solves a pretty interesting little problem. Its utility will really
On Feb 23, 2010, at 3:00 PM, Ralph Castain wrote:
>
> On Feb 23, 2010, at 3:32 PM, George Bosilca wrote:
>
>> Ralph, Josh,
>>
>> We have some comments about the API of the new framework, mostly
>> clarifications needed to better understand how this new framework is
>> supposed to be used.
On Feb 25, 2010, at 7:14 AM, hu yaohui wrote:
> Thanks a lot! i got it.Could you introduce some more materials for me to get
> better understood of the following functions:
> (1):/ompi/mca/pml/ob1/pml_ob1.c/mca_pml_ob1_add_procs
This is just the OB1 function to add new peer processes. It's
On Feb 25, 2010, at 1:41 AM, Leonardo Fialho wrote:
> Hi Ralph,
>
> Very interesting the "composite framework" idea.
Josh is the force behind that idea :-)
> Regarding to the schema represented by the picture, I didn't understand the
> RecoS' behaviour in a node failure situation.
>
> In
Thanks a lot! i got it.Could you introduce some more materials for me to get
better understood of the following functions:
(1):/ompi/mca/pml/ob1/pml_ob1.c/mca_pml_ob1_add_procs
(2):/ompi/mca/bml/r2/bml_r2.c/mca_bml_r2_add_procs
(3):/ompi/mca/btl/tcp/btl_tcp.c/mca_btl_tcp_add_procs
especially the
Hi Ralph,
Very interesting the "composite framework" idea. Regarding to the schema
represented by the picture, I didn't understand the RecoS' behaviour in a node
failure situation.
In this case, will mpirun consider the daemon failure as a normal proc failure?
If it is correct, should mpirun
Hi Ralph,
Very interesting the "composite framework" idea. Regarding to the schema
represented by the picture, I didn't understand the RecoS' behaviour in a node
failure situation.
In this case, will mpirun consider the daemon failure as a normal proc failure?
If it is correct, should mpirun
Hi George et al
I have begun documenting the RecoS operation on the OMPI wiki:
https://svn.open-mpi.org/trac/ompi/wiki/RecoS
I'll continue to work on this over the next few days by adding a section
explaining what was changed outside of the new framework to make it all work.
In addition, I am
22 matches
Mail list logo