Re: [OMPI users] Running on crashing nodes

2010-09-27 Thread Randolph Pullen
I have have successfully used a perl program to start mpirun and record its 
PIDThe monitor can then watch the output from MPI and terminate the mpirun 
command with a series of kills or something if it is having trouble.

One method of doing this is to prefix all legal output from your MPI program 
with a known short string, if the monitor does not see this string prefixed on 
a line, it can terminate MPI, check available nodes and recast the job 
accordingly
Hope this helps,Randolph
--- On Fri, 24/9/10, Joshua Hursey <jjhur...@open-mpi.org> wrote:

From: Joshua Hursey <jjhur...@open-mpi.org>
Subject: Re: [OMPI users] Running on crashing nodes
To: "Open MPI Users" <us...@open-mpi.org>
Received: Friday, 24 September, 2010, 10:18 PM

As one of the Open MPI developers actively working on the MPI layer 
stabilization/recover feature set, I don't think we can give you a specific 
timeframe for availability, especially availability in a stable release. Once 
the initial functionality is finished, we will open it up for user testing by 
making a public branch available. After addressing the concerns highlighted by 
public testing, we will attempt to work this feature into the mainline trunk 
and eventual release.

Unfortunately it is difficult to assess the time needed to go through these 
development stages. What I can tell you is that the work to this point on the 
MPI layer is looking promising, and that as soon as we feel that the code is 
ready we will make it available to the public for further testing.

-- Josh

On Sep 24, 2010, at 3:37 AM, Andrei Fokau wrote:

> Ralph, could you tell us when this functionality will be available in the 
> stable version? A rough estimate will be fine.
> 
> 
> On Fri, Sep 24, 2010 at 01:24, Ralph Castain <r...@open-mpi.org> wrote:
> In a word, no. If a node crashes, OMPI will abort the currently-running job 
> if it had processes on that node. There is no current ability to "ride-thru" 
> such an event.
> 
> That said, there is work being done to support "ride-thru". Most of that is 
> in the current developer's code trunk, and more is coming, but I wouldn't 
> consider it production-quality just yet.
> 
> Specifically, the code that does what you specify below is done and works. It 
> is recovery of the MPI job itself (collectives, lost messages, etc.) that 
> remains to be completed.
> 
> 
> On Thu, Sep 23, 2010 at 7:22 AM, Andrei Fokau <andrei.fo...@neutron.kth.se> 
> wrote:
> Dear users,
> 
> Our cluster has a number of nodes which have high probability to crash, so it 
> happens quite often that calculations stop due to one node getting down. May 
> be you know if it is possible to block the crashed nodes during run-time when 
> running with OpenMPI? I am asking about principal possibility to program such 
> behavior. Does OpenMPI allow such dynamic checking? The scheme I am curious 
> about is the following:
> 
> 1. A code starts its tasks via mpirun on several nodes
> 2. At some moment one node gets down
> 3. The code realizes that the node is down (the results are lost) and 
> excludes it from the list of nodes to run its tasks on
> 4. At later moment the user restarts the crashed node
> 5. The code notices that the node is up again, and puts it back to the list 
> of active nodes
> 
> 
> Regards,
> Andrei
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 


Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://www.cs.indiana.edu/~jjhursey


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



  

Re: [OMPI users] Running on crashing nodes

2010-09-24 Thread Joshua Hursey
As one of the Open MPI developers actively working on the MPI layer 
stabilization/recover feature set, I don't think we can give you a specific 
timeframe for availability, especially availability in a stable release. Once 
the initial functionality is finished, we will open it up for user testing by 
making a public branch available. After addressing the concerns highlighted by 
public testing, we will attempt to work this feature into the mainline trunk 
and eventual release.

Unfortunately it is difficult to assess the time needed to go through these 
development stages. What I can tell you is that the work to this point on the 
MPI layer is looking promising, and that as soon as we feel that the code is 
ready we will make it available to the public for further testing.

-- Josh

On Sep 24, 2010, at 3:37 AM, Andrei Fokau wrote:

> Ralph, could you tell us when this functionality will be available in the 
> stable version? A rough estimate will be fine.
> 
> 
> On Fri, Sep 24, 2010 at 01:24, Ralph Castain  wrote:
> In a word, no. If a node crashes, OMPI will abort the currently-running job 
> if it had processes on that node. There is no current ability to "ride-thru" 
> such an event.
> 
> That said, there is work being done to support "ride-thru". Most of that is 
> in the current developer's code trunk, and more is coming, but I wouldn't 
> consider it production-quality just yet.
> 
> Specifically, the code that does what you specify below is done and works. It 
> is recovery of the MPI job itself (collectives, lost messages, etc.) that 
> remains to be completed.
> 
> 
> On Thu, Sep 23, 2010 at 7:22 AM, Andrei Fokau  
> wrote:
> Dear users,
> 
> Our cluster has a number of nodes which have high probability to crash, so it 
> happens quite often that calculations stop due to one node getting down. May 
> be you know if it is possible to block the crashed nodes during run-time when 
> running with OpenMPI? I am asking about principal possibility to program such 
> behavior. Does OpenMPI allow such dynamic checking? The scheme I am curious 
> about is the following:
> 
> 1. A code starts its tasks via mpirun on several nodes
> 2. At some moment one node gets down
> 3. The code realizes that the node is down (the results are lost) and 
> excludes it from the list of nodes to run its tasks on
> 4. At later moment the user restarts the crashed node
> 5. The code notices that the node is up again, and puts it back to the list 
> of active nodes
> 
> 
> Regards,
> Andrei
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 


Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://www.cs.indiana.edu/~jjhursey




Re: [OMPI users] Running on crashing nodes

2010-09-24 Thread Andrei Fokau
Ralph, could you tell us when this functionality will be available in the
stable version? A rough estimate will be fine.


On Fri, Sep 24, 2010 at 01:24, Ralph Castain  wrote:

> In a word, no. If a node crashes, OMPI will abort the currently-running job
> if it had processes on that node. There is no current ability to "ride-thru"
> such an event.
>
> That said, there is work being done to support "ride-thru". Most of that is
> in the current developer's code trunk, and more is coming, but I wouldn't
> consider it production-quality just yet.
>
> Specifically, the code that does what you specify below is done and works.
> It is recovery of the MPI job itself (collectives, lost messages, etc.) that
> remains to be completed.
>
>
>  On Thu, Sep 23, 2010 at 7:22 AM, Andrei Fokau <
> andrei.fo...@neutron.kth.se> wrote:
>
>>  Dear users,
>>
>> Our cluster has a number of nodes which have high probability to crash, so
>> it happens quite often that calculations stop due to one node getting down.
>> May be you know if it is possible to block the crashed nodes during run-time
>> when running with OpenMPI? I am asking about principal possibility to
>> program such behavior. Does OpenMPI allow such dynamic checking? The scheme
>> I am curious about is the following:
>>
>> 1. A code starts its tasks via mpirun on several nodes
>> 2. At some moment one node gets down
>> 3. The code realizes that the node is down (the results are lost) and
>> excludes it from the list of nodes to run its tasks on
>> 4. At later moment the user restarts the crashed node
>> 5. The code notices that the node is up again, and puts it back to the
>> list of active nodes
>>
>>
>> Regards,
>> Andrei
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Running on crashing nodes

2010-09-23 Thread Ralph Castain
In a word, no. If a node crashes, OMPI will abort the currently-running job
if it had processes on that node. There is no current ability to "ride-thru"
such an event.

That said, there is work being done to support "ride-thru". Most of that is
in the current developer's code trunk, and more is coming, but I wouldn't
consider it production-quality just yet.

Specifically, the code that does what you specify below is done and works.
It is recovery of the MPI job itself (collectives, lost messages, etc.) that
remains to be completed.


On Thu, Sep 23, 2010 at 7:22 AM, Andrei Fokau
wrote:

> Dear users,
>
> Our cluster has a number of nodes which have high probability to crash, so
> it happens quite often that calculations stop due to one node getting down.
> May be you know if it is possible to block the crashed nodes during run-time
> when running with OpenMPI? I am asking about principal possibility to
> program such behavior. Does OpenMPI allow such dynamic checking? The scheme
> I am curious about is the following:
>
> 1. A code starts its tasks via mpirun on several nodes
> 2. At some moment one node gets down
> 3. The code realizes that the node is down (the results are lost) and
> excludes it from the list of nodes to run its tasks on
> 4. At later moment the user restarts the crashed node
> 5. The code notices that the node is up again, and puts it back to the list
> of active nodes
>
>
> Regards,
> Andrei
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>