Re: [OMPI users] Exit Program Without Calling MPI_Finalize For Special Case

2009-06-04 Thread Tee Wen Kai
Hi,
 
Thanks for your reply. Yup, I am engaging in such research. 
 
In that case, I think I will just stick to 1.2.8 and maybe beside sending the 
SIGTERM signal to kill the process, I will kill the orted service as well when 
the spawned processes died or exited.
 
Just to find out more about the consequences for exiting MPI processes without 
calling MPI_Finalize, will it cause memory leak or other fatal problem?
 
Thank you.
 
Regards,
Wenkai

--- On Wed, 3/6/09, Ralph Castain <r...@open-mpi.org> wrote:


From: Ralph Castain <r...@open-mpi.org>
Subject: Re: [OMPI users] Exit Program Without Calling MPI_Finalize For Special 
Case
To: "Open MPI Users" <us...@open-mpi.org>
List-Post: users@lists.open-mpi.org
Date: Wednesday, 3 June, 2009, 5:19 PM


I'm afraid there is no way to do this in 1.3.2 (or any OMPI distributed 
release) with MPI applications.


The OMPI trunk does provide continuous re-spawn of failed processes, mapping 
them to other nodes and considering fault relationships between nodes, but this 
only works if they are -not- MPI processes. I can detail that for you, if you 
would like.


The problem with MPI processes is that restart is a much larger problem than 
just re-spawning a process. The entire MPI system becomes out-of-sync when one 
process fails - messages in-flight can be lost, collectives hang, etc.


Even if you rewire the connections after re-spawning the process, you still 
have the problem of re-synchronizing the MPI communications - recovering lost 
messages, determining if a collective is already in operation and waiting for 
this process to respond, etc. Hence, our default response is to simply 
terminate the job, letting the user restart it from some prior checkpoint.


Of course, the issue of how to recover from a single process failure remains 
the subject of considerable research. I assume you are engaging in such 
research?



On Jun 2, 2009, at 10:49 PM, Tee Wen Kai wrote:






Hi,
 
I am writing a program for a central controller that will spawn processes 
depend on the user selection. And when there is some fault in the spawn 
processes like for example, the computer that is spawned with the process 
suddenly go down, the controller should react to this and respawn the processes 
to available machines. However, when a computer go down, all communications 
will hang. To resolve this, the controller will sent SIGTERM signal to kill 
those spawned processes. In the spawned program, I have written signal handler 
to handle such cases. However, when I include MPI_Finalize in the handler, 
there will be some error messages when the processes exit because some 
communication is not complete. Thus, I modify my program such that when the 
processes need to exit through handler, there will be no MPI_Finalize 
statement. I am using openmpi 1.2.8 and this works. However, version 1.2.8 has 
other bugs like spawned processes using MPI_Comm_spawn when exited
 does not terminate the orted services leading to alot of orted services when 
processes are spawn over and over again. Thus, I started evaluating version 
1.3.2. 1.3.2 solve the bug but the whole program exited once a process exit 
without calling MPI_Finalize. Therefore, I seek your help or suggestion on how 
should I overcome this or what should be the proper way to quit processes when 
they are stuck due to one process down.
 
Thank you.
 
Regards,
Wenkai


New Email names for you! 
Get the Email name you've always wanted on the new @ymail and @rocketmail.
Hurry before someone else does!___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

-Inline Attachment Follows-


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


  

Re: [OMPI users] Exit Program Without Calling MPI_Finalize For Special Case

2009-06-03 Thread Ralph Castain
I'm afraid there is no way to do this in 1.3.2 (or any OMPI  
distributed release) with MPI applications.


The OMPI trunk does provide continuous re-spawn of failed processes,  
mapping them to other nodes and considering fault relationships  
between nodes, but this only works if they are -not- MPI processes. I  
can detail that for you, if you would like.


The problem with MPI processes is that restart is a much larger  
problem than just re-spawning a process. The entire MPI system becomes  
out-of-sync when one process fails - messages in-flight can be lost,  
collectives hang, etc.


Even if you rewire the connections after re-spawning the process, you  
still have the problem of re-synchronizing the MPI communications -  
recovering lost messages, determining if a collective is already in  
operation and waiting for this process to respond, etc. Hence, our  
default response is to simply terminate the job, letting the user  
restart it from some prior checkpoint.


Of course, the issue of how to recover from a single process failure  
remains the subject of considerable research. I assume you are  
engaging in such research?


On Jun 2, 2009, at 10:49 PM, Tee Wen Kai wrote:


Hi,

I am writing a program for a central controller that will spawn  
processes depend on the user selection. And when there is some fault  
in the spawn processes like for example, the computer that is  
spawned with the process suddenly go down, the controller should  
react to this and respawn the processes to available machines.  
However, when a computer go down, all communications will hang. To  
resolve this, the controller will sent SIGTERM signal to kill those  
spawned processes. In the spawned program, I have written signal  
handler to handle such cases. However, when I include MPI_Finalize  
in the handler, there will be some error messages when the processes  
exit because some communication is not complete. Thus, I modify my  
program such that when the processes need to exit through handler,  
there will be no MPI_Finalize statement. I am using openmpi 1.2.8  
and this works. However, version 1.2.8 has other bugs like spawned  
processes using MPI_Comm_spawn when exited does not terminate the  
orted services leading to alot of orted services when processes are  
spawn over and over again. Thus, I started evaluating version 1.3.2.  
1.3.2 solve the bug but the whole program exited once a process exit  
without calling MPI_Finalize. Therefore, I seek your help or  
suggestion on how should I overcome this or what should be the  
proper way to quit processes when they are stuck due to one process  
down.


Thank you.

Regards,
Wenkai

New Email names for you!
Get the Email name you've always wanted on the new @ymail and  
@rocketmail.
Hurry before someone else does! 
___

users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Exit Program Without Calling MPI_Finalize For Special Case

2009-06-03 Thread Tee Wen Kai
Hi,
 
I am writing a program for a central controller that will spawn processes 
depend on the user selection. And when there is some fault in the spawn 
processes like for example, the computer that is spawned with the process 
suddenly go down, the controller should react to this and respawn the processes 
to available machines. However, when a computer go down, all communications 
will hang. To resolve this, the controller will sent SIGTERM signal to kill 
those spawned processes. In the spawned program, I have written signal handler 
to handle such cases. However, when I include MPI_Finalize in the handler, 
there will be some error messages when the processes exit because some 
communication is not complete. Thus, I modify my program such that when the 
processes need to exit through handler, there will be no MPI_Finalize 
statement. I am using openmpi 1.2.8 and this works. However, version 1.2.8 has 
other bugs like spawned processes using MPI_Comm_spawn when exited
 does not terminate the orted services leading to alot of orted services when 
processes are spawn over and over again. Thus, I started evaluating version 
1.3.2. 1.3.2 solve the bug but the whole program exited once a process exit 
without calling MPI_Finalize. Therefore, I seek your help or suggestion on how 
should I overcome this or what should be the proper way to quit processes when 
they are stuck due to one process down.
 
Thank you.
 
Regards,
Wenkai