Hi,
 
I am writing a program for a central controller that will spawn processes 
depend on the user selection. And when there is some fault in the spawn 
processes like for example, the computer that is spawned with the process 
suddenly go down, the controller should react to this and respawn the processes 
to available machines. However, when a computer go down, all communications 
will hang. To resolve this, the controller will sent SIGTERM signal to kill 
those spawned processes. In the spawned program, I have written signal handler 
to handle such cases. However, when I include MPI_Finalize in the handler, 
there will be some error messages when the processes exit because some 
communication is not complete. Thus, I modify my program such that when the 
processes need to exit through handler, there will be no MPI_Finalize 
statement. I am using openmpi 1.2.8 and this works. However, version 1.2.8 has 
other bugs like spawned processes using MPI_Comm_spawn when exited
 does not terminate the orted services leading to alot of orted services when 
processes are spawn over and over again. Thus, I started evaluating version 
1.3.2. 1.3.2 solve the bug but the whole program exited once a process exit 
without calling MPI_Finalize. Therefore, I seek your help or suggestion on how 
should I overcome this or what should be the proper way to quit processes when 
they are stuck due to one process down.
 
Thank you.
 
Regards,
Wenkai


      

Reply via email to