Hi all,
I use openmpi-1.3b3r20000 and blcr-0.7.3 to run my application on 2 nodes. I 
configure openmpi to run on 2 nodes for default. 
 I want to use checkpoint/restart functionalities, so I use this command to 
configure openmpi:
# .configure --with-devel-headers --with-ft=cr --with-blcr=<path_to_blcr>
First: I run application well with this command "mpitun -np 4 <my_app>", but 
after checkpoint I can't restart application. The error return is bus error, 
signal 7. To fix it, you tell me add "-mca btl ^sm" to mpirun, it run well. But 
I want to know why. 
Second: I can't checkpoint application with --term option. Checkpoint command 
not return, snapshot be created but it wasn't returned to localhost. Daemon on 
remote node died before local snapshot returned, but processes on localhost not 
die.
Third: When I restart an application, I can't checkpoint this. Checkpoint 
command not return and restart process died with signal 13 (Broken Pipe).
With my first-class experience, I can't understand why, please help me.
Thank you
Catrina


      

Reply via email to