(Sorry for the delay, I missed the C/R question in the mail)

On May 25, 2010, at 9:35 AM, Jeff Squyres wrote:

On May 24, 2010, at 2:02 PM, Michael E. Thomadakis wrote:

| > 2) I have installed blcr V0.8.2 but when I try to built OMPI and I point to the | > full installation it complains it cannot find it. Note that I build BLCR with
| > GCC but I am building OMPI with Intel compilers (V11.1)
|
| Can you be more specific here?

I pointed to the insatllation path for BLCR but config complained that it couldn't find it. If BLCR is only needed for checkpoint / restart then we can
leave without it. Is BLCR needed for suspend/resume of mpi jobs ?

You mean suspend with ctrl-Z? If so, correct -- BLCR is *only* used for checkpoint/restart. Ctrl-Z just uses the SIGSTP functionality.

So BLCR is used for the checkpoint/restart functionality in Open MPI. We have a webpage with some more details and examples at the link below:
  http://osl.iu.edu/research/ft/ompi-cr/

You should be able to suspend/resume an Open MPI job using SIGSTOP/ SIGCONT without the C/R functionality. We have FAQ item that talks about how to enable this functionality:
  http://www.open-mpi.org/faq/?category=running#suspend-resume

You can combine the C/R and the SIGSTOP/SIGCONT functionality so that when you 'suspend' a job a checkpoint is taken and the process is stopped. You can continue the job by sending SIGCONT as normal. Additionally, this way if the job needs to be terminated for some reason (e.g., memory footprint, maintenance), it can be safely terminated and restarted from the checkpoint. I have a example of how this works at the link below:
  http://osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-stop

As far as C/R integration with schedulers/resource managers, I know that the BLCR folks have been working with Torque to better integrate Open MPI+BLCR+Torque. If this is of interest, you might want to check with them on the progress of that project.

-- Josh

Reply via email to