I took a look at the checkpoint staging and preload functionality. It seems that the combination of the two is broken on the v1.3 and v1.4 branches. I filed a bug about it so that it would not get lost:
  https://svn.open-mpi.org/trac/ompi/ticket/2139

I also attached a patch to partially fix the problem, but the actual fix is must more involved. I don't know when I'll get around to finishing this bug fix for that branch. :(

However, the current development trunk and v1.5 are know to have a working version of this feature. Can you try the trunk or v1.5 and see if this fixes the problem?

-- Josh

P.S. If you are interested, we have a slightly better version of the documentation, hosted at the link below:
  http://osl.iu.edu/research/ft/ompi-cr/

On Nov 18, 2009, at 1:27 PM, Constantinos Makassikis wrote:

Josh Hursey wrote:
(Sorry for the excessive delay in replying)

On Sep 30, 2009, at 11:02 AM, Constantinos Makassikis wrote:

Thanks for the reply!

Concerning the mca options for checkpointing:
- are verbosity options (e.g.: crs_base_verbose) limited to 0 and 1 values ? - in priority options (e.g.: crs_blcr_priority) do lower numbers indicate higher priority ?

By searching in the archives of the mailing list I found two interesting/useful posts: - [1] http://www.open-mpi.org/community/lists/users/ 2008/09/6534.php (for different checkpointing schemes) - [2] http://www.open-mpi.org/community/lists/users/ 2009/05/9385.php (for restarting)

Following indications given in [1], I tried to make each process
checkpoint itself in it local /tmp and centralize the resulting
checkpoints in /tmp or $HOME:

Excerpt from mca-params.conf:
-----------------------------
snapc_base_store_in_place=0
snapc_base_global_snapshot_dir=/tmp or $HOME
crs_base_snapshot_dir=/tmp

COMMANDS used:
--------------
mpirun -n 2 -machinefile machines -am ft-enable-cr a.out
ompi-checkpoint mpirun_pid



OUTPUT of ompi-checkpoint -v 16753
--------------------------------------
[ic85:17044] orte_checkpoint: Checkpointing...
[ic85:17044]     PID 17036
[ic85:17044]     Connected to Mpirun [[42098,0],0]
[ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node Process PID 17036 [ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Requested - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Pending - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Running - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] File Transfer - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Error - Global Snapshot Reference: ompi_global_snapshot_17036.ckpt



OUTPUT of MPIRUN
----------------
----------------------------
[ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3 [ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3
--------------------------------------------------------------------------
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0
Host: ic85

Will continue attempting to launch the process.

--------------------------------------------------------------------------
[ic85:17036] filem:rsh: wait_all(): Wait failed (-1)
[ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/snapc/full/snapc_full_global.c at line 1054

This is a warning about creating the global snapshot directory (ompi_global_snapshot_17036.ckpt) for the first checkpoint (seq 0). It seems to indicate that the directory existed when the file gather started.

A couple things to check:
- Did you clean out the /tmp on all of the nodes with any files starting with "opal" or "ompi"? - Does the error go away when you set (snapc_base_global_snapshot_dir=$HOME)? - Could you try running against a v1.3 release? (I wonder if this feature has been broken on the trunk)

Let me know what you find. In the next couple days, I'll try to test the trunk again with this feature to make sure that it is still working on my test machines.

-- Josh
Hello Josh,

I have switched to v1.3 and re-run with snapc_base_global_snapshot_dir=/tmp or $HOME
with a clean /tmp.

In both cases I get the same error as before :-(

I don't know if the following can be of any help but after ompi- checkpoint
returns there is only a copy of the checkpoint of process of rank 0 in
the global snapshot directory:

$(snapc_base_global_snapshot_dir)/ompi_global_snapshot_XXXX.ckpt/0

So I guess the error occurs during the remote copy phase.

--
Constantinos





Does anyone has an idea about what is wrong?


Best regards,

--
Constantinos



Josh Hursey wrote:
This is described in the C/R User's Guide attached to the webpage below:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR

Additionally this has been addressed on the users mailing list in the past, so searching around will likely turn up some examples.

-- Josh

On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote:

Dear all,

I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS account. By default, it seems that checkpoints are saved in $HOME. However, I would prefer them
to be saved on a local disk (e.g.: /tmp).

Does anyone know how I can change the location where Open MPI saves checkpoints?


Best regards,

--
Constantinos
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to