Re: [OMPI users] Openmpi Checkpoint/Restart failed

2010-12-23 Thread 孟宪军
Dear all,

I have figured it out. It was a simple issue, I didn't add the "blcr lib" to
the $PATH environment varable. However, it can make checkpoint operation,
but can't make restart operation successfully. It was so wield.


Best regards
Xianjun Meng

在 2010年12月23日 下午5:35,孟宪军 写道:

> My main question is:
>
> after I finished the checkpoint operation against a simple task which ran
> on tow machines, I can only restart it on one machine. if I ran the
> following command to force the ompi-restart to run the program on two
> machines:
>
> *ompi-restart  -hostfile  ./machine_names  ompi_global_snapshot_XXX.ckpt*
> (the machine_names include two host names)
>
> the output is:
> *
> --
> Error: Unable to obtain the proper restart command to restart from the
>checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>
> --
> [jx-mpi-fcr048:04116] [ 0] /lib64/tls/libpthread.so.0 [0x302b80c420]
> [jx-mpi-fcr048:04116] [ 1] /lib64/tls/libc.so.6(__libc_free+0x25)
> [0x302af68b85]
> [jx-mpi-fcr048:04116] [ 2]
> /home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_argv_free+0x41)
> [0x2a9557de31]
> [jx-mpi-fcr048:04116] [ 3]
> /home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_event_fini+0x27)
> [0x2a95573ac7]
> [jx-mpi-fcr048:04116] [ 4]
> /home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_finalize+0x2f)
> [0x2a95568a0f]
> [jx-mpi-fcr048:04116] [ 5] opal-restart [0x401888]
> [jx-mpi-fcr048:04116] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
> [0x302af1c4bb]
> [jx-mpi-fcr048:04116] [ 7] opal-restart [0x40147a]
> [jx-mpi-fcr048:04116] *** End of error message ***
> --
> mpirun noticed that process rank 1 with PID 4116 on node
> jx-mpi-fcr048.jx.baidu.com exited on signal 11 (Segmentation fault).
> --
> *
>
> My global_snapshot_meta.data is:
>
> *# Seq: 0
> # Timestamp: Thu Dec 23 16:39:46 2010
> # Process: 1680080897.0
> # OPAL CRS Component: blcr
> # Snapshot Reference: opal_snapshot_0.ckpt
> # Snapshot Location:
> /home/work/checkpoint/ompi_global_snapshot_22817.ckpt/0
> # Process: 1680080897.1
> # OPAL CRS Component: blcr
> # Snapshot Reference: opal_snapshot_1.ckpt
> # Snapshot Location:
> /home/work/checkpoint/ompi_global_snapshot_22817.ckpt/0
> # Timestamp: Thu Dec 23 16:39:47 2010
> # Finished Seq: 0*
>
> Does anabody know why?
>
> Thanks
> Xianjun Meng
>
>
> 2010/12/23 孟宪军 
>
> Dear all,
>>
>> I had to try the checkpoint/restart function of Openmpi recently, and
>> after several failure and checking lots of the docement, I am still very
>> confused about how to config the checkpoint/restart function. Can anybody
>> give me a $HOME/.openmpi/mca-params.conf script and introduce me what
>> parameters I should specified when i install the openmpi?
>>
>> BTW, I want to install the openmpi1.5.1 and blcr 0.8.0.
>>
>>
>> Thanks
>> Xianjun Meng
>>
>
>


Re: [OMPI users] Openmpi Checkpoint/Restart failed

2010-12-23 Thread 孟宪军
My main question is:

after I finished the checkpoint operation against a simple task which ran on
tow machines, I can only restart it on one machine. if I ran the following
command to force the ompi-restart to run the program on two machines:

*ompi-restart  -hostfile  ./machine_names  ompi_global_snapshot_XXX.ckpt*
(the machine_names include two host names)

the output is:
*--
Error: Unable to obtain the proper restart command to restart from the
   checkpoint file (opal_snapshot_1.ckpt). Returned -1.

--
[jx-mpi-fcr048:04116] [ 0] /lib64/tls/libpthread.so.0 [0x302b80c420]
[jx-mpi-fcr048:04116] [ 1] /lib64/tls/libc.so.6(__libc_free+0x25)
[0x302af68b85]
[jx-mpi-fcr048:04116] [ 2]
/home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_argv_free+0x41)
[0x2a9557de31]
[jx-mpi-fcr048:04116] [ 3]
/home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_event_fini+0x27)
[0x2a95573ac7]
[jx-mpi-fcr048:04116] [ 4]
/home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_finalize+0x2f)
[0x2a95568a0f]
[jx-mpi-fcr048:04116] [ 5] opal-restart [0x401888]
[jx-mpi-fcr048:04116] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
[0x302af1c4bb]
[jx-mpi-fcr048:04116] [ 7] opal-restart [0x40147a]
[jx-mpi-fcr048:04116] *** End of error message ***
--
mpirun noticed that process rank 1 with PID 4116 on node
jx-mpi-fcr048.jx.baidu.com exited on signal 11 (Segmentation fault).
--*

My global_snapshot_meta.data is:

*# Seq: 0
# Timestamp: Thu Dec 23 16:39:46 2010
# Process: 1680080897.0
# OPAL CRS Component: blcr
# Snapshot Reference: opal_snapshot_0.ckpt
# Snapshot Location: /home/work/checkpoint/ompi_global_snapshot_22817.ckpt/0
# Process: 1680080897.1
# OPAL CRS Component: blcr
# Snapshot Reference: opal_snapshot_1.ckpt
# Snapshot Location: /home/work/checkpoint/ompi_global_snapshot_22817.ckpt/0
# Timestamp: Thu Dec 23 16:39:47 2010
# Finished Seq: 0*

Does anabody know why?

Thanks
Xianjun Meng


2010/12/23 孟宪军 

> Dear all,
>
> I had to try the checkpoint/restart function of Openmpi recently, and after
> several failure and checking lots of the docement, I am still very confused
> about how to config the checkpoint/restart function. Can anybody give me a
> $HOME/.openmpi/mca-params.conf script and introduce me what parameters I
> should specified when i install the openmpi?
>
> BTW, I want to install the openmpi1.5.1 and blcr 0.8.0.
>
>
> Thanks
> Xianjun Meng
>


[OMPI users] Openmpi Checkpoint/Restart failed

2010-12-23 Thread 孟宪军
Dear all,

I had to try the checkpoint/restart function of Openmpi recently, and after
several failure and checking lots of the docement, I am still very confused
about how to config the checkpoint/restart function. Can anybody give me a
$HOME/.openmpi/mca-params.conf script and introduce me what parameters I
should specified when i install the openmpi?

BTW, I want to install the openmpi1.5.1 and blcr 0.8.0.


Thanks
Xianjun Meng


Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

2010-05-24 Thread Nguyen Toan
Hi all,

I had the same problem like Jitsumoto, i.e. OpenMPI 1.4.2 failed to restart
and the patch which Fernando gave didn't work.
I also tried 1.5 nightly snapshots but it seemed not working well.
For some purpose, I don't want to use --enable-ft-thread in configure but
the same error occurred even --enable-ft-thread is used.
Here is my configure for OMPI 1.5a1r23135:

>./configure \
>--with-ft=cr \
>--enable-mpi-threads \
>--with-blcr=/home/nguyen/opt/blcr
--with-blcr-libdir=/home/nguyen/opt/blcr/lib \
>--prefix=/home/nguyen/opt/openmpi_1.5 --enable-mpirun-prefix-by-default \

and errors:

>$ mpirun -am ft-enable-cr -machinefile ./host ./a.out
>0
>0
>1
>1
>2
>2
>3
>3
>--
>mpirun has exited due to process rank 1 with PID 6582 on
>node rc014 exiting improperly. There are two reasons this could occur:

>1. this process did not call "init" before exiting, but others in
>the job did. This can cause a job to hang indefinitely while it waits
>for all processes to call "init". By rule, if one process calls "init",
>then ALL processes must call "init" prior to termination.

>2. this process called "init", but exited without calling "finalize".
>By rule, all processes that call "init" MUST call "finalize" prior to
>exiting or it will be considered an "abnormal termination"

>This may have caused other processes in the application to be
>terminated by signals sent by mpirun (as reported here).
>---

And here is the checkpoint command:

>$ ompi-checkpoint -s -v --term 10982
>[rc013.local:11001] [  0.00 /   0.14] Requested - ...
>[rc013.local:11001] [  0.00 /   0.14]   Pending - ...
>[rc013.local:11001] [  0.01 /   0.15]   Running - ...
>[rc013.local:11001] [  7.79 /   7.94]  Finished -
>ompi_global_snapshot_10982.ckpt
>Snapshot Ref.:   0 ompi_global_snapshot_10982.ckpt

I also took a look inside the checkpoint files and found that the snapshot
was
taken: 
~/tmp/ckpt/ompi_global_snapshot_10982.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.6582

But restarting failed as follows:
>$ ompi-restart ompi_global_snapshot_10982.ckpt
>--
>mpirun noticed that process rank 1 with PID 11346 on node rc013.local
exited >on signal 11 (Segmentation fault).
>--

Is there any idea about this? Thank you!

Regards,
Nguyen Toan


On Mon, May 24, 2010 at 4:08 PM, Hideyuki Jitsumoto <
jitum...@gsic.titech.ac.jp> wrote:

> -- Forwarded message --
> From: Fernando Lemos <fernando...@gmail.com>
> Date: Thu, Apr 15, 2010 at 2:18 AM
> Subject: Re: [OMPI users] OpenMPI Checkpoint/Restart is failed
> To: Open MPI Users <us...@open-mpi.org>
>
>
> On Wed, Apr 14, 2010 at 5:25 AM, Hideyuki Jitsumoto
> <hjitsum...@gmail.com> wrote:
> > Fernando,
> >
> > Thank you for your reply.
> > I tried to patch the file you mentioned, but the output did not change.
>
> I didn't test the patch, tbh. I'm using 1.5 nightly snapshots, and it
> works great.
>
> >>Are you using a shared file system? You need to use a shared file
> > system for checkpointing with 1.4.1:
> > What is the shared file system ? do you mean NFS, Lustre and so on ?
> > (I'm sorry about my ignorance...)
>
> Something like NFS, yea.
>
> > If I use only one node for application, do I need such a
> shared-file-system ?
>
> No, for a single node, checkpointing with 1.4.1 should work (it works
> for me, at least). If you're using a single node, then your problem is
> probably not related to the bug report I posted.
>
>
> Regards,
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> Sincerely Yours,
> Hideyuki Jitsumoto (jitum...@gsic.titech.ac.jp)
> Tokyo Institute of Technology
> Global Scientific Information and Computing center (Matsuoka Lab.)
>


Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

2010-05-19 Thread Hideyuki Jitsumoto
Hi Josh,

Thank you for your replying.
I tried to patch a Ticket #2139 to openmpi-1.4.1
and to install all of the elements from the very beginning.
Then I got a correct work.
Probably there are some faults on my environment preparation.

# I cannot reproduce the environment when I got failure.
# I'm very sorry that I cannot find truly factors of this malfunction
# and cannot send any information.
# Now I use openmpi-1.4.2, it works well without any patch. (except
for ompi_info)

>> In addition, when I confirmed open_info output as your demo movie, I got
>> "MCA crs: none (MCA v2.0, API v2.0, Component v1.4.1)" (open_info.output)
>
> This is actually a known bug with ompi_info. I have a fix in the works for
> it, and should be available soon. Until then the ticket is linked below:
>  https://svn.open-mpi.org/trac/ompi/ticket/2097
Thank you, I'll try it.


On Wed, May 19, 2010 at 3:46 AM, Josh Hursey  wrote:
> (Sorry for the delay in replying, more below)
>
> On Apr 12, 2010, at 6:36 AM, Hideyuki Jitsumoto wrote:
>
>> Hi Members,
>>
>> I tried to use checkpoint/restart by openmpi.
>> But I can not get collect checkpoint data.
>> I prepared execution environment as follows, the strings in () mean
>> name of output file which attached on next e-mail ( for mail size
>> limitation ):
>>
>> 1. installed BLCR and checked BLCR is working correctly by "make check"
>> 2. executed ./configure with some parameters on openMPI source dir
>> (config.output / config.log)
>> 3. executed make and make install (make.output.2 / install.output.2)
>> 4. confirmed that mca_crs_blcr.[la|so], mca_crs_self.[la|so] on
>> /${INSTALL_DIR}/lib/openmpi
>> 5. make ~/.openmpi/mca-params.conf (mca-params.conf)
>> 6. compiled NPB and executed with -am ft-enable-cr
>> 7. invoked ompi-checkpoint 
>>
>> As result, I got the message "Checkpoint failed: no processes
>> checkpointed."
>> (cr_test_cg)
>
> It is unclear from the output what caused the checkpoint to fail. Can you
> turn on some verbose arguments and send me the output?
>
> Put the following options in you ~/.openmpi/mca-params.conf:
> #---
> orte_debug_daemons=1
> snapc_full_verbose=20
> crs_base_verbose=10
> opal_cr_verbose=10
> #---
>
>
>>
>> In addition, when I confirmed open_info output as your demo movie, I got
>> "MCA crs: none (MCA v2.0, API v2.0, Component v1.4.1)" (open_info.output)
>
> This is actually a known bug with ompi_info. I have a fix in the works for
> it, and should be available soon. Until then the ticket is linked below:
>  https://svn.open-mpi.org/trac/ompi/ticket/2097
>
>>
>> How should I do for checkpointing ?
>> Any guidance in this regard would be highly appreciated.
>
> Let's see what the verbose output tells us, and go from there. What version
> of BLCR are you using?
>
> -- Josh
>
>>
>> Thank you,
>> Hideyuki
>>
>> --
>> Sincerely Yours,
>> Hideyuki Jitsumoto (jitum...@gsic.titech.ac.jp)
>> Tokyo Institute of Technology
>> Global Scientific Information and Computing center (Matsuoka Lab.)
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Sincerely Yours,
Hideyuki Jitsumoto (jitum...@gsic.titech.ac.jp)
Tokyo Institute of Technology
Global Scientific Information and Computing center (Matsuoka Lab.)



Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

2010-05-18 Thread Josh Hursey

(Sorry for the delay in replying, more below)

On Apr 12, 2010, at 6:36 AM, Hideyuki Jitsumoto wrote:


Hi Members,

I tried to use checkpoint/restart by openmpi.
But I can not get collect checkpoint data.
I prepared execution environment as follows, the strings in () mean
name of output file which attached on next e-mail ( for mail size
limitation ):

1. installed BLCR and checked BLCR is working correctly by "make  
check"

2. executed ./configure with some parameters on openMPI source dir
(config.output / config.log)
3. executed make and make install (make.output.2 / install.output.2)
4. confirmed that mca_crs_blcr.[la|so], mca_crs_self.[la|so] on
/${INSTALL_DIR}/lib/openmpi
5. make ~/.openmpi/mca-params.conf (mca-params.conf)
6. compiled NPB and executed with -am ft-enable-cr
7. invoked ompi-checkpoint 

As result, I got the message "Checkpoint failed: no processes  
checkpointed."

(cr_test_cg)


It is unclear from the output what caused the checkpoint to fail. Can  
you turn on some verbose arguments and send me the output?


Put the following options in you ~/.openmpi/mca-params.conf:
#---
orte_debug_daemons=1
snapc_full_verbose=20
crs_base_verbose=10
opal_cr_verbose=10
#---




In addition, when I confirmed open_info output as your demo movie, I  
got
"MCA crs: none (MCA v2.0, API v2.0, Component  
v1.4.1)" (open_info.output)


This is actually a known bug with ompi_info. I have a fix in the works  
for it, and should be available soon. Until then the ticket is linked  
below:

  https://svn.open-mpi.org/trac/ompi/ticket/2097



How should I do for checkpointing ?
Any guidance in this regard would be highly appreciated.


Let's see what the verbose output tells us, and go from there. What  
version of BLCR are you using?


-- Josh



Thank you,
Hideyuki

--
Sincerely Yours,
Hideyuki Jitsumoto (jitum...@gsic.titech.ac.jp)
Tokyo Institute of Technology
Global Scientific Information and Computing center (Matsuoka Lab.)
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

2010-04-14 Thread Fernando Lemos
On Wed, Apr 14, 2010 at 5:25 AM, Hideyuki Jitsumoto
 wrote:
> Fernando,
>
> Thank you for your reply.
> I tried to patch the file you mentioned, but the output did not change.

I didn't test the patch, tbh. I'm using 1.5 nightly snapshots, and it
works great.

>>Are you using a shared file system? You need to use a shared file
> system for checkpointing with 1.4.1:
> What is the shared file system ? do you mean NFS, Lustre and so on ?
> (I'm sorry about my ignorance...)

Something like NFS, yea.

> If I use only one node for application, do I need such a shared-file-system ?

No, for a single node, checkpointing with 1.4.1 should work (it works
for me, at least). If you're using a single node, then your problem is
probably not related to the bug report I posted.


Regards,


Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

2010-04-14 Thread Hideyuki Jitsumoto
Fernando,

Thank you for your reply.
I tried to patch the file you mentioned, but the output did not change.

>Are you using a shared file system? You need to use a shared file
system for checkpointing with 1.4.1:
What is the shared file system ? do you mean NFS, Lustre and so on ?
(I'm sorry about my ignorance...)

If I use only one node for application, do I need such a shared-file-system ?


On Mon, Apr 12, 2010 at 9:41 PM, Fernando Lemos  wrote:
> On Mon, Apr 12, 2010 at 7:36 AM, Hideyuki Jitsumoto
>  wrote:
>> Hi Members,
>>
>> I tried to use checkpoint/restart by openmpi.
>> But I can not get collect checkpoint data.
>> I prepared execution environment as follows, the strings in () mean
>> name of output file which attached on next e-mail ( for mail size
>> limitation ):
>>
>> 1. installed BLCR and checked BLCR is working correctly by "make check"
>> 2. executed ./configure with some parameters on openMPI source dir
>> (config.output / config.log)
>> 3. executed make and make install (make.output.2 / install.output.2)
>> 4. confirmed that mca_crs_blcr.[la|so], mca_crs_self.[la|so] on
>> /${INSTALL_DIR}/lib/openmpi
>> 5. make ~/.openmpi/mca-params.conf (mca-params.conf)
>> 6. compiled NPB and executed with -am ft-enable-cr
>> 7. invoked ompi-checkpoint 
>>
>> As result, I got the message "Checkpoint failed: no processes checkpointed."
>> (cr_test_cg)
>
> Are you using a shared file system? You need to use a shared file
> system for checkpointing with 1.4.1:
>
> https://svn.open-mpi.org/trac/ompi/ticket/2139
>
> Regards,
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Sincerely Yours,
Hideyuki Jitsumoto (jitum...@gsic.titech.ac.jp)
Tokyo Institute of Technology
Global Scientific Information and Computing center (Matsuoka Lab.)


Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

2010-04-12 Thread Fernando Lemos
On Mon, Apr 12, 2010 at 7:36 AM, Hideyuki Jitsumoto
 wrote:
> Hi Members,
>
> I tried to use checkpoint/restart by openmpi.
> But I can not get collect checkpoint data.
> I prepared execution environment as follows, the strings in () mean
> name of output file which attached on next e-mail ( for mail size
> limitation ):
>
> 1. installed BLCR and checked BLCR is working correctly by "make check"
> 2. executed ./configure with some parameters on openMPI source dir
> (config.output / config.log)
> 3. executed make and make install (make.output.2 / install.output.2)
> 4. confirmed that mca_crs_blcr.[la|so], mca_crs_self.[la|so] on
> /${INSTALL_DIR}/lib/openmpi
> 5. make ~/.openmpi/mca-params.conf (mca-params.conf)
> 6. compiled NPB and executed with -am ft-enable-cr
> 7. invoked ompi-checkpoint 
>
> As result, I got the message "Checkpoint failed: no processes checkpointed."
> (cr_test_cg)

Are you using a shared file system? You need to use a shared file
system for checkpointing with 1.4.1:

https://svn.open-mpi.org/trac/ompi/ticket/2139

Regards,


Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

2010-04-12 Thread Hideyuki Jitsumoto
I attache a file (2/2) on this email as mentioned previous one.

Thank you,
Hideyuki


*
** **
** WARNING:  This email contains an attachment of a very suspicious type.  **
** You are urged NOT to open this attachment unless you are absolutely **
** sure it is legitimate.  Opening this attachment may cause irreparable   **
** damage to your computer and your files.  If you have any questions  **
** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING IT. **
** **
** This warning was added by the IU Computer Science Dept. mail scanner.   **
*


<>


Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

2010-04-12 Thread Hideyuki Jitsumoto
I attache a file (1/2) on this email as mentioned previous one.
I'm very sorry to send the large log file.

Thank you,
Hideyuki


*
** **
** WARNING:  This email contains an attachment of a very suspicious type.  **
** You are urged NOT to open this attachment unless you are absolutely **
** sure it is legitimate.  Opening this attachment may cause irreparable   **
** damage to your computer and your files.  If you have any questions  **
** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING IT. **
** **
** This warning was added by the IU Computer Science Dept. mail scanner.   **
*


<>


[OMPI users] OpenMPI Checkpoint/Restart is failed

2010-04-12 Thread Hideyuki Jitsumoto
Hi Members,

I tried to use checkpoint/restart by openmpi.
But I can not get collect checkpoint data.
I prepared execution environment as follows, the strings in () mean
name of output file which attached on next e-mail ( for mail size
limitation ):

1. installed BLCR and checked BLCR is working correctly by "make check"
2. executed ./configure with some parameters on openMPI source dir
(config.output / config.log)
3. executed make and make install (make.output.2 / install.output.2)
4. confirmed that mca_crs_blcr.[la|so], mca_crs_self.[la|so] on
/${INSTALL_DIR}/lib/openmpi
5. make ~/.openmpi/mca-params.conf (mca-params.conf)
6. compiled NPB and executed with -am ft-enable-cr
7. invoked ompi-checkpoint 

As result, I got the message "Checkpoint failed: no processes checkpointed."
(cr_test_cg)

In addition, when I confirmed open_info output as your demo movie, I got
"MCA crs: none (MCA v2.0, API v2.0, Component v1.4.1)" (open_info.output)

How should I do for checkpointing ?
Any guidance in this regard would be highly appreciated.

Thank you,
Hideyuki

--
Sincerely Yours,
Hideyuki Jitsumoto (jitum...@gsic.titech.ac.jp)
Tokyo Institute of Technology
Global Scientific Information and Computing center (Matsuoka Lab.)


Re: [OMPI users] OpenMPI checkpoint/restart on multiple nodes

2010-02-08 Thread Joshua Hursey
You can use the 'checkpoint to local disk' example to checkpoint and restart 
without access to a globally shared storage devices. There is an example on the 
website that does not use a globally mounted file system:
  http://www.osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-local

What version of Open MPI are you using? This functionality is known to be 
broken on the v1.3/1.4 branches, per the ticket below:
  https://svn.open-mpi.org/trac/ompi/ticket/2139

Try the nightly snapshot of the 1.5 branch or the development trunk, and see if 
this issues still occurs.

-- Josh

On Feb 8, 2010, at 8:35 AM, Andreea Costea wrote:

> I asked this question because checkpointing with to NFS is successful, but 
> checkpointing without a mount filesystem or a shared storage throws this 
> warning:
> 
> WARNING: Could not preload specified file: File already exists. 
> Fileset: /home/andreea/checkpoints/global/ompi_global_snapshot_7426.ckpt/0 
> Host: X 
> 
> Will continue attempting to launch the process. 
> 
> 
> filem:rsh: wait_all(): Wait failed (-1) 
> [[62871,0],0] ORTE_ERROR_LOG: Error in file snapc_full_global.c at line 1054 
> 
> even if I set the mca-parameters like this:
> snapc_base_store_in_place=0
> 
> crs_base_snapshot_dir
> =/home/andreea/checkpoints/local
> 
> snapc_base_global_snapshot_dir
> =/home/andreea/checkpoints/global
> and the nodes can connect through ssh without a password. 
> 
> Thanks,
> Andreea
> 
> On Mon, Feb 8, 2010 at 12:59 PM, Andreea Costea  
> wrote:
> Hi,
> 
> Let's say I have an MPI application running on several hosts. Is there any 
> way to checkpoint this application without having a shared storage between 
> the nodes?
> I already took a look at the examples here 
> http://www.osl.iu.edu/research/ft/ompi-cr/examples.php, but it seems that in 
> both cases there is a globally mounted file system. 
> 
> Thanks,
> Andreea
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OpenMPI checkpoint/restart on multiple nodes

2010-02-08 Thread Andreea Costea
I asked this question because checkpointing with to NFS is successful, but
checkpointing without a mount filesystem or a shared storage throws this
warning:

WARNING: Could not preload specified file: File already exists.
Fileset: /home/andreea/checkpoints/global/ompi_global_snapshot_7426.ckpt/0
Host: X

Will continue attempting to launch the process.


filem:rsh: wait_all(): Wait failed (-1)
[[62871,0],0] ORTE_ERROR_LOG: Error in file snapc_full_global.c at line 1054


even if I set the mca-parameters like this:

snapc_base_store_in_place=0
crs_base_snapshot_dir=/home/andreea/checkpoints/local
snapc_base_global_snapshot_dir=/home/andreea/checkpoints/global

and the nodes can connect through ssh without a password.

Thanks,
Andreea

On Mon, Feb 8, 2010 at 12:59 PM, Andreea Costea wrote:

> Hi,
>
> Let's say I have an MPI application running on several hosts. Is there any
> way to checkpoint this application without having a shared storage between
> the nodes?
> I already took a look at the examples here
> http://www.osl.iu.edu/research/ft/ompi-cr/examples.php, but it seems that
> in both cases there is a globally mounted file system.
>
> Thanks,
> Andreea
>
>


[OMPI users] OpenMPI checkpoint/restart on multiple nodes

2010-02-07 Thread Andreea Costea
Hi,

Let's say I have an MPI application running on several hosts. Is there any
way to checkpoint this application without having a shared storage between
the nodes?
I already took a look at the examples here
http://www.osl.iu.edu/research/ft/ompi-cr/examples.php, but it seems that in
both cases there is a globally mounted file system.

Thanks,
Andreea


Re: [OMPI users] OpenMPI checkpoint/restart

2010-01-14 Thread Joshua Hursey

On Jan 14, 2010, at 2:50 AM, Andreea Costea wrote:

> Hei there
> 
> I have some questions regarding checkpoint/restart:
> 
> 1. Until recently I thought that ompi-restart and ompi-restart are used to 
> checkpoint a process inside an MPI application. Now I reread this and I 
> realized that actually what it does is to checkpoint the mpirun process. Does 
> this mean that if I run my application with multiple processes and on 
> multiple nodes in my network the checkpoint file will contain the states of 
> all the processes of my MPI application?

I think you slightly misread the entry. ompi-checkpoint checkpoints the entire 
MPI application, across node boundaries. It requires that the user pass the PID 
of mpirun to server as a reference point for the command. This way a user can 
run multiple mpiruns from the same machine and only checkpoint a subset of 
those.

> 2. Can I restart the application on a different node? 

Yes. If you have trouble doing this, then I would suggest following the 
directions in the BLCR FAQ entry below (it usually addressed 99% of the 
problems people have doing this):
  https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink

-- Josh

> 
> Thanks a lot,
> Andreea
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] OpenMPI checkpoint/restart

2010-01-14 Thread Andreea Costea
Hei there

I have some questions regarding checkpoint/restart:

1. Until recently I thought that ompi-restart and ompi-restart are used to
checkpoint a process inside an MPI application. Now I reread
thisand I
realized that actually what it does is to checkpoint the mpirun
process. Does this mean that if I run my application with multiple processes
and on multiple nodes in my network the checkpoint file will contain the
states of all the processes of my MPI application?

2. Can I restart the application on a different node?

Thanks a lot,
Andreea