subject:"\[OMPI users\] checkpointing multi node and multi process applications"

Re: [OMPI users] checkpointing multi node and multi process applications

2010-03-04 Thread Joshua Hursey


On Mar 4, 2010, at 8:17 AM, Fernando Lemos wrote:

> On Wed, Mar 3, 2010 at 10:24 PM, Fernando Lemos  wrote:
> 
>> Is there anything I can do to provide more information about this bug?
>> E.g. try to compile the code in the SVN trunk? I also have kept the
>> snapshots intact, I can tar them up and upload them somewhere in case
>> you guys need it. I can also provide the source code to the ring
>> program, but it's really the canonical ring MPI example.
>> 
> 
> I tried 1.5 (1.5a1r22754 nightly snapshot, same compilation flags).
> This time taking the checkpoint didn't generate any error message:
> 
> root@debian1:~# mpirun -am ft-enable-cr -mca btl_tcp_if_include eth1
> -np 2 --host debian1,debian2 ring
> 
 Process 1 sending 2761 to 0
 Process 1 received 2760
 Process 1 sending 2760 to 0
> root@debian1:~#
> 
> But restoring it did:
> 
> root@debian1:~# ompi-restart ompi_global_snapshot_23071.ckpt
> [debian1:23129] Error: Unable to access the path
> [/root/ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt]!
> --
> Error: The filename (opal_snapshot_1.ckpt) is invalid because either
> you have not provided a filename
>   or provided an invalid filename.
>   Please see --help for usage.
> 
> --
> --
> mpirun has exited due to process rank 1 with PID 23129 on
> node debian1 exiting improperly. There are two reasons this could occur:
> 
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
> 
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
> 
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --
> root@debian1:~#
> 
> Indeed, opal_snapshot_1.ckpt does not exist exist:
> 
> root@debian1:~# find ompi_global_snapshot_23071.ckpt/
> ompi_global_snapshot_23071.ckpt/
> ompi_global_snapshot_23071.ckpt/global_snapshot_meta.data
> ompi_global_snapshot_23071.ckpt/restart-appfile
> ompi_global_snapshot_23071.ckpt/0
> ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt
> ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt/ompi_blcr_context.23073
> ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt/snapshot_meta.data
> root@debian1:~#
> 
> It can be found in debian2:
> 
> root@debian2:~# find ompi_global_snapshot_23071.ckpt/
> ompi_global_snapshot_23071.ckpt/
> ompi_global_snapshot_23071.ckpt/0
> ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt
> ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt/snapshot_meta.data
> ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.6501
> root@debian2:~#

By default, Open MPI requires a shared file system to save checkpoint files. So 
by default the local snapshot is moved, since the system assumes that it is 
writing to the same directory on a shared file system. If you want to use the 
local disk staging functionality (which is known to be broken in the 1.4 
series), check out the example on the webpage below:
  http://osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-local

> 
> Then I tried supplying a hostfile for ompi-run and it worked just
> fine! I thought the checkpoint included the hosts information?

We intentionally do not save the hostfile as part of the checkpoint. Typically 
folks will want to restart on different nodes than those they checkpointed on 
(such as in a batch scheduling environment). If we saved the hostfile then it 
could lead to unexpected user behavior on restart if the machines that they 
wish to restart on change.

If you need to pass a hostfile, the you can pass one to ompi-restart just as 
you would mpirun.

> 
> So I think it's fixed in 1.5. Should I try the 1.4 branch in SVN?

The file staging functionality is known to be broken in the 1.4 series at this 
time, per the ticket below:
  https://svn.open-mpi.org/trac/ompi/ticket/2139

Unfortunately the fix is likely to be both custom for the branch (since we 
redesigned the functionality for the trunk and v1.5) and fairly involved. I 
don't have the time at the moment to work on fix, but hopefully in the coming 
months I will be able to look into this issue. In the mean time, patches are 
always welcome :)

Hope that helps,
Josh


> 
> 
> Thanks a bunch,
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] checkpointing multi node and multi process applications

2010-03-04 Thread Fernando Lemos

On Wed, Mar 3, 2010 at 10:24 PM, Fernando Lemos  wrote:

> Is there anything I can do to provide more information about this bug?
> E.g. try to compile the code in the SVN trunk? I also have kept the
> snapshots intact, I can tar them up and upload them somewhere in case
> you guys need it. I can also provide the source code to the ring
> program, but it's really the canonical ring MPI example.
>

I tried 1.5 (1.5a1r22754 nightly snapshot, same compilation flags).
This time taking the checkpoint didn't generate any error message:

root@debian1:~# mpirun -am ft-enable-cr -mca btl_tcp_if_include eth1
-np 2 --host debian1,debian2 ring

>>> Process 1 sending 2761 to 0
>>> Process 1 received 2760
>>> Process 1 sending 2760 to 0
root@debian1:~#

But restoring it did:

root@debian1:~# ompi-restart ompi_global_snapshot_23071.ckpt
[debian1:23129] Error: Unable to access the path
[/root/ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt]!
--
Error: The filename (opal_snapshot_1.ckpt) is invalid because either
you have not provided a filename
   or provided an invalid filename.
   Please see --help for usage.

--
--
mpirun has exited due to process rank 1 with PID 23129 on
node debian1 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--
root@debian1:~#

Indeed, opal_snapshot_1.ckpt does not exist exist:

root@debian1:~# find ompi_global_snapshot_23071.ckpt/
ompi_global_snapshot_23071.ckpt/
ompi_global_snapshot_23071.ckpt/global_snapshot_meta.data
ompi_global_snapshot_23071.ckpt/restart-appfile
ompi_global_snapshot_23071.ckpt/0
ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt
ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt/ompi_blcr_context.23073
ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt/snapshot_meta.data
root@debian1:~#

It can be found in debian2:

root@debian2:~# find ompi_global_snapshot_23071.ckpt/
ompi_global_snapshot_23071.ckpt/
ompi_global_snapshot_23071.ckpt/0
ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt
ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt/snapshot_meta.data
ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.6501
root@debian2:~#

Then I tried supplying a hostfile for ompi-run and it worked just
fine! I thought the checkpoint included the hosts information?

So I think it's fixed in 1.5. Should I try the 1.4 branch in SVN?

Thanks a bunch,

[OMPI users] checkpointing multi node and multi process applications

2010-03-03 Thread Fernando Lemos

Hi,


First, I'm hoping setting the subject of this e-mail will get it
attached to the thread that starts with this e-mail:

http://www.open-mpi.org/community/lists/users/2009/12/11608.php

The reason I'm not replying to that thread is that I wasn't subscribed
to the list at the time.


My environment is detailed in another thread, not related at all to this issue:

http://www.open-mpi.org/community/lists/users/2010/03/12199.php


I'm running into the same problem Jean described (though I'm running
1.4.1). Note that taking and restarting from checkpoints works fine
now when I'm using only a single node.

This is what I get by running the job on two nodes, also showing the
output after the checkpoint is taken:

root@debian1# mpirun -am ft-enable-cr -mca btl_tcp_if_include eth1 -np
2 --host debian1,debian2 ring

>>> Process 1 sending 2460 to 0
>>> Process 1 received 2459
>>> Process 1 sending 2459 to 0
[debian1:01817] Error: expected_component: PID information unavailable!
[debian1:01817] Error: expected_component: Component Name information
unavailable!
--
mpirun noticed that process rank 0 with PID 1819 on node debian1
exited on signal 0 (Unknown signal 0).
--

Now taking the checkpoint:

root@debian1# ompi-checkpoint --term `ps ax | grep mpirun | grep -v
grep | awk '{print $1}'`
Snapshot Ref.:   0 ompi_global_snapshot_1817.ckpt

Restarting from the checkpoint:

root@debian1:~# ompi-restart ompi_global_snapshot_1817.ckpt
[debian1:01832] Error: Unable to access the path
[/root/ompi_global_snapshot_1817.ckpt/0/opal_snapshot_1.ckpt]!
--
Error: The filename (opal_snapshot_1.ckpt) is invalid because either
you have not provided a filename
   or provided an invalid filename.
   Please see --help for usage.

--

After spitting that error message, ompi-restart just hangs forever.


Here's something that may or may not matter. debian1 and debian2 are
two virtual machines. They have two network interfaces each:

- eth0: Connected through NAT so that the machine can access the
internet. It gets an address by DHCP (VirtualBox magic), which is
always 10.0.2.15/24 (for both machines). They have no connection to
each other through this interface, they can only access the outside.

- eth1: Connected to an internal VirtualBox interface. Only debian1
and debian2 are members of that internal network (more VirtualBox
magic). The IPs are statically configured, 192.168.200.1/24 for
debian1, 192.168.200.2/24 for debian2.

Since both machines have an IP in the same subnet on eth0 (actually
the same IP), OpenMPI thinks they're in the same network connected
through eth0 too. That's why I need to specify btl_tcp_if_include
eth1, otherwise running jobs across the two nodes will not work
properly (sends and recvs time out).


Is there anything I can do to provide more information about this bug?
E.g. try to compile the code in the SVN trunk? I also have kept the
snapshots intact, I can tar them up and upload them somewhere in case
you guys need it. I can also provide the source code to the ring
program, but it's really the canonical ring MPI example.

As usual, any info you might need, just ask and I'll provide.


Thanks in advance,

Re: [OMPI users] checkpointing multi node and multi process applications

2010-01-25 Thread Josh Hursey

Actually, let me roll that back a bit. I was preparing a custom patch  
for the v1.4 series, and it seems that the code does not have the bug  
I mentioned. It is only the v1.5 and trunk that were effected by this.  
The v1.4 series should be fine.

I will still ask that the error message fix be brought over to the  
v1.4 branch, but it is unlikely to fix your problem. However it would  
be useful to know if upgrading to the trunk or v1.5 series fixes this  
problem. The v1.4 series has an old version of the file and metadata  
handling mechanisms, so I am encouraging people to move to the v1.5  
series if possible.

-- Josh

On Jan 25, 2010, at 3:33 PM, Josh Hursey wrote:

So while working on the error message, I noticed that the global  
coordinator was using the wrong path to investigate the checkpoint  
metadata. This particular section of code is not often used (which  
is probably why I could not reproduce). I just committed a fix to  
the Open MPI development trunk:

 https://svn.open-mpi.org/trac/ompi/changeset/22479

Additionally, I am asking for this to be brought over to the v1.4  
and v1.5 release branches:

 https://svn.open-mpi.org/trac/ompi/ticket/2195
 https://svn.open-mpi.org/trac/ompi/ticket/2196

It seems to solve the problem as I could reproduce it. Can you try  
the trunk (either SVN checkout or nightly tarball from tonight) and  
check if this solves your problem?

Cheers,
Josh

On Jan 25, 2010, at 12:14 PM, Josh Hursey wrote:

I am not able to reproduce this problem with the 1.4 branch using a  
hostfile, and node configuration like you mentioned.

I suspect that the error is caused by a failed local checkpoint.  
The error message is triggered when the global coordinator (located  
in 'mpirun') tries to read the metadata written by the application  
in the local snapshot. If the global coordinator cannot properly  
read the metadata, then it will print a variety of error messages  
depending on what is going wrong.

If these are the only two errors produced, then this typically  
means that the local metadata file has been found, but is empty/ 
corrupted. Can you send me the contents of the local checkpoint  
metadata file:
shell$ cat GLOBAL_SNAPSHOT_DIR/ompi_global_snapshot_YYY.ckpt/0/ 
opal_snapshot_0.ckpt/snapshot_meta.data

It should look something like:
-
#
# PID: 23915
# Component: blcr
# CONTEXT: ompi_blcr_context.23915
-

It may also help to see the following metadata file as well:
shell$ cat GLOBAL_SNAPSHOT_DIR/ompi_global_snapshot_YYY.ckpt/ 
global_snapshot_meta.data

If there are other errors printed by the process, that would  
potentially indicate a different problem. So if there are, let me  
know.

This error message should be a bit more specific about which  
process checkpoint is causing the problem, and what the this  
usually indicates. I filed a bug to cleanup the error:

https://svn.open-mpi.org/trac/ompi/ticket/2190

-- Josh

On Jan 21, 2010, at 8:27 AM, Jean Potsam wrote:

Hi Josh/all,

I have upgraded the openmpi to v 1.4  but still get the same error  
when I try executing the application on multiple nodes:

***
Error: expected_component: PID information unavailable!
Error: expected_component: Component Name information unavailable!
***

I am running my application from the node 'portal11' as follows:

mpirun -am ft-enable-cr -np 2 --hostfile hosts  myapp.

The file 'hosts' contains two host names: portal10, portal11.

I am triggering the checkpoint using ompi-checkpoint -v 'PID' from  
portal11.

I configured open mpi as follows:

#

./configure --prefix=/home/jean/openmpi/ --enable-picky --enable- 
debug --enable-mpi-profile --enable-mpi-cxx --enable-pretty-print- 
stacktrace --enable-binaries --enable-trace --enable-static=yes -- 
enable-debug --with-devel-headers=1 --with-mpi-param-check=always  
--with-ft=cr --enable-ft-thread --with-blcr=/usr/local/blcr/ -- 
with-blcr-libdir=/usr/local/blcr/lib --enable-mpi-threads=yes

#

Question:

what do you think can be wrong? Please instruct me on how to  
resolve this problem.

Thank you

Jean

--- On Mon, 11/1/10, Josh Hursey  wrote:

From: Josh Hursey 
Subject: Re: [OMPI users] checkpointing multi node and multi  
process applications

To: "Open MPI Users" 
Date: Monday, 11 January, 2010, 21:42

On Dec 19, 2009, at 7:42 AM, Jean Potsam wrote:

> Hi Everyone,
>I am trying to checkpoint an mpi  
application running on multiple nodes. However, I get some error  
messages when i trigger the checkpointing process.

>
> Error: expected_component: PID information unavailable!
> Error: expected_component: Component Name information unavailable!
>
> I am using  open mpi 1.3 and blcr 0.8.1

Can you try the v1.4 release and see if the problem persists?

>
> I ex

Re: [OMPI users] checkpointing multi node and multi process applications

2010-01-25 Thread Josh Hursey

So while working on the error message, I noticed that the global  
coordinator was using the wrong path to investigate the checkpoint  
metadata. This particular section of code is not often used (which is  
probably why I could not reproduce). I just committed a fix to the  
Open MPI development trunk:

  https://svn.open-mpi.org/trac/ompi/changeset/22479

Additionally, I am asking for this to be brought over to the v1.4 and  
v1.5 release branches:

  https://svn.open-mpi.org/trac/ompi/ticket/2195
  https://svn.open-mpi.org/trac/ompi/ticket/2196

It seems to solve the problem as I could reproduce it. Can you try the  
trunk (either SVN checkout or nightly tarball from tonight) and check  
if this solves your problem?

Cheers,
Josh

On Jan 25, 2010, at 12:14 PM, Josh Hursey wrote:

I am not able to reproduce this problem with the 1.4 branch using a  
hostfile, and node configuration like you mentioned.

I suspect that the error is caused by a failed local checkpoint. The  
error message is triggered when the global coordinator (located in  
'mpirun') tries to read the metadata written by the application in  
the local snapshot. If the global coordinator cannot properly read  
the metadata, then it will print a variety of error messages  
depending on what is going wrong.

If these are the only two errors produced, then this typically means  
that the local metadata file has been found, but is empty/corrupted.  
Can you send me the contents of the local checkpoint metadata file:
 shell$ cat GLOBAL_SNAPSHOT_DIR/ompi_global_snapshot_YYY.ckpt/0/ 
opal_snapshot_0.ckpt/snapshot_meta.data

It should look something like:
-
#
# PID: 23915
# Component: blcr
# CONTEXT: ompi_blcr_context.23915
-

It may also help to see the following metadata file as well:
shell$ cat GLOBAL_SNAPSHOT_DIR/ompi_global_snapshot_YYY.ckpt/ 
global_snapshot_meta.data

If there are other errors printed by the process, that would  
potentially indicate a different problem. So if there are, let me  
know.

This error message should be a bit more specific about which process  
checkpoint is causing the problem, and what the this usually  
indicates. I filed a bug to cleanup the error:

 https://svn.open-mpi.org/trac/ompi/ticket/2190

-- Josh

On Jan 21, 2010, at 8:27 AM, Jean Potsam wrote:

Hi Josh/all,

I have upgraded the openmpi to v 1.4  but still get the same error  
when I try executing the application on multiple nodes:

***
Error: expected_component: PID information unavailable!
Error: expected_component: Component Name information unavailable!
***

I am running my application from the node 'portal11' as follows:

mpirun -am ft-enable-cr -np 2 --hostfile hosts  myapp.

The file 'hosts' contains two host names: portal10, portal11.

I am triggering the checkpoint using ompi-checkpoint -v 'PID' from  
portal11.

I configured open mpi as follows:

#

./configure --prefix=/home/jean/openmpi/ --enable-picky --enable- 
debug --enable-mpi-profile --enable-mpi-cxx --enable-pretty-print- 
stacktrace --enable-binaries --enable-trace --enable-static=yes -- 
enable-debug --with-devel-headers=1 --with-mpi-param-check=always -- 
with-ft=cr --enable-ft-thread --with-blcr=/usr/local/blcr/ --with- 
blcr-libdir=/usr/local/blcr/lib --enable-mpi-threads=yes

#

Question:

what do you think can be wrong? Please instruct me on how to  
resolve this problem.

Thank you

Jean

--- On Mon, 11/1/10, Josh Hursey  wrote:

From: Josh Hursey 
Subject: Re: [OMPI users] checkpointing multi node and multi  
process applications

To: "Open MPI Users" 
Date: Monday, 11 January, 2010, 21:42

On Dec 19, 2009, at 7:42 AM, Jean Potsam wrote:

> Hi Everyone,
>I am trying to checkpoint an mpi  
application running on multiple nodes. However, I get some error  
messages when i trigger the checkpointing process.

>
> Error: expected_component: PID information unavailable!
> Error: expected_component: Component Name information unavailable!
>
> I am using  open mpi 1.3 and blcr 0.8.1

Can you try the v1.4 release and see if the problem persists?

>
> I execute my application as follows:
>
> mpirun -am ft-enable-cr -np 3 --hostfile hosts gol.
>
> My question:
>
> Does openmpi with blcr support checkpointing of multi node  
execution of mpi application? If so, can you provide me with some  
information on how to achieve this.

Open MPI is able to checkpoint a multi-node application (that's  
what it was designed to do). There are some examples at the link  
below:

 http://www.osl.iu.edu/research/ft/ompi-cr/examples.php

-- Josh

>
> Cheers,
>
> Jean.
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/li

Re: [OMPI users] checkpointing multi node and multi process applications

2010-01-25 Thread Josh Hursey

I am not able to reproduce this problem with the 1.4 branch using a  
hostfile, and node configuration like you mentioned.

I suspect that the error is caused by a failed local checkpoint. The  
error message is triggered when the global coordinator (located in  
'mpirun') tries to read the metadata written by the application in the  
local snapshot. If the global coordinator cannot properly read the  
metadata, then it will print a variety of error messages depending on  
what is going wrong.

If these are the only two errors produced, then this typically means  
that the local metadata file has been found, but is empty/corrupted.  
Can you send me the contents of the local checkpoint metadata file:
  shell$ cat GLOBAL_SNAPSHOT_DIR/ompi_global_snapshot_YYY.ckpt/0/ 
opal_snapshot_0.ckpt/snapshot_meta.data

It should look something like:
-
#
# PID: 23915
# Component: blcr
# CONTEXT: ompi_blcr_context.23915
-

It may also help to see the following metadata file as well:
 shell$ cat GLOBAL_SNAPSHOT_DIR/ompi_global_snapshot_YYY.ckpt/ 
global_snapshot_meta.data

If there are other errors printed by the process, that would  
potentially indicate a different problem. So if there are, let me know.

This error message should be a bit more specific about which process  
checkpoint is causing the problem, and what the this usually  
indicates. I filed a bug to cleanup the error:

  https://svn.open-mpi.org/trac/ompi/ticket/2190

-- Josh

On Jan 21, 2010, at 8:27 AM, Jean Potsam wrote:

Hi Josh/all,

I have upgraded the openmpi to v 1.4  but still get the same error  
when I try executing the application on multiple nodes:

***
 Error: expected_component: PID information unavailable!
 Error: expected_component: Component Name information unavailable!
***

I am running my application from the node 'portal11' as follows:

mpirun -am ft-enable-cr -np 2 --hostfile hosts  myapp.

The file 'hosts' contains two host names: portal10, portal11.

I am triggering the checkpoint using ompi-checkpoint -v 'PID' from  
portal11.

I configured open mpi as follows:

#

./configure --prefix=/home/jean/openmpi/ --enable-picky --enable- 
debug --enable-mpi-profile --enable-mpi-cxx --enable-pretty-print- 
stacktrace --enable-binaries --enable-trace --enable-static=yes -- 
enable-debug --with-devel-headers=1 --with-mpi-param-check=always -- 
with-ft=cr --enable-ft-thread --with-blcr=/usr/local/blcr/ --with- 
blcr-libdir=/usr/local/blcr/lib --enable-mpi-threads=yes

#

Question:

what do you think can be wrong? Please instruct me on how to resolve  
this problem.

Thank you

Jean

--- On Mon, 11/1/10, Josh Hursey  wrote:

From: Josh Hursey 
Subject: Re: [OMPI users] checkpointing multi node and multi process  
applications

To: "Open MPI Users" 
Date: Monday, 11 January, 2010, 21:42

On Dec 19, 2009, at 7:42 AM, Jean Potsam wrote:

> Hi Everyone,
>I am trying to checkpoint an mpi  
application running on multiple nodes. However, I get some error  
messages when i trigger the checkpointing process.

>
> Error: expected_component: PID information unavailable!
> Error: expected_component: Component Name information unavailable!
>
> I am using  open mpi 1.3 and blcr 0.8.1

Can you try the v1.4 release and see if the problem persists?

>
> I execute my application as follows:
>
> mpirun -am ft-enable-cr -np 3 --hostfile hosts gol.
>
> My question:
>
> Does openmpi with blcr support checkpointing of multi node  
execution of mpi application? If so, can you provide me with some  
information on how to achieve this.

Open MPI is able to checkpoint a multi-node application (that's what  
it was designed to do). There are some examples at the link below:

  http://www.osl.iu.edu/research/ft/ompi-cr/examples.php

-- Josh

>
> Cheers,
>
> Jean.
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] checkpointing multi node and multi process applications

2010-01-21 Thread Jean Potsam

Hi Josh/all,

I have upgraded the openmpi to v 1.4  but still get the same error when I try 
executing the application on multiple nodes:

***
 Error: expected_component: PID information unavailable!
 Error: expected_component: Component Name information unavailable!
***

I am running my application from the node 'portal11' as follows:

mpirun -am ft-enable-cr -np 2 --hostfile hosts  myapp.

The file 'hosts' contains two host names: portal10, portal11.

I am triggering the checkpoint using ompi-checkpoint -v 'PID' from portal11.

I configured open mpi as follows:

#

./configure --prefix=/home/jean/openmpi/ --enable-picky --enable-debug 
--enable-mpi-profile --enable-mpi-cxx --enable-pretty-print-stacktrace 
--enable-binaries --enable-trace --enable-static=yes --enable-debug 
--with-devel-headers=1 --with-mpi-param-check=always --with-ft=cr 
--enable-ft-thread --with-blcr=/usr/local/blcr/ 
--with-blcr-libdir=/usr/local/blcr/lib --enable-mpi-threads=yes
#

Question:

what do you think can be wrong? Please instruct me on how to resolve this 
problem.

Thank you

Jean

--- On Mon, 11/1/10, Josh Hursey  wrote:

From: Josh Hursey 
Subject: Re: [OMPI users] checkpointing multi node and multi process 
applications
To: "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Monday, 11 January, 2010, 21:42

On Dec 19, 2009, at 7:42 AM, Jean Potsam wrote:

> Hi Everyone,
>                        I am trying to checkpoint an mpi application running 
>on multiple nodes. However, I get some error messages when i trigger the 
>checkpointing process.
> 
> Error: expected_component: PID information unavailable!
> Error: expected_component: Component Name information unavailable!
> 
> I am using  open mpi 1.3 and blcr 0.8.1

Can you try the v1.4 release and see if the problem persists?

> 
> I execute my application as follows:
> 
> mpirun -am ft-enable-cr -np 3 --hostfile hosts gol.
> 
> My question:
> 
> Does openmpi with blcr support checkpointing of multi node execution of mpi 
> application? If so, can you provide me with some information on how to 
> achieve this.

Open MPI is able to checkpoint a multi-node application (that's what it was 
designed to do). There are some examples at the link below:
  http://www.osl.iu.edu/research/ft/ompi-cr/examples.php

-- Josh

> 
> Cheers,
> 
> Jean.
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] checkpointing multi node and multi process applications

2010-01-11 Thread Josh Hursey



On Dec 19, 2009, at 7:42 AM, Jean Potsam wrote:


Hi Everyone,
   I am trying to checkpoint an mpi application  
running on multiple nodes. However, I get some error messages when i  
trigger the checkpointing process.


Error: expected_component: PID information unavailable!
Error: expected_component: Component Name information unavailable!

I am using  open mpi 1.3 and blcr 0.8.1


Can you try the v1.4 release and see if the problem persists?



I execute my application as follows:

mpirun -am ft-enable-cr -np 3 --hostfile hosts gol.

My question:

Does openmpi with blcr support checkpointing of multi node execution  
of mpi application? If so, can you provide me with some information  
on how to achieve this.


Open MPI is able to checkpoint a multi-node application (that's what  
it was designed to do). There are some examples at the link below:

  http://www.osl.iu.edu/research/ft/ompi-cr/examples.php

-- Josh



Cheers,

Jean.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] checkpointing multi node and multi process applications

2009-12-19 Thread Jean Potsam

Hi Everyone,
   I am trying to checkpoint an mpi application running on 
multiple nodes. However, I get some error messages when i trigger the 
checkpointing process.

Error: expected_component: PID information unavailable!
Error: expected_component: Component Name information unavailable!

I am using  open mpi 1.3 and blcr 0.8.1

I execute my application as follows:

mpirun -am ft-enable-cr -np 3 --hostfile hosts gol.

My question:

Does openmpi with blcr support checkpointing of multi node execution of mpi 
application? If so, can you provide me with some information on how to achieve 
this.

Cheers,

Jean.

Re: [OMPI users] checkpointing multi node and multi process applications

Re: [OMPI users] checkpointing multi node and multi process applications

[OMPI users] checkpointing multi node and multi process applications

Re: [OMPI users] checkpointing multi node and multi process applications

Re: [OMPI users] checkpointing multi node and multi process applications

Re: [OMPI users] checkpointing multi node and multi process applications

[OMPI users] checkpointing multi node and multi process applications

Re: [OMPI users] checkpointing multi node and multi process applications

[OMPI users] checkpointing multi node and multi process applications

9 matches

Site Navigation

Mail list logo

Footer information