Re: [OMPI users] About the necessity of cancelation of pending communication and the use of buffer

2010-05-25 Thread Fernando Lemos
On Tue, May 25, 2010 at 1:03 AM, Yves Caniou  wrote:
> 2 ** When I use a Isend() operation, the manpage says that I can't use the
> buffer until the operation completes.
> What happens if I use an Isend() operation in a function, with a buffer
> declared inside the function?
> Do I have to Wait() for the communication to finish before returning, or to
> declare the buffer as a global variable?

If you declare it inside the function (an auto variable), you're
declaring it on the stack. When the function is over, the stack may be
reused and this is gonna have nasty effects. You don't need to declare
the buffer as a global, just allocate it on the heap (with new or
malloc or whatever), just make sure you don't lose track of it cause
you're probably gonna need to free that memory eventually.


Re: [OMPI users] getc in openmpi

2010-05-12 Thread Fernando Lemos
On Wed, May 12, 2010 at 2:51 PM, Jeff Squyres  wrote:
> On May 12, 2010, at 1:48 PM, Hanjun Kim wrote:
>
>> I am working on parallelizing my sequential program using OpenMPI.
>> Although I got performance speedup using many threads, there was
>> slowdown on a small number of threads like 4 threads.
>> I found that it is because getc worked much slower than sequential
>> version. Does OpenMPI override or wrap getc function?
>
> No.

Please correct me if I'm wrong, but I believe OpenMPI sends
stdin/stdout from the other ranks back to rank 0 so that the output is
displayed as the stdin of mpirun and the other way around with
stdout/stderr. Otherwise it wouldn't be possible to even see the
output from the other ranks. I guess that could make things slower.

MPICH-2  had a command line option that told mpiexec who would receive
stdin (all processes or only some of them) so that you could do things
like mpiexec 

Re: [OMPI users] communicate C++ STL strucutures ??

2010-05-07 Thread Fernando Lemos
On Fri, May 7, 2010 at 5:33 PM, Cristobal Navarro  wrote:
> Hello,
>
> my question is the following.
>
> is it possible to send and receive C++ objects or STL structures (for
> example, send map myMap) through openMPI SEND and RECEIVE functions?
> at first glance i thought it was possible, but after reading some doc, im
> not sure.
> i dont have my source code at that stage for testing yet

Not normally, you have to serialize it before sending and deserialize
it after sending. You could use Boost.MPI and Boost.Serialize too,
that would probably be the best way to go.


Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

2010-04-14 Thread Fernando Lemos
On Wed, Apr 14, 2010 at 5:25 AM, Hideyuki Jitsumoto
 wrote:
> Fernando,
>
> Thank you for your reply.
> I tried to patch the file you mentioned, but the output did not change.

I didn't test the patch, tbh. I'm using 1.5 nightly snapshots, and it
works great.

>>Are you using a shared file system? You need to use a shared file
> system for checkpointing with 1.4.1:
> What is the shared file system ? do you mean NFS, Lustre and so on ?
> (I'm sorry about my ignorance...)

Something like NFS, yea.

> If I use only one node for application, do I need such a shared-file-system ?

No, for a single node, checkpointing with 1.4.1 should work (it works
for me, at least). If you're using a single node, then your problem is
probably not related to the bug report I posted.


Regards,


Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

2010-04-12 Thread Fernando Lemos
On Mon, Apr 12, 2010 at 7:36 AM, Hideyuki Jitsumoto
 wrote:
> Hi Members,
>
> I tried to use checkpoint/restart by openmpi.
> But I can not get collect checkpoint data.
> I prepared execution environment as follows, the strings in () mean
> name of output file which attached on next e-mail ( for mail size
> limitation ):
>
> 1. installed BLCR and checked BLCR is working correctly by "make check"
> 2. executed ./configure with some parameters on openMPI source dir
> (config.output / config.log)
> 3. executed make and make install (make.output.2 / install.output.2)
> 4. confirmed that mca_crs_blcr.[la|so], mca_crs_self.[la|so] on
> /${INSTALL_DIR}/lib/openmpi
> 5. make ~/.openmpi/mca-params.conf (mca-params.conf)
> 6. compiled NPB and executed with -am ft-enable-cr
> 7. invoked ompi-checkpoint 
>
> As result, I got the message "Checkpoint failed: no processes checkpointed."
> (cr_test_cg)

Are you using a shared file system? You need to use a shared file
system for checkpointing with 1.4.1:

https://svn.open-mpi.org/trac/ompi/ticket/2139

Regards,


Re: [OMPI users] Adding new process to running job

2010-04-10 Thread Fernando Lemos
On Sat, Apr 10, 2010 at 6:07 AM, Juergen Kaiser  wrote:
> Hi,
>
> is it possible to add a new MPI process to a set of running MPI processes
> such that they can commnicate as usual? If so, how?

OpenMPI supports MPI-2, so, as far as I can tell, yes, you can do so
by using the dynamic process management functions defined by MPI-2.
Now this has to be done from the application code.

Take my words with a grain of salt, though, as I'm not an MPI guru (by far).

Regards,


[OMPI users] Using a rankfile for ompi-restart

2010-04-08 Thread Fernando Lemos
Hello,


I've noticed that ompi-restart doesn't support the --rankfile option.
It only supports --hostfile/--machinefile. Is there any reason
--rankfile isn't supported?

Suppose you have a cluster without a shared file system. When one node
fails, you transfer its checkpoint to a spare node and invoke
ompi-restart. In 1.5, ompi-restart automagically handles this
situation (if you supply a hostfile) and is able to restart the
process, but I'm afraid it might not always be able to find the
checkpoints this way. If you could specify to ompi-restart where the
ranks are (and thus where the checkpoints are), then maybe restart
would always work as long (as long as you've specified the location of
the checkpoints correctly), or maybe ompi-restart would be faster.



Regards,


Re: [OMPI users] orted: error while loading shared libraries

2010-04-08 Thread Fernando Lemos
On Thu, Apr 8, 2010 at 10:31 AM, Jeff Squyres  wrote:
> Yes.  There is usually a difference between interactive logins and 
> non-interactive logins on which paths, etc. get set.  Look in your shell 
> startup and see if there is somewhere that it exits early (or otherwise 
> doesn't process) for non-interactive logins.
>
> In short: you need to ensure that your paths (etc.) are setup properly for 
> both interactive and non-interactive logins.

Here's a tip: take a look at your shell's man page. If I recall
correctly, bash only reads .bashrc on interative shells, .bash_profile
on all shells, or something like that. So you might want to export
LD_LIBRARY_PATH on .bash_profile too.



Re: [OMPI users] ompi-checkpoint --term

2010-03-31 Thread Fernando Lemos
On Wed, Mar 31, 2010 at 7:39 PM, Addepalli, Srirangam V
 wrote:
> Hello All.
> I am trying to checkpoint a mpi application that has been started using the 
> follwong mpirun command
>
> mpirun -am ft-enable-cr -np 8 pw.x  < Ge46.pw.in > Ge46.ph.out
>
> ompi-checkpoint 31396 ( Works) How ever when i try to terminate the process
>
> ompi-checkpoint  --term 31396  it never finishes.  How do i bebug this issue.

ompi-checkpoint is exactly ompi-checkpoint + sending SIGTERM to your
app. If ompi-checkpoint finishes, then your app is not dealing with
SIGTERM correctly.

Make sure you're not ignoring SIGTERM, you need to either handle it or
let it kill your app. If it's a multithreaded app, make sure you can
"distribute" the SIGTERM to ALL the threads, i.e., when you receive
SIGTERM, notify all other threads that they should join or quit.

Regards,



Re: [OMPI users] ompi-checkpoint hangs when using in multiple clusters

2010-03-23 Thread Fernando Lemos
On Tue, Mar 23, 2010 at 1:25 PM, fengguang tian  wrote:
> now, I set $HOME as shared directory, but when doing ompi-checkpoint, it
> shows:(nimbus1 is the remote machine in
> my cluster)
>
> [nimbus1:12630] opal_os_dirpath_create: Error: Unable to create the
> sub-directory (/home/mpiu/ompi_global_snapshot_1662.ckpt/0) of
> (/home/mpiu/ompi_global_snapshot_1662.ckpt/0/opal_snapshot_4.ckpt), mkdir
> failed [1]
> [nimbus1:12630] Error: No metadata filename specified!
>
> why is that?

The error is described in the error message...

[nimbus1:12630] opal_os_dirpath_create: Error: Unable to create the
sub-directory (/home/mpiu/ompi_global_snapshot_1662.ckpt/0) of
(/home/mpiu/ompi_global_snapshot_1662.ckpt/0/opal_snapshot_4.ckpt),
mkdir failed [1]

If the number between brackets is errno, that is EPERM, "Operation not
permitted". Most likely the user running mpirun doesn't have the
necessary privileges to write to the shared file system (i.e., the
file system is mounted read-only or you don't have write access to the
directory or something of that sort).

Also, please make sure you don't post the same issue twice to the mailing list.


Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-23 Thread Fernando Lemos
On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian  wrote:
>
> I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir
> --hostfile .mpihostfile 
> to store the global checkpoint snapshot into the shared
> directory:/mirror,but the problems are still there,
> when ompi-checkpoint, the mpirun is still not killed,it is hanging
> there.when doing ompi-restart, it shows:
>
> mpiu@nimbus:/mirror$ ompi-restart ompi_global_snapshot_333.ckpt/
> --
> Error: The filename (ompi_global_snapshot_333.ckpt/) is invalid because
> either you have not provided a filename
>    or provided an invalid filename.
>    Please see --help for usage.
>
> --

Have you tried OpenMPI 1.5? I got it to work with 1.5, but not with
1.4 (but then I didn't try 1.4 with a shared filesystem).



Re: [OMPI users] ompi-checkpoint hangs when using in multiple clusters

2010-03-23 Thread Fernando Lemos
On Tue, Mar 23, 2010 at 12:24 PM, fengguang tian  wrote:
> Hi
>
> I am using open-mpi and blcr in a cluster of 3 machines, and the checkpoint
> and restart work fine in single machine,but when doing checkpoint in
> clusters environment, the ompi-checkpoint hangs

Besdies what has been said in another thread (regarding 1.4 and
checkpointing to shared directories), you might want to make sure your
app is terminated if you send a SIGTERM to it. Some apps might ignore
SIGTERM or handle it in a way that does not cause the apps to quit.

ompi-checkpoint --term is simply ompi-checkpoint + sending SIGTERM to
the application (not sure whether SIGTERM is sent to each process
individually or not).


Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-23 Thread Fernando Lemos
On Tue, Mar 23, 2010 at 12:27 PM, fengguang tian  wrote:
> I have created the shared file system. but I created a /mirror at root
> directory,not at the $HOME directory,is that the
> problem? thank you

Others might be able to give you more a accurate explanation. The way
I understood it, in OpenMPI 1.4, you need to write all checkpoints to
a single, shared location. That's why you generally want a shared file
system.

Now I'm pretty sure you can change the directory to which the
checkpoints are written. If you $HOME isn't a shared directory, you
can point OpenMPI to write the checkpoints to the shared directory
instead.

In OpenMPI 1.5 (unstable), some magic allows you to create the
checkpoints and restore them without a shared directory.

Regards,


Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-23 Thread Fernando Lemos
On Mon, Mar 22, 2010 at 8:20 PM, fengguang tian  wrote:
> I set up a cluster of 18 nodes using Open MPI and BLCR library, and the MPI
> program runs well on the clusters,
> but how to checkpoint the MPI program on this clusters?
> for example:
> here is what I do for a test:
> mpiu@nimbus: /mirror$ mpirun -np 50 --hostfile .mpihostfile -am ft-enable-cr
> hellompi
>  the program will run on the clusters
> then ,I enter:
> mpiu@nimbus: /mirror$ ompi-checkpoint -term $(pidof mpirun)
>
> but the MPI program are not terminated as what happaned on single
> machine,although it created a checkpoint file“ompi_global_snapshot_
> 14030.ckpt“ in the home directory on master node.

Are you using OpenMPI 1.4 without a shared file system mounted at
$HOME? If yes, then take a look here:

http://www.open-mpi.org/community/lists/users/2010/03/12246.php

Regards,



Re: [OMPI users] Problem in remote nodes

2010-03-17 Thread Fernando Lemos
On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres  wrote:
> On Mar 17, 2010, at 4:39 AM,  wrote:
>
>> Hi everyone I'm a new Open MPI user and I have just installed Open MPI in
>> a 6 nodes cluster with Scientific Linux. When I execute it in local it
>> works perfectly, but when I try to execute it on the remote nodes with the
>> --host  option it hangs and gives no message. I think that the problem
>> could be with the shared libraries but i'm not sure. In my opinion the
>> problem is not ssh because i can access to the nodes with no password
>
> You might want to check that Open MPI processes are actually running on the 
> remote nodes -- check with ps if you see any "orted" or other MPI-related 
> processes (e.g., your processes).
>
> Do you have any TCP firewall software running between the nodes?  If so, 
> you'll need to disable it (at least for Open MPI jobs).

I also recommend running mpirun with the option --mca btl_base_verbose
30 to troubleshoot tcp issues.

In some environments, you need to explicitly tell mpirun what network
interfaces it can use to reach the hosts. Read the following FAQ
section for more information:

http://www.open-mpi.org/faq/?category=tcp

Item 7 of the FAQ might be of special interest.

Regards,



Re: [OMPI users] Problem in using openmpi

2010-03-12 Thread Fernando Lemos
On Fri, Mar 12, 2010 at 6:02 PM, Samuel K. Gutierrez  wrote:
> One more thing.  The line should have been:
>
> export LD_LIBRARY_PATH=/home/jess/local/ompi/lib64
>
> The space in the previous email will make bash unhappy 8-|.
>
> --
> Samuel K. Gutierrez
> Los Alamos National Laboratory
>
> On Mar 12, 2010, at 1:56 PM, Samuel K. Gutierrez wrote:
>
>> Hi,
>>
>> It sounds like you may need to set your LD_LIBRARY_PATH environment
>> variable correctly.  There are several ways that you can tell the dynamic
>> linker where the required libraries are located, but the following may be
>> sufficient for your needs.
>>
>> Let's say, for example, that your Open MPI installation is rooted at
>> /home/jess/local/ompi and the libraries are located in
>> /home/jess/local/ompi/lib64, try (bash-like shell):
>>
>> export LD_LIBRARY_PATH= /home/jess/local/ompi/lib64
>>
>> Hope this helps,
>>
>> --
>> Samuel K. Gutierrez
>> Los Alamos National Laboratory
>>
>> On Mar 12, 2010, at 1:32 PM, vaibhav dutt wrote:
>>
>>> Hi,
>>>
>>> I have installed openmpi on an Kubuntu , with Dual core Linux AMD Athlon
>>> When trying to compile a simple program, I am getting an error.
>>>
>>> mpicc: error while loading shared libraries: libopen-pal.so.0: cannot
>>> open shared object file: No such file or dir
>>>
>>> I read somewhere that this error is because of some intel compiler
>>> being not installed on the proper node, which I don't understand as I
>>> am using AMD.
>>>
>>> Kindly give your suggestions
>>>
>>> Thank You

It's probably a packaging error, if he used the distribution's
packages. In that case, he should report the bug to downstream.

If he installed from source, then it's most likely installed somewhere
outside the library path, and the LD_LIBRARY_PATH trick might work (if
it doesn't, make sure there are no leftovers, recompile, reinstall and
it should work fine).


Regards,



Re: [OMPI users] change hosts to restart the checkpoint

2010-03-07 Thread Fernando Lemos
On Fri, Mar 5, 2010 at 12:03 PM, Josh Hursey  wrote:
> This type of failure is usually due to prelink'ing being left enabled on one
> or more of the systems. This has come up multiple times on the Open MPI
> list, but is actually a problem between BLCR and the Linux kernel. BLCR has
> a FAQ entry on this that you will want to check out:
>  https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink
>
> If that does not work, then we can look into other causes.

I also suggest checkpointing and restarting the app with BLCR
directly. I.e., take any simple app, run it with cr_run, checkpoint it
with cr_checkpoint then restart it with cr_restart. Make sure the blcr
module is loaded too. That way you can tell whether it's related to
OpenMPI or not.

Regards,



Re: [OMPI users] checkpointing multi node and multi process applications

2010-03-04 Thread Fernando Lemos
On Wed, Mar 3, 2010 at 10:24 PM, Fernando Lemos <fernando...@gmail.com> wrote:

> Is there anything I can do to provide more information about this bug?
> E.g. try to compile the code in the SVN trunk? I also have kept the
> snapshots intact, I can tar them up and upload them somewhere in case
> you guys need it. I can also provide the source code to the ring
> program, but it's really the canonical ring MPI example.
>

I tried 1.5 (1.5a1r22754 nightly snapshot, same compilation flags).
This time taking the checkpoint didn't generate any error message:

root@debian1:~# mpirun -am ft-enable-cr -mca btl_tcp_if_include eth1
-np 2 --host debian1,debian2 ring

>>> Process 1 sending 2761 to 0
>>> Process 1 received 2760
>>> Process 1 sending 2760 to 0
root@debian1:~#

But restoring it did:

root@debian1:~# ompi-restart ompi_global_snapshot_23071.ckpt
[debian1:23129] Error: Unable to access the path
[/root/ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt]!
--
Error: The filename (opal_snapshot_1.ckpt) is invalid because either
you have not provided a filename
   or provided an invalid filename.
   Please see --help for usage.

--
--
mpirun has exited due to process rank 1 with PID 23129 on
node debian1 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--
root@debian1:~#

Indeed, opal_snapshot_1.ckpt does not exist exist:

root@debian1:~# find ompi_global_snapshot_23071.ckpt/
ompi_global_snapshot_23071.ckpt/
ompi_global_snapshot_23071.ckpt/global_snapshot_meta.data
ompi_global_snapshot_23071.ckpt/restart-appfile
ompi_global_snapshot_23071.ckpt/0
ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt
ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt/ompi_blcr_context.23073
ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt/snapshot_meta.data
root@debian1:~#

It can be found in debian2:

root@debian2:~# find ompi_global_snapshot_23071.ckpt/
ompi_global_snapshot_23071.ckpt/
ompi_global_snapshot_23071.ckpt/0
ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt
ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt/snapshot_meta.data
ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.6501
root@debian2:~#

Then I tried supplying a hostfile for ompi-run and it worked just
fine! I thought the checkpoint included the hosts information?

So I think it's fixed in 1.5. Should I try the 1.4 branch in SVN?


Thanks a bunch,


[OMPI users] checkpointing multi node and multi process applications

2010-03-03 Thread Fernando Lemos
Hi,


First, I'm hoping setting the subject of this e-mail will get it
attached to the thread that starts with this e-mail:

http://www.open-mpi.org/community/lists/users/2009/12/11608.php

The reason I'm not replying to that thread is that I wasn't subscribed
to the list at the time.


My environment is detailed in another thread, not related at all to this issue:

http://www.open-mpi.org/community/lists/users/2010/03/12199.php


I'm running into the same problem Jean described (though I'm running
1.4.1). Note that taking and restarting from checkpoints works fine
now when I'm using only a single node.

This is what I get by running the job on two nodes, also showing the
output after the checkpoint is taken:

root@debian1# mpirun -am ft-enable-cr -mca btl_tcp_if_include eth1 -np
2 --host debian1,debian2 ring

>>> Process 1 sending 2460 to 0
>>> Process 1 received 2459
>>> Process 1 sending 2459 to 0
[debian1:01817] Error: expected_component: PID information unavailable!
[debian1:01817] Error: expected_component: Component Name information
unavailable!
--
mpirun noticed that process rank 0 with PID 1819 on node debian1
exited on signal 0 (Unknown signal 0).
--

Now taking the checkpoint:

root@debian1# ompi-checkpoint --term `ps ax | grep mpirun | grep -v
grep | awk '{print $1}'`
Snapshot Ref.:   0 ompi_global_snapshot_1817.ckpt

Restarting from the checkpoint:

root@debian1:~# ompi-restart ompi_global_snapshot_1817.ckpt
[debian1:01832] Error: Unable to access the path
[/root/ompi_global_snapshot_1817.ckpt/0/opal_snapshot_1.ckpt]!
--
Error: The filename (opal_snapshot_1.ckpt) is invalid because either
you have not provided a filename
   or provided an invalid filename.
   Please see --help for usage.

--

After spitting that error message, ompi-restart just hangs forever.


Here's something that may or may not matter. debian1 and debian2 are
two virtual machines. They have two network interfaces each:

- eth0: Connected through NAT so that the machine can access the
internet. It gets an address by DHCP (VirtualBox magic), which is
always 10.0.2.15/24 (for both machines). They have no connection to
each other through this interface, they can only access the outside.

- eth1: Connected to an internal VirtualBox interface. Only debian1
and debian2 are members of that internal network (more VirtualBox
magic). The IPs are statically configured, 192.168.200.1/24 for
debian1, 192.168.200.2/24 for debian2.

Since both machines have an IP in the same subnet on eth0 (actually
the same IP), OpenMPI thinks they're in the same network connected
through eth0 too. That's why I need to specify btl_tcp_if_include
eth1, otherwise running jobs across the two nodes will not work
properly (sends and recvs time out).


Is there anything I can do to provide more information about this bug?
E.g. try to compile the code in the SVN trunk? I also have kept the
snapshots intact, I can tar them up and upload them somewhere in case
you guys need it. I can also provide the source code to the ring
program, but it's really the canonical ring MPI example.

As usual, any info you might need, just ask and I'll provide.


Thanks in advance,


Re: [OMPI users] Segfault in ompi-restart (ft-enable-cr)

2010-03-03 Thread Fernando Lemos
On Wed, Mar 3, 2010 at 5:31 PM, Joshua Hursey  wrote:

>
> Yes, ompi-restart should be printing a helpful message and exiting normally. 
> Thanks for the bug report. I believe that I have seen and fixed this on a 
> development branch making its way to the trunk. I'll make sure to move the 
> fix to the 1.4 series once it has been applied to the trunk.
>
> I filed a ticket on this if you wanted to track the issue.
>  https://svn.open-mpi.org/trac/ompi/ticket/2329

Ah, that's great. Just wondering, do you have any idea why blcr-util
is required? That package only contains the cr_* binaries (cr_restart,
cr_checkpoint, cr_run) and some docs (manpages, changelog, etc.). I've
filled a Debian bug (#572229) about making openmpi-checkpoint depend
on blcr-util, but the package maintainer told me he found it unusual
that ompi-restart would depend on the cr_* binaries since libcr
supposedly provides all the functionality ompi-restart needs.

I'm about to compile OpenMPI in debug mode and take a look at the
backtrace to see if I can understand what's going on.

Btw, this is the list of files in the blcr-util package:
http://packages.debian.org/sid/amd64/blcr-util/filelist . As you can
see, only cr_* binaries and docs.

>
> Thanks again,
> Josh

Thank you!



Re: [OMPI users] Segfault in ompi-restart (ft-enable-cr)

2010-03-02 Thread Fernando Lemos
On Sun, Feb 28, 2010 at 11:11 PM, Fernando Lemos <fernando...@gmail.com> wrote:
> Hello,
>
>
> I'm trying to come up with a fault tolerant OpenMPI setup for research
> purposes. I'm doing some tests now, but I'm stuck with a segfault when
> I try to restart my test program from a checkpoint.
>
> My test program is the "ring" program, where messages are sent to the
> next node in the ring N times. It's pretty simple, I can supply the
> source code if needed. I'm running it like this:
>
> # mpirun -np 4 -am ft-enable-cr ring
> ...
>>>> Process 1 sending 703 to 2
>>>> Process 3 received 704
>>>> Process 3 sending 704 to 0
>>>> Process 3 received 703
>>>> Process 3 sending 703 to 0
> --
> mpirun noticed that process rank 0 with PID 18358 on node debian1
> exited on signal 0 (Unknown signal 0).
> --
> 4 total processes killed (some possibly by mpirun during cleanup)
>
> That's the output when I ompi-checkpoint the mpirun PID from another terminal.
>
> The checkpoint is taken just fine in maybe 1.5 seconds. I can see the
> checkpoint directory has been created in $HOME.
>
> This is what I get when I try to run ompi-restart
>
> ps axroot@debian1:~# ps ax | grep mpirun
> 18357 pts/0    R+     0:01 mpirun -np 4 -am ft-enable-cr ring
> 18378 pts/5    S+     0:00 grep mpirun
> root@debian1:~# ompi-checkpoint 18357
> Snapshot Ref.:   0 ompi_global_snapshot_18357.ckpt
> root@debian1:~# ompi-checkpoint --term 18357
> Snapshot Ref.:   1 ompi_global_snapshot_18357.ckpt
> root@debian1:~# ompi-restart ompi_global_snapshot_18357.ckpt
> --
> Error: Unable to obtain the proper restart command to restart from the
>       checkpoint file (opal_snapshot_2.ckpt). Returned -1.
>
> --
> [debian1:18384] *** Process received signal ***
> [debian1:18384] Signal: Segmentation fault (11)
> [debian1:18384] Signal code: Address not mapped (1)
> [debian1:18384] Failing at address: 0x725f725f
> [debian1:18384] [ 0] [0xb775f40c]
> [debian1:18384] [ 1]
> /usr/local/lib/libopen-pal.so.0(opal_argv_free+0x33) [0xb771ea63]
> [debian1:18384] [ 2]
> /usr/local/lib/libopen-pal.so.0(opal_event_fini+0x30) [0xb77150a0]
> [debian1:18384] [ 3]
> /usr/local/lib/libopen-pal.so.0(opal_finalize+0x35) [0xb7708fa5]
> [debian1:18384] [ 4] opal-restart [0x804908e]
> [debian1:18384] [ 5] /lib/i686/cmov/libc.so.6(__libc_start_main+0xe5)
> [0xb7568b55]
> [debian1:18384] [ 6] opal-restart [0x8048fc1]
> [debian1:18384] *** End of error message ***
> --
> mpirun noticed that process rank 2 with PID 18384 on node debian1
> exited on signal 11 (Segmentat
> --
>
> I used a clean install of Debian Squeeze (testing) to make sure my
> environment was ok. Those are the steps I took:
>
> - Installed Debian Squeeze, only base packages
> - Installed build-essential, libcr0, libcr-dev, blcr-dkms (build
> tools, BLCR dev and run-time environment)
> - Compiled openmpi-1.4.1
>
> Note that I did compile openmpi-1.4.1 because the Debian package
> (openmpi-checkpoint) doesn't seem to be usable at the moment. There
> are no leftovers from any previous install of Debian packages
> supplying OpenMPI because this is a fresh install, no openmpi package
> had been installed before.
>
> I used the following configure options:
>
> # ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads
>
> I also tried to add the option --with-memory-manager=none because I
> saw an e-mail on the mailing list that described this as a possible
> solution to an (apparently) not related problem, but the problem
> remains the same.
>
> I don't have config.log (I rm'ed the build dir), but if you think it's
> necessary I can recompile OpenMPI and provide it.
>
> Some information about the system (VirtualBox virtual machine, single
> processor, btw):
>
> Kernel version 2.6.32-trunk-686
>
> root@debian1:~# lsmod | grep blcr
> blcr                   79084  0
> blcr_imports            2077  1 blcr
>
> libcr (BLCR) is version 0.8.2-9.
>
> gcc is version 4.4.3.
>
>
> Please let me know of any other information you might need.
>
>
> Thanks in advance,
>

Hello,

I figured it out. The problem is that the Debian package brcl-utils,
which contains the BLCR binaries (cr_restart, cr_checkpoint, etc.)
wasn't in

[OMPI users] Segfault in ompi-restart (ft-enable-cr)

2010-02-28 Thread Fernando Lemos
Hello,


I'm trying to come up with a fault tolerant OpenMPI setup for research
purposes. I'm doing some tests now, but I'm stuck with a segfault when
I try to restart my test program from a checkpoint.

My test program is the "ring" program, where messages are sent to the
next node in the ring N times. It's pretty simple, I can supply the
source code if needed. I'm running it like this:

# mpirun -np 4 -am ft-enable-cr ring
...
>>> Process 1 sending 703 to 2
>>> Process 3 received 704
>>> Process 3 sending 704 to 0
>>> Process 3 received 703
>>> Process 3 sending 703 to 0
--
mpirun noticed that process rank 0 with PID 18358 on node debian1
exited on signal 0 (Unknown signal 0).
--
4 total processes killed (some possibly by mpirun during cleanup)

That's the output when I ompi-checkpoint the mpirun PID from another terminal.

The checkpoint is taken just fine in maybe 1.5 seconds. I can see the
checkpoint directory has been created in $HOME.

This is what I get when I try to run ompi-restart

ps axroot@debian1:~# ps ax | grep mpirun
18357 pts/0R+ 0:01 mpirun -np 4 -am ft-enable-cr ring
18378 pts/5S+ 0:00 grep mpirun
root@debian1:~# ompi-checkpoint 18357
Snapshot Ref.:   0 ompi_global_snapshot_18357.ckpt
root@debian1:~# ompi-checkpoint --term 18357
Snapshot Ref.:   1 ompi_global_snapshot_18357.ckpt
root@debian1:~# ompi-restart ompi_global_snapshot_18357.ckpt
--
Error: Unable to obtain the proper restart command to restart from the
   checkpoint file (opal_snapshot_2.ckpt). Returned -1.

--
[debian1:18384] *** Process received signal ***
[debian1:18384] Signal: Segmentation fault (11)
[debian1:18384] Signal code: Address not mapped (1)
[debian1:18384] Failing at address: 0x725f725f
[debian1:18384] [ 0] [0xb775f40c]
[debian1:18384] [ 1]
/usr/local/lib/libopen-pal.so.0(opal_argv_free+0x33) [0xb771ea63]
[debian1:18384] [ 2]
/usr/local/lib/libopen-pal.so.0(opal_event_fini+0x30) [0xb77150a0]
[debian1:18384] [ 3]
/usr/local/lib/libopen-pal.so.0(opal_finalize+0x35) [0xb7708fa5]
[debian1:18384] [ 4] opal-restart [0x804908e]
[debian1:18384] [ 5] /lib/i686/cmov/libc.so.6(__libc_start_main+0xe5)
[0xb7568b55]
[debian1:18384] [ 6] opal-restart [0x8048fc1]
[debian1:18384] *** End of error message ***
--
mpirun noticed that process rank 2 with PID 18384 on node debian1
exited on signal 11 (Segmentat
--

I used a clean install of Debian Squeeze (testing) to make sure my
environment was ok. Those are the steps I took:

- Installed Debian Squeeze, only base packages
- Installed build-essential, libcr0, libcr-dev, blcr-dkms (build
tools, BLCR dev and run-time environment)
- Compiled openmpi-1.4.1

Note that I did compile openmpi-1.4.1 because the Debian package
(openmpi-checkpoint) doesn't seem to be usable at the moment. There
are no leftovers from any previous install of Debian packages
supplying OpenMPI because this is a fresh install, no openmpi package
had been installed before.

I used the following configure options:

# ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads

I also tried to add the option --with-memory-manager=none because I
saw an e-mail on the mailing list that described this as a possible
solution to an (apparently) not related problem, but the problem
remains the same.

I don't have config.log (I rm'ed the build dir), but if you think it's
necessary I can recompile OpenMPI and provide it.

Some information about the system (VirtualBox virtual machine, single
processor, btw):

Kernel version 2.6.32-trunk-686

root@debian1:~# lsmod | grep blcr
blcr   79084  0
blcr_imports2077  1 blcr

libcr (BLCR) is version 0.8.2-9.

gcc is version 4.4.3.


Please let me know of any other information you might need.


Thanks in advance,