Re: [OMPI users] About the necessity of cancelation of pending communication and the use of buffer
On Tue, May 25, 2010 at 1:03 AM, Yves Caniouwrote: > 2 ** When I use a Isend() operation, the manpage says that I can't use the > buffer until the operation completes. > What happens if I use an Isend() operation in a function, with a buffer > declared inside the function? > Do I have to Wait() for the communication to finish before returning, or to > declare the buffer as a global variable? If you declare it inside the function (an auto variable), you're declaring it on the stack. When the function is over, the stack may be reused and this is gonna have nasty effects. You don't need to declare the buffer as a global, just allocate it on the heap (with new or malloc or whatever), just make sure you don't lose track of it cause you're probably gonna need to free that memory eventually.
Re: [OMPI users] getc in openmpi
On Wed, May 12, 2010 at 2:51 PM, Jeff Squyreswrote: > On May 12, 2010, at 1:48 PM, Hanjun Kim wrote: > >> I am working on parallelizing my sequential program using OpenMPI. >> Although I got performance speedup using many threads, there was >> slowdown on a small number of threads like 4 threads. >> I found that it is because getc worked much slower than sequential >> version. Does OpenMPI override or wrap getc function? > > No. Please correct me if I'm wrong, but I believe OpenMPI sends stdin/stdout from the other ranks back to rank 0 so that the output is displayed as the stdin of mpirun and the other way around with stdout/stderr. Otherwise it wouldn't be possible to even see the output from the other ranks. I guess that could make things slower. MPICH-2 had a command line option that told mpiexec who would receive stdin (all processes or only some of them) so that you could do things like mpiexec
Re: [OMPI users] communicate C++ STL strucutures ??
On Fri, May 7, 2010 at 5:33 PM, Cristobal Navarrowrote: > Hello, > > my question is the following. > > is it possible to send and receive C++ objects or STL structures (for > example, send map myMap) through openMPI SEND and RECEIVE functions? > at first glance i thought it was possible, but after reading some doc, im > not sure. > i dont have my source code at that stage for testing yet Not normally, you have to serialize it before sending and deserialize it after sending. You could use Boost.MPI and Boost.Serialize too, that would probably be the best way to go.
Re: [OMPI users] OpenMPI Checkpoint/Restart is failed
On Wed, Apr 14, 2010 at 5:25 AM, Hideyuki Jitsumotowrote: > Fernando, > > Thank you for your reply. > I tried to patch the file you mentioned, but the output did not change. I didn't test the patch, tbh. I'm using 1.5 nightly snapshots, and it works great. >>Are you using a shared file system? You need to use a shared file > system for checkpointing with 1.4.1: > What is the shared file system ? do you mean NFS, Lustre and so on ? > (I'm sorry about my ignorance...) Something like NFS, yea. > If I use only one node for application, do I need such a shared-file-system ? No, for a single node, checkpointing with 1.4.1 should work (it works for me, at least). If you're using a single node, then your problem is probably not related to the bug report I posted. Regards,
Re: [OMPI users] OpenMPI Checkpoint/Restart is failed
On Mon, Apr 12, 2010 at 7:36 AM, Hideyuki Jitsumotowrote: > Hi Members, > > I tried to use checkpoint/restart by openmpi. > But I can not get collect checkpoint data. > I prepared execution environment as follows, the strings in () mean > name of output file which attached on next e-mail ( for mail size > limitation ): > > 1. installed BLCR and checked BLCR is working correctly by "make check" > 2. executed ./configure with some parameters on openMPI source dir > (config.output / config.log) > 3. executed make and make install (make.output.2 / install.output.2) > 4. confirmed that mca_crs_blcr.[la|so], mca_crs_self.[la|so] on > /${INSTALL_DIR}/lib/openmpi > 5. make ~/.openmpi/mca-params.conf (mca-params.conf) > 6. compiled NPB and executed with -am ft-enable-cr > 7. invoked ompi-checkpoint > > As result, I got the message "Checkpoint failed: no processes checkpointed." > (cr_test_cg) Are you using a shared file system? You need to use a shared file system for checkpointing with 1.4.1: https://svn.open-mpi.org/trac/ompi/ticket/2139 Regards,
Re: [OMPI users] Adding new process to running job
On Sat, Apr 10, 2010 at 6:07 AM, Juergen Kaiserwrote: > Hi, > > is it possible to add a new MPI process to a set of running MPI processes > such that they can commnicate as usual? If so, how? OpenMPI supports MPI-2, so, as far as I can tell, yes, you can do so by using the dynamic process management functions defined by MPI-2. Now this has to be done from the application code. Take my words with a grain of salt, though, as I'm not an MPI guru (by far). Regards,
[OMPI users] Using a rankfile for ompi-restart
Hello, I've noticed that ompi-restart doesn't support the --rankfile option. It only supports --hostfile/--machinefile. Is there any reason --rankfile isn't supported? Suppose you have a cluster without a shared file system. When one node fails, you transfer its checkpoint to a spare node and invoke ompi-restart. In 1.5, ompi-restart automagically handles this situation (if you supply a hostfile) and is able to restart the process, but I'm afraid it might not always be able to find the checkpoints this way. If you could specify to ompi-restart where the ranks are (and thus where the checkpoints are), then maybe restart would always work as long (as long as you've specified the location of the checkpoints correctly), or maybe ompi-restart would be faster. Regards,
Re: [OMPI users] orted: error while loading shared libraries
On Thu, Apr 8, 2010 at 10:31 AM, Jeff Squyreswrote: > Yes. There is usually a difference between interactive logins and > non-interactive logins on which paths, etc. get set. Look in your shell > startup and see if there is somewhere that it exits early (or otherwise > doesn't process) for non-interactive logins. > > In short: you need to ensure that your paths (etc.) are setup properly for > both interactive and non-interactive logins. Here's a tip: take a look at your shell's man page. If I recall correctly, bash only reads .bashrc on interative shells, .bash_profile on all shells, or something like that. So you might want to export LD_LIBRARY_PATH on .bash_profile too.
Re: [OMPI users] ompi-checkpoint --term
On Wed, Mar 31, 2010 at 7:39 PM, Addepalli, Srirangam Vwrote: > Hello All. > I am trying to checkpoint a mpi application that has been started using the > follwong mpirun command > > mpirun -am ft-enable-cr -np 8 pw.x < Ge46.pw.in > Ge46.ph.out > > ompi-checkpoint 31396 ( Works) How ever when i try to terminate the process > > ompi-checkpoint --term 31396 it never finishes. How do i bebug this issue. ompi-checkpoint is exactly ompi-checkpoint + sending SIGTERM to your app. If ompi-checkpoint finishes, then your app is not dealing with SIGTERM correctly. Make sure you're not ignoring SIGTERM, you need to either handle it or let it kill your app. If it's a multithreaded app, make sure you can "distribute" the SIGTERM to ALL the threads, i.e., when you receive SIGTERM, notify all other threads that they should join or quit. Regards,
Re: [OMPI users] ompi-checkpoint hangs when using in multiple clusters
On Tue, Mar 23, 2010 at 1:25 PM, fengguang tianwrote: > now, I set $HOME as shared directory, but when doing ompi-checkpoint, it > shows:(nimbus1 is the remote machine in > my cluster) > > [nimbus1:12630] opal_os_dirpath_create: Error: Unable to create the > sub-directory (/home/mpiu/ompi_global_snapshot_1662.ckpt/0) of > (/home/mpiu/ompi_global_snapshot_1662.ckpt/0/opal_snapshot_4.ckpt), mkdir > failed [1] > [nimbus1:12630] Error: No metadata filename specified! > > why is that? The error is described in the error message... [nimbus1:12630] opal_os_dirpath_create: Error: Unable to create the sub-directory (/home/mpiu/ompi_global_snapshot_1662.ckpt/0) of (/home/mpiu/ompi_global_snapshot_1662.ckpt/0/opal_snapshot_4.ckpt), mkdir failed [1] If the number between brackets is errno, that is EPERM, "Operation not permitted". Most likely the user running mpirun doesn't have the necessary privileges to write to the shared file system (i.e., the file system is mounted read-only or you don't have write access to the directory or something of that sort). Also, please make sure you don't post the same issue twice to the mailing list.
Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI
On Tue, Mar 23, 2010 at 12:55 PM, fengguang tianwrote: > > I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir > --hostfile .mpihostfile > to store the global checkpoint snapshot into the shared > directory:/mirror,but the problems are still there, > when ompi-checkpoint, the mpirun is still not killed,it is hanging > there.when doing ompi-restart, it shows: > > mpiu@nimbus:/mirror$ ompi-restart ompi_global_snapshot_333.ckpt/ > -- > Error: The filename (ompi_global_snapshot_333.ckpt/) is invalid because > either you have not provided a filename > or provided an invalid filename. > Please see --help for usage. > > -- Have you tried OpenMPI 1.5? I got it to work with 1.5, but not with 1.4 (but then I didn't try 1.4 with a shared filesystem).
Re: [OMPI users] ompi-checkpoint hangs when using in multiple clusters
On Tue, Mar 23, 2010 at 12:24 PM, fengguang tianwrote: > Hi > > I am using open-mpi and blcr in a cluster of 3 machines, and the checkpoint > and restart work fine in single machine,but when doing checkpoint in > clusters environment, the ompi-checkpoint hangs Besdies what has been said in another thread (regarding 1.4 and checkpointing to shared directories), you might want to make sure your app is terminated if you send a SIGTERM to it. Some apps might ignore SIGTERM or handle it in a way that does not cause the apps to quit. ompi-checkpoint --term is simply ompi-checkpoint + sending SIGTERM to the application (not sure whether SIGTERM is sent to each process individually or not).
Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI
On Tue, Mar 23, 2010 at 12:27 PM, fengguang tianwrote: > I have created the shared file system. but I created a /mirror at root > directory,not at the $HOME directory,is that the > problem? thank you Others might be able to give you more a accurate explanation. The way I understood it, in OpenMPI 1.4, you need to write all checkpoints to a single, shared location. That's why you generally want a shared file system. Now I'm pretty sure you can change the directory to which the checkpoints are written. If you $HOME isn't a shared directory, you can point OpenMPI to write the checkpoints to the shared directory instead. In OpenMPI 1.5 (unstable), some magic allows you to create the checkpoints and restore them without a shared directory. Regards,
Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI
On Mon, Mar 22, 2010 at 8:20 PM, fengguang tianwrote: > I set up a cluster of 18 nodes using Open MPI and BLCR library, and the MPI > program runs well on the clusters, > but how to checkpoint the MPI program on this clusters? > for example: > here is what I do for a test: > mpiu@nimbus: /mirror$ mpirun -np 50 --hostfile .mpihostfile -am ft-enable-cr > hellompi > the program will run on the clusters > then ,I enter: > mpiu@nimbus: /mirror$ ompi-checkpoint -term $(pidof mpirun) > > but the MPI program are not terminated as what happaned on single > machine,although it created a checkpoint file“ompi_global_snapshot_ > 14030.ckpt“ in the home directory on master node. Are you using OpenMPI 1.4 without a shared file system mounted at $HOME? If yes, then take a look here: http://www.open-mpi.org/community/lists/users/2010/03/12246.php Regards,
Re: [OMPI users] Problem in remote nodes
On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyreswrote: > On Mar 17, 2010, at 4:39 AM, wrote: > >> Hi everyone I'm a new Open MPI user and I have just installed Open MPI in >> a 6 nodes cluster with Scientific Linux. When I execute it in local it >> works perfectly, but when I try to execute it on the remote nodes with the >> --host option it hangs and gives no message. I think that the problem >> could be with the shared libraries but i'm not sure. In my opinion the >> problem is not ssh because i can access to the nodes with no password > > You might want to check that Open MPI processes are actually running on the > remote nodes -- check with ps if you see any "orted" or other MPI-related > processes (e.g., your processes). > > Do you have any TCP firewall software running between the nodes? If so, > you'll need to disable it (at least for Open MPI jobs). I also recommend running mpirun with the option --mca btl_base_verbose 30 to troubleshoot tcp issues. In some environments, you need to explicitly tell mpirun what network interfaces it can use to reach the hosts. Read the following FAQ section for more information: http://www.open-mpi.org/faq/?category=tcp Item 7 of the FAQ might be of special interest. Regards,
Re: [OMPI users] Problem in using openmpi
On Fri, Mar 12, 2010 at 6:02 PM, Samuel K. Gutierrezwrote: > One more thing. The line should have been: > > export LD_LIBRARY_PATH=/home/jess/local/ompi/lib64 > > The space in the previous email will make bash unhappy 8-|. > > -- > Samuel K. Gutierrez > Los Alamos National Laboratory > > On Mar 12, 2010, at 1:56 PM, Samuel K. Gutierrez wrote: > >> Hi, >> >> It sounds like you may need to set your LD_LIBRARY_PATH environment >> variable correctly. There are several ways that you can tell the dynamic >> linker where the required libraries are located, but the following may be >> sufficient for your needs. >> >> Let's say, for example, that your Open MPI installation is rooted at >> /home/jess/local/ompi and the libraries are located in >> /home/jess/local/ompi/lib64, try (bash-like shell): >> >> export LD_LIBRARY_PATH= /home/jess/local/ompi/lib64 >> >> Hope this helps, >> >> -- >> Samuel K. Gutierrez >> Los Alamos National Laboratory >> >> On Mar 12, 2010, at 1:32 PM, vaibhav dutt wrote: >> >>> Hi, >>> >>> I have installed openmpi on an Kubuntu , with Dual core Linux AMD Athlon >>> When trying to compile a simple program, I am getting an error. >>> >>> mpicc: error while loading shared libraries: libopen-pal.so.0: cannot >>> open shared object file: No such file or dir >>> >>> I read somewhere that this error is because of some intel compiler >>> being not installed on the proper node, which I don't understand as I >>> am using AMD. >>> >>> Kindly give your suggestions >>> >>> Thank You It's probably a packaging error, if he used the distribution's packages. In that case, he should report the bug to downstream. If he installed from source, then it's most likely installed somewhere outside the library path, and the LD_LIBRARY_PATH trick might work (if it doesn't, make sure there are no leftovers, recompile, reinstall and it should work fine). Regards,
Re: [OMPI users] change hosts to restart the checkpoint
On Fri, Mar 5, 2010 at 12:03 PM, Josh Hurseywrote: > This type of failure is usually due to prelink'ing being left enabled on one > or more of the systems. This has come up multiple times on the Open MPI > list, but is actually a problem between BLCR and the Linux kernel. BLCR has > a FAQ entry on this that you will want to check out: > https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink > > If that does not work, then we can look into other causes. I also suggest checkpointing and restarting the app with BLCR directly. I.e., take any simple app, run it with cr_run, checkpoint it with cr_checkpoint then restart it with cr_restart. Make sure the blcr module is loaded too. That way you can tell whether it's related to OpenMPI or not. Regards,
Re: [OMPI users] checkpointing multi node and multi process applications
On Wed, Mar 3, 2010 at 10:24 PM, Fernando Lemos <fernando...@gmail.com> wrote: > Is there anything I can do to provide more information about this bug? > E.g. try to compile the code in the SVN trunk? I also have kept the > snapshots intact, I can tar them up and upload them somewhere in case > you guys need it. I can also provide the source code to the ring > program, but it's really the canonical ring MPI example. > I tried 1.5 (1.5a1r22754 nightly snapshot, same compilation flags). This time taking the checkpoint didn't generate any error message: root@debian1:~# mpirun -am ft-enable-cr -mca btl_tcp_if_include eth1 -np 2 --host debian1,debian2 ring >>> Process 1 sending 2761 to 0 >>> Process 1 received 2760 >>> Process 1 sending 2760 to 0 root@debian1:~# But restoring it did: root@debian1:~# ompi-restart ompi_global_snapshot_23071.ckpt [debian1:23129] Error: Unable to access the path [/root/ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt]! -- Error: The filename (opal_snapshot_1.ckpt) is invalid because either you have not provided a filename or provided an invalid filename. Please see --help for usage. -- -- mpirun has exited due to process rank 1 with PID 23129 on node debian1 exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -- root@debian1:~# Indeed, opal_snapshot_1.ckpt does not exist exist: root@debian1:~# find ompi_global_snapshot_23071.ckpt/ ompi_global_snapshot_23071.ckpt/ ompi_global_snapshot_23071.ckpt/global_snapshot_meta.data ompi_global_snapshot_23071.ckpt/restart-appfile ompi_global_snapshot_23071.ckpt/0 ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt/ompi_blcr_context.23073 ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt/snapshot_meta.data root@debian1:~# It can be found in debian2: root@debian2:~# find ompi_global_snapshot_23071.ckpt/ ompi_global_snapshot_23071.ckpt/ ompi_global_snapshot_23071.ckpt/0 ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt/snapshot_meta.data ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.6501 root@debian2:~# Then I tried supplying a hostfile for ompi-run and it worked just fine! I thought the checkpoint included the hosts information? So I think it's fixed in 1.5. Should I try the 1.4 branch in SVN? Thanks a bunch,
[OMPI users] checkpointing multi node and multi process applications
Hi, First, I'm hoping setting the subject of this e-mail will get it attached to the thread that starts with this e-mail: http://www.open-mpi.org/community/lists/users/2009/12/11608.php The reason I'm not replying to that thread is that I wasn't subscribed to the list at the time. My environment is detailed in another thread, not related at all to this issue: http://www.open-mpi.org/community/lists/users/2010/03/12199.php I'm running into the same problem Jean described (though I'm running 1.4.1). Note that taking and restarting from checkpoints works fine now when I'm using only a single node. This is what I get by running the job on two nodes, also showing the output after the checkpoint is taken: root@debian1# mpirun -am ft-enable-cr -mca btl_tcp_if_include eth1 -np 2 --host debian1,debian2 ring >>> Process 1 sending 2460 to 0 >>> Process 1 received 2459 >>> Process 1 sending 2459 to 0 [debian1:01817] Error: expected_component: PID information unavailable! [debian1:01817] Error: expected_component: Component Name information unavailable! -- mpirun noticed that process rank 0 with PID 1819 on node debian1 exited on signal 0 (Unknown signal 0). -- Now taking the checkpoint: root@debian1# ompi-checkpoint --term `ps ax | grep mpirun | grep -v grep | awk '{print $1}'` Snapshot Ref.: 0 ompi_global_snapshot_1817.ckpt Restarting from the checkpoint: root@debian1:~# ompi-restart ompi_global_snapshot_1817.ckpt [debian1:01832] Error: Unable to access the path [/root/ompi_global_snapshot_1817.ckpt/0/opal_snapshot_1.ckpt]! -- Error: The filename (opal_snapshot_1.ckpt) is invalid because either you have not provided a filename or provided an invalid filename. Please see --help for usage. -- After spitting that error message, ompi-restart just hangs forever. Here's something that may or may not matter. debian1 and debian2 are two virtual machines. They have two network interfaces each: - eth0: Connected through NAT so that the machine can access the internet. It gets an address by DHCP (VirtualBox magic), which is always 10.0.2.15/24 (for both machines). They have no connection to each other through this interface, they can only access the outside. - eth1: Connected to an internal VirtualBox interface. Only debian1 and debian2 are members of that internal network (more VirtualBox magic). The IPs are statically configured, 192.168.200.1/24 for debian1, 192.168.200.2/24 for debian2. Since both machines have an IP in the same subnet on eth0 (actually the same IP), OpenMPI thinks they're in the same network connected through eth0 too. That's why I need to specify btl_tcp_if_include eth1, otherwise running jobs across the two nodes will not work properly (sends and recvs time out). Is there anything I can do to provide more information about this bug? E.g. try to compile the code in the SVN trunk? I also have kept the snapshots intact, I can tar them up and upload them somewhere in case you guys need it. I can also provide the source code to the ring program, but it's really the canonical ring MPI example. As usual, any info you might need, just ask and I'll provide. Thanks in advance,
Re: [OMPI users] Segfault in ompi-restart (ft-enable-cr)
On Wed, Mar 3, 2010 at 5:31 PM, Joshua Hurseywrote: > > Yes, ompi-restart should be printing a helpful message and exiting normally. > Thanks for the bug report. I believe that I have seen and fixed this on a > development branch making its way to the trunk. I'll make sure to move the > fix to the 1.4 series once it has been applied to the trunk. > > I filed a ticket on this if you wanted to track the issue. > https://svn.open-mpi.org/trac/ompi/ticket/2329 Ah, that's great. Just wondering, do you have any idea why blcr-util is required? That package only contains the cr_* binaries (cr_restart, cr_checkpoint, cr_run) and some docs (manpages, changelog, etc.). I've filled a Debian bug (#572229) about making openmpi-checkpoint depend on blcr-util, but the package maintainer told me he found it unusual that ompi-restart would depend on the cr_* binaries since libcr supposedly provides all the functionality ompi-restart needs. I'm about to compile OpenMPI in debug mode and take a look at the backtrace to see if I can understand what's going on. Btw, this is the list of files in the blcr-util package: http://packages.debian.org/sid/amd64/blcr-util/filelist . As you can see, only cr_* binaries and docs. > > Thanks again, > Josh Thank you!
Re: [OMPI users] Segfault in ompi-restart (ft-enable-cr)
On Sun, Feb 28, 2010 at 11:11 PM, Fernando Lemos <fernando...@gmail.com> wrote: > Hello, > > > I'm trying to come up with a fault tolerant OpenMPI setup for research > purposes. I'm doing some tests now, but I'm stuck with a segfault when > I try to restart my test program from a checkpoint. > > My test program is the "ring" program, where messages are sent to the > next node in the ring N times. It's pretty simple, I can supply the > source code if needed. I'm running it like this: > > # mpirun -np 4 -am ft-enable-cr ring > ... >>>> Process 1 sending 703 to 2 >>>> Process 3 received 704 >>>> Process 3 sending 704 to 0 >>>> Process 3 received 703 >>>> Process 3 sending 703 to 0 > -- > mpirun noticed that process rank 0 with PID 18358 on node debian1 > exited on signal 0 (Unknown signal 0). > -- > 4 total processes killed (some possibly by mpirun during cleanup) > > That's the output when I ompi-checkpoint the mpirun PID from another terminal. > > The checkpoint is taken just fine in maybe 1.5 seconds. I can see the > checkpoint directory has been created in $HOME. > > This is what I get when I try to run ompi-restart > > ps axroot@debian1:~# ps ax | grep mpirun > 18357 pts/0 R+ 0:01 mpirun -np 4 -am ft-enable-cr ring > 18378 pts/5 S+ 0:00 grep mpirun > root@debian1:~# ompi-checkpoint 18357 > Snapshot Ref.: 0 ompi_global_snapshot_18357.ckpt > root@debian1:~# ompi-checkpoint --term 18357 > Snapshot Ref.: 1 ompi_global_snapshot_18357.ckpt > root@debian1:~# ompi-restart ompi_global_snapshot_18357.ckpt > -- > Error: Unable to obtain the proper restart command to restart from the > checkpoint file (opal_snapshot_2.ckpt). Returned -1. > > -- > [debian1:18384] *** Process received signal *** > [debian1:18384] Signal: Segmentation fault (11) > [debian1:18384] Signal code: Address not mapped (1) > [debian1:18384] Failing at address: 0x725f725f > [debian1:18384] [ 0] [0xb775f40c] > [debian1:18384] [ 1] > /usr/local/lib/libopen-pal.so.0(opal_argv_free+0x33) [0xb771ea63] > [debian1:18384] [ 2] > /usr/local/lib/libopen-pal.so.0(opal_event_fini+0x30) [0xb77150a0] > [debian1:18384] [ 3] > /usr/local/lib/libopen-pal.so.0(opal_finalize+0x35) [0xb7708fa5] > [debian1:18384] [ 4] opal-restart [0x804908e] > [debian1:18384] [ 5] /lib/i686/cmov/libc.so.6(__libc_start_main+0xe5) > [0xb7568b55] > [debian1:18384] [ 6] opal-restart [0x8048fc1] > [debian1:18384] *** End of error message *** > -- > mpirun noticed that process rank 2 with PID 18384 on node debian1 > exited on signal 11 (Segmentat > -- > > I used a clean install of Debian Squeeze (testing) to make sure my > environment was ok. Those are the steps I took: > > - Installed Debian Squeeze, only base packages > - Installed build-essential, libcr0, libcr-dev, blcr-dkms (build > tools, BLCR dev and run-time environment) > - Compiled openmpi-1.4.1 > > Note that I did compile openmpi-1.4.1 because the Debian package > (openmpi-checkpoint) doesn't seem to be usable at the moment. There > are no leftovers from any previous install of Debian packages > supplying OpenMPI because this is a fresh install, no openmpi package > had been installed before. > > I used the following configure options: > > # ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads > > I also tried to add the option --with-memory-manager=none because I > saw an e-mail on the mailing list that described this as a possible > solution to an (apparently) not related problem, but the problem > remains the same. > > I don't have config.log (I rm'ed the build dir), but if you think it's > necessary I can recompile OpenMPI and provide it. > > Some information about the system (VirtualBox virtual machine, single > processor, btw): > > Kernel version 2.6.32-trunk-686 > > root@debian1:~# lsmod | grep blcr > blcr 79084 0 > blcr_imports 2077 1 blcr > > libcr (BLCR) is version 0.8.2-9. > > gcc is version 4.4.3. > > > Please let me know of any other information you might need. > > > Thanks in advance, > Hello, I figured it out. The problem is that the Debian package brcl-utils, which contains the BLCR binaries (cr_restart, cr_checkpoint, etc.) wasn't in
[OMPI users] Segfault in ompi-restart (ft-enable-cr)
Hello, I'm trying to come up with a fault tolerant OpenMPI setup for research purposes. I'm doing some tests now, but I'm stuck with a segfault when I try to restart my test program from a checkpoint. My test program is the "ring" program, where messages are sent to the next node in the ring N times. It's pretty simple, I can supply the source code if needed. I'm running it like this: # mpirun -np 4 -am ft-enable-cr ring ... >>> Process 1 sending 703 to 2 >>> Process 3 received 704 >>> Process 3 sending 704 to 0 >>> Process 3 received 703 >>> Process 3 sending 703 to 0 -- mpirun noticed that process rank 0 with PID 18358 on node debian1 exited on signal 0 (Unknown signal 0). -- 4 total processes killed (some possibly by mpirun during cleanup) That's the output when I ompi-checkpoint the mpirun PID from another terminal. The checkpoint is taken just fine in maybe 1.5 seconds. I can see the checkpoint directory has been created in $HOME. This is what I get when I try to run ompi-restart ps axroot@debian1:~# ps ax | grep mpirun 18357 pts/0R+ 0:01 mpirun -np 4 -am ft-enable-cr ring 18378 pts/5S+ 0:00 grep mpirun root@debian1:~# ompi-checkpoint 18357 Snapshot Ref.: 0 ompi_global_snapshot_18357.ckpt root@debian1:~# ompi-checkpoint --term 18357 Snapshot Ref.: 1 ompi_global_snapshot_18357.ckpt root@debian1:~# ompi-restart ompi_global_snapshot_18357.ckpt -- Error: Unable to obtain the proper restart command to restart from the checkpoint file (opal_snapshot_2.ckpt). Returned -1. -- [debian1:18384] *** Process received signal *** [debian1:18384] Signal: Segmentation fault (11) [debian1:18384] Signal code: Address not mapped (1) [debian1:18384] Failing at address: 0x725f725f [debian1:18384] [ 0] [0xb775f40c] [debian1:18384] [ 1] /usr/local/lib/libopen-pal.so.0(opal_argv_free+0x33) [0xb771ea63] [debian1:18384] [ 2] /usr/local/lib/libopen-pal.so.0(opal_event_fini+0x30) [0xb77150a0] [debian1:18384] [ 3] /usr/local/lib/libopen-pal.so.0(opal_finalize+0x35) [0xb7708fa5] [debian1:18384] [ 4] opal-restart [0x804908e] [debian1:18384] [ 5] /lib/i686/cmov/libc.so.6(__libc_start_main+0xe5) [0xb7568b55] [debian1:18384] [ 6] opal-restart [0x8048fc1] [debian1:18384] *** End of error message *** -- mpirun noticed that process rank 2 with PID 18384 on node debian1 exited on signal 11 (Segmentat -- I used a clean install of Debian Squeeze (testing) to make sure my environment was ok. Those are the steps I took: - Installed Debian Squeeze, only base packages - Installed build-essential, libcr0, libcr-dev, blcr-dkms (build tools, BLCR dev and run-time environment) - Compiled openmpi-1.4.1 Note that I did compile openmpi-1.4.1 because the Debian package (openmpi-checkpoint) doesn't seem to be usable at the moment. There are no leftovers from any previous install of Debian packages supplying OpenMPI because this is a fresh install, no openmpi package had been installed before. I used the following configure options: # ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads I also tried to add the option --with-memory-manager=none because I saw an e-mail on the mailing list that described this as a possible solution to an (apparently) not related problem, but the problem remains the same. I don't have config.log (I rm'ed the build dir), but if you think it's necessary I can recompile OpenMPI and provide it. Some information about the system (VirtualBox virtual machine, single processor, btw): Kernel version 2.6.32-trunk-686 root@debian1:~# lsmod | grep blcr blcr 79084 0 blcr_imports2077 1 blcr libcr (BLCR) is version 0.8.2-9. gcc is version 4.4.3. Please let me know of any other information you might need. Thanks in advance,