Re: [OMPI users] Problem with OpenMPI (MX btl and mtl) and threads
Almost assuredly, the MTL is not thread safe, and such support is unlikely to happen in the short term. You might be better off concentrating on the BTL, as George has done significant work on that front. Brian On Jun 11, 2009, at 12:20 PM, François Trahay wrote: The stack trace is from the MX MTL (I attach the backtraces I get with both MX MTL and MX BTL) Here is the program that I use. It is quite simple. It runs ping pongs concurrently (with one thread per node, then with two threads per node, etc.) The error occurs when two threads run concurrently. Francois Scott Atchley wrote: Brian and George, I do not know if the stack trace is complete, but I do not see any mx_* functions called which would indicate a crash inside MX due to multiple threads trying to complete the same request. It does show an assert failed. Francois, is the stack trace from the MX MTL or BTL? Can you send a small program that reproduces this abort? Scott On Jun 11, 2009, at 12:25 PM, Brian Barrett wrote: Neither the CM PML or the MX MTL has been looked at for thread safety. There's not much code to cause problems in the CM PML. The MX MTL would likely need some work to ensure the restrictions Scott mentioned are met (currently, there's no such guarantee in the MX MTL). Brian On Jun 11, 2009, at 10:21 AM, George Bosilca wrote: The comment on the FAQ (and on the other thread) is only true for some BTLs (TCP, SM and MX). I don't have resources to test for the others BTL, it is their developers responsibility to do the required modifications to make them thread safe. In addition, I have to confess that I never tested the MTL for thread safety. It is a completely different implementations for the message passing, supposed to map directly on top of the underlying network capabilities. However, there are clearly few places where thread safety should be enforced in the MTL layer, and I don't know if this is the case. george. On Jun 11, 2009, at 09:35 , Scott Atchley wrote: Francois, For threads, the FAQ has: http://www.open-mpi.org/faq/?category=supported-systems#thread-support It mentions that thread support is designed in, but lightly tested. It is also possible that the FAQ is out of date and MPI_THREAD_MULTIPLE is fully supported. The stack trace below shows: opal_free() opal_progress() MPI_Recv() I do not know this code, but it may be in the higher level code that calls the BTLs and/or MTLs and it would be a place to see if that code handles the TCP BTL differently than MX BTL/MTL. MX is thread safe with the caveat that two threads may not try to complete the same request at the same time. This includes calling mx_test(), mx_wait(), mx_test_any() and/or mx_wait_any() where the latter two have match bits and match mask that could complete a request being tested/waited by another thread. Scott On Jun 11, 2009, at 6:00 AM, François Trahay wrote: Well, according to George Bosilca (http://www.open-mpi.org/community/lists/users/2005/02/0005.php ), threads are supported in OpenMPI. The program I try to run works with the TCP stack and MX driver is thread-safe, so i guess the problem comes from the MX BTL or MTL. Francois Scott Atchley wrote: Hi Francois, I am not familiar with the internals of the OMPI code. Are you sure, however, that threads are fully supported yet? I was under the impression that thread support was still partial. Can anyone else comment? Scott On Jun 8, 2009, at 8:43 AM, François Trahay wrote: Hi, I'm encountering some issues when running a multithreaded program with OpenMPI (trunk rev. 21380, configured with --enable-mpi- threads) My program (included in the tar.bz2) uses several pthreads that perform ping pongs concurrently (thread #1 uses tag #1, thread #2 uses tag #2, etc.) This program crashes over MX (either btl or mtl) with the following backtrace: concurrent_ping_v2: pml_cm_recvreq.c:53: mca_pml_cm_recv_request_completion: Assertion `0 == ((mca_pml_cm_thin_recv_request_t*)base_request)- >req_base.req_pml_complete' failed. [joe0:01709] *** Process received signal *** [joe0:01709] *** Process received signal *** [joe0:01709] Signal: Segmentation fault (11) [joe0:01709] Signal code: Address not mapped (1) [joe0:01709] Failing at address: 0x1238949c4 [joe0:01709] Signal: Aborted (6) [joe0:01709] Signal code: (-6) [joe0:01709] [ 0] /lib/libpthread.so.0 [0x7f57240be7b0] [joe0:01709] [ 1] /lib/libc.so.6(gsignal+0x35) [0x7f5722cba065] [joe0:01709] [ 2] /lib/libc.so.6(abort+0x183) [0x7f5722cbd153] [joe0:01709] [ 3] /lib/libc.so.6(__assert_fail+0xe9) [0x7f5722cb3159] [joe0:01709] [ 0] /lib/libpthread.so.0 [0x7f57240be7b0] [joe0:01709] [ 1] /home/ftrahay/sources/openmpi/trunk/install//lib/libopen- pal.so.0 [0x7f57238d0a08] [joe0:01709] [ 2] /home/ftrahay/sources/openmpi/trunk/install//lib/libopen- pal.so.0 [0x7f57238cf8cc] [joe0:01709] [ 3] /home/ftrahay/sources/openmpi/tr
Re: [OMPI users] Problem with OpenMPI (MX btl and mtl) and threads
Neither the CM PML or the MX MTL has been looked at for thread safety. There's not much code to cause problems in the CM PML. The MX MTL would likely need some work to ensure the restrictions Scott mentioned are met (currently, there's no such guarantee in the MX MTL). Brian On Jun 11, 2009, at 10:21 AM, George Bosilca wrote: The comment on the FAQ (and on the other thread) is only true for some BTLs (TCP, SM and MX). I don't have resources to test for the others BTL, it is their developers responsibility to do the required modifications to make them thread safe. In addition, I have to confess that I never tested the MTL for thread safety. It is a completely different implementations for the message passing, supposed to map directly on top of the underlying network capabilities. However, there are clearly few places where thread safety should be enforced in the MTL layer, and I don't know if this is the case. george. On Jun 11, 2009, at 09:35 , Scott Atchley wrote: Francois, For threads, the FAQ has: http://www.open-mpi.org/faq/?category=supported-systems#thread- support It mentions that thread support is designed in, but lightly tested. It is also possible that the FAQ is out of date and MPI_THREAD_MULTIPLE is fully supported. The stack trace below shows: opal_free() opal_progress() MPI_Recv() I do not know this code, but it may be in the higher level code that calls the BTLs and/or MTLs and it would be a place to see if that code handles the TCP BTL differently than MX BTL/MTL. MX is thread safe with the caveat that two threads may not try to complete the same request at the same time. This includes calling mx_test(), mx_wait(), mx_test_any() and/or mx_wait_any() where the latter two have match bits and match mask that could complete a request being tested/waited by another thread. Scott On Jun 11, 2009, at 6:00 AM, François Trahay wrote: Well, according to George Bosilca (http://www.open-mpi.org/community/lists/users/2005/02/0005.php ), threads are supported in OpenMPI. The program I try to run works with the TCP stack and MX driver is thread-safe, so i guess the problem comes from the MX BTL or MTL. Francois Scott Atchley wrote: Hi Francois, I am not familiar with the internals of the OMPI code. Are you sure, however, that threads are fully supported yet? I was under the impression that thread support was still partial. Can anyone else comment? Scott On Jun 8, 2009, at 8:43 AM, François Trahay wrote: Hi, I'm encountering some issues when running a multithreaded program with OpenMPI (trunk rev. 21380, configured with --enable-mpi-threads) My program (included in the tar.bz2) uses several pthreads that perform ping pongs concurrently (thread #1 uses tag #1, thread #2 uses tag #2, etc.) This program crashes over MX (either btl or mtl) with the following backtrace: concurrent_ping_v2: pml_cm_recvreq.c:53: mca_pml_cm_recv_request_completion: Assertion `0 == ((mca_pml_cm_thin_recv_request_t*)base_request)- >req_base.req_pml_complete' failed. [joe0:01709] *** Process received signal *** [joe0:01709] *** Process received signal *** [joe0:01709] Signal: Segmentation fault (11) [joe0:01709] Signal code: Address not mapped (1) [joe0:01709] Failing at address: 0x1238949c4 [joe0:01709] Signal: Aborted (6) [joe0:01709] Signal code: (-6) [joe0:01709] [ 0] /lib/libpthread.so.0 [0x7f57240be7b0] [joe0:01709] [ 1] /lib/libc.so.6(gsignal+0x35) [0x7f5722cba065] [joe0:01709] [ 2] /lib/libc.so.6(abort+0x183) [0x7f5722cbd153] [joe0:01709] [ 3] /lib/libc.so.6(__assert_fail+0xe9) [0x7f5722cb3159] [joe0:01709] [ 0] /lib/libpthread.so.0 [0x7f57240be7b0] [joe0:01709] [ 1] /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.0 [0x7f57238d0a08] [joe0:01709] [ 2] /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.0 [0x7f57238cf8cc] [joe0:01709] [ 3] /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so. 0(opal_free+0x4e) [0x7f57238bdc69] [joe0:01709] [ 4] /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/ mca_mtl_mx.so [0x7f572060b72f] [joe0:01709] [ 5] /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so. 0(opal_progress+0xbc) [0x7f57238948e0] [joe0:01709] [ 6] /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/ mca_pml_cm.so [0x7f572081145a] [joe0:01709] [ 7] /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/ mca_pml_cm.so [0x7f57208113b7] [joe0:01709] [ 8] /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/ mca_pml_cm.so [0x7f57208112e7] [joe0:01709] [ 9] /home/ftrahay/sources/openmpi/trunk/install//lib/libmpi.so. 0(MPI_Recv+0x2bc) [0x7f5723e07690] [joe0:01709] [10] ./concurrent_ping_v2(client+0x123) [0x401404] [joe0:01709] [11] /lib/libpthread.so.0 [0x7f57240b6faa] [joe0:01709] [12] /lib/libc.so.6(clone+0x6d) [0x7f5722d5629d] [joe0:01709] *** End of error message *** [joe0:01709] [ 4] /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
Re: [OMPI users] HPL with OpenMPI: Do I have a memory leak?
S (Linux CentOS 5.2), HPL is running alone. The cluster has Infiniband. However, I am running on a single node. The surprising thing is that if I run on shared memory only (-mca btl sm,self) there is no memory problem, the memory use is stable at about 13.9GB, and the run completes. So, there is a way around to run on a single node. (Actually shared memory is presumably the way to go on a single node.) However, if I introduce IB (-mca btl openib,sm,self) among the MCA btl parameters, then memory use blows up. This is bad news for me, because I want to extend the experiment to run HPL also across the whole cluster using IB, which is actually the ultimate goal of HPL, of course! It also suggests that the problem is somehow related to Infiniband, maybe hidden under OpenMPI. Here is the mpiexec command I use (with and without openib): /path/to/openmpi/bin/mpiexec \ -prefix /the/run/directory \ -np 8 \ -mca btl [openib,]sm,self \ xhpl Any help, insights, suggestions, reports of previous experiences, are much appreciated. Thank you, Gus Correa ___ users mailing list us...@open-mpi.org <mailto:us...@open-mpi.org> http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] Relocating an Open MPI installation using OPAL_PREFIX
Sorry I haven't jumped in this thread earlier -- I've been a bit behind. The multi-lib support worked at one time, and I can't think of why it would have changed. The one condition is that libdir, includedir, etc. *MUST* be specified relative to $prefix for it to work. It looks like you were defining them as absolute paths, so you'd have to set libdir directly, which will never work in multi-lib because mpirun and the app likely have different word sizes and therefore different libdirs. More information is on the multilib page in the wiki: https://svn.open-mpi.org/trac/ompi/wiki/MultiLib There is actually one condition we do not handle properly, the prefix flag to mpirun. The LD_LIBRARY_PATH will only be set for the word size of mpirun, and not the executable. Really, both would have to be added (so that both orted, which is likely always 32 bit in a multilib situation and the app both find their libraries). Brian On Jan 5, 2009, at 6:02 PM, Jeff Squyres wrote: I honestly haven't thought through the ramifications of doing a multi-lib build with OPAL_PREFIX et al. :-\ If you setenv OPAL_LIBDIR, it'll use whatever you set it to, so it doesn't matter what you configured --libdir with. Additionally mca/ installdirs/config/install_dirs.h has this by default: #define OPAL_LIBDIR "${exec_prefix}/lib" Hence, if you use a default --libdir and setenv OPAL_PREFIX, then the libdir should pick up the right thing (because it's based on the prefix). But if you use --libdir that is *not* based on $ {exec_prefix}, then you might run into problems. Perhaps you can '--libdir="${exec_prefix}/lib64"' so that you can have your custom libdir, but still have it dependent upon the prefix that gets expanded at run time...? (again, I'm not thinking all of this through -- just offering a few suggestions off the top of my head that you'll need to test / trace the code to be sure...) On Jan 5, 2009, at 1:35 PM, Ethan Mallove wrote: On Thu, Dec/25/2008 08:12:49AM, Jeff Squyres wrote: It's quite possible that we don't handle this situation properly. Won't you need to libdir's (one for the 32 bit OMPI executables, and one for the 64 bit MPI apps)? I don't need an OPAL environment variable for the executables, just a single OPAL_LIBDIR var for the libraries. (One set of 32-bit executables runs with both 32-bit and 64-bit libraries.) I'm guessing OPAL_LIBDIR will not work for you if you configure with a non- standard --libdir option. -Ethan On Dec 23, 2008, at 3:58 PM, Ethan Mallove wrote: I think the problem is that I am doing a multi-lib build. I have 32-bit libraries in lib/, and 64-bit libraries in lib/64. I assume I do not see the issue for 32-bit tests, because all the dependencies are where Open MPI expects them to be. For the 64-bit case, I tried setting OPAL_LIBDIR to /opt/openmpi-relocated/lib/lib64, but no luck. Given the below configure arguments, what do my OPAL_* env vars need to be? (Also, could using --enable-orterun-prefix-by-default interfere with OPAL_PREFIX?) $ ./configure CC=cc CXX=CC F77=f77 FC=f90 --with-openib --without-udapl --disable-openib-ibcm --enable-heterogeneous --enable-cxx-exceptions --enable-shared --enable-orterun-prefix- by-default --with-sge --enable-mpi-f90 --with-mpi-f90-size=small --disable-mpi-threads --disable-progress-threads --disable-debug CFLAGS="-m32 -xO5" CXXFLAGS="-m32 -xO5" FFLAGS="-m32 -xO5" FCFLAGS="-m32 -xO5" --prefix=/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi- tarball-testing/installs/DGQx/install --mandir=/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi- tarball-testing/installs/DGQx/install/man --libdir=/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi- tarball-testing/installs/DGQx/install/lib --includedir=/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ ompi-tarball-testing/installs/DGQx/install/include --without-mx --with-tm=/ws/ompi-tools/orte/torque/current/shared- install32 --with-contrib-vt-flags="--prefix=/workspace/em162155/hpc/mtt- scratch/burl-ct-v! 20z-12/ompi-tarball-testing/installs/DGQx/install --mandir=/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi- tarball-testing/installs/DGQx/install/man --libdir=/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi- tarball-testing/installs/DGQx/install/lib --includedir=/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ ompi-tarball-testing/installs/DGQx/install/include LDFLAGS=-R/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ ompi-tarball-testing/installs/DGQx/install/lib" $ ./confgiure CC=cc CXX=CC F77=f77 FC=f90 --with-openib --without-udapl --disable-openib-ibcm --enable-heterogeneous --enable-cxx-exceptions --enable-shared --enable-orterun-prefix- by-default --with-sge --enable-mpi-f90 --with-mpi-f90-size=small --disable-mpi-threads --disable-progress-threads --disable-debug CFLAGS="-m64 -xO5" CXXFLAGS="-m64 -xO5" FFLAGS="-m64 -xO5"
Re: [OMPI users] Crash in _int_malloc via MPI_Init
On Jun 15, 2008, at 2:20 PM, Dirk Eddelbuettel wrote: Yup: I still suspect compiler / linker changes in Ubuntu between Gutsy (released Oct 2007) and Hardy (April 2008). Why? Because the exactly same source package for Open MPI (as maintained by Manuel and myself for Debian) works for me on Ubuntu Hardy __if I compile it on Ubuntu Gutsy__. Now, I reported this to Ubuntu ... for no answer. Lucas and Christoph at Debian today released a feature allowing us Debian maintainers to see which our packages have bugreports in Ubuntu. It was only through this mechanism that I learned that the segfault I saw with Rmpi (using Open MPI) had been experienced by someone else, and that a similar bug occurs with Python use on top of Open MPI. But still no tangible answer from Canonical / Ubuntu other that some reshuffling of bug reports titles and numbers. Very disappointing. I am CCing Steffen and Andreas who've seen similar bugs and are awaiting answers too. I am also CCing Cesare at Ubuntu who did the bug rearrangement, maybe he will find a moment to share their plans with us. I suppose I'm glad that it doesn't look like an Open MPI problem. Due to continual problems with the ptmalloc2 code in Open MPI, we've decided that for v1.3, we'll extract that code out into its own library. Users who need the malloc hooks for InifiniBand support (only a small number of applications really benefit from it) will have to explicitly link in the extra library. Hopefully, this will resolve some of these headaches. Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] Memory manager
Terry - Would you be willing to do an experiment with the memory allocator? There are two values we change to try to make IB run faster (at the cost of corner cases you're hitting). I'm not sure one is strictly necessary, and I'm concerned that it's the one causing problems. If you don't mind recompiling again, would you change line 64 in opal/mca/ memory/ptmalloc2/malloc.c from: #define DEFAULT_MMAP_THRESHOLD (2*1024*1024) to: #define DEFAULT_MMAP_THRESHOLD (128*1024) And then recompile with the memory manager, obviously. That will make the mmap / sbrk cross-over point the same as the default allocator in Linux. There's still one other tweak we do, but I'm almost 100% positive it's the threshold causing problems. Brian On May 19, 2008, at 8:17 PM, Terry Frankcombe wrote: To tell you all what noone wanted to tell me, yes, it does seem to be the memory manager. Compiling everything with --with-memory-manager=none returns the vmem use to the more reasonable ~100MB per process (down from >8GB). I take it this may affect my peak bandwidth over infiniband. What's the general feeling about how bad this is? On Tue, 2008-05-13 at 13:12 +1000, Terry Frankcombe wrote: Hi folks I'm trying to run an MPI app on an infiniband cluster with OpenMPI 1.2.6. When run on a single node, this app is grabbing large chunks of memory (total per process ~8.5GB, including strace showing a single 4GB grab) but not using it. The resident memory use is ~40MB per process. When this app is compiled in serial mode (with conditionals to remove the MPI calls) the memory use is more like what you'd expect, 40MB res and ~100MB vmem. Now I didn't write it so I'm not sure what extra stuff the MPI version does, and we haven't tracked down the large memory grabs. Could it be that this vmem is being grabbed by the OpenMPI memory manager rather than directly by the app? Ciao Terry ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] Memory question and possible bug in 64bit addressing under Leopard!
On Apr 25, 2008, at 2:06 PM, Gregory John Orris wrote: produces a core dump on a machine with 12Gb of RAM. and the error message mpiexec noticed that job rank 0 with PID 75545 on node mymachine.com exited on signal 4 (Illegal instruction). However, substituting in float *X = new float[n]; for float X[n]; Succeeds! You're running off the end of the stack, because of the large amount of data you're trying to put there. OS X by default has a tiny stack size, so codes that run on Linux (which defaults to a much larger stack size) sometimes show this problem. Your best bets are either to increase the max stack size or (more portably) just allocate everything on the heap with malloc/new. Hope this helps, Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] Problems using Intel MKL with OpenMPI and Pathscale
On Apr 14, 2008, at 12:30 AM, Åke Sandgren wrote: On Sun, 2008-04-13 at 08:00 -0400, Jeff Squyres wrote: Do you get the same error if you disable the memory handling in Open MPI? You can configure OMPI with: --disable-memory-manager Doesn't help, it still compiles ptmalloc2 and trying to turn off ptmaloc2 during runtime doesn't help either. Jeff had the option slightly wrong. It's actually: --without-memory-manager Because this is a link time decision, during the memory manager code on / off at runtime won't change anything in terms of interfering with a compiler's own memory management code. Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] OSX undefined symbols when compiling hello world in cpp but not in c
QqJJlF.o typeinfo for MPI::Infoin ccQqJJlF.o "MPI::Comm::Set_errhandler(MPI::Errhandler const&)", referenced from: vtable for MPI::Commin ccQqJJlF.o vtable for MPI::Intracommin ccQqJJlF.o vtable for MPI::Cartcommin ccQqJJlF.o vtable for MPI::Graphcommin ccQqJJlF.o vtable for MPI::Intercommin ccQqJJlF.o "MPI::Win::Free()", referenced from: vtable for MPI::Winin ccQqJJlF.o "operator delete(void*)", referenced from: MPI::Datatype::~Datatype()in ccQqJJlF.o MPI::Datatype::~Datatype()in ccQqJJlF.o MPI::Status::~Status()in ccQqJJlF.o MPI::Status::~Status()in ccQqJJlF.o MPI::Request::~Request()in ccQqJJlF.o MPI::Request::~Request()in ccQqJJlF.o MPI::Request::~Request()in ccQqJJlF.o MPI::Prequest::~Prequest()in ccQqJJlF.o MPI::Prequest::~Prequest()in ccQqJJlF.o MPI::Grequest::~Grequest()in ccQqJJlF.o MPI::Grequest::~Grequest()in ccQqJJlF.o MPI::Group::~Group()in ccQqJJlF.o MPI::Group::~Group()in ccQqJJlF.o MPI::Comm_Null::~Comm_Null()in ccQqJJlF.o MPI::Comm_Null::~Comm_Null()in ccQqJJlF.o MPI::Comm_Null::~Comm_Null()in ccQqJJlF.o MPI::Win::~Win() in ccQqJJlF.o MPI::Win::~Win() in ccQqJJlF.o MPI::Errhandler::~Errhandler()in ccQqJJlF.o MPI::Errhandler::~Errhandler()in ccQqJJlF.o MPI::Comm::~Comm()in ccQqJJlF.o MPI::Comm::~Comm()in ccQqJJlF.o MPI::Comm::~Comm()in ccQqJJlF.o MPI::Intracomm::~Intracomm()in ccQqJJlF.o MPI::Intracomm::~Intracomm()in ccQqJJlF.o MPI::Intracomm::~Intracomm()in ccQqJJlF.o MPI::Info::~Info()in ccQqJJlF.o MPI::Info::~Info()in ccQqJJlF.o MPI::Intercomm::~Intercomm()in ccQqJJlF.o MPI::Intercomm::~Intercomm()in ccQqJJlF.o MPI::Intracomm::Clone() constin ccQqJJlF.o MPI::Cartcomm::~Cartcomm()in ccQqJJlF.o MPI::Cartcomm::~Cartcomm()in ccQqJJlF.o MPI::Graphcomm::~Graphcomm()in ccQqJJlF.o MPI::Graphcomm::~Graphcomm()in ccQqJJlF.o MPI::Cartcomm::Clone() constin ccQqJJlF.o MPI::Graphcomm::Clone() constin ccQqJJlF.o MPI::Op::~Op() in ccQqJJlF.o MPI::Op::~Op() in ccQqJJlF.o "MPI::FinalizeIntercepts()", referenced from: MPI::Finalize() in ccQqJJlF.o "MPI::COMM_WORLD", referenced from: __ZN3MPI10COMM_WORLDE$non_lazy_ptr in ccQqJJlF.o ld: symbol(s) not found collect2: ld returned 1 exit status -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] What architecture? X86_64, that's what architecture!
On Mar 10, 2008, at 9:15 PM, Jim Hill wrote: I'm trying to build a 64-bit 1.2.5 on an 8-core Xeon Mac Pro running OS X 10.4.11, with the Portland Group's PGI Workstation 7.1-5 tools. The configure script works its magic with a couple of modifications to account for PGI's tendency to freak out about F90 modules. Upon make, though, I end up dying with a "Wat architecture?" error in opal/mca/backtrace/darwin/MoreBacktrace/ MoreDebugging/MoreBacktrace.c:128 because (I presume) a 64-bit Xeon build isn't a PPC, a PPC64, or an X86. Is this something that's been seen by others? I'm not the world's greatest software stud and this is just a step along the path to my real objective, which is making my own software run on this beast machine of mine. Suggestions, tips, and clever insults are welcome. Thanks, The configure script should have prevented that from happening (and indeed does with the GNU compilers). I don't have a copy of the PGI compilers for OS X to test with, so I can't debug this without some more information. What changes did you make to configure, what options did you specify to configure, and what was the full output of configure? Thanks, Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] OpenMPI 1.2.5 configure bug for POWERPC64 target
On Feb 27, 2008, at 8:34 AM, Jeff Squyres wrote: On Feb 23, 2008, at 10:05 AM, Mathias PUETZ wrote: 2. Could someone explain, why configure might determine a different ompi_cv_asm_format than stated in the asm-data.txt database ? Maybe the meaning of the cryptic assmebler format string is explained somewhere. If so, could someone point me to the explanation ? I have to defer to Brian on this one... Sorry about the slow reply -- unfortunately, I don't have as much time to look at OPen MPI issues as I once did. I have no idea -- likely the test doesn't cover some corner case. The first question that needs to be asked is for the AIX / Power PC machine you're running on, what is the right answer (as an IBM employee, you're certainly more qualified to answer that than I am...). Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] Can't compile C++ program with extern "C" { #include mpi.h }
On Jan 1, 2008, at 12:47 AM, Adam C Powell IV wrote: On Mon, 2007-12-31 at 20:01 -0700, Brian Barrett wrote: Yeah, this is a complicated example, mostly because HDF5 should really be covering this problem for you. I think your only option at that point would be to use the #define to not include the C++ code. The problem is that the MPI standard *requires* mpi.h to include both the C and C++ interface declarations if you're using C++. There's no way for the preprocessor to determine whether there's a currently active extern "C" block, so there's really not much we can do. Best hope would be to get the HDF5 guys to properly protect their code from C++... Okay. So in HDF5, since they call MPI from C, they're just using the C interface, right? So should they define OMPI_SKIP_MPICXX just in case they're #included by C++ and using OpenMPI, or is there a more MPI implementation-agnostic way to do it? No, they should definitely not be disabling the C++bindings inside HDF5 -- that would be a situation worse than the current one. Consider the case where an application uses both HDF5 and the C++ MPI bindings. It includes hdf5.h before mpi.h. The hdf5.h includes mpi.h, without the C++ bindings. The application then includes mpi.h, wanting the C++ bindings. But the multiple inclusion protection in mpi.h means nothing happens, so no C++ bindings. My comment about HDF5 was that it would be easiest if it protected its declarations with extern "C" when using C++. This is what most packages that might be used with C++ do, and it works pretty well. I'd actually be surprised if modern versions of HDF5 didn't already do that. Now that it's not New Years eve, I thought of what's probably the easiest solution for you. Just include mpi.h (outside your extern "C" block) before hdf5.h. The multiple inclusion protection in mpi.h will mean that the preprocessor removes everything from the mpi.h that's included from hdf5.h. So the extern "C" around the hdf5.h shouldn't be too much of a problem. Hope this helps, Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] Can't compile C++ program with extern "C" { #include mpi.h }
On Dec 31, 2007, at 7:26 PM, Adam C Powell IV wrote: Okay, fair enough for this test example. But the Salomé case is more complicated: extern "C" { #include } What to do here? The hdf5 prototypes must be in an extern "C" block, but hdf5.h #includes a file which #includes mpi.h... Thanks for the quick reply! Yeah, this is a complicated example, mostly because HDF5 should really be covering this problem for you. I think your only option at that point would be to use the #define to not include the C++ code. The problem is that the MPI standard *requires* mpi.h to include both the C and C++ interface declarations if you're using C++. There's no way for the preprocessor to determine whether there's a currently active extern "C" block, so there's really not much we can do. Best hope would be to get the HDF5 guys to properly protect their code from C++... Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] Can't compile C++ program with extern "C" { #include mpi.h }
On Dec 31, 2007, at 7:12 PM, Adam C Powell IV wrote: I'm trying to build the Salomé engineering simulation tool, and am having trouble compiling with OpenMPI. The full text of the error is at http://lyre.mit.edu/~powell/salome-error . The crux of the problem can be reproduced by trying to compile a C++ file with: extern "C" { #include "mpi.h" } At the end of mpi.h, the C++ headers get loaded while in extern C mode, and the result is a vast list of errors. Yes, it will. Similar to other external packages (like system headers), you absolutely should not include mpi.h from an extern "C" block. It will fail, as you've noted. The proper solution is to not be in an extern "C" block when including mpi.h. Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] Compiling 1.2.4 using Intel Compiler 10.1.007 on Leopard
I finally had a chance to look at this (since the same things are happening with LAM as well). The base issue is that Intel's compiler is completely borked. I can't fathom how a company could release a product that fundamentally broken. That's all good, except that recent versions of Autoconf expect that the compiler at least kind of works without special CFLAGS for some of its tests (like does the compiler understand -g or does -E invoke the preprocessor) -- not an unreasonable assumption. Configure is getting those answers wrong because that's suddenly not true. It makes a wrong choice about requiring some Autoconf compatibility scripts to build, which don't work on OS X (probably because they aren't usually needed, so not well tested) A hackish fix is to set CC to "icc -no-multibyte-chars" and CXX to "icpc -no-multibyte-chars" instead of setting the -no-multibyte- chars. With those parameters, I was able to successfully build Open MPI (and applications against Open MPI). Hopefully Intel can fix their compilers before this causes too many more issues. How you ship an (expensive!) compiler that just flat out doesn't work is beyond me. Brian On Dec 12, 2007, at 11:18 AM, Warner Yuen wrote: Hi Jeff, It seems that the problems are partially the compilers fault, maybe the updated compilers didn't catch all the problems filed against the last release? Why else would I need to add the "-no-multibyte- chars" flag for pretty much everything that I build with ICC? Also, its odd that I have to use /lib/cpp when using Intel ICC/ICPC whereas with GCC things just find their way correctly. Again, IFORT and GCC together seem fine. Lastly... not that I use these... but MPICH-2.1 and MPICH-1.2.7 for Myrinet built just fine. Here are the output files: Warner Yuen Scientific Computing Consultant Apple Computer email: wy...@apple.com Tel: 408.718.2859 Fax: 408.715.0133 On Dec 12, 2007, at 9:00 AM, users-requ...@open-mpi.org wrote: -- Message: 1 Date: Wed, 12 Dec 2007 06:50:03 -0500 From: Jeff Squyres <jsquy...@cisco.com> Subject: Re: [OMPI users] Problems compiling 1.2.4 using Intel Compiler10.1.006 on Leopard To: Open MPI Users <us...@open-mpi.org> Message-ID: <43bb0bce-e328-4d3e-ae61-84991b27f...@cisco.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes My primary work platform is a MacBook Pro, but I don't specifically develop for OS X, so I don't have any special compilers. Sorry to ask this because I think the information was sent before, but could you send all the compile/failure information? http://www.open-mpi.org/community/help/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] Open MPI 1.2.4 verbosity w.r.t. osc pt2pt
On Oct 16, 2007, at 11:56 AM, Jeff Squyres wrote: On Oct 16, 2007, at 11:20 AM, Brian Granger wrote: Wow, that is quite a study of the different options. I will spend some time looking over things to better understand the (complex) situation. I will also talk with Lisandro Dalcin about what he thinks the best approach is for mpi4py. One question though. You said that nothing had changed in this respect from 1.2.3 to 1.2.4, but 1.2.3 doesn't show the problem. Does this make sense? I wondered about that as well. Is there any chance that you simply weren't using the one-sided MPI functionality between your different versions? Or are you using the same version of your software with v1.2.3 and v1.2.4 of OMPI? If so, I'm kinda at a loss. :-( FWIW: our one-sided support loads lazily; it doesn't load during MPI_INIT like most of the rest of the plugins that we have. Since not many MPI applications use it, we decided to make it only load the osc plugins the first time an MPI window is created. Actually, I never wrote the lazy open code, so we load the components during MPI_INIT. They aren't initialized until first use, but they are loaded. Just to verify, I did a build of 1.2.3 and of 1.2.4 and there's no difference in the list of undefined symbols or library references between the pt2pt osc components in the two builds. Still at a loss for why the change between releases -- I would not have expected it to work with 1.2.3. Brian.
Re: [OMPI users] Open MPI 1.2.4 verbosity w.r.t. osc pt2pt
On Oct 10, 2007, at 1:27 PM, Dirk Eddelbuettel wrote: | Does this happen for all MPI programs (potentially only those that | use the MPI-2 one-sided stuff), or just your R environment? This is the likely winner. It seems indeed due to R's Rmpi package. Running a simple mpitest.c shows no error message. We will look at the Rmpi initialization to see what could cause this. Does rmpi link in libmpi.so or dynamically load it at run-time? The pt2pt one-sided component uses the MPI-1 point-to-point calls for communication (hence, the pt2pt name). If those symbols were unavailable (say, because libmpi.so was dynamically loaded) I could see how this would cause problems. The pt2pt component (rightly) does not have a -lmpi in its link line. The other components that use symbols in libmpi.so (wrongly) do have a -lmpi in their link line. This can cause some problems on some platforms (Linux tends to do dynamic linking / dynamic loading better than most). That's why only the pt2pt component fails. My guess is that Rmpi is dynamically loading libmpi.so, but not specifying the RTLD_GLOBAL flag. This means that libmpi.so is not available to the components the way it should be, and all goes downhill from there. It only mostly works because we do something silly with how we link most of our components, and Linux is just smart enough to cover our rears (thankfully). Solutions: - Someone could make the pt2pt osc component link in libmpi.so like the rest of the components and hope that no one ever tries this on a non-friendly platform. - Debian (and all Rmpi users) could configure Open MPI with the --disable-dlopen flag and ignore the problem. - Someone could fix Rmpi to dlopen libmpi.so with the RTLD_GLOBAL flag and fix the problem properly. I think it's clear I'm in favor of Option 3. Brian
Re: [OMPI users] aclocal.m4 booboo?
On Sep 27, 2007, at 6:44 PM, Mostyn Lewis wrote: Today's SVN. A generated configure has this in it: I'm not able to replicate this using an SVN checkout of the trunk -- you might want to make sure you have a proper install of all the autotools. If you are using another branch from SVN, you can not use recent CVS copies of Libtool, you'll have to use the same version specified here: http://www.open-mpi.org/svn/building.php Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] Open MPI on 64 bits intel Mac OS X
On Sep 28, 2007, at 4:56 AM, Massimo Cafaro wrote: Dear all, when I try to compile my MPI code on 64 bits intel Mac OS X the build fails since the Open MPI library has been compiled using 32 bits. Can you please provide in the next version the ability at configure time to choose between 32 and 64 bits or even better compile by defaults using both modes? To reproduce the problem, simply compile on 64 bits intel Mac OS X an MPI application using mpicc -arch x86_64. The 64 bits linker complains as follows: ld64 warning: in /usr/local/mpi/lib/libmpi.dylib, file is not of required architecture ld64 warning: in /usr/local/mpi/lib/libopen-rte.dylib, file is not of required architecture ld64 warning: in /usr/local/mpi/lib/libopen-pal.dylib, file is not of required architecture and a number of undefined symbols is shown, one for each MPI function used in the application. This is already possible. Simply use the configure options: ./configure ... CFLAGS="-arch x86_64" CXXFLAGS="-arch x86_64" OBJCFLAGS="-arch x86_64" also set FFLAGS and FCFLAGS to "-m64" if you have gfortran/g95 compiler installed. The common installs of either don't speak the - arch option, so you have to use the more traditional -m64. Hope this helps, Brian
Re: [OMPI users] readv failed with errno=104
On Sep 25, 2007, at 4:25 AM, Rayne wrote: Hi all, I'm using the SGE system on my school network, and would like to know if the errors I received below means there's something wrong with my MPI_Recv function. [0,1,3][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=104 [0,1,2][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=104 Generally, these indicate that the remote process has died. Generally, that means an abnormal termination due to segmentation faults or the like. You might want to run the code under a debugger to see if it shows anything useful. If your cluster doesn't have a parallel debugger like TotalView or DDT available, you can (for small numbers of processes) get away with using xterm and gdb, something like: mpirun -np X -d xterm -e gdb It'll open X xterms, each with a gdb running one instance of the application. Good luck, Brian
Re: [OMPI users] OpenMPI on Cray XT4 CNL
On Sep 25, 2007, at 1:37 PM, Richard Graham wrote: Josh Hursey did the port of Open MPI to CNL. Here is the config line I have used to build on the Cray XT4: ./configure CC=/opt/xt-pe/default/bin/snos64/linux-pgcc CXX=/opt/xt- pe/default/bin/snos64/linux-pgCC F77=/opt/xt-pe/default/bin/snos64/ linux-pgf90 FC=/opt/xt-pe/default/bin/snos64/linux-pgf77 CFLAGS=-I/ opt/xt-pe/default/include/ CPPFLAGS=-I/opt/xt-pe/default/include/ FCFLAGS=-I/opt/xt-pe/default/include/ FFLAGS=-I/opt/xt-pe/default/ include/ LDFLAGS=-L/opt/xt-mpt/default/lib/snos64/ LIBS=-lpct - lalpslli -lalpsutil --build=x86_64-unknown-linux-gnu --host=x86_64- cray-linux-gnu --with-platform=../contrib/platform/cray_xt3_romio -- with-io-romio-flags=--disable-aio build_alias=x86_64-unknown-linux- gnu host_alias=x86_64-cray-linux-gnu --enable-ltdl-convenience --no- recursion --prefix=/na2_apps/OpenMPI/xt-2.0.20/1.2/ompi/P2 I believe, however, that you need to use one of the Open MPI 1.2.4 release candidates or the nightly tarballs from the 1.2 or trunk branches. There are some known issues with the 1.2.3 release on the Cray XT platform that have since been resolved. Brian
Re: [OMPI users] another mpirun + xgrid question
On Sep 10, 2007, at 1:35 PM, Lev Givon wrote: When launching an MPI program with mpirun on an xgrid cluster, is there a way to cause the program being run to be temporarily copied to the compute nodes in the cluster when executed (i.e., similar to what the xgrid command line tool does)? Or is it necessary to make the program being run available on every compute node (e.g., using NFS data partions)? This is functionality we never added to our XGrid support. It certainly could be added, but we have an extremely limited supply of developer cycles for the XGrid support at the moment. Brian -- Brian W. Barrett Networking Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI users] running jobs on a remote XGrid cluster via mpirun
On Aug 28, 2007, at 10:59 AM, Lev Givon wrote: Received from Brian Barrett on Tue, Aug 28, 2007 at 12:22:29PM EDT: On Aug 27, 2007, at 3:14 PM, Lev Givon wrote: I have OpenMPI 1.2.3 installed on an XGrid cluster and a separate Mac client that I am using to submit jobs to the head (controller) node of the cluster. The cluster's compute nodes are all connected to the head node via a private network and are not running any firewalls. When I try running jobs with mpirun directly on the cluster's head node, they execute successfully; if I attempt to submit the jobs from the client (which can run jobs on the cluster using the xgrid command line tool) with mpirun, however, they appear to hang indefinitely (i.e., a job ID is created, but the mpirun itself never returns or terminates). Is it nececessary to configure the firewall on the submission client to grant access to the cluster head node in order to remotely submit jobs to the cluster's head node? Currently, every node on which an MPI process is launched must be able to open a connection to a random port on the machine running mpirun. So in your case, you'd have to configure the network on the cluster to be able to connect back to your workstation (and the workstation would have to allow connections from all your cluster nodes). Far from ideal, but it's what it is. Brian Can this be avoided by submitting the "mpirun -n 10 myProg" command directly to the controller node with the xgrid command line tool? For some reason, sending the above command to the cluster results in a "task: failed with status 255" error even though I can successfully run other programs or commands to the cluster with the xgrid tool. I know that OpenMPI on the cluster is running properly because I can run programs with mpirun successfully when logged into the controller node itself. Open MPI was designed to be the one calling XGrid's scheduling algorithm, so I'm pretty sure that you can't submit a job that just runs Open MPI's mpirun. That wasn't really in our original design space as an option. Brian
Re: [OMPI users] running jobs on a remote XGrid cluster via mpirun
On Aug 27, 2007, at 3:14 PM, Lev Givon wrote: I have OpenMPI 1.2.3 installed on an XGrid cluster and a separate Mac client that I am using to submit jobs to the head (controller) node of the cluster. The cluster's compute nodes are all connected to the head node via a private network and are not running any firewalls. When I try running jobs with mpirun directly on the cluster's head node, they execute successfully; if I attempt to submit the jobs from the client (which can run jobs on the cluster using the xgrid command line tool) with mpirun, however, they appear to hang indefinitely (i.e., a job ID is created, but the mpirun itself never returns or terminates). Is it nececessary to configure the firewall on the submission client to grant access to the cluster head node in order to remotely submit jobs to the cluster's head node? Currently, every node on which an MPI process is launched must be able to open a connection to a random port on the machine running mpirun. So in your case, you'd have to configure the network on the cluster to be able to connect back to your workstation (and the workstation would have to allow connections from all your cluster nodes). Far from ideal, but it's what it is. Brian
Re: [OMPI users] failure to link on macosx
On Aug 24, 2007, at 10:57 AM, Marwan Darwish wrote: I keep on getting the following link error when compiling lam-mpi on a macosx (in the release mode) would moving to open-mpi resolve such issues, anybody with experience in this Moving to Open MPI will work around this issue. Another option (if you're not using Myrinet/GM) would be to compile LAM with the -- without-memory-manager option to configure. Hope this helps, Brian -- Brian W. Barrett Networking Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI users] MPI_FILE_NULL
On Aug 23, 2007, at 4:33 AM, Bernd Schubert wrote: I need to compile a benchmarking program and absolutely so far do not have any experience with any MPI. However, this looks like a general open-mpi problem, doesn't it? bschubert@lanczos MPI_IO> make cp ../globals.f90 ./; mpif90 -O2 -c ../globals.f90 mpif90 -O2 -c main.f90 mpif90 -O2 -c reader.f90 fortcom: Error: reader.f90, line 24: This name does not have a type, and must have an explicit type. [MPI_FILE_NULL] call MPI_File_set_errhandler (MPI_FILE_NULL, MPI_ERRORS_ARE_FATAL, ierror) Yeah, that looks like a mistake on our part. It will be fixed in Open MPI 1.2.4. Your quick fix should work until then. Thanks, Brian -- Brian W. Barrett Networking Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI users] Error: error in configure (maybe libtool)
On Aug 22, 2007, at 2:35 PM, Higor de Padua Vieira Neto wrote: At the end of the output file, just show this: " (...lot of output ...) config.status: creating opal/include/opal_config.h config.status: creating orte/include/orte_config.h config.status: orte/include/orte_config.h is unchanged config.status: creating ompi/include/ompi_config.h config.status: ompi/include/ompi_config.h is unchanged config.status: creating ompi/include/mpi.h config.status: ompi/include/mpi.h is unchanged config.status: executing depfiles commands config.status: executing libtool commands /bin/rm: cannot lstat `libtoolT': No such file or directory" end. I don't know why this happened, but I've read the output file and I didn't find anything strange. Wow, that's pretty cool -- I haven't seen this one before. Two things come to mind -- if you're on a shared file system, are all your clocks synchronized? And any chance you ran out of disk space? Can you try running configure again and see if the same thing happens again? Thanks, Brian -- Brian W. Barrett Networking Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI users] building static and shared OpenMPI libraries on MacOSX
On Aug 21, 2007, at 10:52 PM, Lev Givon wrote: (Running ompi_info after installing the build confirms the absence of said components). My concern, unsurprisingly, is motivated by a desire to use OpenMPI on an xgrid cluster (i.e., not with rsh/ssh); unless I am misconstruing the above observations, building OpenMPI with --enable-static seems to preclude this. Should xgrid functionality still be present when OpenMPI is built with --enable-static? Ah, yes. Do to some issues with our build system, you have to build shared libraries to use the XGrid support. Brian -- Brian W. Barrett Networking Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI users] building static and shared OpenMPI libraries on MacOSX
On Aug 21, 2007, at 3:32 PM, Lev Givon wrote: configure: WARNING: *** Shared libraries have been disabled (-- disable-shared) configure: WARNING: *** Building MCA components as DSOs automatically disabled checking which components should be static... none checking for projects containing MCA frameworks... opal, orte, ompi Specifying --enable-shared --enable-static results in the same behavior, incidentally. Is the above to be expected? Yes, this is expected. This is just a warning that we build components into the library rather than as run-time loadable components when static libraries are enabled. This is probably not technically necessary on Linux and OS X, but in general is the easiest thing for us to do. So you should have a perfectly working build with this setup. Brian -- Brian W. Barrett Networking Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI users] values of mca parameters whilst running program
On Aug 2, 2007, at 4:22 PM, Glenn Carver wrote: Hopefully an easy question to answer... is it possible to get at the values of mca parameters whilst a program is running? What I had in mind was either an open-mpi function to call which would print the current values of mca parameters or a function to call for specific mca parameters. I don't want to interrupt the running of the application. Bit of background. I have a large F90 application running with OpenMPI (as Sun Clustertools 7) on Opteron CPUs with an IB network. We're seeing swap thrashing occurring on some of the nodes at times and having searched the archives and read the FAQ believe we may be seeing the problem described in: http://www.open-mpi.org/community/lists/users/2007/01/2511.php where the udapl free list is growing to a point where lockable memory runs out. Problem is, I have no feel for the kinds of numbers that "btl_udapl_free_list_max" might safely get up to? Hence the request to print mca parameter values whilst the program is running to see if we can tie in high values of this parameter to when we're seeing swap thrashing. Good news, the answer is easy. Bad news is, it's not the one you want. btl_udapl_free_list_max is the *greatest* the list will ever be allowed to grow to, not it's current size. So if you don't specify a value and use the default of -1, it will return -1 for the life of the application, regardless of how big those free lists actually get. If you specify value X, it'll return X for the lift of the application, as well. There is not a good way for a user to find out the current size of a free list or the largest it got for the life of an application (currently those two will always be the same, but that's another story). Your best bet is to set the parameter to some value (say, 128 or 256) and see if that helps with the swapping. Brian -- Brian W. Barrett Networking Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI users] Problem building openmpi 1.2.3 on RHEL 5
On Jul 26, 2007, at 7:43 PM, Mathew Binkley wrote: ../../libtool: line 460: CDPATH: command not found libtool: Version mismatch error. This is libtool 2.1a, but the libtool: definition of this LT_INIT comes from an older release. libtool: You should recreate aclocal.m4 with macros from libtool 2.1a libtool: and run autoconf again. make[2]: *** [asm.lo] Error 1 make[2]: Leaving directory It kind of looks like the Makefiles decided to regenerate configure and ended up with a bad build. Did you start with a clean tarball? If not, can you try from a clean tarball and record the entire output of make? Thanks! Brian
Re: [OMPI users] MPI_File_set_view rejecting subarray views.
On Jul 19, 2007, at 3:24 PM, Moreland, Kenneth wrote: I've run into a problem with the File I/O with openmpi version 1.2.3. It is not possible to call MPI_File_set_view with a datatype created from a subarray. Instead of letting me set a view of this type, it gives an invalid datatype error. I have attached a simple program that demonstrates the problem. In particular, the following sequence of function calls should be supported, but they are not. MPI_Type_create_subarray(3, sizes, subsizes, starts, MPI_ORDER_FORTRAN, MPI_BYTE, ); MPI_File_set_view(fd, 20, MPI_BYTE, view, "native", MPI_INFO_NULL); After poking around in the source code a bit, I discovered that the I/O implementation actually supports the subarray data type, but there is a check that is issuing an error before the underlying I/O layer (ROMIO) has a chance to handle the request. You need to commit the datatype after calling MPI_Type_create_subarray. If you add: MPI_Type_commit(); after the Type_create, but before File_set_view, the code will run to completion. Well, the code will then complain about a Barrier after MPI_Finalize due to an error in how we shut down when there are files that have been opened but not closed (you should also add a call to MPI_File_close after the set_view, but I'm assuming it's not there because this is a test code). This is something we need to fix, but also signifies a user error. Brian -- Brian W. Barrett Networking Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI users] Problems running openmpi under os x
ir=/usr/lib --build=powerpc-apple-darwin8 --with- arch=nocona --with-tune=generic --program-prefix= --host=i686-apple- darwin8 --target=i686-apple-darwin8 Thread model: posix gcc version 4.0.1 (Apple Computer, Inc. build 5367) Tim On 12/07/2007, at 7:57 AM, Brian Barrett wrote: That's unexpected. If you run the command 'ompi_info --all', it should list (towards the top) things like the Bindir and Libdir. Can you see if those have sane values? If they do, can you try running a simple hello, world type MPI application (there's one in the OMPI tarball). It almost looks like memory is getting corrupted, which would be very unexpected that early in the process. I'm unable to duplicate the problem with 1.2.3 on my Mac Pro, making it all the more strange. Another random thought -- Which compilers did you use to build Open MPI? Brian On Jul 11, 2007, at 1:27 PM, Tim Cornwell wrote: Open MPI: 1.2.3 Open MPI SVN revision: r15136 Open RTE: 1.2.3 Open RTE SVN revision: r15136 OPAL: 1.2.3 OPAL SVN revision: r15136 Prefix: /usr/local Configured architecture: i386-apple-darwin8.10.1 Hi Brian, 1.2.3 downloaded and built from source. Tim On 12/07/2007, at 12:50 AM, Brian Barrett wrote: Which version of Open MPI are you using? Thanks, Brian On Jul 11, 2007, at 3:32 AM, Tim Cornwell wrote: I have a problem running openmpi under OS 10.4.10. My program runs fine under debian x86_64 on an opteron but under OS X on a number of Mac Book and Mac Book Pros, I get the following immediately on startup. This smells like a common problem but I could find anything relevant anywhere. Can anyone provide a hint or better yet a solution? Thanks, Tim Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: KERN_PROTECTION_FAILURE at address: 0x000c 0x04510412 in free () (gdb) where #0 0x04510412 in free () #1 0x05d24f80 in opal_install_dirs_expand (input=0x5d2a6b0 "$ {prefix}") at base/installdirs_base_expand.c:67 #2 0x05d24584 in opal_installdirs_base_open () at base/ installdirs_base_components.c:94 #3 0x05d01a40 in opal_init_util () at runtime/opal_init.c:150 #4 0x05d01b24 in opal_init () at runtime/opal_init.c:200 #5 0x051fa5cd in ompi_mpi_init (argc=1, argv=0xbfffde74, requested=0, provided=0xbfffd930) at runtime/ompi_mpi_init.c:219 #6 0x0523a8db in MPI_Init (argc=0xbfffd980, argv=0xbfffde14) at init.c:71 #7 0x0005a03d in conrad::cp::MPIConnection::initMPI (argc=1, argv=@0xbfffde14) at mwcommon/MPIConnection.cc:83 #8 0x4163 in main (argc=1, argv=0xbfffde74) at apps/ cimager.cc: 155 - - -- - - -- Tim Cornwell, Australia Telescope National Facility, CSIRO Location: Cnr Pembroke & Vimiera Rds, Marsfield, NSW, 2122, AUSTRALIA Post: PO Box 76, Epping, NSW 1710, AUSTRALIA Phone:+61 2 9372 4261 Fax: +61 2 9372 4450 or 4310 Mobile: +61 4 3366 5399 Email:tim.cornw...@csiro.au URL: http://www.atnf.csiro.au/people/tim.cornwell - - -- - - --- ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] DataTypes with "holes" for writing files
I wouldn't worry about it. 1.2.3 has no ROMIO fixes over 1.2.2. Brian On Jul 16, 2007, at 9:42 AM, jody wrote: Brian, I am using OpenMPI 1.2.2, so i am lagging a bit behind. Should i update to 1.2.3 and do the test again? Thanks for the info Jody On 7/16/07, Brian Barrett <bbarr...@lanl.gov> wrote: Jody - I usually update the ROMIO package before each major release (1.0, 1.1, 1.2, etc.) and then only within a major release series when a bug is found that requires an update. This seems to be one of those times ;). Just to make sure we're all on the same page, which version of Open MPI are you currently using? I've filed a bug report (you'll get an e-mail about it) about updating ROMIO for the 1.2 series. I'm not sure if it will make 1.2.4, but it could. Thanks, Brian On Jul 16, 2007, at 12:45 AM, jody wrote: > Rob, thanks for your info. > > Do you know whether OpenMPI will use a newer version > of ROMIO sometimes soon? > > Jody > > On 7/13/07, Robert Latham <r...@mcs.anl.gov> wrote:On Tue, Jul 10, > 2007 at 04:36:01PM +, jody wrote: > > Error: Unsupported datatype passed to ADIOI_Count_contiguous_blocks > > [aim-nano_02:9] MPI_ABORT invoked on rank 0 in communicator > > MPI_COMM_WORLD with errorcode 1 > > Hi Jody: > > OpenMPI uses an old version of ROMIO. You get this error because the > ADIOI_Count_contiguous_blocks routine in this version of ROMIO does > not understand all MPI datatypes. > > You can verify that this is the case by building your test against > MPICH2, which should succeed. > > ==rob > > -- > Rob Latham > Mathematics and Computer Science DivisionA215 0178 EA2D B059 8CDF > Argonne National Lab, IL USA B29D F333 664A 4280 315B > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] DataTypes with "holes" for writing files
Jody - I usually update the ROMIO package before each major release (1.0, 1.1, 1.2, etc.) and then only within a major release series when a bug is found that requires an update. This seems to be one of those times ;). Just to make sure we're all on the same page, which version of Open MPI are you currently using? I've filed a bug report (you'll get an e-mail about it) about updating ROMIO for the 1.2 series. I'm not sure if it will make 1.2.4, but it could. Thanks, Brian On Jul 16, 2007, at 12:45 AM, jody wrote: Rob, thanks for your info. Do you know whether OpenMPI will use a newer version of ROMIO sometimes soon? Jody On 7/13/07, Robert Lathamwrote:On Tue, Jul 10, 2007 at 04:36:01PM +, jody wrote: > Error: Unsupported datatype passed to ADIOI_Count_contiguous_blocks > [aim-nano_02:9] MPI_ABORT invoked on rank 0 in communicator > MPI_COMM_WORLD with errorcode 1 Hi Jody: OpenMPI uses an old version of ROMIO. You get this error because the ADIOI_Count_contiguous_blocks routine in this version of ROMIO does not understand all MPI datatypes. You can verify that this is the case by building your test against MPICH2, which should succeed. ==rob -- Rob Latham Mathematics and Computer Science DivisionA215 0178 EA2D B059 8CDF Argonne National Lab, IL USA B29D F333 664A 4280 315B ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] end-to-end data reliability
On Jul 15, 2007, at 10:05 PM, Isaac Huang wrote: Hello, I read from the FAQ that current Open MPI releases don't support end-to-end data reliability. But I still have some confusing that can't be solved by googling or reading the FAQ: 1. I read from "MPI - The Complete Reference" that "MPI provides the user with reliable message transmission. A message sent is always received correctly, and the user does not need to check for transmission errors, timeouts, or other error conditions." But the standard is sort of vague about what exactly this "reliable message transmission" is. Does it at least require reliable delivery? Or, does Open MPI notice and re-transmit lost data? Yes, the MPI standard guarantees message is reliably delivered in order. MPI implementations have taken this to mean that if the transport is "reliable", then the MPI doesn't have to do anything special. So we assume that TCP delivers data into our headers properly and same for shared memory, Myrinet, and InfiniBand (the RC protocol, anyway). We also assume that any data sent arrives on the other side. We have an experimental point-to-point engine, DR, that provides reliable transportation even for networks that have corruption and/or packet loss. The engine isn't available in a stable release, as it is still in the experimental phase. Checksums and timers are used to detect message corruption and recover. This allows us to play with non-reliable network protocols such as UDP or InfiniBand's UD protocol. In truth, however, the reliability guaranteed by the transports currently in use by Open MPI are more than enough to meet the needs of almost all users. Most of the supported networks have some type of error detection or correction that provides protection only slightly statistically worse than what we could provide within Open MPI, but at a much lower cost. 2. When a data corruption happens (in message data), is the data in the message envelop still reliable? Or, does Open MPI or the MPI standard guarantee data integrity of message envelops? I'm particularly interested in MPI_TAG which I use to encode things. In my opinion, any guarantee that applies to the message applies to the meta-data (tag, source, length) as well. The DR component will provide the same level of protection to the headers as it does to the payload. Brian -- Brian W. Barrett Networking Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI users] Problems running openmpi under os x
That's unexpected. If you run the command 'ompi_info --all', it should list (towards the top) things like the Bindir and Libdir. Can you see if those have sane values? If they do, can you try running a simple hello, world type MPI application (there's one in the OMPI tarball). It almost looks like memory is getting corrupted, which would be very unexpected that early in the process. I'm unable to duplicate the problem with 1.2.3 on my Mac Pro, making it all the more strange. Another random thought -- Which compilers did you use to build Open MPI? Brian On Jul 11, 2007, at 1:27 PM, Tim Cornwell wrote: Open MPI: 1.2.3 Open MPI SVN revision: r15136 Open RTE: 1.2.3 Open RTE SVN revision: r15136 OPAL: 1.2.3 OPAL SVN revision: r15136 Prefix: /usr/local Configured architecture: i386-apple-darwin8.10.1 Hi Brian, 1.2.3 downloaded and built from source. Tim On 12/07/2007, at 12:50 AM, Brian Barrett wrote: Which version of Open MPI are you using? Thanks, Brian On Jul 11, 2007, at 3:32 AM, Tim Cornwell wrote: I have a problem running openmpi under OS 10.4.10. My program runs fine under debian x86_64 on an opteron but under OS X on a number of Mac Book and Mac Book Pros, I get the following immediately on startup. This smells like a common problem but I could find anything relevant anywhere. Can anyone provide a hint or better yet a solution? Thanks, Tim Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: KERN_PROTECTION_FAILURE at address: 0x000c 0x04510412 in free () (gdb) where #0 0x04510412 in free () #1 0x05d24f80 in opal_install_dirs_expand (input=0x5d2a6b0 "$ {prefix}") at base/installdirs_base_expand.c:67 #2 0x05d24584 in opal_installdirs_base_open () at base/ installdirs_base_components.c:94 #3 0x05d01a40 in opal_init_util () at runtime/opal_init.c:150 #4 0x05d01b24 in opal_init () at runtime/opal_init.c:200 #5 0x051fa5cd in ompi_mpi_init (argc=1, argv=0xbfffde74, requested=0, provided=0xbfffd930) at runtime/ompi_mpi_init.c:219 #6 0x0523a8db in MPI_Init (argc=0xbfffd980, argv=0xbfffde14) at init.c:71 #7 0x0005a03d in conrad::cp::MPIConnection::initMPI (argc=1, argv=@0xbfffde14) at mwcommon/MPIConnection.cc:83 #8 0x4163 in main (argc=1, argv=0xbfffde74) at apps/cimager.cc: 155 - - -- Tim Cornwell, Australia Telescope National Facility, CSIRO Location: Cnr Pembroke & Vimiera Rds, Marsfield, NSW, 2122, AUSTRALIA Post: PO Box 76, Epping, NSW 1710, AUSTRALIA Phone:+61 2 9372 4261 Fax: +61 2 9372 4450 or 4310 Mobile: +61 4 3366 5399 Email:tim.cornw...@csiro.au URL: http://www.atnf.csiro.au/people/tim.cornwell - - --- ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Connection to HNP lost
What Ralph said is generally true. If your application completed, this is nothing to worry about. It means that an error occurred on the socket between mpirun ad some other process. However, combind with the travor0 errors in the log files, it could mean that your IPoIB network is acting flaky. That would have me slightly concerned. Enough that I'd consider running some TCP stress tests on the network to make sure it's acting normally. Hope this helps, Brian On Jul 10, 2007, at 11:32 AM, Ralph H Castain wrote: On 7/10/07 11:08 AM, "Glenn Carver"wrote: Hi, I'd be grateful if someone could explain the meaning of this error message to me and whether it indicates a hardware problem or application software issue: [node2:11881] OOB: Connection to HNP lost [node1:09876] OOB: Connection to HNP lost This message is nothing to be concerned about - all it indicates is that mpirun exited before our daemon on your backend nodes did. It's relatively harmless and probably should be eliminated in some future version (except when developers are running in debug mode). The message can appear when the timing changes between front and backend nodes. What happens is: 1. mpirun detects that your processes have all completed. It then orders the shutdown of the daemons on your backend nodes. 2. each daemon does an orderly shutdown. Just before it terminates, it tells mpirun that it is done cleaning up and is about to exit 3. when mpirun hears that all daemons are done cleaning up, it exits itself. This is where the timing issue comes into play - if mpirun exits before the daemon, then you get that error message as the daemon is terminating. So it's all a question of whether mpirun completes the last few steps to exit before the daemons do. In most cases, the daemons complete first as they have less to do. Sometimes, mpirun manages to get out first, and you get the message. I doubt it has anything to do with your hardware issues. Personally, I would just ignore the message - I'll see it gets removed in later releases to avoid unnecessary confusion. Hope that helps Ralph I have a small cluster which until last week was just fine. Unfortunately we were hit by a sudden power dip which brought the cluster down and did significant damage to other servers (blew power supplies and disk). Although the cluster machines and the Infiniband link is up and running jobs I am now getting these errors in user applications which we've never had before. The system messages file reports (for node2): Jul 5 12:08:28 node1 genunix: [ID 408789 kern.notice] NOTICE: tavor0: fault cleared external to device; service available Jul 5 12:08:28 node1 genunix: [ID 451854 kern.notice] NOTICE: tavor0: port 1 up Jul 7 16:18:32 node1 genunix: [ID 408114 kern.info] /pci@1,0/pci1022,7450@2/pci15b3,5a46@1/pci15b3,5a44@0 (tavor0) online Jul 7 16:18:32 node1 ib: [ID 842868 kern.info] IB device: daplt@0, daplt0 Jul 7 16:18:32 node1 genunix: [ID 936769 kern.info] daplt0 is /ib/ daplt@0 Jul 7 16:18:32 node1 genunix: [ID 408114 kern.info] /ib/daplt@0 (daplt0) online Jul 7 16:18:32 node1 genunix: [ID 834635 kern.info] /ib/daplt@0 (daplt0) multipath status: degraded, path /pci@1,0/pci1022,7450@2/pci15 b3,5a46@1/pci15b3,5a44@0 (tavor0) to target address: daplt,0 is online Load balancing: round-robin I wonder if this messages are indicative of a hardware problem, possibly on the Infiniband switch or the host adapters on the cluster machines. The cluster software has not been altered but there have been small changes to the application codes. But I want to rule out hardware issues because of the power dip first. Anyone seen this message before and know whether to investigate hardware first? I did check the archives but it didn't help. More info provided below. Any help appreciate, thanks. Glenn -- Details: Cluster uses mix of Sun's X4100/X4200 machines linked with Sun supplied Infiniband and host adapters. All machines are running Solaris 10_x86 (11/06) with latest kernel patches Software is Sun Clustertools 7. Node2 $ ifconfig ibd1 ibd1: flags=1000843 mtu 2044 index 3 inet 192.168.50.202 netmask ff00 broadcast 192.168.50.255 Node1 $ ifconfig ibd1 ibd1: flags=1000843 mtu 2044 index 3 inet 192.168.50.201 netmask ff00 broadcast 192.168.50.255 ompi_info -a Open MPI: 1.2.1r14096-ct7b030r1838 Open MPI SVN revision: 0 Open RTE: 1.2.1r14096-ct7b030r1838 Open RTE SVN revision: 0 OPAL: 1.2.1r14096-ct7b030r1838 OPAL SVN revision: 0 MCA backtrace: printstack (MCA v1.0, API v1.0, Component v1.2.1) MCA paffinity: solaris (MCA v1.0, API v1.0, Component v1.2.1) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.1)
Re: [OMPI users] warning:regcache incompatible with malloc
On Jul 10, 2007, at 11:40 AM, Scott Atchley wrote: On Jul 10, 2007, at 1:14 PM, Christopher D. Maestas wrote: Has anyone seen the following message with Open MPI: --- warning:regcache incompatible with malloc --- --- We don't see this message with mpich-mx-1.2.7..4 MX has an internal registration cache that can be enabled with MX_RCACHE=1 or disabled with MX_RCACHE=0 (the default before MX-1.2.1 was off, and starting with 1.2.1 the default is on). If it is on, MX checks to see if the application is trying to override malloc() and other memory handling functions. If so, it prints the error that you are seeing and fails to use the registration cache. Open MPI can use the regcache if you set MX_RCACHE=2. This tells MX to skip the malloc() check and use the cache regardless. In the case of Open MPI, this is believed to be safe. That will not be true for all applications. MPICH-MX does not manage memory, so MX_RCACHE=1 is safe to use unless the user's application manages memory. Scott - I'm having trouble getting the warning to go away with Open MPI. I've disabled our copy of ptmalloc2, so we're not providing a malloc anymore. I'm wondering if there's also something with the use of DSOs to load libmyriexpress? Is your belief that MX_RCACHE=2 is safe just for the BTL or for the MTL as well? Brian -- Brian W. Barrett Networking Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI users] Unable to find any HCAs ..
On Jul 4, 2007, at 8:21 PM, Graham Jenkins wrote: I'm using the openmpi-1.1.1-5.el5.x86_64 RPM on a Scientific Linux 5 cluster, with no installed HCAs. And a simple MPI job submitted to that cluster runs OK .. except that it issues messages for each node like the one shown below. Is there some way I can supress these, perhaps by an appropriate entry in /etc/openmpi-mca-params.conf ? -- libibverbs: Fatal: couldn't open sysfs class 'infiniband_verbs'. -- [0,1,0]: OpenIB on host localhost was unable to find any HCAs. Another transport will be used instead, although this may result in lower performance. Yes, there is a line you can add to /etc/openmpi-mca-params.conf: btl=^openib will tell Open MPI to use any available btls (our network transport layer) except openib. Hope this helps, Brian -- Brian W. Barrett Networking Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI users] making all library components static (questions about --enable-mcs-static)
On Jun 7, 2007, at 9:04 PM, Code Master wrote: nction `_int_malloc': : multiple definition of `_int_malloc' /usr/lib/libopen-pal.a(lt1-malloc.o)(.text+0x18a0):openmpi-1.2.2/ opal/mca/memory/ptmalloc2/malloc.c:3954: first defined here /usr/bin/ld: Warning: size of symbol `_int_malloc' changed from 1266 in /usr/lib/libopen- pal.a(lt1-malloc.o) to 1333 in /home/ 490_research/490/src/mpi.optimized_profiling//lib/libopen-pal.a(lt1- malloc.o) so what could go wrong here? Is it because openmpi has internal implementatios of system- provided functions (such as malloc) that are also used in my program, but the one the client program use is provided by the system whereas the one in the library has a different internal implementation? In such case, how could I do the static linking in my client program? I really need static linking as far as possible to do the profiling. Yup, you guessed right. The easiest solution is to compile Open MPI without the memory manager code. This disables some optimizations for InfiniBand (OpenFabrics and MVAPI) and Myrinet/GM, but for other networks has no impact. YOu can disable the memory manager with the --without-memory-manager option to configure. Hope this helps, Brian -- Brian W. Barrett Networking Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI users] v1.2.2 mca base unable to open pls/ras tm
Or tell Open MPI not to build torque support, which can be done at configure time with the --without-tm option. Open MPI tries to build support for whatever it finds in the default search paths, plus whatever things you specify the location of. Most of the time, this is what the user wants. In this case, however, it's not what you wanted so you'll have to add the --without-tm option. Hope this helps, Brian On Jun 8, 2007, at 1:08 PM, Cupp, Matthew R wrote: So I either have to uninstall torque, make the shared libraries available on all nodes, or have torque as static libraries on the head node? __ Matt Cupp Battelle Memorial Institute Statistics and Information Analysis -Original Message- From: users-boun...@open-mpi.org [mailto:users-bounces@open- mpi.org] On Behalf Of Jeff Squyres Sent: Friday, June 08, 2007 2:21 PM To: Open MPI Users Subject: Re: [OMPI users] v1.2.2 mca base unable to open pls/ras tm On Jun 8, 2007, at 2:06 PM, Cupp, Matthew R wrote: Yes. But the /opt/torque directory is just the source, not the actual installed directory. The actual installed directory on the head node is the default location of /usr/lib/something. And that is not accessable by every node. But should it matter if it's not accessable if I don't specify --with-tm? I was wondering if ./configure detects torque has been installed, and then builds the associated components under the assumption that it's available. This is what OMPI does. However, if you only have static libraries for Torque, the issue should be moot -- the relevant bits should be statically linked into the OMPI tm plugins. But if your Torque libraries are shared, then you do need to have them available on all nodes for OMPI to be able to leverage native Torque/TM support. Make sense? -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Typo in r14829?
On Jun 1, 2007, at 12:15 PM, Bert Wesarg wrote: Hello, is the 'EGREP' a typo in the first hunk of r14829: https://svn.open-mpi.org/trac/ompi/changeset/14829/trunk/config/ cxx_find_template_repository.m4 Gah! Yes, it is. Should be $GREP. I'll fix this evening. Thanks, Brian
Re: [OMPI users] forcing MPI to bind all sockets to 127.0.0.1
Bill - This is a known issue in all released versions of Open MPI. I have a patch that hopefully will fix this issue in 1.2.3. It's currently waiting on people in the OPen MPI team to verify I didn't do something stupid. Brian On May 29, 2007, at 9:59 PM, Bill Saphir wrote: George, This is one of the things I tried, and the setting the oob interface did not work, with the error message below. Also, per this thread: http://www.open-mpi.org/community/lists/users/2007/05/3319.php I believe it is oob_tcp_include, not oob_tcp_if_include. The latter is silently ignored in 1.2, as far as I can tell. Interestingly, telling the MPI layer to use lo0 (or to not use tcp at all) works fine. But when I try to do the same for the OOB layer, it complains. The full error is: [mymac.local:07001] [0,0,0] mca_oob_tcp_init: invalid address '' returned for selected oob interfaces. [mymac.local:07001] [0,0,0] ORTE_ERROR_LOG: Error in file oob_tcp.c at line 1196 mpirun actually hangs at this point and no processes are spawned. I have to ^C to stop it. I see this behavior on both Mac OS and on Linux with 1.2.2. Bill George Bosilica wrote: There are 2 sets of sockets: one for the oob layer and one for the MPI layer (at least if TCP support is enabled). Therefore, in order to achieve what you're looking for you should add to the command line "--mca oob_tcp_if_include lo0 --mca btl_tcp_if_include lo0". On May 29, 2007, at 3:58 PM, Bill Saphir wrote: - original message below --- We have run into the following problem: - start up Open MPI application on a laptop - disconnect from network - application hangs I believe that the problem is that all sockets created by Open MPI are bound to the external network interface. For example, when I start up a 2 process MPI job on my Mac (no hosts specified), I get the following tcp connections. 192.168.5.2 is an address on my LAN. tcp4 0 0 192.168.5.2.49459 192.168.5.2.49463 ESTABLISHED tcp4 0 0 192.168.5.2.49463 192.168.5.2.49459 ESTABLISHED tcp4 0 0 192.168.5.2.49456 192.168.5.2.49462 ESTABLISHED tcp4 0 0 192.168.5.2.49462 192.168.5.2.49456 ESTABLISHED tcp4 0 0 192.168.5.2.49456 192.168.5.2.49460 ESTABLISHED tcp4 0 0 192.168.5.2.49460 192.168.5.2.49456 ESTABLISHED tcp4 0 0 192.168.5.2.49456 192.168.5.2.49458 ESTABLISHED tcp4 0 0 192.168.5.2.49458 192.168.5.2.49456 ESTABLISHED Since this application is confined to a single machine, I would like it to use 127.0.0.1, which will remain available as the laptop moves around. I am unable to force it to bind sockets to this address, however. Some of the things I've tried are: - explicitly setting the hostname to 127.0.0.1 (--host 127.0.0.1) - turning off the tcp btl (--mca btl ^tcp) and other variations (-- mca btl self,sm) - using --mca oob_tcp_include lo0 The first two have no effect. The last one results in an error message of: [myhost.locall:05830] [0,0,0] mca_oob_tcp_init: invalid address '' returned for selected oob interfaces. Is there any way to force Open MPI to bind all sockets to 127.0.0.1? As a side question -- I'm curious what all of these tcp connections are used for. As I increase the number of processes, it looks like there are 4 sockets created per MPI process, without using the tcp btl. Perhaps stdin/out/err + control? Bill ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] OpenMPI on shared memory.
On May 29, 2007, at 12:25 PM, smai...@ksu.edu wrote: I am doing a research on parallel computing on shared memory with NUMA architecture. The system is a 4 node AMD opteron with each node being a dual-core. I am testing an OpenMPI program with MPI-nodes <= MAX cores available on system (in my case 4*2=8). Can someone tell me whether: a) In such cases (where MPI-nodes<=MAX cores on shared-memory), OpenMPI implements MPI-nodes as processes or threads? If yes, then how can it be determined at run-time? I am wondering because processes have more overhead than light-weight threads. In Open MPI, different MPI ranks are always different processes. This is what users expect, and I'd be hesitant to change that for the over-subscription case. Brian
Re: [OMPI users] Weird interaction with modem under OS X
On May 22, 2007, at 7:52 PM, Tom Clune wrote: For example, if it is ppp0, try: mpirun -np 1 -mca oob_tcp_exclude ppp0 uptime This seems to at least produce a bit of output before hanging: LM000953070:~ tlclune$ mpirun -np 1 -mca oob_tcp_exclude ppp0 uptime [153.sub-70-211-6.myvzw.com:07562] [0,0,0] mca_oob_tcp_init: invalid address '' returned for selected oob interfaces. [153.sub-70-211-6.myvzw.com:07562] [0,0,0] ORTE_ERROR_LOG: Error in file oob_tcp.c at line 1216 Tom - I managed to track this down a bit. We try to use the ppp0 interface (the cell phone device) for network connectivity, as it's the only non-localhost address up at the time. Unfortunately, we can't use the address to route messages that way and Open MPI hangs. The problem is made worse due to a bug that I'm still trying to track down in Open MPI. When you tell Open MPI to not use a device (like ppp0), it should just use whatever other devices are available. In your case, that would be localhost, which is what you're using when you don't have any network connectivity at all. But it appears that this instead causes Open MPI to segfault / hang. I'm looking into exactly why this is happening and should have a fix in the next day or so. Brian -- Brian W. Barrett Open MPI Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI users] Weird interaction with modem under OS X
On May 21, 2007, at 7:40 PM, Tom Clune wrote: Executive summary: mpirun hangs when laptop is connected via cellular modem. Longer description: Under ordinary circumstances mpirun behaves as expected on my OS X (Intel-duo) laptop. I only want to be using the shared-memory mechanism - i.e. not sending packets across any networks. When my laptop is connected to the internet via ethernet or wireless (or not connected to the network at all) mpirun works just fine, but if I connect via my nifty new cellular modem (Verizon in case it matters), mpirun hangs at launch. I.e. my application never even starts, and I have to use to interrupt to regain a prompt. I'd like to be able to engage in other activities (mail, cvs, skype) while executing mpi code in the background, so I'm really hoping there is a simple switch to fix this. I am launching with the command: "mpirun -np 2 ./gx". I have also tried "mpirun --mca btl self,sm -np 2 ./gx" but that did not seem to improve the situation. I have attached the output from "ompi_info --all". The output does not seem to depend on whether I am connected via the modem or not. If you run "mpirun -np 1 uptime" with your cell modem up, does that work? This isn't one of those corner cases we test very often :). If it doesn't work, could you send the output of 'ifconfig'? One thing to try would be telling Open MPI to not use the network device for the modem. For example, if it is ppp0, try: mpirun -np 1 -mca oob_tcp_exclude ppp0 uptime Good luck, Brian
Re: [OMPI users] AlphaServers & OpenMPI
On May 13, 2007, at 6:23 AM, Bert Wesarg wrote: Even better: is there a patch available to fix this in the 1.2.1 tarball, so that I can set the full path again with CC? The patch is quite trivial, but requires a rebuild of the build system (autoheader, autoconf, automake,...) see here: https://svn.open-mpi.org/trac/ompi/changeset/14610 but you can try to hack the current configure script, just by search for the affected line As Bert said, the patch to fix the bug causes a bunch of files to rebuild using tools you probably don't want to bother installing. The easiest solution for now is to use the 1.2.2rc1 pre-release, available on the download page: http://www.open-mpi.org/software/ompi/v1.2/ Fixes a bunch of small issues found in the 1.2 and 1.2.1 releases, including the CC with a path in it bug that you've stumbled upon. Brian -- Brian W. Barrett Open MPI Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI users] newbie question
I fixed the OOB. I also mucked some things up with it interface wise that I need to undo :). Anyway, I'll have a look at fixing up the TCP component in the next day or two. Brian On May 10, 2007, at 6:07 PM, Jeff Squyres wrote: Brian -- Didn't you add something to fix exactly this problem recently? I have a dim recollection of seeing a commit go by about this...? (I advised Steve in IM to use --disable-ipv6 in the meantime) On May 10, 2007, at 1:25 PM, Steve Wise wrote: I'm trying to run a job specifically over tcp and the eth1 interface. It seems to be barfing on trying to listen via ipv6. I don't want ipv6. How can I disable it? Here's my mpirun line: [root@vic12-10g ~]# mpirun --n 2 --host vic12,vic20 --mca btl self,tcp -mca btl_tcp_if_include eth1 /root/IMB_2.3/src/IMB-MPI1 sendrecv [vic12][0,1,0][btl_tcp_component.c: 489:mca_btl_tcp_component_create_listen] socket() failed: Address family not supported by protocol (97) [vic12-10g:15771] mca_btl_tcp_component: IPv6 listening socket failed [vic20][0,1,1][btl_tcp_component.c: 489:mca_btl_tcp_component_create_listen] socket() failed: Address family not supported by protocol (97) [vic20-10g:23977] mca_btl_tcp_component: IPv6 listening socket failed Thanks, Steve. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_TYPE_STRUCT Not
On May 14, 2007, at 10:21 AM, Nym wrote: I am trying to use MPI_TYPE_STRUCT in a 64 bit Fortran 90 program. I'm using the Intel Fortran Compiler 9.1.040 (and C/C++ compilers 9.1.045). If I try to call MPI_TYPE_STRUCT with the array of displacements that are of type INTEGER(KIND=MPI_ADDRESS_KIND), then I get a compilation error: fortcom: Error: ./test_basic.f90, line 34: There is no matching specific subroutine for this generic subroutine call. [MPI_TYPE_STRUCT] CALL MPI_TYPE_STRUCT(numTypes, blockLengths, displacements, oldTypes & ---^ compilation aborted for ./test_basic.f90 (code 1) Attached is a small test program to demonstrate this. I thought according to the MPI specs that the displacement array should be of type MPI_ADDRESS_KIND. Am I wrong? Have a look at the last paragraph of Section 10.2.2 of the MPI-2 standard. Functions from MPI-1 that take address-sized arguments use INTEGER in Fortran. This was obviously a problem, which is why the functions from MPI-1 that take an address-sized argument are depricated in favor of new functions in MPI-2 that take proper address kind arguments. Two options: 1) Use MPI_TYPE_STRUCT with INTEGER arguments 2) Use MPI_TYPE_CREATE_STRUCT with ADDRESS_KIND arguments Hope this helps, Brian -- Brian W. Barrett Open MPI Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI users] openmpi-1.2.1 mpicc error
This was a regression in Open MPI 1.2.1. We improperly handle the situation where CC has a path in it. We will have this fixed in Open MPI 1.2.2. For now, your options are to use Open MPI 1.2 or specify a $CC without a path, such as CC=icc, and make sure $PATH is set properly. Brian On May 7, 2007, at 1:12 PM, Paul Van Allsburg wrote: I just completed the install of release 1.2.1 and I get an error attempting to compile with mpicc. The install was done with: source /opt/intel/fce/9.1.045/bin/ifortvars.sh source /opt/intel/cce/9.1.049/bin/iccvars.sh ./configure --prefix=/usr/local/openmpi-1.2.1_intel \ --with-tm=/usr/local \ --enable-static \ --disable-shared \ CC=/opt/intel/cce/9.1.049/bin/icc \ CXX=/opt/intel/cce/9.1.049/bin/icpc \ FC=/opt/intel/fce/9.1.045/bin/ifort make all install I tried to compile my hello program with $ source /opt/intel/fce/9.1.045/bin/ifortvars.sh $ source /opt/intel/cce/9.1.049/bin/iccvars.sh $ PATH="/usr/local/openmpi-1.2.1_intel/bin:$PATH";export PATH $ mpicc hello.c -o hello -g ld: dummy: No such file: No such file or directory I installed 1.2 exactly the same and it works fine. Any suggestions? Thanks! Paul -- Paul Van Allsburg Computational Science & Modeling Facilitator Natural Sciences Division, Hope College 35 East 12th Street Holland, Michigan 49423 616-395-7292 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] 1.2.1 configure bug report: set CC variable may produce broken *wrapper-data.txt
Thanks for the bug report. I'm able to replicate your problem, and it will be fixed in the 1.2.2 release. Brian On May 7, 2007, at 6:10 AM, livelfs wrote: Hi all I have observed a regression between 1.2 and 1.2.1 if CC is assigned an absolute path (i.e. export CC=/opt/gcc/gcc-3.4.4/bin/gcc like in attached logs), the */tools/wrappers/*-wrapper-data.txt files produced by configure script have then a broken libs macro definition: libs=-lmpi -lopen-rte -lopen-pal -ldl dummy ranlib instead of libs=-lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl Regards, Stephane Rouberol ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Call to MPI_Init affects errno
Yup, it does. There's nothing in the standard that says it isn't allowed to. Given the number of system/libc calls involved in doing communication, pretty much every MPI function is going to change the value of errno. If you expect otherwise, I'd modify your application. Most cluster-based MPI implementations are going to randomly change the errno on you. Brian On May 2, 2007, at 12:18 PM, Chudin, Eugene wrote: I am trying to experiment with openmpi and following trivial code (although runs) affects value of errno #include #include int main(int argc, char** argv) { int _procid, _np; std::cout << "errno=\t" << errno << std::endl; MPI_Init(, ); std::cout << "errno=\t" << errno << "\tafter MPI_Init()\t" << std::endl; MPI_Comm_rank (MPI_COMM_WORLD, &_procid); MPI_Comm_size (MPI_COMM_WORLD, &_np); std::cout << "errno msg=\t" << strerror(errno) << "\tprocessor= \t" << _procid << std::endl; MPI_Finalize(); return 0; } Compiled with mpiCC -Wall test.cpp -o test Produces following output when run just on single processor using mpirun -np 1 --prefix /toolbox/openmpi ./test errno= 0 errno= 2 after MPI_Init() errno msg= No such file or directory processor= 0 When run on two processors using mpirun -np 2 --prefix /toolbox/openmpi ./test errno= 0 errno= 0 errno= 11 after MPI_Init() errno= 115 after MPI_Init() errno msg= Operation now in progress processor= 0 errno msg= Resource temporarily unavailable processor= 1 The output of ompi_info --all is attached <> -- Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates (which may be known outside the United States as Merck Frosst, Merck Sharp & Dohme or MSD and in Japan, as Banyu - direct contact information for affiliates is available at http://www.merck.com/contact/contacts.html) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system. -- ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] orte_init failed
That's very odd. The usual cause for this is /tmp being unwritable by the user or full. Can you check to see if either of those conditions are true? Thanks, Brian On Apr 13, 2007, at 2:44 AM, Christine Kreuzer wrote: Hi, I run openmpi on a AMD Opteron with two dualcore processors an SLE10, until today everything worked fine but than I got the following error message: [computername:20612][0,0,0] ORTE_ERROR_LOG: Error in file ../../orte/runtime/orte_init_stage1.c at line 302 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_session_dir failed --> Returned value -1 instead of ORTE_SUCCESS -- [computername:20612] [0,0,0] ORTE_ERROR_LOG: Error in file ../../orte/runtime/orte_system_init.c at line 42 [computername:20612] [0,0,0] ORTE_ERROR_LOG: Error in file ../../orte/runtime/orte_init.c at line 49 -- Open RTE was unable to initialize properly. The error occured while attempting to orte_init(). Returned value -1 instead of ORTE_SUCCESS. -- I would appreciate any help or ideas to solve this problem. Thanks in advance! Regards, Christine -- Universität des Saarlandes AG Prof. Dr. Christoph Becher Fachrichtung 7.3 (Technische Physik) Geb. E2.6, Zimmer 2.04 D-66123 Saarbrücken Phone:+49(0)681 302 3418 Fax: +49(0)681 302 4676 E-mail: c.kreu...@mx.uni-saarland.de ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] OpenMPI 1.2 on MacOSX Intel Fails
On Apr 7, 2007, at 12:59 AM, Brian Powell wrote: Greetings, I turn to the assistance of the OpenMPI wizards. I have compiled v1.2 using gcc and ifort (see the attached config.log) with a variety of options. The compilation finishes (side note: I had to define NM otherwise the configure script failed) and installs. I try to run ompi_info and get the following: -- A library call unexpectedly failed. This is a terminal error; please show this message to an Open MPI wizard: Library call: mca_base_open Source file: ompi_info.cc Source line number: 139 Aborting... -- For reasons I can't duplicate, you're getting an out of memory error when trying to initialize our component system. I haven't seen this one before, but I noticed a couple of things in the config.log that made me think there might be an underlying problem... 1) You should never have to specify NM on OS X, even when cross- compiling. Can you send information on why this is necessary? 2) This might actually be the answer to #1, but you shouldn't specify --build=i386. The --build argument takes a complete config.guess- style architecture. In the case of Mac OS X 10.4.9 on i386, that would be: i386-apple-darwin8.9.1. But unless you're cross compiling, that argument should not be necessary. 3) The sysroot stuff is really on necessary if you are cross- compiling. I wouldn't use it in other cases, as it seems to make things more fragile. If you're going to specify a sysroot, you need to specify it in CFLAGS, CXXFLAGS, and OBJCFLAGS. You also should specify the -arch i386 in CXXFLAGS and OBJCFLAGS if you are going to specify it in CFLAGS and FFLAGS, if only for consistency. If you could try recompiling without the --build argument and let me know if that fixes the problem, I'd appreciate it. Brian
Re: [OMPI users] Issues with Get/Put and IRecv
Mike - In Open MPI 1.2, one-sided is implemented over point-to-point, so I would expect it to be slower. This may or may not be addressed in a future version of Open MPI (I would guess so, but don't want to commit to it). Where you using multiple threads? If so, how? On the good news, I think your call stack looked similar to what I was seeing, so hopefully I can make some progress on a real solution. Brian On Mar 20, 2007, at 8:54 PM, Mike Houston wrote: Well, I've managed to get a working solution, but I'm not sure how I got there. I built a test case that looked like a nice simple version of what I was trying to do and it worked, so I moved the test code into my implementation and low and behold it works. I must have been doing something a little funky in the original pass, likely causing a stack smash somewhere or trying to do a get/put out of bounds. If I have any more problems, I'll let y'all know. I've tested pretty heavy usage up to 128 MPI processes across 16 nodes and things seem to be behaving. I did notice that single sided transfers seem to be a little slower than explicit send/recv, at least on GigE. Once I do some more testing, I'll bring things up on IB and see how things are going. -Mike Mike Houston wrote: Brian Barrett wrote: On Mar 20, 2007, at 3:15 PM, Mike Houston wrote: If I only do gets/puts, things seem to be working correctly with version 1.2. However, if I have a posted Irecv on the target node and issue a MPI_Get against that target, MPI_Test on the posed IRecv causes a segfaults: Anyone have suggestions? Sadly, I need to have IRecv's posted. I'll attempt to find a workaround, but it looks like the posed IRecv is getting all the data of the MPI_Get from the other node. It's like the message tagging is getting ignored. I've never tried posting two different IRecv's with different message tags either... Hi Mike - I've spent some time this afternoon looking at the problem and have some ideas on what could be happening. I don't think it's a data mismatch (the data intended for the IRecv getting delivered to the Get), but more a problem with the call to MPI_Test perturbing the progress flow of the one-sided engine. I can see one or two places where it's possible this could happen, although I'm having trouble replicating the problem with any test case I can write. Is it possible for you to share the code causing the problem (or some small test case)? It would make me feel considerably better if I could really understand the conditions required to end up in a seg fault state. Thanks, Brian Well, I can give you a linux x86 binary if that would do it. The code is huge as it's part of a much larger system, so there is no such thing as a simple case at the moment, and the code is in pieces an largely unrunnable now with all the hacking... I basically have one thread spinning on an MPI_Test on a posted IRecv while being used as the target to the MPI_Get. I'll see if I can hack together a simple version that breaks late tonight. I've just played with posting a send to that IRecv, issuing the MPI_Get, handshaking and then posting another IRecv and the MPI_Test continues to eat it, but in a memcpy: #0 0x001c068c in memcpy () from /lib/libc.so.6 #1 0x00e412d9 in ompi_convertor_pack (pConv=0x83c1198, iov=0xa0, out_size=0xaffc1fd8, max_data=0xaffc1fdc) at convertor.c:254 #2 0x00ea265d in ompi_osc_pt2pt_replyreq_send (module=0x856e668, replyreq=0x83c1180) at osc_pt2pt_data_move.c:411 #3 0x00ea0ebe in ompi_osc_pt2pt_component_fragment_cb (pt2pt_buffer=0x8573380) at osc_pt2pt_component.c:582 #4 0x00ea1389 in ompi_osc_pt2pt_progress () at osc_pt2pt_component.c:769 #5 0x00aa3019 in opal_progress () at runtime/opal_progress.c:288 #6 0x00ea59e5 in ompi_osc_pt2pt_passive_unlock (module=0x856e668, origin=1, count=1) at osc_pt2pt_sync.c:60 #7 0x00ea0cd2 in ompi_osc_pt2pt_component_fragment_cb (pt2pt_buffer=0x856f300) at osc_pt2pt_component.c:688 #8 0x00ea1389 in ompi_osc_pt2pt_progress () at osc_pt2pt_component.c:769 #9 0x00aa3019 in opal_progress () at runtime/opal_progress.c:288 #10 0x00e33f05 in ompi_request_test (rptr=0xaffc2430, completed=0xaffc2434, status=0xaffc23fc) at request/req_test.c:82 #11 0x00e61770 in PMPI_Test (request=0xaffc2430, completed=0xaffc2434, status=0xaffc23fc) at ptest.c:52 -Mike ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Installation fails on Mac Os
On Mar 25, 2007, at 11:20 AM, Daniele Avitabile wrote: Hi everybody, I am trying to install open mpi on a Mac Os XServer, and the make all command exits with the error openmpi_install_failed.tar.gz as you can see from the output files I attached. Some comments that may be helpful: 1) I am not root on the machine, but I have permissions to write in /usr/local/applications/, which is the directory in which I want to install openmpi. 2) In the same directory there is already an openmpi 1.1.2 installation, with gcc-4.0.1 compilers. I want to install the current version of openmpi and use a different compiler, namely the gcc compilers optimised for apple intel. They reside in the folder /usr/local/bin, and I pass them in the make command, as you can see from the attached file. Any idea as to why I receive that error? Short answer: You need to either use the system-provided GCC or rebuild your version of GCC to use /usr/bin/libtool instead of /usr/bin/ld to link. Long answer: There are some things that are a little complicated to do with Mach-O if you want library versioning and plug-ins and all that to work properly. GNU Libtool (and therefore Open MPI) assume that if you are using GCC, it can emit options to the linker that are meant for / usr/bin/libtool, the library creation helper for OS X. - compatibility_version is one of those things. Your version of GCC is instead invoking /usr/bin/ld directly, so things are going wrong. You can still use the "intel optimized" version of GCC to compile your application, as long as it doesn't use GNU libtool, of course. Just use the system GCC to compile Open MPI and all will be fine. Hope this helps, Brian
Re: [OMPI users] Issues with Get/Put and IRecv
On Mar 20, 2007, at 3:15 PM, Mike Houston wrote: If I only do gets/puts, things seem to be working correctly with version 1.2. However, if I have a posted Irecv on the target node and issue a MPI_Get against that target, MPI_Test on the posed IRecv causes a segfaults: Anyone have suggestions? Sadly, I need to have IRecv's posted. I'll attempt to find a workaround, but it looks like the posed IRecv is getting all the data of the MPI_Get from the other node. It's like the message tagging is getting ignored. I've never tried posting two different IRecv's with different message tags either... Hi Mike - I've spent some time this afternoon looking at the problem and have some ideas on what could be happening. I don't think it's a data mismatch (the data intended for the IRecv getting delivered to the Get), but more a problem with the call to MPI_Test perturbing the progress flow of the one-sided engine. I can see one or two places where it's possible this could happen, although I'm having trouble replicating the problem with any test case I can write. Is it possible for you to share the code causing the problem (or some small test case)? It would make me feel considerably better if I could really understand the conditions required to end up in a seg fault state. Thanks, Brian
Re: [OMPI users] open-mpi 1.2 build failure under Mac OS X 10.3.9
Hi - Thanks for the bug report. I've fixed the problem in SVN and it will likely be part of the 1.2.1 release (whenever that happens). In the mean time, I've attached a patch that should apply to the 1.2 tarball that will also fix the problem. The environment variables you want for specifying the Fortran compilers are F77 for Fortran 77 and FC for Fortran 90/95/03. Hope this helps, Brian On Mar 16, 2007, at 5:42 PM, Marius Schamschula wrote: Hi all, I was building open-mpi 1.2 on my G4 running Mac OS X 10.3.9 and had a build failure with the following: depbase=`echo runtime/ompi_mpi_preconnect.lo | sed 's|[^/]*$|.deps/ &|;s|\.lo$||'`; \ if /bin/sh ../libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H - I. -I. -I../opal/include -I../orte/include -I../ompi/include -I../ ompi/include -I.. -D_REENTRANT -O3 -DNDEBUG -finline-functions - fno-strict-aliasing -MT runtime/ompi_mpi_preconnect.lo -MD -MP -MF "$depbase.Tpo" -c -o runtime/ompi_mpi_preconnect.lo runtime/ ompi_mpi_preconnect.c; \ then mv -f "$depbase.Tpo" "$depbase.Plo"; else rm -f "$depbase.Tpo"; exit 1; fi libtool: compile: gcc -DHAVE_CONFIG_H -I. -I. -I../opal/include - I../orte/include -I../ompi/include -I../ompi/include -I.. - D_REENTRANT -O3 -DNDEBUG -finline-functions -fno-strict-aliasing - MT runtime/ompi_mpi_preconnect.lo -MD -MP -MF runtime/.deps/ ompi_mpi_preconnect.Tpo -c runtime/ompi_mpi_preconnect.c -fno- common -DPIC -o runtime/.libs/ompi_mpi_preconnect.o runtime/ompi_mpi_preconnect.c: In function `ompi_init_do_oob_preconnect': runtime/ompi_mpi_preconnect.c:74: error: storage size of `msg' isn't known make[2]: *** [runtime/ompi_mpi_preconnect.lo] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all-recursive] Error 1 $ gcc -v Reading specs from /usr/libexec/gcc/darwin/ppc/3.3/specs Thread model: posix gcc version 3.3 20030304 (Apple Computer, Inc. build 1495) $ g77 -v Reading specs from /usr/local/lib/gcc/powerpc-apple- darwin7.3.0/3.5.0/specs Configured with: ../gcc/configure --enable-threads=posix --enable- languages=f77 Thread model: posix gcc version 3.5.0 20040429 (experimental) (g77 from hpc.sf.net) Note: I had no such problem under Mac OS X 10.4.9 with my ppc and x86 builds. However, I did notice that the configure script did not detect g95 from g95.org correctly: *** Fortran 90/95 compiler checking for gfortran... no checking for f95... no checking for fort... no checking for xlf95... no checking for ifort... no checking for ifc... no checking for efc... no checking for pgf95... no checking for lf95... no checking for f90... no checking for xlf90... no checking for pgf90... no checking for epcf90... no checking whether we are using the GNU Fortran compiler... no configure --help doesn't give any hint about specifying F95. ompi_1.2_osx_10.3.diff Description: Binary data
Re: [OMPI users] Still having problems building 1.2 on Mac OSX
On Feb 27, 2007, at 3:26 PM, Iannetti, Anthony C. ((GRC-RTB0)) wrote: Dear Open-MPI: I am still ahving problems building OpenMPI 1.2 (now 1.2b4) on MacOSX 10.4 PPC 64. In a message a while back, you gavce me a hack to override this problem. I believe it was a problem with Libtool, or something like that. Well, it looks like I still ahve to use that hack. Thanks for bringing this to my attention. A patch was accidently not moved into the v1.2 release branch. We'll try to get that fixed right away. Brian
Re: [OMPI users] 64-bit Open-mpi on Intel Mac OS X? (opal_if error)
This was fixed in 1.1.4, along with some shared memory performance issues on Intel Macs (32 or 64 bit builds). Brian On Feb 5, 2007, at 1:22 PM, Jason Martin wrote: Hi All, Using openmpi-1.1.3b3, I've been attempting to build Open-MPI in 64-bit bit mode on a Mac Pro (dual Xeon 5150 2.66GHz with 1G RAM). Using the following configuration options: ./configure --prefix=/usr/local/openmpi-1.1.3b3 \ --build=x86_64-apple-darwin \ CFLAGS=-m64 CXXFLAGS=-m64 \ LDFLAGS=-m64 The make goes fine, but in "make check" it hits an error in the "opal_if" test. Searching the source code in opal/util/if.c shows that the error is occuring with the ioctl(sd, SIOCGIFCONF, ) call never returning a valid result (I tried increasing MAX_IFCONF_SIZE, but that didn't help). There's a comment at the top of the file that mentions some compiler magic (align=power, etc.) for the 64-bit PPC version, but I'm at a loss about using it on a 64-bit Intel platform. Has anyone else had any experience with this? (Note that 32-bit binaries compile and pass make check.) Thanks, jason -- Jason Worth Martin Asst. Prof. of Mathematics James Madison University http://www.math.jmu.edu/~martin phone: (+1) 540-568-5101 fax: (+1) 540-568-6857 "Ever my heart rises as we draw near the mountains. There is good rock here." -- Gimli, son of Gloin ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Brian Barrett Open MPI Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI users] Can't run simple job with openmpi using the Intel compiler
This is very odd. The two error messages you are seeing are side effects of the real problem, which is that Open MPI is segfaulting when build with the Intel compiler. We've had some problems with bugs in various versions of the Intel compiler -- just to be on the safe side, can you make sure that the machine has the latest bug fixes from Intel applied? From there, if possible, it would be extremely useful to have a stack trace from a core file, or even to know whether it's mpirun or one of our "orte daemons" that are segfaulting. If you can get a core file, you should be able to figure out which process is causing the segfault. Brian On Feb 2, 2007, at 4:07 PM, Dennis McRitchie wrote: When I submit a simple job (described below) using PBS, I always get one of the following two errors: 1) [adroit-28:03945] [0,0,1]-[0,0,0] mca_oob_tcp_peer_recv_blocking: recv() failed with errno=104 2) [adroit-30:03770] [0,0,3]-[0,0,0] mca_oob_tcp_peer_complete_connect: connection failed (errno=111) - retrying (pid=3770) The program does a uname and prints out results to standard out. The only MPI calls it makes are MPI_Init, MPI_Comm_size, MPI_Comm_rank, and MPI_Finalize. I have tried it with both openmpi v 1.1.2 and 1.1.4, built with Intel C compiler 9.1.045, and get the same results. But if I build the same versions of openmpi using gcc, the test program always works fine. The app itself is built with mpicc. It runs successfully if run from the command line with "mpiexec -n X ", where X is 1 to 8, but if I wrap it in the following qsub command file: --- #PBS -l pmem=512mb,nodes=1:ppn=1,walltime=0:10:00 #PBS -m abe # #PBS -o /home0/dmcr/my_mpi/curt/uname_test.gcc.stdout # #PBS -e /home0/dmcr/my_mpi/curt/uname_test.gcc.stderr cd /home/dmcr/my_mpi/openmpi echo "About to call mpiexec" module list mpiexec -n 1 uname_test.intel echo "After call to mpiexec" it fails on any number of processors from 1 to 8, and the application segfaults. The complete standard error of an 8-processsor job follows (note that mpiexec ran on adroit-31, but usually there is no info about adroit-31 in standard error): - Currently Loaded Modulefiles: 1) intel/9.1/32/C/9.1.045 4) intel/9.1/32/default 2) intel/9.1/32/Fortran/9.1.040 5) openmpi/intel/1.1.2/32 3) intel/9.1/32/Iidb/9.1.045 Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x5 [0] func:/usr/local/openmpi/1.1.4/intel/i386/lib/libopal.so.0 [0xb72c5b] *** End of error message *** ^@[adroit-29:03934] [0,0,2]-[0,0,0] mca_oob_tcp_peer_recv_blocking: recv() failed with errno=104 [adroit-28:03945] [0,0,1]-[0,0,0] mca_oob_tcp_peer_recv_blocking: recv() failed with errno=104 [adroit-30:03770] [0,0,3]-[0,0,0] mca_oob_tcp_peer_complete_connect: connection failed (errno=111) - retrying (pid=3770) -- The complete standard error of an 1-processsor job follows: -- Currently Loaded Modulefiles: 1) intel/9.1/32/C/9.1.045 4) intel/9.1/32/default 2) intel/9.1/32/Fortran/9.1.040 5) openmpi/intel/1.1.2/32 3) intel/9.1/32/Iidb/9.1.045 Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x2 [0] func:/usr/local/openmpi/1.1.2/intel/i386/lib/libopal.so.0 [0x27d847] *** End of error message *** ^@[adroit-31:08840] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect: connection failed (errno=111) - retrying (pid=8840) --- Any thoughts as to why this might be failing? Thanks, Dennis Dennis McRitchie Computational Science and Engineering Support (CSES) Academic Services Department Office of Information Technology Princeton University ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Brian Barrett Open MPI Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI users] mac os x 10.3 openmpi won't compile hello world
Ah, are you using Open MPI 1.1.x, by chance? The wrapper compilers need to be able to find a text file in $prefix/share/openmpi/, where $prefix is the prefix you gave when you configured Open MPI. If that path is different on two hosts, the wrapper compilers can't find the text file, and things fall apart. There's supposed to be an error message from the wrapper compilers when this occurs. Unfortunately, there is a bug in the 1.1.x wrapper compilers such that they just exit with a non-zero exit status without printing that error message. Not friendly, unfortunately. Brian On Oct 18, 2006, at 9:37 AM, Dan Cardin wrote: I found my problem. I installed my openmpi onto an nfs share that resides on another machine. If I login to the machine where the nfs share is physically I can compile and run the hello world. This is my first cluster build. Does anyone have a suggestion how I can keeps this on a nfs share and make it work? Thank you Mac os x 10.3 cluster -dan On Tue, 2006-10-17 at 22:15 -0600, Brian Barrett wrote: On Oct 17, 2006, at 6:41 PM, Dan Cardin wrote: Hello all, I have installed openmpi on a small apple panther cluster. The install went smoothly but when I compile a program with mpicc helloworld.c -o hello No files or message are ever generated. Any help would be appreciated. What version of Open MPI are you using? Also, what is the output of: mpicc -showme Thanks, Brian ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] mac os x 10.3 openmpi won't compile hello world
On Oct 17, 2006, at 6:41 PM, Dan Cardin wrote: Hello all, I have installed openmpi on a small apple panther cluster. The install went smoothly but when I compile a program with mpicc helloworld.c -o hello No files or message are ever generated. Any help would be appreciated. What version of Open MPI are you using? Also, what is the output of: mpicc -showme Thanks, Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] C --> LOGICAL
On Tue, 2006-09-26 at 14:45 -0400, Brock Palen wrote: > I have a code that requires that it be compiled (with the pgi > compilers) with the -i8 > > From the pgf90 man page: > > -i8Treat default INTEGER and LOGICAL variables as eight bytes. > For operations >involving integers, use 64-bits for computations. > > But i get the following from configure: > > checking size of Fortran 77 LOGICAL... 8 > checking for C type corresponding to LOGICAL... not found > configure: WARNING: *** Did not find corresponding C type > configure: error: Cannot continue > > > This is with opnempi-1.1.1 > I also have the same problem with openmpi-1.1.2rc1 > > The application is vasp, you can see the notes on the problem here: > http://cms.mpi.univie.ac.at/vasp-forum/forum_viewtopic.php?2.1255 It looks like we assumed that LOGICAL would never be larger than an int, which clearly isn't the case when that setting is used. I've filed a bug in our tracker about the issue and should have a fix committed this evening. It should be able to make the 1.1.2 release, but I can't promise at this point. Thanks, Brian
Re: [OMPI users] Dynamic loading of libmpi.dylib on Mac OS X
Brian - Sorry for the slow reply, I've been on vacation for a while and am still digging out from all the back e-mail. Anyway, that makes sense. Open MPI's default build mode is to dlopen() the driver components needed for things like the various interconnects and process starters we support. Since libmpi was dlopen()'ed with RTLD_LOCAL, the symbols needed in libmpi were not available to those components when OMPI tried to dlopen() them. I was a little confused initially by why the symbols in our other support libraries were found (everything seemed to work until the MPI-level -- the run-time stuff initialized properly). But apparently this makes sense as well, as there's something about how shared libraries that are dependencies on the dlopen()'ed object are loaded that puts those symbols in the global namespace. One solution, of course, is to specify RTLD_GLOBAL when opening libmpi. The other possibility is to build Open MPI with the --disable-dlopen option, which will cause all the components to be built into libmpi, avoiding the whole namespacing issue. We'll add some information to the FAQ on this issue. Thanks for bringing it to our attention. Brian On Fri, 2006-09-08 at 10:51 -0600, Brian E Granger wrote: > Brian, > > > I think I have figured this one out. By default ctypes calls dlopen > with mode = RTLD_LOCAL (except on Mac OS 10.3). When I instruct > ctypes to set mode = RTLD_GLOBAL it works fine on 10.4. Based on the > dlopen man page: > > > RTLD_GLOBAL Symbols exported from this image (dynamic library > or bun- >dle) will be available to any images build with >-flat_namespace option to ld(1) or to calls to > dlsym() when >using a special handle. > > > RTLD_LOCALSymbols exported from this image (dynamic library > or bun- >dle) are generally hidden and only availble to > dlsym() when >directly using the handle returned by this call to >dlopen(). If neither RTLD_GLOBAL nor RTLD_LOCAL is > speci- >fied, the default is RTLD_GLOBAL > > > This behavior makes sense. Thus the following works on 10.4: > > > from ctypes import * > mpi = CDLL('libmpi.0.dylib', RTLD_GLOBAL) > f = pythonapi.Py_GetArgcArgv > argc = c_int() > argv = POINTER(c_char_p)() > f(byref(argc), byref(argv)) > mpi.MPI_Init(byref(argc), byref(argv)) > mpi.MPI_Finalize() > > > So I am not sure this is a defect in OpenMPI, but it sure is a subtle > aspect of using it. I will probably document this somewhere in the > package I am creating. > > > Thanks > > > Brian > > > > > > > > > > > > > > On Sep 6, 2006, at 9:00 AM, Brian Barrett wrote: > > > Thanks for the information. I've filed a bug in our bug tracker on > > this > > issue. It appears that for some reason, when libmpi is dlopened() > > by > > python, that objects it then dlopens are not able to find symbols in > > the > > libmpi. It will probably take me a bit of time to track this issue > > down, but you will be notified by the bug tracker when the issue is > > resolved. > > > > > > Brian > > > > > > > > > > On Thu, 2006-08-31 at 17:27 -0600, Brian E Granger wrote: > > > Brian, > > > > > > > > > > > > > > > Sure, but my example will probably seem a little odd. I am > > > calling > > > the mpi shared library from Python using ctypes. > > > > > > > > > > > > > > > The dependencies for doing things this way are: > > > > > > > > > > > > > > > 1. Python built with --enable-shared > > > 2. The ctypes python package > > > 3. OpenMPI configured with --enable-shared > > > > > > > > > > > > > > > Once you have this, the following python script will cause the > > > problem > > > on Mac OS X: > > > > > > > > > > > > > > > from ctypes import * > > > > > > > > > > > > > > > f = pythonapi.Py_GetArgcArgv > > > argc = c_int() > > > argv = POINTER(c_char_p)() > > > f(byref(argc), byref(argv)) > > > mpi = cdll.LoadLibrary('libmpi.0.dylib') > > > mpi.MPI_Init(byref(argc), byref(argv)) > > > > > > > > > > > > > > > I will try this on Linux as well to see if I get the same error. > > > One > > > important piece of the puzzle is
Re: [OMPI users] linux alpha ev6 openmpi 1.1.1
On Sep 8, 2006, at 8:18 PM, Nuno Sucena Almeida wrote: Hello, while trying to compile openmpi 1.1.1 on a linux alpha ev6 (tsunami) gentoo system, I had to add the following lines to config/ompi_config_asm.m4: alphaev6-*) ompi_cv_asm_arch="ALPHA" OMPI_ASM_SUPPORT_64BIT=1 OMPI_GCC_INLINE_ASSIGN='"bis zero,zero,%0" : "="(ret)' ;; since my system was being detected as such , and not alpha-* I forgot to mention -- I've committed a fix for this part of the issue in the SVN trunk. It should eventually be migrated into the branch for the 1.2 release once we sort out the other Alpha issues. Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] linux alpha ev6 openmpi 1.1.1
On Sep 8, 2006, at 8:18 PM, Nuno Sucena Almeida wrote: The other issue is the one described in http://www.mail-archive.com/debian-bugs-dist@lists.debian.org/ msg229867.html (...) gcc -O3 -DNDEBUG -fno-strict-aliasing -pthread -o .libs/opal_wrapper opal_wrapper.o -Wl,--export-dynamic ../../../opal/.libs/libopal.so -ldl -lnsl -lutil -lm -Wl,--rpath -Wl,/opt/openmpi-1.1.1/lib ../../../opal/.libs/libopal.so: undefined reference to `opal_atomic_cmpset_acq_32' ../../../opal/.libs/libopal.so: undefined reference to `opal_atomic_cmpset_32' (...) Can you send the config.log file generated by Open MPI's configure, with your bis $31,$31 change? Thanks, Brian
Re: [OMPI users] Probable MPI2 bug?
On Mon, 2006-09-04 at 11:01 -0700, Tom Rosmond wrote: > Attached is some error output from my tests of 1-sided message > passing, plus my info file. Below are two copies of a simple fortran > subroutine that mimics mpi_allgatherv using mpi-get calls. The top > version fails, the bottom runs OK. It seems clear from these > examples, plus the 'self_send' phrases in the error output, that there > is a problem internally with a processor sending data to itself. I > know that your 'mpi_get' implementation is simply a wrapper around > 'send/recv' calls, so clearly this shouldn't happen. However, the > problem does not happen in all cases; I tried to duplicate it in a > simple stand-alone program with mpi_get calls and was unable to make > it fail. Go figure. That is an odd failure and at first glance it does look like there is something wrong with our one-sided implementation. I've filed a bug in our tracker about the issue and you should get updates on the ticket as we work on the issue. Thanks, Brian
Re: [OMPI users] Dynamic loading of libmpi.dylib on Mac OS X
Thanks for the information. I've filed a bug in our bug tracker on this issue. It appears that for some reason, when libmpi is dlopened() by python, that objects it then dlopens are not able to find symbols in the libmpi. It will probably take me a bit of time to track this issue down, but you will be notified by the bug tracker when the issue is resolved. Brian On Thu, 2006-08-31 at 17:27 -0600, Brian E Granger wrote: > Brian, > > > Sure, but my example will probably seem a little odd. I am calling > the mpi shared library from Python using ctypes. > > > The dependencies for doing things this way are: > > > 1. Python built with --enable-shared > 2. The ctypes python package > 3. OpenMPI configured with --enable-shared > > > Once you have this, the following python script will cause the problem > on Mac OS X: > > > from ctypes import * > > > f = pythonapi.Py_GetArgcArgv > argc = c_int() > argv = POINTER(c_char_p)() > f(byref(argc), byref(argv)) > mpi = cdll.LoadLibrary('libmpi.0.dylib') > mpi.MPI_Init(byref(argc), byref(argv)) > > > I will try this on Linux as well to see if I get the same error. One > important piece of the puzzle is that if I configure openmpi with the > --disable-dlopen flag, I don't have the problem. I will do some > further testing on different systems and get back to you. > > > Thanks for looking at this. > > > Brian > > > > On Aug 31, 2006, at 4:20 PM, Brian Barrett wrote: > > > This is quite strange, and we're having some trouble figuring out > > exactly why the opening is failing. Do you have a (somewhat?) easy > > list > > of instructions so that I can try to reproduce this? > > > > > > Thanks, > > > > > > Brian > > > > > > On Tue, 2006-08-22 at 20:58 -0600, Brian Granger wrote: > > > HI, > > > > > > > > > I am trying to dynamically load mpi.dylib on Mac OS X (using > > > ctypes in > > > python). It seems to > > > load fine, but when I call MPI_Init(), I get the error shown > > > below. I > > > can call other functions just fine (like MPI_Initialized). > > > > > > > > > Also, my mpi install is seeing all the needed components and I can > > > load them myself without error using dlopen. I can also compile > > > and > > > run mpi programs and I build openmpi with shared library support. > > > > > > > > > [localhost:00973] mca: base: component_find: unable to open: > > > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_allocator_basic.so, > > > 9): > > > Symbol not found: _ompi_free_list_item_t_class > > > Referenced from: > > > /usr/local/openmpi-1.1/lib/openmpi/mca_allocator_basic.so > > > Expected in: flat namespace > > > (ignored) > > > [localhost:00973] mca: base: component_find: unable to open: > > > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_rcache_rb.so, 9): > > > Symbol > > > not found: _ompi_free_list_item_t_class > > > Referenced > > > from: /usr/local/openmpi-1.1/lib/openmpi/mca_rcache_rb.so > > > Expected in: flat namespace > > > (ignored) > > > [localhost:00973] mca: base: component_find: unable to open: > > > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_mpool_sm.so, 9): > > > Symbol > > > not found: _mca_allocator_base_components > > > Referenced > > > from: /usr/local/openmpi-1.1/lib/openmpi/mca_mpool_sm.so > > > Expected in: flat namespace > > > (ignored) > > > [localhost:00973] mca: base: component_find: unable to open: > > > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_pml_ob1.so, 9): > > > Symbol > > > not found: _ompi_free_list_item_t_class > > > Referenced > > > from: /usr/local/openmpi-1.1/lib/openmpi/mca_pml_ob1.so > > > Expected in: flat namespace > > > (ignored) > > > [localhost:00973] mca: base: component_find: unable to open: > > > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_coll_basic.so, 9): > > > Symbol not found: _mca_pml > > > Referenced > > > from: /usr/local/openmpi-1.1/lib/openmpi/mca_coll_basic.so > > > Expected in: flat namespace > > > (ignored) > > > [localhost:00973] mca: base: component_find: unable to open: > > > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_coll_hierarch.so, > > > 9): > > > Symbol not found: _ompi_mpi_op_max > > > Referenced > > > from: /usr/local/openmpi-1.1/lib/openmpi/mca_
Re: [OMPI users] question about passing MPI communicator
Your example is pretty close to spot on. You want to convert the Fortran handle (integer) into a C handle (something else). Then use the C handle to call C functions. The one thing of note is that you should use the type MPI_Fint instead of int for the type of the Fortran handles. So your parallel_info function's prototype would be: void parallel_info_(int *rank, MPI_Fint *comm); Hope this helps, Brian On Fri, 2006-09-01 at 09:26 -0400, Wang, Peng wrote: > Hello, I am wondering in openmpi how is the passing of MPI communcator > from Fortran to C is handled? Assuming I have a Fortran 90 subroutine > calling a C function passing MPI_COMM_WORLD in, in the C function, do I > need to first do MPI_Comm_f2c > to convert to MPI handle, then use that handle afterward? Or is there > any better way to do this? Here is some test code: > > Fortran 90: > > program test1 > > include 'mpif.h' > > integer myrank,ierr > > call MPI_Init(ierr) > > call parallel_info(myrank,MPI_COMM_WORLD) > write(*,*) 'hello, I am process #',myrank > > call MPI_Finalize(ierr) > > end program test1 > > > C: > > #include > > void parallel_info_(int * rank, int* comm) > { > MPI_Comm ccomm; > > ccomm=MPI_Comm_f2c(*comm); > MPI_Comm_rank(ccomm, rank); > } > > void parallel_info(int * rank, int * comm) > { > MPI_Comm ccomm; > > ccomm=MPI_Comm_f2c(*comm); > > MPI_Comm_rank(ccomm, rank); > } > > > Thanks, > Peng > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Dynamic loading of libmpi.dylib on Mac OS X
This is quite strange, and we're having some trouble figuring out exactly why the opening is failing. Do you have a (somewhat?) easy list of instructions so that I can try to reproduce this? Thanks, Brian On Tue, 2006-08-22 at 20:58 -0600, Brian Granger wrote: > HI, > > I am trying to dynamically load mpi.dylib on Mac OS X (using ctypes in > python). It seems to > load fine, but when I call MPI_Init(), I get the error shown below. I > can call other functions just fine (like MPI_Initialized). > > Also, my mpi install is seeing all the needed components and I can > load them myself without error using dlopen. I can also compile and > run mpi programs and I build openmpi with shared library support. > > [localhost:00973] mca: base: component_find: unable to open: > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_allocator_basic.so, 9): > Symbol not found: _ompi_free_list_item_t_class > Referenced from: > /usr/local/openmpi-1.1/lib/openmpi/mca_allocator_basic.so > Expected in: flat namespace > (ignored) > [localhost:00973] mca: base: component_find: unable to open: > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_rcache_rb.so, 9): Symbol > not found: _ompi_free_list_item_t_class > Referenced from: /usr/local/openmpi-1.1/lib/openmpi/mca_rcache_rb.so > Expected in: flat namespace > (ignored) > [localhost:00973] mca: base: component_find: unable to open: > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_mpool_sm.so, 9): Symbol > not found: _mca_allocator_base_components > Referenced from: /usr/local/openmpi-1.1/lib/openmpi/mca_mpool_sm.so > Expected in: flat namespace > (ignored) > [localhost:00973] mca: base: component_find: unable to open: > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_pml_ob1.so, 9): Symbol > not found: _ompi_free_list_item_t_class > Referenced from: /usr/local/openmpi-1.1/lib/openmpi/mca_pml_ob1.so > Expected in: flat namespace > (ignored) > [localhost:00973] mca: base: component_find: unable to open: > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_coll_basic.so, 9): > Symbol not found: _mca_pml > Referenced from: /usr/local/openmpi-1.1/lib/openmpi/mca_coll_basic.so > Expected in: flat namespace > (ignored) > [localhost:00973] mca: base: component_find: unable to open: > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_coll_hierarch.so, 9): > Symbol not found: _ompi_mpi_op_max > Referenced from: /usr/local/openmpi-1.1/lib/openmpi/mca_coll_hierarch.so > Expected in: flat namespace > (ignored) > [localhost:00973] mca: base: component_find: unable to open: > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_coll_sm.so, 9): Symbol > not found: _ompi_mpi_local_convertor > Referenced from: /usr/local/openmpi-1.1/lib/openmpi/mca_coll_sm.so > Expected in: flat namespace > (ignored) > [localhost:00973] mca: base: component_find: unable to open: > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_coll_tuned.so, 9): > Symbol not found: _mca_pml > Referenced from: /usr/local/openmpi-1.1/lib/openmpi/mca_coll_tuned.so > Expected in: flat namespace > (ignored) > [localhost:00973] mca: base: component_find: unable to open: > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_osc_pt2pt.so, 9): Symbol > not found: _ompi_request_t_class > Referenced from: /usr/local/openmpi-1.1/lib/openmpi/mca_osc_pt2pt.so > Expected in: flat namespace > (ignored) > -- > No available pml components were found! > > This means that there are no components of this type installed on your > system or all the components reported that they could not be used. > > This is a fatal error; your MPI process is likely to abort. Check the > output of the "ompi_info" command and ensure that components of this > type are available on your system. You may also wish to check the > value of the "component_path" MCA parameter and ensure that it has at > least one directory that contains valid MCA components. > > -- > [localhost:00973] PML ob1 cannot be selected > > Any Ideas? > > Thanks > > Brian Granger > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] little endian - big endian conversion
Correct. With the exception of MPI_BOOL / MPI_LOGICAL, we do not handle sending datatypes that are different sizes on the sender and receiver. So sending an MPI_LONG from a 32 bit machine to a 64 bit machine will not work correctly. Brian On Wed, 2006-08-30 at 10:33 -0400, Jeff Squyres wrote: > Oops! My mistake -- thanks for the correction... > > I am still correct in thinking that we do not properly handle *size* > endianness, right? Meaning that if sizeof(long) on one node is different > than sizeof(long) on another, running an MPI job across those two nodes will > cause Bad Things to occur if you try to exchange MPI_LONGs between the MPI > processes, right? (and similar for other datatypes that are different > sizes) > > > On 8/30/06 9:38 AM, "Brian Barrett" <brbar...@open-mpi.org> wrote: > > > Actually, Jeff is incorrect. As of Open MPI 1.1, we do support endian > > conversion between peers. It has not been as well tested as the rest of > > the code base, but it should work. Please let us know if you have any > > issues with that mode and we'll work to resolve them. > > > > Brian > > > > > > On Wed, 2006-08-30 at 06:36 -0400, Jeff Squyres wrote: > >> Open MPI does not yet support endian conversion between peers in a single > >> MPI job. It's on the to-do list, but it's been a lower priority than some > >> other features and issues. > >> > >> > >> > >> On 8/30/06 4:12 AM, "Eng. A.A. Isola" <alfonso.is...@tin.it> wrote: > >> > >>> Hi everybody, > >>> > >>>I got one doudt in OPEN-MPI. Suppose, if > >>> i > >>> run the application on different systems with different data formats > >>> (little-endian & big endian)...Willl OPEN-MPI converts from little > >>> endian > >>> to big-endian(if it is sending data from for eg, Linux Pc & > >>> Solaris) > >>> > >>> If it isn't able to do this, it will be able in the > >>> future releases? (is in your to do list?) > >>> > >>> Thanking u for ur response, > >>> > >>> A.A. Isola > >>> > >>> ___ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > >
Re: [OMPI users] Testing 1-sided MPI again
On Tue, 2006-08-15 at 14:24 -0700, Tom Rosmond wrote: > I am continuing to test the MPI-2 features of 1.1, and have run into > some puzzling behavior. I wrote a simple F90 program to test 'mpi_put' > and 'mpi_get' on a coordinate transformation problem on a two dual-core > processor Opteron workstation running the PGI 6.1 compiler. The program > runs correctly for a variety of problem sizes and processor counts. > > However, my main interest is a large global weather prediction model > that has been running in production with 1-sided message passing on an > SGI Origin 3000 for several years. This code does not run with OMPI > 1-sided message passing. I have investigated the difference between this > code and the test program and noticed a critical difference. Both > programs call 'mpi_win_create' to create an integer 'handle' to the RMA > window used by 'mpi_put' and 'mpi_get'. In the test program this > 'handle' returns with a value of '1', but in the large code the 'handle' > returns with value '0'. Subsequent synchronization calls to > 'mpi_win_fence' succeed in the small program (error status eq 0), while > in the large code they fail (error status ne 0), and the transfers fail > also (no data is passed). > > Do you have any suggestions on what could cause this difference in > behavior between the two codes, specifically why the 'handles' have > different values? Are there any diagnostics I could produce that would > provide information? The difference in handle values is irrelevant to the failures you are seeing. Our handle 0 is MPI_WIN_NULL, so you should never see that returned from MPI_WIN_CREATE. Unfortunately, when I wrote the one-sided implementation, I didn't add useful debugging messages the user can enable. I can add some and make a tarball, if you would be willing to give it a try. What error messages are coming out of the large code? By the way, just to make sure your expectations are set correctly, Open MPI's one-sided performance in v1.1 and v1.2 is bad, as it's implemented over the point-to-point engine. You're not going to get Origin-like performance out of the current implementation. Brian
Re: [OMPI users] pvfs2 and romio
On Mon, 2006-08-14 at 10:57 -0400, Brock Palen wrote: > We will be evaluating pvfs2 (www.pvfs.org) in the future. Is their > any special considerations to take to get romio support with openmpi > with pvfs2 ? > I have the following from ompi_info > > MCA io: romio (MCA v1.0, API v1.0, Component v1.1) > > Does OMPI have to be built pointing at the pvfs2 libs? If so how? I > remember there was a strange way of needing to do this with lam. > > Guidance is much appreciated. Yeah, some minor trickery is required. I believe you can just do something like: ./configure --with-file-system=panfs+nfs+ufs but it's probably safest to do: ./configure --with-io-romio-flags="--with-file-system=panfs +nfs+ufs" Changing the filesystems you want to include, of course. Brian
Re: [OMPI users] Compiling MPI with pgf90
On Mon, 2006-07-31 at 13:12 -0400, James McManus wrote: > I'm trying to compile MPI with pgf90. I use the following configure > settings: > > ./configure --prefix=/usr/local/mpi F90=pgf90 F77=pgf77 > > However, the compiler is set to gfortran: > > *** Fortran 90/95 compiler > checking for gfortran... gfortran > checking whether we are using the GNU Fortran compiler... yes > checking whether gfortran accepts -g... yes > checking if Fortran compiler works... yes > checking whether pgf90 and gfortran compilers are compatible... no > configure: WARNING: *** Fortran 77 and Fortran 90 compilers are not link > compatible > > I do have gfortran, with its binary in /usr/bin/gfortran, However, I > have removed all path information to it, in .bash_profile and .bashrc, > and have replaced it with path information to pgf90. MPI is still > configured with gfortran and the FC compiler. > > I am using a evaluation version of the pgi compilers. Try: ./configure --prefix=/usr/local/mpi FC=pgf90 F77=pgf77 Autoconf looks at the FC variable to choose the modern Fortran compiler, not the F90 variable. Brian
Re: [OMPI users] bug report: wrong reference in mpi.h to mpicxx.h
On Wed, 2006-07-19 at 14:57 +0200, Paul Heinzlreiter wrote: > After that I tried to compile VTK (http://www.vtk.org) with MPI support > using OpenMPI. > > The compilation process issued the following error message: > > /home/ph/local/openmpi/include/mpi.h:1757:33: ompi/mpi/cxx/mpicxx.h: No > such file or directory Sven sent instructions on how to best build VTK, but I wanted to explain what you are seeing. Open MPI actually requires two -I options to use the C++ bindings: -I/include and -I/include/openmpi. Generally, the wrapper compilers (mpicc, mpiCC, mpif77, etc.) are used to build Open MPI applications and the -I flags are automatically added without any problem (a bunch of other flags that might be required on your system may also be added). You can use the "mpiCC -showme" option to the wrapper compiler to see excatly which flags it might add when compiling / linking / etc. Hope this helps, Brian
Re: [OMPI users] x86_64 head with x86 diskless nodes, Node execution fails with SEGV_MAPERR
On Jul 16, 2006, at 4:13 PM, Eric Thibodeau wrote: Now that I have that out of the way, I'd like to know how I am supposed to compile my apps so that they can run on an homogenous network with mpi. Here is an example: kyron@headless ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ mpicc -L/ usr/X/lib -lm -lX11 -O3 mandelbrot-mpi.c -o mandelbrot-mpi kyron@headless ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ mpirun -- hostfile hostlist -np 3 ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2/ mandelbrot-mpi -- Could not execute the executable "/home/kyron/1_Files/1_ETS/ 1_Maitrise/MGL810/Devoir2/mandelbrot-mpi": Exec format error This could mean that your PATH or executable name is wrong, or that you do not have the necessary permissions. Please ensure that the executable is able to be found and executed. -- As can be seen with the uname -a that was run previously, I have 2 "local nodes" on the x86_64 and two i686 nodes. I tried to find examples in the Doc on howto compile applications correctly for such a setup without compromising performance but I came short of an example. From the sound of it, you have a heterogeneous configuration -- some nodes are x86_64 and some are x86. Because of this, you either have to compile your application twice, once for each platform or compile your application for the lowest common denominator. My guess would be that it easier and more foolproof if you compiled everything in 32 bit mode. If you run in a mixed mode, using application schemas (see the mpirun man page) will be the easiest way to make things work. Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] Problem compiling OMPI with Intel C compiler on Mac OS X
On Jul 14, 2006, at 10:35 AM, Warner Yuen wrote: I'm having trouble compiling Open MPI with Mac OS X v10.4.6 with the Intel C compiler. Here are some details: 1) I upgraded to the latest versions of Xcode including GCC 4.0.1 build 5341. 2) I installed the latest Intel update (9.1.027) as well. 3) Open MPI compiles fine with using GCC and IFORT. 4) Open MPI fails with ICC and IFORT 5) MPICH-2.1.0.3 compiles fine with ICC and IFORT (I just had to find out if my compiler worked...sorry!) 6) My Open MPI confguration was using: ./configure --with-rsh=/usr/ bin/ssh --prefix=/usr/local/ompi11icc 7) Should I have included my config.log? It looks like there are some problems with GNU libtool's support for the Intel compiler on OS X. I can't tell if it's a problem with the Intel compiler or libtool. A quick fix is to build Open MPI with static libraries rather than shared libraries. You can do this by adding: --disable-shared --enable-static to the configure line for Open MPI (if you're building in the same directory where you've already run configure, you want to run make clean before building again). I unfortunately don't have access to a Intel Mac machines with the Intel compilers installed, so I can't verify this issue. I believe one of the other developers does have such a configuration, so I'll ask him when he's available (might be a week or two -- I believe he's on vacation). This issue seems to be unique to your exact configuration -- it doesn't happen with GCC on the Intel Mac nor on Linux with the Intel compilers. Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] x86_64 head with x86 diskless nodes, Node execution fails with SEGV_MAPERR
On Jul 15, 2006, at 2:58 PM, Eric Thibodeau wrote: But, for some reason, on the Athlon node (in their image on the server I should say) OpenMPI still doesn't seem to be built correctly since it crashes as follows: kyron@node0 ~ $ mpirun -np 1 uptime Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:(nil) [0] func:/home/kyron/openmpi_i686/lib/libopal.so.0 [0xb7f6258f] [1] func:[0xe440] [2] func:/home/kyron/openmpi_i686/lib/liborte.so.0(orte_init_stage1 +0x1d7) [0xb7fa0227] [3] func:/home/kyron/openmpi_i686/lib/liborte.so.0(orte_system_init +0x23) [0xb7fa3683] [4] func:/home/kyron/openmpi_i686/lib/liborte.so.0(orte_init+0x5f) [0xb7f9ff7f] [5] func:mpirun(orterun+0x255) [0x804a015] [6] func:mpirun(main+0x22) [0x8049db6] [7] func:/lib/tls/libc.so.6(__libc_start_main+0xdb) [0xb7de8f0b] [8] func:mpirun [0x8049d11] *** End of error message *** Segmentation fault The crash happens both in the chrooted env and on the nodes. I configured both systems to have Linux and POSIX threads, though I see openmpi is calling the POSIX version (a message on the mailling list had hinted on keeping the Linux threads around...I have to anyways since sone apps like Matlab extensions still depend on this...). The following is the output for the libc info. That's interesting... We regularly build Open MPI on 32 bit Linux machines (and in 32 bit mode on Opteron machines) without too much issue. It looks like we're jumping into a NULL pointer, which generally means that a ORTE framework failed to initialize itself properly. It would be useful if you could rebuild with debugging symbols (just add -g to CFLAGS when configuring) and run mpirun in gdb. If we can determine where the error is occurring, that would definitely help in debugging your problem. Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] Openmpi, LSF and GM
On Jul 16, 2006, at 6:12 AM, Keith Refson wrote: The compile of openmpi 1.1 was without problems and appears to have correctly built the GM btl. $ ompi_info -a | egrep "\bgm\b|_gm_" MCA mpool: gm (MCA v1.0, API v1.0, Component v1.1) MCA btl: gm (MCA v1.0, API v1.0, Component v1.1) Ok, so GM support is definitely built into your build of Open MPI, which is a good start. However I have been unable to sey up a parallel run which uses gm. If I start a run using the openmpi mpirun command, the program executes correctly in parallel. However the timings appear to suggest that it is using tcp, and the command executed on the node looks like: orted --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 -- nodename scarf-cn001.rl.ac.uk --universe cse0...@scarf-cn001.rl.ac.uk:default-universe-28588 --nsreplica "0.0.0;tcp://192.168.1.1:52491;tcp://130.246.142.1:52491" --gprreplica "0.0.0;tcp://192.168.1.1:52491;t Right, orted is just a starter for the MPI processes -- the information on interconnects to use and that kind of stuff is passed through the out-of-band communication mechanism. orted doesn't really care which interconnect the MPI process is going to use, so we don't pass it on the command line. Furthermore if attempt to start with the mpirun arguments "--mca btl gm,self,^tcp" the run aborts at the MPI_INIT call. Q1: Is there anything else I have to do to get openmpi to use gm? The command line you want is: mpirun -np X -mca btl gm,sm,self If this causes an error during MPI_INIT or early in your application, it would be useful to see all the output form the parallel run. That likely indicates that there is something wrong with the initialization of the interconnect. Q2: Is there any way of diagnosing which btl is actually being used and why? None "-v" option to mpirun, "-mca btl btl_base_verbose" or "-mca btl btl_gm_debug=1" make any difference or produce any more output The arguments you want would look like: mpirun -np X -mca btl gm,sm,self -mca btl_base_verbose 1 -mca btl_gm_debug 1 Q3: Is there a way to make openmpi work with the LSF commands? So far I have constructed a hostfile from the LSF environment variable LSB_HOSTS and used the openmpi mpirun command to start the parallel executable. Currently, we do not have tight LSF integration for Open MPI, like we do for PBS, SLURM, and BProc. This is mainly because the only LSF machines the development team regularly uses are BProc machines, which do not use the traditional startup and allocation mechanisms of LSF. I believe it is on our feature request list, but I also don't believe we have a timeline for implementation. Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] debugging with mpirun
On Jul 6, 2006, at 8:27 PM, Manal Helal wrote: I am trying to debug my mpi program, but printf debugging is not doing much, and I need something that can show me variable values, and which line of execution (and where it is called from), something like gdb with mpi, is there anything like that? There are a couple of options. The first (works best with ssh, but can be made to work with most starting mechanisms) is to start a bunch of gdb sessions in xterms. Something like: mpirun -np XX -d xterm -e gdb The '-d' option is necessary so that mpirun doesn't close the ssh sessions, severing its X11 forwarding channel. This has the advantage of being free, but has the disadvantage of being a major pain. A better option is to try a real parallel debugger, such as TotalView or Portland Group's PGDBG. This has the advantage of working very well (I use TotalView whenever possible), but has the disadvantage of generally not being free. Hope this helps, Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] MPI_Recv, is it possible to switch on/off aggresive mode during runtime?
On Jul 5, 2006, at 8:54 AM, Marcin Skoczylas wrote: I saw some posts ago almost the same question as I have, but it didn't give me satisfactional answer. I have setup like this: GUI program on some machine (f.e. laptop) Head listening on tcpip socket for commands from GUI. Workers waiting for commands from Head / processing the data. And now it's problematic. For passing the commands from Head I'm using: while(true) { MPI_Recv... do whatever head said (process small portion of the data, return result to head, wait for another commands) } So in the idle time workers are stuck in MPI_Recv and have 100% CPU usage, even if they are just waiting for the commands from Head. Normally, I would not prefer to have this situation as I sometimes have to share the cluster with others. I would prefer not to stop whole mpi program, but just go into 'idle' mode, and thus make it run again soon. Also I would like to have this aggresive MPI_Recv approach switched on when I'm alone on the cluster. So is it possible somehow to switch this mode on/off during runtime? Thank you in advance! Currently, there is not a way to do this. Obviously, there's not going to be a way that is portable (ie, compiles with MPICH), but it may be possible to add this in the future. It likely won't happen for the v1.1 release series, and I can't really speak for releases past that at this point. I'll file an enhancement request in our internal bug tracker, and add you to the list of people to be notified when the ticket is updated. Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] error in running openmpi on remote node
On Jul 4, 2006, at 1:53 AM, Chengwen Chen wrote: Dear openmpi users, I am using openmpi-1.0.2 on Redhat linux. I can succussfully run mpirun in single PC with 2 np. But fail in remote node. Can you give me some advices? thank you very much in advance. [say@wolf45 tmp]$ mpirun -np 2 /tmp/test.x [say@wolf45 tmp]$ mpirun -np 2 --host wolf45,wolf46 /tmp/test.x say@wolf46's password: orted: Command not found. [wolf45:11357] ERROR: A daemon on node wolf46 failed to start as expected. [wolf45:11357] ERROR: There may be more information available from [wolf45:11357] ERROR: the remote shell (see above). [wolf45:11357] ERROR: The daemon exited unexpectedly with status 1. Kefeng is correct that you should setup your ssh keys so that you aren't prompted for a password, but that isn't the cause of your failure. The problem appears to be that orted (one of the Open MPI commands) is not in your path on the remote node. You should take a look at one of the other FAQ sections on the setup required for Open MPI in an rsh/ssh type environment. http://www.open-mpi.org/faq/?category=running Hope this helps, Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] Compilation problem
On Jul 3, 2006, at 11:49 PM, Samuel Wieczorek wrote: Hi, I tried to install open-mpi on a Mac OS X (10.4), but the compilation step failed due to undefined symbols. Here is the compressed output files. Any idea to help me ? This is very odd, but it appears that /usr/bin/find isn't executable on your machine. This results in the libraries in Open MPI not being built properly. There were many lines like this in your log file: ../libtool: line 1: /usr/bin/find: cannot execute binary file I'm not sure how this could happen, but fixing your 'find' command should fix the Open MPI build. Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] Testing one-sided message passing with 1.1
On Jun 29, 2006, at 5:23 PM, Tom Rosmond wrote: I am testing the one-sided message passing (mpi_put, mpi_get) that is now supported in the 1.1 release. It seems to work OK for some simple test codes, but when I run my big application, it fails. This application is a large weather model that runs operationally on the SGI Origin 3000, using the native one-sided message passing that has been supported on that system for many years. At least on that architecture, the code always runs correctly for processor numbers up to 480. On the O3K a requirement for the one-sided communication to work correctly is to use 'mpi_win_create' to define the RMA 'windows' in symmetric locations on all processors, i.e. the same 'place' in memory on each processor. This can be done with static memory, i.e. , in common; or on the 'symmetric heap', which is defined via environment variables. In my application the latter method is used. I define several of these 'windows' on the symmetric heap, each with a unique handle. Before I spend my time trying to diagnose this problem further, I need as much information about the OpenMPI one-sided implementation as available. Do you have a similar requirement or criteria for symmetric memory for the RMA windows? Are there runtime parameters that I should be using that are unique to one-sided message passing with OpenMPI? Any other information will certainly be appreciated. There are no requirements on the one-sided windows in terms of buffer pointers. Our current implementation is over point-to-point so it's kinda slow compared to real one-sided implementations, but has the advantage of working with arbitrary window locations. There is only two parameters to tweak in the current implementation: osc_pt2pt_eager_send: If this is 1, we try to start progressing the put/get before the synchronization point. The default is 0. This is not well tested, so I recommend leaving it 0. It's safer at this point. osc_pt2pt_fence_sync_method: This one might be worth playing with, but I doubt it could cause your problems. This is the collective we use to implement MPI_FENCE. Options are reduce_scatter (default), allreduce, alltoall. Again, I doubt it will make any difference, but would be interesting to confirm that. You can set the parameters at mpirun time: mpirun -np XX -mca osc_pt2pt_fence_sync_method reduce_scatter ./ test_code Our one-sided implementation has not been as well tested as the rest of the code (as this is our first release with one-sided support). If you can share any details on your application or, better yet, a test case, we'd appreciate it. There is one known issue with the implementation. It does not support using MPI_ACCUMULATE with user-defined datatypes, even if they are entirely composed of one predefined datatype. We plan on fixing this in the near future, and an error message will be printed if this situation occurs. Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] SEGV_MAPERR during execution
On Thu, 2006-06-15 at 13:46 -0700, Anoop Rajendra wrote: > I'm trying to run a simple pi program compiled using openmpi. > > My command line and error message is > > [mpiuser@Pebble-anoop ~]$ mpirun -n 2 -hostfile /opt/openmpi/openmpi/ > etc/openmpi-default-hostfile /home/mpiuser/cpi2 > Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) > Failing at addr:0x6 > *** End of error message *** > [0] func:/opt/openmpi/openmpi/lib/libopal.so.0 [0xceb6dd] > [1] func:/lib/tls/libpthread.so.0 [0xd44880] > [2] func:/opt/openmpi/openmpi/lib/openmpi/mca_btl_tcp.so [0x746d23] > [3] func:/opt/openmpi/openmpi/lib/openmpi/mca_btl_tcp.so > (mca_btl_tcp_add_procs+0x140) [0x744094] > [4] func:/opt/openmpi/openmpi/lib/openmpi/mca_bml_r2.so > (mca_bml_r2_add_procs+0x202) [0x96add6] > [5] func:/opt/openmpi/openmpi/lib/openmpi/mca_pml_ob1.so > (mca_pml_ob1_add_procs+0x85) [0x134259] > [6] func:/opt/openmpi/openmpi/lib/libmpi.so.0(ompi_mpi_init+0x385) > [0x70ca7d] > [7] func:/opt/openmpi/openmpi/lib/libmpi.so.0(MPI_Init+0x8c) [0x6fb724] > [8] func:/home/mpiuser/cpi2(main+0x56) [0x804890d] > [9] func:/lib/tls/libc.so.6(__libc_start_main+0xd3) [0xaf3e23] > [10] func:/home/mpiuser/cpi2 [0x8048819] Which version of Open MPI are you using? There were some problems with the 1.0 series when certain networking configurations were found (particularly with machines that had multiple active networks). We believe we have these fixed in the upcoming 1.1 release (there is a beta available on the download page) and in the nightly snapshots of the upcoming 1.0.3 release, which can be downloaded here: http://www.open-mpi.org/software/ompi/v1.1/ http://www.open-mpi.org/nightly/v1.0/ Let us know if these help / don't help your problem. Thanks, Brian
Re: [OMPI users] Trouble with open MPI and Slurm
On Wed, 2006-06-14 at 10:05 -0700, Doolittle, Joshua wrote: > I am running Open MPI version 1.0.2 and slurm 1.1.0. I can run slurm > jobs, and I can run mpi jobs. However, when I run a mpi job in slurm > batch mode with 4 processes, the processes do not talk to each other. > They act like they are the only process. I'm running these in slurm > batch mode. The job that I'm running is a simple mpi optimized hello > world. I'm running these on an opteron (x86_64) blade system from a > head node. Any help would be greatly appreciated. How are you running your batch job? Unlike some MPI implementations, Open MPI jobs can not be started under SLURM without the use of mpirun. You can either run mpirun under an interactive session: srun -N 4 -A mpirun -np 4 ./foobar or from a batch script: echo "mpirun -np 4 ./foobar" > foo.sh chmod +x foo.sh srun -N 4 -b foo.sh But you can't submit your application directly without mpirun. This is a feature we would like to support in the future, but there are some licensing issues (we would have to link with their GPL'ed libraries, which wouldn't work so well for us). Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] pnetcdf and OpenMPI
On Tue, 2006-06-13 at 10:51 -0700, Ken Mighell wrote: > On May 6, 2006, Dries Kimpe reported a solution to getting > pnetcdf to compile correctly with OpenMPI. > A patch was given for the file > mca/io/romio/romio/adio/common/flatten.c > Has this fix been implemented in the nightly series? Yes, this has been fixed in the v1.1 and trunk nightly builds and in the 1.1b1 beta release. It looks like we never followed up on the user's mailing list with that information, but we fixed it around the time he posted that fix. Brian
Re: [OMPI users] Why does openMPI abort processes?
On Sun, 2006-06-11 at 04:26 -0700, imran shaik wrote: > Hi, > I some times get this error message. > " 2 addtional processes aborted, possibly by openMPI" > > Some times 2 processes, sometimes even more. Is it due to over load or > program error? > > Why does openMPI actually abort few processes? > > Can anyone explain? Generally, this is because multiple processes in your job aborted (exited with a signal or before MPI_FINALIZE) and mpirun only prints the first abort message. You can modify how many abort status messages you want to receive with the -aborted X option to mpirun, where X is the number of process abort messages you want to see. The message generally includes some information on what happened to your process. Hope this helps, Brian
Re: [OMPI users] error for open-mpi application
On Jun 7, 2006, at 8:20 AM, Weihua Li wrote: CPU: AMD opeteron Linux86-64 I used the following command to configure the open-mpi-1.0.2. ./configure --prefix=/home/ytang/gdata/whli/openmpi CC=pgcc CXX=pgCC F90=gpf90 --with-openib The F90 environment variable doesn't do anything to configure. You need to set F77 (for Fortran 77) and FC for Fortran 90. Most likely, configure picked up gfortran for your Fortran 90 compiler, causing the error messages. I know it must be something wrong with the installation of open- mpi, but I don't know where it is. I think part of it is the Fortran 90 compiler name. The rest, as Hugh mentioned, is that you really should use the wrapper compilers or look at the wrapper compiler configuration output to see what flags and libraries the Open MPI installation deems necessary. You can do this by running: mpif90 -showme Hope this helps, Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] Open MPI 1.0.2 and np >=64
On May 31, 2006, at 9:59 AM, Troy Telford wrote: On Tue, 30 May 2006 20:32:44 -0600, Brian Barrett <brbarret@open- mpi.org> wrote: Also, it would be useful to know more about the platform (32 / 64 bit, etc) and how you configured Open MPI. Yeah, I was a bit terse there, wasn't I? Sorry 'bout that... The system: 64 bit (Opteron, two dual-core processors) PCI Express IB HCA's Myrinet 10G (MX10G) Gigabit Ethernet configured with built with (both) GCC 3.4 and 4.0 -- didn't seem to make much difference. /configure --enable-cxx-exceptions (Note, I use LDFLAGS and CFLAGS to point to the MX & InfiniBand headers.) Did you happen to have a chance to try to run the 1.0.3 or 1.1 nightly tarballs? I'm 50/50 on whether we've fixed these issues already. Brian
Re: [OMPI users] [PMX:VIRUS] Re: OpenMPI 1.0.3a1r10002 Fails to build with IBM XL Compilers.
On May 31, 2006, at 12:41 PM, Justin Bronder wrote: On 5/31/06, Brian W. Barrett <brbar...@open-mpi.org> wrote: A quick workaround is to edit opal/include/opal_config.h and change the #defines for OMPI_CXX_GCC_INLINE_ASSEMBLY and OMPI_CC_GCC_INLINE_ASSEMBLY from 1 to 0. That should allow you to build Open MPI with those XL compilers. Hopefully IBM will fix this in a future version ;). Well I actually edited include/ompi_config.h and set both OMPI_C_GCC_INLINE_ASSEMBLY and OMPI_CXX_GCC_INLINE_ASSEMBLY to 0. This worked until libtool tried to create a shared library: Ah, yes, sorry about that. We reorganized our directory structure a little bit since we released 1.0.2 and I listed the new path. Of course, I've been told that directly linking with ld isn't such a great idea in the first place. Ideas? I've had some issues building shared libraries with the XL compilers. Libtool doesn't seem to do a good job of supporting them. Your best bet is to build Open MPI with static libraries. The options --enable-static --disable-shared will build static libraries instead of shared libraries. Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] openmpi-1.1a7 on solaris10 opteron
On May 29, 2006, at 5:46 AM, Francoise Roch wrote: I still have a problem to select an interface with openmpi-1.1a7 on solaris opteron. I compile in 64 bit mode, with Studio11 compilers I attempted to force interface exclusion without success. This problem is critical for us because we'll soon have Infiniband interfaces for mpi traffic. roch@n15 ~/MPI > mpirun --mca btl_tcp_if_exclude bge1 -np 2 -host p15,p27 all2all Process 0 is alive on n15 Process 1 is alive on n27 [n27:05110] *** An error occurred in MPI_Barrier [n27:05110] *** on communicator MPI_COMM_WORLD [n27:05110] *** MPI_ERR_INTERN: internal error [n27:05110] *** MPI_ERRORS_ARE_FATAL (goodbye) 1 process killed (possibly by Open MPI) The code works without mca btl_tcp_if_exclude option. It took me a while to realize what is going on. Normally, btl_tcp_if_exclude excludes the lo devices so that they won't be used for the btl transport. When you explicitly set btl_tcp_if_exclude, you have to include lo0 (for Solaris) in the list or things go down hill. I can replicate Françoise's problem on his cluster. However, if I instead do: mpirun --mca btl_tcp_if_exclude bge0,lo0 -np 2 --host n15,n27 ./ ring the routing issues are resolved and everything runs to completion. I'll make sure to update the documentation for 1.1 so that this hopefully doesn't confuse too many more people. Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] Thread Safety
On May 26, 2006, at 11:31 PM, imran shaik wrote: I have installed openMPI alpha 7 release. I created an MPI programs with pthreads. I ran with just 6 process, each thread making MPI calls concurrently with main thread. Things work fine . I use a TCP network. Some times i get a strange error message. Sometimes i get this error message, and sometimes not. I can say in a run of 7 i get once. But i get the output properly and the program works fine. I just wanted to know why that occured? We just released alpha 8, which should include a fix for a problem that sounds very similar to what you are seeing. Can you try upgrading and see if that solves your problem? Another one, i tried to get verbose output from "mpirun", but couldnt. Even "mpiexec". I was using the same command as mpirun -v -np 6 myprogram in lam, i used to get the verbose saying which process is running where. Here nothing happens. What is the problem? Otherwise how can i know what process is running on what node? Any suggestions?? We don't currently have a good way of dealing with this. You can get lots of debugging information from the -d option to mpirun, but it would be difficult to get exactly what you are looking for from the debugging output. Your best bet would probably be to use gethostname() and MPI_Comm_rank () inside your MPI application and print the results to stdout / stderr. Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] [PMX:VIRUS] Re: OpenMPI 1.0.3a1r10002 Fails to build with IBM XL Compilers.
On May 28, 2006, at 8:48 AM, Justin Bronder wrote: Brian Barrett wrote: On May 27, 2006, at 10:01 AM, Justin Bronder wrote: I've attached the required logs. Essentially the problem seems to be that the XL Compilers fail to recognize "__asm__ __volatile__" in opal/include/sys/powerpc/atomic.h when building 64-bit. I've tried using various xlc wrappers such as gxlc and xlc_r to no avail. The current log uses xlc_r_64 which is just a one line shell script forcing the -q64 option. The same works flawlessly with gcc-4.1.0. I'm using the nightly build in order to link with Torque's new shared libraries. Any help would be greatly appreciated. For reference here are a few other things that may provide more information. Can you send the config.log file generated by configure? What else is in the xlc_r_64 shell script, other than the -q64 option? I've attached the config.log, and here's what all of the *_64 scripts look like. Can you try compiling without the -qkeyword=__volatile__? It looks like XLC now has some support for GCC-style inline assembly, but it doesn't seem to be working in this case. If that doesn't work, try setting CFLAGS and CXXFLAGS to include -qnokeyword=asm, which should disable GCC inline assembly entirely. I don't have access to a linux cluster with the XL compilers, so I can't verify this. But it should work. Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] OpenMPI 1.0.3a1r10002 Fails to build with IBM XL Compilers.
On May 27, 2006, at 10:01 AM, Justin Bronder wrote: I've attached the required logs. Essentially the problem seems to be that the XL Compilers fail to recognize "__asm__ __volatile__" in opal/include/sys/powerpc/atomic.h when building 64-bit. I've tried using various xlc wrappers such as gxlc and xlc_r to no avail. The current log uses xlc_r_64 which is just a one line shell script forcing the -q64 option. The same works flawlessly with gcc-4.1.0. I'm using the nightly build in order to link with Torque's new shared libraries. Any help would be greatly appreciated. For reference here are a few other things that may provide more information. Can you send the config.log file generated by configure? What else is in the xlc_r_64 shell script, other than the -q64 option? Thanks, Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] Fortran support not installing
The last line of your make.out file was: 90 > mpi-f90-interfaces.h *** * Compiling the mpi.f90 file may take a few minutes. * This is quite normal -- do not be alarmed if the compile * process seems to 'hang' at this point for several minutes. *** g95 -I../../../ompi/include -I. -I. -c -I. -o mpi.o mpi.f90 Was there some other output not included in the file? If nothing happened for a while, don't assume it failed. That file takes a very, very long time to compile. Brian On May 25, 2006, at 1:46 PM, Terry Reeves wrote: Hello I tried configure with FCFLAGS=-lSystemStubs and with both FCFLAGS=-lSystemStubs and LDFLAGS=-lSystemStubs. Again it died during configure both times. I can provide configure output if desired. I also decided to try version 1.1a7. With LDFLAGS=-lSystemStubs, with our without FCFLAGS=-lSystemStubs, ir gets through configure but fails in "make all". Since that seems to be progress I have included that output. Date: Thu, 25 May 2006 10:02:08 -0400 From: "Jeff Squyres \(jsquyres\)"Subject: Re: [OMPI users] Fortran support not installing To: "Open MPI Users" Message-ID: Content-Type: text/plain; charset="us-ascii" I actually had to set FCFLAGS, not LDFLAGS, to get arbitrary flags passed down to the Fortran tests in configure. Can you try that? (I'm not 100% sure -- you may need to specify LDFLAGS *and* FCFLAGS...?) We have made substantial improvements to the configure tests with regards to the MPI F90 bindings in the upcoming 1.1 release. Most of the work is currently off in a temporary branch in our code repository (meaning that it doesn't show up yet in the nightly trunk tarballs), but it will hopefully be brought back to the trunk soon. Terry Reeves 2-1013 - reeve...@osu.edu Computing Services Office of Information Technology The Ohio State University ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI and OpenIB
On May 11, 2006, at 10:10 PM, Gurhan Ozen wrote: Brian, Thanks for the very clear answers. I did change my code to include fflush() calls after printf() ... And I did try with --mca btl ib,self . Interesting result, with --mca btl ib,self it hello_world works fine, but broadcast hangs after i enter the vector length. At any rate though, --mca btl ib,self looks like the traffic goes over ethernet device .. I couldn't find any documentation on the "self" argument of mca, does it mean to explore alternatives if the desired btl (in this case ib) doesn't work? No, self is the loopback device, for sending messages to self. It is never used for message routing outside of the current process, but is required for almost all transports, as send to self can be a sticky issue. You are specifying openib, not ib, as the argument to mpirun, correct? Either way, I'm not really sure how data could be going over TCP -- the TCP transport would definitely be disabled in that case. At this point, I don't know enough about the Open IB driver to be of help -- one of the other developers is going to have to jump in and provide assistance. Speaking of documentation, it looks like open-mpi didn't come with a man for mpirun, i thought i had seen in one of the slides of Open MPI developer's workshop that it did have mpirun.1 . Do i need to check it out from svn? That's one option, or wait for us to release Open MPI 1.0.3 / 1.1. Brian On 5/11/06, Brian Barrett <brbar...@open-mpi.org> wrote: On May 10, 2006, at 10:46 PM, Gurhan Ozen wrote: My ultimate goal is to get Open MPI working with openIB stack. First, I had installed lam-mpi , I know it doesn't have support for openIB but it's still relevant to some of my questions I will ask.. Here is the set up I have: Yes, keep in mind throughout that while Open MPI does support MVAPI, LAM/MPI will fall back to using IP over IB for communication. I have two machines, pe830-01 and pe830-02 .. Both have ethernet interface and HCA interface. The IP addresses follow: eth0 ib0 pe830-01 10.12.4.32 192.168.1.32 pe830-02 10.12.4.34 192.168.1.34 So this has worked even though it lamhosts file is configured to use ib0 interfaces. I further verified with tcpdump command that none of this went to eth0 .. Anyhow, if i change the lamhosts file to use the eth0 IPs, things work just as the same with no issues . And in that case i see some traffic on eth0 with tcpdump. Ok, so at least it sounds like your TCP network is sanely configured. Now, when i installed and used Open MPI, things didn't work as easy.. Here is what happens. After recompiling the sources with the mpicc that comes with open-mpi: $ /usr/local/openmpi/bin/mpirun --prefix /usr/local/openmpi -- mca pls_rsh_agent ssh --mca btl tcp -np 2 --host 10.12.4.34,10.12.4.32 /path/to/hello_world Hello, world, I am 0 of 2 and this is on : pe830-02. Hello, world, I am 1 of 2 and this is on: pe830-01. So far so good, using eth0 interfaces.. hello_world works just fine. Now, when i try the broadcast program: In reality, you always need to include two BTLs when specifying. You need both the one you want to use (mvapi,openib,tcp,etc.) and "self". You can run into issues otherwise. $ /usr/local/openmpi/bin/mpirun --prefix /usr/local/openmpi -- mca pls_rsh_agent ssh --mca btl tcp -np 2 --host 10.12.4.34,10.12.4.32 /path/to/broadcast It just hangs there, it doesn't prompt me the "Enter the vector length:" string . So i just enter a number anyway since i know the behavior of the program: 10 Enter the vector length: i am: 0 , and i have 5 vector elements i am: 1 , and i have 5 vector elements [0] 10.00 [0] 10.00 [0] 10.00 [0] 10.00 [0] 10.00 [0] 10.00 [0] 10.00 [0] 10.00 [0] 10.00 [0] 10.00 So, that's the first bump with the openmpi.. Now , if i try to use ib0 interfaces instead of eth0 ones, i get: I'm actually surprised this worked in LAM/MPI, to be honest. There should be an fflush() after the printf() to make sure that the output is actually sent out of the application. $ /usr/local/openmpi/bin/mpirun --prefix /usr/local/openmpi --mca pls_rsh_agent ssh --mca btl openib -np 2 --host 192.168.1.34,192.168.1.32 /path/to/hello_world -- No available btl components were found! This means that there are no components of this type installed on your system or all the components reported that they could not be used. This is a fatal error; your MPI process is likely to abort. Check the output of the "ompi_info" command and ensure that components of this type are available on your system. You may also wish to check the
Re: [OMPI users] Open MPI and OpenIB
-- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned value -2 instead of OMPI_SUCCESS -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) error ... This makes it sound like Open IB is failing to setup properly. I'm a bit out of my league on this one -- is there any application you can run 4 - How come the behavior of broadcast.c was different on Open MPI than it is on lam/mpi? I think I answered this one already. 5 - Any ideas as to why i am getting no btl component error when i want to use openib even though ompi_info shows it? If it help any further , I have the following openib modules : This usually (but not always) indicates that something is going wrong with initializing the hardware interface. ompi_info only tries to load the module, but not initialize the network device. Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/