Re: [OMPI users] Pointers for understanding failure messages on NetBSD
kevin.buck...@ecs.vuw.ac.nz writes: > Cc: to the OpenMPI list as the oftdump clash might be of interest > elsewhere. > >> I attach a patch, but it doesn't work and I don't see where the >> error lies now. It may be that I'm doing something stupid. >> It produces working OpenMPI-1.3.4 package on Dragonfly though. > > Ok, I'll try and merge it in to the working stuff we have here. > I, obviously, just #ifdef'd for NetBSD as that is all I have to > try stuff out against. No need for that actually, we can do it later. I was using Dragonfly as platform where it works out of box. >> Kevin, I've tried your chunk but it doesn't make any better. >> Do you really have working OpenMPI on NetBSD? > > Oh yes! > > I have placed the tar of current patches from our PkgSrc build in > > http://www.ecs.vuw.ac.nz/~kevin/forMPI/openmpi-1.3.4-20091208-netbsd.tar.gz > > in case you want to try something out from an actual NetBSD build. I'm looking at your patches now. >> (What conflict do you observe with pkgsrc-wip package by the way?) >> > > That was detailed in another email but basically the Open Trace Format > that the Vampire Trace (VT) stuff is looking to install tries to install: > > ${LOCALBASE}/bin/otfdump > > and that binary is already installed there as part of another > package. > > You can get around this for a NetBSD OpenMPI deployment by adding this > patch to the PkgSrc Makefile which just removes the VT toolkit: > > > 26a27 >> CONFIGURE_ARGS+= --enable-contrib-no-build=vt > > I have no idea how NetBSD go about resolving such clashes in the long > term though? I've disabled it the same way for this time, my local package differs from what's in wip: --- PLIST 3 Dec 2009 10:18:00 - 1.5 +++ PLIST 9 Dec 2009 08:29:31 - @@ -1,17 +1,11 @@ @comment $NetBSD$ bin/mpiCC -bin/mpiCC-vt bin/mpic++ -bin/mpic++-vt bin/mpicc -bin/mpicc-vt bin/mpicxx -bin/mpicxx-vt bin/mpiexec bin/mpif77 -bin/mpif77-vt bin/mpif90 -bin/mpif90-vt bin/mpirun bin/ompi-checkpoint bin/ompi-clean @@ -21,28 +15,11 @@ bin/ompi-server bin/ompi_info bin/opal_wrapper -bin/opari bin/orte-clean bin/orte-iof bin/orte-ps bin/orted bin/orterun -bin/otfaux -bin/otfcompress -bin/otfconfig -bin/otfdecompress -bin/otfdump -bin/otfinfo -bin/otfmerge -bin/vtcc -bin/vtcxx -bin/vtf77 -bin/vtf90 -bin/vtfilter -bin/vtunify -etc/openmpi-default-hostfile -etc/openmpi-mca-params.conf -etc/openmpi-totalview.tcl include/mpi.h include/mpif-common.h include/mpif-config.h @@ -79,40 +56,12 @@ include/openmpi/ompi/mpi/cxx/topology_inln.h include/openmpi/ompi/mpi/cxx/win.h include/openmpi/ompi/mpi/cxx/win_inln.h -include/vampirtrace/OTF_CopyHandler.h -include/vampirtrace/OTF_Definitions.h -include/vampirtrace/OTF_File.h -include/vampirtrace/OTF_FileManager.h -include/vampirtrace/OTF_Filenames.h -include/vampirtrace/OTF_HandlerArray.h -include/vampirtrace/OTF_MasterControl.h -include/vampirtrace/OTF_RBuffer.h -include/vampirtrace/OTF_RStream.h -include/vampirtrace/OTF_Reader.h -include/vampirtrace/OTF_WBuffer.h -include/vampirtrace/OTF_WStream.h -include/vampirtrace/OTF_Writer.h -include/vampirtrace/OTF_inttypes.h -include/vampirtrace/OTF_inttypes_unix.h -include/vampirtrace/opari_omp.h -include/vampirtrace/otf.h -include/vampirtrace/pomp_lib.h -include/vampirtrace/vt_user.h -include/vampirtrace/vt_user.inc -include/vampirtrace/vt_user_comment.h -include/vampirtrace/vt_user_comment.inc -include/vampirtrace/vt_user_count.h -include/vampirtrace/vt_user_count.inc lib/libmca_common_sm.la lib/libmpi.la lib/libmpi_cxx.la lib/libmpi_f77.la lib/libopen-pal.la lib/libopen-rte.la -lib/libotf.la -lib/libvt.a -lib/libvt.fmpi.a -lib/libvt.mpi.a lib/openmpi/libompi_dbg_msgq.la lib/openmpi/mca_allocator_basic.la lib/openmpi/mca_allocator_bucket.la @@ -503,6 +452,9 @@ man/man7/orte_hosts.7 man/man7/orte_snapc.7 share/openmpi/amca-param-sets/example.conf +share/openmpi/examples/openmpi-default-hostfile +share/openmpi/examples/openmpi-mca-params.conf +share/openmpi/examples/openmpi-totalview.tcl share/openmpi/help-coll-sync.txt share/openmpi/help-dash-host.txt share/openmpi/help-ess-base.txt @@ -548,36 +500,9 @@ share/openmpi/help-plm-rsh.txt share/openmpi/help-ras-base.txt share/openmpi/help-rmaps_rank_file.txt -share/openmpi/mpiCC-vt-wrapper-data.txt share/openmpi/mpiCC-wrapper-data.txt -share/openmpi/mpic++-vt-wrapper-data.txt share/openmpi/mpic++-wrapper-data.txt -share/openmpi/mpicc-vt-wrapper-data.txt share/openmpi/mpicc-wrapper-data.txt -share/openmpi/mpicxx-vt-wrapper-data.txt share/openmpi/mpicxx-wrapper-data.txt -share/openmpi/mpif77-vt-wrapper-data.txt share/openmpi/mpif77-wrapper-data.txt -share/openmpi/mpif90-vt-wrapper-data.txt share/openmpi/mpif90-wrapper-data.txt -share/vampirtrace/FILTER.SPEC -share/vampirtrace/GROUPS.SPEC -share/vampirtrace/METRICS.SPEC -share/vampirtrace/doc/ChangeLog -share/vampirtrace/doc/LICENSE -share/vampirtrace/doc/UserManual.html -share/vampirtrace/doc/UserManual.pdf
[OMPI users] Hanging vs Stopping behaviour in communication failures
Dear all, sometimes when running Open MPI jobs, the application hangs. By looking the output I get the following error message: [ic17][[34562,1],74][../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv ] mca_btl_tcp_frag_recv: readv failed: No route to host (113) I would expect Open MPI to eventually quit with an error at such situations. Is the observed behaviour (i.e.: hanging) the intended one ? If so, what would be the reason(s) behind choosing the hanging over the stopping ? Best Regards, -- Constantinos
Re: [OMPI users] mpirun only works when -np <4
On Tue, 2009-12-08 at 08:30 -0800, Matthew MacManes wrote: > There are 8 physical cores, or 16 with hyperthreading enabled. That should be meaty enough. > 1st of all, let me say that when I specify that -np is less than 4 > processors (1, 2, or 3), both programs seem to work as expected. Also, > the non-mpi version of each of them works fine. Presumably the non-mpi version is serial however? this this doesn't mean the program is bug-free or that the parallel version isn't broken. There are any number of apps that don't work above N processes, in fact probably all programs break for some value of N, it's normally a little higher then 3 however. > Thus, I am pretty sure that this is a problem with MPI rather that > with the program code or something else. > > What happens is simply that the program hangs.. I presume you mean here the output stops? The program continues to use CPU cycles but no longer appears to make any progress? I'm of the opinion that this is most likely a error in your program, I would start by using either valgrind or padb. You can run the app under valgrind using the following mpirun options, this will give you four files named v.log.0 to v.log.3 which you can check for errors in the normal way. The "--mca btl tcp,self" option will disable shared memory which can create false positives. mpirun -n 4 --mca btl tcp,self valgrind --log-file=v.log.% q{OMPI_COMM_WORLD_RANK} Alternatively you can run the application, wait for it to hang and then in another window run my tool, padb, which will show you the MPI message queues and stack traces which should show you where it's hung, instructions and sample output are on this page. http://padb.pittman.org.uk/full-report.html > There are no error messages, and there is no clue from anything else > (system working fine otherwise- no RAM issues, etc). It does not hang > at the same place everytime, sometimes in the very beginning, sometime > near the middle.. > > Could this an issue with hyperthreading? A conflict with something? Unlikely, if there was a problem in OMPI running more than 3 processes it would have been found by now. I regularly run 8 process applications on my dual-core netbook alongside all my desktop processes without issue, it runs fine, a little slowly but fine. All this talk about binding and affinity won't help either, process binding is about squeezing the last 15% of performance out of a system and making performance reproducible, it has no bearing on correctness or scalability. If you're not running on a dedicated machine which with firefox running I guess you aren't then there would be a good case for leaving it off anyway. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI users] mpirun only works when -np <4
Hi Matthew, I just had the same problem with my application when using more than 4 cores - however, the program didn't hang, it crashed, and I got an error message of 'address not mapped'. As you say, it happened different places in the code, sometimes in the beginning, sometimes in the middle, sometimes at the end. I wrote to the list about it, and also got the suggestion that the cause could probably be found in my own application. And it could! I realized that all the different places where the crash happened were the same places in the code where I got compiler warnings during compilation. Most of the warnings dealt with type mismatch of variables used different places in the code. I cleaned the code to remove the warnings, and after that I've had no problems using more than 4 cores. It may be worth a try for you. Best regards, Iris Lohmann -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Ashley Pittman Sent: 09 December 2009 11:38 To: Open MPI Users Subject: Re: [OMPI users] mpirun only works when -np <4 On Tue, 2009-12-08 at 08:30 -0800, Matthew MacManes wrote: > There are 8 physical cores, or 16 with hyperthreading enabled. That should be meaty enough. > 1st of all, let me say that when I specify that -np is less than 4 > processors (1, 2, or 3), both programs seem to work as expected. Also, > the non-mpi version of each of them works fine. Presumably the non-mpi version is serial however? this this doesn't mean the program is bug-free or that the parallel version isn't broken. There are any number of apps that don't work above N processes, in fact probably all programs break for some value of N, it's normally a little higher then 3 however. > Thus, I am pretty sure that this is a problem with MPI rather that > with the program code or something else. > > What happens is simply that the program hangs.. I presume you mean here the output stops? The program continues to use CPU cycles but no longer appears to make any progress? I'm of the opinion that this is most likely a error in your program, I would start by using either valgrind or padb. You can run the app under valgrind using the following mpirun options, this will give you four files named v.log.0 to v.log.3 which you can check for errors in the normal way. The "--mca btl tcp,self" option will disable shared memory which can create false positives. mpirun -n 4 --mca btl tcp,self valgrind --log-file=v.log.% q{OMPI_COMM_WORLD_RANK} Alternatively you can run the application, wait for it to hang and then in another window run my tool, padb, which will show you the MPI message queues and stack traces which should show you where it's hung, instructions and sample output are on this page. http://padb.pittman.org.uk/full-report.html > There are no error messages, and there is no clue from anything else > (system working fine otherwise- no RAM issues, etc). It does not hang > at the same place everytime, sometimes in the very beginning, sometime > near the middle.. > > Could this an issue with hyperthreading? A conflict with something? Unlikely, if there was a problem in OMPI running more than 3 processes it would have been found by now. I regularly run 8 process applications on my dual-core netbook alongside all my desktop processes without issue, it runs fine, a little slowly but fine. All this talk about binding and affinity won't help either, process binding is about squeezing the last 15% of performance out of a system and making performance reproducible, it has no bearing on correctness or scalability. If you're not running on a dedicated machine which with firefox running I guess you aren't then there would be a good case for leaving it off anyway. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] orte error
Hi I've installed trilinos using the openmpi 1.3.3 libraries. I'm configuring openmpi as follows: /configure CXX=/usr/local/bin/g++ CC=/usr/local/bin/gcc F77=/usr/local/bin/gfortran - prefix=/Users/andrewmcbride/lib/openmpi-1.3.3/MAC Trilinos compiles without problem but the test fail (see below). I'm running a Mac with OSX10.6 (snow leopard). The mpi tests seem to run fine: bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpicc hello_c.c bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 hello_ bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 a.out Hello, world, I am 0 of 2 Hello, world, I am 1 of 2 I'm convinced that the problem has to do with the paths and different versions of mpi lurking on the mac. I don't want to use the version of openmpi that comes bundled with the mac for a different reason. Any help would be most appreciated Andrew Start testing: Dec 09 12:18 SAST -- 1/534 Testing: Teuchos_BLAS_test_MPI_1 1/534 Test: Teuchos_BLAS_test_MPI_1 Command: "/Users/andrewmcbride/lib/openmpi-1.3.3/MAC/bin/mpiexec" "-np" "1" "/Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS/Teuchos_BLAS_test.exe" "-v" Directory: /Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS "Teuchos_BLAS_test_MPI_1" start time: Dec 09 12:18 SAST Output: -- [macs-mac.local:71058] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 125 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [macs-mac.local:71058] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! --
Re: [OMPI users] orte error
You need to set your LD_LIBRARY_PATH to ~/lib/openmpi-1.3.3/MAC/lib, and your PATH to ~/lib/openmpi-1.3.3/MAC/bin It should then run fine. On Wed, Dec 9, 2009 at 6:29 AM, Andrew McBridewrote: > Hi > > I've installed trilinos using the openmpi 1.3.3 libraries. I'm configuring > openmpi as follows: > /configure CXX=/usr/local/bin/g++ CC=/usr/local/bin/gcc > F77=/usr/local/bin/gfortran - > prefix=/Users/andrewmcbride/lib/openmpi-1.3.3/MAC > > Trilinos compiles without problem but the test fail (see below). I'm > running a Mac with OSX10.6 (snow leopard). The mpi tests seem to run fine: > > bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpicc hello_c.c > bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 hello_ > bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 a.out > Hello, world, I am 0 of 2 > Hello, world, I am 1 of 2 > > I'm convinced that the problem has to do with the paths and different > versions of mpi lurking on the mac. I don't want to use the version of > openmpi that comes bundled with the mac for a different reason. > > Any help would be most appreciated > > Andrew > > > Start testing: Dec 09 12:18 SAST > -- > 1/534 Testing: Teuchos_BLAS_test_MPI_1 > 1/534 Test: Teuchos_BLAS_test_MPI_1 > Command: "/Users/andrewmcbride/lib/openmpi-1.3.3/MAC/bin/mpiexec" "-np" "1" > "/Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS/Teuchos_BLAS_test.exe" > "-v" > Directory: > /Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS > "Teuchos_BLAS_test_MPI_1" start time: Dec 09 12:18 SAST > Output: > -- > [macs-mac.local:71058] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in > file runtime/orte_init.c at line 125 > -- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > orte_ess_base_select failed > --> Returned value Not found (-13) instead of ORTE_SUCCESS > -- > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [macs-mac.local:71058] Abort before MPI_INIT completed successfully; not > able to guarantee that all other processes were killed! > -- > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] orte error
Thanks for your quick response Ralph. The errors I get is now are of a completely different nature and have to do with, presumably, calling delete on an unallocated pointer. Now, this probably has little to do with openmpi and more to do with compilers used to create openmpi? I used gcc version 4.5.0 20090910 when compiling openmpi. Does anyone have any ideas? Regards Andrew Start testing: Dec 09 15:53 SAST -- 1/534 Testing: Teuchos_BLAS_test_MPI_1 1/534 Test: Teuchos_BLAS_test_MPI_1 Command: "/Users/andrewmcbride/lib/openmpi-1.3.3/MAC/bin/mpiexec" "-np" "1" "/Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS/Teuchos_BLAS_test.exe" "-v" Directory: /Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS "Teuchos_BLAS_test_MPI_1" start time: Dec 09 15:53 SAST Output: -- Teuchos_BLAS_test.exe(72504) malloc: *** error for object 0x100727c00: pointer being freed was not allocated *** set a breakpoint in malloc_error_break to debug [macs-mac:72504] *** Process received signal *** [macs-mac:72504] Signal: Abort trap (6) On 09 Dec 2009, at 3:32 PM, Ralph Castain wrote: > You need to set your LD_LIBRARY_PATH to ~/lib/openmpi-1.3.3/MAC/lib, and your > PATH to ~/lib/openmpi-1.3.3/MAC/bin > > It should then run fine. > > On Wed, Dec 9, 2009 at 6:29 AM, Andrew McBride> wrote: > Hi > > I've installed trilinos using the openmpi 1.3.3 libraries. I'm configuring > openmpi as follows: > /configure CXX=/usr/local/bin/g++ CC=/usr/local/bin/gcc > F77=/usr/local/bin/gfortran - > prefix=/Users/andrewmcbride/lib/openmpi-1.3.3/MAC > > Trilinos compiles without problem but the test fail (see below). I'm running > a Mac with OSX10.6 (snow leopard). The mpi tests seem to run fine: > > bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpicc hello_c.c > bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 hello_ > bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 a.out > Hello, world, I am 0 of 2 > Hello, world, I am 1 of 2 > > I'm convinced that the problem has to do with the paths and different > versions of mpi lurking on the mac. I don't want to use the version of > openmpi that comes bundled with the mac for a different reason. > > Any help would be most appreciated > > Andrew > > > Start testing: Dec 09 12:18 SAST > -- > 1/534 Testing: Teuchos_BLAS_test_MPI_1 > 1/534 Test: Teuchos_BLAS_test_MPI_1 > Command: "/Users/andrewmcbride/lib/openmpi-1.3.3/MAC/bin/mpiexec" "-np" "1" > "/Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS/Teuchos_BLAS_test.exe" > "-v" > Directory: > /Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS > "Teuchos_BLAS_test_MPI_1" start time: Dec 09 12:18 SAST > Output: > -- > [macs-mac.local:71058] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file > runtime/orte_init.c at line 125 > -- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > orte_ess_base_select failed > --> Returned value Not found (-13) instead of ORTE_SUCCESS > -- > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [macs-mac.local:71058] Abort before MPI_INIT completed successfully; not able > to guarantee that all other processes were killed! > -- > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] orte error
Can you run simple MPI applications, like sending a message around in a ring? On Dec 9, 2009, at 10:18 AM, Andrew McBride wrote: > Thanks for your quick response Ralph. > > The errors I get is now are of a completely different nature and have to do > with, presumably, calling delete on an unallocated pointer. Now, this > probably has little to do with openmpi and more to do with compilers used to > create openmpi? > > I used gcc version 4.5.0 20090910 when compiling openmpi. > > Does anyone have any ideas? > > Regards > Andrew > > Start testing: Dec 09 15:53 SAST > -- > 1/534 Testing: Teuchos_BLAS_test_MPI_1 > 1/534 Test: Teuchos_BLAS_test_MPI_1 > Command: "/Users/andrewmcbride/lib/openmpi-1.3.3/MAC/bin/mpiexec" "-np" "1" > "/Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS/Teuchos_BLAS_test.exe" > "-v" > Directory: > /Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS > "Teuchos_BLAS_test_MPI_1" start time: Dec 09 15:53 SAST > Output: > -- > Teuchos_BLAS_test.exe(72504) malloc: *** error for object 0x100727c00: > pointer being freed was not allocated > *** set a breakpoint in malloc_error_break to debug > [macs-mac:72504] *** Process received signal *** > [macs-mac:72504] Signal: Abort trap (6) > > On 09 Dec 2009, at 3:32 PM, Ralph Castain wrote: > >> You need to set your LD_LIBRARY_PATH to ~/lib/openmpi-1.3.3/MAC/lib, and >> your PATH to ~/lib/openmpi-1.3.3/MAC/bin >> >> It should then run fine. >> >> On Wed, Dec 9, 2009 at 6:29 AM, Andrew McBride>> wrote: >> Hi >> >> I've installed trilinos using the openmpi 1.3.3 libraries. I'm configuring >> openmpi as follows: >> /configure CXX=/usr/local/bin/g++ CC=/usr/local/bin/gcc >> F77=/usr/local/bin/gfortran - >> prefix=/Users/andrewmcbride/lib/openmpi-1.3.3/MAC >> >> Trilinos compiles without problem but the test fail (see below). I'm running >> a Mac with OSX10.6 (snow leopard). The mpi tests seem to run fine: >> >> bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpicc hello_c.c >> bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 hello_ >> bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 a.out >> Hello, world, I am 0 of 2 >> Hello, world, I am 1 of 2 >> >> I'm convinced that the problem has to do with the paths and different >> versions of mpi lurking on the mac. I don't want to use the version of >> openmpi that comes bundled with the mac for a different reason. >> >> Any help would be most appreciated >> >> Andrew >> >> >> Start testing: Dec 09 12:18 SAST >> -- >> 1/534 Testing: Teuchos_BLAS_test_MPI_1 >> 1/534 Test: Teuchos_BLAS_test_MPI_1 >> Command: "/Users/andrewmcbride/lib/openmpi-1.3.3/MAC/bin/mpiexec" "-np" "1" >> "/Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS/Teuchos_BLAS_test.exe" >> "-v" >> Directory: >> /Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS >> "Teuchos_BLAS_test_MPI_1" start time: Dec 09 12:18 SAST >> Output: >> -- >> [macs-mac.local:71058] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file >> runtime/orte_init.c at line 125 >> -- >> It looks like orte_init failed for some reason; your parallel process is >> likely to abort. There are many reasons that a parallel process can >> fail during orte_init; some of which are due to configuration or >> environment problems. This failure appears to be an internal failure; >> here's some additional information (which may only be relevant to an >> Open MPI developer): >> >> orte_ess_base_select failed >> --> Returned value Not found (-13) instead of ORTE_SUCCESS >> -- >> *** An error occurred in MPI_Init >> *** before MPI was initialized >> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) >> [macs-mac.local:71058] Abort before MPI_INIT completed successfully; not >> able to guarantee that all other processes were killed! >> -- >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] orte error
seemingly. here is the output of ring: bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpicxx ring_cxx.cc bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 a.out Process 0 sending 10 to 1, tag 201 (2 processes in ring) Process 0 sent to 1 Process 0 decremented value: 9 Process 0 decremented value: 8 Process 0 decremented value: 7 Process 0 decremented value: 6 Process 0 decremented value: 5 Process 0 decremented value: 4 Process 0 decremented value: 3 Process 0 decremented value: 2 Process 0 decremented value: 1 Process 0 decremented value: 0 Process 0 exiting Process 1 exiting and here is the output of hello: bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpicc hello_c.c bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 hello_ bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 a.out Hello, world, I am 0 of 2 Hello, world, I am 1 of 2 I presume this output is correct? I guess the issue I have lies elsewhere then? Andrew On 09 Dec 2009, at 5:44 PM, Jeff Squyres wrote: > Can you run simple MPI applications, like sending a message around in a ring? > > On Dec 9, 2009, at 10:18 AM, Andrew McBride wrote: > >> Thanks for your quick response Ralph. >> >> The errors I get is now are of a completely different nature and have to do >> with, presumably, calling delete on an unallocated pointer. Now, this >> probably has little to do with openmpi and more to do with compilers used to >> create openmpi? >> >> I used gcc version 4.5.0 20090910 when compiling openmpi. >> >> Does anyone have any ideas? >> >> Regards >> Andrew >> >> Start testing: Dec 09 15:53 SAST >> -- >> 1/534 Testing: Teuchos_BLAS_test_MPI_1 >> 1/534 Test: Teuchos_BLAS_test_MPI_1 >> Command: "/Users/andrewmcbride/lib/openmpi-1.3.3/MAC/bin/mpiexec" "-np" "1" >> "/Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS/Teuchos_BLAS_test.exe" >> "-v" >> Directory: >> /Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS >> "Teuchos_BLAS_test_MPI_1" start time: Dec 09 15:53 SAST >> Output: >> -- >> Teuchos_BLAS_test.exe(72504) malloc: *** error for object 0x100727c00: >> pointer being freed was not allocated >> *** set a breakpoint in malloc_error_break to debug >> [macs-mac:72504] *** Process received signal *** >> [macs-mac:72504] Signal: Abort trap (6) >> >> On 09 Dec 2009, at 3:32 PM, Ralph Castain wrote: >> >>> You need to set your LD_LIBRARY_PATH to ~/lib/openmpi-1.3.3/MAC/lib, and >>> your PATH to ~/lib/openmpi-1.3.3/MAC/bin >>> >>> It should then run fine. >>> >>> On Wed, Dec 9, 2009 at 6:29 AM, Andrew McBride>>> wrote: >>> Hi >>> >>> I've installed trilinos using the openmpi 1.3.3 libraries. I'm configuring >>> openmpi as follows: >>> /configure CXX=/usr/local/bin/g++ CC=/usr/local/bin/gcc >>> F77=/usr/local/bin/gfortran - >>> prefix=/Users/andrewmcbride/lib/openmpi-1.3.3/MAC >>> >>> Trilinos compiles without problem but the test fail (see below). I'm >>> running a Mac with OSX10.6 (snow leopard). The mpi tests seem to run fine: >>> >>> bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpicc hello_c.c >>> bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 hello_ >>> bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 a.out >>> Hello, world, I am 0 of 2 >>> Hello, world, I am 1 of 2 >>> >>> I'm convinced that the problem has to do with the paths and different >>> versions of mpi lurking on the mac. I don't want to use the version of >>> openmpi that comes bundled with the mac for a different reason. >>> >>> Any help would be most appreciated >>> >>> Andrew >>> >>> >>> Start testing: Dec 09 12:18 SAST >>> -- >>> 1/534 Testing: Teuchos_BLAS_test_MPI_1 >>> 1/534 Test: Teuchos_BLAS_test_MPI_1 >>> Command: "/Users/andrewmcbride/lib/openmpi-1.3.3/MAC/bin/mpiexec" "-np" "1" >>> "/Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS/Teuchos_BLAS_test.exe" >>> "-v" >>> Directory: >>> /Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS >>> "Teuchos_BLAS_test_MPI_1" start time: Dec 09 12:18 SAST >>> Output: >>> -- >>> [macs-mac.local:71058] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in >>> file runtime/orte_init.c at line 125 >>> -- >>> It looks like orte_init failed for some reason; your parallel process is >>> likely to abort. There are many reasons that a parallel process can >>> fail during orte_init; some of which are due to configuration or >>> environment problems. This failure appears to be an internal failure; >>> here's some additional information (which may only be relevant to an >>> Open MPI developer): >>> >>> orte_ess_base_select failed >>> --> Returned value Not found (-13) instead of
Re: [OMPI users] orte error
On Dec 9, 2009, at 10:59 AM, Andrew McBride wrote: > seemingly. here is the output of ring: > > I presume this output is correct? I guess the issue I have lies elsewhere > then? Yes -- the output looks correct. Never say "never", but it would *seem* that the error lies in your app somewhere. Can you double check that you're not freeing things that you shouldn't? -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] ompi-restart using different nodes
So I tried to reproduce this problem today, and everything worked fine for me using the trunk. I haven't tested v1.3/v1.4 yet. I tried checkpointing with one hostfile then restarting with each of the following: - No hostfile - a hostfile with completely different machines - a hostfile with the same machines in the opposite order I suspect that the problem is not with Open MPI, but your system interacting with BLCR. Usually when people cannot restart on a different node they have problems with the 'prelink' feature on Linux. BLCR has a FAQ item on this: https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink So if this is your problem then you will probably not be able to checkpoint a single process (non-MPI) application on one node and restart on another. Sorry I didn't mention it before, must have slipped my mind. If this turns out to not be the problem, let me know and I'll take another look. Also send me any error messages that are displayed. -- Josh On Dec 8, 2009, at 1:39 PM, Jonathan Ferland wrote: I did the same test using 1.3.4 and still the same issue I also tried to use the tm interface instead of specifying the hostfile, same result. thanks, Jonathan Josh Hursey wrote: Though I do not test this scenario (using hostfiles) very often, it used to work. The ompi-restart command takes a --hostfile (or -- machinefile) argument that is passed directly to the mpirun command. I wonder if something broke recently with this handoff. I can certainly checkpoint with one set of nodes/allocation and restart with another, but most/all of my testing occurs in a SLURM environment, so no need for an explicit hostfile. I'll take a look to see if I can reproduce, but probably will not be until next week. -- Josh On Dec 2, 2009, at 9:54 AM, Jonathan Ferland wrote: Hi, I am trying to use BLCR checkpointing in mpi. I am currently able to run my application using some hostfile, checkpoint the run, and then restart the application using the same hostfile. The thing I would like to do is to restart the application with a different hostfile. But this leads to a segfault using 1.3.3. Is it possible to restart the application using a different hostfile (we are using pbs to create the hostfile, so each new restart might be on different nodes), how can we do that? If no, do you plan to include this in a future release? thanks ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- -- Jonathan Ferland, analyste en calcul scientifique RQCHP (Réseau québécois de calcul de haute performance) bureau S-252, pavillon Roger-Gaudry, Université de Montréal téléphone : 514 343-6111 poste 8852 télécopieur : 514 343-2155 -- ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] ompi-restart using different nodes
Hi Josh, Thanks for helping. That solved the problem!!! cheers, Jonathan Josh Hursey wrote: So I tried to reproduce this problem today, and everything worked fine for me using the trunk. I haven't tested v1.3/v1.4 yet. I tried checkpointing with one hostfile then restarting with each of the following: - No hostfile - a hostfile with completely different machines - a hostfile with the same machines in the opposite order I suspect that the problem is not with Open MPI, but your system interacting with BLCR. Usually when people cannot restart on a different node they have problems with the 'prelink' feature on Linux. BLCR has a FAQ item on this: https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink So if this is your problem then you will probably not be able to checkpoint a single process (non-MPI) application on one node and restart on another. Sorry I didn't mention it before, must have slipped my mind. If this turns out to not be the problem, let me know and I'll take another look. Also send me any error messages that are displayed. -- Josh On Dec 8, 2009, at 1:39 PM, Jonathan Ferland wrote: I did the same test using 1.3.4 and still the same issue I also tried to use the tm interface instead of specifying the hostfile, same result. thanks, Jonathan Josh Hursey wrote: Though I do not test this scenario (using hostfiles) very often, it used to work. The ompi-restart command takes a --hostfile (or --machinefile) argument that is passed directly to the mpirun command. I wonder if something broke recently with this handoff. I can certainly checkpoint with one set of nodes/allocation and restart with another, but most/all of my testing occurs in a SLURM environment, so no need for an explicit hostfile. I'll take a look to see if I can reproduce, but probably will not be until next week. -- Josh On Dec 2, 2009, at 9:54 AM, Jonathan Ferland wrote: Hi, I am trying to use BLCR checkpointing in mpi. I am currently able to run my application using some hostfile, checkpoint the run, and then restart the application using the same hostfile. The thing I would like to do is to restart the application with a different hostfile. But this leads to a segfault using 1.3.3. Is it possible to restart the application using a different hostfile (we are using pbs to create the hostfile, so each new restart might be on different nodes), how can we do that? If no, do you plan to include this in a future release? thanks ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- -- Jonathan Ferland, analyste en calcul scientifique RQCHP (Réseau québécois de calcul de haute performance) bureau S-252, pavillon Roger-Gaudry, Université de Montréal téléphone : 514 343-6111 poste 8852 télécopieur : 514 343-2155 -- ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- -- Jonathan Ferland, analyste en calcul scientifique RQCHP (Réseau québécois de calcul de haute performance) bureau S-252, pavillon Roger-Gaudry, Université de Montréal téléphone : 514 343-6111 poste 8852 télécopieur : 514 343-2155 --
Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)
Hi Gus, Interestingly the results for the connectivity_c test... works fine with -np <8. For -np >8 it works some of the time, other times it HANGS. I have got to believe that this is a big clue!! Also, when it hangs, sometimes I get the message "mpirun was unable to cleanly terminate the daemons on the nodes shown below" Note that NO nodes are shown below. Once, I got -np 250 to pass the connectivity test, but I was not able to replicate this reliable, so I'm not sure if it was a fluke, or what. Here is a like to a screenshop of TOP when connectivity_c is hung with -np 14.. I see that 2 processes are only at 50% CPU usage.. H http://picasaweb.google.com/lh/photo/87zVEucBNFaQ0TieNVZtdw?authkey=Gv1sRgCLKokNOVqo7BYw=directlink The other tests, ring_c, hello_c, as well as the cxx versions of these guys with with all values of -np. Using -mca mpi-paffinity_alone 1 I get the same behavior. I agree that I am should worry about the mismatch between where the libraries are installed versus where I am telling my programs to look for them. Would this type of mismatch cause behavior like what I am seeing, i.e. working with a small number of processors, but failing with larger? It seems like a mismatch would have the same effect regardless of the number of processors used. Maybe I am mistaken. Anyway, to address this, which mpirun gives me /usr/local/bin/mpirun.. so to configure ./configure --with-mpi=/usr/local/bin/mpirun and to run /usr/local/bin/mpirun -np X ... This should uname -a gives me: Linux macmanes 2.6.31-16-generic #52-Ubuntu SMP Thu Dec 3 22:07:16 UTC 2006 x86_64 GNU/Linux Matt On Dec 8, 2009, at 8:50 PM, Gus Correa wrote: > Hi Matthew > > Please see comments/answers inline below. > > Matthew MacManes wrote: >> Hi Gus, Thanks for your ideas.. I have a few questions, and will try to >> answer yours in hopes of solving this!! > > A simple way to test OpenMPI on your system is to run the > test programs that come with the OpenMPI source code, > hello_c.c, connectivity_c.c, and ring_c.c: > http://www.open-mpi.org/ > > Get the tarball from the OpenMPI site, gzip and untar it, > and look for it in the "examples" directory. > Compile it with /your/path/to/openmpi/bin/mpicc hello_c.c > Run it with /your/path/to/openmpi/bin/mpiexec -np X a.out > using X = 2, 4, 8, 16, 32, 64, ... > > This will tell if your OpenMPI is functional, > and if you can run on many Nehalem cores, > even with oversubscription perhaps. > It will also set the stage for further investigation of your > actual programs. > > >> Should I worry about setting things like --num-cores --bind-to-cores? This, >> I think, gets at your questions about processor affinity.. Am I right? I >> could not exactly figure out the -mca mpi-paffinity_alone stuff... > > I use the simple minded -mca mpi-paffinity_alone 1. > This is probably the easiest way to assign a process to a core. > There more complex ways in OpenMPI, but I haven't tried. > Indeed, -mca mpi-paffinity_alone 1 does improve performance of > our programs here. > There is a chance that without it the 16 virtual cores of > your Nehalem get confused with more than 3 processes > (you reported that -np > 3 breaks). > > Did you try adding just -mca mpi-paffinity_alone 1 to > your mpiexec command line? > > >> 1. Additional load: nope. nothing else, most of the time not even firefox. > > Good. > Turn off firefox, etc, to make it even better. > Ideally, use runlevel 3, no X, like a computer cluster node, > but this may not be required. > >> 2. RAM: no problems apparent when monitoring through TOP. Interesting, I did >> wonder about oversubscription, so I tried the option --nooversubscription, >> but this gave me an error mssage. > > Oversubscription from your program would only happen if > you asked for more processes than available cores, i.e., > -np > 8 (or "virtual" cores, in case of Nehalem hyperthreading, > -np > 16). > Since you have -np=4 there is no oversubscription, > unless you have other external load (e.g. Matlab, etc), > but you said you don't. > > Yet another possibility would be if your program is threaded > (e.g. using OpenMP along with MPI), but considering what you > said about OpenMP I would guess the programs don't use it. > For instance, you launch the program with 4 MPI processes, > and each process decides to start, say, 8 OpenMP threads. > You end up with 32 threads and 8 (real) cores (or 16 hyperthreaded > ones on Nehalem). > > > What else does top say? > Any hog processes (memory- or CPU-wise) > besides your program processes? > >> 3. I have not tried other MPI flavors.. Ive been speaking to the authors of >> the programs, and they are both using openMPI. > > I was not trying to convince you to use another MPI. > I use MPICH2 also, but OpenMPI reigns here. > The idea or trying it with MPICH2 was just to check whether OpenMPI > is causing the problem, but I don't think it is. > >> 4. I don't think that this
Re: [OMPI users] Problem with mpirun -preload-binary option
I verified that the preload functionality works on the trunk. It seems to be broken on the v1.3/v1.4 branches. The version of this code has changed significantly between the v1.3/v1.4 and the trunk/v1.5 versions. I filed a bug about this so it does not get lost: https://svn.open-mpi.org/trac/ompi/ticket/2139 Can you try this again with either the trunk or v1.5 to see if that helps with the preloading? However you need to fix the password-less login issue before anything else will work. If mpirun is prompting you for a password, then it will work properly. -- Josh On Nov 12, 2009, at 3:50 PM, Qing Pang wrote: Now that I have passwordless-ssh set up both directions, and verified working - I still have the same problem. I'm able to run ssh/scp on both master and client nodes - (at this point, they are pretty much the same), without being asked for password. And mpirun works fine if I have the executable put in the same directory on both nodes. But when I tried the preload-binary option, I still have the same problem - it asked me for the password of the node running mpirun, and then tells that scp failed. --- Josh Wrote: Though the --preload-binary option was created while building the checkpoint/restart functionality it does not depend on checkpoint/ restart function in any way (just a side effect of the initial development). The problem you are seeing is a result of the computing environment setup of password-less ssh. The --preload-binary command uses 'scp' (at the moment) to copy the files from the node running mpirun to the compute nodes. The compute nodes are the ones that call 'scp', so you will need to setup password-less ssh in both directions. -- Josh On Nov 11, 2009, at 8:38 AM, Ralph Castain wrote: I'm no expert on the preload-binary option - but I would suspect that is the case given your observations. That option was created to support checkpoint/restart, not for what you are attempting to do. Like I said, you -should- be able to use it for that purpose, but I expect you may hit a few quirks like this along the way. On Nov 11, 2009, at 9:16 AM, Qing Pang wrote: > Thank you very much for your help! I believe I do have password- less ssh set up, at least from master node to client node (desktop -> laptop in my case). If I type >ssh node1 on my desktop terminal, I am able to get to the laptop node without being asked for password. And as I mentioned, if I copy the example executable from desktop to the laptop node using scp, then I am able to run it from desktop using both nodes. > Back to the preload-binary problem - I am asked for the password of my master node - the node I am working on - not the remote client node. Do you mean that I should set up password-less ssh in both direction? Does the client node need to access master node through password-less ssh to make the preload-binary option work? > > > Ralph Castain Wrote: > > It -should- work, but you need password-less ssh setup. See our FAQ > for how to do that, if you are unfamiliar with it. > > On Nov 10, 2009, at 2:02 PM, Qing Pang wrote: > > I'm having problem getting the mpirun "preload-binary" option to work. >> >> I'm using ubutu8.10 with openmpi 1.3.3, nodes connected with Ethernet cable. >> If I copy the executable to client nodes using scp, then do mpirun, everything works. >> >> But I really want to avoid the copying, so I tried the -preload-binary option. >> >> When I typed the command on my master node as below (gordon- desktop is my master node, and gordon-laptop is the client node): >> >> -- >> gordon_at_gordon-desktop:~/Desktop/openmpi-1.3.3/examples$ mpirun >> -machinefile machine.linux -np 2 --preload-binary $(pwd)/ hello_c.out >> -- >> >> I got the following: >> >> gordon_at_gordon-desktop's password: (I entered my password here, why am I asked for the password? I am working under this account anyway) >> >> >> WARNING: Remote peer ([[18118,0],1]) failed to preload a file. >> >> Exit Status: 256 >> Local File: /tmp/openmpi-sessions-gordon_at_gordon-laptop_0/18118/0/hello_c.out >> Remote File: /home/gordon/Desktop/openmpi-1.3.3/examples/ hello_c.out >> Command: >> scp gordon-desktop:/home/gordon/Desktop/openmpi-1.3.3/examples/hello_c.out >> /tmp/openmpi-sessions-gordon_at_gordon-laptop_0/18118/0/ hello_c.out >> >> Will continue attempting to launch the process(es). >> -- >> -- >> mpirun was unable to launch the specified application as it could not access >> or execute an executable: >> >> Executable:
Re: [OMPI users] mpirun only works when -np <4
Thanks Ashley, I'll try your tool.. I would think that this is an error in the programs I am trying to use, too, but this is a problem with 2 different programs, written by 2 different groups.. One of them might be bad, but both.. seems unlikely. Interestingly the results for the connectivity_c test that is included with OMPI... works fine with -np <8. For -np >8 it works some of the time, other times it HANGS. I have got to believe that this is a big clue!! Also, when it hangs, sometimes I get the message "mpirun was unable to cleanly terminate the daemons on the nodes shown below" Note that NO nodes are shown below. Once, I got -np 250 to pass the connectivity test, but I was not able to replicate this reliable, so I'm not sure if it was a fluke, or what. Here is a like to a screenshop of TOP when connectivity_c is hung with -np 14.. I see that 2 processes are only at 50% CPU usage.. H http://picasaweb.google.com/lh/photo/87zVEucBNFaQ0TieNVZtdw?authkey=Gv1sRgCLKokNOVqo7BYw=directlink The other tests, ring_c, hello_c, as well as the cxx versions of these guys with with all values of -np. Unfortunately, I could not get valgrind to work... Thanks, Matt On Dec 9, 2009, at 2:37 AM, Ashley Pittman wrote: > On Tue, 2009-12-08 at 08:30 -0800, Matthew MacManes wrote: >> There are 8 physical cores, or 16 with hyperthreading enabled. > > That should be meaty enough. > >> 1st of all, let me say that when I specify that -np is less than 4 >> processors (1, 2, or 3), both programs seem to work as expected. Also, >> the non-mpi version of each of them works fine. > > Presumably the non-mpi version is serial however? this this doesn't mean > the program is bug-free or that the parallel version isn't broken. > There are any number of apps that don't work above N processes, in fact > probably all programs break for some value of N, it's normally a little > higher then 3 however. > >> Thus, I am pretty sure that this is a problem with MPI rather that >> with the program code or something else. >> >> What happens is simply that the program hangs.. > > I presume you mean here the output stops? The program continues to use > CPU cycles but no longer appears to make any progress? > > I'm of the opinion that this is most likely a error in your program, I > would start by using either valgrind or padb. > > You can run the app under valgrind using the following mpirun options, > this will give you four files named v.log.0 to v.log.3 which you can > check for errors in the normal way. The "--mca btl tcp,self" option > will disable shared memory which can create false positives. > > mpirun -n 4 --mca btl tcp,self valgrind --log-file=v.log.% > q{OMPI_COMM_WORLD_RANK} > > Alternatively you can run the application, wait for it to hang and then > in another window run my tool, padb, which will show you the MPI message > queues and stack traces which should show you where it's hung, > instructions and sample output are on this page. > > http://padb.pittman.org.uk/full-report.html > >> There are no error messages, and there is no clue from anything else >> (system working fine otherwise- no RAM issues, etc). It does not hang >> at the same place everytime, sometimes in the very beginning, sometime >> near the middle.. >> >> Could this an issue with hyperthreading? A conflict with something? > > Unlikely, if there was a problem in OMPI running more than 3 processes > it would have been found by now. I regularly run 8 process applications > on my dual-core netbook alongside all my desktop processes without > issue, it runs fine, a little slowly but fine. > > All this talk about binding and affinity won't help either, process > binding is about squeezing the last 15% of performance out of a system > and making performance reproducible, it has no bearing on correctness or > scalability. If you're not running on a dedicated machine which with > firefox running I guess you aren't then there would be a good case for > leaving it off anyway. > > Ashley, > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users _ Matthew MacManes PhD Candidate University of California- Berkeley Museum of Vertebrate Zoology Phone: 510-495-5833 Lab Website: http://ib.berkeley.edu/labs/lacey Personal Website: http://macmanes.com/
Re: [OMPI users] checkpoint opempi-1.3.3+sge62
On Nov 12, 2009, at 10:54 AM, Sergio Díaz wrote: Hi Josh, You were right. The main problem was the /tmp. SGE uses a scratch directory in which the jobs have temporary files. Setting TMPDIR to / tmp, checkpoint works! However, when I try to restart it... I got the following error (see ERROR1). Option -v agrees these lines (see ERRO2). It is concerning that ompi-restart is segfault'ing when it errors out. The error message is being generated between the launch of the opal- restart starter command and when we try to exec(cr_restart). Usually the failure is related to a corruption of the metadata stored in the checkpoint. Can you send me the file below: ompi_global_snapshot_28454.ckpt/0/opal_snapshot_0.ckpt/ snapshot_meta.data I was able to reproduce the segv (at least I think it is the same one). We failed to check the validity of a string when we parse the metadata. I committed a fix to the trunk in r22290, and requested that the fix be moved to the v1.4 and v1.5 branches. If you are interested in seeing when they get applied you can follow the following tickets: https://svn.open-mpi.org/trac/ompi/ticket/2140 https://svn.open-mpi.org/trac/ompi/ticket/2141 Can you try the trunk to see if the problem goes away? The development trunk and v1.5 series have a bunch of improvements to the C/R functionality that were never brought over the v1.3/v1.4 series. I was trying to use ssh instead of rsh but I was impossible. By default it should use ssh and if it finds a problem, it will use rsh. It seems that ssh doesn't work because always use rsh. If I change this MCA parameter, It still uses rsh. If I set OMPI_MCA_plm_rsh_disable_qrsh variable to 1, It try to use ssh and doesn't works. I got --> "bash: orted: command not found" and the mpi process dies. The command which try to execute is the following and I haven't found yet the reason why this command doesn't found orted because I set the /etc/bashrc in order to get always the right path and I have the right path into my application. (see ERROR4). This seems like an SGE specific issue, so a bit out of my domain. Maybe others have suggestions here. -- Josh Many thanks!, Sergio P.S. Sorry about these long emails. I just try to show you useful information to identify my problems. ERROR 1 > > > > > > > > > > > >> > [sdiaz@compute-3-18 ~]$ ompi-restart ompi_global_snapshot_28454.ckpt > -- > Error: Unable to obtain the proper restart command to restart from the >checkpoint file (opal_snapshot_0.ckpt). Returned -1. > > -- > -- > Error: Unable to obtain the proper restart command to restart from the >checkpoint file (opal_snapshot_1.ckpt). Returned -1. > > -- > [compute-3-18:28792] *** Process received signal *** > [compute-3-18:28792] Signal: Segmentation fault (11) > [compute-3-18:28792] Signal code: (128) > [compute-3-18:28792] Failing at address: (nil) > [compute-3-18:28792] [ 0] /lib64/tls/libpthread.so.0 [0x33bbf0c430] > [compute-3-18:28792] [ 1] /lib64/tls/libc.so.6(__libc_free+0x25) [0x33bb669135] > [compute-3-18:28792] [ 2] /opt/cesga/openmpi-1.3.3/lib/libopen- pal.so.0(opal_argv_free+0x2e) [0x2a95586658] > [compute-3-18:28792] [ 3] /opt/cesga/openmpi-1.3.3/lib/libopen- pal.so.0(opal_event_fini+0x1e) [0x2a9557906e] > [compute-3-18:28792] [ 4] /opt/cesga/openmpi-1.3.3/lib/libopen- pal.so.0(opal_finalize+0x36) [0x2a9556bcfa] > [compute-3-18:28792] [ 5] opal-restart [0x40312a] > [compute-3-18:28792] [ 6] /lib64/tls/libc.so.6(__libc_start_main +0xdb) [0x33bb61c3fb] > [compute-3-18:28792] [ 7] opal-restart [0x40272a] > [compute-3-18:28792] *** End of error message *** > [compute-3-18:28793] *** Process received signal *** > [compute-3-18:28793] Signal: Segmentation fault (11) > [compute-3-18:28793] Signal code: (128) > [compute-3-18:28793] Failing at address: (nil) > [compute-3-18:28793] [ 0] /lib64/tls/libpthread.so.0 [0x33bbf0c430] > [compute-3-18:28793] [ 1] /lib64/tls/libc.so.6(__libc_free+0x25) [0x33bb669135] > [compute-3-18:28793] [ 2] /opt/cesga/openmpi-1.3.3/lib/libopen- pal.so.0(opal_argv_free+0x2e) [0x2a95586658] > [compute-3-18:28793] [ 3] /opt/cesga/openmpi-1.3.3/lib/libopen- pal.so.0(opal_event_fini+0x1e) [0x2a9557906e] > [compute-3-18:28793] [ 4] /opt/cesga/openmpi-1.3.3/lib/libopen- pal.so.0(opal_finalize+0x36) [0x2a9556bcfa] > [compute-3-18:28793] [ 5] opal-restart [0x40312a] > [compute-3-18:28793] [ 6] /lib64/tls/libc.so.6(__libc_start_main +0xdb) [0x33bb61c3fb] > [compute-3-18:28793] [ 7] opal-restart [0x40272a] > [compute-3-18:28793] *** End of error message *** >
Re: [OMPI users] Changing location where checkpoints are saved
I took a look at the checkpoint staging and preload functionality. It seems that the combination of the two is broken on the v1.3 and v1.4 branches. I filed a bug about it so that it would not get lost: https://svn.open-mpi.org/trac/ompi/ticket/2139 I also attached a patch to partially fix the problem, but the actual fix is must more involved. I don't know when I'll get around to finishing this bug fix for that branch. :( However, the current development trunk and v1.5 are know to have a working version of this feature. Can you try the trunk or v1.5 and see if this fixes the problem? -- Josh P.S. If you are interested, we have a slightly better version of the documentation, hosted at the link below: http://osl.iu.edu/research/ft/ompi-cr/ On Nov 18, 2009, at 1:27 PM, Constantinos Makassikis wrote: Josh Hursey wrote: (Sorry for the excessive delay in replying) On Sep 30, 2009, at 11:02 AM, Constantinos Makassikis wrote: Thanks for the reply! Concerning the mca options for checkpointing: - are verbosity options (e.g.: crs_base_verbose) limited to 0 and 1 values ? - in priority options (e.g.: crs_blcr_priority) do lower numbers indicate higher priority ? By searching in the archives of the mailing list I found two interesting/useful posts: - [1] http://www.open-mpi.org/community/lists/users/ 2008/09/6534.php (for different checkpointing schemes) - [2] http://www.open-mpi.org/community/lists/users/ 2009/05/9385.php (for restarting) Following indications given in [1], I tried to make each process checkpoint itself in it local /tmp and centralize the resulting checkpoints in /tmp or $HOME: Excerpt from mca-params.conf: - snapc_base_store_in_place=0 snapc_base_global_snapshot_dir=/tmp or $HOME crs_base_snapshot_dir=/tmp COMMANDS used: -- mpirun -n 2 -machinefile machines -am ft-enable-cr a.out ompi-checkpoint mpirun_pid OUTPUT of ompi-checkpoint -v 16753 -- [ic85:17044] orte_checkpoint: Checkpointing... [ic85:17044] PID 17036 [ic85:17044] Connected to Mpirun [[42098,0],0] [ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node Process PID 17036 [ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Requested - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Pending - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Running - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] File Transfer - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Error - Global Snapshot Reference: ompi_global_snapshot_17036.ckpt OUTPUT of MPIRUN [ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3 [ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3 -- WARNING: Could not preload specified file: File already exists. Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0 Host: ic85 Will continue attempting to launch the process. -- [ic85:17036] filem:rsh: wait_all(): Wait failed (-1) [ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/snapc/full/snapc_full_global.c at line 1054 This is a warning about creating the global snapshot directory (ompi_global_snapshot_17036.ckpt) for the first checkpoint (seq 0). It seems to indicate that the directory existed when the file gather started. A couple things to check: - Did you clean out the /tmp on all of the nodes with any files starting with "opal" or "ompi"? - Does the error go away when you set (snapc_base_global_snapshot_dir=$HOME)? - Could you try running against a v1.3 release? (I wonder if this feature has been broken on the trunk) Let me know what you find. In the next couple days, I'll try to test the trunk again with this feature to make sure that it is still working on my test machines. -- Josh Hello Josh, I have switched to v1.3 and re-run with snapc_base_global_snapshot_dir=/tmp or $HOME with a clean /tmp. In both cases I get the same error as before :-( I don't know if
Re: [OMPI users] Pointers for understanding failure messages on NetBSD
>> 26a27 >>> CONFIGURE_ARGS+= --enable-contrib-no-build=vt >> >> I have no idea how NetBSD go about resolving such clashes in the long >> term though? > > I've disabled it the same way for this time, my local package differs > from what's in wip: > > --- PLIST 3 Dec 2009 10:18:00 - 1.5 > +++ PLIST 9 Dec 2009 08:29:31 - > @@ -1,17 +1,11 @@ > @comment $NetBSD$ > bin/mpiCC > -bin/mpiCC-vt > bin/mpic++ > -bin/mpic++-vt I am surprised that you are still installing binaries and other files with the -vt extension after disabling the vt stuff ? > I can commit my development patches into wip right now, > if that helps you. If your stuff now works then that's ideal. If your build is still failing after applying my patches then probably not. Given that we have something that does work, it would make sense to try and merge the two as far as possible before proceeding any further. As discussed before, there is no real reason to have two getifaddrs loops seperating out IPv6 and non-IPv6 - that could all be in one loop. > Some patches should be there anyway, since OpenMPI doesn't help with > installation of configuration files into example directory anyway. OK, as you are the person within the NetBSD community looking after OpenMPI, I'll happily work with whatever is in the NetBSD repository and patch locally as needed, because I have poeple here who want to use stuff that requires OpenMPI now. Are you going to upgrade the NetBSD port to build against OpenMPI 1.4 now that it available ? Might be a good time to check the fuzzz in the existing patches. Kevin -- Kevin M. Buckley Room: CO327 School of Engineering and Phone: +64 4 463 5971 Computer Science Victoria University of Wellington New Zealand
[OMPI users] Problem building OpenMPI with PGI compilers
Hi all, My first ever attempt to build OpenMPI. Platform is Sun Sunfire x4600 M2 servers, running Scientific Linux version 5.3. Trying to build OpenMPI 1.4 (as of today; same problems yesterday with 1.3.4). Trying to use PGI version 10.0. As a first attempt, I set CC, CXX, F77, and FC, then did "configure" and "make". Make ends with: libtool: link: pgCC --prelink_objects --instantiation_dir Template.dir .libs/mpicxx.o .libs/intercepts.o .libs/comm.o .libs/datatype.o .libs/win.o .libs/file.o -Wl,--rpath -Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/ompi/.libs -Wl,--rpath -Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs -Wl,--rpath -Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs -Wl,--rpath -Wl,/global/common/tesla/usg/openmpi/1.4/lib -L/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs -L/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs ../../../ompi/.libs/libmpi.so /project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs/libopen-rte.so /project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs/libopen-pal.so -ldl -lnsl -lutil -lpthread pgCC-Error-Unknown switch: --instantiation_dir make[2]: *** [libmpi_cxx.la] Error 1 So I Googled "instantiation_dir openmpi", which led me to: http://cia.vc/stats/project/OMPI?s_message=3 where I see: There's still something wrong with the C++ support, however; I get errors about a template directory switch when compiling the C++ MPI bindings (doesn't happen with PGI 9.0). Still working on this... it feels like it's still a Libtool issue because OMPI is not putting in this compiler flag as far as I can tell: {{{ /bin/sh ../../../libtool --tag=CXX --mode=link pgCC -g -version-info 0:0:0 -export-dynamic -o libmpi_cxx.la -rpath /home/jsquyres/bogus/lib mpicxx.lo intercepts.lo comm.lo datatype.lo win.lo file.lo ../../../ompi/libmpi.la -lnsl -lutil -lpthread libtool: link: tpldir=Template.dir libtool: link: rm -rf Template.dir libtool: link: pgCC --prelink_objects --instantiation_dir Template.dir .libs/mpicxx.o .libs/intercepts.o .libs/comm.o .libs/datatype.o .libs/win.o .libs/file.o -Wl,--rpath -Wl,/users/jsquyres/svn/ompi-1.3/ompi/.libs -Wl,--rpath -Wl,/users/jsquyres/svn/ompi-1.3/orte/.libs -Wl,--rpath -Wl,/users/jsquyres/svn/ompi-1.3/opal/.libs -Wl,--rpath -Wl,/home/jsquyres/bogus/lib -L/users/jsquyres/svn/ompi-1.3/orte/.libs -L/users/jsquyres/svn/ompi-1.3/opal/.libs ../../../ompi/.libs/libmpi.so /users/jsquyres/svn/ompi-1.3/orte/.libs/libopen-rte.so /users/jsquyres/svn/ompi-1.3/opal/.libs/libopen-pal.so -ldl -lnsl -lutil -lpthread pgCC-Error-Unknown switch: --instantiation_dir make: *** [libmpi_cxx.la] Error 1 }}} I noticed the comment "doesn't happen with PGI 9.0", so I re-did the entire process with PGI 9.0 instead of 10.0, but I get the same error! Any suggestions? Let me know if I should provide full copies of the configure and make output. Thanks! -- Best regards, David Turner User Services Groupemail: dptur...@lbl.gov NERSC Division phone: (510) 486-4027 Lawrence Berkeley Labfax: (510) 486-4316
[OMPI users] OpenMPI 1.4 RPM Spec file problem
Hi all: I'm trying to build openmpi-1.4 rpms using my normal (complex) rpm build commands, but its failing. I'm running into two errors: One (on gcc only): the D_FORTIFY_SOURCE build failure. I've had to move the if test "$using_gcc" = 0; then line down to after the RPM_OPT_FLAGS= that includes D_FORTIFY_SOURCE; otherwise the compile blows up. The second, and in my opinion, more major rpm spec file bug is something with the files specification. I build multiple versions of OpenMPI to accomidate the collection of compilers I use (on this machine, I have intel 10.1 and GCC, and will have to add 9.1 per user request); on others, I use PGI and GCC. In any case, here's my build command for Intel: CC=icc CXX=icpc F77=ifort FC=ifort rpmbuild -bb --define 'install_in_opt 1' --define 'install_modulefile 1' --define 'modules_rpm_name Modules' --define 'build_all_in_one_rpm 0' --define 'configure_options --with-tm=/opt/torque' --define '_name openmpi-intel' openmpi-1.4.spec Unfortunately, the filespec is somehow broke and it ends up missing most (all?) the files, and failing in the final stage of RPM creation: --- Processing files: openmpi-intel-docs-1.4-1 Finding Provides: /usr/lib/rpm/find-provides openmpi-intel Finding Requires: /usr/lib/rpm/find-requires openmpi-intel Finding Supplements: /usr/lib/rpm/find-supplements openmpi-intel Requires(rpmlib): rpmlib(PayloadFilesHavePrefix) <= 4.0-1 rpmlib(CompressedFileNames) <= 3.0.4-1 Requires: openmpi-intel-runtime Checking for unpackaged file(s): /usr/lib/rpm/check-files /var/tmp/openmpi-intel-1.4-1-root error: Installed (but unpackaged) file(s) found: /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfaux /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfcompress /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfconfig /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfdump /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfinfo /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfmerge /opt/openmpi-intel/1.4/bin/mpiCC-vt /opt/openmpi-intel/1.4/bin/mpic++-vt /opt/openmpi-intel/1.4/bin/mpicc-vt /opt/openmpi-intel/1.4/bin/mpicxx-vt /opt/openmpi-intel/1.4/bin/mpif77-vt /opt/openmpi-intel/1.4/bin/mpif90-vt /opt/openmpi-intel/1.4/bin/ompi-checkpoint /opt/openmpi-intel/1.4/bin/ompi-clean /opt/openmpi-intel/1.4/bin/ompi-iof /opt/openmpi-intel/1.4/bin/ompi-ps /opt/openmpi-intel/1.4/bin/ompi-restart /opt/openmpi-intel/1.4/bin/ompi-server /opt/openmpi-intel/1.4/bin/opari /opt/openmpi-intel/1.4/bin/orte-clean /opt/openmpi-intel/1.4/bin/orte-iof /opt/openmpi-intel/1.4/bin/orte-ps /opt/openmpi-intel/1.4/bin/otfdecompress /opt/openmpi-intel/1.4/bin/vtcc /opt/openmpi-intel/1.4/bin/vtcxx /opt/openmpi-intel/1.4/bin/vtf77 /opt/openmpi-intel/1.4/bin/vtf90 /opt/openmpi-intel/1.4/bin/vtfilter /opt/openmpi-intel/1.4/bin/vtunify /opt/openmpi-intel/1.4/etc/openmpi-default-hostfile /opt/openmpi-intel/1.4/etc/openmpi-mca-params.conf /opt/openmpi-intel/1.4/etc/openmpi-totalview.tcl /opt/openmpi-intel/1.4/share/FILTER.SPEC /opt/openmpi-intel/1.4/share/GROUPS.SPEC /opt/openmpi-intel/1.4/share/METRICS.SPEC /opt/openmpi-intel/1.4/share/vampirtrace/doc/ChangeLog /opt/openmpi-intel/1.4/share/vampirtrace/doc/LICENSE /opt/openmpi-intel/1.4/share/vampirtrace/doc/UserManual.html /opt/openmpi-intel/1.4/share/vampirtrace/doc/UserManual.pdf /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/ChangeLog /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/LICENSE /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/Readme.html /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/lacsi01.pdf /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/lacsi01.ps.gz /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/opari-logo-100.gif /opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/ChangeLog /opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/LICENSE /opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/otftools.pdf /opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/specification.pdf /opt/openmpi-intel/1.4/share/vtcc-wrapper-data.txt /opt/openmpi-intel/1.4/share/vtcxx-wrapper-data.txt /opt/openmpi-intel/1.4/share/vtf77-wrapper-data.txt /opt/openmpi-intel/1.4/share/vtf90-wrapper-data.txt RPM build errors: Installed (but unpackaged) file(s) found: /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfaux /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfcompress /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfconfig /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfdump /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfinfo /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfmerge /opt/openmpi-intel/1.4/bin/mpiCC-vt /opt/openmpi-intel/1.4/bin/mpic++-vt /opt/openmpi-intel/1.4/bin/mpicc-vt /opt/openmpi-intel/1.4/bin/mpicxx-vt /opt/openmpi-intel/1.4/bin/mpif77-vt /opt/openmpi-intel/1.4/bin/mpif90-vt /opt/openmpi-intel/1.4/bin/ompi-checkpoint
Re: [OMPI users] Problem building OpenMPI with PGI compilers
Hi David Last I tried, OpenMPI 1.3.2, PGI (8.0-4) was problematic, particularly for C and C++. I eventually settled down with a hybrid gcc, g++, and pgf90 (for both OpenMPI F77 and F90 bindings). Even this required a trick to avoid the "-pthread" flag to be inserted among the pgf90 flags (where it doesn't belong). Yes, libtool was also part of the problem back then. You may find my postings about on this list archives - early 2009 -, along with Jeff Squyres' solution for the problem. I also built a full Gnu version (gcc, g++, gfortran, gfortran) of OpenMPI that works well. Intel and hybrid Gnu(gcc,g++)+Intel(ifort for F77 and F90) versions of OpenMPI also work right. We need multiple compiler support here anyway. My $0.02 Gus Correa - Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA - David Turner wrote: Hi all, My first ever attempt to build OpenMPI. Platform is Sun Sunfire x4600 M2 servers, running Scientific Linux version 5.3. Trying to build OpenMPI 1.4 (as of today; same problems yesterday with 1.3.4). Trying to use PGI version 10.0. As a first attempt, I set CC, CXX, F77, and FC, then did "configure" and "make". Make ends with: libtool: link: pgCC --prelink_objects --instantiation_dir Template.dir .libs/mpicxx.o .libs/intercepts.o .libs/comm.o .libs/datatype.o .libs/win.o .libs/file.o -Wl,--rpath -Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/ompi/.libs -Wl,--rpath -Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs -Wl,--rpath -Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs -Wl,--rpath -Wl,/global/common/tesla/usg/openmpi/1.4/lib -L/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs -L/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs ../../../ompi/.libs/libmpi.so /project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs/libopen-rte.so /project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs/libopen-pal.so -ldl -lnsl -lutil -lpthread pgCC-Error-Unknown switch: --instantiation_dir make[2]: *** [libmpi_cxx.la] Error 1 So I Googled "instantiation_dir openmpi", which led me to: http://cia.vc/stats/project/OMPI?s_message=3 where I see: There's still something wrong with the C++ support, however; I get errors about a template directory switch when compiling the C++ MPI bindings (doesn't happen with PGI 9.0). Still working on this... it feels like it's still a Libtool issue because OMPI is not putting in this compiler flag as far as I can tell: {{{ /bin/sh ../../../libtool --tag=CXX --mode=link pgCC -g -version-info 0:0:0 -export-dynamic -o libmpi_cxx.la -rpath /home/jsquyres/bogus/lib mpicxx.lo intercepts.lo comm.lo datatype.lo win.lo file.lo ../../../ompi/libmpi.la -lnsl -lutil -lpthread libtool: link: tpldir=Template.dir libtool: link: rm -rf Template.dir libtool: link: pgCC --prelink_objects --instantiation_dir Template.dir .libs/mpicxx.o .libs/intercepts.o .libs/comm.o .libs/datatype.o .libs/win.o .libs/file.o -Wl,--rpath -Wl,/users/jsquyres/svn/ompi-1.3/ompi/.libs -Wl,--rpath -Wl,/users/jsquyres/svn/ompi-1.3/orte/.libs -Wl,--rpath -Wl,/users/jsquyres/svn/ompi-1.3/opal/.libs -Wl,--rpath -Wl,/home/jsquyres/bogus/lib -L/users/jsquyres/svn/ompi-1.3/orte/.libs -L/users/jsquyres/svn/ompi-1.3/opal/.libs ../../../ompi/.libs/libmpi.so /users/jsquyres/svn/ompi-1.3/orte/.libs/libopen-rte.so /users/jsquyres/svn/ompi-1.3/opal/.libs/libopen-pal.so -ldl -lnsl -lutil -lpthread pgCC-Error-Unknown switch: --instantiation_dir make: *** [libmpi_cxx.la] Error 1 }}} I noticed the comment "doesn't happen with PGI 9.0", so I re-did the entire process with PGI 9.0 instead of 10.0, but I get the same error! Any suggestions? Let me know if I should provide full copies of the configure and make output. Thanks!
Re: [OMPI users] Problem building OpenMPI with PGI compilers
Fascinating. I've not had any real problems building it from scratch with PGI. We are using the PGI 9 compilers, though, for that. gerry Gus Correa wrote: Hi David Last I tried, OpenMPI 1.3.2, PGI (8.0-4) was problematic, particularly for C and C++. I eventually settled down with a hybrid gcc, g++, and pgf90 (for both OpenMPI F77 and F90 bindings). Even this required a trick to avoid the "-pthread" flag to be inserted among the pgf90 flags (where it doesn't belong). Yes, libtool was also part of the problem back then. You may find my postings about on this list archives - early 2009 -, along with Jeff Squyres' solution for the problem. I also built a full Gnu version (gcc, g++, gfortran, gfortran) of OpenMPI that works well. Intel and hybrid Gnu(gcc,g++)+Intel(ifort for F77 and F90) versions of OpenMPI also work right. We need multiple compiler support here anyway. My $0.02 Gus Correa - Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA - David Turner wrote: Hi all, My first ever attempt to build OpenMPI. Platform is Sun Sunfire x4600 M2 servers, running Scientific Linux version 5.3. Trying to build OpenMPI 1.4 (as of today; same problems yesterday with 1.3.4). Trying to use PGI version 10.0. As a first attempt, I set CC, CXX, F77, and FC, then did "configure" and "make". Make ends with: libtool: link: pgCC --prelink_objects --instantiation_dir Template.dir .libs/mpicxx.o .libs/intercepts.o .libs/comm.o .libs/datatype.o .libs/win.o .libs/file.o -Wl,--rpath -Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/ompi/.libs -Wl,--rpath -Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs -Wl,--rpath -Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs -Wl,--rpath -Wl,/global/common/tesla/usg/openmpi/1.4/lib -L/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs -L/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs ../../../ompi/.libs/libmpi.so /project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs/libopen-rte.so /project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs/libopen-pal.so -ldl -lnsl -lutil -lpthread pgCC-Error-Unknown switch: --instantiation_dir make[2]: *** [libmpi_cxx.la] Error 1 So I Googled "instantiation_dir openmpi", which led me to: http://cia.vc/stats/project/OMPI?s_message=3 where I see: There's still something wrong with the C++ support, however; I get errors about a template directory switch when compiling the C++ MPI bindings (doesn't happen with PGI 9.0). Still working on this... it feels like it's still a Libtool issue because OMPI is not putting in this compiler flag as far as I can tell: {{{ /bin/sh ../../../libtool --tag=CXX --mode=link pgCC -g -version-info 0:0:0 -export-dynamic -o libmpi_cxx.la -rpath /home/jsquyres/bogus/lib mpicxx.lo intercepts.lo comm.lo datatype.lo win.lo file.lo ../../../ompi/libmpi.la -lnsl -lutil -lpthread libtool: link: tpldir=Template.dir libtool: link: rm -rf Template.dir libtool: link: pgCC --prelink_objects --instantiation_dir Template.dir .libs/mpicxx.o .libs/intercepts.o .libs/comm.o .libs/datatype.o .libs/win.o .libs/file.o -Wl,--rpath -Wl,/users/jsquyres/svn/ompi-1.3/ompi/.libs -Wl,--rpath -Wl,/users/jsquyres/svn/ompi-1.3/orte/.libs -Wl,--rpath -Wl,/users/jsquyres/svn/ompi-1.3/opal/.libs -Wl,--rpath -Wl,/home/jsquyres/bogus/lib -L/users/jsquyres/svn/ompi-1.3/orte/.libs -L/users/jsquyres/svn/ompi-1.3/opal/.libs ../../../ompi/.libs/libmpi.so /users/jsquyres/svn/ompi-1.3/orte/.libs/libopen-rte.so /users/jsquyres/svn/ompi-1.3/opal/.libs/libopen-pal.so -ldl -lnsl -lutil -lpthread pgCC-Error-Unknown switch: --instantiation_dir make: *** [libmpi_cxx.la] Error 1 }}} I noticed the comment "doesn't happen with PGI 9.0", so I re-did the entire process with PGI 9.0 instead of 10.0, but I get the same error! Any suggestions? Let me know if I should provide full copies of the configure and make output. Thanks! ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Problem building OpenMPI with PGI compilers
Just to set the record straight: it's a Libtool problem with PGI version 10 (all PGI versions below 10 work fine). This has been reported to the GNU Libtool folks and patches have already been applied upstream. However, there hasn't been a new Libtool release yet with these patches, so we have to patch during the Open MPI build (hence, the solution is in our autogen.sh script, which sets up the configure/build system). On Dec 9, 2009, at 4:58 PM, Gerald Creager wrote: > Fascinating. I've not had any real problems building it from scratch > with PGI. We are using the PGI 9 compilers, though, for that. > > gerry > > Gus Correa wrote: > > Hi David > > > > Last I tried, OpenMPI 1.3.2, PGI (8.0-4) was problematic, > > particularly for C and C++. > > > > I eventually settled down with a hybrid gcc, g++, and pgf90 > > (for both OpenMPI F77 and F90 bindings). > > Even this required a trick to avoid the "-pthread" flag > > to be inserted among the pgf90 flags (where it doesn't belong). > > Yes, libtool was also part of the problem back then. > > You may find my postings about on this list archives - early 2009 -, > > along with Jeff Squyres' solution for the problem. > > > > > > I also built a full Gnu version (gcc, g++, gfortran, gfortran) > > of OpenMPI that works well. > > Intel and hybrid Gnu(gcc,g++)+Intel(ifort for F77 and F90) > > versions of OpenMPI also work right. > > We need multiple compiler support here anyway. > > > > My $0.02 > > Gus Correa > > - > > Gustavo Correa > > Lamont-Doherty Earth Observatory - Columbia University > > Palisades, NY, 10964-8000 - USA > > - > > > > > > > > > > David Turner wrote: > >> Hi all, > >> > >> My first ever attempt to build OpenMPI. Platform is Sun Sunfire x4600 > >> M2 servers, running Scientific Linux version 5.3. Trying to build > >> OpenMPI 1.4 (as of today; same problems yesterday with 1.3.4). > >> Trying to use PGI version 10.0. > >> > >> As a first attempt, I set CC, CXX, F77, and FC, then did "configure" > >> and "make". Make ends with: > >> > >> libtool: link: pgCC --prelink_objects --instantiation_dir > >> Template.dir .libs/mpicxx.o .libs/intercepts.o .libs/comm.o > >> .libs/datatype.o .libs/win.o .libs/file.o -Wl,--rpath > >> -Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/ompi/.libs > >> -Wl,--rpath > >> -Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs > >> -Wl,--rpath > >> -Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs > >> -Wl,--rpath -Wl,/global/common/tesla/usg/openmpi/1.4/lib > >> -L/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs > >> -L/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs > >> ../../../ompi/.libs/libmpi.so > >> /project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs/libopen-rte.so > >> /project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs/libopen-pal.so > >> -ldl -lnsl -lutil -lpthread > >> pgCC-Error-Unknown switch: --instantiation_dir > >> make[2]: *** [libmpi_cxx.la] Error 1 > >> > >> So I Googled "instantiation_dir openmpi", which led me to: > >> > >> http://cia.vc/stats/project/OMPI?s_message=3 > >> > >> where I see: > >> > >> There's still something wrong with the C++ support, however; I get > >> errors about a template directory switch when compiling the C++ MPI > >> bindings (doesn't happen with PGI 9.0). Still working on this... it > >> feels like it's still a Libtool issue because OMPI is not putting in > >> this compiler flag as far as I can tell: > >> > >> {{{ > >> /bin/sh ../../../libtool --tag=CXX --mode=link pgCC -g -version-info > >> 0:0:0 -export-dynamic -o libmpi_cxx.la -rpath /home/jsquyres/bogus/lib > >> mpicxx.lo intercepts.lo comm.lo datatype.lo win.lo file.lo > >> ../../../ompi/libmpi.la -lnsl -lutil -lpthread > >> libtool: link: tpldir=Template.dir > >> libtool: link: rm -rf Template.dir > >> libtool: link: pgCC --prelink_objects --instantiation_dir Template.dir > >> .libs/mpicxx.o .libs/intercepts.o .libs/comm.o .libs/datatype.o > >> .libs/win.o .libs/file.o -Wl,--rpath > >> -Wl,/users/jsquyres/svn/ompi-1.3/ompi/.libs -Wl,--rpath > >> -Wl,/users/jsquyres/svn/ompi-1.3/orte/.libs -Wl,--rpath > >> -Wl,/users/jsquyres/svn/ompi-1.3/opal/.libs -Wl,--rpath > >> -Wl,/home/jsquyres/bogus/lib -L/users/jsquyres/svn/ompi-1.3/orte/.libs > >> -L/users/jsquyres/svn/ompi-1.3/opal/.libs > >> ../../../ompi/.libs/libmpi.so > >> /users/jsquyres/svn/ompi-1.3/orte/.libs/libopen-rte.so > >> /users/jsquyres/svn/ompi-1.3/opal/.libs/libopen-pal.so -ldl -lnsl > >> -lutil -lpthread > >> pgCC-Error-Unknown switch: --instantiation_dir > >> make: *** [libmpi_cxx.la] Error 1 > >> }}} > >> > >> I noticed the comment "doesn't happen with PGI 9.0", so I re-did the > >> entire process with PGI 9.0 instead of 10.0, but I get
Re: [OMPI users] Problem building OpenMPI with PGI compilers
Hi All As I stated on my original posting, I haven't compiled OpenMPI since 1.3.2. Just trying to be of help, based on previous, and maybe too old, experiences. The problem I referred to happened with PGI 8.0-4 and OpenMPI 1.3. Most likely the issue is superseded already by the newer OpenMPI configuration scripts, but it did exist and it did involve libtool as well, although it seems to be different from what David Turner just reported with PGI 10, and apparently with PGI 9 also (so he wrote). These threads document the problem I had, with one solution provided by Jeff Squyres, and another by Orion Poplawski: http://www.open-mpi.org/community/lists/users/2009/04/8724.php http://www.open-mpi.org/community/lists/users/2009/04/8911.php Those workarounds may no longer be required, considering what Jeff and Gerry wrote, which is good news, of course. Thanks, Gus Correa - Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA - Jeff Squyres wrote: Just to set the record straight: it's a Libtool problem with PGI version 10 (all PGI versions below 10 work fine). This has been reported to the GNU Libtool folks and patches have already been applied upstream. However, there hasn't been a new Libtool release yet with these patches, so we have to patch during the Open MPI build (hence, the solution is in our autogen.sh script, which sets up the configure/build system). On Dec 9, 2009, at 4:58 PM, Gerald Creager wrote: Fascinating. I've not had any real problems building it from scratch with PGI. We are using the PGI 9 compilers, though, for that. gerry Gus Correa wrote: Hi David Last I tried, OpenMPI 1.3.2, PGI (8.0-4) was problematic, particularly for C and C++. I eventually settled down with a hybrid gcc, g++, and pgf90 (for both OpenMPI F77 and F90 bindings). Even this required a trick to avoid the "-pthread" flag to be inserted among the pgf90 flags (where it doesn't belong). Yes, libtool was also part of the problem back then. You may find my postings about on this list archives - early 2009 -, along with Jeff Squyres' solution for the problem. I also built a full Gnu version (gcc, g++, gfortran, gfortran) of OpenMPI that works well. Intel and hybrid Gnu(gcc,g++)+Intel(ifort for F77 and F90) versions of OpenMPI also work right. We need multiple compiler support here anyway. My $0.02 Gus Correa - Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA - David Turner wrote: Hi all, My first ever attempt to build OpenMPI. Platform is Sun Sunfire x4600 M2 servers, running Scientific Linux version 5.3. Trying to build OpenMPI 1.4 (as of today; same problems yesterday with 1.3.4). Trying to use PGI version 10.0. As a first attempt, I set CC, CXX, F77, and FC, then did "configure" and "make". Make ends with: libtool: link: pgCC --prelink_objects --instantiation_dir Template.dir .libs/mpicxx.o .libs/intercepts.o .libs/comm.o .libs/datatype.o .libs/win.o .libs/file.o -Wl,--rpath -Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/ompi/.libs -Wl,--rpath -Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs -Wl,--rpath -Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs -Wl,--rpath -Wl,/global/common/tesla/usg/openmpi/1.4/lib -L/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs -L/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs ../../../ompi/.libs/libmpi.so /project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs/libopen-rte.so /project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs/libopen-pal.so -ldl -lnsl -lutil -lpthread pgCC-Error-Unknown switch: --instantiation_dir make[2]: *** [libmpi_cxx.la] Error 1 So I Googled "instantiation_dir openmpi", which led me to: http://cia.vc/stats/project/OMPI?s_message=3 where I see: There's still something wrong with the C++ support, however; I get errors about a template directory switch when compiling the C++ MPI bindings (doesn't happen with PGI 9.0). Still working on this... it feels like it's still a Libtool issue because OMPI is not putting in this compiler flag as far as I can tell: {{{ /bin/sh ../../../libtool --tag=CXX --mode=link pgCC -g -version-info 0:0:0 -export-dynamic -o libmpi_cxx.la -rpath /home/jsquyres/bogus/lib mpicxx.lo intercepts.lo comm.lo datatype.lo win.lo file.lo ../../../ompi/libmpi.la -lnsl -lutil -lpthread libtool: link: tpldir=Template.dir libtool: link: rm -rf Template.dir libtool: link: pgCC --prelink_objects --instantiation_dir Template.dir .libs/mpicxx.o .libs/intercepts.o .libs/comm.o
Re: [OMPI users] OpenMPI 1.4 RPM Spec file problem
By the way, if I set build_all_in_one_rpm to 1, it works fine... --Jim On Wed, Dec 9, 2009 at 1:47 PM, Jim Kusznirwrote: > Hi all: > > I'm trying to build openmpi-1.4 rpms using my normal (complex) rpm > build commands, but its failing. I'm running into two errors: > > One (on gcc only): the D_FORTIFY_SOURCE build failure. I've had to > move the if test "$using_gcc" = 0; then line down to after the > RPM_OPT_FLAGS= that includes D_FORTIFY_SOURCE; otherwise the compile > blows up. > > The second, and in my opinion, more major rpm spec file bug is > something with the files specification. I build multiple versions of > OpenMPI to accomidate the collection of compilers I use (on this > machine, I have intel 10.1 and GCC, and will have to add 9.1 per user > request); on others, I use PGI and GCC. In any case, here's my build > command for Intel: > > CC=icc CXX=icpc F77=ifort FC=ifort rpmbuild -bb --define > 'install_in_opt 1' --define 'install_modulefile 1' --define > 'modules_rpm_name Modules' --define 'build_all_in_one_rpm 0' --define > 'configure_options --with-tm=/opt/torque' --define '_name > openmpi-intel' openmpi-1.4.spec > > Unfortunately, the filespec is somehow broke and it ends up missing > most (all?) the files, and failing in the final stage of RPM creation: > > --- > Processing files: openmpi-intel-docs-1.4-1 > Finding Provides: /usr/lib/rpm/find-provides openmpi-intel > Finding Requires: /usr/lib/rpm/find-requires openmpi-intel > Finding Supplements: /usr/lib/rpm/find-supplements openmpi-intel > Requires(rpmlib): rpmlib(PayloadFilesHavePrefix) <= 4.0-1 > rpmlib(CompressedFileNames) <= 3.0.4-1 > Requires: openmpi-intel-runtime > Checking for unpackaged file(s): /usr/lib/rpm/check-files > /var/tmp/openmpi-intel-1.4-1-root > error: Installed (but unpackaged) file(s) found: > /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfaux > /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfcompress > /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfconfig > /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfdump > /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfinfo > /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfmerge > /opt/openmpi-intel/1.4/bin/mpiCC-vt > /opt/openmpi-intel/1.4/bin/mpic++-vt > /opt/openmpi-intel/1.4/bin/mpicc-vt > /opt/openmpi-intel/1.4/bin/mpicxx-vt > /opt/openmpi-intel/1.4/bin/mpif77-vt > /opt/openmpi-intel/1.4/bin/mpif90-vt > /opt/openmpi-intel/1.4/bin/ompi-checkpoint > /opt/openmpi-intel/1.4/bin/ompi-clean > /opt/openmpi-intel/1.4/bin/ompi-iof > /opt/openmpi-intel/1.4/bin/ompi-ps > /opt/openmpi-intel/1.4/bin/ompi-restart > /opt/openmpi-intel/1.4/bin/ompi-server > /opt/openmpi-intel/1.4/bin/opari > /opt/openmpi-intel/1.4/bin/orte-clean > /opt/openmpi-intel/1.4/bin/orte-iof > /opt/openmpi-intel/1.4/bin/orte-ps > /opt/openmpi-intel/1.4/bin/otfdecompress > /opt/openmpi-intel/1.4/bin/vtcc > /opt/openmpi-intel/1.4/bin/vtcxx > /opt/openmpi-intel/1.4/bin/vtf77 > /opt/openmpi-intel/1.4/bin/vtf90 > /opt/openmpi-intel/1.4/bin/vtfilter > /opt/openmpi-intel/1.4/bin/vtunify > /opt/openmpi-intel/1.4/etc/openmpi-default-hostfile > /opt/openmpi-intel/1.4/etc/openmpi-mca-params.conf > /opt/openmpi-intel/1.4/etc/openmpi-totalview.tcl > /opt/openmpi-intel/1.4/share/FILTER.SPEC > /opt/openmpi-intel/1.4/share/GROUPS.SPEC > /opt/openmpi-intel/1.4/share/METRICS.SPEC > /opt/openmpi-intel/1.4/share/vampirtrace/doc/ChangeLog > /opt/openmpi-intel/1.4/share/vampirtrace/doc/LICENSE > /opt/openmpi-intel/1.4/share/vampirtrace/doc/UserManual.html > /opt/openmpi-intel/1.4/share/vampirtrace/doc/UserManual.pdf > /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/ChangeLog > /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/LICENSE > /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/Readme.html > /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/lacsi01.pdf > /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/lacsi01.ps.gz > /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/opari-logo-100.gif > /opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/ChangeLog > /opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/LICENSE > /opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/otftools.pdf > /opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/specification.pdf > /opt/openmpi-intel/1.4/share/vtcc-wrapper-data.txt > /opt/openmpi-intel/1.4/share/vtcxx-wrapper-data.txt > /opt/openmpi-intel/1.4/share/vtf77-wrapper-data.txt > /opt/openmpi-intel/1.4/share/vtf90-wrapper-data.txt > > > RPM build errors: > Installed (but unpackaged) file(s) found: > /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfaux > /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfcompress > /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfconfig > /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfdump > /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfinfo > /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfmerge >
Re: [OMPI users] Pointers for understanding failure messages on NetBSD
kevin.buck...@ecs.vuw.ac.nz writes: CONFIGURE_ARGS+= --enable-contrib-no-build=vt >>> >>> I have no idea how NetBSD go about resolving such clashes in the long >>> term though? >> >> I've disabled it the same way for this time, my local package differs >> from what's in wip: >> >> --- PLIST3 Dec 2009 10:18:00 - 1.5 >> +++ PLIST9 Dec 2009 08:29:31 - >> @@ -1,17 +1,11 @@ >> @comment $NetBSD$ >> bin/mpiCC >> -bin/mpiCC-vt >> bin/mpic++ >> -bin/mpic++-vt > > I am surprised that you are still installing binaries and other files > with the -vt extension after disabling the vt stuff ? I don't commit that part since I consider it my own local problems that I have conflicting package. Other people may have none. You can add CONFIGURE_ARGS and regenerate PLIST in a regular way. >> I can commit my development patches into wip right now, >> if that helps you. > > If your stuff now works then that's ideal. If your build is still > failing after applying my patches then probably not. > > Given that we have something that does work, it would make sense > to try and merge the two as far as possible before proceeding any > further. Benchmark I use to test MPICH and OpenMPI (parallel/skampi) doesn't work for me still. It may be that I have somewhat unusual network configuration, I'm looking at it. > As discussed before, there is no real reason to have two getifaddrs > loops seperating out IPv6 and non-IPv6 - that could all be in one > loop. Sure, I think that we can do it a bit later. >> Some patches should be there anyway, since OpenMPI doesn't help with >> installation of configuration files into example directory anyway. > > OK, as you are the person within the NetBSD community looking > after OpenMPI, I'll happily work with whatever is in the NetBSD > repository and patch locally as needed, because I have poeple here > who want to use stuff that requires OpenMPI now. > > Are you going to upgrade the NetBSD port to build against OpenMPI 1.4 > now that it available ? Might be a good time to check the fuzzz in the > existing patches. http://pkgsrc-wip.cvs.sourceforge.net/viewvc/pkgsrc-wip/wip/openmpi/Makefile -- HE CE3OH...
Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)
Hi Gus and List, 1st of all Gus, I want to say thanks.. you have been a huge help, and when I get this fixed, I owe you big time! However, the problems continue... I formatted the HD, reinstalled OS to make sure that I was working from scratch. I did your step A, which seemed to go fine: macmanes@macmanes:~$ which mpicc /home/macmanes/apps/openmpi1.4/bin/mpicc macmanes@macmanes:~$ which mpirun /home/macmanes/apps/openmpi1.4/bin/mpirun Good stuff there... I then compiled the example files: macmanes@macmanes:~/Downloads/openmpi-1.4/examples$ /home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 ring_c Process 0 sending 10 to 1, tag 201 (8 processes in ring) Process 0 sent to 1 Process 0 decremented value: 9 Process 0 decremented value: 8 Process 0 decremented value: 7 Process 0 decremented value: 6 Process 0 decremented value: 5 Process 0 decremented value: 4 Process 0 decremented value: 3 Process 0 decremented value: 2 Process 0 decremented value: 1 Process 0 decremented value: 0 Process 0 exiting Process 1 exiting Process 2 exiting Process 3 exiting Process 4 exiting Process 5 exiting Process 6 exiting Process 7 exiting macmanes@macmanes:~/Downloads/openmpi-1.4/examples$ /home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c Connectivity test on 8 processes PASSED. macmanes@macmanes:~/Downloads/openmpi-1.4/examples$ /home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c ..HANGS..NO OUTPUT this is maddening because ring_c works.. and connectivity_c worked the 1st time, but not the second... I did it 10 times, and it worked twice.. here is the TOP screenshot: http://picasaweb.google.com/macmanes/DropBox?authkey=Gv1sRgCLKokNOVqo7BYw#5413382182027669394 What is the difference between connectivity_c and ring_c? Under what circumstances should one fail and not the other... I'm off to the Linux forums to see about the Nehalem kernel issues.. Matt On Wed, Dec 9, 2009 at 13:25, Gus Correawrote: > Hi Matthew > > There is no point in trying to troubleshoot MrBayes and ABySS > if not even the OpenMPI test programs run properly. > You must straighten them out first. > > ** > > Suggestions: > > ** > > A) While you are at OpenMPI, do yourself a favor, > and install it from source on a separate directory. > Who knows if the OpenMPI package distributed with Ubuntu > works right on Nehalem? > Better install OpenMPI yourself from source code. > It is not a big deal, and may save you further trouble. > > Recipe: > > 1) Install gfortran and g++ if you don't have them using apt-get. > 2) Put the OpenMPI tarball in, say /home/matt/downolads/openmpi > 3) Make another install directory *not in the system directory tree*. > Something like "mkdir /home/matt/apps/openmpi-X.Y.Z/" (X.Y.Z=version) > will work > 4) cd /home/matt/downolads/openmpi > 5) ./configure CC=gcc CXX=g++ F77=gfortran FC=gfortran \ > --prefix=/home/matt/apps/openmpi-X.Y.Z > (Use the prefix flag to install in the directory of item 3.) > 6) make > 7) make install > 8) At the bottom of your /home/matt/.bashrc or .profile file > put these lines: > > export PATH=/home/matt/apps/openmpi-X.Y.Z/bin:${PATH} > export MANPATH=/home/matt/apps/openmpi-X.Y.Z/share/man:`man -w` > export LD_LIBRARY_PATH=home/matt/apps/openmpi-X.Y.Z/lib:${LD_LIBRARY_PATH} > > (If you use csh/tcsh use instead: > setenv PATH /home/matt/apps/openmpi-X.Y.Z/bin:${PATH} > etc) > > 9) Logout and login again to freshen um the environment variables. > 10) Do "which mpicc" to check that it is pointing to your newly > installed OpenMPI. > 11) Recompile and rerun the OpenMPI test programs > with 2, 4, 8, 16, processors. > Use full path names to mpicc and to mpirun, > if the change of PATH above doesn't work right. > > > > B) Nehalem is quite new hardware. > I don't know if the Ubuntu kernel 2.6.31-16 fully supports all > of Nehalem features, particularly hyperthreading, and NUMA, > which are used by MPI programs. > I am not the right person to give you advice about this. > I googled out but couldn't find a clear information about > minimal kernel age/requirements to have Nehalem fully supported. > Some Nehalem owner in the list could come forward and tell. > > ** > > C) On the top screenshot you sent me, please try it again > (after you do item A) but type "f" and "j" to show the processors > that are running each process. > > ** > > D) Also, the screeshot shows 20GB of memory. > This sounds not as a optimal memory for Nehalem, > which tend to be 6GB, 12GB, 24GB, 48GB. > Did you put together the system, or upgraded the memory yourself, > of did you buy the computer as is? > However, this should not break MPI anyway. > > ** > > E) Answering your question: > It is true that different flavors of MPI > used to compile (mpicc) and run (mpiexec) a program would probably > break right away, regardless of the number of processes. > However, when it comes to different versions of the > same MPI flavor (say OpenMPI 1.3.4 and OpenMPI 1.3.3) > I am not
[OMPI users] OMPI 1.4: connectivity_c fails, ring_c and hello_c work
What is the difference between connectivity_c and ring_c or hello_c? Under what circumstances should one fail and not the others... I am having a huge problem with openMPI, and trying to get to the bottom of it by understanding the differences between the example files, connectivity, hello, and ring. 1st off, ring_c and hello_c seem to work fine with up to -np 250 connectivity_c works reliably when -np <5, but less that 30% of the time when -np >6. When it does not work, it just hangs.. no output.. here is a screenshot of TOP with mpirun -np 8 connectivity_c hanging.. http://picasaweb.google.com/macmanes/DropBox?authkey=Gv1sRgCLKokNOVqo7BYw#5413382182027669394 Under what circumstances should this happen? I am using Ubuntu 9.10, kernel 2.6.31-16, Nehelem processors. Hyperthreading is enabled. Thanks! _ Matthew MacManes PhD Candidate University of California- Berkeley Museum of Vertebrate Zoology Phone: 510-495-5833 Lab Website: http://ib.berkeley.edu/labs/lacey Personal Website: http://macmanes.com/
Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)
Hi Matthew Save any misinterpretation I may have made of the code: Hello_c has no real communication, except for a final Barrier synchronization. Each process prints "hello world" and that's it. Ring probes a little more, with processes Send(ing) and Recv(cieving) messages. Ring just passes a message sequentially along all process ranks, then back to rank 0, and repeat the game 10 times. Rank 0 is in charge of counting turns, decrementing the counter, and printing that (nobody else prints). With 4 processes: 0->1->2->3->0->1... 10 times In connectivity every pair of processes exchange a message. Therefore it probes all pairwise connections. In verbose mode you can see that. These programs shouldn't hang at all, if the system were sane. Actually, they should even run with a significant level of oversubscription, say, -np 128 should work easily for all three programs on a powerful machine like yours. ** Suggestions 1) Stick to the OpenMPI you compiled. ** 2) You can run connectivity_c in verbose mode: home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c -v (Note the trailing "-v".) It should tell more about who's talking to who. ** 3) I wonder if there are any BIOS settings that may be required (and perhaps not in place) to make the Nehalem hyperthreading to work properly in your computer. You reach the BIOS settings by typing or when the computer boots up. The key varies by BIOS and computer vendor, but shows quickly on the bootup screen. You may ask the computer vendor about the recommended BIOS settings. If you haven't done this before, be careful to change and save only what really needs to change (if anything really needs to change), or the result may be worse. (Overclocking is for gamers, not for genome researchers ... :) ) ** 4) What I read about Nehalem DDR3 memory is that it is optimal on configurations that are multiples of 3GB per CPU. Common configs. in dual CPU machines like yours are 6, 12, 24 and 48GB. The sockets where you install the memory modules also matter. Your computer has 20GB. Did you build the computer or upgrade the memory yourself? Do you know how the memory is installed, in which memory sockets? What does the vendor have to say about it? See this: http://en.community.dell.com/blogs/dell_tech_center/archive/2009/04/08/nehalem-and-memory-configurations.aspx ** 5) As I said before, typing "f" then "j" on "top" will add a column (labeled "P") that shows in which core each process is running. This will let you observe how the Linux scheduler is distributing the MPI load across the cores. Hopefully it is load-balanced, and different processes go to different cores. *** It is very disconcerting when MPI processes hang. You are not alone. The reasons are not always obvious. At least in your case there is no network involved or to troubleshoot. ** I hope it helps, Gus Correa - Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA - Matthew MacManes wrote: Hi Gus and List, 1st of all Gus, I want to say thanks.. you have been a huge help, and when I get this fixed, I owe you big time! However, the problems continue... I formatted the HD, reinstalled OS to make sure that I was working from scratch. I did your step A, which seemed to go fine: macmanes@macmanes:~$ which mpicc /home/macmanes/apps/openmpi1.4/bin/mpicc macmanes@macmanes:~$ which mpirun /home/macmanes/apps/openmpi1.4/bin/mpirun Good stuff there... I then compiled the example files: macmanes@macmanes:~/Downloads/openmpi-1.4/examples$ /home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 ring_c Process 0 sending 10 to 1, tag 201 (8 processes in ring) Process 0 sent to 1 Process 0 decremented value: 9 Process 0 decremented value: 8 Process 0 decremented value: 7 Process 0 decremented value: 6 Process 0 decremented value: 5 Process 0 decremented value: 4 Process 0 decremented value: 3 Process 0 decremented value: 2 Process 0 decremented value: 1 Process 0 decremented value: 0 Process 0 exiting Process 1 exiting Process 2 exiting Process 3 exiting Process 4 exiting Process 5 exiting Process 6 exiting Process 7 exiting macmanes@macmanes:~/Downloads/openmpi-1.4/examples$ /home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c Connectivity test on 8 processes PASSED. macmanes@macmanes:~/Downloads/openmpi-1.4/examples$ /home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c ..HANGS..NO OUTPUT this is maddening because ring_c works.. and connectivity_c worked the 1st time, but not the second... I did it 10 times, and it worked twice.. here is the TOP screenshot: http://picasaweb.google.com/macmanes/DropBox?authkey=Gv1sRgCLKokNOVqo7BYw#5413382182027669394 What is the difference between connectivity_c and ring_c? Under what circumstances should one fail and not the other... I'm