Re: [OMPI users] Pointers for understanding failure messages on NetBSD

2009-12-09 Thread Aleksej Saushev
kevin.buck...@ecs.vuw.ac.nz writes:

> Cc: to the OpenMPI list as the oftdump clash might be of interest
> elsewhere.
>
>> I attach a patch, but it doesn't work and I don't see where the
>> error lies now. It may be that I'm doing something stupid.
>> It produces working OpenMPI-1.3.4 package on Dragonfly though.
>
> Ok, I'll try and merge it in to the working stuff we have here.
> I, obviously, just #ifdef'd for NetBSD as that is all I have to
> try stuff out against.

No need for that actually, we can do it later.
I was using Dragonfly as platform where it works out of box.

>> Kevin, I've tried your chunk but it doesn't make any better.
>> Do you really have working OpenMPI on NetBSD?
>
> Oh yes!
>
> I have placed the tar of current patches from our PkgSrc build in
>
> http://www.ecs.vuw.ac.nz/~kevin/forMPI/openmpi-1.3.4-20091208-netbsd.tar.gz
>
> in case you want to try something out from an actual NetBSD build.

I'm looking at your patches now.

>> (What conflict do you observe with pkgsrc-wip package by the way?)
>>
>
> That was detailed in another email but basically the Open Trace Format
> that the Vampire Trace (VT) stuff is looking to install tries to install:
>
> ${LOCALBASE}/bin/otfdump
>
> and that binary is already installed there as part of another
> package.
>
> You can get around this for a NetBSD OpenMPI deployment by adding this
> patch to the PkgSrc Makefile which just removes the VT toolkit:
>
>
> 26a27
>> CONFIGURE_ARGS+=  --enable-contrib-no-build=vt
>
> I have no idea how NetBSD go about resolving such clashes in the long
> term though?

I've disabled it the same way for this time, my local package differs
from what's in wip:

--- PLIST   3 Dec 2009 10:18:00 -   1.5
+++ PLIST   9 Dec 2009 08:29:31 -
@@ -1,17 +1,11 @@
 @comment $NetBSD$
 bin/mpiCC
-bin/mpiCC-vt
 bin/mpic++
-bin/mpic++-vt
 bin/mpicc
-bin/mpicc-vt
 bin/mpicxx
-bin/mpicxx-vt
 bin/mpiexec
 bin/mpif77
-bin/mpif77-vt
 bin/mpif90
-bin/mpif90-vt
 bin/mpirun
 bin/ompi-checkpoint
 bin/ompi-clean
@@ -21,28 +15,11 @@
 bin/ompi-server
 bin/ompi_info
 bin/opal_wrapper
-bin/opari
 bin/orte-clean
 bin/orte-iof
 bin/orte-ps
 bin/orted
 bin/orterun
-bin/otfaux
-bin/otfcompress
-bin/otfconfig
-bin/otfdecompress
-bin/otfdump
-bin/otfinfo
-bin/otfmerge
-bin/vtcc
-bin/vtcxx
-bin/vtf77
-bin/vtf90
-bin/vtfilter
-bin/vtunify
-etc/openmpi-default-hostfile
-etc/openmpi-mca-params.conf
-etc/openmpi-totalview.tcl
 include/mpi.h
 include/mpif-common.h
 include/mpif-config.h
@@ -79,40 +56,12 @@
 include/openmpi/ompi/mpi/cxx/topology_inln.h
 include/openmpi/ompi/mpi/cxx/win.h
 include/openmpi/ompi/mpi/cxx/win_inln.h
-include/vampirtrace/OTF_CopyHandler.h
-include/vampirtrace/OTF_Definitions.h
-include/vampirtrace/OTF_File.h
-include/vampirtrace/OTF_FileManager.h
-include/vampirtrace/OTF_Filenames.h
-include/vampirtrace/OTF_HandlerArray.h
-include/vampirtrace/OTF_MasterControl.h
-include/vampirtrace/OTF_RBuffer.h
-include/vampirtrace/OTF_RStream.h
-include/vampirtrace/OTF_Reader.h
-include/vampirtrace/OTF_WBuffer.h
-include/vampirtrace/OTF_WStream.h
-include/vampirtrace/OTF_Writer.h
-include/vampirtrace/OTF_inttypes.h
-include/vampirtrace/OTF_inttypes_unix.h
-include/vampirtrace/opari_omp.h
-include/vampirtrace/otf.h
-include/vampirtrace/pomp_lib.h
-include/vampirtrace/vt_user.h
-include/vampirtrace/vt_user.inc
-include/vampirtrace/vt_user_comment.h
-include/vampirtrace/vt_user_comment.inc
-include/vampirtrace/vt_user_count.h
-include/vampirtrace/vt_user_count.inc
 lib/libmca_common_sm.la
 lib/libmpi.la
 lib/libmpi_cxx.la
 lib/libmpi_f77.la
 lib/libopen-pal.la
 lib/libopen-rte.la
-lib/libotf.la
-lib/libvt.a
-lib/libvt.fmpi.a
-lib/libvt.mpi.a
 lib/openmpi/libompi_dbg_msgq.la
 lib/openmpi/mca_allocator_basic.la
 lib/openmpi/mca_allocator_bucket.la
@@ -503,6 +452,9 @@
 man/man7/orte_hosts.7
 man/man7/orte_snapc.7
 share/openmpi/amca-param-sets/example.conf
+share/openmpi/examples/openmpi-default-hostfile
+share/openmpi/examples/openmpi-mca-params.conf
+share/openmpi/examples/openmpi-totalview.tcl
 share/openmpi/help-coll-sync.txt
 share/openmpi/help-dash-host.txt
 share/openmpi/help-ess-base.txt
@@ -548,36 +500,9 @@
 share/openmpi/help-plm-rsh.txt
 share/openmpi/help-ras-base.txt
 share/openmpi/help-rmaps_rank_file.txt
-share/openmpi/mpiCC-vt-wrapper-data.txt
 share/openmpi/mpiCC-wrapper-data.txt
-share/openmpi/mpic++-vt-wrapper-data.txt
 share/openmpi/mpic++-wrapper-data.txt
-share/openmpi/mpicc-vt-wrapper-data.txt
 share/openmpi/mpicc-wrapper-data.txt
-share/openmpi/mpicxx-vt-wrapper-data.txt
 share/openmpi/mpicxx-wrapper-data.txt
-share/openmpi/mpif77-vt-wrapper-data.txt
 share/openmpi/mpif77-wrapper-data.txt
-share/openmpi/mpif90-vt-wrapper-data.txt
 share/openmpi/mpif90-wrapper-data.txt
-share/vampirtrace/FILTER.SPEC
-share/vampirtrace/GROUPS.SPEC
-share/vampirtrace/METRICS.SPEC
-share/vampirtrace/doc/ChangeLog
-share/vampirtrace/doc/LICENSE
-share/vampirtrace/doc/UserManual.html
-share/vampirtrace/doc/UserManual.pdf

[OMPI users] Hanging vs Stopping behaviour in communication failures

2009-12-09 Thread Constantinos Makassikis

Dear all,

sometimes when running Open MPI jobs, the application hangs. By looking the
output I get the following error message:

[ic17][[34562,1],74][../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv
] mca_btl_tcp_frag_recv: readv failed: No route to host (113)


I would expect Open MPI to eventually quit with an error at such situations.
Is the observed behaviour (i.e.: hanging) the intended one ?

If so, what would be the reason(s) behind choosing the hanging over the 
stopping ?



Best Regards,

--
Constantinos


Re: [OMPI users] mpirun only works when -np <4

2009-12-09 Thread Ashley Pittman
On Tue, 2009-12-08 at 08:30 -0800, Matthew MacManes wrote:
> There are 8 physical cores, or 16 with hyperthreading enabled. 

That should be meaty enough.

> 1st of all, let me say that when I specify that -np is less than 4
> processors (1, 2, or 3), both programs seem to work as expected. Also,
> the non-mpi version of each of them works fine.

Presumably the non-mpi version is serial however? this this doesn't mean
the program is bug-free or that the parallel version isn't broken.
There are any number of apps that don't work above N processes, in fact
probably all programs break for some value of N, it's normally a little
higher then 3 however.

> Thus, I am pretty sure that this is a problem with MPI rather that
> with the program code or something else.  
> 
> What happens is simply that the program hangs..

I presume you mean here the output stops?  The program continues to use
CPU cycles but no longer appears to make any progress?

I'm of the opinion that this is most likely a error in your program, I
would start by using either valgrind or padb.

You can run the app under valgrind using the following mpirun options,
this will give you four files named v.log.0 to v.log.3 which you can
check for errors in the normal way.  The "--mca btl tcp,self" option
will disable shared memory which can create false positives.

mpirun -n 4 --mca btl tcp,self valgrind --log-file=v.log.%
q{OMPI_COMM_WORLD_RANK} 

Alternatively you can run the application, wait for it to hang and then
in another window run my tool, padb, which will show you the MPI message
queues and stack traces which should show you where it's hung,
instructions and sample output are on this page.

http://padb.pittman.org.uk/full-report.html

> There are no error messages, and there is no clue from anything else
> (system working fine otherwise- no RAM issues, etc). It does not hang
> at the same place everytime, sometimes in the very beginning, sometime
> near the middle..  
> 
> Could this an issue with hyperthreading? A conflict with something?

Unlikely, if there was a problem in OMPI running more than 3 processes
it would have been found by now.  I regularly run 8 process applications
on my dual-core netbook alongside all my desktop processes without
issue, it runs fine, a little slowly but fine.

All this talk about binding and affinity won't help either, process
binding is about squeezing the last 15% of performance out of a system
and making performance reproducible, it has no bearing on correctness or
scalability.  If you're not running on a dedicated machine which with
firefox running I guess you aren't then there would be a good case for
leaving it off anyway.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] mpirun only works when -np <4

2009-12-09 Thread Iris Pernille Lohmann
Hi Matthew,

I just had the same problem with my application when using more than 4 cores - 
however, the program didn't hang, it crashed, and I got an error message of 
'address not mapped'. As you say, it happened different places in the code, 
sometimes in the beginning, sometimes in the middle, sometimes at the end. I 
wrote to the list about it, and also got the suggestion that the cause could 
probably be found in my own application. And it could!

I realized that all the different places where the crash happened were the same 
places in the code where I got compiler warnings during compilation. Most of 
the warnings dealt with type mismatch of variables used different places in the 
code.

I cleaned the code to remove the warnings, and after that I've had no problems 
using more than 4 cores.

It may be worth a try for you.

Best regards,
Iris Lohmann



-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ashley Pittman
Sent: 09 December 2009 11:38
To: Open MPI Users
Subject: Re: [OMPI users] mpirun only works when -np <4

On Tue, 2009-12-08 at 08:30 -0800, Matthew MacManes wrote:
> There are 8 physical cores, or 16 with hyperthreading enabled. 

That should be meaty enough.

> 1st of all, let me say that when I specify that -np is less than 4
> processors (1, 2, or 3), both programs seem to work as expected. Also,
> the non-mpi version of each of them works fine.

Presumably the non-mpi version is serial however? this this doesn't mean
the program is bug-free or that the parallel version isn't broken.
There are any number of apps that don't work above N processes, in fact
probably all programs break for some value of N, it's normally a little
higher then 3 however.

> Thus, I am pretty sure that this is a problem with MPI rather that
> with the program code or something else.  
> 
> What happens is simply that the program hangs..

I presume you mean here the output stops?  The program continues to use
CPU cycles but no longer appears to make any progress?

I'm of the opinion that this is most likely a error in your program, I
would start by using either valgrind or padb.

You can run the app under valgrind using the following mpirun options,
this will give you four files named v.log.0 to v.log.3 which you can
check for errors in the normal way.  The "--mca btl tcp,self" option
will disable shared memory which can create false positives.

mpirun -n 4 --mca btl tcp,self valgrind --log-file=v.log.%
q{OMPI_COMM_WORLD_RANK} 

Alternatively you can run the application, wait for it to hang and then
in another window run my tool, padb, which will show you the MPI message
queues and stack traces which should show you where it's hung,
instructions and sample output are on this page.

http://padb.pittman.org.uk/full-report.html

> There are no error messages, and there is no clue from anything else
> (system working fine otherwise- no RAM issues, etc). It does not hang
> at the same place everytime, sometimes in the very beginning, sometime
> near the middle..  
> 
> Could this an issue with hyperthreading? A conflict with something?

Unlikely, if there was a problem in OMPI running more than 3 processes
it would have been found by now.  I regularly run 8 process applications
on my dual-core netbook alongside all my desktop processes without
issue, it runs fine, a little slowly but fine.

All this talk about binding and affinity won't help either, process
binding is about squeezing the last 15% of performance out of a system
and making performance reproducible, it has no bearing on correctness or
scalability.  If you're not running on a dedicated machine which with
firefox running I guess you aren't then there would be a good case for
leaving it off anyway.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





[OMPI users] orte error

2009-12-09 Thread Andrew McBride
Hi

I've installed trilinos using the openmpi 1.3.3 libraries. I'm configuring 
openmpi as follows:
/configure CXX=/usr/local/bin/g++ CC=/usr/local/bin/gcc 
F77=/usr/local/bin/gfortran - prefix=/Users/andrewmcbride/lib/openmpi-1.3.3/MAC

Trilinos compiles without problem but the test fail (see below). I'm running a 
Mac with OSX10.6 (snow leopard). The mpi tests seem to run fine:

bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpicc hello_c.c 
bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 hello_
bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 a.out 
Hello, world, I am 0 of 2
Hello, world, I am 1 of 2

I'm convinced that the problem has to do with the paths and different versions 
of mpi lurking on the mac. I don't want to use the version of openmpi that 
comes bundled with the mac for a different reason. 

Any help would be most appreciated

Andrew


Start testing: Dec 09 12:18 SAST
--
1/534 Testing: Teuchos_BLAS_test_MPI_1
1/534 Test: Teuchos_BLAS_test_MPI_1
Command: "/Users/andrewmcbride/lib/openmpi-1.3.3/MAC/bin/mpiexec" "-np" "1" 
"/Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS/Teuchos_BLAS_test.exe"
 "-v"
Directory: 
/Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS
"Teuchos_BLAS_test_MPI_1" start time: Dec 09 12:18 SAST
Output:
--
[macs-mac.local:71058] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
runtime/orte_init.c at line 125
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[macs-mac.local:71058] Abort before MPI_INIT completed successfully; not able 
to guarantee that all other processes were killed!
--




Re: [OMPI users] orte error

2009-12-09 Thread Ralph Castain
You need to set your LD_LIBRARY_PATH to ~/lib/openmpi-1.3.3/MAC/lib, and
your PATH to ~/lib/openmpi-1.3.3/MAC/bin

It should then run fine.

On Wed, Dec 9, 2009 at 6:29 AM, Andrew McBride wrote:

> Hi
>
> I've installed trilinos using the openmpi 1.3.3 libraries. I'm configuring
> openmpi as follows:
> /configure CXX=/usr/local/bin/g++ CC=/usr/local/bin/gcc
> F77=/usr/local/bin/gfortran -
> prefix=/Users/andrewmcbride/lib/openmpi-1.3.3/MAC
>
> Trilinos compiles without problem but the test fail (see below). I'm
> running a Mac with OSX10.6 (snow leopard). The mpi tests seem to run fine:
>
> bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpicc hello_c.c
> bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 hello_
> bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 a.out
> Hello, world, I am 0 of 2
> Hello, world, I am 1 of 2
>
> I'm convinced that the problem has to do with the paths and different
> versions of mpi lurking on the mac. I don't want to use the version of
> openmpi that comes bundled with the mac for a different reason.
>
> Any help would be most appreciated
>
> Andrew
>
>
> Start testing: Dec 09 12:18 SAST
> --
> 1/534 Testing: Teuchos_BLAS_test_MPI_1
> 1/534 Test: Teuchos_BLAS_test_MPI_1
> Command: "/Users/andrewmcbride/lib/openmpi-1.3.3/MAC/bin/mpiexec" "-np" "1"
> "/Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS/Teuchos_BLAS_test.exe"
> "-v"
> Directory:
> /Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS
> "Teuchos_BLAS_test_MPI_1" start time: Dec 09 12:18 SAST
> Output:
> --
> [macs-mac.local:71058] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
> file runtime/orte_init.c at line 125
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>  orte_ess_base_select failed
>  --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [macs-mac.local:71058] Abort before MPI_INIT completed successfully; not
> able to guarantee that all other processes were killed!
> --
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] orte error

2009-12-09 Thread Andrew McBride
Thanks for your quick response Ralph. 

The errors I get is now are of a completely different nature and have to do 
with, presumably, calling delete on an unallocated pointer. Now, this probably 
has little to do with openmpi and more to do with compilers used to create 
openmpi?

I used gcc version 4.5.0 20090910 when compiling openmpi.

Does anyone have any ideas?

Regards
Andrew

Start testing: Dec 09 15:53 SAST
--
1/534 Testing: Teuchos_BLAS_test_MPI_1
1/534 Test: Teuchos_BLAS_test_MPI_1
Command: "/Users/andrewmcbride/lib/openmpi-1.3.3/MAC/bin/mpiexec" "-np" "1" 
"/Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS/Teuchos_BLAS_test.exe"
 "-v"
Directory: 
/Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS
"Teuchos_BLAS_test_MPI_1" start time: Dec 09 15:53 SAST
Output:
--
Teuchos_BLAS_test.exe(72504) malloc: *** error for object 0x100727c00: pointer 
being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
[macs-mac:72504] *** Process received signal ***
[macs-mac:72504] Signal: Abort trap (6)

On 09 Dec 2009, at 3:32 PM, Ralph Castain wrote:

> You need to set your LD_LIBRARY_PATH to ~/lib/openmpi-1.3.3/MAC/lib, and your 
> PATH to ~/lib/openmpi-1.3.3/MAC/bin
> 
> It should then run fine.
> 
> On Wed, Dec 9, 2009 at 6:29 AM, Andrew McBride  
> wrote:
> Hi
> 
> I've installed trilinos using the openmpi 1.3.3 libraries. I'm configuring 
> openmpi as follows:
> /configure CXX=/usr/local/bin/g++ CC=/usr/local/bin/gcc 
> F77=/usr/local/bin/gfortran - 
> prefix=/Users/andrewmcbride/lib/openmpi-1.3.3/MAC
> 
> Trilinos compiles without problem but the test fail (see below). I'm running 
> a Mac with OSX10.6 (snow leopard). The mpi tests seem to run fine:
> 
> bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpicc hello_c.c
> bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 hello_
> bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 a.out
> Hello, world, I am 0 of 2
> Hello, world, I am 1 of 2
> 
> I'm convinced that the problem has to do with the paths and different 
> versions of mpi lurking on the mac. I don't want to use the version of 
> openmpi that comes bundled with the mac for a different reason.
> 
> Any help would be most appreciated
> 
> Andrew
> 
> 
> Start testing: Dec 09 12:18 SAST
> --
> 1/534 Testing: Teuchos_BLAS_test_MPI_1
> 1/534 Test: Teuchos_BLAS_test_MPI_1
> Command: "/Users/andrewmcbride/lib/openmpi-1.3.3/MAC/bin/mpiexec" "-np" "1" 
> "/Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS/Teuchos_BLAS_test.exe"
>  "-v"
> Directory: 
> /Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS
> "Teuchos_BLAS_test_MPI_1" start time: Dec 09 12:18 SAST
> Output:
> --
> [macs-mac.local:71058] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
> runtime/orte_init.c at line 125
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>  orte_ess_base_select failed
>  --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [macs-mac.local:71058] Abort before MPI_INIT completed successfully; not able 
> to guarantee that all other processes were killed!
> --
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] orte error

2009-12-09 Thread Jeff Squyres
Can you run simple MPI applications, like sending a message around in a ring?

On Dec 9, 2009, at 10:18 AM, Andrew McBride wrote:

> Thanks for your quick response Ralph. 
> 
> The errors I get is now are of a completely different nature and have to do 
> with, presumably, calling delete on an unallocated pointer. Now, this 
> probably has little to do with openmpi and more to do with compilers used to 
> create openmpi?
> 
> I used gcc version 4.5.0 20090910 when compiling openmpi.
> 
> Does anyone have any ideas?
> 
> Regards
> Andrew
> 
> Start testing: Dec 09 15:53 SAST
> --
> 1/534 Testing: Teuchos_BLAS_test_MPI_1
> 1/534 Test: Teuchos_BLAS_test_MPI_1
> Command: "/Users/andrewmcbride/lib/openmpi-1.3.3/MAC/bin/mpiexec" "-np" "1" 
> "/Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS/Teuchos_BLAS_test.exe"
>  "-v"
> Directory: 
> /Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS
> "Teuchos_BLAS_test_MPI_1" start time: Dec 09 15:53 SAST
> Output:
> --
> Teuchos_BLAS_test.exe(72504) malloc: *** error for object 0x100727c00: 
> pointer being freed was not allocated
> *** set a breakpoint in malloc_error_break to debug
> [macs-mac:72504] *** Process received signal ***
> [macs-mac:72504] Signal: Abort trap (6)
> 
> On 09 Dec 2009, at 3:32 PM, Ralph Castain wrote:
> 
>> You need to set your LD_LIBRARY_PATH to ~/lib/openmpi-1.3.3/MAC/lib, and 
>> your PATH to ~/lib/openmpi-1.3.3/MAC/bin
>> 
>> It should then run fine.
>> 
>> On Wed, Dec 9, 2009 at 6:29 AM, Andrew McBride  
>> wrote:
>> Hi
>> 
>> I've installed trilinos using the openmpi 1.3.3 libraries. I'm configuring 
>> openmpi as follows:
>> /configure CXX=/usr/local/bin/g++ CC=/usr/local/bin/gcc 
>> F77=/usr/local/bin/gfortran - 
>> prefix=/Users/andrewmcbride/lib/openmpi-1.3.3/MAC
>> 
>> Trilinos compiles without problem but the test fail (see below). I'm running 
>> a Mac with OSX10.6 (snow leopard). The mpi tests seem to run fine:
>> 
>> bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpicc hello_c.c
>> bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 hello_
>> bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 a.out
>> Hello, world, I am 0 of 2
>> Hello, world, I am 1 of 2
>> 
>> I'm convinced that the problem has to do with the paths and different 
>> versions of mpi lurking on the mac. I don't want to use the version of 
>> openmpi that comes bundled with the mac for a different reason.
>> 
>> Any help would be most appreciated
>> 
>> Andrew
>> 
>> 
>> Start testing: Dec 09 12:18 SAST
>> --
>> 1/534 Testing: Teuchos_BLAS_test_MPI_1
>> 1/534 Test: Teuchos_BLAS_test_MPI_1
>> Command: "/Users/andrewmcbride/lib/openmpi-1.3.3/MAC/bin/mpiexec" "-np" "1" 
>> "/Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS/Teuchos_BLAS_test.exe"
>>  "-v"
>> Directory: 
>> /Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS
>> "Teuchos_BLAS_test_MPI_1" start time: Dec 09 12:18 SAST
>> Output:
>> --
>> [macs-mac.local:71058] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
>> runtime/orte_init.c at line 125
>> --
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>> 
>>  orte_ess_base_select failed
>>  --> Returned value Not found (-13) instead of ORTE_SUCCESS
>> --
>> *** An error occurred in MPI_Init
>> *** before MPI was initialized
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> [macs-mac.local:71058] Abort before MPI_INIT completed successfully; not 
>> able to guarantee that all other processes were killed!
>> --
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com




Re: [OMPI users] orte error

2009-12-09 Thread Andrew McBride
seemingly. here is the output of ring:

bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpicxx ring_cxx.cc
bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 a.out 
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting

and here is the output of hello:
bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpicc hello_c.c
bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 hello_
bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 a.out
Hello, world, I am 0 of 2
Hello, world, I am 1 of 2

I presume this output is correct? I guess the issue I have lies elsewhere then?

Andrew

On 09 Dec 2009, at 5:44 PM, Jeff Squyres wrote:

> Can you run simple MPI applications, like sending a message around in a ring?
> 
> On Dec 9, 2009, at 10:18 AM, Andrew McBride wrote:
> 
>> Thanks for your quick response Ralph. 
>> 
>> The errors I get is now are of a completely different nature and have to do 
>> with, presumably, calling delete on an unallocated pointer. Now, this 
>> probably has little to do with openmpi and more to do with compilers used to 
>> create openmpi?
>> 
>> I used gcc version 4.5.0 20090910 when compiling openmpi.
>> 
>> Does anyone have any ideas?
>> 
>> Regards
>> Andrew
>> 
>> Start testing: Dec 09 15:53 SAST
>> --
>> 1/534 Testing: Teuchos_BLAS_test_MPI_1
>> 1/534 Test: Teuchos_BLAS_test_MPI_1
>> Command: "/Users/andrewmcbride/lib/openmpi-1.3.3/MAC/bin/mpiexec" "-np" "1" 
>> "/Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS/Teuchos_BLAS_test.exe"
>>  "-v"
>> Directory: 
>> /Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS
>> "Teuchos_BLAS_test_MPI_1" start time: Dec 09 15:53 SAST
>> Output:
>> --
>> Teuchos_BLAS_test.exe(72504) malloc: *** error for object 0x100727c00: 
>> pointer being freed was not allocated
>> *** set a breakpoint in malloc_error_break to debug
>> [macs-mac:72504] *** Process received signal ***
>> [macs-mac:72504] Signal: Abort trap (6)
>> 
>> On 09 Dec 2009, at 3:32 PM, Ralph Castain wrote:
>> 
>>> You need to set your LD_LIBRARY_PATH to ~/lib/openmpi-1.3.3/MAC/lib, and 
>>> your PATH to ~/lib/openmpi-1.3.3/MAC/bin
>>> 
>>> It should then run fine.
>>> 
>>> On Wed, Dec 9, 2009 at 6:29 AM, Andrew McBride  
>>> wrote:
>>> Hi
>>> 
>>> I've installed trilinos using the openmpi 1.3.3 libraries. I'm configuring 
>>> openmpi as follows:
>>> /configure CXX=/usr/local/bin/g++ CC=/usr/local/bin/gcc 
>>> F77=/usr/local/bin/gfortran - 
>>> prefix=/Users/andrewmcbride/lib/openmpi-1.3.3/MAC
>>> 
>>> Trilinos compiles without problem but the test fail (see below). I'm 
>>> running a Mac with OSX10.6 (snow leopard). The mpi tests seem to run fine:
>>> 
>>> bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpicc hello_c.c
>>> bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 hello_
>>> bash-3.2$ ~/lib/openmpi-1.3.3/MAC/bin/mpirun -np 2 a.out
>>> Hello, world, I am 0 of 2
>>> Hello, world, I am 1 of 2
>>> 
>>> I'm convinced that the problem has to do with the paths and different 
>>> versions of mpi lurking on the mac. I don't want to use the version of 
>>> openmpi that comes bundled with the mac for a different reason.
>>> 
>>> Any help would be most appreciated
>>> 
>>> Andrew
>>> 
>>> 
>>> Start testing: Dec 09 12:18 SAST
>>> --
>>> 1/534 Testing: Teuchos_BLAS_test_MPI_1
>>> 1/534 Test: Teuchos_BLAS_test_MPI_1
>>> Command: "/Users/andrewmcbride/lib/openmpi-1.3.3/MAC/bin/mpiexec" "-np" "1" 
>>> "/Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS/Teuchos_BLAS_test.exe"
>>>  "-v"
>>> Directory: 
>>> /Users/andrewmcbride/lib/trilinos-10.0.2-Source/MAC_SL/packages/teuchos/test/BLAS
>>> "Teuchos_BLAS_test_MPI_1" start time: Dec 09 12:18 SAST
>>> Output:
>>> --
>>> [macs-mac.local:71058] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in 
>>> file runtime/orte_init.c at line 125
>>> --
>>> It looks like orte_init failed for some reason; your parallel process is
>>> likely to abort.  There are many reasons that a parallel process can
>>> fail during orte_init; some of which are due to configuration or
>>> environment problems.  This failure appears to be an internal failure;
>>> here's some additional information (which may only be relevant to an
>>> Open MPI developer):
>>> 
>>> orte_ess_base_select failed
>>> --> Returned value Not found (-13) instead of 

Re: [OMPI users] orte error

2009-12-09 Thread Jeff Squyres
On Dec 9, 2009, at 10:59 AM, Andrew McBride wrote:

> seemingly. here is the output of ring:
> 
> I presume this output is correct? I guess the issue I have lies elsewhere 
> then?

Yes -- the output looks correct.

Never say "never", but it would *seem* that the error lies in your app 
somewhere.  Can you double check that you're not freeing things that you 
shouldn't?

-- 
Jeff Squyres
jsquy...@cisco.com




Re: [OMPI users] ompi-restart using different nodes

2009-12-09 Thread Josh Hursey
So I tried to reproduce this problem today, and everything worked fine  
for me using the trunk. I haven't tested v1.3/v1.4 yet.


I tried checkpointing with one hostfile then restarting with each of  
the following:

 - No hostfile
 - a hostfile with completely different machines
 - a hostfile with the same machines in the opposite order


I suspect that the problem is not with Open MPI, but your system  
interacting with BLCR. Usually when people cannot restart on a  
different node they have problems with the 'prelink' feature on Linux.  
BLCR has a FAQ item on this:

  https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink

So if this is your problem then you will probably not be able to  
checkpoint a single process (non-MPI) application on one node and  
restart on another. Sorry I didn't mention it before, must have  
slipped my mind.


If this turns out to not be the problem, let me know and I'll take  
another look. Also send me any error messages that are displayed.


-- Josh


On Dec 8, 2009, at 1:39 PM, Jonathan Ferland wrote:

I did the same test using 1.3.4 and still the same issue  I also  
tried to use the tm interface instead of specifying the hostfile,  
same result.


thanks,

Jonathan

Josh Hursey wrote:
Though I do not test this scenario (using hostfiles) very often, it  
used to work. The ompi-restart command takes a --hostfile (or -- 
machinefile) argument that is passed directly to the mpirun  
command. I wonder if something broke recently with this handoff. I  
can certainly checkpoint with one set of nodes/allocation and  
restart with another, but most/all of my testing occurs in a SLURM  
environment, so no need for an explicit hostfile.


I'll take a look to see if I can reproduce, but probably will not  
be until next week.


-- Josh

On Dec 2, 2009, at 9:54 AM, Jonathan Ferland wrote:


Hi,

I am trying to use BLCR checkpointing in mpi. I am currently able  
to run my application using some hostfile, checkpoint the run, and  
then restart the application using the same hostfile. The thing I  
would like to do is to restart the application with a different  
hostfile. But this leads to a segfault using 1.3.3.


Is it possible to restart the application using a different  
hostfile (we are using pbs to create the hostfile, so each new  
restart might be on different nodes), how can we do that? If no,  
do you plan to include this in a future release?


thanks

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--




--
Jonathan Ferland, analyste en calcul scientifique
RQCHP (Réseau québécois de calcul de haute performance)

bureau S-252, pavillon Roger-Gaudry, Université de Montréal
téléphone   : 514 343-6111 poste 8852
télécopieur : 514 343-2155
--

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] ompi-restart using different nodes

2009-12-09 Thread Jonathan Ferland

Hi Josh,

Thanks for helping. That solved the problem!!!

cheers,

Jonathan

Josh Hursey wrote:
So I tried to reproduce this problem today, and everything worked fine 
for me using the trunk. I haven't tested v1.3/v1.4 yet.


I tried checkpointing with one hostfile then restarting with each of 
the following:

 - No hostfile
 - a hostfile with completely different machines
 - a hostfile with the same machines in the opposite order


I suspect that the problem is not with Open MPI, but your system 
interacting with BLCR. Usually when people cannot restart on a 
different node they have problems with the 'prelink' feature on Linux. 
BLCR has a FAQ item on this:

  https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink

So if this is your problem then you will probably not be able to 
checkpoint a single process (non-MPI) application on one node and 
restart on another. Sorry I didn't mention it before, must have 
slipped my mind.


If this turns out to not be the problem, let me know and I'll take 
another look. Also send me any error messages that are displayed.


-- Josh


On Dec 8, 2009, at 1:39 PM, Jonathan Ferland wrote:

I did the same test using 1.3.4 and still the same issue  I also 
tried to use the tm interface instead of specifying the hostfile, 
same result.


thanks,

Jonathan

Josh Hursey wrote:
Though I do not test this scenario (using hostfiles) very often, it 
used to work. The ompi-restart command takes a --hostfile (or 
--machinefile) argument that is passed directly to the mpirun 
command. I wonder if something broke recently with this handoff. I 
can certainly checkpoint with one set of nodes/allocation and 
restart with another, but most/all of my testing occurs in a SLURM 
environment, so no need for an explicit hostfile.


I'll take a look to see if I can reproduce, but probably will not be 
until next week.


-- Josh

On Dec 2, 2009, at 9:54 AM, Jonathan Ferland wrote:


Hi,

I am trying to use BLCR checkpointing in mpi. I am currently able 
to run my application using some hostfile, checkpoint the run, and 
then restart the application using the same hostfile. The thing I 
would like to do is to restart the application with a different 
hostfile. But this leads to a segfault using 1.3.3.


Is it possible to restart the application using a different 
hostfile (we are using pbs to create the hostfile, so each new 
restart might be on different nodes), how can we do that? If no, do 
you plan to include this in a future release?


thanks

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--




--
Jonathan Ferland, analyste en calcul scientifique
RQCHP (Réseau québécois de calcul de haute performance)

bureau S-252, pavillon Roger-Gaudry, Université de Montréal
téléphone   : 514 343-6111 poste 8852
télécopieur : 514 343-2155
--

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--




--
Jonathan Ferland, analyste en calcul scientifique
RQCHP (Réseau québécois de calcul de haute performance)

bureau S-252, pavillon Roger-Gaudry, Université de Montréal
téléphone   : 514 343-6111 poste 8852
télécopieur : 514 343-2155
--



Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-09 Thread Matthew MacManes
Hi Gus,

Interestingly the results for the connectivity_c test... works fine with -np 
<8. For -np >8 it works some of the time, other times it HANGS. I have got to 
believe that this is a big clue!! Also, when it hangs, sometimes I get the 
message "mpirun was unable to cleanly terminate the daemons on the nodes shown 
below" Note that NO nodes are shown below.   Once, I got -np 250 to pass the 
connectivity test, but I was not able to replicate this reliable, so I'm not 
sure if it was a fluke, or what.  Here is a like to a screenshop of TOP when 
connectivity_c is hung with -np 14.. I see that 2 processes are only at 50% CPU 
usage.. H  

http://picasaweb.google.com/lh/photo/87zVEucBNFaQ0TieNVZtdw?authkey=Gv1sRgCLKokNOVqo7BYw=directlink

The other tests, ring_c, hello_c, as well as the cxx versions of these guys 
with with all values of -np.

Using -mca mpi-paffinity_alone 1 I get the same behavior. 

I agree that I am should worry about the mismatch between where the libraries 
are installed versus where I am telling my programs to look for them. Would 
this type of mismatch cause behavior like what I am seeing, i.e. working with  
a small number of processors, but failing with larger?  It seems like a 
mismatch would have the same effect regardless of the number of processors 
used. Maybe I am mistaken. Anyway, to address this, which mpirun gives me 
/usr/local/bin/mpirun.. so to configure ./configure 
--with-mpi=/usr/local/bin/mpirun and to run /usr/local/bin/mpirun -np X ...  
This should 

uname -a gives me: Linux macmanes 2.6.31-16-generic #52-Ubuntu SMP Thu Dec 3 
22:07:16 UTC 2006 x86_64 GNU/Linux

Matt

On Dec 8, 2009, at 8:50 PM, Gus Correa wrote:

> Hi Matthew
> 
> Please see comments/answers inline below.
> 
> Matthew MacManes wrote:
>> Hi Gus, Thanks for your ideas.. I have a few questions, and will try to 
>> answer yours in hopes of solving this!!
> 
> A simple way to test OpenMPI on your system is to run the
> test programs that come with the OpenMPI source code,
> hello_c.c, connectivity_c.c, and ring_c.c:
> http://www.open-mpi.org/
> 
> Get the tarball from the OpenMPI site, gzip and untar it,
> and look for it in the "examples" directory.
> Compile it with /your/path/to/openmpi/bin/mpicc hello_c.c
> Run it with /your/path/to/openmpi/bin/mpiexec -np X a.out
> using X = 2, 4, 8, 16, 32, 64, ...
> 
> This will tell if your OpenMPI is functional,
> and if you can run on many Nehalem cores,
> even with oversubscription perhaps.
> It will also set the stage for further investigation of your
> actual programs.
> 
> 
>> Should I worry about setting things like --num-cores --bind-to-cores?  This, 
>> I think, gets at your questions about processor affinity.. Am I right? I 
>> could not exactly figure out the -mca mpi-paffinity_alone stuff...
> 
> I use the simple minded -mca mpi-paffinity_alone 1.
> This is probably the easiest way to assign a process to a core.
> There more complex  ways in OpenMPI, but I haven't tried.
> Indeed, -mca mpi-paffinity_alone 1 does improve performance of
> our programs here.
> There is a chance that without it the 16 virtual cores of
> your Nehalem get confused with more than 3 processes
> (you reported that -np > 3 breaks).
> 
> Did you try adding just -mca mpi-paffinity_alone 1  to
> your mpiexec command line?
> 
> 
>> 1. Additional load: nope. nothing else, most of the time not even firefox. 
> 
> Good.
> Turn off firefox, etc, to make it even better.
> Ideally, use runlevel 3, no X, like a computer cluster node,
> but this may not be required.
> 
>> 2. RAM: no problems apparent when monitoring through TOP. Interesting, I did 
>> wonder about oversubscription, so I tried the option --nooversubscription, 
>> but this gave me an error mssage.
> 
> Oversubscription from your program would only happen if
> you asked for more processes than available cores, i.e.,
> -np > 8 (or "virtual" cores, in case of Nehalem hyperthreading,
> -np > 16).
> Since you have -np=4 there is no oversubscription,
> unless you have other external load (e.g. Matlab, etc),
> but you said you don't.
> 
> Yet another possibility would be if your program is threaded
> (e.g. using OpenMP along with MPI), but considering what you
> said about OpenMP I would guess the programs don't use it.
> For instance, you launch the program with 4 MPI processes,
> and each process decides to start, say, 8 OpenMP threads.
> You end up with 32 threads and 8 (real) cores (or 16 hyperthreaded
> ones on Nehalem).
> 
> 
> What else does top say?
> Any hog processes (memory- or CPU-wise)
> besides your program processes?
> 
>> 3. I have not tried other MPI flavors.. Ive been speaking to the authors of 
>> the programs, and they are both using openMPI.  
> 
> I was not trying to convince you to use another MPI.
> I use MPICH2 also, but OpenMPI reigns here.
> The idea or trying it with MPICH2 was just to check whether OpenMPI
> is causing the problem, but I don't think it is.
> 
>> 4. I don't think that this 

Re: [OMPI users] Problem with mpirun -preload-binary option

2009-12-09 Thread Josh Hursey
I verified that the preload functionality works on the trunk. It seems  
to be broken on the v1.3/v1.4 branches. The version of this code has  
changed significantly between the v1.3/v1.4 and the trunk/v1.5  
versions. I filed a bug about this so it does not get lost:

  https://svn.open-mpi.org/trac/ompi/ticket/2139

Can you try this again with either the trunk or v1.5 to see if that  
helps with the preloading?


However you need to fix the password-less login issue before anything  
else will work. If mpirun is prompting you for a password, then it  
will work properly.


-- Josh

On Nov 12, 2009, at 3:50 PM, Qing Pang wrote:

Now that I have passwordless-ssh set up both directions, and  
verified working - I still have the same problem.
I'm able to run ssh/scp on both master and client nodes - (at this  
point, they are pretty much the same), without being asked for  
password. And mpirun works fine if I have the executable put in the  
same directory on both nodes.


But when I tried the preload-binary option, I still have the same  
problem - it asked me for the password of the node running mpirun,  
and then tells that scp failed.


---


Josh Wrote:

Though the --preload-binary option was created while building the  
checkpoint/restart functionality it does not depend on checkpoint/ 
restart function in any way (just a side effect of the initial  
development).


The problem you are seeing is a result of the computing environment  
setup of password-less ssh. The --preload-binary command uses  
'scp' (at the moment) to copy the files from the node running mpirun  
to the compute nodes. The compute nodes are the ones that call  
'scp', so you will need to setup password-less ssh in both directions.


-- Josh

On Nov 11, 2009, at 8:38 AM, Ralph Castain wrote:


I'm no expert on the preload-binary option - but I would suspect that

is the case given your observations.


That option was created to support checkpoint/restart, not for what
you are attempting to do. Like I said, you -should- be able to use  
it for that purpose, but I expect you may hit a few quirks like this  
along the way.


On Nov 11, 2009, at 9:16 AM, Qing Pang wrote:

> Thank you very much for your help! I believe I do have password- 
less
ssh set up, at least from master node to client node (desktop ->  
laptop in my case). If I type >ssh node1 on my desktop terminal, I  
am able to get to the laptop node without being asked for password.  
And as I mentioned, if I copy the example executable from desktop to  
the laptop node using scp, then I am able to run it from desktop  
using both nodes.

> Back to the preload-binary problem - I am asked for the password of
my master node - the node I am working on - not the remote client  
node. Do you mean that I should set up password-less ssh in both  
direction? Does the client node need to access master node through  
password-less ssh to make the preload-binary option work?

>
>
> Ralph Castain Wrote:
>
> It -should- work, but you need password-less ssh setup. See our FAQ
> for how to do that, if you are unfamiliar with it.
>
> On Nov 10, 2009, at 2:02 PM, Qing Pang wrote:
>
> I'm having problem getting the mpirun "preload-binary" option to  
work.

>>
>> I'm using ubutu8.10 with openmpi 1.3.3, nodes connected with

Ethernet cable.
>> If I copy the executable to client nodes using scp, then do  
mpirun,

everything works.

>>
>> But I really want to avoid the copying, so I tried the

-preload-binary option.

>>
>> When I typed the command on my master node as below (gordon- 
desktop

is my master node, and gordon-laptop is the client node):

>>
>>

--

>> gordon_at_gordon-desktop:~/Desktop/openmpi-1.3.3/examples$ mpirun
>> -machinefile machine.linux -np 2 --preload-binary $(pwd)/ 
hello_c.out

>>

--

>>
>> I got the following:
>>
>> gordon_at_gordon-desktop's password: (I entered my password here,
why am I asked for the password? I am working under this account  
anyway)

>>
>>
>> WARNING: Remote peer ([[18118,0],1]) failed to preload a file.
>>
>> Exit Status: 256
>> Local File:

/tmp/openmpi-sessions-gordon_at_gordon-laptop_0/18118/0/hello_c.out
>> Remote File: /home/gordon/Desktop/openmpi-1.3.3/examples/ 
hello_c.out

>> Command:
>> scp

gordon-desktop:/home/gordon/Desktop/openmpi-1.3.3/examples/hello_c.out
>> /tmp/openmpi-sessions-gordon_at_gordon-laptop_0/18118/0/ 
hello_c.out

>>
>> Will continue attempting to launch the process(es).
>>

--

>>

--

>> mpirun was unable to launch the specified application as it could

not access

>> or execute an executable:
>>
>> Executable: 

Re: [OMPI users] mpirun only works when -np <4

2009-12-09 Thread Matthew MacManes
Thanks Ashley,  I'll try your tool..

I would think that this is an error in the programs I am trying to use, too, 
but this is a problem with 2 different programs, written by 2 different 
groups.. One of them might be bad, but both.. seems unlikely. 

Interestingly the results for the connectivity_c test that is included with 
OMPI... works fine with -np <8. For -np >8 it works some of the time, other 
times it HANGS. I have got to believe that this is a big clue!! Also, when it 
hangs, sometimes I get the message "mpirun was unable to cleanly terminate the 
daemons on the nodes shown below" Note that NO nodes are shown below.   Once, I 
got -np 250 to pass the connectivity test, but I was not able to replicate this 
reliable, so I'm not sure if it was a fluke, or what.  Here is a like to a 
screenshop of TOP when connectivity_c is hung with -np 14.. I see that 2 
processes are only at 50% CPU usage.. H  

http://picasaweb.google.com/lh/photo/87zVEucBNFaQ0TieNVZtdw?authkey=Gv1sRgCLKokNOVqo7BYw=directlink

The other tests, ring_c, hello_c, as well as the cxx versions of these guys 
with with all values of -np.

Unfortunately, I could not get valgrind to work...

Thanks, Matt



On Dec 9, 2009, at 2:37 AM, Ashley Pittman wrote:

> On Tue, 2009-12-08 at 08:30 -0800, Matthew MacManes wrote:
>> There are 8 physical cores, or 16 with hyperthreading enabled. 
> 
> That should be meaty enough.
> 
>> 1st of all, let me say that when I specify that -np is less than 4
>> processors (1, 2, or 3), both programs seem to work as expected. Also,
>> the non-mpi version of each of them works fine.
> 
> Presumably the non-mpi version is serial however? this this doesn't mean
> the program is bug-free or that the parallel version isn't broken.
> There are any number of apps that don't work above N processes, in fact
> probably all programs break for some value of N, it's normally a little
> higher then 3 however.
> 
>> Thus, I am pretty sure that this is a problem with MPI rather that
>> with the program code or something else.  
>> 
>> What happens is simply that the program hangs..
> 
> I presume you mean here the output stops?  The program continues to use
> CPU cycles but no longer appears to make any progress?
> 
> I'm of the opinion that this is most likely a error in your program, I
> would start by using either valgrind or padb.
> 
> You can run the app under valgrind using the following mpirun options,
> this will give you four files named v.log.0 to v.log.3 which you can
> check for errors in the normal way.  The "--mca btl tcp,self" option
> will disable shared memory which can create false positives.
> 
> mpirun -n 4 --mca btl tcp,self valgrind --log-file=v.log.%
> q{OMPI_COMM_WORLD_RANK} 
> 
> Alternatively you can run the application, wait for it to hang and then
> in another window run my tool, padb, which will show you the MPI message
> queues and stack traces which should show you where it's hung,
> instructions and sample output are on this page.
> 
> http://padb.pittman.org.uk/full-report.html
> 
>> There are no error messages, and there is no clue from anything else
>> (system working fine otherwise- no RAM issues, etc). It does not hang
>> at the same place everytime, sometimes in the very beginning, sometime
>> near the middle..  
>> 
>> Could this an issue with hyperthreading? A conflict with something?
> 
> Unlikely, if there was a problem in OMPI running more than 3 processes
> it would have been found by now.  I regularly run 8 process applications
> on my dual-core netbook alongside all my desktop processes without
> issue, it runs fine, a little slowly but fine.
> 
> All this talk about binding and affinity won't help either, process
> binding is about squeezing the last 15% of performance out of a system
> and making performance reproducible, it has no bearing on correctness or
> scalability.  If you're not running on a dedicated machine which with
> firefox running I guess you aren't then there would be a good case for
> leaving it off anyway.
> 
> Ashley,
> 
> -- 
> 
> Ashley Pittman, Bath, UK.
> 
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_
Matthew MacManes
PhD Candidate
University of California- Berkeley
Museum of Vertebrate Zoology
Phone: 510-495-5833
Lab Website: http://ib.berkeley.edu/labs/lacey
Personal Website: http://macmanes.com/







Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-12-09 Thread Josh Hursey


On Nov 12, 2009, at 10:54 AM, Sergio Díaz wrote:


Hi Josh,

You were right. The main problem was the /tmp. SGE uses a scratch  
directory in which the jobs have temporary files. Setting TMPDIR to / 
tmp, checkpoint works!
However, when I try to restart it... I got the following error (see  
ERROR1). Option -v agrees these lines (see ERRO2).


It is concerning that ompi-restart is segfault'ing when it errors out.  
The error message is being generated between the launch of the opal- 
restart starter command and when we try to exec(cr_restart). Usually  
the failure is related to a corruption of the metadata stored in the  
checkpoint.


Can you send me the file below:
 ompi_global_snapshot_28454.ckpt/0/opal_snapshot_0.ckpt/ 
snapshot_meta.data


I was able to reproduce the segv (at least I think it is the same  
one). We failed to check the validity of a string when we parse the  
metadata. I committed a fix to the trunk in r22290, and requested that  
the fix be moved to the v1.4 and v1.5 branches. If you are interested  
in seeing when they get applied you can follow the following tickets:

  https://svn.open-mpi.org/trac/ompi/ticket/2140
  https://svn.open-mpi.org/trac/ompi/ticket/2141

Can you try the trunk to see if the problem goes away? The development  
trunk and v1.5 series have a bunch of improvements to the C/R  
functionality that were never brought over the v1.3/v1.4 series.




I was trying to use ssh instead of rsh but I was impossible. By  
default it should use ssh and if it finds a problem, it will use  
rsh. It seems that ssh doesn't work because always use rsh.

If I change this MCA parameter, It still uses rsh.
If I set OMPI_MCA_plm_rsh_disable_qrsh variable to 1, It try to use  
ssh and doesn't works. I got --> "bash: orted: command not found"  
and the mpi process dies.
The command which try to execute is the following and I haven't  
found yet the reason why this command doesn't found orted because I  
set the /etc/bashrc in order to get always the right path and I have  
the right path into my application. (see ERROR4).


This seems like an SGE specific issue, so a bit out of my domain.  
Maybe others have suggestions here.


-- Josh




Many thanks!,
Sergio

P.S. Sorry about these long emails. I just try to show you useful  
information to identify my problems.



ERROR 1
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>>

> [sdiaz@compute-3-18 ~]$ ompi-restart ompi_global_snapshot_28454.ckpt
>  
--
> Error: Unable to obtain the proper restart command to restart from  
the

>checkpoint file (opal_snapshot_0.ckpt). Returned -1.
>
>  
--
>  
--
> Error: Unable to obtain the proper restart command to restart from  
the

>checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>
>  
--

> [compute-3-18:28792] *** Process received signal ***
> [compute-3-18:28792] Signal: Segmentation fault (11)
> [compute-3-18:28792] Signal code:  (128)
> [compute-3-18:28792] Failing at address: (nil)
> [compute-3-18:28792] [ 0] /lib64/tls/libpthread.so.0 [0x33bbf0c430]
> [compute-3-18:28792] [ 1] /lib64/tls/libc.so.6(__libc_free+0x25)  
[0x33bb669135]
> [compute-3-18:28792] [ 2] /opt/cesga/openmpi-1.3.3/lib/libopen- 
pal.so.0(opal_argv_free+0x2e) [0x2a95586658]
> [compute-3-18:28792] [ 3] /opt/cesga/openmpi-1.3.3/lib/libopen- 
pal.so.0(opal_event_fini+0x1e) [0x2a9557906e]
> [compute-3-18:28792] [ 4] /opt/cesga/openmpi-1.3.3/lib/libopen- 
pal.so.0(opal_finalize+0x36) [0x2a9556bcfa]

> [compute-3-18:28792] [ 5] opal-restart [0x40312a]
> [compute-3-18:28792] [ 6] /lib64/tls/libc.so.6(__libc_start_main 
+0xdb) [0x33bb61c3fb]

> [compute-3-18:28792] [ 7] opal-restart [0x40272a]
> [compute-3-18:28792] *** End of error message ***
> [compute-3-18:28793] *** Process received signal ***
> [compute-3-18:28793] Signal: Segmentation fault (11)
> [compute-3-18:28793] Signal code:  (128)
> [compute-3-18:28793] Failing at address: (nil)
> [compute-3-18:28793] [ 0] /lib64/tls/libpthread.so.0 [0x33bbf0c430]
> [compute-3-18:28793] [ 1] /lib64/tls/libc.so.6(__libc_free+0x25)  
[0x33bb669135]
> [compute-3-18:28793] [ 2] /opt/cesga/openmpi-1.3.3/lib/libopen- 
pal.so.0(opal_argv_free+0x2e) [0x2a95586658]
> [compute-3-18:28793] [ 3] /opt/cesga/openmpi-1.3.3/lib/libopen- 
pal.so.0(opal_event_fini+0x1e) [0x2a9557906e]
> [compute-3-18:28793] [ 4] /opt/cesga/openmpi-1.3.3/lib/libopen- 
pal.so.0(opal_finalize+0x36) [0x2a9556bcfa]

> [compute-3-18:28793] [ 5] opal-restart [0x40312a]
> [compute-3-18:28793] [ 6] /lib64/tls/libc.so.6(__libc_start_main 
+0xdb) [0x33bb61c3fb]

> [compute-3-18:28793] [ 7] opal-restart [0x40272a]
> [compute-3-18:28793] *** End of error message ***
>  

Re: [OMPI users] Changing location where checkpoints are saved

2009-12-09 Thread Josh Hursey
I took a look at the checkpoint staging and preload functionality. It  
seems that the combination of the two is broken on the v1.3 and v1.4  
branches. I filed a bug about it so that it would not get lost:

  https://svn.open-mpi.org/trac/ompi/ticket/2139

I also attached a patch to partially fix the problem, but the actual  
fix is must more involved. I don't know when I'll get around to  
finishing this bug fix for that branch. :(


However, the current development trunk and v1.5 are know to have a  
working version of this feature. Can you try the trunk or v1.5 and see  
if this fixes the problem?


-- Josh

P.S. If you are interested, we have a slightly better version of the  
documentation, hosted at the link below:

  http://osl.iu.edu/research/ft/ompi-cr/

On Nov 18, 2009, at 1:27 PM, Constantinos Makassikis wrote:


Josh Hursey wrote:

(Sorry for the excessive delay in replying)

On Sep 30, 2009, at 11:02 AM, Constantinos Makassikis wrote:


Thanks for the reply!

Concerning the mca options for checkpointing:
- are verbosity options (e.g.: crs_base_verbose) limited to 0 and  
1 values ?
- in priority options (e.g.: crs_blcr_priority) do lower numbers  
indicate higher priority ?


By searching in the archives of the mailing list I found two  
interesting/useful posts:
- [1] http://www.open-mpi.org/community/lists/users/ 
2008/09/6534.php (for different checkpointing schemes)
- [2] http://www.open-mpi.org/community/lists/users/ 
2009/05/9385.php (for restarting)


Following indications given in [1], I tried to make each process
checkpoint itself in it local /tmp and centralize the resulting
checkpoints in /tmp or $HOME:

Excerpt from mca-params.conf:
-
snapc_base_store_in_place=0
snapc_base_global_snapshot_dir=/tmp or $HOME
crs_base_snapshot_dir=/tmp

COMMANDS used:
--
mpirun -n 2 -machinefile machines -am ft-enable-cr a.out
ompi-checkpoint mpirun_pid



OUTPUT of ompi-checkpoint -v 16753
--
[ic85:17044] orte_checkpoint: Checkpointing...
[ic85:17044] PID 17036
[ic85:17044] Connected to Mpirun [[42098,0],0]
[ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node  
Process PID 17036
[ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint  
of jobid [INVALID]
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command  
message.

[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Requested - Global Snapshot  
Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command  
message.

[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044]   Pending - Global Snapshot  
Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command  
message.

[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044]   Running - Global Snapshot  
Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command  
message.

[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] File Transfer - Global Snapshot  
Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command  
message.

[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Error - Global Snapshot  
Reference: ompi_global_snapshot_17036.ckpt




OUTPUT of MPIRUN


[ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with  
status 3
[ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with  
status 3

--
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0
Host: ic85

Will continue attempting to launch the process.

--
[ic85:17036] filem:rsh: wait_all(): Wait failed (-1)
[ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in  
file ../../../../../orte/mca/snapc/full/snapc_full_global.c at  
line 1054


This is a warning about creating the global snapshot directory  
(ompi_global_snapshot_17036.ckpt) for the first checkpoint (seq 0).  
It seems to indicate that the directory existed when the file  
gather started.


A couple things to check:
- Did you clean out the /tmp on all of the nodes with any files  
starting with "opal" or "ompi"?
- Does the error go away when you set  
(snapc_base_global_snapshot_dir=$HOME)?
- Could you try running against a v1.3 release? (I wonder if this  
feature has been broken on the trunk)


Let me know what you find. In the next couple days, I'll try to  
test the trunk again with this feature to make sure that it is  
still working on my test machines.


-- Josh

Hello Josh,

I have switched to v1.3 and re-run with  
snapc_base_global_snapshot_dir=/tmp or $HOME

with a clean /tmp.

In both cases I get the same error as before :-(

I don't know if 

Re: [OMPI users] Pointers for understanding failure messages on NetBSD

2009-12-09 Thread Kevin . Buckley
>> 26a27
>>> CONFIGURE_ARGS+=  --enable-contrib-no-build=vt
>>
>> I have no idea how NetBSD go about resolving such clashes in the long
>> term though?
>
> I've disabled it the same way for this time, my local package differs
> from what's in wip:
>
> --- PLIST 3 Dec 2009 10:18:00 -   1.5
> +++ PLIST 9 Dec 2009 08:29:31 -
> @@ -1,17 +1,11 @@
>  @comment $NetBSD$
>  bin/mpiCC
> -bin/mpiCC-vt
>  bin/mpic++
> -bin/mpic++-vt

I am surprised that you are still installing binaries and other files
with the -vt extension after disabling the vt stuff ?

> I can commit my development patches into wip right now,
> if that helps you.

If your stuff now works then that's ideal. If your build is still
failing after applying my patches then probably not.

Given that we have something that does work, it would make sense
to try and merge the two as far as possible before proceeding any
further.

As discussed before, there is no real reason to have two getifaddrs
loops seperating out IPv6 and non-IPv6 - that could all be in one
loop.

> Some patches should be there anyway, since OpenMPI doesn't help with
> installation of configuration files into example directory anyway.

OK, as you are the person within the NetBSD community looking
after OpenMPI, I'll happily work with whatever is in the NetBSD
repository and patch locally as needed, because I have poeple here
who want to use stuff that requires OpenMPI now.

Are you going to upgrade the NetBSD port to build against OpenMPI 1.4
now that it available ? Might be a good time to check the fuzzz in the
existing patches.

Kevin

-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand



[OMPI users] Problem building OpenMPI with PGI compilers

2009-12-09 Thread David Turner

Hi all,

My first ever attempt to build OpenMPI.  Platform is Sun Sunfire x4600
M2 servers, running Scientific Linux version 5.3.  Trying to build
OpenMPI 1.4 (as of today; same problems yesterday with 1.3.4).
Trying to use PGI version 10.0.

As a first attempt, I set CC, CXX, F77, and FC, then did "configure"
and "make".  Make ends with:

libtool: link:  pgCC --prelink_objects --instantiation_dir Template.dir 
  .libs/mpicxx.o .libs/intercepts.o .libs/comm.o .libs/datatype.o 
.libs/win.o .libs/file.o   -Wl,--rpath 
-Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/ompi/.libs 
-Wl,--rpath 
-Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs 
-Wl,--rpath 
-Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs 
-Wl,--rpath -Wl,/global/common/tesla/usg/openmpi/1.4/lib 
-L/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs 
-L/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs 
../../../ompi/.libs/libmpi.so 
/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs/libopen-rte.so 
/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs/libopen-pal.so 
-ldl -lnsl -lutil -lpthread

pgCC-Error-Unknown switch: --instantiation_dir
make[2]: *** [libmpi_cxx.la] Error 1

So I Googled "instantiation_dir openmpi", which led me to:

http://cia.vc/stats/project/OMPI?s_message=3

where I see:

There's still something wrong with the C++ support, however; I get
errors about a template directory switch when compiling the C++ MPI
bindings (doesn't happen with PGI 9.0). Still working on this... it
feels like it's still a Libtool issue because OMPI is not putting in
this compiler flag as far as I can tell:

{{{
/bin/sh ../../../libtool --tag=CXX --mode=link pgCC -g -version-info 
0:0:0 -export-dynamic -o libmpi_cxx.la -rpath /home/jsquyres/bogus/lib 
mpicxx.lo intercepts.lo comm.lo datatype.lo win.lo file.lo 
../../../ompi/libmpi.la -lnsl -lutil -lpthread

libtool: link: tpldir=Template.dir
libtool: link: rm -rf Template.dir
libtool: link: pgCC --prelink_objects --instantiation_dir Template.dir 
.libs/mpicxx.o .libs/intercepts.o .libs/comm.o .libs/datatype.o 
.libs/win.o .libs/file.o -Wl,--rpath 
-Wl,/users/jsquyres/svn/ompi-1.3/ompi/.libs -Wl,--rpath 
-Wl,/users/jsquyres/svn/ompi-1.3/orte/.libs -Wl,--rpath 
-Wl,/users/jsquyres/svn/ompi-1.3/opal/.libs -Wl,--rpath 
-Wl,/home/jsquyres/bogus/lib -L/users/jsquyres/svn/ompi-1.3/orte/.libs 
-L/users/jsquyres/svn/ompi-1.3/opal/.libs ../../../ompi/.libs/libmpi.so 
/users/jsquyres/svn/ompi-1.3/orte/.libs/libopen-rte.so 
/users/jsquyres/svn/ompi-1.3/opal/.libs/libopen-pal.so -ldl -lnsl -lutil 
-lpthread

pgCC-Error-Unknown switch: --instantiation_dir
make: *** [libmpi_cxx.la] Error 1
}}}

I noticed the comment "doesn't happen with PGI 9.0", so I re-did the
entire process with PGI 9.0 instead of 10.0, but I get the same error!

Any suggestions?  Let me know if I should provide full copies of the
configure and make output.  Thanks!

--
Best regards,

David Turner
User Services Groupemail: dptur...@lbl.gov
NERSC Division phone: (510) 486-4027
Lawrence Berkeley Labfax: (510) 486-4316


[OMPI users] OpenMPI 1.4 RPM Spec file problem

2009-12-09 Thread Jim Kusznir
Hi all:

I'm trying to build openmpi-1.4 rpms using my normal (complex) rpm
build commands, but its failing.  I'm running into two errors:

One (on gcc only): the D_FORTIFY_SOURCE build failure.  I've had to
move the if test "$using_gcc" = 0; then line down to after the
RPM_OPT_FLAGS= that includes D_FORTIFY_SOURCE; otherwise the compile
blows up.

The second, and in my opinion, more major rpm spec file bug is
something with the files specification.  I build multiple versions of
OpenMPI to accomidate the collection of compilers I use (on this
machine, I have intel 10.1 and GCC, and will have to add 9.1 per user
request); on others, I use PGI and GCC.  In any case, here's my build
command for Intel:

CC=icc CXX=icpc F77=ifort FC=ifort rpmbuild -bb --define
'install_in_opt 1' --define 'install_modulefile 1' --define
'modules_rpm_name Modules' --define 'build_all_in_one_rpm 0'  --define
'configure_options --with-tm=/opt/torque' --define '_name
openmpi-intel' openmpi-1.4.spec

Unfortunately, the filespec is somehow broke and it ends up missing
most (all?) the files, and failing in the final stage of RPM creation:

---
Processing files: openmpi-intel-docs-1.4-1
Finding  Provides: /usr/lib/rpm/find-provides openmpi-intel
Finding  Requires: /usr/lib/rpm/find-requires openmpi-intel
Finding  Supplements: /usr/lib/rpm/find-supplements openmpi-intel
Requires(rpmlib): rpmlib(PayloadFilesHavePrefix) <= 4.0-1
rpmlib(CompressedFileNames) <= 3.0.4-1
Requires: openmpi-intel-runtime
Checking for unpackaged file(s): /usr/lib/rpm/check-files
/var/tmp/openmpi-intel-1.4-1-root
error: Installed (but unpackaged) file(s) found:
   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfaux
   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfcompress
   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfconfig
   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfdump
   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfinfo
   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfmerge
   /opt/openmpi-intel/1.4/bin/mpiCC-vt
   /opt/openmpi-intel/1.4/bin/mpic++-vt
   /opt/openmpi-intel/1.4/bin/mpicc-vt
   /opt/openmpi-intel/1.4/bin/mpicxx-vt
   /opt/openmpi-intel/1.4/bin/mpif77-vt
   /opt/openmpi-intel/1.4/bin/mpif90-vt
   /opt/openmpi-intel/1.4/bin/ompi-checkpoint
   /opt/openmpi-intel/1.4/bin/ompi-clean
   /opt/openmpi-intel/1.4/bin/ompi-iof
   /opt/openmpi-intel/1.4/bin/ompi-ps
   /opt/openmpi-intel/1.4/bin/ompi-restart
   /opt/openmpi-intel/1.4/bin/ompi-server
   /opt/openmpi-intel/1.4/bin/opari
   /opt/openmpi-intel/1.4/bin/orte-clean
   /opt/openmpi-intel/1.4/bin/orte-iof
   /opt/openmpi-intel/1.4/bin/orte-ps
   /opt/openmpi-intel/1.4/bin/otfdecompress
   /opt/openmpi-intel/1.4/bin/vtcc
   /opt/openmpi-intel/1.4/bin/vtcxx
   /opt/openmpi-intel/1.4/bin/vtf77
   /opt/openmpi-intel/1.4/bin/vtf90
   /opt/openmpi-intel/1.4/bin/vtfilter
   /opt/openmpi-intel/1.4/bin/vtunify
   /opt/openmpi-intel/1.4/etc/openmpi-default-hostfile
   /opt/openmpi-intel/1.4/etc/openmpi-mca-params.conf
   /opt/openmpi-intel/1.4/etc/openmpi-totalview.tcl
   /opt/openmpi-intel/1.4/share/FILTER.SPEC
   /opt/openmpi-intel/1.4/share/GROUPS.SPEC
   /opt/openmpi-intel/1.4/share/METRICS.SPEC
   /opt/openmpi-intel/1.4/share/vampirtrace/doc/ChangeLog
   /opt/openmpi-intel/1.4/share/vampirtrace/doc/LICENSE
   /opt/openmpi-intel/1.4/share/vampirtrace/doc/UserManual.html
   /opt/openmpi-intel/1.4/share/vampirtrace/doc/UserManual.pdf
   /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/ChangeLog
   /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/LICENSE
   /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/Readme.html
   /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/lacsi01.pdf
   /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/lacsi01.ps.gz
   /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/opari-logo-100.gif
   /opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/ChangeLog
   /opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/LICENSE
   /opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/otftools.pdf
   /opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/specification.pdf
   /opt/openmpi-intel/1.4/share/vtcc-wrapper-data.txt
   /opt/openmpi-intel/1.4/share/vtcxx-wrapper-data.txt
   /opt/openmpi-intel/1.4/share/vtf77-wrapper-data.txt
   /opt/openmpi-intel/1.4/share/vtf90-wrapper-data.txt


RPM build errors:
Installed (but unpackaged) file(s) found:
   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfaux
   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfcompress
   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfconfig
   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfdump
   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfinfo
   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfmerge
   /opt/openmpi-intel/1.4/bin/mpiCC-vt
   /opt/openmpi-intel/1.4/bin/mpic++-vt
   /opt/openmpi-intel/1.4/bin/mpicc-vt
   /opt/openmpi-intel/1.4/bin/mpicxx-vt
   /opt/openmpi-intel/1.4/bin/mpif77-vt
   /opt/openmpi-intel/1.4/bin/mpif90-vt
   /opt/openmpi-intel/1.4/bin/ompi-checkpoint
   

Re: [OMPI users] Problem building OpenMPI with PGI compilers

2009-12-09 Thread Gus Correa

Hi David

Last I tried, OpenMPI 1.3.2, PGI (8.0-4) was problematic,
particularly for C and C++.

I eventually settled down with a hybrid gcc, g++, and pgf90
(for both OpenMPI F77 and F90 bindings).
Even this required a trick to avoid the "-pthread" flag
to be inserted among the pgf90 flags (where it doesn't belong).
Yes, libtool was also part of the problem back then.
You may find my postings about on this list archives - early 2009 -,
along with Jeff Squyres' solution for the problem.


I also built a full Gnu version (gcc, g++, gfortran, gfortran)
of OpenMPI that works well.
Intel and hybrid Gnu(gcc,g++)+Intel(ifort for F77 and F90)
versions of OpenMPI also work right.
We need multiple compiler support here anyway.

My $0.02
Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-




David Turner wrote:

Hi all,

My first ever attempt to build OpenMPI.  Platform is Sun Sunfire x4600
M2 servers, running Scientific Linux version 5.3.  Trying to build
OpenMPI 1.4 (as of today; same problems yesterday with 1.3.4).
Trying to use PGI version 10.0.

As a first attempt, I set CC, CXX, F77, and FC, then did "configure"
and "make".  Make ends with:

libtool: link:  pgCC --prelink_objects --instantiation_dir Template.dir 
  .libs/mpicxx.o .libs/intercepts.o .libs/comm.o .libs/datatype.o 
.libs/win.o .libs/file.o   -Wl,--rpath 
-Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/ompi/.libs 
-Wl,--rpath 
-Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs 
-Wl,--rpath 
-Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs 
-Wl,--rpath -Wl,/global/common/tesla/usg/openmpi/1.4/lib 
-L/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs 
-L/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs 
../../../ompi/.libs/libmpi.so 
/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs/libopen-rte.so 
/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs/libopen-pal.so 
-ldl -lnsl -lutil -lpthread

pgCC-Error-Unknown switch: --instantiation_dir
make[2]: *** [libmpi_cxx.la] Error 1

So I Googled "instantiation_dir openmpi", which led me to:

http://cia.vc/stats/project/OMPI?s_message=3

where I see:

There's still something wrong with the C++ support, however; I get
errors about a template directory switch when compiling the C++ MPI
bindings (doesn't happen with PGI 9.0). Still working on this... it
feels like it's still a Libtool issue because OMPI is not putting in
this compiler flag as far as I can tell:

{{{
/bin/sh ../../../libtool --tag=CXX --mode=link pgCC -g -version-info 
0:0:0 -export-dynamic -o libmpi_cxx.la -rpath /home/jsquyres/bogus/lib 
mpicxx.lo intercepts.lo comm.lo datatype.lo win.lo file.lo 
../../../ompi/libmpi.la -lnsl -lutil -lpthread

libtool: link: tpldir=Template.dir
libtool: link: rm -rf Template.dir
libtool: link: pgCC --prelink_objects --instantiation_dir Template.dir 
.libs/mpicxx.o .libs/intercepts.o .libs/comm.o .libs/datatype.o 
.libs/win.o .libs/file.o -Wl,--rpath 
-Wl,/users/jsquyres/svn/ompi-1.3/ompi/.libs -Wl,--rpath 
-Wl,/users/jsquyres/svn/ompi-1.3/orte/.libs -Wl,--rpath 
-Wl,/users/jsquyres/svn/ompi-1.3/opal/.libs -Wl,--rpath 
-Wl,/home/jsquyres/bogus/lib -L/users/jsquyres/svn/ompi-1.3/orte/.libs 
-L/users/jsquyres/svn/ompi-1.3/opal/.libs ../../../ompi/.libs/libmpi.so 
/users/jsquyres/svn/ompi-1.3/orte/.libs/libopen-rte.so 
/users/jsquyres/svn/ompi-1.3/opal/.libs/libopen-pal.so -ldl -lnsl -lutil 
-lpthread

pgCC-Error-Unknown switch: --instantiation_dir
make: *** [libmpi_cxx.la] Error 1
}}}

I noticed the comment "doesn't happen with PGI 9.0", so I re-did the
entire process with PGI 9.0 instead of 10.0, but I get the same error!

Any suggestions?  Let me know if I should provide full copies of the
configure and make output.  Thanks!





Re: [OMPI users] Problem building OpenMPI with PGI compilers

2009-12-09 Thread Gerald Creager
Fascinating. I've not had any real problems building it from scratch 
with PGI. We are using the PGI 9 compilers, though, for that.


gerry

Gus Correa wrote:

Hi David

Last I tried, OpenMPI 1.3.2, PGI (8.0-4) was problematic,
particularly for C and C++.

I eventually settled down with a hybrid gcc, g++, and pgf90
(for both OpenMPI F77 and F90 bindings).
Even this required a trick to avoid the "-pthread" flag
to be inserted among the pgf90 flags (where it doesn't belong).
Yes, libtool was also part of the problem back then.
You may find my postings about on this list archives - early 2009 -,
along with Jeff Squyres' solution for the problem.


I also built a full Gnu version (gcc, g++, gfortran, gfortran)
of OpenMPI that works well.
Intel and hybrid Gnu(gcc,g++)+Intel(ifort for F77 and F90)
versions of OpenMPI also work right.
We need multiple compiler support here anyway.

My $0.02
Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-




David Turner wrote:

Hi all,

My first ever attempt to build OpenMPI.  Platform is Sun Sunfire x4600
M2 servers, running Scientific Linux version 5.3.  Trying to build
OpenMPI 1.4 (as of today; same problems yesterday with 1.3.4).
Trying to use PGI version 10.0.

As a first attempt, I set CC, CXX, F77, and FC, then did "configure"
and "make".  Make ends with:

libtool: link:  pgCC --prelink_objects --instantiation_dir 
Template.dir   .libs/mpicxx.o .libs/intercepts.o .libs/comm.o 
.libs/datatype.o .libs/win.o .libs/file.o   -Wl,--rpath 
-Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/ompi/.libs 
-Wl,--rpath 
-Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs 
-Wl,--rpath 
-Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs 
-Wl,--rpath -Wl,/global/common/tesla/usg/openmpi/1.4/lib 
-L/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs 
-L/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs 
../../../ompi/.libs/libmpi.so 
/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs/libopen-rte.so 
/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs/libopen-pal.so 
-ldl -lnsl -lutil -lpthread

pgCC-Error-Unknown switch: --instantiation_dir
make[2]: *** [libmpi_cxx.la] Error 1

So I Googled "instantiation_dir openmpi", which led me to:

http://cia.vc/stats/project/OMPI?s_message=3

where I see:

There's still something wrong with the C++ support, however; I get
errors about a template directory switch when compiling the C++ MPI
bindings (doesn't happen with PGI 9.0). Still working on this... it
feels like it's still a Libtool issue because OMPI is not putting in
this compiler flag as far as I can tell:

{{{
/bin/sh ../../../libtool --tag=CXX --mode=link pgCC -g -version-info 
0:0:0 -export-dynamic -o libmpi_cxx.la -rpath /home/jsquyres/bogus/lib 
mpicxx.lo intercepts.lo comm.lo datatype.lo win.lo file.lo 
../../../ompi/libmpi.la -lnsl -lutil -lpthread

libtool: link: tpldir=Template.dir
libtool: link: rm -rf Template.dir
libtool: link: pgCC --prelink_objects --instantiation_dir Template.dir 
.libs/mpicxx.o .libs/intercepts.o .libs/comm.o .libs/datatype.o 
.libs/win.o .libs/file.o -Wl,--rpath 
-Wl,/users/jsquyres/svn/ompi-1.3/ompi/.libs -Wl,--rpath 
-Wl,/users/jsquyres/svn/ompi-1.3/orte/.libs -Wl,--rpath 
-Wl,/users/jsquyres/svn/ompi-1.3/opal/.libs -Wl,--rpath 
-Wl,/home/jsquyres/bogus/lib -L/users/jsquyres/svn/ompi-1.3/orte/.libs 
-L/users/jsquyres/svn/ompi-1.3/opal/.libs 
../../../ompi/.libs/libmpi.so 
/users/jsquyres/svn/ompi-1.3/orte/.libs/libopen-rte.so 
/users/jsquyres/svn/ompi-1.3/opal/.libs/libopen-pal.so -ldl -lnsl 
-lutil -lpthread

pgCC-Error-Unknown switch: --instantiation_dir
make: *** [libmpi_cxx.la] Error 1
}}}

I noticed the comment "doesn't happen with PGI 9.0", so I re-did the
entire process with PGI 9.0 instead of 10.0, but I get the same error!

Any suggestions?  Let me know if I should provide full copies of the
configure and make output.  Thanks!



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] Problem building OpenMPI with PGI compilers

2009-12-09 Thread Jeff Squyres
Just to set the record straight: it's a Libtool problem with PGI version 10 
(all PGI versions below 10 work fine).

This has been reported to the GNU Libtool folks and patches have already been 
applied upstream.  However, there hasn't been a new Libtool release yet with 
these patches, so we have to patch during the Open MPI build (hence, the 
solution is in our autogen.sh script, which sets up the configure/build system).


On Dec 9, 2009, at 4:58 PM, Gerald Creager wrote:

> Fascinating. I've not had any real problems building it from scratch
> with PGI. We are using the PGI 9 compilers, though, for that.
> 
> gerry
> 
> Gus Correa wrote:
> > Hi David
> >
> > Last I tried, OpenMPI 1.3.2, PGI (8.0-4) was problematic,
> > particularly for C and C++.
> >
> > I eventually settled down with a hybrid gcc, g++, and pgf90
> > (for both OpenMPI F77 and F90 bindings).
> > Even this required a trick to avoid the "-pthread" flag
> > to be inserted among the pgf90 flags (where it doesn't belong).
> > Yes, libtool was also part of the problem back then.
> > You may find my postings about on this list archives - early 2009 -,
> > along with Jeff Squyres' solution for the problem.
> >
> >
> > I also built a full Gnu version (gcc, g++, gfortran, gfortran)
> > of OpenMPI that works well.
> > Intel and hybrid Gnu(gcc,g++)+Intel(ifort for F77 and F90)
> > versions of OpenMPI also work right.
> > We need multiple compiler support here anyway.
> >
> > My $0.02
> > Gus Correa
> > -
> > Gustavo Correa
> > Lamont-Doherty Earth Observatory - Columbia University
> > Palisades, NY, 10964-8000 - USA
> > -
> >
> >
> >
> >
> > David Turner wrote:
> >> Hi all,
> >>
> >> My first ever attempt to build OpenMPI.  Platform is Sun Sunfire x4600
> >> M2 servers, running Scientific Linux version 5.3.  Trying to build
> >> OpenMPI 1.4 (as of today; same problems yesterday with 1.3.4).
> >> Trying to use PGI version 10.0.
> >>
> >> As a first attempt, I set CC, CXX, F77, and FC, then did "configure"
> >> and "make".  Make ends with:
> >>
> >> libtool: link:  pgCC --prelink_objects --instantiation_dir
> >> Template.dir   .libs/mpicxx.o .libs/intercepts.o .libs/comm.o
> >> .libs/datatype.o .libs/win.o .libs/file.o   -Wl,--rpath
> >> -Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/ompi/.libs
> >> -Wl,--rpath
> >> -Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs
> >> -Wl,--rpath
> >> -Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs
> >> -Wl,--rpath -Wl,/global/common/tesla/usg/openmpi/1.4/lib
> >> -L/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs
> >> -L/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs
> >> ../../../ompi/.libs/libmpi.so
> >> /project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs/libopen-rte.so
> >> /project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs/libopen-pal.so
> >> -ldl -lnsl -lutil -lpthread
> >> pgCC-Error-Unknown switch: --instantiation_dir
> >> make[2]: *** [libmpi_cxx.la] Error 1
> >>
> >> So I Googled "instantiation_dir openmpi", which led me to:
> >>
> >> http://cia.vc/stats/project/OMPI?s_message=3
> >>
> >> where I see:
> >>
> >> There's still something wrong with the C++ support, however; I get
> >> errors about a template directory switch when compiling the C++ MPI
> >> bindings (doesn't happen with PGI 9.0). Still working on this... it
> >> feels like it's still a Libtool issue because OMPI is not putting in
> >> this compiler flag as far as I can tell:
> >>
> >> {{{
> >> /bin/sh ../../../libtool --tag=CXX --mode=link pgCC -g -version-info
> >> 0:0:0 -export-dynamic -o libmpi_cxx.la -rpath /home/jsquyres/bogus/lib
> >> mpicxx.lo intercepts.lo comm.lo datatype.lo win.lo file.lo
> >> ../../../ompi/libmpi.la -lnsl -lutil -lpthread
> >> libtool: link: tpldir=Template.dir
> >> libtool: link: rm -rf Template.dir
> >> libtool: link: pgCC --prelink_objects --instantiation_dir Template.dir
> >> .libs/mpicxx.o .libs/intercepts.o .libs/comm.o .libs/datatype.o
> >> .libs/win.o .libs/file.o -Wl,--rpath
> >> -Wl,/users/jsquyres/svn/ompi-1.3/ompi/.libs -Wl,--rpath
> >> -Wl,/users/jsquyres/svn/ompi-1.3/orte/.libs -Wl,--rpath
> >> -Wl,/users/jsquyres/svn/ompi-1.3/opal/.libs -Wl,--rpath
> >> -Wl,/home/jsquyres/bogus/lib -L/users/jsquyres/svn/ompi-1.3/orte/.libs
> >> -L/users/jsquyres/svn/ompi-1.3/opal/.libs
> >> ../../../ompi/.libs/libmpi.so
> >> /users/jsquyres/svn/ompi-1.3/orte/.libs/libopen-rte.so
> >> /users/jsquyres/svn/ompi-1.3/opal/.libs/libopen-pal.so -ldl -lnsl
> >> -lutil -lpthread
> >> pgCC-Error-Unknown switch: --instantiation_dir
> >> make: *** [libmpi_cxx.la] Error 1
> >> }}}
> >>
> >> I noticed the comment "doesn't happen with PGI 9.0", so I re-did the
> >> entire process with PGI 9.0 instead of 10.0, but I get 

Re: [OMPI users] Problem building OpenMPI with PGI compilers

2009-12-09 Thread Gus Correa

Hi All

As I stated on my original posting,
I haven't compiled OpenMPI since 1.3.2.
Just trying to be of help, based on previous,
and maybe too old, experiences.

The problem I referred to happened with PGI 8.0-4 and OpenMPI 1.3.
Most likely the issue is superseded already by the newer
OpenMPI configuration scripts, but it did exist
and it did involve libtool as well,
although it seems to be different from what
David Turner just reported with PGI 10,
and apparently with PGI 9 also (so he wrote).

These threads document the problem I had,
with one solution provided by Jeff Squyres,
and another by Orion Poplawski:

http://www.open-mpi.org/community/lists/users/2009/04/8724.php
http://www.open-mpi.org/community/lists/users/2009/04/8911.php

Those workarounds may no longer be required, considering what Jeff
and Gerry wrote, which is good news, of course.

Thanks,
Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-


Jeff Squyres wrote:

Just to set the record straight: it's a Libtool problem with PGI version 10 
(all PGI versions below 10 work fine).

This has been reported to the GNU Libtool folks and patches have already been 
applied upstream.  However, there hasn't been a new Libtool release yet with 
these patches, so we have to patch during the Open MPI build (hence, the 
solution is in our autogen.sh script, which sets up the configure/build system).


On Dec 9, 2009, at 4:58 PM, Gerald Creager wrote:


Fascinating. I've not had any real problems building it from scratch
with PGI. We are using the PGI 9 compilers, though, for that.

gerry

Gus Correa wrote:

Hi David

Last I tried, OpenMPI 1.3.2, PGI (8.0-4) was problematic,
particularly for C and C++.

I eventually settled down with a hybrid gcc, g++, and pgf90
(for both OpenMPI F77 and F90 bindings).
Even this required a trick to avoid the "-pthread" flag
to be inserted among the pgf90 flags (where it doesn't belong).
Yes, libtool was also part of the problem back then.
You may find my postings about on this list archives - early 2009 -,
along with Jeff Squyres' solution for the problem.


I also built a full Gnu version (gcc, g++, gfortran, gfortran)
of OpenMPI that works well.
Intel and hybrid Gnu(gcc,g++)+Intel(ifort for F77 and F90)
versions of OpenMPI also work right.
We need multiple compiler support here anyway.

My $0.02
Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-




David Turner wrote:

Hi all,

My first ever attempt to build OpenMPI.  Platform is Sun Sunfire x4600
M2 servers, running Scientific Linux version 5.3.  Trying to build
OpenMPI 1.4 (as of today; same problems yesterday with 1.3.4).
Trying to use PGI version 10.0.

As a first attempt, I set CC, CXX, F77, and FC, then did "configure"
and "make".  Make ends with:

libtool: link:  pgCC --prelink_objects --instantiation_dir
Template.dir   .libs/mpicxx.o .libs/intercepts.o .libs/comm.o
.libs/datatype.o .libs/win.o .libs/file.o   -Wl,--rpath
-Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/ompi/.libs
-Wl,--rpath
-Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs
-Wl,--rpath
-Wl,/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs
-Wl,--rpath -Wl,/global/common/tesla/usg/openmpi/1.4/lib
-L/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs
-L/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs
../../../ompi/.libs/libmpi.so
/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/orte/.libs/libopen-rte.so
/project/projectdirs/mpccc/usg/software/tnt/openmpi/openmpi-1.4/opal/.libs/libopen-pal.so
-ldl -lnsl -lutil -lpthread
pgCC-Error-Unknown switch: --instantiation_dir
make[2]: *** [libmpi_cxx.la] Error 1

So I Googled "instantiation_dir openmpi", which led me to:

http://cia.vc/stats/project/OMPI?s_message=3

where I see:

There's still something wrong with the C++ support, however; I get
errors about a template directory switch when compiling the C++ MPI
bindings (doesn't happen with PGI 9.0). Still working on this... it
feels like it's still a Libtool issue because OMPI is not putting in
this compiler flag as far as I can tell:

{{{
/bin/sh ../../../libtool --tag=CXX --mode=link pgCC -g -version-info
0:0:0 -export-dynamic -o libmpi_cxx.la -rpath /home/jsquyres/bogus/lib
mpicxx.lo intercepts.lo comm.lo datatype.lo win.lo file.lo
../../../ompi/libmpi.la -lnsl -lutil -lpthread
libtool: link: tpldir=Template.dir
libtool: link: rm -rf Template.dir
libtool: link: pgCC --prelink_objects --instantiation_dir Template.dir
.libs/mpicxx.o .libs/intercepts.o .libs/comm.o 

Re: [OMPI users] OpenMPI 1.4 RPM Spec file problem

2009-12-09 Thread Jim Kusznir
By the way, if I set build_all_in_one_rpm to 1, it works fine...

--Jim

On Wed, Dec 9, 2009 at 1:47 PM, Jim Kusznir  wrote:
> Hi all:
>
> I'm trying to build openmpi-1.4 rpms using my normal (complex) rpm
> build commands, but its failing.  I'm running into two errors:
>
> One (on gcc only): the D_FORTIFY_SOURCE build failure.  I've had to
> move the if test "$using_gcc" = 0; then line down to after the
> RPM_OPT_FLAGS= that includes D_FORTIFY_SOURCE; otherwise the compile
> blows up.
>
> The second, and in my opinion, more major rpm spec file bug is
> something with the files specification.  I build multiple versions of
> OpenMPI to accomidate the collection of compilers I use (on this
> machine, I have intel 10.1 and GCC, and will have to add 9.1 per user
> request); on others, I use PGI and GCC.  In any case, here's my build
> command for Intel:
>
> CC=icc CXX=icpc F77=ifort FC=ifort rpmbuild -bb --define
> 'install_in_opt 1' --define 'install_modulefile 1' --define
> 'modules_rpm_name Modules' --define 'build_all_in_one_rpm 0'  --define
> 'configure_options --with-tm=/opt/torque' --define '_name
> openmpi-intel' openmpi-1.4.spec
>
> Unfortunately, the filespec is somehow broke and it ends up missing
> most (all?) the files, and failing in the final stage of RPM creation:
>
> ---
> Processing files: openmpi-intel-docs-1.4-1
> Finding  Provides: /usr/lib/rpm/find-provides openmpi-intel
> Finding  Requires: /usr/lib/rpm/find-requires openmpi-intel
> Finding  Supplements: /usr/lib/rpm/find-supplements openmpi-intel
> Requires(rpmlib): rpmlib(PayloadFilesHavePrefix) <= 4.0-1
> rpmlib(CompressedFileNames) <= 3.0.4-1
> Requires: openmpi-intel-runtime
> Checking for unpackaged file(s): /usr/lib/rpm/check-files
> /var/tmp/openmpi-intel-1.4-1-root
> error: Installed (but unpackaged) file(s) found:
>   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfaux
>   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfcompress
>   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfconfig
>   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfdump
>   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfinfo
>   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfmerge
>   /opt/openmpi-intel/1.4/bin/mpiCC-vt
>   /opt/openmpi-intel/1.4/bin/mpic++-vt
>   /opt/openmpi-intel/1.4/bin/mpicc-vt
>   /opt/openmpi-intel/1.4/bin/mpicxx-vt
>   /opt/openmpi-intel/1.4/bin/mpif77-vt
>   /opt/openmpi-intel/1.4/bin/mpif90-vt
>   /opt/openmpi-intel/1.4/bin/ompi-checkpoint
>   /opt/openmpi-intel/1.4/bin/ompi-clean
>   /opt/openmpi-intel/1.4/bin/ompi-iof
>   /opt/openmpi-intel/1.4/bin/ompi-ps
>   /opt/openmpi-intel/1.4/bin/ompi-restart
>   /opt/openmpi-intel/1.4/bin/ompi-server
>   /opt/openmpi-intel/1.4/bin/opari
>   /opt/openmpi-intel/1.4/bin/orte-clean
>   /opt/openmpi-intel/1.4/bin/orte-iof
>   /opt/openmpi-intel/1.4/bin/orte-ps
>   /opt/openmpi-intel/1.4/bin/otfdecompress
>   /opt/openmpi-intel/1.4/bin/vtcc
>   /opt/openmpi-intel/1.4/bin/vtcxx
>   /opt/openmpi-intel/1.4/bin/vtf77
>   /opt/openmpi-intel/1.4/bin/vtf90
>   /opt/openmpi-intel/1.4/bin/vtfilter
>   /opt/openmpi-intel/1.4/bin/vtunify
>   /opt/openmpi-intel/1.4/etc/openmpi-default-hostfile
>   /opt/openmpi-intel/1.4/etc/openmpi-mca-params.conf
>   /opt/openmpi-intel/1.4/etc/openmpi-totalview.tcl
>   /opt/openmpi-intel/1.4/share/FILTER.SPEC
>   /opt/openmpi-intel/1.4/share/GROUPS.SPEC
>   /opt/openmpi-intel/1.4/share/METRICS.SPEC
>   /opt/openmpi-intel/1.4/share/vampirtrace/doc/ChangeLog
>   /opt/openmpi-intel/1.4/share/vampirtrace/doc/LICENSE
>   /opt/openmpi-intel/1.4/share/vampirtrace/doc/UserManual.html
>   /opt/openmpi-intel/1.4/share/vampirtrace/doc/UserManual.pdf
>   /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/ChangeLog
>   /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/LICENSE
>   /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/Readme.html
>   /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/lacsi01.pdf
>   /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/lacsi01.ps.gz
>   /opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/opari-logo-100.gif
>   /opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/ChangeLog
>   /opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/LICENSE
>   /opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/otftools.pdf
>   /opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/specification.pdf
>   /opt/openmpi-intel/1.4/share/vtcc-wrapper-data.txt
>   /opt/openmpi-intel/1.4/share/vtcxx-wrapper-data.txt
>   /opt/openmpi-intel/1.4/share/vtf77-wrapper-data.txt
>   /opt/openmpi-intel/1.4/share/vtf90-wrapper-data.txt
>
>
> RPM build errors:
>    Installed (but unpackaged) file(s) found:
>   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfaux
>   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfcompress
>   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfconfig
>   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfdump
>   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfinfo
>   /opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfmerge
>   

Re: [OMPI users] Pointers for understanding failure messages on NetBSD

2009-12-09 Thread Aleksej Saushev
kevin.buck...@ecs.vuw.ac.nz writes:

 CONFIGURE_ARGS+=  --enable-contrib-no-build=vt
>>>
>>> I have no idea how NetBSD go about resolving such clashes in the long
>>> term though?
>>
>> I've disabled it the same way for this time, my local package differs
>> from what's in wip:
>>
>> --- PLIST3 Dec 2009 10:18:00 -   1.5
>> +++ PLIST9 Dec 2009 08:29:31 -
>> @@ -1,17 +1,11 @@
>>  @comment $NetBSD$
>>  bin/mpiCC
>> -bin/mpiCC-vt
>>  bin/mpic++
>> -bin/mpic++-vt
>
> I am surprised that you are still installing binaries and other files
> with the -vt extension after disabling the vt stuff ?

I don't commit that part since I consider it my own local problems
that I have conflicting package. Other people may have none.
You can add CONFIGURE_ARGS and regenerate PLIST in a regular way.

>> I can commit my development patches into wip right now,
>> if that helps you.
>
> If your stuff now works then that's ideal. If your build is still
> failing after applying my patches then probably not.
>
> Given that we have something that does work, it would make sense
> to try and merge the two as far as possible before proceeding any
> further.

Benchmark I use to test MPICH and OpenMPI (parallel/skampi) doesn't work
for me still. It may be that I have somewhat unusual network configuration,
I'm looking at it.

> As discussed before, there is no real reason to have two getifaddrs
> loops seperating out IPv6 and non-IPv6 - that could all be in one
> loop.

Sure, I think that we can do it a bit later.

>> Some patches should be there anyway, since OpenMPI doesn't help with
>> installation of configuration files into example directory anyway.
>
> OK, as you are the person within the NetBSD community looking
> after OpenMPI, I'll happily work with whatever is in the NetBSD
> repository and patch locally as needed, because I have poeple here
> who want to use stuff that requires OpenMPI now.
>
> Are you going to upgrade the NetBSD port to build against OpenMPI 1.4
> now that it available ? Might be a good time to check the fuzzz in the
> existing patches.

http://pkgsrc-wip.cvs.sourceforge.net/viewvc/pkgsrc-wip/wip/openmpi/Makefile


-- 
HE CE3OH...


Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-09 Thread Matthew MacManes
Hi Gus and List,

1st of all Gus, I want to say thanks.. you have been a huge help, and when I
get this fixed, I owe you big time!

However, the problems continue...

I formatted the HD, reinstalled OS to make sure that I was working from
scratch.  I did your step A, which seemed to go fine:

macmanes@macmanes:~$ which mpicc
/home/macmanes/apps/openmpi1.4/bin/mpicc
macmanes@macmanes:~$ which mpirun
/home/macmanes/apps/openmpi1.4/bin/mpirun

Good stuff there...

I then compiled the example files:

macmanes@macmanes:~/Downloads/openmpi-1.4/examples$
/home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 ring_c
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
Process 2 exiting
Process 3 exiting
Process 4 exiting
Process 5 exiting
Process 6 exiting
Process 7 exiting
macmanes@macmanes:~/Downloads/openmpi-1.4/examples$
/home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c
Connectivity test on 8 processes PASSED.
macmanes@macmanes:~/Downloads/openmpi-1.4/examples$
/home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c
..HANGS..NO OUTPUT

this is maddening because ring_c works.. and connectivity_c worked the 1st
time, but not the second... I did it 10 times, and it worked twice.. here is
the TOP screenshot:

http://picasaweb.google.com/macmanes/DropBox?authkey=Gv1sRgCLKokNOVqo7BYw#5413382182027669394

What is the difference between connectivity_c and ring_c? Under what
circumstances should one fail and not the other...

I'm off to the Linux forums to see about the Nehalem kernel issues..

Matt



On Wed, Dec 9, 2009 at 13:25, Gus Correa  wrote:

> Hi Matthew
>
> There is no point in trying to troubleshoot MrBayes and ABySS
> if not even the OpenMPI test programs run properly.
> You must straighten them out first.
>
> **
>
> Suggestions:
>
> **
>
> A) While you are at OpenMPI, do yourself a favor,
> and install it from source on a separate directory.
> Who knows if the OpenMPI package distributed with Ubuntu
> works right on Nehalem?
> Better install OpenMPI yourself from source code.
> It is not a big deal, and may save you further trouble.
>
> Recipe:
>
> 1) Install gfortran and g++ if you don't have them using apt-get.
> 2) Put the OpenMPI tarball in, say /home/matt/downolads/openmpi
> 3) Make another install directory *not in the system directory tree*.
> Something like "mkdir /home/matt/apps/openmpi-X.Y.Z/" (X.Y.Z=version)
> will work
> 4) cd /home/matt/downolads/openmpi
> 5) ./configure CC=gcc CXX=g++ F77=gfortran FC=gfortran  \
> --prefix=/home/matt/apps/openmpi-X.Y.Z
> (Use the prefix flag to install in the directory of item 3.)
> 6) make
> 7) make install
> 8) At the bottom of your /home/matt/.bashrc or .profile file
> put these lines:
>
> export PATH=/home/matt/apps/openmpi-X.Y.Z/bin:${PATH}
> export MANPATH=/home/matt/apps/openmpi-X.Y.Z/share/man:`man -w`
> export LD_LIBRARY_PATH=home/matt/apps/openmpi-X.Y.Z/lib:${LD_LIBRARY_PATH}
>
> (If you use csh/tcsh use instead:
> setenv PATH /home/matt/apps/openmpi-X.Y.Z/bin:${PATH}
> etc)
>
> 9) Logout and login again to freshen um the environment variables.
> 10) Do "which mpicc"  to check that it is pointing to your newly
> installed OpenMPI.
> 11) Recompile and rerun the OpenMPI test programs
> with 2, 4, 8, 16,  processors.
> Use full path names to mpicc and to mpirun,
> if the change of PATH above doesn't work right.
>
> 
>
> B) Nehalem is quite new hardware.
> I don't know if the Ubuntu kernel 2.6.31-16 fully supports all
> of Nehalem features, particularly hyperthreading, and NUMA,
> which are used by MPI programs.
> I am not the right person to give you advice about this.
> I googled out but couldn't find a clear information about
> minimal kernel age/requirements to have Nehalem fully supported.
> Some Nehalem owner in the list could come forward and tell.
>
> **
>
> C) On the top screenshot you sent me, please try it again
> (after you do item A) but type "f" and "j" to show the processors
> that are running each process.
>
> **
>
> D) Also, the screeshot shows 20GB of memory.
> This sounds not as a optimal memory for Nehalem,
> which tend to be 6GB, 12GB, 24GB, 48GB.
> Did you put together the system, or upgraded the memory yourself,
> of did you buy the computer as is?
> However, this should not break MPI anyway.
>
> **
>
> E) Answering your question:
> It is true that different flavors of MPI
> used to compile (mpicc) and run (mpiexec) a program would probably
> break right away, regardless of the number of processes.
> However, when it comes to different versions of the
> same MPI flavor (say OpenMPI 1.3.4 and OpenMPI 1.3.3)
> I am not 

[OMPI users] OMPI 1.4: connectivity_c fails, ring_c and hello_c work

2009-12-09 Thread Matthew MacManes
What is the difference between connectivity_c and ring_c or hello_c? Under
what circumstances should one fail and not the others...

I am having a huge problem with openMPI, and trying to get to the bottom of
it by understanding the differences between the example files, connectivity,
hello, and ring.

1st off, ring_c and hello_c seem to work fine with up to -np 250

connectivity_c works reliably when -np <5, but less that 30% of the time
when -np >6.  When it does not work, it just hangs.. no output..  here is a
screenshot of TOP with mpirun -np 8 connectivity_c hanging..

http://picasaweb.google.com/macmanes/DropBox?authkey=Gv1sRgCLKokNOVqo7BYw#5413382182027669394

Under what circumstances should this happen?

I am using Ubuntu 9.10, kernel 2.6.31-16, Nehelem processors.
Hyperthreading is enabled.

Thanks!
_
Matthew MacManes
PhD Candidate
University of California- Berkeley
Museum of Vertebrate Zoology
Phone: 510-495-5833
Lab Website: http://ib.berkeley.edu/labs/lacey
Personal Website: http://macmanes.com/


Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-09 Thread Gus Correa

Hi Matthew

Save any misinterpretation I may have made of the code:

Hello_c has no real communication, except for a final Barrier
synchronization.
Each process prints "hello world" and that's it.

Ring probes a little more, with processes Send(ing) and
Recv(cieving) messages.
Ring just passes a message sequentially along all process
ranks, then back to rank 0, and repeat the game 10 times.
Rank 0 is in charge of counting turns, decrementing the counter,
and printing that (nobody else prints).
With 4 processes:
0->1->2->3->0->1... 10 times

In connectivity every pair of processes exchange a message.
Therefore it probes all pairwise connections.
In verbose mode you can see that.

These programs shouldn't hang at all, if the system were sane.
Actually, they should even run with a significant level of
oversubscription, say,
-np 128  should work easily for all three programs on a powerful
machine like yours.


**

Suggestions

1) Stick to the OpenMPI you compiled.

**

2) You can run connectivity_c in verbose mode:

home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c -v

(Note the trailing "-v".)

It should tell more about who's talking to who.

**

3) I wonder if there are any BIOS settings that may be required
(and perhaps not in place) to make the Nehalem hyperthreading to
work properly in your computer.

You reach the BIOS settings by typing  or 
when the computer boots up.
The key varies by
BIOS and computer vendor, but shows quickly on the bootup screen.

You may ask the computer vendor about the recommended BIOS settings.
If you haven't done this before, be careful to change and save only
what really needs to change (if anything really needs to change),
or the result may be worse.
(Overclocking is for gamers, not for genome researchers ... :) )

**

4) What I read about Nehalem DDR3 memory is that it is optimal
on configurations that are multiples of 3GB per CPU.
Common configs. in dual CPU machines like yours are
6, 12, 24 and 48GB.
The sockets where you install the memory modules also matter.

Your computer has 20GB.
Did you build the computer or upgrade the memory yourself?
Do you know how the memory is installed, in which memory sockets?
What does the vendor have to say about it?

See this:
http://en.community.dell.com/blogs/dell_tech_center/archive/2009/04/08/nehalem-and-memory-configurations.aspx

**

5) As I said before, typing "f" then "j" on "top" will add
a column (labeled "P") that shows in which core each process is running.
This will let you observe how the Linux scheduler is distributing
the MPI load across the cores.
Hopefully it is load-balanced, and different processes go to different
cores.

***

It is very disconcerting when MPI processes hang.
You are not alone.
The reasons are not always obvious.
At least in your case there is no network involved or to troubleshoot.


**

I hope it helps,
Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-





Matthew MacManes wrote:

Hi Gus and List,

1st of all Gus, I want to say thanks.. you have been a huge help, and 
when I get this fixed, I owe you big time!


However, the problems continue...

I formatted the HD, reinstalled OS to make sure that I was working from 
scratch.  I did your step A, which seemed to go fine:


macmanes@macmanes:~$ which mpicc
/home/macmanes/apps/openmpi1.4/bin/mpicc
macmanes@macmanes:~$ which mpirun
/home/macmanes/apps/openmpi1.4/bin/mpirun

Good stuff there...

I then compiled the example files:

macmanes@macmanes:~/Downloads/openmpi-1.4/examples$ 
/home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 ring_c

Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
Process 2 exiting
Process 3 exiting
Process 4 exiting
Process 5 exiting
Process 6 exiting
Process 7 exiting
macmanes@macmanes:~/Downloads/openmpi-1.4/examples$ 
/home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c

Connectivity test on 8 processes PASSED.
macmanes@macmanes:~/Downloads/openmpi-1.4/examples$ 
/home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c

..HANGS..NO OUTPUT

this is maddening because ring_c works.. and connectivity_c worked the 
1st time, but not the second... I did it 10 times, and it worked twice.. 
here is the TOP screenshot:


http://picasaweb.google.com/macmanes/DropBox?authkey=Gv1sRgCLKokNOVqo7BYw#5413382182027669394

What is the difference between connectivity_c and ring_c? Under what 
circumstances should one fail and not the other...


I'm