Re: [OMPI users] Problem with OpenMPI (MX btl and mtl) and threads

2009-06-11 Thread Brian Barrett
Almost assuredly, the MTL is not thread safe, and such support is  
unlikely to happen in the short term.  You might be better off  
concentrating on the BTL, as George has done significant work on that  
front.


Brian


On Jun 11, 2009, at 12:20 PM, François Trahay wrote:

The stack trace is from the MX MTL (I attach the backtraces I get  
with both MX MTL and MX BTL)
Here is the program that I use. It is quite simple. It runs ping  
pongs concurrently (with one thread per node, then with two threads  
per node, etc.)

The error occurs when two threads run concurrently.

Francois

Scott Atchley wrote:

Brian and George,

I do not know if the stack trace is complete, but I do not see any  
mx_* functions called which would indicate a crash inside MX due to  
multiple threads trying to complete the same request. It does show  
an assert failed.


Francois, is the stack trace from the MX MTL or BTL? Can you send a  
small program that reproduces this abort?


Scott


On Jun 11, 2009, at 12:25 PM, Brian Barrett wrote:

Neither the CM PML or the MX MTL has been looked at for thread  
safety.  There's not much code to cause problems in the CM PML.   
The MX MTL would likely need some work to ensure the restrictions  
Scott mentioned are met (currently, there's no such guarantee in  
the MX MTL).


Brian

On Jun 11, 2009, at 10:21 AM, George Bosilca wrote:

The comment on the FAQ (and on the other thread) is only true for  
some BTLs (TCP, SM and MX). I don't have resources to test for  
the others BTL, it is their developers responsibility to do the  
required modifications to make them thread safe.


In addition, I have to confess that I never tested the MTL for  
thread safety. It is a completely different implementations for  
the message passing, supposed to map directly on top of the  
underlying network capabilities. However, there are clearly few  
places where thread safety should be enforced in the MTL layer,  
and I don't know if this is the case.


george.

On Jun 11, 2009, at 09:35 , Scott Atchley wrote:


Francois,

For threads, the FAQ has:

http://www.open-mpi.org/faq/?category=supported-systems#thread-support

It mentions that thread support is designed in, but lightly  
tested. It is also possible that the FAQ is out of date and  
MPI_THREAD_MULTIPLE is fully supported.


The stack trace below shows:

opal_free()
opal_progress()
MPI_Recv()

I do not know this code, but it may be in the higher level code  
that calls the BTLs and/or MTLs and it would be a place to see  
if that code handles the TCP BTL differently than MX BTL/MTL.


MX is thread safe with the caveat that two threads may not try  
to complete the same request at the same time. This includes  
calling mx_test(), mx_wait(), mx_test_any() and/or mx_wait_any()  
where the latter two have match bits and match mask that could  
complete a request being tested/waited by another thread.


Scott

On Jun 11, 2009, at 6:00 AM, François Trahay wrote:

Well, according to George Bosilca (http://www.open-mpi.org/community/lists/users/2005/02/0005.php 
), threads are supported in OpenMPI.
The program I try to run works with the TCP stack and MX driver  
is thread-safe, so i guess the problem comes from the MX BTL or  
MTL.


Francois


Scott Atchley wrote:

Hi Francois,

I am not familiar with the internals of the OMPI code. Are you  
sure, however, that threads are fully supported yet? I was  
under the impression that thread support was still partial.


Can anyone else comment?

Scott

On Jun 8, 2009, at 8:43 AM, François Trahay wrote:


Hi,
I'm encountering some issues when running a multithreaded  
program with
OpenMPI (trunk rev. 21380, configured with --enable-mpi- 
threads)
My program (included in the tar.bz2) uses several pthreads  
that perform
ping pongs concurrently (thread #1 uses tag #1, thread #2  
uses tag #2, etc.)
This program crashes over MX (either btl or mtl) with the  
following

backtrace:

concurrent_ping_v2: pml_cm_recvreq.c:53:
mca_pml_cm_recv_request_completion: Assertion `0 ==
((mca_pml_cm_thin_recv_request_t*)base_request)- 
>req_base.req_pml_complete'

failed.
[joe0:01709] *** Process received signal ***
[joe0:01709] *** Process received signal ***
[joe0:01709] Signal: Segmentation fault (11)
[joe0:01709] Signal code: Address not mapped (1)
[joe0:01709] Failing at address: 0x1238949c4
[joe0:01709] Signal: Aborted (6)
[joe0:01709] Signal code:  (-6)
[joe0:01709] [ 0] /lib/libpthread.so.0 [0x7f57240be7b0]
[joe0:01709] [ 1] /lib/libc.so.6(gsignal+0x35) [0x7f5722cba065]
[joe0:01709] [ 2] /lib/libc.so.6(abort+0x183) [0x7f5722cbd153]
[joe0:01709] [ 3] /lib/libc.so.6(__assert_fail+0xe9)  
[0x7f5722cb3159]

[joe0:01709] [ 0] /lib/libpthread.so.0 [0x7f57240be7b0]
[joe0:01709] [ 1]
/home/ftrahay/sources/openmpi/trunk/install//lib/libopen- 
pal.so.0

[0x7f57238d0a08]
[joe0:01709] [ 2]
/home/ftrahay/sources/openmpi/trunk/install//lib/libopen- 
pal.so.0

[0x7f57238cf8cc]
[joe0:01709] [ 3]
/home/ftrahay/sources/openmpi/tr

Re: [OMPI users] Problem with OpenMPI (MX btl and mtl) and threads

2009-06-11 Thread Brian Barrett
Neither the CM PML or the MX MTL has been looked at for thread  
safety.  There's not much code to cause problems in the CM PML.  The  
MX MTL would likely need some work to ensure the restrictions Scott  
mentioned are met (currently, there's no such guarantee in the MX MTL).


Brian

On Jun 11, 2009, at 10:21 AM, George Bosilca wrote:

The comment on the FAQ (and on the other thread) is only true for  
some BTLs (TCP, SM and MX). I don't have resources to test for the  
others BTL, it is their developers responsibility to do the required  
modifications to make them thread safe.


In addition, I have to confess that I never tested the MTL for  
thread safety. It is a completely different implementations for the  
message passing, supposed to map directly on top of the underlying  
network capabilities. However, there are clearly few places where  
thread safety should be enforced in the MTL layer, and I don't know  
if this is the case.


 george.

On Jun 11, 2009, at 09:35 , Scott Atchley wrote:


Francois,

For threads, the FAQ has:

http://www.open-mpi.org/faq/?category=supported-systems#thread- 
support


It mentions that thread support is designed in, but lightly tested.  
It is also possible that the FAQ is out of date and  
MPI_THREAD_MULTIPLE is fully supported.


The stack trace below shows:

opal_free()
opal_progress()
MPI_Recv()

I do not know this code, but it may be in the higher level code  
that calls the BTLs and/or MTLs and it would be a place to see if  
that code handles the TCP BTL differently than MX BTL/MTL.


MX is thread safe with the caveat that two threads may not try to  
complete the same request at the same time. This includes calling  
mx_test(), mx_wait(), mx_test_any() and/or mx_wait_any() where the  
latter two have match bits and match mask that could complete a  
request being tested/waited by another thread.


Scott

On Jun 11, 2009, at 6:00 AM, François Trahay wrote:

Well, according to George Bosilca (http://www.open-mpi.org/community/lists/users/2005/02/0005.php 
), threads are supported in OpenMPI.
The program I try to run works with the TCP stack and MX driver is  
thread-safe, so i guess the problem comes from the MX BTL or MTL.


Francois


Scott Atchley wrote:

Hi Francois,

I am not familiar with the internals of the OMPI code. Are you  
sure, however, that threads are fully supported yet? I was under  
the impression that thread support was still partial.


Can anyone else comment?

Scott

On Jun 8, 2009, at 8:43 AM, François Trahay wrote:


Hi,
I'm encountering some issues when running a multithreaded  
program with

OpenMPI (trunk rev. 21380, configured with --enable-mpi-threads)
My program (included in the tar.bz2) uses several pthreads that  
perform
ping pongs concurrently (thread #1 uses tag #1, thread #2 uses  
tag #2, etc.)
This program crashes over MX (either btl or mtl) with the  
following

backtrace:

concurrent_ping_v2: pml_cm_recvreq.c:53:
mca_pml_cm_recv_request_completion: Assertion `0 ==
((mca_pml_cm_thin_recv_request_t*)base_request)- 
>req_base.req_pml_complete'

failed.
[joe0:01709] *** Process received signal ***
[joe0:01709] *** Process received signal ***
[joe0:01709] Signal: Segmentation fault (11)
[joe0:01709] Signal code: Address not mapped (1)
[joe0:01709] Failing at address: 0x1238949c4
[joe0:01709] Signal: Aborted (6)
[joe0:01709] Signal code:  (-6)
[joe0:01709] [ 0] /lib/libpthread.so.0 [0x7f57240be7b0]
[joe0:01709] [ 1] /lib/libc.so.6(gsignal+0x35) [0x7f5722cba065]
[joe0:01709] [ 2] /lib/libc.so.6(abort+0x183) [0x7f5722cbd153]
[joe0:01709] [ 3] /lib/libc.so.6(__assert_fail+0xe9)  
[0x7f5722cb3159]

[joe0:01709] [ 0] /lib/libpthread.so.0 [0x7f57240be7b0]
[joe0:01709] [ 1]
/home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.0
[0x7f57238d0a08]
[joe0:01709] [ 2]
/home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.0
[0x7f57238cf8cc]
[joe0:01709] [ 3]
/home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so. 
0(opal_free+0x4e)

[0x7f57238bdc69]
[joe0:01709] [ 4]
/home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/ 
mca_mtl_mx.so

[0x7f572060b72f]
[joe0:01709] [ 5]
/home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so. 
0(opal_progress+0xbc)

[0x7f57238948e0]
[joe0:01709] [ 6]
/home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/ 
mca_pml_cm.so

[0x7f572081145a]
[joe0:01709] [ 7]
/home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/ 
mca_pml_cm.so

[0x7f57208113b7]
[joe0:01709] [ 8]
/home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/ 
mca_pml_cm.so

[0x7f57208112e7]
[joe0:01709] [ 9]
/home/ftrahay/sources/openmpi/trunk/install//lib/libmpi.so. 
0(MPI_Recv+0x2bc)

[0x7f5723e07690]
[joe0:01709] [10] ./concurrent_ping_v2(client+0x123) [0x401404]
[joe0:01709] [11] /lib/libpthread.so.0 [0x7f57240b6faa]
[joe0:01709] [12] /lib/libc.so.6(clone+0x6d) [0x7f5722d5629d]
[joe0:01709] *** End of error message ***
[joe0:01709] [ 4]
/home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/ 

Re: [OMPI users] HPL with OpenMPI: Do I have a memory leak?

2009-05-04 Thread Brian Barrett
S (Linux CentOS 5.2), HPL is running alone.

   The cluster has Infiniband.
   However, I am running on a single node.

   The surprising thing is that if I run on shared memory only
   (-mca btl sm,self) there is no memory problem,
   the memory use is stable at about 13.9GB,
   and the run completes.
   So, there is a way around to run on a single node.
   (Actually shared memory is presumably the way to go on a  
single node.)


   However, if I introduce IB (-mca btl openib,sm,self)
   among the MCA btl parameters, then memory use blows up.

   This is bad news for me, because I want to extend the experiment
   to run HPL also across the whole cluster using IB,
   which is actually the ultimate goal of HPL, of course!
   It also suggests that the problem is somehow related to  
Infiniband,

   maybe hidden under OpenMPI.

   Here is the mpiexec command I use (with and without openib):

   /path/to/openmpi/bin/mpiexec \
  -prefix /the/run/directory \
  -np 8 \
  -mca btl [openib,]sm,self \
  xhpl


   Any help, insights, suggestions, reports of previous  
experiences,

   are much appreciated.

   Thank you,
   Gus Correa
   ___
   users mailing list
   us...@open-mpi.org <mailto:us...@open-mpi.org>
   http://www.open-mpi.org/mailman/listinfo.cgi/users





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] Relocating an Open MPI installation using OPAL_PREFIX

2009-01-06 Thread Brian Barrett

Sorry I haven't jumped in this thread earlier -- I've been a bit behind.

The multi-lib support worked at one time, and I can't think of why it  
would have changed.  The one condition is that libdir, includedir,  
etc. *MUST* be specified relative to $prefix for it to work.  It looks  
like you were defining them as absolute paths, so you'd have to set  
libdir directly, which will never work in multi-lib because mpirun and  
the app likely have different word sizes and therefore different  
libdirs.  More information is on the multilib page in the wiki:


  https://svn.open-mpi.org/trac/ompi/wiki/MultiLib

There is actually one condition we do not handle properly, the prefix  
flag to mpirun.  The LD_LIBRARY_PATH will only be set for the word  
size of mpirun, and not the executable.  Really, both would have to be  
added (so that both orted, which is likely always 32 bit in a multilib  
situation and the app both find their libraries).



Brian

On Jan 5, 2009, at 6:02 PM, Jeff Squyres wrote:

I honestly haven't thought through the ramifications of doing a  
multi-lib build with OPAL_PREFIX et al. :-\


If you setenv OPAL_LIBDIR, it'll use whatever you set it to, so it  
doesn't matter what you configured --libdir with.  Additionally mca/ 
installdirs/config/install_dirs.h has this by default:


#define OPAL_LIBDIR "${exec_prefix}/lib"

Hence, if you use a default --libdir and setenv OPAL_PREFIX, then  
the libdir should pick up the right thing (because it's based on the  
prefix).  But if you use --libdir that is *not* based on $ 
{exec_prefix}, then you might run into problems.


Perhaps you can '--libdir="${exec_prefix}/lib64"' so that you can  
have your custom libdir, but still have it dependent upon the prefix  
that gets expanded at run time...?


(again, I'm not thinking all of this through -- just offering a few  
suggestions off the top of my head that you'll need to test / trace  
the code to be sure...)



On Jan 5, 2009, at 1:35 PM, Ethan Mallove wrote:


On Thu, Dec/25/2008 08:12:49AM, Jeff Squyres wrote:
It's quite possible that we don't handle this situation properly.   
Won't
you need to libdir's (one for the 32 bit OMPI executables, and one  
for the

64 bit MPI apps)?


I don't need an OPAL environment variable for the executables, just a
single OPAL_LIBDIR var for the libraries. (One set of 32-bit
executables runs with both 32-bit and 64-bit libraries.) I'm guessing
OPAL_LIBDIR will not work for you if you configure with a non- 
standard

--libdir option.

-Ethan




On Dec 23, 2008, at 3:58 PM, Ethan Mallove wrote:


I think the problem is that I am doing a multi-lib build. I have
32-bit libraries in lib/, and 64-bit libraries in lib/64. I  
assume I

do not see the issue for 32-bit tests, because all the dependencies
are where Open MPI expects them to be. For the 64-bit case, I tried
setting OPAL_LIBDIR to /opt/openmpi-relocated/lib/lib64, but no  
luck.
Given the below configure arguments, what do my OPAL_* env vars  
need
to be? (Also, could using --enable-orterun-prefix-by-default  
interfere

with OPAL_PREFIX?)

 $ ./configure CC=cc CXX=CC F77=f77 FC=f90  --with-openib
--without-udapl --disable-openib-ibcm --enable-heterogeneous
--enable-cxx-exceptions --enable-shared --enable-orterun-prefix- 
by-default

--with-sge --enable-mpi-f90 --with-mpi-f90-size=small
--disable-mpi-threads --disable-progress-threads   --disable-debug
CFLAGS="-m32 -xO5" CXXFLAGS="-m32 -xO5" FFLAGS="-m32 -xO5"   
FCFLAGS="-m32

-xO5"
--prefix=/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi- 
tarball-testing/installs/DGQx/install
--mandir=/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi- 
tarball-testing/installs/DGQx/install/man
--libdir=/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi- 
tarball-testing/installs/DGQx/install/lib
--includedir=/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ 
ompi-tarball-testing/installs/DGQx/install/include
--without-mx --with-tm=/ws/ompi-tools/orte/torque/current/shared- 
install32
--with-contrib-vt-flags="--prefix=/workspace/em162155/hpc/mtt- 
scratch/burl-ct-v!

20z-12/ompi-tarball-testing/installs/DGQx/install
--mandir=/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi- 
tarball-testing/installs/DGQx/install/man
--libdir=/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi- 
tarball-testing/installs/DGQx/install/lib
--includedir=/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ 
ompi-tarball-testing/installs/DGQx/install/include
LDFLAGS=-R/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ 
ompi-tarball-testing/installs/DGQx/install/lib"


 $ ./confgiure CC=cc CXX=CC F77=f77 FC=f90  --with-openib
--without-udapl --disable-openib-ibcm --enable-heterogeneous
--enable-cxx-exceptions --enable-shared --enable-orterun-prefix- 
by-default

--with-sge --enable-mpi-f90 --with-mpi-f90-size=small
--disable-mpi-threads --disable-progress-threads   --disable-debug
CFLAGS="-m64 -xO5" CXXFLAGS="-m64 -xO5" FFLAGS="-m64 -xO5"   

Re: [OMPI users] Crash in _int_malloc via MPI_Init

2008-06-15 Thread Brian Barrett

On Jun 15, 2008, at 2:20 PM, Dirk Eddelbuettel wrote:


Yup: I still suspect compiler / linker changes in Ubuntu between Gutsy
(released Oct 2007) and Hardy (April 2008).

Why? Because the exactly same source package for Open MPI (as  
maintained by
Manuel and myself for Debian) works for me on Ubuntu Hardy __if I  
compile it

on Ubuntu Gutsy__.

Now, I reported this to Ubuntu ... for no answer.  Lucas and  
Christoph at
Debian today released a feature allowing us Debian maintainers to  
see which
our packages have bugreports in Ubuntu.  It was only through this  
mechanism
that I learned that the segfault I saw with Rmpi (using Open MPI)  
had been
experienced by someone else, and that a similar bug occurs with  
Python use on

top of Open MPI.

But still no tangible answer from Canonical / Ubuntu other that some
reshuffling of bug reports titles and numbers.  Very disappointing.

I am CCing Steffen and Andreas who've seen similar bugs and are  
awaiting
answers too.  I am also CCing Cesare at Ubuntu who did the bug  
rearrangement,

maybe he will find a moment to share their plans with us.


I suppose I'm glad that it doesn't look like an Open MPI problem.  Due  
to continual problems with the ptmalloc2 code in Open MPI, we've  
decided that for v1.3, we'll extract that code out into its own  
library.  Users who need the malloc hooks for InifiniBand support  
(only a small number of applications really benefit from it) will have  
to explicitly link in the extra library.  Hopefully, this will resolve  
some of these headaches.


Brian


--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] Memory manager

2008-05-20 Thread Brian Barrett

Terry -

Would you be willing to do an experiment with the memory allocator?   
There are two values we change to try to make IB run faster (at the  
cost of corner cases you're hitting).  I'm not sure one is strictly  
necessary, and I'm concerned that it's the one causing problems.  If  
you don't mind recompiling again, would you change line 64 in opal/mca/ 
memory/ptmalloc2/malloc.c from:


#define DEFAULT_MMAP_THRESHOLD (2*1024*1024)

to:

#define DEFAULT_MMAP_THRESHOLD (128*1024)

And then recompile with the memory manager, obviously.  That will make  
the mmap / sbrk cross-over point the same as the default allocator in  
Linux.  There's still one other tweak we do, but I'm almost 100%  
positive it's the threshold causing problems.



Brian


On May 19, 2008, at 8:17 PM, Terry Frankcombe wrote:


To tell you all what noone wanted to tell me, yes, it does seem to be
the memory manager.  Compiling everything with
--with-memory-manager=none returns the vmem use to the more reasonable
~100MB per process (down from >8GB).

I take it this may affect my peak bandwidth over infiniband.  What's  
the

general feeling about how bad this is?


On Tue, 2008-05-13 at 13:12 +1000, Terry Frankcombe wrote:

Hi folks

I'm trying to run an MPI app on an infiniband cluster with OpenMPI
1.2.6.

When run on a single node, this app is grabbing large chunks of  
memory
(total per process ~8.5GB, including strace showing a single 4GB  
grab)
but not using it.  The resident memory use is ~40MB per process.   
When
this app is compiled in serial mode (with conditionals to remove  
the MPI

calls) the memory use is more like what you'd expect, 40MB res and
~100MB vmem.

Now I didn't write it so I'm not sure what extra stuff the MPI  
version

does, and we haven't tracked down the large memory grabs.

Could it be that this vmem is being grabbed by the OpenMPI memory
manager rather than directly by the app?

Ciao
Terry




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] Memory question and possible bug in 64bit addressing under Leopard!

2008-04-25 Thread Brian Barrett

On Apr 25, 2008, at 2:06 PM, Gregory John Orris wrote:


produces a core dump on a machine with 12Gb of RAM.

and the error message

mpiexec noticed that job rank 0 with PID 75545 on node mymachine.com
exited on signal 4 (Illegal instruction).

However, substituting in

float *X = new float[n];
for
float X[n];

Succeeds!



You're running off the end of the stack, because of the large amount  
of data you're trying to put there.  OS X by default has a tiny stack  
size, so codes that run on Linux (which defaults to a much larger  
stack size) sometimes show this problem.  Your best bets are either to  
increase the max stack size or (more portably) just allocate  
everything on the heap with malloc/new.


Hope this helps,

Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] Problems using Intel MKL with OpenMPI and Pathscale

2008-04-14 Thread Brian Barrett

On Apr 14, 2008, at 12:30 AM, Åke Sandgren wrote:

On Sun, 2008-04-13 at 08:00 -0400, Jeff Squyres wrote:

Do you get the same error if you disable the memory handling in Open
MPI?  You can configure OMPI with:

--disable-memory-manager


Doesn't help, it still compiles ptmalloc2 and trying to turn off
ptmaloc2 during runtime doesn't help either.


Jeff had the option slightly wrong.  It's actually:

  --without-memory-manager

Because this is a link time decision, during the memory manager code  
on / off at runtime won't change anything in terms of interfering with  
a compiler's own memory management code.


Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/





Re: [OMPI users] OSX undefined symbols when compiling hello world in cpp but not in c

2008-04-03 Thread Brian Barrett
QqJJlF.o
  typeinfo for MPI::Infoin ccQqJJlF.o
  "MPI::Comm::Set_errhandler(MPI::Errhandler const&)", referenced  
from:

  vtable for MPI::Commin ccQqJJlF.o
  vtable for MPI::Intracommin ccQqJJlF.o
  vtable for MPI::Cartcommin ccQqJJlF.o
  vtable for MPI::Graphcommin ccQqJJlF.o
  vtable for MPI::Intercommin ccQqJJlF.o
  "MPI::Win::Free()", referenced from:
  vtable for MPI::Winin ccQqJJlF.o
  "operator delete(void*)", referenced from:
  MPI::Datatype::~Datatype()in ccQqJJlF.o
  MPI::Datatype::~Datatype()in ccQqJJlF.o
  MPI::Status::~Status()in ccQqJJlF.o
  MPI::Status::~Status()in ccQqJJlF.o
  MPI::Request::~Request()in ccQqJJlF.o
  MPI::Request::~Request()in ccQqJJlF.o
  MPI::Request::~Request()in ccQqJJlF.o
  MPI::Prequest::~Prequest()in ccQqJJlF.o
  MPI::Prequest::~Prequest()in ccQqJJlF.o
  MPI::Grequest::~Grequest()in ccQqJJlF.o
  MPI::Grequest::~Grequest()in ccQqJJlF.o
  MPI::Group::~Group()in ccQqJJlF.o
  MPI::Group::~Group()in ccQqJJlF.o
  MPI::Comm_Null::~Comm_Null()in ccQqJJlF.o
  MPI::Comm_Null::~Comm_Null()in ccQqJJlF.o
  MPI::Comm_Null::~Comm_Null()in ccQqJJlF.o
  MPI::Win::~Win() in ccQqJJlF.o
  MPI::Win::~Win() in ccQqJJlF.o
  MPI::Errhandler::~Errhandler()in ccQqJJlF.o
  MPI::Errhandler::~Errhandler()in ccQqJJlF.o
  MPI::Comm::~Comm()in ccQqJJlF.o
  MPI::Comm::~Comm()in ccQqJJlF.o
  MPI::Comm::~Comm()in ccQqJJlF.o
  MPI::Intracomm::~Intracomm()in ccQqJJlF.o
  MPI::Intracomm::~Intracomm()in ccQqJJlF.o
  MPI::Intracomm::~Intracomm()in ccQqJJlF.o
  MPI::Info::~Info()in ccQqJJlF.o
  MPI::Info::~Info()in ccQqJJlF.o
  MPI::Intercomm::~Intercomm()in ccQqJJlF.o
  MPI::Intercomm::~Intercomm()in ccQqJJlF.o
  MPI::Intracomm::Clone() constin ccQqJJlF.o
  MPI::Cartcomm::~Cartcomm()in ccQqJJlF.o
  MPI::Cartcomm::~Cartcomm()in ccQqJJlF.o
  MPI::Graphcomm::~Graphcomm()in ccQqJJlF.o
  MPI::Graphcomm::~Graphcomm()in ccQqJJlF.o
  MPI::Cartcomm::Clone() constin ccQqJJlF.o
  MPI::Graphcomm::Clone() constin ccQqJJlF.o
  MPI::Op::~Op()  in ccQqJJlF.o
  MPI::Op::~Op()  in ccQqJJlF.o
  "MPI::FinalizeIntercepts()", referenced from:
  MPI::Finalize() in ccQqJJlF.o
  "MPI::COMM_WORLD", referenced from:
  __ZN3MPI10COMM_WORLDE$non_lazy_ptr in ccQqJJlF.o
ld: symbol(s) not found
collect2: ld returned 1 exit status


--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] What architecture? X86_64, that's what architecture!

2008-03-14 Thread Brian Barrett

On Mar 10, 2008, at 9:15 PM, Jim Hill wrote:

I'm trying to build a 64-bit 1.2.5 on an 8-core Xeon Mac Pro running  
OS X 10.4.11, with the Portland Group's PGI Workstation 7.1-5  
tools.  The configure script works its magic with a couple of  
modifications to account for PGI's tendency to freak out about F90  
modules.  Upon make, though, I end up dying with a "Wat  
architecture?" error in opal/mca/backtrace/darwin/MoreBacktrace/ 
MoreDebugging/MoreBacktrace.c:128 because (I presume) a 64-bit Xeon  
build isn't a PPC, a PPC64, or an X86.


Is this something that's been seen by others?  I'm not the world's  
greatest software stud and this is just a step along the path to my  
real objective, which is making my own software run on this beast  
machine of mine.


Suggestions, tips, and clever insults are welcome.  Thanks,


The configure script should have prevented that from happening (and  
indeed does with the GNU compilers).  I don't have a copy of the PGI  
compilers for OS X to test with, so I can't debug this without some  
more information.  What changes did you make to configure, what  
options did you specify to configure, and what was the full output of  
configure?


Thanks,

Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] OpenMPI 1.2.5 configure bug for POWERPC64 target

2008-03-02 Thread Brian Barrett

On Feb 27, 2008, at 8:34 AM, Jeff Squyres wrote:


On Feb 23, 2008, at 10:05 AM, Mathias PUETZ wrote:


2. Could someone explain, why configure might determine a different
ompi_cv_asm_format
   than stated in the asm-data.txt database ?
   Maybe the meaning of the cryptic assmebler format string is
explained somewhere.
   If so, could someone point me to the explanation ?



I have to defer to Brian on this one...



Sorry about the slow reply -- unfortunately, I don't have as much time  
to look at OPen MPI issues as I once did.


I have no idea -- likely the test doesn't cover some corner case.  The  
first question that needs to be asked is for the AIX / Power PC  
machine you're running on, what is the right answer (as an IBM  
employee, you're certainly more qualified to answer that than I am...).


Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] Can't compile C++ program with extern "C" { #include mpi.h }

2008-01-01 Thread Brian Barrett

On Jan 1, 2008, at 12:47 AM, Adam C Powell IV wrote:


On Mon, 2007-12-31 at 20:01 -0700, Brian Barrett wrote:



Yeah, this is a complicated example, mostly because HDF5 should
really be covering this problem for you.  I think your only option at
that point would be to use the #define to not include the C++ code.

The problem is that the MPI standard *requires* mpi.h to include both
the C and C++ interface declarations if you're using C++.  There's no
way for the preprocessor to determine whether there's a currently
active extern "C" block, so there's really not much we can do.  Best
hope would be to get the HDF5 guys to properly protect their code
from C++...


Okay.  So in HDF5, since they call MPI from C, they're just using  
the C

interface, right?  So should they define OMPI_SKIP_MPICXX just in case
they're #included by C++ and using OpenMPI, or is there a more MPI
implementation-agnostic way to do it?



No, they should definitely not be disabling the C++bindings inside  
HDF5 -- that would be a situation worse than the current one.   
Consider the case where an application uses both HDF5 and the C++ MPI  
bindings.  It includes hdf5.h before mpi.h.  The hdf5.h includes  
mpi.h, without the C++ bindings.  The application then includes mpi.h,  
wanting the C++ bindings.  But the multiple inclusion protection in  
mpi.h means nothing happens, so no C++ bindings.


My comment about HDF5 was that it would be easiest if it protected its  
declarations with extern "C" when using C++.  This is what most  
packages that might be used with C++ do, and it works pretty well.   
I'd actually be surprised if modern versions of HDF5 didn't already do  
that.


Now that it's not New Years eve, I thought of what's probably the  
easiest solution for you.  Just include mpi.h (outside your extern "C"  
block) before hdf5.h.  The multiple inclusion protection in mpi.h will  
mean that the preprocessor removes everything from the mpi.h that's  
included from hdf5.h.  So the extern "C" around the hdf5.h shouldn't  
be too much of a problem.


Hope this helps,

Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] Can't compile C++ program with extern "C" { #include mpi.h }

2007-12-31 Thread Brian Barrett

On Dec 31, 2007, at 7:26 PM, Adam C Powell IV wrote:


Okay, fair enough for this test example.

But the Salomé case is more complicated:
extern "C"
{
#include 
}
What to do here?  The hdf5 prototypes must be in an extern "C" block,
but hdf5.h #includes a file which #includes mpi.h...

Thanks for the quick reply!


Yeah, this is a complicated example, mostly because HDF5 should  
really be covering this problem for you.  I think your only option at  
that point would be to use the #define to not include the C++ code.


The problem is that the MPI standard *requires* mpi.h to include both  
the C and C++ interface declarations if you're using C++.  There's no  
way for the preprocessor to determine whether there's a currently  
active extern "C" block, so there's really not much we can do.  Best  
hope would be to get the HDF5 guys to properly protect their code  
from C++...



Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/





Re: [OMPI users] Can't compile C++ program with extern "C" { #include mpi.h }

2007-12-31 Thread Brian Barrett

On Dec 31, 2007, at 7:12 PM, Adam C Powell IV wrote:


I'm trying to build the Salomé engineering simulation tool, and am
having trouble compiling with OpenMPI.  The full text of the error  
is at
http://lyre.mit.edu/~powell/salome-error .  The crux of the problem  
can

be reproduced by trying to compile a C++ file with:

extern "C"
{
#include "mpi.h"
}

At the end of mpi.h, the C++ headers get loaded while in extern C  
mode,

and the result is a vast list of errors.


Yes, it will.  Similar to other external packages (like system  
headers), you absolutely should not include mpi.h from an extern "C"  
block.  It will fail, as you've noted.  The proper solution is to not  
be in an extern "C" block when including mpi.h.


Brian


--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/





Re: [OMPI users] Compiling 1.2.4 using Intel Compiler 10.1.007 on Leopard

2007-12-15 Thread Brian Barrett
I finally had a chance to look at this (since the same things are  
happening with LAM as well).  The base issue is that Intel's compiler  
is completely borked.  I can't fathom how a company could release a  
product that fundamentally broken.  That's all good, except that  
recent versions of Autoconf expect that the compiler at least kind of  
works without special CFLAGS for some of its tests (like does the  
compiler understand -g or does -E invoke the preprocessor) -- not an  
unreasonable assumption.  Configure is getting those answers wrong  
because that's suddenly not true.  It makes a wrong choice about  
requiring some Autoconf compatibility scripts to build, which don't  
work on OS X (probably because they aren't usually needed, so not well  
tested)


A hackish fix is to set CC to "icc -no-multibyte-chars" and CXX to  
"icpc -no-multibyte-chars" instead of setting the -no-multibyte- 
chars.  With those parameters, I was able to successfully build Open  
MPI (and applications against Open MPI).  Hopefully Intel can fix  
their compilers before this causes too many more issues.  How you ship  
an (expensive!) compiler that just flat out doesn't work is beyond me.


Brian

On Dec 12, 2007, at 11:18 AM, Warner Yuen wrote:


Hi Jeff,

It seems that the problems are partially the compilers fault, maybe  
the updated compilers didn't catch all the problems filed against  
the last release? Why else would I need to add the "-no-multibyte- 
chars" flag for pretty much everything that I build with ICC? Also,  
its odd that I have to use /lib/cpp when using Intel ICC/ICPC  
whereas with GCC things just find their way correctly. Again, IFORT  
and GCC together seem fine. Lastly... not that I use these... but  
MPICH-2.1 and MPICH-1.2.7 for Myrinet built just fine.


Here are the output files:






Warner Yuen
Scientific Computing Consultant
Apple Computer
email: wy...@apple.com
Tel: 408.718.2859
Fax: 408.715.0133


On Dec 12, 2007, at 9:00 AM, users-requ...@open-mpi.org wrote:


--

Message: 1
Date: Wed, 12 Dec 2007 06:50:03 -0500
From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] Problems compiling 1.2.4 using Intel
Compiler10.1.006 on Leopard
To: Open MPI Users <us...@open-mpi.org>
Message-ID: <43bb0bce-e328-4d3e-ae61-84991b27f...@cisco.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes

My primary work platform is a MacBook Pro, but I don't specifically
develop for OS X, so I don't have any special compilers.

Sorry to ask this because I think the information was sent before,  
but

could you send all the compile/failure information?  
http://www.open-mpi.org/community/help/



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] Open MPI 1.2.4 verbosity w.r.t. osc pt2pt

2007-10-16 Thread Brian Barrett

On Oct 16, 2007, at 11:56 AM, Jeff Squyres wrote:


On Oct 16, 2007, at 11:20 AM, Brian Granger wrote:


Wow, that is quite a study of the different options.  I will spend
some time looking over things to better understand the (complex)
situation.  I will also talk with Lisandro Dalcin about what he  
thinks

the best approach is for mpi4py.  One question though.  You said that
nothing had changed in this respect from 1.2.3 to 1.2.4, but 1.2.3
doesn't show the problem.  Does this make sense?


I wondered about that as well.  Is there any chance that you simply
weren't using the one-sided MPI functionality between your different
versions?

Or are you using the same version of your software with v1.2.3 and
v1.2.4 of OMPI?  If so, I'm kinda at a loss.  :-(

FWIW: our one-sided support loads lazily; it doesn't load during
MPI_INIT like most of the rest of the plugins that we have.  Since
not many MPI applications use it, we decided to make it only load the
osc plugins the first time an MPI window is created.


Actually, I never wrote the lazy open code, so we load the components  
during MPI_INIT.  They aren't initialized until first use, but they  
are loaded.  Just to verify, I did a build of 1.2.3 and of 1.2.4 and  
there's no difference in the list of undefined symbols or library  
references between the pt2pt osc components in the two builds.


Still at a loss for why the change between releases -- I would not  
have expected it to work with 1.2.3.


Brian.


Re: [OMPI users] Open MPI 1.2.4 verbosity w.r.t. osc pt2pt

2007-10-10 Thread Brian Barrett

On Oct 10, 2007, at 1:27 PM, Dirk Eddelbuettel wrote:

| Does this happen for all MPI programs (potentially only those that
| use the MPI-2 one-sided stuff), or just your R environment?

This is the likely winner.

It seems indeed due to R's Rmpi package. Running a simple mpitest.c  
shows no
error message. We will look at the Rmpi initialization to see what  
could

cause this.


Does rmpi link in libmpi.so or dynamically load it at run-time?  The  
pt2pt one-sided component uses the MPI-1 point-to-point calls for  
communication (hence, the pt2pt name). If those symbols were  
unavailable (say, because libmpi.so was dynamically loaded) I could  
see how this would cause problems.


The pt2pt component (rightly) does not have a -lmpi in its link  
line.  The other components that use symbols in libmpi.so (wrongly)  
do  have a -lmpi in their link line.  This can cause some problems on  
some platforms (Linux tends to do dynamic linking / dynamic loading  
better than most).  That's why only the pt2pt component fails.


My guess is that Rmpi is dynamically loading libmpi.so, but not  
specifying the RTLD_GLOBAL flag.  This means that libmpi.so is not  
available to the components the way it should be, and all goes  
downhill from there.  It only mostly works because we do something  
silly with how we link most of our components, and Linux is just  
smart enough to cover our rears (thankfully).


Solutions:

  - Someone could make the pt2pt osc component link in libmpi.so
like the rest of the components and hope that no one ever
tries this on a non-friendly platform.
  - Debian (and all Rmpi users) could configure Open MPI with the
 --disable-dlopen flag and ignore the problem.
  - Someone could fix Rmpi to dlopen libmpi.so with the RTLD_GLOBAL
flag and fix the problem properly.

I think it's clear I'm in favor of Option 3.

Brian


Re: [OMPI users] aclocal.m4 booboo?

2007-09-28 Thread Brian Barrett

On Sep 27, 2007, at 6:44 PM, Mostyn Lewis wrote:


Today's SVN.

A generated configure has this in it:


I'm not able to replicate this using an SVN checkout of the trunk --  
you might want to make sure you have a proper install of all the  
autotools.  If you are using another branch from SVN, you can not use  
recent CVS copies of Libtool, you'll have to use the same version  
specified here:


http://www.open-mpi.org/svn/building.php

Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] Open MPI on 64 bits intel Mac OS X

2007-09-28 Thread Brian Barrett


On Sep 28, 2007, at 4:56 AM, Massimo Cafaro wrote:


Dear all,

when I try to compile my MPI code on 64 bits intel Mac OS X the  
build fails since the Open MPI library has been compiled using 32  
bits. Can you please provide in the next version the ability at  
configure time to choose between 32 and 64 bits or even better  
compile by defaults using both modes?


To reproduce the problem, simply compile on 64 bits intel Mac OS X  
an MPI application using mpicc -arch x86_64. The 64 bits linker  
complains as follows:


ld64 warning: in /usr/local/mpi/lib/libmpi.dylib, file is not of  
required architecture
ld64 warning: in /usr/local/mpi/lib/libopen-rte.dylib, file is not  
of required architecture
ld64 warning: in /usr/local/mpi/lib/libopen-pal.dylib, file is not  
of required architecture


and a number of undefined symbols is shown, one for each MPI  
function used in the application.


This is already possible.  Simply use the configure options:

  ./configure ... CFLAGS="-arch x86_64" CXXFLAGS="-arch x86_64"  
OBJCFLAGS="-arch x86_64"


also set FFLAGS and FCFLAGS to "-m64" if you have gfortran/g95  
compiler installed.  The common installs of either don't speak the - 
arch option, so you have to use the more traditional -m64.


Hope this helps,

Brian


Re: [OMPI users] readv failed with errno=104

2007-09-25 Thread Brian Barrett

On Sep 25, 2007, at 4:25 AM, Rayne wrote:


Hi all, I'm using the SGE system on my school network,
and would like to know if the errors I received below
means there's something wrong with my MPI_Recv
function.

[0,1,3][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=104
[0,1,2][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=104


Generally, these indicate that the remote process has died.   
Generally, that means an abnormal termination due to segmentation  
faults or the like.  You might want to run the code under a debugger  
to see if it shows anything useful.  If your cluster doesn't have a  
parallel debugger like TotalView or DDT available, you can (for small  
numbers of processes) get away with using xterm and gdb, something like:


  mpirun -np X -d xterm -e gdb 

It'll open X xterms, each with a gdb running one instance of the  
application.



Good luck,

Brian


Re: [OMPI users] OpenMPI on Cray XT4 CNL

2007-09-25 Thread Brian Barrett

On Sep 25, 2007, at 1:37 PM, Richard Graham wrote:

Josh Hursey did the port of Open MPI to CNL.  Here is the config  
line I have used to build

 on the Cray XT4:

./configure CC=/opt/xt-pe/default/bin/snos64/linux-pgcc CXX=/opt/xt- 
pe/default/bin/snos64/linux-pgCC F77=/opt/xt-pe/default/bin/snos64/ 
linux-pgf90 FC=/opt/xt-pe/default/bin/snos64/linux-pgf77 CFLAGS=-I/ 
opt/xt-pe/default/include/ CPPFLAGS=-I/opt/xt-pe/default/include/  
FCFLAGS=-I/opt/xt-pe/default/include/ FFLAGS=-I/opt/xt-pe/default/ 
include/ LDFLAGS=-L/opt/xt-mpt/default/lib/snos64/ LIBS=-lpct - 
lalpslli -lalpsutil  --build=x86_64-unknown-linux-gnu --host=x86_64- 
cray-linux-gnu --with-platform=../contrib/platform/cray_xt3_romio -- 
with-io-romio-flags=--disable-aio build_alias=x86_64-unknown-linux- 
gnu host_alias=x86_64-cray-linux-gnu --enable-ltdl-convenience --no- 
recursion --prefix=/na2_apps/OpenMPI/xt-2.0.20/1.2/ompi/P2


I believe, however, that you need to use one of the Open MPI 1.2.4  
release candidates or the nightly tarballs from the 1.2 or trunk  
branches.  There are some known issues with the 1.2.3 release on the  
Cray XT platform that have since been resolved.


Brian


Re: [OMPI users] another mpirun + xgrid question

2007-09-17 Thread Brian Barrett

On Sep 10, 2007, at 1:35 PM, Lev Givon wrote:


When launching an MPI program with mpirun on an xgrid cluster, is
there a way to cause the program being run to be temporarily copied to
the compute nodes in the cluster when executed (i.e., similar to  
what the

xgrid command line tool does)? Or is it necessary to make the program
being run available on every compute node (e.g., using NFS data
partions)?


This is functionality we never added to our XGrid support.  It  
certainly could be added, but we have an extremely limited supply of  
developer cycles for the XGrid support at the moment.



Brian

--
  Brian W. Barrett
  Networking Team, CCS-1
  Los Alamos National Laboratory




Re: [OMPI users] running jobs on a remote XGrid cluster via mpirun

2007-08-28 Thread Brian Barrett

On Aug 28, 2007, at 10:59 AM, Lev Givon wrote:


Received from Brian Barrett on Tue, Aug 28, 2007 at 12:22:29PM EDT:

On Aug 27, 2007, at 3:14 PM, Lev Givon wrote:

I have OpenMPI 1.2.3 installed on an XGrid cluster and a separate  
Mac
client that I am using to submit jobs to the head (controller)  
node of
the cluster. The cluster's compute nodes are all connected to the  
head

node via a private network and are not running any firewalls. When I
try running jobs with mpirun directly on the cluster's head node,  
they
execute successfully; if I attempt to submit the jobs from the  
client
(which can run jobs on the cluster using the xgrid command line  
tool)
with mpirun, however, they appear to hang indefinitely (i.e., a  
job ID
is created, but the mpirun itself never returns or terminates).  
Is it

nececessary to configure the firewall on the submission client to
grant access to the cluster head node in order to remotely submit  
jobs

to the cluster's head node?


Currently, every node on which an MPI process is launched must be
able to open a connection to a random port on the machine running
mpirun.  So in your case, you'd have to configure the network on the
cluster to be able to connect back to your workstation (and the
workstation would have to allow connections from all your cluster
nodes). Far from ideal, but it's what it is.

Brian


Can this be avoided by submitting the "mpirun -n 10 myProg" command
directly to the controller node with the xgrid command line tool? For
some reason, sending the above command to the cluster results in a
"task: failed with status 255" error even though I can successfully
run other programs or commands to the cluster with the xgrid tool.  I
know that OpenMPI on the cluster is running properly because I can run
programs with mpirun successfully when logged into the controller node
itself.


Open MPI was designed to be the one calling XGrid's scheduling  
algorithm, so I'm pretty sure that you can't submit a job that just  
runs Open MPI's mpirun.  That wasn't really in our original design  
space as an option.


Brian


Re: [OMPI users] running jobs on a remote XGrid cluster via mpirun

2007-08-28 Thread Brian Barrett

On Aug 27, 2007, at 3:14 PM, Lev Givon wrote:


I have OpenMPI 1.2.3 installed on an XGrid cluster and a separate Mac
client that I am using to submit jobs to the head (controller) node of
the cluster. The cluster's compute nodes are all connected to the head
node via a private network and are not running any firewalls. When I
try running jobs with mpirun directly on the cluster's head node, they
execute successfully; if I attempt to submit the jobs from the client
(which can run jobs on the cluster using the xgrid command line tool)
with mpirun, however, they appear to hang indefinitely (i.e., a job ID
is created, but the mpirun itself never returns or terminates). Is it
nececessary to configure the firewall on the submission client to
grant access to the cluster head node in order to remotely submit jobs
to the cluster's head node?


Currently, every node on which an MPI process is launched must be  
able to open a connection to a random port on the machine running  
mpirun.  So in your case, you'd have to configure the network on the  
cluster to be able to connect back to your workstation (and the  
workstation would have to allow connections from all your cluster  
nodes).  Far from ideal, but it's what it is.


Brian


Re: [OMPI users] failure to link on macosx

2007-08-24 Thread Brian Barrett

On Aug 24, 2007, at 10:57 AM, Marwan Darwish wrote:

I keep on getting the following link error when compiling lam-mpi  
on a macosx (in the release mode)
would moving to open-mpi resolve such issues, anybody with  
experience in this


Moving to Open MPI will work around this issue.  Another option (if  
you're not using Myrinet/GM) would be to compile LAM with the -- 
without-memory-manager option to configure.


Hope this helps,

Brian

--
  Brian W. Barrett
  Networking Team, CCS-1
  Los Alamos National Laboratory




Re: [OMPI users] MPI_FILE_NULL

2007-08-23 Thread Brian Barrett

On Aug 23, 2007, at 4:33 AM, Bernd Schubert wrote:

I need to compile a benchmarking program and absolutely so far do  
not have

any experience with any MPI.
However, this looks like a general open-mpi problem, doesn't it?

bschubert@lanczos MPI_IO> make
cp ../globals.f90 ./; mpif90 -O2 -c ../globals.f90
mpif90 -O2 -c main.f90
mpif90 -O2 -c reader.f90
fortcom: Error: reader.f90, line 24: This name does not have a  
type, and

must have an explicit type.   [MPI_FILE_NULL]
call MPI_File_set_errhandler (MPI_FILE_NULL, MPI_ERRORS_ARE_FATAL,  
ierror)


Yeah, that looks like a mistake on our part.  It will be fixed in  
Open MPI 1.2.4.  Your quick fix should work until then.



Thanks,

Brian

--
  Brian W. Barrett
  Networking Team, CCS-1
  Los Alamos National Laboratory




Re: [OMPI users] Error: error in configure (maybe libtool)

2007-08-22 Thread Brian Barrett

On Aug 22, 2007, at 2:35 PM, Higor de Padua Vieira Neto wrote:


At the end of the output file, just show this:
" (...lot of output ...)
config.status: creating opal/include/opal_config.h
config.status: creating orte/include/orte_config.h
config.status: orte/include/orte_config.h is unchanged
config.status: creating ompi/include/ompi_config.h
config.status: ompi/include/ompi_config.h is unchanged
config.status: creating ompi/include/mpi.h
config.status: ompi/include/mpi.h is unchanged
config.status: executing depfiles commands
config.status: executing libtool commands
/bin/rm: cannot lstat `libtoolT': No such file or directory"
end.

I don't know why this happened, but I've read the output file and I  
didn't find anything strange.


Wow, that's pretty cool -- I haven't seen this one before.  Two  
things come to mind -- if you're on a shared file system, are all  
your clocks synchronized?  And any chance you ran out of disk space?   
Can you try running configure again and see if the same thing happens  
again?



Thanks,

Brian

--
  Brian W. Barrett
  Networking Team, CCS-1
  Los Alamos National Laboratory




Re: [OMPI users] building static and shared OpenMPI libraries on MacOSX

2007-08-22 Thread Brian Barrett

On Aug 21, 2007, at 10:52 PM, Lev Givon wrote:


(Running ompi_info after installing the build confirms the absence of
said components). My concern, unsurprisingly, is motivated by a desire
to use OpenMPI on an xgrid cluster (i.e., not with rsh/ssh); unless I
am misconstruing the above observations, building OpenMPI with
--enable-static seems to preclude this. Should xgrid functionality
still be present when OpenMPI is built with --enable-static?


Ah, yes.  Do to some issues with our build system, you have to build  
shared libraries to use the XGrid support.



Brian

--
  Brian W. Barrett
  Networking Team, CCS-1
  Los Alamos National Laboratory




Re: [OMPI users] building static and shared OpenMPI libraries on MacOSX

2007-08-22 Thread Brian Barrett

On Aug 21, 2007, at 3:32 PM, Lev Givon wrote:

configure: WARNING: *** Shared libraries have been disabled (-- 
disable-shared)
configure: WARNING: *** Building MCA components as DSOs  
automatically disabled

checking which components should be static... none
checking for projects containing MCA frameworks... opal, orte, ompi

Specifying --enable-shared --enable-static results in the same
behavior, incidentally. Is the above to be expected?


Yes, this is expected.  This is just a warning that we build  
components into the library rather than as run-time loadable  
components when static libraries are enabled.  This is probably not  
technically necessary on Linux and OS X, but in general is the  
easiest thing for us to do.  So you should have a perfectly working  
build with this setup.



Brian

--
  Brian W. Barrett
  Networking Team, CCS-1
  Los Alamos National Laboratory




Re: [OMPI users] values of mca parameters whilst running program

2007-08-02 Thread Brian Barrett

On Aug 2, 2007, at 4:22 PM, Glenn Carver wrote:


Hopefully an easy question to answer... is it possible to get at the
values of mca parameters whilst a program is running?   What I had in
mind was either an open-mpi function to call which would print the
current values of mca parameters or a function to call for specific
mca parameters. I don't want to interrupt the running of the
application.

Bit of background. I have a large F90 application running with
OpenMPI (as Sun Clustertools 7) on Opteron CPUs with an IB network.
We're seeing swap thrashing occurring on some of the nodes at times
and having searched the archives and read the FAQ believe we may be
seeing the problem described in:
http://www.open-mpi.org/community/lists/users/2007/01/2511.php
where the udapl free list is growing to a point where lockable  
memory runs out.


Problem is, I have no feel for the kinds of numbers  that
"btl_udapl_free_list_max" might safely get up to?  Hence the request
to print mca parameter values whilst the program is running to see if
we can tie in high values of this parameter to when we're seeing swap
thrashing.


Good news, the answer is easy.  Bad news is, it's not the one you  
want.  btl_udapl_free_list_max is the *greatest* the list will ever  
be allowed to grow to, not it's current size.  So if you don't  
specify a value and use the default of -1, it will return -1 for the  
life of the application, regardless of how big those free lists  
actually get.  If you specify value X, it'll return X for the lift of  
the application, as well.


There is not a good way for a user to find out the current size of a  
free list or the largest it got for the life of an application  
(currently those two will always be the same, but that's another  
story).  Your best bet is to set the parameter to some value (say,  
128 or 256) and see if that helps with the swapping.



Brian

--
  Brian W. Barrett
  Networking Team, CCS-1
  Los Alamos National Laboratory




Re: [OMPI users] Problem building openmpi 1.2.3 on RHEL 5

2007-07-26 Thread Brian Barrett

On Jul 26, 2007, at 7:43 PM, Mathew Binkley wrote:


../../libtool: line 460: CDPATH: command not found
libtool: Version mismatch error.  This is libtool 2.1a, but the
libtool: definition of this LT_INIT comes from an older release.
libtool: You should recreate aclocal.m4 with macros from libtool 2.1a
libtool: and run autoconf again.
make[2]: *** [asm.lo] Error 1
make[2]: Leaving directory


It kind of looks like the Makefiles decided to regenerate configure  
and ended up with a bad build.  Did you start with a clean tarball?   
If not, can you try from a clean tarball and record the entire output  
of make?


Thanks!

Brian


Re: [OMPI users] MPI_File_set_view rejecting subarray views.

2007-07-19 Thread Brian Barrett

On Jul 19, 2007, at 3:24 PM, Moreland, Kenneth wrote:


I've run into a problem with the File I/O with openmpi version 1.2.3.
It is not possible to call MPI_File_set_view with a datatype created
from a subarray.  Instead of letting me set a view of this type, it
gives an invalid datatype error.  I have attached a simple program  
that

demonstrates the problem.  In particular, the following sequence of
function calls should be supported, but they are not.

  MPI_Type_create_subarray(3, sizes, subsizes, starts,
   MPI_ORDER_FORTRAN, MPI_BYTE, );
  MPI_File_set_view(fd, 20, MPI_BYTE, view, "native", MPI_INFO_NULL);

After poking around in the source code a bit, I discovered that the  
I/O
implementation actually supports the subarray data type, but there  
is a

check that is issuing an error before the underlying I/O layer (ROMIO)
has a chance to handle the request.


You need to commit the datatype after calling  
MPI_Type_create_subarray.  If you add:


  MPI_Type_commit();

after the Type_create, but before File_set_view, the code will run to  
completion.


Well, the code will then complain about a Barrier after MPI_Finalize  
due to an error in how we shut down when there are files that have  
been opened but not closed (you should also add a call to  
MPI_File_close after the set_view, but I'm assuming it's not there  
because this is a test code).  This is something we need to fix, but  
also signifies a user error.



Brian

--
  Brian W. Barrett
  Networking Team, CCS-1
  Los Alamos National Laboratory




Re: [OMPI users] Problems running openmpi under os x

2007-07-19 Thread Brian Barrett
ir=/usr/lib --build=powerpc-apple-darwin8 --with-
arch=nocona --with-tune=generic --program-prefix= --host=i686-apple-
darwin8 --target=i686-apple-darwin8
Thread model: posix
gcc version 4.0.1 (Apple Computer, Inc. build 5367)

Tim



On 12/07/2007, at 7:57 AM, Brian Barrett wrote:


That's unexpected.  If you run the command 'ompi_info --all', it
should list (towards the top) things like the Bindir and Libdir.   
Can
you see if those have sane values?  If they do, can you try  
running a

simple hello, world type MPI application (there's one in the OMPI
tarball).  It almost looks like memory is getting corrupted, which
would be very unexpected that early in the process.  I'm unable to
duplicate the problem with 1.2.3 on my Mac Pro, making it all the
more strange.

Another random thought -- Which compilers did you use to build
Open MPI?

Brian


On Jul 11, 2007, at 1:27 PM, Tim Cornwell wrote:



 Open MPI: 1.2.3
Open MPI SVN revision: r15136
 Open RTE: 1.2.3
Open RTE SVN revision: r15136
 OPAL: 1.2.3
    OPAL SVN revision: r15136
   Prefix: /usr/local
  Configured architecture: i386-apple-darwin8.10.1

Hi Brian,

1.2.3 downloaded and built from source.

Tim

On 12/07/2007, at 12:50 AM, Brian Barrett wrote:


Which version of Open MPI are you using?

Thanks,

Brian

On Jul 11, 2007, at 3:32 AM, Tim Cornwell wrote:



I have a problem running openmpi under OS 10.4.10. My program  
runs

fine under debian x86_64 on an opteron but under OS X on a number
of Mac Book and Mac Book Pros, I get the following immediately on
startup. This smells like a common problem but I could find
anything relevant anywhere. Can anyone provide a hint or better
yet
a solution?

Thanks,

Tim


Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x000c
0x04510412 in free ()
(gdb) where
#0  0x04510412 in free ()
#1  0x05d24f80 in opal_install_dirs_expand (input=0x5d2a6b0 "$
{prefix}") at base/installdirs_base_expand.c:67
#2  0x05d24584 in opal_installdirs_base_open () at base/
installdirs_base_components.c:94
#3  0x05d01a40 in opal_init_util () at runtime/opal_init.c:150
#4  0x05d01b24 in opal_init () at runtime/opal_init.c:200
#5  0x051fa5cd in ompi_mpi_init (argc=1, argv=0xbfffde74,
requested=0, provided=0xbfffd930) at runtime/ompi_mpi_init.c:219
#6  0x0523a8db in MPI_Init (argc=0xbfffd980, argv=0xbfffde14) at
init.c:71
#7  0x0005a03d in conrad::cp::MPIConnection::initMPI (argc=1,
argv=@0xbfffde14) at mwcommon/MPIConnection.cc:83
#8  0x4163 in main (argc=1, argv=0xbfffde74) at apps/
cimager.cc:
155


- 
-

--
-
-
--
Tim Cornwell,  Australia Telescope National Facility, CSIRO
Location: Cnr Pembroke & Vimiera Rds, Marsfield, NSW, 2122,
AUSTRALIA
Post: PO Box 76, Epping, NSW 1710, AUSTRALIA
Phone:+61 2 9372 4261   Fax:  +61 2 9372 4450 or 4310
Mobile:  +61 4 3366 5399
Email:tim.cornw...@csiro.au
URL:  http://www.atnf.csiro.au/people/tim.cornwell
- 
-

--
-
-
---



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] DataTypes with "holes" for writing files

2007-07-16 Thread Brian Barrett

I wouldn't worry about it.  1.2.3 has no ROMIO fixes over 1.2.2.

Brian

On Jul 16, 2007, at 9:42 AM, jody wrote:


Brian,

I am using OpenMPI 1.2.2, so i am lagging a bit behind.
Should i update to 1.2.3 and do the test again?

Thanks for the info

Jody


On 7/16/07, Brian Barrett <bbarr...@lanl.gov> wrote: Jody -

I usually update the ROMIO package before each major release (1.0,
1.1, 1.2, etc.) and then only within a major release series when a
bug is found that requires an update.  This seems to be one of those
times ;).  Just to make sure we're all on the same page, which
version of Open MPI are you currently using?  I've filed a bug report
(you'll get an e-mail about it) about updating ROMIO for the 1.2
series.  I'm not sure if it will make 1.2.4, but it could.

Thanks,

Brian

On Jul 16, 2007, at 12:45 AM, jody wrote:

> Rob, thanks for your info.
>
> Do you know whether OpenMPI will use a newer version
> of ROMIO sometimes soon?
>
> Jody
>
> On 7/13/07, Robert Latham <r...@mcs.anl.gov> wrote:On Tue, Jul 10,
> 2007 at 04:36:01PM +, jody wrote:
> > Error: Unsupported datatype passed to  
ADIOI_Count_contiguous_blocks

> > [aim-nano_02:9] MPI_ABORT invoked on rank 0 in communicator
> > MPI_COMM_WORLD with errorcode 1
>
> Hi Jody:
>
> OpenMPI uses an old version of ROMIO.  You get this error because  
the

> ADIOI_Count_contiguous_blocks routine in this version of ROMIO does
> not understand all MPI datatypes.
>
> You can verify that this is the case by building your test against
> MPICH2, which should succeed.
>
> ==rob
>
> --
> Rob Latham
> Mathematics and Computer Science DivisionA215 0178 EA2D B059  
8CDF
> Argonne National Lab, IL USA B29D F333 664A 4280  
315B

> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] DataTypes with "holes" for writing files

2007-07-16 Thread Brian Barrett

Jody -

I usually update the ROMIO package before each major release (1.0,  
1.1, 1.2, etc.) and then only within a major release series when a  
bug is found that requires an update.  This seems to be one of those  
times ;).  Just to make sure we're all on the same page, which  
version of Open MPI are you currently using?  I've filed a bug report  
(you'll get an e-mail about it) about updating ROMIO for the 1.2  
series.  I'm not sure if it will make 1.2.4, but it could.


Thanks,

Brian

On Jul 16, 2007, at 12:45 AM, jody wrote:


Rob, thanks for your info.

Do you know whether OpenMPI will use a newer version
of ROMIO sometimes soon?

Jody

On 7/13/07, Robert Latham  wrote:On Tue, Jul 10,  
2007 at 04:36:01PM +, jody wrote:

> Error: Unsupported datatype passed to ADIOI_Count_contiguous_blocks
> [aim-nano_02:9] MPI_ABORT invoked on rank 0 in communicator
> MPI_COMM_WORLD with errorcode 1

Hi Jody:

OpenMPI uses an old version of ROMIO.  You get this error because the
ADIOI_Count_contiguous_blocks routine in this version of ROMIO does
not understand all MPI datatypes.

You can verify that this is the case by building your test against
MPICH2, which should succeed.

==rob

--
Rob Latham
Mathematics and Computer Science DivisionA215 0178 EA2D B059 8CDF
Argonne National Lab, IL USA B29D F333 664A 4280 315B
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] end-to-end data reliability

2007-07-16 Thread Brian Barrett

On Jul 15, 2007, at 10:05 PM, Isaac Huang wrote:


Hello, I read from the FAQ that current Open MPI releases don't
support end-to-end data reliability. But I still have some confusing
that can't be solved by googling or reading the FAQ:

1. I read from "MPI - The Complete Reference" that "MPI provides the
user with reliable message transmission. A message sent is always
received correctly, and the user does not need to check for
transmission errors, timeouts, or other error conditions." But the
standard is sort of vague about what exactly this "reliable message
transmission" is. Does it at least require reliable delivery? Or, does
Open MPI notice and re-transmit lost data?


Yes, the MPI standard guarantees message is reliably delivered in  
order.  MPI implementations have taken this to mean that if the  
transport is "reliable", then the MPI doesn't have to do anything  
special.  So we assume that TCP delivers data into our headers  
properly and same for shared memory, Myrinet, and InfiniBand (the RC  
protocol, anyway).  We also assume that any data sent arrives on the  
other side.


We have an experimental point-to-point engine, DR, that provides  
reliable transportation even for networks that have corruption and/or  
packet loss.  The engine isn't available in a stable release, as it  
is still in the experimental phase.  Checksums and timers are used to  
detect message corruption and recover.  This allows us to play with  
non-reliable network protocols such as UDP or InfiniBand's UD protocol.


In truth, however, the reliability guaranteed by the transports  
currently in use by Open MPI are more than enough to meet the needs  
of almost all users.  Most of the supported networks have some type  
of error detection or correction that provides protection only  
slightly statistically worse than what we could provide within Open  
MPI, but at a much lower cost.



2. When a data corruption happens (in message data), is the data in
the message envelop still reliable? Or, does Open MPI or the MPI
standard guarantee data integrity of message envelops? I'm
particularly interested in MPI_TAG which I use to encode things.


In my opinion, any guarantee that applies to the message applies to  
the meta-data (tag, source, length) as well.  The DR component will  
provide the same level of protection to the headers as it does to the  
payload.


Brian


--
  Brian W. Barrett
  Networking Team, CCS-1
  Los Alamos National Laboratory




Re: [OMPI users] Problems running openmpi under os x

2007-07-11 Thread Brian Barrett
That's unexpected.  If you run the command 'ompi_info --all', it  
should list (towards the top) things like the Bindir and Libdir.  Can  
you see if those have sane values?  If they do, can you try running a  
simple hello, world type MPI application (there's one in the OMPI  
tarball).  It almost looks like memory is getting corrupted, which  
would be very unexpected that early in the process.  I'm unable to  
duplicate the problem with 1.2.3 on my Mac Pro, making it all the  
more strange.


Another random thought -- Which compilers did you use to build Open MPI?

Brian


On Jul 11, 2007, at 1:27 PM, Tim Cornwell wrote:



 Open MPI: 1.2.3
Open MPI SVN revision: r15136
 Open RTE: 1.2.3
Open RTE SVN revision: r15136
 OPAL: 1.2.3
OPAL SVN revision: r15136
   Prefix: /usr/local
  Configured architecture: i386-apple-darwin8.10.1

Hi Brian,

1.2.3 downloaded and built from source.

Tim

On 12/07/2007, at 12:50 AM, Brian Barrett wrote:


Which version of Open MPI are you using?

Thanks,

Brian

On Jul 11, 2007, at 3:32 AM, Tim Cornwell wrote:



I have a problem running openmpi under OS 10.4.10. My program runs
fine under debian x86_64 on an opteron but under OS X on a number
of Mac Book and Mac Book Pros, I get the following immediately on
startup. This smells like a common problem but I could find
anything relevant anywhere. Can anyone provide a hint or better yet
a solution?

Thanks,

Tim


Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x000c
0x04510412 in free ()
(gdb) where
#0  0x04510412 in free ()
#1  0x05d24f80 in opal_install_dirs_expand (input=0x5d2a6b0 "$
{prefix}") at base/installdirs_base_expand.c:67
#2  0x05d24584 in opal_installdirs_base_open () at base/
installdirs_base_components.c:94
#3  0x05d01a40 in opal_init_util () at runtime/opal_init.c:150
#4  0x05d01b24 in opal_init () at runtime/opal_init.c:200
#5  0x051fa5cd in ompi_mpi_init (argc=1, argv=0xbfffde74,
requested=0, provided=0xbfffd930) at runtime/ompi_mpi_init.c:219
#6  0x0523a8db in MPI_Init (argc=0xbfffd980, argv=0xbfffde14) at
init.c:71
#7  0x0005a03d in conrad::cp::MPIConnection::initMPI (argc=1,
argv=@0xbfffde14) at mwcommon/MPIConnection.cc:83
#8  0x4163 in main (argc=1, argv=0xbfffde74) at apps/cimager.cc:
155


 
-

-
--
Tim Cornwell,  Australia Telescope National Facility, CSIRO
Location: Cnr Pembroke & Vimiera Rds, Marsfield, NSW, 2122,  
AUSTRALIA

Post: PO Box 76, Epping, NSW 1710, AUSTRALIA
Phone:+61 2 9372 4261   Fax:  +61 2 9372 4450 or 4310
Mobile:  +61 4 3366 5399
Email:tim.cornw...@csiro.au
URL:  http://www.atnf.csiro.au/people/tim.cornwell
 
-

-
---



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Connection to HNP lost

2007-07-10 Thread Brian Barrett
What Ralph said is generally true.  If your application completed,  
this is nothing to worry about.  It means that an error occurred on  
the socket between mpirun ad some other process.  However, combind  
with the travor0 errors in the log files, it could mean that your  
IPoIB network is acting flaky.  That would have me slightly  
concerned.  Enough that I'd consider running some TCP stress tests on  
the network to make sure it's acting normally.


Hope this helps,

Brian

On Jul 10, 2007, at 11:32 AM, Ralph H Castain wrote:





On 7/10/07 11:08 AM, "Glenn Carver"   
wrote:



Hi,

I'd be grateful if someone could explain the meaning of this error
message to me and whether it indicates a hardware problem or
application software issue:

[node2:11881] OOB: Connection to HNP lost
[node1:09876] OOB: Connection to HNP lost


This message is nothing to be concerned about - all it indicates is  
that
mpirun exited before our daemon on your backend nodes did. It's  
relatively
harmless and probably should be eliminated in some future version  
(except

when developers are running in debug mode).

The message can appear when the timing changes between front and  
backend

nodes. What happens is:

1. mpirun detects that your processes have all completed. It then  
orders the

shutdown of the daemons on your backend nodes.

2. each daemon does an orderly shutdown. Just before it terminates,  
it tells

mpirun that it is done cleaning up and is about to exit

3. when mpirun hears that all daemons are done cleaning up, it  
exits itself.
This is where the timing issue comes into play - if mpirun exits  
before the

daemon, then you get that error message as the daemon is terminating.

So it's all a question of whether mpirun completes the last few  
steps to
exit before the daemons do. In most cases, the daemons complete  
first as
they have less to do. Sometimes, mpirun manages to get out first,  
and you

get the message.

I doubt it has anything to do with your hardware issues.  
Personally, I would
just ignore the message - I'll see it gets removed in later  
releases to

avoid unnecessary confusion.

Hope that helps
Ralph




I have a small cluster which until last week was just fine.
Unfortunately we were hit by a sudden power dip which brought the
cluster down and did significant damage to other servers (blew power
supplies and disk).  Although the cluster machines and the Infiniband
link is up and running jobs I am now getting these errors in user
applications which we've never had before.

The system messages file reports (for node2):
Jul  5 12:08:28 node1 genunix: [ID 408789 kern.notice] NOTICE:
tavor0: fault cleared external to device; service available
Jul  5 12:08:28 node1 genunix: [ID 451854 kern.notice] NOTICE:
tavor0: port 1 up
Jul  7 16:18:32 node1 genunix: [ID 408114 kern.info]
/pci@1,0/pci1022,7450@2/pci15b3,5a46@1/pci15b3,5a44@0 (tavor0) online
Jul  7 16:18:32 node1 ib: [ID 842868 kern.info] IB device:  
daplt@0, daplt0
Jul  7 16:18:32 node1 genunix: [ID 936769 kern.info] daplt0 is /ib/ 
daplt@0

Jul  7 16:18:32 node1 genunix: [ID 408114 kern.info] /ib/daplt@0
(daplt0) online
Jul  7 16:18:32 node1 genunix: [ID 834635 kern.info] /ib/daplt@0
(daplt0) multipath status: degraded, path
/pci@1,0/pci1022,7450@2/pci15
b3,5a46@1/pci15b3,5a44@0 (tavor0) to target address: daplt,0 is
online Load balancing: round-robin

I wonder if this messages are indicative of a hardware problem,
possibly on the Infiniband switch or the host adapters on the cluster
machines.  The cluster software has not been altered but there have
been small changes to the application codes. But I want to rule out
hardware issues because of the power dip first.

Anyone seen this message before and know whether to investigate
hardware first?  I did check the archives but it didn't help. More
info provided below.

Any help appreciate, thanks.

  Glenn

--
Details:
Cluster uses mix of Sun's X4100/X4200 machines linked with Sun
supplied Infiniband and host adapters. All machines are running
Solaris 10_x86 (11/06) with latest kernel patches
Software is Sun Clustertools 7.

Node2 $ ifconfig ibd1
ibd1: flags=1000843 mtu 2044  
index 3
 inet 192.168.50.202 netmask ff00 broadcast  
192.168.50.255


Node1 $ ifconfig ibd1
ibd1: flags=1000843 mtu 2044  
index 3
 inet 192.168.50.201 netmask ff00 broadcast  
192.168.50.255



ompi_info -a
 Open MPI: 1.2.1r14096-ct7b030r1838
Open MPI SVN revision: 0
 Open RTE: 1.2.1r14096-ct7b030r1838
Open RTE SVN revision: 0
 OPAL: 1.2.1r14096-ct7b030r1838
OPAL SVN revision: 0
MCA backtrace: printstack (MCA v1.0, API v1.0,  
Component v1.2.1)
MCA paffinity: solaris (MCA v1.0, API v1.0, Component  
v1.2.1)
MCA maffinity: first_use (MCA v1.0, API v1.0,  
Component v1.2.1)
   

Re: [OMPI users] warning:regcache incompatible with malloc

2007-07-10 Thread Brian Barrett

On Jul 10, 2007, at 11:40 AM, Scott Atchley wrote:


On Jul 10, 2007, at 1:14 PM, Christopher D. Maestas wrote:


Has anyone seen the following message with Open MPI:
---
warning:regcache incompatible with malloc
---



---

We don't see this message with mpich-mx-1.2.7..4


MX has an internal registration cache that can be enabled with
MX_RCACHE=1 or disabled with MX_RCACHE=0 (the default before MX-1.2.1
was off, and starting with 1.2.1 the default is on). If it is on, MX
checks to see if the application is trying to override malloc() and
other memory handling functions. If so, it prints the error that you
are seeing and fails to use the registration cache.

Open MPI can use the regcache if you set MX_RCACHE=2. This tells MX
to skip the malloc() check and use the cache regardless. In the case
of Open MPI, this is believed to be safe. That will not be true for
all applications.

MPICH-MX does not manage memory, so MX_RCACHE=1 is safe to use unless
the user's application manages memory.


Scott -

I'm having trouble getting the warning to go away with Open MPI.   
I've disabled our copy of ptmalloc2, so we're not providing a malloc  
anymore.  I'm wondering if there's also something with the use of  
DSOs to load libmyriexpress?  Is your belief that MX_RCACHE=2 is safe  
just for the BTL or for the MTL as well?


Brian


--
  Brian W. Barrett
  Networking Team, CCS-1
  Los Alamos National Laboratory




Re: [OMPI users] Unable to find any HCAs ..

2007-07-05 Thread Brian Barrett

On Jul 4, 2007, at 8:21 PM, Graham Jenkins wrote:


I'm using the openmpi-1.1.1-5.el5.x86_64 RPM on a Scientific Linux 5
cluster, with no installed HCAs. And a simple MPI job submitted to  
that
cluster runs OK .. except that it issues messages for each node  
like the

one shown below.  Is there some way I can supress these, perhaps by an
appropriate entry in /etc/openmpi-mca-params.conf ?

--
libibverbs: Fatal: couldn't open sysfs class 'infiniband_verbs'.
-- 


[0,1,0]: OpenIB on host localhost was unable to find any HCAs.
Another transport will be used instead, although this may result in
lower performance.


Yes, there is a line you can add to /etc/openmpi-mca-params.conf:

btl=^openib

will tell Open MPI to use any available btls (our network transport  
layer) except openib.


Hope this helps,

Brian


--
  Brian W. Barrett
  Networking Team, CCS-1
  Los Alamos National Laboratory




Re: [OMPI users] making all library components static (questions about --enable-mcs-static)

2007-06-26 Thread Brian Barrett

On Jun 7, 2007, at 9:04 PM, Code Master wrote:


nction `_int_malloc':
: multiple definition of `_int_malloc'
/usr/lib/libopen-pal.a(lt1-malloc.o)(.text+0x18a0):openmpi-1.2.2/ 
opal/mca/memory/ptmalloc2/malloc.c:3954: first defined here
/usr/bin/ld: Warning: size of symbol `_int_malloc' changed from  
1266 in /usr/lib/libopen- pal.a(lt1-malloc.o) to 1333 in /home/ 
490_research/490/src/mpi.optimized_profiling//lib/libopen-pal.a(lt1- 
malloc.o)



so what could go wrong here?

Is it because openmpi has internal implementatios of system- 
provided functions (such as malloc) that are also used in my  
program, but the one the client program use is provided by the  
system whereas the one in the library has a different internal  
implementation?


In such case, how could I do the static linking in my client  
program?  I really need static linking as far as possible to do the  
profiling.


Yup, you guessed right.  The easiest solution is to compile Open MPI  
without the memory manager code.  This disables some optimizations  
for InfiniBand (OpenFabrics and MVAPI) and Myrinet/GM, but for other  
networks has no impact.  YOu can disable the memory manager with the  
--without-memory-manager option to configure.



Hope this helps,

Brian

--
  Brian W. Barrett
  Networking Team, CCS-1
  Los Alamos National Laboratory




Re: [OMPI users] v1.2.2 mca base unable to open pls/ras tm

2007-06-08 Thread Brian Barrett
Or tell Open MPI not to build torque support, which can be done at  
configure time with the --without-tm option.


Open MPI tries to build support for whatever it finds in the default  
search paths, plus whatever things you specify the location of.  Most  
of the time, this is what the user wants.  In this case, however,  
it's not what you wanted so you'll have to add the --without-tm option.


Hope this helps,

Brian


On Jun 8, 2007, at 1:08 PM, Cupp, Matthew R wrote:


So I either have to uninstall torque, make the shared libraries
available on all nodes, or have torque as static libraries on the head
node?

__
Matt Cupp
Battelle Memorial Institute
Statistics and Information Analysis


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-bounces@open- 
mpi.org] On

Behalf Of Jeff Squyres
Sent: Friday, June 08, 2007 2:21 PM
To: Open MPI Users
Subject: Re: [OMPI users] v1.2.2 mca base unable to open pls/ras tm

On Jun 8, 2007, at 2:06 PM, Cupp, Matthew R wrote:

Yes.  But the /opt/torque directory is just the source, not the  
actual

installed directory.  The actual installed directory on the head
node is
the default location of /usr/lib/something.  And that is not
accessable
by every node.

But should it matter if it's not accessable if I don't specify
--with-tm?  I was wondering if ./configure detects torque has been
installed, and then builds the associated components under the
assumption that it's available.


This is what OMPI does.

However, if you only have static libraries for Torque, the issue
should be moot -- the relevant bits should be statically linked into
the OMPI tm plugins.  But if your Torque libraries are shared, then
you do need to have them available on all nodes for OMPI to be able
to leverage native Torque/TM support.

Make sense?

--
Jeff Squyres
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Typo in r14829?

2007-06-01 Thread Brian Barrett

On Jun 1, 2007, at 12:15 PM, Bert Wesarg wrote:


Hello,

is the 'EGREP' a typo in the first hunk of r14829:

https://svn.open-mpi.org/trac/ompi/changeset/14829/trunk/config/ 
cxx_find_template_repository.m4


Gah!  Yes, it is.  Should be $GREP.  I'll fix this evening.


Thanks,

Brian


Re: [OMPI users] forcing MPI to bind all sockets to 127.0.0.1

2007-05-30 Thread Brian Barrett

Bill -

This is a known issue in all released versions of Open MPI.  I have a  
patch that hopefully will fix this issue in 1.2.3.  It's currently  
waiting on people in the OPen MPI team to verify I didn't do  
something stupid.


Brian

On May 29, 2007, at 9:59 PM, Bill Saphir wrote:



George,

This is one of the things I tried, and the setting the oob  
interface did not work,

with the error message below.

Also, per this thread:
http://www.open-mpi.org/community/lists/users/2007/05/3319.php
I believe it is oob_tcp_include, not oob_tcp_if_include. The latter  
is silently

ignored in 1.2, as far as I can tell.

Interestingly, telling the MPI layer to use lo0 (or to not use tcp  
at all) works fine.
But when I try to do the same for the OOB layer, it complains. The  
full error is:


[mymac.local:07001] [0,0,0] mca_oob_tcp_init: invalid address ''  
returned for selected oob interfaces.
[mymac.local:07001] [0,0,0] ORTE_ERROR_LOG: Error in file oob_tcp.c  
at line 1196


mpirun actually hangs at this point and no processes are spawned. I  
have to ^C to stop it.

I see this behavior on both Mac OS and on Linux with 1.2.2.

Bill


George Bosilica wrote:

There are 2 sets of sockets: one for the oob layer and one for the
MPI layer (at least if TCP support is enabled). Therefore, in order
to achieve what you're looking for you should add to the command line
"--mca oob_tcp_if_include lo0 --mca btl_tcp_if_include lo0".
On May 29, 2007, at 3:58 PM, Bill Saphir wrote:



- original message below ---


We have run into the following problem:

- start up Open MPI application on a laptop
- disconnect from network
- application hangs

I believe that the problem is that all sockets created by Open MPI  
are bound to the external network interface.
For example, when I start up a 2 process MPI job on my Mac (no  
hosts specified), I get the following tcp

connections. 192.168.5.2 is an address on my LAN.

tcp4   0  0  192.168.5.2.49459  192.168.5.2.49463   
ESTABLISHED
tcp4   0  0  192.168.5.2.49463  192.168.5.2.49459   
ESTABLISHED
tcp4   0  0  192.168.5.2.49456  192.168.5.2.49462   
ESTABLISHED
tcp4   0  0  192.168.5.2.49462  192.168.5.2.49456   
ESTABLISHED
tcp4   0  0  192.168.5.2.49456  192.168.5.2.49460   
ESTABLISHED
tcp4   0  0  192.168.5.2.49460  192.168.5.2.49456   
ESTABLISHED
tcp4   0  0  192.168.5.2.49456  192.168.5.2.49458   
ESTABLISHED
tcp4   0  0  192.168.5.2.49458  192.168.5.2.49456   
ESTABLISHED


Since this application is confined to a single machine, I would  
like it to use 127.0.0.1,
which will remain available as the laptop moves around. I am  
unable to force it to bind

sockets to this address, however.

Some of the things I've tried are:
- explicitly setting the hostname to 127.0.0.1 (--host 127.0.0.1)
- turning off the tcp btl (--mca btl ^tcp) and other variations (-- 
mca btl self,sm)

- using --mca oob_tcp_include lo0

The first two have no effect. The last one results in an error  
message of:
[myhost.locall:05830] [0,0,0] mca_oob_tcp_init: invalid address ''  
returned for selected oob interfaces.


Is there any way to force Open MPI to bind all sockets to 127.0.0.1?

As a side question -- I'm curious what all of these tcp  
connections are used for.  As I increase the number
of processes, it looks like there are 4 sockets created per MPI  
process, without using the tcp btl.

Perhaps stdin/out/err + control?

Bill




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OpenMPI on shared memory.

2007-05-29 Thread Brian Barrett

On May 29, 2007, at 12:25 PM, smai...@ksu.edu wrote:


 I am doing a research on parallel computing on shared memory with
NUMA architecture. The system is a 4 node AMD opteron with each node
being a dual-core. I am testing an OpenMPI program with MPI-nodes <=
MAX cores available on system (in my case 4*2=8). Can someone tell me
whether:
a) In such cases (where MPI-nodes<=MAX cores on shared-memory),  
OpenMPI

implements MPI-nodes as processes or threads? If yes, then how can it
be determined at run-time? I am wondering because processes have more
overhead than light-weight threads.


In Open MPI, different MPI ranks are always different processes.   
This is what users expect, and I'd be hesitant to change that for the  
over-subscription case.


Brian


Re: [OMPI users] Weird interaction with modem under OS X

2007-05-28 Thread Brian Barrett

On May 22, 2007, at 7:52 PM, Tom Clune wrote:


For example, if it is ppp0, try:

   mpirun -np 1 -mca oob_tcp_exclude ppp0 uptime


This seems to at least produce a bit of output before hanging:

LM000953070:~ tlclune$ mpirun -np 1 -mca oob_tcp_exclude ppp0 uptime
[153.sub-70-211-6.myvzw.com:07562] [0,0,0] mca_oob_tcp_init:  
invalid address '' returned for selected oob interfaces.
[153.sub-70-211-6.myvzw.com:07562] [0,0,0] ORTE_ERROR_LOG: Error in  
file oob_tcp.c at line 1216


Tom -

I managed to track this down a bit.  We try to use the ppp0 interface  
(the cell phone device) for network connectivity, as it's the only  
non-localhost address up at the time.  Unfortunately, we can't use  
the address to route messages that way and Open MPI hangs.  The  
problem is made worse due to a bug that I'm still trying to track  
down in Open MPI.  When you tell Open MPI to not use a device (like  
ppp0), it should just use whatever other devices are available.  In  
your case, that would be localhost, which is what you're using when  
you don't have any network connectivity at all.  But it appears that  
this instead causes Open MPI to segfault / hang.  I'm looking into  
exactly why this is happening and should have a fix in the next day  
or so.


Brian

--
  Brian W. Barrett
  Open MPI Team, CCS-1
  Los Alamos National Laboratory




Re: [OMPI users] Weird interaction with modem under OS X

2007-05-21 Thread Brian Barrett


On May 21, 2007, at 7:40 PM, Tom Clune wrote:

Executive summary: mpirun hangs when laptop is connected via  
cellular modem.


Longer description: Under ordinary circumstances mpirun behaves as  
expected on my OS X (Intel-duo) laptop.  I only want to be using  
the shared-memory mechanism - i.e. not sending packets across any  
networks.   When my laptop is connected to the internet via  
ethernet or wireless (or not connected to the network at all)  
mpirun works just fine, but if I connect via my nifty new cellular  
modem (Verizon in case it matters), mpirun hangs at launch.  I.e.  
my application never even starts, and I have to use  to  
interrupt to regain a prompt.   I'd like to be able to engage in  
other activities (mail, cvs, skype) while executing mpi code in the  
background, so I'm really hoping there is a simple switch to fix this.


I am launching with the command:  "mpirun -np 2 ./gx".   I have  
also tried "mpirun --mca btl self,sm -np 2 ./gx" but that did not  
seem to improve the situation.


I have attached the output from "ompi_info --all".  The output does  
not seem to depend on whether I am connected via the modem or not.


If you run "mpirun -np 1 uptime" with your cell modem up, does that  
work?  This isn't one of those corner cases we test very often :).   
If it doesn't work, could you send the output of 'ifconfig'?  One  
thing to try would be telling Open MPI to not use the network device  
for the modem.  For example, if it is ppp0, try:


  mpirun -np 1 -mca oob_tcp_exclude ppp0 uptime

Good luck,

Brian

Re: [OMPI users] AlphaServers & OpenMPI

2007-05-14 Thread Brian Barrett

On May 13, 2007, at 6:23 AM, Bert Wesarg wrote:



Even better: is there a patch available to fix this in the 1.2.1
tarball, so that
I can set the full path again with CC?
The patch is quite trivial, but requires a rebuild of the  build  
system

(autoheader, autoconf, automake,...)

see here:
https://svn.open-mpi.org/trac/ompi/changeset/14610

but you can try to hack the current configure script, just by  
search for

the affected line


As Bert said, the patch to fix the bug causes a bunch of files to  
rebuild using tools you probably don't want to bother installing.   
The easiest solution for now is to use the 1.2.2rc1 pre-release,  
available on the download page:


   http://www.open-mpi.org/software/ompi/v1.2/

Fixes a bunch of small issues found in the 1.2 and 1.2.1 releases,  
including the CC with a path in it bug that you've stumbled upon.



Brian

--
  Brian W. Barrett
  Open MPI Team, CCS-1
  Los Alamos National Laboratory




Re: [OMPI users] newbie question

2007-05-14 Thread Brian Barrett
I fixed the OOB.  I also mucked some things up with it interface wise  
that I need to undo :).  Anyway, I'll have a look at fixing up the  
TCP component in the next day or two.


Brian

On May 10, 2007, at 6:07 PM, Jeff Squyres wrote:


Brian --

Didn't you add something to fix exactly this problem recently?  I
have a dim recollection of seeing a commit go by about this...?

(I advised Steve in IM to use --disable-ipv6 in the meantime)


On May 10, 2007, at 1:25 PM, Steve Wise wrote:


I'm trying to run a job specifically over tcp and the eth1 interface.
It seems to be barfing on trying to listen via ipv6.  I don't want
ipv6.
How can I disable it?

Here's my mpirun line:

[root@vic12-10g ~]# mpirun --n 2 --host vic12,vic20 --mca btl
self,tcp -mca btl_tcp_if_include eth1 /root/IMB_2.3/src/IMB-MPI1
sendrecv
[vic12][0,1,0][btl_tcp_component.c:
489:mca_btl_tcp_component_create_listen] socket() failed: Address
family not supported by protocol (97)
[vic12-10g:15771] mca_btl_tcp_component: IPv6 listening socket failed
[vic20][0,1,1][btl_tcp_component.c:
489:mca_btl_tcp_component_create_listen] socket() failed: Address
family not supported by protocol (97)
[vic20-10g:23977] mca_btl_tcp_component: IPv6 listening socket failed


Thanks,

Steve.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] MPI_TYPE_STRUCT Not

2007-05-14 Thread Brian Barrett

On May 14, 2007, at 10:21 AM, Nym wrote:


I am trying to use MPI_TYPE_STRUCT in a 64 bit Fortran 90 program. I'm
using the Intel Fortran Compiler 9.1.040 (and C/C++ compilers
9.1.045).

If I try to call MPI_TYPE_STRUCT with the array of displacements that
are of type INTEGER(KIND=MPI_ADDRESS_KIND), then I get a compilation
error:

fortcom: Error: ./test_basic.f90, line 34: There is no matching
specific subroutine for this generic subroutine call.
[MPI_TYPE_STRUCT]
 CALL MPI_TYPE_STRUCT(numTypes, blockLengths, displacements,  
oldTypes &

---^
compilation aborted for ./test_basic.f90 (code 1)

Attached is a small test program to demonstrate this. I thought
according to the MPI specs that the displacement array should be of
type MPI_ADDRESS_KIND. Am I wrong?


Have a look at the last paragraph of Section 10.2.2 of the MPI-2  
standard.  Functions from MPI-1 that take address-sized arguments use  
INTEGER in Fortran.  This was obviously a problem, which is why the  
functions from MPI-1 that take an address-sized argument are  
depricated in favor of new functions in MPI-2 that take proper  
address kind arguments.


Two options:

  1) Use MPI_TYPE_STRUCT with INTEGER arguments
  2) Use MPI_TYPE_CREATE_STRUCT with ADDRESS_KIND arguments

Hope this helps,

Brian

--
  Brian W. Barrett
  Open MPI Team, CCS-1
  Los Alamos National Laboratory




Re: [OMPI users] openmpi-1.2.1 mpicc error

2007-05-07 Thread Brian Barrett
This was a regression in Open MPI 1.2.1.  We improperly handle the  
situation where CC has a path in it.  We will have this fixed in Open  
MPI 1.2.2.  For now, your options are to use Open MPI 1.2 or specify  
a $CC without a path, such as CC=icc, and make sure $PATH is set  
properly.


Brian

On May 7, 2007, at 1:12 PM, Paul Van Allsburg wrote:


I just completed the install of release 1.2.1 and I get an error
attempting to compile with mpicc.

The install was done with:

source /opt/intel/fce/9.1.045/bin/ifortvars.sh
source /opt/intel/cce/9.1.049/bin/iccvars.sh
./configure --prefix=/usr/local/openmpi-1.2.1_intel \
--with-tm=/usr/local   \
--enable-static   \
--disable-shared  \
CC=/opt/intel/cce/9.1.049/bin/icc \
CXX=/opt/intel/cce/9.1.049/bin/icpc \
FC=/opt/intel/fce/9.1.045/bin/ifort
make all install


I tried to compile my hello program with

$ source /opt/intel/fce/9.1.045/bin/ifortvars.sh
$ source /opt/intel/cce/9.1.049/bin/iccvars.sh
$ PATH="/usr/local/openmpi-1.2.1_intel/bin:$PATH";export PATH
$ mpicc hello.c  -o hello  -g
ld: dummy: No such file: No such file or directory


I installed 1.2 exactly the same and it works fine.

Any suggestions? Thanks!
Paul

--
Paul Van Allsburg
Computational Science & Modeling Facilitator
Natural Sciences Division,  Hope College
35 East 12th Street
Holland, Michigan 49423
616-395-7292



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] 1.2.1 configure bug report: set CC variable may produce broken *wrapper-data.txt

2007-05-07 Thread Brian Barrett
Thanks for the bug report.  I'm able to replicate your problem, and  
it will be fixed in the 1.2.2 release.


Brian

On May 7, 2007, at 6:10 AM, livelfs wrote:


Hi all

I have observed a regression between 1.2 and 1.2.1

if CC is assigned an absolute path (i.e. export
CC=/opt/gcc/gcc-3.4.4/bin/gcc like in attached logs),
the */tools/wrappers/*-wrapper-data.txt files
produced by configure script have then a broken libs macro definition:

libs=-lmpi -lopen-rte -lopen-pal   -ldl   dummy ranlib

instead of
libs=-lmpi -lopen-rte -lopen-pal   -ldl   -Wl,--export-dynamic -lnsl
-lutil -lm -ldl

Regards,
Stephane Rouberol


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Call to MPI_Init affects errno

2007-05-02 Thread Brian Barrett
Yup, it does.  There's nothing in the standard that says it isn't  
allowed to.  Given the number of system/libc calls involved in doing  
communication, pretty much every MPI function is going to change the  
value of errno.  If you expect otherwise, I'd modify your  
application.  Most cluster-based MPI implementations are going to  
randomly change the errno on you.


Brian

On May 2, 2007, at 12:18 PM, Chudin, Eugene wrote:

I am trying to experiment with openmpi and following trivial code  
(although runs) affects value of errno


#include 
#include 


int main(int argc, char** argv)
{
 int _procid, _np;
 std::cout << "errno=\t" << errno << std::endl;
 MPI_Init(, );
 std::cout << "errno=\t" << errno << "\tafter MPI_Init()\t" <<  
std::endl;

 MPI_Comm_rank (MPI_COMM_WORLD, &_procid);
 MPI_Comm_size (MPI_COMM_WORLD, &_np);
 std::cout << "errno msg=\t" << strerror(errno) << "\tprocessor= 
\t" << _procid << std::endl;

 MPI_Finalize();
 return 0;
}

Compiled with
mpiCC -Wall test.cpp -o test

Produces following output when run just on single processor using
mpirun -np 1 --prefix /toolbox/openmpi  ./test
errno=  0
errno=  2   after MPI_Init()
errno msg=  No such file or directory   processor=  0

When run on two processors using
mpirun -np 2 --prefix /toolbox/openmpi  ./test
errno=  0
errno=  0
errno=  11  after MPI_Init()
errno=  115 after MPI_Init()
errno msg=  Operation now in progress   processor=  0
errno msg=  Resource temporarily unavailable 
processor=  1


The output of ompi_info --all is attached

<>



-- 


Notice:  This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
New Jersey, USA 08889), and/or its affiliates (which may be known
outside the United States as Merck Frosst, Merck Sharp & Dohme or MSD
and in Japan, as Banyu - direct contact information for affiliates is
available at http://www.merck.com/contact/contacts.html) that may be
confidential, proprietary copyrighted and/or legally privileged. It is
intended solely for the use of the individual or entity named on this
message. If you are not the intended recipient, and have received this
message in error, please notify us immediately by reply e-mail and  
then

delete it from your system.

-- 



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] orte_init failed

2007-04-16 Thread Brian Barrett
That's very odd.  The usual cause for this is /tmp being unwritable  
by the user or full.  Can you check to see if either of those  
conditions are true?


Thanks,

Brian


On Apr 13, 2007, at 2:44 AM, Christine Kreuzer wrote:


Hi,

I run openmpi on a AMD Opteron with two dualcore processors an SLE10,
until today everything worked fine but than I got the following error
message:

[computername:20612][0,0,0] ORTE_ERROR_LOG: Error in file
../../orte/runtime/orte_init_stage1.c at line 302
-- 

It looks like orte_init failed for some reason; your parallel  
process is

likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

   orte_session_dir failed
   --> Returned value -1 instead of ORTE_SUCCESS

-- 


[computername:20612] [0,0,0] ORTE_ERROR_LOG: Error in file
../../orte/runtime/orte_system_init.c at line 42
[computername:20612] [0,0,0] ORTE_ERROR_LOG: Error in file
../../orte/runtime/orte_init.c at line 49
-- 


Open RTE was unable to initialize properly.  The error occured while
attempting to orte_init().  Returned value -1 instead of ORTE_SUCCESS.
-- 



I would appreciate any help or ideas to solve this problem.
Thanks in advance!

Regards,
Christine
--

Universität des Saarlandes
AG Prof. Dr. Christoph Becher
Fachrichtung 7.3 (Technische Physik)
Geb. E2.6, Zimmer 2.04
D-66123 Saarbrücken

Phone:+49(0)681 302 3418
Fax: +49(0)681 302 4676
E-mail: c.kreu...@mx.uni-saarland.de


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] OpenMPI 1.2 on MacOSX Intel Fails

2007-04-09 Thread Brian Barrett

On Apr 7, 2007, at 12:59 AM, Brian Powell wrote:


Greetings,

I turn to the assistance of the OpenMPI wizards. I have compiled  
v1.2 using gcc and ifort (see the attached config.log) with a  
variety of options. The compilation finishes (side note: I had to  
define NM otherwise the configure script failed) and installs. I  
try to run ompi_info and get the following:


-- 


A library call unexpectedly failed.  This is a terminal error; please
show this message to an Open MPI wizard:

Library call: mca_base_open
 Source file: ompi_info.cc
  Source line number: 139

Aborting...
-- 



For reasons I can't duplicate, you're getting an out of memory error  
when trying to initialize our component system.  I haven't seen this  
one before, but I noticed a couple of things in the config.log that  
made me think there might be an underlying problem...


1) You should never have to specify NM on OS X, even when cross- 
compiling.  Can you send information on why this is necessary?


2) This might actually be the answer to #1, but you shouldn't specify  
--build=i386.  The --build argument takes a complete config.guess- 
style architecture.  In the case of Mac OS X 10.4.9 on i386, that  
would be: i386-apple-darwin8.9.1.  But unless you're cross compiling,  
that argument should not be necessary.


3) The sysroot stuff is really on necessary if you are cross- 
compiling.  I wouldn't use it in other cases, as it seems to make  
things more fragile.  If you're going to specify a sysroot, you need  
to specify it in CFLAGS, CXXFLAGS, and OBJCFLAGS.  You also should  
specify the -arch i386 in CXXFLAGS and OBJCFLAGS if you are going to  
specify it in CFLAGS and FFLAGS, if only for consistency.


If you could try recompiling without the --build argument and let me  
know if that fixes the problem, I'd appreciate it.


Brian


Re: [OMPI users] Issues with Get/Put and IRecv

2007-03-26 Thread Brian Barrett

Mike -

In Open MPI 1.2, one-sided is implemented over point-to-point, so I  
would expect it to be slower.  This may or may not be addressed in a  
future version of Open MPI (I would guess so, but don't want to  
commit to it).  Where you using multiple threads?  If so, how?


On the good news, I think your call stack looked similar to what I  
was seeing, so hopefully I can make some progress on a real solution.


Brian

On Mar 20, 2007, at 8:54 PM, Mike Houston wrote:

Well, I've managed to get a working solution, but I'm not sure how  
I got

there.  I built a test case that looked like a nice simple version of
what I was trying to do and it worked, so I moved the test code  
into my

implementation and low and behold it works.  I must have been doing
something a little funky in the original pass, likely causing a stack
smash somewhere or trying to do a get/put out of bounds.

If I have any more problems, I'll let y'all know.  I've tested pretty
heavy usage up to 128 MPI processes across 16 nodes and things seem to
be behaving.  I did notice that single sided transfers seem to be a
little slower than explicit send/recv, at least on GigE.  Once I do  
some

more testing, I'll bring things up on IB and see how things are going.

-Mike

Mike Houston wrote:

Brian Barrett wrote:


On Mar 20, 2007, at 3:15 PM, Mike Houston wrote:




If I only do gets/puts, things seem to be working correctly with
version
1.2.  However, if I have a posted Irecv on the target node and  
issue a

MPI_Get against that target, MPI_Test on the posed IRecv causes a
segfaults:

Anyone have suggestions?  Sadly, I need to have IRecv's posted.   
I'll

attempt to find a workaround, but it looks like the posed IRecv is
getting all the data of the MPI_Get from the other node.  It's like
the
message tagging is getting ignored.  I've never tried posting two
different IRecv's with different message tags either...



Hi Mike -

I've spent some time this afternoon looking at the problem and have
some ideas on what could be happening.  I don't think it's a data
mismatch (the data intended for the IRecv getting delivered to the
Get), but more a problem with the call to MPI_Test perturbing the
progress flow of the one-sided engine.  I can see one or two places
where it's possible this could happen, although I'm having trouble
replicating the problem with any test case I can write.  Is it
possible for you to share the code causing the problem (or some  
small

test case)?  It would make me feel considerably better if I could
really understand the conditions required to end up in a seg fault
state.

Thanks,

Brian


Well, I can give you a linux x86 binary if that would do it.  The  
code
is huge as it's part of a much larger system, so there is no such  
thing

as a simple case at the moment, and the code is in pieces an largely
unrunnable now with all the hacking...

I basically have one thread spinning on an MPI_Test on a posted IRecv
while being used as the target to the MPI_Get.  I'll see if I can  
hack

together a simple version that breaks late tonight.  I've just played
with posting a send to that IRecv, issuing the MPI_Get,  
handshaking and
then posting another IRecv and the MPI_Test continues to eat it,  
but in

a memcpy:

#0  0x001c068c in memcpy () from /lib/libc.so.6
#1  0x00e412d9 in ompi_convertor_pack (pConv=0x83c1198, iov=0xa0,
out_size=0xaffc1fd8, max_data=0xaffc1fdc) at convertor.c:254
#2  0x00ea265d in ompi_osc_pt2pt_replyreq_send (module=0x856e668,
replyreq=0x83c1180) at osc_pt2pt_data_move.c:411
#3  0x00ea0ebe in ompi_osc_pt2pt_component_fragment_cb
(pt2pt_buffer=0x8573380) at osc_pt2pt_component.c:582
#4  0x00ea1389 in ompi_osc_pt2pt_progress () at  
osc_pt2pt_component.c:769

#5  0x00aa3019 in opal_progress () at runtime/opal_progress.c:288
#6  0x00ea59e5 in ompi_osc_pt2pt_passive_unlock (module=0x856e668,
origin=1, count=1) at osc_pt2pt_sync.c:60
#7  0x00ea0cd2 in ompi_osc_pt2pt_component_fragment_cb
(pt2pt_buffer=0x856f300) at osc_pt2pt_component.c:688
#8  0x00ea1389 in ompi_osc_pt2pt_progress () at  
osc_pt2pt_component.c:769

#9  0x00aa3019 in opal_progress () at runtime/opal_progress.c:288
#10 0x00e33f05 in ompi_request_test (rptr=0xaffc2430,
completed=0xaffc2434, status=0xaffc23fc) at request/req_test.c:82
#11 0x00e61770 in PMPI_Test (request=0xaffc2430,  
completed=0xaffc2434,

status=0xaffc23fc) at ptest.c:52

-Mike
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Installation fails on Mac Os

2007-03-26 Thread Brian Barrett

On Mar 25, 2007, at 11:20 AM, Daniele Avitabile wrote:


Hi everybody,

I am trying to install open mpi on a Mac Os XServer, and the make  
all command exits with the error


openmpi_install_failed.tar.gz

as you can see from the output files I attached.

Some comments that may be helpful:

1) I am not root on the machine, but I have permissions to write  
in /usr/local/applications/, which is the directory in which I want  
to install openmpi.


2) In the same directory there is already an openmpi 1.1.2  
installation, with gcc-4.0.1 compilers. I want to install the  
current version of openmpi
and use a different compiler, namely the gcc compilers optimised  
for apple intel. They reside in the folder /usr/local/bin, and I  
pass them in the make command, as you can see from the attached file.


Any idea as to why I receive that error?


Short answer:
You need to either use the system-provided GCC or rebuild your  
version of GCC to use /usr/bin/libtool instead of /usr/bin/ld to link.


Long answer:
There are some things that are a little complicated to do with Mach-O  
if you want library versioning and plug-ins and all that to work  
properly.  GNU Libtool (and therefore Open MPI) assume that if you  
are using GCC, it can emit options to the linker that are meant for / 
usr/bin/libtool, the library creation helper for OS X.  - 
compatibility_version is one of those things.  Your version of GCC is  
instead invoking /usr/bin/ld directly, so things are going wrong.


You can still use the "intel optimized" version of GCC to compile  
your application, as long as it doesn't use GNU libtool, of course.   
Just use the system GCC to compile Open MPI and all will be fine.



Hope this helps,

Brian




Re: [OMPI users] Issues with Get/Put and IRecv

2007-03-20 Thread Brian Barrett

On Mar 20, 2007, at 3:15 PM, Mike Houston wrote:

If I only do gets/puts, things seem to be working correctly with  
version

1.2.  However, if I have a posted Irecv on the target node and issue a
MPI_Get against that target, MPI_Test on the posed IRecv causes a  
segfaults:


Anyone have suggestions?  Sadly, I need to have IRecv's posted.  I'll
attempt to find a workaround, but it looks like the posed IRecv is
getting all the data of the MPI_Get from the other node.  It's like  
the

message tagging is getting ignored.  I've never tried posting two
different IRecv's with different message tags either...


Hi Mike -

I've spent some time this afternoon looking at the problem and have  
some ideas on what could be happening.  I don't think it's a data  
mismatch (the data intended for the IRecv getting delivered to the  
Get), but more a problem with the call to MPI_Test perturbing the  
progress flow of the one-sided engine.  I can see one or two places  
where it's possible this could happen, although I'm having trouble  
replicating the problem with any test case I can write.  Is it  
possible for you to share the code causing the problem (or some small  
test case)?  It would make me feel considerably better if I could  
really understand the conditions required to end up in a seg fault  
state.


Thanks,

Brian


Re: [OMPI users] open-mpi 1.2 build failure under Mac OS X 10.3.9

2007-03-19 Thread Brian Barrett

Hi -

Thanks for the bug report.  I've fixed the problem in SVN and it will  
likely be part of the 1.2.1 release (whenever that happens).  In the  
mean time, I've attached a patch that should apply to the 1.2 tarball  
that will also fix the problem.


The environment variables you want for specifying the Fortran  
compilers are F77 for Fortran 77 and FC for Fortran 90/95/03.


Hope this helps,

Brian


On Mar 16, 2007, at 5:42 PM, Marius Schamschula wrote:


Hi all,

I was building open-mpi 1.2 on my G4 running Mac OS X 10.3.9 and  
had a build failure with the following:


depbase=`echo runtime/ompi_mpi_preconnect.lo | sed 's|[^/]*$|.deps/ 
&|;s|\.lo$||'`; \
if /bin/sh ../libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H - 
I. -I. -I../opal/include -I../orte/include -I../ompi/include -I../ 
ompi/include   -I..  -D_REENTRANT  -O3 -DNDEBUG -finline-functions - 
fno-strict-aliasing  -MT runtime/ompi_mpi_preconnect.lo -MD -MP -MF  
"$depbase.Tpo" -c -o runtime/ompi_mpi_preconnect.lo runtime/ 
ompi_mpi_preconnect.c; \
then mv -f "$depbase.Tpo" "$depbase.Plo"; else rm -f  
"$depbase.Tpo"; exit 1; fi
libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I. -I../opal/include - 
I../orte/include -I../ompi/include -I../ompi/include -I.. - 
D_REENTRANT -O3 -DNDEBUG -finline-functions -fno-strict-aliasing - 
MT runtime/ompi_mpi_preconnect.lo -MD -MP -MF runtime/.deps/ 
ompi_mpi_preconnect.Tpo -c runtime/ompi_mpi_preconnect.c  -fno- 
common -DPIC -o runtime/.libs/ompi_mpi_preconnect.o
runtime/ompi_mpi_preconnect.c: In function  
`ompi_init_do_oob_preconnect':
runtime/ompi_mpi_preconnect.c:74: error: storage size of `msg'  
isn't known

make[2]: *** [runtime/ompi_mpi_preconnect.lo] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

$ gcc -v
Reading specs from /usr/libexec/gcc/darwin/ppc/3.3/specs
Thread model: posix
gcc version 3.3 20030304 (Apple Computer, Inc. build 1495)

$ g77 -v
Reading specs from /usr/local/lib/gcc/powerpc-apple- 
darwin7.3.0/3.5.0/specs
Configured with: ../gcc/configure --enable-threads=posix --enable- 
languages=f77

Thread model: posix
gcc version 3.5.0 20040429 (experimental)

(g77 from hpc.sf.net)


Note: I had no such problem under Mac OS X 10.4.9 with my ppc and  
x86 builds. However, I did notice that the configure script did not  
detect g95 from g95.org correctly:


*** Fortran 90/95 compiler
checking for gfortran... no
checking for f95... no
checking for fort... no
checking for xlf95... no
checking for ifort... no
checking for ifc... no
checking for efc... no
checking for pgf95... no
checking for lf95... no
checking for f90... no
checking for xlf90... no
checking for pgf90... no
checking for epcf90... no
checking whether we are using the GNU Fortran compiler... no

configure --help doesn't give any hint about specifying F95.






ompi_1.2_osx_10.3.diff
Description: Binary data




Re: [OMPI users] Still having problems building 1.2 on Mac OSX

2007-02-27 Thread Brian Barrett

On Feb 27, 2007, at 3:26 PM, Iannetti, Anthony C. ((GRC-RTB0)) wrote:


Dear Open-MPI:

   I am still ahving problems building OpenMPI 1.2 (now 1.2b4) on  
MacOSX 10.4 PPC 64.  In a message a while back, you gavce me a hack  
to override this problem.  I believe it was a problem with Libtool,  
or something like that.  Well, it looks like I still ahve to use  
that hack.


Thanks for bringing this to my attention.  A patch was accidently not  
moved into the v1.2 release branch.  We'll try to get that fixed  
right away.


Brian



Re: [OMPI users] 64-bit Open-mpi on Intel Mac OS X? (opal_if error)

2007-02-05 Thread Brian Barrett
This was fixed in 1.1.4, along with some shared memory performance  
issues on Intel Macs (32 or 64 bit builds).


Brian

On Feb 5, 2007, at 1:22 PM, Jason Martin wrote:


Hi All,

Using openmpi-1.1.3b3, I've been attempting to build Open-MPI in
64-bit bit mode on a Mac Pro (dual Xeon 5150 2.66GHz with 1G RAM).
Using the following configuration options:

./configure --prefix=/usr/local/openmpi-1.1.3b3 \
--build=x86_64-apple-darwin \
CFLAGS=-m64 CXXFLAGS=-m64 \
LDFLAGS=-m64

The make goes fine, but in "make check" it hits an error in the  
"opal_if" test.


Searching the source code in opal/util/if.c shows that the error is
occuring with the
   ioctl(sd, SIOCGIFCONF, )
call never returning a valid result (I tried increasing
MAX_IFCONF_SIZE, but that didn't help).  There's a comment at the top
of the file that mentions some compiler magic (align=power, etc.) for
the 64-bit PPC version, but I'm at a loss about using it on a 64-bit
Intel platform.

Has anyone else had any experience with this?

(Note that 32-bit binaries compile and pass make check.)

Thanks,
jason

--
Jason Worth Martin
Asst. Prof. of Mathematics
James Madison University
http://www.math.jmu.edu/~martin
phone: (+1) 540-568-5101
fax: (+1) 540-568-6857

"Ever my heart rises as we draw near the mountains.
There is good rock here." -- Gimli, son of Gloin
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
  Brian Barrett
  Open MPI Team, CCS-1
  Los Alamos National Laboratory





Re: [OMPI users] Can't run simple job with openmpi using the Intel compiler

2007-02-05 Thread Brian Barrett
This is very odd.  The two error messages you are seeing are side  
effects of the real problem, which is that Open MPI is segfaulting  
when build with the Intel compiler.  We've had some problems with  
bugs in various versions of the Intel compiler -- just to be on the  
safe side, can you make sure that the machine has the latest bug  
fixes from Intel applied?  From there, if possible, it would be  
extremely useful to have a stack trace from a core file, or even to  
know whether it's mpirun or one of our "orte daemons" that are  
segfaulting.  If you can get a core file, you should be able to  
figure out which process is causing the segfault.


Brian

On Feb 2, 2007, at 4:07 PM, Dennis McRitchie wrote:

When I submit a simple job (described below) using PBS, I always  
get one

of the following two errors:
1) [adroit-28:03945] [0,0,1]-[0,0,0] mca_oob_tcp_peer_recv_blocking:
recv() failed with errno=104

2) [adroit-30:03770] [0,0,3]-[0,0,0]  
mca_oob_tcp_peer_complete_connect:

connection failed (errno=111) - retrying (pid=3770)

The program does a uname and prints out results to standard out. The
only MPI calls it makes are MPI_Init, MPI_Comm_size, MPI_Comm_rank,  
and
MPI_Finalize. I have tried it with both openmpi v 1.1.2 and 1.1.4,  
built
with Intel C compiler 9.1.045, and get the same results. But if I  
build

the same versions of openmpi using gcc, the test program always works
fine. The app itself is built with mpicc.

It runs successfully if run from the command line with "mpiexec -n X
", where X is 1 to 8, but if I wrap it in the
following qsub command file:
---
#PBS -l pmem=512mb,nodes=1:ppn=1,walltime=0:10:00
#PBS -m abe
# #PBS -o /home0/dmcr/my_mpi/curt/uname_test.gcc.stdout
# #PBS -e /home0/dmcr/my_mpi/curt/uname_test.gcc.stderr

cd /home/dmcr/my_mpi/openmpi
echo "About to call mpiexec"
module list
mpiexec -n 1 uname_test.intel
echo "After call to mpiexec"


it fails on any number of processors from 1 to 8, and the application
segfaults.

The complete standard error of an 8-processsor job follows (note that
mpiexec ran on adroit-31, but usually there is no info about adroit-31
in standard error):
-
Currently Loaded Modulefiles:
  1) intel/9.1/32/C/9.1.045 4) intel/9.1/32/default
  2) intel/9.1/32/Fortran/9.1.040   5) openmpi/intel/1.1.2/32
  3) intel/9.1/32/Iidb/9.1.045
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x5
[0] func:/usr/local/openmpi/1.1.4/intel/i386/lib/libopal.so.0  
[0xb72c5b]

*** End of error message ***
^@[adroit-29:03934] [0,0,2]-[0,0,0] mca_oob_tcp_peer_recv_blocking:
recv() failed with errno=104
[adroit-28:03945] [0,0,1]-[0,0,0] mca_oob_tcp_peer_recv_blocking:  
recv()

failed with errno=104
[adroit-30:03770] [0,0,3]-[0,0,0] mca_oob_tcp_peer_complete_connect:
connection failed (errno=111) - retrying (pid=3770)
--

The complete standard error of an 1-processsor job follows:
--
Currently Loaded Modulefiles:
  1) intel/9.1/32/C/9.1.045 4) intel/9.1/32/default
  2) intel/9.1/32/Fortran/9.1.040   5) openmpi/intel/1.1.2/32
  3) intel/9.1/32/Iidb/9.1.045
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2
[0] func:/usr/local/openmpi/1.1.2/intel/i386/lib/libopal.so.0  
[0x27d847]

*** End of error message ***
^@[adroit-31:08840] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect:
connection failed (errno=111) - retrying (pid=8840)
---

Any thoughts as to why this might be failing?

Thanks,
   Dennis

Dennis McRitchie
Computational Science and Engineering Support (CSES)
Academic Services Department
Office of Information Technology
Princeton University

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
  Brian Barrett
  Open MPI Team, CCS-1
  Los Alamos National Laboratory





Re: [OMPI users] mac os x 10.3 openmpi won't compile hello world

2006-10-18 Thread Brian Barrett
Ah, are you using Open MPI 1.1.x, by chance?  The wrapper compilers  
need to be able to find a text file in $prefix/share/openmpi/, where  
$prefix is the prefix you gave when you configured Open MPI.  If that  
path is different on two hosts, the wrapper compilers can't find the  
text file, and things fall apart.


There's supposed to be an error message from the wrapper compilers  
when this occurs.  Unfortunately, there is a bug in the 1.1.x wrapper  
compilers such that they just exit with a non-zero exit status  
without printing that error message.  Not friendly, unfortunately.


Brian

On Oct 18, 2006, at 9:37 AM, Dan Cardin wrote:


I found my problem. I installed my openmpi onto an nfs share that
resides on another machine. If I login to the machine where the nfs
share is physically I can compile and run the hello world.

This is my first cluster build. Does anyone have a suggestion how I  
can

keeps this on a nfs share and make it work? Thank you

Mac os x 10.3 cluster

-dan

On Tue, 2006-10-17 at 22:15 -0600, Brian Barrett wrote:

On Oct 17, 2006, at 6:41 PM, Dan Cardin wrote:

Hello all, I have installed openmpi on a small apple panther  
cluster.

The install went smoothly but when I compile a program with

mpicc helloworld.c -o hello

No files or message are ever generated. Any help would be  
appreciated.


What version of Open MPI are you using?  Also, what is the output of:

   mpicc -showme

Thanks,

Brian



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] mac os x 10.3 openmpi won't compile hello world

2006-10-18 Thread Brian Barrett

On Oct 17, 2006, at 6:41 PM, Dan Cardin wrote:


Hello all, I have installed openmpi on a small apple panther cluster.
The install went smoothly but when I compile a program with

mpicc helloworld.c -o hello

No files or message are ever generated. Any help would be appreciated.


What version of Open MPI are you using?  Also, what is the output of:

  mpicc -showme

Thanks,

Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] C --> LOGICAL

2006-09-26 Thread Brian Barrett
On Tue, 2006-09-26 at 14:45 -0400, Brock Palen wrote:
> I have a code that requires that it be compiled (with the pgi  
> compilers) with the -i8
> 
>  From the pgf90 man page:
> 
> -i8Treat default INTEGER and LOGICAL variables as eight bytes.   
> For operations
>involving integers, use 64-bits for computations.
> 
> But i get the following from configure:
> 
> checking size of Fortran 77 LOGICAL... 8
> checking for C type corresponding to LOGICAL... not found
> configure: WARNING: *** Did not find corresponding C type
> configure: error: Cannot continue
> 
> 
> This is with opnempi-1.1.1
> I also have the same problem with openmpi-1.1.2rc1
> 
> The application is vasp, you can see the notes on the problem here:
> http://cms.mpi.univie.ac.at/vasp-forum/forum_viewtopic.php?2.1255

It looks like we assumed that LOGICAL would never be larger than an int,
which clearly isn't the case when that setting is used.  I've filed a
bug in our tracker about the issue and should have a fix committed this
evening.  It should be able to make the 1.1.2 release, but I can't
promise at this point.


Thanks,

Brian




Re: [OMPI users] Dynamic loading of libmpi.dylib on Mac OS X

2006-09-22 Thread Brian Barrett
Brian -

Sorry for the slow reply, I've been on vacation for a while and am still
digging out from all the back e-mail.

Anyway, that makes sense.  Open MPI's default build mode is to dlopen()
the driver components needed for things like the various interconnects
and process starters we support.  Since libmpi was dlopen()'ed with
RTLD_LOCAL, the symbols needed in libmpi were not available to those
components when OMPI tried to dlopen() them.

I was a little confused initially by why the symbols in our other
support libraries were found (everything seemed to work until the
MPI-level -- the run-time stuff initialized properly).  But apparently
this makes sense as well, as there's something about how shared
libraries that are dependencies on the dlopen()'ed object are loaded
that puts those symbols in the global namespace.

One solution, of course, is to specify RTLD_GLOBAL when opening libmpi.
The other possibility is to build Open MPI with the --disable-dlopen
option, which will cause all the components to be built into libmpi,
avoiding the whole namespacing issue.

We'll add some information to the FAQ on this issue.  Thanks for
bringing it to our attention.

Brian

On Fri, 2006-09-08 at 10:51 -0600, Brian E Granger wrote:
> Brian,
> 
> 
> I think I have figured this one out.  By default ctypes calls dlopen
> with mode = RTLD_LOCAL (except on Mac OS 10.3).  When I instruct
> ctypes to set mode = RTLD_GLOBAL it works fine on 10.4.  Based on the
> dlopen man page:
> 
> 
>  RTLD_GLOBAL   Symbols exported from this image (dynamic library
> or bun-
>dle) will be available to any images build with
>-flat_namespace option to ld(1) or to calls to
> dlsym() when
>using a special handle.
> 
> 
>  RTLD_LOCALSymbols exported from this image (dynamic library
> or bun-
>dle) are generally hidden and only availble to
> dlsym() when
>directly using the handle returned by this call to
>dlopen().  If neither RTLD_GLOBAL nor RTLD_LOCAL is
> speci-
>fied, the default is RTLD_GLOBAL
> 
> 
> This behavior makes sense.  Thus the following works on 10.4:
> 
> 
> from ctypes import *
> mpi = CDLL('libmpi.0.dylib', RTLD_GLOBAL)
> f = pythonapi.Py_GetArgcArgv
> argc = c_int()
> argv = POINTER(c_char_p)()
> f(byref(argc), byref(argv))
> mpi.MPI_Init(byref(argc), byref(argv))
> mpi.MPI_Finalize()
> 
> 
> So I am not sure this is a defect in OpenMPI, but it sure is a subtle
> aspect of using it.  I will probably document this somewhere in the
> package I am creating.  
> 
> 
> Thanks
> 
> 
> Brian
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Sep 6, 2006, at 9:00 AM, Brian Barrett wrote:
> 
> > Thanks for the information.  I've filed a bug in our bug tracker on
> > this
> > issue.  It appears that for some reason, when libmpi is dlopened()
> > by
> > python, that objects it then dlopens are not able to find symbols in
> > the
> > libmpi.  It will probably take me a bit of time to track this issue
> > down, but you will be notified by the bug tracker when the issue is
> > resolved.
> > 
> > 
> > Brian
> > 
> > 
> > 
> > 
> > On Thu, 2006-08-31 at 17:27 -0600, Brian E Granger wrote:
> > > Brian,
> > > 
> > > 
> > > 
> > > 
> > > Sure, but my example will probably seem a little odd.  I am
> > > calling
> > > the mpi shared library from Python using ctypes.  
> > > 
> > > 
> > > 
> > > 
> > > The dependencies for doing things this way are:
> > > 
> > > 
> > > 
> > > 
> > > 1. Python built with --enable-shared
> > > 2. The ctypes python package
> > > 3. OpenMPI configured with --enable-shared
> > > 
> > > 
> > > 
> > > 
> > > Once you have this, the following python script will cause the
> > > problem
> > > on Mac OS X:
> > > 
> > > 
> > > 
> > > 
> > > from ctypes import *
> > > 
> > > 
> > > 
> > > 
> > > f = pythonapi.Py_GetArgcArgv
> > > argc = c_int()
> > > argv = POINTER(c_char_p)()
> > > f(byref(argc), byref(argv))
> > > mpi = cdll.LoadLibrary('libmpi.0.dylib')
> > > mpi.MPI_Init(byref(argc), byref(argv))
> > > 
> > > 
> > > 
> > > 
> > > I will try this on Linux as well to see if I get the same error.
> > > One
> > > important piece of the puzzle is 

Re: [OMPI users] linux alpha ev6 openmpi 1.1.1

2006-09-17 Thread Brian Barrett

On Sep 8, 2006, at 8:18 PM, Nuno Sucena Almeida wrote:


Hello,
	while trying to compile openmpi 1.1.1 on a linux alpha ev6  
(tsunami) gentoo

system, I had to add the following lines to config/ompi_config_asm.m4:

alphaev6-*)
ompi_cv_asm_arch="ALPHA"
OMPI_ASM_SUPPORT_64BIT=1
OMPI_GCC_INLINE_ASSIGN='"bis zero,zero,%0" : "="(ret)'
;;

since my system was being detected as such , and not alpha-*


I forgot to mention -- I've committed a fix for this part of the  
issue in the SVN trunk.  It should eventually be migrated into the  
branch for the 1.2 release once we sort out the other Alpha issues.


Brian


--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] linux alpha ev6 openmpi 1.1.1

2006-09-17 Thread Brian Barrett

On Sep 8, 2006, at 8:18 PM, Nuno Sucena Almeida wrote:


The other issue is the one described in
	http://www.mail-archive.com/debian-bugs-dist@lists.debian.org/ 
msg229867.html


(...)
gcc -O3 -DNDEBUG -fno-strict-aliasing -pthread -o .libs/opal_wrapper
opal_wrapper.o -Wl,--export-dynamic  ../../../opal/.libs/libopal.so  
-ldl -lnsl -lutil -lm -Wl,--rpath -Wl,/opt/openmpi-1.1.1/lib

../../../opal/.libs/libopal.so: undefined reference to
`opal_atomic_cmpset_acq_32'
../../../opal/.libs/libopal.so: undefined reference to  
`opal_atomic_cmpset_32'

(...)


Can you send the config.log file generated by Open MPI's configure,  
with your bis $31,$31 change?


Thanks,

Brian


Re: [OMPI users] Probable MPI2 bug?

2006-09-06 Thread Brian Barrett
On Mon, 2006-09-04 at 11:01 -0700, Tom Rosmond wrote:

> Attached is some error output from my tests of 1-sided message
> passing, plus my info file.  Below are two copies of a simple fortran
> subroutine that mimics mpi_allgatherv using  mpi-get calls.  The top
> version fails, the bottom runs OK.  It seems clear from these
> examples, plus the 'self_send' phrases in the error output, that there
> is a problem internally with a processor sending data to itself.  I
> know that your 'mpi_get' implementation is simply a wrapper around
> 'send/recv' calls, so clearly this shouldn't happen.  However, the
> problem does not happen in all cases; I tried to duplicate it in a
> simple stand-alone program with mpi_get calls and was unable to make
> it fail.  Go figure.

That is an odd failure and at first glance it does look like there is
something wrong with our one-sided implementation.  I've filed a bug in
our tracker about the issue and you should get updates on the ticket as
we work on the issue.

Thanks,

Brian



Re: [OMPI users] Dynamic loading of libmpi.dylib on Mac OS X

2006-09-06 Thread Brian Barrett
Thanks for the information.  I've filed a bug in our bug tracker on this
issue.  It appears that for some reason, when libmpi is dlopened() by
python, that objects it then dlopens are not able to find symbols in the
libmpi.  It will probably take me a bit of time to track this issue
down, but you will be notified by the bug tracker when the issue is
resolved.

Brian


On Thu, 2006-08-31 at 17:27 -0600, Brian E Granger wrote:
> Brian,
> 
> 
> Sure, but my example will probably seem a little odd.  I am calling
> the mpi shared library from Python using ctypes.  
> 
> 
> The dependencies for doing things this way are:
> 
> 
> 1. Python built with --enable-shared
> 2. The ctypes python package
> 3. OpenMPI configured with --enable-shared
> 
> 
> Once you have this, the following python script will cause the problem
> on Mac OS X:
> 
> 
> from ctypes import *
> 
> 
> f = pythonapi.Py_GetArgcArgv
> argc = c_int()
> argv = POINTER(c_char_p)()
> f(byref(argc), byref(argv))
> mpi = cdll.LoadLibrary('libmpi.0.dylib')
> mpi.MPI_Init(byref(argc), byref(argv))
> 
> 
> I will try this on Linux as well to see if I get the same error.  One
> important piece of the puzzle is that if I configure openmpi with the
> --disable-dlopen flag, I don't have the problem.  I will do some
> further testing on different systems and get back to you.  
> 
> 
> Thanks for looking at this.
> 
> 
> Brian
> 
> 
> 
> On Aug 31, 2006, at 4:20 PM, Brian Barrett wrote:
> 
> > This is quite strange, and we're having some trouble figuring out
> > exactly why the opening is failing.  Do you have a (somewhat?) easy
> > list
> > of instructions so that I can try to reproduce this?
> > 
> > 
> > Thanks,
> > 
> > 
> > Brian
> > 
> > 
> > On Tue, 2006-08-22 at 20:58 -0600, Brian Granger wrote:
> > > HI,
> > > 
> > > 
> > > I am trying to dynamically load mpi.dylib on Mac OS X (using
> > > ctypes in 
> > > python).  It seems to
> > > load fine, but when I call MPI_Init(), I get the error shown
> > > below.  I
> > > can call other functions just fine (like MPI_Initialized).
> > > 
> > > 
> > > Also, my mpi install is seeing all the needed components and I can
> > > load them myself without error using dlopen.  I can also compile
> > > and
> > > run mpi programs and I build openmpi with shared library support.
> > > 
> > > 
> > > [localhost:00973] mca: base: component_find: unable to open:
> > > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_allocator_basic.so,
> > > 9):
> > > Symbol not found: _ompi_free_list_item_t_class
> > >   Referenced from: 
> > > /usr/local/openmpi-1.1/lib/openmpi/mca_allocator_basic.so
> > >   Expected in: flat namespace
> > >   (ignored)
> > > [localhost:00973] mca: base: component_find: unable to open:
> > > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_rcache_rb.so, 9):
> > > Symbol
> > > not found: _ompi_free_list_item_t_class
> > >   Referenced
> > > from: /usr/local/openmpi-1.1/lib/openmpi/mca_rcache_rb.so
> > >   Expected in: flat namespace
> > >   (ignored)
> > > [localhost:00973] mca: base: component_find: unable to open:
> > > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_mpool_sm.so, 9):
> > > Symbol
> > > not found: _mca_allocator_base_components
> > >   Referenced
> > > from: /usr/local/openmpi-1.1/lib/openmpi/mca_mpool_sm.so
> > >   Expected in: flat namespace
> > >   (ignored)
> > > [localhost:00973] mca: base: component_find: unable to open:
> > > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_pml_ob1.so, 9):
> > > Symbol
> > > not found: _ompi_free_list_item_t_class
> > >   Referenced
> > > from: /usr/local/openmpi-1.1/lib/openmpi/mca_pml_ob1.so
> > >   Expected in: flat namespace
> > >   (ignored)
> > > [localhost:00973] mca: base: component_find: unable to open:
> > > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_coll_basic.so, 9):
> > > Symbol not found: _mca_pml
> > >   Referenced
> > > from: /usr/local/openmpi-1.1/lib/openmpi/mca_coll_basic.so
> > >   Expected in: flat namespace
> > >   (ignored)
> > > [localhost:00973] mca: base: component_find: unable to open:
> > > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_coll_hierarch.so,
> > > 9):
> > > Symbol not found: _ompi_mpi_op_max
> > >   Referenced
> > > from: /usr/local/openmpi-1.1/lib/openmpi/mca_

Re: [OMPI users] question about passing MPI communicator

2006-09-01 Thread Brian Barrett
Your example is pretty close to spot on.  You want to convert the
Fortran handle (integer) into a C handle (something else).  Then use the
C handle to call C functions.  The one thing of note is that you should
use the type MPI_Fint instead of int for the type of the Fortran
handles.  So your parallel_info function's prototype would be:

  void parallel_info_(int *rank, MPI_Fint *comm);

Hope this helps,

Brian


On Fri, 2006-09-01 at 09:26 -0400, Wang, Peng wrote:
> Hello, I am wondering in openmpi how is the passing of MPI communcator 
> from Fortran to C is handled? Assuming I have a Fortran 90 subroutine 
> calling a C function passing MPI_COMM_WORLD in, in the C function, do I 
> need to first do MPI_Comm_f2c
> to convert to MPI handle, then use that handle afterward? Or is there 
> any better way to do this? Here is some test code:
> 
> Fortran 90:
> 
> program test1
> 
> include 'mpif.h'
> 
> integer myrank,ierr
> 
> call MPI_Init(ierr)
> 
> call parallel_info(myrank,MPI_COMM_WORLD)
> write(*,*) 'hello, I am process #',myrank
> 
> call MPI_Finalize(ierr)
> 
> end program test1
> 
> 
> C:
> 
> #include 
> 
> void parallel_info_(int * rank, int* comm)
> {
> MPI_Comm ccomm;
> 
> ccomm=MPI_Comm_f2c(*comm);
> MPI_Comm_rank(ccomm, rank);
> }
> 
> void parallel_info(int * rank, int * comm)
> {
> MPI_Comm ccomm;
> 
> ccomm=MPI_Comm_f2c(*comm);
> 
> MPI_Comm_rank(ccomm, rank);
> }
> 
> 
> Thanks,
> Peng
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Dynamic loading of libmpi.dylib on Mac OS X

2006-08-31 Thread Brian Barrett
This is quite strange, and we're having some trouble figuring out
exactly why the opening is failing.  Do you have a (somewhat?) easy list
of instructions so that I can try to reproduce this?

Thanks,

Brian

On Tue, 2006-08-22 at 20:58 -0600, Brian Granger wrote:
> HI,
> 
> I am trying to dynamically load mpi.dylib on Mac OS X (using ctypes in 
> python).  It seems to
> load fine, but when I call MPI_Init(), I get the error shown below.  I
> can call other functions just fine (like MPI_Initialized).
> 
> Also, my mpi install is seeing all the needed components and I can
> load them myself without error using dlopen.  I can also compile and
> run mpi programs and I build openmpi with shared library support.
> 
> [localhost:00973] mca: base: component_find: unable to open:
> dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_allocator_basic.so, 9):
> Symbol not found: _ompi_free_list_item_t_class
>   Referenced from: 
> /usr/local/openmpi-1.1/lib/openmpi/mca_allocator_basic.so
>   Expected in: flat namespace
>   (ignored)
> [localhost:00973] mca: base: component_find: unable to open:
> dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_rcache_rb.so, 9): Symbol
> not found: _ompi_free_list_item_t_class
>   Referenced from: /usr/local/openmpi-1.1/lib/openmpi/mca_rcache_rb.so
>   Expected in: flat namespace
>   (ignored)
> [localhost:00973] mca: base: component_find: unable to open:
> dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_mpool_sm.so, 9): Symbol
> not found: _mca_allocator_base_components
>   Referenced from: /usr/local/openmpi-1.1/lib/openmpi/mca_mpool_sm.so
>   Expected in: flat namespace
>   (ignored)
> [localhost:00973] mca: base: component_find: unable to open:
> dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_pml_ob1.so, 9): Symbol
> not found: _ompi_free_list_item_t_class
>   Referenced from: /usr/local/openmpi-1.1/lib/openmpi/mca_pml_ob1.so
>   Expected in: flat namespace
>   (ignored)
> [localhost:00973] mca: base: component_find: unable to open:
> dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_coll_basic.so, 9):
> Symbol not found: _mca_pml
>   Referenced from: /usr/local/openmpi-1.1/lib/openmpi/mca_coll_basic.so
>   Expected in: flat namespace
>   (ignored)
> [localhost:00973] mca: base: component_find: unable to open:
> dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_coll_hierarch.so, 9):
> Symbol not found: _ompi_mpi_op_max
>   Referenced from: /usr/local/openmpi-1.1/lib/openmpi/mca_coll_hierarch.so
>   Expected in: flat namespace
>   (ignored)
> [localhost:00973] mca: base: component_find: unable to open:
> dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_coll_sm.so, 9): Symbol
> not found: _ompi_mpi_local_convertor
>   Referenced from: /usr/local/openmpi-1.1/lib/openmpi/mca_coll_sm.so
>   Expected in: flat namespace
>   (ignored)
> [localhost:00973] mca: base: component_find: unable to open:
> dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_coll_tuned.so, 9):
> Symbol not found: _mca_pml
>   Referenced from: /usr/local/openmpi-1.1/lib/openmpi/mca_coll_tuned.so
>   Expected in: flat namespace
>   (ignored)
> [localhost:00973] mca: base: component_find: unable to open:
> dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_osc_pt2pt.so, 9): Symbol
> not found: _ompi_request_t_class
>   Referenced from: /usr/local/openmpi-1.1/lib/openmpi/mca_osc_pt2pt.so
>   Expected in: flat namespace
>   (ignored)
> --
> No available pml components were found!
> 
> This means that there are no components of this type installed on your
> system or all the components reported that they could not be used.
> 
> This is a fatal error; your MPI process is likely to abort.  Check the
> output of the "ompi_info" command and ensure that components of this
> type are available on your system.  You may also wish to check the
> value of the "component_path" MCA parameter and ensure that it has at
> least one directory that contains valid MCA components.
> 
> --
> [localhost:00973] PML ob1 cannot be selected
> 
> Any Ideas?
> 
> Thanks
> 
> Brian Granger
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] little endian - big endian conversion

2006-08-30 Thread Brian Barrett
Correct.  With the exception of MPI_BOOL / MPI_LOGICAL, we do not handle
sending datatypes that are different sizes on the sender and receiver.
So sending an MPI_LONG from a 32 bit machine to a 64 bit machine will
not work correctly.

Brian

On Wed, 2006-08-30 at 10:33 -0400, Jeff Squyres wrote:
> Oops!  My mistake -- thanks for the correction...
> 
> I am still correct in thinking that we do not properly handle *size*
> endianness, right?  Meaning that if sizeof(long) on one node is different
> than sizeof(long) on another, running an MPI job across those two nodes will
> cause Bad Things to occur if you try to exchange MPI_LONGs between the MPI
> processes, right?  (and similar for other datatypes that are different
> sizes)
> 
> 
> On 8/30/06 9:38 AM, "Brian Barrett" <brbar...@open-mpi.org> wrote:
> 
> > Actually, Jeff is incorrect.  As of Open MPI 1.1, we do support endian
> > conversion between peers.  It has not been as well tested as the rest of
> > the code base, but it should work.  Please let us know if you have any
> > issues with that mode and we'll work to resolve them.
> > 
> > Brian
> > 
> > 
> > On Wed, 2006-08-30 at 06:36 -0400, Jeff Squyres wrote:
> >> Open MPI does not yet support endian conversion between peers in a single
> >> MPI job.  It's on the to-do list, but it's been a lower priority than some
> >> other features and issues.
> >> 
> >> 
> >> 
> >> On 8/30/06 4:12 AM, "Eng. A.A. Isola" <alfonso.is...@tin.it> wrote:
> >> 
> >>> Hi everybody,
> >>>  
> >>>I got one doudt in OPEN-MPI. Suppose, if
> >>> i 
> >>> run the application on different systems with different data formats
> >>> (little-endian & big endian)...Willl OPEN-MPI converts from little
> >>> endian 
> >>> to big-endian(if it is sending data from for eg, Linux Pc &
> >>> Solaris)
> >>> 
> >>> If it isn't able to do this, it will be able in the
> >>> future releases? (is in your to do list?)
> >>>  
> >>> Thanking u for ur response,
> >>>  
> >>> A.A. Isola
> >>> 
> >>> ___
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> 
> > 
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 



Re: [OMPI users] Testing 1-sided MPI again

2006-08-15 Thread Brian Barrett
On Tue, 2006-08-15 at 14:24 -0700, Tom Rosmond wrote:
> I am continuing to test the MPI-2 features of 1.1, and have run into 
> some puzzling behavior. I wrote a simple F90 program to test 'mpi_put' 
> and 'mpi_get' on a coordinate transformation problem on a two dual-core 
> processor Opteron workstation running the PGI 6.1 compiler. The program 
> runs correctly for a variety of problem sizes and processor counts.
> 
> However, my main interest is a large global weather prediction model 
> that has been running in production with 1-sided message passing on an 
> SGI Origin 3000 for several years. This code does not run with OMPI 
> 1-sided message passing. I have investigated the difference between this 
> code and the test program and noticed a critical difference. Both 
> programs call 'mpi_win_create' to create an integer 'handle' to the RMA 
> window used by 'mpi_put' and 'mpi_get'. In the test program this 
> 'handle' returns with a value of '1', but in the large code the 'handle' 
> returns with value '0'. Subsequent synchronization calls to 
> 'mpi_win_fence' succeed in the small program (error status eq 0), while 
> in the large code they fail (error status ne 0), and the transfers fail 
> also (no data is passed).
> 
> Do you have any suggestions on what could cause this difference in 
> behavior between the two codes, specifically why the 'handles' have 
> different values? Are there any diagnostics I could produce that would 
> provide information?

The difference in handle values is irrelevant to the failures you are
seeing.  Our handle 0 is MPI_WIN_NULL, so you should never see that
returned from MPI_WIN_CREATE.

Unfortunately, when I wrote the one-sided implementation, I didn't add
useful debugging messages the user can enable.  I can add some and make
a tarball, if you would be willing to give it a try.  What error
messages are coming out of the large code?

By the way, just to make sure your expectations are set correctly, Open
MPI's one-sided performance in v1.1 and v1.2 is bad, as it's implemented
over the point-to-point engine.  You're not going to get Origin-like
performance out of the current implementation.

Brian




Re: [OMPI users] pvfs2 and romio

2006-08-14 Thread Brian Barrett
On Mon, 2006-08-14 at 10:57 -0400, Brock Palen wrote:
> We will be evaluating pvfs2 (www.pvfs.org) in the future.  Is their  
> any special considerations to take to get romio support with openmpi  
> with pvfs2 ?
> I have the following from ompi_info
> 
> MCA io: romio (MCA v1.0, API v1.0, Component v1.1)
> 
> Does OMPI have to be built pointing at the pvfs2 libs?  If so how?  I  
> remember there was a strange way of needing to do this with lam.
> 
> Guidance is much appreciated.

Yeah, some minor trickery is required.  I believe you can just do
something like:

  ./configure  --with-file-system=panfs+nfs+ufs

but it's probably safest to do:

  ./configure  --with-io-romio-flags="--with-file-system=panfs
+nfs+ufs"

Changing the filesystems you want to include, of course.

Brian




Re: [OMPI users] Compiling MPI with pgf90

2006-07-31 Thread Brian Barrett
On Mon, 2006-07-31 at 13:12 -0400, James McManus wrote:
> I'm trying to compile MPI with pgf90. I use the following configure 
> settings:
> 
> ./configure --prefix=/usr/local/mpi F90=pgf90 F77=pgf77
> 
> However, the compiler is set to gfortran:
> 
> *** Fortran 90/95 compiler
> checking for gfortran... gfortran
> checking whether we are using the GNU Fortran compiler... yes
> checking whether gfortran accepts -g... yes
> checking if Fortran compiler works... yes
> checking whether pgf90 and gfortran compilers are compatible... no
> configure: WARNING: *** Fortran 77 and Fortran 90 compilers are not link 
> compatible
> 
> I do have gfortran, with its binary in /usr/bin/gfortran, However, I 
> have removed all path information to it, in .bash_profile and .bashrc, 
> and have replaced it with path information to pgf90. MPI is still 
> configured with gfortran and the FC compiler.
> 
> I am using a evaluation version of the pgi compilers.


Try:

 ./configure --prefix=/usr/local/mpi FC=pgf90 F77=pgf77

Autoconf looks at the FC variable to choose the modern Fortran compiler,
not the F90 variable.

Brian




Re: [OMPI users] bug report: wrong reference in mpi.h to mpicxx.h

2006-07-19 Thread Brian Barrett
On Wed, 2006-07-19 at 14:57 +0200, Paul Heinzlreiter wrote:

> After that I tried to compile VTK (http://www.vtk.org) with MPI support
> using OpenMPI.
> 
> The compilation process issued the following error message:
> 
> /home/ph/local/openmpi/include/mpi.h:1757:33: ompi/mpi/cxx/mpicxx.h: No
> such file or directory

Sven sent instructions on how to best build VTK, but I wanted to explain
what you are seeing.  Open MPI actually requires two -I options to use
the C++ bindings: -I/include and -I/include/openmpi.
Generally, the wrapper compilers (mpicc, mpiCC, mpif77, etc.) are used
to build Open MPI applications and the -I flags are automatically added
without any problem (a bunch of other flags that might be required on
your system may also be added).  

You can use the "mpiCC -showme" option to the wrapper compiler to see
excatly which flags it might add when compiling / linking / etc.


Hope this helps,

Brian



Re: [OMPI users] x86_64 head with x86 diskless nodes, Node execution fails with SEGV_MAPERR

2006-07-16 Thread Brian Barrett

On Jul 16, 2006, at 4:13 PM, Eric Thibodeau wrote:
Now that I have that out of the way, I'd like to know how I am  
supposed to compile my apps so that they can run on an homogenous  
network with mpi. Here is an example:


kyron@headless ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ mpicc -L/ 
usr/X/lib -lm -lX11 -O3 mandelbrot-mpi.c -o mandelbrot-mpi


kyron@headless ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ mpirun -- 
hostfile hostlist -np 3 ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2/ 
mandelbrot-mpi


-- 



Could not execute the executable "/home/kyron/1_Files/1_ETS/ 
1_Maitrise/MGL810/Devoir2/mandelbrot-mpi": Exec format error



This could mean that your PATH or executable name is wrong, or that  
you do not


have the necessary permissions. Please ensure that the executable  
is able to be


found and executed.

-- 



As can be seen with the uname -a that was run previously, I have 2  
"local nodes" on the x86_64 and two i686 nodes. I tried to find  
examples in the Doc on howto compile applications correctly for  
such a setup without compromising performance but I came short of  
an example.


From the sound of it, you have a heterogeneous configuration -- some  
nodes are x86_64 and some are x86.  Because of this, you either have  
to compile your application twice, once for each platform or compile  
your application for the lowest common denominator.  My guess would  
be that it easier and more foolproof if you compiled everything in 32  
bit mode.  If you run in a mixed mode, using application schemas (see  
the mpirun man page) will be the easiest way to make things work.



Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] Problem compiling OMPI with Intel C compiler on Mac OS X

2006-07-16 Thread Brian Barrett

On Jul 14, 2006, at 10:35 AM, Warner Yuen wrote:

I'm having trouble compiling Open MPI with Mac OS X v10.4.6 with  
the Intel C compiler. Here are some details:


1) I upgraded to the latest versions of Xcode including GCC 4.0.1  
build 5341.

2) I installed the latest Intel update (9.1.027) as well.
3) Open MPI compiles fine with using GCC and IFORT.
4) Open MPI fails with ICC and IFORT
5) MPICH-2.1.0.3 compiles fine with ICC and IFORT (I just had to  
find out if my compiler worked...sorry!)
6) My Open MPI confguration was using: ./configure --with-rsh=/usr/ 
bin/ssh --prefix=/usr/local/ompi11icc

7) Should I have included my config.log?


It looks like there are some problems with GNU libtool's support for  
the Intel compiler on OS X.  I can't tell if it's a problem with the  
Intel compiler or libtool.  A quick fix is to build Open MPI with  
static libraries rather than shared libraries.  You can do this by  
adding:


  --disable-shared --enable-static

to the configure line for Open MPI (if you're building in the same  
directory where you've already run configure, you want to run make  
clean before building again).


I unfortunately don't have access to a Intel Mac machines with the  
Intel compilers installed, so I can't verify this issue.  I believe  
one of the other developers does have such a configuration, so I'll  
ask him when he's available (might be a week or two -- I believe he's  
on vacation).  This issue seems to be unique to your exact  
configuration -- it doesn't happen with GCC on the Intel Mac nor on  
Linux with the Intel compilers.



Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] x86_64 head with x86 diskless nodes, Node execution fails with SEGV_MAPERR

2006-07-16 Thread Brian Barrett

On Jul 15, 2006, at 2:58 PM, Eric Thibodeau wrote:
But, for some reason, on the Athlon node (in their image on the  
server I should say) OpenMPI still doesn't seem to be built  
correctly since it crashes as follows:



kyron@node0 ~ $ mpirun -np 1 uptime

Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)

Failing at addr:(nil)

[0] func:/home/kyron/openmpi_i686/lib/libopal.so.0 [0xb7f6258f]

[1] func:[0xe440]

[2] func:/home/kyron/openmpi_i686/lib/liborte.so.0(orte_init_stage1 
+0x1d7) [0xb7fa0227]


[3] func:/home/kyron/openmpi_i686/lib/liborte.so.0(orte_system_init 
+0x23) [0xb7fa3683]


[4] func:/home/kyron/openmpi_i686/lib/liborte.so.0(orte_init+0x5f)  
[0xb7f9ff7f]


[5] func:mpirun(orterun+0x255) [0x804a015]

[6] func:mpirun(main+0x22) [0x8049db6]

[7] func:/lib/tls/libc.so.6(__libc_start_main+0xdb) [0xb7de8f0b]

[8] func:mpirun [0x8049d11]

*** End of error message ***

Segmentation fault


The crash happens both in the chrooted env and on the nodes. I  
configured both systems to have Linux and POSIX threads, though I  
see openmpi is calling the POSIX version (a message on the mailling  
list had hinted on keeping the Linux threads around...I have to  
anyways since sone apps like Matlab extensions still depend on  
this...). The following is the output for the libc info.


That's interesting...  We regularly build Open MPI on 32 bit Linux  
machines (and in 32 bit mode on Opteron machines) without too much  
issue.  It looks like we're jumping into a NULL pointer, which  
generally means that a ORTE framework failed to initialize itself  
properly.  It would be useful if you could rebuild with debugging  
symbols (just add -g to CFLAGS when configuring) and run mpirun in  
gdb.  If we can determine where the error is occurring, that would  
definitely help in debugging your problem.


Brian


--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] Openmpi, LSF and GM

2006-07-16 Thread Brian Barrett

On Jul 16, 2006, at 6:12 AM, Keith Refson wrote:


The compile of openmpi 1.1 was without problems and
appears to have correctly built the GM btl.
$ ompi_info -a | egrep "\bgm\b|_gm_"
   MCA mpool: gm (MCA v1.0, API v1.0, Component v1.1)
 MCA btl: gm (MCA v1.0, API v1.0, Component v1.1)


Ok, so GM support is definitely built into your build of Open MPI,  
which is a good start.



However I have been unable to sey up a parallel run which uses gm.
If I start a run using the openmpi mpirun command, the program  
executes
correctly in parallel. However the timings appear to suggest that  
it is

using tcp, and the command executed on the node looks  like:

  orted --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 -- 
nodename

scarf-cn001.rl.ac.uk --universe
cse0...@scarf-cn001.rl.ac.uk:default-universe-28588 --nsreplica
"0.0.0;tcp://192.168.1.1:52491;tcp://130.246.142.1:52491" --gprreplica
"0.0.0;tcp://192.168.1.1:52491;t


Right, orted is just a starter for the MPI processes -- the  
information on interconnects to use and that kind of stuff is passed  
through the out-of-band communication mechanism.  orted doesn't  
really care which interconnect the MPI process is going to use, so we  
don't pass it on the command line.



Furthermore if attempt to start with the mpirun arguments "--mca btl
gm,self,^tcp" the run aborts at the MPI_INIT call.

Q1:  Is there anything else I have to do to get openmpi to use gm?


The command line you want is:

  mpirun -np X -mca btl gm,sm,self 

If this causes an error during MPI_INIT or early in your application,  
it would be useful to see all the output form the parallel run.  That  
likely indicates that there is something wrong with the  
initialization of the interconnect.



Q2:  Is there any way of diagnosing which btl is actually being used
 and why?  None "-v" option to mpirun, "-mca btl   
btl_base_verbose"

 or "-mca btl  btl_gm_debug=1" make any difference or produce any
 more output


The arguments you want would look like:

  mpirun -np X -mca btl gm,sm,self -mca btl_base_verbose 1 -mca  
btl_gm_debug 1 


Q3:  Is there a way to make openmpi work with the LSF commands?  So  
far

 I have constructed a hostfile from the LSF environment variable
 LSB_HOSTS and used the openmpi mpirun command to start the
 parallel executable.


Currently, we do not have tight LSF integration for Open MPI, like we  
do for PBS, SLURM, and BProc.  This is mainly because the only LSF  
machines the development team regularly uses are BProc machines,  
which do not use the traditional startup and allocation mechanisms of  
LSF.  I believe it is on our feature request list, but I also don't  
believe we have a timeline for implementation.



Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] debugging with mpirun

2006-07-06 Thread Brian Barrett

On Jul 6, 2006, at 8:27 PM, Manal Helal wrote:


I am trying to debug my mpi program, but printf debugging is not doing
much, and I need something that can show me variable values, and which
line of execution (and where it is called from), something like gdb  
with

mpi,

is there anything like that?


There are a couple of options.  The first (works best with ssh, but  
can be made to work with most starting mechanisms) is to start a  
bunch of gdb sessions in xterms.  Something like:


  mpirun -np XX -d xterm -e gdb 

The '-d' option is necessary so that mpirun doesn't close the ssh  
sessions, severing its X11 forwarding channel.  This has the  
advantage of being free, but has the disadvantage of being a major  
pain.  A better option is to try a real parallel debugger, such as  
TotalView or Portland Group's PGDBG.  This has the advantage of  
working very well (I use TotalView whenever possible), but has the  
disadvantage of generally not being free.



Hope this helps,

Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] MPI_Recv, is it possible to switch on/off aggresive mode during runtime?

2006-07-06 Thread Brian Barrett

On Jul 5, 2006, at 8:54 AM, Marcin Skoczylas wrote:


I saw some posts ago almost the same question as I have, but it didn't
give me satisfactional answer.
I have setup like this:

GUI program on some machine (f.e. laptop)
Head listening on tcpip socket for commands from GUI.
Workers waiting for commands from Head / processing the data.

And now it's problematic. For passing the commands from Head I'm  
using:

while(true)
{
MPI_Recv...

do whatever head said (process small portion of the data, return
result to head, wait for another commands)
}

So in the idle time workers are stuck in MPI_Recv and have 100% CPU
usage, even if they are just waiting for the commands from Head.
Normally, I would not prefer to have this situation as I sometimes  
have

to share the cluster with others. I would prefer not to stop whole mpi
program, but just go into 'idle' mode, and thus make it run again  
soon.

Also I would like to have this aggresive MPI_Recv approach switched on
when I'm alone on the cluster. So is it possible somehow to switch  
this

mode on/off during runtime? Thank you in advance!


Currently, there is not a way to do this.  Obviously, there's not  
going to be a way that is portable (ie, compiles with MPICH), but it  
may be possible to add this in the future.  It likely won't happen  
for the v1.1 release series, and I can't really speak for releases  
past that at this point.  I'll file an enhancement request in our  
internal bug tracker, and add you to the list of people to be  
notified when the ticket is updated.



Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] error in running openmpi on remote node

2006-07-04 Thread Brian Barrett

On Jul 4, 2006, at 1:53 AM, Chengwen Chen wrote:


Dear openmpi users,

I am using openmpi-1.0.2 on Redhat linux. I can succussfully run  
mpirun in single PC with 2 np. But fail in remote node. Can you  
give me some advices? thank you very much in advance.


[say@wolf45 tmp]$ mpirun -np 2 /tmp/test.x

[say@wolf45 tmp]$ mpirun -np 2 --host wolf45,wolf46 /tmp/test.x
say@wolf46's password:
orted: Command not found.
[wolf45:11357] ERROR: A daemon on node wolf46 failed to start as  
expected.

[wolf45:11357] ERROR: There may be more information available from
[wolf45:11357] ERROR: the remote shell (see above).
[wolf45:11357] ERROR: The daemon exited unexpectedly with status 1.


Kefeng is correct that you should setup your ssh keys so that you  
aren't prompted for a password, but that isn't the cause of your  
failure.  The problem appears to be that orted (one of the Open MPI  
commands) is not in your path on the remote node.  You should take a  
look at one of the other FAQ sections on the setup required for Open  
MPI in an rsh/ssh type environment.


  http://www.open-mpi.org/faq/?category=running


Hope this helps,

Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] Compilation problem

2006-07-04 Thread Brian Barrett

On Jul 3, 2006, at 11:49 PM, Samuel Wieczorek wrote:

Hi, I tried to install open-mpi on a Mac OS X (10.4), but the  
compilation

step failed due to undefined symbols.
Here is the compressed output files.

Any idea to help me ?


This is very odd, but it appears that /usr/bin/find isn't executable  
on your machine.  This results in the libraries in Open MPI not being  
built properly.  There were many lines like this in your log file:


  ../libtool: line 1: /usr/bin/find: cannot execute binary file

I'm not sure how this could happen, but fixing your 'find' command  
should fix the Open MPI build.



Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] Testing one-sided message passing with 1.1

2006-06-30 Thread Brian Barrett

On Jun 29, 2006, at 5:23 PM, Tom Rosmond wrote:

I am testing the one-sided message passing (mpi_put, mpi_get) that  
is now supported in the 1.1 release.  It seems to work OK for some  
simple test codes, but when I run my big application, it fails.   
This application is a large weather model that runs operationally  
on the SGI Origin 3000, using the native one-sided message passing  
that has been supported on that system for many years.  At least on  
that architecture, the code always runs correctly for processor  
numbers up to 480.  On the O3K a requirement for the one-sided  
communication to work correctly is to use 'mpi_win_create' to  
define the RMA 'windows' in symmetric locations on all processors,  
i.e. the same 'place' in memory on each processor.  This can be   
done with static memory, i.e. , in common; or on the 'symmetric  
heap', which is defined via environment variables.  In my  
application the latter method is used.  I define several of these  
'windows' on the symmetric heap, each with a unique handle.


Before I spend my time trying to diagnose this problem further, I  
need as much information about the OpenMPI one-sided implementation  
as available.  Do you have a similar requirement or criteria for  
symmetric memory for the RMA windows?  Are there runtime parameters  
that I should be using that are unique to one-sided message passing  
with OpenMPI?  Any other information will certainly be appreciated.


There are no requirements on the one-sided windows in terms of buffer  
pointers.  Our current implementation is over point-to-point so it's  
kinda slow compared to real one-sided implementations, but has the  
advantage of working with arbitrary window locations.


There is only two parameters to tweak in the current implementation:

  osc_pt2pt_eager_send: If this is 1, we try to start progressing  
the put/get
 before the synchronization point.  The default is 0.  This is  
not well

 tested, so I recommend leaving it 0.  It's safer at this point.

  osc_pt2pt_fence_sync_method: This one might be worth playing with,  
but I
 doubt it could cause your problems.  This is the collective we  
use to
 implement MPI_FENCE.  Options are reduce_scatter (default),  
allreduce,
 alltoall.  Again, I doubt it will make any difference, but  
would be

 interesting to confirm that.

You can set the parameters at mpirun time:

mpirun -np XX -mca osc_pt2pt_fence_sync_method reduce_scatter ./ 
test_code


Our one-sided implementation has not been as well tested as the rest  
of the code (as this is our first release with one-sided support).   
If you can share any details on your application or, better yet, a  
test case, we'd appreciate it.


There is one known issue with the implementation.  It does not  
support using MPI_ACCUMULATE with user-defined datatypes, even if  
they are entirely composed of one predefined datatype.  We plan on  
fixing this in the near future, and an error message will be printed  
if this situation occurs.



Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] SEGV_MAPERR during execution

2006-06-15 Thread Brian Barrett
On Thu, 2006-06-15 at 13:46 -0700, Anoop Rajendra wrote:

> I'm trying to run a simple pi program compiled using openmpi.
> 
> My command line and error message is
> 
> [mpiuser@Pebble-anoop ~]$ mpirun -n 2 -hostfile /opt/openmpi/openmpi/ 
> etc/openmpi-default-hostfile /home/mpiuser/cpi2
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:0x6
> *** End of error message ***
> [0] func:/opt/openmpi/openmpi/lib/libopal.so.0 [0xceb6dd]
> [1] func:/lib/tls/libpthread.so.0 [0xd44880]
> [2] func:/opt/openmpi/openmpi/lib/openmpi/mca_btl_tcp.so [0x746d23]
> [3] func:/opt/openmpi/openmpi/lib/openmpi/mca_btl_tcp.so 
> (mca_btl_tcp_add_procs+0x140) [0x744094]
> [4] func:/opt/openmpi/openmpi/lib/openmpi/mca_bml_r2.so 
> (mca_bml_r2_add_procs+0x202) [0x96add6]
> [5] func:/opt/openmpi/openmpi/lib/openmpi/mca_pml_ob1.so 
> (mca_pml_ob1_add_procs+0x85) [0x134259]
> [6] func:/opt/openmpi/openmpi/lib/libmpi.so.0(ompi_mpi_init+0x385)  
> [0x70ca7d]
> [7] func:/opt/openmpi/openmpi/lib/libmpi.so.0(MPI_Init+0x8c) [0x6fb724]
> [8] func:/home/mpiuser/cpi2(main+0x56) [0x804890d]
> [9] func:/lib/tls/libc.so.6(__libc_start_main+0xd3) [0xaf3e23]
> [10] func:/home/mpiuser/cpi2 [0x8048819]

Which version of Open MPI are you using?  There were some problems with
the 1.0 series when certain networking configurations were found
(particularly with machines that had multiple active networks).  We
believe we have these fixed in the upcoming 1.1 release (there is a beta
available on the download page) and in the nightly snapshots of the
upcoming 1.0.3 release, which can be downloaded here:

 http://www.open-mpi.org/software/ompi/v1.1/
 http://www.open-mpi.org/nightly/v1.0/

Let us know if these help / don't help your problem.

Thanks,

Brian



Re: [OMPI users] Trouble with open MPI and Slurm

2006-06-15 Thread Brian Barrett
On Wed, 2006-06-14 at 10:05 -0700, Doolittle, Joshua wrote:
> I am running Open MPI version 1.0.2 and slurm 1.1.0.  I can run slurm
> jobs, and I can run mpi jobs.  However, when I run a mpi job in slurm
> batch mode with 4 processes, the processes do not talk to each other.
> They act like they are the only process.  I'm running these in slurm
> batch mode.  The job that I'm running is a simple mpi optimized hello
> world.  I'm running these on an opteron (x86_64) blade system from a
> head node.  Any help would be greatly appreciated.

How are you running your batch job?  Unlike some MPI implementations,  
Open MPI jobs  can not be started under SLURM without the use of  
mpirun.  You can either run mpirun under an interactive session:

   srun -N 4 -A
   mpirun -np 4 ./foobar

or from a batch script:

   echo "mpirun -np 4 ./foobar" > foo.sh
   chmod +x foo.sh
   srun -N 4 -b foo.sh

But you can't submit your application directly without mpirun.  This  
is a feature we would like to support in the future, but there are  
some licensing issues (we would have to link with their GPL'ed  
libraries, which wouldn't work so well for us).

Brian


-- 
   Brian Barrett
   Open MPI developer
   http://www.open-mpi.org/



Re: [OMPI users] pnetcdf and OpenMPI

2006-06-13 Thread Brian Barrett
On Tue, 2006-06-13 at 10:51 -0700, Ken Mighell wrote:
> On May 6, 2006, Dries Kimpe reported a solution to getting 
> pnetcdf to compile correctly with OpenMPI.
> A patch was given for the file
> mca/io/romio/romio/adio/common/flatten.c
> Has this fix been implemented in the nightly series?

Yes, this has been fixed in the v1.1 and trunk nightly builds and in the
1.1b1 beta release.  It looks like we never followed up on the user's
mailing list with that information, but we fixed it around the time he
posted that fix.

Brian




Re: [OMPI users] Why does openMPI abort processes?

2006-06-12 Thread Brian Barrett
On Sun, 2006-06-11 at 04:26 -0700, imran shaik wrote:
> Hi,
> I some times get this error message.
> " 2 addtional processes aborted, possibly by openMPI"
>  
> Some times 2 processes, sometimes even more. Is it due to over load or
> program error?
>  
> Why does openMPI actually abort few processes?
>  
> Can anyone explain?

Generally, this is because multiple processes in your job aborted
(exited with a signal or before MPI_FINALIZE) and mpirun only prints the
first abort message.  You can modify how many abort status messages you
want to receive with the -aborted X option to mpirun, where X is the
number of process abort messages you want to see.  The message generally
includes some information on what happened to your process.


Hope this helps,

Brian




Re: [OMPI users] error for open-mpi application

2006-06-08 Thread Brian Barrett

On Jun 7, 2006, at 8:20 AM, Weihua Li wrote:


CPU: AMD opeteron Linux86-64
 I used the following command to configure the open-mpi-1.0.2.
./configure --prefix=/home/ytang/gdata/whli/openmpi CC=pgcc  
CXX=pgCC F90=gpf90 --with-openib


The F90 environment variable doesn't do anything to configure.  You  
need to set F77 (for Fortran 77) and FC for Fortran 90.  Most likely,  
configure picked up gfortran for your Fortran 90 compiler, causing  
the error messages.


 I know it must be something wrong with the installation of open- 
mpi, but I don't know where it is.


I think part of it is the Fortran 90 compiler name.  The rest, as  
Hugh mentioned, is that you really should use the wrapper compilers  
or look at the wrapper compiler configuration output to see what  
flags and libraries the Open MPI installation deems necessary.  You  
can do this by running:


  mpif90 -showme


Hope this helps,

Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] Open MPI 1.0.2 and np >=64

2006-05-31 Thread Brian Barrett

On May 31, 2006, at 9:59 AM, Troy Telford wrote:

On Tue, 30 May 2006 20:32:44 -0600, Brian Barrett <brbarret@open- 
mpi.org>

wrote:


Also, it would be useful to know more about the platform (32 / 64
bit, etc) and how you configured Open MPI.


Yeah, I was a bit terse there, wasn't I?  Sorry 'bout that...

The system:
64 bit (Opteron, two dual-core processors)
PCI Express IB HCA's
Myrinet 10G (MX10G)
Gigabit Ethernet

configured with built with (both) GCC 3.4 and 4.0 -- didn't seem to  
make

much difference.
/configure --enable-cxx-exceptions

(Note, I use LDFLAGS and CFLAGS to point to the MX & InfiniBand  
headers.)


Did you happen to have a chance to try to run the 1.0.3 or 1.1  
nightly tarballs?  I'm 50/50 on whether we've fixed these issues  
already.


Brian


Re: [OMPI users] [PMX:VIRUS] Re: OpenMPI 1.0.3a1r10002 Fails to build with IBM XL Compilers.

2006-05-31 Thread Brian Barrett

On May 31, 2006, at 12:41 PM, Justin Bronder wrote:

On 5/31/06, Brian W. Barrett <brbar...@open-mpi.org> wrote: A quick  
workaround is to edit opal/include/opal_config.h and change the
#defines for OMPI_CXX_GCC_INLINE_ASSEMBLY and  
OMPI_CC_GCC_INLINE_ASSEMBLY

from 1 to 0.  That should allow you to build Open MPI with those XL
compilers.  Hopefully IBM will fix this in a future version ;).

Well I actually edited include/ompi_config.h and set both  
OMPI_C_GCC_INLINE_ASSEMBLY
and OMPI_CXX_GCC_INLINE_ASSEMBLY to 0.  This worked until libtool  
tried to create

a shared library:


Ah, yes, sorry about that.  We reorganized our directory structure a  
little bit since we released 1.0.2 and I listed the new path.




Of course, I've been told that directly linking with ld isn't such  
a great idea in the first

place.  Ideas?


I've had some issues building shared libraries with the XL  
compilers.  Libtool doesn't seem to do a good job of supporting  
them.  Your best bet is to build Open MPI with static libraries.  The  
options --enable-static --disable-shared will build static libraries  
instead of shared libraries.



Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] openmpi-1.1a7 on solaris10 opteron

2006-05-31 Thread Brian Barrett

On May 29, 2006, at 5:46 AM, Francoise Roch wrote:


I still have a problem to select an interface with openmpi-1.1a7 on
solaris opteron.
I compile in 64 bit mode, with Studio11 compilers

I attempted to force interface exclusion without success.
This problem is critical for us because we'll soon have Infiniband
interfaces for mpi traffic.

roch@n15 ~/MPI > mpirun --mca btl_tcp_if_exclude bge1 -np 2 -host
p15,p27 all2all
Process 0 is alive on n15
Process 1 is alive on n27
[n27:05110] *** An error occurred in MPI_Barrier
[n27:05110] *** on communicator MPI_COMM_WORLD
[n27:05110] *** MPI_ERR_INTERN: internal error
[n27:05110] *** MPI_ERRORS_ARE_FATAL (goodbye)
1 process killed (possibly by Open MPI)

The code works without mca btl_tcp_if_exclude option.


It took me a while to realize what is going on.  Normally,  
btl_tcp_if_exclude excludes the lo devices so that they won't be used  
for the btl transport. When you explicitly set btl_tcp_if_exclude,  
you have to include lo0 (for Solaris) in the list or things go down  
hill. I can replicate Françoise's problem on his cluster. However, if  
I instead do:


mpirun --mca btl_tcp_if_exclude bge0,lo0 -np 2 --host n15,n27 ./ 
ring


the routing issues are resolved and everything runs to completion.

I'll make sure to update the documentation for 1.1 so that this  
hopefully doesn't confuse too many more people.




Brian


--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/





Re: [OMPI users] Thread Safety

2006-05-31 Thread Brian Barrett

On May 26, 2006, at 11:31 PM, imran shaik wrote:

I have installed openMPI alpha 7 release. I created an MPI programs  
with pthreads. I ran with just 6 process, each thread making MPI  
calls concurrently with main thread. Things work fine . I use a TCP  
network.


Some times i get a strange error message.




Sometimes i get this error message, and sometimes not. I can say in  
a run of 7 i get once. But i get the output properly and the  
program  works fine. I just wanted to know why that occured?


We just released alpha 8, which should include a fix for a problem  
that sounds very similar to what you are seeing.  Can you try  
upgrading and see if that solves your problem?


Another one, i tried to get verbose output from "mpirun", but  
couldnt. Even "mpiexec". I was using the same command as
mpirun -v -np 6 myprogram in lam, i used to get the verbose saying  
which process is running where. Here nothing happens.


What is the problem? Otherwise how can i know what process is  
running on what node? Any suggestions??


We don't currently have a good way of dealing with this.  You can get  
lots of debugging information from the -d option to mpirun, but it  
would be difficult to get exactly what you are looking for from the  
debugging output.


Your best bet would probably be to use gethostname() and MPI_Comm_rank 
() inside your MPI application and print the results to stdout / stderr.



Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] [PMX:VIRUS] Re: OpenMPI 1.0.3a1r10002 Fails to build with IBM XL Compilers.

2006-05-31 Thread Brian Barrett

On May 28, 2006, at 8:48 AM, Justin Bronder wrote:


Brian Barrett wrote:

On May 27, 2006, at 10:01 AM, Justin Bronder wrote:



I've attached the required logs.  Essentially the problem seems to
be that the XL Compilers fail to recognize "__asm__ __volatile__" in
opal/include/sys/powerpc/atomic.h when building 64-bit.

I've tried using various xlc wrappers such as gxlc and xlc_r to
no avail.  The current log uses xlc_r_64 which is just a one line
shell script forcing the -q64 option.

The same works flawlessly with gcc-4.1.0.  I'm using the nightly
build in order to link with Torque's new shared libraries.

Any help would be greatly appreciated.  For reference here are
a few other things that may provide more information.



Can you send the config.log file generated by configure?  What else
is in the xlc_r_64 shell script, other than the -q64 option?



I've attached the config.log, and here's what all of the *_64 scripts
look like.


Can you try compiling without the -qkeyword=__volatile__?  It looks  
like XLC now has some support for GCC-style inline assembly, but it  
doesn't seem to be working in this case.  If that doesn't work, try  
setting CFLAGS and CXXFLAGS to include -qnokeyword=asm, which should  
disable GCC inline assembly entirely.  I don't have access to a linux  
cluster with the XL compilers, so I can't verify this.  But it should  
work.


Brian


--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] OpenMPI 1.0.3a1r10002 Fails to build with IBM XL Compilers.

2006-05-27 Thread Brian Barrett

On May 27, 2006, at 10:01 AM, Justin Bronder wrote:


I've attached the required logs.  Essentially the problem seems to
be that the XL Compilers fail to recognize "__asm__ __volatile__" in
opal/include/sys/powerpc/atomic.h when building 64-bit.

I've tried using various xlc wrappers such as gxlc and xlc_r to
no avail.  The current log uses xlc_r_64 which is just a one line
shell script forcing the -q64 option.

The same works flawlessly with gcc-4.1.0.  I'm using the nightly
build in order to link with Torque's new shared libraries.

Any help would be greatly appreciated.  For reference here are
a few other things that may provide more information.


Can you send the config.log file generated by configure?  What else  
is in the xlc_r_64 shell script, other than the -q64 option?


Thanks,

Brian


--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] Fortran support not installing

2006-05-26 Thread Brian Barrett

The last line of your make.out file was:

90 > mpi-f90-interfaces.h
***
* Compiling the mpi.f90 file may take a few minutes.
* This is quite normal -- do not be alarmed if the compile
* process seems to 'hang' at this point for several minutes.
***
g95 -I../../../ompi/include -I. -I.  -c -I. -o mpi.o  mpi.f90

Was there some other output not included in the file?  If nothing  
happened for a while, don't assume it failed.  That file takes a  
very, very long time to compile.


Brian

On May 25, 2006, at 1:46 PM, Terry Reeves wrote:


Hello
	I tried configure with FCFLAGS=-lSystemStubs and with both  
FCFLAGS=-lSystemStubs and LDFLAGS=-lSystemStubs. Again it died  
during configure both times. I can provide configure output if  
desired.
	I also decided to try version 1.1a7. With LDFLAGS=-lSystemStubs,  
with our without FCFLAGS=-lSystemStubs, ir gets through configure  
but fails in "make all". Since that seems to be progress I have  
included that output.




Date: Thu, 25 May 2006 10:02:08 -0400
From: "Jeff Squyres \(jsquyres\)" 
Subject: Re: [OMPI users] Fortran support not installing
To: "Open MPI Users" 
Message-ID:

Content-Type: text/plain; charset="us-ascii"

I actually had to set FCFLAGS, not LDFLAGS, to get arbitrary flags
passed down to the Fortran tests in configure.

Can you try that?  (I'm not 100% sure -- you may need to specify  
LDFLAGS

*and* FCFLAGS...?)

We have made substantial improvements to the configure tests with
regards to the MPI F90 bindings in the upcoming 1.1 release.  Most of
the work is currently off in a temporary branch in our code  
repository
(meaning that it doesn't show up yet in the nightly trunk  
tarballs), but

it will hopefully be brought back to the trunk soon.





Terry Reeves 2-1013 - reeve...@osu.edu
Computing Services
Office of Information Technology
The Ohio State University


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Open MPI and OpenIB

2006-05-11 Thread Brian Barrett

On May 11, 2006, at 10:10 PM, Gurhan Ozen wrote:


Brian,
Thanks for the very clear answers.

I did change my code to include fflush() calls after printf() ...

And I did try with --mca btl ib,self . Interesting result, with --mca
btl ib,self it hello_world works fine, but broadcast hangs after i
enter the vector length.

At any rate though, --mca btl ib,self looks like the traffic goes over
ethernet device .. I couldn't find any documentation on the "self"
argument of mca, does it mean to explore alternatives if the desired
btl (in this case ib) doesn't work?


No, self is the loopback device, for sending messages to self.  It is  
never used for message routing outside of the current process, but is  
required for almost all transports, as send to self can be a sticky  
issue.


You are specifying openib, not ib, as the argument to mpirun,  
correct?  Either way, I'm not really sure how data could be going  
over TCP -- the TCP transport would definitely be disabled in that  
case.  At this point, I don't know enough about the Open IB driver to  
be of help -- one of the other developers is going to have to jump in  
and provide assistance.



Speaking of documentation, it looks like open-mpi didn't come with a
man for mpirun, i thought i had seen in one of the slides of Open MPI
developer's workshop that it did have mpirun.1 . Do i need to check it
out from svn?


That's one option, or wait for us to release Open MPI 1.0.3 / 1.1.

Brian



On 5/11/06, Brian Barrett <brbar...@open-mpi.org> wrote:

On May 10, 2006, at 10:46 PM, Gurhan Ozen wrote:


 My ultimate goal is to get Open MPI working with openIB stack.
First, I had
 installed lam-mpi , I know it doesn't have support for openIB but
it's still
 relevant to some of my questions  I will ask.. Here is the set up
I have:


Yes, keep in mind throughout that while Open MPI does support MVAPI,
LAM/MPI will fall back to using IP over IB for communication.


 I have two machines, pe830-01 and pe830-02 .. Both have ethernet
interface and
 HCA interface. The IP addresses follow:
 eth0 ib0
 pe830-01 10.12.4.32  192.168.1.32
 pe830-02 10.12.4.34  192.168.1.34

   So this has worked even though it lamhosts file is configured to
use ib0
   interfaces. I further verified with tcpdump command that none of
this went
   to eth0 ..

   Anyhow, if i change the lamhosts file to use the eth0 IPs,
things work just
   as the same with no issues . And in that case i see some traffic
on eth0
   with tcpdump.


Ok, so at least it sounds like your TCP network is sanely configured.


   Now, when i installed and used Open MPI, things didn't work as
easy.. Here is
   what happens. After recompiling the sources with the mpicc that
comes with
   open-mpi:

   $ /usr/local/openmpi/bin/mpirun  --prefix /usr/local/openmpi -- 
mca
   pls_rsh_agent ssh --mca btl tcp -np 2 --host  
10.12.4.34,10.12.4.32

   /path/to/hello_world
   Hello, world, I am 0 of 2 and this is on : pe830-02.
   Hello, world, I am 1 of 2 and this is on: pe830-01.

   So far so good, using eth0 interfaces.. hello_world works just
fine. Now,
   when i try the broadcast program:


In reality, you always need to include two BTLs when specifying.  You
need both the one you want to use (mvapi,openib,tcp,etc.) and
"self".  You can run into issues otherwise.

   $ /usr/local/openmpi/bin/mpirun  --prefix /usr/local/openmpi -- 
mca
   pls_rsh_agent ssh --mca btl tcp -np 2 --host  
10.12.4.34,10.12.4.32

   /path/to/broadcast

   It just hangs there, it doesn't prompt me the "Enter the vector
length:"
   string . So i just enter a number anyway since i know the
behavior of the
   program:

   10
   Enter the vector length: i am: 0 , and i have 5 vector elements
   i am: 1 , and i have 5 vector elements
   [0] 10.00
   [0] 10.00
   [0] 10.00
   [0] 10.00
   [0] 10.00
   [0] 10.00
   [0] 10.00
   [0] 10.00
   [0] 10.00
   [0] 10.00

   So, that's the first bump with the openmpi.. Now , if i try to
use ib0
   interfaces instead of eth0 ones, i get:


I'm actually surprised this worked in LAM/MPI, to be honest.  There
should be an fflush() after the printf() to make sure that the output
is actually sent out of the application.

   $  /usr/local/openmpi/bin/mpirun  --prefix /usr/local/openmpi  
--mca

   pls_rsh_agent ssh --mca btl openib -np 2 --host
192.168.1.34,192.168.1.32
   /path/to/hello_world

 
--


   No available btl components were found!

   This means that there are no components of this type installed
on your
   system or all the components reported that they could not be  
used.


   This is a fatal error; your MPI process is likely to abort.
Check the
   output of the "ompi_info" command and ensure that components of
this
   type are available on your system.  You may also wish to check  
the


Re: [OMPI users] Open MPI and OpenIB

2006-05-11 Thread Brian Barrett
-- 

   It looks like MPI_INIT failed for some reason; your parallel  
process is
   likely to abort.  There are many reasons that a parallel process  
can
   fail during MPI_INIT; some of which are due to configuration or  
environment
   problems.  This failure appears to be an internal failure;  
here's some

   additional information (which may only be relevant to an Open MPI
   developer):

 PML add procs failed
 --> Returned value -2 instead of OMPI_SUCCESS

-- 


   *** An error occurred in MPI_Init
   *** before MPI was initialized
   *** MPI_ERRORS_ARE_FATAL (goodbye)

   error ...


This makes it sound like Open IB is failing to setup properly.  I'm a  
bit out of my league on this one -- is there any application you can run



   4 - How come the behavior of broadcast.c was different on Open MPI
than it is
   on lam/mpi?


I think I answered this one already.

   5 - Any ideas as to why i am getting no btl component error when  
i want to
   use openib even though ompi_info shows it? If it help any  
further , I have

   the following openib modules :


This usually (but not always) indicates that something is going wrong  
with initializing the hardware interface.  ompi_info only tries to  
load the module, but not initialize the network device.



Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




  1   2   >