If you are not using iWARP or InfiniBand networking, try configuring Open MPI --without-memory-manager and see if that solves your problem. Issues like this can come up, especially in C++ codes, when the application (or supporting libraries) have their own memory managers that conflict with Open MPI's. We *only* use an internal memory manager for optimizing benchmark performance on iWARP and IB networks, so if you're not using iWARP or IB, and/or your application doesn't re-use the same buffers repeatedly to MPI_SEND/MPI_RECV, then you don't need our memory manager.

To be 100% clear: OMPI's internal memory manager is only used for the "mpi_leave_pinned" behavior. OMPI runs fine without it, but will definitely see degraded performance in apps that continually re-use the same buffers for MPI_SEND/MPI_RECV (i.e., benchmarks).

FYI: for these kinds of reasons, we're changing how we do mpi_leave_pinned in the upcoming v1.3 series so that you hopefully shouldn't have these issues.


On Jul 24, 2008, at 2:39 PM, Adam C Powell IV wrote:

Greetings,

I'm seeing a segfault in a code on Ubuntu 8.04 with gcc 4.2.  I
recompiled the Debian lenny openmpi 1.2.7~rc2 package on Ubuntu, and
compiled the Debian lenny petsc and libmesh packages against that.

Everything works just fine in Debian lenny (gcc 4.3), but in Ubuntu
hardy it fails during MPI_Init:

[Thread debugging using libthread_db enabled]
[New Thread 0x7faceea6f6f0 (LWP 5376)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7faceea6f6f0 (LWP 5376)]
0x00007faceb265b8b in _int_malloc () from /usr/lib/libopen-pal.so.0
(gdb) backtrace
#0 0x00007faceb265b8b in _int_malloc () from /usr/lib/libopen- pal.so.0
#1  0x00007faceb266e58 in malloc () from /usr/lib/libopen-pal.so.0
#2  0x00007faceb248bfb in opal_class_initialize ()
  from /usr/lib/libopen-pal.so.0
#3 0x00007faceb25ce2b in opal_malloc_init () from /usr/lib/libopen- pal.so.0 #4 0x00007faceb249d97 in opal_init_util () from /usr/lib/libopen- pal.so.0
#5  0x00007faceb249e76 in opal_init () from /usr/lib/libopen-pal.so.0
#6  0x00007faced05a723 in ompi_mpi_init () from /usr/lib/libmpi.so.0
#7  0x00007faced07c106 in PMPI_Init () from /usr/lib/libmpi.so.0
#8 0x00007facee144d92 in libMesh::init () from /usr/lib/libmesh.so. 0.6.2
#9  0x0000000000411f61 in main ()

libMesh::init() just has an assertion and command line check before
MPI_Init, so I think it's safe to conclude this is an OpenMPI problem.

How can I help to test and fix this?

This might be related to Vincent Rotival's problem in
http://www.open-mpi.org/community/lists/users/2008/04/5427.php or maybe http://www.open-mpi.org/community/lists/users/2008/05/5668.php . On the
latter, I'm building the Debian package, which should have the
LDFLAGS="" fix. Hmm, nope, no LDFLAGS anywhere in the .diff.gz... The
OpenMPI top-level Makefile has
"LDFLAGS = -export-dynamic -Wl,-Bsymbolic-functions"

-Adam
--
GPG fingerprint: D54D 1AEE B11C CE9B A02B  C5DD 526F 01E8 564E E4B6

Engineering consulting with open source tools
http://www.opennovation.com/
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to