Brock Palen wrote:
Ok it looks like a bigger problem. The segfault is not related to
OMPI because when I go and rebuild 1.2 or another version we use with
IB all the time, it will now fail with a segfault when forcing IB.
The old libs of the same version still work. They of-course do not
have the flag to turn off early completion.
Was there an older version of OpenMPI that did not suffer from the
early completion problem?
The issue was fixed in 1.3 branch, all versions before 1.3 have this
problem.
We have many installed and for a quick test latest and greatest would
not be of much concern while we track down the problem on our end.
We are on RHEL4 using OFED provided by redhat. The error is "address
not mapped to object"
I think that best for you will be try to install Mellanox OFED
distribution that already include pre-build versions on OpenMPI 1.2.6
with Intel and Pgi compilers:
http://www.mellanox.com/products/ofed.php
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985
On Jul 3, 2008, at 8:38 AM, Jeff Squyres wrote:
On Jul 2, 2008, at 11:51 PM, Pavel Shamis (Pasha) wrote:
In trying to build 1.2.6 with the pgi compilers it makes an MPI
library that works with tcp, sm. But it segfaults on openib.
Both our intel compiler version and pgi version of 1.2.6 blow up
like this when we force IB. So this is a new issue.
I have ompi 1.2.6 installed on my machines with Intel compiler
(version 10.1) and Pgi compiler (version 7.1-5), both of them works
with IB without any problem. BTW Mellanox provides Mellanox OFED
binary distribution that include Intel and Pgi Open MPI 1.2.6 build.
You can download it from here http://www.mellanox.com/products/ofed.php
Is there a way to shut off early completion in 1.2.3?
Sure, just add "--mca |pml_ob1_use_early_completion 0" to your
command line.| ||
Note that this flag was not added until v1.2.6; it has no effect in
v1.2.3.
Or the the above a known issues and i should use 1.2.7-pre or grab
a 1.3 snap shot?
1.2.6 should be ok.
The upcoming v1.3 series works a little differently; there's no need
to use this flag in the v1.3 series (i.e., this flag only exists in
the v1.2 series starting with v1.2.6).
--
Jeff Squyres
Cisco Systems