On 9/21/2011 11:44 AM, Blosch, Edwin L wrote:
Follow-up to a mislabeled thread: "How could OpenMPI (or MVAPICH) affect
floating-point results?"
I have found a solution to my problem, but I would like to understand the
underlying issue better.
To rehash: An Intel-compiled executable linked with MVAPICH runs fine; linked
with OpenMPI fails. The earliest symptom I could see was some strange
difference in numerical values of quantities that should be unaffected by MPI
calls. Tim's advice guided me to assume memory corruption. Eugene's advice
guided me to explore the detailed differences in compilation.
I observed that the MVAPICH mpif90 wrapper adds -fPIC.
I tried adding -fPIC and -mcmodel=medium to the compilation of the
OpenMPI-linked executable. Now it works fine. I haven't tried without
-mcmodel=medium, but my guess is -fPIC did the trick.
Does anyone know why compiling with -fPIC has helped? Does it suggest an
application problem or an OpenMPI problem?
To note: This is an Infiniband-based cluster. The application does pretty
basic MPI-1 operations: send, recv, bcast, reduce, allreduce, gather, gather,
isend, irecv, waitall. There is one task that uses iprobe with MPI_ANY_TAG,
but this task is only involved in certain cases (including this one).
Conversely, cases that do not call iprobe have not yet been observed to crash.
I am deducing that this function is the problem.
If you are making a .so, the included .o files should be built with
-fPIC or similar. Ideally, the configure and build tools would enforce this.
--
Tim Prince