Without analyzing your source, it's hard to say. I will say that
OMPI may send fragments out of order, but we do, of course, provide
the same message ordering guarantees that MPI mandates. So let me
ask a few leading questions:
- Are you using any wildcards in your receives, such as
MPI_ANY_SOURCE or MPI_ANY_TAG?
- Have you run your code through a memory-checking debugger such as
valgrind?
- I don't know what Scali MPI uses, but MPICH and Intel MPI use
integers for MPI handles. Have you tried LAM/MPI as well? It, like
Open MPI, uses pointers for MPI handles. I mention this because apps
that unintentionally have code that takes advantage of integer
handles can sometimes behave unpredictably when switching to a
pointer-based MPI implementation.
- What network interconnect are you using between the two hosts?
On Jan 25, 2007, at 4:22 PM, Fisher, Mark S wrote:
Recently I wanted to try OpenMPI for use with our CFD flow solver
WINDUS. The code uses a master/slave methodology were the master
handles
I/O and issues tasks for the slaves to perform. The original parallel
implementation was done in 1993 using PVM and in 1999 we added support
for MPI.
When testing the code with Openmpi 1.1.2 it ran fine when running on a
single machine. As soon as I ran on more than one machine I started
getting random errors right away (the attached tar ball has a good and
bad output). It looked like either the messages were out of order or
were for the other slave process. In the run mode used there is no
slave
to slave communication. In the file the code died near the
beginning of
the communication between master and slave. Sometimes it will run
further before it fails.
I have included a tar file with the build and configuration info. The
two nodes are identical Xeon 2.8 GHZ machines running SLED 10. I am
running real-time (no queue) using the ssh starter using the following
appt file.
-x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host
skipper2 -wdir /opt/scratch/m209290/ol.scr.16348 -np 1 ./
__bcfdbeta.exe
-x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host
copland -wdir /tmp/mpi.m209290 -np 2 ./__bcfdbeta.exe
The above file fails but the following works:
-x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host
skipper2 -wdir /opt/scratch/m209290/ol.scr.16348 -np 1 ./
__bcfdbeta.exe
-x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host
skipper2 -wdir /tmp/mpi.m209290 -np 2 ./__bcfdbeta.exe
The first process is the master and the second two are the slaves.
I am
not sure what is going wrong, the code runs fine with many other MPI
distributions (MPICH1/2, Intel, Scali...). I assume that either I
built
it wrong or am not running it properly but I cannot see what I am
doing
wrong. Any help would be appreciated!
<<mpipb.tgz>>
<mpipb.tgz>
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems