Jonathan Dursi wrote:
We have here installed a couple of installations of OpenMPI 1.3.2, and
we are having real problems with single-node jobs randomly hanging
when using the shared memory BTL, particularly (but perhaps not only)
when using the version compiled with gcc 4.4.0.
The very trivial attached program, which just does a series of
SENDRECVs rightwards through MPI_COMM_WORLD, hangs extremely
reliably when run like so on an 8 core box:
mpirun -np 6 -mca btl self,sm ./diffusion-mpi
(the test example was based on a simple fortran example of MPIing the
1d diffusion equation). The hanging seems to always occur within the
first 500 or so iterations - but sometimes between the 10th and 20th
and sometimes not until the late 400s. The hanging occurs both on a
new dual socket quad core nehalem box, and an older harpertown machine.
Running without sm, however, seems to work fine:
mpirun -np 6 -mca btl self,tcp ./diffusion-mpi
never gives any problems.
Any suggestions? I notice a mention of `improved flow control in sm'
in the ChangeLog to 1.3.3; is that a significant bugfix?
I filed a trac ticket on this.
https://svn.open-mpi.org/trac/ompi/ticket/2043