Jonathan Dursi wrote:

We have here installed a couple of installations of OpenMPI 1.3.2, and we are having real problems with single-node jobs randomly hanging when using the shared memory BTL, particularly (but perhaps not only) when using the version compiled with gcc 4.4.0.

The very trivial attached program, which just does a series of SENDRECVs rightwards through MPI_COMM_WORLD, hangs extremely reliably when run like so on an 8 core box:

mpirun -np 6 -mca btl self,sm ./diffusion-mpi

(the test example was based on a simple fortran example of MPIing the 1d diffusion equation). The hanging seems to always occur within the first 500 or so iterations - but sometimes between the 10th and 20th and sometimes not until the late 400s. The hanging occurs both on a new dual socket quad core nehalem box, and an older harpertown machine.

Running without sm, however, seems to work fine:

mpirun -np 6 -mca btl self,tcp ./diffusion-mpi

never gives any problems.

Any suggestions? I notice a mention of `improved flow control in sm' in the ChangeLog to 1.3.3; is that a significant bugfix?

I filed a trac ticket on this.

https://svn.open-mpi.org/trac/ompi/ticket/2043

Reply via email to