Hi, Eugene:

Thanks for your efforts in reproducing this problem; glad to know it's not just us.

I think our solution for now is just to migrate our users to MVAPICH2 and Intel MPI; these MPICH-based systems work for us and our users extremely reliably, and it just looks like OpenMPI isn't ready for real production use on our system.

        - Jonathan

On 2009-09-24, at 4:16PM, Eugene Loh wrote:

Jonathan Dursi wrote:

So to summarize:

OpenMPI 1.3.2 + gcc4.4.0

Test problem with periodic (left neighbour of proc 0 is proc N-1) Sendrecv()s:
 Default always hangs in Sendrecv after random number of iterations
 Turning off sm (-mca btl self,tcp) not observed to hang
Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang Using fewer than 5 fifos hangs in Sendrecv after random number of iterations or Finalize

OpenMPI 1.3.3 + gcc4.4.0

Test problem with periodic (left neighbour of proc 0 is proc N-1) Sendrecv()s: Default sometimes (~20% of time) hangs in Sendrecv after random number of iterations
 Turning off sm (-mca btl self,tcp) not observed to hang
Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
 Using fewer than 5 fifos but more than 2 not observed to hang
Using 2 fifos sometimes (~20% of time) hangs in Finalize or Sendrecv after random number of iterations but sometimes completes

OpenMPI 1.3.2 + intel 11.0 compilers

We are seeing a problem which we believe to be related; ~1% of certain single-node jobs hang, turning off sm or setting num_fifos to NP-1 eliminates this.

I can reproduce this with just Barriers, which keeps the processes all in sync. So, this has nothing to do with processes outrunning one another (which wasn't likely in the first place given that you had Sendrecv calls).

The problem is fickle. E.g., building OMPI with -g seems to make the problem go away.

I did observe that the sm FIFO would fill up. That's weird since there aren't ever a lot of in-flight messages. I tried adding a line of code that would make a process pause if ever it tried to write to a FIFO that seemed full. That pretty much made the problem go away. So, I guess it's a memory coherency problem: receive clears the FIFO, but writer thinks it's congested.

I tried all sorts of GCC compilers. The problem seems to set in with 4.4.0. I don't know what's significant about that. It requires moving to the 2.18 assembler, but I tried the 2.18 assembler with 4.3.3 and that worked okay. I'd think this has to do with GCC 4.4.x, but you say you see the problem with Intel compilers as well. Hmm. Maybe an OMPI problem that's better exposed with GCC 4.4.x?
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Jonathan Dursi <ljdu...@scinet.utoronto.ca>




Reply via email to