Hi, Eugene:
Thanks for your efforts in reproducing this problem; glad to know it's
not just us.
I think our solution for now is just to migrate our users to MVAPICH2
and Intel MPI; these MPICH-based systems work for us and our users
extremely reliably, and it just looks like OpenMPI isn't ready for
real production use on our system.
- Jonathan
On 2009-09-24, at 4:16PM, Eugene Loh wrote:
Jonathan Dursi wrote:
So to summarize:
OpenMPI 1.3.2 + gcc4.4.0
Test problem with periodic (left neighbour of proc 0 is proc N-1)
Sendrecv()s:
Default always hangs in Sendrecv after random number of iterations
Turning off sm (-mca btl self,tcp) not observed to hang
Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to
hang
Using fewer than 5 fifos hangs in Sendrecv after random number of
iterations or Finalize
OpenMPI 1.3.3 + gcc4.4.0
Test problem with periodic (left neighbour of proc 0 is proc N-1)
Sendrecv()s:
Default sometimes (~20% of time) hangs in Sendrecv after random
number of iterations
Turning off sm (-mca btl self,tcp) not observed to hang
Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to
hang
Using fewer than 5 fifos but more than 2 not observed to hang
Using 2 fifos sometimes (~20% of time) hangs in Finalize or
Sendrecv after random number of iterations but sometimes completes
OpenMPI 1.3.2 + intel 11.0 compilers
We are seeing a problem which we believe to be related; ~1% of
certain single-node jobs hang, turning off sm or setting num_fifos
to NP-1 eliminates this.
I can reproduce this with just Barriers, which keeps the processes
all in sync. So, this has nothing to do with processes outrunning
one another (which wasn't likely in the first place given that you
had Sendrecv calls).
The problem is fickle. E.g., building OMPI with -g seems to make
the problem go away.
I did observe that the sm FIFO would fill up. That's weird since
there aren't ever a lot of in-flight messages. I tried adding a
line of code that would make a process pause if ever it tried to
write to a FIFO that seemed full. That pretty much made the problem
go away. So, I guess it's a memory coherency problem: receive
clears the FIFO, but writer thinks it's congested.
I tried all sorts of GCC compilers. The problem seems to set in
with 4.4.0. I don't know what's significant about that. It
requires moving to the 2.18 assembler, but I tried the 2.18
assembler with 4.3.3 and that worked okay. I'd think this has to do
with GCC 4.4.x, but you say you see the problem with Intel compilers
as well. Hmm. Maybe an OMPI problem that's better exposed with GCC
4.4.x?
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jonathan Dursi <ljdu...@scinet.utoronto.ca>