On Tue, 2009-05-19 at 11:01 -0400, Noam Bernstein wrote:

> I'd suspect the filesystem too, except that it's hung up in an MPI  
> call.  As I said
> before, the whole thing is bizarre.  It doesn't matter where the  
> executable is,
> just what CWD is (i.e. I can do mpirun /scratch/exec or mpirun /home/ 
> bernstei/exec,
> but if it's sitting in /scratch it'll hang).  And I've been running
> other codes both from NFS and from scratch directories for months,
> and never had a problem.

That is indeed odd but it shouldn't be too hard to track down, how often
does the failure occur?  Presumably when you say you have three
invocations of the program they communicate via files, is the location
of these files changing?

I assume you're certain it's actually hanging and not just failing to
converge?

Finally if you could run it with "--mca btl ^ofed" to rule out the ofed
stack causing the problem that would be useful.  You'd need to check the
syntax here.

> Using MVAPICH every process is stuck in a collective, but they're not  
> all the
> same collective (see stack traces below).  The 2 processes on the head  
> node
> are stuck on mpi_bcast, in various low level MPI routines.  The other 6
> processes are stuck on an mpi_allreduce, again in various low level mpi
> processes.  I don't know enough about the code to tell they're all  
> supposed
> to be part of the same communicator, and the fact that they're stuck on
> different collectives is suspicious.  I can look into that.

This isn't so suspicious, if there is a problem with some processes it's
common for other processes to continue till the next collective call.

Ashley,

Reply via email to