--- On *Thu, 6/19/08, Pavel Shamis (Pasha) /<pa...@dev.mellanox.co.il
>/* wrote:
From: Pavel Shamis (Pasha) <pa...@dev.mellanox.co.il>
Subject: Re: [OMPI users] Open MPI timeout problems.
To: pj...@cornell.edu, "Open MPI Users" <us...@open-mpi.org>
Date: Thursday, June 19, 2008, 5:20 AM
Usually the retry exceed point to some network issue on your
cluster. I see from the logs that you still
use MVAPI. If i remember correct, MVAPI include IBADM
application that should be able to check and debug the network.
BTW I recommend you to update your MVAPI driver to latest
OpenFabric driver.
Peter Diamessis wrote:
> Dear folks,
>
> I would appreciate your help on the following:
>
> I'm running a parallel CFD code on the Army Research Lab's MJM
Linux
> cluster, which uses Open-MPI. I've run the same code on other
Linux
> clusters that use MPICH2 and had never run into this problem.
>
> I'm quite convinced that the bottleneck for my code is this
data
> transposition routine, although I have not done any rigorous
profiling
> to check on it. This is where 90% of the parallel
communication takes
> place. I'm running a CFD code that uses a 3-D rectangular
domain which
> is partitioned across processors in such a way that each
processor
> stores vertical slabs that are contiguous in the x-direction
but shared
> across processors in the y-dir. . When a 2-D Fast Fourier
Transform
> (FFT) needs to be done, data is transposed such that the
vertical slabs
> are now contiguous in the y-dir. in each processor. >
> The code would normally be run for about 10,000 timesteps. In
the
> specific case which blocks, the job crashes after ~200
timesteps and at
> each timestep a large number of 2-D FFTs are performed. For a
domain
> with resolution of Nx * Ny * Nz points and P processors,
during one FFT,
> each processor performs P Sends and P Receives of a message
of size
> (Nx*Ny*Nz)/P, i.e. there are a total of 2*P^2 such Sends/
Receives. >
> I've focused on a case using P=32 procs with Nx=256, Ny=128,
Nz=175.
You
> can see that each FFT involves 2048 communications. I totally
rewrote my
> data transposition routine to no longer use specific blocking/
non-
> blocking Sends/Receives but to use MPI_ALLTOALL which I would
hope is
> optimized for the specific MPI Implementation to do data
transpositions.
> Unfortunately, my code still crashes with time-out problems
like before.
>
> This happens for P=4, 8, 16 & 32 processors. The same
MPI_ALLTOALL
code
> worked fine on a smaller cluster here. Note that in the
future I would
> like to work with resolutions of (Nx,Ny,Nz)=(512,256,533) and
P=128 or
> 256 procs. which will involve an order of magnitude more
communication.
>
> Note that I ran the job by submitting it to an LSF queue
system. I've
> attached the script file used for that. I basically enter
bsub -x <
> script_openmpi at the command line. >
> When I communicated with a consultant at ARL, he recommended
I use
> 3 specific script files which I've attached. I believe these
enable
> control over some of the MCA parameters. I've experimented
with values
> of btl_mvapi_ib_timeout = 14, 18, 20, 24 and 30 and I still
have this
> problem. I am still in contact with this consultant but
thought it would
> be good to contact you folks directly.
>
> Note:
> a) echo $PATH returns: >
> /opt/mpi/x86_64/pgi/6.2/openmpi-1.2/bin:
> /opt/compiler/pgi/linux86-64/6.2/bin:/usr/lsf/6.2/linux2.6-
glibc2.3-
> ia32e/bin:/usr/lsf/6.2/linux2.6-glibc2.3-
> ia32e/etc:/usr/cta/modules/3.1.6/bin:
> /usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/
gnome/bin:
> .:/usr/lib/java/bin:/opt/gm/bin:/opt/mx/bin:/opt/PST/bin
>
> b) echo $LD_LIBRARY_PATH returns:
> /opt/mpi/x86_64/pgi/6.2/openmpi-1.2/lib:
> /opt/compiler/pgi/linux86-64/6.2/lib:
> /opt/compiler/pgi/linux86-64/6.2/libso:/usr/lsf/6.2/linux2.6-
glibc2.3-
> ia32e/lib
>
> I've attached the following files:
> 1) Gzipped versions of the .out & .err files of the failed job.
> 2) ompi_info.log: The output of ompi_info -all
> 3) mpirun, mpirun.lsf, openmpi_wrapper: the three script
files provided
> to me by the ARL consultant. I store these in my home
directory and
> experimented with the MCA parameter btl_mvapi_ib_timeout in
mpirun.
> 4) The script file script_openmpi that I use to submit the job.
>
> I am unable to provide you with the config.log file as I
cannot find it
> in the top level Open MPI directory.
>
> I am also unable to provide you with details on the specific
cluster
> that I'm running in terms of the network. I know they use
Infiniband
and
> some more detail may be found on:
>
> http://www.arl.hpc.mil/Systems/mjm.html
>
> Some other info:
> a) uname -a returns: > Linux l1 2.6.5-7.308-smp.arl-msrc
#2 SMP Thu Jan 10 09:18:41 EST 2008
> x86_64 x86_64 x86_64 GNU/Linux
>
> b) ulimit -l returns: unlimited
>
> I cannot see a pattern as to which nodes are bad and which
are good ...
>
>
> Note that I found in the mail archives that someone had a
similar
> problem in transposing a matrix with 16 million elements. The
only
> answer I found in the thread was to increase the value of
> btl_mvapi_ib_timeout to 14 or 16, something I've done already.
>
> I'm hoping that there must be a way out of this problem. I
need to
> get my code running as I'm under pressure to produce results
for a
> grant that's paying me.
>
> If you have any feedback I would be hugely grateful.
>
> Sincerely,
>
> Peter Diamessis
> Cornell University
>
>
> >
------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users