Mathieu Gontier wrote:
Dear OpenMPI users
I am dealing with an arithmetic problem. In fact, I have two variants
of my code: one in single precision, one in double precision. When I
compare the two executable built with MPICH, one can observed an
expected difference of performance: 115.7-sec in single precision
against 178.68-sec in double precision (+54%).
The thing is, when I use OpenMPI, the difference is really bigger:
238.5-sec in single precision against 403.19-sec double precision (+69%).
Our experiences have already shown OpenMPI is less efficient than
MPICH on Ethernet with a small number of processes. This explain the
differences between the first set of results with MPICH and the second
set with OpenMPI. (But if someone have more information about that or
even a solution, I am of course interested.)
But, using OpenMPI increases the difference between the two
arithmetic. Is it the accentuation of the OpenMPI+Ethernet loss of
performance, is it another issue into OpenMPI or is there any option a
can use?
It is also unusual that the performance difference between MPICH and
OMPI is so large. You say that OMPI is slower than MPICH even at small
process counts. Can you confirm that this is because MPI calls are
slower? Some of the biggest performance differences I've seen between
MPI implementations had nothing to do with the performance of MPI calls
at all. It had to do with process binding or other factors that
impacted the computational (non-MPI) performance of the code. The
performance of MPI calls was basically irrelevant.
In this particular case, I'm not convinced since neither OMPI nor MPICH
binds processes by default.
Still, can you do some basic performance profiling to confirm what
aspect of your application is consuming so much time? Is it a
particular MPI call? If your application is spending almost all of its
time in MPI calls, do you have some way of judging whether the faster
performance is acceptable? That is, is 238 secs acceptable and 403 secs
slow? Or, are both timings unacceptable -- e.g., the code "should" be
running in about 30 secs.