[OMPI users] collective tuning (was: MPI_Bcast implementations in OpenMPI)

Dave Love Tue, 3 May 2016 10:14:33 -0400 (EDT)

George Bosilca <bosi...@icl.utk.edu> writes:

>> On Apr 25, 2016, at 11:33 , Dave Love <d.l...@liverpool.ac.uk> wrote:
>> 
>> George Bosilca <bosi...@icl.utk.edu> writes:


>>> I have recently reshuffled the tuned module to move all the algorithms
>>> in the base and therefore make them available to other collective
>>> modules (the code is available in master and 1.10 and the future
>>> 2.0). This move has the potential for allowing different decision
>>> schemes to coexists, and be dynamically selected at runtime based on
>>> network properties, network topology, or even applications needs. I
>>> continue to have hopes that network vendors will eventually get
>>> interested in tailoring the collective selection to match their
>>> network capabilities, and provide their users with a performance boost
>>> by allowing for network specific algorithm selection.
>> 
>> That sounds useful, assuming the speed is generally dominated by the
>> basic fabric.  What's involved in making the relevant measurements and
>> plugging them in?  I did look at using OTPO(?) to check this sort of
>> thing once.  I couldn't make it work in the time I had, but Periscope
>> might be a good alternative now.
>
> It is a multidimensional space optimization problem.

Sure, but it's not clear to me that I understand it well enough to
optimize in principle.

> The critical
> point is identifying the switching points between different algorithms
> based on their performance (taking in account, at least, physical
> topology, number of processes and amount of data).

Runs of IMB don't necessarily reveal clear switch points (which I could
believe means there's something wrong with them...).

> The paper I sent on
> one of my previous email discusses how we did the decision functions
> on the current implementation. There are certainly better ways, but
> the one we took at least did not involve any extra software, and was
> done using simple scripts.

I'd looked at it, but I couldn't see much about doing the measurements.
I thought there was a paper (from UTK?) on the OMPI web site which was
more about that, but I can't find it.

>> If it's fairly mechanical -- maybe even if not -- it seems like
>> something that should just be done regardless of vendors.  I'm sure
>> plenty of people could measure QDR fat tree, for a start (at least where
>> measurement isn’t frowned upon).
>
> Based on feedback from the user mailing list, several users did such
> optimizations for their specific applications.

That sort of thing is mainly what prompted me to ask.  (And I see plenty
of pretty useless benchmark-type "studies" that make more-or-less
absolute statements about MPIs' relative speed without even saying what
parameters were used.)  One thing I don't know is whether this is likely
to be significantly application specific, as I've seen suggested.

Presumably there's m(va)pich work on this that might be useful if they
use the same algorithms, but I couldn't find a relevant write-up.

> This makes the
> optimization problem much simpler, as some of the parameters have
> discrete values (message size). If we assume a symmetric network, and
> have a small number of message sizes of interest, it is enough to run
> few benchmarks (skampi, to the IMB test on the collective of
> interest), and manually finding the switch point is a relatively
> simple process.

I've looked at alltoallv, which is important for typical chemistry codes
whose users have an insatiable appetite for cycles.  To start with it's
not clear how useful IMB is as it's not exercising the "v".  Then for
low-ish process counts I've seen the relative speed of the two
algorithms all over the place.  However, 2 appears best overall, but
when I profiled an application, I got ~30% speedup by switching to 1.
To a hard-bitten experimentalist, this just suggests too little
understanding to make useful measurements, and that it would be useful
to have a good review of the issues -- presumably for current sorts of
interconnect.  Does one exist?

[OMPI users] collective tuning (was: MPI_Bcast implementations in OpenMPI)

Reply via email to