George Bosilca <bosi...@icl.utk.edu> writes: >> On Apr 25, 2016, at 11:33 , Dave Love <d.l...@liverpool.ac.uk> wrote: >> >> George Bosilca <bosi...@icl.utk.edu> writes:
>>> I have recently reshuffled the tuned module to move all the algorithms >>> in the base and therefore make them available to other collective >>> modules (the code is available in master and 1.10 and the future >>> 2.0). This move has the potential for allowing different decision >>> schemes to coexists, and be dynamically selected at runtime based on >>> network properties, network topology, or even applications needs. I >>> continue to have hopes that network vendors will eventually get >>> interested in tailoring the collective selection to match their >>> network capabilities, and provide their users with a performance boost >>> by allowing for network specific algorithm selection. >> >> That sounds useful, assuming the speed is generally dominated by the >> basic fabric. What's involved in making the relevant measurements and >> plugging them in? I did look at using OTPO(?) to check this sort of >> thing once. I couldn't make it work in the time I had, but Periscope >> might be a good alternative now. > > It is a multidimensional space optimization problem. Sure, but it's not clear to me that I understand it well enough to optimize in principle. > The critical > point is identifying the switching points between different algorithms > based on their performance (taking in account, at least, physical > topology, number of processes and amount of data). Runs of IMB don't necessarily reveal clear switch points (which I could believe means there's something wrong with them...). > The paper I sent on > one of my previous email discusses how we did the decision functions > on the current implementation. There are certainly better ways, but > the one we took at least did not involve any extra software, and was > done using simple scripts. I'd looked at it, but I couldn't see much about doing the measurements. I thought there was a paper (from UTK?) on the OMPI web site which was more about that, but I can't find it. >> If it's fairly mechanical -- maybe even if not -- it seems like >> something that should just be done regardless of vendors. I'm sure >> plenty of people could measure QDR fat tree, for a start (at least where >> measurement isn’t frowned upon). > > Based on feedback from the user mailing list, several users did such > optimizations for their specific applications. That sort of thing is mainly what prompted me to ask. (And I see plenty of pretty useless benchmark-type "studies" that make more-or-less absolute statements about MPIs' relative speed without even saying what parameters were used.) One thing I don't know is whether this is likely to be significantly application specific, as I've seen suggested. Presumably there's m(va)pich work on this that might be useful if they use the same algorithms, but I couldn't find a relevant write-up. > This makes the > optimization problem much simpler, as some of the parameters have > discrete values (message size). If we assume a symmetric network, and > have a small number of message sizes of interest, it is enough to run > few benchmarks (skampi, to the IMB test on the collective of > interest), and manually finding the switch point is a relatively > simple process. I've looked at alltoallv, which is important for typical chemistry codes whose users have an insatiable appetite for cycles. To start with it's not clear how useful IMB is as it's not exercising the "v". Then for low-ish process counts I've seen the relative speed of the two algorithms all over the place. However, 2 appears best overall, but when I profiled an application, I got ~30% speedup by switching to 1. To a hard-bitten experimentalist, this just suggests too little understanding to make useful measurements, and that it would be useful to have a good review of the issues -- presumably for current sorts of interconnect. Does one exist?