Eric, A tree based implementation for a gather is not that critical, because the root will always have to gather the entire data, so from it’s perspective going from a star to a tree-based topology is basically exchanging latencies for bandwidth (a little bit more complicated in practice). In fact from the amount of data injected on the network, the star topology is optimal for this particular collective.
Moreover, with your homemade implementation you have the opportunity to execute something after each successful receive, which might improve your communication/computation overlap, and have an impact on overall performance. I would carefully benchmark both approaches based on some concrete example (taking in account the size of the communicator and the size of the data being transmitted). More inline. > On Oct 22, 2015, at 12:25 , Eric Chamberland > <eric.chamberl...@giref.ulaval.ca> wrote: > > Hi Gilles and Josh, > > I think my reply apply to both of your answers which I thank you for. > > On 21/10/15 08:31 PM, Gilles Gouaillardet wrote: >> Eric, >> >> #2 maybe not ... >> a tree based approach has O(log(n)) scaling >> (compared to O(n) scaling with your linear method. >> so at scale, MPI_Igather will hopefully scale better (and especially if >> you are transmitting small messages) > > I see. Now, please don't blame me for not reading the mpi standard, but: is > it or should it be guaranteed by the standard? If not, isn't it a repetitive > work for all mpi users to re-implement a (debugged) tree based approach in > all their code? > > In other words, if everybody knows that to scale well, you have to program a > tree based approach for the communications, why isn’t it in the standard? The MPI standard defines an API and the outcome of each function, without messing with the way they are implemented in the different MPI implementation. Thus, the standard defines what a gather operations is, one process gathering data from all the others, without imposing how this goal should be reached (such as internal copies on intermediary nodes on the tree structure). >> #3 difficult question ... >> first, keep in mind there is currently no progress thread in Open MPI. >> that means messages can be received only when MPI_Wait* or MPI_Test* is >> invoked. you might hope messages are received when doing some > > ok! So it may be different with another MPI…? Only the API is the same, each MPI has different perks and drawbacks. Moreover, each MPI implement the collective communications differently. >> computation (overlap of computation and communication) but unfortunatly, >> that does not happen most of the time. >> >> linear gather does not scale well (see my previous comment) plus you >> openmpi might malloc some space "under the hood" so MPI_Igather will >> hopefully scale better. > > That is something I was asking myself about… will I over-allocate memory with > all our MPI_Isend/Irecv...? Only if you generate unexpected messages (the send has been posted, but the local receive not). This is the most memory constraint approach. > I simple test with the code I sent in the first mail show a small extra use > of memory, but I didn't go very far with the test yet... > >> is there any hard reason why you are using non blocking collective ? > > No. Our home-made non-blocking collective is just an initial design that is > still used in our code, but I want a sufficient number of good reasons to > change it or not, to non-blocking or blocking collective calls… 1. You should expect portability and maybe even some performance portability. 2. Someone else already debug it 3. It’s trendy to say that you use igather George. >> if your application is known to be highly asynchronous and some message >> might arrive (way) later than others, and computation is quite >> expensive, then your approach might be a good fit. >> if your application is pretty synchronous, and cost of computation that >> might overlap with communication is not significant, your approach might >> have little benefits and poor scalability, so MPI_Gather (not >> MPI_Igather since you might have no computation that could overlap with >> communication) might be a better choice. >> > > Good Question. It is a finite elements code. Some work may be highly > asynchronous but other not... > > Do you have any suggestion for some good further reading about all this > matter? > > Thanks, > > Eric > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/10/27917.php