Hi Jeff, Thank you very much for your reply! I am now using a cluster at my university (https://www.arc.vt.edu/computing/newriver/). I cannot find any info. about the use of Unified Communications X (or UCX) there so I would guess the cluster does not use it (not exactly sure though). Actually, I called MPI_Test functions at several places in my code where the communication activity was supposed to finish, but communication did not finish until the code finally called MPI_WAITALL. I got to know this by using the Nvidia profiler (The profiling result showed that the kernel on GPUs right after MPI_WAITALL only started after CPUs finished communication. However, there is enough time for CPUs to finish this task in the background before MPI_WAITALL). If the communication overhead is not hidden, then it does not make any sense to write the code in an overlapping way. I am wondering whether the openmpi on the cluster was compiled with asynchronous progression enabled, as "OMPI progress: no, ORTE progress: yes" is obtained by using "ompi_info". I really do not know the difference between "OMPI progress" and "ORTE progress" as I am not a CS guy. Also, I am wondering whether MVAPICH2 is worthwhile to be tried as it provides an environment variable to control the progression of operation, which is easier. I would greatly appreciate your help!
Best Regards, Weicheng Xue On Fri, Nov 16, 2018 at 5:37 PM Jeff Squyres (jsquyres) via users < users@lists.open-mpi.org> wrote: > On Nov 13, 2018, at 8:52 PM, Weicheng Xue <weic...@vt.edu> wrote: > > > > I am a student whose research work includes using MPI and OpenACC to > accelerate our in-house research CFD code on multiple GPUs. I am having a > big issue related to the "progression of operations in MPI" and am thinking > your inputs can be very helpful. > > Someone asked me about an Open MPI + OpenACC issue this past week at the > Supercomputing trade show. > > I'm not sure if anyone in the Open MPI development community is testing > with Open MPI + OpenACC. I don't know much about it -- I would *hope* that > it "just works", but I don't know that for sure. > > > I am now testing the performance of overlapping communication and > computation for a code. Communication exists between hosts (CPUs) and > computations are done on devices (GPUs). However, in my case, the actual > communication always starts when the computations finish. Therefore, even > though I wrote my code in an overlapping way, there is no overlapping > because of the OpenMPI not supporting asynchronous progression. I found > that MPI often does progress (i.e. actually send or receive the data) only > if I am blocking in a call to MPI_Wait (Then no overlapping occurs at all). > My purpose is to use overlapping to hide communication latency and thus > improve the performance of my code. Is there a way you can suggest to me? > Thank you very much! > > Nearly all transports in Open MPI support asynchronous progress -- but > only some of them offer hardware- and/or OS-assisted asynchronous progress > (which is probably what you are assuming). Specifically: I'm quibbling > with your choice of wording, but the end effect you are observing is likely > a) correct, and b) dependent upon the network transport that you are using. > > > I am now using PGI/17.5 compiler and openmpi/2.0.0. A 100 Gbps > EDR-Infiniband is used for MPI traffic. If I use "ompi_info", then info. > about the thread support is "Thread support: posix (MPI_THREAD_MULTIPLE: > yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: > yes)". > > That's a little surprising -- IB should be one of the transports that > actually supports asynchronous progress. > > Are you using UCX for the IB transport? > > -- > Jeff Squyres > jsquy...@cisco.com > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users