Hi Jeff, Thank you very much for providing these useful suggestions! I may try MVAPICH2 first. In my case, I transferred different data 2 times. Each time the size is 3.146MB. Actually, I also tested problems of different sizes, and none of them worked as expected.
Best Regards, Weicheng Xue On Tue, Nov 27, 2018 at 6:32 AM Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > Sorry for the delay in replying; the SC'18 show and then the US > Thanksgiving holiday got in the way. More below. > > > > > On Nov 16, 2018, at 10:50 PM, Weicheng Xue <weic...@vt.edu> wrote: > > > > Hi Jeff, > > > > Thank you very much for your reply! I am now using a cluster at my > university (https://www.arc.vt.edu/computing/newriver/). I cannot find > any info. about the use of Unified Communications X (or UCX) there so I > would guess the cluster does not use it (not exactly sure though). > > You might want to try compiling UCX yourself (it's just a user-level > library -- it can even be installed under your $HOME) and then try > compiling Open MPI against it and using that. Make sure to > configure/compile UCX with CUDA support -- I believe you need a very recent > version of UCX for that. > > > Actually, I called MPI_Test functions at several places in my code where > the communication activity was supposed to finish, but communication did > not finish until the code finally called MPI_WAITALL. > > You might want to test looping calling MPI_TEST many times, just to see > what is happening. > > Specifically: in Open MPI (and probably in other MPI implementations), > MPI_TEST dips into the MPI progression engine (essentially) once, whereas > MPI_WAIT dips into the MPI progression engine as many times as necessary in > order to complete the request(s). So it's just a difference of looping. > > How large is the message you're sending? > > > I got to know this by using the Nvidia profiler (The profiling result > showed that the kernel on GPUs right after MPI_WAITALL only started after > CPUs finished communication. However, there is enough time for CPUs to > finish this task in the background before MPI_WAITALL). If the > communication overhead is not hidden, then it does not make any sense to > write the code in an overlapping way. I am wondering whether the openmpi on > the cluster was compiled with asynchronous progression enabled, as "OMPI > progress: no, ORTE progress: yes" is obtained by using "ompi_info". I > really do not know the difference between "OMPI progress" and "ORTE > progress" as I am not a CS guy. > > I applaud your initiative to find that phrase in the ompi_info output! > > However, don't get caught up in it -- that phrase isn't specifically > oriented to the exact issue you're discussing here (for lack of a longer > explanation). > > > Also, I am wondering whether MVAPICH2 is worthwhile to be tried as it > provides an environment variable to control the progression of operation, > which is easier. I would greatly appreciate your help! > > Sure, try MVAPICH2 -- that's kinda the strength of the MPI ecosystem (that > there are multiple different MPI implementations to try). > > -- > Jeff Squyres > jsquy...@cisco.com > >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users