Sorry for the delay in replying; the SC'18 show and then the US Thanksgiving holiday got in the way. More below.
> On Nov 16, 2018, at 10:50 PM, Weicheng Xue <weic...@vt.edu> wrote: > > Hi Jeff, > > Thank you very much for your reply! I am now using a cluster at my > university (https://www.arc.vt.edu/computing/newriver/). I cannot find any > info. about the use of Unified Communications X (or UCX) there so I would > guess the cluster does not use it (not exactly sure though). You might want to try compiling UCX yourself (it's just a user-level library -- it can even be installed under your $HOME) and then try compiling Open MPI against it and using that. Make sure to configure/compile UCX with CUDA support -- I believe you need a very recent version of UCX for that. > Actually, I called MPI_Test functions at several places in my code where the > communication activity was supposed to finish, but communication did not > finish until the code finally called MPI_WAITALL. You might want to test looping calling MPI_TEST many times, just to see what is happening. Specifically: in Open MPI (and probably in other MPI implementations), MPI_TEST dips into the MPI progression engine (essentially) once, whereas MPI_WAIT dips into the MPI progression engine as many times as necessary in order to complete the request(s). So it's just a difference of looping. How large is the message you're sending? > I got to know this by using the Nvidia profiler (The profiling result showed > that the kernel on GPUs right after MPI_WAITALL only started after CPUs > finished communication. However, there is enough time for CPUs to finish this > task in the background before MPI_WAITALL). If the communication overhead is > not hidden, then it does not make any sense to write the code in an > overlapping way. I am wondering whether the openmpi on the cluster was > compiled with asynchronous progression enabled, as "OMPI progress: no, ORTE > progress: yes" is obtained by using "ompi_info". I really do not know the > difference between "OMPI progress" and "ORTE progress" as I am not a CS guy. I applaud your initiative to find that phrase in the ompi_info output! However, don't get caught up in it -- that phrase isn't specifically oriented to the exact issue you're discussing here (for lack of a longer explanation). > Also, I am wondering whether MVAPICH2 is worthwhile to be tried as it > provides an environment variable to control the progression of operation, > which is easier. I would greatly appreciate your help! Sure, try MVAPICH2 -- that's kinda the strength of the MPI ecosystem (that there are multiple different MPI implementations to try). -- Jeff Squyres jsquy...@cisco.com _______________________________________________ users mailing list firstname.lastname@example.org https://lists.open-mpi.org/mailman/listinfo/users