Hi Jeff,

     Thank you very much for providing these useful suggestions! I may try
MVAPICH2 first. In my case, I transferred different data 2 times. Each time
the size is 3.146MB. Actually, I also tested problems of different sizes,
and none of them worked as expected.

Best Regards,

Weicheng Xue

On Tue, Nov 27, 2018 at 6:32 AM Jeff Squyres (jsquyres) <jsquy...@cisco.com>
wrote:

> Sorry for the delay in replying; the SC'18 show and then the US
> Thanksgiving holiday got in the way.  More below.
>
>
>
> > On Nov 16, 2018, at 10:50 PM, Weicheng Xue <weic...@vt.edu> wrote:
> >
> > Hi Jeff,
> >
> >      Thank you very much for your reply! I am now using a cluster at my
> university (https://www.arc.vt.edu/computing/newriver/). I cannot find
> any info. about the use of Unified Communications X (or UCX) there so I
> would guess the cluster does not use it (not exactly sure though).
>
> You might want to try compiling UCX yourself (it's just a user-level
> library -- it can even be installed under your $HOME) and then try
> compiling Open MPI against it and using that.  Make sure to
> configure/compile UCX with CUDA support -- I believe you need a very recent
> version of UCX for that.
>
> > Actually, I called MPI_Test functions at several places in my code where
> the communication activity was supposed to finish, but communication did
> not finish until the code finally called MPI_WAITALL.
>
> You might want to test looping calling MPI_TEST many times, just to see
> what is happening.
>
> Specifically: in Open MPI (and probably in other MPI implementations),
> MPI_TEST dips into the MPI progression engine (essentially) once, whereas
> MPI_WAIT dips into the MPI progression engine as many times as necessary in
> order to complete the request(s).  So it's just a difference of looping.
>
> How large is the message you're sending?
>
> > I got to know this by using the Nvidia profiler (The profiling result
> showed that the kernel on GPUs right after MPI_WAITALL only started after
> CPUs finished communication. However, there is enough time for CPUs to
> finish this task in the background before MPI_WAITALL).  If the
> communication overhead is not hidden, then it does not make any sense to
> write the code in an overlapping way. I am wondering whether the openmpi on
> the cluster was compiled with asynchronous progression enabled, as "OMPI
> progress: no, ORTE progress: yes" is obtained by using "ompi_info". I
> really do not know the difference between "OMPI progress" and "ORTE
> progress" as I am not a CS guy.
>
> I applaud your initiative to find that phrase in the ompi_info output!
>
> However, don't get caught up in it -- that phrase isn't specifically
> oriented to the exact issue you're discussing here (for lack of a longer
> explanation).
>
> > Also, I am wondering whether MVAPICH2 is worthwhile to be tried as it
> provides an environment variable to control the progression of operation,
> which is easier. I would greatly appreciate your help!
>
> Sure, try MVAPICH2 -- that's kinda the strength of the MPI ecosystem (that
> there are multiple different MPI implementations to try).
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to