Hi, I'm confronting a problem when using OpenMPI 1.5.1 on a GPU cluster. My program uses MPI to exchange data between nodes, and uses cudaMemcpyAsync to exchange data between Host and GPU devices within a node. When the MPI message size is less than 1MB, everything works fine. However, when the message size is > 1MB, the program hangs (i.e., an MPI send never reaches its destination based on my trace).
The issue may be related to locked-memory contention between OpenMPI and CUDA. Does anyone have the experience to solve the problem? Which MCA parameters should I tune to increase the message size to be > 1MB (to avoid the program hang)? Any help would be appreciated. Thanks, Fengguang