Le 05/06/2011 00:15, Fengguang Song a écrit : > Hi, > > I'm confronting a problem when using OpenMPI 1.5.1 on a GPU cluster. My > program uses MPI to exchange data > between nodes, and uses cudaMemcpyAsync to exchange data between Host and GPU > devices within a node. > When the MPI message size is less than 1MB, everything works fine. However, > when the message size > is > 1MB, the program hangs (i.e., an MPI send never reaches its destination > based on my trace). > > The issue may be related to locked-memory contention between OpenMPI and CUDA. > Does anyone have the experience to solve the problem? Which MCA parameters > should I tune to increase > the message size to be > 1MB (to avoid the program hang)? Any help would be > appreciated. > > Thanks, > Fengguang
Hello, I may have seen the same problem when testing GPU direct. Do you use the same host buffer for copying from/to GPU and for sending/receiving on the network ? If so, you need a GPUDirect enabled kernel and mellanox drivers, but it only helps before 1MB. You can work around the problem with one of the following solution: * add --mca btl_openib_flags 304 to force OMPI to always send/recv through an intermediate (internal buffer), but it'll decrease performance before 1MB too * use different host buffers for the GPU and the network and manually copy between them I never got any reply from NVIDIA/Mellanox/here when I reported this problem with GPUDirect and messages larger than 1MB. http://www.open-mpi.org/community/lists/users/2011/03/15823.php Brice