FYI for others that have run into the same problem, see https://github.com/openucx/ucx/issues/3359. In short: 1. Use UCX 1.5 rather than 1.4 (I recommend updating https://www.open-mpi.org/faq/?category=buildcuda) 2. Dynamically link in the cudart library (by default nvcc will statically link it). Future UCX versions will fix a lingering bug that makes this required currently.
With these changes, I was able to successfully run my application. On Sun, Mar 3, 2019 at 9:49 AM Adam Sylvester <op8...@gmail.com> wrote: > I'm running OpenMPI 4.0.0 built with gdrcopy 1.3 and UCX 1.4 per the > instructions at https://www.open-mpi.org/faq/?category=buildcuda, built > against CUDA 10.0 on RHEL 7. I'm running on a p2.xlarge instance in AWS > (single NVIDIA K80 GPU). OpenMPI reports CUDA support: > $ ompi_info --parsable --all | grep mpi_built_with_cuda_support:value > mca:mpi:base:param:mpi_built_with_cuda_support:value:true > > I'm attempting to use MPI_Ialltoall() to overlap a block of GPU > computations with network transfers, using MPI_Test() to nudge async > transfers along. Based on the table 5 I see in > https://www.open-mpi.org/faq/?category=runcuda, MPI_Ialltoall() should be > supported (though I don't see MPI_Test() called out as supported or not > supported... though my example crashes with or without it). The behavior > I'm seeing is that when running with a small number of elements, everything > runs without issue. However, for a larger number of elements (where > "large" is just a few hundred), I start to get errors like this > "cma_ep.c:113 UCX ERROR process_vm_readv delivered 0 instead of 16000, > error message Bad address". Changing to synchronous MPI_alltoall() results > in the program running successfully. > > I tried boiling my issue down to the simplest problem I could that > recreates the crash. Note that this needs to be compiled with > "--std=c++11". Running "mpirun -np 2 mpi_test_ialltoall 200 256 10" runs > successfully; changing the 200 to a 400 results in a crash after a few > blocks. Thanks for any thoughts. > > Code sample: > https://gist.github.com/asylvest/7c9d5c15a3a044a0a2338cf9c828d2c3 > > -Adam >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users