FYI for others that have run into the same problem, see
https://github.com/openucx/ucx/issues/3359.  In short:
1. Use UCX 1.5 rather than 1.4 (I recommend updating
https://www.open-mpi.org/faq/?category=buildcuda)
2. Dynamically link in the cudart library (by default nvcc will statically
link it).  Future UCX versions will fix a lingering bug that makes this
required currently.

With these changes, I was able to successfully run my application.

On Sun, Mar 3, 2019 at 9:49 AM Adam Sylvester <op8...@gmail.com> wrote:

> I'm running OpenMPI 4.0.0 built with gdrcopy 1.3 and UCX 1.4 per the
> instructions at https://www.open-mpi.org/faq/?category=buildcuda, built
> against CUDA 10.0 on RHEL 7.  I'm running on a p2.xlarge instance in AWS
> (single NVIDIA K80 GPU).  OpenMPI reports CUDA support:
> $ ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
> mca:mpi:base:param:mpi_built_with_cuda_support:value:true
>
> I'm attempting to use MPI_Ialltoall() to overlap a block of GPU
> computations with network transfers, using MPI_Test() to nudge async
> transfers along.  Based on the table 5 I see in
> https://www.open-mpi.org/faq/?category=runcuda, MPI_Ialltoall() should be
> supported (though I don't see MPI_Test() called out as supported or not
> supported... though my example crashes with or without it).  The behavior
> I'm seeing is that when running with a small number of elements, everything
> runs without issue.  However, for a larger number of elements (where
> "large" is just a few hundred), I start to get errors like this
> "cma_ep.c:113  UCX  ERROR process_vm_readv delivered 0 instead of 16000,
> error message Bad address".  Changing to synchronous MPI_alltoall() results
> in the program running successfully.
>
> I tried boiling my issue down to the simplest problem I could that
> recreates the crash.  Note that this needs to be compiled with
> "--std=c++11".  Running "mpirun -np 2 mpi_test_ialltoall 200 256 10" runs
> successfully; changing the 200 to a 400 results in a crash after a few
> blocks.  Thanks for any thoughts.
>
> Code sample:
> https://gist.github.com/asylvest/7c9d5c15a3a044a0a2338cf9c828d2c3
>
> -Adam
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to