Hey all,

to test the performance of my application I duplicated the call to the
function that will issue the computation on two GPUs 5 times. During the
4th and 5th run of the algorithm, however, the algorithm yields
different results (9 instead of 20):

# datatype: double
# datapoints: 20000
# max_iterations: 1000
# conv_iterations: 1000
# damping: 0.9
# communicator.size: 2
# time elapsed [s]; iterations executed; convergent since; clusters
identified
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 *820* *9*
121.* 1000 *820* *9*

For communication I use Open MPI 1.8 and/or Open MPI 1.8.1, both
compiled with cuda-awareness. The CUDA Toolkit version is 6.0.
Both GPUs are under the control of one single CPU, so that CUDA IPC can
be used (because no QPI link has to be traversed).
Running the application with "mpirun -np 2 --mca
btl_smcuda_cuda_ipc_verbose 100", shows that IPC is used.

I tracked my problem down to an MPI_Allgather, which seems not to work
since the first GPU  identifies 9 clusters, the second GPU identifies 11
clusters (makes 20 clusters total). Debugging the application shows,
that all clusters are identified correctly, however, the exchange of the
identified clusters seems not to work: Each MPI process stores its
identified clusters in an buffer, that both processes exchange using
MPI_Allgather:

value_type* d_dec = thrust::raw_pointer_cast(&dec[0]);
MPI_Allgather(    MPI_IN_PLACE, 0, MPI_DATATYPE_NULL,
            d_dec, columns, MPI_DOUBLE, communicator);

I later discovered, that if I introduce a temporary host buffer, that
will receive the results of both GPUs, all results are computed correctly:

value_type* d_dec = thrust::raw_pointer_cast(&dec[0]);
thrust::host_vector<value_type> h_dec(dec.size());
MPI_Allgather( d_dec+columns*comm.rank(), columns, MPI_DOUBLE,
            h_dec, columns, MPI_DOUBLE, communicator);
dec = h_dec; //copy results back from host to device

This lead me to the conclusion, that something with OMPIs CUDA IPC seems
to cause the problems (synchronisation and/or fail-silent error) and
indeed, disabling CUDA IPC :

mpirun --mca btl_smcuda_use_cuda_ipc 0 --mca
btl_smcuda_use_cuda_ipc_same_gpu 0 -np 2 ./double_test
../data/similarities20000.double.-300 ex.20000.double.2.gpus 1000 1000 0.9

will calculate correct results:

# datatype: double
# datapoints: 20000
# max_iterations: 1000
# conv_iterations: 1000
# damping: 0.9
# communicator.size: 2
# time elapsed [s]; iterations executed; convergent since; clusters
identified
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 *807 20*
121.* 1000 *807 20*

Surprisingly, the wrong results _always_ occur during the 4th and 5th
run. Is there a way to force synchronisation (I tried MPI_Barrier()
without success), has anybody discovered similar problems?

I posted some of the code to pastebin: http://pastebin.com/wCmc36k5

Thanks in advance,
Christoph

Reply via email to