Hello all,

The very simple code below returns mpiRC = 15.

const std::array< double, 2 > rangeMin { minX, minY };
std::array< double, 2 > rangeTempRecv { 0.0, 0.0 };
int mpiRC = MPI_Allreduce( rangeMin.data(), rangeTempRecv.data(), 
rangeMin.size(), MPI_DOUBLE, MPI_MIN, PETSC_COMM_WORLD );

Some information before my questions:

  1.  The environment I am running this code has hundreds of compute nodes, 
each node with 4 MPI ranks.
  2.  It is running in the cloud, so it is tricky to get extra information "on 
the fly".
  3.  I am using OpenMPI 4.1.2 + PETSc 3.16.5 + GNU compilers.
  4.  The error happens consistently at the same point in the execution, at 
ranks 1 and 2 only (out of hundreds of MPI ranks).
  5.  By the time the execution gets to the code above, the execution has 
already called PetscInitialize() and many MPI routines successfully
  6.  Before the call to MPI_Allreduce() above, the code calls MPI_Barrier(). 
So, all nodes call MPI_Allreduce()
  7.  At https://www.open-mpi.org/doc/current/man3/OpenMPI.3.php it is written 
"MPI_ERR_TRUNCATE          15      Message truncated on receive."
  8.  At https://www.open-mpi.org/doc/v4.1/man3/MPI_Allreduce.3.php, it is 
written "The reduction functions ( MPI_Op ) do not return an error value. As a 
result, if the functions detect an error, all they can do is either call 
MPI_Abort<https://www.open-mpi.org/doc/v4.1/man3/MPI_Abort.3.php> or silently 
skip the problem. Thus, if you change the error handler from 
MPI_ERRORS_ARE_FATAL to something else, for example, MPI_ERRORS_RETURN , then 
no error may be indicated."

Questions:

  1.  Any ideas for what could be the cause for the return code 15? The code is 
pretty simple and the buffers have fixed size = 2.
  2.  In view of item (8), does it mean that the return code 15 in item (7) 
might not be informative?
  3.  Once I get a return code != MPI_SUCCESS, is there any routine I can call, 
in the application code, to get extra information on MPI?
  4.  Once the application aborts (I throw an exception once a return code is 
!= MPI_SUCESS), is there some command line I can run on all nodes in order to 
get extra info?

Thank you in advance,

Ernesto.


Schlumberger-Private

Reply via email to