There are two ways the MPI_Allreduce returns MPI_ERR_TRUNCATE:
1. it is propagated from one of the underlying point-to-point
communications, which means that at least one of the participants has an
input buffer with a larger size. I know you said the size is fixed, but it
only matters if all processes are in the same blocking MPI_Allreduce.
2. The code is not SPMD, and one of your processes calls a different
MPI_Allreduce on the same communicator.

There is no simple way to get more information about this issue. If you
have a version of OMPI compiled in debug mode, you can increase the
verbosity of the collective framework to see if you get more interesting
information.

George.


On Wed, Mar 9, 2022 at 2:23 PM Ernesto Prudencio via users <
users@lists.open-mpi.org> wrote:

> Hello all,
>
>
>
> The very simple code below returns mpiRC = 15.
>
>
>
> const std::array< double, 2 > rangeMin { minX, minY };
>
> std::array< double, 2 > rangeTempRecv { 0.0, 0.0 };
>
> int mpiRC = MPI_Allreduce( rangeMin.data(), rangeTempRecv.data(),
> rangeMin.size(), MPI_DOUBLE, MPI_MIN, PETSC_COMM_WORLD );
>
>
>
> Some information before my questions:
>
>    1. The environment I am running this code has hundreds of compute
>    nodes, each node with 4 MPI ranks.
>    2. It is running in the cloud, so it is tricky to get extra
>    information “on the fly”.
>    3. I am using OpenMPI 4.1.2 + PETSc 3.16.5 + GNU compilers.
>    4. The error happens consistently at the same point in the execution,
>    at ranks 1 and 2 only (out of hundreds of MPI ranks).
>    5. By the time the execution gets to the code above, the execution has
>    already called PetscInitialize() and many MPI routines successfully
>    6. Before the call to MPI_Allreduce() above, the code calls
>    MPI_Barrier(). So, all nodes call MPI_Allreduce()
>    7. At https://www.open-mpi.org/doc/current/man3/OpenMPI.3.php it is
>    written “MPI_ERR_TRUNCATE          15      Message truncated on receive.”
>    8. At https://www.open-mpi.org/doc/v4.1/man3/MPI_Allreduce.3.php, it
>    is written “The reduction functions ( *MPI_Op* ) do not return an
>    error value. As a result, if the functions detect an error, all they can do
>    is either call *MPI_Abort
>    <https://www.open-mpi.org/doc/v4.1/man3/MPI_Abort.3.php>* or silently
>    skip the problem. Thus, if you change the error handler from
>    *MPI_ERRORS_ARE_FATAL* to something else, for example,
>    *MPI_ERRORS_RETURN* , then no error may be indicated.”
>
>
>
> Questions:
>
>    1. Any ideas for what could be the cause for the return code 15? The
>    code is pretty simple and the buffers have fixed size = 2.
>    2. In view of item (8), does it mean that the return code 15 in item
>    (7) might not be informative?
>    3. Once I get a return code != MPI_SUCCESS, is there any routine I can
>    call, in the application code, to get extra information on MPI?
>    4. Once the application aborts (I throw an exception once a return
>    code is != MPI_SUCESS), is there some command line I can run on all nodes
>    in order to get extra info?
>
>
>
> Thank you in advance,
>
>
>
> Ernesto.
>
> Schlumberger-Private
>

Reply via email to