Daniel, Can you please post the full error message and share a reproducer for this issue?
Cheers, Gilles On Fri, Jan 8, 2021 at 10:25 PM Daniel Torres via users <users@lists.open-mpi.org> wrote: > > Hi all. > > Actually I'm implementing an algorithm that creates a process grid and > divides it into row and column communicators as follows: > > col_comm0 col_comm1 col_comm2 col_comm3 > row_comm0 P0 P1 P2 P3 > row_comm1 P4 P5 P6 P7 > row_comm2 P8 P9 P10 P11 > row_comm3 P12 P13 P14 P15 > > Then, every process works on its own column communicator and broadcast data > on row communicators. > While column operations are being executed, processes not included in the > current column communicator just wait for results. > > In a moment, a column communicator could be splitted to create a temp > communicator and allow only the right processes to work on it. > > At the end of a step, a call to MPI_Barrier (on a duplicate of > MPI_COMM_WORLD) is executed to sync all processes and avoid bad results. > > With a small amount of data (a small matrix) the MPI_Barrier call syncs > correctly on the communicator that includes all processes and processing ends > fine. > But when the amount of data (a big matrix) is incremented, operations on > column communicators take more time to finish and hence waiting time also > increments for waiting processes. > > After a few time, waiting processes return an error when they have not > received the broadcast (MPI_Bcast) on row communicators or when they have > finished their work at the sync point (MPI_Barrier). But when the operations > on the current column communicator end, the still active processes try to > broadcast on row communicators and they fail because the waiting processes > have returned an error. So all processes fail in different moment in time. > > So my problem is that waiting processes "believe" that the current operations > have failed (but they have not finished yet!) and they fail too. > > So I have a question about MPI_Bcast/MPI_Barrier: > > Is there a way to increment the timeout a process can wait for a broadcast or > barrier to be completed? > > Here is my machine and OpenMPI info: > - OpenMPI version: Open MPI 4.1.0u1a1 > - OS: Linux Daniel 5.4.0-52-generic #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC > 2020 x86_64 x86_64 x86_64 GNU/Linux > > Thanks in advance for reading my description/question. > > Best regards. > > -- > Daniel Torres > LIPN - Université Sorbonne Paris Nord