Hi all.

Actually I'm implementing an algorithm that creates a process grid and divides it into row and column communicators as follows:

             col_comm0    col_comm1    col_comm2 col_comm3
row_comm0    P0           P1           P2        P3
row_comm1    P4           P5           P6        P7
row_comm2    P8           P9           P10       P11
row_comm3    P12          P13          P14       P15

Then, every process works on its own column communicator and broadcast data on row communicators. While column operations are being executed, processes not included in the current column communicator just wait for results.

In a moment, a column communicator could be splitted to create a temp communicator and allow only the right processes to work on it.

At the end of a step, a call to MPI_Barrier (on a duplicate of MPI_COMM_WORLD) is executed to sync all processes and avoid bad results.

With a small amount of data (a small matrix) the MPI_Barrier call syncs correctly on the communicator that includes all processes and processing ends fine. But when the amount of data (a big matrix) is incremented, operations on column communicators take more time to finish and hence waiting time also increments for waiting processes.

After a few time, waiting processes return an error when they have not received the broadcast (MPI_Bcast) on row communicators or when they have finished their work at the sync point (MPI_Barrier). But when the operations on the current column communicator end, the still active processes try to broadcast on row communicators and they fail because the waiting processes have returned an error. So all processes fail in different moment in time.

So my problem is that waiting processes "believe" that the current operations have failed (but they have not finished yet!) and they fail too.

So I have a question about MPI_Bcast/MPI_Barrier:

Is there a way to increment the timeout a process can wait for a broadcast or barrier to be completed?

Here is my machine and OpenMPI info:
- OpenMPI version: Open MPI 4.1.0u1a1
- OS: Linux Daniel 5.4.0-52-generic #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Thanks in advance for reading my description/question.

Best regards.

--
Daniel Torres
LIPN - Université Sorbonne Paris Nord

Reply via email to