Hi all.
Actually I'm implementing an algorithm that creates a process grid and
divides it into row and column communicators as follows:
col_comm0 col_comm1 col_comm2 col_comm3
row_comm0 P0 P1 P2 P3
row_comm1 P4 P5 P6 P7
row_comm2 P8 P9 P10 P11
row_comm3 P12 P13 P14 P15
Then, every process works on its own column communicator and broadcast
data on row communicators.
While column operations are being executed, processes not included in
the current column communicator just wait for results.
In a moment, a column communicator could be splitted to create a temp
communicator and allow only the right processes to work on it.
At the end of a step, a call to MPI_Barrier (on a duplicate of
MPI_COMM_WORLD) is executed to sync all processes and avoid bad results.
With a small amount of data (a small matrix) the MPI_Barrier call syncs
correctly on the communicator that includes all processes and processing
ends fine.
But when the amount of data (a big matrix) is incremented, operations on
column communicators take more time to finish and hence waiting time
also increments for waiting processes.
After a few time, waiting processes return an error when they have not
received the broadcast (MPI_Bcast) on row communicators or when they
have finished their work at the sync point (MPI_Barrier). But when the
operations on the current column communicator end, the still active
processes try to broadcast on row communicators and they fail because
the waiting processes have returned an error. So all processes fail in
different moment in time.
So my problem is that waiting processes "believe" that the current
operations have failed (but they have not finished yet!) and they fail too.
So I have a question about MPI_Bcast/MPI_Barrier:
Is there a way to increment the timeout a process can wait for a
broadcast or barrier to be completed?
Here is my machine and OpenMPI info:
- OpenMPI version: Open MPI 4.1.0u1a1
- OS: Linux Daniel 5.4.0-52-generic #57-Ubuntu SMP Thu Oct 15 10:57:00
UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Thanks in advance for reading my description/question.
Best regards.
--
Daniel Torres
LIPN - Université Sorbonne Paris Nord