Re: [OMPI users] Timeout in MPI_Bcast/MPI_Barrier?

2021-01-12 Thread Daniel Torres via users
Hi George and Gilles. Thanks a lot for taking the time to test the code I sent. As Gilles mentioned all tests he made worked perfect, I decided to install a totally new *OMPI 4.1.0* and test again. Happily, the OOM killer is not shooting any process and all my experimentation worked

Re: [OMPI users] Timeout in MPI_Bcast/MPI_Barrier?

2021-01-11 Thread George Bosilca via users
*MPI_ERR_PROC_FAILED is not yet a valid error in MPI. It is coming from ULFM, an extension to MPI that is not yet in the OMPI master.* *Daniel what version of Open MPI are you using ? Are you sure you are not mixing multiple versions due to PATH/LD_LIBRARY_PATH ?* *George.* On Mon, Jan 11,

Re: [OMPI users] Timeout in MPI_Bcast/MPI_Barrier?

2021-01-11 Thread Gilles Gouaillardet via users
Daniel, the test works in my environment (1 node, 32 GB memory) with all the mentioned parameters. Did you check the memory usage on your nodes and made sure the oom killer did not shoot any process? Cheers, Gilles On Tue, Jan 12, 2021 at 1:48 AM Daniel Torres via users wrote: > > Hi. > >

Re: [OMPI users] Timeout in MPI_Bcast/MPI_Barrier?

2021-01-11 Thread Daniel Torres via users
Hi. Thanks for responding. I have taken the most important parts from my code and I created a test that reproduces the behavior I described previously. I attach to this e-mail the compressed file "*test.tar.gz*". Inside him, you can find: 1.- The .c source code "test.c", which I compiled

Re: [OMPI users] Timeout in MPI_Bcast/MPI_Barrier?

2021-01-08 Thread George Bosilca via users
Daniel, There are no timeouts in OMPI with the exception of the initial connection over TCP, where we use the socket timeout to prevent deadlocks. As you already did quite a few communicator duplications and other collective communications before you see the timeout, we need more info about this.

Re: [OMPI users] Timeout in MPI_Bcast/MPI_Barrier?

2021-01-08 Thread Gilles Gouaillardet via users
Daniel, Can you please post the full error message and share a reproducer for this issue? Cheers, Gilles On Fri, Jan 8, 2021 at 10:25 PM Daniel Torres via users wrote: > > Hi all. > > Actually I'm implementing an algorithm that creates a process grid and > divides it into row and column