Hi George and Gilles.
Thanks a lot for taking the time to test the code I sent.
As Gilles mentioned all tests he made worked perfect, I decided to
install a totally new *OMPI 4.1.0* and test again.
Happily, the OOM killer is not shooting any process and all my
experimentation worked
*MPI_ERR_PROC_FAILED is not yet a valid error in MPI. It is coming from
ULFM, an extension to MPI that is not yet in the OMPI master.*
*Daniel what version of Open MPI are you using ? Are you sure you are not
mixing multiple versions due to PATH/LD_LIBRARY_PATH ?*
*George.*
On Mon, Jan 11,
Daniel,
the test works in my environment (1 node, 32 GB memory) with all the
mentioned parameters.
Did you check the memory usage on your nodes and made sure the oom
killer did not shoot any process?
Cheers,
Gilles
On Tue, Jan 12, 2021 at 1:48 AM Daniel Torres via users
wrote:
>
> Hi.
>
>
Hi.
Thanks for responding. I have taken the most important parts from my
code and I created a test that reproduces the behavior I described
previously.
I attach to this e-mail the compressed file "*test.tar.gz*". Inside him,
you can find:
1.- The .c source code "test.c", which I compiled
Daniel,
There are no timeouts in OMPI with the exception of the initial connection
over TCP, where we use the socket timeout to prevent deadlocks. As you
already did quite a few communicator duplications and other collective
communications before you see the timeout, we need more info about this.
Daniel,
Can you please post the full error message and share a reproducer for
this issue?
Cheers,
Gilles
On Fri, Jan 8, 2021 at 10:25 PM Daniel Torres via users
wrote:
>
> Hi all.
>
> Actually I'm implementing an algorithm that creates a process grid and
> divides it into row and column