Thanks, Gilles,

Yes, this binary was built a few years ago.

You mention a user error, but do you mean developer error?  I.e., it
would have to be in the code?

What does "--mca coll ^tuned" do?

Thx....John

On 2/15/16 4:03 PM, Gilles Gouaillardet wrote:
John,

the readv error is likely a consequence of the abort, and not the root cause of the issue.

an obvious user error is if not all MPI tasks MPI_Bcast with non compatible signatures.

coll/tuned module is known to be broken when using different but compatible signatures. for example, one process MPI_Bcast one vector of N MPI_DOUBLE, and one other process MPI_Bcast N MPI_DOUBLE.

you can try to

mpirun --mca coll ^tuned ...

and see if it helps

fwiw, OpenMPI 1.6.5 is quite old nowadays...

Cheers,

Gilles
On 2/16/2016 7:28 AM, JR Cary wrote:
We have distributed a binary to a person with a Linux cluster. When
he runs our binary, he gets

[server1:10978] *** An error occurred in MPI_Bcast
[server1:10978] *** on communicator MPI COMMUNICATOR 8 DUP FROM 7
[server1:10978] *** MPI_ERR_TRUNCATE: message truncated
[server1:10978] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
[server2][[14125,1],2][/..../openmpi-1.6.5/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)

Anyone have any ideas on how to debug this?

Thanks......John Cary
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/02/28534.php


_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/02/28535.php


Reply via email to