Hello Jeff, Ralph, All Open MPI folks,

We had an off-list discussion about an error in the Serpent program. Ralph said:

>We already have several tickets for that problem, each relating to a different scenario:
>https://svn.open-mpi.org/trac/ompi/ticket/2155
>https://svn.open-mpi.org/trac/ompi/ticket/2157
>https://svn.open-mpi.org/trac/ompi/ticket/2295

I've build a quite small reproducer for the original issue (with a huge memory footprint) and have send it to you.

The other week, another user got problemz if using huge data sets.

A program, which runs without any problem with smaller data sets (in order of 24Gb data in total and smaller), got problem with huge data sets (in order of 100Gb data in total and more),
_if running over infiniband or IPoIB_.

The program essentially hangs, mostly blocking the transport used. In some scenarios it crash. The same program and data set run fine over ethernet or shared memory (yes, we've computers with 100ths of GB of memory). The behaviour is reproducible.

Diverse errors are produced, some of them are listed below. Another thing is that in the most cases, if the program hangs, it also blocks the transport, that is another programs cannot run over the same interface (just as reported earlier).

More fun: we also found some '#procs x #Nodes' combinations where the program run fine.

I.e.,
30 and 60 processes over 6 nodes run through fine,
6 procs over 6 nodes - killed with a error message (see below)
12,18,36,61,62,64,66 procs over 6 nodes - hangs and block the interface.

Well, we cannot give any warranty that that isn't a bug in the program itself, because it is just in development now. However, since the program works well for smaller sized data sets and over TCP and over ShMem, it smells like a MPI library error, thus this mail.

Or maybe the puzzling behaviour may be a follow-up of any bugs in the program itself? If yes, what it could be and how we could try no find it?

I did not attach a reproducer to this mail because the user do not want to spread the code all over the world, but can send it to you if you are interested in reproducing it. [The code is about matrix transpose of huge matrices and essentially calls MPI_Alltoallv, it is written a 'nice, well-structured' C++ code (nothing stays unwrapped) but is pretty small and readable].

Ralph, Jeff, anybody - any interest in reproducing this issue?

Best wishes,
Paul Kapinos


P.S. Open MPI 1.5.3 used - still waiting for 1.5.5 ;-)








Some error messages:

with 6 procs over 6 Nodes:
------------------------------------------------------------------------------
mlx4: local QP operation err (QPN 7c0063, WQE index 0, vendor syndrome 6f, opcode = 5e) [[8771,1],5][btl_openib_component.c:3316:handle_wc] from linuxbdc07.rz.RWTH-Aachen.DE to: linuxbdc04 error polling LP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id 6afb70 opcode 0 vendor error 111 qp_idx 3 mlx4: local QP operation err (QPN 18005f, WQE index 0, vendor syndrome 6f, opcode = 5e) [[8771,1],2][btl_openib_component.c:3316:handle_wc] from linuxbdc03.rz.RWTH-Aachen.DE to: linuxbdc02 error polling LP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id 6afb70 opcode 0 vendor error 111 qp_idx 3 [[8771,1],1][btl_openib_component.c:3316:handle_wc] from linuxbdc02.rz.RWTH-Aachen.DE to: linuxbdc01 error polling LP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id 6afb70 opcode 0 vendor error 111 qp_idx 3 mlx4: local QP operation err (QPN 340057, WQE index 0, vendor syndrome 6f, opcode = 5e)
------------------------------------------------------------------------------


with 61 processes using IPoIB:
mpiexec -mca btl ^openib -np 61 -host 1,2,3,4,5,6 a.out < dim100G.in
------------------------------------------------------------------------------
[linuxbdc02.rz.RWTH-Aachen.DE][[21403,1],1][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect] connect() to 134.61.208.202 failed: Connection timed out (110) [linuxbdc01.rz.RWTH-Aachen.DE][[21403,1],18][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect] connect() to 134.61.208.203 failed: Connection timed out (110) [linuxbdc01.rz.RWTH-Aachen.DE][[21403,1],18][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect] connect() to 134.61.208.203 failed: Connection timed out (110)
------------------------------------------------------------------------------


--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to