Troy and I talked about this one off-list as well and resolved the issue as problems with his local IB fabric.
The moral of the lesson here is that Open MPI's error messages need to be a bit more descriptive (in this case, they should have said, "Help! The sky is falling, the sky is falling!"). > -----Original Message----- > From: users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] On Behalf Of Troy Telford > Sent: Thursday, June 01, 2006 3:35 PM > To: Open MPI Users > Subject: Re: [OMPI users] Open MPI 1.0.2 and np >=64 > > > Did you happen to have a chance to try to run the 1.0.3 or 1.1 > > nightly tarballs? I'm 50/50 on whether we've fixed these issues > > already. > > OK, for ticket #40: > > With Open MPI 1.0.3 (nightly downloaded/built May 31st) > (This time using presta's 'laten', since the source code + > comments is < > 1k lines of code) > > One note: There doesn't seem to be a specific number of > nodes in which > the error crops up. It almost seems like a case of > probability: With -np > 142, the test will succeed ~75% of the time. Lower -np > values result in > higher success rates. Larger values of -np increase the > probability of > failure. -np 148 fails > 90% of the time. -np 128 works > pretty much all > the time. > > Fiddling with the machinefile (to try to narrow it down to > misbehaving > hardware) -- for instance, using only a specific set of > nodes, etc. had no > effect; > > On to the results: > > [root@zartan1 tmp]# mpirun -v -prefix $MPIHOME -mca btl > openib,sm,self -np > 148 -machinefile machines /tmp/laten -o 10 > > MPI Bidirectional latency test (Send/Recv) > Processes Max Latency (us) > ------------------------------------------ > > [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_ > progress] > error polling HP CQ with status 12 for wr_id 47120798794424 opcode 0 > > [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_ > progress] > error polling HP CQ with status 5 for wr_id 47121337969156 opcode 0 > > [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_ > progress] > error polling HP CQ with status 5 for wr_id 47121338002208 opcode 0 > > [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_ > progress] > error polling HP CQ with status 5 for wr_id 47121338035260 opcode 0 > > [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_ > progress] > error polling HP CQ with status 5 for wr_id 47121338068312 opcode 0 > > [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_ > progress] > error polling HP CQ with status 5 for wr_id 47121338101364 opcode 0 > > [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_ > progress] > error polling HP CQ with status 5 for wr_id 47121338134416 opcode 0 > > [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_ > progress] > error polling HP CQ with status 5 for wr_id 47121338167468 opcode 0 > > [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_ > progress] > error polling HP CQ with status 5 for wr_id 47121338200520 opcode 0 > > [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_ > progress] > error polling HP CQ with status 5 for wr_id 47121338233572 opcode 0 > > [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_ > progress] > error polling HP CQ with status 5 for wr_id 47121340387456 opcode 0 > > If I use -np 145, (actually, any odd number of nodes; that > may just be a > case of running 'laten' incorrectly) > > MPI Bidirectional latency test (Send/Recv) > Processes Max Latency (us) > ------------------------------------------ > 2 8.249 > 4 15.795 > 8 21.803 > 16 23.353 > 32 21.601 > 64 31.900 > [zartan75:06723] *** An error occurred in MPI_Group_incl > [zartan75:06723] *** on communicator MPI_COMM_WORLD > [zartan75:06723] *** MPI_ERR_RANK: invalid rank > [zartan75:06723] *** MPI_ERRORS_ARE_FATAL (goodbye) > > ***and more of the same, with different nodes) > > 1 additional process aborted (not shown) > > *************************** > With Open MPI 1.1: > mpirun -v -np 150 -prefix $MPIHOME -mca btl openib,sm,self > -machinefile > machines laten -o 10 > MPI Bidirectional latency test (Send/Recv) > Processes Max Latency (us) > ------------------------------------------ > 2 21.648 > [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_ > progress] > error polling HP CQ with status 12 for wr_id 5775790 opcode 0 > > [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_ > progress] > error polling HP CQ with status 5 for wr_id 5865600 opcode 0 > > [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_ > progress] > error polling HP CQ with status 5 for wr_id 7954692 opcode 0 > > [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_ > progress] > error polling HP CQ with status 5 for wr_id 7967282 opcode 0 > > [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_ > progress] > error polling HP CQ with status 5 for wr_id 7979872 opcode 0 > > [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_ > progress] > error polling HP CQ with status 5 for wr_id 7992462 opcode 0 > > [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_ > progress] > error polling HP CQ with status 5 for wr_id 8005052 opcode 0 > > [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_ > progress] > error polling HP CQ with status 5 for wr_id 8017642 opcode 0 > > [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_ > progress] > error polling HP CQ with status 5 for wr_id 8030232 opcode 0 > > [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_ > progress] > error polling HP CQ with status 5 for wr_id 8042822 opcode 0 > > [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_ > progress] > error polling HP CQ with status 5 for wr_id 8055412 opcode 0 > -- > Troy Telford > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >