Troy and I talked about this one off-list as well and resolved the issue
as problems with his local IB fabric.  

The moral of the lesson here is that Open MPI's error messages need to
be a bit more descriptive (in this case, they should have said, "Help!
The sky is falling, the sky is falling!").


> -----Original Message-----
> From: users-boun...@open-mpi.org 
> [mailto:users-boun...@open-mpi.org] On Behalf Of Troy Telford
> Sent: Thursday, June 01, 2006 3:35 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Open MPI 1.0.2 and np >=64
> 
> > Did you happen to have a chance to try to run the 1.0.3 or 1.1
> > nightly tarballs?  I'm 50/50 on whether we've fixed these issues
> > already.
> 
> OK, for ticket #40:
> 
> With Open MPI 1.0.3 (nightly downloaded/built May 31st)
> (This time using presta's 'laten', since the source code + 
> comments is <  
> 1k lines of code)
> 
> One note:  There doesn't seem to be a specific number of 
> nodes in which  
> the error crops up.  It almost seems like a case of 
> probability:  With -np  
> 142, the test will succeed ~75% of the time.  Lower -np 
> values result in  
> higher success rates.  Larger values of -np increase the 
> probability of  
> failure.  -np 148 fails > 90% of the time.  -np 128 works 
> pretty much all  
> the time.
> 
> Fiddling with the machinefile (to try to narrow it down to 
> misbehaving  
> hardware) -- for instance, using only a specific set of 
> nodes, etc. had no  
> effect;
> 
> On to the results:
> 
> [root@zartan1 tmp]# mpirun -v -prefix $MPIHOME -mca btl 
> openib,sm,self -np  
> 148 -machinefile machines /tmp/laten -o 10
> 
> MPI Bidirectional latency test (Send/Recv)
>               Processes    Max Latency (us)
> ------------------------------------------
> 
> [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_
> progress]  
> error polling HP CQ with status 12 for wr_id 47120798794424 opcode 0
> 
> [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_
> progress]  
> error polling HP CQ with status 5 for wr_id 47121337969156 opcode 0
> 
> [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_
> progress]  
> error polling HP CQ with status 5 for wr_id 47121338002208 opcode 0
> 
> [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_
> progress]  
> error polling HP CQ with status 5 for wr_id 47121338035260 opcode 0
> 
> [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_
> progress]  
> error polling HP CQ with status 5 for wr_id 47121338068312 opcode 0
> 
> [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_
> progress]  
> error polling HP CQ with status 5 for wr_id 47121338101364 opcode 0
> 
> [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_
> progress]  
> error polling HP CQ with status 5 for wr_id 47121338134416 opcode 0
> 
> [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_
> progress]  
> error polling HP CQ with status 5 for wr_id 47121338167468 opcode 0
> 
> [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_
> progress]  
> error polling HP CQ with status 5 for wr_id 47121338200520 opcode 0
> 
> [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_
> progress]  
> error polling HP CQ with status 5 for wr_id 47121338233572 opcode 0
> 
> [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_
> progress]  
> error polling HP CQ with status 5 for wr_id 47121340387456 opcode 0
> 
> If I use -np 145, (actually, any odd number of nodes; that 
> may just be a  
> case of running 'laten' incorrectly)
> 
> MPI Bidirectional latency test (Send/Recv)
>               Processes    Max Latency (us)
> ------------------------------------------
>                       2               8.249
>                       4              15.795
>                       8              21.803
>                      16              23.353
>                      32              21.601
>                      64              31.900
> [zartan75:06723] *** An error occurred in MPI_Group_incl
> [zartan75:06723] *** on communicator MPI_COMM_WORLD
> [zartan75:06723] *** MPI_ERR_RANK: invalid rank
> [zartan75:06723] *** MPI_ERRORS_ARE_FATAL (goodbye)
> 
> ***and more of the same, with different nodes)
> 
> 1 additional process aborted (not shown)
> 
> ***************************
> With Open MPI 1.1:
> mpirun -v -np 150 -prefix $MPIHOME -mca btl openib,sm,self 
> -machinefile  
> machines laten -o 10
> MPI Bidirectional latency test (Send/Recv)
>               Processes    Max Latency (us)
> ------------------------------------------
>                       2              21.648
> [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_
> progress]  
> error polling HP CQ with status 12 for wr_id 5775790 opcode 0
> 
> [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_
> progress]  
> error polling HP CQ with status 5 for wr_id 5865600 opcode 0
> 
> [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_
> progress]  
> error polling HP CQ with status 5 for wr_id 7954692 opcode 0
> 
> [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_
> progress]  
> error polling HP CQ with status 5 for wr_id 7967282 opcode 0
> 
> [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_
> progress]  
> error polling HP CQ with status 5 for wr_id 7979872 opcode 0
> 
> [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_
> progress]  
> error polling HP CQ with status 5 for wr_id 7992462 opcode 0
> 
> [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_
> progress]  
> error polling HP CQ with status 5 for wr_id 8005052 opcode 0
> 
> [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_
> progress]  
> error polling HP CQ with status 5 for wr_id 8017642 opcode 0
> 
> [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_
> progress]  
> error polling HP CQ with status 5 for wr_id 8030232 opcode 0
> 
> [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_
> progress]  
> error polling HP CQ with status 5 for wr_id 8042822 opcode 0
> 
> [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_
> progress]  
> error polling HP CQ with status 5 for wr_id 8055412 opcode 0
> --
> Troy Telford
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 

Reply via email to