On Thu, Mar 05, 2009 at 10:27:27AM +0200, Pavel Shamis (Pasha) wrote:
> 
> >Time to dig up diagnostics tools and look at port statistics.
> >  
> You may use ibdiagnet tool for the network debug - 
> *http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED.
> 
> Pasha.
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 

Thanks Pasha!
ibdiagnet reports the following:

-I---------------------------------------------------
-I- IPoIB Subnets Check
-I---------------------------------------------------
-I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Port localhost/P1 lid=0x00e2 guid=0x001e0bffff4ced75 dev=25218 can not join
    due to rate:2.5Gbps < group:10Gbps

I guess this may indicate a bad adapter.  Now, I just need to find what
system this maps to.

I also ran ibcheckerrors and it reports a lot of problems with buffer
overruns.  Here's the tail end of the output, with only some of the last
ports reported:

#warn: counter SymbolErrors = 36905     (threshold 10) lid 193 port 14
#warn: counter LinkDowned = 23  (threshold 10) lid 193 port 14
#warn: counter RcvErrors = 15641        (threshold 10) lid 193 port 14
#warn: counter RcvSwRelayErrors = 225   (threshold 100) lid 193 port 14
#warn: counter ExcBufOverrunErrors = 10         (threshold 10) lid 193 port 14
Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 14:  FAILED 
#warn: counter LinkRecovers = 181       (threshold 10) lid 193 port 1
#warn: counter RcvSwRelayErrors = 2417  (threshold 100) lid 193 port 1
Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 1:  FAILED 
#warn: counter LinkRecovers = 103       (threshold 10) lid 193 port 3
#warn: counter RcvErrors = 9035         (threshold 10) lid 193 port 3
#warn: counter RcvSwRelayErrors = 64670         (threshold 100) lid 193 port 3
Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 3:  FAILED 
#warn: counter SymbolErrors = 13151     (threshold 10) lid 193 port 4
#warn: counter RcvErrors = 109  (threshold 10) lid 193 port 4
#warn: counter RcvSwRelayErrors = 507   (threshold 100) lid 193 port 4
Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 4:  FAILED 

## Summary: 209 nodes checked, 0 bad nodes found
##          716 ports checked, 103 ports have errors beyond threshold


I wonder if this is something that needs to be tuned in the Infiniband
switch or if there is something in OpenMPI/OpenIB that can be tuned.

Thanks,
Jan Lindheim

Reply via email to