Re: [OMPI users] GM + OpenMPI bug ...

José Ignacio Aliaga Estellés Mon, 31 May 2010 07:27:45 -0400

Hi,

We have made different tests to locate the problem. Some nodes don'twork correctly when we use gm_allsize -v and we have isolated them.On the good nodes, we have executed our broadcast test with MPICH-1and it works correctly. But If we use OpenMPI 1.4.2 it still fails.

We would like to active the parity error check, to test if thisoption solves all our problems. But we don´t know how to do it.Below, we attach you the output of the lspci command. We suppose thatthis check errors is not enabled.


Best regards,

  José i. Aliaga

==================
$ /sbin/lspci -vvxxx
...

02:03.0 Network controller: MYRICOM Inc. Myrinet 2000 ScalableCluster Interconnect (rev 03)Subsystem: MYRICOM Inc. Myrinet 2000 Scalable ClusterInterconnectControl: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-ParErr+ Stepping+ SERR+ FastB2B-Status: Cap- 66MHz+ UDF- FastB2B- ParErr- DEVSEL=slow>TAbort- <TAbort- <MAbort- >SERR- <PERR-

        Latency: 64, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 217
        Region 0: Memory at fb000000 (32-bit, prefetchable) [size=16M]
        Expansion ROM at fce80000 [disabled] [size=512K]
00: c1 14 43 80 d6 01 20 04 03 00 80 02 10 40 00 00
10: 08 00 00 fb 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 c1 14 43 80
30: 00 00 e8 fc 00 00 00 00 00 00 00 00 0a 01 00 00
...

El 21/05/2010, a las 19:57, Patrick Geoffray escribió:

Hi Jose,

On 5/21/2010 6:54 AM, José Ignacio Aliaga Estellés wrote:
We have used the lspci -vvxxx and we have obtained:

bi00: 04:01.0 Ethernet controller: Intel Corporation 82544EI Gigabit
Ethernet Controller (Copper) (rev 02)
This is the output for the Intel GigE NIC, you should look at theone for the Myricom NIC and the PCI bridge above it (lspci -t tosee the tree).
bi00: Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium>TAbort-
<TAbort- <MAbort- >SERR- <PERR-
PERR- status means no parity detected when receiving data. Lookingat the PERR status of the PCI bridge on the other side will show ifthere was in corruption on that bus.
As a first step, you can see if you can reproduce errors with asimple test involving a single node at a time. You can run"gm_allsize --verify" on each machine: it will send packets toitself (loopback in the switch) and check for corruption. If youdon't see errors after a while, that node is probably clean. If yousee errors, you can look deeper at lspci output to see if it's aPCI problem. If you are using a riser card, you can try without.
I am not sure if openMPI has an option to enable debug checksum,but it would also be useful to see if it detects anything.
Additionally, if you know any software tool or methodology tocheck the
hardware/software, please, could you send us how to do it?
You may want to look at the FAQ on GM troubleshooting:
http://www.myri.com/cgi-bin/fom.pl?file=425

Additionally, you can send email to h...@myri.com to open a ticket.

Patrick

Re: [OMPI users] GM + OpenMPI bug ...

Reply via email to