Hi,

We have made different tests to locate the problem. Some nodes don't work correctly when we use gm_allsize -v and we have isolated them. On the good nodes, we have executed our broadcast test with MPICH-1 and it works correctly. But If we use OpenMPI 1.4.2 it still fails.

We would like to active the parity error check, to test if this option solves all our problems. But we don´t know how to do it. Below, we attach you the output of the lspci command. We suppose that this check errors is not enabled.

Best regards,

  José i. Aliaga

==================
$ /sbin/lspci -vvxxx
...
02:03.0 Network controller: MYRICOM Inc. Myrinet 2000 Scalable Cluster Interconnect (rev 03) Subsystem: MYRICOM Inc. Myrinet 2000 Scalable Cluster Interconnect Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr+ Stepping+ SERR+ FastB2B- Status: Cap- 66MHz+ UDF- FastB2B- ParErr- DEVSEL=slow >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 64, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 217
        Region 0: Memory at fb000000 (32-bit, prefetchable) [size=16M]
        Expansion ROM at fce80000 [disabled] [size=512K]
00: c1 14 43 80 d6 01 20 04 03 00 80 02 10 40 00 00
10: 08 00 00 fb 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 c1 14 43 80
30: 00 00 e8 fc 00 00 00 00 00 00 00 00 0a 01 00 00
...

El 21/05/2010, a las 19:57, Patrick Geoffray escribió:

Hi Jose,

On 5/21/2010 6:54 AM, José Ignacio Aliaga Estellés wrote:
We have used the lspci -vvxxx and we have obtained:

bi00: 04:01.0 Ethernet controller: Intel Corporation 82544EI Gigabit
Ethernet Controller (Copper) (rev 02)

This is the output for the Intel GigE NIC, you should look at the one for the Myricom NIC and the PCI bridge above it (lspci -t to see the tree).

bi00: Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-

PERR- status means no parity detected when receiving data. Looking at the PERR status of the PCI bridge on the other side will show if there was in corruption on that bus.

As a first step, you can see if you can reproduce errors with a simple test involving a single node at a time. You can run "gm_allsize --verify" on each machine: it will send packets to itself (loopback in the switch) and check for corruption. If you don't see errors after a while, that node is probably clean. If you see errors, you can look deeper at lspci output to see if it's a PCI problem. If you are using a riser card, you can try without.

I am not sure if openMPI has an option to enable debug checksum, but it would also be useful to see if it detects anything.

Additionally, if you know any software tool or methodology to check the
hardware/software, please, could you send us how to do it?

You may want to look at the FAQ on GM troubleshooting:
http://www.myri.com/cgi-bin/fom.pl?file=425

Additionally, you can send email to h...@myri.com to open a ticket.

Patrick



Reply via email to