Hi,

We have used the lspci -vvxxx and we have obtained:

bi00: 04:01.0 Ethernet controller: Intel Corporation 82544EI Gigabit Ethernet Controller (Copper) (rev 02)
bi00:   Subsystem: Intel Corporation PRO/1000 XT Server Adapter
bi00: Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- bi00: Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
bi00:   Latency: 64 (63750ns min), Cache Line Size: 64 bytes
bi00:   Interrupt: pin A routed to IRQ 185
bi00: Region 0: Memory at fe9e0000 (64-bit, non-prefetchable) [size=128K] bi00: Region 2: Memory at fe9d0000 (64-bit, non-prefetchable) [size=64K]
bi00:   Region 4: I/O ports at dc80 [size=32]
bi00:   Expansion ROM at fe9c0000 [disabled] [size=64K]
bi00:   Capabilities: [dc] Power Management version 2
bi00: Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0 +,D1-,D2-,D3hot+,D3cold-)
bi00:     Status: D0 PME-Enable- DSel=0 DScale=1 PME-
bi00:   Capabilities: [e4] PCI-X non-bridge device
bi00:     Command: DPERE- ERO+ RBC=512 OST=1
bi00: Status: Dev=04:01.0 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz- bi00: Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable-
bi00:     Address: 0000000000000000  Data: 0000
bi00: 00: 86 80 08 10 17 01 30 02 02 00 00 02 10 40 00 00
bi00: 10: 04 00 9e fe 00 00 00 00 04 00 9d fe 00 00 00 00
bi00: 20: 81 dc 00 00 00 00 00 00 00 00 00 00 86 80 07 11
bi00: 30: 00 00 9c fe dc 00 00 00 00 00 00 00 05 01 ff 00
bi00: 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: d0: 00 00 00 00 00 00 00 00 00 00 00 00 01 e4 22 48
bi00: e0: 00 20 00 40 07 f0 02 00 08 04 43 04 00 00 00 00
bi00: f0: 05 00 80 00 00 00 00 00 00 00 00 00 00 00 00 00

We don't know how to interpret this information. We suppose that SEER and PERR are not activated, if we have understood correctly the Status " ... >SERR- <PERR-". Could you confirm that? If this is the case, could you indicate how to activate them?

Additionally, if you know any software tool or methodology to check the hardware/software, please, could you send us how to do it?

Thanks in advance.

Best regards,

  José I. Aliaga

El 20/05/2010, a las 16:29, Patrick Geoffray escribió:

Hi Jose,

On 5/12/2010 10:57 PM, Jos? Ignacio Aliaga Estell?s wrote:
I think that I have found a bug on the implementation of GM collectives routines included in OpenMPI. The version of the GM software is 2.0.30
for the PCI64 cards.

I obtain the same problems when I use the 1.4.1 or the 1.4.2 version.
Could you help me? Thanks.

We have been running the test you provided on 8 nodes for 4 hours and haven't seen any errors. The setup used GM 2.0.30 and openmpi 1.4.2 on PCI-X cards (M3F-PCIXD-2 aka 'D' cards). We do not have PCI64 NICs anymore, and no machines with a PCI 64/66 slot.

One-bit errors are rarely a software problem, they are usually linked to hardware corruption. Old PCI has a simple parity check but most machines/BIOS of this era ignored reported errors. You may want to check the lspci output on your machines and see if SERR or PERR is set. You can also try to reset each NIC in its PCI slot, or use a different slot if available.

Hope it helps.

Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com



Reply via email to