Hi,
We have used the lspci -vvxxx and we have obtained:
bi00: 04:01.0 Ethernet controller: Intel Corporation 82544EI Gigabit
Ethernet Controller (Copper) (rev 02)
bi00: Subsystem: Intel Corporation PRO/1000 XT Server Adapter
bi00: Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
bi00: Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-
bi00: Latency: 64 (63750ns min), Cache Line Size: 64 bytes
bi00: Interrupt: pin A routed to IRQ 185
bi00: Region 0: Memory at fe9e0000 (64-bit, non-prefetchable)
[size=128K]
bi00: Region 2: Memory at fe9d0000 (64-bit, non-prefetchable)
[size=64K]
bi00: Region 4: I/O ports at dc80 [size=32]
bi00: Expansion ROM at fe9c0000 [disabled] [size=64K]
bi00: Capabilities: [dc] Power Management version 2
bi00: Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0
+,D1-,D2-,D3hot+,D3cold-)
bi00: Status: D0 PME-Enable- DSel=0 DScale=1 PME-
bi00: Capabilities: [e4] PCI-X non-bridge device
bi00: Command: DPERE- ERO+ RBC=512 OST=1
bi00: Status: Dev=04:01.0 64bit+ 133MHz+ SCD- USC- DC=simple
DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
bi00: Capabilities: [f0] Message Signalled Interrupts: 64bit+
Queue=0/0 Enable-
bi00: Address: 0000000000000000 Data: 0000
bi00: 00: 86 80 08 10 17 01 30 02 02 00 00 02 10 40 00 00
bi00: 10: 04 00 9e fe 00 00 00 00 04 00 9d fe 00 00 00 00
bi00: 20: 81 dc 00 00 00 00 00 00 00 00 00 00 86 80 07 11
bi00: 30: 00 00 9c fe dc 00 00 00 00 00 00 00 05 01 ff 00
bi00: 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: d0: 00 00 00 00 00 00 00 00 00 00 00 00 01 e4 22 48
bi00: e0: 00 20 00 40 07 f0 02 00 08 04 43 04 00 00 00 00
bi00: f0: 05 00 80 00 00 00 00 00 00 00 00 00 00 00 00 00
We don't know how to interpret this information. We suppose that SEER
and PERR are not activated, if we have understood correctly the
Status " ... >SERR- <PERR-".
Could you confirm that? If this is the case, could you indicate how
to activate them?
Additionally, if you know any software tool or methodology to check
the hardware/software, please, could you send us how to do it?
Thanks in advance.
Best regards,
José I. Aliaga
El 20/05/2010, a las 16:29, Patrick Geoffray escribió:
Hi Jose,
On 5/12/2010 10:57 PM, Jos? Ignacio Aliaga Estell?s wrote:
I think that I have found a bug on the implementation of GM
collectives
routines included in OpenMPI. The version of the GM software is
2.0.30
for the PCI64 cards.
I obtain the same problems when I use the 1.4.1 or the 1.4.2 version.
Could you help me? Thanks.
We have been running the test you provided on 8 nodes for 4 hours
and haven't seen any errors. The setup used GM 2.0.30 and openmpi
1.4.2 on PCI-X cards (M3F-PCIXD-2 aka 'D' cards). We do not have
PCI64 NICs anymore, and no machines with a PCI 64/66 slot.
One-bit errors are rarely a software problem, they are usually
linked to hardware corruption. Old PCI has a simple parity check
but most machines/BIOS of this era ignored reported errors. You may
want to check the lspci output on your machines and see if SERR or
PERR is set. You can also try to reset each NIC in its PCI slot, or
use a different slot if available.
Hope it helps.
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com