Hi Jose,

On 5/12/2010 10:57 PM, Jos? Ignacio Aliaga Estell?s wrote:
I think that I have found a bug on the implementation of GM collectives
routines included in OpenMPI. The version of the GM software is 2.0.30
for the PCI64 cards.

I obtain the same problems when I use the 1.4.1 or the 1.4.2 version.
Could you help me? Thanks.

We have been running the test you provided on 8 nodes for 4 hours and haven't seen any errors. The setup used GM 2.0.30 and openmpi 1.4.2 on PCI-X cards (M3F-PCIXD-2 aka 'D' cards). We do not have PCI64 NICs anymore, and no machines with a PCI 64/66 slot.

One-bit errors are rarely a software problem, they are usually linked to hardware corruption. Old PCI has a simple parity check but most machines/BIOS of this era ignored reported errors. You may want to check the lspci output on your machines and see if SERR or PERR is set. You can also try to reset each NIC in its PCI slot, or use a different slot if available.

Hope it helps.

Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com

Reply via email to