I have a cluster with two Intel Xeon Nehalem E5520 CPU per server (quad-core, 2.27GHz). The interconnect is 4xQDR Infiniband (Mellanox ConnectX).
I have compiled and installed OpenMPI 1.4.2. The kernel is 2.6.32.2 and I have compiled the kernel myself. I use gridengine 6.2u5. Openmpi was compiled with "--with-libnuma --with-sge. The problem is that I get very bad performance unless I explicitly exclude the "sm" btl and I can't figure out why. I have tried searching the web and the OpenMPI mailing lists. I have seen reports about non-optimal performance, but my results are far worse than any other reports I have found. I run the "mpi_stress" program with different packet lengths. I run on a single server using 8 slots so that all eight cores on one server are occupied. When I use "-mca btl self,openib" I get pretty good results, between 450MB/s and 700MB/s depending on the packet lengths. When I use "-mca btl self,sm" or "-mca btl self,sm,openib" I just get 25MB/s to 30MB/s for packet length 1MB. For 10kB packets the results are around 5MB/s. things get abour 20% faster if I set "-mca paffinity_alone 1". What is going on? Any hints? I thought these CPU's had excellent SM-bandwidth over quickpath. I expected several GB/s. Hyperthreading is enabled, if that is relevant. The locked-memory limit is 500MB and the stack limit is 64MB. Please help! Thanks /Oskar