Hi! We have one user code that is having lots of problems with RNRs or sometimes hangs. (The same code runs ok on another IB based system which has full connectivity and on our Myrinet system)
The IB network has a 7:3 overload, i.e. 7 nodes per 3 IB links up to the main Cisco switch. In other words, we have 48 bladecenters with 14 blades (8 cores) in each with a IB switch per bladecenter and 2x3 IB lines per bladecenter to the main Cisco switch. Now to the question, do you have any good suggestions on parameters that will help us get around this problem. I tried changing the queue-pair settings and it does affect the problem but so far i haven't been able to fix it completely. The code usually works when running with nodes=8:ppn=8, but always fails sooner or later with nodes=16:ppn=8. Also turning off leave_pinned helps a bit. The best settings i have so far are: -mca mpi_leave_pinned 0 -mca btl_openib_receive_queues "P,128,512:S,2048, 512,128,2:S,12288,512,128,2:S,65536,512,128,2" I have tried almost anything i can think of and desperately need help here. Building everything in debug mode helps somewhat due to the code getting so slow that the network can keep up a lot better but not completely. OS: CentOS5.3 (OFED 1.3.2 and 1.4.2 tested) HW: Mellanox MT25208 InfiniHost III Ex (128MB) -- Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden Internet: a...@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90 7866126 Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se