Re: [OMPI devel] Dual rail IB card problem

2015-09-01 Thread Brice Goglin
Le 01/09/2015 15:59, marcin.krotkiewski a écrit : > Dear Rolf and Brice, > > Thank you very much for your help. I have now moved the 'dubious' IB > card from Slot 1 to Slot 5. It is now reported by hwloc as bound to a > separate NUMA node. In this case OpenMPI works as could be expected: > > - NUM

Re: [OMPI devel] Dual rail IB card problem

2015-09-01 Thread Brice Goglin
Hello It's a float because we normalize to 1 on the diagonal (some AMD machines have values like 10 on the diagonal and 16 or 22 otherwise, so you ge 1.0, 1.6 or 2.2 after normalization), and also because some users wanted to specify their own distance matrix. I'd like to cleanup the distance API

Re: [OMPI devel] Dual rail IB card problem

2015-08-31 Thread Gilles Gouaillardet
Brice, as a side note, what is the rationale for defining the distance as a floating point number ? i remember i had to fix a bug in ompi a while ago /* e.g. replace if (d1 == d2) with if((d1-d2) < epsilon) */ Cheers, Gilles On 9/1/2015 5:28 AM, Brice Goglin wrote: The locality is mlx4_0 as

Re: [OMPI devel] Dual rail IB card problem

2015-08-31 Thread Brice Goglin
The locality is mlx4_0 as reported by lstopo is "near the entire machine" (while mlx4_1 is reported near NUMA node #3). I would vote for buggy PCI-NUMA affinity being reported by the BIOS. But I am not very familiar with 4x E5-4600 machines so please make sure this PCI slot is really attached to a

Re: [OMPI devel] Dual rail IB card problem

2015-08-31 Thread Atchley, Scott
What is the output of /sbin/lspci -tv? On Aug 31, 2015, at 4:06 PM, Rolf vandeVaart wrote: > There was a problem reported on the User's list about Open MPI always picking > one Mellanox card when they were two in the machine. > > http://www.open-mpi.org/community/lists/users/2015/08/27507.php

[OMPI devel] Dual rail IB card problem

2015-08-31 Thread Rolf vandeVaart
There was a problem reported on the User's list about Open MPI always picking one Mellanox card when they were two in the machine. http://www.open-mpi.org/community/lists/users/2015/08/27507.php We dug a little deeper and I think this has to do with how hwloc is figuring out where one of the