[OMPI users] Can't build openmpi-1.6.5 with latest FCA 2.5 release.

2014-01-30 Thread Brock Palen
I grabbed the latest FCA release from Mellnox's website.  We have been building 
against FCA 2.5 for a while, but it never worked right.  Today I tried to build 
against the latest (version number was still 2.5, but I think we have updated 
our OFED sense the last install).  We are running MOFED 1.5.3-4.0.42

1.6.5 configures fine against the old 2.5 fca library I have around (don't 
recall what OFED it expected)  but the new one, which claims is for RHEL6.4 
OFED 1.5.3-4.0.42,  fails in configure with:

/home/software/rhel6/fca/2.5-v2/lib/libfca.so: undefined reference to 
`smp_mkey_set@IBMAD_1.3'

libibmad is installed, but the symbol smp_mkey_set  is not defined in it. 
IBMAD_1.3  is though.

Any thought what may cause this?  As far as I know our MOFED is from Mellnox 
and should match up fine to their release of FCA. So this has me scratching my 
head.

Thanks

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(734)936-1985





signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [OMPI users] Running on two nodes slower than running on one node

2014-01-30 Thread Tim Prince


On 1/29/2014 11:30 PM, Ralph Castain wrote:


On Jan 29, 2014, at 7:56 PM, Victor > wrote:


Thanks for the insights Tim. I was aware that the CPUs will choke 
beyond a certain point. From memory on my machine this happens with 5 
concurrent MPI jobs with that benchmark that I am using.


My primary question was about scaling between the nodes. I was not 
getting close to double the performance when running MPI jobs acros 
two 4 core nodes. It may be better now since I have Open-MX in place, 
but I have not repeated the benchmarks yet since I need to get one 
simulation job done asap.


Some of that may be due to expected loss of performance when you 
switch from shared memory to inter-node transports. While it is true 
about saturation of the memory path, what you reported could be more 
consistent with that transition - i.e., it isn't unusual to see 
applications perform better when run on a single node, depending upon 
how they are written, up to a certain size of problem (which your code 
may not be hitting).




Regarding your mention of setting affinities and MPI ranks do you 
have a specific (as in syntactically specific since I am a novice and 
easily confused...) examples how I may want to set affinities to get 
the Westmere node performing better?


mpirun --bind-to-core -cpus-per-rank 2 ...

will bind each MPI rank to 2 cores. Note that this will definitely 
*not* be a good idea if you are running more than two threads in your 
process - if you are, then set --cpus-per-rank to the number of 
threads, keeping in mind that you want things to break evenly across 
the sockets. In other words, if you have two 6 core/socket Westmere's 
on the node, then you either want to run 6 process at cpus-per-rank=2 
if each process runs 2 threads, or 4 processes with cpus-per-rank=3 if 
each process runs 3 threads, or 2 processes with no cpus-per-rank but 
--bind-to-socket instead of --bind-to-core for any other thread number 
> 3.


You would not want to run any other number of processes on the node or 
else the binding pattern will cause a single process to split its 
threads across the sockets - which will definitely hurt performance.



-cpus-per-rank 2 is an effective choice for this platform.  As Ralph 
said, it should work automatically for 2 threads per rank.
Ralph's point about not splitting a process across sockets is an 
important one.  Even splitting a process across internal busses, which 
would happen with 3 threads per process, seems problematical.


--
Tim Prince



Re: [OMPI users] Running on two nodes slower than running on one node

2014-01-30 Thread Tim Prince


On 1/29/2014 10:56 PM, Victor wrote:
Thanks for the insights Tim. I was aware that the CPUs will choke 
beyond a certain point. From memory on my machine this happens with 5 
concurrent MPI jobs with that benchmark that I am using.


Regarding your mention of setting affinities and MPI ranks do you have 
a specific (as in syntactically specific since I am a novice and 
easily ...) examples how I may want to set affinities to get the 
Westmere node performing better?


ompi_info returns this: MCA paffinity: hwloc (MCA v2.0, API v2.0, 
Component v1.6.5)


I haven't worked with current OpenMPI on Intel Westmere, although I do 
have a Westmere as my only dual CPU platform.  Ideally, the current 
scheme OpenMPI uses for MPI/OpenMP hybrid affinity will make it easy to 
allocate adjacent pairs of cores to ranks: [0,1], [2,3],[4,5],
hwloc will not be able to see whether cores [0,1] and [2,3] are actually 
the pairs sharing internal cache buss, and Intel never guaranteed it, 
but that is the only way I've seen it done (presumably controlled by BIOS).
If you had a requirement to run 1 rank per CPU, with 4 threads per CPU, 
you would pin a thread to the each of the core pairs [0,1] and [2,3] 
(and [6,7],[8,9].  If required to run 8 threads per CPU, using 
HyperThreading, you would pin 1 thread to each of the first 4 cores on 
each CPU and 2 threads each to the remaining cores (the ones which don't 
share cache paths).
Likewise, when you are testing pure MPI scaling, you would take care not 
to place a 2nd rank on a core pair wich shares an internal buss until 
you are using all 4 internal buss resources, and you would load up the 2 
CPUs symmetrically.  You might find that 8 ranks with optimized 
placement gave nearly the performance of 12 ranks, and that you need an 
effective hybrid MPI/OpenMP to get perhaps 25% additional performance by 
using the remaining cores.  I've never seen an automated scheme to deal 
with this.  If you ignored the placement requirements, you would find 
that 8 ranks on the 12 core platform didn't perform as well as on the 
similar 8 core platform.
Needless to say, these special requirements of this CPU model have 
eluded even experts, and led to it not being used to full 
effectiveness.  The reason we got into this is your remark that it 
seemed strange to you that you didn't gain performance when you added a 
rank, presumably a 2nd rank on a core pair sharing an internal buss.
You seem to have the impression that MPI performance scaling could be 
linear with the number of cores in use.  Such an expectation is 
unrealistic given that the point of multi-core platforms is to share 
memory and other resources and support more ranks without a linear 
increase in cost.
In your efforts to make an effective cluster out of nodes of dissimilar 
performance levels, you may need to explore means of evening up the 
performance per rank, such as more OpenMP threads per rank on the lower 
performance CPUs.  It really doesn't look like a beginner's project.


--
Tim Prince



Re: [OMPI users] Running on two nodes slower than running on one node

2014-01-30 Thread Ralph Castain

On Jan 30, 2014, at 12:38 AM, Victor  wrote:

> I use htop and topand until now I did not make the connection that each 
> listed process is actually a thread...
> 
> Thus the application that I am running is single threaded. How does that 
> affect the CPU affinity and rank settings?


It affects it very much - there is no value in running a rank on multiple cores 
in that case. Just do "mpirun --bind-to-core"

Threading the app is a mixed bag - it might help, but there is a penalty as 
well since you have to do all that thread locking. For now, just bind-to-core


> <-- as mentioned earlier I am a novice, and very easily confused :-)
> 
> 
> 
> On 30 January 2014 15:59, John Hearns  wrote:
> Ps. 'htop' is a good tool for looking at where processes are running.
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Running on two nodes slower than running on one node

2014-01-30 Thread Victor
I use htop and topand until now I did not make the connection that each
listed process is actually a thread...

Thus the application that I am running is single threaded. How does that
affect the CPU affinity and rank settings? <-- as mentioned earlier I am a
novice, and very easily confused :-)



On 30 January 2014 15:59, John Hearns  wrote:

> Ps. 'htop' is a good tool for looking at where processes are running.
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Running on two nodes slower than running on one node

2014-01-30 Thread John Hearns
Ps. 'htop' is a good tool for looking at where processes are running.


Re: [OMPI users] Running on two nodes slower than running on one node

2014-01-30 Thread Victor
Thank you for the very detailed reply Ralph. I will try what you say. I
will need to ask the developers to let me know about threading of the main
solver process.


On 30 January 2014 12:30, Ralph Castain  wrote:

>
> On Jan 29, 2014, at 7:56 PM, Victor  wrote:
>
> Thanks for the insights Tim. I was aware that the CPUs will choke beyond a
> certain point. From memory on my machine this happens with 5 concurrent MPI
> jobs with that benchmark that I am using.
>
> My primary question was about scaling between the nodes. I was not getting
> close to double the performance when running MPI jobs acros two 4 core
> nodes. It may be better now since I have Open-MX in place, but I have not
> repeated the benchmarks yet since I need to get one simulation job done
> asap.
>
>
> Some of that may be due to expected loss of performance when you switch
> from shared memory to inter-node transports. While it is true about
> saturation of the memory path, what you reported could be more consistent
> with that transition - i.e., it isn't unusual to see applications perform
> better when run on a single node, depending upon how they are written, up
> to a certain size of problem (which your code may not be hitting).
>
>
> Regarding your mention of setting affinities and MPI ranks do you have a
> specific (as in syntactically specific since I am a novice and easily
> confused...) examples how I may want to set affinities to get the Westmere
> node performing better?
>
>
> mpirun --bind-to-core -cpus-per-rank 2 ...
>
> will bind each MPI rank to 2 cores. Note that this will definitely *not*
> be a good idea if you are running more than two threads in your process -
> if you are, then set --cpus-per-rank to the number of threads, keeping in
> mind that you want things to break evenly across the sockets. In other
> words, if you have two 6 core/socket Westmere's on the node, then you
> either want to run 6 process at cpus-per-rank=2 if each process runs 2
> threads, or 4 processes with cpus-per-rank=3 if each process runs 3
> threads, or 2 processes with no cpus-per-rank but --bind-to-socket instead
> of --bind-to-core for any other thread number > 3.
>
> You would not want to run any other number of processes on the node or
> else the binding pattern will cause a single process to split its threads
> across the sockets - which will definitely hurt performance.
>
>
>
> ompi_info returns this: MCA paffinity: hwloc (MCA v2.0, API v2.0,
> Component v1.6.5)
>
> And finally to hybridisation... in a week or so I will get 4 AMD A10-6800
> machines with 8Gb each on loan and will attempt to make them work along the
> existing Intel nodes.
>
> Victor
>
>
> On 29 January 2014 22:03, Tim Prince  wrote:
>
>>
>> On 1/29/2014 8:02 AM, Reuti wrote:
>>
>>> Quoting Victor :
>>>
>>>  Thanks for the reply Reuti,

 There are two machines: Node1 with 12 physical cores (dual 6 core Xeon)
 and

>>>
>>> Do you have this CPU?
>>>
>>> http://ark.intel.com/de/products/37109/Intel-Xeon-
>>> Processor-X5560-8M-Cache-2_80-GHz-6_40-GTs-Intel-QPI
>>>
>>> -- Reuti
>>>
>>>  It's expected on the Xeon Westmere 6-core CPUs to see MPI performance
>> saturating when all 4 of the internal buss paths are in use.  For this
>> reason, hybrid MPI/OpenMP with 2 cores per MPI rank, with affinity set so
>> that each MPI rank has its own internal CPU buss, could out-perform plain
>> MPI on those CPUs.
>> That scheme of pairing cores on selected internal buss paths hasn't been
>> repeated.  Some influential customers learned to prefer the 4-core version
>> of that CPU, given a reluctance to adopt MPI/OpenMP hybrid with affinity.
>> If you want to talk about "downright strange," start thinking about the
>> schemes to optimize performance of 8 threads with 2 threads assigned to
>> each internal CPU buss on that CPU model.  Or your scheme of trying to
>> balance MPI performance between very different CPU models.
>> Tim
>>
>>
>>>  Node2 with 4 physical cores (i5-2400).

 Regarding scaling on the single 12 core node, not it is also not
 linear. In
 fact it is downright strange. I do not remember the numbers right now
 but
 10 jobs are faster than 11 and 12 are the fastest with peak performance
 of
 approximately 66 Msu/s which is also far from triple the 4 core
 performance. This odd non-linear behaviour also happens at the lower job
 counts on that 12 core node. I understand the decrease in scaling with
 increase in core count on the single node as the memory bandwidth is an
 issue.

 On the 4 core machine the scaling is progressive, ie. every additional
 job
 brings an increase in performance. Single core delivers 8.1 Msu/s while
 4
 cores deliver 30.8 Msu/s. This is almost linear.

 Since my original email I have also installed Open-MX and recompiled
 OpenMPI to use it. This has resulted in approximately 10% better
 performance using the existing GbE hardware.


 O