Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Eugene Loh




Gilbert Grosdidier wrote:
Any other suggestion ?
Can any more information be extracted from profiling?  Here is where I
think things left off:

Eugene Loh wrote:

  
  
Gilbert Grosdidier wrote:
  #    
[time]   [calls]    <%mpi>  <%wall>
# MPI_Waitall 741683   7.91081e+07 77.96   
21.58
# MPI_Allreduce   114057   2.53665e+07
11.99 3.32
# MPI_Isend  27420.6   6.53513e+08 
2.88 0.80
# MPI_Irecv  464.616   6.53513e+08 
0.05 0.01
###

It seems to my non-expert eye that MPI_Waitall is dominant among MPI
calls,
but not for the overall application,
Looks like on average each MPI_Waitall call is completing 8+ MPI_Isend
calls and 8+ MPI_Irecv calls.  I think IPM gives some point-to-point
messaging information.  Maybe you can tell what the distribution is of
message sizes, etc.  Or, maybe you already know the characteristic
pattern.  Does a stand-alone message-passing test (without the
computational portion) capture the performance problem you're looking
for?

Do you know message lengths and patterns?  Can you confirm whether
non-MPI time is the same between good and bad runs?




Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Gilbert Grosdidier
Unfortunately, I was unable to spot any striking difference in perfs  
when using --bind-to-core.


 Sorry. Any other suggestion ?

 Regards,Gilbert.



Le 7 janv. 11 à 16:32, Jeff Squyres a écrit :

Well, bummer -- there goes my theory.  According to the hwloc info  
you posted earlier, this shows that OMPI is binding to the 1st  
hyperthread on each core; *not* to both hyperthreads on a single  
core.  :-\


It would still be slightly interesting to see if there's any  
difference when you run with --bind-to-core instead of  
paffinity_alone.




On Jan 7, 2011, at 9:56 AM, Gilbert Grosdidier wrote:


Yes, here it is :

mpirun -np 8 --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/ 
1.1rc6r3028/bin/hwloc-bind --get

0x0001
0x0002
0x0004
0x0008
0x0010
0x0020
0x0040
0x0080

Gilbert.

Le 7 janv. 11 à 15:50, Jeff Squyres a écrit :


Can you run with np=8?

On Jan 7, 2011, at 9:49 AM, Gilbert Grosdidier wrote:


Hi Jeff,

Thanks for taking care of this.

Here is what I got on a worker node:

mpirun --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/ 
1.1rc6r3028/bin/hwloc-bind --get

0x0001

Is this what is expected, please ? Or should I try yet another  
command ?


Thanks,   Regards,   Gilbert.



Le 7 janv. 11 à 15:35, Jeff Squyres a écrit :


On Jan 6, 2011, at 11:23 PM, Gilbert Grosdidier wrote:


lstopo

Machine (35GB)
NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB)
L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
  PU L#0 (P#0)
  PU L#1 (P#8)
L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
  PU L#2 (P#1)
  PU L#3 (P#9)
L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
  PU L#4 (P#2)
  PU L#5 (P#10)
L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
  PU L#6 (P#3)
  PU L#7 (P#11)

[snip]


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



--
*-*
  Gilbert Grosdidier gilbert.grosdid...@in2p3.fr
  LAL / IN2P3 / CNRS Phone : +33 1 6446 8909
  Faculté des Sciences, Bat. 200 Fax   : +33 1 6446 8546
  B.P. 34, F-91898 Orsay Cedex (FRANCE)
*-*







Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Gilbert Grosdidier

I'll very soon give a try to using Hyperthreading with our app,
and keep you posted about the improvements, if any.

 Our current cluster is made out of 4-core dual-socket Nehalem nodes.

 Cheers,Gilbert.


Le 7 janv. 11 à 16:17, Tim Prince a écrit :


On 1/7/2011 6:49 AM, Jeff Squyres wrote:


My understanding is that hyperthreading can only be activated/ 
deactivated at boot time -- once the core resources are allocated  
to hyperthreads, they can't be changed while running.


Whether disabling the hyperthreads or simply telling Linux not to  
schedule on them makes a difference performance-wise remains to be  
seen.  I've never had the time to do a little benchmarking to  
quantify the difference.  If someone could rustle up a few cycles  
(get it?) to test out what the real-world performance difference is  
between disabling hyperthreading in the BIOS vs. telling Linux to  
ignore the hyperthreads, that would be awesome.  I'd love to see  
such results.


My personal guess is that the difference is in the noise.  But  
that's a guess.


Applications which depend on availability of full size instruction  
lookaside buffer would be candidates for better performance with  
hyperthreads completely disabled.  Many HPC applications don't  
stress ITLB, but some do.
Most of the important resources are allocated dynamically between  
threads, but the ITLB is an exception.
We reported results of an investigation on Intel Nehalem 4-core  
hyperthreading where geometric mean performance of standard  
benchmarks for certain commercial applications was 2% better with  
hyperthreading disabled at boot time, compared with best 1 rank per  
core scheduling with hyperthreading enabled.  Needless to say, the  
report wasn't popular with marketing.  I haven't seen an equivalent  
investigation for the 6-core CPUs, where various strange performance  
effects have been noted, so, as Jeff said, the hyperthreading effect  
could be "in the noise."



--
Tim Prince

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
*-*
  Gilbert Grosdidier gilbert.grosdid...@in2p3.fr
  LAL / IN2P3 / CNRS Phone : +33 1 6446 8909
  Faculté des Sciences, Bat. 200 Fax   : +33 1 6446 8546
  B.P. 34, F-91898 Orsay Cedex (FRANCE)
*-*







Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Jeff Squyres
Well, bummer -- there goes my theory.  According to the hwloc info you posted 
earlier, this shows that OMPI is binding to the 1st hyperthread on each core; 
*not* to both hyperthreads on a single core.  :-\

It would still be slightly interesting to see if there's any difference when 
you run with --bind-to-core instead of paffinity_alone.



On Jan 7, 2011, at 9:56 AM, Gilbert Grosdidier wrote:

> Yes, here it is :
> 
> > mpirun -np 8 --mca mpi_paffinity_alone 1 
> > /opt/software/SGI/hwloc/1.1rc6r3028/bin/hwloc-bind --get
> 0x0001
> 0x0002
> 0x0004
> 0x0008
> 0x0010
> 0x0020
> 0x0040
> 0x0080
> 
>  Gilbert.
> 
> Le 7 janv. 11 à 15:50, Jeff Squyres a écrit :
> 
>> Can you run with np=8?
>> 
>> On Jan 7, 2011, at 9:49 AM, Gilbert Grosdidier wrote:
>> 
>>> Hi Jeff,
>>> 
>>> Thanks for taking care of this.
>>> 
>>> Here is what I got on a worker node:
>>> 
 mpirun --mca mpi_paffinity_alone 1 
 /opt/software/SGI/hwloc/1.1rc6r3028/bin/hwloc-bind --get
>>> 0x0001
>>> 
>>> Is this what is expected, please ? Or should I try yet another command ?
>>> 
>>> Thanks,   Regards,   Gilbert.
>>> 
>>> 
>>> 
>>> Le 7 janv. 11 à 15:35, Jeff Squyres a écrit :
>>> 
 On Jan 6, 2011, at 11:23 PM, Gilbert Grosdidier wrote:
 
>> lstopo
> Machine (35GB)
> NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB)
>  L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
>PU L#0 (P#0)
>PU L#1 (P#8)
>  L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
>PU L#2 (P#1)
>PU L#3 (P#9)
>  L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
>PU L#4 (P#2)
>PU L#5 (P#10)
>  L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
>PU L#6 (P#3)
>PU L#7 (P#11)
 [snip]

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Tim Prince

On 1/7/2011 6:49 AM, Jeff Squyres wrote:


My understanding is that hyperthreading can only be activated/deactivated at 
boot time -- once the core resources are allocated to hyperthreads, they can't 
be changed while running.

Whether disabling the hyperthreads or simply telling Linux not to schedule on 
them makes a difference performance-wise remains to be seen.  I've never had 
the time to do a little benchmarking to quantify the difference.  If someone 
could rustle up a few cycles (get it?) to test out what the real-world 
performance difference is between disabling hyperthreading in the BIOS vs. 
telling Linux to ignore the hyperthreads, that would be awesome.  I'd love to 
see such results.

My personal guess is that the difference is in the noise.  But that's a guess.

Applications which depend on availability of full size instruction 
lookaside buffer would be candidates for better performance with 
hyperthreads completely disabled.  Many HPC applications don't stress 
ITLB, but some do.
Most of the important resources are allocated dynamically between 
threads, but the ITLB is an exception.
We reported results of an investigation on Intel Nehalem 4-core 
hyperthreading where geometric mean performance of standard benchmarks 
for certain commercial applications was 2% better with hyperthreading 
disabled at boot time, compared with best 1 rank per core scheduling 
with hyperthreading enabled.  Needless to say, the report wasn't popular 
with marketing.  I haven't seen an equivalent investigation for the 
6-core CPUs, where various strange performance effects have been noted, 
so, as Jeff said, the hyperthreading effect could be "in the noise."



--
Tim Prince



Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Gilbert Grosdidier

Yes, here it is :

> mpirun -np 8 --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/ 
1.1rc6r3028/bin/hwloc-bind --get

0x0001
0x0002
0x0004
0x0008
0x0010
0x0020
0x0040
0x0080

 Gilbert.

Le 7 janv. 11 à 15:50, Jeff Squyres a écrit :


Can you run with np=8?

On Jan 7, 2011, at 9:49 AM, Gilbert Grosdidier wrote:


Hi Jeff,

Thanks for taking care of this.

Here is what I got on a worker node:

mpirun --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/ 
1.1rc6r3028/bin/hwloc-bind --get

0x0001

Is this what is expected, please ? Or should I try yet another  
command ?


Thanks,   Regards,   Gilbert.



Le 7 janv. 11 à 15:35, Jeff Squyres a écrit :


On Jan 6, 2011, at 11:23 PM, Gilbert Grosdidier wrote:


lstopo

Machine (35GB)
NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB)
 L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
   PU L#0 (P#0)
   PU L#1 (P#8)
 L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
   PU L#2 (P#1)
   PU L#3 (P#9)
 L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
   PU L#4 (P#2)
   PU L#5 (P#10)
 L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
   PU L#6 (P#3)
   PU L#7 (P#11)

[snip]

Well, this might disprove my theory.  :-\  The OS indexing is not  
contiguous on the hyperthreads, so I might be wrong about what  
happened here.  Try this:


mpirun --mca mpi_paffinity_alone 1 hwloc-bind --get

You can even run that on just one node; let's see what you get.   
This will tell us what each process is *actually* bound to.  hwloc- 
bind --get will report a bitmask of the P#'s from above.  So if we  
see 001, 010, 011, ...etc, then my theory of OMPI binding 1 proc  
per hyperthread (vs. 1 proc per core) is incorrect.


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



--
*-*
 Gilbert Grosdidier gilbert.grosdid...@in2p3.fr
 LAL / IN2P3 / CNRS Phone : +33 1 6446 8909
 Faculté des Sciences, Bat. 200 Fax   : +33 1 6446 8546
 B.P. 34, F-91898 Orsay Cedex (FRANCE)
*-*








--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



--
*-*
  Gilbert Grosdidier gilbert.grosdid...@in2p3.fr
  LAL / IN2P3 / CNRS Phone : +33 1 6446 8909
  Faculté des Sciences, Bat. 200 Fax   : +33 1 6446 8546
  B.P. 34, F-91898 Orsay Cedex (FRANCE)
*-*







Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Jeff Squyres
Can you run with np=8?

On Jan 7, 2011, at 9:49 AM, Gilbert Grosdidier wrote:

> Hi Jeff,
> 
>  Thanks for taking care of this.
> 
> Here is what I got on a worker node:
> 
> > mpirun --mca mpi_paffinity_alone 1 
> > /opt/software/SGI/hwloc/1.1rc6r3028/bin/hwloc-bind --get
> 0x0001
> 
>  Is this what is expected, please ? Or should I try yet another command ?
> 
>  Thanks,   Regards,   Gilbert.
> 
> 
> 
> Le 7 janv. 11 à 15:35, Jeff Squyres a écrit :
> 
>> On Jan 6, 2011, at 11:23 PM, Gilbert Grosdidier wrote:
>> 
 lstopo
>>> Machine (35GB)
>>> NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB)
>>>   L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
>>> PU L#0 (P#0)
>>> PU L#1 (P#8)
>>>   L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
>>> PU L#2 (P#1)
>>> PU L#3 (P#9)
>>>   L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
>>> PU L#4 (P#2)
>>> PU L#5 (P#10)
>>>   L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
>>> PU L#6 (P#3)
>>> PU L#7 (P#11)
>> [snip]
>> 
>> Well, this might disprove my theory.  :-\  The OS indexing is not contiguous 
>> on the hyperthreads, so I might be wrong about what happened here.  Try this:
>> 
>> mpirun --mca mpi_paffinity_alone 1 hwloc-bind --get
>> 
>> You can even run that on just one node; let's see what you get.  This will 
>> tell us what each process is *actually* bound to.  hwloc-bind --get will 
>> report a bitmask of the P#'s from above.  So if we see 001, 010, 011, 
>> ...etc, then my theory of OMPI binding 1 proc per hyperthread (vs. 1 proc 
>> per core) is incorrect.
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
> 
> --
> *-*
>   Gilbert Grosdidier gilbert.grosdid...@in2p3.fr
>   LAL / IN2P3 / CNRS Phone : +33 1 6446 8909
>   Faculté des Sciences, Bat. 200 Fax   : +33 1 6446 8546
>   B.P. 34, F-91898 Orsay Cedex (FRANCE)
> *-*
> 
> 
> 
> 
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Jeff Squyres
On Jan 7, 2011, at 5:27 AM, John Hearns wrote:

> Actually, the topic of hyperthreading is interesting, and we should
> discuss it please.
> Hyperthreading is supposedly implemented better and 'properly' on
> Nehalem - I would be interested to see some genuine
> performance measurements with hyperthreading on/off on your machine Gilbert.

FWIW, from what I've seen, and from the recommendations I've heard from Intel, 
using hyperthreading is still a hit-or-miss proposition with HPC apps.  It's 
true that Nehalem (and later) hyperthreading is much better than it was before. 
 But hyperthreading is still designed to support apps that stall frequently (so 
the other hyperthread(s) can take over and do useful work while one is 
stalled).  Good HPC apps don't stall much, so hyperthreading still isn't a huge 
win.

Nehalem (and later) hyperthreading has been discussed on this list at least 
once or twice before; google through the archives to see if you can dig up the 
conversations.  I have dim recollections of people sending at least some 
performance numbers...?  (I could be wrong here, though)

> Also you don;t need to reboot and change BIOS settings - there was a
> rather niofty technique on this list I think,
> where you disable every second CPU in Linux - which has the same
> effect as switching off hyperthreading.

Yes, you can disable all but one hyperthread on a processor in Linux by:

# echo 0 > /sys/devices/system/cpu/cpuX/online

where X is an integer from the set listed in hwloc's lstopo output from the P# 
numbers (i.e., the OS index values, as opposed to the logical index values).  
Repeat for the 2nd P# value on each core in your machine.  You can run lstopo 
again to verify that they went offline.  You can "echo 1" to the same file to 
bring it back online.

Note that you can't offline X=0.

Note that this technique technically doesn't disable each hyperthread; it just 
causes Linux to avoid scheduling on it.  Disabling hyperthreading in the BIOS 
is slightly different; you are actually physically disabling all but one thread 
per core.

The difference is in how resources in a core are split between hyperthreads.  
When you disable hyperthreading in the BIOS, all the resources in the core are 
given to the first hyperthread and the 2nd is deactivated (i.e., the OS doesn't 
even see it at all).  When hyperthreading is enabled in the BIOS, the core 
resources are split between all hyperthreads.  

Specifically: causing the OS to simply not schedule on all but the first 
hyperthread doesn't give those resources back to the first hyperthread; it just 
effectively ignores all but the first hyperthread.

My understanding is that hyperthreading can only be activated/deactivated at 
boot time -- once the core resources are allocated to hyperthreads, they can't 
be changed while running.

Whether disabling the hyperthreads or simply telling Linux not to schedule on 
them makes a difference performance-wise remains to be seen.  I've never had 
the time to do a little benchmarking to quantify the difference.  If someone 
could rustle up a few cycles (get it?) to test out what the real-world 
performance difference is between disabling hyperthreading in the BIOS vs. 
telling Linux to ignore the hyperthreads, that would be awesome.  I'd love to 
see such results.  

My personal guess is that the difference is in the noise.  But that's a guess.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Gilbert Grosdidier

Hi Jeff,

 Thanks for taking care of this.

Here is what I got on a worker node:

> mpirun --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/ 
1.1rc6r3028/bin/hwloc-bind --get

0x0001

 Is this what is expected, please ? Or should I try yet another  
command ?


 Thanks,   Regards,   Gilbert.



Le 7 janv. 11 à 15:35, Jeff Squyres a écrit :


On Jan 6, 2011, at 11:23 PM, Gilbert Grosdidier wrote:


lstopo

Machine (35GB)
NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB)
  L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#8)
  L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#9)
  L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#10)
  L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#11)

[snip]

Well, this might disprove my theory.  :-\  The OS indexing is not  
contiguous on the hyperthreads, so I might be wrong about what  
happened here.  Try this:


mpirun --mca mpi_paffinity_alone 1 hwloc-bind --get

You can even run that on just one node; let's see what you get.   
This will tell us what each process is *actually* bound to.  hwloc- 
bind --get will report a bitmask of the P#'s from above.  So if we  
see 001, 010, 011, ...etc, then my theory of OMPI binding 1 proc per  
hyperthread (vs. 1 proc per core) is incorrect.


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



--
*-*
  Gilbert Grosdidier gilbert.grosdid...@in2p3.fr
  LAL / IN2P3 / CNRS Phone : +33 1 6446 8909
  Faculté des Sciences, Bat. 200 Fax   : +33 1 6446 8546
  B.P. 34, F-91898 Orsay Cedex (FRANCE)
*-*







Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Jeff Squyres
On Jan 6, 2011, at 11:23 PM, Gilbert Grosdidier wrote:

> > lstopo
> Machine (35GB)
>  NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB)
>L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
>  PU L#0 (P#0)
>  PU L#1 (P#8)
>L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
>  PU L#2 (P#1)
>  PU L#3 (P#9)
>L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
>  PU L#4 (P#2)
>  PU L#5 (P#10)
>L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
>  PU L#6 (P#3)
>  PU L#7 (P#11)
[snip]

Well, this might disprove my theory.  :-\  The OS indexing is not contiguous on 
the hyperthreads, so I might be wrong about what happened here.  Try this:

mpirun --mca mpi_paffinity_alone 1 hwloc-bind --get

You can even run that on just one node; let's see what you get.  This will tell 
us what each process is *actually* bound to.  hwloc-bind --get will report a 
bitmask of the P#'s from above.  So if we see 001, 010, 011, ...etc, then my 
theory of OMPI binding 1 proc per hyperthread (vs. 1 proc per core) is 
incorrect.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread John Hearns
On 6 January 2011 21:10, Gilbert Grosdidier  wrote:
> Hi Jeff,
>
>  Where's located lstopo command on SuseLinux, please ?
> And/or hwloc-bind, which seems related to it ?

I was able to get hwloc to install quite easily on SuSE -
download/configure/make
Configure it to install to /usr/local/bin


Actually, the topic of hyperthreading is interesting, and we should
discuss it please.
Hyperthreading is supposedly implemented better and 'properly' on
Nehalem - I would be interested to see some genuine
performance measurements with hyperthreading on/off on your machine Gilbert.

Also you don;t need to reboot and change BIOS settings - there was a
rather niofty technique on this list I think,
where you disable every second CPU in Linux - which has the same
effect as switching off hyperthreading.
Maybe you could try it?



Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-06 Thread Gilbert Grosdidier

Hi Jeff,

 Here is the output of lstopo on one of the workers (thanks 
Jean-Christophe) :


> lstopo
Machine (35GB)
  NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB)
L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
  PU L#0 (P#0)
  PU L#1 (P#8)
L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
  PU L#2 (P#1)
  PU L#3 (P#9)
L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
  PU L#4 (P#2)
  PU L#5 (P#10)
L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
  PU L#6 (P#3)
  PU L#7 (P#11)
  NUMANode L#1 (P#1 18GB) + Socket L#1 + L3 L#1 (8192KB)
L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4
  PU L#8 (P#4)
  PU L#9 (P#12)
L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5
  PU L#10 (P#5)
  PU L#11 (P#13)
L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6
  PU L#12 (P#6)
  PU L#13 (P#14)
L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7
  PU L#14 (P#7)
  PU L#15 (P#15)

 Tests with --bind-to-core are under way ...

 What is your conclusion, please ?

 Thanks,   G.




Le 06/01/2011 23:16, Jeff Squyres a écrit :

On Jan 6, 2011, at 5:07 PM, Gilbert Grosdidier wrote:


Yes Jeff, I'm pretty sure  indeed that hyperthreading is enabled, since 16 CPUs 
are visible in the /proc/cpuinfo pseudo-file, while it's a 8 core Nehalem node.

However, I always carefully checked that only 8 processes are running on each 
node.  Could it be that they are assigned to 8 hyperthreads but only 4 cores, 
for example ?  Is this actually possible with paffinity set to 1 ?

Yes.  I actually had this happen to another user recently; I should add this to 
the FAQ...  (/me adds to to-do list)

Here's what I'm guessing is happening: OMPI's paffinity_alone algorithm is 
currently pretty stupid.  It simply assigns the first MPI process on the node 
to OS processor ID 0.  It then assigned the second MPI process on the node to 
OS processor ID 1.  ...and so on.

However, if hyperthreading is enabled, OS processor ID's 0 and 1 might be 2 
hyperthreads on the same core.  And therefore OMPI has effectively just bound 2 
processes to the same core.  Ouch!

The output of lstopo can verify if this is happening: look to see if processor 
ID's 0 through 7 are on the same 4 cores.

Instead of paffinity_alone, use the mpirun --bind-to-core option; that should 
bind each MPI process to (the first hyperthread in) its own core.

Sidenote: many improvements are coming to our processor affinity system over 
the next few releases...  See my slides from the Open MPI BOF at SC'10 for some 
discussion of what's coming:

 http://www.open-mpi.org/papers/sc-2010/




Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-06 Thread Jeff Squyres
On Jan 6, 2011, at 5:07 PM, Gilbert Grosdidier wrote:

> Yes Jeff, I'm pretty sure  indeed that hyperthreading is enabled, since 16 
> CPUs are visible in the /proc/cpuinfo pseudo-file, while it's a 8 core 
> Nehalem node.
> 
> However, I always carefully checked that only 8 processes are running on each 
> node.  Could it be that they are assigned to 8 hyperthreads but only 4 cores, 
> for example ?  Is this actually possible with paffinity set to 1 ?

Yes.  I actually had this happen to another user recently; I should add this to 
the FAQ...  (/me adds to to-do list)

Here's what I'm guessing is happening: OMPI's paffinity_alone algorithm is 
currently pretty stupid.  It simply assigns the first MPI process on the node 
to OS processor ID 0.  It then assigned the second MPI process on the node to 
OS processor ID 1.  ...and so on.

However, if hyperthreading is enabled, OS processor ID's 0 and 1 might be 2 
hyperthreads on the same core.  And therefore OMPI has effectively just bound 2 
processes to the same core.  Ouch!

The output of lstopo can verify if this is happening: look to see if processor 
ID's 0 through 7 are on the same 4 cores.

Instead of paffinity_alone, use the mpirun --bind-to-core option; that should 
bind each MPI process to (the first hyperthread in) its own core. 

Sidenote: many improvements are coming to our processor affinity system over 
the next few releases...  See my slides from the Open MPI BOF at SC'10 for some 
discussion of what's coming:

http://www.open-mpi.org/papers/sc-2010/

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-06 Thread Jeff Squyres
On Jan 6, 2011, at 4:10 PM, Gilbert Grosdidier wrote:

> Where's located lstopo command on SuseLinux, please ?

'fraid I don't know anything about Suse...  :-(

It may be named hwloc-ls...?

> And/or hwloc-bind, which seems related to it ?

hwloc-bind is definitely related, but it's a different utility:

  http://www.open-mpi.org/projects/hwloc/doc/v1.1/tools.php

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-06 Thread Gilbert Grosdidier

Hi Jeff,

 Where's located lstopo command on SuseLinux, please ?
And/or hwloc-bind, which seems related to it ?

 Thanks,   G.




Le 06/01/2011 21:21, Jeff Squyres a écrit :

(now that we're back from vacation)

Actually, this could be an issue.  Is hyperthreading enabled on your machine?

Can you send the text output from running hwloc's "lstopo" command on your 
compute nodes?

I ask because if hyperthreading is enabled, OMPI might be assigning one process 
per *hyerthread* (vs. one process per *core*).  And that could be disastrous 
for performance.



On Dec 22, 2010, at 2:25 PM, Gilbert Grosdidier wrote:


Hi David,

Yes, I set mpi_affinity_alone to 1. Is that right and sufficient, please ?

Thanks for your help,   Best,   G.



Le 22/12/2010 20:18, David Singleton a écrit :

Is the same level of processes and memory affinity or binding being used?

On 12/21/2010 07:45 AM, Gilbert Grosdidier wrote:

Yes, there is definitely only 1 process per core with both MPI implementations.

Thanks, G.


Le 20/12/2010 20:39, George Bosilca a écrit :

Are your processes places the same way with the two MPI implementations? 
Per-node vs. per-core ?

george.




Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-06 Thread Jeff Squyres
(now that we're back from vacation)

Actually, this could be an issue.  Is hyperthreading enabled on your machine?

Can you send the text output from running hwloc's "lstopo" command on your 
compute nodes?

I ask because if hyperthreading is enabled, OMPI might be assigning one process 
per *hyerthread* (vs. one process per *core*).  And that could be disastrous 
for performance.



On Dec 22, 2010, at 2:25 PM, Gilbert Grosdidier wrote:

> Hi David,
> 
> Yes, I set mpi_affinity_alone to 1. Is that right and sufficient, please ?
> 
> Thanks for your help,   Best,   G.
> 
> 
> 
> Le 22/12/2010 20:18, David Singleton a écrit :
>> 
>> Is the same level of processes and memory affinity or binding being used?
>> 
>> On 12/21/2010 07:45 AM, Gilbert Grosdidier wrote:
>>> Yes, there is definitely only 1 process per core with both MPI 
>>> implementations.
>>> 
>>> Thanks, G.
>>> 
>>> 
>>> Le 20/12/2010 20:39, George Bosilca a écrit :
 Are your processes places the same way with the two MPI implementations? 
 Per-node vs. per-core ?
 
 george. 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2010-12-22 Thread Gilbert Grosdidier

Hi David,

 Yes, I set mpi_affinity_alone to 1. Is that right and sufficient, please ?

 Thanks for your help,   Best,   G.



Le 22/12/2010 20:18, David Singleton a écrit :


Is the same level of processes and memory affinity or binding being used?

On 12/21/2010 07:45 AM, Gilbert Grosdidier wrote:
Yes, there is definitely only 1 process per core with both MPI 
implementations.


Thanks, G.


Le 20/12/2010 20:39, George Bosilca a écrit :
Are your processes places the same way with the two MPI 
implementations? Per-node vs. per-core ?


george. 




Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores: very poor performance

2010-12-22 Thread Eugene Loh




Gilbert Grosdidier wrote:

  
Bonsoir Eugene,

Bon matin chez moi.
 Here
follows some output for a 1024 core run.

Assuming this corresponds meaningfully with your original e-mail, 1024
cores means performance of 700 vs 900.  So, that looks roughly
consistent with the 28% MPI time you show here.  That seems to imply
that the slowdown is due entirely to long MPI times (rather than slow
non-MPI times).  Just a sanity check.

Unfortunately, I'm yet unable to have the equivalent MPT chart.

That may be all right.  If one run clearly shows a problem (which is
perhaps the case here), then a "good profile" is not needed.  Here, a
"good profile" would perhaps be used only to confirm that near-zero MPI
time is possible.
 #IPMv0.983
# host    : r34i0n0/x86_64_Linux   mpi_tasks : 1024 on 128 nodes
# start   : 12/21/10/13:18:09  wallclock : 3357.308618 sec
# stop    : 12/21/10/14:14:06  %comm : 27.67
##
#
#   [total]   
min   max 
# wallclock  3.43754e+06   3356.98   3356.83  
3357.31
# user   2.82831e+06   2762.02   2622.04  
2923.37
# system  376230   367.412   174.603  
492.919
# mpi 951328   929.031   633.137  
1052.86
# %comm    27.6719   18.8601   
31.363
No glaring evidence here of load imbalance being the sole explanation,
but hard to tell from these numbers.  (If min comm time is 0%, then
that process is presumably holding everyone else up.)
#    
[time]   [calls]    <%mpi>  <%wall>
# MPI_Waitall 741683   7.91081e+07 77.96   
21.58
# MPI_Allreduce   114057   2.53665e+07
11.99 3.32
# MPI_Isend  27420.6   6.53513e+08 
2.88 0.80
# MPI_Irecv  464.616   6.53513e+08 
0.05 0.01
###
  
It seems to my non-expert eye that MPI_Waitall is dominant among MPI
calls,
but not for the overall application,
If at 1024 cores, performance is 700 compared to 900, then whatever the
problem is still hasn't dominated the entire application performance. 
So, it looks like MPI_Waitall is the problem, even if it doesn't
dominate overall application time.

Looks like on average each MPI_Waitall call is completing 8+ MPI_Isend
calls and 8+ MPI_Irecv calls.  I think IPM gives some point-to-point
messaging information.  Maybe you can tell what the distribution is of
message sizes, etc.  Or, maybe you already know the characteristic
pattern.  Does a stand-alone message-passing test (without the
computational portion) capture the performance problem you're looking
for?
Le
22/12/2010 18:50, Eugene Loh a écrit :
  Can
you isolate a bit more where the time is being spent?  The performance
effect you're describing appears to be drastic.  Have you profiled the
code?  Some choices of tools can be found in the FAQ http://www.open-mpi.org/faq/?category=perftools 
The results may be "uninteresting" (all time spent in your MPI_Waitall
calls, for example), but it'd be good to rule out other possibilities
(e.g., I've seen cases where it's the non-MPI time that's the culprit).


If all the time is spent in MPI_Waitall, then I wonder if it would be
possible for you to reproduce the problem with just some
MPI_Isend|Irecv|Waitall calls that mimic your program.  E.g., "lots of
short messages", or "lots of long messages", etc.  It sounds like there
is some repeated set of MPI exchanges, so maybe that set can be
extracted and run without the complexities of the application. 





Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores: very poor performance

2010-12-22 Thread Gilbert Grosdidier

Bonsoir Eugene,

 First thanks for trying to help me.

 I already gave a try to some profiling tool, namely IPM, which is rather
simple to use. Here follows some output for a 1024 core run.
Unfortunately, I'm yet unable to have the equivalent MPT chart.

#IPMv0.983
#
# command : unknown (completed)
# host: r34i0n0/x86_64_Linux   mpi_tasks : 1024 on 128 nodes
# start   : 12/21/10/13:18:09  wallclock : 3357.308618 sec
# stop: 12/21/10/14:14:06  %comm : 27.67
# gbytes  : 0.0e+00 total  gflop/sec : 0.0e+00 total
#
##
# region  : *   [ntasks] =   1024
#
#   [total]min   max
# entries   1024 1 
1 1
# wallclock  3.43754e+06   3356.98   3356.83   
3357.31
# user   2.82831e+06   2762.02   2622.04   
2923.37
# system  376230   367.412   174.603   
492.919
# mpi 951328   929.031   633.137   
1052.86
# %comm27.6719   18.8601
31.363
# gflop/sec0 0 
0 0
# gbytes   0 0 
0 0

#
#
#[time]   [calls] <%mpi> <%wall>
# MPI_Waitall 741683   7.91081e+07 77.96
21.58
# MPI_Allreduce   114057   2.53665e+07 11.99 
3.32
# MPI_Recv   40164.7  2048  4.22 
1.17
# MPI_Isend  27420.6   6.53513e+08  2.88 
0.80
# MPI_Barrier25113.5  2048  2.64 
0.73
# MPI_Sendrecv2123.6212992  0.22 
0.06
# MPI_Irecv  464.616   6.53513e+08  0.05 
0.01
# MPI_Reduce 215.447171008  0.02 
0.01
# MPI_Bcast  85.0198  1024  0.01 
0.00
# MPI_Send  0.377043  2048  0.00 
0.00
# MPI_Comm_rank  0.000744925  4096  0.00 
0.00
# MPI_Comm_size  0.000252183  1024  0.00 
0.00

###

 It seems to my non-expert eye that MPI_Waitall is dominant among MPI 
calls,
but not for the overall application, however I will have to compare with 
MPT,

before concluding.

 Thanks again for your suggestions, that I'll address one by one.

 Best, G.




Le 22/12/2010 18:50, Eugene Loh a écrit :
Can you isolate a bit more where the time is being spent?  The 
performance effect you're describing appears to be drastic.  Have you 
profiled the code?  Some choices of tools can be found in the FAQ 
http://www.open-mpi.org/faq/?category=perftools  The results may be 
"uninteresting" (all time spent in your MPI_Waitall calls, for 
example), but it'd be good to rule out other possibilities (e.g., I've 
seen cases where it's the non-MPI time that's the culprit).


If all the time is spent in MPI_Waitall, then I wonder if it would be 
possible for you to reproduce the problem with just some 
MPI_Isend|Irecv|Waitall calls that mimic your program.  E.g., "lots of 
short messages", or "lots of long messages", etc.  It sounds like 
there is some repeated set of MPI exchanges, so maybe that set can be 
extracted and run without the complexities of the application.


Anyhow, some profiling might help guide one to the problem.

Gilbert Grosdidier wrote:


There are indeed a high rate of communications. But the buffer
size is always the same for a given pair of processes, and I thought
that mpi_leave_pinned should avoid freeing the memory in this case.
Am I wrong ?




Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores: very poor performance

2010-12-22 Thread Gilbert Grosdidier

There are indeed a high rate of communications. But the buffer
size is always the same for a given pair of processes, and I thought
that mpi_leave_pinned should avoid freeing the memory in this case.
Am I wrong ?

 Thanks,  Best, G.


Le 21/12/2010 18:52, Matthieu Brucher a écrit :

Don't forget that MPT has some optimizations OpenMPI may not have, as
"overriding" free(). This way, MPT can have a huge performance boost
if you're allocating and freeing memory, and the same happens if you
communicate often.

Matthieu

2010/12/21 Gilbert Grosdidier:

Hi George,
  Thanks for your help. The bottom line is that the processes are neatly
placed on the nodes/cores,
as far as I can tell from the map :
[...]
 Process OMPI jobid: [33285,1] Process rank: 4
 Process OMPI jobid: [33285,1] Process rank: 5
 Process OMPI jobid: [33285,1] Process rank: 6
 Process OMPI jobid: [33285,1] Process rank: 7
  Data for node: Name: r34i0n1   Num procs: 8
 Process OMPI jobid: [33285,1] Process rank: 8
 Process OMPI jobid: [33285,1] Process rank: 9
 Process OMPI jobid: [33285,1] Process rank: 10
 Process OMPI jobid: [33285,1] Process rank: 11
 Process OMPI jobid: [33285,1] Process rank: 12
 Process OMPI jobid: [33285,1] Process rank: 13
 Process OMPI jobid: [33285,1] Process rank: 14
 Process OMPI jobid: [33285,1] Process rank: 15
  Data for node: Name: r34i0n2   Num procs: 8
 Process OMPI jobid: [33285,1] Process rank: 16
 Process OMPI jobid: [33285,1] Process rank: 17
 Process OMPI jobid: [33285,1] Process rank: 18
 Process OMPI jobid: [33285,1] Process rank: 19
 Process OMPI jobid: [33285,1] Process rank: 20
[...]
  But the perfs are still very low ;-(
  Best,G.
Le 20 déc. 10 à 22:27, George Bosilca a écrit :

That's a first step. My question was more related to the process overlay on
the cores. If the MPI implementation place one process per node, then rank k
and rank k+1 will always be on separate node, and the communications will
have to go over IB. In the opposite if the MPI implementation places the
processes per core, then rank k and k+1 will [mostly] be on the same node
and the communications will be over shared memory. Depending on how the
processes are placed and how you create the neighborhoods the performance
can be drastically impacted.

There is a pretty good description of the problem at:
http://www.hpccommunity.org/f55/behind-scenes-mpi-process-placement-640/

Some hints at
http://www.open-mpi.org/faq/?category=running#mpirun-scheduling. I suggest
you play with the --byslot --bynode options to see how this affect the
performance of your application.

For the hardcore cases we provide a rankfile feature. More info at:
http://www.open-mpi.org/faq/?category=tuning#using-paffinity

Enjoy,
  george.



On Dec 20, 2010, at 15:45 , Gilbert Grosdidier wrote:

Yes, there is definitely only 1 process per core with both MPI
implementations.

Thanks,   G.


Le 20/12/2010 20:39, George Bosilca a écrit :

Are your processes places the same way with the two MPI implementations?
Per-node vs. per-core ?

  george.

On Dec 20, 2010, at 11:14 , Gilbert Grosdidier wrote:

Bonjour,

I am now at a loss with my running of OpenMPI (namely 1.4.3)

on a SGI Altix cluster with 2048 or 4096 cores, running over Infiniband.

After fixing several rather obvious failures with Ralph, Jeff and John help,

I am now facing the bottom of this story since :

- there are no more obvious failures with messages

- compared to the running of the application with SGI-MPT, the CPU
performances I get

are very low, decreasing when the number of cores increases (cf below)

- these performances are highly reproducible

- I tried a very high number of -mca parameters, to no avail

If I take as a reference the MPT CPU speed performance,

it is of about 900 (in some arbitrary unit), whatever the

number of cores I used (up to 8192).

But, when running with OMPI, I get:

- 700 with 1024 cores (which is already rather low)

- 300 with 2048 cores

- 60   with 4096 cores.

The computing loop, over which the above CPU performance is evaluated,
includes

a stack of MPI exchanges [per core : 8 x (MPI_Isend + MPI_Irecv) +
MPI_Waitall]

The application is of the 'domain partition' type,

and the performances, together with the memory footprint,

are very identical on all  cores. The memory footprint is twice higher in

the OMPI case (1.5GB/core) than in the MPT case (0.7GB/core).

What could be wrong with all these, please ?

I provided (in attachment) the 'ompi_info -all ' output.

The config.log is in attachment as well.

I compiled OMPI with icc. I checked numa and affinity are OK.

I use the following command to run my OMPI app:

"mpiexec -mca btl_openib_rdma_pipeline_send_length 65536\

-mca btl_openib_rdma_pipeline_frag_size 65536\

-mca btl_openib_min_rdma_pipeline_size 65536\

-mca 

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores: very poor performance

2010-12-21 Thread Matthieu Brucher
Don't forget that MPT has some optimizations OpenMPI may not have, as
"overriding" free(). This way, MPT can have a huge performance boost
if you're allocating and freeing memory, and the same happens if you
communicate often.

Matthieu

2010/12/21 Gilbert Grosdidier :
> Hi George,
>  Thanks for your help. The bottom line is that the processes are neatly
> placed on the nodes/cores,
> as far as I can tell from the map :
> [...]
>         Process OMPI jobid: [33285,1] Process rank: 4
>         Process OMPI jobid: [33285,1] Process rank: 5
>         Process OMPI jobid: [33285,1] Process rank: 6
>         Process OMPI jobid: [33285,1] Process rank: 7
>  Data for node: Name: r34i0n1   Num procs: 8
>         Process OMPI jobid: [33285,1] Process rank: 8
>         Process OMPI jobid: [33285,1] Process rank: 9
>         Process OMPI jobid: [33285,1] Process rank: 10
>         Process OMPI jobid: [33285,1] Process rank: 11
>         Process OMPI jobid: [33285,1] Process rank: 12
>         Process OMPI jobid: [33285,1] Process rank: 13
>         Process OMPI jobid: [33285,1] Process rank: 14
>         Process OMPI jobid: [33285,1] Process rank: 15
>  Data for node: Name: r34i0n2   Num procs: 8
>         Process OMPI jobid: [33285,1] Process rank: 16
>         Process OMPI jobid: [33285,1] Process rank: 17
>         Process OMPI jobid: [33285,1] Process rank: 18
>         Process OMPI jobid: [33285,1] Process rank: 19
>         Process OMPI jobid: [33285,1] Process rank: 20
> [...]
>  But the perfs are still very low ;-(
>  Best,    G.
> Le 20 déc. 10 à 22:27, George Bosilca a écrit :
>
> That's a first step. My question was more related to the process overlay on
> the cores. If the MPI implementation place one process per node, then rank k
> and rank k+1 will always be on separate node, and the communications will
> have to go over IB. In the opposite if the MPI implementation places the
> processes per core, then rank k and k+1 will [mostly] be on the same node
> and the communications will be over shared memory. Depending on how the
> processes are placed and how you create the neighborhoods the performance
> can be drastically impacted.
>
> There is a pretty good description of the problem at:
> http://www.hpccommunity.org/f55/behind-scenes-mpi-process-placement-640/
>
> Some hints at
> http://www.open-mpi.org/faq/?category=running#mpirun-scheduling. I suggest
> you play with the --byslot --bynode options to see how this affect the
> performance of your application.
>
> For the hardcore cases we provide a rankfile feature. More info at:
> http://www.open-mpi.org/faq/?category=tuning#using-paffinity
>
> Enjoy,
>  george.
>
>
>
> On Dec 20, 2010, at 15:45 , Gilbert Grosdidier wrote:
>
> Yes, there is definitely only 1 process per core with both MPI
> implementations.
>
> Thanks,   G.
>
>
> Le 20/12/2010 20:39, George Bosilca a écrit :
>
> Are your processes places the same way with the two MPI implementations?
> Per-node vs. per-core ?
>
>  george.
>
> On Dec 20, 2010, at 11:14 , Gilbert Grosdidier wrote:
>
> Bonjour,
>
> I am now at a loss with my running of OpenMPI (namely 1.4.3)
>
> on a SGI Altix cluster with 2048 or 4096 cores, running over Infiniband.
>
> After fixing several rather obvious failures with Ralph, Jeff and John help,
>
> I am now facing the bottom of this story since :
>
> - there are no more obvious failures with messages
>
> - compared to the running of the application with SGI-MPT, the CPU
> performances I get
>
> are very low, decreasing when the number of cores increases (cf below)
>
> - these performances are highly reproducible
>
> - I tried a very high number of -mca parameters, to no avail
>
> If I take as a reference the MPT CPU speed performance,
>
> it is of about 900 (in some arbitrary unit), whatever the
>
> number of cores I used (up to 8192).
>
> But, when running with OMPI, I get:
>
> - 700 with 1024 cores (which is already rather low)
>
> - 300 with 2048 cores
>
> - 60   with 4096 cores.
>
> The computing loop, over which the above CPU performance is evaluated,
> includes
>
> a stack of MPI exchanges [per core : 8 x (MPI_Isend + MPI_Irecv) +
> MPI_Waitall]
>
> The application is of the 'domain partition' type,
>
> and the performances, together with the memory footprint,
>
> are very identical on all  cores. The memory footprint is twice higher in
>
> the OMPI case (1.5GB/core) than in the MPT case (0.7GB/core).
>
> What could be wrong with all these, please ?
>
> I provided (in attachment) the 'ompi_info -all ' output.
>
> The config.log is in attachment as well.
>
> I compiled OMPI with icc. I checked numa and affinity are OK.
>
> I use the following command to run my OMPI app:
>
> "mpiexec -mca btl_openib_rdma_pipeline_send_length 65536\
>
> -mca btl_openib_rdma_pipeline_frag_size 65536\
>
> -mca btl_openib_min_rdma_pipeline_size 65536\
>
> -mca btl_self_rdma_pipeline_send_length 262144\
>
> -mca btl_self_rdma_pipeline_frag_size 262144\
>
> -mca 

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores: very poor performance

2010-12-21 Thread Gilbert Grosdidier

Hi George,

 Thanks for your help. The bottom line is that the processes are  
neatly placed on the nodes/cores,

as far as I can tell from the map :

[...]
Process OMPI jobid: [33285,1] Process rank: 4
Process OMPI jobid: [33285,1] Process rank: 5
Process OMPI jobid: [33285,1] Process rank: 6
Process OMPI jobid: [33285,1] Process rank: 7

 Data for node: Name: r34i0n1   Num procs: 8
Process OMPI jobid: [33285,1] Process rank: 8
Process OMPI jobid: [33285,1] Process rank: 9
Process OMPI jobid: [33285,1] Process rank: 10
Process OMPI jobid: [33285,1] Process rank: 11
Process OMPI jobid: [33285,1] Process rank: 12
Process OMPI jobid: [33285,1] Process rank: 13
Process OMPI jobid: [33285,1] Process rank: 14
Process OMPI jobid: [33285,1] Process rank: 15

 Data for node: Name: r34i0n2   Num procs: 8
Process OMPI jobid: [33285,1] Process rank: 16
Process OMPI jobid: [33285,1] Process rank: 17
Process OMPI jobid: [33285,1] Process rank: 18
Process OMPI jobid: [33285,1] Process rank: 19
Process OMPI jobid: [33285,1] Process rank: 20
[...]

 But the perfs are still very low ;-(

 Best,G.

Le 20 déc. 10 à 22:27, George Bosilca a écrit :

That's a first step. My question was more related to the process  
overlay on the cores. If the MPI implementation place one process  
per node, then rank k and rank k+1 will always be on separate node,  
and the communications will have to go over IB. In the opposite if  
the MPI implementation places the processes per core, then rank k  
and k+1 will [mostly] be on the same node and the communications  
will be over shared memory. Depending on how the processes are  
placed and how you create the neighborhoods the performance can be  
drastically impacted.


There is a pretty good description of the problem at: 
http://www.hpccommunity.org/f55/behind-scenes-mpi-process-placement-640/

Some hints at http://www.open-mpi.org/faq/?category=running#mpirun-scheduling 
. I suggest you play with the --byslot --bynode options to see how  
this affect the performance of your application.


For the hardcore cases we provide a rankfile feature. More info at: 
http://www.open-mpi.org/faq/?category=tuning#using-paffinity

Enjoy,
 george.



On Dec 20, 2010, at 15:45 , Gilbert Grosdidier wrote:

Yes, there is definitely only 1 process per core with both MPI  
implementations.


Thanks,   G.


Le 20/12/2010 20:39, George Bosilca a écrit :
Are your processes places the same way with the two MPI  
implementations? Per-node vs. per-core ?


 george.

On Dec 20, 2010, at 11:14 , Gilbert Grosdidier wrote:


Bonjour,

I am now at a loss with my running of OpenMPI (namely 1.4.3)
on a SGI Altix cluster with 2048 or 4096 cores, running over  
Infiniband.


After fixing several rather obvious failures with Ralph, Jeff and  
John help,

I am now facing the bottom of this story since :
- there are no more obvious failures with messages
- compared to the running of the application with SGI-MPT, the  
CPU performances I get
are very low, decreasing when the number of cores increases (cf  
below)

- these performances are highly reproducible
- I tried a very high number of -mca parameters, to no avail

If I take as a reference the MPT CPU speed performance,
it is of about 900 (in some arbitrary unit), whatever the
number of cores I used (up to 8192).

But, when running with OMPI, I get:
- 700 with 1024 cores (which is already rather low)
- 300 with 2048 cores
- 60   with 4096 cores.

The computing loop, over which the above CPU performance is  
evaluated, includes
a stack of MPI exchanges [per core : 8 x (MPI_Isend + MPI_Irecv)  
+ MPI_Waitall]


The application is of the 'domain partition' type,
and the performances, together with the memory footprint,
are very identical on all  cores. The memory footprint is twice  
higher in

the OMPI case (1.5GB/core) than in the MPT case (0.7GB/core).

What could be wrong with all these, please ?

I provided (in attachment) the 'ompi_info -all ' output.
The config.log is in attachment as well.
I compiled OMPI with icc. I checked numa and affinity are OK.

I use the following command to run my OMPI app:
"mpiexec -mca btl_openib_rdma_pipeline_send_length 65536\
-mca btl_openib_rdma_pipeline_frag_size 65536\
-mca btl_openib_min_rdma_pipeline_size 65536\
-mca btl_self_rdma_pipeline_send_length 262144\
-mca btl_self_rdma_pipeline_frag_size 262144\
-mca plm_rsh_num_concurrent 4096 -mca mpi_paffinity_alone 1\
-mca mpi_leave_pinned 1 -mca btl_sm_max_send_size 128\
-mca coll_tuned_pre_allocate_memory_comm_size_limit 128\
-mca btl_openib_cq_size 128 -mca btl_ofud_rd_num 128\
-mca mpool_rdma_rcache_size_limit 131072 -mca mpi_preconnect_mpi 0\
-mca mpool_sm_min_size 131072 -mca mpi_abort_print_stack 1\
-mca btl sm,openib,self -mca btl_openib_want_fork_support 0\
-mca opal_set_max_sys_limits 1 -mca osc_pt2pt_no_locks 1\
-mca osc_rdma_no_locks 1\

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2010-12-20 Thread George Bosilca
That's a first step. My question was more related to the process overlay on the 
cores. If the MPI implementation place one process per node, then rank k and 
rank k+1 will always be on separate node, and the communications will have to 
go over IB. In the opposite if the MPI implementation places the processes per 
core, then rank k and k+1 will [mostly] be on the same node and the 
communications will be over shared memory. Depending on how the processes are 
placed and how you create the neighborhoods the performance can be drastically 
impacted.

There is a pretty good description of the problem at: 
http://www.hpccommunity.org/f55/behind-scenes-mpi-process-placement-640/

Some hints at http://www.open-mpi.org/faq/?category=running#mpirun-scheduling. 
I suggest you play with the --byslot --bynode options to see how this affect 
the performance of your application.

For the hardcore cases we provide a rankfile feature. More info at: 
http://www.open-mpi.org/faq/?category=tuning#using-paffinity

Enjoy,
  george.



On Dec 20, 2010, at 15:45 , Gilbert Grosdidier wrote:

> Yes, there is definitely only 1 process per core with both MPI 
> implementations.
> 
> Thanks,   G.
> 
> 
> Le 20/12/2010 20:39, George Bosilca a écrit :
>> Are your processes places the same way with the two MPI implementations? 
>> Per-node vs. per-core ?
>> 
>>   george.
>> 
>> On Dec 20, 2010, at 11:14 , Gilbert Grosdidier wrote:
>> 
>>> Bonjour,
>>> 
>>>  I am now at a loss with my running of OpenMPI (namely 1.4.3)
>>> on a SGI Altix cluster with 2048 or 4096 cores, running over Infiniband.
>>> 
>>>  After fixing several rather obvious failures with Ralph, Jeff and John 
>>> help,
>>> I am now facing the bottom of this story since :
>>> - there are no more obvious failures with messages
>>> - compared to the running of the application with SGI-MPT, the CPU 
>>> performances I get
>>> are very low, decreasing when the number of cores increases (cf below)
>>> - these performances are highly reproducible
>>> - I tried a very high number of -mca parameters, to no avail
>>> 
>>>  If I take as a reference the MPT CPU speed performance,
>>> it is of about 900 (in some arbitrary unit), whatever the
>>> number of cores I used (up to 8192).
>>> 
>>>  But, when running with OMPI, I get:
>>> - 700 with 1024 cores (which is already rather low)
>>> - 300 with 2048 cores
>>> - 60   with 4096 cores.
>>> 
>>>  The computing loop, over which the above CPU performance is evaluated, 
>>> includes
>>> a stack of MPI exchanges [per core : 8 x (MPI_Isend + MPI_Irecv) + 
>>> MPI_Waitall]
>>> 
>>>  The application is of the 'domain partition' type,
>>> and the performances, together with the memory footprint,
>>> are very identical on all  cores. The memory footprint is twice higher in
>>> the OMPI case (1.5GB/core) than in the MPT case (0.7GB/core).
>>> 
>>>  What could be wrong with all these, please ?
>>> 
>>>  I provided (in attachment) the 'ompi_info -all ' output.
>>> The config.log is in attachment as well.
>>> I compiled OMPI with icc. I checked numa and affinity are OK.
>>> 
>>> I use the following command to run my OMPI app:
>>> "mpiexec -mca btl_openib_rdma_pipeline_send_length 65536\
>>>  -mca btl_openib_rdma_pipeline_frag_size 65536\
>>>  -mca btl_openib_min_rdma_pipeline_size 65536\
>>>  -mca btl_self_rdma_pipeline_send_length 262144\
>>>  -mca btl_self_rdma_pipeline_frag_size 262144\
>>>  -mca plm_rsh_num_concurrent 4096 -mca mpi_paffinity_alone 1\
>>>  -mca mpi_leave_pinned 1 -mca btl_sm_max_send_size 128\
>>>  -mca coll_tuned_pre_allocate_memory_comm_size_limit 128\
>>>  -mca btl_openib_cq_size 128 -mca btl_ofud_rd_num 128\
>>>  -mca mpool_rdma_rcache_size_limit 131072 -mca mpi_preconnect_mpi 0\
>>>  -mca mpool_sm_min_size 131072 -mca mpi_abort_print_stack 1\
>>>  -mca btl sm,openib,self -mca btl_openib_want_fork_support 0\
>>>  -mca opal_set_max_sys_limits 1 -mca osc_pt2pt_no_locks 1\
>>>  -mca osc_rdma_no_locks 1\
>>>  $PBS_JOBDIR/phmc_tm_p2.$PBS_JOBID -v -f $Jinput".
>>> 
>>>  OpenIB info:
>>> 
>>> 1) OFED-1.4.1, installed by SGI SGI
>>> 
>>> 2) Linux xx 2.6.16.60-0.42.10-smp #1 SMP Tue Apr 27 05:11:27 UTC 2010 
>>> x86_64 x86_64 x86_64 GNU/Linux
>>> OS : SGI ProPack 6SP5 for Linux, Build 605r1.sles10-0909302200
>>> 
>>> 3) Running most probably an SGI subnet manager
>>> 
>>> 4)>  ibv_devinfo (on a worker node)
>>> hca_id:mlx4_0
>>> fw_ver:2.7.000
>>> node_guid:0030:48ff:ffcc:4c44
>>> sys_image_guid:0030:48ff:ffcc:4c47
>>> vendor_id:0x02c9
>>> vendor_part_id:26418
>>> hw_ver:0xA0
>>> board_id:SM_207101000
>>> phys_port_cnt:2
>>> port:1
>>> state:PORT_ACTIVE (4)
>>> max_mtu:2048 (4)
>>> active_mtu:2048 (4)
>>> sm_lid:1
>>> port_lid:6009
>>> port_lmc:0x00

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2010-12-20 Thread Gilbert Grosdidier
Yes, there is definitely only 1 process per core with both MPI 
implementations.


 Thanks,   G.


Le 20/12/2010 20:39, George Bosilca a écrit :

Are your processes places the same way with the two MPI implementations? 
Per-node vs. per-core ?

   george.

On Dec 20, 2010, at 11:14 , Gilbert Grosdidier wrote:


Bonjour,

  I am now at a loss with my running of OpenMPI (namely 1.4.3)
on a SGI Altix cluster with 2048 or 4096 cores, running over Infiniband.

  After fixing several rather obvious failures with Ralph, Jeff and John help,
I am now facing the bottom of this story since :
- there are no more obvious failures with messages
- compared to the running of the application with SGI-MPT, the CPU performances 
I get
are very low, decreasing when the number of cores increases (cf below)
- these performances are highly reproducible
- I tried a very high number of -mca parameters, to no avail

  If I take as a reference the MPT CPU speed performance,
it is of about 900 (in some arbitrary unit), whatever the
number of cores I used (up to 8192).

  But, when running with OMPI, I get:
- 700 with 1024 cores (which is already rather low)
- 300 with 2048 cores
- 60   with 4096 cores.

  The computing loop, over which the above CPU performance is evaluated, 
includes
a stack of MPI exchanges [per core : 8 x (MPI_Isend + MPI_Irecv) + MPI_Waitall]

  The application is of the 'domain partition' type,
and the performances, together with the memory footprint,
are very identical on all  cores. The memory footprint is twice higher in
the OMPI case (1.5GB/core) than in the MPT case (0.7GB/core).

  What could be wrong with all these, please ?

  I provided (in attachment) the 'ompi_info -all ' output.
The config.log is in attachment as well.
I compiled OMPI with icc. I checked numa and affinity are OK.

I use the following command to run my OMPI app:
"mpiexec -mca btl_openib_rdma_pipeline_send_length 65536\
  -mca btl_openib_rdma_pipeline_frag_size 65536\
  -mca btl_openib_min_rdma_pipeline_size 65536\
  -mca btl_self_rdma_pipeline_send_length 262144\
  -mca btl_self_rdma_pipeline_frag_size 262144\
  -mca plm_rsh_num_concurrent 4096 -mca mpi_paffinity_alone 1\
  -mca mpi_leave_pinned 1 -mca btl_sm_max_send_size 128\
  -mca coll_tuned_pre_allocate_memory_comm_size_limit 128\
  -mca btl_openib_cq_size 128 -mca btl_ofud_rd_num 128\
  -mca mpool_rdma_rcache_size_limit 131072 -mca mpi_preconnect_mpi 0\
  -mca mpool_sm_min_size 131072 -mca mpi_abort_print_stack 1\
  -mca btl sm,openib,self -mca btl_openib_want_fork_support 0\
  -mca opal_set_max_sys_limits 1 -mca osc_pt2pt_no_locks 1\
  -mca osc_rdma_no_locks 1\
  $PBS_JOBDIR/phmc_tm_p2.$PBS_JOBID -v -f $Jinput".

  OpenIB info:

1) OFED-1.4.1, installed by SGI SGI

2) Linux xx 2.6.16.60-0.42.10-smp #1 SMP Tue Apr 27 05:11:27 UTC 2010 
x86_64 x86_64 x86_64 GNU/Linux
OS : SGI ProPack 6SP5 for Linux, Build 605r1.sles10-0909302200

3) Running most probably an SGI subnet manager

4)>  ibv_devinfo (on a worker node)
hca_id:mlx4_0
 fw_ver:2.7.000
 node_guid:0030:48ff:ffcc:4c44
 sys_image_guid:0030:48ff:ffcc:4c47
 vendor_id:0x02c9
 vendor_part_id:26418
 hw_ver:0xA0
 board_id:SM_207101000
 phys_port_cnt:2
 port:1
 state:PORT_ACTIVE (4)
 max_mtu:2048 (4)
 active_mtu:2048 (4)
 sm_lid:1
 port_lid:6009
 port_lmc:0x00

 port:2
 state:PORT_ACTIVE (4)
 max_mtu:2048 (4)
 active_mtu:2048 (4)
 sm_lid:1
 port_lid:6010
 port_lmc:0x00

5)>  ifconfig -a (on a worker node)
eth0  Link encap:Ethernet  HWaddr 00:30:48:CE:73:30
   inet adr:192.168.159.10  Bcast:192.168.159.255  Masque:255.255.255.0
   adr inet6: fe80::230:48ff:fece:7330/64 Scope:Lien
   UP BROADCAST NOTRAILERS RUNNING MULTICAST  MTU:1500  Metric:1
   RX packets:32337499 errors:0 dropped:0 overruns:0 frame:0
   TX packets:34733462 errors:0 dropped:0 overruns:0 carrier:0
   collisions:0 lg file transmission:1000
   RX bytes:11486224753 (10954.1 Mb)  TX bytes:16450996864 (15688.8 Mb)
   Mémoire:fbc6-fbc8

eth1  Link encap:Ethernet  HWaddr 00:30:48:CE:73:31
   BROADCAST MULTICAST  MTU:1500  Metric:1
   RX packets:0 errors:0 dropped:0 overruns:0 frame:0
   TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
   collisions:0 lg file transmission:1000
   RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
   Mémoire:fbce-fbd0

ib0   Link encap:UNSPEC  HWaddr 
80-00-00-48-FE-C0-00-00-00-00-00-00-00-00-00-00
   inet adr:10.148.9.198  Bcast:10.148.255.255  Masque:255.255.0.0
   adr 

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2010-12-20 Thread George Bosilca
Are your processes places the same way with the two MPI implementations? 
Per-node vs. per-core ?

  george.

On Dec 20, 2010, at 11:14 , Gilbert Grosdidier wrote:

> Bonjour,
> 
>  I am now at a loss with my running of OpenMPI (namely 1.4.3)
> on a SGI Altix cluster with 2048 or 4096 cores, running over Infiniband.
> 
>  After fixing several rather obvious failures with Ralph, Jeff and John help,
> I am now facing the bottom of this story since :
> - there are no more obvious failures with messages
> - compared to the running of the application with SGI-MPT, the CPU 
> performances I get
> are very low, decreasing when the number of cores increases (cf below)
> - these performances are highly reproducible 
> - I tried a very high number of -mca parameters, to no avail 
> 
>  If I take as a reference the MPT CPU speed performance,
> it is of about 900 (in some arbitrary unit), whatever the
> number of cores I used (up to 8192).
> 
>  But, when running with OMPI, I get:
> - 700 with 1024 cores (which is already rather low)
> - 300 with 2048 cores
> - 60   with 4096 cores.
> 
>  The computing loop, over which the above CPU performance is evaluated, 
> includes
> a stack of MPI exchanges [per core : 8 x (MPI_Isend + MPI_Irecv) + 
> MPI_Waitall]
> 
>  The application is of the 'domain partition' type,
> and the performances, together with the memory footprint,
> are very identical on all  cores. The memory footprint is twice higher in 
> the OMPI case (1.5GB/core) than in the MPT case (0.7GB/core).
> 
>  What could be wrong with all these, please ?
> 
>  I provided (in attachment) the 'ompi_info -all ' output.
> The config.log is in attachment as well.
> I compiled OMPI with icc. I checked numa and affinity are OK.
> 
> I use the following command to run my OMPI app:
> "mpiexec -mca btl_openib_rdma_pipeline_send_length 65536\
>  -mca btl_openib_rdma_pipeline_frag_size 65536\
>  -mca btl_openib_min_rdma_pipeline_size 65536\
>  -mca btl_self_rdma_pipeline_send_length 262144\
>  -mca btl_self_rdma_pipeline_frag_size 262144\
>  -mca plm_rsh_num_concurrent 4096 -mca mpi_paffinity_alone 1\
>  -mca mpi_leave_pinned 1 -mca btl_sm_max_send_size 128\
>  -mca coll_tuned_pre_allocate_memory_comm_size_limit 128\
>  -mca btl_openib_cq_size 128 -mca btl_ofud_rd_num 128\
>  -mca mpool_rdma_rcache_size_limit 131072 -mca mpi_preconnect_mpi 0\
>  -mca mpool_sm_min_size 131072 -mca mpi_abort_print_stack 1\
>  -mca btl sm,openib,self -mca btl_openib_want_fork_support 0\
>  -mca opal_set_max_sys_limits 1 -mca osc_pt2pt_no_locks 1\
>  -mca osc_rdma_no_locks 1\
>  $PBS_JOBDIR/phmc_tm_p2.$PBS_JOBID -v -f $Jinput".
> 
>  OpenIB info:
> 
> 1) OFED-1.4.1, installed by SGI SGI
> 
> 2) Linux xx 2.6.16.60-0.42.10-smp #1 SMP Tue Apr 27 05:11:27 UTC 2010 
> x86_64 x86_64 x86_64 GNU/Linux
> OS : SGI ProPack 6SP5 for Linux, Build 605r1.sles10-0909302200
> 
> 3) Running most probably an SGI subnet manager
> 
> 4) > ibv_devinfo (on a worker node)
> hca_id:mlx4_0
> fw_ver:2.7.000
> node_guid:0030:48ff:ffcc:4c44
> sys_image_guid:0030:48ff:ffcc:4c47
> vendor_id:0x02c9
> vendor_part_id:26418
> hw_ver:0xA0
> board_id:SM_207101000
> phys_port_cnt:2
> port:1
> state:PORT_ACTIVE (4)
> max_mtu:2048 (4)
> active_mtu:2048 (4)
> sm_lid:1
> port_lid:6009
> port_lmc:0x00
> 
> port:2
> state:PORT_ACTIVE (4)
> max_mtu:2048 (4)
> active_mtu:2048 (4)
> sm_lid:1
> port_lid:6010
> port_lmc:0x00
> 
> 5) > ifconfig -a (on a worker node)
> eth0  Link encap:Ethernet  HWaddr 00:30:48:CE:73:30  
>   inet adr:192.168.159.10  Bcast:192.168.159.255  Masque:255.255.255.0
>   adr inet6: fe80::230:48ff:fece:7330/64 Scope:Lien
>   UP BROADCAST NOTRAILERS RUNNING MULTICAST  MTU:1500  Metric:1
>   RX packets:32337499 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:34733462 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 lg file transmission:1000 
>   RX bytes:11486224753 (10954.1 Mb)  TX bytes:16450996864 (15688.8 Mb)
>   Mémoire:fbc6-fbc8 
> 
> eth1  Link encap:Ethernet  HWaddr 00:30:48:CE:73:31  
>   BROADCAST MULTICAST  MTU:1500  Metric:1
>   RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 lg file transmission:1000 
>   RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
>   Mémoire:fbce-fbd0 
> 
> ib0   Link encap:UNSPEC  HWaddr 
> 80-00-00-48-FE-C0-00-00-00-00-00-00-00-00-00-00  
>   inet adr:10.148.9.198  Bcast:10.148.255.255