Re: [Beowulf] Theoretical vs. Actual Performance

2018-02-22 Thread Chris Samuel
On Friday, 23 February 2018 2:52:54 AM AEDT John Hearns via Beowulf wrote:

> Oh, and use the Adaptive computing HPL calculator to get your input file.
> Thanks Adaptive guys!

I think you mean Advanced Clustering.. :-)

http://www.advancedclustering.com/act_kb/tune-hpl-dat-file/

cheers,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Theoretical vs. Actual Performance

2018-02-22 Thread Chris Samuel
On Friday, 23 February 2018 1:45:00 AM AEDT Joe Landman wrote:

> 85% makes the assumption that you have the systems configured in an
> optimal manner, that the compiler doesn't do anything wonky, and that,
> to some degree, you isolate the OS portion of the workload off of most
> of the cores to reduce jitter.   Among other things.

Interesting, purchases I've done before for IB clusters have had Rpeak as 80% 
of Rmax HPL requirement for acceptance testing before and not had much problem 
hitting it.

The worst issue we had was on SandyBridge where the kernel ignoring UEFI 
settings saying "hey, I know these CPUs, I'll enable ALL the power saving" and 
that killed performance until we disabled the states via the kernel boot 
parameters. 

It can be worth running powertop to see what states your CPUs are sitting in 
whilst running HPL, and also "perf top" to see what the system is up to.

Good luck!
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Theoretical vs. Actual Performance

2018-02-22 Thread Prentice Bisbal

Joe,

Thanks for the link. Based on that, they should be pretty close in 
performance, and mine are not, so I must be doing something wrong with 
my OpenBLAS build. Since ACML is dead, I was hoping I could use OpenBLAS 
moving forward.


Prentice

On 02/22/2018 06:01 PM, Joe Landman wrote:
ACML is hand coded assembly.  Not likely that OpenBLAS will be much 
better.  Could be similar.  c.f. 
http://gcdart.blogspot.co.uk/2013/06/fast-matrix-multiply-and-ml.html




On 02/22/2018 05:48 PM, Prentice Bisbal wrote:
Just rebuilt OpenBLAS 0.2.20 locally on the test system with GCC 
6.1.0, and I'm only getting 91 GFLOPS. I'm pretty sure OpenBLAS 
performance should be close to ACML performance, if not better. I'll 
have to dig into this later. For now, I'm going to continue my 
testing using the ACML-based build and revisit the OpenBLAS 
performance later.


Prentice

On 02/22/2018 05:27 PM, Prentice Bisbal wrote:
So I just rebuilt HPL using the ACML 6.1.0 libraries with GCC 6.1.0, 
and I'm now getting 197 GFLOPS, so clearly there's a problem with my 
OpenBLAS build. I'm going to try building OpenBLAS without the 
dynamic arch support on the machine where I plan on running my 
tests, and see if that version of the library is any better.


Prentice

On 02/22/2018 09:37 AM, Prentice Bisbal wrote:

Beowulfers,

In your experience, how close does actual performance of your 
processors match up to their theoretical performance? I'm 
investigating a performances issue on some of my nodes. These are 
older systems using AMD Opteron 6274 processors. I found literature 
from AMD stating the theoretical performance of these processors is 
282 GFLOPS, and my LINPACK performance isn't coming close to that 
(I get approximately ~33% of that).  The number I often hear 
mentioned is actual performance should be ~85%. of theoretical 
performance is that a realistic number your experience?


I don't want this to be a discussion of what could be wrong at this 
point, we will get to that in future posts, I assure you!






___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf




___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Theoretical vs. Actual Performance

2018-02-22 Thread Joe Landman
ACML is hand coded assembly.  Not likely that OpenBLAS will be much 
better.  Could be similar.  c.f. 
http://gcdart.blogspot.co.uk/2013/06/fast-matrix-multiply-and-ml.html




On 02/22/2018 05:48 PM, Prentice Bisbal wrote:
Just rebuilt OpenBLAS 0.2.20 locally on the test system with GCC 
6.1.0, and I'm only getting 91 GFLOPS. I'm pretty sure OpenBLAS 
performance should be close to ACML performance, if not better. I'll 
have to dig into this later. For now, I'm going to continue my testing 
using the ACML-based build and revisit the OpenBLAS performance later.


Prentice

On 02/22/2018 05:27 PM, Prentice Bisbal wrote:
So I just rebuilt HPL using the ACML 6.1.0 libraries with GCC 6.1.0, 
and I'm now getting 197 GFLOPS, so clearly there's a problem with my 
OpenBLAS build. I'm going to try building OpenBLAS without the 
dynamic arch support on the machine where I plan on running my tests, 
and see if that version of the library is any better.


Prentice

On 02/22/2018 09:37 AM, Prentice Bisbal wrote:

Beowulfers,

In your experience, how close does actual performance of your 
processors match up to their theoretical performance? I'm 
investigating a performances issue on some of my nodes. These are 
older systems using AMD Opteron 6274 processors. I found literature 
from AMD stating the theoretical performance of these processors is 
282 GFLOPS, and my LINPACK performance isn't coming close to that (I 
get approximately ~33% of that).  The number I often hear mentioned 
is actual performance should be ~85%. of theoretical performance is 
that a realistic number your experience?


I don't want this to be a discussion of what could be wrong at this 
point, we will get to that in future posts, I assure you!






___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


--
Joe Landman
e: joe.land...@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Theoretical vs. Actual Performance

2018-02-22 Thread Benson Muite

Consider trying:
https://github.com/amd/blis
https://github.com/clMathLibraries/clBLAS

as well.

On 02/23/2018 12:48 AM, Prentice Bisbal wrote:
Just rebuilt OpenBLAS 0.2.20 locally on the test system with GCC 6.1.0, 
and I'm only getting 91 GFLOPS. I'm pretty sure OpenBLAS performance 
should be close to ACML performance, if not better. I'll have to dig 
into this later. For now, I'm going to continue my testing using the 
ACML-based build and revisit the OpenBLAS performance later.


Prentice

On 02/22/2018 05:27 PM, Prentice Bisbal wrote:
So I just rebuilt HPL using the ACML 6.1.0 libraries with GCC 6.1.0, 
and I'm now getting 197 GFLOPS, so clearly there's a problem with my 
OpenBLAS build. I'm going to try building OpenBLAS without the dynamic 
arch support on the machine where I plan on running my tests, and see 
if that version of the library is any better.


Prentice

On 02/22/2018 09:37 AM, Prentice Bisbal wrote:

Beowulfers,

In your experience, how close does actual performance of your 
processors match up to their theoretical performance? I'm 
investigating a performances issue on some of my nodes. These are 
older systems using AMD Opteron 6274 processors. I found literature 
from AMD stating the theoretical performance of these processors is 
282 GFLOPS, and my LINPACK performance isn't coming close to that (I 
get approximately ~33% of that).  The number I often hear mentioned 
is actual performance should be ~85%. of theoretical performance is 
that a realistic number your experience?


I don't want this to be a discussion of what could be wrong at this 
point, we will get to that in future posts, I assure you!






___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf



-
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Theoretical vs. Actual Performance

2018-02-22 Thread Prentice Bisbal
Just rebuilt OpenBLAS 0.2.20 locally on the test system with GCC 6.1.0, 
and I'm only getting 91 GFLOPS. I'm pretty sure OpenBLAS performance 
should be close to ACML performance, if not better. I'll have to dig 
into this later. For now, I'm going to continue my testing using the 
ACML-based build and revisit the OpenBLAS performance later.


Prentice

On 02/22/2018 05:27 PM, Prentice Bisbal wrote:
So I just rebuilt HPL using the ACML 6.1.0 libraries with GCC 6.1.0, 
and I'm now getting 197 GFLOPS, so clearly there's a problem with my 
OpenBLAS build. I'm going to try building OpenBLAS without the dynamic 
arch support on the machine where I plan on running my tests, and see 
if that version of the library is any better.


Prentice

On 02/22/2018 09:37 AM, Prentice Bisbal wrote:

Beowulfers,

In your experience, how close does actual performance of your 
processors match up to their theoretical performance? I'm 
investigating a performances issue on some of my nodes. These are 
older systems using AMD Opteron 6274 processors. I found literature 
from AMD stating the theoretical performance of these processors is 
282 GFLOPS, and my LINPACK performance isn't coming close to that (I 
get approximately ~33% of that).  The number I often hear mentioned 
is actual performance should be ~85%. of theoretical performance is 
that a realistic number your experience?


I don't want this to be a discussion of what could be wrong at this 
point, we will get to that in future posts, I assure you!






___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Theoretical vs. Actual Performance

2018-02-22 Thread Prentice Bisbal
So I just rebuilt HPL using the ACML 6.1.0 libraries with GCC 6.1.0, and 
I'm now getting 197 GFLOPS, so clearly there's a problem with my 
OpenBLAS build. I'm going to try building OpenBLAS without the dynamic 
arch support on the machine where I plan on running my tests, and see if 
that version of the library is any better.


Prentice

On 02/22/2018 09:37 AM, Prentice Bisbal wrote:

Beowulfers,

In your experience, how close does actual performance of your 
processors match up to their theoretical performance? I'm 
investigating a performances issue on some of my nodes. These are 
older systems using AMD Opteron 6274 processors. I found literature 
from AMD stating the theoretical performance of these processors is 
282 GFLOPS, and my LINPACK performance isn't coming close to that (I 
get approximately ~33% of that).  The number I often hear mentioned is 
actual performance should be ~85%. of theoretical performance is that 
a realistic number your experience?


I don't want this to be a discussion of what could be wrong at this 
point, we will get to that in future posts, I assure you!




___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Theoretical vs. Actual Performance

2018-02-22 Thread Prentice Bisbal

For OpenBlas, or hpl?

For hpl, I used GCC 6.1.0 with these flags. I

$ egrep -i "flags|defs" Make.gcc-6.1.0_openblas-0.2.19
F2CDEFS  = -DAdd__ -DF77_INTEGER=int -DStringSunStyle
HPL_DEFS = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
CCNOOPT  = $(HPL_DEFS)
OMP_DEFS = -openmp
CCFLAGS  = $(HPL_DEFS) -march=barcelona -O3 -Wall
LINKFLAGS    = $(CCFLAGS) $(OMP_DEFS)
ARFLAGS  = r

For OpenBLAS:

make DYNAMIC_ARCH=1 CC=gcc FC=gfortran

# This little summary is printed out at end of build:

 OpenBLAS build complete. (BLAS CBLAS LAPACK LAPACKE)

  OS   ... Linux
  Architecture ... x86_64
  BINARY   ... 64bit
  C compiler   ... GCC  (command line : gcc)
  Fortran compiler ... GFORTRAN  (command line : gfortran)
  Library Name ... libopenblasp-r0.2.19.a (Multi threaded; Max 
num-threads i

s 8)

Prentice

On 02/22/2018 11:58 AM, Joe Landman wrote:

which compiler are you using, and what options are you compiling it with?


On 02/22/2018 11:48 AM, Prentice Bisbal wrote:

On 02/22/2018 10:44 AM, Michael Di Domenico wrote:

i can't speak to AMD, but using HPL 2.1 on Intel using the Intel
compiler and the Intel MKL, i can hit 90% without issue.  no major
tuning either

if you're at 33% i would be suspect of your math library


I'm using OpenBLAS 0.29 with dynamic architecture support,  but I'm 
thinking of switching to using ACML for this test, to remove the 
possibility that it's a problem with my OpenBLAS build.


___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf




___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Theoretical vs. Actual Performance

2018-02-22 Thread Prentice Bisbal

This is my source for those theoretical numbers:

http://dewaele.org/~robbe/thesis/writing/references/49747D_HPC_Processor_Comparison_v3_July2012.pdf

If those numbers are off, that makes my job a bit easier.  And it looks 
like you're right. In the text above the table, it does mention 2-socket 
servers, and then below the table in fine print, it states


"For AMD Opteron Processors, theoretical FLOPS = Core Count x Core 
Frequency x number of processors per server x 4."


Why can't the table just show single socket performance? G

Regardless of bad marketing and graphics design, I'm still at at square 
one. My system has 2 sockets, and the best I've been able to do is get 
~115 GFLOPS. And that's one of the 'instaneous' values LINPACK spits out 
every few seconds. At the end of test, the actual GFLOPS  result is more 
like 77 GLOPS:


===
T/V    N    NB P Q Time Gflops

WR00L2L2   82775    40 4 8 4924.71  7.678e+01

This is a two socket system, so that's only 27% of theoretical max.

Prentice

On 02/22/2018 01:18 PM, Dmitri Chubarov wrote:

Hi,

not sure if the 282 GFLOPS number is correct.

We have 16 Bulldozer/Interlagos cores at 2.2 GHz. Each pair of cores 
forms a CMT module. The two cores in the module share an FPU with 2 
128-bit FMAC units.


In terms of double precision FLOPS it should make
16 * 2.2GHz * 2 double precision scalars/SIMD register * 2 FLOPS / FMA 
op = 140.8 GFLOPS


It looks like 282 GFLOPS number is per a 2P node.

Dima

On 22 February 2018 at 21:37, Prentice Bisbal > wrote:


Beowulfers,

In your experience, how close does actual performance of your
processors match up to their theoretical performance? I'm
investigating a performances issue on some of my nodes. These are
older systems using AMD Opteron 6274 processors. I found
literature from AMD stating the theoretical performance of these
processors is 282 GFLOPS, and my LINPACK performance isn't coming
close to that (I get approximately ~33% of that).  The number I
often hear mentioned is actual performance should be ~85%. of
theoretical performance is that a realistic number your experience?

I don't want this to be a discussion of what could be wrong at
this point, we will get to that in future posts, I assure you!

-- 
Prentice


___
Beowulf mailing list, Beowulf@beowulf.org
 sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf





___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Theoretical vs. Actual Performance

2018-02-22 Thread Dmitri Chubarov
Hi,

not sure if the 282 GFLOPS number is correct.

We have 16 Bulldozer/Interlagos cores at 2.2 GHz. Each pair of cores forms
a CMT module. The two cores in the module share an FPU with 2 128-bit FMAC
units.

In terms of double precision FLOPS it should make
16 * 2.2GHz * 2 double precision scalars/SIMD register * 2 FLOPS / FMA op =
140.8 GFLOPS

It looks like 282 GFLOPS number is per a 2P node.

Dima

On 22 February 2018 at 21:37, Prentice Bisbal  wrote:

> Beowulfers,
>
> In your experience, how close does actual performance of your processors
> match up to their theoretical performance? I'm investigating a performances
> issue on some of my nodes. These are older systems using AMD Opteron 6274
> processors. I found literature from AMD stating the theoretical performance
> of these processors is 282 GFLOPS, and my LINPACK performance isn't coming
> close to that (I get approximately ~33% of that).  The number I often hear
> mentioned is actual performance should be ~85%. of theoretical performance
> is that a realistic number your experience?
>
> I don't want this to be a discussion of what could be wrong at this point,
> we will get to that in future posts, I assure you!
>
> --
> Prentice
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Theoretical vs. Actual Performance

2018-02-22 Thread David Mathog

On Thu, 22 Feb 2018 09:37:54 -0500 Prentice Bisbal wrote:


I found literature from AMD stating the
theoretical performance of these processors is 282 GFLOPS, and my
LINPACK performance isn't coming close to that (I get approximately 
~33%

of that). 


That does seem low.  Check the usual culprits:

1.  CPU frequency adjust locked to lowest setting, or set to one which 
adjusts and which then interacts poorly with the test software.  You 
know that the rated performance will have been measured with the CPU 
locked to its highest frequency.


2.  something else running, especially something which forces the test 
program out of memory or file caches.  I wouldn't expect this sort of 
test to be IO bound to disk, but if it is, and hugepages are used, 
enormous performance drops may be observed when the system decides to 
move those around.  I wouldn't put it past AMD or Intel to run these 
sorts of tests with the test system stripped down to the bones.  No 
network, no logging, single user, etc.  That is, absolutely nothing that 
would compete for CPU time.  (Just checked on one of our big systems.  
ps -ef | wc shows 953 processes:  48 migration, 48 ksoftirqd, 49 
stopper, 49 watchdog, 49 kintegrityd, 49 kblockd, 49 ata_sff, 49 md, 49 
md_misc, 49 aio, 49 crypto, 49 kthrotld, 49 rpciod, 19 gdm (console 
processes, even with no display attached at the moment and nobody logged 
in there), 193 events, 12 of my processes, and 107 miscellaneous OS 
processes.)


3.  ulimit settings.  /etc/security/limits.conf settings.

4.  NUMA issues.  Multithreaded programs have been observed which 
allocate a large block of memory once, which ends up on one side of a 
NUMA system and then start some or all of the threads on the other.  
Those on the wrong side will run a variable amount slower than those on 
the right side.   If this is what is going on locking all threads to the 
same side of the system (if it has just two sides) can speed things up a 
bit.  Assuming it isn't supposed to use all threads.


5.  Different compiler/optimization.  The vendor may have used a binary 
which was tweaked to the Nth degree, perhaps even using profiling from 
earlier runs to optimize the final run.  If you are using a benchmark 
number from AMD see if you can obtain the exact same version of the test 
software that they used (which is maybe available), so that you can 
eliminate this variable.  Perhaps wherever they keep that they also have 
a detailed description of the test system?


Regards,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Theoretical vs. Actual Performance

2018-02-22 Thread Joe Landman

which compiler are you using, and what options are you compiling it with?


On 02/22/2018 11:48 AM, Prentice Bisbal wrote:

On 02/22/2018 10:44 AM, Michael Di Domenico wrote:

i can't speak to AMD, but using HPL 2.1 on Intel using the Intel
compiler and the Intel MKL, i can hit 90% without issue.  no major
tuning either

if you're at 33% i would be suspect of your math library


I'm using OpenBLAS 0.29 with dynamic architecture support,  but I'm 
thinking of switching to using ACML for this test, to remove the 
possibility that it's a problem with my OpenBLAS build.


___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


--
Joe Landman
e: joe.land...@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Theoretical vs. Actual Performance

2018-02-22 Thread Prentice Bisbal

On 02/22/2018 10:44 AM, Michael Di Domenico wrote:

i can't speak to AMD, but using HPL 2.1 on Intel using the Intel
compiler and the Intel MKL, i can hit 90% without issue.  no major
tuning either

if you're at 33% i would be suspect of your math library


I'm using OpenBLAS 0.29 with dynamic architecture support,  but I'm 
thinking of switching to using ACML for this test, to remove the 
possibility that it's a problem with my OpenBLAS build.


___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Theoretical vs. Actual Performance

2018-02-22 Thread John Hearns via Beowulf
Oh, and use the Adaptive computing HPL calculator to get your input file.
Thanks Adaptive guys!

On 22 February 2018 at 16:44, Michael Di Domenico 
wrote:

> i can't speak to AMD, but using HPL 2.1 on Intel using the Intel
> compiler and the Intel MKL, i can hit 90% without issue.  no major
> tuning either
>
> if you're at 33% i would be suspect of your math library
>
> On Thu, Feb 22, 2018 at 9:37 AM, Prentice Bisbal  wrote:
> > Beowulfers,
> >
> > In your experience, how close does actual performance of your processors
> > match up to their theoretical performance? I'm investigating a
> performances
> > issue on some of my nodes. These are older systems using AMD Opteron 6274
> > processors. I found literature from AMD stating the theoretical
> performance
> > of these processors is 282 GFLOPS, and my LINPACK performance isn't
> coming
> > close to that (I get approximately ~33% of that).  The number I often
> hear
> > mentioned is actual performance should be ~85%. of theoretical
> performance
> > is that a realistic number your experience?
> >
> > I don't want this to be a discussion of what could be wrong at this
> point,
> > we will get to that in future posts, I assure you!
> >
> > --
> > Prentice
> >
> > ___
> > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Theoretical vs. Actual Performance

2018-02-22 Thread Michael Di Domenico
i can't speak to AMD, but using HPL 2.1 on Intel using the Intel
compiler and the Intel MKL, i can hit 90% without issue.  no major
tuning either

if you're at 33% i would be suspect of your math library

On Thu, Feb 22, 2018 at 9:37 AM, Prentice Bisbal  wrote:
> Beowulfers,
>
> In your experience, how close does actual performance of your processors
> match up to their theoretical performance? I'm investigating a performances
> issue on some of my nodes. These are older systems using AMD Opteron 6274
> processors. I found literature from AMD stating the theoretical performance
> of these processors is 282 GFLOPS, and my LINPACK performance isn't coming
> close to that (I get approximately ~33% of that).  The number I often hear
> mentioned is actual performance should be ~85%. of theoretical performance
> is that a realistic number your experience?
>
> I don't want this to be a discussion of what could be wrong at this point,
> we will get to that in future posts, I assure you!
>
> --
> Prentice
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Theoretical vs. Actual Performance

2018-02-22 Thread Benson Muite
There is a very nice and simple Max flops code that requires much less 
tuning than Linpack. It is described in pg 57 of:


Rahman "Intel® Xeon Phi™ Coprocessor Architecture and Tools"
https://link.springer.com/book/10.1007%2F978-1-4302-5927-5

An example Fortran code is here:
https://github.com/bkmgit/intel-xeon-phi-coprocessor-architecture-tools/tree/master/ch05

On 02/22/2018 05:16 PM, John Hearns via Beowulf wrote:

Prentice, I echo what Joe says.
When doing benchmarking with HPL or SPEC benchmarks, I would optimise 
the BIOS settings to the highest degree I could.

Switch off processor C) states
As Joe says you need to look at what the OS is runnign in the 
background. I would disable the Bright cluster manager daemon for instance.



85% of theoretical peak on an HPL run sounds reasonable to me and I 
would get fogures in that ballpark.


For your AMDs I would start by choosing one system, no interconnect to 
cloud the waters. See what you can get out of that.










On 22 February 2018 at 15:45, Joe Landman > wrote:




On 02/22/2018 09:37 AM, Prentice Bisbal wrote:

Beowulfers,

In your experience, how close does actual performance of your
processors match up to their theoretical performance? I'm
investigating a performances issue on some of my nodes. These
are older systems using AMD Opteron 6274 processors. I found
literature from AMD stating the theoretical performance of these
processors is 282 GFLOPS, and my LINPACK performance isn't
coming close to that (I get approximately ~33% of that).  The
number I often hear mentioned is actual performance should be
~85%. of theoretical performance is that a realistic number your
experience?


85% makes the assumption that you have the systems configured in an
optimal manner, that the compiler doesn't do anything wonky, and
that, to some degree, you isolate the OS portion of the workload off
of most of the cores to reduce jitter.   Among other things.

At Scalable, I'd regularly hit 60-90 % of theoretical max computing
performance, with progressively more heroic tuning.   Storage, I'd
typically hit 90-95% of theoretical max (good architectures almost
always beat bad ones).  Networking, fairly similar, though tuning
per use case mattered significantly.


I don't want this to be a discussion of what could be wrong at
this point, we will get to that in future posts, I assure you!


-- 
Joe Landman

t: @hpcjoe
w: https://scalability.org


___
Beowulf mailing list, Beowulf@beowulf.org
 sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf





___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Theoretical vs. Actual Performance

2018-02-22 Thread John Hearns via Beowulf
Prentice, I echo what Joe says.
When doing benchmarking with HPL or SPEC benchmarks, I would optimise the
BIOS settings to the highest degree I could.
Switch off processor C) states
As Joe says you need to look at what the OS is runnign in the background. I
would disable the Bright cluster manager daemon for instance.


85% of theoretical peak on an HPL run sounds reasonable to me and I would
get fogures in that ballpark.

For your AMDs I would start by choosing one system, no interconnect to
cloud the waters. See what you can get out of that.









On 22 February 2018 at 15:45, Joe Landman  wrote:

>
>
> On 02/22/2018 09:37 AM, Prentice Bisbal wrote:
>
>> Beowulfers,
>>
>> In your experience, how close does actual performance of your processors
>> match up to their theoretical performance? I'm investigating a performances
>> issue on some of my nodes. These are older systems using AMD Opteron 6274
>> processors. I found literature from AMD stating the theoretical performance
>> of these processors is 282 GFLOPS, and my LINPACK performance isn't coming
>> close to that (I get approximately ~33% of that).  The number I often hear
>> mentioned is actual performance should be ~85%. of theoretical performance
>> is that a realistic number your experience?
>>
>
> 85% makes the assumption that you have the systems configured in an
> optimal manner, that the compiler doesn't do anything wonky, and that, to
> some degree, you isolate the OS portion of the workload off of most of the
> cores to reduce jitter.   Among other things.
>
> At Scalable, I'd regularly hit 60-90 % of theoretical max computing
> performance, with progressively more heroic tuning.   Storage, I'd
> typically hit 90-95% of theoretical max (good architectures almost always
> beat bad ones).  Networking, fairly similar, though tuning per use case
> mattered significantly.
>
>
>> I don't want this to be a discussion of what could be wrong at this
>> point, we will get to that in future posts, I assure you!
>>
>>
> --
> Joe Landman
> t: @hpcjoe
> w: https://scalability.org
>
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Theoretical vs. Actual Performance

2018-02-22 Thread Joe Landman



On 02/22/2018 09:37 AM, Prentice Bisbal wrote:

Beowulfers,

In your experience, how close does actual performance of your 
processors match up to their theoretical performance? I'm 
investigating a performances issue on some of my nodes. These are 
older systems using AMD Opteron 6274 processors. I found literature 
from AMD stating the theoretical performance of these processors is 
282 GFLOPS, and my LINPACK performance isn't coming close to that (I 
get approximately ~33% of that).  The number I often hear mentioned is 
actual performance should be ~85%. of theoretical performance is that 
a realistic number your experience?


85% makes the assumption that you have the systems configured in an 
optimal manner, that the compiler doesn't do anything wonky, and that, 
to some degree, you isolate the OS portion of the workload off of most 
of the cores to reduce jitter.   Among other things.


At Scalable, I'd regularly hit 60-90 % of theoretical max computing 
performance, with progressively more heroic tuning.   Storage, I'd 
typically hit 90-95% of theoretical max (good architectures almost 
always beat bad ones).  Networking, fairly similar, though tuning per 
use case mattered significantly.




I don't want this to be a discussion of what could be wrong at this 
point, we will get to that in future posts, I assure you!




--
Joe Landman
t: @hpcjoe
w: https://scalability.org

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf