Re: [Valgrind-users] DHAT speed

2020-05-27 Thread John Reiser

On 5/27/20 Paul FLOYD wrote:


Well, no real surprises. This is with a testcase that runs standalone in about 
5 seconds and under DHAT in about 200 seconds (so a reasonable slowdown of 40x).

# Overhead  Command   Shared Object
Symbol
#   ...  ..  

#
29.11%  dhat-amd64-linu  dhat-amd64-linux[.] interval_tree_Cmp
21.13%  dhat-amd64-linu  perf-26905.map  [.] 0x0010057a25f8
13.32%  dhat-amd64-linu  dhat-amd64-linux[.] vgPlain_lookupFM
 9.56%  dhat-amd64-linu  dhat-amd64-linux[.] dh_handle_read
 8.83%  dhat-amd64-linu  dhat-amd64-linux[.] vgPlain_nextIterFM
 4.66%  dhat-amd64-linu  dhat-amd64-linux[.] check_for_peak
 1.85%  dhat-amd64-linu  dhat-amd64-linux[.] vgPlain_disp_cp_xindir
 1.32%  dhat-amd64-linu  [kernel.kallsyms]   [k] 0x8103ec0a
 1.00%  dhat-amd64-linu  dhat-amd64-linux[.] dh_handle_write


To me this suggests two things:

1) investigate the coding of the 4 or 5 highest-use subroutines 
(interval_tree_Cmp,
vgPlain_lookupFM, dh_handle_read, vgPlain_nextIterFM)

2) see whether DHAT might recognize and use higher-level abstractions
than MemoryRead and MemoryWrite of individual addresses.  Similar to memcheck
intercepting and analyzing strlen (etc.) as a complete concept instead of as its
individual Reads and Writes, perhaps DHAT could intercept (and/or recognize)
vector linear search, vector addition, vector partial sum, other BLAS routines, 
etc.,
and then analyze the algorithm as a whole.


___
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users


Re: [Valgrind-users] DHAT speed

2020-05-27 Thread Paul FLOYD
[snip - perf]

Well, no real surprises. This is with a testcase that runs standalone in about 
5 seconds and under DHAT in about 200 seconds (so a reasonable slowdown of 40x).

# Overhead  Command   Shared Object 


   
Symbol
#   ...  ..  

#
   29.11%  dhat-amd64-linu  dhat-amd64-linux[.] interval_tree_Cmp   



   21.13%  dhat-amd64-linu  perf-26905.map  [.] 0x0010057a25f8  



   13.32%  dhat-amd64-linu  dhat-amd64-linux[.] vgPlain_lookupFM



9.56%  dhat-amd64-linu  dhat-amd64-linux[.] dh_handle_read  



8.83%  dhat-amd64-linu  dhat-amd64-linux[.] vgPlain_nextIterFM  



4.66%  dhat-amd64-linu  dhat-amd64-linux[.] check_for_peak  



1.85%  dhat-amd64-linu  dhat-amd64-linux[.] vgPlain_disp_cp_xindir  



1.32%  dhat-amd64-linu  [kernel.kallsyms]   [k] 0x8103ec0a  



1.00%  dhat-amd64-linu  dhat-amd64-linux[.] dh_handle_write  


A+
Paul

___
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users


Re: [Valgrind-users] DHAT speed

2020-05-26 Thread John Reiser

 > So (4 * 5 * 10 * 3) is a slowdown of 600X, which turns 10 minutes into 100 
hours.

What I'm seeing is a DHAT-only slowdown that is much more than that.


Running 'perf' is likely to give data and strong hints about what is going on.
The overhead of 'perf' is only a few percent.

perf record valgrind --tool=DHAT <> ./my_app <>
perf report  > perf_output.txt

The "perf record ..." will stop shortly after the valgrind sub-process 
terminates.
You don't have to wait for DHAT to finish; just 'kill' it after a while.


___
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users


Re: [Valgrind-users] DHAT speed

2020-05-26 Thread Paul FLOYD
> Message du 26/05/20 13:19> De : "John Reiser"  the ratio is about 1:50. So 
> right away, that's a hardware slowdown of 4X.Maybe more. The machine has 
> 12Mbyte of cache according to cpuinfo.> Valgrind runs every tool 
> single-threaded. So if your app averages 5 active threads,> then that is a 
> slowdown of 5X.I was running the application in single thread mode.> 
> Valgrind's JIT (Just-In-Time) instruction emulator has a slowdown. Assume 10X 
> (or measure nulgrind.)Yes, this is what I see with nulgrind, about a 11x 
> factor slowdown. However this will also account for a large part of the cache 
> overhead.> Finally we get to "useful work": the slowdown of the tool DHAT. 
> Assume 3X.> So (4 * 5 * 10 * 3) is a slowdown of 600X, which turns 10 minutes 
> into 100 hours.What I'm seeing is a DHAT-only slowdown that is much more than 
> that.A+Paul___
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users


Re: [Valgrind-users] DHAT speed

2020-05-26 Thread Paul FLOYD
 > That doesn't sound right. I use DHAT extensively and expect a slowdown of> 
perhaps 50:1, maybe less. What you're describing is a slowdown factor of> at 
least several thousand.>> Bear in mind though that (1) V sequentialises thread 
execution, which wil> make a big difference if the program is heavily 
multithreaded, and (2)> I suspect dhat's scheme of looking up all memory 
accesses in an AVL tree> (of malloc'd blocks) doesn't scale all that well if 
you have tens of> millions of blocks.>> Can you run it on a smaller workload?  
HiI'll try on something smaller and also get some info on the number of blocks 
of memory allocated.A+Paul___
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users


Re: [Valgrind-users] DHAT speed

2020-05-26 Thread John Reiser

The server has 48Gbytes of RAM and only about 6Gbytes is being used.

The executable is quite big

text data bss dec hex filename
57369168 417156 20903108 78689432 4b0b498 [snip]

The run under DHAT is using about 2Gbytes virtual and 1.5Gbytes resident 
according to htop. Running standalone those are about 750M and 350M 
respectively.


Some hardware cache+memory delays:
   L1 hit3 cycles  ( 32KB size)
   L2 hit   11 cycles  (256KB size)
   L3 hit   25 cycles  (4MB to 40MB size)
   miss180 cycles
The dynamic RAM chips commonly used for main memory have stayed the same speed
for over 25 years: 60 nanoseconds from CAS (Column Address Strobe) to DataOut.
If the CPU runs at 3GHz, then a cache miss costs at least 180 cycles.
A quick estimate of (largest and slowest) cache size is given by the "cache 
size"
line from /proc/cpuinfo:
-
$ sed 9q  < /proc/cpuinfo  # on a 8-year old consumer-grade machine
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 42
model name  : Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
stepping: 7
microcode   : 0x2f
cpu MHz : 1599.982
cache size  : 6144 KB
-

Assume that your "big" server has 30MB of L3 cache.  For a resident set size of
350MB then that is a ratio of about 1:12.  For a resident set size of 1.5GB then
the ratio is about 1:50.  So right away, that's a hardware slowdown of 4X.

Valgrind runs every tool single-threaded.  So if your app averages 5 active 
threads,
then that is a slowdown of 5X.

Valgrind's JIT (Just-In-Time) instruction emulator has a slowdown.  Assume 10X 
(or measure nulgrind.)

Finally we get to "useful work": the slowdown of the tool DHAT.  Assume 3X.

So (4 * 5 * 10 * 3) is a slowdown of 600X, which turns 10 minutes into 100 
hours.


___
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users


Re: [Valgrind-users] DHAT speed

2020-05-26 Thread Paul FLOYD


On 5/26/20, Paul FLOYD wrote:

> > I'm running DHAT on what I consider to be a relatively small example. 
> > Standalone the executable runs in a bit under 10 minutes. Based on the CPU 
> > time that we print after every 10% of 
progress, under DHAT the same executable is going to take about 422 hours - 
about two and a half weeks.
> >
> > Does anyone have any ideas what could be causing it to be so slow?

> Please tell us about your environment.
> (EVERY query should supply this information!)

I just wanted to have an idea of the general expectation for DHAT overhead.

> Which version of DHAT? Which hardware? Which operating system and version?
> Which compiler and version built DHAT? Which compiler(s) built the program 
> under test?

Red Hat Enterprise Linux Workstation release 6.5 (Santiago)

gcc (GCC) 5.3.0 for the test executable.
gcc version 9.2.0 (GCC) for building Valgrind, built from git HEAD.

The server has 48Gbytes of RAM and only about 6Gbytes is being used.

The executable is quite big

text data bss dec hex filename
57369168 417156 20903108 78689432 4b0b498 [snip]

The run under DHAT is using about 2Gbytes virtual and 1.5Gbytes resident 
according to htop. Running standalone those are about 750M and 350M 
respectively.

I'll have to find a much smaller test before I consider doing any profiling of 
DHAT. I'll also try nullgrind to see if it really is the DHAT instrumentation 
overhead that is slowing things down.

A+
Paul


___
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users


Re: [Valgrind-users] DHAT speed

2020-05-26 Thread Julian Seward


That doesn't sound right.  I use DHAT extensively and expect a slowdown of
perhaps 50:1, maybe less.  What you're describing is a slowdown factor of
at least several thousand.

Bear in mind though that (1) V sequentialises thread execution, which wil
make a big difference if the program is heavily multithreaded, and (2)
I suspect dhat's scheme of looking up all memory accesses in an AVL tree
(of malloc'd blocks) doesn't scale all that well if you have tens of
millions of blocks.

Can you run it on a smaller workload?

J

On 26/05/2020 09:21, Paul FLOYD wrote:

Hi
  
I'm running DHAT on what I consider to be a relatively small example. Standalone the executable runs in a bit under 10 minutes. Based on the CPU time that we print after every 10% of progress, under DHAT the same executable is going to take about 422 hours - about two and a half weeks.
  
Does anyone have any ideas what could be causing it to be so slow? Indeed, is this the sort of slowdown that I should be expecting with DHAT? The executable is intensive in both memory and floating point. Probably not helping matters, the data structures that I want to look at are over 1kB in size so I tweaked the HISTOGRAM_SIZE_LIMIT to bump it up to 2kb.
  
On the DHAT side, I have thought of trying to use some macro hackery to try to inline the avl comparator function calls. Otherwise I don't have much in the way of other ideas, and DHAT doesn't have any cli options to tweak things.
  
A+

Paul
  




___
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users





___
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users


Re: [Valgrind-users] DHAT speed

2020-05-26 Thread John Reiser

On 5/26/20, Paul FLOYD wrote:


I'm running DHAT on what I consider to be a relatively small example. 
Standalone the executable runs in a bit under 10 minutes. Based on the CPU time 
that we print after every 10% of progress, under DHAT the same executable is 
going to take about 422 hours - about two and a half weeks.

Does anyone have any ideas what could be causing it to be so slow? 


Please tell us about your environment.
(EVERY query should supply this information!)

Which version of DHAT?  Which hardware?  Which operating system and version?
Which compiler and version built DHAT?  Which compiler(s) built the program 
under test?

How big is the program?  On a Linux system, run "size ./my_app; ldd ./my_app"
for static size, and "ps alx | grep my_app" for total VIRTual and RESident size.
How much RAM is available for use?  How much demand paging is occurring?

There are standard performance tools that apply to EVERY executable program,
not just DHAT.  On a Linux system with a DHAT that was built with symbol
information retained ("-g" command-line option to gcc or clang, and
no symbol stripping during static binding by /usr/bin/ld or /usr/bin/strip),
what does 'perf' say?
perf record valgrind --tool=DHAT <> ./my_app <>
perf report  > perf_output.txt

A smart user would supply such information in the first post.


___
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users


[Valgrind-users] DHAT speed

2020-05-26 Thread Paul FLOYD
Hi
 
I'm running DHAT on what I consider to be a relatively small example. 
Standalone the executable runs in a bit under 10 minutes. Based on the CPU time 
that we print after every 10% of progress, under DHAT the same executable is 
going to take about 422 hours - about two and a half weeks.
 
Does anyone have any ideas what could be causing it to be so slow? Indeed, is 
this the sort of slowdown that I should be expecting with DHAT? The executable 
is intensive in both memory and floating point. Probably not helping matters, 
the data structures that I want to look at are over 1kB in size so I tweaked 
the HISTOGRAM_SIZE_LIMIT to bump it up to 2kb.
 
On the DHAT side, I have thought of trying to use some macro hackery to try to 
inline the avl comparator function calls. Otherwise I don't have much in the 
way of other ideas, and DHAT doesn't have any cli options to tweak things.
 
A+
Paul
 ___
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users