The server has 48Gbytes of RAM and only about 6Gbytes is being used.

The executable is quite big

text data bss dec hex filename
57369168 417156 20903108 78689432 4b0b498 [snip]

The run under DHAT is using about 2Gbytes virtual and 1.5Gbytes resident 
according to htop. Running standalone those are about 750M and 350M 
respectively.

Some hardware cache+memory delays:
   L1 hit    3 cycles  ( 32KB size)
   L2 hit   11 cycles  (256KB size)
   L3 hit   25 cycles  (4MB to 40MB size)
   miss    180 cycles
The dynamic RAM chips commonly used for main memory have stayed the same speed
for over 25 years: 60 nanoseconds from CAS (Column Address Strobe) to DataOut.
If the CPU runs at 3GHz, then a cache miss costs at least 180 cycles.
A quick estimate of (largest and slowest) cache size is given by the "cache 
size"
line from /proc/cpuinfo:
-----
$ sed 9q  < /proc/cpuinfo  # on a 8-year old consumer-grade machine
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 42
model name      : Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
stepping        : 7
microcode       : 0x2f
cpu MHz         : 1599.982
cache size      : 6144 KB
-----

Assume that your "big" server has 30MB of L3 cache.  For a resident set size of
350MB then that is a ratio of about 1:12.  For a resident set size of 1.5GB then
the ratio is about 1:50.  So right away, that's a hardware slowdown of 4X.

Valgrind runs every tool single-threaded.  So if your app averages 5 active 
threads,
then that is a slowdown of 5X.

Valgrind's JIT (Just-In-Time) instruction emulator has a slowdown.  Assume 10X 
(or measure nulgrind.)

Finally we get to "useful work": the slowdown of the tool DHAT.  Assume 3X.

So (4 * 5 * 10 * 3) is a slowdown of 600X, which turns 10 minutes into 100 
hours.


_______________________________________________
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users

Reply via email to