Re: [Valgrind-users] DHAT speed
On 5/27/20 Paul FLOYD wrote: Well, no real surprises. This is with a testcase that runs standalone in about 5 seconds and under DHAT in about 200 seconds (so a reasonable slowdown of 40x). # Overhead Command Shared Object Symbol # ... .. # 29.11% dhat-amd64-linu dhat-amd64-linux[.] interval_tree_Cmp 21.13% dhat-amd64-linu perf-26905.map [.] 0x0010057a25f8 13.32% dhat-amd64-linu dhat-amd64-linux[.] vgPlain_lookupFM 9.56% dhat-amd64-linu dhat-amd64-linux[.] dh_handle_read 8.83% dhat-amd64-linu dhat-amd64-linux[.] vgPlain_nextIterFM 4.66% dhat-amd64-linu dhat-amd64-linux[.] check_for_peak 1.85% dhat-amd64-linu dhat-amd64-linux[.] vgPlain_disp_cp_xindir 1.32% dhat-amd64-linu [kernel.kallsyms] [k] 0x8103ec0a 1.00% dhat-amd64-linu dhat-amd64-linux[.] dh_handle_write To me this suggests two things: 1) investigate the coding of the 4 or 5 highest-use subroutines (interval_tree_Cmp, vgPlain_lookupFM, dh_handle_read, vgPlain_nextIterFM) 2) see whether DHAT might recognize and use higher-level abstractions than MemoryRead and MemoryWrite of individual addresses. Similar to memcheck intercepting and analyzing strlen (etc.) as a complete concept instead of as its individual Reads and Writes, perhaps DHAT could intercept (and/or recognize) vector linear search, vector addition, vector partial sum, other BLAS routines, etc., and then analyze the algorithm as a whole. ___ Valgrind-users mailing list Valgrind-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/valgrind-users
Re: [Valgrind-users] DHAT speed
[snip - perf] Well, no real surprises. This is with a testcase that runs standalone in about 5 seconds and under DHAT in about 200 seconds (so a reasonable slowdown of 40x). # Overhead Command Shared Object Symbol # ... .. # 29.11% dhat-amd64-linu dhat-amd64-linux[.] interval_tree_Cmp 21.13% dhat-amd64-linu perf-26905.map [.] 0x0010057a25f8 13.32% dhat-amd64-linu dhat-amd64-linux[.] vgPlain_lookupFM 9.56% dhat-amd64-linu dhat-amd64-linux[.] dh_handle_read 8.83% dhat-amd64-linu dhat-amd64-linux[.] vgPlain_nextIterFM 4.66% dhat-amd64-linu dhat-amd64-linux[.] check_for_peak 1.85% dhat-amd64-linu dhat-amd64-linux[.] vgPlain_disp_cp_xindir 1.32% dhat-amd64-linu [kernel.kallsyms] [k] 0x8103ec0a 1.00% dhat-amd64-linu dhat-amd64-linux[.] dh_handle_write A+ Paul ___ Valgrind-users mailing list Valgrind-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/valgrind-users
Re: [Valgrind-users] DHAT speed
> So (4 * 5 * 10 * 3) is a slowdown of 600X, which turns 10 minutes into 100 hours. What I'm seeing is a DHAT-only slowdown that is much more than that. Running 'perf' is likely to give data and strong hints about what is going on. The overhead of 'perf' is only a few percent. perf record valgrind --tool=DHAT <> ./my_app <> perf report > perf_output.txt The "perf record ..." will stop shortly after the valgrind sub-process terminates. You don't have to wait for DHAT to finish; just 'kill' it after a while. ___ Valgrind-users mailing list Valgrind-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/valgrind-users
Re: [Valgrind-users] DHAT speed
> Message du 26/05/20 13:19> De : "John Reiser" the ratio is about 1:50. So > right away, that's a hardware slowdown of 4X.Maybe more. The machine has > 12Mbyte of cache according to cpuinfo.> Valgrind runs every tool > single-threaded. So if your app averages 5 active threads,> then that is a > slowdown of 5X.I was running the application in single thread mode.> > Valgrind's JIT (Just-In-Time) instruction emulator has a slowdown. Assume 10X > (or measure nulgrind.)Yes, this is what I see with nulgrind, about a 11x > factor slowdown. However this will also account for a large part of the cache > overhead.> Finally we get to "useful work": the slowdown of the tool DHAT. > Assume 3X.> So (4 * 5 * 10 * 3) is a slowdown of 600X, which turns 10 minutes > into 100 hours.What I'm seeing is a DHAT-only slowdown that is much more than > that.A+Paul___ Valgrind-users mailing list Valgrind-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/valgrind-users
Re: [Valgrind-users] DHAT speed
> That doesn't sound right. I use DHAT extensively and expect a slowdown of> perhaps 50:1, maybe less. What you're describing is a slowdown factor of> at least several thousand.>> Bear in mind though that (1) V sequentialises thread execution, which wil> make a big difference if the program is heavily multithreaded, and (2)> I suspect dhat's scheme of looking up all memory accesses in an AVL tree> (of malloc'd blocks) doesn't scale all that well if you have tens of> millions of blocks.>> Can you run it on a smaller workload? HiI'll try on something smaller and also get some info on the number of blocks of memory allocated.A+Paul___ Valgrind-users mailing list Valgrind-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/valgrind-users
Re: [Valgrind-users] DHAT speed
The server has 48Gbytes of RAM and only about 6Gbytes is being used. The executable is quite big text data bss dec hex filename 57369168 417156 20903108 78689432 4b0b498 [snip] The run under DHAT is using about 2Gbytes virtual and 1.5Gbytes resident according to htop. Running standalone those are about 750M and 350M respectively. Some hardware cache+memory delays: L1 hit3 cycles ( 32KB size) L2 hit 11 cycles (256KB size) L3 hit 25 cycles (4MB to 40MB size) miss180 cycles The dynamic RAM chips commonly used for main memory have stayed the same speed for over 25 years: 60 nanoseconds from CAS (Column Address Strobe) to DataOut. If the CPU runs at 3GHz, then a cache miss costs at least 180 cycles. A quick estimate of (largest and slowest) cache size is given by the "cache size" line from /proc/cpuinfo: - $ sed 9q < /proc/cpuinfo # on a 8-year old consumer-grade machine processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz stepping: 7 microcode : 0x2f cpu MHz : 1599.982 cache size : 6144 KB - Assume that your "big" server has 30MB of L3 cache. For a resident set size of 350MB then that is a ratio of about 1:12. For a resident set size of 1.5GB then the ratio is about 1:50. So right away, that's a hardware slowdown of 4X. Valgrind runs every tool single-threaded. So if your app averages 5 active threads, then that is a slowdown of 5X. Valgrind's JIT (Just-In-Time) instruction emulator has a slowdown. Assume 10X (or measure nulgrind.) Finally we get to "useful work": the slowdown of the tool DHAT. Assume 3X. So (4 * 5 * 10 * 3) is a slowdown of 600X, which turns 10 minutes into 100 hours. ___ Valgrind-users mailing list Valgrind-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/valgrind-users
Re: [Valgrind-users] DHAT speed
On 5/26/20, Paul FLOYD wrote: > > I'm running DHAT on what I consider to be a relatively small example. > > Standalone the executable runs in a bit under 10 minutes. Based on the CPU > > time that we print after every 10% of progress, under DHAT the same executable is going to take about 422 hours - about two and a half weeks. > > > > Does anyone have any ideas what could be causing it to be so slow? > Please tell us about your environment. > (EVERY query should supply this information!) I just wanted to have an idea of the general expectation for DHAT overhead. > Which version of DHAT? Which hardware? Which operating system and version? > Which compiler and version built DHAT? Which compiler(s) built the program > under test? Red Hat Enterprise Linux Workstation release 6.5 (Santiago) gcc (GCC) 5.3.0 for the test executable. gcc version 9.2.0 (GCC) for building Valgrind, built from git HEAD. The server has 48Gbytes of RAM and only about 6Gbytes is being used. The executable is quite big text data bss dec hex filename 57369168 417156 20903108 78689432 4b0b498 [snip] The run under DHAT is using about 2Gbytes virtual and 1.5Gbytes resident according to htop. Running standalone those are about 750M and 350M respectively. I'll have to find a much smaller test before I consider doing any profiling of DHAT. I'll also try nullgrind to see if it really is the DHAT instrumentation overhead that is slowing things down. A+ Paul ___ Valgrind-users mailing list Valgrind-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/valgrind-users
Re: [Valgrind-users] DHAT speed
That doesn't sound right. I use DHAT extensively and expect a slowdown of perhaps 50:1, maybe less. What you're describing is a slowdown factor of at least several thousand. Bear in mind though that (1) V sequentialises thread execution, which wil make a big difference if the program is heavily multithreaded, and (2) I suspect dhat's scheme of looking up all memory accesses in an AVL tree (of malloc'd blocks) doesn't scale all that well if you have tens of millions of blocks. Can you run it on a smaller workload? J On 26/05/2020 09:21, Paul FLOYD wrote: Hi I'm running DHAT on what I consider to be a relatively small example. Standalone the executable runs in a bit under 10 minutes. Based on the CPU time that we print after every 10% of progress, under DHAT the same executable is going to take about 422 hours - about two and a half weeks. Does anyone have any ideas what could be causing it to be so slow? Indeed, is this the sort of slowdown that I should be expecting with DHAT? The executable is intensive in both memory and floating point. Probably not helping matters, the data structures that I want to look at are over 1kB in size so I tweaked the HISTOGRAM_SIZE_LIMIT to bump it up to 2kb. On the DHAT side, I have thought of trying to use some macro hackery to try to inline the avl comparator function calls. Otherwise I don't have much in the way of other ideas, and DHAT doesn't have any cli options to tweak things. A+ Paul ___ Valgrind-users mailing list Valgrind-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/valgrind-users ___ Valgrind-users mailing list Valgrind-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/valgrind-users
Re: [Valgrind-users] DHAT speed
On 5/26/20, Paul FLOYD wrote: I'm running DHAT on what I consider to be a relatively small example. Standalone the executable runs in a bit under 10 minutes. Based on the CPU time that we print after every 10% of progress, under DHAT the same executable is going to take about 422 hours - about two and a half weeks. Does anyone have any ideas what could be causing it to be so slow? Please tell us about your environment. (EVERY query should supply this information!) Which version of DHAT? Which hardware? Which operating system and version? Which compiler and version built DHAT? Which compiler(s) built the program under test? How big is the program? On a Linux system, run "size ./my_app; ldd ./my_app" for static size, and "ps alx | grep my_app" for total VIRTual and RESident size. How much RAM is available for use? How much demand paging is occurring? There are standard performance tools that apply to EVERY executable program, not just DHAT. On a Linux system with a DHAT that was built with symbol information retained ("-g" command-line option to gcc or clang, and no symbol stripping during static binding by /usr/bin/ld or /usr/bin/strip), what does 'perf' say? perf record valgrind --tool=DHAT <> ./my_app <> perf report > perf_output.txt A smart user would supply such information in the first post. ___ Valgrind-users mailing list Valgrind-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/valgrind-users
[Valgrind-users] DHAT speed
Hi I'm running DHAT on what I consider to be a relatively small example. Standalone the executable runs in a bit under 10 minutes. Based on the CPU time that we print after every 10% of progress, under DHAT the same executable is going to take about 422 hours - about two and a half weeks. Does anyone have any ideas what could be causing it to be so slow? Indeed, is this the sort of slowdown that I should be expecting with DHAT? The executable is intensive in both memory and floating point. Probably not helping matters, the data structures that I want to look at are over 1kB in size so I tweaked the HISTOGRAM_SIZE_LIMIT to bump it up to 2kb. On the DHAT side, I have thought of trying to use some macro hackery to try to inline the avl comparator function calls. Otherwise I don't have much in the way of other ideas, and DHAT doesn't have any cli options to tweak things. A+ Paul ___ Valgrind-users mailing list Valgrind-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/valgrind-users