I tried the below (the toggle version, translated to c) on x86,
with gcc 4.9.2.

Looking at the printf call, I see:
 0x080483e2 <+178>:     push   $0x7a314
 0x080483e7 <+183>:     push   $0x80485d0
 0x080483ec <+188>:     call   0x80482f0 <printf@plt>
So, you see that 0x7a314 (so 500500 in decimal) is pushed.
In other words, gcc optimiser has detected what you are doing
and has computed the result at compilation time :).
The loop is then only executed to do the CALLGRIND start/stop.
This is somewhat confirmed by gdb:
(gdb) list 10,10
10          n += i;
(gdb) info line 10
Line 10 of "toggle.c" is at address 0x8048391 <main+97> but contains no code.
(gdb) 

I guess then that the difference between toggle and start/stop version
depends on the very fine details of what/when exactly the start/stop
and/or the toggle client requests are taking effect.

E.g. I think that stop instrumentation implies to flush the translation
cache. But the currently executed block will not be really flushed till
the block is exited.

I think that at least START/STOP is not designed/not usable to measure 
ultra small pieces of code, as the code is only re-instrumented when
'exiting from the current executing block'.
(for more info, see the long comment in coregrind/m_transtab.c around
line 224).

Toggle collect on the other hand might work
better as this does not have to flush the translation cache : the
JITted code is instrumented, the callgrind 'helpers' will be called 
by the JITted code but will just do nothing if collect is off.
Then why is the nr of instruction 'big' ?
Well, as you have seen, the addition itself is 0 instruction :).
So, what is measured is not very clear, maybe/probably measuring
a part of the TOGGLE call. At least, when using --dump-instr=yes
and then using kcachegrind, the executed instructions are associated
with the TOGGLE macro.

In summary: do not use STOP/START for very small pieces of code
such as loop bodies, very unlikely to work precisely/correctly.
TOGGLE collect for very small pieces of code in a loop
are likely to measure (partially) the TOGGLE call
(in particular if gcc optimiser has just removed the 'real' loop code).

Philippe



On Thu, 2015-08-27 at 06:42 +0000, galexand...@nc.rr.com wrote:
> I would like to count instructions for a specific part of my code.
> 
> I found a valgrind-users thread, "A great trick for counting cycles using 
> callgrind - but what about massif?" 
> (http://sourceforge.net/p/valgrind/mailman/message/33476105/), that gives a 
> way to use Callgrind to do this using the CALLGRIND_START_INSTRUMENTATION and 
> CALLGRIND_STOP_INSTRUMENTATION macros.  This seemed to work but was taking 
> way too long on "real world tests" to be of practical use.
> 
> So I did some more searching and found a Stack Overflow question, "Callgrind: 
> Profile a specific part of my code" 
> (http://stackoverflow.com/questions/13688185/callgrind-profile-a-specific-part-of-my-code),
>  that addresses the performance issue.  It suggests using the 
> CALLGRIND_TOGGLE_COLLECT macro instead the CALLGRIND_START_INSTRUMENTATION / 
> CALLGRIND_STOP_INSTRUMENTATION macro.  This worked in that performance was 
> much better.  However, the two approaches give significantly different 
> instruction counts, which leads me to wonder if either is reporting a 
> "correct" instruction count.
> 
> I created two simple test programs illustrating the difference.  Here's a 
> simple program using the CALLGRIND_START_INSTRUMENTATION / 
> CALLGRIND_STOP_INSTRUMENTATION approach:
> 
>    #include <cstdlib>
>    #include <iostream>
>    #include <string>
>    #include <valgrind/callgrind.h>
> 
>    int main(int argc, char* argv[]) {
>      int n = 0;
>      for (int i = 1; i <= 1000; ++i) {
>        CALLGRIND_START_INSTRUMENTATION;
>        n += i;
>        CALLGRIND_STOP_INSTRUMENTATION;
>      }
>      std::cout << "n: " << n << std::endl;
>      CALLGRIND_DUMP_STATS;
>      exit(0);
>    }
> 
> I compiled this using g++ 4.8.2 on openSUSE 13.2 x86-64 as follows:
> 
>    g++ -O3 -o Test1 Test1.cpp   
> 
> I then ran Callgrind as follows:
> 
>    valgrind --tool=callgrind --instr-atstart=no Test1
> 
> which gave
> 
>    ==7371== Callgrind, a call-graph generating cache profiler
>    ==7371== Copyright (C) 2002-2013, and GNU GPL'd, by Josef Weidendorfer et 
> al.
>    ==7371== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info
>    ==7371== Command: Test1
>    ==7371== 
>    ==7371== For interactive control, run 'callgrind_control -h'.
>    n: 500500
>    ==7371== 
>    ==7371== Events    : Ir
>    ==7371== Collected : 0
>    ==7371== 
>    ==7371== I   refs:      0
> 
> The question is why is the instruction count 0? (I do get a nonzero 
> instruction count when profiling my real code using the 
> CALLGRIND_START_INSTRUMENTATION / CALLGRIND_STOP_INSTRUMENTATION approach, 
> but the reported instruction count is significantly less than the instruction 
> count reported using the CALLGRIND_TOGGLE_COLLECT approach.)
> 
> Here's the program using the CALLGRIND_TOGGLE_COLLECT approach:
> 
>    #include <cstdlib>
>    #include <iostream>
>    #include <string>
>    #include <valgrind/callgrind.h>
> 
>    int main(int argc, char* argv[]) {
>      int n = 0;
>      for (int i = 1; i <= 10000; ++i) {
>        CALLGRIND_TOGGLE_COLLECT;
>        n += i;
>        CALLGRIND_TOGGLE_COLLECT;
>      }
>      std::cout << "n: " << n << std::endl;
>      CALLGRIND_DUMP_STATS;
>      exit(0);
>   }
> 
> I compiled the program as follows:
> 
>    g++ -O3 -o Test2 Test2.cpp
> 
> and ran Callgrind as follows:
> 
>     valgrind --tool=callgrind --collect-atstart=no Test2
> 
> which gave
> 
>    ==7378== Callgrind, a call-graph generating cache profiler
>    ==7378== Copyright (C) 2002-2013, and GNU GPL'd, by Josef Weidendorfer et 
> al.
>    ==7378== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info
>    ==7378== Command: Test2
>    ==7378== 
>    ==7378== For interactive control, run 'callgrind_control -h'.
>    n: 50005000
>    ==7378== 
>    ==7378== Events    : Ir
>    ==7378== Collected : 90008
>    ==7378== 
>    ==7378== I   refs:      90,008
> 
> The reported instruction count is 90008, which amounts to an amortized 
> instruction count of approximately nine instructions for each "n += i;".  
> Nine instructions seems high for a single integer addition even on an x86-64 
> CPU.  Is the reported instruction count really a "correct" measure of the "n 
> += i;" statement?
> 
> Is there a better way to profile instruction count for a specific part of 
> code?
> 
> Thanks,
> Geoff Alexander
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Valgrind-users mailing list
> Valgrind-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/valgrind-users



------------------------------------------------------------------------------
_______________________________________________
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users

Reply via email to