[Valgrind-users] Usability of callgrind_annotate (Re: binary debugging and generate source of which that running)

Josef Weidendorfer Fri, 08 Jul 2011 10:56:23 -0700

On Friday 08 July 2011, John Reiser wrote:
> What I see is an instance of EPIC FAIL for Usability.


Thanks for the detailed reply. I really appreciate it.

The first thing I must note here is that "callgrind_annotate" is meant
as fallback solution for people who can/do not want to use a GUI. 
KCachegrind (the GUI for callgrind output)
- shows "Instruction Fetch" as explanation for "Ir",
- shows a call count column in the function list.

> The legend paragraph
> at the beginning of the output does not define "Ir", the output displayed
> by the command-line parameter "--help" does not define "Ir",

Ok. The same is true for cachegrind/cg_annotate. We should do something
about that.

> and the default
> output of a tool whose name begins with "call" has nothing to do with
> the number of calls to each subroutine.

Point taken.

> The legend paragraph contains these instances of "Ir":
>    Events recorded:  Ir
>    Events shown:     Ir
>    Event sort order: Ir
> The output of "callgrind_annotate --help" does not contain "Ir".
> The MINIMUM I expect is for BOTH places to contain an additional annotation
> such as "Possible events:  Ir (Instruction Read)".
> 
> The only explanation of "Ir" is in the callgrind manual
>    share/doc/valgrind/html/cl-manual.html
> and the cachegrind manual
>    share/doc/valgrind/html/cg-manual.html

> Neither manual explains why "Ir" (Instruction Read) is used instead of
> something like "Ix" (Instruction eXecuted). Until some years ago most
> x86 chips read [and decode] many more instructions than they execute:
> in particular, some [effectively] non-executed instructions which reside
> after every taken branch.  Some newer x86 chips read-decode-translate
> instruction bytes into microcode, often blurring the boundaries between
> architectural instructions, and execute only the translated microcode.
> The translations are cached; a loop of up to several dozen instructions
> can be executed many times after the first translation.  So the
> relationship between Instruction Read and Instruction eXecute is murky.

Indeed. The "Ir" naming is part of cachegrind history.
We equally could use "Instructions Executed", and yes, this should be
explained in the manual. It would be even better to talk about
"Instructions retired" (works as abbrevation to Ir).
 
> > Actually, callgrind_annotate only prints the number of calls for call arcs,
> > which are displayed e.g. with "--tree=caller":
> > 
> >> callgrind_annotate --tree=caller callgrind.out*
> > ...
> >  7,060  < ???:0x0000000000401d60 (1x) [/bin/date]
> >     73  *  /build/buildd/eglibc-2.13/stdlib/exit.c:exit 
> > [/lib/x86_64-linux-gnu/libc-2.13.so]
> > ...
> > 
> > This excerpt means that a function at "0x0000000000401d60" called
> > "exit" one time. It probably would be good to add a total call count number
> > for every function.
> 
> Yes.  Total incoming calls by function is the zero-level statistic which is
> useful and expected by both novice and experienced users.

Good.

> >> For another instance, using the exact suggestion:
> >> $ valgrind --tool=callgrind --ct-verbose=1 /bin/date
> >> gives output such as:
> >> -----
> >>    .   .   .   > check_match.10789(0x3d, 0x5b, ...) [ld-2.13.so / 
> >> 0x3d25a08a20]
> >>    .   .   .   .> strcmp(0x3d, 0x5b, ...) [ld-2.13.so / 0x3d25a16ac0]
> >>    .   .   .   . > strcmp(0x3d, 0x5b, ...) [ld-2.13.so / 0x3d25a16ac0]
> >> -----
> >> where the indicated parameters are nonsense,
> > 
> > Sure, it would be a nice feature!
> > But "--ct-verbose=..." is actually only for internal debugging, and
> > the output is not documented. 
> 
> If something such as "strcmp(0x3d, 0x5b, ...)" is _not_ an indication
> of parameters, then this is another poor choice,

This is undocumented debug output. We can change it any time. I do not see
a problem here.

> especially as a
> recommendation to a new user.

The original poster asked for a trace of called functions. You suggested
a quite elaborated wrapper method, which did not seem to be easy to implement.
I just commented that callgrind can print out exactly that list of
called functions using a hidden "feature", to show him a shortcut.
The only thing I may have added: "Warning: includes a lot of totally misleading 
crap".

> >> and a function whose body
> >> is a loop displays as a recursion.
> > 
> > Please tell me a way how to distinguish loops from tail recursion 
> > optimization
> > at machine code level, when the loop jumps back to the first instruction
> > of a function.
> 
> After entry to a function and before the corresponding return, if there has
> been no write to the register or memory location which is designated to hold
> the return address, and if the target is within the same function (as 
> determined
> by the available symbol table and perhaps by implementation details such as
> choice of opcode and instruction encoding), then it's a loop and not
> a recursion.

That is exactly what callgrind is doing, unless ...

> Sure, at the _lowest_ level a branch back to the beginning [or even after the
> prologue] cannot be distinguished between a loop and a recursion.

Right. Unfortunately, callgrind only can see the lowest level. Do you
know any debug information that can help here?

> However,
> programmers who use explicit tail recursion to the same function _know_
> that it is equivalent to a loop, and won't be surprised by such a display;
> whereas the coder of a loop might be confused by seeing a recursion.

That is arguable.

Callgrind only translates a jump into recursion if it is back to the first
instruction. This mostly works fine, as loop bodies do not include function
prologues.  The "wrongly detected" strcmp really is a rare exception, as
with the x86_64 calling convention, this function actually has no prologue:

   0x7ffff7df36f0:      mov    (%rdi),%al
   0x7ffff7df36f2:      cmp    (%rsi),%al
   0x7ffff7df36f4:      jne    0x7ffff7df3703
   0x7ffff7df36f6:      inc    %rdi
   0x7ffff7df36f9:      inc    %rsi
   0x7ffff7df36fc:      test   %al,%al
   0x7ffff7df36fe:      jne    0x7ffff7df36f0
   0x7ffff7df3700:      xor    %eax,%eax
   0x7ffff7df3702:      retq   
   0x7ffff7df3703:      mov    $0x1,%eax
   0x7ffff7df3708:      mov    $0xffffffff,%ecx
   0x7ffff7df370d:      cmovb  %ecx,%eax
   0x7ffff7df3710:      retq   

And you can try yourself:

 int mysc1(char* a, char* b)
 {
  if (*a == *b) {
    if (*a != 0) return mysc1(a+1,b+1);
    return 0;
  }
  return (*a < *b) ? -1:1;
 }

 int mysc2(char* a, char* b)
 {
  while(1) {
    if (*a != *b) break;
    if (*a == 0) return 0;
    a++, b++;  
  }
  return (*a < *b) ? -1:1;
 }


$> gcc -v
...
gcc version 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4)
$> gcc -O3 -Os -c mysc{1,2}.c; objdump -S mysc{1,2}.o

0000000000000000 <mysc>:
   0:   8a 07                   mov    (%rdi),%al
   2:   3a 06                   cmp    (%rsi),%al
   4:   75 0c                   jne    12 <mysc+0x12>
   6:   84 c0                   test   %al,%al
   8:   74 13                   je     1d <mysc+0x1d>
   a:   48 ff c6                inc    %rsi
   d:   48 ff c7                inc    %rdi
  10:   eb ee                   jmp    0 <mysc>
  12:   0f 9d c0                setge  %al
  15:   0f b6 c0                movzbl %al,%eax
  18:   8d 44 00 ff             lea    -0x1(%rax,%rax,1),%eax
  1c:   c3                      retq   
  1d:   31 c0                   xor    %eax,%eax
  1f:   c3                      retq   

Both give exactly the same machine code as result with GCC 4.5.2!

> Also,
> callgrind should take advantage of having an event horizon that is larger
> than one instruction.

Please explain.

> Adopting the "subroutine outlook" also favors a loop over recursion.
> The "subroutine outlook" knows only what is [scheduled] to be done
> in the future; what has happened in the past is not relevant and
> cannot be known.  Thus the current recursion level always equals
> the number of pending return addresses.  The disadvantage is that
> traceback "through" a tail recursion elides some names, in much the same
> way that visible state at a debugger breakpoint shows only the _next_
> instruction to be executed, and not the previous instruction pointer.

Sorry, I don't get your point here.

Do you suggest for callgrind to better always play dumb, and do not even
try to reconstruct a potential tail recursion into a real recursion?

There is a similar scenario: machine code can have jumps between
functions (e.g. as result of a tail recursion optimization, or hand crafted
assembler). Callgrind output is about call relationship. There is no way
to reason about jumps between functions. Thus, callgrind has to map a
jump between functions either to a "call" (with an additional return when
returning from the function jumped to) or a "return/call" pair.

> > Please tell me the reason why you started with the assumption that the
> > discovered output must be buggy.
> 
> I started with the assumption that a tool with a name such as "callgrind"
> would tell me how many times each subroutine was called.  The output was
> not so.  The output legend had no explanation.

I am really sorry about that experience. I obviously need to better care
about CLI users.
I will open a bug report for that.

Josef

> The --help output had
> no explanation.  The output used notation "strcmp(0x3d, 0x5b, ...)"
> for something other than displaying [correctly] a function call with 
> parameters.
> After all that experience, then I decided that I should be wary of callgrind.

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users

[Valgrind-users] Usability of callgrind_annotate (Re: binary debugging and generate source of which that running)

Reply via email to