Re: more profiling

Gonzalo Arana Tue, 19 Sep 2006 10:12:44 -0700

On 9/19/06, Andres Kroonmaa <[EMAIL PROTECTED]> wrote:

On 19 Sep 2006 at 21:10, Adrian Chadd wrote:
> On Tue, Sep 19, 2006, Gonzalo Arana wrote:
>
> > Is hires profiling *that* heavy? I've used it in my production squids
> > (while I've used squid3) and the overhead was neglible.
>
> It doesn't seem to be that heavy.


hires profiling was designed for lightest possible
overhead. I added special probes to measure its own
overhead, run regularily in events (PROF_OVERHEAD) and it
shows around 100 cpu 2.6G clock ticks on average for
Adrian.

> > There is a comment in profiling.h claiming that rdtsc (for x86 arch)
> > stalls CPU pipes.  That's not what Intel documentation says (page 213
> > -numbered as 4-209- of the Intel Architecture Software Developer
> > Manual, volume 2b, Instruction Reference N-Z).
> >
> > So, it should be harmless to profile as much code as possible, am I right?
>
> Thats what I'm thinking! Things like perfsuite seem to do a pretty good job
> of it without requiring re-compilation as well.

Well, this is somewhat mixed issue. Intel documented
usage of rdtsc (at a time when I coded this) with a
requirement of a pair of cpuid+rdtsc. Cpuid flushes all


I guess that's because time stamp counter may have different values on
different CPUs.  Am I right?

prefetch pipes and serializes execution and that is what
they meant by stalled pipes. In my implementation I
played around with both and found that cpuid inclusion
added unnecessary overhead - it indeed stalls pipes. For


so, timestamp counters seem to be onsync on multiple processors.

netburst Intel cores that can become quite notable
overhead. So I excluded this and use simple rdtsc command
alone. This on other hand by definition causes some
precision error due to possible out of order execution on
superscalar cpus, but I went on with assumption that as
long as probe start and probe stop are similar pieces of
code, the added time uncertainty is largely cancelling
out as we are measuring time *between* two invocations.


Sounds reasonable to me: both the start and the stop could have (on
average) the same offset error, so they would cancel each other.  I
just wonder if branch prediction does not gives us some bias in this.

> > We could build something like gprof call graph (with some
> > limitations).  Adding this shouln't be *that* difficult, right?
> >
> > Is there interest in improving the profiling code this way? (i.e.:
> > somewhat automated probe collection & adding call graph support).
>
> It'd be a pretty interesting experiment. gprof seems good enough
> to obtain call graph information (and call graph information only)
> and I'd rather we put our efforts towards fixing what we can find
> and porting over the remaining stuff from 2.6 into 3. We really
> need to concentrate on fixing up -3 rather than adding shinier things.
> Yet :)

How would you go on with adding call graphs without
adding too much overhead? I think it would be hard to
beat gprof on that one.


I'm sory, by 'adding call graph' I meant 'adding some basic call tree
information' (I've used the wrong words, sory).  We could have a
matrix of profile information:
M[caller][callee].  If you wish to get deeper levels, just add a new
dimension it to the 'profile matrix': M[grandfather][father][callee].

Any way, if we trust & rely on gprof call tree, there is no point in
doing any of this.
Just as a note: why don't we trust in gprof profile information but we
do trus in gprof call graph?

Regards,

--
Gonzalo A. Arana

Re: more profiling

Reply via email to