On 19 Sep 2006 at 21:10, Adrian Chadd wrote: > On Tue, Sep 19, 2006, Gonzalo Arana wrote: > > > Is hires profiling *that* heavy? I've used it in my production squids > > (while I've used squid3) and the overhead was neglible. > > It doesn't seem to be that heavy.
hires profiling was designed for lightest possible overhead. I added special probes to measure its own overhead, run regularily in events (PROF_OVERHEAD) and it shows around 100 cpu 2.6G clock ticks on average for Adrian. > > There is a comment in profiling.h claiming that rdtsc (for x86 arch) > > stalls CPU pipes. That's not what Intel documentation says (page 213 > > -numbered as 4-209- of the Intel Architecture Software Developer > > Manual, volume 2b, Instruction Reference N-Z). > > > > So, it should be harmless to profile as much code as possible, am I right? > > Thats what I'm thinking! Things like perfsuite seem to do a pretty good job > of it without requiring re-compilation as well. Well, this is somewhat mixed issue. Intel documented usage of rdtsc (at a time when I coded this) with a requirement of a pair of cpuid+rdtsc. Cpuid flushes all prefetch pipes and serializes execution and that is what they meant by stalled pipes. In my implementation I played around with both and found that cpuid inclusion added unnecessary overhead - it indeed stalls pipes. For netburst Intel cores that can become quite notable overhead. So I excluded this and use simple rdtsc command alone. This on other hand by definition causes some precision error due to possible out of order execution on superscalar cpus, but I went on with assumption that as long as probe start and probe stop are similar pieces of code, the added time uncertainty is largely cancelling out as we are measuring time *between* two invocations. I notice that someone has added rdtsc macro for windows platform and went with the documented cpuid+rdtsc pair. My suggestion would be to omit the cpuid. It gives nothing in terms of precision due to stalling pipes adding more overhead than error due to omiting it. > > We could build something like gprof call graph (with some > > limitations). Adding this shouln't be *that* difficult, right? > > > > Is there interest in improving the profiling code this way? (i.e.: > > somewhat automated probe collection & adding call graph support). > > It'd be a pretty interesting experiment. gprof seems good enough > to obtain call graph information (and call graph information only) > and I'd rather we put our efforts towards fixing what we can find > and porting over the remaining stuff from 2.6 into 3. We really > need to concentrate on fixing up -3 rather than adding shinier things. > Yet :) How would you go on with adding call graphs without adding too much overhead? I think it would be hard to beat gprof on that one. ------------------------------------ Andres Kroonmaa Elion
