Re: broken cycle counts from perf record in frequency mode [Was: Re: deducing CPU clock rate over time from cycle samples]

2017-09-05 Thread Milian Wolff
On Tuesday, September 5, 2017 5:40:58 AM CEST Andi Kleen wrote:
> > The cycle value gets associated with a sample via it's period value, which
> > is used by `perf report` in the analysis. If I get a single "broken"
> > sample with
>
> I always thought it just used the number of samples?

No, that is not the case. It uses the cycle period as "weight" by default. 
Note also the recent patch set for `perf annotate` by Taeung Song which adds 
the functionality to switch between sample or period fractions. I'm actually 
not sure whether that exists for `perf report` yet.

In some situations where the cycle period values associated with the samples 
are correct, I actually also saw how this is useful: I have seen perf data 
files, where tons of samples got recorded around syscall entry/exit, but most 
samples only had tiny cycle values associated with them. If one only looks at 
number of samples, then it would look like the syscalls are expensive. While 
in reality, way more cycles are executed in userspace. This issue was/is 
apparent when looking at `perf report` values vs. the FlameGraph visualization 
created by the normal `stackcollapse-perf.pl` script. The latter does only 
look at the number of samples, the former takes the sample period value. 
Hotspot also used the former approach, but then I adapted perf's behavior.

> > a cycle count of, say 1E14 and then a million other samples, each with
> > "sane" cycle counts of let's say 1E5, then the one broken sample will
> > hold 50% of the total amount of measured cycles. So perf report will show
> > that the function where the broken sample points to will have a cost of
> > 50%.
> 
> I don't think I've seen such a situation. Did you?

I have seen this situation. This is what this current revival of this thread 
is all about. Without such issues, I wouldn't claim it's such a serious issue.

> BTW I'm not arguing against fixing it, but typically I just
> recommend to avoid frequency mode. The fast sampling at the beginning
> has caused a range of low level PMU bugs and it is hard to reason about
> because of its complex behavior. Also it has no protection against
> synchronizing with repeated patterns in the execution, which
> can cause bad shadowing effects.  If you use the Intel
> event aliases they have all sensible periods set by default.

I think we both agree that the frequency mode as-is should not be the default. 
But it is, and this is a serious issue in my opinion. We will need to find a 
sensible default for the event period and use that mode by default. I nowadays 
always add something like `-c 300` which gives me roughly 1k samples per 
second on a ~3GHz machine. It's just a ballpark figure and of course gets 
influenced by frequency scaling, but it's good enough for me. We could use a 
similar approach to find a period based on the max CPU clock rate 
automatically. But of course that would only work for cycles, and not for 
instructions or any of the other fancy event counters. But since the frequency 
mode is probably similarly broken there, it should not be the default. Better 
ask the user for explicit values rather than doing something automatically 
which can lead to broken results.

Cheers

-- 
Milian Wolff | milian.wo...@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

smime.p7s
Description: S/MIME cryptographic signature


Re: broken cycle counts from perf record in frequency mode [Was: Re: deducing CPU clock rate over time from cycle samples]

2017-09-05 Thread Milian Wolff
On Tuesday, September 5, 2017 5:40:58 AM CEST Andi Kleen wrote:
> > The cycle value gets associated with a sample via it's period value, which
> > is used by `perf report` in the analysis. If I get a single "broken"
> > sample with
>
> I always thought it just used the number of samples?

No, that is not the case. It uses the cycle period as "weight" by default. 
Note also the recent patch set for `perf annotate` by Taeung Song which adds 
the functionality to switch between sample or period fractions. I'm actually 
not sure whether that exists for `perf report` yet.

In some situations where the cycle period values associated with the samples 
are correct, I actually also saw how this is useful: I have seen perf data 
files, where tons of samples got recorded around syscall entry/exit, but most 
samples only had tiny cycle values associated with them. If one only looks at 
number of samples, then it would look like the syscalls are expensive. While 
in reality, way more cycles are executed in userspace. This issue was/is 
apparent when looking at `perf report` values vs. the FlameGraph visualization 
created by the normal `stackcollapse-perf.pl` script. The latter does only 
look at the number of samples, the former takes the sample period value. 
Hotspot also used the former approach, but then I adapted perf's behavior.

> > a cycle count of, say 1E14 and then a million other samples, each with
> > "sane" cycle counts of let's say 1E5, then the one broken sample will
> > hold 50% of the total amount of measured cycles. So perf report will show
> > that the function where the broken sample points to will have a cost of
> > 50%.
> 
> I don't think I've seen such a situation. Did you?

I have seen this situation. This is what this current revival of this thread 
is all about. Without such issues, I wouldn't claim it's such a serious issue.

> BTW I'm not arguing against fixing it, but typically I just
> recommend to avoid frequency mode. The fast sampling at the beginning
> has caused a range of low level PMU bugs and it is hard to reason about
> because of its complex behavior. Also it has no protection against
> synchronizing with repeated patterns in the execution, which
> can cause bad shadowing effects.  If you use the Intel
> event aliases they have all sensible periods set by default.

I think we both agree that the frequency mode as-is should not be the default. 
But it is, and this is a serious issue in my opinion. We will need to find a 
sensible default for the event period and use that mode by default. I nowadays 
always add something like `-c 300` which gives me roughly 1k samples per 
second on a ~3GHz machine. It's just a ballpark figure and of course gets 
influenced by frequency scaling, but it's good enough for me. We could use a 
similar approach to find a period based on the max CPU clock rate 
automatically. But of course that would only work for cycles, and not for 
instructions or any of the other fancy event counters. But since the frequency 
mode is probably similarly broken there, it should not be the default. Better 
ask the user for explicit values rather than doing something automatically 
which can lead to broken results.

Cheers

-- 
Milian Wolff | milian.wo...@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

smime.p7s
Description: S/MIME cryptographic signature


Re: broken cycle counts from perf record in frequency mode [Was: Re: deducing CPU clock rate over time from cycle samples]

2017-09-04 Thread Andi Kleen
> The cycle value gets associated with a sample via it's period value, which is 
> used by `perf report` in the analysis. If I get a single "broken" sample with 

I always thought it just used the number of samples?

> a cycle count of, say 1E14 and then a million other samples, each with "sane" 
> cycle counts of let's say 1E5, then the one broken sample will hold 50% of 
> the 
> total amount of measured cycles. So perf report will show that the function 
> where the broken sample points to will have a cost of 50%.

I don't think I've seen such a situation. Did you?

BTW I'm not arguing against fixing it, but typically I just
recommend to avoid frequency mode. The fast sampling at the beginning
has caused a range of low level PMU bugs and it is hard to reason about
because of its complex behavior. Also it has no protection against
synchronizing with repeated patterns in the execution, which
can cause bad shadowing effects.  If you use the Intel
event aliases they have all sensible periods set by default.

-Andi



Re: broken cycle counts from perf record in frequency mode [Was: Re: deducing CPU clock rate over time from cycle samples]

2017-09-04 Thread Andi Kleen
> The cycle value gets associated with a sample via it's period value, which is 
> used by `perf report` in the analysis. If I get a single "broken" sample with 

I always thought it just used the number of samples?

> a cycle count of, say 1E14 and then a million other samples, each with "sane" 
> cycle counts of let's say 1E5, then the one broken sample will hold 50% of 
> the 
> total amount of measured cycles. So perf report will show that the function 
> where the broken sample points to will have a cost of 50%.

I don't think I've seen such a situation. Did you?

BTW I'm not arguing against fixing it, but typically I just
recommend to avoid frequency mode. The fast sampling at the beginning
has caused a range of low level PMU bugs and it is hard to reason about
because of its complex behavior. Also it has no protection against
synchronizing with repeated patterns in the execution, which
can cause bad shadowing effects.  If you use the Intel
event aliases they have all sensible periods set by default.

-Andi



Re: broken cycle counts from perf record in frequency mode [Was: Re: deducing CPU clock rate over time from cycle samples]

2017-09-04 Thread Milian Wolff
On Friday, September 1, 2017 6:48:20 PM CEST Andi Kleen wrote:
> Milian Wolff  writes:
> > do you have any input on this issue? I really wonder why noone else is
> > complaining about the frequency mode being unreliable or right out broken
> > in many situations. Since it's the default mode, I think this urgently
> > needs to be investigated - it makes perf unusable for a large group of
> > users who want to use it but don't know about `-c N` as a workaround...
> 
> It's likely related due to the frequency algorithm starting with 0.  So
> at the beginning the samples are very fast (like 1 cycle) and likely
> something breaks down in perf or your frequency calculation for very
> short samples.
> 
> Also for inherited events this happens on every fork. If you
> trace fork events too you'll likely see it correlated.
> If you use -a and disable inheritance (no-inherit=1) it will
> also likely be only at the beginning.
> 
> However I fail to see what it would actually break. The frequency
> just needs to be roughly accurate over the whole measurement
> period to get good sampling coverage. But there's nothing
> in the profile that uses the actual frequency. It's just a means
> to get samples, not a measurement by itself.

The cycle value gets associated with a sample via it's period value, which is 
used by `perf report` in the analysis. If I get a single "broken" sample with 
a cycle count of, say 1E14 and then a million other samples, each with "sane" 
cycle counts of let's say 1E5, then the one broken sample will hold 50% of the 
total amount of measured cycles. So perf report will show that the function 
where the broken sample points to will have a cost of 50%.

To me, this is clearly a really big issue. Don't you think so too?

Thanks

-- 
Milian Wolff | milian.wo...@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

smime.p7s
Description: S/MIME cryptographic signature


Re: broken cycle counts from perf record in frequency mode [Was: Re: deducing CPU clock rate over time from cycle samples]

2017-09-04 Thread Milian Wolff
On Friday, September 1, 2017 6:48:20 PM CEST Andi Kleen wrote:
> Milian Wolff  writes:
> > do you have any input on this issue? I really wonder why noone else is
> > complaining about the frequency mode being unreliable or right out broken
> > in many situations. Since it's the default mode, I think this urgently
> > needs to be investigated - it makes perf unusable for a large group of
> > users who want to use it but don't know about `-c N` as a workaround...
> 
> It's likely related due to the frequency algorithm starting with 0.  So
> at the beginning the samples are very fast (like 1 cycle) and likely
> something breaks down in perf or your frequency calculation for very
> short samples.
> 
> Also for inherited events this happens on every fork. If you
> trace fork events too you'll likely see it correlated.
> If you use -a and disable inheritance (no-inherit=1) it will
> also likely be only at the beginning.
> 
> However I fail to see what it would actually break. The frequency
> just needs to be roughly accurate over the whole measurement
> period to get good sampling coverage. But there's nothing
> in the profile that uses the actual frequency. It's just a means
> to get samples, not a measurement by itself.

The cycle value gets associated with a sample via it's period value, which is 
used by `perf report` in the analysis. If I get a single "broken" sample with 
a cycle count of, say 1E14 and then a million other samples, each with "sane" 
cycle counts of let's say 1E5, then the one broken sample will hold 50% of the 
total amount of measured cycles. So perf report will show that the function 
where the broken sample points to will have a cost of 50%.

To me, this is clearly a really big issue. Don't you think so too?

Thanks

-- 
Milian Wolff | milian.wo...@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

smime.p7s
Description: S/MIME cryptographic signature


Re: broken cycle counts from perf record in frequency mode [Was: Re: deducing CPU clock rate over time from cycle samples]

2017-09-01 Thread Andi Kleen
Milian Wolff  writes:
>
> do you have any input on this issue? I really wonder why noone else is 
> complaining about the frequency mode being unreliable or right out broken in 
> many situations. Since it's the default mode, I think this urgently needs to 
> be investigated - it makes perf unusable for a large group of users who want 
> to use it but don't know about `-c N` as a workaround...

It's likely related due to the frequency algorithm starting with 0.  So
at the beginning the samples are very fast (like 1 cycle) and likely
something breaks down in perf or your frequency calculation for very
short samples.

Also for inherited events this happens on every fork. If you
trace fork events too you'll likely see it correlated.
If you use -a and disable inheritance (no-inherit=1) it will
also likely be only at the beginning.

However I fail to see what it would actually break. The frequency
just needs to be roughly accurate over the whole measurement
period to get good sampling coverage. But there's nothing
in the profile that uses the actual frequency. It's just a means
to get samples, not a measurement by itself.

-Andi


Re: broken cycle counts from perf record in frequency mode [Was: Re: deducing CPU clock rate over time from cycle samples]

2017-09-01 Thread Andi Kleen
Milian Wolff  writes:
>
> do you have any input on this issue? I really wonder why noone else is 
> complaining about the frequency mode being unreliable or right out broken in 
> many situations. Since it's the default mode, I think this urgently needs to 
> be investigated - it makes perf unusable for a large group of users who want 
> to use it but don't know about `-c N` as a workaround...

It's likely related due to the frequency algorithm starting with 0.  So
at the beginning the samples are very fast (like 1 cycle) and likely
something breaks down in perf or your frequency calculation for very
short samples.

Also for inherited events this happens on every fork. If you
trace fork events too you'll likely see it correlated.
If you use -a and disable inheritance (no-inherit=1) it will
also likely be only at the beginning.

However I fail to see what it would actually break. The frequency
just needs to be roughly accurate over the whole measurement
period to get good sampling coverage. But there's nothing
in the profile that uses the actual frequency. It's just a means
to get samples, not a measurement by itself.

-Andi


Re: broken cycle counts from perf record in frequency mode [Was: Re: deducing CPU clock rate over time from cycle samples]

2017-09-01 Thread Milian Wolff
On Montag, 28. August 2017 19:28:08 CEST Andi Kleen wrote:
> Adding Peter.
> 
> The original thread starts at
> https://www.spinics.net/lists/linux-perf-users/msg03486.html

Peter, Arnaldo,

do you have any input on this issue? I really wonder why noone else is 
complaining about the frequency mode being unreliable or right out broken in 
many situations. Since it's the default mode, I think this urgently needs to 
be investigated - it makes perf unusable for a large group of users who want 
to use it but don't know about `-c N` as a workaround...

> On Mon, Aug 28, 2017 at 04:40:43PM +0200, Milian Wolff wrote:
> > On Monday, August 28, 2017 4:08:47 PM CEST Milian Wolff wrote:
> > > On Sunday, June 18, 2017 9:53:05 PM CEST Milian Wolff wrote:
> > > > On Sonntag, 18. Juni 2017 06:22:19 CEST Andi Kleen wrote:
> > > > > Milian Wolff  writes:
> > > > > > But when I look at the naively calculated first derivative, to
> > > > > > visualize
> > > > > > CPU load, i.e. CPU clock rate in Hz, then things start to become
> > > > > > somewhat
> > > > > > confusing:
> > > > > > 
> > > > > > 
> > > > > > perf script -F time,period | awk 'BEGIN {lastTime = -1;} { time =
> > > > > > $1 +
> > > > > > 0.0; if (lastTime != -1) {printf("%.6f\t%f\n", time, $2 / (time -
> > > > > > lastTime));} lastTime = time; }' | gnuplot -p -e "plot '-' with
> > > > > > linespoints"
> > > > > > 
> > > > > 
> > > > > The perf time stamps approach the maximum precision of double (12 vs
> > > > > 15 digits). Likely the division loses too many digits, which may
> > > > > cause
> > > > > the bogus results. I've ran into similar problems before.
> > > > 
> > > > I don't think so, just look at the raw values:
> > > > 
> > > > $ perf script -F time,period --ns
> > > > 71789.438122347:  1
> > > > 71789.438127160:  1
> > > > 71789.438129599:  7
> > > > 71789.438131844: 94
> > > > 71789.438134282:   1391
> > > > 71789.438139871:  19106
> > > > 71789.438156426: 123336
> > > > ...
> > > > 
> > > > $ qalc '123336/(71789.438156426s - 71789.438139871s) to Hz'
> > > > 123336 / ((71789.438 * second) - (71789.438 * second)) = approx.
> > > > 7.4500755E9 Hz
> > > > 
> > > > > One way around is is to normalize the time stamps first that they
> > > > > start with 0, but this only works for shorter traces.
> > > > > Or use some bignum float library
> > > > 
> > > > I take the time delta between two samples, so a normalization of the
> > > > individual times to 0 would not affect my calculations - the delta
> > > > stays
> > > > the same after all.
> > > > 
> > > > Also, using bignum in my calculations wouldn't change anything either.
> > > > If
> > > > perf tells me that 123336 cycles have been executed in 16.555 us, then
> > > > it
> > > > will always be larger than any expected value. At 3.2GHz it should be
> > > > maximally 52976 cycles in such a short timeframe...
> > > > 
> > > > > Also at the beginning of frequency the periods are very small, and
> > > > > the default us resolution will give big jumps for such a
> > > > > calculation.
> > > > 
> > > > OK, but who/what measures the large cycle values then? Is this a PMU
> > > > limitation? Or is this an issue with the interaction with the kernel,
> > > > when
> > > > the algorithm tries to find a good frequency at the beginning?
> > > > 
> > > > > It's better to use the script --ns option then, but that makes the
> > > > > double precision problem event worse.
> > > > 
> > > > See above, using `--ns` doesn't change anything. And qalc e.g. already
> > > > uses
> > > > bignum internally.
> > > > 
> > > > > In generally you get better results by avoiding frequency mode,
> > > > > but always specify a fixed period.
> > > > 
> > > > This indeed removes the spikes at the beginning:
> > > > 
> > > > perf record --switch-events --call-graph dwarf -P -c 50
> > > > 
> > > > The value is chosen to give a similar sample count to frequency mode.
> > > 
> > > Hey all,
> > > 
> > > I want to revive the above discussion as I'm still completely puzzled by
> > > the observation. The tl;dr; for those who have not followed the
> > > previous discussion:
> > > 
> > > perf record in frequency mode (i.e. "record ~1000 samples per second")
> > > sometimes reports excessively large cycle counts.
> > > 
> > > In the previous mails I have outlined how to visualize this issue
> > > graphically with gnuplot, by drawing the first derivative of the cycles
> > > over time which gives nicely comparable numbers to ones CPU clock speed.
> > > 
> > > Sometimes, this value goe up to 10Ghz and beyond. Sometimes the values
> > > are
> > > so broken (i.e. so high), that they completely break the analysis with
> > > perf
> > > report or similar, as they completely skew the total event count and
> > > thereby drastically influence the fractional cost reported by perf. E.g.
> > > just now I ran `perf record` on another application and got this result:
> > > 
> > > ~
> > 

Re: broken cycle counts from perf record in frequency mode [Was: Re: deducing CPU clock rate over time from cycle samples]

2017-09-01 Thread Milian Wolff
On Montag, 28. August 2017 19:28:08 CEST Andi Kleen wrote:
> Adding Peter.
> 
> The original thread starts at
> https://www.spinics.net/lists/linux-perf-users/msg03486.html

Peter, Arnaldo,

do you have any input on this issue? I really wonder why noone else is 
complaining about the frequency mode being unreliable or right out broken in 
many situations. Since it's the default mode, I think this urgently needs to 
be investigated - it makes perf unusable for a large group of users who want 
to use it but don't know about `-c N` as a workaround...

> On Mon, Aug 28, 2017 at 04:40:43PM +0200, Milian Wolff wrote:
> > On Monday, August 28, 2017 4:08:47 PM CEST Milian Wolff wrote:
> > > On Sunday, June 18, 2017 9:53:05 PM CEST Milian Wolff wrote:
> > > > On Sonntag, 18. Juni 2017 06:22:19 CEST Andi Kleen wrote:
> > > > > Milian Wolff  writes:
> > > > > > But when I look at the naively calculated first derivative, to
> > > > > > visualize
> > > > > > CPU load, i.e. CPU clock rate in Hz, then things start to become
> > > > > > somewhat
> > > > > > confusing:
> > > > > > 
> > > > > > 
> > > > > > perf script -F time,period | awk 'BEGIN {lastTime = -1;} { time =
> > > > > > $1 +
> > > > > > 0.0; if (lastTime != -1) {printf("%.6f\t%f\n", time, $2 / (time -
> > > > > > lastTime));} lastTime = time; }' | gnuplot -p -e "plot '-' with
> > > > > > linespoints"
> > > > > > 
> > > > > 
> > > > > The perf time stamps approach the maximum precision of double (12 vs
> > > > > 15 digits). Likely the division loses too many digits, which may
> > > > > cause
> > > > > the bogus results. I've ran into similar problems before.
> > > > 
> > > > I don't think so, just look at the raw values:
> > > > 
> > > > $ perf script -F time,period --ns
> > > > 71789.438122347:  1
> > > > 71789.438127160:  1
> > > > 71789.438129599:  7
> > > > 71789.438131844: 94
> > > > 71789.438134282:   1391
> > > > 71789.438139871:  19106
> > > > 71789.438156426: 123336
> > > > ...
> > > > 
> > > > $ qalc '123336/(71789.438156426s - 71789.438139871s) to Hz'
> > > > 123336 / ((71789.438 * second) - (71789.438 * second)) = approx.
> > > > 7.4500755E9 Hz
> > > > 
> > > > > One way around is is to normalize the time stamps first that they
> > > > > start with 0, but this only works for shorter traces.
> > > > > Or use some bignum float library
> > > > 
> > > > I take the time delta between two samples, so a normalization of the
> > > > individual times to 0 would not affect my calculations - the delta
> > > > stays
> > > > the same after all.
> > > > 
> > > > Also, using bignum in my calculations wouldn't change anything either.
> > > > If
> > > > perf tells me that 123336 cycles have been executed in 16.555 us, then
> > > > it
> > > > will always be larger than any expected value. At 3.2GHz it should be
> > > > maximally 52976 cycles in such a short timeframe...
> > > > 
> > > > > Also at the beginning of frequency the periods are very small, and
> > > > > the default us resolution will give big jumps for such a
> > > > > calculation.
> > > > 
> > > > OK, but who/what measures the large cycle values then? Is this a PMU
> > > > limitation? Or is this an issue with the interaction with the kernel,
> > > > when
> > > > the algorithm tries to find a good frequency at the beginning?
> > > > 
> > > > > It's better to use the script --ns option then, but that makes the
> > > > > double precision problem event worse.
> > > > 
> > > > See above, using `--ns` doesn't change anything. And qalc e.g. already
> > > > uses
> > > > bignum internally.
> > > > 
> > > > > In generally you get better results by avoiding frequency mode,
> > > > > but always specify a fixed period.
> > > > 
> > > > This indeed removes the spikes at the beginning:
> > > > 
> > > > perf record --switch-events --call-graph dwarf -P -c 50
> > > > 
> > > > The value is chosen to give a similar sample count to frequency mode.
> > > 
> > > Hey all,
> > > 
> > > I want to revive the above discussion as I'm still completely puzzled by
> > > the observation. The tl;dr; for those who have not followed the
> > > previous discussion:
> > > 
> > > perf record in frequency mode (i.e. "record ~1000 samples per second")
> > > sometimes reports excessively large cycle counts.
> > > 
> > > In the previous mails I have outlined how to visualize this issue
> > > graphically with gnuplot, by drawing the first derivative of the cycles
> > > over time which gives nicely comparable numbers to ones CPU clock speed.
> > > 
> > > Sometimes, this value goe up to 10Ghz and beyond. Sometimes the values
> > > are
> > > so broken (i.e. so high), that they completely break the analysis with
> > > perf
> > > report or similar, as they completely skew the total event count and
> > > thereby drastically influence the fractional cost reported by perf. E.g.
> > > just now I ran `perf record` on another application and got this result:
> > > 
> > > ~
> > > $ perf script | grep 

Re: broken cycle counts from perf record in frequency mode [Was: Re: deducing CPU clock rate over time from cycle samples]

2017-08-28 Thread Andi Kleen

Adding Peter.

The original thread starts at
https://www.spinics.net/lists/linux-perf-users/msg03486.html

On Mon, Aug 28, 2017 at 04:40:43PM +0200, Milian Wolff wrote:
> On Monday, August 28, 2017 4:08:47 PM CEST Milian Wolff wrote:
> > On Sunday, June 18, 2017 9:53:05 PM CEST Milian Wolff wrote:
> > > On Sonntag, 18. Juni 2017 06:22:19 CEST Andi Kleen wrote:
> > > > Milian Wolff  writes:
> > > > > But when I look at the naively calculated first derivative, to
> > > > > visualize
> > > > > CPU load, i.e. CPU clock rate in Hz, then things start to become
> > > > > somewhat
> > > > > confusing:
> > > > > 
> > > > > 
> > > > > perf script -F time,period | awk 'BEGIN {lastTime = -1;} { time = $1 +
> > > > > 0.0; if (lastTime != -1) {printf("%.6f\t%f\n", time, $2 / (time -
> > > > > lastTime));} lastTime = time; }' | gnuplot -p -e "plot '-' with
> > > > > linespoints"
> > > > > 
> > > > 
> > > > The perf time stamps approach the maximum precision of double (12 vs
> > > > 15 digits). Likely the division loses too many digits, which may cause
> > > > the bogus results. I've ran into similar problems before.
> > > 
> > > I don't think so, just look at the raw values:
> > > 
> > > $ perf script -F time,period --ns
> > > 71789.438122347:  1
> > > 71789.438127160:  1
> > > 71789.438129599:  7
> > > 71789.438131844: 94
> > > 71789.438134282:   1391
> > > 71789.438139871:  19106
> > > 71789.438156426: 123336
> > > ...
> > > 
> > > $ qalc '123336/(71789.438156426s - 71789.438139871s) to Hz'
> > > 123336 / ((71789.438 * second) - (71789.438 * second)) = approx.
> > > 7.4500755E9 Hz
> > > 
> > > > One way around is is to normalize the time stamps first that they
> > > > start with 0, but this only works for shorter traces.
> > > > Or use some bignum float library
> > > 
> > > I take the time delta between two samples, so a normalization of the
> > > individual times to 0 would not affect my calculations - the delta stays
> > > the same after all.
> > > 
> > > Also, using bignum in my calculations wouldn't change anything either. If
> > > perf tells me that 123336 cycles have been executed in 16.555 us, then it
> > > will always be larger than any expected value. At 3.2GHz it should be
> > > maximally 52976 cycles in such a short timeframe...
> > > 
> > > > Also at the beginning of frequency the periods are very small, and
> > > > the default us resolution will give big jumps for such a calculation.
> > > 
> > > OK, but who/what measures the large cycle values then? Is this a PMU
> > > limitation? Or is this an issue with the interaction with the kernel, when
> > > the algorithm tries to find a good frequency at the beginning?
> > > 
> > > > It's better to use the script --ns option then, but that makes the
> > > > double precision problem event worse.
> > > 
> > > See above, using `--ns` doesn't change anything. And qalc e.g. already
> > > uses
> > > bignum internally.
> > > 
> > > > In generally you get better results by avoiding frequency mode,
> > > > but always specify a fixed period.
> > > 
> > > This indeed removes the spikes at the beginning:
> > > 
> > > perf record --switch-events --call-graph dwarf -P -c 50
> > > 
> > > The value is chosen to give a similar sample count to frequency mode.
> > 
> > Hey all,
> > 
> > I want to revive the above discussion as I'm still completely puzzled by the
> > observation. The tl;dr; for those who have not followed the previous
> > discussion:
> > 
> > perf record in frequency mode (i.e. "record ~1000 samples per second")
> > sometimes reports excessively large cycle counts.
> > 
> > In the previous mails I have outlined how to visualize this issue
> > graphically with gnuplot, by drawing the first derivative of the cycles
> > over time which gives nicely comparable numbers to ones CPU clock speed.
> > 
> > Sometimes, this value goe up to 10Ghz and beyond. Sometimes the values are
> > so broken (i.e. so high), that they completely break the analysis with perf
> > report or similar, as they completely skew the total event count and
> > thereby drastically influence the fractional cost reported by perf. E.g.
> > just now I ran `perf record` on another application and got this result:
> > 
> > ~
> > $ perf script | grep page_remove_rmap -C 10
> >  QXcbEventReader 23866 605019.879412: 128193 cycles:ppp: 
> > bb23fbf5 __fget_light (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
> >   lab_mandelbrot 23865 605019.879469:  1 cycles:ppp: 
> > bb064504 native_write_msr
> > (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
> >  QXcbEventReader 23866 605019.879471:1810360 cycles:ppp: 
> > bb1cb4ec find_vma (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
> >   lab_mandelbrot 23865 605019.879472:  1 cycles:ppp: 
> > bb064504 native_write_msr
> > (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
> >   lab_mandelbrot 23865 605019.879474: 10 cycles:ppp: 
> > 

Re: broken cycle counts from perf record in frequency mode [Was: Re: deducing CPU clock rate over time from cycle samples]

2017-08-28 Thread Andi Kleen

Adding Peter.

The original thread starts at
https://www.spinics.net/lists/linux-perf-users/msg03486.html

On Mon, Aug 28, 2017 at 04:40:43PM +0200, Milian Wolff wrote:
> On Monday, August 28, 2017 4:08:47 PM CEST Milian Wolff wrote:
> > On Sunday, June 18, 2017 9:53:05 PM CEST Milian Wolff wrote:
> > > On Sonntag, 18. Juni 2017 06:22:19 CEST Andi Kleen wrote:
> > > > Milian Wolff  writes:
> > > > > But when I look at the naively calculated first derivative, to
> > > > > visualize
> > > > > CPU load, i.e. CPU clock rate in Hz, then things start to become
> > > > > somewhat
> > > > > confusing:
> > > > > 
> > > > > 
> > > > > perf script -F time,period | awk 'BEGIN {lastTime = -1;} { time = $1 +
> > > > > 0.0; if (lastTime != -1) {printf("%.6f\t%f\n", time, $2 / (time -
> > > > > lastTime));} lastTime = time; }' | gnuplot -p -e "plot '-' with
> > > > > linespoints"
> > > > > 
> > > > 
> > > > The perf time stamps approach the maximum precision of double (12 vs
> > > > 15 digits). Likely the division loses too many digits, which may cause
> > > > the bogus results. I've ran into similar problems before.
> > > 
> > > I don't think so, just look at the raw values:
> > > 
> > > $ perf script -F time,period --ns
> > > 71789.438122347:  1
> > > 71789.438127160:  1
> > > 71789.438129599:  7
> > > 71789.438131844: 94
> > > 71789.438134282:   1391
> > > 71789.438139871:  19106
> > > 71789.438156426: 123336
> > > ...
> > > 
> > > $ qalc '123336/(71789.438156426s - 71789.438139871s) to Hz'
> > > 123336 / ((71789.438 * second) - (71789.438 * second)) = approx.
> > > 7.4500755E9 Hz
> > > 
> > > > One way around is is to normalize the time stamps first that they
> > > > start with 0, but this only works for shorter traces.
> > > > Or use some bignum float library
> > > 
> > > I take the time delta between two samples, so a normalization of the
> > > individual times to 0 would not affect my calculations - the delta stays
> > > the same after all.
> > > 
> > > Also, using bignum in my calculations wouldn't change anything either. If
> > > perf tells me that 123336 cycles have been executed in 16.555 us, then it
> > > will always be larger than any expected value. At 3.2GHz it should be
> > > maximally 52976 cycles in such a short timeframe...
> > > 
> > > > Also at the beginning of frequency the periods are very small, and
> > > > the default us resolution will give big jumps for such a calculation.
> > > 
> > > OK, but who/what measures the large cycle values then? Is this a PMU
> > > limitation? Or is this an issue with the interaction with the kernel, when
> > > the algorithm tries to find a good frequency at the beginning?
> > > 
> > > > It's better to use the script --ns option then, but that makes the
> > > > double precision problem event worse.
> > > 
> > > See above, using `--ns` doesn't change anything. And qalc e.g. already
> > > uses
> > > bignum internally.
> > > 
> > > > In generally you get better results by avoiding frequency mode,
> > > > but always specify a fixed period.
> > > 
> > > This indeed removes the spikes at the beginning:
> > > 
> > > perf record --switch-events --call-graph dwarf -P -c 50
> > > 
> > > The value is chosen to give a similar sample count to frequency mode.
> > 
> > Hey all,
> > 
> > I want to revive the above discussion as I'm still completely puzzled by the
> > observation. The tl;dr; for those who have not followed the previous
> > discussion:
> > 
> > perf record in frequency mode (i.e. "record ~1000 samples per second")
> > sometimes reports excessively large cycle counts.
> > 
> > In the previous mails I have outlined how to visualize this issue
> > graphically with gnuplot, by drawing the first derivative of the cycles
> > over time which gives nicely comparable numbers to ones CPU clock speed.
> > 
> > Sometimes, this value goe up to 10Ghz and beyond. Sometimes the values are
> > so broken (i.e. so high), that they completely break the analysis with perf
> > report or similar, as they completely skew the total event count and
> > thereby drastically influence the fractional cost reported by perf. E.g.
> > just now I ran `perf record` on another application and got this result:
> > 
> > ~
> > $ perf script | grep page_remove_rmap -C 10
> >  QXcbEventReader 23866 605019.879412: 128193 cycles:ppp: 
> > bb23fbf5 __fget_light (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
> >   lab_mandelbrot 23865 605019.879469:  1 cycles:ppp: 
> > bb064504 native_write_msr
> > (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
> >  QXcbEventReader 23866 605019.879471:1810360 cycles:ppp: 
> > bb1cb4ec find_vma (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
> >   lab_mandelbrot 23865 605019.879472:  1 cycles:ppp: 
> > bb064504 native_write_msr
> > (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
> >   lab_mandelbrot 23865 605019.879474: 10 cycles:ppp: 
> > bb064504 native_write_msr
> 

Re: broken cycle counts from perf record in frequency mode [Was: Re: deducing CPU clock rate over time from cycle samples]

2017-08-28 Thread Milian Wolff
On Monday, August 28, 2017 4:08:47 PM CEST Milian Wolff wrote:
> On Sunday, June 18, 2017 9:53:05 PM CEST Milian Wolff wrote:
> > On Sonntag, 18. Juni 2017 06:22:19 CEST Andi Kleen wrote:
> > > Milian Wolff  writes:
> > > > But when I look at the naively calculated first derivative, to
> > > > visualize
> > > > CPU load, i.e. CPU clock rate in Hz, then things start to become
> > > > somewhat
> > > > confusing:
> > > > 
> > > > 
> > > > perf script -F time,period | awk 'BEGIN {lastTime = -1;} { time = $1 +
> > > > 0.0; if (lastTime != -1) {printf("%.6f\t%f\n", time, $2 / (time -
> > > > lastTime));} lastTime = time; }' | gnuplot -p -e "plot '-' with
> > > > linespoints"
> > > > 
> > > 
> > > The perf time stamps approach the maximum precision of double (12 vs
> > > 15 digits). Likely the division loses too many digits, which may cause
> > > the bogus results. I've ran into similar problems before.
> > 
> > I don't think so, just look at the raw values:
> > 
> > $ perf script -F time,period --ns
> > 71789.438122347:  1
> > 71789.438127160:  1
> > 71789.438129599:  7
> > 71789.438131844: 94
> > 71789.438134282:   1391
> > 71789.438139871:  19106
> > 71789.438156426: 123336
> > ...
> > 
> > $ qalc '123336/(71789.438156426s - 71789.438139871s) to Hz'
> > 123336 / ((71789.438 * second) - (71789.438 * second)) = approx.
> > 7.4500755E9 Hz
> > 
> > > One way around is is to normalize the time stamps first that they
> > > start with 0, but this only works for shorter traces.
> > > Or use some bignum float library
> > 
> > I take the time delta between two samples, so a normalization of the
> > individual times to 0 would not affect my calculations - the delta stays
> > the same after all.
> > 
> > Also, using bignum in my calculations wouldn't change anything either. If
> > perf tells me that 123336 cycles have been executed in 16.555 us, then it
> > will always be larger than any expected value. At 3.2GHz it should be
> > maximally 52976 cycles in such a short timeframe...
> > 
> > > Also at the beginning of frequency the periods are very small, and
> > > the default us resolution will give big jumps for such a calculation.
> > 
> > OK, but who/what measures the large cycle values then? Is this a PMU
> > limitation? Or is this an issue with the interaction with the kernel, when
> > the algorithm tries to find a good frequency at the beginning?
> > 
> > > It's better to use the script --ns option then, but that makes the
> > > double precision problem event worse.
> > 
> > See above, using `--ns` doesn't change anything. And qalc e.g. already
> > uses
> > bignum internally.
> > 
> > > In generally you get better results by avoiding frequency mode,
> > > but always specify a fixed period.
> > 
> > This indeed removes the spikes at the beginning:
> > 
> > perf record --switch-events --call-graph dwarf -P -c 50
> > 
> > The value is chosen to give a similar sample count to frequency mode.
> 
> Hey all,
> 
> I want to revive the above discussion as I'm still completely puzzled by the
> observation. The tl;dr; for those who have not followed the previous
> discussion:
> 
> perf record in frequency mode (i.e. "record ~1000 samples per second")
> sometimes reports excessively large cycle counts.
> 
> In the previous mails I have outlined how to visualize this issue
> graphically with gnuplot, by drawing the first derivative of the cycles
> over time which gives nicely comparable numbers to ones CPU clock speed.
> 
> Sometimes, this value goe up to 10Ghz and beyond. Sometimes the values are
> so broken (i.e. so high), that they completely break the analysis with perf
> report or similar, as they completely skew the total event count and
> thereby drastically influence the fractional cost reported by perf. E.g.
> just now I ran `perf record` on another application and got this result:
> 
> ~
> $ perf script | grep page_remove_rmap -C 10
>  QXcbEventReader 23866 605019.879412: 128193 cycles:ppp: 
> bb23fbf5 __fget_light (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
>   lab_mandelbrot 23865 605019.879469:  1 cycles:ppp: 
> bb064504 native_write_msr
> (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
>  QXcbEventReader 23866 605019.879471:1810360 cycles:ppp: 
> bb1cb4ec find_vma (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
>   lab_mandelbrot 23865 605019.879472:  1 cycles:ppp: 
> bb064504 native_write_msr
> (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
>   lab_mandelbrot 23865 605019.879474: 10 cycles:ppp: 
> bb064504 native_write_msr
> (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
>   lab_mandelbrot 23865 605019.879475:216 cycles:ppp: 
> bb064504 native_write_msr
> (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
>   lab_mandelbrot 23865 605019.879486:   5106 cycles:ppp: 
> bb064504 native_write_msr
> (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
>   lab_mandelbrot 23865 

Re: broken cycle counts from perf record in frequency mode [Was: Re: deducing CPU clock rate over time from cycle samples]

2017-08-28 Thread Milian Wolff
On Monday, August 28, 2017 4:08:47 PM CEST Milian Wolff wrote:
> On Sunday, June 18, 2017 9:53:05 PM CEST Milian Wolff wrote:
> > On Sonntag, 18. Juni 2017 06:22:19 CEST Andi Kleen wrote:
> > > Milian Wolff  writes:
> > > > But when I look at the naively calculated first derivative, to
> > > > visualize
> > > > CPU load, i.e. CPU clock rate in Hz, then things start to become
> > > > somewhat
> > > > confusing:
> > > > 
> > > > 
> > > > perf script -F time,period | awk 'BEGIN {lastTime = -1;} { time = $1 +
> > > > 0.0; if (lastTime != -1) {printf("%.6f\t%f\n", time, $2 / (time -
> > > > lastTime));} lastTime = time; }' | gnuplot -p -e "plot '-' with
> > > > linespoints"
> > > > 
> > > 
> > > The perf time stamps approach the maximum precision of double (12 vs
> > > 15 digits). Likely the division loses too many digits, which may cause
> > > the bogus results. I've ran into similar problems before.
> > 
> > I don't think so, just look at the raw values:
> > 
> > $ perf script -F time,period --ns
> > 71789.438122347:  1
> > 71789.438127160:  1
> > 71789.438129599:  7
> > 71789.438131844: 94
> > 71789.438134282:   1391
> > 71789.438139871:  19106
> > 71789.438156426: 123336
> > ...
> > 
> > $ qalc '123336/(71789.438156426s - 71789.438139871s) to Hz'
> > 123336 / ((71789.438 * second) - (71789.438 * second)) = approx.
> > 7.4500755E9 Hz
> > 
> > > One way around is is to normalize the time stamps first that they
> > > start with 0, but this only works for shorter traces.
> > > Or use some bignum float library
> > 
> > I take the time delta between two samples, so a normalization of the
> > individual times to 0 would not affect my calculations - the delta stays
> > the same after all.
> > 
> > Also, using bignum in my calculations wouldn't change anything either. If
> > perf tells me that 123336 cycles have been executed in 16.555 us, then it
> > will always be larger than any expected value. At 3.2GHz it should be
> > maximally 52976 cycles in such a short timeframe...
> > 
> > > Also at the beginning of frequency the periods are very small, and
> > > the default us resolution will give big jumps for such a calculation.
> > 
> > OK, but who/what measures the large cycle values then? Is this a PMU
> > limitation? Or is this an issue with the interaction with the kernel, when
> > the algorithm tries to find a good frequency at the beginning?
> > 
> > > It's better to use the script --ns option then, but that makes the
> > > double precision problem event worse.
> > 
> > See above, using `--ns` doesn't change anything. And qalc e.g. already
> > uses
> > bignum internally.
> > 
> > > In generally you get better results by avoiding frequency mode,
> > > but always specify a fixed period.
> > 
> > This indeed removes the spikes at the beginning:
> > 
> > perf record --switch-events --call-graph dwarf -P -c 50
> > 
> > The value is chosen to give a similar sample count to frequency mode.
> 
> Hey all,
> 
> I want to revive the above discussion as I'm still completely puzzled by the
> observation. The tl;dr; for those who have not followed the previous
> discussion:
> 
> perf record in frequency mode (i.e. "record ~1000 samples per second")
> sometimes reports excessively large cycle counts.
> 
> In the previous mails I have outlined how to visualize this issue
> graphically with gnuplot, by drawing the first derivative of the cycles
> over time which gives nicely comparable numbers to ones CPU clock speed.
> 
> Sometimes, this value goe up to 10Ghz and beyond. Sometimes the values are
> so broken (i.e. so high), that they completely break the analysis with perf
> report or similar, as they completely skew the total event count and
> thereby drastically influence the fractional cost reported by perf. E.g.
> just now I ran `perf record` on another application and got this result:
> 
> ~
> $ perf script | grep page_remove_rmap -C 10
>  QXcbEventReader 23866 605019.879412: 128193 cycles:ppp: 
> bb23fbf5 __fget_light (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
>   lab_mandelbrot 23865 605019.879469:  1 cycles:ppp: 
> bb064504 native_write_msr
> (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
>  QXcbEventReader 23866 605019.879471:1810360 cycles:ppp: 
> bb1cb4ec find_vma (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
>   lab_mandelbrot 23865 605019.879472:  1 cycles:ppp: 
> bb064504 native_write_msr
> (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
>   lab_mandelbrot 23865 605019.879474: 10 cycles:ppp: 
> bb064504 native_write_msr
> (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
>   lab_mandelbrot 23865 605019.879475:216 cycles:ppp: 
> bb064504 native_write_msr
> (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
>   lab_mandelbrot 23865 605019.879486:   5106 cycles:ppp: 
> bb064504 native_write_msr
> (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
>   lab_mandelbrot 23865 605019.879489:  

broken cycle counts from perf record in frequency mode [Was: Re: deducing CPU clock rate over time from cycle samples]

2017-08-28 Thread Milian Wolff
On Sunday, June 18, 2017 9:53:05 PM CEST Milian Wolff wrote:
> On Sonntag, 18. Juni 2017 06:22:19 CEST Andi Kleen wrote:
> > Milian Wolff  writes:
> > > But when I look at the naively calculated first derivative, to visualize
> > > CPU load, i.e. CPU clock rate in Hz, then things start to become
> > > somewhat
> > > confusing:
> > > 
> > > 
> > > perf script -F time,period | awk 'BEGIN {lastTime = -1;} { time = $1 +
> > > 0.0; if (lastTime != -1) {printf("%.6f\t%f\n", time, $2 / (time -
> > > lastTime));} lastTime = time; }' | gnuplot -p -e "plot '-' with
> > > linespoints"
> > > 
> > 
> > The perf time stamps approach the maximum precision of double (12 vs
> > 15 digits). Likely the division loses too many digits, which may cause
> > the bogus results. I've ran into similar problems before.
> 
> I don't think so, just look at the raw values:
> 
> $ perf script -F time,period --ns
> 71789.438122347:  1
> 71789.438127160:  1
> 71789.438129599:  7
> 71789.438131844: 94
> 71789.438134282:   1391
> 71789.438139871:  19106
> 71789.438156426: 123336
> ...
> 
> $ qalc '123336/(71789.438156426s - 71789.438139871s) to Hz'
> 123336 / ((71789.438 * second) - (71789.438 * second)) = approx. 7.4500755E9
> Hz
> 
> > One way around is is to normalize the time stamps first that they
> > start with 0, but this only works for shorter traces.
> > Or use some bignum float library
> 
> I take the time delta between two samples, so a normalization of the
> individual times to 0 would not affect my calculations - the delta stays the
> same after all.
> 
> Also, using bignum in my calculations wouldn't change anything either. If
> perf tells me that 123336 cycles have been executed in 16.555 us, then it
> will always be larger than any expected value. At 3.2GHz it should be
> maximally 52976 cycles in such a short timeframe...
> 
> > Also at the beginning of frequency the periods are very small, and
> > the default us resolution will give big jumps for such a calculation.
> 
> OK, but who/what measures the large cycle values then? Is this a PMU
> limitation? Or is this an issue with the interaction with the kernel, when
> the algorithm tries to find a good frequency at the beginning?
> 
> > It's better to use the script --ns option then, but that makes the
> > double precision problem event worse.
> 
> See above, using `--ns` doesn't change anything. And qalc e.g. already uses
> bignum internally.
> 
> > In generally you get better results by avoiding frequency mode,
> > but always specify a fixed period.
> 
> This indeed removes the spikes at the beginning:
> 
> perf record --switch-events --call-graph dwarf -P -c 50
> 
> The value is chosen to give a similar sample count to frequency mode.

Hey all,

I want to revive the above discussion as I'm still completely puzzled by the 
observation. The tl;dr; for those who have not followed the previous 
discussion:

perf record in frequency mode (i.e. "record ~1000 samples per second") 
sometimes reports excessively large cycle counts.

In the previous mails I have outlined how to visualize this issue graphically 
with gnuplot, by drawing the first derivative of the cycles over time which 
gives nicely comparable numbers to ones CPU clock speed.

Sometimes, this value goe up to 10Ghz and beyond. Sometimes the values are so 
broken (i.e. so high), that they completely break the analysis with perf 
report or similar, as they completely skew the total event count and thereby 
drastically influence the fractional cost reported by perf. E.g. just now I 
ran `perf record` on another application and got this result:

~
$ perf script | grep page_remove_rmap -C 10
 QXcbEventReader 23866 605019.879412: 128193 cycles:ppp:  bb23fbf5 
__fget_light (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
  lab_mandelbrot 23865 605019.879469:  1 cycles:ppp:  bb064504 
native_write_msr (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
 QXcbEventReader 23866 605019.879471:1810360 cycles:ppp:  bb1cb4ec 
find_vma (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
  lab_mandelbrot 23865 605019.879472:  1 cycles:ppp:  bb064504 
native_write_msr (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
  lab_mandelbrot 23865 605019.879474: 10 cycles:ppp:  bb064504 
native_write_msr (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
  lab_mandelbrot 23865 605019.879475:216 cycles:ppp:  bb064504 
native_write_msr (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
  lab_mandelbrot 23865 605019.879486:   5106 cycles:ppp:  bb064504 
native_write_msr (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
  lab_mandelbrot 23865 605019.879489:  19381 cycles:ppp:  7f85beae64ce 
QWaitCondition::wait (/usr/lib/libQt5Core.so.5.9.1)
  lab_mandelbrot 23865 605019.879495: 251346 cycles:ppp:  7f85bf2567c1 
QScreen::~QScreen (/usr/lib/libQt5Gui.so.5.9.1)
  lab_mandelbrot 23865 605019.880722:

broken cycle counts from perf record in frequency mode [Was: Re: deducing CPU clock rate over time from cycle samples]

2017-08-28 Thread Milian Wolff
On Sunday, June 18, 2017 9:53:05 PM CEST Milian Wolff wrote:
> On Sonntag, 18. Juni 2017 06:22:19 CEST Andi Kleen wrote:
> > Milian Wolff  writes:
> > > But when I look at the naively calculated first derivative, to visualize
> > > CPU load, i.e. CPU clock rate in Hz, then things start to become
> > > somewhat
> > > confusing:
> > > 
> > > 
> > > perf script -F time,period | awk 'BEGIN {lastTime = -1;} { time = $1 +
> > > 0.0; if (lastTime != -1) {printf("%.6f\t%f\n", time, $2 / (time -
> > > lastTime));} lastTime = time; }' | gnuplot -p -e "plot '-' with
> > > linespoints"
> > > 
> > 
> > The perf time stamps approach the maximum precision of double (12 vs
> > 15 digits). Likely the division loses too many digits, which may cause
> > the bogus results. I've ran into similar problems before.
> 
> I don't think so, just look at the raw values:
> 
> $ perf script -F time,period --ns
> 71789.438122347:  1
> 71789.438127160:  1
> 71789.438129599:  7
> 71789.438131844: 94
> 71789.438134282:   1391
> 71789.438139871:  19106
> 71789.438156426: 123336
> ...
> 
> $ qalc '123336/(71789.438156426s - 71789.438139871s) to Hz'
> 123336 / ((71789.438 * second) - (71789.438 * second)) = approx. 7.4500755E9
> Hz
> 
> > One way around is is to normalize the time stamps first that they
> > start with 0, but this only works for shorter traces.
> > Or use some bignum float library
> 
> I take the time delta between two samples, so a normalization of the
> individual times to 0 would not affect my calculations - the delta stays the
> same after all.
> 
> Also, using bignum in my calculations wouldn't change anything either. If
> perf tells me that 123336 cycles have been executed in 16.555 us, then it
> will always be larger than any expected value. At 3.2GHz it should be
> maximally 52976 cycles in such a short timeframe...
> 
> > Also at the beginning of frequency the periods are very small, and
> > the default us resolution will give big jumps for such a calculation.
> 
> OK, but who/what measures the large cycle values then? Is this a PMU
> limitation? Or is this an issue with the interaction with the kernel, when
> the algorithm tries to find a good frequency at the beginning?
> 
> > It's better to use the script --ns option then, but that makes the
> > double precision problem event worse.
> 
> See above, using `--ns` doesn't change anything. And qalc e.g. already uses
> bignum internally.
> 
> > In generally you get better results by avoiding frequency mode,
> > but always specify a fixed period.
> 
> This indeed removes the spikes at the beginning:
> 
> perf record --switch-events --call-graph dwarf -P -c 50
> 
> The value is chosen to give a similar sample count to frequency mode.

Hey all,

I want to revive the above discussion as I'm still completely puzzled by the 
observation. The tl;dr; for those who have not followed the previous 
discussion:

perf record in frequency mode (i.e. "record ~1000 samples per second") 
sometimes reports excessively large cycle counts.

In the previous mails I have outlined how to visualize this issue graphically 
with gnuplot, by drawing the first derivative of the cycles over time which 
gives nicely comparable numbers to ones CPU clock speed.

Sometimes, this value goe up to 10Ghz and beyond. Sometimes the values are so 
broken (i.e. so high), that they completely break the analysis with perf 
report or similar, as they completely skew the total event count and thereby 
drastically influence the fractional cost reported by perf. E.g. just now I 
ran `perf record` on another application and got this result:

~
$ perf script | grep page_remove_rmap -C 10
 QXcbEventReader 23866 605019.879412: 128193 cycles:ppp:  bb23fbf5 
__fget_light (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
  lab_mandelbrot 23865 605019.879469:  1 cycles:ppp:  bb064504 
native_write_msr (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
 QXcbEventReader 23866 605019.879471:1810360 cycles:ppp:  bb1cb4ec 
find_vma (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
  lab_mandelbrot 23865 605019.879472:  1 cycles:ppp:  bb064504 
native_write_msr (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
  lab_mandelbrot 23865 605019.879474: 10 cycles:ppp:  bb064504 
native_write_msr (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
  lab_mandelbrot 23865 605019.879475:216 cycles:ppp:  bb064504 
native_write_msr (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
  lab_mandelbrot 23865 605019.879486:   5106 cycles:ppp:  bb064504 
native_write_msr (/lib/modules/4.12.8-2-ARCH/build/vmlinux)
  lab_mandelbrot 23865 605019.879489:  19381 cycles:ppp:  7f85beae64ce 
QWaitCondition::wait (/usr/lib/libQt5Core.so.5.9.1)
  lab_mandelbrot 23865 605019.879495: 251346 cycles:ppp:  7f85bf2567c1 
QScreen::~QScreen (/usr/lib/libQt5Gui.so.5.9.1)
  lab_mandelbrot 23865 605019.880722:3210571 cycles:ppp: