Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock

2016-09-26 Thread Nicolai Stange
Nicolai Stange  writes:

> Thomas Gleixner  writes:
>
>> On Wed, 21 Sep 2016, Nicolai Stange wrote:
>>> Thomas Gleixner  writes:
>>> 
>>> > On Wed, 21 Sep 2016, Nicolai Stange wrote:
>>> >> Thomas Gleixner  writes:
>>> >> > Have you ever measured the overhead of the extra work which has to be 
>>> >> > done
>>> >> > in clockevents_adjust_all_freqs() ?
>>> >> 
>>> >> Not exactly, I had a look at its invocation frequency which seems to
>>> >> decay exponentially with uptime, presumably because the NTP error
>>> >> approaches zero.
>>> >> 
>>> >> However, I've just gathered a function_graph ftrace on my Intel
>>> >> i7-4800MQ (Haswell, 8HTs):
>>> >> 
>>> >> # TIMECPU  DURATION  FUNCTION CALLS
>>> >> #  |  | |   | |   |   |   |
>>> >>85.287027 |   0)   0.899 us|  clockevents_adjust_all_freqs();
>>> >>85.288026 |   0)   0.759 us|  clockevents_adjust_all_freqs();
>>> >>85.289026 |   0)   0.735 us|  clockevents_adjust_all_freqs();
>>> >>85.290026 |   0)   0.671 us|  clockevents_adjust_all_freqs();
>>> >>   149.503656 |   2)   2.477 us|  clockevents_adjust_all_freqs();
>>> >
>>> > That's not that bad. Though I'd like to see numbers for ARM (especially 
>>> > the
>>> > less powerful SoCs) as well.
>>> 
>>> On a Raspberry Pi 2B (bcm2836, ARMv7) with CONFIG_SMP=y, the mean over
>>> ~5300 samples is 5.14+/-1.04us with a max of 11.15us.
>>
>> So why is the variance that high?
>
> I think this is because the histogram has got two peaks, c.f. [1]
>
> Also, the 11us maximum is not isolated but a flat tail is reaching to
> this point which I admittedly can't explain.

It turned out that the linux-next kernel always ran the RPi2B at what
apparently is its minimum speed.

lmbench3's mhz gave me 560MHz and lat_mem_rd reports a memory latency of
120ns on linux-next.

Compare this to an "official" kernel from the Raspberry Pi Foundation
obtained from [2]: mhz says that the CPU runs at 900MHz and according to
lat_mem_rd, the memory latency is at 50ns.

Especially the high memory latency killed my benchmark: both, the second
peak and the long tail stemmed from cache misses.

In order to verify this, I separated the tracing data from linux-next
into those samples that do not have any other calls to
clockevents_adjust_all_freqs() within a time span of 100ms before them
("first of run") and those that do ("not first of run"). The result can
be seen at [3]: the second peak as well as the long tail is generated
exclusively by the "first of run"'s.


Unfortunately I was not able to get this RPi2B running at its full
capabilities with linux-next. So I applied this series on top of the RPi
Foundation's kernel and did further benchmarking there. The results can
be found at [4]: no second peak, no particularly long tail.
Some statistics:
 0%   25%   50%   75%  100% 
  1.250 1.511 1.667 1.927 7.031

  Mean: 1.89
  sd: 0.69 

Much better IMHO. Good enough?


A random note: during tracing, I recognized that the adjustment should
better skip those CLOCK_EVT_FEAT_DUMMY devices. v8 will do this. Both
measurements include that change already.


>> You have an outlier on that intel as well which might be caused by
>> NMI, but it might also be a systematic issue depending on the input
>> parameters.
>
> AFACIT, the "algorithmic" runtime should be constant per CED, so it
> should not be dependent on any input parameters.

Well, this is not exactly true: the __do_div64() on ARM is implemented in
software. Basically this algorithm's runtime depends on the position of
the dividend's MSB. However, the range of the "adj" dividend as given by
__clockevents_calc_adjust_freq() should be relatively narrow.
I traced __do_div64() and there haven't been any apparent abnormalities.


>> 11 us on that ARM worries me.

These are 7us now. Also, this max value isn't nearly as connected to the
rest of the histogram as that 11us sample before. So it *might* be an
outlier now. I can't tell for sure though.


Thanks,

Nicolai


> [1] https://nicst.de/cev-freqadjust/adjust_all_freqs-function_graph_hist.png

[2] https://github.com/raspberrypi/linux
[3] https://nicst.de/cev-freqadjust/hist-adjust-smp.pdf
[4] https://nicst.de/cev-freqadjust/hist-adjust-official-smp.pdf


Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock

2016-09-26 Thread Nicolai Stange
Nicolai Stange  writes:

> Thomas Gleixner  writes:
>
>> On Wed, 21 Sep 2016, Nicolai Stange wrote:
>>> Thomas Gleixner  writes:
>>> 
>>> > On Wed, 21 Sep 2016, Nicolai Stange wrote:
>>> >> Thomas Gleixner  writes:
>>> >> > Have you ever measured the overhead of the extra work which has to be 
>>> >> > done
>>> >> > in clockevents_adjust_all_freqs() ?
>>> >> 
>>> >> Not exactly, I had a look at its invocation frequency which seems to
>>> >> decay exponentially with uptime, presumably because the NTP error
>>> >> approaches zero.
>>> >> 
>>> >> However, I've just gathered a function_graph ftrace on my Intel
>>> >> i7-4800MQ (Haswell, 8HTs):
>>> >> 
>>> >> # TIMECPU  DURATION  FUNCTION CALLS
>>> >> #  |  | |   | |   |   |   |
>>> >>85.287027 |   0)   0.899 us|  clockevents_adjust_all_freqs();
>>> >>85.288026 |   0)   0.759 us|  clockevents_adjust_all_freqs();
>>> >>85.289026 |   0)   0.735 us|  clockevents_adjust_all_freqs();
>>> >>85.290026 |   0)   0.671 us|  clockevents_adjust_all_freqs();
>>> >>   149.503656 |   2)   2.477 us|  clockevents_adjust_all_freqs();
>>> >
>>> > That's not that bad. Though I'd like to see numbers for ARM (especially 
>>> > the
>>> > less powerful SoCs) as well.
>>> 
>>> On a Raspberry Pi 2B (bcm2836, ARMv7) with CONFIG_SMP=y, the mean over
>>> ~5300 samples is 5.14+/-1.04us with a max of 11.15us.
>>
>> So why is the variance that high?
>
> I think this is because the histogram has got two peaks, c.f. [1]
>
> Also, the 11us maximum is not isolated but a flat tail is reaching to
> this point which I admittedly can't explain.

It turned out that the linux-next kernel always ran the RPi2B at what
apparently is its minimum speed.

lmbench3's mhz gave me 560MHz and lat_mem_rd reports a memory latency of
120ns on linux-next.

Compare this to an "official" kernel from the Raspberry Pi Foundation
obtained from [2]: mhz says that the CPU runs at 900MHz and according to
lat_mem_rd, the memory latency is at 50ns.

Especially the high memory latency killed my benchmark: both, the second
peak and the long tail stemmed from cache misses.

In order to verify this, I separated the tracing data from linux-next
into those samples that do not have any other calls to
clockevents_adjust_all_freqs() within a time span of 100ms before them
("first of run") and those that do ("not first of run"). The result can
be seen at [3]: the second peak as well as the long tail is generated
exclusively by the "first of run"'s.


Unfortunately I was not able to get this RPi2B running at its full
capabilities with linux-next. So I applied this series on top of the RPi
Foundation's kernel and did further benchmarking there. The results can
be found at [4]: no second peak, no particularly long tail.
Some statistics:
 0%   25%   50%   75%  100% 
  1.250 1.511 1.667 1.927 7.031

  Mean: 1.89
  sd: 0.69 

Much better IMHO. Good enough?


A random note: during tracing, I recognized that the adjustment should
better skip those CLOCK_EVT_FEAT_DUMMY devices. v8 will do this. Both
measurements include that change already.


>> You have an outlier on that intel as well which might be caused by
>> NMI, but it might also be a systematic issue depending on the input
>> parameters.
>
> AFACIT, the "algorithmic" runtime should be constant per CED, so it
> should not be dependent on any input parameters.

Well, this is not exactly true: the __do_div64() on ARM is implemented in
software. Basically this algorithm's runtime depends on the position of
the dividend's MSB. However, the range of the "adj" dividend as given by
__clockevents_calc_adjust_freq() should be relatively narrow.
I traced __do_div64() and there haven't been any apparent abnormalities.


>> 11 us on that ARM worries me.

These are 7us now. Also, this max value isn't nearly as connected to the
rest of the histogram as that 11us sample before. So it *might* be an
outlier now. I can't tell for sure though.


Thanks,

Nicolai


> [1] https://nicst.de/cev-freqadjust/adjust_all_freqs-function_graph_hist.png

[2] https://github.com/raspberrypi/linux
[3] https://nicst.de/cev-freqadjust/hist-adjust-smp.pdf
[4] https://nicst.de/cev-freqadjust/hist-adjust-official-smp.pdf


Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock

2016-09-22 Thread Nicolai Stange
Thomas Gleixner  writes:

> On Wed, 21 Sep 2016, Nicolai Stange wrote:
>> Thomas Gleixner  writes:
>> 
>> > On Wed, 21 Sep 2016, Nicolai Stange wrote:
>> >> Thomas Gleixner  writes:
>> >> > Have you ever measured the overhead of the extra work which has to be 
>> >> > done
>> >> > in clockevents_adjust_all_freqs() ?
>> >> 
>> >> Not exactly, I had a look at its invocation frequency which seems to
>> >> decay exponentially with uptime, presumably because the NTP error
>> >> approaches zero.
>> >> 
>> >> However, I've just gathered a function_graph ftrace on my Intel
>> >> i7-4800MQ (Haswell, 8HTs):
>> >> 
>> >> # TIMECPU  DURATION  FUNCTION CALLS
>> >> #  |  | |   | |   |   |   |
>> >>85.287027 |   0)   0.899 us|  clockevents_adjust_all_freqs();
>> >>85.288026 |   0)   0.759 us|  clockevents_adjust_all_freqs();
>> >>85.289026 |   0)   0.735 us|  clockevents_adjust_all_freqs();
>> >>85.290026 |   0)   0.671 us|  clockevents_adjust_all_freqs();
>> >>   149.503656 |   2)   2.477 us|  clockevents_adjust_all_freqs();
>> >
>> > That's not that bad. Though I'd like to see numbers for ARM (especially the
>> > less powerful SoCs) as well.
>> 
>> On a Raspberry Pi 2B (bcm2836, ARMv7) with CONFIG_SMP=y, the mean over
>> ~5300 samples is 5.14+/-1.04us with a max of 11.15us.
>
> So why is the variance that high?

I think this is because the histogram has got two peaks, c.f. [1]

Also, the 11us maximum is not isolated but a flat tail is reaching to
this point which I admittedly can't explain.

> You have an outlier on that intel as well which might be caused by
> NMI, but it might also be a systematic issue depending on the input
> parameters.

AFACIT, the "algorithmic" runtime should be constant per CED, so it
should not be dependent on any input parameters.

> 11 us on that ARM worries me.

I'll try to do some more tracing tomorrow in order to get the reason for
that histogram's long tail. But I have to admit that I don't really know
what to look for except for NMIs. Any hints?
What might be remarkable in this context is that the dataset's min is
at 2.24us. Perhaps I'm actually seeing the distribution of the
clockevents_lock acquisition?


Thanks,

Nicolai



[1] https://nicst.de/cev-freqadjust/adjust_all_freqs-function_graph_hist.png


Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock

2016-09-22 Thread Nicolai Stange
Thomas Gleixner  writes:

> On Wed, 21 Sep 2016, Nicolai Stange wrote:
>> Thomas Gleixner  writes:
>> 
>> > On Wed, 21 Sep 2016, Nicolai Stange wrote:
>> >> Thomas Gleixner  writes:
>> >> > Have you ever measured the overhead of the extra work which has to be 
>> >> > done
>> >> > in clockevents_adjust_all_freqs() ?
>> >> 
>> >> Not exactly, I had a look at its invocation frequency which seems to
>> >> decay exponentially with uptime, presumably because the NTP error
>> >> approaches zero.
>> >> 
>> >> However, I've just gathered a function_graph ftrace on my Intel
>> >> i7-4800MQ (Haswell, 8HTs):
>> >> 
>> >> # TIMECPU  DURATION  FUNCTION CALLS
>> >> #  |  | |   | |   |   |   |
>> >>85.287027 |   0)   0.899 us|  clockevents_adjust_all_freqs();
>> >>85.288026 |   0)   0.759 us|  clockevents_adjust_all_freqs();
>> >>85.289026 |   0)   0.735 us|  clockevents_adjust_all_freqs();
>> >>85.290026 |   0)   0.671 us|  clockevents_adjust_all_freqs();
>> >>   149.503656 |   2)   2.477 us|  clockevents_adjust_all_freqs();
>> >
>> > That's not that bad. Though I'd like to see numbers for ARM (especially the
>> > less powerful SoCs) as well.
>> 
>> On a Raspberry Pi 2B (bcm2836, ARMv7) with CONFIG_SMP=y, the mean over
>> ~5300 samples is 5.14+/-1.04us with a max of 11.15us.
>
> So why is the variance that high?

I think this is because the histogram has got two peaks, c.f. [1]

Also, the 11us maximum is not isolated but a flat tail is reaching to
this point which I admittedly can't explain.

> You have an outlier on that intel as well which might be caused by
> NMI, but it might also be a systematic issue depending on the input
> parameters.

AFACIT, the "algorithmic" runtime should be constant per CED, so it
should not be dependent on any input parameters.

> 11 us on that ARM worries me.

I'll try to do some more tracing tomorrow in order to get the reason for
that histogram's long tail. But I have to admit that I don't really know
what to look for except for NMIs. Any hints?
What might be remarkable in this context is that the dataset's min is
at 2.24us. Perhaps I'm actually seeing the distribution of the
clockevents_lock acquisition?


Thanks,

Nicolai



[1] https://nicst.de/cev-freqadjust/adjust_all_freqs-function_graph_hist.png


Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock

2016-09-22 Thread Thomas Gleixner
On Wed, 21 Sep 2016, Nicolai Stange wrote:
> Thomas Gleixner  writes:
> 
> > On Wed, 21 Sep 2016, Nicolai Stange wrote:
> >> Thomas Gleixner  writes:
> >> > Have you ever measured the overhead of the extra work which has to be 
> >> > done
> >> > in clockevents_adjust_all_freqs() ?
> >> 
> >> Not exactly, I had a look at its invocation frequency which seems to
> >> decay exponentially with uptime, presumably because the NTP error
> >> approaches zero.
> >> 
> >> However, I've just gathered a function_graph ftrace on my Intel
> >> i7-4800MQ (Haswell, 8HTs):
> >> 
> >> # TIMECPU  DURATION  FUNCTION CALLS
> >> #  |  | |   | |   |   |   |
> >>85.287027 |   0)   0.899 us|  clockevents_adjust_all_freqs();
> >>85.288026 |   0)   0.759 us|  clockevents_adjust_all_freqs();
> >>85.289026 |   0)   0.735 us|  clockevents_adjust_all_freqs();
> >>85.290026 |   0)   0.671 us|  clockevents_adjust_all_freqs();
> >>   149.503656 |   2)   2.477 us|  clockevents_adjust_all_freqs();
> >
> > That's not that bad. Though I'd like to see numbers for ARM (especially the
> > less powerful SoCs) as well.
> 
> On a Raspberry Pi 2B (bcm2836, ARMv7) with CONFIG_SMP=y, the mean over
> ~5300 samples is 5.14+/-1.04us with a max of 11.15us.

So why is the variance that high? You have an outlier on that intel as well
which might be caused by NMI, but it might also be a systematic issue
depending on the input parameters. 11 us on that ARM worries me.

Thanks,

tglx


Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock

2016-09-22 Thread Thomas Gleixner
On Wed, 21 Sep 2016, Nicolai Stange wrote:
> Thomas Gleixner  writes:
> 
> > On Wed, 21 Sep 2016, Nicolai Stange wrote:
> >> Thomas Gleixner  writes:
> >> > Have you ever measured the overhead of the extra work which has to be 
> >> > done
> >> > in clockevents_adjust_all_freqs() ?
> >> 
> >> Not exactly, I had a look at its invocation frequency which seems to
> >> decay exponentially with uptime, presumably because the NTP error
> >> approaches zero.
> >> 
> >> However, I've just gathered a function_graph ftrace on my Intel
> >> i7-4800MQ (Haswell, 8HTs):
> >> 
> >> # TIMECPU  DURATION  FUNCTION CALLS
> >> #  |  | |   | |   |   |   |
> >>85.287027 |   0)   0.899 us|  clockevents_adjust_all_freqs();
> >>85.288026 |   0)   0.759 us|  clockevents_adjust_all_freqs();
> >>85.289026 |   0)   0.735 us|  clockevents_adjust_all_freqs();
> >>85.290026 |   0)   0.671 us|  clockevents_adjust_all_freqs();
> >>   149.503656 |   2)   2.477 us|  clockevents_adjust_all_freqs();
> >
> > That's not that bad. Though I'd like to see numbers for ARM (especially the
> > less powerful SoCs) as well.
> 
> On a Raspberry Pi 2B (bcm2836, ARMv7) with CONFIG_SMP=y, the mean over
> ~5300 samples is 5.14+/-1.04us with a max of 11.15us.

So why is the variance that high? You have an outlier on that intel as well
which might be caused by NMI, but it might also be a systematic issue
depending on the input parameters. 11 us on that ARM worries me.

Thanks,

tglx


Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock

2016-09-21 Thread Nicolai Stange
Thomas Gleixner  writes:

> On Wed, 21 Sep 2016, Nicolai Stange wrote:
>> Thomas Gleixner  writes:
>> > Have you ever measured the overhead of the extra work which has to be done
>> > in clockevents_adjust_all_freqs() ?
>> 
>> Not exactly, I had a look at its invocation frequency which seems to
>> decay exponentially with uptime, presumably because the NTP error
>> approaches zero.
>> 
>> However, I've just gathered a function_graph ftrace on my Intel
>> i7-4800MQ (Haswell, 8HTs):
>> 
>> # TIMECPU  DURATION  FUNCTION CALLS
>> #  |  | |   | |   |   |   |
>>85.287027 |   0)   0.899 us|  clockevents_adjust_all_freqs();
>>85.288026 |   0)   0.759 us|  clockevents_adjust_all_freqs();
>>85.289026 |   0)   0.735 us|  clockevents_adjust_all_freqs();
>>85.290026 |   0)   0.671 us|  clockevents_adjust_all_freqs();
>>   149.503656 |   2)   2.477 us|  clockevents_adjust_all_freqs();
>
> That's not that bad. Though I'd like to see numbers for ARM (especially the
> less powerful SoCs) as well.

On a Raspberry Pi 2B (bcm2836, ARMv7) with CONFIG_SMP=y, the mean over
~5300 samples is 5.14+/-1.04us with a max of 11.15us.

Unfortunately, the invocation frequency doesn't calm down as much as it
did on x86_64: after an uptime of 45min, I'm still seeing approximately
one invocation per second. Right after boot, it was ~3/s.

Thanks,

Nicolai


Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock

2016-09-21 Thread Nicolai Stange
Thomas Gleixner  writes:

> On Wed, 21 Sep 2016, Nicolai Stange wrote:
>> Thomas Gleixner  writes:
>> > Have you ever measured the overhead of the extra work which has to be done
>> > in clockevents_adjust_all_freqs() ?
>> 
>> Not exactly, I had a look at its invocation frequency which seems to
>> decay exponentially with uptime, presumably because the NTP error
>> approaches zero.
>> 
>> However, I've just gathered a function_graph ftrace on my Intel
>> i7-4800MQ (Haswell, 8HTs):
>> 
>> # TIMECPU  DURATION  FUNCTION CALLS
>> #  |  | |   | |   |   |   |
>>85.287027 |   0)   0.899 us|  clockevents_adjust_all_freqs();
>>85.288026 |   0)   0.759 us|  clockevents_adjust_all_freqs();
>>85.289026 |   0)   0.735 us|  clockevents_adjust_all_freqs();
>>85.290026 |   0)   0.671 us|  clockevents_adjust_all_freqs();
>>   149.503656 |   2)   2.477 us|  clockevents_adjust_all_freqs();
>
> That's not that bad. Though I'd like to see numbers for ARM (especially the
> less powerful SoCs) as well.

On a Raspberry Pi 2B (bcm2836, ARMv7) with CONFIG_SMP=y, the mean over
~5300 samples is 5.14+/-1.04us with a max of 11.15us.

Unfortunately, the invocation frequency doesn't calm down as much as it
did on x86_64: after an uptime of 45min, I'm still seeing approximately
one invocation per second. Right after boot, it was ~3/s.

Thanks,

Nicolai


Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock

2016-09-20 Thread Thomas Gleixner
On Wed, 21 Sep 2016, Nicolai Stange wrote:
> Thomas Gleixner  writes:
> > Have you ever measured the overhead of the extra work which has to be done
> > in clockevents_adjust_all_freqs() ?
> 
> Not exactly, I had a look at its invocation frequency which seems to
> decay exponentially with uptime, presumably because the NTP error
> approaches zero.
> 
> However, I've just gathered a function_graph ftrace on my Intel
> i7-4800MQ (Haswell, 8HTs):
> 
> # TIMECPU  DURATION  FUNCTION CALLS
> #  |  | |   | |   |   |   |
>85.287027 |   0)   0.899 us|  clockevents_adjust_all_freqs();
>85.288026 |   0)   0.759 us|  clockevents_adjust_all_freqs();
>85.289026 |   0)   0.735 us|  clockevents_adjust_all_freqs();
>85.290026 |   0)   0.671 us|  clockevents_adjust_all_freqs();
>   149.503656 |   2)   2.477 us|  clockevents_adjust_all_freqs();

That's not that bad. Though I'd like to see numbers for ARM (especially the
less powerful SoCs) as well.

Thanks,

tglx


Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock

2016-09-20 Thread Thomas Gleixner
On Wed, 21 Sep 2016, Nicolai Stange wrote:
> Thomas Gleixner  writes:
> > Have you ever measured the overhead of the extra work which has to be done
> > in clockevents_adjust_all_freqs() ?
> 
> Not exactly, I had a look at its invocation frequency which seems to
> decay exponentially with uptime, presumably because the NTP error
> approaches zero.
> 
> However, I've just gathered a function_graph ftrace on my Intel
> i7-4800MQ (Haswell, 8HTs):
> 
> # TIMECPU  DURATION  FUNCTION CALLS
> #  |  | |   | |   |   |   |
>85.287027 |   0)   0.899 us|  clockevents_adjust_all_freqs();
>85.288026 |   0)   0.759 us|  clockevents_adjust_all_freqs();
>85.289026 |   0)   0.735 us|  clockevents_adjust_all_freqs();
>85.290026 |   0)   0.671 us|  clockevents_adjust_all_freqs();
>   149.503656 |   2)   2.477 us|  clockevents_adjust_all_freqs();

That's not that bad. Though I'd like to see numbers for ARM (especially the
less powerful SoCs) as well.

Thanks,

tglx


Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock

2016-09-20 Thread Nicolai Stange
Thomas Gleixner  writes:

> On Fri, 16 Sep 2016, Nicolai Stange wrote:
>
>> Goal: avoid programming ced devices too early for large deltas, for
>>   details, c.f. the description of [21/23].
>> 
>> [21-23/23] Actually do the frequency adjustments.
>> 
>> Tested on x86_64 and next-20160916.
>
> Have you ever measured the overhead of the extra work which has to be done
> in clockevents_adjust_all_freqs() ?

Not exactly, I had a look at its invocation frequency which seems to
decay exponentially with uptime, presumably because the NTP error
approaches zero.

However, I've just gathered a function_graph ftrace on my Intel
i7-4800MQ (Haswell, 8HTs):

# tracer: function_graph
#
# TIMECPU  DURATION  FUNCTION CALLS
#  |  | |   | |   |   |   |
   85.287027 |   0)   0.899 us|  clockevents_adjust_all_freqs();
   85.288026 |   0)   0.759 us|  clockevents_adjust_all_freqs();
   85.289026 |   0)   0.735 us|  clockevents_adjust_all_freqs();
   85.290026 |   0)   0.671 us|  clockevents_adjust_all_freqs();
  149.503656 |   2)   2.477 us|  clockevents_adjust_all_freqs();
  149.507660 |   2)   2.308 us|  clockevents_adjust_all_freqs();
  149.511658 |   2)   2.651 us|  clockevents_adjust_all_freqs();
  149.545660 |   0)   2.268 us|  clockevents_adjust_all_freqs();
  149.564211 |   2)   2.321 us|  clockevents_adjust_all_freqs();
  214.351899 |   2)   1.520 us|  clockevents_adjust_all_freqs();
  214.354935 |   0)   1.053 us|  clockevents_adjust_all_freqs();
  279.026205 |   0)   2.289 us|  clockevents_adjust_all_freqs();
  279.030195 |   0)   2.190 us|  clockevents_adjust_all_freqs();
  279.034196 |   0)   2.381 us|  clockevents_adjust_all_freqs();
  279.047492 |   2)   2.390 us|  clockevents_adjust_all_freqs();
  344.250356 |   1)   2.727 us|  clockevents_adjust_all_freqs();
  408.879538 |   1)   2.235 us|  clockevents_adjust_all_freqs();
  473.125730 |   6)   1.513 us|  clockevents_adjust_all_freqs();
  473.129731 |   6)   1.650 us|  clockevents_adjust_all_freqs();
  538.387891 |   3)   2.305 us|  clockevents_adjust_all_freqs();
  538.391890 |   3)   2.300 us|  clockevents_adjust_all_freqs();
  668.257162 |   3)   2.691 us|  clockevents_adjust_all_freqs();
  668.261162 |   3)   2.306 us|  clockevents_adjust_all_freqs();
  733.459261 |   0)   1.066 us|  clockevents_adjust_all_freqs();
  733.463261 |   0)   1.233 us|  clockevents_adjust_all_freqs();
  733.467263 |   1)   1.382 us|  clockevents_adjust_all_freqs();
  863.398561 |   2)   2.218 us|  clockevents_adjust_all_freqs();
  863.402552 |   2)   2.792 us|  clockevents_adjust_all_freqs();
 1122.210001 |   3)   2.259 us|  clockevents_adjust_all_freqs();
 1122.214004 |   3)   2.165 us|  clockevents_adjust_all_freqs();
 1381.283287 |   2)   1.944 us|  clockevents_adjust_all_freqs();
 1895.664008 |   2)   1.940 us|  clockevents_adjust_all_freqs();
 1895.668009 |   2)   2.041 us|  clockevents_adjust_all_freqs();
 2930.385388 |   0)   1.067 us|  clockevents_adjust_all_freqs();
 2930.386390 |   5)   1.208 us|  clockevents_adjust_all_freqs();


Thanks,

Nicolai


Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock

2016-09-20 Thread Nicolai Stange
Thomas Gleixner  writes:

> On Fri, 16 Sep 2016, Nicolai Stange wrote:
>
>> Goal: avoid programming ced devices too early for large deltas, for
>>   details, c.f. the description of [21/23].
>> 
>> [21-23/23] Actually do the frequency adjustments.
>> 
>> Tested on x86_64 and next-20160916.
>
> Have you ever measured the overhead of the extra work which has to be done
> in clockevents_adjust_all_freqs() ?

Not exactly, I had a look at its invocation frequency which seems to
decay exponentially with uptime, presumably because the NTP error
approaches zero.

However, I've just gathered a function_graph ftrace on my Intel
i7-4800MQ (Haswell, 8HTs):

# tracer: function_graph
#
# TIMECPU  DURATION  FUNCTION CALLS
#  |  | |   | |   |   |   |
   85.287027 |   0)   0.899 us|  clockevents_adjust_all_freqs();
   85.288026 |   0)   0.759 us|  clockevents_adjust_all_freqs();
   85.289026 |   0)   0.735 us|  clockevents_adjust_all_freqs();
   85.290026 |   0)   0.671 us|  clockevents_adjust_all_freqs();
  149.503656 |   2)   2.477 us|  clockevents_adjust_all_freqs();
  149.507660 |   2)   2.308 us|  clockevents_adjust_all_freqs();
  149.511658 |   2)   2.651 us|  clockevents_adjust_all_freqs();
  149.545660 |   0)   2.268 us|  clockevents_adjust_all_freqs();
  149.564211 |   2)   2.321 us|  clockevents_adjust_all_freqs();
  214.351899 |   2)   1.520 us|  clockevents_adjust_all_freqs();
  214.354935 |   0)   1.053 us|  clockevents_adjust_all_freqs();
  279.026205 |   0)   2.289 us|  clockevents_adjust_all_freqs();
  279.030195 |   0)   2.190 us|  clockevents_adjust_all_freqs();
  279.034196 |   0)   2.381 us|  clockevents_adjust_all_freqs();
  279.047492 |   2)   2.390 us|  clockevents_adjust_all_freqs();
  344.250356 |   1)   2.727 us|  clockevents_adjust_all_freqs();
  408.879538 |   1)   2.235 us|  clockevents_adjust_all_freqs();
  473.125730 |   6)   1.513 us|  clockevents_adjust_all_freqs();
  473.129731 |   6)   1.650 us|  clockevents_adjust_all_freqs();
  538.387891 |   3)   2.305 us|  clockevents_adjust_all_freqs();
  538.391890 |   3)   2.300 us|  clockevents_adjust_all_freqs();
  668.257162 |   3)   2.691 us|  clockevents_adjust_all_freqs();
  668.261162 |   3)   2.306 us|  clockevents_adjust_all_freqs();
  733.459261 |   0)   1.066 us|  clockevents_adjust_all_freqs();
  733.463261 |   0)   1.233 us|  clockevents_adjust_all_freqs();
  733.467263 |   1)   1.382 us|  clockevents_adjust_all_freqs();
  863.398561 |   2)   2.218 us|  clockevents_adjust_all_freqs();
  863.402552 |   2)   2.792 us|  clockevents_adjust_all_freqs();
 1122.210001 |   3)   2.259 us|  clockevents_adjust_all_freqs();
 1122.214004 |   3)   2.165 us|  clockevents_adjust_all_freqs();
 1381.283287 |   2)   1.944 us|  clockevents_adjust_all_freqs();
 1895.664008 |   2)   1.940 us|  clockevents_adjust_all_freqs();
 1895.668009 |   2)   2.041 us|  clockevents_adjust_all_freqs();
 2930.385388 |   0)   1.067 us|  clockevents_adjust_all_freqs();
 2930.386390 |   5)   1.208 us|  clockevents_adjust_all_freqs();


Thanks,

Nicolai


Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock

2016-09-20 Thread Thomas Gleixner
On Fri, 16 Sep 2016, Nicolai Stange wrote:

> Goal: avoid programming ced devices too early for large deltas, for
>   details, c.f. the description of [21/23].
> 
> [21-23/23] Actually do the frequency adjustments.
> 
> Tested on x86_64 and next-20160916.

Have you ever measured the overhead of the extra work which has to be done
in clockevents_adjust_all_freqs() ?

Thanks,

tglx




Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock

2016-09-20 Thread Thomas Gleixner
On Fri, 16 Sep 2016, Nicolai Stange wrote:

> Goal: avoid programming ced devices too early for large deltas, for
>   details, c.f. the description of [21/23].
> 
> [21-23/23] Actually do the frequency adjustments.
> 
> Tested on x86_64 and next-20160916.

Have you ever measured the overhead of the extra work which has to be done
in clockevents_adjust_all_freqs() ?

Thanks,

tglx




[RFC v7 00/23] adapt clockevents frequencies to mono clock

2016-09-16 Thread Nicolai Stange
Goal: avoid programming ced devices too early for large deltas, for
  details, c.f. the description of [21/23].

Previous v6 can be found here:

  http://lkml.kernel.org/r/20160909200033.32103-1-nicsta...@gmail.com

Your objections [0] to v6 have both been towards
[1/23] ("clocksource: sh_cmt: compute rate before registration again"),
namely
- there was a coding style issue due to the removal of braces at an if
  statement
- and I carried the original mult/shift calculation over rather than
  using clockevents_calc_mult_shift() instead.

I fixed the first issue up. However, I did nothing regarding the
second one because I'd not feel very confident about doing this
cleanup: I don't know why the shift value is set the way it is and
thus, I can't tell whether this would break anything. If you still
insist on me doing this, I'd prefer to send a patch separate from this
series such that it could get merged, dropped (or reverted)
independently...


This series can be divided into logical subseries as follows:
[1-6/23]   Don't modify ced rate after registrations through mechanisms
   other than clockevents_update_freq().

[7-12/23]  Let all ced devices set their ->*_delta_ticks values and let
   the clockevent core do the rest.

[13/23]Introduce the CLOCK_EVT_FEAT_NO_ADJUST flag

[14-20/23] Fiddle around with the bound checking code in order to
   allow for non-atomic frequency updates from a CPU different
   than where the ced is programmed.

[21-23/23] Actually do the frequency adjustments.


Tested on x86_64 and next-20160916.


[0] http://lkml.kernel.org/r/alpine.DEB.2.20.1609101416420.32361@nanos



Changes to v6:
 Rebased against next-20160916.

 [1/23]  ("clocksource: sh_cmt: compute rate before registration again")
   Do not remove braces at if statement.


Changes to v5:
 [21/23] ("clockevents: initial support for mono to raw time conversion")
   Replace the max_t() in
 adj = max_t(u64, adj, mult_ce_raw / 8);
   by min_t(): mult_ce_raw / 8 actually sets an upper bound on the
   mult adjustments.

 [23/23] ("timekeeping: inform clockevents about freq adjustments")
   Move the clockevents_adjust_all_freqs() invocation from
   timekeeping_apply_adjustment() to timekeeping_freqadjust().
   Reason is given in the patch description.


Changes to v4:
 [1-12/23] Unchanged

 [13/23] ("clockevents: introduce CLOCK_EVT_FEAT_NO_ADJUST flag")
   New.

 [14/23] ("clockevents: decouple ->max_delta_ns from ->max_delta_ticks")
   New. Solves the overflow problem the former
   [13/22] ("clockevents: check a programmed delta's bounds in terms of cycles")
   from v4 introduced.

   (Note that the former
[14/22] ("clockevents: clockevents_program_event(): turn clc into unsigned 
long")
from v4 has been purged.)

 [15/23] ("clockevents: do comparison of delta against minimum in terms of 
cycles")
   This is the former
   [13/22] ("clockevents: check a programmed delta's bounds in terms of 
cycles"),
   but only for the ->min_delta_* -- the ->max_delta_* are handled by [14/23] 
now.

 [16/23] ("clockevents: clockevents_program_min_delta(): don't set 
->next_event")
   Former [15/22] unchanged.

 [17/23] ("clockevents: use ->min_delta_ticks_adjusted to program minimum 
delta")
   Former [16/22]. Trivially fix compilation error with
   CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=n.

 [18/22] ("clockevents: min delta increment: calculate min_delta_ns from ticks")
   Former [17/22] unchanged.

 [19/23] ("timer_list: print_tickdevice(): calculate ->min_delta_ns 
dynamically")
   Corresponds to former
   [18/22] ("timer_list: print_tickdevice(): calculate ->*_delta_ns 
dynamically")
   from v4, but only for ->min_delta_ns. The changes required for the display of
   ->max_delta_ns are now being made in [14/23] already.

 [20/23] ("clockevents: purge ->min_delta_ns")
   Corresponds to former
   [19/22] ("clockevents: purge ->min_delta_ns and ->max_delta_ns"),
   but with ->max_delta_ns being kept.

 [21/23] ("clockevents: initial support for mono to raw time conversion")
   Former [20/22] with the following changes:
   - Don't adjust mult for those ced's that have CLOCK_EVT_FEAT_NO_ADJUST set.
   - Don't meld __clockevents_update_bounds() into __clockevents_adjust_freq()
 anymore: the bounds for those devices having CLOCK_EVT_FEAT_NO_ADJUST set
 must have got their bounds set as well.
   - In __clockevents_calc_adjust_freq(), make sure that the adjusted mult
 doesn't exceed the original by more than 12.5%. C.f. [14/23].
   - In timekeeping, define timekeeping_get_mono_mult() only for
 CONFIG_GENERIC_CLOCKEVENTS=y.

  [22/23] ("clockevents: make setting of ->mult and ->mult_adjusted atomic")
   Former [12/22], but with the description updated: previously, it said that
   this patch would introduce a new locking dependency. This is not true.

  [23/23] ("timekeeping: inform clockevents about freq adjustments")
Former [22/22] with the following changes:
- Don't adjust 

[RFC v7 00/23] adapt clockevents frequencies to mono clock

2016-09-16 Thread Nicolai Stange
Goal: avoid programming ced devices too early for large deltas, for
  details, c.f. the description of [21/23].

Previous v6 can be found here:

  http://lkml.kernel.org/r/20160909200033.32103-1-nicsta...@gmail.com

Your objections [0] to v6 have both been towards
[1/23] ("clocksource: sh_cmt: compute rate before registration again"),
namely
- there was a coding style issue due to the removal of braces at an if
  statement
- and I carried the original mult/shift calculation over rather than
  using clockevents_calc_mult_shift() instead.

I fixed the first issue up. However, I did nothing regarding the
second one because I'd not feel very confident about doing this
cleanup: I don't know why the shift value is set the way it is and
thus, I can't tell whether this would break anything. If you still
insist on me doing this, I'd prefer to send a patch separate from this
series such that it could get merged, dropped (or reverted)
independently...


This series can be divided into logical subseries as follows:
[1-6/23]   Don't modify ced rate after registrations through mechanisms
   other than clockevents_update_freq().

[7-12/23]  Let all ced devices set their ->*_delta_ticks values and let
   the clockevent core do the rest.

[13/23]Introduce the CLOCK_EVT_FEAT_NO_ADJUST flag

[14-20/23] Fiddle around with the bound checking code in order to
   allow for non-atomic frequency updates from a CPU different
   than where the ced is programmed.

[21-23/23] Actually do the frequency adjustments.


Tested on x86_64 and next-20160916.


[0] http://lkml.kernel.org/r/alpine.DEB.2.20.1609101416420.32361@nanos



Changes to v6:
 Rebased against next-20160916.

 [1/23]  ("clocksource: sh_cmt: compute rate before registration again")
   Do not remove braces at if statement.


Changes to v5:
 [21/23] ("clockevents: initial support for mono to raw time conversion")
   Replace the max_t() in
 adj = max_t(u64, adj, mult_ce_raw / 8);
   by min_t(): mult_ce_raw / 8 actually sets an upper bound on the
   mult adjustments.

 [23/23] ("timekeeping: inform clockevents about freq adjustments")
   Move the clockevents_adjust_all_freqs() invocation from
   timekeeping_apply_adjustment() to timekeeping_freqadjust().
   Reason is given in the patch description.


Changes to v4:
 [1-12/23] Unchanged

 [13/23] ("clockevents: introduce CLOCK_EVT_FEAT_NO_ADJUST flag")
   New.

 [14/23] ("clockevents: decouple ->max_delta_ns from ->max_delta_ticks")
   New. Solves the overflow problem the former
   [13/22] ("clockevents: check a programmed delta's bounds in terms of cycles")
   from v4 introduced.

   (Note that the former
[14/22] ("clockevents: clockevents_program_event(): turn clc into unsigned 
long")
from v4 has been purged.)

 [15/23] ("clockevents: do comparison of delta against minimum in terms of 
cycles")
   This is the former
   [13/22] ("clockevents: check a programmed delta's bounds in terms of 
cycles"),
   but only for the ->min_delta_* -- the ->max_delta_* are handled by [14/23] 
now.

 [16/23] ("clockevents: clockevents_program_min_delta(): don't set 
->next_event")
   Former [15/22] unchanged.

 [17/23] ("clockevents: use ->min_delta_ticks_adjusted to program minimum 
delta")
   Former [16/22]. Trivially fix compilation error with
   CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=n.

 [18/22] ("clockevents: min delta increment: calculate min_delta_ns from ticks")
   Former [17/22] unchanged.

 [19/23] ("timer_list: print_tickdevice(): calculate ->min_delta_ns 
dynamically")
   Corresponds to former
   [18/22] ("timer_list: print_tickdevice(): calculate ->*_delta_ns 
dynamically")
   from v4, but only for ->min_delta_ns. The changes required for the display of
   ->max_delta_ns are now being made in [14/23] already.

 [20/23] ("clockevents: purge ->min_delta_ns")
   Corresponds to former
   [19/22] ("clockevents: purge ->min_delta_ns and ->max_delta_ns"),
   but with ->max_delta_ns being kept.

 [21/23] ("clockevents: initial support for mono to raw time conversion")
   Former [20/22] with the following changes:
   - Don't adjust mult for those ced's that have CLOCK_EVT_FEAT_NO_ADJUST set.
   - Don't meld __clockevents_update_bounds() into __clockevents_adjust_freq()
 anymore: the bounds for those devices having CLOCK_EVT_FEAT_NO_ADJUST set
 must have got their bounds set as well.
   - In __clockevents_calc_adjust_freq(), make sure that the adjusted mult
 doesn't exceed the original by more than 12.5%. C.f. [14/23].
   - In timekeeping, define timekeeping_get_mono_mult() only for
 CONFIG_GENERIC_CLOCKEVENTS=y.

  [22/23] ("clockevents: make setting of ->mult and ->mult_adjusted atomic")
   Former [12/22], but with the description updated: previously, it said that
   this patch would introduce a new locking dependency. This is not true.

  [23/23] ("timekeeping: inform clockevents about freq adjustments")
Former [22/22] with the following changes:
- Don't adjust