Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock
Nicolai Stangewrites: > Thomas Gleixner writes: > >> On Wed, 21 Sep 2016, Nicolai Stange wrote: >>> Thomas Gleixner writes: >>> >>> > On Wed, 21 Sep 2016, Nicolai Stange wrote: >>> >> Thomas Gleixner writes: >>> >> > Have you ever measured the overhead of the extra work which has to be >>> >> > done >>> >> > in clockevents_adjust_all_freqs() ? >>> >> >>> >> Not exactly, I had a look at its invocation frequency which seems to >>> >> decay exponentially with uptime, presumably because the NTP error >>> >> approaches zero. >>> >> >>> >> However, I've just gathered a function_graph ftrace on my Intel >>> >> i7-4800MQ (Haswell, 8HTs): >>> >> >>> >> # TIMECPU DURATION FUNCTION CALLS >>> >> # | | | | | | | | >>> >>85.287027 | 0) 0.899 us| clockevents_adjust_all_freqs(); >>> >>85.288026 | 0) 0.759 us| clockevents_adjust_all_freqs(); >>> >>85.289026 | 0) 0.735 us| clockevents_adjust_all_freqs(); >>> >>85.290026 | 0) 0.671 us| clockevents_adjust_all_freqs(); >>> >> 149.503656 | 2) 2.477 us| clockevents_adjust_all_freqs(); >>> > >>> > That's not that bad. Though I'd like to see numbers for ARM (especially >>> > the >>> > less powerful SoCs) as well. >>> >>> On a Raspberry Pi 2B (bcm2836, ARMv7) with CONFIG_SMP=y, the mean over >>> ~5300 samples is 5.14+/-1.04us with a max of 11.15us. >> >> So why is the variance that high? > > I think this is because the histogram has got two peaks, c.f. [1] > > Also, the 11us maximum is not isolated but a flat tail is reaching to > this point which I admittedly can't explain. It turned out that the linux-next kernel always ran the RPi2B at what apparently is its minimum speed. lmbench3's mhz gave me 560MHz and lat_mem_rd reports a memory latency of 120ns on linux-next. Compare this to an "official" kernel from the Raspberry Pi Foundation obtained from [2]: mhz says that the CPU runs at 900MHz and according to lat_mem_rd, the memory latency is at 50ns. Especially the high memory latency killed my benchmark: both, the second peak and the long tail stemmed from cache misses. In order to verify this, I separated the tracing data from linux-next into those samples that do not have any other calls to clockevents_adjust_all_freqs() within a time span of 100ms before them ("first of run") and those that do ("not first of run"). The result can be seen at [3]: the second peak as well as the long tail is generated exclusively by the "first of run"'s. Unfortunately I was not able to get this RPi2B running at its full capabilities with linux-next. So I applied this series on top of the RPi Foundation's kernel and did further benchmarking there. The results can be found at [4]: no second peak, no particularly long tail. Some statistics: 0% 25% 50% 75% 100% 1.250 1.511 1.667 1.927 7.031 Mean: 1.89 sd: 0.69 Much better IMHO. Good enough? A random note: during tracing, I recognized that the adjustment should better skip those CLOCK_EVT_FEAT_DUMMY devices. v8 will do this. Both measurements include that change already. >> You have an outlier on that intel as well which might be caused by >> NMI, but it might also be a systematic issue depending on the input >> parameters. > > AFACIT, the "algorithmic" runtime should be constant per CED, so it > should not be dependent on any input parameters. Well, this is not exactly true: the __do_div64() on ARM is implemented in software. Basically this algorithm's runtime depends on the position of the dividend's MSB. However, the range of the "adj" dividend as given by __clockevents_calc_adjust_freq() should be relatively narrow. I traced __do_div64() and there haven't been any apparent abnormalities. >> 11 us on that ARM worries me. These are 7us now. Also, this max value isn't nearly as connected to the rest of the histogram as that 11us sample before. So it *might* be an outlier now. I can't tell for sure though. Thanks, Nicolai > [1] https://nicst.de/cev-freqadjust/adjust_all_freqs-function_graph_hist.png [2] https://github.com/raspberrypi/linux [3] https://nicst.de/cev-freqadjust/hist-adjust-smp.pdf [4] https://nicst.de/cev-freqadjust/hist-adjust-official-smp.pdf
Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock
Nicolai Stange writes: > Thomas Gleixner writes: > >> On Wed, 21 Sep 2016, Nicolai Stange wrote: >>> Thomas Gleixner writes: >>> >>> > On Wed, 21 Sep 2016, Nicolai Stange wrote: >>> >> Thomas Gleixner writes: >>> >> > Have you ever measured the overhead of the extra work which has to be >>> >> > done >>> >> > in clockevents_adjust_all_freqs() ? >>> >> >>> >> Not exactly, I had a look at its invocation frequency which seems to >>> >> decay exponentially with uptime, presumably because the NTP error >>> >> approaches zero. >>> >> >>> >> However, I've just gathered a function_graph ftrace on my Intel >>> >> i7-4800MQ (Haswell, 8HTs): >>> >> >>> >> # TIMECPU DURATION FUNCTION CALLS >>> >> # | | | | | | | | >>> >>85.287027 | 0) 0.899 us| clockevents_adjust_all_freqs(); >>> >>85.288026 | 0) 0.759 us| clockevents_adjust_all_freqs(); >>> >>85.289026 | 0) 0.735 us| clockevents_adjust_all_freqs(); >>> >>85.290026 | 0) 0.671 us| clockevents_adjust_all_freqs(); >>> >> 149.503656 | 2) 2.477 us| clockevents_adjust_all_freqs(); >>> > >>> > That's not that bad. Though I'd like to see numbers for ARM (especially >>> > the >>> > less powerful SoCs) as well. >>> >>> On a Raspberry Pi 2B (bcm2836, ARMv7) with CONFIG_SMP=y, the mean over >>> ~5300 samples is 5.14+/-1.04us with a max of 11.15us. >> >> So why is the variance that high? > > I think this is because the histogram has got two peaks, c.f. [1] > > Also, the 11us maximum is not isolated but a flat tail is reaching to > this point which I admittedly can't explain. It turned out that the linux-next kernel always ran the RPi2B at what apparently is its minimum speed. lmbench3's mhz gave me 560MHz and lat_mem_rd reports a memory latency of 120ns on linux-next. Compare this to an "official" kernel from the Raspberry Pi Foundation obtained from [2]: mhz says that the CPU runs at 900MHz and according to lat_mem_rd, the memory latency is at 50ns. Especially the high memory latency killed my benchmark: both, the second peak and the long tail stemmed from cache misses. In order to verify this, I separated the tracing data from linux-next into those samples that do not have any other calls to clockevents_adjust_all_freqs() within a time span of 100ms before them ("first of run") and those that do ("not first of run"). The result can be seen at [3]: the second peak as well as the long tail is generated exclusively by the "first of run"'s. Unfortunately I was not able to get this RPi2B running at its full capabilities with linux-next. So I applied this series on top of the RPi Foundation's kernel and did further benchmarking there. The results can be found at [4]: no second peak, no particularly long tail. Some statistics: 0% 25% 50% 75% 100% 1.250 1.511 1.667 1.927 7.031 Mean: 1.89 sd: 0.69 Much better IMHO. Good enough? A random note: during tracing, I recognized that the adjustment should better skip those CLOCK_EVT_FEAT_DUMMY devices. v8 will do this. Both measurements include that change already. >> You have an outlier on that intel as well which might be caused by >> NMI, but it might also be a systematic issue depending on the input >> parameters. > > AFACIT, the "algorithmic" runtime should be constant per CED, so it > should not be dependent on any input parameters. Well, this is not exactly true: the __do_div64() on ARM is implemented in software. Basically this algorithm's runtime depends on the position of the dividend's MSB. However, the range of the "adj" dividend as given by __clockevents_calc_adjust_freq() should be relatively narrow. I traced __do_div64() and there haven't been any apparent abnormalities. >> 11 us on that ARM worries me. These are 7us now. Also, this max value isn't nearly as connected to the rest of the histogram as that 11us sample before. So it *might* be an outlier now. I can't tell for sure though. Thanks, Nicolai > [1] https://nicst.de/cev-freqadjust/adjust_all_freqs-function_graph_hist.png [2] https://github.com/raspberrypi/linux [3] https://nicst.de/cev-freqadjust/hist-adjust-smp.pdf [4] https://nicst.de/cev-freqadjust/hist-adjust-official-smp.pdf
Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock
Thomas Gleixnerwrites: > On Wed, 21 Sep 2016, Nicolai Stange wrote: >> Thomas Gleixner writes: >> >> > On Wed, 21 Sep 2016, Nicolai Stange wrote: >> >> Thomas Gleixner writes: >> >> > Have you ever measured the overhead of the extra work which has to be >> >> > done >> >> > in clockevents_adjust_all_freqs() ? >> >> >> >> Not exactly, I had a look at its invocation frequency which seems to >> >> decay exponentially with uptime, presumably because the NTP error >> >> approaches zero. >> >> >> >> However, I've just gathered a function_graph ftrace on my Intel >> >> i7-4800MQ (Haswell, 8HTs): >> >> >> >> # TIMECPU DURATION FUNCTION CALLS >> >> # | | | | | | | | >> >>85.287027 | 0) 0.899 us| clockevents_adjust_all_freqs(); >> >>85.288026 | 0) 0.759 us| clockevents_adjust_all_freqs(); >> >>85.289026 | 0) 0.735 us| clockevents_adjust_all_freqs(); >> >>85.290026 | 0) 0.671 us| clockevents_adjust_all_freqs(); >> >> 149.503656 | 2) 2.477 us| clockevents_adjust_all_freqs(); >> > >> > That's not that bad. Though I'd like to see numbers for ARM (especially the >> > less powerful SoCs) as well. >> >> On a Raspberry Pi 2B (bcm2836, ARMv7) with CONFIG_SMP=y, the mean over >> ~5300 samples is 5.14+/-1.04us with a max of 11.15us. > > So why is the variance that high? I think this is because the histogram has got two peaks, c.f. [1] Also, the 11us maximum is not isolated but a flat tail is reaching to this point which I admittedly can't explain. > You have an outlier on that intel as well which might be caused by > NMI, but it might also be a systematic issue depending on the input > parameters. AFACIT, the "algorithmic" runtime should be constant per CED, so it should not be dependent on any input parameters. > 11 us on that ARM worries me. I'll try to do some more tracing tomorrow in order to get the reason for that histogram's long tail. But I have to admit that I don't really know what to look for except for NMIs. Any hints? What might be remarkable in this context is that the dataset's min is at 2.24us. Perhaps I'm actually seeing the distribution of the clockevents_lock acquisition? Thanks, Nicolai [1] https://nicst.de/cev-freqadjust/adjust_all_freqs-function_graph_hist.png
Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock
Thomas Gleixner writes: > On Wed, 21 Sep 2016, Nicolai Stange wrote: >> Thomas Gleixner writes: >> >> > On Wed, 21 Sep 2016, Nicolai Stange wrote: >> >> Thomas Gleixner writes: >> >> > Have you ever measured the overhead of the extra work which has to be >> >> > done >> >> > in clockevents_adjust_all_freqs() ? >> >> >> >> Not exactly, I had a look at its invocation frequency which seems to >> >> decay exponentially with uptime, presumably because the NTP error >> >> approaches zero. >> >> >> >> However, I've just gathered a function_graph ftrace on my Intel >> >> i7-4800MQ (Haswell, 8HTs): >> >> >> >> # TIMECPU DURATION FUNCTION CALLS >> >> # | | | | | | | | >> >>85.287027 | 0) 0.899 us| clockevents_adjust_all_freqs(); >> >>85.288026 | 0) 0.759 us| clockevents_adjust_all_freqs(); >> >>85.289026 | 0) 0.735 us| clockevents_adjust_all_freqs(); >> >>85.290026 | 0) 0.671 us| clockevents_adjust_all_freqs(); >> >> 149.503656 | 2) 2.477 us| clockevents_adjust_all_freqs(); >> > >> > That's not that bad. Though I'd like to see numbers for ARM (especially the >> > less powerful SoCs) as well. >> >> On a Raspberry Pi 2B (bcm2836, ARMv7) with CONFIG_SMP=y, the mean over >> ~5300 samples is 5.14+/-1.04us with a max of 11.15us. > > So why is the variance that high? I think this is because the histogram has got two peaks, c.f. [1] Also, the 11us maximum is not isolated but a flat tail is reaching to this point which I admittedly can't explain. > You have an outlier on that intel as well which might be caused by > NMI, but it might also be a systematic issue depending on the input > parameters. AFACIT, the "algorithmic" runtime should be constant per CED, so it should not be dependent on any input parameters. > 11 us on that ARM worries me. I'll try to do some more tracing tomorrow in order to get the reason for that histogram's long tail. But I have to admit that I don't really know what to look for except for NMIs. Any hints? What might be remarkable in this context is that the dataset's min is at 2.24us. Perhaps I'm actually seeing the distribution of the clockevents_lock acquisition? Thanks, Nicolai [1] https://nicst.de/cev-freqadjust/adjust_all_freqs-function_graph_hist.png
Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock
On Wed, 21 Sep 2016, Nicolai Stange wrote: > Thomas Gleixnerwrites: > > > On Wed, 21 Sep 2016, Nicolai Stange wrote: > >> Thomas Gleixner writes: > >> > Have you ever measured the overhead of the extra work which has to be > >> > done > >> > in clockevents_adjust_all_freqs() ? > >> > >> Not exactly, I had a look at its invocation frequency which seems to > >> decay exponentially with uptime, presumably because the NTP error > >> approaches zero. > >> > >> However, I've just gathered a function_graph ftrace on my Intel > >> i7-4800MQ (Haswell, 8HTs): > >> > >> # TIMECPU DURATION FUNCTION CALLS > >> # | | | | | | | | > >>85.287027 | 0) 0.899 us| clockevents_adjust_all_freqs(); > >>85.288026 | 0) 0.759 us| clockevents_adjust_all_freqs(); > >>85.289026 | 0) 0.735 us| clockevents_adjust_all_freqs(); > >>85.290026 | 0) 0.671 us| clockevents_adjust_all_freqs(); > >> 149.503656 | 2) 2.477 us| clockevents_adjust_all_freqs(); > > > > That's not that bad. Though I'd like to see numbers for ARM (especially the > > less powerful SoCs) as well. > > On a Raspberry Pi 2B (bcm2836, ARMv7) with CONFIG_SMP=y, the mean over > ~5300 samples is 5.14+/-1.04us with a max of 11.15us. So why is the variance that high? You have an outlier on that intel as well which might be caused by NMI, but it might also be a systematic issue depending on the input parameters. 11 us on that ARM worries me. Thanks, tglx
Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock
On Wed, 21 Sep 2016, Nicolai Stange wrote: > Thomas Gleixner writes: > > > On Wed, 21 Sep 2016, Nicolai Stange wrote: > >> Thomas Gleixner writes: > >> > Have you ever measured the overhead of the extra work which has to be > >> > done > >> > in clockevents_adjust_all_freqs() ? > >> > >> Not exactly, I had a look at its invocation frequency which seems to > >> decay exponentially with uptime, presumably because the NTP error > >> approaches zero. > >> > >> However, I've just gathered a function_graph ftrace on my Intel > >> i7-4800MQ (Haswell, 8HTs): > >> > >> # TIMECPU DURATION FUNCTION CALLS > >> # | | | | | | | | > >>85.287027 | 0) 0.899 us| clockevents_adjust_all_freqs(); > >>85.288026 | 0) 0.759 us| clockevents_adjust_all_freqs(); > >>85.289026 | 0) 0.735 us| clockevents_adjust_all_freqs(); > >>85.290026 | 0) 0.671 us| clockevents_adjust_all_freqs(); > >> 149.503656 | 2) 2.477 us| clockevents_adjust_all_freqs(); > > > > That's not that bad. Though I'd like to see numbers for ARM (especially the > > less powerful SoCs) as well. > > On a Raspberry Pi 2B (bcm2836, ARMv7) with CONFIG_SMP=y, the mean over > ~5300 samples is 5.14+/-1.04us with a max of 11.15us. So why is the variance that high? You have an outlier on that intel as well which might be caused by NMI, but it might also be a systematic issue depending on the input parameters. 11 us on that ARM worries me. Thanks, tglx
Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock
Thomas Gleixnerwrites: > On Wed, 21 Sep 2016, Nicolai Stange wrote: >> Thomas Gleixner writes: >> > Have you ever measured the overhead of the extra work which has to be done >> > in clockevents_adjust_all_freqs() ? >> >> Not exactly, I had a look at its invocation frequency which seems to >> decay exponentially with uptime, presumably because the NTP error >> approaches zero. >> >> However, I've just gathered a function_graph ftrace on my Intel >> i7-4800MQ (Haswell, 8HTs): >> >> # TIMECPU DURATION FUNCTION CALLS >> # | | | | | | | | >>85.287027 | 0) 0.899 us| clockevents_adjust_all_freqs(); >>85.288026 | 0) 0.759 us| clockevents_adjust_all_freqs(); >>85.289026 | 0) 0.735 us| clockevents_adjust_all_freqs(); >>85.290026 | 0) 0.671 us| clockevents_adjust_all_freqs(); >> 149.503656 | 2) 2.477 us| clockevents_adjust_all_freqs(); > > That's not that bad. Though I'd like to see numbers for ARM (especially the > less powerful SoCs) as well. On a Raspberry Pi 2B (bcm2836, ARMv7) with CONFIG_SMP=y, the mean over ~5300 samples is 5.14+/-1.04us with a max of 11.15us. Unfortunately, the invocation frequency doesn't calm down as much as it did on x86_64: after an uptime of 45min, I'm still seeing approximately one invocation per second. Right after boot, it was ~3/s. Thanks, Nicolai
Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock
Thomas Gleixner writes: > On Wed, 21 Sep 2016, Nicolai Stange wrote: >> Thomas Gleixner writes: >> > Have you ever measured the overhead of the extra work which has to be done >> > in clockevents_adjust_all_freqs() ? >> >> Not exactly, I had a look at its invocation frequency which seems to >> decay exponentially with uptime, presumably because the NTP error >> approaches zero. >> >> However, I've just gathered a function_graph ftrace on my Intel >> i7-4800MQ (Haswell, 8HTs): >> >> # TIMECPU DURATION FUNCTION CALLS >> # | | | | | | | | >>85.287027 | 0) 0.899 us| clockevents_adjust_all_freqs(); >>85.288026 | 0) 0.759 us| clockevents_adjust_all_freqs(); >>85.289026 | 0) 0.735 us| clockevents_adjust_all_freqs(); >>85.290026 | 0) 0.671 us| clockevents_adjust_all_freqs(); >> 149.503656 | 2) 2.477 us| clockevents_adjust_all_freqs(); > > That's not that bad. Though I'd like to see numbers for ARM (especially the > less powerful SoCs) as well. On a Raspberry Pi 2B (bcm2836, ARMv7) with CONFIG_SMP=y, the mean over ~5300 samples is 5.14+/-1.04us with a max of 11.15us. Unfortunately, the invocation frequency doesn't calm down as much as it did on x86_64: after an uptime of 45min, I'm still seeing approximately one invocation per second. Right after boot, it was ~3/s. Thanks, Nicolai
Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock
On Wed, 21 Sep 2016, Nicolai Stange wrote: > Thomas Gleixnerwrites: > > Have you ever measured the overhead of the extra work which has to be done > > in clockevents_adjust_all_freqs() ? > > Not exactly, I had a look at its invocation frequency which seems to > decay exponentially with uptime, presumably because the NTP error > approaches zero. > > However, I've just gathered a function_graph ftrace on my Intel > i7-4800MQ (Haswell, 8HTs): > > # TIMECPU DURATION FUNCTION CALLS > # | | | | | | | | >85.287027 | 0) 0.899 us| clockevents_adjust_all_freqs(); >85.288026 | 0) 0.759 us| clockevents_adjust_all_freqs(); >85.289026 | 0) 0.735 us| clockevents_adjust_all_freqs(); >85.290026 | 0) 0.671 us| clockevents_adjust_all_freqs(); > 149.503656 | 2) 2.477 us| clockevents_adjust_all_freqs(); That's not that bad. Though I'd like to see numbers for ARM (especially the less powerful SoCs) as well. Thanks, tglx
Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock
On Wed, 21 Sep 2016, Nicolai Stange wrote: > Thomas Gleixner writes: > > Have you ever measured the overhead of the extra work which has to be done > > in clockevents_adjust_all_freqs() ? > > Not exactly, I had a look at its invocation frequency which seems to > decay exponentially with uptime, presumably because the NTP error > approaches zero. > > However, I've just gathered a function_graph ftrace on my Intel > i7-4800MQ (Haswell, 8HTs): > > # TIMECPU DURATION FUNCTION CALLS > # | | | | | | | | >85.287027 | 0) 0.899 us| clockevents_adjust_all_freqs(); >85.288026 | 0) 0.759 us| clockevents_adjust_all_freqs(); >85.289026 | 0) 0.735 us| clockevents_adjust_all_freqs(); >85.290026 | 0) 0.671 us| clockevents_adjust_all_freqs(); > 149.503656 | 2) 2.477 us| clockevents_adjust_all_freqs(); That's not that bad. Though I'd like to see numbers for ARM (especially the less powerful SoCs) as well. Thanks, tglx
Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock
Thomas Gleixnerwrites: > On Fri, 16 Sep 2016, Nicolai Stange wrote: > >> Goal: avoid programming ced devices too early for large deltas, for >> details, c.f. the description of [21/23]. >> >> [21-23/23] Actually do the frequency adjustments. >> >> Tested on x86_64 and next-20160916. > > Have you ever measured the overhead of the extra work which has to be done > in clockevents_adjust_all_freqs() ? Not exactly, I had a look at its invocation frequency which seems to decay exponentially with uptime, presumably because the NTP error approaches zero. However, I've just gathered a function_graph ftrace on my Intel i7-4800MQ (Haswell, 8HTs): # tracer: function_graph # # TIMECPU DURATION FUNCTION CALLS # | | | | | | | | 85.287027 | 0) 0.899 us| clockevents_adjust_all_freqs(); 85.288026 | 0) 0.759 us| clockevents_adjust_all_freqs(); 85.289026 | 0) 0.735 us| clockevents_adjust_all_freqs(); 85.290026 | 0) 0.671 us| clockevents_adjust_all_freqs(); 149.503656 | 2) 2.477 us| clockevents_adjust_all_freqs(); 149.507660 | 2) 2.308 us| clockevents_adjust_all_freqs(); 149.511658 | 2) 2.651 us| clockevents_adjust_all_freqs(); 149.545660 | 0) 2.268 us| clockevents_adjust_all_freqs(); 149.564211 | 2) 2.321 us| clockevents_adjust_all_freqs(); 214.351899 | 2) 1.520 us| clockevents_adjust_all_freqs(); 214.354935 | 0) 1.053 us| clockevents_adjust_all_freqs(); 279.026205 | 0) 2.289 us| clockevents_adjust_all_freqs(); 279.030195 | 0) 2.190 us| clockevents_adjust_all_freqs(); 279.034196 | 0) 2.381 us| clockevents_adjust_all_freqs(); 279.047492 | 2) 2.390 us| clockevents_adjust_all_freqs(); 344.250356 | 1) 2.727 us| clockevents_adjust_all_freqs(); 408.879538 | 1) 2.235 us| clockevents_adjust_all_freqs(); 473.125730 | 6) 1.513 us| clockevents_adjust_all_freqs(); 473.129731 | 6) 1.650 us| clockevents_adjust_all_freqs(); 538.387891 | 3) 2.305 us| clockevents_adjust_all_freqs(); 538.391890 | 3) 2.300 us| clockevents_adjust_all_freqs(); 668.257162 | 3) 2.691 us| clockevents_adjust_all_freqs(); 668.261162 | 3) 2.306 us| clockevents_adjust_all_freqs(); 733.459261 | 0) 1.066 us| clockevents_adjust_all_freqs(); 733.463261 | 0) 1.233 us| clockevents_adjust_all_freqs(); 733.467263 | 1) 1.382 us| clockevents_adjust_all_freqs(); 863.398561 | 2) 2.218 us| clockevents_adjust_all_freqs(); 863.402552 | 2) 2.792 us| clockevents_adjust_all_freqs(); 1122.210001 | 3) 2.259 us| clockevents_adjust_all_freqs(); 1122.214004 | 3) 2.165 us| clockevents_adjust_all_freqs(); 1381.283287 | 2) 1.944 us| clockevents_adjust_all_freqs(); 1895.664008 | 2) 1.940 us| clockevents_adjust_all_freqs(); 1895.668009 | 2) 2.041 us| clockevents_adjust_all_freqs(); 2930.385388 | 0) 1.067 us| clockevents_adjust_all_freqs(); 2930.386390 | 5) 1.208 us| clockevents_adjust_all_freqs(); Thanks, Nicolai
Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock
Thomas Gleixner writes: > On Fri, 16 Sep 2016, Nicolai Stange wrote: > >> Goal: avoid programming ced devices too early for large deltas, for >> details, c.f. the description of [21/23]. >> >> [21-23/23] Actually do the frequency adjustments. >> >> Tested on x86_64 and next-20160916. > > Have you ever measured the overhead of the extra work which has to be done > in clockevents_adjust_all_freqs() ? Not exactly, I had a look at its invocation frequency which seems to decay exponentially with uptime, presumably because the NTP error approaches zero. However, I've just gathered a function_graph ftrace on my Intel i7-4800MQ (Haswell, 8HTs): # tracer: function_graph # # TIMECPU DURATION FUNCTION CALLS # | | | | | | | | 85.287027 | 0) 0.899 us| clockevents_adjust_all_freqs(); 85.288026 | 0) 0.759 us| clockevents_adjust_all_freqs(); 85.289026 | 0) 0.735 us| clockevents_adjust_all_freqs(); 85.290026 | 0) 0.671 us| clockevents_adjust_all_freqs(); 149.503656 | 2) 2.477 us| clockevents_adjust_all_freqs(); 149.507660 | 2) 2.308 us| clockevents_adjust_all_freqs(); 149.511658 | 2) 2.651 us| clockevents_adjust_all_freqs(); 149.545660 | 0) 2.268 us| clockevents_adjust_all_freqs(); 149.564211 | 2) 2.321 us| clockevents_adjust_all_freqs(); 214.351899 | 2) 1.520 us| clockevents_adjust_all_freqs(); 214.354935 | 0) 1.053 us| clockevents_adjust_all_freqs(); 279.026205 | 0) 2.289 us| clockevents_adjust_all_freqs(); 279.030195 | 0) 2.190 us| clockevents_adjust_all_freqs(); 279.034196 | 0) 2.381 us| clockevents_adjust_all_freqs(); 279.047492 | 2) 2.390 us| clockevents_adjust_all_freqs(); 344.250356 | 1) 2.727 us| clockevents_adjust_all_freqs(); 408.879538 | 1) 2.235 us| clockevents_adjust_all_freqs(); 473.125730 | 6) 1.513 us| clockevents_adjust_all_freqs(); 473.129731 | 6) 1.650 us| clockevents_adjust_all_freqs(); 538.387891 | 3) 2.305 us| clockevents_adjust_all_freqs(); 538.391890 | 3) 2.300 us| clockevents_adjust_all_freqs(); 668.257162 | 3) 2.691 us| clockevents_adjust_all_freqs(); 668.261162 | 3) 2.306 us| clockevents_adjust_all_freqs(); 733.459261 | 0) 1.066 us| clockevents_adjust_all_freqs(); 733.463261 | 0) 1.233 us| clockevents_adjust_all_freqs(); 733.467263 | 1) 1.382 us| clockevents_adjust_all_freqs(); 863.398561 | 2) 2.218 us| clockevents_adjust_all_freqs(); 863.402552 | 2) 2.792 us| clockevents_adjust_all_freqs(); 1122.210001 | 3) 2.259 us| clockevents_adjust_all_freqs(); 1122.214004 | 3) 2.165 us| clockevents_adjust_all_freqs(); 1381.283287 | 2) 1.944 us| clockevents_adjust_all_freqs(); 1895.664008 | 2) 1.940 us| clockevents_adjust_all_freqs(); 1895.668009 | 2) 2.041 us| clockevents_adjust_all_freqs(); 2930.385388 | 0) 1.067 us| clockevents_adjust_all_freqs(); 2930.386390 | 5) 1.208 us| clockevents_adjust_all_freqs(); Thanks, Nicolai
Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock
On Fri, 16 Sep 2016, Nicolai Stange wrote: > Goal: avoid programming ced devices too early for large deltas, for > details, c.f. the description of [21/23]. > > [21-23/23] Actually do the frequency adjustments. > > Tested on x86_64 and next-20160916. Have you ever measured the overhead of the extra work which has to be done in clockevents_adjust_all_freqs() ? Thanks, tglx
Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock
On Fri, 16 Sep 2016, Nicolai Stange wrote: > Goal: avoid programming ced devices too early for large deltas, for > details, c.f. the description of [21/23]. > > [21-23/23] Actually do the frequency adjustments. > > Tested on x86_64 and next-20160916. Have you ever measured the overhead of the extra work which has to be done in clockevents_adjust_all_freqs() ? Thanks, tglx
[RFC v7 00/23] adapt clockevents frequencies to mono clock
Goal: avoid programming ced devices too early for large deltas, for details, c.f. the description of [21/23]. Previous v6 can be found here: http://lkml.kernel.org/r/20160909200033.32103-1-nicsta...@gmail.com Your objections [0] to v6 have both been towards [1/23] ("clocksource: sh_cmt: compute rate before registration again"), namely - there was a coding style issue due to the removal of braces at an if statement - and I carried the original mult/shift calculation over rather than using clockevents_calc_mult_shift() instead. I fixed the first issue up. However, I did nothing regarding the second one because I'd not feel very confident about doing this cleanup: I don't know why the shift value is set the way it is and thus, I can't tell whether this would break anything. If you still insist on me doing this, I'd prefer to send a patch separate from this series such that it could get merged, dropped (or reverted) independently... This series can be divided into logical subseries as follows: [1-6/23] Don't modify ced rate after registrations through mechanisms other than clockevents_update_freq(). [7-12/23] Let all ced devices set their ->*_delta_ticks values and let the clockevent core do the rest. [13/23]Introduce the CLOCK_EVT_FEAT_NO_ADJUST flag [14-20/23] Fiddle around with the bound checking code in order to allow for non-atomic frequency updates from a CPU different than where the ced is programmed. [21-23/23] Actually do the frequency adjustments. Tested on x86_64 and next-20160916. [0] http://lkml.kernel.org/r/alpine.DEB.2.20.1609101416420.32361@nanos Changes to v6: Rebased against next-20160916. [1/23] ("clocksource: sh_cmt: compute rate before registration again") Do not remove braces at if statement. Changes to v5: [21/23] ("clockevents: initial support for mono to raw time conversion") Replace the max_t() in adj = max_t(u64, adj, mult_ce_raw / 8); by min_t(): mult_ce_raw / 8 actually sets an upper bound on the mult adjustments. [23/23] ("timekeeping: inform clockevents about freq adjustments") Move the clockevents_adjust_all_freqs() invocation from timekeeping_apply_adjustment() to timekeeping_freqadjust(). Reason is given in the patch description. Changes to v4: [1-12/23] Unchanged [13/23] ("clockevents: introduce CLOCK_EVT_FEAT_NO_ADJUST flag") New. [14/23] ("clockevents: decouple ->max_delta_ns from ->max_delta_ticks") New. Solves the overflow problem the former [13/22] ("clockevents: check a programmed delta's bounds in terms of cycles") from v4 introduced. (Note that the former [14/22] ("clockevents: clockevents_program_event(): turn clc into unsigned long") from v4 has been purged.) [15/23] ("clockevents: do comparison of delta against minimum in terms of cycles") This is the former [13/22] ("clockevents: check a programmed delta's bounds in terms of cycles"), but only for the ->min_delta_* -- the ->max_delta_* are handled by [14/23] now. [16/23] ("clockevents: clockevents_program_min_delta(): don't set ->next_event") Former [15/22] unchanged. [17/23] ("clockevents: use ->min_delta_ticks_adjusted to program minimum delta") Former [16/22]. Trivially fix compilation error with CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=n. [18/22] ("clockevents: min delta increment: calculate min_delta_ns from ticks") Former [17/22] unchanged. [19/23] ("timer_list: print_tickdevice(): calculate ->min_delta_ns dynamically") Corresponds to former [18/22] ("timer_list: print_tickdevice(): calculate ->*_delta_ns dynamically") from v4, but only for ->min_delta_ns. The changes required for the display of ->max_delta_ns are now being made in [14/23] already. [20/23] ("clockevents: purge ->min_delta_ns") Corresponds to former [19/22] ("clockevents: purge ->min_delta_ns and ->max_delta_ns"), but with ->max_delta_ns being kept. [21/23] ("clockevents: initial support for mono to raw time conversion") Former [20/22] with the following changes: - Don't adjust mult for those ced's that have CLOCK_EVT_FEAT_NO_ADJUST set. - Don't meld __clockevents_update_bounds() into __clockevents_adjust_freq() anymore: the bounds for those devices having CLOCK_EVT_FEAT_NO_ADJUST set must have got their bounds set as well. - In __clockevents_calc_adjust_freq(), make sure that the adjusted mult doesn't exceed the original by more than 12.5%. C.f. [14/23]. - In timekeeping, define timekeeping_get_mono_mult() only for CONFIG_GENERIC_CLOCKEVENTS=y. [22/23] ("clockevents: make setting of ->mult and ->mult_adjusted atomic") Former [12/22], but with the description updated: previously, it said that this patch would introduce a new locking dependency. This is not true. [23/23] ("timekeeping: inform clockevents about freq adjustments") Former [22/22] with the following changes: - Don't adjust
[RFC v7 00/23] adapt clockevents frequencies to mono clock
Goal: avoid programming ced devices too early for large deltas, for details, c.f. the description of [21/23]. Previous v6 can be found here: http://lkml.kernel.org/r/20160909200033.32103-1-nicsta...@gmail.com Your objections [0] to v6 have both been towards [1/23] ("clocksource: sh_cmt: compute rate before registration again"), namely - there was a coding style issue due to the removal of braces at an if statement - and I carried the original mult/shift calculation over rather than using clockevents_calc_mult_shift() instead. I fixed the first issue up. However, I did nothing regarding the second one because I'd not feel very confident about doing this cleanup: I don't know why the shift value is set the way it is and thus, I can't tell whether this would break anything. If you still insist on me doing this, I'd prefer to send a patch separate from this series such that it could get merged, dropped (or reverted) independently... This series can be divided into logical subseries as follows: [1-6/23] Don't modify ced rate after registrations through mechanisms other than clockevents_update_freq(). [7-12/23] Let all ced devices set their ->*_delta_ticks values and let the clockevent core do the rest. [13/23]Introduce the CLOCK_EVT_FEAT_NO_ADJUST flag [14-20/23] Fiddle around with the bound checking code in order to allow for non-atomic frequency updates from a CPU different than where the ced is programmed. [21-23/23] Actually do the frequency adjustments. Tested on x86_64 and next-20160916. [0] http://lkml.kernel.org/r/alpine.DEB.2.20.1609101416420.32361@nanos Changes to v6: Rebased against next-20160916. [1/23] ("clocksource: sh_cmt: compute rate before registration again") Do not remove braces at if statement. Changes to v5: [21/23] ("clockevents: initial support for mono to raw time conversion") Replace the max_t() in adj = max_t(u64, adj, mult_ce_raw / 8); by min_t(): mult_ce_raw / 8 actually sets an upper bound on the mult adjustments. [23/23] ("timekeeping: inform clockevents about freq adjustments") Move the clockevents_adjust_all_freqs() invocation from timekeeping_apply_adjustment() to timekeeping_freqadjust(). Reason is given in the patch description. Changes to v4: [1-12/23] Unchanged [13/23] ("clockevents: introduce CLOCK_EVT_FEAT_NO_ADJUST flag") New. [14/23] ("clockevents: decouple ->max_delta_ns from ->max_delta_ticks") New. Solves the overflow problem the former [13/22] ("clockevents: check a programmed delta's bounds in terms of cycles") from v4 introduced. (Note that the former [14/22] ("clockevents: clockevents_program_event(): turn clc into unsigned long") from v4 has been purged.) [15/23] ("clockevents: do comparison of delta against minimum in terms of cycles") This is the former [13/22] ("clockevents: check a programmed delta's bounds in terms of cycles"), but only for the ->min_delta_* -- the ->max_delta_* are handled by [14/23] now. [16/23] ("clockevents: clockevents_program_min_delta(): don't set ->next_event") Former [15/22] unchanged. [17/23] ("clockevents: use ->min_delta_ticks_adjusted to program minimum delta") Former [16/22]. Trivially fix compilation error with CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=n. [18/22] ("clockevents: min delta increment: calculate min_delta_ns from ticks") Former [17/22] unchanged. [19/23] ("timer_list: print_tickdevice(): calculate ->min_delta_ns dynamically") Corresponds to former [18/22] ("timer_list: print_tickdevice(): calculate ->*_delta_ns dynamically") from v4, but only for ->min_delta_ns. The changes required for the display of ->max_delta_ns are now being made in [14/23] already. [20/23] ("clockevents: purge ->min_delta_ns") Corresponds to former [19/22] ("clockevents: purge ->min_delta_ns and ->max_delta_ns"), but with ->max_delta_ns being kept. [21/23] ("clockevents: initial support for mono to raw time conversion") Former [20/22] with the following changes: - Don't adjust mult for those ced's that have CLOCK_EVT_FEAT_NO_ADJUST set. - Don't meld __clockevents_update_bounds() into __clockevents_adjust_freq() anymore: the bounds for those devices having CLOCK_EVT_FEAT_NO_ADJUST set must have got their bounds set as well. - In __clockevents_calc_adjust_freq(), make sure that the adjusted mult doesn't exceed the original by more than 12.5%. C.f. [14/23]. - In timekeeping, define timekeeping_get_mono_mult() only for CONFIG_GENERIC_CLOCKEVENTS=y. [22/23] ("clockevents: make setting of ->mult and ->mult_adjusted atomic") Former [12/22], but with the description updated: previously, it said that this patch would introduce a new locking dependency. This is not true. [23/23] ("timekeeping: inform clockevents about freq adjustments") Former [22/22] with the following changes: - Don't adjust