Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
On Wed, Mar 14, 2018 at 12:55:20PM +, Jason Vas Dias wrote: > > You could read the time using the group_fd's mmap() page. That actually > > includes the TSC mult,shift,offset as used by perf clocks. > > > > Yes, but as mentioned earlier, that presupposes I want to use the mmap() > sample method - I don't - I want to use the Group FD method, so > that I can be sure the measurements are for the same code sequence > over the same period of time. You can use both, you can use the data from the mmap page to convert the times obtained from the read() syscall back to raw TSC ticks for all I care (in fact, that's what some people do). Then your userspace can use saw RDTSC instructions and not worry about scaling anything.
Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
On Wed, Mar 14, 2018 at 12:55:20PM +, Jason Vas Dias wrote: > > If you want to correlate to CLOCK_MONOTONIC_RAW you have to read > > CLOCK_MONOTONIC_RAW and not some random other clock value. > > > > Exactly ! Hence the need for the patch so that users can get > CLOCK_MONOTONIC_RAW values with low latency and correlate them > with PERF CPU_CLOCK values. No, you _CANNOT_ correlate CLOCK_MONOTONIC_RAW with CPU_CLOCK, that is _BROKEN_. Yes it 'works', but that's mostly a happy accident. There is no guarantee that CLOCK_MONOTONIC_RAW runs off the TSC, and even if both CPU_CLOCK and CLOCK_MONOTONIC_RAW use the TSC, they need not use the same rate (and they didn't for a long time). Do not mix clocks.
Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
On Wed, Mar 14, 2018 at 12:55:20PM +, Jason Vas Dias wrote: > > While CPU_CLOCK is TSC based, there is no guarantee it has any > > correlation to CLOCK_MONOTONIC_RAW (even if that is also TSC based). > > > > (although, I think I might have fixed that recently and it might just > > work, but it's very much not guaranteed). > > Yes, I believe the CPU_CLOCK is effectively the converted TSC - > it does appear to correlate well with the new CLOCK_MONOTONIC_RAW > values from the patched VDSO. It (now) runs at the same rate, but there is no guarantee for this, in fact it didn't for a very long time. Relying on this is broken.
Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
On Wed, Mar 14, 2018 at 12:55:20PM +, Jason Vas Dias wrote: > > So you can avoid the whole ioctl(ENABLE), ioctl(DISABLE) nonsense and > > just let them run and do: > > > > read(group_fd, &buf_pre, size); > > /* your code section */ > > read(group_fd, &buf_post, size); > > > > /* compute buf_post - buf_pre */ > > > > Which is only 2 system calls, not 4. > > But I can't, really - I am trying to restrict the > performance counter measurements > to only a subset of the code, and exclude > performance measurement result processing - > so the timeline is like: > struct timespec t_start, t_end; > perf_event_open(...); > thread_main_loop() { ... do { > t _clock_gettime(CLOCK_MONOTONIC_RAW, &t_start); > t+x _ enable_perf (); > total_work = do_some_work(); > disable_perf (); > clock_gettime(CLOCK_MONOTONIC_RAW, &t_end); >t+y_ > read_perf_counters_and_store_results >( perf_grp_fd, &results , total_work, > TS2T( &t_end ) - TS2T( &t_start) > ); >} while ( ); > } > >Now. here the bandwidth / performance results recorded by >my 'read_perf_counters_and_store_results' method >is very sensitive to the measurement of the OUTER >elapsed time . I still don't see why you have to do that enable_perf() / disable_perf() stuff. What goes wrong if you just let them run and do 2 read_perf*() things?
Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Thanks for the helpful comments, Peter - re: On 14/03/2018, Peter Zijlstra wrote: > >> Yes, I am sampling perf counters, > > You're not in fact sampling, you're just reading the counters. Correct, using Linux-ese terminology - but "sampling" in looser English. >> Reading performance counters does involve 2 ioctls and a read() , > > So you can avoid the whole ioctl(ENABLE), ioctl(DISABLE) nonsense and > just let them run and do: > > read(group_fd, &buf_pre, size); > /* your code section */ > read(group_fd, &buf_post, size); > > /* compute buf_post - buf_pre */ > > Which is only 2 system calls, not 4. But I can't, really - I am trying to restrict the performance counter measurements to only a subset of the code, and exclude performance measurement result processing - so the timeline is like: struct timespec t_start, t_end; perf_event_open(...); thread_main_loop() { ... do { t _clock_gettime(CLOCK_MONOTONIC_RAW, &t_start); t+x _ enable_perf (); total_work = do_some_work(); disable_perf (); clock_gettime(CLOCK_MONOTONIC_RAW, &t_end); t+y_ read_perf_counters_and_store_results ( perf_grp_fd, &results , total_work, TS2T( &t_end ) - TS2T( &t_start) ); } while ( ); } Now. here the bandwidth / performance results recorded by my 'read_perf_counters_and_store_results' method is very sensitive to the measurement of the OUTER elapsed time . > > Also, a while back there was the proposal to extend the mmap() > self-monitoring interface to groups, see: > > https://lkml.kernel.org/r/20170530172555.5ya3ilfw3sowo...@hirez.programming.kicks-ass.net > > I never did get around to writing the actual code for it, but it > shouldn't be too hard. > Great, I'm looking forward to trying it - but meanwhile, to get NON-MULTIPLEXED measurements for the SAME CODE SEQUENCE over the SAME TIME I believe the group FD method is what is implemented and what works. >> The CPU_CLOCK software counter should give the converted TSC cycles >> seen between the ioctl( grp_fd, PERF_EVENT_IOC_ENABLE , ...) >> and the ioctl( grp_fd, PERF_EVENT_IOC_DISABLE ), and the >> difference between the event->time_running and time_enabled >> should also measure elapsed time . > > While CPU_CLOCK is TSC based, there is no guarantee it has any > correlation to CLOCK_MONOTONIC_RAW (even if that is also TSC based). > > (although, I think I might have fixed that recently and it might just > work, but it's very much not guaranteed). Yes, I believe the CPU_CLOCK is effectively the converted TSC - it does appear to correlate well with the new CLOCK_MONOTONIC_RAW values from the patched VDSO. > If you want to correlate to CLOCK_MONOTONIC_RAW you have to read > CLOCK_MONOTONIC_RAW and not some random other clock value. > Exactly ! Hence the need for the patch so that users can get CLOCK_MONOTONIC_RAW values with low latency and correlate them with PERF CPU_CLOCK values. >> This gives the "inner" elapsed time, from the perpective of the kernel, >> while the measured code section had the counters enabled. >> >> But unless the user-space program also has a way of measuring elapsed >> time from the CPU's perspective , ie. without being subject to >> operator or NTP / PTP adjustment, it has no way of correlating this >> inner elapsed time with any "outer" > > You could read the time using the group_fd's mmap() page. That actually > includes the TSC mult,shift,offset as used by perf clocks. > Yes, but as mentioned earlier, that presupposes I want to use the mmap() sample method - I don't - I want to use the Group FD method, so that I can be sure the measurements are for the same code sequence over the same period of time. >> Currently, users must parse the log file or use gdb / objdump to >> inspect /proc/kcore to get the TSC calibration and exact >> mult+shift values for the TSC value conversion. > > Which ;-) there's multiple floating around.. > Yes, but why must Linux make it so difficult ? I think it has to be recognized that the vDSO or user-space program are the only places in which low-latency clock values can be generated for use by user-space programs with sufficiently low latencies to be useful. So why does it not export the TSC calibration which is so complex to calibrate when such calibration information is available nowhere else ? >> Intel does not publish, nor does the CPU come with in ROM or firmware, >> the actual precise TSC frequency - this must be calibrated against the >> other clocks , according to a complicated procedure in section 18.2 of >> the SDM . My TSC has a "rated" / nominal TSC frequency , which one >> can compute from CPUID leaves, of 2.3ghz, but the "Refined TSC frequency" >> is 2.8333ghz . > > You might
Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
On Tue, Mar 13, 2018 at 11:45:45PM +, Jason Vas Dias wrote: > On 12/03/2018, Peter Zijlstra wrote: > > On Mon, Mar 12, 2018 at 07:01:20AM +, Jason Vas Dias wrote: > >> Sometimes, particularly when correlating elapsed time to performance > >> counter values, > > > > So what actual problem are you tring to solve here? Perf can already > > give you sample time in various clocks, including MONOTONIC_RAW. > > > > > > Yes, I am sampling perf counters, You're not in fact sampling, you're just reading the counters. > including CPU_CYCLES , INSTRUCTIONS, > CPU_CLOCK, TASK_CLOCK, etc, in a Group FD I open with > perf_event_open() , for the current thread on the current CPU - > I am doing this for 4 threads , on Intel & ARM cpus. > > Reading performance counters does involve 2 ioctls and a read() , > which takes time that already far exceeds the time required to read > the TSC or CNTPCT in the VDSO . So you can avoid the whole ioctl(ENABLE), ioctl(DISABLE) nonsense and just let them run and do: read(group_fd, &buf_pre, size); /* your code section */ read(group_fd, &buf_post, size); /* compute buf_post - buf_pre */ Which is only 2 system calls, not 4. Also, a while back there was the proposal to extend the mmap() self-monitoring interface to groups, see: https://lkml.kernel.org/r/20170530172555.5ya3ilfw3sowo...@hirez.programming.kicks-ass.net I never did get around to writing the actual code for it, but it shouldn't be too hard. > The CPU_CLOCK software counter should give the converted TSC cycles > seen between the ioctl( grp_fd, PERF_EVENT_IOC_ENABLE , ...) > and the ioctl( grp_fd, PERF_EVENT_IOC_DISABLE ), and the > difference between the event->time_running and time_enabled > should also measure elapsed time . While CPU_CLOCK is TSC based, there is no guarantee it has any correlation to CLOCK_MONOTONIC_RAW (even if that is also TSC based). (although, I think I might have fixed that recently and it might just work, but it's very much not guaranteed). If you want to correlate to CLOCK_MONOTONIC_RAW you have to read CLOCK_MONOTONIC_RAW and not some random other clock value. > This gives the "inner" elapsed time, from the perpective of the kernel, > while the measured code section had the counters enabled. > > But unless the user-space program also has a way of measuring elapsed > time from the CPU's perspective , ie. without being subject to > operator or NTP / PTP adjustment, it has no way of correlating this > inner elapsed time with any "outer" You could read the time using the group_fd's mmap() page. That actually includes the TSC mult,shift,offset as used by perf clocks. > Currently, users must parse the log file or use gdb / objdump to > inspect /proc/kcore to get the TSC calibration and exact > mult+shift values for the TSC value conversion. Which ;-) there's multiple floating around.. > Intel does not publish, nor does the CPU come with in ROM or firmware, > the actual precise TSC frequency - this must be calibrated against the > other clocks , according to a complicated procedure in section 18.2 of > the SDM . My TSC has a "rated" / nominal TSC frequency , which one > can compute from CPUID leaves, of 2.3ghz, but the "Refined TSC frequency" > is 2.8333ghz . You might want to look at commit: b51120309348 ("x86/tsc: Fix erroneous TSC rate on Skylake Xeon") There is no such thing as a precise TSC frequency, there's a reason we have NTP/PTP. > Hence I think Linux should export this calibrated frequency somehow ; > its "calibration" is expressed as the raw clocksource 'mult' and 'shift' > values, and is exported to the VDSO . > > I think the VDSO should read the TSC and use the calibration > to render the raw, unadjusted time from the CPU's perspective. > > Hence, the patch I am preparing , which is again attached. I have no objection to adding CLOCK_MONOTONIC_RAW support to the VDSO, but you seem to be rather confused on how things work. Now, if you wanted to actually have CLOCK_MONOTONIC_RAW times from perf you'd need something like the below patch. You'd need to create your events with: attr.use_clockid = 1; attr.clockid = CLOCK_MONOTONIC_RAW; attr.read_format |= PERF_FORMAT_TIME; But whatever you do, you really have to stop mixing clocks, that's broken, even if it magically works for now. --- include/uapi/linux/perf_event.h | 5 - kernel/events/core.c| 23 --- 2 files changed, 24 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index 912b85b52344..e210c9a97f2b 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -271,9 +271,11 @@ enum { * { u64 time_enabled; } && PERF_FORMAT_TOTAL_TIME_ENABLED * { u64 time_running; } && PERF_FORMAT_TOTAL_TIME_RUNNING * { u64 id; } && PERF_FORMAT_ID + * { u64 time;
Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
On 12/03/2018, Peter Zijlstra wrote: > On Mon, Mar 12, 2018 at 07:01:20AM +, Jason Vas Dias wrote: >> Sometimes, particularly when correlating elapsed time to performance >> counter values, > > So what actual problem are you tring to solve here? Perf can already > give you sample time in various clocks, including MONOTONIC_RAW. > > Yes, I am sampling perf counters, including CPU_CYCLES , INSTRUCTIONS, CPU_CLOCK, TASK_CLOCK, etc, in a Group FD I open with perf_event_open() , for the current thread on the current CPU - I am doing this for 4 threads , on Intel & ARM cpus. Reading performance counters does involve 2 ioctls and a read() , which takes time that already far exceeds the time required to read the TSC or CNTPCT in the VDSO . The CPU_CLOCK software counter should give the converted TSC cycles seen between the ioctl( grp_fd, PERF_EVENT_IOC_ENABLE , ...) and the ioctl( grp_fd, PERF_EVENT_IOC_DISABLE ), and the difference between the event->time_running and time_enabled should also measure elapsed time . This gives the "inner" elapsed time, from the perpective of the kernel, while the measured code section had the counters enabled. But unless the user-space program also has a way of measuring elapsed time from the CPU's perspective , ie. without being subject to operator or NTP / PTP adjustment, it has no way of correlating this inner elapsed time with any "outer" elapsed time measurement it may have made - I also measure the time taken by I/O operations between threads, for instance. So that is my primary motivation - for each thread's main run loop, I enable performance counters and count several PMU counters and the CPU_CLOCK & TASK_CLOCK . I want to determine with maximal accuracy how much elapsed time was used actually executing the task's instructions on the CPU , and how long they took to execute. I want to try to exclude the time spent gathering and making and analysing the performance measurements from the time spent running the threads' main loop . To do this accurately, it is best to exclude variations in time that occur because of operator or NTP / PTP adjustments . The CLOCK_MONOTONIC_RAW clock is the ONLY clock that is MEANT to be immune from any adjustment. It is meant to be high - resolution clock with 1ns resolution that should be subject to no adjustment, and hence one would expect it it have the lowest latency. But the way Linux has up to now implemented it , CLOCK_MONOTONIC_RAW has a resolution (minimum time that can be measured) that varies from 300 - 1000ns . I can read the TSC and store a 16-byte timespec value in @ 8ns on the same CPU . I understand that linux must conform to the POSIX interface which means it cannot provide sub-nanosecond resolution timers, but it could allow user-space programs to easily discover the timer calibration so that user-space programs can read the timers themselves. Currently, users must parse the log file or use gdb / objdump to inspect /proc/kcore to get the TSC calibration and exact mult+shift values for the TSC value conversion. Intel does not publish, nor does the CPU come with in ROM or firmware, the actual precise TSC frequency - this must be calibrated against the other clocks , according to a complicated procedure in section 18.2 of the SDM . My TSC has a "rated" / nominal TSC frequency , which one can compute from CPUID leaves, of 2.3ghz, but the "Refined TSC frequency" is 2.8333ghz . Hence I think Linux should export this calibrated frequency somehow ; its "calibration" is expressed as the raw clocksource 'mult' and 'shift' values, and is exported to the VDSO . I think the VDSO should read the TSC and use the calibration to render the raw, unadjusted time from the CPU's perspective. Hence, the patch I am preparing , which is again attached. I will submit it properly via email once I figure out how to obtain the 'git-send-mail' tool, and how to use it to send multiple patches, which seems to be the only way to submit acceptable patches. Also the attached timer program measures a latency of @ 20ns with my patch 4.15.9 kernel, when it measured a latency of 300-1000ns without it. Thanks & Regards, Jason vdso_clock_monotonic_raw_1.patch Description: Binary data /* * Program to measure high-res timer latency. * */ #include #include #include #include #include #include #include #include #ifndef N_SAMPLES #define N_SAMPLES 100 #endif #define _STR(_S_) #_S_ #define STR(_S_) _STR(_S_) int main(int argc, char *const* argv, char *const* envp) { clockid_t clk = CLOCK_MONOTONIC_RAW; bool do_dump = false; int argn=1; for(; argn < argc; argn+=1) if( argv[argn] != NULL ) if( *(argv[argn]) == '-') switch( *(argv[argn]+1) ) { case 'm': case 'M': clk = CLOCK_MONOTONIC; break; case 'd': case 'D': do_dump = true; break; case '?': case 'h': case 'u': case 'U': case 'H': fprintf(stderr,"Usage: timer_latency [ -m : use CLOCK_MONOTONIC clock (not CLOCK_MONOTONI
Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Hi Jason, Thank you for the patch! Yet something to improve: [auto build test ERROR on v4.16-rc4] url: https://github.com/0day-ci/linux/commits/Jason-Vas-Dias/x86-vdso-on-Intel-VDSO-should-handle-CLOCK_MONOTONIC_RAW/20180313-00 config: i386-tinyconfig (attached as .config) compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): >> arch/x86/entry/vsyscall/vsyscall_gtod.c:19:10: fatal error: cpufeatures.h: >> No such file or directory #include ^~~ compilation terminated. vim +19 arch/x86/entry/vsyscall/vsyscall_gtod.c > 19 #include 20 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
On Mon, Mar 12, 2018 at 07:01:20AM +, Jason Vas Dias wrote: > Sometimes, particularly when correlating elapsed time to performance > counter values, So what actual problem are you tring to solve here? Perf can already give you sample time in various clocks, including MONOTONIC_RAW.