Re: Differences between builtins and modules
Sorry I didn't see this mail until now - RE: Randy Dunlapwrote: > Would someone please answer/reply to this (related) kernel bugzilla entry: > https://bugzilla.kernel.org/show_bug.cgi?id=118661 Yes, I raised this bug because I think modinfo should return 0 exit status if a requested module is built-in, not just when it has been loaded, like this modified version does: $ modinfo snd modinfo: ERROR: Module snd not found. built-in: snd $ echo $? 0 What was the query about the Bug 118661 that needs to be answered ? I don't see any query on the bug report - just a comment from someone who also agrees modinfo should return OK for a built-in module . Glad to hear someone is finally considering fixing modinfo to report status of built-in modules - with only a 2 year response time. Thanks & Best Regards, Jason
Re: Differences between builtins and modules
Sorry I didn't see this mail until now - RE: Randy Dunlap wrote: > Would someone please answer/reply to this (related) kernel bugzilla entry: > https://bugzilla.kernel.org/show_bug.cgi?id=118661 Yes, I raised this bug because I think modinfo should return 0 exit status if a requested module is built-in, not just when it has been loaded, like this modified version does: $ modinfo snd modinfo: ERROR: Module snd not found. built-in: snd $ echo $? 0 What was the query about the Bug 118661 that needs to be answered ? I don't see any query on the bug report - just a comment from someone who also agrees modinfo should return OK for a built-in module . Glad to hear someone is finally considering fixing modinfo to report status of built-in modules - with only a 2 year response time. Thanks & Best Regards, Jason
Re: [PATCH v4.16-rc6 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
Good day - I believe the last patch I sent, with $subject, addresses all concerns raised so far by reviewers, and complies with all kernel coding standards . Please, it would be most helpful if you could let me know whether the patch is now acceptable and will be applied at some stage or not - or if not, what is the problem with it . My clients are asking whether the patch is going to be in the upstream kernel or not, and I need to tell them something. Thanks & Best Regards, Jason
Re: [PATCH v4.16-rc6 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
Good day - I believe the last patch I sent, with $subject, addresses all concerns raised so far by reviewers, and complies with all kernel coding standards . Please, it would be most helpful if you could let me know whether the patch is now acceptable and will be applied at some stage or not - or if not, what is the problem with it . My clients are asking whether the patch is going to be in the upstream kernel or not, and I need to tell them something. Thanks & Best Regards, Jason
[PATCH v4.16-rc6 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
This patch implements clock_gettime(CLOCK_MONOTONIC_RAW,) calls entirely in the vDSO, without calling vdso_fallback_gettime() . It has been augmented to support compilation with or without -DRETPOLINE / $(RETPOLINE_CFLAGS) ; when compiled with -DRETPOLINE, not all functions calls can be inlined within __vdso_clock_gettime, and all functions invoked by __vdso_clock_gettime must have 'indirect_branch("keep")' + 'function_return("keep")' attributes to compile, otherwise thunk relocations will be generated ; and the functions cannot all be declared '__always_inline_', otherwise a compiler -Werror ('not all __always_inline__ functions can be inlined') is generated. Also, compared to previous version of same patch, the do_*_coarse functions are still not inlines, and not inadvertently changed to inline. I still think it might be better to apply H.J. Liu's patch from https://bugzilla.kernel.org/show_bug.cgi?id=199129 to disable -DRETPOLINE compilation for the vDSO . --- diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index f19856d..80d65d4 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -182,29 +182,62 @@ notrace static u64 vread_tsc(void) return last; } -notrace static inline u64 vgetsns(int *mode) +notrace static inline u64 vgetcycles(int *mode) { - u64 v; - cycles_t cycles; - - if (gtod->vclock_mode == VCLOCK_TSC) - cycles = vread_tsc(); + switch (gtod->vclock_mode) { + case VCLOCK_TSC: + return vread_tsc(); #ifdef CONFIG_PARAVIRT_CLOCK - else if (gtod->vclock_mode == VCLOCK_PVCLOCK) - cycles = vread_pvclock(mode); + case VCLOCK_PVCLOCK: + return vread_pvclock(mode); #endif #ifdef CONFIG_HYPERV_TSCPAGE - else if (gtod->vclock_mode == VCLOCK_HVCLOCK) - cycles = vread_hvclock(mode); + case VCLOCK_HVCLOCK: + return vread_hvclock(mode); #endif - else + default: + break; + } + return 0; +} + +notrace static inline u64 vgetsns(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) return 0; + v = (cycles - gtod->cycle_last) & gtod->mask; return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) + return 0; + + v = (cycles - gtod->cycle_last) & gtod->mask; + return v * gtod->raw_mult; +} + +#ifdef RETPOLINE +# define _NO_THUNK_RELOCS_()(indirect_branch("keep"),\ + function_return("keep")) +# define _RETPOLINE_FUNC_ATTR_ __attribute__(_NO_THUNK_RELOCS_()) +# define _RETPOLINE_INLINE_ inline +#else +# define _RETPOLINE_FUNC_ATTR_ +# define _RETPOLINE_INLINE_ __always_inline +#endif + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ -notrace static int __always_inline do_realtime(struct timespec *ts) +notrace static _RETPOLINE_INLINE_ _RETPOLINE_FUNC_ATTR_ +int do_realtime(struct timespec *ts) { unsigned long seq; u64 ns; @@ -225,7 +258,8 @@ notrace static int __always_inline do_realtime(struct timespec *ts) return mode; } -notrace static int __always_inline do_monotonic(struct timespec *ts) +notrace static _RETPOLINE_INLINE_ _RETPOLINE_FUNC_ATTR_ +int do_monotonic(struct timespec *ts) { unsigned long seq; u64 ns; @@ -246,7 +280,30 @@ notrace static int __always_inline do_monotonic(struct timespec *ts) return mode; } -notrace static void do_realtime_coarse(struct timespec *ts) +notrace static _RETPOLINE_INLINE_ _RETPOLINE_FUNC_ATTR_ +int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + +notrace static _RETPOLINE_FUNC_ATTR_ +void do_realtime_coarse(struct timespec *ts) { unsigned long seq; do { @@ -256,7 +313,8 @@ notrace static void do_realtime_coarse(struct timespec *ts) } while (unlikely(gtod_read_retry(gtod, seq))); } -notrace static void do_monotonic_coarse(struct timespec *ts) +notrace static _RETPOLINE_FUNC_ATTR_ +void do_monotonic_coarse(struct timespec *ts) { unsigned long seq; do { @@ -266,7 +324,8 @@ notrace static void do_monotonic_coarse(struct timespec
[PATCH v4.16-rc6 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
This patch implements clock_gettime(CLOCK_MONOTONIC_RAW,) calls entirely in the vDSO, without calling vdso_fallback_gettime() . It has been augmented to support compilation with or without -DRETPOLINE / $(RETPOLINE_CFLAGS) ; when compiled with -DRETPOLINE, not all functions calls can be inlined within __vdso_clock_gettime, and all functions invoked by __vdso_clock_gettime must have 'indirect_branch("keep")' + 'function_return("keep")' attributes to compile, otherwise thunk relocations will be generated ; and the functions cannot all be declared '__always_inline_', otherwise a compiler -Werror ('not all __always_inline__ functions can be inlined') is generated. Also, compared to previous version of same patch, the do_*_coarse functions are still not inlines, and not inadvertently changed to inline. I still think it might be better to apply H.J. Liu's patch from https://bugzilla.kernel.org/show_bug.cgi?id=199129 to disable -DRETPOLINE compilation for the vDSO . --- diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index f19856d..80d65d4 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -182,29 +182,62 @@ notrace static u64 vread_tsc(void) return last; } -notrace static inline u64 vgetsns(int *mode) +notrace static inline u64 vgetcycles(int *mode) { - u64 v; - cycles_t cycles; - - if (gtod->vclock_mode == VCLOCK_TSC) - cycles = vread_tsc(); + switch (gtod->vclock_mode) { + case VCLOCK_TSC: + return vread_tsc(); #ifdef CONFIG_PARAVIRT_CLOCK - else if (gtod->vclock_mode == VCLOCK_PVCLOCK) - cycles = vread_pvclock(mode); + case VCLOCK_PVCLOCK: + return vread_pvclock(mode); #endif #ifdef CONFIG_HYPERV_TSCPAGE - else if (gtod->vclock_mode == VCLOCK_HVCLOCK) - cycles = vread_hvclock(mode); + case VCLOCK_HVCLOCK: + return vread_hvclock(mode); #endif - else + default: + break; + } + return 0; +} + +notrace static inline u64 vgetsns(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) return 0; + v = (cycles - gtod->cycle_last) & gtod->mask; return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) + return 0; + + v = (cycles - gtod->cycle_last) & gtod->mask; + return v * gtod->raw_mult; +} + +#ifdef RETPOLINE +# define _NO_THUNK_RELOCS_()(indirect_branch("keep"),\ + function_return("keep")) +# define _RETPOLINE_FUNC_ATTR_ __attribute__(_NO_THUNK_RELOCS_()) +# define _RETPOLINE_INLINE_ inline +#else +# define _RETPOLINE_FUNC_ATTR_ +# define _RETPOLINE_INLINE_ __always_inline +#endif + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ -notrace static int __always_inline do_realtime(struct timespec *ts) +notrace static _RETPOLINE_INLINE_ _RETPOLINE_FUNC_ATTR_ +int do_realtime(struct timespec *ts) { unsigned long seq; u64 ns; @@ -225,7 +258,8 @@ notrace static int __always_inline do_realtime(struct timespec *ts) return mode; } -notrace static int __always_inline do_monotonic(struct timespec *ts) +notrace static _RETPOLINE_INLINE_ _RETPOLINE_FUNC_ATTR_ +int do_monotonic(struct timespec *ts) { unsigned long seq; u64 ns; @@ -246,7 +280,30 @@ notrace static int __always_inline do_monotonic(struct timespec *ts) return mode; } -notrace static void do_realtime_coarse(struct timespec *ts) +notrace static _RETPOLINE_INLINE_ _RETPOLINE_FUNC_ATTR_ +int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + +notrace static _RETPOLINE_FUNC_ATTR_ +void do_realtime_coarse(struct timespec *ts) { unsigned long seq; do { @@ -256,7 +313,8 @@ notrace static void do_realtime_coarse(struct timespec *ts) } while (unlikely(gtod_read_retry(gtod, seq))); } -notrace static void do_monotonic_coarse(struct timespec *ts) +notrace static _RETPOLINE_FUNC_ATTR_ +void do_monotonic_coarse(struct timespec *ts) { unsigned long seq; do { @@ -266,7 +324,8 @@ notrace static void do_monotonic_coarse(struct timespec
[PATCH v4.16-rc6 (1)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
Resent to address reviewer comments, and allow builds with compilers that support -DRETPOLINE to succeed. Currently, the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, user-space code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP "real" time; when code needs this, the latencies associated with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc6 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug #198161, as is the test program, timer_latency.c, to demonstrate the problem. Before the patch a latency of 200-1000ns was measured for clock_gettime(CLOCK_MONOTONIC_RAW,) calls - after the patch, the same call on the same machine has a latency of @ 20ns. Please consider applying something like this patch to a future Linux release. This patch is being resent because it has slight improvements to vclock_gettime static function attributes wrt. the previous version. It also supersedes all previous patches with subject matching '.*VDSO should handle.*clock_gettime.*MONOTONIC_RAW' that I have sent previously - sorry for the resends. Please apply this patch so we stop getting emails from intel build bot trying to build previous version, with subject : '[PATCH v4.16-rc5 1/2] x86/vdso: VDSO should handle \ clock_gettime(CLOCK_MONOTONIC_RAW) without syscall' , which only fails to build because its patch 2/2 , which removed -DRETPOLINE from the VDSO build, and is now the subject of https://bugzilla.kernel.org/show_bug.cgi?id=199129, raised by H.J. Liu, was not applied first - Sorry! Thanks & Best Regards, Jason Vas Dias
[PATCH v4.16-rc6 (1)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
Resent to address reviewer comments, and allow builds with compilers that support -DRETPOLINE to succeed. Currently, the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, user-space code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP "real" time; when code needs this, the latencies associated with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc6 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug #198161, as is the test program, timer_latency.c, to demonstrate the problem. Before the patch a latency of 200-1000ns was measured for clock_gettime(CLOCK_MONOTONIC_RAW,) calls - after the patch, the same call on the same machine has a latency of @ 20ns. Please consider applying something like this patch to a future Linux release. This patch is being resent because it has slight improvements to vclock_gettime static function attributes wrt. the previous version. It also supersedes all previous patches with subject matching '.*VDSO should handle.*clock_gettime.*MONOTONIC_RAW' that I have sent previously - sorry for the resends. Please apply this patch so we stop getting emails from intel build bot trying to build previous version, with subject : '[PATCH v4.16-rc5 1/2] x86/vdso: VDSO should handle \ clock_gettime(CLOCK_MONOTONIC_RAW) without syscall' , which only fails to build because its patch 2/2 , which removed -DRETPOLINE from the VDSO build, and is now the subject of https://bugzilla.kernel.org/show_bug.cgi?id=199129, raised by H.J. Liu, was not applied first - Sorry! Thanks & Best Regards, Jason Vas Dias
Re: [PATCH v4.16-rc6 (1)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
Note there is a bug raised by H.J. Liu : Bug 199129: Don't build vDSO with $(RETPOLINE_CFLAGS) -DRETPOLINE (https://bugzilla.kernel.org/show_bug.cgi?id=199129) If you agree it is a bug, then use both patches from post : '[PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle \ clock_gettime(CLOCK_MONOTONIC_RAW) without syscall ' else, use the single patch from $subject, which makes the calls to the statics in vclock_gettime.c' use indirect_branch("keep") / function_return("keep") , to avoid generation of thunk relocations which would not occur unless compiled with -mindirect-branch=thunk-extern -mindirect-branch-register . Thanks & Regards, Jason
Re: [PATCH v4.16-rc6 (1)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
Note there is a bug raised by H.J. Liu : Bug 199129: Don't build vDSO with $(RETPOLINE_CFLAGS) -DRETPOLINE (https://bugzilla.kernel.org/show_bug.cgi?id=199129) If you agree it is a bug, then use both patches from post : '[PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle \ clock_gettime(CLOCK_MONOTONIC_RAW) without syscall ' else, use the single patch from $subject, which makes the calls to the statics in vclock_gettime.c' use indirect_branch("keep") / function_return("keep") , to avoid generation of thunk relocations which would not occur unless compiled with -mindirect-branch=thunk-extern -mindirect-branch-register . Thanks & Regards, Jason
[PATCH v4.16-rc6 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
This patch makes the vDSO handle clock_gettime(CLOCK_MONOTONIC_RAW,) calls in the same way it handles clock_gettime(CLOCK_MONOTONIC,) calls, reducing latency from @ 200-1000ns to @ 20ns. It has been resent and augmented to support compilation with -DRETPOLINE / -mindirect-branch=thunk-extern -mindirect-branch-register, to avoid generating relocations for thunks. --- diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index f19856d..9b89f86 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -182,29 +182,60 @@ notrace static u64 vread_tsc(void) return last; } -notrace static inline u64 vgetsns(int *mode) +notrace static inline u64 vgetcycles(int *mode) { - u64 v; - cycles_t cycles; - - if (gtod->vclock_mode == VCLOCK_TSC) - cycles = vread_tsc(); + switch (gtod->vclock_mode) { + case VCLOCK_TSC: + return vread_tsc(); #ifdef CONFIG_PARAVIRT_CLOCK - else if (gtod->vclock_mode == VCLOCK_PVCLOCK) - cycles = vread_pvclock(mode); + case VCLOCK_PVCLOCK: + return vread_pvclock(mode); #endif #ifdef CONFIG_HYPERV_TSCPAGE - else if (gtod->vclock_mode == VCLOCK_HVCLOCK) - cycles = vread_hvclock(mode); + case VCLOCK_HVCLOCK: + return vread_hvclock(mode); #endif - else + default: + break; + } + return 0; +} + +notrace static inline u64 vgetsns(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) return 0; + v = (cycles - gtod->cycle_last) & gtod->mask; return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) + return 0; + + v = (cycles - gtod->cycle_last) & gtod->mask; + return v * gtod->raw_mult; +} + +#ifdef RETPOLINE +# define _NO_THUNK_RELOCS_()(indirect_branch("keep"),\ + function_return("keep")) +# define _RETPOLINE_FUNC_ATTR_ __attribute__(_NO_THUNK_RELOCS_()) +#else +# define _RETPOLINE_FUNC_ATTR_ +#endif + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ -notrace static int __always_inline do_realtime(struct timespec *ts) +notrace static inline _RETPOLINE_FUNC_ATTR_ +int do_realtime(struct timespec *ts) { unsigned long seq; u64 ns; @@ -225,7 +256,8 @@ notrace static int __always_inline do_realtime(struct timespec *ts) return mode; } -notrace static int __always_inline do_monotonic(struct timespec *ts) +notrace static inline _RETPOLINE_FUNC_ATTR_ +int do_monotonic(struct timespec *ts) { unsigned long seq; u64 ns; @@ -246,7 +278,30 @@ notrace static int __always_inline do_monotonic(struct timespec *ts) return mode; } -notrace static void do_realtime_coarse(struct timespec *ts) +notrace static inline _RETPOLINE_FUNC_ATTR_ +int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + +notrace static inline _RETPOLINE_FUNC_ATTR_ +void do_realtime_coarse(struct timespec *ts) { unsigned long seq; do { @@ -256,7 +311,8 @@ notrace static void do_realtime_coarse(struct timespec *ts) } while (unlikely(gtod_read_retry(gtod, seq))); } -notrace static void do_monotonic_coarse(struct timespec *ts) +notrace static inline _RETPOLINE_FUNC_ATTR_ +void do_monotonic_coarse(struct timespec *ts) { unsigned long seq; do { @@ -266,7 +322,11 @@ notrace static void do_monotonic_coarse(struct timespec *ts) } while (unlikely(gtod_read_retry(gtod, seq))); } -notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) +notrace +#ifdef RETPOLINE + __attribute__((indirect_branch("keep"), function_return("keep"))) +#endif +int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) { switch (clock) { case CLOCK_REALTIME: @@ -277,6 +337,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break;
[PATCH v4.16-rc6 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
This patch makes the vDSO handle clock_gettime(CLOCK_MONOTONIC_RAW,) calls in the same way it handles clock_gettime(CLOCK_MONOTONIC,) calls, reducing latency from @ 200-1000ns to @ 20ns. It has been resent and augmented to support compilation with -DRETPOLINE / -mindirect-branch=thunk-extern -mindirect-branch-register, to avoid generating relocations for thunks. --- diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index f19856d..9b89f86 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -182,29 +182,60 @@ notrace static u64 vread_tsc(void) return last; } -notrace static inline u64 vgetsns(int *mode) +notrace static inline u64 vgetcycles(int *mode) { - u64 v; - cycles_t cycles; - - if (gtod->vclock_mode == VCLOCK_TSC) - cycles = vread_tsc(); + switch (gtod->vclock_mode) { + case VCLOCK_TSC: + return vread_tsc(); #ifdef CONFIG_PARAVIRT_CLOCK - else if (gtod->vclock_mode == VCLOCK_PVCLOCK) - cycles = vread_pvclock(mode); + case VCLOCK_PVCLOCK: + return vread_pvclock(mode); #endif #ifdef CONFIG_HYPERV_TSCPAGE - else if (gtod->vclock_mode == VCLOCK_HVCLOCK) - cycles = vread_hvclock(mode); + case VCLOCK_HVCLOCK: + return vread_hvclock(mode); #endif - else + default: + break; + } + return 0; +} + +notrace static inline u64 vgetsns(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) return 0; + v = (cycles - gtod->cycle_last) & gtod->mask; return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) + return 0; + + v = (cycles - gtod->cycle_last) & gtod->mask; + return v * gtod->raw_mult; +} + +#ifdef RETPOLINE +# define _NO_THUNK_RELOCS_()(indirect_branch("keep"),\ + function_return("keep")) +# define _RETPOLINE_FUNC_ATTR_ __attribute__(_NO_THUNK_RELOCS_()) +#else +# define _RETPOLINE_FUNC_ATTR_ +#endif + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ -notrace static int __always_inline do_realtime(struct timespec *ts) +notrace static inline _RETPOLINE_FUNC_ATTR_ +int do_realtime(struct timespec *ts) { unsigned long seq; u64 ns; @@ -225,7 +256,8 @@ notrace static int __always_inline do_realtime(struct timespec *ts) return mode; } -notrace static int __always_inline do_monotonic(struct timespec *ts) +notrace static inline _RETPOLINE_FUNC_ATTR_ +int do_monotonic(struct timespec *ts) { unsigned long seq; u64 ns; @@ -246,7 +278,30 @@ notrace static int __always_inline do_monotonic(struct timespec *ts) return mode; } -notrace static void do_realtime_coarse(struct timespec *ts) +notrace static inline _RETPOLINE_FUNC_ATTR_ +int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + +notrace static inline _RETPOLINE_FUNC_ATTR_ +void do_realtime_coarse(struct timespec *ts) { unsigned long seq; do { @@ -256,7 +311,8 @@ notrace static void do_realtime_coarse(struct timespec *ts) } while (unlikely(gtod_read_retry(gtod, seq))); } -notrace static void do_monotonic_coarse(struct timespec *ts) +notrace static inline _RETPOLINE_FUNC_ATTR_ +void do_monotonic_coarse(struct timespec *ts) { unsigned long seq; do { @@ -266,7 +322,11 @@ notrace static void do_monotonic_coarse(struct timespec *ts) } while (unlikely(gtod_read_retry(gtod, seq))); } -notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) +notrace +#ifdef RETPOLINE + __attribute__((indirect_branch("keep"), function_return("keep"))) +#endif +int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) { switch (clock) { case CLOCK_REALTIME: @@ -277,6 +337,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break;
[PATCH v4.16-rc6 (1)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
Resent to address reviewer comments, and allow builds with compilers that support -DRETPOLINE to succeed. Currently, the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, user-space code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP "real" time; when code needs this, the latencies associated with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug #198161, as is the test program, timer_latency.c, to demonstrate the problem. Before the patch a latency of 200-1000ns was measured for clock_gettime(CLOCK_MONOTONIC_RAW,) calls - after the patch, the same call on the same machine has a latency of @ 20ns. Please consider applying something like this patch to a future Linux release. Thanks & Best Regards, Jason Vas Dias
[PATCH v4.16-rc6 (1)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
Resent to address reviewer comments, and allow builds with compilers that support -DRETPOLINE to succeed. Currently, the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, user-space code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP "real" time; when code needs this, the latencies associated with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug #198161, as is the test program, timer_latency.c, to demonstrate the problem. Before the patch a latency of 200-1000ns was measured for clock_gettime(CLOCK_MONOTONIC_RAW,) calls - after the patch, the same call on the same machine has a latency of @ 20ns. Please consider applying something like this patch to a future Linux release. Thanks & Best Regards, Jason Vas Dias
Re: [PATCH v4.16-rc5 2/2] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
On 18/03/2018, Jason Vas Dias <jason.vas.d...@gmail.com> wrote: (should have CC'ed to list, sorry) > On 17/03/2018, Andi Kleen <a...@firstfloor.org> wrote: >> >> That's quite a mischaracterization of the issue. gcc works as intended, >> but the kernel did not correctly supply a indirect call retpoline thunk >> to the vdso, and it just happened to work by accident with the old >> vdso. >> >>> >>> The automated test builds should now succeed with this patch. >> >> How about just adding the thunk function to the vdso object instead of >> this cheap hack? >> >> The other option would be to build vdso with inline thunks. >> >> But just disabling is completely the wrong action. >> >> -Andi >> > > Aha! Thanks for the clarification , Andi! > > I will do so and resend the 2nd patch. > > But is everyone agreed we should accept any slowdown for the timer > functions ? I personally don't think it is a good idea, but I will > regenerate the patch with the thunk function and without > the Makefile change. > > Thanks & Best Regards, > Jason > I am wondering if it is not better to avoid the thunk being generated and remove the Makefile patch ? I know that changing the switch in __vdso_clock_gettime() like this avoids the thunk : switch(clock) { case CLOCK_MONOTONIC: if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; default: switch (clock) { case CLOCK_REALTIME: if (do_realtime(ts) == VCLOCK_NONE) goto fallback; break; case CLOCK_MONOTONIC_RAW: if (do_monotonic_raw(ts) == VCLOCK_NONE) goto fallback; break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; case CLOCK_MONOTONIC_COARSE: do_monotonic_coarse(ts); break; default: goto fallback; } return 0; fallback: ... } So at the cost of an unnecessary extra test of the clock parameter, the thunk is avoided . I wonder if the whole switch should be changed to an if / else clause ? Or, I know this might be unorthodox, but might work : #define _CAT(V1,V2) V1##V2 #define GTOD_CLK_LABEL(CLK) _CAT(_VCG_L_,CLK) #define MAX_CLK 16 //^^ ?? __vdso_clock_gettime( ... ) { ... static const void *clklbl_tab[MAX_CLK] ={ [ CLOCK_MONOTONIC ] = &_CLK_LABEL(CLOCK_MONOTONIC) , [ CLOCK_MONOTONIC_RAW ] = &_CLK_LABEL(CLOCK_MONOTONIC_RAW) , // and similarly for all clocks handled ... }; goto clklbl_tab[ clock & 0xf ] ; GTOD_CLK_LABEL(CLOCK_MONOTONIC) : if ( do_monotonic(ts) == VCLOCK_NONE ) goto fallback ; GTOD_CLK_LABEL(CLOCK_MONOTONIC_RAW) : if ( do_monotonic_raw(ts) == VCLOCK_NONE ) goto fallback ; ... // similarly for all clocks fallback: return vdso_fallback_gettime(clock,ts); } If a restructuring like that might be acceptable (with correct tab-based formatting) , and the VDSO can have such a table in its .BSS , I think it would avoid the thunk, and have the advantage of precomputing the jump table at compile-time, and would not require any indirect branches, I think. Any thoughts ? Thanks & Best regards, Jason ; G
Re: [PATCH v4.16-rc5 2/2] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
On 18/03/2018, Jason Vas Dias wrote: (should have CC'ed to list, sorry) > On 17/03/2018, Andi Kleen wrote: >> >> That's quite a mischaracterization of the issue. gcc works as intended, >> but the kernel did not correctly supply a indirect call retpoline thunk >> to the vdso, and it just happened to work by accident with the old >> vdso. >> >>> >>> The automated test builds should now succeed with this patch. >> >> How about just adding the thunk function to the vdso object instead of >> this cheap hack? >> >> The other option would be to build vdso with inline thunks. >> >> But just disabling is completely the wrong action. >> >> -Andi >> > > Aha! Thanks for the clarification , Andi! > > I will do so and resend the 2nd patch. > > But is everyone agreed we should accept any slowdown for the timer > functions ? I personally don't think it is a good idea, but I will > regenerate the patch with the thunk function and without > the Makefile change. > > Thanks & Best Regards, > Jason > I am wondering if it is not better to avoid the thunk being generated and remove the Makefile patch ? I know that changing the switch in __vdso_clock_gettime() like this avoids the thunk : switch(clock) { case CLOCK_MONOTONIC: if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; default: switch (clock) { case CLOCK_REALTIME: if (do_realtime(ts) == VCLOCK_NONE) goto fallback; break; case CLOCK_MONOTONIC_RAW: if (do_monotonic_raw(ts) == VCLOCK_NONE) goto fallback; break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; case CLOCK_MONOTONIC_COARSE: do_monotonic_coarse(ts); break; default: goto fallback; } return 0; fallback: ... } So at the cost of an unnecessary extra test of the clock parameter, the thunk is avoided . I wonder if the whole switch should be changed to an if / else clause ? Or, I know this might be unorthodox, but might work : #define _CAT(V1,V2) V1##V2 #define GTOD_CLK_LABEL(CLK) _CAT(_VCG_L_,CLK) #define MAX_CLK 16 //^^ ?? __vdso_clock_gettime( ... ) { ... static const void *clklbl_tab[MAX_CLK] ={ [ CLOCK_MONOTONIC ] = &_CLK_LABEL(CLOCK_MONOTONIC) , [ CLOCK_MONOTONIC_RAW ] = &_CLK_LABEL(CLOCK_MONOTONIC_RAW) , // and similarly for all clocks handled ... }; goto clklbl_tab[ clock & 0xf ] ; GTOD_CLK_LABEL(CLOCK_MONOTONIC) : if ( do_monotonic(ts) == VCLOCK_NONE ) goto fallback ; GTOD_CLK_LABEL(CLOCK_MONOTONIC_RAW) : if ( do_monotonic_raw(ts) == VCLOCK_NONE ) goto fallback ; ... // similarly for all clocks fallback: return vdso_fallback_gettime(clock,ts); } If a restructuring like that might be acceptable (with correct tab-based formatting) , and the VDSO can have such a table in its .BSS , I think it would avoid the thunk, and have the advantage of precomputing the jump table at compile-time, and would not require any indirect branches, I think. Any thoughts ? Thanks & Best regards, Jason ; G
Re: [PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
fixed typo in timer_latency.c affecting only -r printout : $ gcc -DN_SAMPLES=1000 -o timer timer_latency.c CLOCK_MONOTONIC ( using rdtscp_ordered() ) : $ ./timer -m -r 10 sum: 67615 Total time: 0.67615S - Average Latency: 0.00067S N zero deltas: 0 N inconsistent deltas: 0 sum: 51858 Total time: 0.51858S - Average Latency: 0.00051S N zero deltas: 0 N inconsistent deltas: 0 sum: 51742 Total time: 0.51742S - Average Latency: 0.00051S N zero deltas: 0 N inconsistent deltas: 0 sum: 51944 Total time: 0.51944S - Average Latency: 0.00051S N zero deltas: 0 N inconsistent deltas: 0 sum: 51838 Total time: 0.51838S - Average Latency: 0.00051S N zero deltas: 0 N inconsistent deltas: 0 sum: 52397 Total time: 0.52397S - Average Latency: 0.00052S N zero deltas: 0 N inconsistent deltas: 0 sum: 52428 Total time: 0.52428S - Average Latency: 0.00052S N zero deltas: 0 N inconsistent deltas: 0 sum: 52135 Total time: 0.52135S - Average Latency: 0.00052S N zero deltas: 0 N inconsistent deltas: 0 sum: 52145 Total time: 0.52145S - Average Latency: 0.00052S N zero deltas: 0 N inconsistent deltas: 0 sum: 53116 Total time: 0.53116S - Average Latency: 0.00053S N zero deltas: 0 N inconsistent deltas: 0 Average of 10 average latencies of 1000 samples : 0.00053S CLOCK_MONOTONIC_RAW ( using rdtscp() ) : $ ./timer -r 10 sum: 25755 Total time: 0.25755S - Average Latency: 0.00025S N zero deltas: 0 N inconsistent deltas: 0 sum: 21614 Total time: 0.21614S - Average Latency: 0.00021S N zero deltas: 0 N inconsistent deltas: 0 sum: 21616 Total time: 0.21616S - Average Latency: 0.00021S N zero deltas: 0 N inconsistent deltas: 0 sum: 21610 Total time: 0.21610S - Average Latency: 0.00021S N zero deltas: 0 N inconsistent deltas: 0 sum: 21619 Total time: 0.21619S - Average Latency: 0.00021S N zero deltas: 0 N inconsistent deltas: 0 sum: 21617 Total time: 0.21617S - Average Latency: 0.00021S N zero deltas: 0 N inconsistent deltas: 0 sum: 21610 Total time: 0.21610S - Average Latency: 0.00021S N zero deltas: 0 N inconsistent deltas: 0 sum: 16940 Total time: 0.16940S - Average Latency: 0.00016S N zero deltas: 0 N inconsistent deltas: 0 sum: 16939 Total time: 0.16939S - Average Latency: 0.00016S N zero deltas: 0 N inconsistent deltas: 0 sum: 16943 Total time: 0.16943S - Average Latency: 0.00016S N zero deltas: 0 N inconsistent deltas: 0 Average of 10 average latencies of 1000 samples : 0.00019S /* * Program to measure high-res timer latency. * */ #include #include #include #include #include #include #include #include #include #include #ifndef N_SAMPLES #define N_SAMPLES 100 #endif #define _STR(_S_) #_S_ #define STR(_S_) _STR(_S_) #define TS2NS(_TS_) unsigned long long)(_TS_).tv_sec)*10ULL) + (((unsigned long long)((_TS_).tv_nsec int main(int argc, char *const* argv, char *const* envp) { struct timespec sample[N_SAMPLES+1]; unsigned int cnt=N_SAMPLES, s=0 , avg_n=0; unsigned long long deltas [ N_SAMPLES ] , t1, t2, sum=0, zd=0, ic=0, d , t_start, avg_ns, *avgs=0; clockid_t clk = CLOCK_MONOTONIC_RAW; bool do_dump = false; int argn=1, repeat=1; for(; argn < argc; argn+=1) if( argv[argn] != NULL ) if( *(argv[argn]) == '-') switch( *(argv[argn]+1) ) { case 'm': case 'M': clk = CLOCK_MONOTONIC; break; case 'd': case 'D': do_dump = true; break; case 'r': case 'R': if( (argn < argc) && (argv[argn+1] != NULL)) repeat = atoi(argv[argn+=1]); break; case '?': case 'h': case 'u': case 'U': case 'H': fprintf(stderr,"Usage: timer_latency [\n\t-m : use CLOCK_MONOTONIC clock (not CLOCK_MONOTONIC_RAW)\n\t-d : dump timespec contents. N_SAMPLES: " STR(N_SAMPLES) "\n\t" "-r \n]\t" "Calculates average timer latency (minimum time that can be measured) over N_SAMPLES.\n" ); return 0; } if( repeat > 1 ) { avgs=alloca(sizeof(unsigned long long) * (N_SAMPLES + 1)); if( ((unsigned long) avgs) & 7 ) avgs = ((unsigned long long*)(((unsigned char*)avgs)+(8-((unsigned long) avgs) & 7))); } do { cnt=N_SAMPLES; s=0; do { if( 0 != clock_gettime(clk, [s++]) ) { fprintf(stderr,"oops, clock_gettime() failed: %d: '%s'.\n", errno, strerror(errno)); return 1; } }while( --cnt ); clock_gettime(clk, [s]); for(s=1; s < (N_SAMPLES+1); s+=1) { t1 = TS2NS(sample[s-1]); t2 = TS2NS(sample[s]); if ( (t1 > t2) ||(sample[s-1].tv_sec > sample[s].tv_sec) ||((sample[s-1].tv_sec == sample[s].tv_sec) &&(sample[s-1].tv_nsec > sample[s].tv_nsec) ) ) { fprintf(stderr,"Inconsistency: %llu %llu %lu.%lu %lu.%lu\n", t1 , t2 , sample[s-1].tv_sec, sample[s-1].tv_nsec , sample[s].tv_sec, sample[s].tv_nsec ); ic+=1;
Re: [PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
fixed typo in timer_latency.c affecting only -r printout : $ gcc -DN_SAMPLES=1000 -o timer timer_latency.c CLOCK_MONOTONIC ( using rdtscp_ordered() ) : $ ./timer -m -r 10 sum: 67615 Total time: 0.67615S - Average Latency: 0.00067S N zero deltas: 0 N inconsistent deltas: 0 sum: 51858 Total time: 0.51858S - Average Latency: 0.00051S N zero deltas: 0 N inconsistent deltas: 0 sum: 51742 Total time: 0.51742S - Average Latency: 0.00051S N zero deltas: 0 N inconsistent deltas: 0 sum: 51944 Total time: 0.51944S - Average Latency: 0.00051S N zero deltas: 0 N inconsistent deltas: 0 sum: 51838 Total time: 0.51838S - Average Latency: 0.00051S N zero deltas: 0 N inconsistent deltas: 0 sum: 52397 Total time: 0.52397S - Average Latency: 0.00052S N zero deltas: 0 N inconsistent deltas: 0 sum: 52428 Total time: 0.52428S - Average Latency: 0.00052S N zero deltas: 0 N inconsistent deltas: 0 sum: 52135 Total time: 0.52135S - Average Latency: 0.00052S N zero deltas: 0 N inconsistent deltas: 0 sum: 52145 Total time: 0.52145S - Average Latency: 0.00052S N zero deltas: 0 N inconsistent deltas: 0 sum: 53116 Total time: 0.53116S - Average Latency: 0.00053S N zero deltas: 0 N inconsistent deltas: 0 Average of 10 average latencies of 1000 samples : 0.00053S CLOCK_MONOTONIC_RAW ( using rdtscp() ) : $ ./timer -r 10 sum: 25755 Total time: 0.25755S - Average Latency: 0.00025S N zero deltas: 0 N inconsistent deltas: 0 sum: 21614 Total time: 0.21614S - Average Latency: 0.00021S N zero deltas: 0 N inconsistent deltas: 0 sum: 21616 Total time: 0.21616S - Average Latency: 0.00021S N zero deltas: 0 N inconsistent deltas: 0 sum: 21610 Total time: 0.21610S - Average Latency: 0.00021S N zero deltas: 0 N inconsistent deltas: 0 sum: 21619 Total time: 0.21619S - Average Latency: 0.00021S N zero deltas: 0 N inconsistent deltas: 0 sum: 21617 Total time: 0.21617S - Average Latency: 0.00021S N zero deltas: 0 N inconsistent deltas: 0 sum: 21610 Total time: 0.21610S - Average Latency: 0.00021S N zero deltas: 0 N inconsistent deltas: 0 sum: 16940 Total time: 0.16940S - Average Latency: 0.00016S N zero deltas: 0 N inconsistent deltas: 0 sum: 16939 Total time: 0.16939S - Average Latency: 0.00016S N zero deltas: 0 N inconsistent deltas: 0 sum: 16943 Total time: 0.16943S - Average Latency: 0.00016S N zero deltas: 0 N inconsistent deltas: 0 Average of 10 average latencies of 1000 samples : 0.00019S /* * Program to measure high-res timer latency. * */ #include #include #include #include #include #include #include #include #include #include #ifndef N_SAMPLES #define N_SAMPLES 100 #endif #define _STR(_S_) #_S_ #define STR(_S_) _STR(_S_) #define TS2NS(_TS_) unsigned long long)(_TS_).tv_sec)*10ULL) + (((unsigned long long)((_TS_).tv_nsec int main(int argc, char *const* argv, char *const* envp) { struct timespec sample[N_SAMPLES+1]; unsigned int cnt=N_SAMPLES, s=0 , avg_n=0; unsigned long long deltas [ N_SAMPLES ] , t1, t2, sum=0, zd=0, ic=0, d , t_start, avg_ns, *avgs=0; clockid_t clk = CLOCK_MONOTONIC_RAW; bool do_dump = false; int argn=1, repeat=1; for(; argn < argc; argn+=1) if( argv[argn] != NULL ) if( *(argv[argn]) == '-') switch( *(argv[argn]+1) ) { case 'm': case 'M': clk = CLOCK_MONOTONIC; break; case 'd': case 'D': do_dump = true; break; case 'r': case 'R': if( (argn < argc) && (argv[argn+1] != NULL)) repeat = atoi(argv[argn+=1]); break; case '?': case 'h': case 'u': case 'U': case 'H': fprintf(stderr,"Usage: timer_latency [\n\t-m : use CLOCK_MONOTONIC clock (not CLOCK_MONOTONIC_RAW)\n\t-d : dump timespec contents. N_SAMPLES: " STR(N_SAMPLES) "\n\t" "-r \n]\t" "Calculates average timer latency (minimum time that can be measured) over N_SAMPLES.\n" ); return 0; } if( repeat > 1 ) { avgs=alloca(sizeof(unsigned long long) * (N_SAMPLES + 1)); if( ((unsigned long) avgs) & 7 ) avgs = ((unsigned long long*)(((unsigned char*)avgs)+(8-((unsigned long) avgs) & 7))); } do { cnt=N_SAMPLES; s=0; do { if( 0 != clock_gettime(clk, [s++]) ) { fprintf(stderr,"oops, clock_gettime() failed: %d: '%s'.\n", errno, strerror(errno)); return 1; } }while( --cnt ); clock_gettime(clk, [s]); for(s=1; s < (N_SAMPLES+1); s+=1) { t1 = TS2NS(sample[s-1]); t2 = TS2NS(sample[s]); if ( (t1 > t2) ||(sample[s-1].tv_sec > sample[s].tv_sec) ||((sample[s-1].tv_sec == sample[s].tv_sec) &&(sample[s-1].tv_nsec > sample[s].tv_nsec) ) ) { fprintf(stderr,"Inconsistency: %llu %llu %lu.%lu %lu.%lu\n", t1 , t2 , sample[s-1].tv_sec, sample[s-1].tv_nsec , sample[s].tv_sec, sample[s].tv_nsec ); ic+=1;
re: [PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
Hi - I submitted a new stripped-down to bare essentials version of the patch, (see LKML emails with $subject) which passes all checkpatch.pl tests and addresses all concerns raised by reviewers, which uses only rdtsc_ordered(), and which only only updates in vsyscall_gtod_data the new fields: u32 raw_mult, raw_shift ; ... gtod_long_t monotonic_time_raw_sec /* == tk->raw_sec */ , monotonic_time_raw_nsec /* == tk->tkr_raw.nsec */; (this is NOT the formatting used in vgtod.h - sorry about previous formatting issues . ) . I don't see how one could present the raw timespec in user-space properly without tk->tkr_raw.xtime_nsec and and tk->raw_sec ; monotonic has gtod->monotonic_time_sec and gtod->monotonic_time_snsec, and I am only trying to follow exactly the existing algorithm in timekeeping.c's getrawmonotonic64() . When I submitted the initial version of this stripped down patch, I got an email back from robot<l...@intel.com> reporting a compilation error saying : > > arch/x86/entry/vdso/vclock_gettime.o: In function `__vdso_clock_gettime': > vclock_gettime.c:(.text+0xf7): undefined reference to > >`__x86_indirect_thunk_rax' > /usr/bin/ld: arch/x86/entry/vdso/vclock_gettime.o: relocation R_X86_64_PC32 > >against undefined symbol `__x86_indirect_thunk_rax' can not be used when > making >a shared object; recompile with -fPIC > /usr/bin/ld: final link failed: Bad value >>> collect2: error: ld returned 1 exit status >-- >>> arch/x86/entry/vdso/vdso32.so.dbg: undefined symbols found >-- >>> objcopy: 'arch/x86/entry/vdso/vdso64.so.dbg': No such file >--- I had fixed this problem with the patch to the RHEL kernel attached to bug #198161 (attachment #274751: https://bugzilla.kernel.org/attachment.cgi?id=274751) , by simply reducing the number of clauses in __vdso_clock_gettime's switch(clock) from 6 to 5 , but at the cost of an extra test of clock & second switch(clock). I reported this as GCC bug : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84908 because I don't think GCC should fail to do anything for a switch with 6 clauses and not for one with 5, but the response I got from H.J. Liu was: H.J. Lu wrote @ 2018-03-16 22:13:27 UTC: > > vDSO isn't compiled with $(KBUILD_CFLAGS). Why does your kernel do it? > > Please try my kernel patch at comment 4.. > So that patch to the arch/x86/vdso/Makefile only prevents it enabling the RETPOLINE_CFLAGS for building the vDSO . I defer to H.J.'s expertise on GCC + binutils & advisability of enabling RETPOLINE_CFLAGS in the VDSO - GCC definitely behaves strangely for the vDSO when RETPOLINE _CFLAGS are enabled. Please provide something like the patch in a future version of Linux , and I suggest not compiling the vDSO with RETPOLINE_CFLAGS as does H.J. . The inconsistency_check program in tools/testing/selftests/timers produces no errors for long runs and the timer_latency.c program (attached) also produces no errors and latencies of @ 20ns for CLOCK_MONOTONIC_RAW and latencies of @ 40ns for CLOCK_MONOTONIC - this is however with the additional rdtscp patches , and under 4.15.9, for use on my system ; the 4.16-rc5 version submitted still uses barrier() + rdtsc , and that has a latency of @ 30ns for CLOCK_MONOTONIC_RAW and @40ns for CLOCK_MONOTONIC ; but both are much, much better that 200-1000ns for CLOCK_MONOTONIC_RAW that the unpatched kernels have (all times refer to 'Average Latency' output produced by 'timer_latency.c'). I do apologize for whitespace errors, unread emails and resends and confusion of previous emails - I now understand the process and standards much better and will attempt to adhere to them more closely in future. Thanks & Best Regards, Jason Vas Dias /* * Program to measure high-res timer latency. * */ #include #include #include #include #include #include #include #include #include #include #ifndef N_SAMPLES #define N_SAMPLES 100 #endif #define _STR(_S_) #_S_ #define STR(_S_) _STR(_S_) #define TS2NS(_TS_) unsigned long long)(_TS_).tv_sec)*10ULL) + (((unsigned long long)((_TS_).tv_nsec int main(int argc, char *const* argv, char *const* envp) { struct timespec sample[N_SAMPLES+1]; unsigned int cnt=N_SAMPLES, s=0 , avg_n=0; unsigned long long deltas [ N_SAMPLES ] , t1, t2, sum=0, zd=0, ic=0, d , t_start, avg_ns, *avgs=0; clockid_t clk = CLOCK_MONOTONIC_RAW; bool do_dump = false; int argn=1, repeat=1; for(; argn < argc; argn+=1) if( argv[argn] != NULL ) if( *(argv[argn]) == '-') switch( *(argv[argn]+1) ) { case 'm': case 'M': clk = CLOCK_MONOTONIC; break; case 'd': case 'D': do_dump = true; break; case 'r': case 'R': if( (argn < argc) && (argv[argn+1] != NULL)) repeat = atoi(argv[argn+=1]); break; case
re: [PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
Hi - I submitted a new stripped-down to bare essentials version of the patch, (see LKML emails with $subject) which passes all checkpatch.pl tests and addresses all concerns raised by reviewers, which uses only rdtsc_ordered(), and which only only updates in vsyscall_gtod_data the new fields: u32 raw_mult, raw_shift ; ... gtod_long_t monotonic_time_raw_sec /* == tk->raw_sec */ , monotonic_time_raw_nsec /* == tk->tkr_raw.nsec */; (this is NOT the formatting used in vgtod.h - sorry about previous formatting issues . ) . I don't see how one could present the raw timespec in user-space properly without tk->tkr_raw.xtime_nsec and and tk->raw_sec ; monotonic has gtod->monotonic_time_sec and gtod->monotonic_time_snsec, and I am only trying to follow exactly the existing algorithm in timekeeping.c's getrawmonotonic64() . When I submitted the initial version of this stripped down patch, I got an email back from robot reporting a compilation error saying : > > arch/x86/entry/vdso/vclock_gettime.o: In function `__vdso_clock_gettime': > vclock_gettime.c:(.text+0xf7): undefined reference to > >`__x86_indirect_thunk_rax' > /usr/bin/ld: arch/x86/entry/vdso/vclock_gettime.o: relocation R_X86_64_PC32 > >against undefined symbol `__x86_indirect_thunk_rax' can not be used when > making >a shared object; recompile with -fPIC > /usr/bin/ld: final link failed: Bad value >>> collect2: error: ld returned 1 exit status >-- >>> arch/x86/entry/vdso/vdso32.so.dbg: undefined symbols found >-- >>> objcopy: 'arch/x86/entry/vdso/vdso64.so.dbg': No such file >--- I had fixed this problem with the patch to the RHEL kernel attached to bug #198161 (attachment #274751: https://bugzilla.kernel.org/attachment.cgi?id=274751) , by simply reducing the number of clauses in __vdso_clock_gettime's switch(clock) from 6 to 5 , but at the cost of an extra test of clock & second switch(clock). I reported this as GCC bug : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84908 because I don't think GCC should fail to do anything for a switch with 6 clauses and not for one with 5, but the response I got from H.J. Liu was: H.J. Lu wrote @ 2018-03-16 22:13:27 UTC: > > vDSO isn't compiled with $(KBUILD_CFLAGS). Why does your kernel do it? > > Please try my kernel patch at comment 4.. > So that patch to the arch/x86/vdso/Makefile only prevents it enabling the RETPOLINE_CFLAGS for building the vDSO . I defer to H.J.'s expertise on GCC + binutils & advisability of enabling RETPOLINE_CFLAGS in the VDSO - GCC definitely behaves strangely for the vDSO when RETPOLINE _CFLAGS are enabled. Please provide something like the patch in a future version of Linux , and I suggest not compiling the vDSO with RETPOLINE_CFLAGS as does H.J. . The inconsistency_check program in tools/testing/selftests/timers produces no errors for long runs and the timer_latency.c program (attached) also produces no errors and latencies of @ 20ns for CLOCK_MONOTONIC_RAW and latencies of @ 40ns for CLOCK_MONOTONIC - this is however with the additional rdtscp patches , and under 4.15.9, for use on my system ; the 4.16-rc5 version submitted still uses barrier() + rdtsc , and that has a latency of @ 30ns for CLOCK_MONOTONIC_RAW and @40ns for CLOCK_MONOTONIC ; but both are much, much better that 200-1000ns for CLOCK_MONOTONIC_RAW that the unpatched kernels have (all times refer to 'Average Latency' output produced by 'timer_latency.c'). I do apologize for whitespace errors, unread emails and resends and confusion of previous emails - I now understand the process and standards much better and will attempt to adhere to them more closely in future. Thanks & Best Regards, Jason Vas Dias /* * Program to measure high-res timer latency. * */ #include #include #include #include #include #include #include #include #include #include #ifndef N_SAMPLES #define N_SAMPLES 100 #endif #define _STR(_S_) #_S_ #define STR(_S_) _STR(_S_) #define TS2NS(_TS_) unsigned long long)(_TS_).tv_sec)*10ULL) + (((unsigned long long)((_TS_).tv_nsec int main(int argc, char *const* argv, char *const* envp) { struct timespec sample[N_SAMPLES+1]; unsigned int cnt=N_SAMPLES, s=0 , avg_n=0; unsigned long long deltas [ N_SAMPLES ] , t1, t2, sum=0, zd=0, ic=0, d , t_start, avg_ns, *avgs=0; clockid_t clk = CLOCK_MONOTONIC_RAW; bool do_dump = false; int argn=1, repeat=1; for(; argn < argc; argn+=1) if( argv[argn] != NULL ) if( *(argv[argn]) == '-') switch( *(argv[argn]+1) ) { case 'm': case 'M': clk = CLOCK_MONOTONIC; break; case 'd': case 'D': do_dump = true; break; case 'r': case 'R': if( (argn < argc) && (argv[argn+1] != NULL)) repeat = atoi(argv[argn+=1]); break; case '?': case 'h': case 'u': case
[PATCH v4.16-rc5 2/2] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
This patch allows compilation to succeed with compilers that support -DRETPOLINE - it was kindly contributed by H.J. Liu in GCC Bugzilla: 84908 : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84908 Apparently the GCC retpoline implementation has a limitation that it cannot handle switch statements with more than 5 clauses, which vclock_gettime.c's __vdso_clock_gettime function now contains. The automated test builds should now succeed with this patch. diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile index 1943aeb..cb64e10 100644 --- a/arch/x86/entry/vdso/Makefile +++ b/arch/x86/entry/vdso/Makefile @@ -76,7 +76,7 @@ CFL := $(PROFILING) -mcmodel=small -fPIC -O2 -fasynchronous-unwind-tables -m64 \ -fno-omit-frame-pointer -foptimize-sibling-calls \ -DDISABLE_BRANCH_PROFILING -DBUILD_VDSO -$(vobjs): KBUILD_CFLAGS := $(filter-out $(GCC_PLUGINS_CFLAGS),$(KBUILD_CFLAGS)) $(CFL) +$(vobjs): KBUILD_CFLAGS := $(filter-out $(GCC_PLUGINS_CFLAGS) $(RETPOLINE_CFLAGS) -DRETPOLINE,$(KBUILD_CFLAGS)) $(CFL) # # vDSO code runs in userspace and -pg doesn't help with profiling anyway. @@ -143,6 +143,7 @@ KBUILD_CFLAGS_32 := $(filter-out -mcmodel=kernel,$(KBUILD_CFLAGS_32)) KBUILD_CFLAGS_32 := $(filter-out -fno-pic,$(KBUILD_CFLAGS_32)) KBUILD_CFLAGS_32 := $(filter-out -mfentry,$(KBUILD_CFLAGS_32)) KBUILD_CFLAGS_32 := $(filter-out $(GCC_PLUGINS_CFLAGS),$(KBUILD_CFLAGS_32)) +KBUILD_CFLAGS_32 := $(filter-out $(RETPOLINE_CFLAGS) -DRETPOLINE,$(KBUILD_CFLAGS_32)) KBUILD_CFLAGS_32 += -m32 -msoft-float -mregparm=0 -fpic KBUILD_CFLAGS_32 += $(call cc-option, -fno-stack-protector) KBUILD_CFLAGS_32 += $(call cc-option, -foptimize-sibling-calls)
[PATCH v4.16-rc5 2/2] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
This patch allows compilation to succeed with compilers that support -DRETPOLINE - it was kindly contributed by H.J. Liu in GCC Bugzilla: 84908 : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84908 Apparently the GCC retpoline implementation has a limitation that it cannot handle switch statements with more than 5 clauses, which vclock_gettime.c's __vdso_clock_gettime function now contains. The automated test builds should now succeed with this patch. diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile index 1943aeb..cb64e10 100644 --- a/arch/x86/entry/vdso/Makefile +++ b/arch/x86/entry/vdso/Makefile @@ -76,7 +76,7 @@ CFL := $(PROFILING) -mcmodel=small -fPIC -O2 -fasynchronous-unwind-tables -m64 \ -fno-omit-frame-pointer -foptimize-sibling-calls \ -DDISABLE_BRANCH_PROFILING -DBUILD_VDSO -$(vobjs): KBUILD_CFLAGS := $(filter-out $(GCC_PLUGINS_CFLAGS),$(KBUILD_CFLAGS)) $(CFL) +$(vobjs): KBUILD_CFLAGS := $(filter-out $(GCC_PLUGINS_CFLAGS) $(RETPOLINE_CFLAGS) -DRETPOLINE,$(KBUILD_CFLAGS)) $(CFL) # # vDSO code runs in userspace and -pg doesn't help with profiling anyway. @@ -143,6 +143,7 @@ KBUILD_CFLAGS_32 := $(filter-out -mcmodel=kernel,$(KBUILD_CFLAGS_32)) KBUILD_CFLAGS_32 := $(filter-out -fno-pic,$(KBUILD_CFLAGS_32)) KBUILD_CFLAGS_32 := $(filter-out -mfentry,$(KBUILD_CFLAGS_32)) KBUILD_CFLAGS_32 := $(filter-out $(GCC_PLUGINS_CFLAGS),$(KBUILD_CFLAGS_32)) +KBUILD_CFLAGS_32 := $(filter-out $(RETPOLINE_CFLAGS) -DRETPOLINE,$(KBUILD_CFLAGS_32)) KBUILD_CFLAGS_32 += -m32 -msoft-float -mregparm=0 -fpic KBUILD_CFLAGS_32 += $(call cc-option, -fno-stack-protector) KBUILD_CFLAGS_32 += $(call cc-option, -foptimize-sibling-calls)
[PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
Resent to address reviewer comments, and allow builds with compilers that support -DRETPOLINE to succeed. Currently, the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, user-space code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP "real" time; when code needs this, the latencies associated with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c arch/x86/entry/vdso/Makefile Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug #198161, as is the test program, timer_latency.c, to demonstrate the problem. Before the patch a latency of 200-1000ns was measured for clock_gettime(CLOCK_MONOTONIC_RAW,) calls - after the patch, the same call on the same machine has a latency of @ 20ns. Thanks & Best Regards, Jason Vas Dias
[PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
Resent to address reviewer comments, and allow builds with compilers that support -DRETPOLINE to succeed. Currently, the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, user-space code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP "real" time; when code needs this, the latencies associated with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c arch/x86/entry/vdso/Makefile Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug #198161, as is the test program, timer_latency.c, to demonstrate the problem. Before the patch a latency of 200-1000ns was measured for clock_gettime(CLOCK_MONOTONIC_RAW,) calls - after the patch, the same call on the same machine has a latency of @ 20ns. Thanks & Best Regards, Jason Vas Dias
[PATCH v4.16-rc5 1/2] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
This patch makes the vDSO handle clock_gettime(CLOCK_MONOTONIC_RAW,) calls in the same way it handles clock_gettime(CLOCK_MONOTONIC,) calls, reducing latency from @ 200-1000ns to @ 20ns. diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index f19856d..843b0a6 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -182,27 +182,49 @@ notrace static u64 vread_tsc(void) return last; } -notrace static inline u64 vgetsns(int *mode) +notrace static inline __always_inline u64 vgetcycles(int *mode) { - u64 v; - cycles_t cycles; - - if (gtod->vclock_mode == VCLOCK_TSC) - cycles = vread_tsc(); + switch (gtod->vclock_mode) { + case VCLOCK_TSC: + return vread_tsc(); #ifdef CONFIG_PARAVIRT_CLOCK - else if (gtod->vclock_mode == VCLOCK_PVCLOCK) - cycles = vread_pvclock(mode); + case VCLOCK_PVCLOCK: + return vread_pvclock(mode); #endif #ifdef CONFIG_HYPERV_TSCPAGE - else if (gtod->vclock_mode == VCLOCK_HVCLOCK) - cycles = vread_hvclock(mode); + case VCLOCK_HVCLOCK: + return vread_hvclock(mode); #endif - else + default: + break; + } + return 0; +} + +notrace static inline u64 vgetsns(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) return 0; + v = (cycles - gtod->cycle_last) & gtod->mask; return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) + return 0; + + v = (cycles - gtod->cycle_last) & gtod->mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +268,27 @@ notrace static int __always_inline do_monotonic(struct timespec *ts) return mode; } +notrace static __always_inline int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +320,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index e1216dd..c4d89b6 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -44,6 +44,8 @@ void update_vsyscall(struct timekeeper *tk) vdata->mask = tk->tkr_mono.mask; vdata->mult = tk->tkr_mono.mult; vdata->shift= tk->tkr_mono.shift; + vdata->raw_mult = tk->tkr_raw.mult; + vdata->raw_shift= tk->tkr_raw.shift; vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; @@ -74,5 +76,8 @@ void update_vsyscall(struct timekeeper *tk) vdata->monotonic_time_coarse_sec++; } + vdata->monotonic_time_raw_sec = tk->raw_sec; + vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec; + gtod_write_end(vdata); } diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index fb856c9..ec1a37c 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -22,7 +22,8 @@ struct vsyscall_gtod_data { u64 mask; u32 mult; u32 shift; - + u32 raw_mult; + u32 raw_shift; /* open coded 'struct timespec' */ u64 wall_time_snsec; gtod_long_t wall_time_sec; @@ -32,6 +33,8 @@ struct vsyscall_gtod_data { gtod_long_t wall_time_coarse_nsec; gtod_long_t monotonic_time_coarse_sec; gtod_long_t monotonic_time_coarse_nsec; + gtod_long_t monotonic_time_raw_sec; + gtod_long_t monotonic_time_raw_nsec;
[PATCH v4.16-rc5 1/2] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
This patch makes the vDSO handle clock_gettime(CLOCK_MONOTONIC_RAW,) calls in the same way it handles clock_gettime(CLOCK_MONOTONIC,) calls, reducing latency from @ 200-1000ns to @ 20ns. diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index f19856d..843b0a6 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -182,27 +182,49 @@ notrace static u64 vread_tsc(void) return last; } -notrace static inline u64 vgetsns(int *mode) +notrace static inline __always_inline u64 vgetcycles(int *mode) { - u64 v; - cycles_t cycles; - - if (gtod->vclock_mode == VCLOCK_TSC) - cycles = vread_tsc(); + switch (gtod->vclock_mode) { + case VCLOCK_TSC: + return vread_tsc(); #ifdef CONFIG_PARAVIRT_CLOCK - else if (gtod->vclock_mode == VCLOCK_PVCLOCK) - cycles = vread_pvclock(mode); + case VCLOCK_PVCLOCK: + return vread_pvclock(mode); #endif #ifdef CONFIG_HYPERV_TSCPAGE - else if (gtod->vclock_mode == VCLOCK_HVCLOCK) - cycles = vread_hvclock(mode); + case VCLOCK_HVCLOCK: + return vread_hvclock(mode); #endif - else + default: + break; + } + return 0; +} + +notrace static inline u64 vgetsns(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) return 0; + v = (cycles - gtod->cycle_last) & gtod->mask; return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) + return 0; + + v = (cycles - gtod->cycle_last) & gtod->mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +268,27 @@ notrace static int __always_inline do_monotonic(struct timespec *ts) return mode; } +notrace static __always_inline int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +320,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index e1216dd..c4d89b6 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -44,6 +44,8 @@ void update_vsyscall(struct timekeeper *tk) vdata->mask = tk->tkr_mono.mask; vdata->mult = tk->tkr_mono.mult; vdata->shift= tk->tkr_mono.shift; + vdata->raw_mult = tk->tkr_raw.mult; + vdata->raw_shift= tk->tkr_raw.shift; vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; @@ -74,5 +76,8 @@ void update_vsyscall(struct timekeeper *tk) vdata->monotonic_time_coarse_sec++; } + vdata->monotonic_time_raw_sec = tk->raw_sec; + vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec; + gtod_write_end(vdata); } diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index fb856c9..ec1a37c 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -22,7 +22,8 @@ struct vsyscall_gtod_data { u64 mask; u32 mult; u32 shift; - + u32 raw_mult; + u32 raw_shift; /* open coded 'struct timespec' */ u64 wall_time_snsec; gtod_long_t wall_time_sec; @@ -32,6 +33,8 @@ struct vsyscall_gtod_data { gtod_long_t wall_time_coarse_nsec; gtod_long_t monotonic_time_coarse_sec; gtod_long_t monotonic_time_coarse_nsec; + gtod_long_t monotonic_time_raw_sec; + gtod_long_t monotonic_time_raw_nsec;
Re: [PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Good day - RE: On 15/03/2018, Thomas Gleixner <t...@linutronix.de> wrote: > On Thu, 15 Mar 2018, Jason Vas Dias wrote: >> On 15/03/2018, Thomas Gleixner <t...@linutronix.de> wrote: >> > On Thu, 15 Mar 2018, jason.vas.d...@gmail.com wrote: >> > >> >> Resent to address reviewer comments. >> > >> > I was being patient so far and tried to guide you through the patch >> > submission process, but unfortunately this turns out to be just waste of >> > my >> > time. >> > >> > You have not addressed any of the comments I made here: >> > >> > [1] >> > https://lkml.kernel.org/r/alpine.deb.2.21.1803141511340.2...@nanos.tec.linutronix.de >> > [2] >> > https://lkml.kernel.org/r/alpine.deb.2.21.1803141527300.2...@nanos.tec.linutronix.de >> > >> >> I'm really sorry about that - I did not see those mails , >> and have searched for them in my inbox - > > That's close to the 'my dog ate the homework' excuse. > Nevertheless, those messages are NOT in my inbox, nor can I find them on the list - a google search for 'alpine.DEB.2.21.1803141511340.2481' or 'alpine.DEB.2.21.1803141527300.2481' returns only the last two mails on the subject , where you included the links to https://lkml.kernel.org. I don't know what went wrong here, but I did not receive those mails until you informed me of them yesterday evening, when I immediately regenerated the Patch #1 incorporating fixes for your comments, and sent it with Subject: '[PATCH v4.16-rc5 1/1] x86/vdso: VDSO should handle\ clock_gettime(CLOCK_MONOTONIC_RAW) without syscall ' This version re-uses the 'gtod->cycles' value, which as you point out, is the same as 'tk->tkr_raw.cycle_last' - so I removed vread_tsc_raw() . > Of course they were sent to the list and to you personally as I used > reply-all. From the mail server log: > > 2018-03-14 15:27:27 1ew7NH-00039q-Hv <= t...@linutronix.de > id=alpine.deb.2.21.1803141511340.2...@nanos.tec.linutronix.de > > 2018-03-14 15:27:30 1ew7NH-00039q-Hv => jason.vas.d...@gmail.com R=dnslookup > T=remote_smtp H=gmail-smtp-in.l.google.com [2a00:1450:4013:c01::1a] > X=TLS1.2:RSA_AES_128_CBC_SHA1:128 DN="C=US,ST=California,L=Mountain > View,O=Google Inc,CN=mx.google.com" > > 2018-03-14 15:27:31 1ew7NH-00039q-Hv => linux-kernel@vger.kernel.org > R=dnslookup T=remote_smtp H=vger.kernel.org [209.132.180.67] > > > > 2018-03-14 15:27:47 1ew7NH-00039q-Hv Completed > > If those messages would not have been delivered to > linux-kernel@vger.kernel.org they would hardly be on the mailing list > archive, right? > Yes, I cannot explain why I did not receive them . I guess I should consider gmail an unreliable delivery method and use the lkml.org web interface to check for replies - I will do this from now one. > And they both got delivered to your gmail account as well. > No, they are not in my gmail account Inbox or folders. > ERROR: Missing Signed-off-by: line(s) > total: 1 errors, 0 warnings, 71 lines checked > I do not know how to fix this error - I was hoping someone on the list might enlighten me. > > WARNING: externs should be avoided in .c files > #24: FILE: arch/x86/entry/vdso/vclock_gettime.c:31: > +extern unsigned int __vdso_tsc_calibration( > I thought that must be a script bug, since no extern is being declared by that line; it is an external function declaration, just like the unmodified line that precedes it. > WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? > #93: > new file mode 100644 > > ERROR: Missing Signed-off-by: line(s) > > total: 1 errors, 2 warnings, 143 lines checked > > It reports an error for every single patch of your latest submission. > >> And I did send the test results in a previous mail - > > In private mail which I ignore if there is no real good reason. And just > for the record. This private mail contains the following headers: > > In-Reply-To: <alpine.deb.2.21.1803141527300.2...@nanos.tec.linutronix.de> > References: <1521001222-10712-1-git-send-email-jason.vas.d...@gmail.com> > <1521001222-10712-3-git-send-email-jason.vas.d...@gmail.com> > <alpine.deb.2.21.1803141527300.2...@nanos.tec.linutronix.de> > From: Jason Vas Dias <jason.vas.d...@gmail.com> > Date: Wed, 14 Mar 2018 15:08:55 + > Message-ID: > <calyzvkwb667x-adq4pe8p7_oc2-gdjwqcw4ch4naadmw9zo...@mail.gmail.com> > Subject: Re: [PATCH v4.16-rc5 2/3] x86/vdso: on Intel, VDSO should handle > CLOCK_MONOTONIC_RAW > > So now, if you take the message ID which is in the In-Reply-To: field and > compare it to the message ID which I used for link [2]: > > In-Reply-To: <a
Re: [PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Good day - RE: On 15/03/2018, Thomas Gleixner wrote: > On Thu, 15 Mar 2018, Jason Vas Dias wrote: >> On 15/03/2018, Thomas Gleixner wrote: >> > On Thu, 15 Mar 2018, jason.vas.d...@gmail.com wrote: >> > >> >> Resent to address reviewer comments. >> > >> > I was being patient so far and tried to guide you through the patch >> > submission process, but unfortunately this turns out to be just waste of >> > my >> > time. >> > >> > You have not addressed any of the comments I made here: >> > >> > [1] >> > https://lkml.kernel.org/r/alpine.deb.2.21.1803141511340.2...@nanos.tec.linutronix.de >> > [2] >> > https://lkml.kernel.org/r/alpine.deb.2.21.1803141527300.2...@nanos.tec.linutronix.de >> > >> >> I'm really sorry about that - I did not see those mails , >> and have searched for them in my inbox - > > That's close to the 'my dog ate the homework' excuse. > Nevertheless, those messages are NOT in my inbox, nor can I find them on the list - a google search for 'alpine.DEB.2.21.1803141511340.2481' or 'alpine.DEB.2.21.1803141527300.2481' returns only the last two mails on the subject , where you included the links to https://lkml.kernel.org. I don't know what went wrong here, but I did not receive those mails until you informed me of them yesterday evening, when I immediately regenerated the Patch #1 incorporating fixes for your comments, and sent it with Subject: '[PATCH v4.16-rc5 1/1] x86/vdso: VDSO should handle\ clock_gettime(CLOCK_MONOTONIC_RAW) without syscall ' This version re-uses the 'gtod->cycles' value, which as you point out, is the same as 'tk->tkr_raw.cycle_last' - so I removed vread_tsc_raw() . > Of course they were sent to the list and to you personally as I used > reply-all. From the mail server log: > > 2018-03-14 15:27:27 1ew7NH-00039q-Hv <= t...@linutronix.de > id=alpine.deb.2.21.1803141511340.2...@nanos.tec.linutronix.de > > 2018-03-14 15:27:30 1ew7NH-00039q-Hv => jason.vas.d...@gmail.com R=dnslookup > T=remote_smtp H=gmail-smtp-in.l.google.com [2a00:1450:4013:c01::1a] > X=TLS1.2:RSA_AES_128_CBC_SHA1:128 DN="C=US,ST=California,L=Mountain > View,O=Google Inc,CN=mx.google.com" > > 2018-03-14 15:27:31 1ew7NH-00039q-Hv => linux-kernel@vger.kernel.org > R=dnslookup T=remote_smtp H=vger.kernel.org [209.132.180.67] > > > > 2018-03-14 15:27:47 1ew7NH-00039q-Hv Completed > > If those messages would not have been delivered to > linux-kernel@vger.kernel.org they would hardly be on the mailing list > archive, right? > Yes, I cannot explain why I did not receive them . I guess I should consider gmail an unreliable delivery method and use the lkml.org web interface to check for replies - I will do this from now one. > And they both got delivered to your gmail account as well. > No, they are not in my gmail account Inbox or folders. > ERROR: Missing Signed-off-by: line(s) > total: 1 errors, 0 warnings, 71 lines checked > I do not know how to fix this error - I was hoping someone on the list might enlighten me. > > WARNING: externs should be avoided in .c files > #24: FILE: arch/x86/entry/vdso/vclock_gettime.c:31: > +extern unsigned int __vdso_tsc_calibration( > I thought that must be a script bug, since no extern is being declared by that line; it is an external function declaration, just like the unmodified line that precedes it. > WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? > #93: > new file mode 100644 > > ERROR: Missing Signed-off-by: line(s) > > total: 1 errors, 2 warnings, 143 lines checked > > It reports an error for every single patch of your latest submission. > >> And I did send the test results in a previous mail - > > In private mail which I ignore if there is no real good reason. And just > for the record. This private mail contains the following headers: > > In-Reply-To: > References: <1521001222-10712-1-git-send-email-jason.vas.d...@gmail.com> > <1521001222-10712-3-git-send-email-jason.vas.d...@gmail.com> > > From: Jason Vas Dias > Date: Wed, 14 Mar 2018 15:08:55 + > Message-ID: > > Subject: Re: [PATCH v4.16-rc5 2/3] x86/vdso: on Intel, VDSO should handle > CLOCK_MONOTONIC_RAW > > So now, if you take the message ID which is in the In-Reply-To: field and > compare it to the message ID which I used for link [2]: > > In-Reply-To: >> > https://lkml.kernel.org/r/alpine.deb.2.21.1803141527300.2...@nanos.tec.linutronix.de > > you might notice that these are identical. So how did you end up replying > to a mail which you never received? > > Nice try. I'm really fed up with this. > The o
[PATCH v4.16-rc5 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
Resent to address reviewer comments. Currently, the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, user-space code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP "real" time; when code needs this, the latencies associated with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug #198161, as is the test program, timer_latency.c, to demonstrate the problem. Before the patch a latency of 200-1000ns was measured for clock_gettime(CLOCK_MONOTONIC_RAW,) calls - after the patch, the same call on the same machine has a latency of @ 20ns. Thanks & Best Regards, Jason Vas Dias
[PATCH v4.16-rc5 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
Resent to address reviewer comments. Currently, the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, user-space code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP "real" time; when code needs this, the latencies associated with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug #198161, as is the test program, timer_latency.c, to demonstrate the problem. Before the patch a latency of 200-1000ns was measured for clock_gettime(CLOCK_MONOTONIC_RAW,) calls - after the patch, the same call on the same machine has a latency of @ 20ns. Thanks & Best Regards, Jason Vas Dias
[PATCH v4.16-rc5 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index f19856d..8b9b9cf 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -182,27 +182,49 @@ notrace static u64 vread_tsc(void) return last; } -notrace static inline u64 vgetsns(int *mode) +notrace static inline __always_inline u64 vgetcycles(int *mode) { - u64 v; - cycles_t cycles; - - if (gtod->vclock_mode == VCLOCK_TSC) - cycles = vread_tsc(); + switch (gtod->vclock_mode) { + case VCLOCK_TSC: + return vread_tsc(); #ifdef CONFIG_PARAVIRT_CLOCK - else if (gtod->vclock_mode == VCLOCK_PVCLOCK) - cycles = vread_pvclock(mode); + case VCLOCK_PVCLOCK: + return vread_pvclock(mode); #endif #ifdef CONFIG_HYPERV_TSCPAGE - else if (gtod->vclock_mode == VCLOCK_HVCLOCK) - cycles = vread_hvclock(mode); + case VCLOCK_HVCLOCK: + return vread_hvclock(mode); #endif - else + default: + break; + } + return 0; +} + +notrace static inline u64 vgetsns(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) return 0; + v = (cycles - gtod->cycle_last) & gtod->mask; return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) + return 0; + + v = (cycles - gtod->cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +268,27 @@ notrace static int __always_inline do_monotonic(struct timespec *ts) return mode; } +notrace static __always_inline int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +320,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index e1216dd..83f5c21 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -44,6 +44,9 @@ void update_vsyscall(struct timekeeper *tk) vdata->mask = tk->tkr_mono.mask; vdata->mult = tk->tkr_mono.mult; vdata->shift= tk->tkr_mono.shift; + vdata->raw_mask = tk->tkr_raw.mask; + vdata->raw_mult = tk->tkr_raw.mult; + vdata->raw_shift= tk->tkr_raw.shift; vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; @@ -74,5 +77,8 @@ void update_vsyscall(struct timekeeper *tk) vdata->monotonic_time_coarse_sec++; } + vdata->monotonic_time_raw_sec = tk->raw_sec; + vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec; + gtod_write_end(vdata); } diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index fb856c9..941e9d6 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -22,7 +22,9 @@ struct vsyscall_gtod_data { u64 mask; u32 mult; u32 shift; - + u32 raw_mask; + u32 raw_mult; + u32 raw_shift; /* open coded 'struct timespec' */ u64 wall_time_snsec; gtod_long_t wall_time_sec; @@ -32,6 +34,8 @@ struct vsyscall_gtod_data { gtod_long_t wall_time_coarse_nsec; gtod_long_t monotonic_time_coarse_sec; gtod_long_t monotonic_time_coarse_nsec; + gtod_long_t monotonic_time_raw_sec; + gtod_long_t monotonic_time_raw_nsec; int tz_minuteswest; int tz_dsttime;
[PATCH v4.16-rc5 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index f19856d..8b9b9cf 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -182,27 +182,49 @@ notrace static u64 vread_tsc(void) return last; } -notrace static inline u64 vgetsns(int *mode) +notrace static inline __always_inline u64 vgetcycles(int *mode) { - u64 v; - cycles_t cycles; - - if (gtod->vclock_mode == VCLOCK_TSC) - cycles = vread_tsc(); + switch (gtod->vclock_mode) { + case VCLOCK_TSC: + return vread_tsc(); #ifdef CONFIG_PARAVIRT_CLOCK - else if (gtod->vclock_mode == VCLOCK_PVCLOCK) - cycles = vread_pvclock(mode); + case VCLOCK_PVCLOCK: + return vread_pvclock(mode); #endif #ifdef CONFIG_HYPERV_TSCPAGE - else if (gtod->vclock_mode == VCLOCK_HVCLOCK) - cycles = vread_hvclock(mode); + case VCLOCK_HVCLOCK: + return vread_hvclock(mode); #endif - else + default: + break; + } + return 0; +} + +notrace static inline u64 vgetsns(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) return 0; + v = (cycles - gtod->cycle_last) & gtod->mask; return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) + return 0; + + v = (cycles - gtod->cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +268,27 @@ notrace static int __always_inline do_monotonic(struct timespec *ts) return mode; } +notrace static __always_inline int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +320,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index e1216dd..83f5c21 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -44,6 +44,9 @@ void update_vsyscall(struct timekeeper *tk) vdata->mask = tk->tkr_mono.mask; vdata->mult = tk->tkr_mono.mult; vdata->shift= tk->tkr_mono.shift; + vdata->raw_mask = tk->tkr_raw.mask; + vdata->raw_mult = tk->tkr_raw.mult; + vdata->raw_shift= tk->tkr_raw.shift; vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; @@ -74,5 +77,8 @@ void update_vsyscall(struct timekeeper *tk) vdata->monotonic_time_coarse_sec++; } + vdata->monotonic_time_raw_sec = tk->raw_sec; + vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec; + gtod_write_end(vdata); } diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index fb856c9..941e9d6 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -22,7 +22,9 @@ struct vsyscall_gtod_data { u64 mask; u32 mult; u32 shift; - + u32 raw_mask; + u32 raw_mult; + u32 raw_shift; /* open coded 'struct timespec' */ u64 wall_time_snsec; gtod_long_t wall_time_sec; @@ -32,6 +34,8 @@ struct vsyscall_gtod_data { gtod_long_t wall_time_coarse_nsec; gtod_long_t monotonic_time_coarse_sec; gtod_long_t monotonic_time_coarse_nsec; + gtod_long_t monotonic_time_raw_sec; + gtod_long_t monotonic_time_raw_nsec; int tz_minuteswest; int tz_dsttime;
Re: [PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Hi Thomas - RE: On 15/03/2018, Thomas Gleixnerwrote: > Jason, > > On Thu, 15 Mar 2018, jason.vas.d...@gmail.com wrote: > >> Resent to address reviewer comments. > > I was being patient so far and tried to guide you through the patch > submission process, but unfortunately this turns out to be just waste of my > time. > > You have not addressed any of the comments I made here: > > [1] > https://lkml.kernel.org/r/alpine.deb.2.21.1803141511340.2...@nanos.tec.linutronix.de > [2] > https://lkml.kernel.org/r/alpine.deb.2.21.1803141527300.2...@nanos.tec.linutronix.de > I'm really sorry about that - I did not see those mails , and have searched for them in my inbox - are you sure they were sent to 'linux-kernel@vger.kernel.org' ? That is the only list I am subscribed to . I clicked on the links , but the 'To:' field is just 'linux-kernel' . If I had seen those messages before I re-submitted, those issues would have been fixed. checkpatch.pl did not report them - I ran it with all patches and it reported no errors . And I did send the test results in a previous mail - $ gcc -m64 -o timer timer.c ( must be compiled in 64-bit mode). This is using the new rdtscp() function : $ ./timer -r 100 ... Total time: 0.02806S - Average Latency: 0.00028S N zero deltas: 0 N inconsistent deltas: 0 Average of 100 average latencies of 100 samples : 0.00027S This is using the rdtsc_ordered() function: $ ./timer -m -r 100 Total time: 0.05269S - Average Latency: 0.00052S N zero deltas: 0 N inconsistent deltas: 0 Average of 100 average latencies of 100 samples : 0.00047S timer.c is a very short program that just reads N_SAMPLES (a compile-time option) timespecs using either CLOCK_MONOTONIC_RAW (no -m) or CLOCK_MONOTONIC first parameter to clock_gettime(), then computes the deltas as long long, then averages them , counting any zero deltas, or deltas where the previous timespec is somehow greater than the current timespec, which are reported as inconsitencies (note 'inconistent deltas:0' and 'zero deltas: 0' in output). So my initial claim that rdtscp() can be twice as fast as rdtsc_ordered() was not far-fetched - this is what I am seeing . I think this is because of the explicit barrier() call in rdtsc_ordered() . This must be slower than than the internal processor pipeline "cancellation point" (barrier) used by the rdtscp instruction itself. This is the only reason for the rdtscp call - plus all modern Intel & AMD CPUs support it, and it DOES solve the ordering problem, whereby instructions in one pipeline of a task can get different rdtsc() results than instructions in another pipeline. I will document the results better in the ChangeLog , fix all issues you identified, and resend . I did not mean to ignore your comments - those mails are nowhere in my Inbox - please , confirm the actual email address they are getting sent to. Thanks & Regards, Jason /* * Program to measure high-res timer latency. * */ #include #include #include #include #include #include #include #include #include #include #ifndef N_SAMPLES #define N_SAMPLES 100 #endif #define _STR(_S_) #_S_ #define STR(_S_) _STR(_S_) #define TS2NS(_TS_) unsigned long long)(_TS_).tv_sec)*10ULL) + (((unsigned long long)((_TS_).tv_nsec int main(int argc, char *const* argv, char *const* envp) { struct timespec sample[N_SAMPLES+1]; unsigned int cnt=N_SAMPLES, s=0 , avg_n=0; unsigned long long deltas [ N_SAMPLES ] , t1, t2, sum=0, zd=0, ic=0, d , t_start, avg_ns, *avgs=0; clockid_t clk = CLOCK_MONOTONIC_RAW; bool do_dump = false; int argn=1, repeat=1; for(; argn < argc; argn+=1) if( argv[argn] != NULL ) if( *(argv[argn]) == '-') switch( *(argv[argn]+1) ) { case 'm': case 'M': clk = CLOCK_MONOTONIC; break; case 'd': case 'D': do_dump = true; break; case 'r': case 'R': if( (argn < argc) && (argv[argn+1] != NULL)) repeat = atoi(argv[argn+=1]); break; case '?': case 'h': case 'u': case 'U': case 'H': fprintf(stderr,"Usage: timer_latency [\n\t-m : use CLOCK_MONOTONIC clock (not CLOCK_MONOTONIC_RAW)\n\t-d : dump timespec contents. N_SAMPLES: " STR(N_SAMPLES) "\n\t" "-r \n]\t" "Calculates average timer latency (minimum time that can be measured) over N_SAMPLES.\n" ); return 0; } if( repeat > 1 ) { avgs=alloca(sizeof(unsigned long long) * (N_SAMPLES + 1)); if( ((unsigned long) avgs) & 7 ) avgs = ((unsigned long long*)(((unsigned char*)avgs)+(8-((unsigned long) avgs) & 7))); } do { cnt=N_SAMPLES; s=0; do { if( 0 != clock_gettime(clk, [s++]) ) { fprintf(stderr,"oops, clock_gettime() failed: %d: '%s'.\n", errno, strerror(errno)); return 1; } }while( --cnt ); clock_gettime(clk, [s]); for(s=1; s < (N_SAMPLES+1); s+=1) { t1 = TS2NS(sample[s-1]); t2 = TS2NS(sample[s]); if ( (t1 >
Re: [PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Hi Thomas - RE: On 15/03/2018, Thomas Gleixner wrote: > Jason, > > On Thu, 15 Mar 2018, jason.vas.d...@gmail.com wrote: > >> Resent to address reviewer comments. > > I was being patient so far and tried to guide you through the patch > submission process, but unfortunately this turns out to be just waste of my > time. > > You have not addressed any of the comments I made here: > > [1] > https://lkml.kernel.org/r/alpine.deb.2.21.1803141511340.2...@nanos.tec.linutronix.de > [2] > https://lkml.kernel.org/r/alpine.deb.2.21.1803141527300.2...@nanos.tec.linutronix.de > I'm really sorry about that - I did not see those mails , and have searched for them in my inbox - are you sure they were sent to 'linux-kernel@vger.kernel.org' ? That is the only list I am subscribed to . I clicked on the links , but the 'To:' field is just 'linux-kernel' . If I had seen those messages before I re-submitted, those issues would have been fixed. checkpatch.pl did not report them - I ran it with all patches and it reported no errors . And I did send the test results in a previous mail - $ gcc -m64 -o timer timer.c ( must be compiled in 64-bit mode). This is using the new rdtscp() function : $ ./timer -r 100 ... Total time: 0.02806S - Average Latency: 0.00028S N zero deltas: 0 N inconsistent deltas: 0 Average of 100 average latencies of 100 samples : 0.00027S This is using the rdtsc_ordered() function: $ ./timer -m -r 100 Total time: 0.05269S - Average Latency: 0.00052S N zero deltas: 0 N inconsistent deltas: 0 Average of 100 average latencies of 100 samples : 0.00047S timer.c is a very short program that just reads N_SAMPLES (a compile-time option) timespecs using either CLOCK_MONOTONIC_RAW (no -m) or CLOCK_MONOTONIC first parameter to clock_gettime(), then computes the deltas as long long, then averages them , counting any zero deltas, or deltas where the previous timespec is somehow greater than the current timespec, which are reported as inconsitencies (note 'inconistent deltas:0' and 'zero deltas: 0' in output). So my initial claim that rdtscp() can be twice as fast as rdtsc_ordered() was not far-fetched - this is what I am seeing . I think this is because of the explicit barrier() call in rdtsc_ordered() . This must be slower than than the internal processor pipeline "cancellation point" (barrier) used by the rdtscp instruction itself. This is the only reason for the rdtscp call - plus all modern Intel & AMD CPUs support it, and it DOES solve the ordering problem, whereby instructions in one pipeline of a task can get different rdtsc() results than instructions in another pipeline. I will document the results better in the ChangeLog , fix all issues you identified, and resend . I did not mean to ignore your comments - those mails are nowhere in my Inbox - please , confirm the actual email address they are getting sent to. Thanks & Regards, Jason /* * Program to measure high-res timer latency. * */ #include #include #include #include #include #include #include #include #include #include #ifndef N_SAMPLES #define N_SAMPLES 100 #endif #define _STR(_S_) #_S_ #define STR(_S_) _STR(_S_) #define TS2NS(_TS_) unsigned long long)(_TS_).tv_sec)*10ULL) + (((unsigned long long)((_TS_).tv_nsec int main(int argc, char *const* argv, char *const* envp) { struct timespec sample[N_SAMPLES+1]; unsigned int cnt=N_SAMPLES, s=0 , avg_n=0; unsigned long long deltas [ N_SAMPLES ] , t1, t2, sum=0, zd=0, ic=0, d , t_start, avg_ns, *avgs=0; clockid_t clk = CLOCK_MONOTONIC_RAW; bool do_dump = false; int argn=1, repeat=1; for(; argn < argc; argn+=1) if( argv[argn] != NULL ) if( *(argv[argn]) == '-') switch( *(argv[argn]+1) ) { case 'm': case 'M': clk = CLOCK_MONOTONIC; break; case 'd': case 'D': do_dump = true; break; case 'r': case 'R': if( (argn < argc) && (argv[argn+1] != NULL)) repeat = atoi(argv[argn+=1]); break; case '?': case 'h': case 'u': case 'U': case 'H': fprintf(stderr,"Usage: timer_latency [\n\t-m : use CLOCK_MONOTONIC clock (not CLOCK_MONOTONIC_RAW)\n\t-d : dump timespec contents. N_SAMPLES: " STR(N_SAMPLES) "\n\t" "-r \n]\t" "Calculates average timer latency (minimum time that can be measured) over N_SAMPLES.\n" ); return 0; } if( repeat > 1 ) { avgs=alloca(sizeof(unsigned long long) * (N_SAMPLES + 1)); if( ((unsigned long) avgs) & 7 ) avgs = ((unsigned long long*)(((unsigned char*)avgs)+(8-((unsigned long) avgs) & 7))); } do { cnt=N_SAMPLES; s=0; do { if( 0 != clock_gettime(clk, [s++]) ) { fprintf(stderr,"oops, clock_gettime() failed: %d: '%s'.\n", errno, strerror(errno)); return 1; } }while( --cnt ); clock_gettime(clk, [s]); for(s=1; s < (N_SAMPLES+1); s+=1) { t1 = TS2NS(sample[s-1]); t2 = TS2NS(sample[s]); if ( (t1 > t2)
[PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Resent to address reviewer comments. Currently, the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, user-space code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP "real" time; when code needs this, the latencies associated with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vdso/vdso.lds.S arch/x86/entry/vdso/vdsox32.lds.S arch/x86/entry/vdso/vdso32/vdso32.lds.S arch/x86/entry/vsyscall/vsyscall_gtod.c There are 3 patches in the series : Patch #1 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with rdtsc_ordered() Patches #2 & #3 should be considered "optional" : Patch #2 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with a new rdtscp() function in msr.h Patch #3 makes the VDSO export TSC calibration data via a new function in the vDSO: unsigned int __vdso_linux_tsc_calibration ( struct linux_tsc_calibration *tsc_cal ) that user code can optionally call. Patch #2 makes clock_gettime(CLOCK_MONOTONIC_RAW) calls somewhat faster than clock_gettime(CLOCK_MONOTONIC) calls. I think something like Patch #3 is necessary to export TSC calibration data to user-space TSC readers. It is entirely up to the kernel developers whether they want to include patches #2 and #3, but I think something like Patch #1 really needs to get into a future Linux release, as an unecessary latency of 200-1000ns for a timer that can tick 3 times per nanosecond is unacceptable. Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug #198161. Thanks & Best Regards, Jason Vas Dias
[PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Resent to address reviewer comments. Currently, the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, user-space code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP "real" time; when code needs this, the latencies associated with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vdso/vdso.lds.S arch/x86/entry/vdso/vdsox32.lds.S arch/x86/entry/vdso/vdso32/vdso32.lds.S arch/x86/entry/vsyscall/vsyscall_gtod.c There are 3 patches in the series : Patch #1 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with rdtsc_ordered() Patches #2 & #3 should be considered "optional" : Patch #2 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with a new rdtscp() function in msr.h Patch #3 makes the VDSO export TSC calibration data via a new function in the vDSO: unsigned int __vdso_linux_tsc_calibration ( struct linux_tsc_calibration *tsc_cal ) that user code can optionally call. Patch #2 makes clock_gettime(CLOCK_MONOTONIC_RAW) calls somewhat faster than clock_gettime(CLOCK_MONOTONIC) calls. I think something like Patch #3 is necessary to export TSC calibration data to user-space TSC readers. It is entirely up to the kernel developers whether they want to include patches #2 and #3, but I think something like Patch #1 really needs to get into a future Linux release, as an unecessary latency of 200-1000ns for a timer that can tick 3 times per nanosecond is unacceptable. Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug #198161. Thanks & Best Regards, Jason Vas Dias
[PATCH v4.16-rc5 2/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index fbc7371..2c46675 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -184,10 +184,9 @@ notrace static u64 vread_tsc(void) notrace static u64 vread_tsc_raw(void) { - u64 tsc + u64 tsc = (gtod->has_rdtscp ? rdtscp((void *)0) : rdtsc_ordered()) , last = gtod->raw_cycle_last; - tsc = rdtsc_ordered(); if (likely(tsc >= last)) return tsc; asm volatile (""); diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index 5af7093..0327a95 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -16,6 +16,9 @@ #include #include #include +#include + +extern unsigned int tsc_khz; int vclocks_used __read_mostly; @@ -49,6 +52,7 @@ void update_vsyscall(struct timekeeper *tk) vdata->raw_mask = tk->tkr_raw.mask; vdata->raw_mult = tk->tkr_raw.mult; vdata->raw_shift= tk->tkr_raw.shift; + vdata->has_rdtscp = static_cpu_has(X86_FEATURE_RDTSCP); vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h index 30df295..a5ff704 100644 --- a/arch/x86/include/asm/msr.h +++ b/arch/x86/include/asm/msr.h @@ -218,6 +218,37 @@ static __always_inline unsigned long long rdtsc_ordered(void) return rdtsc(); } +/** + * rdtscp() - read the current TSC and (optionally) CPU number, with built-in + *cancellation point replacing barrier - only available + *if static_cpu_has(X86_FEATURE_RDTSCP) . + * returns: The 64-bit Time Stamp Counter (TSC) value. + * Optionally, 'cpu_out' can be non-null, and on return it will contain + * the number (Intel CPU ID) of the CPU that the task is currently running on. + * As does EAX_EDT_RET, this uses the "open-coded asm" style to + * force the compiler + assembler to always use (eax, edx, ecx) registers, + * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit + * variables are used - exactly the same code should be generated + * for this instruction on 32-bit as on 64-bit when this asm stanza is used. + * See: SDM , Vol #2, RDTSCP instruction. + */ +static __always_inline u64 rdtscp(u32 *cpu_out) +{ + u32 tsc_lo, tsc_hi, tsc_cpu; + + asm volatile + ("rdtscp" + : "=a" (tsc_lo) + , "=d" (tsc_hi) + , "=c" (tsc_cpu) + ); // : eax, edx, ecx used - NOT rax, rdx, rcx + if (unlikely(cpu_out != ((void *)0))) + *cpu_out = tsc_cpu; + return u64)tsc_hi) << 32) | + (((u64)tsc_lo) & 0x0ULL) + ); +} + /* Deprecated, keep it for a cycle for easier merging: */ #define rdtscll(now) do { (now) = rdtsc_ordered(); } while (0) diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index 24e4d45..e7e4804 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -26,6 +26,7 @@ struct vsyscall_gtod_data { u64 raw_mask; u32 raw_mult; u32 raw_shift; + u32 has_rdtscp; /* open coded 'struct timespec' */ u64 wall_time_snsec;
[PATCH v4.16-rc5 2/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index fbc7371..2c46675 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -184,10 +184,9 @@ notrace static u64 vread_tsc(void) notrace static u64 vread_tsc_raw(void) { - u64 tsc + u64 tsc = (gtod->has_rdtscp ? rdtscp((void *)0) : rdtsc_ordered()) , last = gtod->raw_cycle_last; - tsc = rdtsc_ordered(); if (likely(tsc >= last)) return tsc; asm volatile (""); diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index 5af7093..0327a95 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -16,6 +16,9 @@ #include #include #include +#include + +extern unsigned int tsc_khz; int vclocks_used __read_mostly; @@ -49,6 +52,7 @@ void update_vsyscall(struct timekeeper *tk) vdata->raw_mask = tk->tkr_raw.mask; vdata->raw_mult = tk->tkr_raw.mult; vdata->raw_shift= tk->tkr_raw.shift; + vdata->has_rdtscp = static_cpu_has(X86_FEATURE_RDTSCP); vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h index 30df295..a5ff704 100644 --- a/arch/x86/include/asm/msr.h +++ b/arch/x86/include/asm/msr.h @@ -218,6 +218,37 @@ static __always_inline unsigned long long rdtsc_ordered(void) return rdtsc(); } +/** + * rdtscp() - read the current TSC and (optionally) CPU number, with built-in + *cancellation point replacing barrier - only available + *if static_cpu_has(X86_FEATURE_RDTSCP) . + * returns: The 64-bit Time Stamp Counter (TSC) value. + * Optionally, 'cpu_out' can be non-null, and on return it will contain + * the number (Intel CPU ID) of the CPU that the task is currently running on. + * As does EAX_EDT_RET, this uses the "open-coded asm" style to + * force the compiler + assembler to always use (eax, edx, ecx) registers, + * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit + * variables are used - exactly the same code should be generated + * for this instruction on 32-bit as on 64-bit when this asm stanza is used. + * See: SDM , Vol #2, RDTSCP instruction. + */ +static __always_inline u64 rdtscp(u32 *cpu_out) +{ + u32 tsc_lo, tsc_hi, tsc_cpu; + + asm volatile + ("rdtscp" + : "=a" (tsc_lo) + , "=d" (tsc_hi) + , "=c" (tsc_cpu) + ); // : eax, edx, ecx used - NOT rax, rdx, rcx + if (unlikely(cpu_out != ((void *)0))) + *cpu_out = tsc_cpu; + return u64)tsc_hi) << 32) | + (((u64)tsc_lo) & 0x0ULL) + ); +} + /* Deprecated, keep it for a cycle for easier merging: */ #define rdtscll(now) do { (now) = rdtsc_ordered(); } while (0) diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index 24e4d45..e7e4804 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -26,6 +26,7 @@ struct vsyscall_gtod_data { u64 raw_mask; u32 raw_mult; u32 raw_shift; + u32 has_rdtscp; /* open coded 'struct timespec' */ u64 wall_time_snsec;
[PATCH v4.16-rc5 1/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index f19856d..fbc7371 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ + u64 tsc + , last = gtod->raw_cycle_last; + + tsc = rdtsc_ordered(); + if (likely(tsc >= last)) + return tsc; + asm volatile (""); + return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *mode) return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +279,27 @@ notrace static int __always_inline do_monotonic(struct timespec *ts) return mode; } +notrace static __always_inline int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index e1216dd..5af7093 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -45,6 +45,11 @@ void update_vsyscall(struct timekeeper *tk) vdata->mult = tk->tkr_mono.mult; vdata->shift= tk->tkr_mono.shift; + vdata->raw_cycle_last = tk->tkr_raw.cycle_last; + vdata->raw_mask = tk->tkr_raw.mask; + vdata->raw_mult = tk->tkr_raw.mult; + vdata->raw_shift= tk->tkr_raw.shift; + vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; @@ -74,5 +79,8 @@ void update_vsyscall(struct timekeeper *tk) vdata->monotonic_time_coarse_sec++; } + vdata->monotonic_time_raw_sec = tk->raw_sec; + vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec; + gtod_write_end(vdata); } diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index fb856c9..24e4d45 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -22,6 +22,10 @@ struct vsyscall_gtod_data { u64 mask; u32 mult; u32 shift; + u64 raw_cycle_last; + u64 raw_mask; + u32 raw_mult; + u32 raw_shift; /* open coded 'struct timespec' */ u64 wall_time_snsec; @@ -32,6 +36,8 @@ struct vsyscall_gtod_data { gtod_long_t wall_time_coarse_nsec; gtod_long_t monotonic_time_coarse_sec; gtod_long_t monotonic_time_coarse_nsec; + gtod_long_t monotonic_time_raw_sec; + gtod_long_t monotonic_time_raw_nsec; int tz_minuteswest; int tz_dsttime;
[PATCH v4.16-rc5 1/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index f19856d..fbc7371 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ + u64 tsc + , last = gtod->raw_cycle_last; + + tsc = rdtsc_ordered(); + if (likely(tsc >= last)) + return tsc; + asm volatile (""); + return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *mode) return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +279,27 @@ notrace static int __always_inline do_monotonic(struct timespec *ts) return mode; } +notrace static __always_inline int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index e1216dd..5af7093 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -45,6 +45,11 @@ void update_vsyscall(struct timekeeper *tk) vdata->mult = tk->tkr_mono.mult; vdata->shift= tk->tkr_mono.shift; + vdata->raw_cycle_last = tk->tkr_raw.cycle_last; + vdata->raw_mask = tk->tkr_raw.mask; + vdata->raw_mult = tk->tkr_raw.mult; + vdata->raw_shift= tk->tkr_raw.shift; + vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; @@ -74,5 +79,8 @@ void update_vsyscall(struct timekeeper *tk) vdata->monotonic_time_coarse_sec++; } + vdata->monotonic_time_raw_sec = tk->raw_sec; + vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec; + gtod_write_end(vdata); } diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index fb856c9..24e4d45 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -22,6 +22,10 @@ struct vsyscall_gtod_data { u64 mask; u32 mult; u32 shift; + u64 raw_cycle_last; + u64 raw_mask; + u32 raw_mult; + u32 raw_shift; /* open coded 'struct timespec' */ u64 wall_time_snsec; @@ -32,6 +36,8 @@ struct vsyscall_gtod_data { gtod_long_t wall_time_coarse_nsec; gtod_long_t monotonic_time_coarse_sec; gtod_long_t monotonic_time_coarse_nsec; + gtod_long_t monotonic_time_raw_sec; + gtod_long_t monotonic_time_raw_nsec; int tz_minuteswest; int tz_dsttime;
[PATCH v4.16-rc5 3/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index 03f3904..61d9633 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -21,12 +21,15 @@ #include #include #include +#include #define gtod ((vsyscall_gtod_data)) extern int __vdso_clock_gettime(clockid_t clock, struct timespec *ts); extern int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz); extern time_t __vdso_time(time_t *t); +extern unsigned int __vdso_tsc_calibration( + struct linux_tsc_calibration_s *tsc_cal); #ifdef CONFIG_PARAVIRT_CLOCK extern u8 pvclock_page @@ -383,3 +386,25 @@ notrace time_t __vdso_time(time_t *t) } time_t time(time_t *t) __attribute__((weak, alias("__vdso_time"))); + +notraceunsigned int +__vdso_linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal) +{ + unsigned long seq; + + do { + seq = gtod_read_begin(gtod); + if ((gtod->vclock_mode == VCLOCK_TSC) && + (tsc_cal != ((void *)0UL))) { + tsc_cal->tsc_khz = gtod->tsc_khz; + tsc_cal->mult= gtod->raw_mult; + tsc_cal->shift = gtod->raw_shift; + return 1; + } + } while (unlikely(gtod_read_retry(gtod, seq))); + + return 0; +} + +unsigned int linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal) + __attribute((weak, alias("__vdso_linux_tsc_calibration"))); diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S index d3a2dce..e0b5cce 100644 --- a/arch/x86/entry/vdso/vdso.lds.S +++ b/arch/x86/entry/vdso/vdso.lds.S @@ -25,6 +25,8 @@ VERSION { __vdso_getcpu; time; __vdso_time; + linux_tsc_calibration; + __vdso_linux_tsc_calibration; local: *; }; } diff --git a/arch/x86/entry/vdso/vdso32/vdso32.lds.S b/arch/x86/entry/vdso/vdso32/vdso32.lds.S index 422764a..17fd07f 100644 --- a/arch/x86/entry/vdso/vdso32/vdso32.lds.S +++ b/arch/x86/entry/vdso/vdso32/vdso32.lds.S @@ -26,6 +26,7 @@ VERSION __vdso_clock_gettime; __vdso_gettimeofday; __vdso_time; + __vdso_linux_tsc_calibration; }; LINUX_2.5 { diff --git a/arch/x86/entry/vdso/vdsox32.lds.S b/arch/x86/entry/vdso/vdsox32.lds.S index 05cd1c5..7acac71 100644 --- a/arch/x86/entry/vdso/vdsox32.lds.S +++ b/arch/x86/entry/vdso/vdsox32.lds.S @@ -21,6 +21,7 @@ VERSION { __vdso_gettimeofday; __vdso_getcpu; __vdso_time; + __vdso_linux_tsc_calibration; local: *; }; } diff --git a/arch/x86/include/uapi/asm/vdso_tsc_calibration.h b/arch/x86/include/uapi/asm/vdso_tsc_calibration.h new file mode 100644 index 000..ce4b5a45 --- /dev/null +++ b/arch/x86/include/uapi/asm/vdso_tsc_calibration.h @@ -0,0 +1,81 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _ASM_X86_VDSO_TSC_CALIBRATION_H +#define _ASM_X86_VDSO_TSC_CALIBRATION_H +/* + * Programs that want to use rdtsc / rdtscp instructions + * from user-space can make use of the Linux kernel TSC calibration + * by calling : + *__vdso_linux_tsc_calibration(struct linux_tsc_calibration_s *); + * ( one has to resolve this symbol as in + * tools/testing/selftests/vDSO/parse_vdso.c + * ) + * which fills in a structure + * with the following layout : + */ + +/** struct linux_tsc_calibration_s - + * mult:amount to multiply 64-bit TSC value by + * shift: the right shift to apply to (mult*TSC) yielding nanoseconds + * tsc_khz: the calibrated TSC frequency in KHz from which previous + * members calculated + */ +struct linux_tsc_calibration_s { + + unsigned int mult; + unsigned int shift; + unsigned int tsc_khz; + +}; + +/* To use: + * + * static unsigned + * (*linux_tsc_cal)(struct linux_tsc_calibration_s *linux_tsc_cal) = + *vdso_sym("LINUX_2.6", "__vdso_linux_tsc_calibration"); + * if(linux_tsc_cal == ((void *)0)) + * { fprintf(stderr,"the patch providing __vdso_linux_tsc_calibration" + * " is not applied to the kernel.\n"); + *return ERROR; + * } + * static struct linux_tsc_calibration clock_source={0}; + * if((clock_source.mult==0) && ! (*linux_tsc_cal)(_source) ) + *fprintf(stderr,"TSC is not the system clocksource.\n"); + * unsigned int tsc_lo, tsc_hi, tsc_cpu; + * asm volatile + * ( "rdtscp" : (=a) tsc_hi, (=d) tsc_lo, (=c) tsc_cpu ); + * unsigned long tsc = (((unsigned long)tsc_hi) << 32) | tsc_lo; + * unsigned long nanoseconds = + * (( clock_source . mult ) * tsc ) >> (clock_source . shift); + * + * nanoseconds is now TSC value converted to nanoseconds, + * according to Linux' clocksource calibration values. + * Incidentally, 'tsc_cpu' is the number of the CPU the task is running on. + * + *
[PATCH v4.16-rc5 3/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index 03f3904..61d9633 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -21,12 +21,15 @@ #include #include #include +#include #define gtod ((vsyscall_gtod_data)) extern int __vdso_clock_gettime(clockid_t clock, struct timespec *ts); extern int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz); extern time_t __vdso_time(time_t *t); +extern unsigned int __vdso_tsc_calibration( + struct linux_tsc_calibration_s *tsc_cal); #ifdef CONFIG_PARAVIRT_CLOCK extern u8 pvclock_page @@ -383,3 +386,25 @@ notrace time_t __vdso_time(time_t *t) } time_t time(time_t *t) __attribute__((weak, alias("__vdso_time"))); + +notraceunsigned int +__vdso_linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal) +{ + unsigned long seq; + + do { + seq = gtod_read_begin(gtod); + if ((gtod->vclock_mode == VCLOCK_TSC) && + (tsc_cal != ((void *)0UL))) { + tsc_cal->tsc_khz = gtod->tsc_khz; + tsc_cal->mult= gtod->raw_mult; + tsc_cal->shift = gtod->raw_shift; + return 1; + } + } while (unlikely(gtod_read_retry(gtod, seq))); + + return 0; +} + +unsigned int linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal) + __attribute((weak, alias("__vdso_linux_tsc_calibration"))); diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S index d3a2dce..e0b5cce 100644 --- a/arch/x86/entry/vdso/vdso.lds.S +++ b/arch/x86/entry/vdso/vdso.lds.S @@ -25,6 +25,8 @@ VERSION { __vdso_getcpu; time; __vdso_time; + linux_tsc_calibration; + __vdso_linux_tsc_calibration; local: *; }; } diff --git a/arch/x86/entry/vdso/vdso32/vdso32.lds.S b/arch/x86/entry/vdso/vdso32/vdso32.lds.S index 422764a..17fd07f 100644 --- a/arch/x86/entry/vdso/vdso32/vdso32.lds.S +++ b/arch/x86/entry/vdso/vdso32/vdso32.lds.S @@ -26,6 +26,7 @@ VERSION __vdso_clock_gettime; __vdso_gettimeofday; __vdso_time; + __vdso_linux_tsc_calibration; }; LINUX_2.5 { diff --git a/arch/x86/entry/vdso/vdsox32.lds.S b/arch/x86/entry/vdso/vdsox32.lds.S index 05cd1c5..7acac71 100644 --- a/arch/x86/entry/vdso/vdsox32.lds.S +++ b/arch/x86/entry/vdso/vdsox32.lds.S @@ -21,6 +21,7 @@ VERSION { __vdso_gettimeofday; __vdso_getcpu; __vdso_time; + __vdso_linux_tsc_calibration; local: *; }; } diff --git a/arch/x86/include/uapi/asm/vdso_tsc_calibration.h b/arch/x86/include/uapi/asm/vdso_tsc_calibration.h new file mode 100644 index 000..ce4b5a45 --- /dev/null +++ b/arch/x86/include/uapi/asm/vdso_tsc_calibration.h @@ -0,0 +1,81 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _ASM_X86_VDSO_TSC_CALIBRATION_H +#define _ASM_X86_VDSO_TSC_CALIBRATION_H +/* + * Programs that want to use rdtsc / rdtscp instructions + * from user-space can make use of the Linux kernel TSC calibration + * by calling : + *__vdso_linux_tsc_calibration(struct linux_tsc_calibration_s *); + * ( one has to resolve this symbol as in + * tools/testing/selftests/vDSO/parse_vdso.c + * ) + * which fills in a structure + * with the following layout : + */ + +/** struct linux_tsc_calibration_s - + * mult:amount to multiply 64-bit TSC value by + * shift: the right shift to apply to (mult*TSC) yielding nanoseconds + * tsc_khz: the calibrated TSC frequency in KHz from which previous + * members calculated + */ +struct linux_tsc_calibration_s { + + unsigned int mult; + unsigned int shift; + unsigned int tsc_khz; + +}; + +/* To use: + * + * static unsigned + * (*linux_tsc_cal)(struct linux_tsc_calibration_s *linux_tsc_cal) = + *vdso_sym("LINUX_2.6", "__vdso_linux_tsc_calibration"); + * if(linux_tsc_cal == ((void *)0)) + * { fprintf(stderr,"the patch providing __vdso_linux_tsc_calibration" + * " is not applied to the kernel.\n"); + *return ERROR; + * } + * static struct linux_tsc_calibration clock_source={0}; + * if((clock_source.mult==0) && ! (*linux_tsc_cal)(_source) ) + *fprintf(stderr,"TSC is not the system clocksource.\n"); + * unsigned int tsc_lo, tsc_hi, tsc_cpu; + * asm volatile + * ( "rdtscp" : (=a) tsc_hi, (=d) tsc_lo, (=c) tsc_cpu ); + * unsigned long tsc = (((unsigned long)tsc_hi) << 32) | tsc_lo; + * unsigned long nanoseconds = + * (( clock_source . mult ) * tsc ) >> (clock_source . shift); + * + * nanoseconds is now TSC value converted to nanoseconds, + * according to Linux' clocksource calibration values. + * Incidentally, 'tsc_cpu' is the number of the CPU the task is running on. + * + *
Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Thanks for the helpful comments, Peter - re: On 14/03/2018, Peter Zijlstrawrote: > >> Yes, I am sampling perf counters, > > You're not in fact sampling, you're just reading the counters. Correct, using Linux-ese terminology - but "sampling" in looser English. >> Reading performance counters does involve 2 ioctls and a read() , > > So you can avoid the whole ioctl(ENABLE), ioctl(DISABLE) nonsense and > just let them run and do: > > read(group_fd, _pre, size); > /* your code section */ > read(group_fd, _post, size); > > /* compute buf_post - buf_pre */ > > Which is only 2 system calls, not 4. But I can't, really - I am trying to restrict the performance counter measurements to only a subset of the code, and exclude performance measurement result processing - so the timeline is like: struct timespec t_start, t_end; perf_event_open(...); thread_main_loop() { ... do { t _clock_gettime(CLOCK_MONOTONIC_RAW, _start); t+x _ enable_perf (); total_work = do_some_work(); disable_perf (); clock_gettime(CLOCK_MONOTONIC_RAW, _end); t+y_ read_perf_counters_and_store_results ( perf_grp_fd, , total_work, TS2T( _end ) - TS2T( _start) ); } while ( ); } Now. here the bandwidth / performance results recorded by my 'read_perf_counters_and_store_results' method is very sensitive to the measurement of the OUTER elapsed time . > > Also, a while back there was the proposal to extend the mmap() > self-monitoring interface to groups, see: > > https://lkml.kernel.org/r/20170530172555.5ya3ilfw3sowo...@hirez.programming.kicks-ass.net > > I never did get around to writing the actual code for it, but it > shouldn't be too hard. > Great, I'm looking forward to trying it - but meanwhile, to get NON-MULTIPLEXED measurements for the SAME CODE SEQUENCE over the SAME TIME I believe the group FD method is what is implemented and what works. >> The CPU_CLOCK software counter should give the converted TSC cycles >> seen between the ioctl( grp_fd, PERF_EVENT_IOC_ENABLE , ...) >> and the ioctl( grp_fd, PERF_EVENT_IOC_DISABLE ), and the >> difference between the event->time_running and time_enabled >> should also measure elapsed time . > > While CPU_CLOCK is TSC based, there is no guarantee it has any > correlation to CLOCK_MONOTONIC_RAW (even if that is also TSC based). > > (although, I think I might have fixed that recently and it might just > work, but it's very much not guaranteed). Yes, I believe the CPU_CLOCK is effectively the converted TSC - it does appear to correlate well with the new CLOCK_MONOTONIC_RAW values from the patched VDSO. > If you want to correlate to CLOCK_MONOTONIC_RAW you have to read > CLOCK_MONOTONIC_RAW and not some random other clock value. > Exactly ! Hence the need for the patch so that users can get CLOCK_MONOTONIC_RAW values with low latency and correlate them with PERF CPU_CLOCK values. >> This gives the "inner" elapsed time, from the perpective of the kernel, >> while the measured code section had the counters enabled. >> >> But unless the user-space program also has a way of measuring elapsed >> time from the CPU's perspective , ie. without being subject to >> operator or NTP / PTP adjustment, it has no way of correlating this >> inner elapsed time with any "outer" > > You could read the time using the group_fd's mmap() page. That actually > includes the TSC mult,shift,offset as used by perf clocks. > Yes, but as mentioned earlier, that presupposes I want to use the mmap() sample method - I don't - I want to use the Group FD method, so that I can be sure the measurements are for the same code sequence over the same period of time. >> Currently, users must parse the log file or use gdb / objdump to >> inspect /proc/kcore to get the TSC calibration and exact >> mult+shift values for the TSC value conversion. > > Which ;-) there's multiple floating around.. > Yes, but why must Linux make it so difficult ? I think it has to be recognized that the vDSO or user-space program are the only places in which low-latency clock values can be generated for use by user-space programs with sufficiently low latencies to be useful. So why does it not export the TSC calibration which is so complex to calibrate when such calibration information is available nowhere else ? >> Intel does not publish, nor does the CPU come with in ROM or firmware, >> the actual precise TSC frequency - this must be calibrated against the >> other clocks , according to a complicated procedure in section 18.2 of >> the SDM . My TSC has a "rated" / nominal TSC frequency , which one >> can compute from CPUID leaves, of 2.3ghz, but the "Refined TSC frequency" >> is 2.8333ghz . > > You might
Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Thanks for the helpful comments, Peter - re: On 14/03/2018, Peter Zijlstra wrote: > >> Yes, I am sampling perf counters, > > You're not in fact sampling, you're just reading the counters. Correct, using Linux-ese terminology - but "sampling" in looser English. >> Reading performance counters does involve 2 ioctls and a read() , > > So you can avoid the whole ioctl(ENABLE), ioctl(DISABLE) nonsense and > just let them run and do: > > read(group_fd, _pre, size); > /* your code section */ > read(group_fd, _post, size); > > /* compute buf_post - buf_pre */ > > Which is only 2 system calls, not 4. But I can't, really - I am trying to restrict the performance counter measurements to only a subset of the code, and exclude performance measurement result processing - so the timeline is like: struct timespec t_start, t_end; perf_event_open(...); thread_main_loop() { ... do { t _clock_gettime(CLOCK_MONOTONIC_RAW, _start); t+x _ enable_perf (); total_work = do_some_work(); disable_perf (); clock_gettime(CLOCK_MONOTONIC_RAW, _end); t+y_ read_perf_counters_and_store_results ( perf_grp_fd, , total_work, TS2T( _end ) - TS2T( _start) ); } while ( ); } Now. here the bandwidth / performance results recorded by my 'read_perf_counters_and_store_results' method is very sensitive to the measurement of the OUTER elapsed time . > > Also, a while back there was the proposal to extend the mmap() > self-monitoring interface to groups, see: > > https://lkml.kernel.org/r/20170530172555.5ya3ilfw3sowo...@hirez.programming.kicks-ass.net > > I never did get around to writing the actual code for it, but it > shouldn't be too hard. > Great, I'm looking forward to trying it - but meanwhile, to get NON-MULTIPLEXED measurements for the SAME CODE SEQUENCE over the SAME TIME I believe the group FD method is what is implemented and what works. >> The CPU_CLOCK software counter should give the converted TSC cycles >> seen between the ioctl( grp_fd, PERF_EVENT_IOC_ENABLE , ...) >> and the ioctl( grp_fd, PERF_EVENT_IOC_DISABLE ), and the >> difference between the event->time_running and time_enabled >> should also measure elapsed time . > > While CPU_CLOCK is TSC based, there is no guarantee it has any > correlation to CLOCK_MONOTONIC_RAW (even if that is also TSC based). > > (although, I think I might have fixed that recently and it might just > work, but it's very much not guaranteed). Yes, I believe the CPU_CLOCK is effectively the converted TSC - it does appear to correlate well with the new CLOCK_MONOTONIC_RAW values from the patched VDSO. > If you want to correlate to CLOCK_MONOTONIC_RAW you have to read > CLOCK_MONOTONIC_RAW and not some random other clock value. > Exactly ! Hence the need for the patch so that users can get CLOCK_MONOTONIC_RAW values with low latency and correlate them with PERF CPU_CLOCK values. >> This gives the "inner" elapsed time, from the perpective of the kernel, >> while the measured code section had the counters enabled. >> >> But unless the user-space program also has a way of measuring elapsed >> time from the CPU's perspective , ie. without being subject to >> operator or NTP / PTP adjustment, it has no way of correlating this >> inner elapsed time with any "outer" > > You could read the time using the group_fd's mmap() page. That actually > includes the TSC mult,shift,offset as used by perf clocks. > Yes, but as mentioned earlier, that presupposes I want to use the mmap() sample method - I don't - I want to use the Group FD method, so that I can be sure the measurements are for the same code sequence over the same period of time. >> Currently, users must parse the log file or use gdb / objdump to >> inspect /proc/kcore to get the TSC calibration and exact >> mult+shift values for the TSC value conversion. > > Which ;-) there's multiple floating around.. > Yes, but why must Linux make it so difficult ? I think it has to be recognized that the vDSO or user-space program are the only places in which low-latency clock values can be generated for use by user-space programs with sufficiently low latencies to be useful. So why does it not export the TSC calibration which is so complex to calibrate when such calibration information is available nowhere else ? >> Intel does not publish, nor does the CPU come with in ROM or firmware, >> the actual precise TSC frequency - this must be calibrated against the >> other clocks , according to a complicated procedure in section 18.2 of >> the SDM . My TSC has a "rated" / nominal TSC frequency , which one >> can compute from CPUID leaves, of 2.3ghz, but the "Refined TSC frequency" >> is 2.8333ghz . > > You might want to look at commit:
[PATCH v4.16-rc5 1/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index f19856d..fbc7371 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ + u64 tsc + , last = gtod->raw_cycle_last; + + tsc = rdtsc_ordered(); + if (likely(tsc >= last)) + return tsc; + asm volatile (""); + return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *mode) return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +279,27 @@ notrace static int __always_inline do_monotonic(struct timespec *ts) return mode; } +notrace static __always_inline int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index e1216dd..5af7093 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -45,6 +45,11 @@ void update_vsyscall(struct timekeeper *tk) vdata->mult = tk->tkr_mono.mult; vdata->shift= tk->tkr_mono.shift; + vdata->raw_cycle_last = tk->tkr_raw.cycle_last; + vdata->raw_mask = tk->tkr_raw.mask; + vdata->raw_mult = tk->tkr_raw.mult; + vdata->raw_shift= tk->tkr_raw.shift; + vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; @@ -74,5 +79,8 @@ void update_vsyscall(struct timekeeper *tk) vdata->monotonic_time_coarse_sec++; } + vdata->monotonic_time_raw_sec = tk->raw_sec; + vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec; + gtod_write_end(vdata); } diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index fb856c9..24e4d45 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -22,6 +22,10 @@ struct vsyscall_gtod_data { u64 mask; u32 mult; u32 shift; + u64 raw_cycle_last; + u64 raw_mask; + u32 raw_mult; + u32 raw_shift; /* open coded 'struct timespec' */ u64 wall_time_snsec; @@ -32,6 +36,8 @@ struct vsyscall_gtod_data { gtod_long_t wall_time_coarse_nsec; gtod_long_t monotonic_time_coarse_sec; gtod_long_t monotonic_time_coarse_nsec; + gtod_long_t monotonic_time_raw_sec; + gtod_long_t monotonic_time_raw_nsec; int tz_minuteswest; int tz_dsttime;
[PATCH v4.16-rc5 1/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index f19856d..fbc7371 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ + u64 tsc + , last = gtod->raw_cycle_last; + + tsc = rdtsc_ordered(); + if (likely(tsc >= last)) + return tsc; + asm volatile (""); + return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *mode) return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +279,27 @@ notrace static int __always_inline do_monotonic(struct timespec *ts) return mode; } +notrace static __always_inline int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index e1216dd..5af7093 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -45,6 +45,11 @@ void update_vsyscall(struct timekeeper *tk) vdata->mult = tk->tkr_mono.mult; vdata->shift= tk->tkr_mono.shift; + vdata->raw_cycle_last = tk->tkr_raw.cycle_last; + vdata->raw_mask = tk->tkr_raw.mask; + vdata->raw_mult = tk->tkr_raw.mult; + vdata->raw_shift= tk->tkr_raw.shift; + vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; @@ -74,5 +79,8 @@ void update_vsyscall(struct timekeeper *tk) vdata->monotonic_time_coarse_sec++; } + vdata->monotonic_time_raw_sec = tk->raw_sec; + vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec; + gtod_write_end(vdata); } diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index fb856c9..24e4d45 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -22,6 +22,10 @@ struct vsyscall_gtod_data { u64 mask; u32 mult; u32 shift; + u64 raw_cycle_last; + u64 raw_mask; + u32 raw_mult; + u32 raw_shift; /* open coded 'struct timespec' */ u64 wall_time_snsec; @@ -32,6 +36,8 @@ struct vsyscall_gtod_data { gtod_long_t wall_time_coarse_nsec; gtod_long_t monotonic_time_coarse_sec; gtod_long_t monotonic_time_coarse_nsec; + gtod_long_t monotonic_time_raw_sec; + gtod_long_t monotonic_time_raw_nsec; int tz_minuteswest; int tz_dsttime;
[PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, user-space code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP "real" time; when code needs this, the latencies associated with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vdso/vdso.lds.S arch/x86/entry/vdso/vdsox32.lds.S arch/x86/entry/vdso/vdso32/vdso32.lds.S arch/x86/entry/vsyscall/vsyscall_gtod.c There are 3 patches in the series : Patch #1 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with rdtsc_ordered() Patch #2 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with a new rdtscp() function in msr.h Patch #3 makes the VDSO export TSC calibration data via a new function in the vDSO: unsigned int __vdso_linux_tsc_calibration ( struct linux_tsc_calibration *tsc_cal ) that user code can optionally call. Patches #2 & #3 should be considered "optional" . Patch #2 makes clock_gettime(CLOCK_MONOTONIC_RAW) calls have @ half the latency of clock_gettime(CLOCK_MONOTONIC) calls. I think something like Patch #3 is necessary to export TSC calibration data to user-space TSC readers. Best Regards, Jason Vas Dias
[PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, user-space code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP "real" time; when code needs this, the latencies associated with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vdso/vdso.lds.S arch/x86/entry/vdso/vdsox32.lds.S arch/x86/entry/vdso/vdso32/vdso32.lds.S arch/x86/entry/vsyscall/vsyscall_gtod.c There are 3 patches in the series : Patch #1 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with rdtsc_ordered() Patch #2 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with a new rdtscp() function in msr.h Patch #3 makes the VDSO export TSC calibration data via a new function in the vDSO: unsigned int __vdso_linux_tsc_calibration ( struct linux_tsc_calibration *tsc_cal ) that user code can optionally call. Patches #2 & #3 should be considered "optional" . Patch #2 makes clock_gettime(CLOCK_MONOTONIC_RAW) calls have @ half the latency of clock_gettime(CLOCK_MONOTONIC) calls. I think something like Patch #3 is necessary to export TSC calibration data to user-space TSC readers. Best Regards, Jason Vas Dias
[PATCH v4.16-rc5 2/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index fbc7371..2c46675 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -184,10 +184,9 @@ notrace static u64 vread_tsc(void) notrace static u64 vread_tsc_raw(void) { - u64 tsc + u64 tsc = (gtod->has_rdtscp ? rdtscp((void*)0) : rdtsc_ordered()) , last = gtod->raw_cycle_last; - tsc = rdtsc_ordered(); if (likely(tsc >= last)) return tsc; asm volatile (""); diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index 5af7093..0327a95 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -16,6 +16,9 @@ #include #include #include +#include + +extern unsigned tsc_khz; int vclocks_used __read_mostly; @@ -49,6 +52,7 @@ void update_vsyscall(struct timekeeper *tk) vdata->raw_mask = tk->tkr_raw.mask; vdata->raw_mult = tk->tkr_raw.mult; vdata->raw_shift= tk->tkr_raw.shift; + vdata->has_rdtscp = static_cpu_has(X86_FEATURE_RDTSCP); vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h index 30df295..a5ff704 100644 --- a/arch/x86/include/asm/msr.h +++ b/arch/x86/include/asm/msr.h @@ -218,6 +218,36 @@ static __always_inline unsigned long long rdtsc_ordered(void) return rdtsc(); } +/** + * rdtscp() - read the current TSC and (optionally) CPU number, with built-in + *cancellation point replacing barrier - only available + *if static_cpu_has(X86_FEATURE_RDTSCP) . + * returns: The 64-bit Time Stamp Counter (TSC) value. + * Optionally, 'cpu_out' can be non-null, and on return it will contain + * the number (Intel CPU ID) of the CPU that the task is currently running on. + * As does EAX_EDT_RET, this uses the "open-coded asm" style to + * force the compiler + assembler to always use (eax, edx, ecx) registers, + * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit + * variables are used - exactly the same code should be generated + * for this instruction on 32-bit as on 64-bit when this asm stanza is used. + * See: SDM , Vol #2, RDTSCP instruction. + */ +static __always_inline u64 rdtscp(u32 *cpu_out) +{ + u32 tsc_lo, tsc_hi, tsc_cpu; + asm volatile + ( "rdtscp" + : "=a" (tsc_lo) + , "=d" (tsc_hi) + , "=c" (tsc_cpu) + ); // : eax, edx, ecx used - NOT rax, rdx, rcx + if (unlikely(cpu_out != ((void*)0))) + *cpu_out = tsc_cpu; + return u64)tsc_hi) << 32) | + (((u64)tsc_lo) & 0x0ULL ) + ); +} + /* Deprecated, keep it for a cycle for easier merging: */ #define rdtscll(now) do { (now) = rdtsc_ordered(); } while (0) diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index 24e4d45..e7e4804 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -26,6 +26,7 @@ struct vsyscall_gtod_data { u64 raw_mask; u32 raw_mult; u32 raw_shift; + u32 has_rdtscp; /* open coded 'struct timespec' */ u64 wall_time_snsec;
[PATCH v4.16-rc5 2/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index fbc7371..2c46675 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -184,10 +184,9 @@ notrace static u64 vread_tsc(void) notrace static u64 vread_tsc_raw(void) { - u64 tsc + u64 tsc = (gtod->has_rdtscp ? rdtscp((void*)0) : rdtsc_ordered()) , last = gtod->raw_cycle_last; - tsc = rdtsc_ordered(); if (likely(tsc >= last)) return tsc; asm volatile (""); diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index 5af7093..0327a95 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -16,6 +16,9 @@ #include #include #include +#include + +extern unsigned tsc_khz; int vclocks_used __read_mostly; @@ -49,6 +52,7 @@ void update_vsyscall(struct timekeeper *tk) vdata->raw_mask = tk->tkr_raw.mask; vdata->raw_mult = tk->tkr_raw.mult; vdata->raw_shift= tk->tkr_raw.shift; + vdata->has_rdtscp = static_cpu_has(X86_FEATURE_RDTSCP); vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h index 30df295..a5ff704 100644 --- a/arch/x86/include/asm/msr.h +++ b/arch/x86/include/asm/msr.h @@ -218,6 +218,36 @@ static __always_inline unsigned long long rdtsc_ordered(void) return rdtsc(); } +/** + * rdtscp() - read the current TSC and (optionally) CPU number, with built-in + *cancellation point replacing barrier - only available + *if static_cpu_has(X86_FEATURE_RDTSCP) . + * returns: The 64-bit Time Stamp Counter (TSC) value. + * Optionally, 'cpu_out' can be non-null, and on return it will contain + * the number (Intel CPU ID) of the CPU that the task is currently running on. + * As does EAX_EDT_RET, this uses the "open-coded asm" style to + * force the compiler + assembler to always use (eax, edx, ecx) registers, + * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit + * variables are used - exactly the same code should be generated + * for this instruction on 32-bit as on 64-bit when this asm stanza is used. + * See: SDM , Vol #2, RDTSCP instruction. + */ +static __always_inline u64 rdtscp(u32 *cpu_out) +{ + u32 tsc_lo, tsc_hi, tsc_cpu; + asm volatile + ( "rdtscp" + : "=a" (tsc_lo) + , "=d" (tsc_hi) + , "=c" (tsc_cpu) + ); // : eax, edx, ecx used - NOT rax, rdx, rcx + if (unlikely(cpu_out != ((void*)0))) + *cpu_out = tsc_cpu; + return u64)tsc_hi) << 32) | + (((u64)tsc_lo) & 0x0ULL ) + ); +} + /* Deprecated, keep it for a cycle for easier merging: */ #define rdtscll(now) do { (now) = rdtsc_ordered(); } while (0) diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index 24e4d45..e7e4804 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -26,6 +26,7 @@ struct vsyscall_gtod_data { u64 raw_mask; u32 raw_mult; u32 raw_shift; + u32 has_rdtscp; /* open coded 'struct timespec' */ u64 wall_time_snsec;
[PATCH v4.16-rc5 3/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index 2c46675..772988c 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -21,6 +21,7 @@ #include #include #include +#include #define gtod ((vsyscall_gtod_data)) @@ -184,7 +185,7 @@ notrace static u64 vread_tsc(void) notrace static u64 vread_tsc_raw(void) { - u64 tsc = (gtod->has_rdtscp ? rdtscp((void*)0) : rdtsc_ordered()) + u64 tsc = (gtod->has_rdtscp ? rdtscp((void *)0) : rdtsc_ordered()) , last = gtod->raw_cycle_last; if (likely(tsc >= last)) @@ -383,3 +384,21 @@ notrace time_t __vdso_time(time_t *t) } time_t time(time_t *t) __attribute__((weak, alias("__vdso_time"))); + +unsigned int __vdso_linux_tsc_calibration( + struct linux_tsc_calibration_s *tsc_cal); + +notraceunsigned int +__vdso_linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal) +{ + if ((gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void *)0UL))) { + tsc_cal->tsc_khz = gtod->tsc_khz; + tsc_cal->mult= gtod->raw_mult; + tsc_cal->shift = gtod->raw_shift; + return 1; + } + return 0; +} + +unsigned int linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal) + __attribute((weak, alias("__vdso_linux_tsc_calibration"))); diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S index d3a2dce..e0b5cce 100644 --- a/arch/x86/entry/vdso/vdso.lds.S +++ b/arch/x86/entry/vdso/vdso.lds.S @@ -25,6 +25,8 @@ VERSION { __vdso_getcpu; time; __vdso_time; + linux_tsc_calibration; + __vdso_linux_tsc_calibration; local: *; }; } diff --git a/arch/x86/entry/vdso/vdso32/vdso32.lds.S b/arch/x86/entry/vdso/vdso32/vdso32.lds.S index 422764a..17fd07f 100644 --- a/arch/x86/entry/vdso/vdso32/vdso32.lds.S +++ b/arch/x86/entry/vdso/vdso32/vdso32.lds.S @@ -26,6 +26,7 @@ VERSION __vdso_clock_gettime; __vdso_gettimeofday; __vdso_time; + __vdso_linux_tsc_calibration; }; LINUX_2.5 { diff --git a/arch/x86/entry/vdso/vdsox32.lds.S b/arch/x86/entry/vdso/vdsox32.lds.S index 05cd1c5..7acac71 100644 --- a/arch/x86/entry/vdso/vdsox32.lds.S +++ b/arch/x86/entry/vdso/vdsox32.lds.S @@ -21,6 +21,7 @@ VERSION { __vdso_gettimeofday; __vdso_getcpu; __vdso_time; + __vdso_linux_tsc_calibration; local: *; }; } diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index 0327a95..692562a 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -53,6 +53,7 @@ void update_vsyscall(struct timekeeper *tk) vdata->raw_mult = tk->tkr_raw.mult; vdata->raw_shift= tk->tkr_raw.shift; vdata->has_rdtscp = static_cpu_has(X86_FEATURE_RDTSCP); + vdata->tsc_khz = tsc_khz; vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h index a5ff704..c7b2ed2 100644 --- a/arch/x86/include/asm/msr.h +++ b/arch/x86/include/asm/msr.h @@ -227,7 +227,7 @@ static __always_inline unsigned long long rdtsc_ordered(void) * the number (Intel CPU ID) of the CPU that the task is currently running on. * As does EAX_EDT_RET, this uses the "open-coded asm" style to * force the compiler + assembler to always use (eax, edx, ecx) registers, - * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit + * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit * variables are used - exactly the same code should be generated * for this instruction on 32-bit as on 64-bit when this asm stanza is used. * See: SDM , Vol #2, RDTSCP instruction. @@ -236,15 +236,15 @@ static __always_inline u64 rdtscp(u32 *cpu_out) { u32 tsc_lo, tsc_hi, tsc_cpu; asm volatile - ( "rdtscp" + ("rdtscp" : "=a" (tsc_lo) , "=d" (tsc_hi) , "=c" (tsc_cpu) ); // : eax, edx, ecx used - NOT rax, rdx, rcx - if (unlikely(cpu_out != ((void*)0))) + if (unlikely(cpu_out != ((void *)0))) *cpu_out = tsc_cpu; return u64)tsc_hi) << 32) | - (((u64)tsc_lo) & 0x0ULL ) + (((u64)tsc_lo) & 0x0ULL) ); } diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index e7e4804..75078fc 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -27,6 +27,7 @@ struct vsyscall_gtod_data { u32 raw_mult; u32 raw_shift; u32 has_rdtscp; + u32 tsc_khz;
[PATCH v4.16-rc5 3/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index 2c46675..772988c 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -21,6 +21,7 @@ #include #include #include +#include #define gtod ((vsyscall_gtod_data)) @@ -184,7 +185,7 @@ notrace static u64 vread_tsc(void) notrace static u64 vread_tsc_raw(void) { - u64 tsc = (gtod->has_rdtscp ? rdtscp((void*)0) : rdtsc_ordered()) + u64 tsc = (gtod->has_rdtscp ? rdtscp((void *)0) : rdtsc_ordered()) , last = gtod->raw_cycle_last; if (likely(tsc >= last)) @@ -383,3 +384,21 @@ notrace time_t __vdso_time(time_t *t) } time_t time(time_t *t) __attribute__((weak, alias("__vdso_time"))); + +unsigned int __vdso_linux_tsc_calibration( + struct linux_tsc_calibration_s *tsc_cal); + +notraceunsigned int +__vdso_linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal) +{ + if ((gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void *)0UL))) { + tsc_cal->tsc_khz = gtod->tsc_khz; + tsc_cal->mult= gtod->raw_mult; + tsc_cal->shift = gtod->raw_shift; + return 1; + } + return 0; +} + +unsigned int linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal) + __attribute((weak, alias("__vdso_linux_tsc_calibration"))); diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S index d3a2dce..e0b5cce 100644 --- a/arch/x86/entry/vdso/vdso.lds.S +++ b/arch/x86/entry/vdso/vdso.lds.S @@ -25,6 +25,8 @@ VERSION { __vdso_getcpu; time; __vdso_time; + linux_tsc_calibration; + __vdso_linux_tsc_calibration; local: *; }; } diff --git a/arch/x86/entry/vdso/vdso32/vdso32.lds.S b/arch/x86/entry/vdso/vdso32/vdso32.lds.S index 422764a..17fd07f 100644 --- a/arch/x86/entry/vdso/vdso32/vdso32.lds.S +++ b/arch/x86/entry/vdso/vdso32/vdso32.lds.S @@ -26,6 +26,7 @@ VERSION __vdso_clock_gettime; __vdso_gettimeofday; __vdso_time; + __vdso_linux_tsc_calibration; }; LINUX_2.5 { diff --git a/arch/x86/entry/vdso/vdsox32.lds.S b/arch/x86/entry/vdso/vdsox32.lds.S index 05cd1c5..7acac71 100644 --- a/arch/x86/entry/vdso/vdsox32.lds.S +++ b/arch/x86/entry/vdso/vdsox32.lds.S @@ -21,6 +21,7 @@ VERSION { __vdso_gettimeofday; __vdso_getcpu; __vdso_time; + __vdso_linux_tsc_calibration; local: *; }; } diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index 0327a95..692562a 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -53,6 +53,7 @@ void update_vsyscall(struct timekeeper *tk) vdata->raw_mult = tk->tkr_raw.mult; vdata->raw_shift= tk->tkr_raw.shift; vdata->has_rdtscp = static_cpu_has(X86_FEATURE_RDTSCP); + vdata->tsc_khz = tsc_khz; vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h index a5ff704..c7b2ed2 100644 --- a/arch/x86/include/asm/msr.h +++ b/arch/x86/include/asm/msr.h @@ -227,7 +227,7 @@ static __always_inline unsigned long long rdtsc_ordered(void) * the number (Intel CPU ID) of the CPU that the task is currently running on. * As does EAX_EDT_RET, this uses the "open-coded asm" style to * force the compiler + assembler to always use (eax, edx, ecx) registers, - * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit + * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit * variables are used - exactly the same code should be generated * for this instruction on 32-bit as on 64-bit when this asm stanza is used. * See: SDM , Vol #2, RDTSCP instruction. @@ -236,15 +236,15 @@ static __always_inline u64 rdtscp(u32 *cpu_out) { u32 tsc_lo, tsc_hi, tsc_cpu; asm volatile - ( "rdtscp" + ("rdtscp" : "=a" (tsc_lo) , "=d" (tsc_hi) , "=c" (tsc_cpu) ); // : eax, edx, ecx used - NOT rax, rdx, rcx - if (unlikely(cpu_out != ((void*)0))) + if (unlikely(cpu_out != ((void *)0))) *cpu_out = tsc_cpu; return u64)tsc_hi) << 32) | - (((u64)tsc_lo) & 0x0ULL ) + (((u64)tsc_lo) & 0x0ULL) ); } diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index e7e4804..75078fc 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -27,6 +27,7 @@ struct vsyscall_gtod_data { u32 raw_mult; u32 raw_shift; u32 has_rdtscp; + u32 tsc_khz;
Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
On 12/03/2018, Peter Zijlstra <pet...@infradead.org> wrote: > On Mon, Mar 12, 2018 at 07:01:20AM +0000, Jason Vas Dias wrote: >> Sometimes, particularly when correlating elapsed time to performance >> counter values, > > So what actual problem are you tring to solve here? Perf can already > give you sample time in various clocks, including MONOTONIC_RAW. > > Yes, I am sampling perf counters, including CPU_CYCLES , INSTRUCTIONS, CPU_CLOCK, TASK_CLOCK, etc, in a Group FD I open with perf_event_open() , for the current thread on the current CPU - I am doing this for 4 threads , on Intel & ARM cpus. Reading performance counters does involve 2 ioctls and a read() , which takes time that already far exceeds the time required to read the TSC or CNTPCT in the VDSO . The CPU_CLOCK software counter should give the converted TSC cycles seen between the ioctl( grp_fd, PERF_EVENT_IOC_ENABLE , ...) and the ioctl( grp_fd, PERF_EVENT_IOC_DISABLE ), and the difference between the event->time_running and time_enabled should also measure elapsed time . This gives the "inner" elapsed time, from the perpective of the kernel, while the measured code section had the counters enabled. But unless the user-space program also has a way of measuring elapsed time from the CPU's perspective , ie. without being subject to operator or NTP / PTP adjustment, it has no way of correlating this inner elapsed time with any "outer" elapsed time measurement it may have made - I also measure the time taken by I/O operations between threads, for instance. So that is my primary motivation - for each thread's main run loop, I enable performance counters and count several PMU counters and the CPU_CLOCK & TASK_CLOCK . I want to determine with maximal accuracy how much elapsed time was used actually executing the task's instructions on the CPU , and how long they took to execute. I want to try to exclude the time spent gathering and making and analysing the performance measurements from the time spent running the threads' main loop . To do this accurately, it is best to exclude variations in time that occur because of operator or NTP / PTP adjustments . The CLOCK_MONOTONIC_RAW clock is the ONLY clock that is MEANT to be immune from any adjustment. It is meant to be high - resolution clock with 1ns resolution that should be subject to no adjustment, and hence one would expect it it have the lowest latency. But the way Linux has up to now implemented it , CLOCK_MONOTONIC_RAW has a resolution (minimum time that can be measured) that varies from 300 - 1000ns . I can read the TSC and store a 16-byte timespec value in @ 8ns on the same CPU . I understand that linux must conform to the POSIX interface which means it cannot provide sub-nanosecond resolution timers, but it could allow user-space programs to easily discover the timer calibration so that user-space programs can read the timers themselves. Currently, users must parse the log file or use gdb / objdump to inspect /proc/kcore to get the TSC calibration and exact mult+shift values for the TSC value conversion. Intel does not publish, nor does the CPU come with in ROM or firmware, the actual precise TSC frequency - this must be calibrated against the other clocks , according to a complicated procedure in section 18.2 of the SDM . My TSC has a "rated" / nominal TSC frequency , which one can compute from CPUID leaves, of 2.3ghz, but the "Refined TSC frequency" is 2.8333ghz . Hence I think Linux should export this calibrated frequency somehow ; its "calibration" is expressed as the raw clocksource 'mult' and 'shift' values, and is exported to the VDSO . I think the VDSO should read the TSC and use the calibration to render the raw, unadjusted time from the CPU's perspective. Hence, the patch I am preparing , which is again attached. I will submit it properly via email once I figure out how to obtain the 'git-send-mail' tool, and how to use it to send multiple patches, which seems to be the only way to submit acceptable patches. Also the attached timer program measures a latency of @ 20ns with my patch 4.15.9 kernel, when it measured a latency of 300-1000ns without it. Thanks & Regards, Jason vdso_clock_monotonic_raw_1.patch Description: Binary data /* * Program to measure high-res timer latency. * */ #include #include #include #include #include #include #include #include #ifndef N_SAMPLES #define N_SAMPLES 100 #endif #define _STR(_S_) #_S_ #define STR(_S_) _STR(_S_) int main(int argc, char *const* argv, char *const* envp) { clockid_t clk = CLOCK_MONOTONIC_RAW; bool do_dump = false; int argn=1; for(; argn < argc; argn+=1) if( argv[argn] != NULL ) if( *(argv[argn]) == '-') switch( *(argv[argn]+1) ) { case 'm': case 'M': clk = CLOCK_MONOTONIC; break; case 'd': case 'D': do_dump = true; break; case '?': case
Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
On 12/03/2018, Peter Zijlstra wrote: > On Mon, Mar 12, 2018 at 07:01:20AM +0000, Jason Vas Dias wrote: >> Sometimes, particularly when correlating elapsed time to performance >> counter values, > > So what actual problem are you tring to solve here? Perf can already > give you sample time in various clocks, including MONOTONIC_RAW. > > Yes, I am sampling perf counters, including CPU_CYCLES , INSTRUCTIONS, CPU_CLOCK, TASK_CLOCK, etc, in a Group FD I open with perf_event_open() , for the current thread on the current CPU - I am doing this for 4 threads , on Intel & ARM cpus. Reading performance counters does involve 2 ioctls and a read() , which takes time that already far exceeds the time required to read the TSC or CNTPCT in the VDSO . The CPU_CLOCK software counter should give the converted TSC cycles seen between the ioctl( grp_fd, PERF_EVENT_IOC_ENABLE , ...) and the ioctl( grp_fd, PERF_EVENT_IOC_DISABLE ), and the difference between the event->time_running and time_enabled should also measure elapsed time . This gives the "inner" elapsed time, from the perpective of the kernel, while the measured code section had the counters enabled. But unless the user-space program also has a way of measuring elapsed time from the CPU's perspective , ie. without being subject to operator or NTP / PTP adjustment, it has no way of correlating this inner elapsed time with any "outer" elapsed time measurement it may have made - I also measure the time taken by I/O operations between threads, for instance. So that is my primary motivation - for each thread's main run loop, I enable performance counters and count several PMU counters and the CPU_CLOCK & TASK_CLOCK . I want to determine with maximal accuracy how much elapsed time was used actually executing the task's instructions on the CPU , and how long they took to execute. I want to try to exclude the time spent gathering and making and analysing the performance measurements from the time spent running the threads' main loop . To do this accurately, it is best to exclude variations in time that occur because of operator or NTP / PTP adjustments . The CLOCK_MONOTONIC_RAW clock is the ONLY clock that is MEANT to be immune from any adjustment. It is meant to be high - resolution clock with 1ns resolution that should be subject to no adjustment, and hence one would expect it it have the lowest latency. But the way Linux has up to now implemented it , CLOCK_MONOTONIC_RAW has a resolution (minimum time that can be measured) that varies from 300 - 1000ns . I can read the TSC and store a 16-byte timespec value in @ 8ns on the same CPU . I understand that linux must conform to the POSIX interface which means it cannot provide sub-nanosecond resolution timers, but it could allow user-space programs to easily discover the timer calibration so that user-space programs can read the timers themselves. Currently, users must parse the log file or use gdb / objdump to inspect /proc/kcore to get the TSC calibration and exact mult+shift values for the TSC value conversion. Intel does not publish, nor does the CPU come with in ROM or firmware, the actual precise TSC frequency - this must be calibrated against the other clocks , according to a complicated procedure in section 18.2 of the SDM . My TSC has a "rated" / nominal TSC frequency , which one can compute from CPUID leaves, of 2.3ghz, but the "Refined TSC frequency" is 2.8333ghz . Hence I think Linux should export this calibrated frequency somehow ; its "calibration" is expressed as the raw clocksource 'mult' and 'shift' values, and is exported to the VDSO . I think the VDSO should read the TSC and use the calibration to render the raw, unadjusted time from the CPU's perspective. Hence, the patch I am preparing , which is again attached. I will submit it properly via email once I figure out how to obtain the 'git-send-mail' tool, and how to use it to send multiple patches, which seems to be the only way to submit acceptable patches. Also the attached timer program measures a latency of @ 20ns with my patch 4.15.9 kernel, when it measured a latency of 300-1000ns without it. Thanks & Regards, Jason vdso_clock_monotonic_raw_1.patch Description: Binary data /* * Program to measure high-res timer latency. * */ #include #include #include #include #include #include #include #include #ifndef N_SAMPLES #define N_SAMPLES 100 #endif #define _STR(_S_) #_S_ #define STR(_S_) _STR(_S_) int main(int argc, char *const* argv, char *const* envp) { clockid_t clk = CLOCK_MONOTONIC_RAW; bool do_dump = false; int argn=1; for(; argn < argc; argn+=1) if( argv[argn] != NULL ) if( *(argv[argn]) == '-') switch( *(argv[argn]+1) ) { case 'm': case 'M': clk = CLOCK_MONOTONIC; break; case 'd': case 'D': do_dump = true; break; case '?': case 'h': case '
Re: [PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
The split patches with no checkpatch.pl failures are attached and were just sent in separate emails to the mailing list . Sorry it took a few tries to get right . This will be my last send today - I'm off to use it at work. Thanks & all the best, Jason vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#1.patch Description: Binary data vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#2.patch Description: Binary data
Re: [PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
The split patches with no checkpatch.pl failures are attached and were just sent in separate emails to the mailing list . Sorry it took a few tries to get right . This will be my last send today - I'm off to use it at work. Thanks & all the best, Jason vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#1.patch Description: Binary data vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#2.patch Description: Binary data
[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/msr.h arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c This is the second patch in the series, which adds use of rdtscp . Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 2018-03-12 08:12:17.110120433 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 08:59:21.135475862 + @@ -187,7 +187,7 @@ notrace static u64 vread_tsc_raw(void) u64 tsc , last = gtod->raw_cycle_last; - tsc = rdtsc_ordered(); + tsc = gtod->has_rdtscp ? rdtscp((void*)0UL) : rdtsc_ordered(); if (likely(tsc >= last)) return tsc; asm volatile (""); diff -up linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c --- linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5-p1 2018-03-12 07:58:07.974214168 + +++ linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c 2018-03-12 08:54:07.490267640 + @@ -16,6 +16,7 @@ #include #include #include +#include int vclocks_used __read_mostly; @@ -49,6 +50,7 @@ void update_vsyscall(struct timekeeper * vdata->raw_mask = tk->tkr_raw.mask; vdata->raw_mult = tk->tkr_raw.mult; vdata->raw_shift= tk->tkr_raw.shift; + vdata->has_rdtscp = static_cpu_has(X86_FEATURE_RDTSCP); vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; diff -up linux-4.16-rc5/arch/x86/include/asm/msr.h.4.16-rc5-p1 linux-4.16-rc5/arch/x86/include/asm/msr.h --- linux-4.16-rc5/arch/x86/include/asm/msr.h.4.16-rc5-p1 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/include/asm/msr.h 2018-03-12 09:06:03.902728749 + @@ -218,6 +218,36 @@ static __always_inline unsigned long lon return rdtsc(); } +/** + * rdtscp() - read the current TSC and (optionally) CPU number, with built-in + *cancellation point replacing barrier - only available + *if static_cpu_has(X86_FEATURE_RDTSCP) . + * returns: The 64-bit Time Stamp Counter (TSC) value. + * Optionally, 'cpu_out' can be non-null, and on return it will contain + * the number (Intel CPU ID) of the CPU that the task is currently running on. + * As does EAX_EDT_RET, this uses the "open-coded asm" style to + * force the compiler + assembler to always use (eax, edx, ecx) registers, + * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit + * variables are used - exactly the same code should be generated + * for this instruction on 32-bit as on 64-bit when this asm stanza is used. + * See: SDM , Vol #2, RDTSCP instruction. + */ +static __always_inline u64 rdtscp(u32 *cpu_out) +{ + u32 tsc_lo, tsc_hi, tsc_cpu; + asm volatile + ( "rdtscp" + : "=a" (tsc_lo) + , "=d" (tsc_hi) + , "=c" (tsc_cpu) + ); + if ( unlikely(cpu_out != ((void*)0)) ) + *cpu_out = tsc_cpu; + ret
[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/msr.h arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c This is the second patch in the series, which adds use of rdtscp . Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 2018-03-12 08:12:17.110120433 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 08:59:21.135475862 + @@ -187,7 +187,7 @@ notrace static u64 vread_tsc_raw(void) u64 tsc , last = gtod->raw_cycle_last; - tsc = rdtsc_ordered(); + tsc = gtod->has_rdtscp ? rdtscp((void*)0UL) : rdtsc_ordered(); if (likely(tsc >= last)) return tsc; asm volatile (""); diff -up linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c --- linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5-p1 2018-03-12 07:58:07.974214168 + +++ linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c 2018-03-12 08:54:07.490267640 + @@ -16,6 +16,7 @@ #include #include #include +#include int vclocks_used __read_mostly; @@ -49,6 +50,7 @@ void update_vsyscall(struct timekeeper * vdata->raw_mask = tk->tkr_raw.mask; vdata->raw_mult = tk->tkr_raw.mult; vdata->raw_shift= tk->tkr_raw.shift; + vdata->has_rdtscp = static_cpu_has(X86_FEATURE_RDTSCP); vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; diff -up linux-4.16-rc5/arch/x86/include/asm/msr.h.4.16-rc5-p1 linux-4.16-rc5/arch/x86/include/asm/msr.h --- linux-4.16-rc5/arch/x86/include/asm/msr.h.4.16-rc5-p1 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/include/asm/msr.h 2018-03-12 09:06:03.902728749 + @@ -218,6 +218,36 @@ static __always_inline unsigned long lon return rdtsc(); } +/** + * rdtscp() - read the current TSC and (optionally) CPU number, with built-in + *cancellation point replacing barrier - only available + *if static_cpu_has(X86_FEATURE_RDTSCP) . + * returns: The 64-bit Time Stamp Counter (TSC) value. + * Optionally, 'cpu_out' can be non-null, and on return it will contain + * the number (Intel CPU ID) of the CPU that the task is currently running on. + * As does EAX_EDT_RET, this uses the "open-coded asm" style to + * force the compiler + assembler to always use (eax, edx, ecx) registers, + * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit + * variables are used - exactly the same code should be generated + * for this instruction on 32-bit as on 64-bit when this asm stanza is used. + * See: SDM , Vol #2, RDTSCP instruction. + */ +static __always_inline u64 rdtscp(u32 *cpu_out) +{ + u32 tsc_lo, tsc_hi, tsc_cpu; + asm volatile + ( "rdtscp" + : "=a" (tsc_lo) + , "=d" (tsc_hi) + , "=c" (tsc_cpu) + ); + if ( unlikely(cpu_out != ((void*)0)) ) + *cpu_out = tsc_cpu; + ret
[PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only these files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c There are 2 patches in the series - this first one handles CLOCK_MONOTONIC_RAW in VDSO using existing rdtsc_ordered() , and the second uses new rstscp() function which avoids use of an explicit barrier. Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 08:12:17.110120433 + @@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ + u64 tsc + , last = gtod->raw_cycle_last; + + tsc = rdtsc_ordered(); + if (likely(tsc >= last)) + return tsc; + asm volatile (""); + return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +279,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static __always_inline int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff -up linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5 linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c --- linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16
[PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only these files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c There are 2 patches in the series - this first one handles CLOCK_MONOTONIC_RAW in VDSO using existing rdtsc_ordered() , and the second uses new rstscp() function which avoids use of an explicit barrier. Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 08:12:17.110120433 + @@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ + u64 tsc + , last = gtod->raw_cycle_last; + + tsc = rdtsc_ordered(); + if (likely(tsc >= last)) + return tsc; + asm volatile (""); + return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +279,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static __always_inline int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff -up linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5 linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c --- linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16
Re: [PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Good day - On 12/03/2018, Ingo Molnar <mi...@kernel.org> wrote: > > * Thomas Gleixner <t...@linutronix.de> wrote: > >> On Mon, 12 Mar 2018, Jason Vas Dias wrote: >> >> checkpatch.pl still reports: >> >>total: 15 errors, 3 warnings, 165 lines checked >> Sorry I didn't see you had responded until 40 mins ago . I finally found where checkpatch.pl is and it now reports : WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per line) #2: --- linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 2018-03-12 00:25:09.0 + WARNING: struct should normally be const #55: FILE: arch/x86/entry/vdso/vclock_gettime.c:282: +notrace static __always_inline int do_monotonic_raw(struct timespec *ts) I don't know how to fix that, since 'ts' cannot be a const pointer. ERROR: Missing Signed-off-by: line(s) I guess that disappears once someone OKs the patch. total: 1 errors, 2 warnings, 127 lines checked NOTE: For some of the reported defects, checkpatch may be able to mechanically convert to the typical style using --fix or --fix-inplace. ../vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#1.patch has style problems, please review. NOTE: If any of the errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. >> > +notrace static u64 vread_tsc_raw(void) >> > +{ >> > + u64 tsc, last=gtod->raw_cycle_last; >> > + if( likely( gtod->has_rdtscp ) ) >> > + tsc = rdtscp((void*)0); >> >> Plus I asked more than once to split that rdtscp() stuff into a separate >> patch. I misunderstood - I thought you meant the rdtscp implementation which was split into a separate file - but now it is in a separate patch , (attached). >> >> You surely are free to ignore my review comments, but rest assured that >> I'm >> free to ignore the crap you insist to send me as well. > I didn't mean to ignore any comments, and I'm really trying to fix this problem the right way and not produce crap. > In addition to Thomas's review feedback I'd strongly urge the careful > reading of > Documentation/SubmittingPatches as well: > > - When sending multiple patches please use git-send-mail > > - Please don't send several patch iterations per day! > > - Code quality of the submitted patches is atrocious, please run them > through >scripts/checkpatch.pl (and make sure they pass) to at least enable the > reading >of them. > > - ... plus dozens of other details described in > Documentation/SubmittingPatches. > > Thanks, > > Ingo > I am reading all those documents and cannot see how the code in the attached patch contravenes any guidelines / best practices - if you can, please clarify phrases like "atrocious style" - I cannot see any style guidelines contravened, and I can prove that the numeric output produced in 16-30ns is just as good as that produced before the patch was applied in 300-700ns . Aside from any style comments, any content comments ? Sorry I am new to latest kernel guidelines. I needed to get this problem solved the right way for use at work today. Thanks for your advice, Best Regards Jason vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#1.patch Description: Binary data
Re: [PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Good day - On 12/03/2018, Ingo Molnar wrote: > > * Thomas Gleixner wrote: > >> On Mon, 12 Mar 2018, Jason Vas Dias wrote: >> >> checkpatch.pl still reports: >> >>total: 15 errors, 3 warnings, 165 lines checked >> Sorry I didn't see you had responded until 40 mins ago . I finally found where checkpatch.pl is and it now reports : WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per line) #2: --- linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 2018-03-12 00:25:09.0 + WARNING: struct should normally be const #55: FILE: arch/x86/entry/vdso/vclock_gettime.c:282: +notrace static __always_inline int do_monotonic_raw(struct timespec *ts) I don't know how to fix that, since 'ts' cannot be a const pointer. ERROR: Missing Signed-off-by: line(s) I guess that disappears once someone OKs the patch. total: 1 errors, 2 warnings, 127 lines checked NOTE: For some of the reported defects, checkpatch may be able to mechanically convert to the typical style using --fix or --fix-inplace. ../vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#1.patch has style problems, please review. NOTE: If any of the errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. >> > +notrace static u64 vread_tsc_raw(void) >> > +{ >> > + u64 tsc, last=gtod->raw_cycle_last; >> > + if( likely( gtod->has_rdtscp ) ) >> > + tsc = rdtscp((void*)0); >> >> Plus I asked more than once to split that rdtscp() stuff into a separate >> patch. I misunderstood - I thought you meant the rdtscp implementation which was split into a separate file - but now it is in a separate patch , (attached). >> >> You surely are free to ignore my review comments, but rest assured that >> I'm >> free to ignore the crap you insist to send me as well. > I didn't mean to ignore any comments, and I'm really trying to fix this problem the right way and not produce crap. > In addition to Thomas's review feedback I'd strongly urge the careful > reading of > Documentation/SubmittingPatches as well: > > - When sending multiple patches please use git-send-mail > > - Please don't send several patch iterations per day! > > - Code quality of the submitted patches is atrocious, please run them > through >scripts/checkpatch.pl (and make sure they pass) to at least enable the > reading >of them. > > - ... plus dozens of other details described in > Documentation/SubmittingPatches. > > Thanks, > > Ingo > I am reading all those documents and cannot see how the code in the attached patch contravenes any guidelines / best practices - if you can, please clarify phrases like "atrocious style" - I cannot see any style guidelines contravened, and I can prove that the numeric output produced in 16-30ns is just as good as that produced before the patch was applied in 300-700ns . Aside from any style comments, any content comments ? Sorry I am new to latest kernel guidelines. I needed to get this problem solved the right way for use at work today. Thanks for your advice, Best Regards Jason vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#1.patch Description: Binary data
[PATCH v4.16-rc4 1/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, about the same as do_monotonic(), and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . The patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c This is a resend of the original patch fixing review issues - the next patch will add the rdtscp() function . The patch passes the checkpatch.pl script . Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 08:12:17.110120433 + @@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ + u64 tsc + , last = gtod->raw_cycle_last; + + tsc = rdtsc_ordered(); + if (likely(tsc >= last)) + return tsc; + asm volatile (""); + return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +279,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static __always_inline int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff -up linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5 linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c --- linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5 2018-03-12 00:25:09.0 + +++ li
[PATCH v4.16-rc4 1/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, about the same as do_monotonic(), and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . The patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c This is a resend of the original patch fixing review issues - the next patch will add the rdtscp() function . The patch passes the checkpatch.pl script . Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 08:12:17.110120433 + @@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ + u64 tsc + , last = gtod->raw_cycle_last; + + tsc = rdtsc_ordered(); + if (likely(tsc >= last)) + return tsc; + asm volatile (""); + return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +279,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static __always_inline int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff -up linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5 linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c --- linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5 2018-03-12 00:25:09.0 + +++ li
[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c arch/x86/entry/vdso/vdso.lds.S arch/x86/entry/vdso/vdsox32.lds.S arch/x86/entry/vdso/vdso32/vdso32.lds.S and adds one new file: arch/x86/include/uapi/asm/vdso_tsc_calibration.h This is a second patch in the series, which adds a record of the calibrated tsc frequency to the VDSO, and a new header: uapi/asm/vdso_tsc_calibration.h which defines a structure : struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; }; and a getter function in the VDSO that can optionally be used by user-space code to implement sub-nanosecond precision clocks . This second patch is entirely optional but I think greatly expands the scope of user-space TSC readers . Resent : Oops, in previous version of this patch (#2), the comments in the new vdso_tsc_calibration were wrong, for an earlier version - sorry about that. Best Regards, Jason Vas Dias . PATCH 2/2: --- diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 2018-03-12 04:29:27.296982872 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 05:38:53.019891195 + @@ -21,6 +21,7 @@ #include #include #include +#include #define gtod ((vsyscall_gtod_data)) @@ -385,3 +386,22 @@ notrace time_t __vdso_time(time_t *t) } time_t time(time_t *t) __attribute__((weak, alias("__vdso_time"))); + +extern unsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *); + +notraceunsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal) +{ + if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) ) + { + tsc_cal -> tsc_khz = gtod->tsc_khz; + tsc_cal -> mult= gtod->raw_mult; + tsc_cal -> shift = gtod->raw_shift; + return 1; + } + return 0; +} + +unsigned linux_tsc_calibration(void) + __attribute((weak, alias("__vdso_linux_tsc_calibration"))); diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S --- linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S 2018-03-12 05:18:36.380673342 + @@ -25,6 +25,8 @@ VERSION { __vdso_getcpu; time; __vdso_time; + linux_tsc_calibration; + __vdso_linux_tsc_calibration; local: *; }; } diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S --- linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S 2018-03-12 05:19:10.765022295 + @@ -26,6 +26,7 @@ VERSION __vdso_clock_gettime; __vdso_gettimeofday; __vdso_time; + __vdso_linux_tsc_calibration; }; LINUX_2.5 { diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.l
[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c arch/x86/entry/vdso/vdso.lds.S arch/x86/entry/vdso/vdsox32.lds.S arch/x86/entry/vdso/vdso32/vdso32.lds.S and adds one new file: arch/x86/include/uapi/asm/vdso_tsc_calibration.h This is a second patch in the series, which adds a record of the calibrated tsc frequency to the VDSO, and a new header: uapi/asm/vdso_tsc_calibration.h which defines a structure : struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; }; and a getter function in the VDSO that can optionally be used by user-space code to implement sub-nanosecond precision clocks . This second patch is entirely optional but I think greatly expands the scope of user-space TSC readers . Resent : Oops, in previous version of this patch (#2), the comments in the new vdso_tsc_calibration were wrong, for an earlier version - sorry about that. Best Regards, Jason Vas Dias . PATCH 2/2: --- diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 2018-03-12 04:29:27.296982872 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 05:38:53.019891195 + @@ -21,6 +21,7 @@ #include #include #include +#include #define gtod ((vsyscall_gtod_data)) @@ -385,3 +386,22 @@ notrace time_t __vdso_time(time_t *t) } time_t time(time_t *t) __attribute__((weak, alias("__vdso_time"))); + +extern unsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *); + +notraceunsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal) +{ + if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) ) + { + tsc_cal -> tsc_khz = gtod->tsc_khz; + tsc_cal -> mult= gtod->raw_mult; + tsc_cal -> shift = gtod->raw_shift; + return 1; + } + return 0; +} + +unsigned linux_tsc_calibration(void) + __attribute((weak, alias("__vdso_linux_tsc_calibration"))); diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S --- linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S 2018-03-12 05:18:36.380673342 + @@ -25,6 +25,8 @@ VERSION { __vdso_getcpu; time; __vdso_time; + linux_tsc_calibration; + __vdso_linux_tsc_calibration; local: *; }; } diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S --- linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S 2018-03-12 05:19:10.765022295 + @@ -26,6 +26,7 @@ VERSION __vdso_clock_gettime; __vdso_gettimeofday; __vdso_time; + __vdso_linux_tsc_calibration; }; LINUX_2.5 { diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.l
[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vdso/vdso.lds.S arch/x86/entry/vdso/vdsox32.lds.S arch/x86/entry/vdso/vdso32/vdso32.lds.S arch/x86/entry/vsyscall/vsyscall_gtod.c This is a second patch in the series, which adds a record of the calibrated tsc frequency to the VDSO, and a new header: uapi/asm/vdso_tsc_calibration.h which defines a structure : struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; }; and a getter function in the VDSO that can optionally be used by user-space code to implement sub-nanosecond precision clocks . This second patch is entirely optional but I think greatly expands the scope of user-space TSC readers . Oops, previous version of this second patch mistakenly copied the changed part of vclock_gettime.c. Best Regards, Jason Vas Dias . diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 2018-03-12 04:29:27.296982872 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 05:38:53.019891195 + @@ -21,6 +21,7 @@ #include #include #include +#include #define gtod ((vsyscall_gtod_data)) @@ -385,3 +386,22 @@ notrace time_t __vdso_time(time_t *t) } time_t time(time_t *t) __attribute__((weak, alias("__vdso_time"))); + +extern unsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *); + +notraceunsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal) +{ + if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) ) + { + tsc_cal -> tsc_khz = gtod->tsc_khz; + tsc_cal -> mult= gtod->raw_mult; + tsc_cal -> shift = gtod->raw_shift; + return 1; + } + return 0; +} + +unsigned linux_tsc_calibration(void) + __attribute((weak, alias("__vdso_linux_tsc_calibration"))); diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S --- linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S 2018-03-12 05:18:36.380673342 + @@ -25,6 +25,8 @@ VERSION { __vdso_getcpu; time; __vdso_time; + linux_tsc_calibration; + __vdso_linux_tsc_calibration; local: *; }; } diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S --- linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S 2018-03-12 05:19:10.765022295 + @@ -26,6 +26,7 @@ VERSION __vdso_clock_gettime; __vdso_gettimeofday; __vdso_time; + __vdso_linux_tsc_calibration; }; LINUX_2.5 { diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S --- linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S.4.16-rc5-p1 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/entry/vdso
[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vdso/vdso.lds.S arch/x86/entry/vdso/vdsox32.lds.S arch/x86/entry/vdso/vdso32/vdso32.lds.S arch/x86/entry/vsyscall/vsyscall_gtod.c This is a second patch in the series, which adds a record of the calibrated tsc frequency to the VDSO, and a new header: uapi/asm/vdso_tsc_calibration.h which defines a structure : struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; }; and a getter function in the VDSO that can optionally be used by user-space code to implement sub-nanosecond precision clocks . This second patch is entirely optional but I think greatly expands the scope of user-space TSC readers . Oops, previous version of this second patch mistakenly copied the changed part of vclock_gettime.c. Best Regards, Jason Vas Dias . diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 2018-03-12 04:29:27.296982872 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 05:38:53.019891195 + @@ -21,6 +21,7 @@ #include #include #include +#include #define gtod ((vsyscall_gtod_data)) @@ -385,3 +386,22 @@ notrace time_t __vdso_time(time_t *t) } time_t time(time_t *t) __attribute__((weak, alias("__vdso_time"))); + +extern unsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *); + +notraceunsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal) +{ + if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) ) + { + tsc_cal -> tsc_khz = gtod->tsc_khz; + tsc_cal -> mult= gtod->raw_mult; + tsc_cal -> shift = gtod->raw_shift; + return 1; + } + return 0; +} + +unsigned linux_tsc_calibration(void) + __attribute((weak, alias("__vdso_linux_tsc_calibration"))); diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S --- linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S 2018-03-12 05:18:36.380673342 + @@ -25,6 +25,8 @@ VERSION { __vdso_getcpu; time; __vdso_time; + linux_tsc_calibration; + __vdso_linux_tsc_calibration; local: *; }; } diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S --- linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S 2018-03-12 05:19:10.765022295 + @@ -26,6 +26,7 @@ VERSION __vdso_clock_gettime; __vdso_gettimeofday; __vdso_time; + __vdso_linux_tsc_calibration; }; LINUX_2.5 { diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S --- linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S.4.16-rc5-p1 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/entry/vdso
[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vdso/vdso.lds.S arch/x86/entry/vdso/vdsox32.lds.S arch/x86/entry/vdso/vdso32/vdso32.lds.S arch/x86/entry/vsyscall/vsyscall_gtod.c This is a second patch in the series, which adds a record of the calibrated tsc frequency to the VDSO, and a new header: uapi/asm/vdso_tsc_calibration.h which defines a structure : struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; }; and a getter function in the VDSO that can optionally be used by user-space code to implement sub-nanosecond precision clocks . This second patch is entirely optional but I think greatly expands the scope of user-space TSC readers . Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 2018-03-12 04:29:27.296982872 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 05:10:53.185158334 + @@ -21,6 +21,7 @@ #include #include #include +#include #define gtod ((vsyscall_gtod_data)) @@ -385,3 +386,41 @@ notrace time_t __vdso_time(time_t *t) } time_t time(time_t *t) __attribute__((weak, alias("__vdso_time"))); + +extern unsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *); + +notraceunsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal) +{ + if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) ) + { + tsc_cal -> tsc_khz = gtod->tsc_khz; + tsc_cal -> mult= gtod->raw_mult; + tsc_cal -> shift = gtod->raw_shift; + return 1; + } + return 0; +} + +unsigned linux_tsc_calibration(void) + __attribute((weak, alias("__vdso_linux_tsc_calibration"))); + +extern unsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *); + +notraceunsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal) +{ + if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) ) + { + tsc_cal -> tsc_khz = gtod->tsc_khz; + tsc_cal -> mult= gtod->raw_mult; + tsc_cal -> shift = gtod->raw_shift; + return 1; + } + return 0; +} + +unsigned linux_tsc_calibration(void) + __attribute((weak, alias("__vdso_linux_tsc_calibration"))); diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S --- linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S 2018-03-12 05:18:36.380673342 + @@ -25,6 +25,8 @@ VERSION { __vdso_getcpu; time; __vdso_time; + linux_tsc_calibration; + __vdso_linux_tsc_calibration; local: *; }; } diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S --- linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5
[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vdso/vdso.lds.S arch/x86/entry/vdso/vdsox32.lds.S arch/x86/entry/vdso/vdso32/vdso32.lds.S arch/x86/entry/vsyscall/vsyscall_gtod.c This is a second patch in the series, which adds a record of the calibrated tsc frequency to the VDSO, and a new header: uapi/asm/vdso_tsc_calibration.h which defines a structure : struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; }; and a getter function in the VDSO that can optionally be used by user-space code to implement sub-nanosecond precision clocks . This second patch is entirely optional but I think greatly expands the scope of user-space TSC readers . Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 2018-03-12 04:29:27.296982872 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 05:10:53.185158334 + @@ -21,6 +21,7 @@ #include #include #include +#include #define gtod ((vsyscall_gtod_data)) @@ -385,3 +386,41 @@ notrace time_t __vdso_time(time_t *t) } time_t time(time_t *t) __attribute__((weak, alias("__vdso_time"))); + +extern unsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *); + +notraceunsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal) +{ + if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) ) + { + tsc_cal -> tsc_khz = gtod->tsc_khz; + tsc_cal -> mult= gtod->raw_mult; + tsc_cal -> shift = gtod->raw_shift; + return 1; + } + return 0; +} + +unsigned linux_tsc_calibration(void) + __attribute((weak, alias("__vdso_linux_tsc_calibration"))); + +extern unsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *); + +notraceunsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal) +{ + if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) ) + { + tsc_cal -> tsc_khz = gtod->tsc_khz; + tsc_cal -> mult= gtod->raw_mult; + tsc_cal -> shift = gtod->raw_shift; + return 1; + } + return 0; +} + +unsigned linux_tsc_calibration(void) + __attribute((weak, alias("__vdso_linux_tsc_calibration"))); diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S --- linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S 2018-03-12 05:18:36.380673342 + @@ -25,6 +25,8 @@ VERSION { __vdso_getcpu; time; __vdso_time; + linux_tsc_calibration; + __vdso_linux_tsc_calibration; local: *; }; } diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S --- linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5
[PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . The patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/include/asm/msr.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c This is a resend of the original patch fixing issues identified by tglx in mail thread of $subject - mainly that the rdtscp() assembler wrapper function should be in msr.h - it now is. There is a second patch following in a few minutes which adds a record of the calibrated tsc frequency to the VDSO, and a new header: uapi/asm/vdso_tsc_calibration.h which defines a structure : struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; }; and a getter function in the VDSO that can optionally be used by user-space code to implement sub-nanosecond precision clocks . This second patch is entirely optional but I think greatly expands the scope of user-space TSC readers . Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 04:29:27.296982872 + @@ -182,6 +182,19 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ + u64 tsc, last=gtod->raw_cycle_last; + if( likely( gtod->has_rdtscp ) ) + tsc = rdtscp((void*)0); +else + tsc = rdtsc_ordered(); + if (likely(tsc >= last)) + return tsc; + asm volatile (""); + return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +216,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +280,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static int __always_inline do_monotonic_raw( struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +332,10 @@ n
[PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . The patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/include/asm/msr.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c This is a resend of the original patch fixing issues identified by tglx in mail thread of $subject - mainly that the rdtscp() assembler wrapper function should be in msr.h - it now is. There is a second patch following in a few minutes which adds a record of the calibrated tsc frequency to the VDSO, and a new header: uapi/asm/vdso_tsc_calibration.h which defines a structure : struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; }; and a getter function in the VDSO that can optionally be used by user-space code to implement sub-nanosecond precision clocks . This second patch is entirely optional but I think greatly expands the scope of user-space TSC readers . Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 04:29:27.296982872 + @@ -182,6 +182,19 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ + u64 tsc, last=gtod->raw_cycle_last; + if( likely( gtod->has_rdtscp ) ) + tsc = rdtscp((void*)0); +else + tsc = rdtsc_ordered(); + if (likely(tsc >= last)) + return tsc; + asm volatile (""); + return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +216,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +280,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static int __always_inline do_monotonic_raw( struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +332,10 @@ n
Re: [PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Thanks Thomas - On 11/03/2018, Thomas Gleixner <t...@linutronix.de> wrote: > On Sun, 11 Mar 2018, Jason Vas Dias wrote: > > This looks better now. Though running that patch through checkpatch.pl > results in: > > total: 28 errors, 20 warnings, 139 lines checked > Hmm, I was unaware of that script, I'll run and find out why - probably because whitespace is not visible in emacs with my monospace font and it is very difficult to see if tabs are used if somehow a '\t\ ' or ' \t' has slipped in . I'll run the script, fix the errors. and repost. > > >> +notrace static u64 vread_tsc_raw(void) > > Why do you need a separate function? I asked you to use vread_tsc(). So you > might have reasons for doing that, but please then explain WHY and not just > throw the stuff in my direction w/o any comment. > mainly, because vread_tsc() makes its comparison against gtod->cycles_last , a copy of tk->tkr_mono.cycle_last, while vread_tsc_raw() uses gtod->raw_cycle_last, a copy of tk->tkr_raw.cycle_last . And rdtscp has a built-in "barrier", as the comments explain, making rdtsc_ordered()'s 'barrier()' unnecessary . >> +{ >> +u64 tsc, last=gtod->raw_cycle_last; >> +if( likely( gtod->has_rdtscp ) ) { >> +u32 tsc_lo, tsc_hi, >> +tsc_cpu __attribute__((unused)); >> +asm volatile >> +( "rdtscp" >> +/* ^- has built-in cancellation point / pipeline stall >> "barrier" */ >> +: "=a" (tsc_lo) >> +, "=d" (tsc_hi) >> +, "=c" (tsc_cpu) >> +); // since all variables 32-bit, eax, edx, ecx used - >> NOT rax, rdx, rcx >> +tsc = u64)tsc_hi) & 0xUL) << 32) | >> (((u64)tsc_lo) & 0xUL); > > This is not required to make the vdso accessor for monotonic raw work. > > If at all then the rdtscp support wants to be in a separate patch with a > proper explanation. > > Aside of that the code for rdtscp wants to be in a proper inline helper in > the relevant header file and written according to the coding style the > kernel uses for asm inlines. > Sorry, I will put the function in the same header as rdtsc_ordered () , in a separate patch. > The rest looks ok. > > Thanks, > > tglx > I'll re-generate patches and resend . A complete patch , against 4.15.9, is attached , that I am using , including a suggested '__vdso_linux_tsc_calibration()' function and arch/x86/include/uapi/asm/vdso_tsc_calibration.h file that does not return any pointers into the VDSO . Presuming this was split into separate patches as you suggest, and was against the latest HEAD branch (4.16-rcX), would it be OK to include the vdso_linux_tsc_calibration() work ? It does enable user space code to develop accurate TSC readers which are free to use different structures and pico-second resolution. The actual user-space clock_gettime(CLOCK_MONOTONIC_RAW) replacement I am using for work just reads the TSC , with a latency of < 8ns, and uses the linux_tsc_calibration to convert using floating-point as required. Thanks & Regards, Jason vdso_gettime_monotonic_raw-4.15.9.patch Description: Binary data
Re: [PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Thanks Thomas - On 11/03/2018, Thomas Gleixner wrote: > On Sun, 11 Mar 2018, Jason Vas Dias wrote: > > This looks better now. Though running that patch through checkpatch.pl > results in: > > total: 28 errors, 20 warnings, 139 lines checked > Hmm, I was unaware of that script, I'll run and find out why - probably because whitespace is not visible in emacs with my monospace font and it is very difficult to see if tabs are used if somehow a '\t\ ' or ' \t' has slipped in . I'll run the script, fix the errors. and repost. > > >> +notrace static u64 vread_tsc_raw(void) > > Why do you need a separate function? I asked you to use vread_tsc(). So you > might have reasons for doing that, but please then explain WHY and not just > throw the stuff in my direction w/o any comment. > mainly, because vread_tsc() makes its comparison against gtod->cycles_last , a copy of tk->tkr_mono.cycle_last, while vread_tsc_raw() uses gtod->raw_cycle_last, a copy of tk->tkr_raw.cycle_last . And rdtscp has a built-in "barrier", as the comments explain, making rdtsc_ordered()'s 'barrier()' unnecessary . >> +{ >> +u64 tsc, last=gtod->raw_cycle_last; >> +if( likely( gtod->has_rdtscp ) ) { >> +u32 tsc_lo, tsc_hi, >> +tsc_cpu __attribute__((unused)); >> +asm volatile >> +( "rdtscp" >> +/* ^- has built-in cancellation point / pipeline stall >> "barrier" */ >> +: "=a" (tsc_lo) >> +, "=d" (tsc_hi) >> +, "=c" (tsc_cpu) >> +); // since all variables 32-bit, eax, edx, ecx used - >> NOT rax, rdx, rcx >> +tsc = u64)tsc_hi) & 0xUL) << 32) | >> (((u64)tsc_lo) & 0xUL); > > This is not required to make the vdso accessor for monotonic raw work. > > If at all then the rdtscp support wants to be in a separate patch with a > proper explanation. > > Aside of that the code for rdtscp wants to be in a proper inline helper in > the relevant header file and written according to the coding style the > kernel uses for asm inlines. > Sorry, I will put the function in the same header as rdtsc_ordered () , in a separate patch. > The rest looks ok. > > Thanks, > > tglx > I'll re-generate patches and resend . A complete patch , against 4.15.9, is attached , that I am using , including a suggested '__vdso_linux_tsc_calibration()' function and arch/x86/include/uapi/asm/vdso_tsc_calibration.h file that does not return any pointers into the VDSO . Presuming this was split into separate patches as you suggest, and was against the latest HEAD branch (4.16-rcX), would it be OK to include the vdso_linux_tsc_calibration() work ? It does enable user space code to develop accurate TSC readers which are free to use different structures and pico-second resolution. The actual user-space clock_gettime(CLOCK_MONOTONIC_RAW) replacement I am using for work just reads the TSC , with a latency of < 8ns, and uses the linux_tsc_calibration to convert using floating-point as required. Thanks & Regards, Jason vdso_gettime_monotonic_raw-4.15.9.patch Description: Binary data
[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc4 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . The patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c This is a resend of the original patch fixing indentation issues after installation of emacs Lisp cc-mode hooks in Documentation/coding-style.rst and calling 'indent-region' and 'tabify' (whitespace only changes) - SORRY ! (and even after that, somehow 2 '\t\n's got left in vgtod.h - now removed - sorry again!) . Best Regards, Jason Vas Dias . PATCH: --- diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 2018-03-04 22:54:11.0 + +++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 19:00:04.630019100 + @@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ + u64 tsc, last=gtod->raw_cycle_last; + if( likely( gtod->has_rdtscp ) ) { + u32 tsc_lo, tsc_hi, + tsc_cpu __attribute__((unused)); + asm volatile + ( "rdtscp" + /* ^- has built-in cancellation point / pipeline stall"barrier" */ + : "=a" (tsc_lo) + , "=d" (tsc_hi) + , "=c" (tsc_cpu) + ); // since all variables 32-bit, eax, edx, ecx used - NOT rax, rdx, rcx + tsc = u64)tsc_hi) & 0xUL) << 32) | (((u64)tsc_lo) & 0xUL); + } else { + tsc = rdtsc_ordered(); + } + if (likely(tsc >= last)) + return tsc; + asm volatile (""); + return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +290,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static int __always_inline do_monotonic_raw( struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + +
[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc4 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . The patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c This is a resend of the original patch fixing indentation issues after installation of emacs Lisp cc-mode hooks in Documentation/coding-style.rst and calling 'indent-region' and 'tabify' (whitespace only changes) - SORRY ! (and even after that, somehow 2 '\t\n's got left in vgtod.h - now removed - sorry again!) . Best Regards, Jason Vas Dias . PATCH: --- diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 2018-03-04 22:54:11.0 + +++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 19:00:04.630019100 + @@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ + u64 tsc, last=gtod->raw_cycle_last; + if( likely( gtod->has_rdtscp ) ) { + u32 tsc_lo, tsc_hi, + tsc_cpu __attribute__((unused)); + asm volatile + ( "rdtscp" + /* ^- has built-in cancellation point / pipeline stall"barrier" */ + : "=a" (tsc_lo) + , "=d" (tsc_hi) + , "=c" (tsc_cpu) + ); // since all variables 32-bit, eax, edx, ecx used - NOT rax, rdx, rcx + tsc = u64)tsc_hi) & 0xUL) << 32) | (((u64)tsc_lo) & 0xUL); + } else { + tsc = rdtsc_ordered(); + } + if (likely(tsc >= last)) + return tsc; + asm volatile (""); + return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +290,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static int __always_inline do_monotonic_raw( struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + +
[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc4 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . The patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c This is a resend of the original patch fixing indentation issues after installation of emacs Lisp cc-mode hooks in Documentation/coding-style.rst and calling 'indent-region' and 'tabify' (whitespace only changes) - SORRY ! Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 2018-03-04 22:54:11.0 + +++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 19:00:04.630019100 + @@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ + u64 tsc, last=gtod->raw_cycle_last; + if( likely( gtod->has_rdtscp ) ) { + u32 tsc_lo, tsc_hi, + tsc_cpu __attribute__((unused)); + asm volatile + ( "rdtscp" + /* ^- has built-in cancellation point / pipeline stall"barrier" */ + : "=a" (tsc_lo) + , "=d" (tsc_hi) + , "=c" (tsc_cpu) + ); // since all variables 32-bit, eax, edx, ecx used - NOT rax, rdx, rcx + tsc = u64)tsc_hi) & 0xUL) << 32) | (((u64)tsc_lo) & 0xUL); + } else { + tsc = rdtsc_ordered(); + } + if (likely(tsc >= last)) + return tsc; + asm volatile (""); + return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +290,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static int __always_inline do_monotonic_raw( struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return
[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc4 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . The patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c This is a resend of the original patch fixing indentation issues after installation of emacs Lisp cc-mode hooks in Documentation/coding-style.rst and calling 'indent-region' and 'tabify' (whitespace only changes) - SORRY ! Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 2018-03-04 22:54:11.0 + +++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 19:00:04.630019100 + @@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ + u64 tsc, last=gtod->raw_cycle_last; + if( likely( gtod->has_rdtscp ) ) { + u32 tsc_lo, tsc_hi, + tsc_cpu __attribute__((unused)); + asm volatile + ( "rdtscp" + /* ^- has built-in cancellation point / pipeline stall"barrier" */ + : "=a" (tsc_lo) + , "=d" (tsc_hi) + , "=c" (tsc_cpu) + ); // since all variables 32-bit, eax, edx, ecx used - NOT rax, rdx, rcx + tsc = u64)tsc_hi) & 0xUL) << 32) | (((u64)tsc_lo) & 0xUL); + } else { + tsc = rdtsc_ordered(); + } + if (likely(tsc >= last)) + return tsc; + asm volatile (""); + return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +290,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static int __always_inline do_monotonic_raw( struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return
Re: Fwd: [PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer
Hi Thomas - Thanks very much for your help & guidance in previous mail: RE: On 08/03/2018, Thomas Gleixner <t...@linutronix.de> wrote: > > The right way to do that is to put the raw conversion values and the raw > seconds base value into the vdso data and implement the counterpart of > getrawmonotonic64(). And if that is done, then it can be done for _ALL_ > clocksources which support VDSO access and not just for the TSC. > I have done this now with a new patch, sent in mail with subject : '[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW' which should address all the concerns you raise. > I already know how that works, really. I never doubted or meant to impugn that ! I am beginning to know a little how that works also, thanks in great part to your help last week - thanks for your patience. I was impatient last week to get access to low latency timers for a work project, and was trying to read the unadjusted clock . > instead of making completely false claims about the correctness of the kernel > timekeeping infrastructure. I really didn't mean to make any such claims - I'm sorry if I did . I was just trying to say that by the time the results of clock_gettime(CLOCK_MONOTONIC_RAW,) were available to the caller they were not of much use because of the latencies often dwarfing the time differences . Anyway, I hope sometime you will consider putting such a patch in the kernel. I have developed a verson for ARM also, but that depends on making CNTPCT + CNTFRQ registers readable in user-space, which is not meant to be secure and is not normally done , but does work - but it is against the Texas Instruments (ti-linux) kernel and can be enabled with a new KConfig option, and brings latencies down from > 300ns to < 20ns . Maybe I should post that also to kernel.org, or to ti.com ? I have a separate patch for the vdso_tsc_calibration export of the tsc_khz and calibration which no longer returns pointers into the VDSO - I can post this as a patch if you like. Thanks & Best Regards, Jason Vas Dias <jason.vas.d...@gmail.com> diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 2018-03-04 22:54:11.0 + +++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 05:08:31.137681337 + @@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ +u64 tsc, last=gtod->raw_cycle_last; +if( likely( gtod->has_rdtscp ) ) { +u32 tsc_lo, tsc_hi, +tsc_cpu __attribute__((unused)); +asm volatile +( "rdtscp" +/* ^- has built-in cancellation point / pipeline stall "barrier" */ +: "=a" (tsc_lo) +, "=d" (tsc_hi) +, "=c" (tsc_cpu) +); // since all variables 32-bit, eax, edx, ecx used - NOT rax, rdx, rcx +tsc = u64)tsc_hi) & 0xUL) << 32) | (((u64)tsc_lo) & 0xUL); +} else { +tsc = rdtsc_ordered(); +} + if (likely(tsc >= last)) + return tsc; +asm volatile (""); +return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +290,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static int __always_inline do_monotonic_raw( struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespe
Re: Fwd: [PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer
Hi Thomas - Thanks very much for your help & guidance in previous mail: RE: On 08/03/2018, Thomas Gleixner wrote: > > The right way to do that is to put the raw conversion values and the raw > seconds base value into the vdso data and implement the counterpart of > getrawmonotonic64(). And if that is done, then it can be done for _ALL_ > clocksources which support VDSO access and not just for the TSC. > I have done this now with a new patch, sent in mail with subject : '[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW' which should address all the concerns you raise. > I already know how that works, really. I never doubted or meant to impugn that ! I am beginning to know a little how that works also, thanks in great part to your help last week - thanks for your patience. I was impatient last week to get access to low latency timers for a work project, and was trying to read the unadjusted clock . > instead of making completely false claims about the correctness of the kernel > timekeeping infrastructure. I really didn't mean to make any such claims - I'm sorry if I did . I was just trying to say that by the time the results of clock_gettime(CLOCK_MONOTONIC_RAW,) were available to the caller they were not of much use because of the latencies often dwarfing the time differences . Anyway, I hope sometime you will consider putting such a patch in the kernel. I have developed a verson for ARM also, but that depends on making CNTPCT + CNTFRQ registers readable in user-space, which is not meant to be secure and is not normally done , but does work - but it is against the Texas Instruments (ti-linux) kernel and can be enabled with a new KConfig option, and brings latencies down from > 300ns to < 20ns . Maybe I should post that also to kernel.org, or to ti.com ? I have a separate patch for the vdso_tsc_calibration export of the tsc_khz and calibration which no longer returns pointers into the VDSO - I can post this as a patch if you like. Thanks & Best Regards, Jason Vas Dias diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 2018-03-04 22:54:11.0 + +++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 05:08:31.137681337 + @@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ +u64 tsc, last=gtod->raw_cycle_last; +if( likely( gtod->has_rdtscp ) ) { +u32 tsc_lo, tsc_hi, +tsc_cpu __attribute__((unused)); +asm volatile +( "rdtscp" +/* ^- has built-in cancellation point / pipeline stall "barrier" */ +: "=a" (tsc_lo) +, "=d" (tsc_hi) +, "=c" (tsc_cpu) +); // since all variables 32-bit, eax, edx, ecx used - NOT rax, rdx, rcx +tsc = u64)tsc_hi) & 0xUL) << 32) | (((u64)tsc_lo) & 0xUL); +} else { +tsc = rdtsc_ordered(); +} + if (likely(tsc >= last)) + return tsc; +asm volatile (""); +return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +290,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static int __always_inline do_monotonic_raw( struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +342,10 @@ notrace int __
[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc4 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . The patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 2018-03-04 22:54:11.0 + +++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 05:08:31.137681337 + @@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ +u64 tsc, last=gtod->raw_cycle_last; +if( likely( gtod->has_rdtscp ) ) { +u32 tsc_lo, tsc_hi, +tsc_cpu __attribute__((unused)); +asm volatile +( "rdtscp" +/* ^- has built-in cancellation point / pipeline stall "barrier" */ +: "=a" (tsc_lo) +, "=d" (tsc_hi) +, "=c" (tsc_cpu) +); // since all variables 32-bit, eax, edx, ecx used - NOT rax, rdx, rcx +tsc = u64)tsc_hi) & 0xUL) << 32) | (((u64)tsc_lo) & 0xUL); +} else { +tsc = rdtsc_ordered(); +} + if (likely(tsc >= last)) + return tsc; +asm volatile (""); +return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +290,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static int __always_inline do_monotonic_raw( struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +342,10 @@ notrace int __vdso_clock_gettime(clockid if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff -up linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc4 linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c --- linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc4 2018-03-04 22:54
[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc4 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . The patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 2018-03-04 22:54:11.0 + +++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 05:08:31.137681337 + @@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ +u64 tsc, last=gtod->raw_cycle_last; +if( likely( gtod->has_rdtscp ) ) { +u32 tsc_lo, tsc_hi, +tsc_cpu __attribute__((unused)); +asm volatile +( "rdtscp" +/* ^- has built-in cancellation point / pipeline stall "barrier" */ +: "=a" (tsc_lo) +, "=d" (tsc_hi) +, "=c" (tsc_cpu) +); // since all variables 32-bit, eax, edx, ecx used - NOT rax, rdx, rcx +tsc = u64)tsc_hi) & 0xUL) << 32) | (((u64)tsc_lo) & 0xUL); +} else { +tsc = rdtsc_ordered(); +} + if (likely(tsc >= last)) + return tsc; +asm volatile (""); +return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +290,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static int __always_inline do_monotonic_raw( struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +342,10 @@ notrace int __vdso_clock_gettime(clockid if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff -up linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc4 linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c --- linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc4 2018-03-04 22:54
Re: [PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Oops, please disregard 1st mail on $subject - I guess use of Quoted Printable is not a way of getting past the email line length. Patch I tried to send is attached as attachment - will resend inline using other method. Sorry, Regards, Jason vdso_monotonic_raw-v4.16-rc4.patch Description: Binary data
Re: [PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Oops, please disregard 1st mail on $subject - I guess use of Quoted Printable is not a way of getting past the email line length. Patch I tried to send is attached as attachment - will resend inline using other method. Sorry, Regards, Jason vdso_monotonic_raw-v4.16-rc4.patch Description: Binary data
[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc4 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . The patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 2018-03-04 22:54:11.0 + +++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 05:08:31.137681337 + @@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ +u64 tsc, last=gtod->raw_cycle_last; +if( likely( gtod->has_rdtscp ) ) { +u32 tsc_lo, tsc_hi, +tsc_cpu __attribute__((unused)); +asm volatile +( "rdtscp" +/* ^- has built-in cancellation point / pipeline stall"barrier" */ +: "=a" (tsc_lo) +, "=d" (tsc_hi) +, "=c" (tsc_cpu) +); // since all variables 32-bit, eax, edx, ecx used - NOT rax, rdx, rcx +tsc = u64)tsc_hi) & 0xUL) << 32) | (((u64)tsc_lo) & 0xUL); +} else { +tsc = rdtsc_ordered(); +} + if (likely(tsc >= last)) + return tsc; +asm volatile (""); +return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +290,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static int __always_inline do_monotonic_raw( struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +342,10 @@ notrace int __vdso_clock_gettime(clockid if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff -up linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc4 linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c --- linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc4 2018-03-04 22:54:11.0 + +++ linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c 2018-03
[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc4 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . The patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 2018-03-04 22:54:11.0 + +++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 05:08:31.137681337 + @@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ +u64 tsc, last=gtod->raw_cycle_last; +if( likely( gtod->has_rdtscp ) ) { +u32 tsc_lo, tsc_hi, +tsc_cpu __attribute__((unused)); +asm volatile +( "rdtscp" +/* ^- has built-in cancellation point / pipeline stall"barrier" */ +: "=a" (tsc_lo) +, "=d" (tsc_hi) +, "=c" (tsc_cpu) +); // since all variables 32-bit, eax, edx, ecx used - NOT rax, rdx, rcx +tsc = u64)tsc_hi) & 0xUL) << 32) | (((u64)tsc_lo) & 0xUL); +} else { +tsc = rdtsc_ordered(); +} + if (likely(tsc >= last)) + return tsc; +asm volatile (""); +return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +290,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static int __always_inline do_monotonic_raw( struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +342,10 @@ notrace int __vdso_clock_gettime(clockid if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff -up linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc4 linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c --- linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc4 2018-03-04 22:54:11.0 + +++ linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c 2018-03
Re: Fwd: [PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer
On 08/03/2018, Thomas Gleixner <t...@linutronix.de> wrote: > On Tue, 6 Mar 2018, Jason Vas Dias wrote: >> I will prepare a new patch that meets submission + coding style guidelines >> and >> does not expose pointers within the vsyscall_gtod_data region to >> user-space code - >> but I don't really understand why not, since only the gtod->mult value >> will >> change as long as the clocksource remains TSC, and updates to it by the >> kernel >> are atomic and partial values cannot be read . >> >> The code in the patch reverts to old behavior for clocks which are not >> the >> TSC and provides a way for users to determine if the clock is still the >> TSC >> by calling '__vdso_linux_tsc_calibration()', which would return NULL if >> the clock is not the TSC . >> >> I have never seen Linux on a modern intel box spontaneously decide to >> switch from the TSC clocksource after calibration succeeds and >> it has decided to use the TSC as the system / platform clock source - >> what would make it do this ? >> >> But for the highly controlled systems I am doing performance testing on, >> I can guarantee that the clocksource does not change. > > We are not writing code for a particular highly controlled system. We > expose functionality which operates under all circumstances. There are > various reasons why TSC can be disabled at runtime, crappy BIOS/SMI, > sockets getting out of sync . > >> There is no way user code can write those pointers or do anything other >> than read them, so I see no harm in exposing them to user-space ; then >> user-space programs can issue rdtscp and use the same calibration values >> as the kernel, and use some cached 'previous timespec value' to avoid >> doing the long division every time. >> >> If the shift & mult are not accurate TSC calibration values, then the >> kernel should put other more accurate calibration values in the gtod . > > The raw calibration values are as accurate as the kernel can make them. But > they can be rather far off from converting to real nanoseconds for various > reasons. The NTP/PTP adjusted conversion is matching real units and is > obviously more accurate. > >> > Please look at the kernel side implementation of >> > clock_gettime(CLOCK_MONOTONIC_RAW). >> > The VDSO side can be implemented in the >> > same way. >> > All what is required is to expose the relevant information in the >> > existing vsyscall_gtod_data data structure. >> >> I agree - that is my point entirely , & what I was trying to do . > > Well, you did not expose the raw conversion data in vsyscall_gtod_data. You > are using: > > + tsc*= gtod->mult; > + tsc >>= gtod->shift; > > That's is the adjusted mult/shift value which can change when NTP/PTP is > enabled and you _cannot_ use it unprotected. > >> void getrawmonotonic64(struct timespec64 *ts) >> { >> struct timekeeper *tk = _core.timekeeper; >> unsigned long seq; >> u64 nsecs; >> >> do { >> seq = read_seqcount_begin(_core.seq); >> # ^-- I think this is the source of the locking >> #and the very long latencies ! > > This protects tk->raw_sec from changing which would result in random time > stamps. Yes, it can cause slightly larger latencies when the timekeeper is > updated on another CPU concurrently, but that's not the main reason why > this is slower in general than the VDSO functions. The syscall overhead is > there for every invocation and it's substantial. > >> So in fact, when the clock source is TSC, the value recorded in 'ts' >> by clock_gettime(CLOCK_MONOTONIC_RAW, ) is very similar to >> u64 tsc = rdtscp(); >> tsc *= gtod->mult; >> tsc >>= gtod->shift; >> ts.tv_sec=tsc / NSEC_PER_SEC; >> ts.tv_nsec=tsc % NSEC_PER_SEC; >> >> which is the algorithm I was using in the VDSO fast TSC reader, >> do_monotonic_raw() . > > Except that you are using the adjusted conversion values and not the raw > ones. So your VDSO implementation of monotonic raw access is just wrong and > not matching the syscall based implementation in any way. > >> The problem with doing anything more in the VDSO is that there >> is of course nowhere in the VDSO to store any data, as it has >> no data section or writable pages . So some kind of writable >> page would need to be added to the vdso , complicating its >> vdso/vma.c, etc., which is not desirable. > > No, you don
Re: Fwd: [PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer
On 08/03/2018, Thomas Gleixner wrote: > On Tue, 6 Mar 2018, Jason Vas Dias wrote: >> I will prepare a new patch that meets submission + coding style guidelines >> and >> does not expose pointers within the vsyscall_gtod_data region to >> user-space code - >> but I don't really understand why not, since only the gtod->mult value >> will >> change as long as the clocksource remains TSC, and updates to it by the >> kernel >> are atomic and partial values cannot be read . >> >> The code in the patch reverts to old behavior for clocks which are not >> the >> TSC and provides a way for users to determine if the clock is still the >> TSC >> by calling '__vdso_linux_tsc_calibration()', which would return NULL if >> the clock is not the TSC . >> >> I have never seen Linux on a modern intel box spontaneously decide to >> switch from the TSC clocksource after calibration succeeds and >> it has decided to use the TSC as the system / platform clock source - >> what would make it do this ? >> >> But for the highly controlled systems I am doing performance testing on, >> I can guarantee that the clocksource does not change. > > We are not writing code for a particular highly controlled system. We > expose functionality which operates under all circumstances. There are > various reasons why TSC can be disabled at runtime, crappy BIOS/SMI, > sockets getting out of sync . > >> There is no way user code can write those pointers or do anything other >> than read them, so I see no harm in exposing them to user-space ; then >> user-space programs can issue rdtscp and use the same calibration values >> as the kernel, and use some cached 'previous timespec value' to avoid >> doing the long division every time. >> >> If the shift & mult are not accurate TSC calibration values, then the >> kernel should put other more accurate calibration values in the gtod . > > The raw calibration values are as accurate as the kernel can make them. But > they can be rather far off from converting to real nanoseconds for various > reasons. The NTP/PTP adjusted conversion is matching real units and is > obviously more accurate. > >> > Please look at the kernel side implementation of >> > clock_gettime(CLOCK_MONOTONIC_RAW). >> > The VDSO side can be implemented in the >> > same way. >> > All what is required is to expose the relevant information in the >> > existing vsyscall_gtod_data data structure. >> >> I agree - that is my point entirely , & what I was trying to do . > > Well, you did not expose the raw conversion data in vsyscall_gtod_data. You > are using: > > + tsc*= gtod->mult; > + tsc >>= gtod->shift; > > That's is the adjusted mult/shift value which can change when NTP/PTP is > enabled and you _cannot_ use it unprotected. > >> void getrawmonotonic64(struct timespec64 *ts) >> { >> struct timekeeper *tk = _core.timekeeper; >> unsigned long seq; >> u64 nsecs; >> >> do { >> seq = read_seqcount_begin(_core.seq); >> # ^-- I think this is the source of the locking >> #and the very long latencies ! > > This protects tk->raw_sec from changing which would result in random time > stamps. Yes, it can cause slightly larger latencies when the timekeeper is > updated on another CPU concurrently, but that's not the main reason why > this is slower in general than the VDSO functions. The syscall overhead is > there for every invocation and it's substantial. > >> So in fact, when the clock source is TSC, the value recorded in 'ts' >> by clock_gettime(CLOCK_MONOTONIC_RAW, ) is very similar to >> u64 tsc = rdtscp(); >> tsc *= gtod->mult; >> tsc >>= gtod->shift; >> ts.tv_sec=tsc / NSEC_PER_SEC; >> ts.tv_nsec=tsc % NSEC_PER_SEC; >> >> which is the algorithm I was using in the VDSO fast TSC reader, >> do_monotonic_raw() . > > Except that you are using the adjusted conversion values and not the raw > ones. So your VDSO implementation of monotonic raw access is just wrong and > not matching the syscall based implementation in any way. > >> The problem with doing anything more in the VDSO is that there >> is of course nowhere in the VDSO to store any data, as it has >> no data section or writable pages . So some kind of writable >> page would need to be added to the vdso , complicating its >> vdso/vma.c, etc., which is not desirable. > > No, you don't need any writeable memo
[PATCH v4.15.7 1/1] x86/vdso: handle clock_gettime(CLOCK_MONOTONIC_RAW, ) in VDSO
Handling clock_gettime( CLOCK_MONOTONIC_RAW, ) by calling vdso_fallback_gettime(), ie. syscall, is too slow - latencies of 300-700ns are common on Haswell (06:3C) CPUs . This patch against the 4.15.7 stable branch makes the VDSO handle clock_gettime(CLOCK_GETTIME_RAW, ) by issuing rdtscp in userspace, IFF the clock source is the TSC, and converting it to nanoseconds using the vsyscall_gtod_data 'mult' and 'shift' fields : volatile u32 tsc_lo, tsc_hi, tsc_cpu; asm volatile( "rdtscp" : (=a) tsc_lo, (=d) tsc_hi, (=c) tsc_cpu ); u64 tsc = (((u64)tsc_hi)<<32) | ((u64)tsc_lo); tsc *= gtod->mult; tsc >>=gtod->shift; /* tsc is now number of nanoseconds */ ts->tv_sec = __iter_div_u64_rem( tsc, NSEC_PER_SEC, >tv_nsec); Use of the "open coded asm" style here actually forces the compiler to always choose the 32-bit version of rdtscp, which sets only %eax, %edx, and %ecx and does not clear the high bits of %rax, %rdx, and %rdx , because the variables are declared 32-bit - so the same 32-bit version is used whether the code is compiled with -m32 or -m64 ( tested using gcc 5.4.0, gcc 6.4.1 ) . The full story and test programs are in Bug #198961 : https://bugzilla.kernel.org/show_bug.cgi?id=198961 . The patched VDSO now handles clock_gettime(CLOCK_MONOTONIC_RAW, ) on the same machine with a latency (minimum time that can be measured) of around 100ns (compared with 300-700ns before patch). I also think it makes sense to expose pointers to the live, updated gtod->mult and gtod->shift values somehow to userspace . Then a userspace TSC reader could re-use previous values to avoid the long-division in most cases and obtain latencies of 10-20ns . Hence there is now a new method in the VDSO: __ vdso_linux_tsc_calibration() which returns a pointer to a 'struct linux_tsc_calibration' declared in a new header arch/x86/include/uapi/asm/vdso_tsc_calibration.h If the clock source is NOT the TSC, this function returns NULL . The pointer is only valid when the system clock source is the TSC . User-space TSC readers can detect when TSC is modified with Events, and now can detect when clock source changes from / to TSC with this function . The patch : --- diff --git a/arch/x86/entry/vdso/vclock_gettime.c \ b/arch/x86/entry/vdso/vclock_gettime.c index f19856d..e840600 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -21,6 +21,7 @@ #include #include #include +#include #define gtod ((vsyscall_gtod_data)) @@ -246,6 +247,29 @@ notrace static int __always_inline do_monotonic\ (struct timespec *ts) return mode; } +notrace static int __always_inline do_monotonic_raw( struct timespec *ts) +{ +volatile u32 tsc_lo=0, tsc_hi=0, tsc_cpu=0; // so same instrs generated for 64-bit as for 32-bit builds +u64 ns; +register u64 tsc=0; +if (gtod->vclock_mode == VCLOCK_TSC) +{ +asm volatile +( "rdtscp" +: "=a" (tsc_lo) +, "=d" (tsc_hi) +, "=c" (tsc_cpu) +); // : eax, edx, ecx used - NOT rax, rdx, rcx +tsc = u64)tsc_hi) & 0xUL) << 32) | (((u64)tsc_lo) & 0xUL); +tsc*= gtod->mult; +tsc >>= gtod->shift; +ts->tv_sec = __iter_div_u64_rem(tsc, NSEC_PER_SEC, ); +ts->tv_nsec = ns; +return VCLOCK_TSC; +} +return VCLOCK_NONE; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +301,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; @@ -326,3 +354,18 @@ notrace time_t __vdso_time(time_t *t) } time_t time(time_t *t) __attribute__((weak, alias("__vdso_time"))); + +extern const struct linux_tsc_calibration * +__vdso_linux_tsc_calibration(void); + +notrace const struct linux_tsc_calibration * + __vdso_linux_tsc_calibration(void) +{ +if( gtod->vclock_mode == VCLOCK_TSC ) +return ((const struct linux_tsc_calibration*) >mult); +return 0UL; +} + +const struct linux_tsc_calibration * linux_tsc_calibration(void) +__attribute((weak, alias("__vdso_linux_tsc_calibration"))); + diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S index d3a2dce..41a2ca5 100644 --- a/arch/x86/entry/vdso/vdso.lds.S +++ b/arch/x86/entry/vdso/vdso.lds.S @@ -24,7 +24,9 @@ VERSION { getcpu; __vdso_getcpu; time; -
[PATCH v4.15.7 1/1] x86/vdso: handle clock_gettime(CLOCK_MONOTONIC_RAW, ) in VDSO
Handling clock_gettime( CLOCK_MONOTONIC_RAW, ) by calling vdso_fallback_gettime(), ie. syscall, is too slow - latencies of 300-700ns are common on Haswell (06:3C) CPUs . This patch against the 4.15.7 stable branch makes the VDSO handle clock_gettime(CLOCK_GETTIME_RAW, ) by issuing rdtscp in userspace, IFF the clock source is the TSC, and converting it to nanoseconds using the vsyscall_gtod_data 'mult' and 'shift' fields : volatile u32 tsc_lo, tsc_hi, tsc_cpu; asm volatile( "rdtscp" : (=a) tsc_lo, (=d) tsc_hi, (=c) tsc_cpu ); u64 tsc = (((u64)tsc_hi)<<32) | ((u64)tsc_lo); tsc *= gtod->mult; tsc >>=gtod->shift; /* tsc is now number of nanoseconds */ ts->tv_sec = __iter_div_u64_rem( tsc, NSEC_PER_SEC, >tv_nsec); Use of the "open coded asm" style here actually forces the compiler to always choose the 32-bit version of rdtscp, which sets only %eax, %edx, and %ecx and does not clear the high bits of %rax, %rdx, and %rdx , because the variables are declared 32-bit - so the same 32-bit version is used whether the code is compiled with -m32 or -m64 ( tested using gcc 5.4.0, gcc 6.4.1 ) . The full story and test programs are in Bug #198961 : https://bugzilla.kernel.org/show_bug.cgi?id=198961 . The patched VDSO now handles clock_gettime(CLOCK_MONOTONIC_RAW, ) on the same machine with a latency (minimum time that can be measured) of around 100ns (compared with 300-700ns before patch). I also think it makes sense to expose pointers to the live, updated gtod->mult and gtod->shift values somehow to userspace . Then a userspace TSC reader could re-use previous values to avoid the long-division in most cases and obtain latencies of 10-20ns . Hence there is now a new method in the VDSO: __ vdso_linux_tsc_calibration() which returns a pointer to a 'struct linux_tsc_calibration' declared in a new header arch/x86/include/uapi/asm/vdso_tsc_calibration.h If the clock source is NOT the TSC, this function returns NULL . The pointer is only valid when the system clock source is the TSC . User-space TSC readers can detect when TSC is modified with Events, and now can detect when clock source changes from / to TSC with this function . The patch : --- diff --git a/arch/x86/entry/vdso/vclock_gettime.c \ b/arch/x86/entry/vdso/vclock_gettime.c index f19856d..e840600 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -21,6 +21,7 @@ #include #include #include +#include #define gtod ((vsyscall_gtod_data)) @@ -246,6 +247,29 @@ notrace static int __always_inline do_monotonic\ (struct timespec *ts) return mode; } +notrace static int __always_inline do_monotonic_raw( struct timespec *ts) +{ +volatile u32 tsc_lo=0, tsc_hi=0, tsc_cpu=0; // so same instrs generated for 64-bit as for 32-bit builds +u64 ns; +register u64 tsc=0; +if (gtod->vclock_mode == VCLOCK_TSC) +{ +asm volatile +( "rdtscp" +: "=a" (tsc_lo) +, "=d" (tsc_hi) +, "=c" (tsc_cpu) +); // : eax, edx, ecx used - NOT rax, rdx, rcx +tsc = u64)tsc_hi) & 0xUL) << 32) | (((u64)tsc_lo) & 0xUL); +tsc*= gtod->mult; +tsc >>= gtod->shift; +ts->tv_sec = __iter_div_u64_rem(tsc, NSEC_PER_SEC, ); +ts->tv_nsec = ns; +return VCLOCK_TSC; +} +return VCLOCK_NONE; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +301,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; @@ -326,3 +354,18 @@ notrace time_t __vdso_time(time_t *t) } time_t time(time_t *t) __attribute__((weak, alias("__vdso_time"))); + +extern const struct linux_tsc_calibration * +__vdso_linux_tsc_calibration(void); + +notrace const struct linux_tsc_calibration * + __vdso_linux_tsc_calibration(void) +{ +if( gtod->vclock_mode == VCLOCK_TSC ) +return ((const struct linux_tsc_calibration*) >mult); +return 0UL; +} + +const struct linux_tsc_calibration * linux_tsc_calibration(void) +__attribute((weak, alias("__vdso_linux_tsc_calibration"))); + diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S index d3a2dce..41a2ca5 100644 --- a/arch/x86/entry/vdso/vdso.lds.S +++ b/arch/x86/entry/vdso/vdso.lds.S @@ -24,7 +24,9 @@ VERSION { getcpu; __vdso_getcpu; time; -
Fwd: [PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer
On 06/03/2018, Thomas Gleixner <t...@linutronix.de> wrote: > Jason, > > On Mon, 5 Mar 2018, Jason Vas Dias wrote: > > thanks for providing this. A few formal nits first. > > Please read Documentation/process/submitting-patches.rst > > Patches need a concise subject line and the subject line wants a prefix, in > this case 'x86/vdso'. > > Please don't put anything past the patch. Your delimiters are human > readable, but cannot be handled by tools. > > Also please follow the kernel coding style guide lines. > >> It also provides a new function in the VDSO : >> >> struct linux_timestamp_conversion >> { u32 mult; >> u32 shift; >> }; >> extern >> const struct linux_timestamp_conversion * >> __vdso_linux_tsc_calibration(void); >> >> which can be used by user-space rdtsc / rdtscp issuers >> by using code such as in >> tools/testing/selftests/vDSO/parse_vdso.c >> to call vdso_sym("LINUX_2.6", "__vdso_linux_tsc_calibration"), >> which returns a pointer to the function in the VDSO, which >> returns the address of the 'mult' field in the vsyscall_gtod_data. > > No, that's just wrong. The VDSO data is solely there for the VDSO accessor > functions and not to be exposed to random user space. > >> Thus user-space programs can use rdtscp and interpret its return values >> in exactly the same way the kernel would, but without entering the >> kernel. > > The VDSO clock_gettime() functions are providing exactly this mechanism. > >> As pointed out in Bug # 198961 : >> https://bugzilla.kernel.org/show_bug.cgi?id=198961 >> which contains extra test programs and the full story behind this >> change, >> using CLOCK_MONOTONIC_RAW without the patch results in >> a minimum measurable time (latency) of @ 300 - 700ns because of >> the syscall used by vdso_fallback_gtod() . >> >> With the patch, the latency falls to @ 100ns . >> >> The latency would be @ 16 - 32 ns if the do_monotonic_raw() >> handler could record its previous TSC value and seconds return value >> somewhere, but since the VDSO has no data region or writable page, >> of course it cannot . > > And even if it could, it's not as simple as you want it to be. Clocksources > can change during runtime and without effective protection the values are > just garbage. > >> Hence, to enable effective use of TSC by user space programs, Linux must >> provide a way for them to discover the calibration mult and shift values >> the kernel uses for the clock source ; only by doing so can user-space >> get values that are comparable to kernel generated values. > > Linux must not do anything. It can provide a vdso implementation of > CLOCK_MONOTONIC_RAW, which does not enter the kernel, but not exposure to > data which is not reliably accessible by random user space code. > >> And I'd really like to know: why does the gtod->mult value change ? >> After TSC calibration, it and the shift are calculated to render the >> best approximation of a nanoseconds value from the TSC value. >> >> The TSC is MEANT to be monotonic and to continue in sleep states >> on modern Intel CPUs . So why does the gtod->mult change ? > > You are missing the fact that gtod->mult/shift are used for CLOCK_MONOTONIC > and CLOCK_REALTIME, which are adjusted by NTP/PTP to provide network > synchronized time. That means CLOCK_MONOTONIC is providing accurate > and slope compensated nanoseconds. > > The raw TSC conversion, even if it is sane hardware, provides just some > approximation of nanoseconds which can be off by quite a margin. > >> But the mult value does change. Currently there is no way for user-space >> programs to discover that such a change has occurred, or when . With this >> very tiny simple patch, they could know instantly when such changes >> occur, and could implement TSC readers that perform the full conversion >> with latencies of 15-30ns (on my CPU). > > No. Accessing the mult/shift pair without protection is racy and can lead > to completely erratic results. > >> +notrace static int __always_inline do_monotonic_raw( struct timespec >> *ts) >> +{ >> + volatile u32 tsc_lo=0, tsc_hi=0, tsc_cpu=0; // so same instrs >> generated for 64-bit as for 32-bit builds >> + u64 ns; >> + register u64 tsc=0; >> + if (gtod->vclock_mode == VCLOCK_TSC) >> + { asm volatile >> + ( "rdtscp" >> + : "=a" (tsc_lo) >> + , "=d" (tsc_hi) >> + , "=c" (tsc_cpu) >> + ); // : eax, edx, ecx used - NOT rax, rdx, rcx > > If you look at the
Fwd: [PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer
On 06/03/2018, Thomas Gleixner wrote: > Jason, > > On Mon, 5 Mar 2018, Jason Vas Dias wrote: > > thanks for providing this. A few formal nits first. > > Please read Documentation/process/submitting-patches.rst > > Patches need a concise subject line and the subject line wants a prefix, in > this case 'x86/vdso'. > > Please don't put anything past the patch. Your delimiters are human > readable, but cannot be handled by tools. > > Also please follow the kernel coding style guide lines. > >> It also provides a new function in the VDSO : >> >> struct linux_timestamp_conversion >> { u32 mult; >> u32 shift; >> }; >> extern >> const struct linux_timestamp_conversion * >> __vdso_linux_tsc_calibration(void); >> >> which can be used by user-space rdtsc / rdtscp issuers >> by using code such as in >> tools/testing/selftests/vDSO/parse_vdso.c >> to call vdso_sym("LINUX_2.6", "__vdso_linux_tsc_calibration"), >> which returns a pointer to the function in the VDSO, which >> returns the address of the 'mult' field in the vsyscall_gtod_data. > > No, that's just wrong. The VDSO data is solely there for the VDSO accessor > functions and not to be exposed to random user space. > >> Thus user-space programs can use rdtscp and interpret its return values >> in exactly the same way the kernel would, but without entering the >> kernel. > > The VDSO clock_gettime() functions are providing exactly this mechanism. > >> As pointed out in Bug # 198961 : >> https://bugzilla.kernel.org/show_bug.cgi?id=198961 >> which contains extra test programs and the full story behind this >> change, >> using CLOCK_MONOTONIC_RAW without the patch results in >> a minimum measurable time (latency) of @ 300 - 700ns because of >> the syscall used by vdso_fallback_gtod() . >> >> With the patch, the latency falls to @ 100ns . >> >> The latency would be @ 16 - 32 ns if the do_monotonic_raw() >> handler could record its previous TSC value and seconds return value >> somewhere, but since the VDSO has no data region or writable page, >> of course it cannot . > > And even if it could, it's not as simple as you want it to be. Clocksources > can change during runtime and without effective protection the values are > just garbage. > >> Hence, to enable effective use of TSC by user space programs, Linux must >> provide a way for them to discover the calibration mult and shift values >> the kernel uses for the clock source ; only by doing so can user-space >> get values that are comparable to kernel generated values. > > Linux must not do anything. It can provide a vdso implementation of > CLOCK_MONOTONIC_RAW, which does not enter the kernel, but not exposure to > data which is not reliably accessible by random user space code. > >> And I'd really like to know: why does the gtod->mult value change ? >> After TSC calibration, it and the shift are calculated to render the >> best approximation of a nanoseconds value from the TSC value. >> >> The TSC is MEANT to be monotonic and to continue in sleep states >> on modern Intel CPUs . So why does the gtod->mult change ? > > You are missing the fact that gtod->mult/shift are used for CLOCK_MONOTONIC > and CLOCK_REALTIME, which are adjusted by NTP/PTP to provide network > synchronized time. That means CLOCK_MONOTONIC is providing accurate > and slope compensated nanoseconds. > > The raw TSC conversion, even if it is sane hardware, provides just some > approximation of nanoseconds which can be off by quite a margin. > >> But the mult value does change. Currently there is no way for user-space >> programs to discover that such a change has occurred, or when . With this >> very tiny simple patch, they could know instantly when such changes >> occur, and could implement TSC readers that perform the full conversion >> with latencies of 15-30ns (on my CPU). > > No. Accessing the mult/shift pair without protection is racy and can lead > to completely erratic results. > >> +notrace static int __always_inline do_monotonic_raw( struct timespec >> *ts) >> +{ >> + volatile u32 tsc_lo=0, tsc_hi=0, tsc_cpu=0; // so same instrs >> generated for 64-bit as for 32-bit builds >> + u64 ns; >> + register u64 tsc=0; >> + if (gtod->vclock_mode == VCLOCK_TSC) >> + { asm volatile >> + ( "rdtscp" >> + : "=a" (tsc_lo) >> + , "=d" (tsc_hi) >> + , "=c" (tsc_cpu) >> + ); // : eax, edx, ecx used - NOT rax, rdx, rcx > > If you look at the existing VDSO time getters the
[PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer
"Total time: %1.1llu.%9.9lluS - Average Latency: %1.1llu.%9.9lluS\n", t1/10, t1-((t1/10)*10), avg_ns/10, avg_ns-((avg_ns/10)*10) ); return 0; } : END EXAMPLE EXAMPLE Usage : $ gcc -std=gnu11 -o t_vdso_tsc t_vdso_tsc.c $ ./t_vdso_tsc Got TSC calibration @ 0x7ffdb9be5098: mult: 5798705 shift: 24 sum: Total time: 0.04859S - Average Latency: 0.00022S Latencies are typically @ 15 - 30 ns . That multiplication and shift really doesn't leave very many significant seconds bits! Please, can the VDSO include some similar functionality to NOT always enter the kernel for CLOCK_MONOTONIC_RAW , and to export a pointer to the LIVE (kernel updated) gtod->mult and gtod->shift values somehow . The documentation states for CLOCK_MONOTONIC_RAW that it is the same as CLOCK_MONOTONIC except it is NOT subject to NTP adjustments . This is very far from the case currently, without a patch like the one above. And the kernel should not restrict user-space programs to only being able to either measure an NTP adjusted time value, or a time value difference of greater than 1000ns with any accuracy, on a modern Intel CPU whose TSC ticks 2.8 times per nanosecond (picosecond resolution is theoretically possible). Please, include something like the above patch in future Linux versions. Thanks & Best Regards, Jason Vas Dias <jason.vas.d...@gmail.com>
[PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer
"Total time: %1.1llu.%9.9lluS - Average Latency: %1.1llu.%9.9lluS\n", t1/10, t1-((t1/10)*10), avg_ns/10, avg_ns-((avg_ns/10)*10) ); return 0; } : END EXAMPLE EXAMPLE Usage : $ gcc -std=gnu11 -o t_vdso_tsc t_vdso_tsc.c $ ./t_vdso_tsc Got TSC calibration @ 0x7ffdb9be5098: mult: 5798705 shift: 24 sum: Total time: 0.04859S - Average Latency: 0.00022S Latencies are typically @ 15 - 30 ns . That multiplication and shift really doesn't leave very many significant seconds bits! Please, can the VDSO include some similar functionality to NOT always enter the kernel for CLOCK_MONOTONIC_RAW , and to export a pointer to the LIVE (kernel updated) gtod->mult and gtod->shift values somehow . The documentation states for CLOCK_MONOTONIC_RAW that it is the same as CLOCK_MONOTONIC except it is NOT subject to NTP adjustments . This is very far from the case currently, without a patch like the one above. And the kernel should not restrict user-space programs to only being able to either measure an NTP adjusted time value, or a time value difference of greater than 1000ns with any accuracy, on a modern Intel CPU whose TSC ticks 2.8 times per nanosecond (picosecond resolution is theoretically possible). Please, include something like the above patch in future Linux versions. Thanks & Best Regards, Jason Vas Dias
Re: perf Intel x86_64 : BUG: BRANCH_INSTRUCTIONS / BRANCH_MISSES cannot be combined with CACHE_REFERENCES / CACHE_MISSES .
On 13/02/2018, Jason Vas Dias <jason.vas.d...@gmail.com> wrote: > Good day - > > I'd much appreciate some advice as to why, on my Intel x86_64 > ( DisplayFamily_DisplayModel : 06_3CH ), running either Linux 4.12.10, > or Linux 3.10.0, any attempt to count all of : > PERF_COUNT_HW_BRANCH_INSTRUCTIONS > (or raw config 0xC4) , and > PERF_COUNT_HW_BRANCH_MISSES > (or raw config 0xC5), and > combined with > PERF_COUNT_HW_CACHE_REFERENCES > (or raw config 0x4F2E ), and > PERF_COUNT_HW_CACHE_MISSES > (or raw config 0x412E) , > results in ALL COUNTERS BEING 0 in a read of the Group FD or > mmap sample area. > > This is demonstrated by the example program, which will > use perf_event_open() to create a Group Leader FD for the first event, > and associate all other events with that Event Group , so that it > will read all events on the group FD . > > The perf_event_open() calls and the ioctl(event_fd, PERF_EVENT_IOC_ID, ) > calls all return successfully , but if I combine ANY of > ( PERF_COUNT_HW_BRANCH_INSTRUCTIONS, > PERF_COUNT_HW_BRANCH_MISSES > ) with any of > ( PERF_COUNT_HW_CACHE_REFERENCES, > PERF_COUNT_HW_CACHE_MISSES > ) in the Event Group, ALL events have '0' event->value. > > Demo : > 1. Compile program to use kernel mapped Generic Events: > $ gcc -std=gnu11 -o perf_bug perf_bug.c > Running program shows all counters have 0 values, since both > CACHE & BRANCH hits+misses are being requested: > > $ ./perf_bug > EVENT: Branch Instructions : 0 > EVENT: Branch Misses : 0 > EVENT: Instructions : 0 > EVENT: CPU Cycles : 0 > EVENT: Ref. CPU Cycles : 0 > EVENT: Bus Cycles : 0 > EVENT: Cache References : 0 > EVENT: Cache Misses : 0 > > NOT registering interest in EITHER the BRANCH counters > OR the CACHE counters fixes the problem: > > Compile without registering for BRANCH_INSTRUCTIONS > or BRANCH_MISSES: > $ gcc -std=gnu11 -DNO_BUG_NO_BRANCH -o perf_bug perf_bug.c > $ ./perf_bug > EVENT: Instructions : 914 > EVENT: CPU Cycles : 4110 > EVENT: Ref. CPU Cycles : 4437 > EVENT: Bus Cycles : 152 > EVENT: Cache References : 1 > EVENT: Cache Misses : 1 > > Compile without registering for CACHE_REFERENCES or CACHE_MISSES: > $ gcc -std=gnu11 -DNO_BUG_NO_CACHE -o perf_bug perf_bug.c > $ ./perf_bug > EVENT: Branch Instructions : 106 > EVENT: Branch Misses : 6 > EVENT: Instructions : 914 > EVENT: CPU Cycles : 4132 > EVENT: Ref. CPU Cycles : 8526 > EVENT: Bus Cycles : 295 > > The same thing happens if I do not use Generic Events, but rather > "dynamic raw PMU" events, by putting the hex values from > /sys/bus/event_source/devices/cpu/events/? into the perf_event_attr > config, OR'ed with (1<<63), and using the PERF_TYPE_RAW perf_event_attr > type value : > > $ gcc -DUSE_RAW_PMU -o perf_bug perf_bug.c > $ ./perf_bug > EVENT: Branch Instructions : 0 > EVENT: Branch Misses : 0 > EVENT: Instructions : 0 > EVENT: CPU Cycles : 0 > EVENT: Ref. CPU Cycles : 0 > EVENT: Bus Cycles : 0 > EVENT: Cache References : 0 > EVENT: Cache Misses : 0 > > > $ gcc -DUSE_RAW_PMU -DNO_BUG_NO_BRANCH -o perf_bug perf_bug.c > $ ./perf_bug > EVENT: Instructions : 914 > EVENT: CPU Cycles : 4102 > EVENT: Ref. CPU Cycles : 4959 > EVENT: Bus Cycles : 171 > EVENT: Cache References : 2 > EVENT: Cache Misses : 2 > > $ gcc -DUSE_RAW_PMU -DNO_BUG_NO_CACHE -o perf_bug perf_bug.c > $ ./perf_bug > EVENT: Branch Instructions : 106 > EVENT: Branch Misses : 6 > EVENT: Instructions : 914 > EVENT: CPU Cycles : 4108 > EVENT: Ref. CPU Cycles : 10817 > EVENT: Bus Cycles : 373 > > > The perf tool itself seems to have the same issue: > > With CACHE & BRANCH counters does not work : > $ perf stat -e '{r0c4,r0c5,r0c0,r03c,r0300,r013c,r04F2E,r0412E}:SIu' sleep > 1 > > Performance counter stats for 'sleep 1': > >r0c4 >(0.00%) >r0c5 >(0.00%) >r0c0 >(0.00%) >r03c >(0.00%) >r0300 >(0.00%) >r013c >(0.00%) >r04F2E >(0.00%) > r0412E > >1.001652932 seconds time elapsed > >Some events weren't counted. Try disabling the NMI watchdog: > echo 0 > /proc/sys/kernel/nmi_watchdog > perf stat ... > echo 1 > /proc/sys/kernel/nmi_watchdog > > Disabling the NMI watchdog makes no difference . > > It is very strange that perf thinks 'r0412E' is not supported : >$ cat
Re: perf Intel x86_64 : BUG: BRANCH_INSTRUCTIONS / BRANCH_MISSES cannot be combined with CACHE_REFERENCES / CACHE_MISSES .
On 13/02/2018, Jason Vas Dias wrote: > Good day - > > I'd much appreciate some advice as to why, on my Intel x86_64 > ( DisplayFamily_DisplayModel : 06_3CH ), running either Linux 4.12.10, > or Linux 3.10.0, any attempt to count all of : > PERF_COUNT_HW_BRANCH_INSTRUCTIONS > (or raw config 0xC4) , and > PERF_COUNT_HW_BRANCH_MISSES > (or raw config 0xC5), and > combined with > PERF_COUNT_HW_CACHE_REFERENCES > (or raw config 0x4F2E ), and > PERF_COUNT_HW_CACHE_MISSES > (or raw config 0x412E) , > results in ALL COUNTERS BEING 0 in a read of the Group FD or > mmap sample area. > > This is demonstrated by the example program, which will > use perf_event_open() to create a Group Leader FD for the first event, > and associate all other events with that Event Group , so that it > will read all events on the group FD . > > The perf_event_open() calls and the ioctl(event_fd, PERF_EVENT_IOC_ID, ) > calls all return successfully , but if I combine ANY of > ( PERF_COUNT_HW_BRANCH_INSTRUCTIONS, > PERF_COUNT_HW_BRANCH_MISSES > ) with any of > ( PERF_COUNT_HW_CACHE_REFERENCES, > PERF_COUNT_HW_CACHE_MISSES > ) in the Event Group, ALL events have '0' event->value. > > Demo : > 1. Compile program to use kernel mapped Generic Events: > $ gcc -std=gnu11 -o perf_bug perf_bug.c > Running program shows all counters have 0 values, since both > CACHE & BRANCH hits+misses are being requested: > > $ ./perf_bug > EVENT: Branch Instructions : 0 > EVENT: Branch Misses : 0 > EVENT: Instructions : 0 > EVENT: CPU Cycles : 0 > EVENT: Ref. CPU Cycles : 0 > EVENT: Bus Cycles : 0 > EVENT: Cache References : 0 > EVENT: Cache Misses : 0 > > NOT registering interest in EITHER the BRANCH counters > OR the CACHE counters fixes the problem: > > Compile without registering for BRANCH_INSTRUCTIONS > or BRANCH_MISSES: > $ gcc -std=gnu11 -DNO_BUG_NO_BRANCH -o perf_bug perf_bug.c > $ ./perf_bug > EVENT: Instructions : 914 > EVENT: CPU Cycles : 4110 > EVENT: Ref. CPU Cycles : 4437 > EVENT: Bus Cycles : 152 > EVENT: Cache References : 1 > EVENT: Cache Misses : 1 > > Compile without registering for CACHE_REFERENCES or CACHE_MISSES: > $ gcc -std=gnu11 -DNO_BUG_NO_CACHE -o perf_bug perf_bug.c > $ ./perf_bug > EVENT: Branch Instructions : 106 > EVENT: Branch Misses : 6 > EVENT: Instructions : 914 > EVENT: CPU Cycles : 4132 > EVENT: Ref. CPU Cycles : 8526 > EVENT: Bus Cycles : 295 > > The same thing happens if I do not use Generic Events, but rather > "dynamic raw PMU" events, by putting the hex values from > /sys/bus/event_source/devices/cpu/events/? into the perf_event_attr > config, OR'ed with (1<<63), and using the PERF_TYPE_RAW perf_event_attr > type value : > > $ gcc -DUSE_RAW_PMU -o perf_bug perf_bug.c > $ ./perf_bug > EVENT: Branch Instructions : 0 > EVENT: Branch Misses : 0 > EVENT: Instructions : 0 > EVENT: CPU Cycles : 0 > EVENT: Ref. CPU Cycles : 0 > EVENT: Bus Cycles : 0 > EVENT: Cache References : 0 > EVENT: Cache Misses : 0 > > > $ gcc -DUSE_RAW_PMU -DNO_BUG_NO_BRANCH -o perf_bug perf_bug.c > $ ./perf_bug > EVENT: Instructions : 914 > EVENT: CPU Cycles : 4102 > EVENT: Ref. CPU Cycles : 4959 > EVENT: Bus Cycles : 171 > EVENT: Cache References : 2 > EVENT: Cache Misses : 2 > > $ gcc -DUSE_RAW_PMU -DNO_BUG_NO_CACHE -o perf_bug perf_bug.c > $ ./perf_bug > EVENT: Branch Instructions : 106 > EVENT: Branch Misses : 6 > EVENT: Instructions : 914 > EVENT: CPU Cycles : 4108 > EVENT: Ref. CPU Cycles : 10817 > EVENT: Bus Cycles : 373 > > > The perf tool itself seems to have the same issue: > > With CACHE & BRANCH counters does not work : > $ perf stat -e '{r0c4,r0c5,r0c0,r03c,r0300,r013c,r04F2E,r0412E}:SIu' sleep > 1 > > Performance counter stats for 'sleep 1': > >r0c4 >(0.00%) >r0c5 >(0.00%) >r0c0 >(0.00%) >r03c >(0.00%) >r0300 >(0.00%) >r013c >(0.00%) >r04F2E >(0.00%) > r0412E > >1.001652932 seconds time elapsed > >Some events weren't counted. Try disabling the NMI watchdog: > echo 0 > /proc/sys/kernel/nmi_watchdog > perf stat ... > echo 1 > /proc/sys/kernel/nmi_watchdog > > Disabling the NMI watchdog makes no difference . > > It is very strange that perf thinks 'r0412E' is not supported : >$ cat /sys/bus/event_source/devices/cpu/cac
perf Intel x86_64 : BUG: BRANCH_INSTRUCTIONS / BRANCH_MISSES cannot be combined with CACHE_REFERENCES / CACHE_MISSES .
Good day - I'd much appreciate some advice as to why, on my Intel x86_64 ( DisplayFamily_DisplayModel : 06_3CH ), running either Linux 4.12.10, or Linux 3.10.0, any attempt to count all of : PERF_COUNT_HW_BRANCH_INSTRUCTIONS (or raw config 0xC4) , and PERF_COUNT_HW_BRANCH_MISSES (or raw config 0xC5), and combined with PERF_COUNT_HW_CACHE_REFERENCES (or raw config 0x4F2E ), and PERF_COUNT_HW_CACHE_MISSES (or raw config 0x412E) , results in ALL COUNTERS BEING 0 in a read of the Group FD or mmap sample area. This is demonstrated by the example program, which will use perf_event_open() to create a Group Leader FD for the first event, and associate all other events with that Event Group , so that it will read all events on the group FD . The perf_event_open() calls and the ioctl(event_fd, PERF_EVENT_IOC_ID, ) calls all return successfully , but if I combine ANY of ( PERF_COUNT_HW_BRANCH_INSTRUCTIONS, PERF_COUNT_HW_BRANCH_MISSES ) with any of ( PERF_COUNT_HW_CACHE_REFERENCES, PERF_COUNT_HW_CACHE_MISSES ) in the Event Group, ALL events have '0' event->value. Demo : 1. Compile program to use kernel mapped Generic Events: $ gcc -std=gnu11 -o perf_bug perf_bug.c Running program shows all counters have 0 values, since both CACHE & BRANCH hits+misses are being requested: $ ./perf_bug EVENT: Branch Instructions : 0 EVENT: Branch Misses : 0 EVENT: Instructions : 0 EVENT: CPU Cycles : 0 EVENT: Ref. CPU Cycles : 0 EVENT: Bus Cycles : 0 EVENT: Cache References : 0 EVENT: Cache Misses : 0 NOT registering interest in EITHER the BRANCH counters OR the CACHE counters fixes the problem: Compile without registering for BRANCH_INSTRUCTIONS or BRANCH_MISSES: $ gcc -std=gnu11 -DNO_BUG_NO_BRANCH -o perf_bug perf_bug.c $ ./perf_bug EVENT: Instructions : 914 EVENT: CPU Cycles : 4110 EVENT: Ref. CPU Cycles : 4437 EVENT: Bus Cycles : 152 EVENT: Cache References : 1 EVENT: Cache Misses : 1 Compile without registering for CACHE_REFERENCES or CACHE_MISSES: $ gcc -std=gnu11 -DNO_BUG_NO_CACHE -o perf_bug perf_bug.c $ ./perf_bug EVENT: Branch Instructions : 106 EVENT: Branch Misses : 6 EVENT: Instructions : 914 EVENT: CPU Cycles : 4132 EVENT: Ref. CPU Cycles : 8526 EVENT: Bus Cycles : 295 The same thing happens if I do not use Generic Events, but rather "dynamic raw PMU" events, by putting the hex values from /sys/bus/event_source/devices/cpu/events/? into the perf_event_attr config, OR'ed with (1<<63), and using the PERF_TYPE_RAW perf_event_attr type value : $ gcc -DUSE_RAW_PMU -o perf_bug perf_bug.c $ ./perf_bug EVENT: Branch Instructions : 0 EVENT: Branch Misses : 0 EVENT: Instructions : 0 EVENT: CPU Cycles : 0 EVENT: Ref. CPU Cycles : 0 EVENT: Bus Cycles : 0 EVENT: Cache References : 0 EVENT: Cache Misses : 0 $ gcc -DUSE_RAW_PMU -DNO_BUG_NO_BRANCH -o perf_bug perf_bug.c $ ./perf_bug EVENT: Instructions : 914 EVENT: CPU Cycles : 4102 EVENT: Ref. CPU Cycles : 4959 EVENT: Bus Cycles : 171 EVENT: Cache References : 2 EVENT: Cache Misses : 2 $ gcc -DUSE_RAW_PMU -DNO_BUG_NO_CACHE -o perf_bug perf_bug.c $ ./perf_bug EVENT: Branch Instructions : 106 EVENT: Branch Misses : 6 EVENT: Instructions : 914 EVENT: CPU Cycles : 4108 EVENT: Ref. CPU Cycles : 10817 EVENT: Bus Cycles : 373 The perf tool itself seems to have the same issue: With CACHE & BRANCH counters does not work : $ perf stat -e '{r0c4,r0c5,r0c0,r03c,r0300,r013c,r04F2E,r0412E}:SIu' sleep 1 Performance counter stats for 'sleep 1': r0c4 (0.00%) r0c5 (0.00%) r0c0 (0.00%) r03c (0.00%) r0300 (0.00%) r013c (0.00%) r04F2E (0.00%) r0412E 1.001652932 seconds time elapsed Some events weren't counted. Try disabling the NMI watchdog: echo 0 > /proc/sys/kernel/nmi_watchdog perf stat ... echo 1 > /proc/sys/kernel/nmi_watchdog Disabling the NMI watchdog makes no difference . It is very strange that perf thinks 'r0412E' is not supported : $ cat /sys/bus/event_source/devices/cpu/cache_misses event=0x2e,umask=0x41 The kernel should not be advertizing an unsupported event in a /sys/bus/event_source/devices/cpu/events/ file, should it ? So perf stat has the same problem - without either Cache or Branch counters seems to work fine: without cache: $ perf stat -e '{r0c4,r0c5,r0c0,r03c,r0300,r013c}:SIu' sleep 1 Performance counter stats for 'sleep 1': 37740 r0c4 3557 r0c5 188552 r0c0 311684 r03c 360963 r0300 12461 r013c 1.001508109 seconds time elapsed without branch: $ perf stat -e '{r0c0,r03c,r0300,r013c,r04F2E,r0412E}:SIu' sleep 1 Performance counter stats for 'sleep 1': 188554 r0c0
perf Intel x86_64 : BUG: BRANCH_INSTRUCTIONS / BRANCH_MISSES cannot be combined with CACHE_REFERENCES / CACHE_MISSES .
Good day - I'd much appreciate some advice as to why, on my Intel x86_64 ( DisplayFamily_DisplayModel : 06_3CH ), running either Linux 4.12.10, or Linux 3.10.0, any attempt to count all of : PERF_COUNT_HW_BRANCH_INSTRUCTIONS (or raw config 0xC4) , and PERF_COUNT_HW_BRANCH_MISSES (or raw config 0xC5), and combined with PERF_COUNT_HW_CACHE_REFERENCES (or raw config 0x4F2E ), and PERF_COUNT_HW_CACHE_MISSES (or raw config 0x412E) , results in ALL COUNTERS BEING 0 in a read of the Group FD or mmap sample area. This is demonstrated by the example program, which will use perf_event_open() to create a Group Leader FD for the first event, and associate all other events with that Event Group , so that it will read all events on the group FD . The perf_event_open() calls and the ioctl(event_fd, PERF_EVENT_IOC_ID, ) calls all return successfully , but if I combine ANY of ( PERF_COUNT_HW_BRANCH_INSTRUCTIONS, PERF_COUNT_HW_BRANCH_MISSES ) with any of ( PERF_COUNT_HW_CACHE_REFERENCES, PERF_COUNT_HW_CACHE_MISSES ) in the Event Group, ALL events have '0' event->value. Demo : 1. Compile program to use kernel mapped Generic Events: $ gcc -std=gnu11 -o perf_bug perf_bug.c Running program shows all counters have 0 values, since both CACHE & BRANCH hits+misses are being requested: $ ./perf_bug EVENT: Branch Instructions : 0 EVENT: Branch Misses : 0 EVENT: Instructions : 0 EVENT: CPU Cycles : 0 EVENT: Ref. CPU Cycles : 0 EVENT: Bus Cycles : 0 EVENT: Cache References : 0 EVENT: Cache Misses : 0 NOT registering interest in EITHER the BRANCH counters OR the CACHE counters fixes the problem: Compile without registering for BRANCH_INSTRUCTIONS or BRANCH_MISSES: $ gcc -std=gnu11 -DNO_BUG_NO_BRANCH -o perf_bug perf_bug.c $ ./perf_bug EVENT: Instructions : 914 EVENT: CPU Cycles : 4110 EVENT: Ref. CPU Cycles : 4437 EVENT: Bus Cycles : 152 EVENT: Cache References : 1 EVENT: Cache Misses : 1 Compile without registering for CACHE_REFERENCES or CACHE_MISSES: $ gcc -std=gnu11 -DNO_BUG_NO_CACHE -o perf_bug perf_bug.c $ ./perf_bug EVENT: Branch Instructions : 106 EVENT: Branch Misses : 6 EVENT: Instructions : 914 EVENT: CPU Cycles : 4132 EVENT: Ref. CPU Cycles : 8526 EVENT: Bus Cycles : 295 The same thing happens if I do not use Generic Events, but rather "dynamic raw PMU" events, by putting the hex values from /sys/bus/event_source/devices/cpu/events/? into the perf_event_attr config, OR'ed with (1<<63), and using the PERF_TYPE_RAW perf_event_attr type value : $ gcc -DUSE_RAW_PMU -o perf_bug perf_bug.c $ ./perf_bug EVENT: Branch Instructions : 0 EVENT: Branch Misses : 0 EVENT: Instructions : 0 EVENT: CPU Cycles : 0 EVENT: Ref. CPU Cycles : 0 EVENT: Bus Cycles : 0 EVENT: Cache References : 0 EVENT: Cache Misses : 0 $ gcc -DUSE_RAW_PMU -DNO_BUG_NO_BRANCH -o perf_bug perf_bug.c $ ./perf_bug EVENT: Instructions : 914 EVENT: CPU Cycles : 4102 EVENT: Ref. CPU Cycles : 4959 EVENT: Bus Cycles : 171 EVENT: Cache References : 2 EVENT: Cache Misses : 2 $ gcc -DUSE_RAW_PMU -DNO_BUG_NO_CACHE -o perf_bug perf_bug.c $ ./perf_bug EVENT: Branch Instructions : 106 EVENT: Branch Misses : 6 EVENT: Instructions : 914 EVENT: CPU Cycles : 4108 EVENT: Ref. CPU Cycles : 10817 EVENT: Bus Cycles : 373 The perf tool itself seems to have the same issue: With CACHE & BRANCH counters does not work : $ perf stat -e '{r0c4,r0c5,r0c0,r03c,r0300,r013c,r04F2E,r0412E}:SIu' sleep 1 Performance counter stats for 'sleep 1': r0c4 (0.00%) r0c5 (0.00%) r0c0 (0.00%) r03c (0.00%) r0300 (0.00%) r013c (0.00%) r04F2E (0.00%) r0412E 1.001652932 seconds time elapsed Some events weren't counted. Try disabling the NMI watchdog: echo 0 > /proc/sys/kernel/nmi_watchdog perf stat ... echo 1 > /proc/sys/kernel/nmi_watchdog Disabling the NMI watchdog makes no difference . It is very strange that perf thinks 'r0412E' is not supported : $ cat /sys/bus/event_source/devices/cpu/cache_misses event=0x2e,umask=0x41 The kernel should not be advertizing an unsupported event in a /sys/bus/event_source/devices/cpu/events/ file, should it ? So perf stat has the same problem - without either Cache or Branch counters seems to work fine: without cache: $ perf stat -e '{r0c4,r0c5,r0c0,r03c,r0300,r013c}:SIu' sleep 1 Performance counter stats for 'sleep 1': 37740 r0c4 3557 r0c5 188552 r0c0 311684 r03c 360963 r0300 12461 r013c 1.001508109 seconds time elapsed without branch: $ perf stat -e '{r0c0,r03c,r0300,r013c,r04F2E,r0412E}:SIu' sleep 1 Performance counter stats for 'sleep 1': 188554 r0c0
Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
I have found a new source of weirdness with TSC using clock_gettime(CLOCK_MONOTONIC_RAW,) : The vsyscall_gtod_data.mult field changes somewhat between calls to clock_gettime(CLOCK_MONOTONIC_RAW,), so that sometimes an extra (2^24) nanoseconds are added or removed from the value derived from the TSC and stored in 'ts' . This is demonstrated by the output of the test program in the attached ttsc.tar file: $ ./tlgtd it worked! - GTOD: clock:1 mult:5798662 shift:24 synced - mult now: 5798661 What it is doing is finding the address of the 'vsyscall_gtod_data' structure from /proc/kallsyms, and mapping the virtual address to an ELF section offset within /proc/kcore, and reading just the 'vsyscall_gtod_data' structure into user-space memory . Really, this 'mult' value, which is used to return the seconds|nanoseconds value: ( tsc_cycles * mult ) >> shift (where shift is 24 ), should not change from the first time it is initialized . The TSC is meant to be FIXED FREQUENCY, right ? So how could / why should the conversion function from TSC ticks to nanoseconds change ? So now it is doubly difficult for user-space libraries to maintain their RDTSC derived seconds|nanoseconds values to correlate well those returned by the kernel, because they must regularly read the updated 'mult' value used by the kernel . I really don't think the kernel should randomly be deciding to increase / decrease the TSC tick period by 2^24 nanoseconds! Is this a bug or intentional ? I am searching for all places where a '[.>]mult.*=' occurs, but this returns rather alot of matches. Please could a future version of linux at least export the 'mult' and 'shift' values for the current clocksource ! Regards, Jason On 22/02/2017, Jason Vas Dias <jason.vas.d...@gmail.com> wrote: > OK, last post on this issue today - > can anyone explain why, with standard 4.10.0 kernel & no new > 'notsc_adjust' option, and the same maths being used, these two runs > should display > such a wide disparity between clock_gettime(CLOCK_MONOTONIC_RAW,) > values ? : > > $ J/pub/ttsc/ttsc1 > max_extended_leaf: 8008 > has tsc: 1 constant: 1 > Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1. > ts2 - ts1: 162 ts3 - ts2: 110 ns1: 0.00641 ns2: 0.02850 > ts3 - ts2: 175 ns1: 0.00659 > ts3 - ts2: 18 ns1: 0.00643 > ts3 - ts2: 18 ns1: 0.00618 > ts3 - ts2: 17 ns1: 0.00620 > ts3 - ts2: 17 ns1: 0.00616 > ts3 - ts2: 18 ns1: 0.00641 > ts3 - ts2: 18 ns1: 0.00709 > ts3 - ts2: 20 ns1: 0.00763 > ts3 - ts2: 20 ns1: 0.00735 > ts3 - ts2: 20 ns1: 0.00761 > t1 - t0: 78200 - ns2: 0.80824 > $ J/pub/ttsc/ttsc1 > max_extended_leaf: 8008 > has tsc: 1 constant: 1 > Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1. > ts2 - ts1: 217 ts3 - ts2: 221 ns1: 0.01294 ns2: 0.05375 > ts3 - ts2: 210 ns1: 0.01418 > ts3 - ts2: 23 ns1: 0.01399 > ts3 - ts2: 22 ns1: 0.01445 > ts3 - ts2: 25 ns1: 0.01321 > ts3 - ts2: 20 ns1: 0.01428 > ts3 - ts2: 25 ns1: 0.01367 > ts3 - ts2: 23 ns1: 0.01425 > ts3 - ts2: 23 ns1: 0.01357 > ts3 - ts2: 22 ns1: 0.01487 > ts3 - ts2: 25 ns1: 0.01377 > t1 - t0: 145753 - ns2: 0.000150781 > > (complete source of test program ttsc1 attached in ttsc.tar > $ tar -xpf ttsc.tar > $ cd ttsc > $ make > ). > > On 22/02/2017, Jason Vas Dias <jason.vas.d...@gmail.com> wrote: >> I actually tried adding a 'notsc_adjust' kernel option to disable any >> setting or >> access to the TSC_ADJUST MSR, but then I see the problems - a big >> disparity >> in values depending on which CPU the thread is scheduled - and no >> improvement in clock_gettime() latency. So I don't think the new >> TSC_ADJUST >> code in ts_sync.c itself is the issue - but something added @ 460ns >> onto every clock_gettime() call when moving from v4.8.0 -> v4.10.0 . >> As I don't think fixing the clock_gettime() latency issue is my problem >> or >> even >> possible with current clock architecture approach, it is a non-issue. >> >> But please, can anyone tell me if are there any plans to move the time >> infrastructure out of the kernel and into glibc along the lines >> outlined >> in previous mail - if not, I am going to concentrate on this more radical >> overhaul approach for my own systems . >> >> At least, I think mapping the clocksource information structure itself in >> some >> kind of sharable page makes sense . Processes could map that page >> copy-on-write >> so they could start off with all the timing parameters preloaded, then >> keep >> their copy updated using the rdtscp instruction , or msync() (read-only) >>
Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
I have found a new source of weirdness with TSC using clock_gettime(CLOCK_MONOTONIC_RAW,) : The vsyscall_gtod_data.mult field changes somewhat between calls to clock_gettime(CLOCK_MONOTONIC_RAW,), so that sometimes an extra (2^24) nanoseconds are added or removed from the value derived from the TSC and stored in 'ts' . This is demonstrated by the output of the test program in the attached ttsc.tar file: $ ./tlgtd it worked! - GTOD: clock:1 mult:5798662 shift:24 synced - mult now: 5798661 What it is doing is finding the address of the 'vsyscall_gtod_data' structure from /proc/kallsyms, and mapping the virtual address to an ELF section offset within /proc/kcore, and reading just the 'vsyscall_gtod_data' structure into user-space memory . Really, this 'mult' value, which is used to return the seconds|nanoseconds value: ( tsc_cycles * mult ) >> shift (where shift is 24 ), should not change from the first time it is initialized . The TSC is meant to be FIXED FREQUENCY, right ? So how could / why should the conversion function from TSC ticks to nanoseconds change ? So now it is doubly difficult for user-space libraries to maintain their RDTSC derived seconds|nanoseconds values to correlate well those returned by the kernel, because they must regularly read the updated 'mult' value used by the kernel . I really don't think the kernel should randomly be deciding to increase / decrease the TSC tick period by 2^24 nanoseconds! Is this a bug or intentional ? I am searching for all places where a '[.>]mult.*=' occurs, but this returns rather alot of matches. Please could a future version of linux at least export the 'mult' and 'shift' values for the current clocksource ! Regards, Jason On 22/02/2017, Jason Vas Dias wrote: > OK, last post on this issue today - > can anyone explain why, with standard 4.10.0 kernel & no new > 'notsc_adjust' option, and the same maths being used, these two runs > should display > such a wide disparity between clock_gettime(CLOCK_MONOTONIC_RAW,) > values ? : > > $ J/pub/ttsc/ttsc1 > max_extended_leaf: 8008 > has tsc: 1 constant: 1 > Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1. > ts2 - ts1: 162 ts3 - ts2: 110 ns1: 0.00641 ns2: 0.02850 > ts3 - ts2: 175 ns1: 0.00659 > ts3 - ts2: 18 ns1: 0.00643 > ts3 - ts2: 18 ns1: 0.00618 > ts3 - ts2: 17 ns1: 0.00620 > ts3 - ts2: 17 ns1: 0.00616 > ts3 - ts2: 18 ns1: 0.00641 > ts3 - ts2: 18 ns1: 0.00709 > ts3 - ts2: 20 ns1: 0.00763 > ts3 - ts2: 20 ns1: 0.00735 > ts3 - ts2: 20 ns1: 0.00761 > t1 - t0: 78200 - ns2: 0.80824 > $ J/pub/ttsc/ttsc1 > max_extended_leaf: 8008 > has tsc: 1 constant: 1 > Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1. > ts2 - ts1: 217 ts3 - ts2: 221 ns1: 0.01294 ns2: 0.05375 > ts3 - ts2: 210 ns1: 0.01418 > ts3 - ts2: 23 ns1: 0.01399 > ts3 - ts2: 22 ns1: 0.01445 > ts3 - ts2: 25 ns1: 0.01321 > ts3 - ts2: 20 ns1: 0.01428 > ts3 - ts2: 25 ns1: 0.01367 > ts3 - ts2: 23 ns1: 0.01425 > ts3 - ts2: 23 ns1: 0.01357 > ts3 - ts2: 22 ns1: 0.01487 > ts3 - ts2: 25 ns1: 0.01377 > t1 - t0: 145753 - ns2: 0.000150781 > > (complete source of test program ttsc1 attached in ttsc.tar > $ tar -xpf ttsc.tar > $ cd ttsc > $ make > ). > > On 22/02/2017, Jason Vas Dias wrote: >> I actually tried adding a 'notsc_adjust' kernel option to disable any >> setting or >> access to the TSC_ADJUST MSR, but then I see the problems - a big >> disparity >> in values depending on which CPU the thread is scheduled - and no >> improvement in clock_gettime() latency. So I don't think the new >> TSC_ADJUST >> code in ts_sync.c itself is the issue - but something added @ 460ns >> onto every clock_gettime() call when moving from v4.8.0 -> v4.10.0 . >> As I don't think fixing the clock_gettime() latency issue is my problem >> or >> even >> possible with current clock architecture approach, it is a non-issue. >> >> But please, can anyone tell me if are there any plans to move the time >> infrastructure out of the kernel and into glibc along the lines >> outlined >> in previous mail - if not, I am going to concentrate on this more radical >> overhaul approach for my own systems . >> >> At least, I think mapping the clocksource information structure itself in >> some >> kind of sharable page makes sense . Processes could map that page >> copy-on-write >> so they could start off with all the timing parameters preloaded, then >> keep >> their copy updated using the rdtscp instruction , or msync() (read-only) >> with the kernel's single copy to get the latest time any proces