Re: Differences between builtins and modules
Sorry I didn't see this mail until now - RE: Randy Dunlap wrote: > Would someone please answer/reply to this (related) kernel bugzilla entry: > https://bugzilla.kernel.org/show_bug.cgi?id=118661 Yes, I raised this bug because I think modinfo should return 0 exit status if a requested module is built-in, not just when it has been loaded, like this modified version does: $ modinfo snd modinfo: ERROR: Module snd not found. built-in: snd $ echo $? 0 What was the query about the Bug 118661 that needs to be answered ? I don't see any query on the bug report - just a comment from someone who also agrees modinfo should return OK for a built-in module . Glad to hear someone is finally considering fixing modinfo to report status of built-in modules - with only a 2 year response time. Thanks & Best Regards, Jason
Re: [PATCH v4.16-rc6 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
Good day - I believe the last patch I sent, with $subject, addresses all concerns raised so far by reviewers, and complies with all kernel coding standards . Please, it would be most helpful if you could let me know whether the patch is now acceptable and will be applied at some stage or not - or if not, what is the problem with it . My clients are asking whether the patch is going to be in the upstream kernel or not, and I need to tell them something. Thanks & Best Regards, Jason
[PATCH v4.16-rc6 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
This patch implements clock_gettime(CLOCK_MONOTONIC_RAW,&ts) calls entirely in the vDSO, without calling vdso_fallback_gettime() . It has been augmented to support compilation with or without -DRETPOLINE / $(RETPOLINE_CFLAGS) ; when compiled with -DRETPOLINE, not all functions calls can be inlined within __vdso_clock_gettime, and all functions invoked by __vdso_clock_gettime must have 'indirect_branch("keep")' + 'function_return("keep")' attributes to compile, otherwise thunk relocations will be generated ; and the functions cannot all be declared '__always_inline_', otherwise a compiler -Werror ('not all __always_inline__ functions can be inlined') is generated. Also, compared to previous version of same patch, the do_*_coarse functions are still not inlines, and not inadvertently changed to inline. I still think it might be better to apply H.J. Liu's patch from https://bugzilla.kernel.org/show_bug.cgi?id=199129 to disable -DRETPOLINE compilation for the vDSO . --- diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index f19856d..80d65d4 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -182,29 +182,62 @@ notrace static u64 vread_tsc(void) return last; } -notrace static inline u64 vgetsns(int *mode) +notrace static inline u64 vgetcycles(int *mode) { - u64 v; - cycles_t cycles; - - if (gtod->vclock_mode == VCLOCK_TSC) - cycles = vread_tsc(); + switch (gtod->vclock_mode) { + case VCLOCK_TSC: + return vread_tsc(); #ifdef CONFIG_PARAVIRT_CLOCK - else if (gtod->vclock_mode == VCLOCK_PVCLOCK) - cycles = vread_pvclock(mode); + case VCLOCK_PVCLOCK: + return vread_pvclock(mode); #endif #ifdef CONFIG_HYPERV_TSCPAGE - else if (gtod->vclock_mode == VCLOCK_HVCLOCK) - cycles = vread_hvclock(mode); + case VCLOCK_HVCLOCK: + return vread_hvclock(mode); #endif - else + default: + break; + } + return 0; +} + +notrace static inline u64 vgetsns(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) return 0; + v = (cycles - gtod->cycle_last) & gtod->mask; return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) + return 0; + + v = (cycles - gtod->cycle_last) & gtod->mask; + return v * gtod->raw_mult; +} + +#ifdef RETPOLINE +# define _NO_THUNK_RELOCS_()(indirect_branch("keep"),\ + function_return("keep")) +# define _RETPOLINE_FUNC_ATTR_ __attribute__(_NO_THUNK_RELOCS_()) +# define _RETPOLINE_INLINE_ inline +#else +# define _RETPOLINE_FUNC_ATTR_ +# define _RETPOLINE_INLINE_ __always_inline +#endif + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ -notrace static int __always_inline do_realtime(struct timespec *ts) +notrace static _RETPOLINE_INLINE_ _RETPOLINE_FUNC_ATTR_ +int do_realtime(struct timespec *ts) { unsigned long seq; u64 ns; @@ -225,7 +258,8 @@ notrace static int __always_inline do_realtime(struct timespec *ts) return mode; } -notrace static int __always_inline do_monotonic(struct timespec *ts) +notrace static _RETPOLINE_INLINE_ _RETPOLINE_FUNC_ATTR_ +int do_monotonic(struct timespec *ts) { unsigned long seq; u64 ns; @@ -246,7 +280,30 @@ notrace static int __always_inline do_monotonic(struct timespec *ts) return mode; } -notrace static void do_realtime_coarse(struct timespec *ts) +notrace static _RETPOLINE_INLINE_ _RETPOLINE_FUNC_ATTR_ +int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(&mode); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); + ts->tv_nsec = ns; + + return mode; +} + +notrace static _RETPOLINE_FUNC_ATTR_ +void do_realtime_coarse(struct timespec *ts) { unsigned long seq; do { @@ -256,7 +313,8 @@ notrace static void do_realtime_coarse(struct timespec *ts) } while (unlikely(gtod_read_retry(gtod, seq))); } -notrace static void do_monotonic_coarse(struct timespec *ts) +notrace static _RETPOLINE_FUNC_ATTR_ +void do_monotonic_coarse(struct timespec *ts) { unsigned long seq; do { @@ -266,7 +324,8 @@ notrace static void do_monotonic_coarse(struct
[PATCH v4.16-rc6 (1)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
Resent to address reviewer comments, and allow builds with compilers that support -DRETPOLINE to succeed. Currently, the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, &ts ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, user-space code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP "real" time; when code needs this, the latencies associated with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc6 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug #198161, as is the test program, timer_latency.c, to demonstrate the problem. Before the patch a latency of 200-1000ns was measured for clock_gettime(CLOCK_MONOTONIC_RAW,&ts) calls - after the patch, the same call on the same machine has a latency of @ 20ns. Please consider applying something like this patch to a future Linux release. This patch is being resent because it has slight improvements to vclock_gettime static function attributes wrt. the previous version. It also supersedes all previous patches with subject matching '.*VDSO should handle.*clock_gettime.*MONOTONIC_RAW' that I have sent previously - sorry for the resends. Please apply this patch so we stop getting emails from intel build bot trying to build previous version, with subject : '[PATCH v4.16-rc5 1/2] x86/vdso: VDSO should handle \ clock_gettime(CLOCK_MONOTONIC_RAW) without syscall' , which only fails to build because its patch 2/2 , which removed -DRETPOLINE from the VDSO build, and is now the subject of https://bugzilla.kernel.org/show_bug.cgi?id=199129, raised by H.J. Liu, was not applied first - Sorry! Thanks & Best Regards, Jason Vas Dias
Re: [PATCH v4.16-rc6 (1)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
Note there is a bug raised by H.J. Liu : Bug 199129: Don't build vDSO with $(RETPOLINE_CFLAGS) -DRETPOLINE (https://bugzilla.kernel.org/show_bug.cgi?id=199129) If you agree it is a bug, then use both patches from post : '[PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle \ clock_gettime(CLOCK_MONOTONIC_RAW) without syscall ' else, use the single patch from $subject, which makes the calls to the statics in vclock_gettime.c' use indirect_branch("keep") / function_return("keep") , to avoid generation of thunk relocations which would not occur unless compiled with -mindirect-branch=thunk-extern -mindirect-branch-register . Thanks & Regards, Jason
[PATCH v4.16-rc6 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
This patch makes the vDSO handle clock_gettime(CLOCK_MONOTONIC_RAW,&ts) calls in the same way it handles clock_gettime(CLOCK_MONOTONIC,&ts) calls, reducing latency from @ 200-1000ns to @ 20ns. It has been resent and augmented to support compilation with -DRETPOLINE / -mindirect-branch=thunk-extern -mindirect-branch-register, to avoid generating relocations for thunks. --- diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index f19856d..9b89f86 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -182,29 +182,60 @@ notrace static u64 vread_tsc(void) return last; } -notrace static inline u64 vgetsns(int *mode) +notrace static inline u64 vgetcycles(int *mode) { - u64 v; - cycles_t cycles; - - if (gtod->vclock_mode == VCLOCK_TSC) - cycles = vread_tsc(); + switch (gtod->vclock_mode) { + case VCLOCK_TSC: + return vread_tsc(); #ifdef CONFIG_PARAVIRT_CLOCK - else if (gtod->vclock_mode == VCLOCK_PVCLOCK) - cycles = vread_pvclock(mode); + case VCLOCK_PVCLOCK: + return vread_pvclock(mode); #endif #ifdef CONFIG_HYPERV_TSCPAGE - else if (gtod->vclock_mode == VCLOCK_HVCLOCK) - cycles = vread_hvclock(mode); + case VCLOCK_HVCLOCK: + return vread_hvclock(mode); #endif - else + default: + break; + } + return 0; +} + +notrace static inline u64 vgetsns(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) return 0; + v = (cycles - gtod->cycle_last) & gtod->mask; return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) + return 0; + + v = (cycles - gtod->cycle_last) & gtod->mask; + return v * gtod->raw_mult; +} + +#ifdef RETPOLINE +# define _NO_THUNK_RELOCS_()(indirect_branch("keep"),\ + function_return("keep")) +# define _RETPOLINE_FUNC_ATTR_ __attribute__(_NO_THUNK_RELOCS_()) +#else +# define _RETPOLINE_FUNC_ATTR_ +#endif + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ -notrace static int __always_inline do_realtime(struct timespec *ts) +notrace static inline _RETPOLINE_FUNC_ATTR_ +int do_realtime(struct timespec *ts) { unsigned long seq; u64 ns; @@ -225,7 +256,8 @@ notrace static int __always_inline do_realtime(struct timespec *ts) return mode; } -notrace static int __always_inline do_monotonic(struct timespec *ts) +notrace static inline _RETPOLINE_FUNC_ATTR_ +int do_monotonic(struct timespec *ts) { unsigned long seq; u64 ns; @@ -246,7 +278,30 @@ notrace static int __always_inline do_monotonic(struct timespec *ts) return mode; } -notrace static void do_realtime_coarse(struct timespec *ts) +notrace static inline _RETPOLINE_FUNC_ATTR_ +int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(&mode); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); + ts->tv_nsec = ns; + + return mode; +} + +notrace static inline _RETPOLINE_FUNC_ATTR_ +void do_realtime_coarse(struct timespec *ts) { unsigned long seq; do { @@ -256,7 +311,8 @@ notrace static void do_realtime_coarse(struct timespec *ts) } while (unlikely(gtod_read_retry(gtod, seq))); } -notrace static void do_monotonic_coarse(struct timespec *ts) +notrace static inline _RETPOLINE_FUNC_ATTR_ +void do_monotonic_coarse(struct timespec *ts) { unsigned long seq; do { @@ -266,7 +322,11 @@ notrace static void do_monotonic_coarse(struct timespec *ts) } while (unlikely(gtod_read_retry(gtod, seq))); } -notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) +notrace +#ifdef RETPOLINE + __attribute__((indirect_branch("keep"), function_return("keep"))) +#endif +int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) { switch (clock) { case CLOCK_REALTIME: @@ -277,6 +337,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; +
[PATCH v4.16-rc6 (1)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
Resent to address reviewer comments, and allow builds with compilers that support -DRETPOLINE to succeed. Currently, the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, &ts ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, user-space code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP "real" time; when code needs this, the latencies associated with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug #198161, as is the test program, timer_latency.c, to demonstrate the problem. Before the patch a latency of 200-1000ns was measured for clock_gettime(CLOCK_MONOTONIC_RAW,&ts) calls - after the patch, the same call on the same machine has a latency of @ 20ns. Please consider applying something like this patch to a future Linux release. Thanks & Best Regards, Jason Vas Dias
Re: [PATCH v4.16-rc5 2/2] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
On 18/03/2018, Jason Vas Dias wrote: (should have CC'ed to list, sorry) > On 17/03/2018, Andi Kleen wrote: >> >> That's quite a mischaracterization of the issue. gcc works as intended, >> but the kernel did not correctly supply a indirect call retpoline thunk >> to the vdso, and it just happened to work by accident with the old >> vdso. >> >>> >>> The automated test builds should now succeed with this patch. >> >> How about just adding the thunk function to the vdso object instead of >> this cheap hack? >> >> The other option would be to build vdso with inline thunks. >> >> But just disabling is completely the wrong action. >> >> -Andi >> > > Aha! Thanks for the clarification , Andi! > > I will do so and resend the 2nd patch. > > But is everyone agreed we should accept any slowdown for the timer > functions ? I personally don't think it is a good idea, but I will > regenerate the patch with the thunk function and without > the Makefile change. > > Thanks & Best Regards, > Jason > I am wondering if it is not better to avoid the thunk being generated and remove the Makefile patch ? I know that changing the switch in __vdso_clock_gettime() like this avoids the thunk : switch(clock) { case CLOCK_MONOTONIC: if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; default: switch (clock) { case CLOCK_REALTIME: if (do_realtime(ts) == VCLOCK_NONE) goto fallback; break; case CLOCK_MONOTONIC_RAW: if (do_monotonic_raw(ts) == VCLOCK_NONE) goto fallback; break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; case CLOCK_MONOTONIC_COARSE: do_monotonic_coarse(ts); break; default: goto fallback; } return 0; fallback: ... } So at the cost of an unnecessary extra test of the clock parameter, the thunk is avoided . I wonder if the whole switch should be changed to an if / else clause ? Or, I know this might be unorthodox, but might work : #define _CAT(V1,V2) V1##V2 #define GTOD_CLK_LABEL(CLK) _CAT(_VCG_L_,CLK) #define MAX_CLK 16 //^^ ?? __vdso_clock_gettime( ... ) { ... static const void *clklbl_tab[MAX_CLK] ={ [ CLOCK_MONOTONIC ] = &>OD_CLK_LABEL(CLOCK_MONOTONIC) , [ CLOCK_MONOTONIC_RAW ] = &>OD_CLK_LABEL(CLOCK_MONOTONIC_RAW) , // and similarly for all clocks handled ... }; goto clklbl_tab[ clock & 0xf ] ; GTOD_CLK_LABEL(CLOCK_MONOTONIC) : if ( do_monotonic(ts) == VCLOCK_NONE ) goto fallback ; GTOD_CLK_LABEL(CLOCK_MONOTONIC_RAW) : if ( do_monotonic_raw(ts) == VCLOCK_NONE ) goto fallback ; ... // similarly for all clocks fallback: return vdso_fallback_gettime(clock,ts); } If a restructuring like that might be acceptable (with correct tab-based formatting) , and the VDSO can have such a table in its .BSS , I think it would avoid the thunk, and have the advantage of precomputing the jump table at compile-time, and would not require any indirect branches, I think. Any thoughts ? Thanks & Best regards, Jason ; G
Re: [PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
fixed typo in timer_latency.c affecting only -r printout : $ gcc -DN_SAMPLES=1000 -o timer timer_latency.c CLOCK_MONOTONIC ( using rdtscp_ordered() ) : $ ./timer -m -r 10 sum: 67615 Total time: 0.67615S - Average Latency: 0.00067S N zero deltas: 0 N inconsistent deltas: 0 sum: 51858 Total time: 0.51858S - Average Latency: 0.00051S N zero deltas: 0 N inconsistent deltas: 0 sum: 51742 Total time: 0.51742S - Average Latency: 0.00051S N zero deltas: 0 N inconsistent deltas: 0 sum: 51944 Total time: 0.51944S - Average Latency: 0.00051S N zero deltas: 0 N inconsistent deltas: 0 sum: 51838 Total time: 0.51838S - Average Latency: 0.00051S N zero deltas: 0 N inconsistent deltas: 0 sum: 52397 Total time: 0.52397S - Average Latency: 0.00052S N zero deltas: 0 N inconsistent deltas: 0 sum: 52428 Total time: 0.52428S - Average Latency: 0.00052S N zero deltas: 0 N inconsistent deltas: 0 sum: 52135 Total time: 0.52135S - Average Latency: 0.00052S N zero deltas: 0 N inconsistent deltas: 0 sum: 52145 Total time: 0.52145S - Average Latency: 0.00052S N zero deltas: 0 N inconsistent deltas: 0 sum: 53116 Total time: 0.53116S - Average Latency: 0.00053S N zero deltas: 0 N inconsistent deltas: 0 Average of 10 average latencies of 1000 samples : 0.00053S CLOCK_MONOTONIC_RAW ( using rdtscp() ) : $ ./timer -r 10 sum: 25755 Total time: 0.25755S - Average Latency: 0.00025S N zero deltas: 0 N inconsistent deltas: 0 sum: 21614 Total time: 0.21614S - Average Latency: 0.00021S N zero deltas: 0 N inconsistent deltas: 0 sum: 21616 Total time: 0.21616S - Average Latency: 0.00021S N zero deltas: 0 N inconsistent deltas: 0 sum: 21610 Total time: 0.21610S - Average Latency: 0.00021S N zero deltas: 0 N inconsistent deltas: 0 sum: 21619 Total time: 0.21619S - Average Latency: 0.00021S N zero deltas: 0 N inconsistent deltas: 0 sum: 21617 Total time: 0.21617S - Average Latency: 0.00021S N zero deltas: 0 N inconsistent deltas: 0 sum: 21610 Total time: 0.21610S - Average Latency: 0.00021S N zero deltas: 0 N inconsistent deltas: 0 sum: 16940 Total time: 0.16940S - Average Latency: 0.00016S N zero deltas: 0 N inconsistent deltas: 0 sum: 16939 Total time: 0.16939S - Average Latency: 0.00016S N zero deltas: 0 N inconsistent deltas: 0 sum: 16943 Total time: 0.16943S - Average Latency: 0.00016S N zero deltas: 0 N inconsistent deltas: 0 Average of 10 average latencies of 1000 samples : 0.00019S /* * Program to measure high-res timer latency. * */ #include #include #include #include #include #include #include #include #include #include #ifndef N_SAMPLES #define N_SAMPLES 100 #endif #define _STR(_S_) #_S_ #define STR(_S_) _STR(_S_) #define TS2NS(_TS_) unsigned long long)(_TS_).tv_sec)*10ULL) + (((unsigned long long)((_TS_).tv_nsec int main(int argc, char *const* argv, char *const* envp) { struct timespec sample[N_SAMPLES+1]; unsigned int cnt=N_SAMPLES, s=0 , avg_n=0; unsigned long long deltas [ N_SAMPLES ] , t1, t2, sum=0, zd=0, ic=0, d , t_start, avg_ns, *avgs=0; clockid_t clk = CLOCK_MONOTONIC_RAW; bool do_dump = false; int argn=1, repeat=1; for(; argn < argc; argn+=1) if( argv[argn] != NULL ) if( *(argv[argn]) == '-') switch( *(argv[argn]+1) ) { case 'm': case 'M': clk = CLOCK_MONOTONIC; break; case 'd': case 'D': do_dump = true; break; case 'r': case 'R': if( (argn < argc) && (argv[argn+1] != NULL)) repeat = atoi(argv[argn+=1]); break; case '?': case 'h': case 'u': case 'U': case 'H': fprintf(stderr,"Usage: timer_latency [\n\t-m : use CLOCK_MONOTONIC clock (not CLOCK_MONOTONIC_RAW)\n\t-d : dump timespec contents. N_SAMPLES: " STR(N_SAMPLES) "\n\t" "-r \n]\t" "Calculates average timer latency (minimum time that can be measured) over N_SAMPLES.\n" ); return 0; } if( repeat > 1 ) { avgs=alloca(sizeof(unsigned long long) * (N_SAMPLES + 1)); if( ((unsigned long) avgs) & 7 ) avgs = ((unsigned long long*)(((unsigned char*)avgs)+(8-((unsigned long) avgs) & 7))); } do { cnt=N_SAMPLES; s=0; do { if( 0 != clock_gettime(clk, &sample[s++]) ) { fprintf(stderr,"oops, clock_gettime() failed: %d: '%s'.\n", errno, strerror(errno)); return 1; } }while( --cnt ); clock_gettime(clk, &sample[s]); for(s=1; s < (N_SAMPLES+1); s+=1) { t1 = TS2NS(sample[s-1]); t2 = TS2NS(sample[s]); if ( (t1 > t2) ||(sample[s-1].tv_sec > sample[s].tv_sec) ||((sample[s-1].tv_sec == sample[s].tv_sec) &&(sample[s-1].tv_nsec > sample[s].tv_nsec) ) ) { fprintf(stderr,"Inconsistency: %llu %llu %lu.%lu %lu.%lu\n", t1 , t2 , sample[s-1].tv_sec, sample[s-1].tv_nsec , sample[s].tv_sec, sample[s].tv_nsec );
re: [PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
Hi - I submitted a new stripped-down to bare essentials version of the patch, (see LKML emails with $subject) which passes all checkpatch.pl tests and addresses all concerns raised by reviewers, which uses only rdtsc_ordered(), and which only only updates in vsyscall_gtod_data the new fields: u32 raw_mult, raw_shift ; ... gtod_long_t monotonic_time_raw_sec /* == tk->raw_sec */ , monotonic_time_raw_nsec /* == tk->tkr_raw.nsec */; (this is NOT the formatting used in vgtod.h - sorry about previous formatting issues . ) . I don't see how one could present the raw timespec in user-space properly without tk->tkr_raw.xtime_nsec and and tk->raw_sec ; monotonic has gtod->monotonic_time_sec and gtod->monotonic_time_snsec, and I am only trying to follow exactly the existing algorithm in timekeeping.c's getrawmonotonic64() . When I submitted the initial version of this stripped down patch, I got an email back from robot reporting a compilation error saying : > > arch/x86/entry/vdso/vclock_gettime.o: In function `__vdso_clock_gettime': > vclock_gettime.c:(.text+0xf7): undefined reference to > >`__x86_indirect_thunk_rax' > /usr/bin/ld: arch/x86/entry/vdso/vclock_gettime.o: relocation R_X86_64_PC32 > >against undefined symbol `__x86_indirect_thunk_rax' can not be used when > making >a shared object; recompile with -fPIC > /usr/bin/ld: final link failed: Bad value >>> collect2: error: ld returned 1 exit status >-- >>> arch/x86/entry/vdso/vdso32.so.dbg: undefined symbols found >-- >>> objcopy: 'arch/x86/entry/vdso/vdso64.so.dbg': No such file >--- I had fixed this problem with the patch to the RHEL kernel attached to bug #198161 (attachment #274751: https://bugzilla.kernel.org/attachment.cgi?id=274751) , by simply reducing the number of clauses in __vdso_clock_gettime's switch(clock) from 6 to 5 , but at the cost of an extra test of clock & second switch(clock). I reported this as GCC bug : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84908 because I don't think GCC should fail to do anything for a switch with 6 clauses and not for one with 5, but the response I got from H.J. Liu was: H.J. Lu wrote @ 2018-03-16 22:13:27 UTC: > > vDSO isn't compiled with $(KBUILD_CFLAGS). Why does your kernel do it? > > Please try my kernel patch at comment 4.. > So that patch to the arch/x86/vdso/Makefile only prevents it enabling the RETPOLINE_CFLAGS for building the vDSO . I defer to H.J.'s expertise on GCC + binutils & advisability of enabling RETPOLINE_CFLAGS in the VDSO - GCC definitely behaves strangely for the vDSO when RETPOLINE _CFLAGS are enabled. Please provide something like the patch in a future version of Linux , and I suggest not compiling the vDSO with RETPOLINE_CFLAGS as does H.J. . The inconsistency_check program in tools/testing/selftests/timers produces no errors for long runs and the timer_latency.c program (attached) also produces no errors and latencies of @ 20ns for CLOCK_MONOTONIC_RAW and latencies of @ 40ns for CLOCK_MONOTONIC - this is however with the additional rdtscp patches , and under 4.15.9, for use on my system ; the 4.16-rc5 version submitted still uses barrier() + rdtsc , and that has a latency of @ 30ns for CLOCK_MONOTONIC_RAW and @40ns for CLOCK_MONOTONIC ; but both are much, much better that 200-1000ns for CLOCK_MONOTONIC_RAW that the unpatched kernels have (all times refer to 'Average Latency' output produced by 'timer_latency.c'). I do apologize for whitespace errors, unread emails and resends and confusion of previous emails - I now understand the process and standards much better and will attempt to adhere to them more closely in future. Thanks & Best Regards, Jason Vas Dias /* * Program to measure high-res timer latency. * */ #include #include #include #include #include #include #include #include #include #include #ifndef N_SAMPLES #define N_SAMPLES 100 #endif #define _STR(_S_) #_S_ #define STR(_S_) _STR(_S_) #define TS2NS(_TS_) unsigned long long)(_TS_).tv_sec)*10ULL) + (((unsigned long long)((_TS_).tv_nsec int main(int argc, char *const* argv, char *const* envp) { struct timespec sample[N_SAMPLES+1]; unsigned int cnt=N_SAMPLES, s=0 , avg_n=0; unsigned long long deltas [ N_SAMPLES ] , t1, t2, sum=0, zd=0, ic=0, d , t_start, avg_ns, *avgs=0; clockid_t clk = CLOCK_MONOTONIC_RAW; bool do_dump = false; int argn=1, repeat=1; for(; argn < argc; argn+=1) if( argv[argn] != NULL ) if( *(argv[argn]) == '-') switch( *(argv[argn]+1) ) { case 'm': case 'M': clk = CLOCK_MONOTONIC; break; case 'd': case 'D': do_dump = true; break; case 'r': case 'R': if( (a
[PATCH v4.16-rc5 2/2] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
This patch allows compilation to succeed with compilers that support -DRETPOLINE - it was kindly contributed by H.J. Liu in GCC Bugzilla: 84908 : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84908 Apparently the GCC retpoline implementation has a limitation that it cannot handle switch statements with more than 5 clauses, which vclock_gettime.c's __vdso_clock_gettime function now contains. The automated test builds should now succeed with this patch. diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile index 1943aeb..cb64e10 100644 --- a/arch/x86/entry/vdso/Makefile +++ b/arch/x86/entry/vdso/Makefile @@ -76,7 +76,7 @@ CFL := $(PROFILING) -mcmodel=small -fPIC -O2 -fasynchronous-unwind-tables -m64 \ -fno-omit-frame-pointer -foptimize-sibling-calls \ -DDISABLE_BRANCH_PROFILING -DBUILD_VDSO -$(vobjs): KBUILD_CFLAGS := $(filter-out $(GCC_PLUGINS_CFLAGS),$(KBUILD_CFLAGS)) $(CFL) +$(vobjs): KBUILD_CFLAGS := $(filter-out $(GCC_PLUGINS_CFLAGS) $(RETPOLINE_CFLAGS) -DRETPOLINE,$(KBUILD_CFLAGS)) $(CFL) # # vDSO code runs in userspace and -pg doesn't help with profiling anyway. @@ -143,6 +143,7 @@ KBUILD_CFLAGS_32 := $(filter-out -mcmodel=kernel,$(KBUILD_CFLAGS_32)) KBUILD_CFLAGS_32 := $(filter-out -fno-pic,$(KBUILD_CFLAGS_32)) KBUILD_CFLAGS_32 := $(filter-out -mfentry,$(KBUILD_CFLAGS_32)) KBUILD_CFLAGS_32 := $(filter-out $(GCC_PLUGINS_CFLAGS),$(KBUILD_CFLAGS_32)) +KBUILD_CFLAGS_32 := $(filter-out $(RETPOLINE_CFLAGS) -DRETPOLINE,$(KBUILD_CFLAGS_32)) KBUILD_CFLAGS_32 += -m32 -msoft-float -mregparm=0 -fpic KBUILD_CFLAGS_32 += $(call cc-option, -fno-stack-protector) KBUILD_CFLAGS_32 += $(call cc-option, -foptimize-sibling-calls)
[PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
Resent to address reviewer comments, and allow builds with compilers that support -DRETPOLINE to succeed. Currently, the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, &ts ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, user-space code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP "real" time; when code needs this, the latencies associated with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c arch/x86/entry/vdso/Makefile Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug #198161, as is the test program, timer_latency.c, to demonstrate the problem. Before the patch a latency of 200-1000ns was measured for clock_gettime(CLOCK_MONOTONIC_RAW,&ts) calls - after the patch, the same call on the same machine has a latency of @ 20ns. Thanks & Best Regards, Jason Vas Dias
[PATCH v4.16-rc5 1/2] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
This patch makes the vDSO handle clock_gettime(CLOCK_MONOTONIC_RAW,&ts) calls in the same way it handles clock_gettime(CLOCK_MONOTONIC,&ts) calls, reducing latency from @ 200-1000ns to @ 20ns. diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index f19856d..843b0a6 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -182,27 +182,49 @@ notrace static u64 vread_tsc(void) return last; } -notrace static inline u64 vgetsns(int *mode) +notrace static inline __always_inline u64 vgetcycles(int *mode) { - u64 v; - cycles_t cycles; - - if (gtod->vclock_mode == VCLOCK_TSC) - cycles = vread_tsc(); + switch (gtod->vclock_mode) { + case VCLOCK_TSC: + return vread_tsc(); #ifdef CONFIG_PARAVIRT_CLOCK - else if (gtod->vclock_mode == VCLOCK_PVCLOCK) - cycles = vread_pvclock(mode); + case VCLOCK_PVCLOCK: + return vread_pvclock(mode); #endif #ifdef CONFIG_HYPERV_TSCPAGE - else if (gtod->vclock_mode == VCLOCK_HVCLOCK) - cycles = vread_hvclock(mode); + case VCLOCK_HVCLOCK: + return vread_hvclock(mode); #endif - else + default: + break; + } + return 0; +} + +notrace static inline u64 vgetsns(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) return 0; + v = (cycles - gtod->cycle_last) & gtod->mask; return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) + return 0; + + v = (cycles - gtod->cycle_last) & gtod->mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +268,27 @@ notrace static int __always_inline do_monotonic(struct timespec *ts) return mode; } +notrace static __always_inline int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(&mode); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +320,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index e1216dd..c4d89b6 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -44,6 +44,8 @@ void update_vsyscall(struct timekeeper *tk) vdata->mask = tk->tkr_mono.mask; vdata->mult = tk->tkr_mono.mult; vdata->shift= tk->tkr_mono.shift; + vdata->raw_mult = tk->tkr_raw.mult; + vdata->raw_shift= tk->tkr_raw.shift; vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; @@ -74,5 +76,8 @@ void update_vsyscall(struct timekeeper *tk) vdata->monotonic_time_coarse_sec++; } + vdata->monotonic_time_raw_sec = tk->raw_sec; + vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec; + gtod_write_end(vdata); } diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index fb856c9..ec1a37c 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -22,7 +22,8 @@ struct vsyscall_gtod_data { u64 mask; u32 mult; u32 shift; - + u32 raw_mult; + u32 raw_shift; /* open coded 'struct timespec' */ u64 wall_time_snsec; gtod_long_t wall_time_sec; @@ -32,6 +33,8 @@ struct vsyscall_gtod_data { gtod_long_t wall_time_coarse_nsec; gtod_long_t monotonic_time_coarse_sec; gtod_long_t monotonic_time_coarse_nsec; + gtod_long_t monotonic_time_raw_sec; + gtod_long_t monotonic_time_raw_ns
Re: [PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Good day - RE: On 15/03/2018, Thomas Gleixner wrote: > On Thu, 15 Mar 2018, Jason Vas Dias wrote: >> On 15/03/2018, Thomas Gleixner wrote: >> > On Thu, 15 Mar 2018, jason.vas.d...@gmail.com wrote: >> > >> >> Resent to address reviewer comments. >> > >> > I was being patient so far and tried to guide you through the patch >> > submission process, but unfortunately this turns out to be just waste of >> > my >> > time. >> > >> > You have not addressed any of the comments I made here: >> > >> > [1] >> > https://lkml.kernel.org/r/alpine.deb.2.21.1803141511340.2...@nanos.tec.linutronix.de >> > [2] >> > https://lkml.kernel.org/r/alpine.deb.2.21.1803141527300.2...@nanos.tec.linutronix.de >> > >> >> I'm really sorry about that - I did not see those mails , >> and have searched for them in my inbox - > > That's close to the 'my dog ate the homework' excuse. > Nevertheless, those messages are NOT in my inbox, nor can I find them on the list - a google search for 'alpine.DEB.2.21.1803141511340.2481' or 'alpine.DEB.2.21.1803141527300.2481' returns only the last two mails on the subject , where you included the links to https://lkml.kernel.org. I don't know what went wrong here, but I did not receive those mails until you informed me of them yesterday evening, when I immediately regenerated the Patch #1 incorporating fixes for your comments, and sent it with Subject: '[PATCH v4.16-rc5 1/1] x86/vdso: VDSO should handle\ clock_gettime(CLOCK_MONOTONIC_RAW) without syscall ' This version re-uses the 'gtod->cycles' value, which as you point out, is the same as 'tk->tkr_raw.cycle_last' - so I removed vread_tsc_raw() . > Of course they were sent to the list and to you personally as I used > reply-all. From the mail server log: > > 2018-03-14 15:27:27 1ew7NH-00039q-Hv <= t...@linutronix.de > id=alpine.deb.2.21.1803141511340.2...@nanos.tec.linutronix.de > > 2018-03-14 15:27:30 1ew7NH-00039q-Hv => jason.vas.d...@gmail.com R=dnslookup > T=remote_smtp H=gmail-smtp-in.l.google.com [2a00:1450:4013:c01::1a] > X=TLS1.2:RSA_AES_128_CBC_SHA1:128 DN="C=US,ST=California,L=Mountain > View,O=Google Inc,CN=mx.google.com" > > 2018-03-14 15:27:31 1ew7NH-00039q-Hv => linux-kernel@vger.kernel.org > R=dnslookup T=remote_smtp H=vger.kernel.org [209.132.180.67] > > > > 2018-03-14 15:27:47 1ew7NH-00039q-Hv Completed > > If those messages would not have been delivered to > linux-kernel@vger.kernel.org they would hardly be on the mailing list > archive, right? > Yes, I cannot explain why I did not receive them . I guess I should consider gmail an unreliable delivery method and use the lkml.org web interface to check for replies - I will do this from now one. > And they both got delivered to your gmail account as well. > No, they are not in my gmail account Inbox or folders. > ERROR: Missing Signed-off-by: line(s) > total: 1 errors, 0 warnings, 71 lines checked > I do not know how to fix this error - I was hoping someone on the list might enlighten me. > > WARNING: externs should be avoided in .c files > #24: FILE: arch/x86/entry/vdso/vclock_gettime.c:31: > +extern unsigned int __vdso_tsc_calibration( > I thought that must be a script bug, since no extern is being declared by that line; it is an external function declaration, just like the unmodified line that precedes it. > WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? > #93: > new file mode 100644 > > ERROR: Missing Signed-off-by: line(s) > > total: 1 errors, 2 warnings, 143 lines checked > > It reports an error for every single patch of your latest submission. > >> And I did send the test results in a previous mail - > > In private mail which I ignore if there is no real good reason. And just > for the record. This private mail contains the following headers: > > In-Reply-To: > References: <1521001222-10712-1-git-send-email-jason.vas.d...@gmail.com> > <1521001222-10712-3-git-send-email-jason.vas.d...@gmail.com> > > From: Jason Vas Dias > Date: Wed, 14 Mar 2018 15:08:55 + > Message-ID: > > Subject: Re: [PATCH v4.16-rc5 2/3] x86/vdso: on Intel, VDSO should handle > CLOCK_MONOTONIC_RAW > > So now, if you take the message ID which is in the In-Reply-To: field and > compare it to the message ID which I used for link [2]: > > In-Reply-To: >> > https://lkml.kernel.org/r/alpine.deb.2.21.1803141527300.2...@nanos.tec.linutronix.de > > you might notice that these are identical. So how did you end up replying > to a mail which you never recei
[PATCH v4.16-rc5 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
Resent to address reviewer comments. Currently, the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, &ts ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, user-space code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP "real" time; when code needs this, the latencies associated with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug #198161, as is the test program, timer_latency.c, to demonstrate the problem. Before the patch a latency of 200-1000ns was measured for clock_gettime(CLOCK_MONOTONIC_RAW,&ts) calls - after the patch, the same call on the same machine has a latency of @ 20ns. Thanks & Best Regards, Jason Vas Dias
[PATCH v4.16-rc5 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index f19856d..8b9b9cf 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -182,27 +182,49 @@ notrace static u64 vread_tsc(void) return last; } -notrace static inline u64 vgetsns(int *mode) +notrace static inline __always_inline u64 vgetcycles(int *mode) { - u64 v; - cycles_t cycles; - - if (gtod->vclock_mode == VCLOCK_TSC) - cycles = vread_tsc(); + switch (gtod->vclock_mode) { + case VCLOCK_TSC: + return vread_tsc(); #ifdef CONFIG_PARAVIRT_CLOCK - else if (gtod->vclock_mode == VCLOCK_PVCLOCK) - cycles = vread_pvclock(mode); + case VCLOCK_PVCLOCK: + return vread_pvclock(mode); #endif #ifdef CONFIG_HYPERV_TSCPAGE - else if (gtod->vclock_mode == VCLOCK_HVCLOCK) - cycles = vread_hvclock(mode); + case VCLOCK_HVCLOCK: + return vread_hvclock(mode); #endif - else + default: + break; + } + return 0; +} + +notrace static inline u64 vgetsns(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) return 0; + v = (cycles - gtod->cycle_last) & gtod->mask; return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles = vgetcycles(mode); + + if (cycles == 0) + return 0; + + v = (cycles - gtod->cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +268,27 @@ notrace static int __always_inline do_monotonic(struct timespec *ts) return mode; } +notrace static __always_inline int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(&mode); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +320,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index e1216dd..83f5c21 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -44,6 +44,9 @@ void update_vsyscall(struct timekeeper *tk) vdata->mask = tk->tkr_mono.mask; vdata->mult = tk->tkr_mono.mult; vdata->shift= tk->tkr_mono.shift; + vdata->raw_mask = tk->tkr_raw.mask; + vdata->raw_mult = tk->tkr_raw.mult; + vdata->raw_shift= tk->tkr_raw.shift; vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; @@ -74,5 +77,8 @@ void update_vsyscall(struct timekeeper *tk) vdata->monotonic_time_coarse_sec++; } + vdata->monotonic_time_raw_sec = tk->raw_sec; + vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec; + gtod_write_end(vdata); } diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index fb856c9..941e9d6 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -22,7 +22,9 @@ struct vsyscall_gtod_data { u64 mask; u32 mult; u32 shift; - + u32 raw_mask; + u32 raw_mult; + u32 raw_shift; /* open coded 'struct timespec' */ u64 wall_time_snsec; gtod_long_t wall_time_sec; @@ -32,6 +34,8 @@ struct vsyscall_gtod_data { gtod_long_t wall_time_coarse_nsec; gtod_long_t monotonic_time_coarse_sec; gtod_long_t monotonic_time_coarse_nsec; + gtod_long_t monotonic_time_raw_sec; + gtod_long_t monotonic_time_raw_nsec; int tz_minuteswest; int tz_dsttime;
Re: [PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Hi Thomas - RE: On 15/03/2018, Thomas Gleixner wrote: > Jason, > > On Thu, 15 Mar 2018, jason.vas.d...@gmail.com wrote: > >> Resent to address reviewer comments. > > I was being patient so far and tried to guide you through the patch > submission process, but unfortunately this turns out to be just waste of my > time. > > You have not addressed any of the comments I made here: > > [1] > https://lkml.kernel.org/r/alpine.deb.2.21.1803141511340.2...@nanos.tec.linutronix.de > [2] > https://lkml.kernel.org/r/alpine.deb.2.21.1803141527300.2...@nanos.tec.linutronix.de > I'm really sorry about that - I did not see those mails , and have searched for them in my inbox - are you sure they were sent to 'linux-kernel@vger.kernel.org' ? That is the only list I am subscribed to . I clicked on the links , but the 'To:' field is just 'linux-kernel' . If I had seen those messages before I re-submitted, those issues would have been fixed. checkpatch.pl did not report them - I ran it with all patches and it reported no errors . And I did send the test results in a previous mail - $ gcc -m64 -o timer timer.c ( must be compiled in 64-bit mode). This is using the new rdtscp() function : $ ./timer -r 100 ... Total time: 0.02806S - Average Latency: 0.00028S N zero deltas: 0 N inconsistent deltas: 0 Average of 100 average latencies of 100 samples : 0.00027S This is using the rdtsc_ordered() function: $ ./timer -m -r 100 Total time: 0.05269S - Average Latency: 0.00052S N zero deltas: 0 N inconsistent deltas: 0 Average of 100 average latencies of 100 samples : 0.00047S timer.c is a very short program that just reads N_SAMPLES (a compile-time option) timespecs using either CLOCK_MONOTONIC_RAW (no -m) or CLOCK_MONOTONIC first parameter to clock_gettime(), then computes the deltas as long long, then averages them , counting any zero deltas, or deltas where the previous timespec is somehow greater than the current timespec, which are reported as inconsitencies (note 'inconistent deltas:0' and 'zero deltas: 0' in output). So my initial claim that rdtscp() can be twice as fast as rdtsc_ordered() was not far-fetched - this is what I am seeing . I think this is because of the explicit barrier() call in rdtsc_ordered() . This must be slower than than the internal processor pipeline "cancellation point" (barrier) used by the rdtscp instruction itself. This is the only reason for the rdtscp call - plus all modern Intel & AMD CPUs support it, and it DOES solve the ordering problem, whereby instructions in one pipeline of a task can get different rdtsc() results than instructions in another pipeline. I will document the results better in the ChangeLog , fix all issues you identified, and resend . I did not mean to ignore your comments - those mails are nowhere in my Inbox - please , confirm the actual email address they are getting sent to. Thanks & Regards, Jason /* * Program to measure high-res timer latency. * */ #include #include #include #include #include #include #include #include #include #include #ifndef N_SAMPLES #define N_SAMPLES 100 #endif #define _STR(_S_) #_S_ #define STR(_S_) _STR(_S_) #define TS2NS(_TS_) unsigned long long)(_TS_).tv_sec)*10ULL) + (((unsigned long long)((_TS_).tv_nsec int main(int argc, char *const* argv, char *const* envp) { struct timespec sample[N_SAMPLES+1]; unsigned int cnt=N_SAMPLES, s=0 , avg_n=0; unsigned long long deltas [ N_SAMPLES ] , t1, t2, sum=0, zd=0, ic=0, d , t_start, avg_ns, *avgs=0; clockid_t clk = CLOCK_MONOTONIC_RAW; bool do_dump = false; int argn=1, repeat=1; for(; argn < argc; argn+=1) if( argv[argn] != NULL ) if( *(argv[argn]) == '-') switch( *(argv[argn]+1) ) { case 'm': case 'M': clk = CLOCK_MONOTONIC; break; case 'd': case 'D': do_dump = true; break; case 'r': case 'R': if( (argn < argc) && (argv[argn+1] != NULL)) repeat = atoi(argv[argn+=1]); break; case '?': case 'h': case 'u': case 'U': case 'H': fprintf(stderr,"Usage: timer_latency [\n\t-m : use CLOCK_MONOTONIC clock (not CLOCK_MONOTONIC_RAW)\n\t-d : dump timespec contents. N_SAMPLES: " STR(N_SAMPLES) "\n\t" "-r \n]\t" "Calculates average timer latency (minimum time that can be measured) over N_SAMPLES.\n" ); return 0; } if( repeat > 1 ) { avgs=alloca(sizeof(unsigned long long) * (N_SAMPLES + 1)); if( ((unsigned long) avgs) & 7 ) avgs = ((unsigned long long*)(((unsigned char*)avgs)+(8-((unsigned long) avgs) & 7))); } do { cnt=N_SAMPLES; s=0; do { if( 0 != clock_gettime(clk, &sample[s++]) ) { fprintf(stderr,"oops, clock_gettime() failed: %d: '%s'.\n", errno, strerror(errno)); return 1; } }while( --cnt ); clock_gettime(clk, &sample[s]); for(s=1; s < (N_SAMPLES+1); s+=1) { t1 = TS2NS(sample[s-1]); t2 = TS2NS(sample[s]); if ( (t1 > t2)
[PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Resent to address reviewer comments. Currently, the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, &ts ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, user-space code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP "real" time; when code needs this, the latencies associated with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vdso/vdso.lds.S arch/x86/entry/vdso/vdsox32.lds.S arch/x86/entry/vdso/vdso32/vdso32.lds.S arch/x86/entry/vsyscall/vsyscall_gtod.c There are 3 patches in the series : Patch #1 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with rdtsc_ordered() Patches #2 & #3 should be considered "optional" : Patch #2 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with a new rdtscp() function in msr.h Patch #3 makes the VDSO export TSC calibration data via a new function in the vDSO: unsigned int __vdso_linux_tsc_calibration ( struct linux_tsc_calibration *tsc_cal ) that user code can optionally call. Patch #2 makes clock_gettime(CLOCK_MONOTONIC_RAW) calls somewhat faster than clock_gettime(CLOCK_MONOTONIC) calls. I think something like Patch #3 is necessary to export TSC calibration data to user-space TSC readers. It is entirely up to the kernel developers whether they want to include patches #2 and #3, but I think something like Patch #1 really needs to get into a future Linux release, as an unecessary latency of 200-1000ns for a timer that can tick 3 times per nanosecond is unacceptable. Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug #198161. Thanks & Best Regards, Jason Vas Dias
[PATCH v4.16-rc5 2/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index fbc7371..2c46675 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -184,10 +184,9 @@ notrace static u64 vread_tsc(void) notrace static u64 vread_tsc_raw(void) { - u64 tsc + u64 tsc = (gtod->has_rdtscp ? rdtscp((void *)0) : rdtsc_ordered()) , last = gtod->raw_cycle_last; - tsc = rdtsc_ordered(); if (likely(tsc >= last)) return tsc; asm volatile (""); diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index 5af7093..0327a95 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -16,6 +16,9 @@ #include #include #include +#include + +extern unsigned int tsc_khz; int vclocks_used __read_mostly; @@ -49,6 +52,7 @@ void update_vsyscall(struct timekeeper *tk) vdata->raw_mask = tk->tkr_raw.mask; vdata->raw_mult = tk->tkr_raw.mult; vdata->raw_shift= tk->tkr_raw.shift; + vdata->has_rdtscp = static_cpu_has(X86_FEATURE_RDTSCP); vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h index 30df295..a5ff704 100644 --- a/arch/x86/include/asm/msr.h +++ b/arch/x86/include/asm/msr.h @@ -218,6 +218,37 @@ static __always_inline unsigned long long rdtsc_ordered(void) return rdtsc(); } +/** + * rdtscp() - read the current TSC and (optionally) CPU number, with built-in + *cancellation point replacing barrier - only available + *if static_cpu_has(X86_FEATURE_RDTSCP) . + * returns: The 64-bit Time Stamp Counter (TSC) value. + * Optionally, 'cpu_out' can be non-null, and on return it will contain + * the number (Intel CPU ID) of the CPU that the task is currently running on. + * As does EAX_EDT_RET, this uses the "open-coded asm" style to + * force the compiler + assembler to always use (eax, edx, ecx) registers, + * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit + * variables are used - exactly the same code should be generated + * for this instruction on 32-bit as on 64-bit when this asm stanza is used. + * See: SDM , Vol #2, RDTSCP instruction. + */ +static __always_inline u64 rdtscp(u32 *cpu_out) +{ + u32 tsc_lo, tsc_hi, tsc_cpu; + + asm volatile + ("rdtscp" + : "=a" (tsc_lo) + , "=d" (tsc_hi) + , "=c" (tsc_cpu) + ); // : eax, edx, ecx used - NOT rax, rdx, rcx + if (unlikely(cpu_out != ((void *)0))) + *cpu_out = tsc_cpu; + return u64)tsc_hi) << 32) | + (((u64)tsc_lo) & 0x0ULL) + ); +} + /* Deprecated, keep it for a cycle for easier merging: */ #define rdtscll(now) do { (now) = rdtsc_ordered(); } while (0) diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index 24e4d45..e7e4804 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -26,6 +26,7 @@ struct vsyscall_gtod_data { u64 raw_mask; u32 raw_mult; u32 raw_shift; + u32 has_rdtscp; /* open coded 'struct timespec' */ u64 wall_time_snsec;
[PATCH v4.16-rc5 1/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index f19856d..fbc7371 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ + u64 tsc + , last = gtod->raw_cycle_last; + + tsc = rdtsc_ordered(); + if (likely(tsc >= last)) + return tsc; + asm volatile (""); + return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *mode) return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +279,27 @@ notrace static int __always_inline do_monotonic(struct timespec *ts) return mode; } +notrace static __always_inline int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(&mode); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index e1216dd..5af7093 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -45,6 +45,11 @@ void update_vsyscall(struct timekeeper *tk) vdata->mult = tk->tkr_mono.mult; vdata->shift= tk->tkr_mono.shift; + vdata->raw_cycle_last = tk->tkr_raw.cycle_last; + vdata->raw_mask = tk->tkr_raw.mask; + vdata->raw_mult = tk->tkr_raw.mult; + vdata->raw_shift= tk->tkr_raw.shift; + vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; @@ -74,5 +79,8 @@ void update_vsyscall(struct timekeeper *tk) vdata->monotonic_time_coarse_sec++; } + vdata->monotonic_time_raw_sec = tk->raw_sec; + vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec; + gtod_write_end(vdata); } diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index fb856c9..24e4d45 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -22,6 +22,10 @@ struct vsyscall_gtod_data { u64 mask; u32 mult; u32 shift; + u64 raw_cycle_last; + u64 raw_mask; + u32 raw_mult; + u32 raw_shift; /* open coded 'struct timespec' */ u64 wall_time_snsec; @@ -32,6 +36,8 @@ struct vsyscall_gtod_data { gtod_long_t wall_time_coarse_nsec; gtod_long_t monotonic_time_coarse_sec; gtod_long_t monotonic_time_coarse_nsec; + gtod_long_t monotonic_time_raw_sec; + gtod_long_t monotonic_time_raw_nsec; int tz_minuteswest; int tz_dsttime;
[PATCH v4.16-rc5 3/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index 03f3904..61d9633 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -21,12 +21,15 @@ #include #include #include +#include #define gtod (&VVAR(vsyscall_gtod_data)) extern int __vdso_clock_gettime(clockid_t clock, struct timespec *ts); extern int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz); extern time_t __vdso_time(time_t *t); +extern unsigned int __vdso_tsc_calibration( + struct linux_tsc_calibration_s *tsc_cal); #ifdef CONFIG_PARAVIRT_CLOCK extern u8 pvclock_page @@ -383,3 +386,25 @@ notrace time_t __vdso_time(time_t *t) } time_t time(time_t *t) __attribute__((weak, alias("__vdso_time"))); + +notraceunsigned int +__vdso_linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal) +{ + unsigned long seq; + + do { + seq = gtod_read_begin(gtod); + if ((gtod->vclock_mode == VCLOCK_TSC) && + (tsc_cal != ((void *)0UL))) { + tsc_cal->tsc_khz = gtod->tsc_khz; + tsc_cal->mult= gtod->raw_mult; + tsc_cal->shift = gtod->raw_shift; + return 1; + } + } while (unlikely(gtod_read_retry(gtod, seq))); + + return 0; +} + +unsigned int linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal) + __attribute((weak, alias("__vdso_linux_tsc_calibration"))); diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S index d3a2dce..e0b5cce 100644 --- a/arch/x86/entry/vdso/vdso.lds.S +++ b/arch/x86/entry/vdso/vdso.lds.S @@ -25,6 +25,8 @@ VERSION { __vdso_getcpu; time; __vdso_time; + linux_tsc_calibration; + __vdso_linux_tsc_calibration; local: *; }; } diff --git a/arch/x86/entry/vdso/vdso32/vdso32.lds.S b/arch/x86/entry/vdso/vdso32/vdso32.lds.S index 422764a..17fd07f 100644 --- a/arch/x86/entry/vdso/vdso32/vdso32.lds.S +++ b/arch/x86/entry/vdso/vdso32/vdso32.lds.S @@ -26,6 +26,7 @@ VERSION __vdso_clock_gettime; __vdso_gettimeofday; __vdso_time; + __vdso_linux_tsc_calibration; }; LINUX_2.5 { diff --git a/arch/x86/entry/vdso/vdsox32.lds.S b/arch/x86/entry/vdso/vdsox32.lds.S index 05cd1c5..7acac71 100644 --- a/arch/x86/entry/vdso/vdsox32.lds.S +++ b/arch/x86/entry/vdso/vdsox32.lds.S @@ -21,6 +21,7 @@ VERSION { __vdso_gettimeofday; __vdso_getcpu; __vdso_time; + __vdso_linux_tsc_calibration; local: *; }; } diff --git a/arch/x86/include/uapi/asm/vdso_tsc_calibration.h b/arch/x86/include/uapi/asm/vdso_tsc_calibration.h new file mode 100644 index 000..ce4b5a45 --- /dev/null +++ b/arch/x86/include/uapi/asm/vdso_tsc_calibration.h @@ -0,0 +1,81 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _ASM_X86_VDSO_TSC_CALIBRATION_H +#define _ASM_X86_VDSO_TSC_CALIBRATION_H +/* + * Programs that want to use rdtsc / rdtscp instructions + * from user-space can make use of the Linux kernel TSC calibration + * by calling : + *__vdso_linux_tsc_calibration(struct linux_tsc_calibration_s *); + * ( one has to resolve this symbol as in + * tools/testing/selftests/vDSO/parse_vdso.c + * ) + * which fills in a structure + * with the following layout : + */ + +/** struct linux_tsc_calibration_s - + * mult:amount to multiply 64-bit TSC value by + * shift: the right shift to apply to (mult*TSC) yielding nanoseconds + * tsc_khz: the calibrated TSC frequency in KHz from which previous + * members calculated + */ +struct linux_tsc_calibration_s { + + unsigned int mult; + unsigned int shift; + unsigned int tsc_khz; + +}; + +/* To use: + * + * static unsigned + * (*linux_tsc_cal)(struct linux_tsc_calibration_s *linux_tsc_cal) = + *vdso_sym("LINUX_2.6", "__vdso_linux_tsc_calibration"); + * if(linux_tsc_cal == ((void *)0)) + * { fprintf(stderr,"the patch providing __vdso_linux_tsc_calibration" + * " is not applied to the kernel.\n"); + *return ERROR; + * } + * static struct linux_tsc_calibration clock_source={0}; + * if((clock_source.mult==0) && ! (*linux_tsc_cal)(&clock_source) ) + *fprintf(stderr,"TSC is not the system clocksource.\n"); + * unsigned int tsc_lo, tsc_hi, tsc_cpu; + * asm volatile + * ( "rdtscp" : (=a) tsc_hi, (=d) tsc_lo, (=c) tsc_cpu ); + * unsigned long tsc = (((unsigned long)tsc_hi) << 32) | tsc_lo; + * unsigned long nanoseconds = + * (( clock_source . mult ) * tsc ) >> (clock_source . shift); + * + * nanoseconds is now TSC value converted to nanoseconds, + * according to Linux' clocksource calibration values. + * Incidentally, 'tsc_cpu' is the number of the CPU the task is running on.
Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Thanks for the helpful comments, Peter - re: On 14/03/2018, Peter Zijlstra wrote: > >> Yes, I am sampling perf counters, > > You're not in fact sampling, you're just reading the counters. Correct, using Linux-ese terminology - but "sampling" in looser English. >> Reading performance counters does involve 2 ioctls and a read() , > > So you can avoid the whole ioctl(ENABLE), ioctl(DISABLE) nonsense and > just let them run and do: > > read(group_fd, &buf_pre, size); > /* your code section */ > read(group_fd, &buf_post, size); > > /* compute buf_post - buf_pre */ > > Which is only 2 system calls, not 4. But I can't, really - I am trying to restrict the performance counter measurements to only a subset of the code, and exclude performance measurement result processing - so the timeline is like: struct timespec t_start, t_end; perf_event_open(...); thread_main_loop() { ... do { t _clock_gettime(CLOCK_MONOTONIC_RAW, &t_start); t+x _ enable_perf (); total_work = do_some_work(); disable_perf (); clock_gettime(CLOCK_MONOTONIC_RAW, &t_end); t+y_ read_perf_counters_and_store_results ( perf_grp_fd, &results , total_work, TS2T( &t_end ) - TS2T( &t_start) ); } while ( ); } Now. here the bandwidth / performance results recorded by my 'read_perf_counters_and_store_results' method is very sensitive to the measurement of the OUTER elapsed time . > > Also, a while back there was the proposal to extend the mmap() > self-monitoring interface to groups, see: > > https://lkml.kernel.org/r/20170530172555.5ya3ilfw3sowo...@hirez.programming.kicks-ass.net > > I never did get around to writing the actual code for it, but it > shouldn't be too hard. > Great, I'm looking forward to trying it - but meanwhile, to get NON-MULTIPLEXED measurements for the SAME CODE SEQUENCE over the SAME TIME I believe the group FD method is what is implemented and what works. >> The CPU_CLOCK software counter should give the converted TSC cycles >> seen between the ioctl( grp_fd, PERF_EVENT_IOC_ENABLE , ...) >> and the ioctl( grp_fd, PERF_EVENT_IOC_DISABLE ), and the >> difference between the event->time_running and time_enabled >> should also measure elapsed time . > > While CPU_CLOCK is TSC based, there is no guarantee it has any > correlation to CLOCK_MONOTONIC_RAW (even if that is also TSC based). > > (although, I think I might have fixed that recently and it might just > work, but it's very much not guaranteed). Yes, I believe the CPU_CLOCK is effectively the converted TSC - it does appear to correlate well with the new CLOCK_MONOTONIC_RAW values from the patched VDSO. > If you want to correlate to CLOCK_MONOTONIC_RAW you have to read > CLOCK_MONOTONIC_RAW and not some random other clock value. > Exactly ! Hence the need for the patch so that users can get CLOCK_MONOTONIC_RAW values with low latency and correlate them with PERF CPU_CLOCK values. >> This gives the "inner" elapsed time, from the perpective of the kernel, >> while the measured code section had the counters enabled. >> >> But unless the user-space program also has a way of measuring elapsed >> time from the CPU's perspective , ie. without being subject to >> operator or NTP / PTP adjustment, it has no way of correlating this >> inner elapsed time with any "outer" > > You could read the time using the group_fd's mmap() page. That actually > includes the TSC mult,shift,offset as used by perf clocks. > Yes, but as mentioned earlier, that presupposes I want to use the mmap() sample method - I don't - I want to use the Group FD method, so that I can be sure the measurements are for the same code sequence over the same period of time. >> Currently, users must parse the log file or use gdb / objdump to >> inspect /proc/kcore to get the TSC calibration and exact >> mult+shift values for the TSC value conversion. > > Which ;-) there's multiple floating around.. > Yes, but why must Linux make it so difficult ? I think it has to be recognized that the vDSO or user-space program are the only places in which low-latency clock values can be generated for use by user-space programs with sufficiently low latencies to be useful. So why does it not export the TSC calibration which is so complex to calibrate when such calibration information is available nowhere else ? >> Intel does not publish, nor does the CPU come with in ROM or firmware, >> the actual precise TSC frequency - this must be calibrated against the >> other clocks , according to a complicated procedure in section 18.2 of >> the SDM . My TSC has a "rated" / nominal TSC frequency , which one >> can compute from CPUID leaves, of 2.3ghz, but the "Refined TSC frequency" >> is 2.8333ghz . > > You might
[PATCH v4.16-rc5 1/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index f19856d..fbc7371 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ + u64 tsc + , last = gtod->raw_cycle_last; + + tsc = rdtsc_ordered(); + if (likely(tsc >= last)) + return tsc; + asm volatile (""); + return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *mode) return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +279,27 @@ notrace static int __always_inline do_monotonic(struct timespec *ts) return mode; } +notrace static __always_inline int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(&mode); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index e1216dd..5af7093 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -45,6 +45,11 @@ void update_vsyscall(struct timekeeper *tk) vdata->mult = tk->tkr_mono.mult; vdata->shift= tk->tkr_mono.shift; + vdata->raw_cycle_last = tk->tkr_raw.cycle_last; + vdata->raw_mask = tk->tkr_raw.mask; + vdata->raw_mult = tk->tkr_raw.mult; + vdata->raw_shift= tk->tkr_raw.shift; + vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; @@ -74,5 +79,8 @@ void update_vsyscall(struct timekeeper *tk) vdata->monotonic_time_coarse_sec++; } + vdata->monotonic_time_raw_sec = tk->raw_sec; + vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec; + gtod_write_end(vdata); } diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index fb856c9..24e4d45 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -22,6 +22,10 @@ struct vsyscall_gtod_data { u64 mask; u32 mult; u32 shift; + u64 raw_cycle_last; + u64 raw_mask; + u32 raw_mult; + u32 raw_shift; /* open coded 'struct timespec' */ u64 wall_time_snsec; @@ -32,6 +36,8 @@ struct vsyscall_gtod_data { gtod_long_t wall_time_coarse_nsec; gtod_long_t monotonic_time_coarse_sec; gtod_long_t monotonic_time_coarse_nsec; + gtod_long_t monotonic_time_raw_sec; + gtod_long_t monotonic_time_raw_nsec; int tz_minuteswest; int tz_dsttime;
[PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, &ts ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, user-space code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP "real" time; when code needs this, the latencies associated with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vdso/vdso.lds.S arch/x86/entry/vdso/vdsox32.lds.S arch/x86/entry/vdso/vdso32/vdso32.lds.S arch/x86/entry/vsyscall/vsyscall_gtod.c There are 3 patches in the series : Patch #1 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with rdtsc_ordered() Patch #2 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with a new rdtscp() function in msr.h Patch #3 makes the VDSO export TSC calibration data via a new function in the vDSO: unsigned int __vdso_linux_tsc_calibration ( struct linux_tsc_calibration *tsc_cal ) that user code can optionally call. Patches #2 & #3 should be considered "optional" . Patch #2 makes clock_gettime(CLOCK_MONOTONIC_RAW) calls have @ half the latency of clock_gettime(CLOCK_MONOTONIC) calls. I think something like Patch #3 is necessary to export TSC calibration data to user-space TSC readers. Best Regards, Jason Vas Dias
[PATCH v4.16-rc5 2/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index fbc7371..2c46675 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -184,10 +184,9 @@ notrace static u64 vread_tsc(void) notrace static u64 vread_tsc_raw(void) { - u64 tsc + u64 tsc = (gtod->has_rdtscp ? rdtscp((void*)0) : rdtsc_ordered()) , last = gtod->raw_cycle_last; - tsc = rdtsc_ordered(); if (likely(tsc >= last)) return tsc; asm volatile (""); diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index 5af7093..0327a95 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -16,6 +16,9 @@ #include #include #include +#include + +extern unsigned tsc_khz; int vclocks_used __read_mostly; @@ -49,6 +52,7 @@ void update_vsyscall(struct timekeeper *tk) vdata->raw_mask = tk->tkr_raw.mask; vdata->raw_mult = tk->tkr_raw.mult; vdata->raw_shift= tk->tkr_raw.shift; + vdata->has_rdtscp = static_cpu_has(X86_FEATURE_RDTSCP); vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h index 30df295..a5ff704 100644 --- a/arch/x86/include/asm/msr.h +++ b/arch/x86/include/asm/msr.h @@ -218,6 +218,36 @@ static __always_inline unsigned long long rdtsc_ordered(void) return rdtsc(); } +/** + * rdtscp() - read the current TSC and (optionally) CPU number, with built-in + *cancellation point replacing barrier - only available + *if static_cpu_has(X86_FEATURE_RDTSCP) . + * returns: The 64-bit Time Stamp Counter (TSC) value. + * Optionally, 'cpu_out' can be non-null, and on return it will contain + * the number (Intel CPU ID) of the CPU that the task is currently running on. + * As does EAX_EDT_RET, this uses the "open-coded asm" style to + * force the compiler + assembler to always use (eax, edx, ecx) registers, + * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit + * variables are used - exactly the same code should be generated + * for this instruction on 32-bit as on 64-bit when this asm stanza is used. + * See: SDM , Vol #2, RDTSCP instruction. + */ +static __always_inline u64 rdtscp(u32 *cpu_out) +{ + u32 tsc_lo, tsc_hi, tsc_cpu; + asm volatile + ( "rdtscp" + : "=a" (tsc_lo) + , "=d" (tsc_hi) + , "=c" (tsc_cpu) + ); // : eax, edx, ecx used - NOT rax, rdx, rcx + if (unlikely(cpu_out != ((void*)0))) + *cpu_out = tsc_cpu; + return u64)tsc_hi) << 32) | + (((u64)tsc_lo) & 0x0ULL ) + ); +} + /* Deprecated, keep it for a cycle for easier merging: */ #define rdtscll(now) do { (now) = rdtsc_ordered(); } while (0) diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index 24e4d45..e7e4804 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -26,6 +26,7 @@ struct vsyscall_gtod_data { u64 raw_mask; u32 raw_mult; u32 raw_shift; + u32 has_rdtscp; /* open coded 'struct timespec' */ u64 wall_time_snsec;
[PATCH v4.16-rc5 3/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index 2c46675..772988c 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -21,6 +21,7 @@ #include #include #include +#include #define gtod (&VVAR(vsyscall_gtod_data)) @@ -184,7 +185,7 @@ notrace static u64 vread_tsc(void) notrace static u64 vread_tsc_raw(void) { - u64 tsc = (gtod->has_rdtscp ? rdtscp((void*)0) : rdtsc_ordered()) + u64 tsc = (gtod->has_rdtscp ? rdtscp((void *)0) : rdtsc_ordered()) , last = gtod->raw_cycle_last; if (likely(tsc >= last)) @@ -383,3 +384,21 @@ notrace time_t __vdso_time(time_t *t) } time_t time(time_t *t) __attribute__((weak, alias("__vdso_time"))); + +unsigned int __vdso_linux_tsc_calibration( + struct linux_tsc_calibration_s *tsc_cal); + +notraceunsigned int +__vdso_linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal) +{ + if ((gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void *)0UL))) { + tsc_cal->tsc_khz = gtod->tsc_khz; + tsc_cal->mult= gtod->raw_mult; + tsc_cal->shift = gtod->raw_shift; + return 1; + } + return 0; +} + +unsigned int linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal) + __attribute((weak, alias("__vdso_linux_tsc_calibration"))); diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S index d3a2dce..e0b5cce 100644 --- a/arch/x86/entry/vdso/vdso.lds.S +++ b/arch/x86/entry/vdso/vdso.lds.S @@ -25,6 +25,8 @@ VERSION { __vdso_getcpu; time; __vdso_time; + linux_tsc_calibration; + __vdso_linux_tsc_calibration; local: *; }; } diff --git a/arch/x86/entry/vdso/vdso32/vdso32.lds.S b/arch/x86/entry/vdso/vdso32/vdso32.lds.S index 422764a..17fd07f 100644 --- a/arch/x86/entry/vdso/vdso32/vdso32.lds.S +++ b/arch/x86/entry/vdso/vdso32/vdso32.lds.S @@ -26,6 +26,7 @@ VERSION __vdso_clock_gettime; __vdso_gettimeofday; __vdso_time; + __vdso_linux_tsc_calibration; }; LINUX_2.5 { diff --git a/arch/x86/entry/vdso/vdsox32.lds.S b/arch/x86/entry/vdso/vdsox32.lds.S index 05cd1c5..7acac71 100644 --- a/arch/x86/entry/vdso/vdsox32.lds.S +++ b/arch/x86/entry/vdso/vdsox32.lds.S @@ -21,6 +21,7 @@ VERSION { __vdso_gettimeofday; __vdso_getcpu; __vdso_time; + __vdso_linux_tsc_calibration; local: *; }; } diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index 0327a95..692562a 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -53,6 +53,7 @@ void update_vsyscall(struct timekeeper *tk) vdata->raw_mult = tk->tkr_raw.mult; vdata->raw_shift= tk->tkr_raw.shift; vdata->has_rdtscp = static_cpu_has(X86_FEATURE_RDTSCP); + vdata->tsc_khz = tsc_khz; vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h index a5ff704..c7b2ed2 100644 --- a/arch/x86/include/asm/msr.h +++ b/arch/x86/include/asm/msr.h @@ -227,7 +227,7 @@ static __always_inline unsigned long long rdtsc_ordered(void) * the number (Intel CPU ID) of the CPU that the task is currently running on. * As does EAX_EDT_RET, this uses the "open-coded asm" style to * force the compiler + assembler to always use (eax, edx, ecx) registers, - * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit + * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit * variables are used - exactly the same code should be generated * for this instruction on 32-bit as on 64-bit when this asm stanza is used. * See: SDM , Vol #2, RDTSCP instruction. @@ -236,15 +236,15 @@ static __always_inline u64 rdtscp(u32 *cpu_out) { u32 tsc_lo, tsc_hi, tsc_cpu; asm volatile - ( "rdtscp" + ("rdtscp" : "=a" (tsc_lo) , "=d" (tsc_hi) , "=c" (tsc_cpu) ); // : eax, edx, ecx used - NOT rax, rdx, rcx - if (unlikely(cpu_out != ((void*)0))) + if (unlikely(cpu_out != ((void *)0))) *cpu_out = tsc_cpu; return u64)tsc_hi) << 32) | - (((u64)tsc_lo) & 0x0ULL ) + (((u64)tsc_lo) & 0x0ULL) ); } diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index e7e4804..75078fc 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -27,6 +27,7 @@ struct vsyscall_gtod_data { u32 raw_mult; u32 raw_shift; u32 has_rdtscp; + u32 tsc_khz;
Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
On 12/03/2018, Peter Zijlstra wrote: > On Mon, Mar 12, 2018 at 07:01:20AM +0000, Jason Vas Dias wrote: >> Sometimes, particularly when correlating elapsed time to performance >> counter values, > > So what actual problem are you tring to solve here? Perf can already > give you sample time in various clocks, including MONOTONIC_RAW. > > Yes, I am sampling perf counters, including CPU_CYCLES , INSTRUCTIONS, CPU_CLOCK, TASK_CLOCK, etc, in a Group FD I open with perf_event_open() , for the current thread on the current CPU - I am doing this for 4 threads , on Intel & ARM cpus. Reading performance counters does involve 2 ioctls and a read() , which takes time that already far exceeds the time required to read the TSC or CNTPCT in the VDSO . The CPU_CLOCK software counter should give the converted TSC cycles seen between the ioctl( grp_fd, PERF_EVENT_IOC_ENABLE , ...) and the ioctl( grp_fd, PERF_EVENT_IOC_DISABLE ), and the difference between the event->time_running and time_enabled should also measure elapsed time . This gives the "inner" elapsed time, from the perpective of the kernel, while the measured code section had the counters enabled. But unless the user-space program also has a way of measuring elapsed time from the CPU's perspective , ie. without being subject to operator or NTP / PTP adjustment, it has no way of correlating this inner elapsed time with any "outer" elapsed time measurement it may have made - I also measure the time taken by I/O operations between threads, for instance. So that is my primary motivation - for each thread's main run loop, I enable performance counters and count several PMU counters and the CPU_CLOCK & TASK_CLOCK . I want to determine with maximal accuracy how much elapsed time was used actually executing the task's instructions on the CPU , and how long they took to execute. I want to try to exclude the time spent gathering and making and analysing the performance measurements from the time spent running the threads' main loop . To do this accurately, it is best to exclude variations in time that occur because of operator or NTP / PTP adjustments . The CLOCK_MONOTONIC_RAW clock is the ONLY clock that is MEANT to be immune from any adjustment. It is meant to be high - resolution clock with 1ns resolution that should be subject to no adjustment, and hence one would expect it it have the lowest latency. But the way Linux has up to now implemented it , CLOCK_MONOTONIC_RAW has a resolution (minimum time that can be measured) that varies from 300 - 1000ns . I can read the TSC and store a 16-byte timespec value in @ 8ns on the same CPU . I understand that linux must conform to the POSIX interface which means it cannot provide sub-nanosecond resolution timers, but it could allow user-space programs to easily discover the timer calibration so that user-space programs can read the timers themselves. Currently, users must parse the log file or use gdb / objdump to inspect /proc/kcore to get the TSC calibration and exact mult+shift values for the TSC value conversion. Intel does not publish, nor does the CPU come with in ROM or firmware, the actual precise TSC frequency - this must be calibrated against the other clocks , according to a complicated procedure in section 18.2 of the SDM . My TSC has a "rated" / nominal TSC frequency , which one can compute from CPUID leaves, of 2.3ghz, but the "Refined TSC frequency" is 2.8333ghz . Hence I think Linux should export this calibrated frequency somehow ; its "calibration" is expressed as the raw clocksource 'mult' and 'shift' values, and is exported to the VDSO . I think the VDSO should read the TSC and use the calibration to render the raw, unadjusted time from the CPU's perspective. Hence, the patch I am preparing , which is again attached. I will submit it properly via email once I figure out how to obtain the 'git-send-mail' tool, and how to use it to send multiple patches, which seems to be the only way to submit acceptable patches. Also the attached timer program measures a latency of @ 20ns with my patch 4.15.9 kernel, when it measured a latency of 300-1000ns without it. Thanks & Regards, Jason vdso_clock_monotonic_raw_1.patch Description: Binary data /* * Program to measure high-res timer latency. * */ #include #include #include #include #include #include #include #include #ifndef N_SAMPLES #define N_SAMPLES 100 #endif #define _STR(_S_) #_S_ #define STR(_S_) _STR(_S_) int main(int argc, char *const* argv, char *const* envp) { clockid_t clk = CLOCK_MONOTONIC_RAW; bool do_dump = false; int argn=1; for(; argn < argc; argn+=1) if( argv[argn] != NULL ) if( *(argv[argn]) == '-') switch( *(argv[argn]+1) ) { case 'm': case 'M': clk = CLOCK_MONOTONIC; break; case
Re: [PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
The split patches with no checkpatch.pl failures are attached and were just sent in separate emails to the mailing list . Sorry it took a few tries to get right . This will be my last send today - I'm off to use it at work. Thanks & all the best, Jason vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#1.patch Description: Binary data vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#2.patch Description: Binary data
[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, &ts ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/msr.h arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c This is the second patch in the series, which adds use of rdtscp . Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 2018-03-12 08:12:17.110120433 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 08:59:21.135475862 + @@ -187,7 +187,7 @@ notrace static u64 vread_tsc_raw(void) u64 tsc , last = gtod->raw_cycle_last; - tsc = rdtsc_ordered(); + tsc = gtod->has_rdtscp ? rdtscp((void*)0UL) : rdtsc_ordered(); if (likely(tsc >= last)) return tsc; asm volatile (""); diff -up linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c --- linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5-p1 2018-03-12 07:58:07.974214168 + +++ linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c 2018-03-12 08:54:07.490267640 + @@ -16,6 +16,7 @@ #include #include #include +#include int vclocks_used __read_mostly; @@ -49,6 +50,7 @@ void update_vsyscall(struct timekeeper * vdata->raw_mask = tk->tkr_raw.mask; vdata->raw_mult = tk->tkr_raw.mult; vdata->raw_shift= tk->tkr_raw.shift; + vdata->has_rdtscp = static_cpu_has(X86_FEATURE_RDTSCP); vdata->wall_time_sec= tk->xtime_sec; vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; diff -up linux-4.16-rc5/arch/x86/include/asm/msr.h.4.16-rc5-p1 linux-4.16-rc5/arch/x86/include/asm/msr.h --- linux-4.16-rc5/arch/x86/include/asm/msr.h.4.16-rc5-p1 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/include/asm/msr.h 2018-03-12 09:06:03.902728749 + @@ -218,6 +218,36 @@ static __always_inline unsigned long lon return rdtsc(); } +/** + * rdtscp() - read the current TSC and (optionally) CPU number, with built-in + *cancellation point replacing barrier - only available + *if static_cpu_has(X86_FEATURE_RDTSCP) . + * returns: The 64-bit Time Stamp Counter (TSC) value. + * Optionally, 'cpu_out' can be non-null, and on return it will contain + * the number (Intel CPU ID) of the CPU that the task is currently running on. + * As does EAX_EDT_RET, this uses the "open-coded asm" style to + * force the compiler + assembler to always use (eax, edx, ecx) registers, + * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit + * variables are used - exactly the same code should be generated + * for this instruction on 32-bit as on 64-bit when this asm stanza is used. + * See: SDM , Vol #2, RDTSCP instruction. + */ +static __always_inline u64 rdtscp(u32 *cpu_out) +{ + u32 tsc_lo, tsc_hi, tsc_cpu; + asm volatile + ( "rdtscp" + : "=a" (tsc_lo) + , "=d" (tsc_hi) + , "=c" (tsc_cpu) + ); + if ( unlikely(cpu_out != ((voi
[PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, &ts ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only these files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c There are 2 patches in the series - this first one handles CLOCK_MONOTONIC_RAW in VDSO using existing rdtsc_ordered() , and the second uses new rstscp() function which avoids use of an explicit barrier. Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 08:12:17.110120433 + @@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ + u64 tsc + , last = gtod->raw_cycle_last; + + tsc = rdtsc_ordered(); + if (likely(tsc >= last)) + return tsc; + asm volatile (""); + return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +279,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static __always_inline int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(&mode); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff -up linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5 linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c ---
Re: [PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Good day - On 12/03/2018, Ingo Molnar wrote: > > * Thomas Gleixner wrote: > >> On Mon, 12 Mar 2018, Jason Vas Dias wrote: >> >> checkpatch.pl still reports: >> >>total: 15 errors, 3 warnings, 165 lines checked >> Sorry I didn't see you had responded until 40 mins ago . I finally found where checkpatch.pl is and it now reports : WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per line) #2: --- linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 2018-03-12 00:25:09.0 + WARNING: struct should normally be const #55: FILE: arch/x86/entry/vdso/vclock_gettime.c:282: +notrace static __always_inline int do_monotonic_raw(struct timespec *ts) I don't know how to fix that, since 'ts' cannot be a const pointer. ERROR: Missing Signed-off-by: line(s) I guess that disappears once someone OKs the patch. total: 1 errors, 2 warnings, 127 lines checked NOTE: For some of the reported defects, checkpatch may be able to mechanically convert to the typical style using --fix or --fix-inplace. ../vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#1.patch has style problems, please review. NOTE: If any of the errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. >> > +notrace static u64 vread_tsc_raw(void) >> > +{ >> > + u64 tsc, last=gtod->raw_cycle_last; >> > + if( likely( gtod->has_rdtscp ) ) >> > + tsc = rdtscp((void*)0); >> >> Plus I asked more than once to split that rdtscp() stuff into a separate >> patch. I misunderstood - I thought you meant the rdtscp implementation which was split into a separate file - but now it is in a separate patch , (attached). >> >> You surely are free to ignore my review comments, but rest assured that >> I'm >> free to ignore the crap you insist to send me as well. > I didn't mean to ignore any comments, and I'm really trying to fix this problem the right way and not produce crap. > In addition to Thomas's review feedback I'd strongly urge the careful > reading of > Documentation/SubmittingPatches as well: > > - When sending multiple patches please use git-send-mail > > - Please don't send several patch iterations per day! > > - Code quality of the submitted patches is atrocious, please run them > through >scripts/checkpatch.pl (and make sure they pass) to at least enable the > reading >of them. > > - ... plus dozens of other details described in > Documentation/SubmittingPatches. > > Thanks, > > Ingo > I am reading all those documents and cannot see how the code in the attached patch contravenes any guidelines / best practices - if you can, please clarify phrases like "atrocious style" - I cannot see any style guidelines contravened, and I can prove that the numeric output produced in 16-30ns is just as good as that produced before the patch was applied in 300-700ns . Aside from any style comments, any content comments ? Sorry I am new to latest kernel guidelines. I needed to get this problem solved the right way for use at work today. Thanks for your advice, Best Regards Jason vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#1.patch Description: Binary data
[PATCH v4.16-rc4 1/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, &ts ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, about the same as do_monotonic(), and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . The patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c This is a resend of the original patch fixing review issues - the next patch will add the rdtscp() function . The patch passes the checkpatch.pl script . Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 08:12:17.110120433 + @@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ + u64 tsc + , last = gtod->raw_cycle_last; + + tsc = rdtsc_ordered(); + if (likely(tsc >= last)) + return tsc; + asm volatile (""); + return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +279,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static __always_inline int do_monotonic_raw(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(&mode); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff -up linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5 linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c --- linux-4.16-rc5.1/arch/x86/entry/vsys
[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, &ts ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c arch/x86/entry/vdso/vdso.lds.S arch/x86/entry/vdso/vdsox32.lds.S arch/x86/entry/vdso/vdso32/vdso32.lds.S and adds one new file: arch/x86/include/uapi/asm/vdso_tsc_calibration.h This is a second patch in the series, which adds a record of the calibrated tsc frequency to the VDSO, and a new header: uapi/asm/vdso_tsc_calibration.h which defines a structure : struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; }; and a getter function in the VDSO that can optionally be used by user-space code to implement sub-nanosecond precision clocks . This second patch is entirely optional but I think greatly expands the scope of user-space TSC readers . Resent : Oops, in previous version of this patch (#2), the comments in the new vdso_tsc_calibration were wrong, for an earlier version - sorry about that. Best Regards, Jason Vas Dias . PATCH 2/2: --- diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 2018-03-12 04:29:27.296982872 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 05:38:53.019891195 + @@ -21,6 +21,7 @@ #include #include #include +#include #define gtod (&VVAR(vsyscall_gtod_data)) @@ -385,3 +386,22 @@ notrace time_t __vdso_time(time_t *t) } time_t time(time_t *t) __attribute__((weak, alias("__vdso_time"))); + +extern unsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *); + +notraceunsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal) +{ + if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) ) + { + tsc_cal -> tsc_khz = gtod->tsc_khz; + tsc_cal -> mult= gtod->raw_mult; + tsc_cal -> shift = gtod->raw_shift; + return 1; + } + return 0; +} + +unsigned linux_tsc_calibration(void) + __attribute((weak, alias("__vdso_linux_tsc_calibration"))); diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S --- linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S 2018-03-12 05:18:36.380673342 + @@ -25,6 +25,8 @@ VERSION { __vdso_getcpu; time; __vdso_time; + linux_tsc_calibration; + __vdso_linux_tsc_calibration; local: *; }; } diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S --- linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S 2018-03-12 05:19:10.765022295 + @@ -26,6 +26,7 @@ VERSION __vdso_clock_gettime; __vdso_gettimeofday; __vdso_time; + __vdso_linux_tsc_calibration; }; LINUX_2.5 { diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S.4
[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, &ts ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vdso/vdso.lds.S arch/x86/entry/vdso/vdsox32.lds.S arch/x86/entry/vdso/vdso32/vdso32.lds.S arch/x86/entry/vsyscall/vsyscall_gtod.c This is a second patch in the series, which adds a record of the calibrated tsc frequency to the VDSO, and a new header: uapi/asm/vdso_tsc_calibration.h which defines a structure : struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; }; and a getter function in the VDSO that can optionally be used by user-space code to implement sub-nanosecond precision clocks . This second patch is entirely optional but I think greatly expands the scope of user-space TSC readers . Oops, previous version of this second patch mistakenly copied the changed part of vclock_gettime.c. Best Regards, Jason Vas Dias . diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 2018-03-12 04:29:27.296982872 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 05:38:53.019891195 + @@ -21,6 +21,7 @@ #include #include #include +#include #define gtod (&VVAR(vsyscall_gtod_data)) @@ -385,3 +386,22 @@ notrace time_t __vdso_time(time_t *t) } time_t time(time_t *t) __attribute__((weak, alias("__vdso_time"))); + +extern unsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *); + +notraceunsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal) +{ + if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) ) + { + tsc_cal -> tsc_khz = gtod->tsc_khz; + tsc_cal -> mult= gtod->raw_mult; + tsc_cal -> shift = gtod->raw_shift; + return 1; + } + return 0; +} + +unsigned linux_tsc_calibration(void) + __attribute((weak, alias("__vdso_linux_tsc_calibration"))); diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S --- linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S 2018-03-12 05:18:36.380673342 + @@ -25,6 +25,8 @@ VERSION { __vdso_getcpu; time; __vdso_time; + linux_tsc_calibration; + __vdso_linux_tsc_calibration; local: *; }; } diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S --- linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S 2018-03-12 05:19:10.765022295 + @@ -26,6 +26,7 @@ VERSION __vdso_clock_gettime; __vdso_gettimeofday; __vdso_time; + __vdso_linux_tsc_calibration; }; LINUX_2.5 { diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S --- linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S.4.16-rc5-p1 2018-03-12 00:2
[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, &ts ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . This patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vdso/vdso.lds.S arch/x86/entry/vdso/vdsox32.lds.S arch/x86/entry/vdso/vdso32/vdso32.lds.S arch/x86/entry/vsyscall/vsyscall_gtod.c This is a second patch in the series, which adds a record of the calibrated tsc frequency to the VDSO, and a new header: uapi/asm/vdso_tsc_calibration.h which defines a structure : struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; }; and a getter function in the VDSO that can optionally be used by user-space code to implement sub-nanosecond precision clocks . This second patch is entirely optional but I think greatly expands the scope of user-space TSC readers . Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 2018-03-12 04:29:27.296982872 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 05:10:53.185158334 + @@ -21,6 +21,7 @@ #include #include #include +#include #define gtod (&VVAR(vsyscall_gtod_data)) @@ -385,3 +386,41 @@ notrace time_t __vdso_time(time_t *t) } time_t time(time_t *t) __attribute__((weak, alias("__vdso_time"))); + +extern unsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *); + +notraceunsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal) +{ + if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) ) + { + tsc_cal -> tsc_khz = gtod->tsc_khz; + tsc_cal -> mult= gtod->raw_mult; + tsc_cal -> shift = gtod->raw_shift; + return 1; + } + return 0; +} + +unsigned linux_tsc_calibration(void) + __attribute((weak, alias("__vdso_linux_tsc_calibration"))); + +extern unsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *); + +notraceunsigned +__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal) +{ + if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) ) + { + tsc_cal -> tsc_khz = gtod->tsc_khz; + tsc_cal -> mult= gtod->raw_mult; + tsc_cal -> shift = gtod->raw_shift; + return 1; + } + return 0; +} + +unsigned linux_tsc_calibration(void) + __attribute((weak, alias("__vdso_linux_tsc_calibration"))); diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S --- linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S 2018-03-12 05:18:36.380673342 + @@ -25,6 +25,8 @@ VERSION { __vdso_getcpu; time; __vdso_time; + linux_tsc_calibration; + __vdso_linux_tsc_calibration; local: *; }; } diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S --- linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 2
[PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, &ts ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc5 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . The patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/include/asm/msr.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c This is a resend of the original patch fixing issues identified by tglx in mail thread of $subject - mainly that the rdtscp() assembler wrapper function should be in msr.h - it now is. There is a second patch following in a few minutes which adds a record of the calibrated tsc frequency to the VDSO, and a new header: uapi/asm/vdso_tsc_calibration.h which defines a structure : struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; }; and a getter function in the VDSO that can optionally be used by user-space code to implement sub-nanosecond precision clocks . This second patch is entirely optional but I think greatly expands the scope of user-space TSC readers . Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 2018-03-12 00:25:09.0 + +++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 04:29:27.296982872 + @@ -182,6 +182,19 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ + u64 tsc, last=gtod->raw_cycle_last; + if( likely( gtod->has_rdtscp ) ) + tsc = rdtscp((void*)0); +else + tsc = rdtsc_ordered(); + if (likely(tsc >= last)) + return tsc; + asm volatile (""); + return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +216,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +280,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static int __always_inline do_monotonic_raw( struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(&mode); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct time
Re: [PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Thanks Thomas - On 11/03/2018, Thomas Gleixner wrote: > On Sun, 11 Mar 2018, Jason Vas Dias wrote: > > This looks better now. Though running that patch through checkpatch.pl > results in: > > total: 28 errors, 20 warnings, 139 lines checked > Hmm, I was unaware of that script, I'll run and find out why - probably because whitespace is not visible in emacs with my monospace font and it is very difficult to see if tabs are used if somehow a '\t\ ' or ' \t' has slipped in . I'll run the script, fix the errors. and repost. > > >> +notrace static u64 vread_tsc_raw(void) > > Why do you need a separate function? I asked you to use vread_tsc(). So you > might have reasons for doing that, but please then explain WHY and not just > throw the stuff in my direction w/o any comment. > mainly, because vread_tsc() makes its comparison against gtod->cycles_last , a copy of tk->tkr_mono.cycle_last, while vread_tsc_raw() uses gtod->raw_cycle_last, a copy of tk->tkr_raw.cycle_last . And rdtscp has a built-in "barrier", as the comments explain, making rdtsc_ordered()'s 'barrier()' unnecessary . >> +{ >> +u64 tsc, last=gtod->raw_cycle_last; >> +if( likely( gtod->has_rdtscp ) ) { >> +u32 tsc_lo, tsc_hi, >> +tsc_cpu __attribute__((unused)); >> +asm volatile >> +( "rdtscp" >> +/* ^- has built-in cancellation point / pipeline stall >> "barrier" */ >> +: "=a" (tsc_lo) >> +, "=d" (tsc_hi) >> +, "=c" (tsc_cpu) >> +); // since all variables 32-bit, eax, edx, ecx used - >> NOT rax, rdx, rcx >> +tsc = u64)tsc_hi) & 0xUL) << 32) | >> (((u64)tsc_lo) & 0xUL); > > This is not required to make the vdso accessor for monotonic raw work. > > If at all then the rdtscp support wants to be in a separate patch with a > proper explanation. > > Aside of that the code for rdtscp wants to be in a proper inline helper in > the relevant header file and written according to the coding style the > kernel uses for asm inlines. > Sorry, I will put the function in the same header as rdtsc_ordered () , in a separate patch. > The rest looks ok. > > Thanks, > > tglx > I'll re-generate patches and resend . A complete patch , against 4.15.9, is attached , that I am using , including a suggested '__vdso_linux_tsc_calibration()' function and arch/x86/include/uapi/asm/vdso_tsc_calibration.h file that does not return any pointers into the VDSO . Presuming this was split into separate patches as you suggest, and was against the latest HEAD branch (4.16-rcX), would it be OK to include the vdso_linux_tsc_calibration() work ? It does enable user space code to develop accurate TSC readers which are free to use different structures and pico-second resolution. The actual user-space clock_gettime(CLOCK_MONOTONIC_RAW) replacement I am using for work just reads the TSC , with a latency of < 8ns, and uses the linux_tsc_calibration to convert using floating-point as required. Thanks & Regards, Jason vdso_gettime_monotonic_raw-4.15.9.patch Description: Binary data
[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, &ts ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc4 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . The patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c This is a resend of the original patch fixing indentation issues after installation of emacs Lisp cc-mode hooks in Documentation/coding-style.rst and calling 'indent-region' and 'tabify' (whitespace only changes) - SORRY ! (and even after that, somehow 2 '\t\n's got left in vgtod.h - now removed - sorry again!) . Best Regards, Jason Vas Dias . PATCH: --- diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 2018-03-04 22:54:11.0 + +++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 19:00:04.630019100 + @@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ + u64 tsc, last=gtod->raw_cycle_last; + if( likely( gtod->has_rdtscp ) ) { + u32 tsc_lo, tsc_hi, + tsc_cpu __attribute__((unused)); + asm volatile + ( "rdtscp" + /* ^- has built-in cancellation point / pipeline stall"barrier" */ + : "=a" (tsc_lo) + , "=d" (tsc_hi) + , "=c" (tsc_cpu) + ); // since all variables 32-bit, eax, edx, ecx used - NOT rax, rdx, rcx + tsc = u64)tsc_hi) & 0xUL) << 32) | (((u64)tsc_lo) & 0xUL); + } else { + tsc = rdtsc_ordered(); + } + if (likely(tsc >= last)) + return tsc; + asm volatile (""); + return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +290,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static int __always_inline do_monotonic_raw( struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(&mode); +
[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, &ts ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. Sometimes, particularly when correlating elapsed time to performance counter values, code needs to know elapsed time from the perspective of the CPU no matter how "hot" / fast or "cold" / slow it might be running wrt NTP / PTP ; when code needs this, the latencies with a syscall are often unacceptably high. I reported this as Bug #198161 : 'https://bugzilla.kernel.org/show_bug.cgi?id=198961' and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' . This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc4 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . The patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c This is a resend of the original patch fixing indentation issues after installation of emacs Lisp cc-mode hooks in Documentation/coding-style.rst and calling 'indent-region' and 'tabify' (whitespace only changes) - SORRY ! Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 2018-03-04 22:54:11.0 + +++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 19:00:04.630019100 + @@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ + u64 tsc, last=gtod->raw_cycle_last; + if( likely( gtod->has_rdtscp ) ) { + u32 tsc_lo, tsc_hi, + tsc_cpu __attribute__((unused)); + asm volatile + ( "rdtscp" + /* ^- has built-in cancellation point / pipeline stall"barrier" */ + : "=a" (tsc_lo) + , "=d" (tsc_hi) + , "=c" (tsc_cpu) + ); // since all variables 32-bit, eax, edx, ecx used - NOT rax, rdx, rcx + tsc = u64)tsc_hi) & 0xUL) << 32) | (((u64)tsc_lo) & 0xUL); + } else { + tsc = rdtsc_ordered(); + } + if (likely(tsc >= last)) + return tsc; + asm volatile (""); + return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +290,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static int __always_inline do_monotonic_raw( struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(&mode); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->
Re: Fwd: [PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer
Hi Thomas - Thanks very much for your help & guidance in previous mail: RE: On 08/03/2018, Thomas Gleixner wrote: > > The right way to do that is to put the raw conversion values and the raw > seconds base value into the vdso data and implement the counterpart of > getrawmonotonic64(). And if that is done, then it can be done for _ALL_ > clocksources which support VDSO access and not just for the TSC. > I have done this now with a new patch, sent in mail with subject : '[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW' which should address all the concerns you raise. > I already know how that works, really. I never doubted or meant to impugn that ! I am beginning to know a little how that works also, thanks in great part to your help last week - thanks for your patience. I was impatient last week to get access to low latency timers for a work project, and was trying to read the unadjusted clock . > instead of making completely false claims about the correctness of the kernel > timekeeping infrastructure. I really didn't mean to make any such claims - I'm sorry if I did . I was just trying to say that by the time the results of clock_gettime(CLOCK_MONOTONIC_RAW,&ts) were available to the caller they were not of much use because of the latencies often dwarfing the time differences . Anyway, I hope sometime you will consider putting such a patch in the kernel. I have developed a verson for ARM also, but that depends on making CNTPCT + CNTFRQ registers readable in user-space, which is not meant to be secure and is not normally done , but does work - but it is against the Texas Instruments (ti-linux) kernel and can be enabled with a new KConfig option, and brings latencies down from > 300ns to < 20ns . Maybe I should post that also to kernel.org, or to ti.com ? I have a separate patch for the vdso_tsc_calibration export of the tsc_khz and calibration which no longer returns pointers into the VDSO - I can post this as a patch if you like. Thanks & Best Regards, Jason Vas Dias diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 2018-03-04 22:54:11.0 + +++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 05:08:31.137681337 + @@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ +u64 tsc, last=gtod->raw_cycle_last; +if( likely( gtod->has_rdtscp ) ) { +u32 tsc_lo, tsc_hi, +tsc_cpu __attribute__((unused)); +asm volatile +( "rdtscp" +/* ^- has built-in cancellation point / pipeline stall "barrier" */ +: "=a" (tsc_lo) +, "=d" (tsc_hi) +, "=c" (tsc_cpu) +); // since all variables 32-bit, eax, edx, ecx used - NOT rax, rdx, rcx +tsc = u64)tsc_hi) & 0xUL) << 32) | (((u64)tsc_lo) & 0xUL); +} else { +tsc = rdtsc_ordered(); +} + if (likely(tsc >= last)) + return tsc; +asm volatile (""); +return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +290,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static int __always_inline do_monotonic_raw( struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(&mode); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigne
[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, &ts ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc4 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . The patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 2018-03-04 22:54:11.0 + +++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 05:08:31.137681337 + @@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ +u64 tsc, last=gtod->raw_cycle_last; +if( likely( gtod->has_rdtscp ) ) { +u32 tsc_lo, tsc_hi, +tsc_cpu __attribute__((unused)); +asm volatile +( "rdtscp" +/* ^- has built-in cancellation point / pipeline stall "barrier" */ +: "=a" (tsc_lo) +, "=d" (tsc_hi) +, "=c" (tsc_cpu) +); // since all variables 32-bit, eax, edx, ecx used - NOT rax, rdx, rcx +tsc = u64)tsc_hi) & 0xUL) << 32) | (((u64)tsc_lo) & 0xUL); +} else { +tsc = rdtsc_ordered(); +} + if (likely(tsc >= last)) + return tsc; +asm volatile (""); +return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +290,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static int __always_inline do_monotonic_raw( struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(&mode); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +342,10 @@ notrace int __vdso_clock_gettime(clockid if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff -up linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc4 linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c --- linux-4.16-rc4/arch/x86/entry/vsyscal
Re: [PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Oops, please disregard 1st mail on $subject - I guess use of Quoted Printable is not a way of getting past the email line length. Patch I tried to send is attached as attachment - will resend inline using other method. Sorry, Regards, Jason vdso_monotonic_raw-v4.16-rc4.patch Description: Binary data
[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Currently the VDSO does not handle clock_gettime( CLOCK_MONOTONIC_RAW, &ts ) on Intel / AMD - it calls vdso_fallback_gettime() for this clock, which issues a syscall, having an unacceptably high latency (minimum measurable time or time between measurements) of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C machines under various versions of Linux. This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO , by exporting the raw clock calibration, last cycles, last xtime_nsec, and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() . Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns on average, and the test program: tools/testing/selftest/timers/inconsistency-check.c succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value. The patch is against Linus' latest 4.16-rc4 tree, current HEAD of : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git . The patch affects only files: arch/x86/include/asm/vgtod.h arch/x86/entry/vdso/vclock_gettime.c arch/x86/entry/vsyscall/vsyscall_gtod.c Best Regards, Jason Vas Dias . --- diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c --- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 2018-03-04 22:54:11.0 + +++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 05:08:31.137681337 + @@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void) return last; } +notrace static u64 vread_tsc_raw(void) +{ +u64 tsc, last=gtod->raw_cycle_last; +if( likely( gtod->has_rdtscp ) ) { +u32 tsc_lo, tsc_hi, +tsc_cpu __attribute__((unused)); +asm volatile +( "rdtscp" +/* ^- has built-in cancellation point / pipeline stall"barrier" */ +: "=a" (tsc_lo) +, "=d" (tsc_hi) +, "=c" (tsc_cpu) +); // since all variables 32-bit, eax, edx, ecx used - NOT rax, rdx, rcx +tsc = u64)tsc_hi) & 0xUL) << 32) | (((u64)tsc_lo) & 0xUL); +} else { +tsc = rdtsc_ordered(); +} + if (likely(tsc >= last)) + return tsc; +asm volatile (""); +return last; +} + notrace static inline u64 vgetsns(int *mode) { u64 v; @@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } +notrace static inline u64 vgetsns_raw(int *mode) +{ + u64 v; + cycles_t cycles; + + if (gtod->vclock_mode == VCLOCK_TSC) + cycles = vread_tsc_raw(); +#ifdef CONFIG_PARAVIRT_CLOCK + else if (gtod->vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); +#endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif + else + return 0; + v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask; + return v * gtod->raw_mult; +} + /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { @@ -246,6 +290,27 @@ notrace static int __always_inline do_mo return mode; } +notrace static int __always_inline do_monotonic_raw( struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_raw_sec; + ns = gtod->monotonic_time_raw_nsec; + ns += vgetsns_raw(&mode); + ns >>= gtod->raw_shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +342,10 @@ notrace int __vdso_clock_gettime(clockid if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff -up linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc4 linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c --- linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc4 2018-03-04 22:54:11.0 + +++ linux-4.1
Re: Fwd: [PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer
On 08/03/2018, Thomas Gleixner wrote: > On Tue, 6 Mar 2018, Jason Vas Dias wrote: >> I will prepare a new patch that meets submission + coding style guidelines >> and >> does not expose pointers within the vsyscall_gtod_data region to >> user-space code - >> but I don't really understand why not, since only the gtod->mult value >> will >> change as long as the clocksource remains TSC, and updates to it by the >> kernel >> are atomic and partial values cannot be read . >> >> The code in the patch reverts to old behavior for clocks which are not >> the >> TSC and provides a way for users to determine if the clock is still the >> TSC >> by calling '__vdso_linux_tsc_calibration()', which would return NULL if >> the clock is not the TSC . >> >> I have never seen Linux on a modern intel box spontaneously decide to >> switch from the TSC clocksource after calibration succeeds and >> it has decided to use the TSC as the system / platform clock source - >> what would make it do this ? >> >> But for the highly controlled systems I am doing performance testing on, >> I can guarantee that the clocksource does not change. > > We are not writing code for a particular highly controlled system. We > expose functionality which operates under all circumstances. There are > various reasons why TSC can be disabled at runtime, crappy BIOS/SMI, > sockets getting out of sync . > >> There is no way user code can write those pointers or do anything other >> than read them, so I see no harm in exposing them to user-space ; then >> user-space programs can issue rdtscp and use the same calibration values >> as the kernel, and use some cached 'previous timespec value' to avoid >> doing the long division every time. >> >> If the shift & mult are not accurate TSC calibration values, then the >> kernel should put other more accurate calibration values in the gtod . > > The raw calibration values are as accurate as the kernel can make them. But > they can be rather far off from converting to real nanoseconds for various > reasons. The NTP/PTP adjusted conversion is matching real units and is > obviously more accurate. > >> > Please look at the kernel side implementation of >> > clock_gettime(CLOCK_MONOTONIC_RAW). >> > The VDSO side can be implemented in the >> > same way. >> > All what is required is to expose the relevant information in the >> > existing vsyscall_gtod_data data structure. >> >> I agree - that is my point entirely , & what I was trying to do . > > Well, you did not expose the raw conversion data in vsyscall_gtod_data. You > are using: > > + tsc*= gtod->mult; > + tsc >>= gtod->shift; > > That's is the adjusted mult/shift value which can change when NTP/PTP is > enabled and you _cannot_ use it unprotected. > >> void getrawmonotonic64(struct timespec64 *ts) >> { >> struct timekeeper *tk = &tk_core.timekeeper; >> unsigned long seq; >> u64 nsecs; >> >> do { >> seq = read_seqcount_begin(&tk_core.seq); >> # ^-- I think this is the source of the locking >> #and the very long latencies ! > > This protects tk->raw_sec from changing which would result in random time > stamps. Yes, it can cause slightly larger latencies when the timekeeper is > updated on another CPU concurrently, but that's not the main reason why > this is slower in general than the VDSO functions. The syscall overhead is > there for every invocation and it's substantial. > >> So in fact, when the clock source is TSC, the value recorded in 'ts' >> by clock_gettime(CLOCK_MONOTONIC_RAW, &ts) is very similar to >> u64 tsc = rdtscp(); >> tsc *= gtod->mult; >> tsc >>= gtod->shift; >> ts.tv_sec=tsc / NSEC_PER_SEC; >> ts.tv_nsec=tsc % NSEC_PER_SEC; >> >> which is the algorithm I was using in the VDSO fast TSC reader, >> do_monotonic_raw() . > > Except that you are using the adjusted conversion values and not the raw > ones. So your VDSO implementation of monotonic raw access is just wrong and > not matching the syscall based implementation in any way. > >> The problem with doing anything more in the VDSO is that there >> is of course nowhere in the VDSO to store any data, as it has >> no data section or writable pages . So some kind of writable >> page would need to be added to the vdso , complicating its >> vdso/vma.c, etc., w
[PATCH v4.15.7 1/1] x86/vdso: handle clock_gettime(CLOCK_MONOTONIC_RAW, &ts) in VDSO
Handling clock_gettime( CLOCK_MONOTONIC_RAW, ×pec) by calling vdso_fallback_gettime(), ie. syscall, is too slow - latencies of 300-700ns are common on Haswell (06:3C) CPUs . This patch against the 4.15.7 stable branch makes the VDSO handle clock_gettime(CLOCK_GETTIME_RAW, &ts) by issuing rdtscp in userspace, IFF the clock source is the TSC, and converting it to nanoseconds using the vsyscall_gtod_data 'mult' and 'shift' fields : volatile u32 tsc_lo, tsc_hi, tsc_cpu; asm volatile( "rdtscp" : (=a) tsc_lo, (=d) tsc_hi, (=c) tsc_cpu ); u64 tsc = (((u64)tsc_hi)<<32) | ((u64)tsc_lo); tsc *= gtod->mult; tsc >>=gtod->shift; /* tsc is now number of nanoseconds */ ts->tv_sec = __iter_div_u64_rem( tsc, NSEC_PER_SEC, &ts->tv_nsec); Use of the "open coded asm" style here actually forces the compiler to always choose the 32-bit version of rdtscp, which sets only %eax, %edx, and %ecx and does not clear the high bits of %rax, %rdx, and %rdx , because the variables are declared 32-bit - so the same 32-bit version is used whether the code is compiled with -m32 or -m64 ( tested using gcc 5.4.0, gcc 6.4.1 ) . The full story and test programs are in Bug #198961 : https://bugzilla.kernel.org/show_bug.cgi?id=198961 . The patched VDSO now handles clock_gettime(CLOCK_MONOTONIC_RAW, &ts) on the same machine with a latency (minimum time that can be measured) of around 100ns (compared with 300-700ns before patch). I also think it makes sense to expose pointers to the live, updated gtod->mult and gtod->shift values somehow to userspace . Then a userspace TSC reader could re-use previous values to avoid the long-division in most cases and obtain latencies of 10-20ns . Hence there is now a new method in the VDSO: __ vdso_linux_tsc_calibration() which returns a pointer to a 'struct linux_tsc_calibration' declared in a new header arch/x86/include/uapi/asm/vdso_tsc_calibration.h If the clock source is NOT the TSC, this function returns NULL . The pointer is only valid when the system clock source is the TSC . User-space TSC readers can detect when TSC is modified with Events, and now can detect when clock source changes from / to TSC with this function . The patch : --- diff --git a/arch/x86/entry/vdso/vclock_gettime.c \ b/arch/x86/entry/vdso/vclock_gettime.c index f19856d..e840600 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -21,6 +21,7 @@ #include #include #include +#include #define gtod (&VVAR(vsyscall_gtod_data)) @@ -246,6 +247,29 @@ notrace static int __always_inline do_monotonic\ (struct timespec *ts) return mode; } +notrace static int __always_inline do_monotonic_raw( struct timespec *ts) +{ +volatile u32 tsc_lo=0, tsc_hi=0, tsc_cpu=0; // so same instrs generated for 64-bit as for 32-bit builds +u64 ns; +register u64 tsc=0; +if (gtod->vclock_mode == VCLOCK_TSC) +{ +asm volatile +( "rdtscp" +: "=a" (tsc_lo) +, "=d" (tsc_hi) +, "=c" (tsc_cpu) +); // : eax, edx, ecx used - NOT rax, rdx, rcx +tsc = u64)tsc_hi) & 0xUL) << 32) | (((u64)tsc_lo) & 0xUL); +tsc*= gtod->mult; +tsc >>= gtod->shift; +ts->tv_sec = __iter_div_u64_rem(tsc, NSEC_PER_SEC, &ns); +ts->tv_nsec = ns; +return VCLOCK_TSC; +} +return VCLOCK_NONE; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +301,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_MONOTONIC_RAW: + if (do_monotonic_raw(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; @@ -326,3 +354,18 @@ notrace time_t __vdso_time(time_t *t) } time_t time(time_t *t) __attribute__((weak, alias("__vdso_time"))); + +extern const struct linux_tsc_calibration * +__vdso_linux_tsc_calibration(void); + +notrace const struct linux_tsc_calibration * + __vdso_linux_tsc_calibration(void) +{ +if( gtod->vclock_mode == VCLOCK_TSC ) +return ((const struct linux_tsc_calibration*) >od->mult); +return 0UL; +} + +const struct linux_tsc_calibration * linux_tsc_calibration(void) +__attribute((weak, alias("__vdso_linux_tsc_calibration"))); + diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S index d3a2dce..41a2ca5 100644 --- a/arch/x86/entry/vdso/vdso.lds.S +++ b/arch/x86/entry/vdso/vdso.lds.S @@ -24,7 +24,9 @@ VERSION { getcpu; __vdso_getcpu; ti
Fwd: [PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer
On 06/03/2018, Thomas Gleixner wrote: > Jason, > > On Mon, 5 Mar 2018, Jason Vas Dias wrote: > > thanks for providing this. A few formal nits first. > > Please read Documentation/process/submitting-patches.rst > > Patches need a concise subject line and the subject line wants a prefix, in > this case 'x86/vdso'. > > Please don't put anything past the patch. Your delimiters are human > readable, but cannot be handled by tools. > > Also please follow the kernel coding style guide lines. > >> It also provides a new function in the VDSO : >> >> struct linux_timestamp_conversion >> { u32 mult; >> u32 shift; >> }; >> extern >> const struct linux_timestamp_conversion * >> __vdso_linux_tsc_calibration(void); >> >> which can be used by user-space rdtsc / rdtscp issuers >> by using code such as in >> tools/testing/selftests/vDSO/parse_vdso.c >> to call vdso_sym("LINUX_2.6", "__vdso_linux_tsc_calibration"), >> which returns a pointer to the function in the VDSO, which >> returns the address of the 'mult' field in the vsyscall_gtod_data. > > No, that's just wrong. The VDSO data is solely there for the VDSO accessor > functions and not to be exposed to random user space. > >> Thus user-space programs can use rdtscp and interpret its return values >> in exactly the same way the kernel would, but without entering the >> kernel. > > The VDSO clock_gettime() functions are providing exactly this mechanism. > >> As pointed out in Bug # 198961 : >> https://bugzilla.kernel.org/show_bug.cgi?id=198961 >> which contains extra test programs and the full story behind this >> change, >> using CLOCK_MONOTONIC_RAW without the patch results in >> a minimum measurable time (latency) of @ 300 - 700ns because of >> the syscall used by vdso_fallback_gtod() . >> >> With the patch, the latency falls to @ 100ns . >> >> The latency would be @ 16 - 32 ns if the do_monotonic_raw() >> handler could record its previous TSC value and seconds return value >> somewhere, but since the VDSO has no data region or writable page, >> of course it cannot . > > And even if it could, it's not as simple as you want it to be. Clocksources > can change during runtime and without effective protection the values are > just garbage. > >> Hence, to enable effective use of TSC by user space programs, Linux must >> provide a way for them to discover the calibration mult and shift values >> the kernel uses for the clock source ; only by doing so can user-space >> get values that are comparable to kernel generated values. > > Linux must not do anything. It can provide a vdso implementation of > CLOCK_MONOTONIC_RAW, which does not enter the kernel, but not exposure to > data which is not reliably accessible by random user space code. > >> And I'd really like to know: why does the gtod->mult value change ? >> After TSC calibration, it and the shift are calculated to render the >> best approximation of a nanoseconds value from the TSC value. >> >> The TSC is MEANT to be monotonic and to continue in sleep states >> on modern Intel CPUs . So why does the gtod->mult change ? > > You are missing the fact that gtod->mult/shift are used for CLOCK_MONOTONIC > and CLOCK_REALTIME, which are adjusted by NTP/PTP to provide network > synchronized time. That means CLOCK_MONOTONIC is providing accurate > and slope compensated nanoseconds. > > The raw TSC conversion, even if it is sane hardware, provides just some > approximation of nanoseconds which can be off by quite a margin. > >> But the mult value does change. Currently there is no way for user-space >> programs to discover that such a change has occurred, or when . With this >> very tiny simple patch, they could know instantly when such changes >> occur, and could implement TSC readers that perform the full conversion >> with latencies of 15-30ns (on my CPU). > > No. Accessing the mult/shift pair without protection is racy and can lead > to completely erratic results. > >> +notrace static int __always_inline do_monotonic_raw( struct timespec >> *ts) >> +{ >> + volatile u32 tsc_lo=0, tsc_hi=0, tsc_cpu=0; // so same instrs >> generated for 64-bit as for 32-bit builds >> + u64 ns; >> + register u64 tsc=0; >> + if (gtod->vclock_mode == VCLOCK_TSC) >> + { asm volatile >> + ( "rdtscp" >> + : "=a" (tsc_lo) >> + , "=d" (tsc_hi) >> + , "=c" (tsc_cpu) >> + ); // : eax, edx, ecx used - NOT rax, rdx, rcx > > If you look
[PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer
sum += sample[s]; fprintf(stderr,"sum: %llu\n",sum); unsigned long long avg_ns = sum / N_SAMPLES; t1=(t2 - t_start); fprintf(stderr, "Total time: %1.1llu.%9.9lluS - Average Latency: %1.1llu.%9.9lluS\n", t1/10, t1-((t1/10)*10), avg_ns/10, avg_ns-((avg_ns/10)*10) ); return 0; } : END EXAMPLE EXAMPLE Usage : $ gcc -std=gnu11 -o t_vdso_tsc t_vdso_tsc.c $ ./t_vdso_tsc Got TSC calibration @ 0x7ffdb9be5098: mult: 5798705 shift: 24 sum: Total time: 0.04859S - Average Latency: 0.00022S Latencies are typically @ 15 - 30 ns . That multiplication and shift really doesn't leave very many significant seconds bits! Please, can the VDSO include some similar functionality to NOT always enter the kernel for CLOCK_MONOTONIC_RAW , and to export a pointer to the LIVE (kernel updated) gtod->mult and gtod->shift values somehow . The documentation states for CLOCK_MONOTONIC_RAW that it is the same as CLOCK_MONOTONIC except it is NOT subject to NTP adjustments . This is very far from the case currently, without a patch like the one above. And the kernel should not restrict user-space programs to only being able to either measure an NTP adjusted time value, or a time value difference of greater than 1000ns with any accuracy, on a modern Intel CPU whose TSC ticks 2.8 times per nanosecond (picosecond resolution is theoretically possible). Please, include something like the above patch in future Linux versions. Thanks & Best Regards, Jason Vas Dias
Re: perf Intel x86_64 : BUG: BRANCH_INSTRUCTIONS / BRANCH_MISSES cannot be combined with CACHE_REFERENCES / CACHE_MISSES .
On 13/02/2018, Jason Vas Dias wrote: > Good day - > > I'd much appreciate some advice as to why, on my Intel x86_64 > ( DisplayFamily_DisplayModel : 06_3CH ), running either Linux 4.12.10, > or Linux 3.10.0, any attempt to count all of : > PERF_COUNT_HW_BRANCH_INSTRUCTIONS > (or raw config 0xC4) , and > PERF_COUNT_HW_BRANCH_MISSES > (or raw config 0xC5), and > combined with > PERF_COUNT_HW_CACHE_REFERENCES > (or raw config 0x4F2E ), and > PERF_COUNT_HW_CACHE_MISSES > (or raw config 0x412E) , > results in ALL COUNTERS BEING 0 in a read of the Group FD or > mmap sample area. > > This is demonstrated by the example program, which will > use perf_event_open() to create a Group Leader FD for the first event, > and associate all other events with that Event Group , so that it > will read all events on the group FD . > > The perf_event_open() calls and the ioctl(event_fd, PERF_EVENT_IOC_ID, &id) > calls all return successfully , but if I combine ANY of > ( PERF_COUNT_HW_BRANCH_INSTRUCTIONS, > PERF_COUNT_HW_BRANCH_MISSES > ) with any of > ( PERF_COUNT_HW_CACHE_REFERENCES, > PERF_COUNT_HW_CACHE_MISSES > ) in the Event Group, ALL events have '0' event->value. > > Demo : > 1. Compile program to use kernel mapped Generic Events: > $ gcc -std=gnu11 -o perf_bug perf_bug.c > Running program shows all counters have 0 values, since both > CACHE & BRANCH hits+misses are being requested: > > $ ./perf_bug > EVENT: Branch Instructions : 0 > EVENT: Branch Misses : 0 > EVENT: Instructions : 0 > EVENT: CPU Cycles : 0 > EVENT: Ref. CPU Cycles : 0 > EVENT: Bus Cycles : 0 > EVENT: Cache References : 0 > EVENT: Cache Misses : 0 > > NOT registering interest in EITHER the BRANCH counters > OR the CACHE counters fixes the problem: > > Compile without registering for BRANCH_INSTRUCTIONS > or BRANCH_MISSES: > $ gcc -std=gnu11 -DNO_BUG_NO_BRANCH -o perf_bug perf_bug.c > $ ./perf_bug > EVENT: Instructions : 914 > EVENT: CPU Cycles : 4110 > EVENT: Ref. CPU Cycles : 4437 > EVENT: Bus Cycles : 152 > EVENT: Cache References : 1 > EVENT: Cache Misses : 1 > > Compile without registering for CACHE_REFERENCES or CACHE_MISSES: > $ gcc -std=gnu11 -DNO_BUG_NO_CACHE -o perf_bug perf_bug.c > $ ./perf_bug > EVENT: Branch Instructions : 106 > EVENT: Branch Misses : 6 > EVENT: Instructions : 914 > EVENT: CPU Cycles : 4132 > EVENT: Ref. CPU Cycles : 8526 > EVENT: Bus Cycles : 295 > > The same thing happens if I do not use Generic Events, but rather > "dynamic raw PMU" events, by putting the hex values from > /sys/bus/event_source/devices/cpu/events/? into the perf_event_attr > config, OR'ed with (1<<63), and using the PERF_TYPE_RAW perf_event_attr > type value : > > $ gcc -DUSE_RAW_PMU -o perf_bug perf_bug.c > $ ./perf_bug > EVENT: Branch Instructions : 0 > EVENT: Branch Misses : 0 > EVENT: Instructions : 0 > EVENT: CPU Cycles : 0 > EVENT: Ref. CPU Cycles : 0 > EVENT: Bus Cycles : 0 > EVENT: Cache References : 0 > EVENT: Cache Misses : 0 > > > $ gcc -DUSE_RAW_PMU -DNO_BUG_NO_BRANCH -o perf_bug perf_bug.c > $ ./perf_bug > EVENT: Instructions : 914 > EVENT: CPU Cycles : 4102 > EVENT: Ref. CPU Cycles : 4959 > EVENT: Bus Cycles : 171 > EVENT: Cache References : 2 > EVENT: Cache Misses : 2 > > $ gcc -DUSE_RAW_PMU -DNO_BUG_NO_CACHE -o perf_bug perf_bug.c > $ ./perf_bug > EVENT: Branch Instructions : 106 > EVENT: Branch Misses : 6 > EVENT: Instructions : 914 > EVENT: CPU Cycles : 4108 > EVENT: Ref. CPU Cycles : 10817 > EVENT: Bus Cycles : 373 > > > The perf tool itself seems to have the same issue: > > With CACHE & BRANCH counters does not work : > $ perf stat -e '{r0c4,r0c5,r0c0,r03c,r0300,r013c,r04F2E,r0412E}:SIu' sleep > 1 > > Performance counter stats for 'sleep 1': > >r0c4 >(0.00%) >r0c5 >(0.00%) >r0c0 >(0.00%) >r03c >(0.00%) >r0300 >(0.00%) >r013c >(0.00%) >r04F2E >(0.00%) > r0412E > >1.001652932 seconds time elapsed > >Some events weren't counted. Try disabling the NMI watchdog: > echo 0 > /proc/sys/kernel/nmi_watchdog > perf stat ... > echo 1 > /proc/sys/kernel/nmi_watchdog > > Disabling the NMI watchdog makes no difference . > > It is very strange that perf thinks 'r0412E' is not supp
perf Intel x86_64 : BUG: BRANCH_INSTRUCTIONS / BRANCH_MISSES cannot be combined with CACHE_REFERENCES / CACHE_MISSES .
Good day - I'd much appreciate some advice as to why, on my Intel x86_64 ( DisplayFamily_DisplayModel : 06_3CH ), running either Linux 4.12.10, or Linux 3.10.0, any attempt to count all of : PERF_COUNT_HW_BRANCH_INSTRUCTIONS (or raw config 0xC4) , and PERF_COUNT_HW_BRANCH_MISSES (or raw config 0xC5), and combined with PERF_COUNT_HW_CACHE_REFERENCES (or raw config 0x4F2E ), and PERF_COUNT_HW_CACHE_MISSES (or raw config 0x412E) , results in ALL COUNTERS BEING 0 in a read of the Group FD or mmap sample area. This is demonstrated by the example program, which will use perf_event_open() to create a Group Leader FD for the first event, and associate all other events with that Event Group , so that it will read all events on the group FD . The perf_event_open() calls and the ioctl(event_fd, PERF_EVENT_IOC_ID, &id) calls all return successfully , but if I combine ANY of ( PERF_COUNT_HW_BRANCH_INSTRUCTIONS, PERF_COUNT_HW_BRANCH_MISSES ) with any of ( PERF_COUNT_HW_CACHE_REFERENCES, PERF_COUNT_HW_CACHE_MISSES ) in the Event Group, ALL events have '0' event->value. Demo : 1. Compile program to use kernel mapped Generic Events: $ gcc -std=gnu11 -o perf_bug perf_bug.c Running program shows all counters have 0 values, since both CACHE & BRANCH hits+misses are being requested: $ ./perf_bug EVENT: Branch Instructions : 0 EVENT: Branch Misses : 0 EVENT: Instructions : 0 EVENT: CPU Cycles : 0 EVENT: Ref. CPU Cycles : 0 EVENT: Bus Cycles : 0 EVENT: Cache References : 0 EVENT: Cache Misses : 0 NOT registering interest in EITHER the BRANCH counters OR the CACHE counters fixes the problem: Compile without registering for BRANCH_INSTRUCTIONS or BRANCH_MISSES: $ gcc -std=gnu11 -DNO_BUG_NO_BRANCH -o perf_bug perf_bug.c $ ./perf_bug EVENT: Instructions : 914 EVENT: CPU Cycles : 4110 EVENT: Ref. CPU Cycles : 4437 EVENT: Bus Cycles : 152 EVENT: Cache References : 1 EVENT: Cache Misses : 1 Compile without registering for CACHE_REFERENCES or CACHE_MISSES: $ gcc -std=gnu11 -DNO_BUG_NO_CACHE -o perf_bug perf_bug.c $ ./perf_bug EVENT: Branch Instructions : 106 EVENT: Branch Misses : 6 EVENT: Instructions : 914 EVENT: CPU Cycles : 4132 EVENT: Ref. CPU Cycles : 8526 EVENT: Bus Cycles : 295 The same thing happens if I do not use Generic Events, but rather "dynamic raw PMU" events, by putting the hex values from /sys/bus/event_source/devices/cpu/events/? into the perf_event_attr config, OR'ed with (1<<63), and using the PERF_TYPE_RAW perf_event_attr type value : $ gcc -DUSE_RAW_PMU -o perf_bug perf_bug.c $ ./perf_bug EVENT: Branch Instructions : 0 EVENT: Branch Misses : 0 EVENT: Instructions : 0 EVENT: CPU Cycles : 0 EVENT: Ref. CPU Cycles : 0 EVENT: Bus Cycles : 0 EVENT: Cache References : 0 EVENT: Cache Misses : 0 $ gcc -DUSE_RAW_PMU -DNO_BUG_NO_BRANCH -o perf_bug perf_bug.c $ ./perf_bug EVENT: Instructions : 914 EVENT: CPU Cycles : 4102 EVENT: Ref. CPU Cycles : 4959 EVENT: Bus Cycles : 171 EVENT: Cache References : 2 EVENT: Cache Misses : 2 $ gcc -DUSE_RAW_PMU -DNO_BUG_NO_CACHE -o perf_bug perf_bug.c $ ./perf_bug EVENT: Branch Instructions : 106 EVENT: Branch Misses : 6 EVENT: Instructions : 914 EVENT: CPU Cycles : 4108 EVENT: Ref. CPU Cycles : 10817 EVENT: Bus Cycles : 373 The perf tool itself seems to have the same issue: With CACHE & BRANCH counters does not work : $ perf stat -e '{r0c4,r0c5,r0c0,r03c,r0300,r013c,r04F2E,r0412E}:SIu' sleep 1 Performance counter stats for 'sleep 1': r0c4 (0.00%) r0c5 (0.00%) r0c0 (0.00%) r03c (0.00%) r0300 (0.00%) r013c (0.00%) r04F2E (0.00%) r0412E 1.001652932 seconds time elapsed Some events weren't counted. Try disabling the NMI watchdog: echo 0 > /proc/sys/kernel/nmi_watchdog perf stat ... echo 1 > /proc/sys/kernel/nmi_watchdog Disabling the NMI watchdog makes no difference . It is very strange that perf thinks 'r0412E' is not supported : $ cat /sys/bus/event_source/devices/cpu/cache_misses event=0x2e,umask=0x41 The kernel should not be advertizing an unsupported event in a /sys/bus/event_source/devices/cpu/events/ file, should it ? So perf stat has the same problem - without either Cache or Branch counters seems to work fine: without cache: $ perf stat -e '{r0c4,r0c5,r0c0,r03c,r0300,r013c}:SIu' sleep 1 Performance counter stats for 'sleep 1': 37740 r0c4 3557 r0c5 188552 r0c0 311684 r03c 360963 r0300 12461 r013c 1.001508109 seconds time elapsed without branch: $ perf stat -e '{r0c0,r03c,r0300,r013c,r04F2E,r0412E}:SIu' sleep 1 Performance counter stats for 'sleep 1': 188554 r0c0 32
Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
I have found a new source of weirdness with TSC using clock_gettime(CLOCK_MONOTONIC_RAW,&ts) : The vsyscall_gtod_data.mult field changes somewhat between calls to clock_gettime(CLOCK_MONOTONIC_RAW,&ts), so that sometimes an extra (2^24) nanoseconds are added or removed from the value derived from the TSC and stored in 'ts' . This is demonstrated by the output of the test program in the attached ttsc.tar file: $ ./tlgtd it worked! - GTOD: clock:1 mult:5798662 shift:24 synced - mult now: 5798661 What it is doing is finding the address of the 'vsyscall_gtod_data' structure from /proc/kallsyms, and mapping the virtual address to an ELF section offset within /proc/kcore, and reading just the 'vsyscall_gtod_data' structure into user-space memory . Really, this 'mult' value, which is used to return the seconds|nanoseconds value: ( tsc_cycles * mult ) >> shift (where shift is 24 ), should not change from the first time it is initialized . The TSC is meant to be FIXED FREQUENCY, right ? So how could / why should the conversion function from TSC ticks to nanoseconds change ? So now it is doubly difficult for user-space libraries to maintain their RDTSC derived seconds|nanoseconds values to correlate well those returned by the kernel, because they must regularly read the updated 'mult' value used by the kernel . I really don't think the kernel should randomly be deciding to increase / decrease the TSC tick period by 2^24 nanoseconds! Is this a bug or intentional ? I am searching for all places where a '[.>]mult.*=' occurs, but this returns rather alot of matches. Please could a future version of linux at least export the 'mult' and 'shift' values for the current clocksource ! Regards, Jason On 22/02/2017, Jason Vas Dias wrote: > OK, last post on this issue today - > can anyone explain why, with standard 4.10.0 kernel & no new > 'notsc_adjust' option, and the same maths being used, these two runs > should display > such a wide disparity between clock_gettime(CLOCK_MONOTONIC_RAW,&ts) > values ? : > > $ J/pub/ttsc/ttsc1 > max_extended_leaf: 8008 > has tsc: 1 constant: 1 > Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1. > ts2 - ts1: 162 ts3 - ts2: 110 ns1: 0.00641 ns2: 0.02850 > ts3 - ts2: 175 ns1: 0.00659 > ts3 - ts2: 18 ns1: 0.00643 > ts3 - ts2: 18 ns1: 0.00618 > ts3 - ts2: 17 ns1: 0.00620 > ts3 - ts2: 17 ns1: 0.00616 > ts3 - ts2: 18 ns1: 0.00641 > ts3 - ts2: 18 ns1: 0.00709 > ts3 - ts2: 20 ns1: 0.00763 > ts3 - ts2: 20 ns1: 0.00735 > ts3 - ts2: 20 ns1: 0.00761 > t1 - t0: 78200 - ns2: 0.80824 > $ J/pub/ttsc/ttsc1 > max_extended_leaf: 8008 > has tsc: 1 constant: 1 > Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1. > ts2 - ts1: 217 ts3 - ts2: 221 ns1: 0.01294 ns2: 0.05375 > ts3 - ts2: 210 ns1: 0.01418 > ts3 - ts2: 23 ns1: 0.01399 > ts3 - ts2: 22 ns1: 0.01445 > ts3 - ts2: 25 ns1: 0.01321 > ts3 - ts2: 20 ns1: 0.01428 > ts3 - ts2: 25 ns1: 0.01367 > ts3 - ts2: 23 ns1: 0.01425 > ts3 - ts2: 23 ns1: 0.01357 > ts3 - ts2: 22 ns1: 0.01487 > ts3 - ts2: 25 ns1: 0.01377 > t1 - t0: 145753 - ns2: 0.000150781 > > (complete source of test program ttsc1 attached in ttsc.tar > $ tar -xpf ttsc.tar > $ cd ttsc > $ make > ). > > On 22/02/2017, Jason Vas Dias wrote: >> I actually tried adding a 'notsc_adjust' kernel option to disable any >> setting or >> access to the TSC_ADJUST MSR, but then I see the problems - a big >> disparity >> in values depending on which CPU the thread is scheduled - and no >> improvement in clock_gettime() latency. So I don't think the new >> TSC_ADJUST >> code in ts_sync.c itself is the issue - but something added @ 460ns >> onto every clock_gettime() call when moving from v4.8.0 -> v4.10.0 . >> As I don't think fixing the clock_gettime() latency issue is my problem >> or >> even >> possible with current clock architecture approach, it is a non-issue. >> >> But please, can anyone tell me if are there any plans to move the time >> infrastructure out of the kernel and into glibc along the lines >> outlined >> in previous mail - if not, I am going to concentrate on this more radical >> overhaul approach for my own systems . >> >> At least, I think mapping the clocksource information structure itself in >> some >> kind of sharable page makes sense . Processes could map that page >> copy-on-write >> so they could start off with all the timing parameters preloaded, then >> keep >> their copy up
Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
OK, last post on this issue today - can anyone explain why, with standard 4.10.0 kernel & no new 'notsc_adjust' option, and the same maths being used, these two runs should display such a wide disparity between clock_gettime(CLOCK_MONOTONIC_RAW,&ts) values ? : $ J/pub/ttsc/ttsc1 max_extended_leaf: 8008 has tsc: 1 constant: 1 Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1. ts2 - ts1: 162 ts3 - ts2: 110 ns1: 0.00641 ns2: 0.02850 ts3 - ts2: 175 ns1: 0.00659 ts3 - ts2: 18 ns1: 0.00643 ts3 - ts2: 18 ns1: 0.00618 ts3 - ts2: 17 ns1: 0.00620 ts3 - ts2: 17 ns1: 0.00616 ts3 - ts2: 18 ns1: 0.00641 ts3 - ts2: 18 ns1: 0.00709 ts3 - ts2: 20 ns1: 0.00763 ts3 - ts2: 20 ns1: 0.00735 ts3 - ts2: 20 ns1: 0.00761 t1 - t0: 78200 - ns2: 0.80824 $ J/pub/ttsc/ttsc1 max_extended_leaf: 8008 has tsc: 1 constant: 1 Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1. ts2 - ts1: 217 ts3 - ts2: 221 ns1: 0.01294 ns2: 0.05375 ts3 - ts2: 210 ns1: 0.01418 ts3 - ts2: 23 ns1: 0.01399 ts3 - ts2: 22 ns1: 0.01445 ts3 - ts2: 25 ns1: 0.01321 ts3 - ts2: 20 ns1: 0.01428 ts3 - ts2: 25 ns1: 0.01367 ts3 - ts2: 23 ns1: 0.01425 ts3 - ts2: 23 ns1: 0.01357 ts3 - ts2: 22 ns1: 0.01487 ts3 - ts2: 25 ns1: 0.01377 t1 - t0: 145753 - ns2: 0.000150781 (complete source of test program ttsc1 attached in ttsc.tar $ tar -xpf ttsc.tar $ cd ttsc $ make ). On 22/02/2017, Jason Vas Dias wrote: > I actually tried adding a 'notsc_adjust' kernel option to disable any > setting or > access to the TSC_ADJUST MSR, but then I see the problems - a big > disparity > in values depending on which CPU the thread is scheduled - and no > improvement in clock_gettime() latency. So I don't think the new > TSC_ADJUST > code in ts_sync.c itself is the issue - but something added @ 460ns > onto every clock_gettime() call when moving from v4.8.0 -> v4.10.0 . > As I don't think fixing the clock_gettime() latency issue is my problem or > even > possible with current clock architecture approach, it is a non-issue. > > But please, can anyone tell me if are there any plans to move the time > infrastructure out of the kernel and into glibc along the lines > outlined > in previous mail - if not, I am going to concentrate on this more radical > overhaul approach for my own systems . > > At least, I think mapping the clocksource information structure itself in > some > kind of sharable page makes sense . Processes could map that page > copy-on-write > so they could start off with all the timing parameters preloaded, then > keep > their copy updated using the rdtscp instruction , or msync() (read-only) > with the kernel's single copy to get the latest time any process has > requested. > All real-time parameters & adjustments could be stored in that page , > & eventually a single copy of the tzdata could be used by both kernel > & user-space. > That is what I am working towards. Any plans to make linux real-time tsc > clock user-friendly ? > > > > On 22/02/2017, Jason Vas Dias wrote: >> Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is >> read or written . It is probably because it genuinuely does not >> support any cpuid > 13 , >> or the modern TSC_ADJUST interface . This is probably why my >> clock_gettime() >> latencies are so bad. Now I have to develop a patch to disable all access >> to >> TSC_ADJUST MSR if boot_cpu_data.cpuid_level <= 13 . >> I really have an unlucky CPU :-) . >> >> But really, I think this issue goes deeper into the fundamental limits of >> time measurement on Linux : it is never going to be possible to measure >> minimum times with clock_gettime() comparable with those returned by >> rdtscp instruction - the time taken to enter the kernel through the VDSO, >> queue an access to vsyscall_gtod_data via a workqueue, access it & do >> computations & copy value to user-space is NEVER going to be up to the >> job of measuring small real-time durations of the order of 10-20 TSC >> ticks >> . >> >> I think the best way to solve this problem going forward would be to >> store >> the entire vsyscall_gtod_data data structure representing the current >> clocksource >> in a shared page which is memory-mappable (read-only) by user-space . >> I think sser-space programs should be able to do something like : >> int fd = >> open("/sys/devices/system/clocksource/clocksource0/gtod.page",O_RDONLY); >> size_t psz = getpagesize(); >> void *gtod = mmap( 0, psz, PROT_READ, MAP_PRIVATE, fd, 0 ); >> msync(gtod,psz,MS_SYNC); >>
Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
I actually tried adding a 'notsc_adjust' kernel option to disable any setting or access to the TSC_ADJUST MSR, but then I see the problems - a big disparity in values depending on which CPU the thread is scheduled - and no improvement in clock_gettime() latency. So I don't think the new TSC_ADJUST code in ts_sync.c itself is the issue - but something added @ 460ns onto every clock_gettime() call when moving from v4.8.0 -> v4.10.0 . As I don't think fixing the clock_gettime() latency issue is my problem or even possible with current clock architecture approach, it is a non-issue. But please, can anyone tell me if are there any plans to move the time infrastructure out of the kernel and into glibc along the lines outlined in previous mail - if not, I am going to concentrate on this more radical overhaul approach for my own systems . At least, I think mapping the clocksource information structure itself in some kind of sharable page makes sense . Processes could map that page copy-on-write so they could start off with all the timing parameters preloaded, then keep their copy updated using the rdtscp instruction , or msync() (read-only) with the kernel's single copy to get the latest time any process has requested. All real-time parameters & adjustments could be stored in that page , & eventually a single copy of the tzdata could be used by both kernel & user-space. That is what I am working towards. Any plans to make linux real-time tsc clock user-friendly ? On 22/02/2017, Jason Vas Dias wrote: > Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is > read or written . It is probably because it genuinuely does not > support any cpuid > 13 , > or the modern TSC_ADJUST interface . This is probably why my > clock_gettime() > latencies are so bad. Now I have to develop a patch to disable all access > to > TSC_ADJUST MSR if boot_cpu_data.cpuid_level <= 13 . > I really have an unlucky CPU :-) . > > But really, I think this issue goes deeper into the fundamental limits of > time measurement on Linux : it is never going to be possible to measure > minimum times with clock_gettime() comparable with those returned by > rdtscp instruction - the time taken to enter the kernel through the VDSO, > queue an access to vsyscall_gtod_data via a workqueue, access it & do > computations & copy value to user-space is NEVER going to be up to the > job of measuring small real-time durations of the order of 10-20 TSC ticks > . > > I think the best way to solve this problem going forward would be to store > the entire vsyscall_gtod_data data structure representing the current > clocksource > in a shared page which is memory-mappable (read-only) by user-space . > I think sser-space programs should be able to do something like : > int fd = > open("/sys/devices/system/clocksource/clocksource0/gtod.page",O_RDONLY); > size_t psz = getpagesize(); > void *gtod = mmap( 0, psz, PROT_READ, MAP_PRIVATE, fd, 0 ); > msync(gtod,psz,MS_SYNC); > > Then they could all read the real-time clock values as they are updated > in real-time by the kernel, and know exactly how to interpret them . > > I also think that all mktime() / gmtime() / localtime() timezone handling > functionality should be > moved to user-space, and that the kernel should actually load and link in > some > /lib/libtzdata.so > library, provided by glibc / libc implementations, that is exactly the > same library > used by glibc() code to parse tzdata ; tzdata should be loaded at boot time > by the kernel from the same places glibc loads it, and both the kernel and > glibc should use identical mktime(), gmtime(), etc. functions to access it, > and > glibc using code would not need to enter the kernel at all for any > time-handling > code. This tzdata-library code be automatically loaded into process images > the > same way the vdso region is , and the whole system could access only one > copy of it and the 'gtod.page' in memory. > > That's just my two-cents worth, and how I'd like to eventually get > things working > on my system. > > All the best, Regards, > Jason > > > > > > > > > > > > > > On 22/02/2017, Jason Vas Dias wrote: >> On 22/02/2017, Jason Vas Dias wrote: >>> RE: >>>>> 4.10 has new code which utilizes the TSC_ADJUST MSR. >>> >>> I just built an unpatched linux v4.10 with tglx's TSC improvements - >>> much else improved in this kernel (like iwlwifi) - thanks! >>> >>> I have attached an updated version of the test program which >>> doesn't print the bogus "Nominal TSC Frequency" (the previous >>> version printed it, but equa
Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is read or written . It is probably because it genuinuely does not support any cpuid > 13 , or the modern TSC_ADJUST interface . This is probably why my clock_gettime() latencies are so bad. Now I have to develop a patch to disable all access to TSC_ADJUST MSR if boot_cpu_data.cpuid_level <= 13 . I really have an unlucky CPU :-) . But really, I think this issue goes deeper into the fundamental limits of time measurement on Linux : it is never going to be possible to measure minimum times with clock_gettime() comparable with those returned by rdtscp instruction - the time taken to enter the kernel through the VDSO, queue an access to vsyscall_gtod_data via a workqueue, access it & do computations & copy value to user-space is NEVER going to be up to the job of measuring small real-time durations of the order of 10-20 TSC ticks . I think the best way to solve this problem going forward would be to store the entire vsyscall_gtod_data data structure representing the current clocksource in a shared page which is memory-mappable (read-only) by user-space . I think sser-space programs should be able to do something like : int fd = open("/sys/devices/system/clocksource/clocksource0/gtod.page",O_RDONLY); size_t psz = getpagesize(); void *gtod = mmap( 0, psz, PROT_READ, MAP_PRIVATE, fd, 0 ); msync(gtod,psz,MS_SYNC); Then they could all read the real-time clock values as they are updated in real-time by the kernel, and know exactly how to interpret them . I also think that all mktime() / gmtime() / localtime() timezone handling functionality should be moved to user-space, and that the kernel should actually load and link in some /lib/libtzdata.so library, provided by glibc / libc implementations, that is exactly the same library used by glibc() code to parse tzdata ; tzdata should be loaded at boot time by the kernel from the same places glibc loads it, and both the kernel and glibc should use identical mktime(), gmtime(), etc. functions to access it, and glibc using code would not need to enter the kernel at all for any time-handling code. This tzdata-library code be automatically loaded into process images the same way the vdso region is , and the whole system could access only one copy of it and the 'gtod.page' in memory. That's just my two-cents worth, and how I'd like to eventually get things working on my system. All the best, Regards, Jason On 22/02/2017, Jason Vas Dias wrote: > On 22/02/2017, Jason Vas Dias wrote: >> RE: >>>> 4.10 has new code which utilizes the TSC_ADJUST MSR. >> >> I just built an unpatched linux v4.10 with tglx's TSC improvements - >> much else improved in this kernel (like iwlwifi) - thanks! >> >> I have attached an updated version of the test program which >> doesn't print the bogus "Nominal TSC Frequency" (the previous >> version printed it, but equally ignored it). >> >> The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by >> a factor of 2 - it used to be @140ns and is now @ 70ns ! Wow! : >> >> $ uname -r >> 4.10.0 >> $ ./ttsc1 >> max_extended_leaf: 8008 >> has tsc: 1 constant: 1 >> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz. >> ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.00588 ns2: 0.02599 >> ts3 - ts2: 178 ns1: 0.00592 >> ts3 - ts2: 14 ns1: 0.00577 >> ts3 - ts2: 14 ns1: 0.00651 >> ts3 - ts2: 17 ns1: 0.00625 >> ts3 - ts2: 17 ns1: 0.00677 >> ts3 - ts2: 17 ns1: 0.00626 >> ts3 - ts2: 17 ns1: 0.00627 >> ts3 - ts2: 17 ns1: 0.00627 >> ts3 - ts2: 18 ns1: 0.00655 >> ts3 - ts2: 17 ns1: 0.00631 >> t1 - t0: 89067 - ns2: 0.91411 >> > > > Oops, going blind in my old age. These latencies are actually 3 times > greater than under 4.8 !! > > Under 4.8, the program printed latencies of @ 140ns for clock_gettime, as > shown > in bug 194609 as the 'ns1' (timespec_b - timespec_a) value:: > > ts3 - ts2: 24 ns1: 0.00162 > ts3 - ts2: 17 ns1: 0.00143 > ts3 - ts2: 17 ns1: 0.00146 > ts3 - ts2: 17 ns1: 0.00149 > ts3 - ts2: 17 ns1: 0.00141 > ts3 - ts2: 16 ns1: 0.00142 > > now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @ > 600ns, @ 4 times more than under 4.8 . > But I'm glad the TSC_ADJUST problems are fixed. > > Will programs reading : > $ cat /sys/devices/msr/events/tsc > event=0x00 > read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on the > TSC ? > >> I think this is because under Linux 4.8, the CPU got a fault every >> time it read the TSC_ADJUST MSR. > > maybe it still is! > > >> But
Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
On 22/02/2017, Jason Vas Dias wrote: > RE: >>> 4.10 has new code which utilizes the TSC_ADJUST MSR. > > I just built an unpatched linux v4.10 with tglx's TSC improvements - > much else improved in this kernel (like iwlwifi) - thanks! > > I have attached an updated version of the test program which > doesn't print the bogus "Nominal TSC Frequency" (the previous > version printed it, but equally ignored it). > > The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by > a factor of 2 - it used to be @140ns and is now @ 70ns ! Wow! : > > $ uname -r > 4.10.0 > $ ./ttsc1 > max_extended_leaf: 8008 > has tsc: 1 constant: 1 > Invariant TSC is enabled: Actual TSC freq: 2.893299GHz. > ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.00588 ns2: 0.02599 > ts3 - ts2: 178 ns1: 0.00592 > ts3 - ts2: 14 ns1: 0.00577 > ts3 - ts2: 14 ns1: 0.00651 > ts3 - ts2: 17 ns1: 0.00625 > ts3 - ts2: 17 ns1: 0.00677 > ts3 - ts2: 17 ns1: 0.00626 > ts3 - ts2: 17 ns1: 0.00627 > ts3 - ts2: 17 ns1: 0.00627 > ts3 - ts2: 18 ns1: 0.00655 > ts3 - ts2: 17 ns1: 0.00631 > t1 - t0: 89067 - ns2: 0.91411 > Oops, going blind in my old age. These latencies are actually 3 times greater than under 4.8 !! Under 4.8, the program printed latencies of @ 140ns for clock_gettime, as shown in bug 194609 as the 'ns1' (timespec_b - timespec_a) value:: ts3 - ts2: 24 ns1: 0.00162 ts3 - ts2: 17 ns1: 0.00143 ts3 - ts2: 17 ns1: 0.00146 ts3 - ts2: 17 ns1: 0.00149 ts3 - ts2: 17 ns1: 0.00141 ts3 - ts2: 16 ns1: 0.00142 now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @ 600ns, @ 4 times more than under 4.8 . But I'm glad the TSC_ADJUST problems are fixed. Will programs reading : $ cat /sys/devices/msr/events/tsc event=0x00 read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on the TSC ? > I think this is because under Linux 4.8, the CPU got a fault every > time it read the TSC_ADJUST MSR. maybe it still is! > But user programs wanting to use the TSC and correlate its value to > clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above > program still have to dig the TSC frequency value out of the kernel > with objdump - this was really the point of the bug #194609. > > I would still like to investigate exporting 'tsc_khz' & 'mult' + > 'shift' values via sysfs. > > Regards, > Jason. > > > > > > On 21/02/2017, Jason Vas Dias wrote: >> Thank You for enlightening me - >> >> I was just having a hard time believing that Intel would ship a chip >> that features a monotonic, fixed frequency timestamp counter >> without specifying in either documentation or on-chip or in ACPI what >> precisely that hard-wired frequency is, but I now know that to >> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU >> assert CPUID:8007[8] ( InvariantTSC ) which it does, which is >> difficult to reconcile with the statement in the SDM : >> 17.16.4 Invariant Time-Keeping >> The invariant TSC is based on the invariant timekeeping hardware >> (called Always Running Timer or ART), that runs at the core crystal >> clock >> frequency. The ratio defined by CPUID leaf 15H expresses the >> frequency >> relationship between the ART hardware and TSC. If CPUID.15H:EBX[31:0] >> != >> 0 >> and CPUID.8007H:EDX[InvariantTSC] = 1, the following linearity >> relationship holds between TSC and the ART hardware: >> TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] ) >> / CPUID.15H:EAX[31:0] + K >> Where 'K' is an offset that can be adjusted by a privileged agent*2. >> When ART hardware is reset, both invariant TSC and K are also reset. >> >> So I'm just trying to figure out what CPUID.15H:EBX[31:0] and >> CPUID.15H:EAX[31:0] are for my hardware. I assumed (incorrectly) >> that >> the "Nominal TSC Frequency" formulae in the manul must apply to all >> CPUs with InvariantTSC . >> >> Do I understand correctly , that since I do have InvariantTSC , the >> TSC_Value is in fact calculated according to the above formula, but with >> a "hidden" ART Value, & Core Crystal Clock frequency & its ratio to >> TSC frequency ? >> It was obvious this nominal TSC Frequency had nothing to do with the >> actual TSC frequency used by Linux, which is 'tsc_khz' . >> I guess wishful thinking led me to believe CPUID:15h was actually >> supported somehow , because I thought InvariantTSC meant it had ART >
Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
RE: >> 4.10 has new code which utilizes the TSC_ADJUST MSR. I just built an unpatched linux v4.10 with tglx's TSC improvements - much else improved in this kernel (like iwlwifi) - thanks! I have attached an updated version of the test program which doesn't print the bogus "Nominal TSC Frequency" (the previous version printed it, but equally ignored it). The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by a factor of 2 - it used to be @140ns and is now @ 70ns ! Wow! : $ uname -r 4.10.0 $ ./ttsc1 max_extended_leaf: 8008 has tsc: 1 constant: 1 Invariant TSC is enabled: Actual TSC freq: 2.893299GHz. ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.00588 ns2: 0.02599 ts3 - ts2: 178 ns1: 0.00592 ts3 - ts2: 14 ns1: 0.00577 ts3 - ts2: 14 ns1: 0.00651 ts3 - ts2: 17 ns1: 0.00625 ts3 - ts2: 17 ns1: 0.00677 ts3 - ts2: 17 ns1: 0.00626 ts3 - ts2: 17 ns1: 0.00627 ts3 - ts2: 17 ns1: 0.00627 ts3 - ts2: 18 ns1: 0.00655 ts3 - ts2: 17 ns1: 0.00631 t1 - t0: 89067 - ns2: 0.91411 I think this is because under Linux 4.8, the CPU got a fault every time it read the TSC_ADJUST MSR. But user programs wanting to use the TSC and correlate its value to clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above program still have to dig the TSC frequency value out of the kernel with objdump - this was really the point of the bug #194609. I would still like to investigate exporting 'tsc_khz' & 'mult' + 'shift' values via sysfs. Regards, Jason. On 21/02/2017, Jason Vas Dias wrote: > Thank You for enlightening me - > > I was just having a hard time believing that Intel would ship a chip > that features a monotonic, fixed frequency timestamp counter > without specifying in either documentation or on-chip or in ACPI what > precisely that hard-wired frequency is, but I now know that to > be the case for the unfortunate i7-4910MQ - I mean, how can the CPU > assert CPUID:8007[8] ( InvariantTSC ) which it does, which is > difficult to reconcile with the statement in the SDM : > 17.16.4 Invariant Time-Keeping > The invariant TSC is based on the invariant timekeeping hardware > (called Always Running Timer or ART), that runs at the core crystal > clock > frequency. The ratio defined by CPUID leaf 15H expresses the frequency > relationship between the ART hardware and TSC. If CPUID.15H:EBX[31:0] != > 0 > and CPUID.8007H:EDX[InvariantTSC] = 1, the following linearity > relationship holds between TSC and the ART hardware: > TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] ) > / CPUID.15H:EAX[31:0] + K > Where 'K' is an offset that can be adjusted by a privileged agent*2. > When ART hardware is reset, both invariant TSC and K are also reset. > > So I'm just trying to figure out what CPUID.15H:EBX[31:0] and > CPUID.15H:EAX[31:0] are for my hardware. I assumed (incorrectly) > that > the "Nominal TSC Frequency" formulae in the manul must apply to all > CPUs with InvariantTSC . > > Do I understand correctly , that since I do have InvariantTSC , the > TSC_Value is in fact calculated according to the above formula, but with > a "hidden" ART Value, & Core Crystal Clock frequency & its ratio to > TSC frequency ? > It was obvious this nominal TSC Frequency had nothing to do with the > actual TSC frequency used by Linux, which is 'tsc_khz' . > I guess wishful thinking led me to believe CPUID:15h was actually > supported somehow , because I thought InvariantTSC meant it had ART > hardware . > > I do strongly suggest that Linux exports its calibrated TSC Khz > somewhere to user > space . > > I think the best long-term solution would be to allow programs to > somehow read the TSC without invoking > clock_gettime(CLOCK_MONOTONIC_RAW,&ts), & > having to enter the kernel, which incurs an overhead of > 120ns on my system > . > > > Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and > 'clocksource->shift' values to /sysfs somehow ? > > For instance , only if the 'current_clocksource' is 'tsc', then these > values could be exported as: > /sys/devices/system/clocksource/clocksource0/shift > /sys/devices/system/clocksource/clocksource0/mult > /sys/devices/system/clocksource/clocksource0/freq > > So user-space programs could know that the value returned by > clock_gettime(CLOCK_MONOTONIC_RAW) > would be > {.tv_sec = ( ( rdtsc() * mult ) >> shift ) >> 32, > , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U > } > and that represents ticks of period (1.0 / ( freq
Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
":%d:(%s): must be called with invariant TSC enabled.\n"); return 0; } U32_t tsc_hi, tsc_lo; register UL_t tsc; asm volatile ( "rdtscp\n\t" "mov %%edx, %0\n\t" "mov %%eax, %1\n\t" "mov %%ecx, %2\n\t" : "=m" (tsc_hi) , "=m" (tsc_lo) , "=m" (_ia64_tsc_user_cpu) : : "%eax","%ecx","%edx" ); tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo); return tsc; } __thread U64_t _ia64_first_tsc = 0xUL; static inline __attribute__((always_inline)) U64_t IA64_tsc_ticks_since_start() { if(_ia64_first_tsc == 0xUL) { _ia64_first_tsc = IA64_tsc_now(); return 0; } return (IA64_tsc_now() - _ia64_first_tsc) ; } static inline __attribute__((always_inline)) void ia64_tsc_calc_mult_shift ( register U32_t *mult, register U32_t *shift ) { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift() function: * calculates second + nanosecond mult + shift in same way linux does. * we want to be compatible with what linux returns in struct timespec ts after call to * clock_gettime(CLOCK_MONOTONIC_RAW, &ts). */ const U32_t scale=1000U; register U32_t from= IA64_tsc_khz(); register U32_t to = NSEC_PER_SEC / scale; register U64_t sec = ( ~0UL / from ) / scale; sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1); register U64_t maxsec = sec * scale; UL_t tmp; U32_t sft, sftacc=32; /* * Calculate the shift factor which is limiting the conversion * range: */ tmp = (maxsec * from) >> 32; while (tmp) { tmp >>=1; sftacc--; } /* * Find the conversion shift/mult pair which has the best * accuracy and fits the maxsec conversion range: */ for (sft = 32; sft > 0; sft--) { tmp = ((UL_t) to) << sft; tmp += from / 2; tmp = tmp / from; if ((tmp >> sftacc) == 0) break; } *mult = tmp; *shift = sft; } __thread U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U; static inline __attribute__((always_inline)) U64_t IA64_s_ns_since_start() { if( ( _ia64_tsc_mult == ~0U ) || ( _ia64_tsc_shift == ~0U ) ) ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift); register U64_t cycles = IA64_tsc_ticks_since_start(); register U64_t ns = ((cycles *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift); return( (((ns / NSEC_PER_SEC)&0xUL) << 32) | ((ns % NSEC_PER_SEC)&0x3fffUL) ); /* Yes, we are purposefully ignoring durations of more than 4.2 billion seconds here! */ } I think Linux should export the 'tsc_khz', 'mult' and 'shift' values somehow, then user-space libraries could have more confidence in using 'rdtsc' or 'rdtscp' if Linux's current_clocksource is 'tsc'. Regards, Jason On 20/02/2017, Thomas Gleixner wrote: > On Sun, 19 Feb 2017, Jason Vas Dias wrote: > >> CPUID:15H is available in user-space, returning the integers : ( 7, >> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so >> in detect_art() in tsc.c, > > By some definition of available. You can feed CPUID random leaf numbers and > it will return something, usually the value of the last valid CPUID leaf, > which is 13 on your CPU. A similar CPU model has > > 0x000d 0x00: eax=0x0007 ebx=0x0340 ecx=0x0340 > edx=0x > > i.e. 7, 832, 832, 0 > > Looks familiar, right? > > You can verify that with 'cpuid -1 -r' on your machine. > >> Linux does not think ART is enabled, and does not set the synthesized >> CPUID + >> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not >> see this bit set . > > Rightfully so. This is a Haswell Core model. > >> if an e1000 NIC card had been installed, PTP would not be available. > > PTP is independent of the ART kernel feature . ART just provides enhanced > PTP features. You are confusing things here. > > The ART feature as the kernel sees it is a hardware extension which feeds > the ART clock to peripherals for timestamping and time correlation > purposes. The ratio between ART and TSC is described by CPUID leaf 0x15 so > the kernel can make use of that correlation, e.g. for enhanced PTP > accuracy. > > It's correct, that the NONSTOP_TSC feature depends on the availability of > ART, but that has nothing to do with the feature bit, which solely > describes the ratio between TSC and the ART frequency which is exposed to > peripherals. That frequency is not necessarily the real ART frequency. > >> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be >> nowhere else in Linux, the code will always think X86_FEATURE_ART is 0 >> because the CPU will always get a fault reading the MSR since it has >>
[PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
Patch to make tsc.c set X86_FEATURE_ART and setup the TSC_ADJUST MSR correctly on my "i7-4910MQ" CPU, which reports ( boot_cpu_data.cpuid_level==0x13 && boot_cpu_data.extended_cpuid_level==0x8008 ), so the code didn't think it supported CPUID:15h, but it does . Patch: diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 46b2f41..f76cca8 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -1030,6 +1030,7 @@ core_initcall(cpufreq_register_tsc_scaling); #endif /* CONFIG_CPU_FREQ */ #define ART_CPUID_LEAF (0x15) +#define MINIMUM_CPUID_EXTENDED_LEAF_THAT_MUST_HAVE_ART (0x8008) #define ART_MIN_DENOMINATOR (1) @@ -1038,24 +1039,43 @@ core_initcall(cpufreq_register_tsc_scaling); */ static void detect_art(void) { - unsigned int unused[2]; - - if (boot_cpu_data.cpuid_level < ART_CPUID_LEAF) - return; - - cpuid(ART_CPUID_LEAF, &art_to_tsc_denominator, - &art_to_tsc_numerator, unused, unused+1); - + unsigned int v[2]; + + if(boot_cpu_data.cpuid_level < ART_CPUID_LEAF) +{ +if(boot_cpu_data.extended_cpuid_level >= MINIMUM_CPUID_EXTENDED_LEAF_THAT_MUST_HAVE_ART) +{ +pr_info("Would normally not use ART - cpuid_level:%d < %d - but extended_cpuid_level is: %x, so probing for ART support.\n", +boot_cpu_data.cpuid_level, ART_CPUID_LEAF, boot_cpu_data.extended_cpuid_level); +}else +return; +} + +cpuid(ART_CPUID_LEAF, &art_to_tsc_denominator, + &art_to_tsc_numerator, v, v+1); + /* Don't enable ART in a VM, non-stop TSC required */ if (boot_cpu_has(X86_FEATURE_HYPERVISOR) || - !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) || - art_to_tsc_denominator < ART_MIN_DENOMINATOR) - return; - - if (rdmsrl_safe(MSR_IA32_TSC_ADJUST, &art_to_tsc_offset)) - return; - + !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) || + art_to_tsc_denominator < ART_MIN_DENOMINATOR) +{ +pr_info("Not using Intel ART for TSC - HYPERVISOR:%d NO NONSTOP_TSC:%d bad TSC/Crystal ratio denominator: %d.", boot_cpu_has(X86_FEATURE_HYPERVISOR), !boot_cpu_has(X86_FEATURE_NONSTOP_TSC), art_to_tsc_denominator ); +return; +} + if ( (v[0]=rdmsrl_safe(MSR_IA32_TSC_ADJUST, &art_to_tsc_offset))!=0) /* will get fault on first read if nothing written yet */ +{ +if((v[1]=wrmsrl_safe(MSR_IA32_TSC_ADJUST, 0))!=0) +{ +pr_info("Not using Intel ART for TSC - failed to initialize TSC_ADJUST: %d %d.\n", v[0], v[1] ); +return; +}else +{ +art_to_tsc_offset = 0; /* perhaps initalize to -1 * current rdtsc value ? */ +pr_info("Using Intel ART for TSC - TSC_ADJUST initialized to %llu.\n",art_to_tsc_offset); +} +} /* Make this sticky over multiple CPU init calls */ +pr_info("Using Intel Always Running Timer (ART) feature %x for TSC on all CPUs - TSC/CCC: %d/%d offset: %llu.\n", X86_FEATURE_ART, art_to_tsc_numerator, art_to_tsc_denominator, art_to_tsc_offset ); setup_force_cpu_cap(X86_FEATURE_ART); } I originally reported this issue on bugzilla.kernel.org : bug # 194609 : https://bugzilla.kernel.org/show_bug.cgi?id=194609 , but it was not posted to the list , & then I posted it to the list, but Julia Lawell kindly suggested I should re-post with patch inline, & include extra recipients, which includes the last person to modify tsc.c (Prarit), so am doing so. My CPU reports 'model name' as "Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz" , has 4 physical & 8 hyperthreading cores with a frequency scalable from 80 to 390 (/sys/devices/system/cpu/cpu0/cpufreq/scaling_{min,max}_freq) , and flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt dtherm ida arat pln pts $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource tsc $ CPUID:15H is available in user-space, returning the integers : ( 7, 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so in detect_art() in tsc.c, Linux does not think ART is enabled, and does not set the synthesized CPUID + ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not see this bit set . if an e1000 NIC card had been installed, PTP would no
[PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
I originally reported this issue on bugzilla.kernel.org : bug # 194609 : https://bugzilla.kernel.org/show_bug.cgi?id=194609 , but it was not posted to the list . My CPU reports 'model name' as "Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz" , has 4 physical & 8 hyperthreading cores with a frequency scalable from 80 to 390 (/sys/devices/system/cpu/cpu0/cpufreq/scaling_{min,max}_freq) , and flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt dtherm ida arat pln pts $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource tsc $ CPUID:15H is available in user-space, returning the integers : ( 7, 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so in detect_art() in tsc.c, Linux does not think ART is enabled, and does not set the synthesized CPUID + ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not see this bit set . if an e1000 NIC card had been installed, PTP would not be available. Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be nowhere else in Linux, the code will always think X86_FEATURE_ART is 0 because the CPU will always get a fault reading the MSR since it has never been written. So the attached patch makes tsc.c set X86_FEATURE_ART correctly in tsc.c , and set the TSC_ADJUST to 0 if the rdmsr gets an error . Please consider applying it to a future linux version. It would be nice for user-space programs that want to use the TSC with rdtsc / rdtscp instructions, such as the demo program attached to the bug report, could have confidence that Linux is actually generating the results of clock_gettime(CLOCK_MONOTONIC_RAW, ×pec) in a predictable way from the TSC by looking at the /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space use of TSC values, so that they can correlate TSC values with linux clock_gettime() values. The patch applies to linux kernels v4.8 & v4.9.10 GIT tags and the kernels build and run & the demo program produces results like : $ ./ttsc1 has tsc: 1 constant: 1 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1 Hooray! TSC is enabled with KHz: 2893300 ts2 - ts1: 261 ts3 - ts2: 211 ns1: 0.00146 ns2: 0.01629 ts3 - ts2: 27 ns1: 0.00168 ts3 - ts2: 20 ns1: 0.00147 ts3 - ts2: 14 ns1: 0.00152 ts3 - ts2: 15 ns1: 0.00151 ts3 - ts2: 15 ns1: 0.00153 ts3 - ts2: 15 ns1: 0.00150 ts3 - ts2: 20 ns1: 0.00148 ts3 - ts2: 19 ns1: 0.00164 ts3 - ts2: 19 ns1: 0.00164 ts3 - ts2: 19 ns1: 0.00160 t1 - t0: 52901 - ns2: 0.53951 The value 'ts3 - ts2' is the number of nanoseconds measured by successive calls to 'rdtscp'; the 'ns1' value is the number of nanoseconds (shown as decimal seconds) measured by clock_gettime(CLOCK_MONOTONIC_RAW, &ts2) - clock_gettime(CLOCK_MONOTONIC_RAW, &ts1) when casting each {ts.tv_sec, ts.tv_nsec} to a 128 bit long long integer . It shows a user-space program can read the TSC with a latency of @20ns but can only measure times >= @ 140ns using Linux clock_gettime() on this CPU. x86_kernel_tsc-bz194609.patch Description: Binary data
Re: please, where has xconfig KConf option documentation gone with linux 4.8's Qt5 / Qt4 xconfig ?
Aha, thanks! I never would have known this without being told - there is no visible indication that the symbol info pane exists at all until one tries to drag the lower right corner of the window notth-eastwards - is this meant to be somehow an intuitive thing to do these days to view more info ? I did manage to view the option documentation with nconfig / using emacs to view the KConf files (preferable). Really, it would be nice if xconfig had some 'View' Menu & one could select View -> Option Documentation or press over an option to view the documentation for it , and if the geometry of the different panes was correct at startup .- the whole Option value pane initially appears on the far right hand side, about 10 pixels wide , until resized ; and there really is no sign of the documentation pane at all until lower right-hand corner dragged. Also, in the Help -> Introduction panel, it says : "Toggling Show Debug Info under the Options menu will show the dependencies..." but there is no "Show Debug Info" option on the Options menu - sounds like it might be a useful feature - should I be seeing a "Show Debug Info" option ? why don't I see one ? Maybe the Options menu might be a good place to put an "Expand Option Documentation Pane" option ? Thanks anyway for the info. Regards, Jason On 11/10/2016, Randy Dunlap wrote: > [changed linux-config to linux-kbuild list] > > On 10/09/16 13:46, Jason Vas Dias wrote: >> Hi - >> I've been doing 'make xconfig' to configure the kernel for many years >> now, and >> always there used to be some option documentation pane populated with >> summary documentation for the specific option selected . >> But now, when built for Qt 5.7.0, (also tried Qt 4.8 and GTK) there >> is no option >> documentation pane - this is a real pain ! The option documentation also >> is not displayed with any other gui, eg. 'make menuconfig' / 'make >> gtkconfig' - >> I'm sure it used to be . This is a regression IMHO . >> How can I restore display of documentation for each selected option ? >> Will older xconfig work for Linux 4.8 ? it appears not ... >> Thanks in advance for any replies, >> Jason > > That's odd. I see the help info in all of xconfig, gconfig, menuconfig, & > nconfig. > > In xconfig, if the right hand side of the config window only lists some > kernel config > options and no symbol help/info, the symbol info portion may be hidden. Try > pointing > to the bottom of the right side of the window and hold down the left mouse > button > and then drag the mouse pointer upward to open the symbol info pane. > At least that is what works for me. > > -- > ~Randy >
please, where has xconfig KConf option documentation gone with linux 4.8's Qt5 / Qt4 xconfig ?
Hi - I've been doing 'make xconfig' to configure the kernel for many years now, and always there used to be some option documentation pane populated with summary documentation for the specific option selected . But now, when built for Qt 5.7.0, (also tried Qt 4.8 and GTK) there is no option documentation pane - this is a real pain ! The option documentation also is not displayed with any other gui, eg. 'make menuconfig' / 'make gtkconfig' - I'm sure it used to be . This is a regression IMHO . How can I restore display of documentation for each selected option ? Will older xconfig work for Linux 4.8 ? it appears not ... Thanks in advance for any replies, Jason
4.5.x drm/i915/ + drm/drm_irq + drm/radeon & ACPI problems doing vga_switcheroo switching & getting EDID modes for laptop hybrid graphics with Intel IGC & Radeon Neptune 8970M
I have not so far been able to get my Radeon 8970M discrete graphics card with GPU to go into graphics mode under Linux 4.4.0+ ( tried 4.4.0, 4.5.0, 4.5.1, ...) on my Clevo KAPOK laptop x86_64 LFS system , which has : CPU : Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz RAM: 16GB ; Disk: 1TB SATA + 256MB SDD . $ lspci -nn | grep VGA 00:02.0 VGA compatible controller [0300]: Intel Corporation 4th Gen Core Processor Integrated Graphics Controller [8086:0416] (rev 06) 01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Neptune XT [Radeon HD 8970M] [1002:6801] So far, the Neptune card will only go into graphics mode when driven by the closed source FGLRX driver under a Linux 3.10 / RHEL-7 clone - I'm trying to get it working under Linux 4.4.0+, whose 'drivers/drm/radeon' driver claims to support the card . Persistently, the Xorg server v1.18.3 with Xorg Radeon Driver v7.7.0 (latest stable GIT versions) report "No modes" and are unable to discover any probed EDID display modes for the card , as shown by the Xorg.0.log excerpt : [ 1503.772] (II) Loading /usr/lib64/xorg/modules/drivers/radeon_drv.so [ 1503.773] (II) Module radeon: vendor="X.Org Foundation" [ 1503.773]compiled for 1.18.3, module version = 7.7.0 [ 1503.775]Module class: X.Org Video Driver [ 1503.775]ABI class: X.Org Video Driver, version 20.0 [ 1503.775] (II) LoadModule: "intel" [ 1503.777] (II) Loading /usr/lib64/xorg/modules/drivers/intel_drv.so [ 1503.778] (II) Module intel: vendor="X.Org Foundation" [ 1503.778]compiled for 1.18.3, module version = 2.99.917 [ 1503.779]Module class: X.Org Video Driver [ 1503.780]ABI class: X.Org Video Driver, version 20.0 ... [ 1503.788] (II) RADEON: Driver for ATI Radeon chipsets: ... [ 1503.957] (II) [KMS] Kernel modesetting enabled. [ 1503.957] (II) intel(1): Using Kernel Mode Setting driver: i915, version 1.6.0 20151218 [ 1503.957] (EE) Screen 1 deleted because of no matching config section. [ 1503.957] (II) UnloadModule: "intel" [ 1503.957] (II) RADEON(0): RADEONPreInit_KMS [ 1503.957] (==) RADEON(0): Depth 24, (--) framebuffer bpp 32 [ 1503.957] (II) RADEON(0): Pixel depth = 24 bits stored in 4 bytes (32 bpp pixmaps) [ 1503.957] (==) RADEON(0): Default visual is TrueColor [ 1503.957] (**) RADEON(0): Option "DRI" "3" [ 1503.957] (==) RADEON(0): RGB weight 888 [ 1503.957] (II) RADEON(0): Using 8 bits per RGB (8 bit DAC) [ 1503.957] (--) RADEON(0): Chipset: "PITCAIRN" (ChipID = 0x6801) [ 1503.957] (II) Loading sub module "fb" [ 1503.957] (II) LoadModule: "fb" [ 1503.957] (II) Loading /usr/lib64/xorg/modules/libfb.so [ 1503.958] (II) Module fb: vendor="X.Org Foundation" [ 1503.958]compiled for 1.18.3, module version = 1.0.0 [ 1503.958]ABI class: X.Org ANSI C Emulation, version 0.4 [ 1503.958] (II) Loading sub module "dri2" [ 1503.958] (II) LoadModule: "dri2" [ 1503.958] (II) Module "dri2" already built-in [ 1503.958] (II) Loading sub module "glamoregl" [ 1503.958] (II) LoadModule: "glamoregl" [ 1503.958] (II) Loading /usr/lib64/xorg/modules/libglamoregl.so [ 1503.958] (II) Module glamoregl: vendor="X.Org Foundation" [ 1503.958]compiled for 1.18.3, module version = 0.6.0 [ 1503.958]ABI class: X.Org ANSI C Emulation, version 0.4 [ 1503.958] (II) glamor: OpenGL accelerated X.org driver based. [ 1504.023] (II) glamor: EGL version 1.4 (DRI2): [ 1504.023] (II) RADEON(0): glamor detected, initialising EGL layer. [ 1504.023] (II) RADEON(0): KMS Color Tiling: enabled [ 1504.023] (II) RADEON(0): KMS Color Tiling 2D: enabled [ 1504.024] (II) RADEON(0): KMS Pageflipping: enabled [ 1504.024] (II) RADEON(0): SwapBuffers wait for vsync: enabled [ 1504.024] (II) RADEON(0): Initializing outputs ... [ 1504.024] (II) RADEON(0): 0 crtcs needed for screen. [ 1504.024] (II) RADEON(0): Allocated crtc nr. 0 to this screen. [ 1504.024] (II) RADEON(0): Allocated crtc nr. 1 to this screen. [ 1504.024] (II) RADEON(0): Allocated crtc nr. 2 to this screen. [ 1504.024] (II) RADEON(0): Allocated crtc nr. 3 to this screen. [ 1504.024] (II) RADEON(0): Allocated crtc nr. 4 to this screen. [ 1504.024] (II) RADEON(0): Allocated crtc nr. 5 to this screen. [ 1504.024] (WW) RADEON(0): No outputs definitely connected, trying again... [ 1504.024] (WW) RADEON(0): Unable to find connected outputs - setting 1024x768 initial framebuffer [ 1504.024] (II) RADEON(0): Using default gamma of (1.0, 1.0, 1.0) unless otherwise stated. [ 1504.024] (II) RADEON(0): mem size init: gart size :7fbcc000 vram size: s:1 visible:ff916000 [ 1504.024] (==) RADEON(0): DPI set to (96, 96) [ 1504.024] (II) Loading sub module "ramdac" [ 1504.024] (II) LoadModule: "ramdac" [ 1504.024] (II) Module "ramdac" already built-in [ 1504.024] (EE) RADEON(0): No modes. [ 1504.024] (II) RADEON(0): RADEONFreeScreen [ 1504.024] (II) UnloadModule: "radeon" [ 1504.024] (II) UnloadSubModule: "glamoregl" [ 1504.024] (II) Unloading glamoregl [ 1504.02
how to unmount an rbind mount ?
Good day - Please could anyone advise - Once one has mounted an alias mount with the 'rbind' option, so that mounts underneath it are also mounted under the new path, how can one unmount that filesystem safely without un-mounting the original mountpoints ? For example, I do this for chroots : $ for d in /dev /proc /sys; do mount -o rbind $d $chroot/$d; done Now, if I want to unmount the chroot device, I cannot do eg. : $ unmount ${chroot}/dev because this will fail since /dev/pts /dev/mqueue etc are still mounted ; if I do: $ unmount -R ${chroot}/dev or $ unmount ${chroot}/dev/pts then /dev/pts will be unmounted from the root device filesystem - the situation is much more horrid to try and unmount ${chroot}/sys or ${chroot}/run . Personally, I think this is rather buggy behaviour by Linux, since I told the kernel I only want to BIND the path ${chroot}/dev to /dev - and recursively bind names beneath ${chroot}/dev/* to /dev/*, with the 'rbind' option, ie. to make an alias of ${chroot}/dev/* for /dev/* - NOT to actually re-mount the devices there . So I think umount should be clever enough to 'un-bind' sub-mounts of mounts with the 'rbind' option, rather than unmount the devices from the root filesystem, which is what currently happens. It does make chroot filesystems very difficult to unmount safely ! Linux badly needs a better umount, IMHO . Are there any plans to improve umount behavior wrt rbind mounts ?
Re: how to build 2.6.x based kernel with perf ?
Here's a patch that fixes the issue for me . Also attached to Red Hat bugzilla : https://bugzilla.redhat.com/show_bug.cgi?id=1173649 On 12/12/14, Jason Vas Dias wrote: > Good day - > I am trying to build the latest RHEL kernel from the source RPM, > but this fails because the "perf" component cannot build . > The build gets as far as building the modules and debug flavour > of the kernel, but fails for the 'perf' target with : > > > + make -j4 -C tools/perf -s V=1 prefix=/usr all > CHK -fstack-protector-all > CHK -Wstack-protector > CHK -Wvolatile-register-var > CHK -D_FORTIFY_SOURCE=2 > CHK bionic > :1:31: error: android/api-level.h: No such file or directory > : In function 'main': > :5: error: '__ANDROID_API__' undeclared (first use in this function) > :5: error: (Each undeclared identifier is reported only once > :5: error: for each function it appears in.) > CHK libelf > CHK libdw > CHK -DLIBELF_MMAP > CHK -DHAVE_ELF_GETPHDRNUM > CHK -DLIBELF_MMAP > CHK libunwind > CHK libaudit > cc1: warnings being treated as errors > : In function 'main': > :5: error: implicit declaration of function 'printf' > :5: error: incompatible implicit declaration of built-in > function 'printf' > config/Makefile:240: No libaudit.h found, disables 'trace' tool, > please install audit-libs-devel or libaudit-dev > CHK libslang > CHK gtk2 > CHK -DHAVE_GTK_INFO_BAR > CHK perl > CHK python > CHK python version > CHK libbfd > CHK -DHAVE_STRLCPY > /tmp/ccOCUfYU.o: In function `main': > :(.text+0x14): undefined reference to `strlcpy' > collect2: ld returned 1 exit status > CHK -DHAVE_ON_EXIT > CHK -DBACKTRACE_SUPPORT > CHK libnuma > :1:18: error: numa.h: No such file or directory > :2:20: error: numaif.h: No such file or directory > cc1: warnings being treated as errors > : In function 'main': > :6: error: implicit declaration of function 'numa_available' > :6: error: nested extern declaration of 'numa_available' > config/Makefile:422: No numa.h found, disables 'perf bench numa mem' > benchmark, please install numa-libs-devel or libnuma-dev > * new build flags or prefix > PERF_VERSION = 2.6.32-504.1.3.el6.x86_64.debug > * new build flags or cross compiler > cc1: warnings being treated as errors > /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:113: > error: no previous prototype for 'breakpoint' > /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:119: > error: no previous prototype for 'alloc_arg' > /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c: > In function 'find_cmdline': > /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:183: > error: return discards qualifiers from pointer target type > /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:186: > error: return discards qualifiers from pointer target type > /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:195: > error: return discards qualifiers from pointer target type > /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c: > In function 'type_size': > /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:1243: > error: missing initializer > /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:1243: > error: (near initialization for 'table[9].type') > /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c: > In function 'event_read_fields': > /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:1519: > error: signed and unsigned type in conditional expression > /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c: > In function 'arg_num_eval': > /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:2076: > err
how to build 2.6.x based kernel with perf ?
Good day - I am trying to build the latest RHEL kernel from the source RPM, but this fails because the "perf" component cannot build . The build gets as far as building the modules and debug flavour of the kernel, but fails for the 'perf' target with : + make -j4 -C tools/perf -s V=1 prefix=/usr all CHK -fstack-protector-all CHK -Wstack-protector CHK -Wvolatile-register-var CHK -D_FORTIFY_SOURCE=2 CHK bionic :1:31: error: android/api-level.h: No such file or directory : In function 'main': :5: error: '__ANDROID_API__' undeclared (first use in this function) :5: error: (Each undeclared identifier is reported only once :5: error: for each function it appears in.) CHK libelf CHK libdw CHK -DLIBELF_MMAP CHK -DHAVE_ELF_GETPHDRNUM CHK -DLIBELF_MMAP CHK libunwind CHK libaudit cc1: warnings being treated as errors : In function 'main': :5: error: implicit declaration of function 'printf' :5: error: incompatible implicit declaration of built-in function 'printf' config/Makefile:240: No libaudit.h found, disables 'trace' tool, please install audit-libs-devel or libaudit-dev CHK libslang CHK gtk2 CHK -DHAVE_GTK_INFO_BAR CHK perl CHK python CHK python version CHK libbfd CHK -DHAVE_STRLCPY /tmp/ccOCUfYU.o: In function `main': :(.text+0x14): undefined reference to `strlcpy' collect2: ld returned 1 exit status CHK -DHAVE_ON_EXIT CHK -DBACKTRACE_SUPPORT CHK libnuma :1:18: error: numa.h: No such file or directory :2:20: error: numaif.h: No such file or directory cc1: warnings being treated as errors : In function 'main': :6: error: implicit declaration of function 'numa_available' :6: error: nested extern declaration of 'numa_available' config/Makefile:422: No numa.h found, disables 'perf bench numa mem' benchmark, please install numa-libs-devel or libnuma-dev * new build flags or prefix PERF_VERSION = 2.6.32-504.1.3.el6.x86_64.debug * new build flags or cross compiler cc1: warnings being treated as errors /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:113: error: no previous prototype for 'breakpoint' /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:119: error: no previous prototype for 'alloc_arg' /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c: In function 'find_cmdline': /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:183: error: return discards qualifiers from pointer target type /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:186: error: return discards qualifiers from pointer target type /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:195: error: return discards qualifiers from pointer target type /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c: In function 'type_size': /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:1243: error: missing initializer /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:1243: error: (near initialization for 'table[9].type') /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c: In function 'event_read_fields': /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:1519: error: signed and unsigned type in conditional expression /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c: In function 'arg_num_eval': /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:2076: error: enumeration value 'PRINT_HEX' not handled in switch /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:2076: error: enumeration value 'PRINT_DYNAMIC_ARRAY' not handled in switc /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:2076: error: enumeration value 'PRINT_FUNC' not handled in switch /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c: In function 'arg_eval': /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:2235: error: enumeration value 'PRINT_HEX' not handled in switch /home/jvasdias/rpmbuild/B
Re: mount BTRFS filesystems created with 3.8+ under 2.6.32 kernels ?
Of course the solution was to have created the filesystem in the first place with 'mkfs.btrfs -O ^extref' . Found this after some more googling ... Shouldn't this be the default ? Regards, Jason On 9/22/14, Jason Vas Dias wrote: > Good day - > > I wonder if there is a GIT repository somewhere with a backport of the > BTRFS > kernel modules that will allow BTRFS filesystems created with a 3.8 kernel > to > be mounted on a 2.6.32+ kernel . > > When I try this, the 2.6.32 kernel crashes with the message : > 'BTRFS: couldn't mount because of unsupported optional features (40)' > ( kernel-2.6.32-431.29.2.el6.x86_64 from RHEL 6.4+ ). > > The same filesystem mounts fine under Oracle EL6 which now comes with > kernel-uek 3.8+ . Has anyone tried to backport the 3.8 BTRFS > capabilities to 2.6.32 , > or is there any way I can remove "Option 40" to get it to mount > without crashing ? > It is a very small BTRFS filesystem with a root filesystem and a few > snapshots. > I did not specify any BTRFS options in : > $ mkfs.btrfs /dev/sda9; > ... # mount on /mnt/btr & create some files > $ btrfs subvolume snapshot -r /mnt/btr /mnt/btr/root-0 > $ btrfs subvolume snapshot/mnt/btr /mnt/btr/root-w-0 > Now I can mount /dev/sda9 under any 3.8+ kernel, but not under 2.6.32 . > > Thanks in advance for any replies, > Best Regards, Jason Vas Dias > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
mount BTRFS filesystems created with 3.8+ under 2.6.32 kernels ?
Good day - I wonder if there is a GIT repository somewhere with a backport of the BTRFS kernel modules that will allow BTRFS filesystems created with a 3.8 kernel to be mounted on a 2.6.32+ kernel . When I try this, the 2.6.32 kernel crashes with the message : 'BTRFS: couldn't mount because of unsupported optional features (40)' ( kernel-2.6.32-431.29.2.el6.x86_64 from RHEL 6.4+ ). The same filesystem mounts fine under Oracle EL6 which now comes with kernel-uek 3.8+ . Has anyone tried to backport the 3.8 BTRFS capabilities to 2.6.32 , or is there any way I can remove "Option 40" to get it to mount without crashing ? It is a very small BTRFS filesystem with a root filesystem and a few snapshots. I did not specify any BTRFS options in : $ mkfs.btrfs /dev/sda9; ... # mount on /mnt/btr & create some files $ btrfs subvolume snapshot -r /mnt/btr /mnt/btr/root-0 $ btrfs subvolume snapshot/mnt/btr /mnt/btr/root-w-0 Now I can mount /dev/sda9 under any 3.8+ kernel, but not under 2.6.32 . Thanks in advance for any replies, Best Regards, Jason Vas Dias -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
how to build kernel-firmware and kernel-doc RPMs from Red Hat EL6 kernel.spec files ?
Sorry for this newbie question, but its been a while since I built the kernel from the Red Hat source RPMs, and there appears to be no way to build the kernel-firmware-*.noarch.rpm RPM package without modifying the spec file, which contains : l# we don't want a .config file when building firmware: it just confuses the build system %define build_firmware \ mv .config .config.firmware_save \ make INSTALL_FW_PATH=$RPM_BUILD_ROOT/lib/firmware firmware_install \ mv .config.firmware_save .config When intending to build the kernel-doc and kernel-firmware noarch RPMs, after the x86_64 RPMs have successfully built, with: $ rpmbuild --target=noarch --rebuild $path_to_kernel_srpm --define '_with_docs 1' --define '_with_firmware 1' --define '_without_perf 1' ... this fails : + cd ${BUILDROOT}/lib/modules/ + ln -s kabi-rhel65 kabi-current + mv .config .config.firmware_save + make INSTALL_FW_PATH=${BUILDROOT}/lib/firmware firmware_install scripts/kconfig/conf -s arch/x86/Kconfig *** *** You have not yet configured your kernel! *** (missing kernel config file ".config") *** *** Please run some configurator (e.g. "make oldconfig" or *** "make menuconfig" or "make xconfig"). *** ${BUILD}/scripts/kconfig/Makefile:30: recipe for target 'silentoldconfig' failed make[2]: *** [silentoldconfig] Error 1 ${BUILD}/Makefile:484: recipe for target 'silentoldconfig' failed make[1]: *** [silentoldconfig] Error 2 IHEXfirmware/iwlwifi-105-6.ucode make[1]: *** No rule to make target '${BUILDROOT}/lib/firmware/./', needed by '${BUILDROOT}/lib/firmware/iwlwifi-105-6.ucode'. Stop. Makefile:1112: recipe for target 'firmware_install' failed make: *** [firmware_install] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.3a7tvf (%install) (with the $BUILD and $BUILDROOT strings representing actual paths). So I have to edit that bit of the .spec file to be: %define build_firmware \ make INSTALL_FW_PATH=$RPM_BUILD_ROOT/lib/firmware firmware_install and then I can 'rpmbuild --target -ba $path_to_modified_spec_file' OK and the firmware and documentation RPMs are produced OK . Has anyone found a way of avoiding having to edit the Red Hat RPM spec file in this manner to enable building noarch kernel-doc and kernel-firmware RPMs ? This happens with every RHEL-6.4 kernel I've built so far : kernel-2.6.32-431.23.3.el6.src.rpm kernel-2.6.32-431.29.2.el6.src.rpm Incidentally, anyone found a way to build the Red Hat RPMs without "--define '_with_perf 0'" ? It seems this causes the build to look for libunwind and the Android SDK headers, which are not part of the kernel's BuildRequires . Thanks in advance for any helpful replies, best Regards, Jason Vas Dias -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
HP6715b laptop's wireless radio on LED went off after 1st boot of 3.9.6 from 3.4.4 - please help / any ideas ?
After building and installing 3.9.6 kernel & modules on my 2.2ghz HP6715b x86-64 Turion dual core laptop , which has always run Linux with no b43 wireless problems since 2007, now has no access to its onboard broadcom 4311 wireless radio . I had always used the b43 driver with the correct firmware installed under /lib/firmware/b43 with b43-fwcutter as per instructions at http://wireless.kernel.org/en/users/Drivers/b43 , which I've just now redone again, but since booting 3.9.6, which I believe resulted in a firmware download via udev at first boot, the wireless radio "on" blue LED indicator goes off after BIOS POST . It has always been the case that if the blue LED indicator is off after BIOS POST , then the kernel does not see the device, and I have no wireless until a hard poweroff and pressing the touch sensitive wireless on button during BIOS POST. But now the wireless LED goes on for @ 1 second during BIOS POST , and never comes on again , and there are no responses to touching wireless-on button after reboot, though there are to other buttons next to it. In short, I've lost wireless access (and home internet access for my pc - I'm sending this from my mobile) . Can anyone help? How can I force the card to download the re-installed b43-fwcutter firmware, if the device no longer appears in lspci output? Anyway to force the kernel to ignore the wireless button ( it could be that the kernel & bios think this button is in the off state - any way to force its state to "on") ? Any ideas / suggestions would be much appreciated. Thanks & Regards, Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH: 1/1] ACPI: make evaluation of thermal trip points before temperature or vice versa dependant on new "temp_b4_trip" module parameter to support older AMD x86_64s
This patch adds a new acpi.thermal.temp_b4_trip = 1 settting, which causes the temperature to be set before evaluation of thermal trip points (the old default) . This mode should be selected automatically by DMI match if the system identifies as "HPCompaq 6715b" . Please consider applying a patch like that attached to fix the issue reported in lkml thread "Re: PROBLEM: Performance drop" recently, whereby it was found that HP 6715b laptops ( which have 2.2Ghz dual-core AMD x86_64 k8 CPUs) get stuck running the CPU at 800Khz and cannot switch frequency. I have verified that this still the case with v3.4.4 tagged "stable" kernel. and with v3.5-rc6, which this is a patch against ( ie. against commit bd0a521e88aa7a06ae7aabaed7ae196ed4ad867a : "Linux 3.5-rc6" : diff --git a/Makefile b/Makefile index 81ea154..bf02707 100644 --- a/Makefile +++ b/Makefile @@ -1,7 +1,7 @@ VERSION = 3 PATCHLEVEL = 5 SUBLEVEL = 0 -EXTRAVERSION = -rc5 +EXTRAVERSION = -rc6 NAME = Saber-toothed Squirrel # *DOCUMENTATION* diff --git a/drivers/acpi/thermal.c b/drivers/acpi/thermal.c index 7dbebea..13d3b22 100644 --- a/drivers/acpi/thermal.c +++ b/drivers/acpi/thermal.c @@ -96,6 +96,10 @@ static int psv; module_param(psv, int, 0644); MODULE_PARM_DESC(psv, "Disable or override all passive trip points."); +static bool temp_b4_trip; +module_param(temp_b4_trip, bool, 0644); +MODULE_PARM_DESC(temp_b4_trip, "Get the temperature before initializing trip points."); + static int acpi_thermal_add(struct acpi_device *device); static int acpi_thermal_remove(struct acpi_device *device, int type); static int acpi_thermal_resume(struct acpi_device *device); @@ -941,27 +945,41 @@ static int acpi_thermal_get_info(struct acpi_thermal *tz) if (!tz) return -EINVAL; - /* Get trip points [_CRT, _PSV, etc.] (required) */ - result = acpi_thermal_get_trip_points(tz); - if (result) + if( temp_b4_trip ) + { /* some CPUs, eg AMD K8 need temperature before trip points can be obtained */ + /* Get temperature [_TMP] (required) */ + result = acpi_thermal_get_temperature(tz); + if (result) return result; - - /* Get temperature [_TMP] (required) */ - result = acpi_thermal_get_temperature(tz); - if (result) + + /* Get trip points [_CRT, _PSV, etc.] (required) */ + result = acpi_thermal_get_trip_points(tz); + if (result) return result; - + }else + { /* newer x86_64s need trip points set before temperature obtained */ + /* Get trip points [_CRT, _PSV, etc.] (required) */
Re: PROBLEM: Performance drop
Hi - any progress on this or on the patch I submitted for it ? - please see enclosed - apologies for my being forced to use gmail which has mandatory line wrap - Please do something about restoring correct thermal operation on x86_64 K8's with HP BIOS ! Thanks & Regards, Jason Re: [PATCH: 1/1] ACPI: make evaluation of thermal trip points before temperature or vice versa dependant on new "temp_b4_trip" module parameter to support older AMD x86_64s Kernel x Jason Vas Dias Jul 9 (5 days ago) Reply to Rusty, linux-kernel, Andreas, Matthew, Len, Comrade Thanks Rusty - sorry I didn't see your email until now - revised patch addressing your comments attached - BTW, sorry about the word wrap on the initial posting - should I attach a '.patch' file or inline ? Trying both . The Revised Patch (against : commit bd0a521e88aa7a06ae7aabaed7ae196ed4ad867a Author: Linus Torvalds Date: Sat Jul 7 17:23:56 2012 -0700 Linux 3.5-rc6 ) : $ git diff bd0a521e88aa7a06ae7aabaed7ae196ed4ad867a > /tmp/acpi_thermal_temp_b4_trip.patch $ cat /tmp/acpi_thermal_temp_b4_trip.patch diff --git a/drivers/acpi/thermal.c b/drivers/acpi/thermal.c index 7dbebea..13d3b22 100644 --- a/drivers/acpi/thermal.c +++ b/drivers/acpi/thermal.c @@ -96,6 +96,10 @@ static int psv; module_param(psv, int, 0644); MODULE_PARM_DESC(psv, "Disable or override all passive trip points."); +static bool temp_b4_trip; +module_param(temp_b4_trip, bool, 0644); +MODULE_PARM_DESC(temp_b4_trip, "Get the temperature before initializing trip points."); + static int acpi_thermal_add(struct acpi_device *device); static int acpi_thermal_remove(struct acpi_device *device, int type); static int acpi_thermal_resume(struct acpi_device *device); @@ -941,27 +945,41 @@ static int acpi_thermal_get_info(struct acpi_thermal *tz) if (!tz) return -EINVAL; - /* Get trip points [_CRT, _PSV, etc.] (required) */ - result = acpi_thermal_get_trip_points(tz); - if (result) + if( temp_b4_trip ) + { /* some CPUs, eg AMD K8 need temperature before trip points can be obtained */ + /* Get temperature [_TMP] (required) */ + result = acpi_thermal_get_temperature(tz); + if (result) return result; - - /* Get temperature [_TMP] (required) */ - result = acpi_thermal_get_temperature(tz); - if (result) + + /* Get trip points [_CRT, _PSV, etc.] (required) */ + result = acpi_thermal_get_trip_points(tz); + if (result) return result; - + }else + { /* newer x86_64s need trip points set before temperature obtained */ + /* Get trip points [_CRT, _PSV, etc.] (required) */ + result = acpi_thermal_get_trip_points(tz); + if (result) + return result; + + /* Get temperature [_TMP] (required) */ + result = acpi_thermal_get_temperature(tz); + if (result) + return result; + } + /* Set the cooling mode [_SCP] to active cooling (default) */ result = acpi_thermal_set_cooling_mode(tz, ACPI_THERMAL_MODE_ACTIVE); if (!result) tz->flags.cooling_mode = 1; - + /* Get default polling frequency [_TZP] (optional) */ if (tzp) tz->polling_frequency = tzp; else acpi_thermal_get_polling_frequency(tz); - + return 0; } @@ -1110,6 +1128,14 @@ static int thermal_psv(const struct dmi_system_id *d) { return 0; } +static int thermal_temp_b4_trip(const struct dmi_system_id *d) { + + printk(KERN_NOTICE "ACPI: %s detected: : " + "getting temperature before trip point initialisation\n", d->ident); + temp_b4_trip = TRUE; + return 0; +} + static struct dmi_system_id thermal_dmi_table[] __initdata = { /* * Award BIOS on this AOpen makes thermal control almost worthless. @@ -1147,6 +1173,14 @@ static struct dmi_system_id thermal_dmi_table[] __initdata = { DMI_MATCH(DMI_BOARD_NAME, "7ZX"), }, }, + { +.callback = thermal_temp_b4_trip, +.ident = "HP 6715b laptop", +.matches = { +DMI_MATCH(DMI_SYS_VENDOR, "Hewlett-Packard"), +DMI_MATCH(DMI_PRODUCT_NAME, "HP Compaq 6715b"), + }, + }, {} }; Incidentally, there are still plenty of cpufreq and temperature related issues on this platform : with the "ondemand" or "performance" governors, placing a large load on system ( eg. building gcc-4.7.1 ) makes the CPU switch into highest frequency, but not switch down after the 65 degree trip point has been toggled once . And once the trip point has been reached once and the t
Re: [PATCH: 1/1] ACPI: make evaluation of thermal trip points before temperature or vice versa dependant on new "temp_b4_trip" module parameter to support older AMD x86_64s
Thanks Rusty - sorry I didn't see your email until now - revised patch addressing your comments attached - BTW, sorry about the word wrap on the initial posting - should I attach a '.patch' file or inline ? Trying both . The Revised Patch (against : commit bd0a521e88aa7a06ae7aabaed7ae196ed4ad867a Author: Linus Torvalds Date: Sat Jul 7 17:23:56 2012 -0700 Linux 3.5-rc6 ) : $ git diff bd0a521e88aa7a06ae7aabaed7ae196ed4ad867a > /tmp/acpi_thermal_temp_b4_trip.patch $ cat /tmp/acpi_thermal_temp_b4_trip.patch diff --git a/drivers/acpi/thermal.c b/drivers/acpi/thermal.c index 7dbebea..13d3b22 100644 --- a/drivers/acpi/thermal.c +++ b/drivers/acpi/thermal.c @@ -96,6 +96,10 @@ static int psv; module_param(psv, int, 0644); MODULE_PARM_DESC(psv, "Disable or override all passive trip points."); +static bool temp_b4_trip; +module_param(temp_b4_trip, bool, 0644); +MODULE_PARM_DESC(temp_b4_trip, "Get the temperature before initializing trip points."); + static int acpi_thermal_add(struct acpi_device *device); static int acpi_thermal_remove(struct acpi_device *device, int type); static int acpi_thermal_resume(struct acpi_device *device); @@ -941,27 +945,41 @@ static int acpi_thermal_get_info(struct acpi_thermal *tz) if (!tz) return -EINVAL; - /* Get trip points [_CRT, _PSV, etc.] (required) */ - result = acpi_thermal_get_trip_points(tz); - if (result) + if( temp_b4_trip ) + { /* some CPUs, eg AMD K8 need temperature before trip points can be obtained */ + /* Get temperature [_TMP] (required) */ + result = acpi_thermal_get_temperature(tz); + if (result) return result; - - /* Get temperature [_TMP] (required) */ - result = acpi_thermal_get_temperature(tz); - if (result) + + /* Get trip points [_CRT, _PSV, etc.] (required) */ + result = acpi_thermal_get_trip_points(tz); + if (result) return result; - + }else + { /* newer x86_64s need trip points set before temperature obtained */ + /* Get trip points [_CRT, _PSV, etc.] (required) */ + result = acpi_thermal_get_trip_points(tz); + if (result) + return result; + + /* Get temperature [_TMP] (required) */ + result = acpi_thermal_get_temperature(tz); + if (result) + return result; + } + /* Set the cooling mode [_SCP] to active cooling (default) */ result = acpi_thermal_set_cooling_mode(tz, ACPI_THERMAL_MODE_ACTIVE); if (!result) tz->flags.cooling_mode = 1; - + /* Get default polling frequency [_TZP] (optional) */ if (tzp) tz->polling_frequency = tzp; else acpi_thermal_get_polling_frequency(tz); - + return 0; } @@ -1110,6 +1128,14 @@ static int thermal_psv(const struct dmi_system_id *d) { return 0; } +static int thermal_temp_b4_trip(const struct dmi_system_id *d) { + + printk(KERN_NOTICE "ACPI: %s detected: : " + "getting temperature before trip point initialisation\n", d->ident); + temp_b4_trip = TRUE; + return 0; +} + static struct dmi_system_id thermal_dmi_table[] __initdata = { /* * Award BIOS on this AOpen makes thermal control almost worthless. @@ -1147,6 +1173,14 @@ static struct dmi_system_id thermal_dmi_table[] __initdata = { DMI_MATCH(DMI_BOARD_NAME, "7ZX"), }, }, + { +.callback = thermal_temp_b4_trip, +.ident = "HP 6715b laptop", +.matches = { +DMI_MATCH(DMI_SYS_VENDOR, "Hewlett-Packard"), +DMI_MATCH(DMI_PRODUCT_NAME, "HP Compaq 6715b"), + }, + }, {} }; Incidentally, there are still plenty of cpufreq and temperature related issues on this platform : with the "ondemand" or "performance" governors, placing a large load on system ( eg. building gcc-4.7.1 ) makes the CPU switch into highest frequency, but not switch down after the 65 degree trip point has been toggled once . And once the trip point has been reached once and the temperature falls below 65, returning CPU freq to 2GHz, the reported temperature seems to be stuck at 62 degrees even though the base of the laptop nearly burns my hand . So I get emergency overheating reboots unless I manually run my cpufreq & temperature monitoring scripts - which, if the CPU freq is 2Ghz, now have to down the freqency to 800Khz for 2 seconds every 8 seconds regardless of what temperature is reported . On Mon, Jul 9, 2012 at 1:30 AM, Rusty Russell wrote: > On Sun, 8 Jul 2012 19:50:54 +0100, Jason Vas Dias > wrote: >> This patch adds a new
[PATCH: 1/1] ACPI: make evaluation of thermal trip points before temperature or vice versa dependant on new "temp_b4_trip" module parameter to support older AMD x86_64s
This patch adds a new acpi.thermal.temp_b4_trip = 1 settting, which causes the temperature to be set before evaluation of thermal trip points (the old default) ; this mode should be selected automatically by DMI match if the system identifies as "HP Compaq 6715b" . Please consider applying a patch like that attached to fix the issue reported in lkml thread "Re: PROBLEM: Performance drop" recently, whereby it was found that HP 6715b laptops ( which have 2.2Ghz dual-core AMD x86_64 k8 CPUs) get stuck running the CPU at 800Khz and cannot switch frequency. I have verified that this still the case with v3.4.4 tagged "stable" kernel. diff --git a/drivers/acpi/thermal.c b/drivers/acpi/thermal.c index 7dbebea..de2b164 100644 --- a/drivers/acpi/thermal.c +++ b/drivers/acpi/thermal.c @@ -96,6 +96,10 @@ static int psv; module_param(psv, int, 0644); MODULE_PARM_DESC(psv, "Disable or override all passive trip points."); +static int temp_b4_trip; +module_param(temp_b4_trip, int, 0); +MODULE_PARM_DESC(temp_b4_trip, "Get the temperature before initializing trip points."); + static int acpi_thermal_add(struct acpi_device *device); static int acpi_thermal_remove(struct acpi_device *device, int type); static int acpi_thermal_resume(struct acpi_device *device); @@ -941,27 +945,41 @@ static int acpi_thermal_get_info(struct acpi_thermal *tz) if (!tz) return -EINVAL; - /* Get trip points [_CRT, _PSV, etc.] (required) */ - result = acpi_thermal_get_trip_points(tz); - if (result) + if( temp_b4_trip ) + { /* some CPUs, eg AMD K8 need temperature before trip points can be obtained */ + /* Get temperature [_TMP] (required) */ + result = acpi_thermal_get_temperature(tz); + if (result) return result; - - /* Get temperature [_TMP] (required) */ - result = acpi_thermal_get_temperature(tz); - if (result) + + /* Get trip points [_CRT, _PSV, etc.] (required) */ + result = acpi_thermal_get_trip_points(tz); + if (result) return result; - + }else + { /* newer x86_64s need trip points set before temperature obtained */ + /* Get trip points [_CRT, _PSV, etc.] (required) */ + result = acpi_thermal_get_trip_points(tz); + if (result) + return result; + + /* Get temperature [_TMP] (required) */ + result = acpi_thermal_get_temperature(tz); + if (result) + return result; + } + /* Set the cooling mode [_SCP] to active cooling (default) */ result = acpi_thermal_set_cooling_mode(tz, ACPI_THERMAL_MODE_ACTIVE); if (!result) tz->flags.cooling_mode = 1; - + /* Get default polling frequency [_TZP] (optional) */ if (tzp) tz->polling_frequency = tzp; else acpi_thermal_get_polling_frequency(tz); - + return 0; } @@ -1110,6 +1128,14 @@ static int thermal_psv(const struct dmi_system_id *d) { return 0; } +static int thermal_temp_b4_trip(const struct dmi_system_id *d) { + + printk(KERN_NOTICE "ACPI: %s detected: : " + "getting temperature before trip point initialisation\n", d->ident); + temp_b4_trip = 1; + return 0; +} + static struct dmi_system_id thermal_dmi_table[] __initdata = { /* * Award BIOS on this AOpen makes thermal control almost worthless. @@ -1147,6 +1173,14 @@ static struct dmi_system_id thermal_dmi_table[] __initdata = { DMI_MATCH(DMI_BOARD_NAME, "7ZX"), }, }, + { +.callback = thermal_temp_b4_trip, +.ident = "HP 6715b laptop", +.matches = { +DMI_MATCH(DMI_SYS_VENDOR, "Hewlett-Packard"), +DMI_MATCH(DMI_PRODUCT_NAME, "HP Compaq 6715b"), + }, + }, {} }; acpi_thermal_HP6715b.patch Description: Binary data
Re: PROBLEM: Performance drop
Sorry, of course the commit I backed out was : 9bcb8118965ab4631a65ee0726e6518f75cda6c5. On Sat, Jul 7, 2012 at9bcb8118965ab4631a65ee0726e6518f75cda6c5. 3:40 PM, Jason Vas Dias wrote: > I can confirm that the AMD Turion X2 2.2Ghz HP Compaq 6715b > "business" x86_64 k8 dual-core laptops circa 2007 > DO get stuck in 800Khz mode and cannot switch out of it after booting > the "stable" "v3.4.4" tagged kernel. > > I followed the containing post and reverted commit > ff74ae50f01ee67764564815c023c362c87ce18b : > > Commit d51cdad33bb5bb370c05129f7c7f3a16a55eff40 > Author: root > Date: Fri Jul 6 18:57:03 2012 + > > Revert "ACPI: Evaluate thermal trip points before reading temperature" > > This reverts commit 9bcb8118965ab4631a65ee0726e6518f75cda6c5. > > commit ff74ae50f01ee67764564815c023c362c87ce18b > Author: Greg Kroah-Hartman > Date: Fri Jun 22 11:37:50 2012 -0700 > > And wow ! what a difference - back to a circa 2007 machine versus a > circa 1987 machine. > > Not too many of us left around trying to run the latest version of > linux on nearly 5-year-old hardware I guess, but still - > please can you restore correct Linux cpufreq & thermal operation on > old-style AMD k8 CPUs ? > They do seem to depend on the temperature being set BEFORE 1st entry . > > Thanks & Regards, > Jason Vas Dias (a Software Engineer) > > > On Wed, May 30, 2012 at 1:43 PM, Andreas Herrmann > wrote: >> On Wed, May 30, 2012 at 03:20:27AM +0700, Comrade DOS wrote: >>> > Unfortunately you have used acpi=debug instead of apic=debug. So I >>> > can't compare I/O APIC configurations between the different test >>> > scenarios. >>> >>> Sorry me for this mistake. >> >> No problem. >> >> The logs show no difference in IO-APIC pin usage. >> So it's not the old problem ... >> >> Comparing both logs I found following differences: >> >> (Most other stuff seems just to be changed formatting.) >> >> -ACPI: Thermal Zone [TZ1] (67 C) >> +ACPI: Thermal Zone [TZ1] (62 C) >> >> I think what's shown is the temperature value which just differed >> between the boots. But that made me look at acpi/thermal.c where the >> messages came from. The only change between 3.3 and 3.4 is this >> commit: >> >> commit 9bcb8118965ab4631a65ee0726e6518f75cda6c5 >> Author: Matthew Garrett >> Date: Wed Feb 1 10:26:54 2012 -0500 >> >> ACPI: Evaluate thermal trip points before reading temperature >> >> I'd suggest to do a test with this patch reverted. Maybe this change >> to fix issues with one HP Laptop (re-)intruduced the trouble with your >> system. >> >> If reverting the patch helps we have to take a closer look at your >> ACPI tables. So can you please do a >> >> # git revert 9bcb8118965ab4631a65ee0726e6518f75cda6c5 >> >> on top of v3.4 and rebuid your kernel and rerun your test (with >> apic=debug. This allows easier diff to dmesg output of your previous >> test runs). >> >> In any case it also would be good to have the acpi tables from your >> system. So please also use >> # acpidump >acpidump.3.3(using the 3.3.x kernel) >> # acpidump >acpidump.3.4(using the unmodified 3.4 version) >> >> and send all files as attachments to your mail. >> >> This will allow me to look at your thermal zone definitions in the >> working and non-working case. >> >> >> Thanks, >> >> Andreas >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: Performance drop
I can confirm that the AMD Turion X2 2.2Ghz HP Compaq 6715b "business" x86_64 k8 dual-core laptops circa 2007 DO get stuck in 800Khz mode and cannot switch out of it after booting the "stable" "v3.4.4" tagged kernel. I followed the containing post and reverted commit ff74ae50f01ee67764564815c023c362c87ce18b : Commit d51cdad33bb5bb370c05129f7c7f3a16a55eff40 Author: root Date: Fri Jul 6 18:57:03 2012 + Revert "ACPI: Evaluate thermal trip points before reading temperature" This reverts commit 9bcb8118965ab4631a65ee0726e6518f75cda6c5. commit ff74ae50f01ee67764564815c023c362c87ce18b Author: Greg Kroah-Hartman Date: Fri Jun 22 11:37:50 2012 -0700 And wow ! what a difference - back to a circa 2007 machine versus a circa 1987 machine. Not too many of us left around trying to run the latest version of linux on nearly 5-year-old hardware I guess, but still - please can you restore correct Linux cpufreq & thermal operation on old-style AMD k8 CPUs ? They do seem to depend on the temperature being set BEFORE 1st entry . Thanks & Regards, Jason Vas Dias (a Software Engineer) On Wed, May 30, 2012 at 1:43 PM, Andreas Herrmann wrote: > On Wed, May 30, 2012 at 03:20:27AM +0700, Comrade DOS wrote: >> > Unfortunately you have used acpi=debug instead of apic=debug. So I >> > can't compare I/O APIC configurations between the different test >> > scenarios. >> >> Sorry me for this mistake. > > No problem. > > The logs show no difference in IO-APIC pin usage. > So it's not the old problem ... > > Comparing both logs I found following differences: > > (Most other stuff seems just to be changed formatting.) > > -ACPI: Thermal Zone [TZ1] (67 C) > +ACPI: Thermal Zone [TZ1] (62 C) > > I think what's shown is the temperature value which just differed > between the boots. But that made me look at acpi/thermal.c where the > messages came from. The only change between 3.3 and 3.4 is this > commit: > > commit 9bcb8118965ab4631a65ee0726e6518f75cda6c5 > Author: Matthew Garrett > Date: Wed Feb 1 10:26:54 2012 -0500 > > ACPI: Evaluate thermal trip points before reading temperature > > I'd suggest to do a test with this patch reverted. Maybe this change > to fix issues with one HP Laptop (re-)intruduced the trouble with your > system. > > If reverting the patch helps we have to take a closer look at your > ACPI tables. So can you please do a > > # git revert 9bcb8118965ab4631a65ee0726e6518f75cda6c5 > > on top of v3.4 and rebuid your kernel and rerun your test (with > apic=debug. This allows easier diff to dmesg output of your previous > test runs). > > In any case it also would be good to have the acpi tables from your > system. So please also use > # acpidump >acpidump.3.3(using the 3.3.x kernel) > # acpidump >acpidump.3.4(using the unmodified 3.4 version) > > and send all files as attachments to your mail. > > This will allow me to look at your thermal zone definitions in the > working and non-working case. > > > Thanks, > > Andreas > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/