from:"jason . vas . dias"

Re: Differences between builtins and modules

2018-05-10 Thread Jason Vas Dias

Sorry I didn't see this mail until now - RE:

Randy Dunlap  wrote:
> Would someone please answer/reply to this (related) kernel bugzilla entry:
> https://bugzilla.kernel.org/show_bug.cgi?id=118661

Yes, I raised this bug because I think modinfo should return 0 exit status
if a requested module is built-in, not just when it has been loaded, like
this modified version does:
$ modinfo snd
modinfo: ERROR: Module snd not found.
built-in: snd
$ echo $?
0

What was the query about the Bug 118661 that needs to be answered ?
I don't see any query on the bug report - just a comment from someone
who also agrees modinfo should return OK for a built-in module .

Glad to hear someone is finally considering fixing modinfo to report
status of built-in modules - with only a 2 year response time.

Thanks & Best Regards,
Jason

Re: [PATCH v4.16-rc6 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-23 Thread Jason Vas Dias

Good day -

I believe the last patch I sent, with $subject,
addresses all concerns raised so far by reviewers,
and complies with all kernel coding standards .

Please, it would be most helpful if you could let
me know whether the patch is now acceptable
and will be applied at some stage or not - or if not,
what is the problem with it .

My clients are asking whether the patch is going
to be in the upstream kernel or not, and I need
to tell them something.

Thanks & Best Regards,
Jason

[PATCH v4.16-rc6 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-21 Thread jason . vas . dias


This patch implements clock_gettime(CLOCK_MONOTONIC_RAW,&ts) calls entirely 
in the
vDSO, without calling vdso_fallback_gettime() .

It has been augmented to support compilation with or without -DRETPOLINE / 
$(RETPOLINE_CFLAGS) ;
when compiled with -DRETPOLINE, not all functions calls can be inlined 
within __vdso_clock_gettime,
and all functions invoked by __vdso_clock_gettime must have 
'indirect_branch("keep")' +
'function_return("keep")' attributes to compile, otherwise thunk 
relocations will be generated ;
and the functions cannot all be declared '__always_inline_', otherwise a 
compiler -Werror
('not all __always_inline__ functions can be inlined')  is generated.
Also, compared to previous version of same patch,  the do_*_coarse 
functions are still
not inlines, and not inadvertently changed to inline.

I still think it might be better to apply H.J. Liu's patch from
https://bugzilla.kernel.org/show_bug.cgi?id=199129 to disable
-DRETPOLINE compilation for the vDSO .

---
diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d..80d65d4 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -182,29 +182,62 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
-notrace static inline u64 vgetsns(int *mode)
+notrace static inline u64 vgetcycles(int *mode)
 {
-   u64 v;
-   cycles_t cycles;
-
-   if (gtod->vclock_mode == VCLOCK_TSC)
-   cycles = vread_tsc();
+   switch (gtod->vclock_mode) {
+   case VCLOCK_TSC:
+   return vread_tsc();
 #ifdef CONFIG_PARAVIRT_CLOCK
-   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
-   cycles = vread_pvclock(mode);
+   case VCLOCK_PVCLOCK:
+   return vread_pvclock(mode);
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
-   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
-   cycles = vread_hvclock(mode);
+   case VCLOCK_HVCLOCK:
+   return vread_hvclock(mode);
 #endif
-   else
+   default:
+   break;
+   }
+   return 0;
+}
+
+notrace static inline u64 vgetsns(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
return 0;
+
v = (cycles - gtod->cycle_last) & gtod->mask;
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
+   return 0;
+
+   v = (cycles - gtod->cycle_last) & gtod->mask;
+   return v * gtod->raw_mult;
+}
+
+#ifdef RETPOLINE
+#  define  _NO_THUNK_RELOCS_()(indirect_branch("keep"),\
+   function_return("keep"))
+#  define  _RETPOLINE_FUNC_ATTR_   __attribute__(_NO_THUNK_RELOCS_())
+#  define  _RETPOLINE_INLINE_  inline
+#else
+#  define  _RETPOLINE_FUNC_ATTR_
+#  define  _RETPOLINE_INLINE_  __always_inline
+#endif
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
-notrace static int __always_inline do_realtime(struct timespec *ts)
+notrace static _RETPOLINE_INLINE_ _RETPOLINE_FUNC_ATTR_
+int do_realtime(struct timespec *ts)
 {
unsigned long seq;
u64 ns;
@@ -225,7 +258,8 @@ notrace static int __always_inline do_realtime(struct 
timespec *ts)
return mode;
 }
 
-notrace static int __always_inline do_monotonic(struct timespec *ts)
+notrace static _RETPOLINE_INLINE_ _RETPOLINE_FUNC_ATTR_
+int do_monotonic(struct timespec *ts)
 {
unsigned long seq;
u64 ns;
@@ -246,7 +280,30 @@ notrace static int __always_inline do_monotonic(struct 
timespec *ts)
return mode;
 }
 
-notrace static void do_realtime_coarse(struct timespec *ts)
+notrace static _RETPOLINE_INLINE_ _RETPOLINE_FUNC_ATTR_
+int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw(&mode);
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
+notrace static _RETPOLINE_FUNC_ATTR_
+void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
do {
@@ -256,7 +313,8 @@ notrace static void do_realtime_coarse(struct timespec *ts)
} while (unlikely(gtod_read_retry(gtod, seq)));
 }
 
-notrace static void do_monotonic_coarse(struct timespec *ts)
+notrace static _RETPOLINE_FUNC_ATTR_
+void do_monotonic_coarse(struct timespec *ts)
 {
unsigned long seq;
do {
@@ -266,7 +324,8 @@ notrace static void do_monotonic_coarse(struct

[PATCH v4.16-rc6 (1)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-21 Thread jason . vas . dias


  Resent to address reviewer comments, and allow builds with compilers
  that support -DRETPOLINE to succeed.
  
  Currently, the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW, &ts )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.
  
  Sometimes, particularly when correlating elapsed time to performance
  counter values, user-space  code needs to know elapsed time from the
  perspective of the CPU no matter how "hot" / fast or "cold" / slow it
  might be running wrt NTP / PTP "real" time; when code needs this,
  the latencies associated with a syscall are often unacceptably high.
  
  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
  
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .
  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.
  
  The patch is against Linus' latest 4.16-rc6 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .
  
  This patch affects only files:
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c
   
  Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug
  #198161,
  as is the test program, timer_latency.c, to demonstrate the problem.
  
  Before the patch a latency of 200-1000ns was measured for
clock_gettime(CLOCK_MONOTONIC_RAW,&ts)
  calls - after the patch, the same call on the same machine
  has a latency of @ 20ns.
  
  Please consider applying something like this patch to a future Linux release.

  This patch is being resent because it has slight improvements to 
vclock_gettime
  static function attributes wrt. the previous version.

  It also supersedes all previous patches with subject matching
 '.*VDSO should handle.*clock_gettime.*MONOTONIC_RAW'
  that I have sent previously - sorry for the resends.

  Please apply this patch so we stop getting emails from
  intel build bot trying to build previous version, with
  subject :
'[PATCH v4.16-rc5 1/2] x86/vdso: VDSO should handle \
 clock_gettime(CLOCK_MONOTONIC_RAW) without syscall'
  , which only fails to build because its patch 2/2 , which
  removed -DRETPOLINE from the VDSO build, and is now the
  subject of https://bugzilla.kernel.org/show_bug.cgi?id=199129,
  raised by H.J. Liu, was not applied first - Sorry! 

Thanks & Best Regards,
Jason Vas Dias

Re: [PATCH v4.16-rc6 (1)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-19 Thread Jason Vas Dias

Note there is a bug raised by H.J. Liu :
 Bug 199129: Don't build vDSO with $(RETPOLINE_CFLAGS) -DRETPOLINE
(https://bugzilla.kernel.org/show_bug.cgi?id=199129)

If you agree it is a bug, then use both patches from post :
'[PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle \
 clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
'
else, use the single patch from $subject, which makes the
calls to the statics in vclock_gettime.c' use
   indirect_branch("keep") / function_return("keep") ,
to avoid generation of thunk relocations which would not
occur unless compiled with
   -mindirect-branch=thunk-extern -mindirect-branch-register
.

Thanks & Regards,
Jason

[PATCH v4.16-rc6 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-19 Thread jason . vas . dias


 This patch makes the vDSO handle clock_gettime(CLOCK_MONOTONIC_RAW,&ts)
 calls in the same way it handles clock_gettime(CLOCK_MONOTONIC,&ts) 
calls,
 reducing latency from @ 200-1000ns to @ 20ns.

 It has been resent and augmented to support compilation with 
-DRETPOLINE /
  -mindirect-branch=thunk-extern -mindirect-branch-register, to avoid
  generating relocations for thunks.
  
---
diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d..9b89f86 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -182,29 +182,60 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
-notrace static inline u64 vgetsns(int *mode)
+notrace static inline u64 vgetcycles(int *mode)
 {
-   u64 v;
-   cycles_t cycles;
-
-   if (gtod->vclock_mode == VCLOCK_TSC)
-   cycles = vread_tsc();
+   switch (gtod->vclock_mode) {
+   case VCLOCK_TSC:
+   return vread_tsc();
 #ifdef CONFIG_PARAVIRT_CLOCK
-   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
-   cycles = vread_pvclock(mode);
+   case VCLOCK_PVCLOCK:
+   return vread_pvclock(mode);
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
-   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
-   cycles = vread_hvclock(mode);
+   case VCLOCK_HVCLOCK:
+   return vread_hvclock(mode);
 #endif
-   else
+   default:
+   break;
+   }
+   return 0;
+}
+
+notrace static inline u64 vgetsns(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
return 0;
+
v = (cycles - gtod->cycle_last) & gtod->mask;
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
+   return 0;
+
+   v = (cycles - gtod->cycle_last) & gtod->mask;
+   return v * gtod->raw_mult;
+}
+
+#ifdef RETPOLINE
+#  define  _NO_THUNK_RELOCS_()(indirect_branch("keep"),\
+   function_return("keep"))
+#  define  _RETPOLINE_FUNC_ATTR_ __attribute__(_NO_THUNK_RELOCS_())
+#else
+#  define  _RETPOLINE_FUNC_ATTR_
+#endif
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
-notrace static int __always_inline do_realtime(struct timespec *ts)
+notrace static inline _RETPOLINE_FUNC_ATTR_
+int do_realtime(struct timespec *ts)
 {
unsigned long seq;
u64 ns;
@@ -225,7 +256,8 @@ notrace static int __always_inline do_realtime(struct 
timespec *ts)
return mode;
 }
 
-notrace static int __always_inline do_monotonic(struct timespec *ts)
+notrace static inline _RETPOLINE_FUNC_ATTR_
+int do_monotonic(struct timespec *ts)
 {
unsigned long seq;
u64 ns;
@@ -246,7 +278,30 @@ notrace static int __always_inline do_monotonic(struct 
timespec *ts)
return mode;
 }
 
-notrace static void do_realtime_coarse(struct timespec *ts)
+notrace static inline _RETPOLINE_FUNC_ATTR_
+int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw(&mode);
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
+notrace static inline _RETPOLINE_FUNC_ATTR_
+void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
do {
@@ -256,7 +311,8 @@ notrace static void do_realtime_coarse(struct timespec *ts)
} while (unlikely(gtod_read_retry(gtod, seq)));
 }
 
-notrace static void do_monotonic_coarse(struct timespec *ts)
+notrace static inline _RETPOLINE_FUNC_ATTR_
+void do_monotonic_coarse(struct timespec *ts)
 {
unsigned long seq;
do {
@@ -266,7 +322,11 @@ notrace static void do_monotonic_coarse(struct timespec 
*ts)
} while (unlikely(gtod_read_retry(gtod, seq)));
 }
 
-notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
+notrace
+#ifdef RETPOLINE
+   __attribute__((indirect_branch("keep"), function_return("keep")))
+#endif
+int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
 {
switch (clock) {
case CLOCK_REALTIME:
@@ -277,6 +337,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct 
timespec *ts)
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+

[PATCH v4.16-rc6 (1)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-19 Thread jason . vas . dias



Resent to address reviewer comments, and allow builds with compilers
  that support -DRETPOLINE to succeed.

  Currently, the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW, &ts )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values, user-space  code needs to know elapsed time from the
  perspective of the CPU no matter how "hot" / fast or "cold" / slow it
  might be running wrt NTP / PTP "real" time; when code needs this,
  the latencies associated with a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .

  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:

   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c


  Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug 
#198161,
  as is the test program, timer_latency.c, to demonstrate the problem.

  Before the patch a latency of 200-1000ns was measured for
clock_gettime(CLOCK_MONOTONIC_RAW,&ts)
  calls - after the patch, the same call on the same machine
  has a latency of @ 20ns.

  Please consider applying something like this patch to a future Linux release.

Thanks & Best Regards,
Jason Vas Dias

Re: [PATCH v4.16-rc5 2/2] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-17 Thread Jason Vas Dias

On 18/03/2018, Jason Vas Dias  wrote:
(should have CC'ed to list, sorry)
> On 17/03/2018, Andi Kleen  wrote:
>>
>> That's quite a mischaracterization of the issue. gcc works as intended,
>> but the kernel did not correctly supply a indirect call retpoline thunk
>> to the vdso, and it just happened to work by accident with the old
>> vdso.
>>
>>>
>>>  The automated test builds should now succeed with this patch.
>>
>> How about just adding the thunk function to the vdso object instead of
>> this cheap hack?
>>
>> The other option would be to build vdso with inline thunks.
>>
>> But just disabling is completely the wrong action.
>>
>> -Andi
>>
>
> Aha! Thanks for the clarification , Andi!
>
> I will do so and resend the 2nd patch.
>
> But is everyone agreed we should accept any slowdown for the timer
> functions ? I personally don't think it is a good idea, but I will
> regenerate the patch with the thunk function and without
> the Makefile change.
>
> Thanks & Best Regards,
> Jason
>

I am wondering if it is not better to avoid the thunk being generated
and remove the Makefile patch ?

I know that changing the switch in __vdso_clock_gettime() like
this avoids the thunk :

   switch(clock) {
   case CLOCK_MONOTONIC:
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
default:
switch (clock) {
case CLOCK_REALTIME:
if (do_realtime(ts) == VCLOCK_NONE)
   goto fallback;
break;
 case CLOCK_MONOTONIC_RAW:
 if (do_monotonic_raw(ts) == VCLOCK_NONE)
   goto fallback;
 break;
 case CLOCK_REALTIME_COARSE:
 do_realtime_coarse(ts);
 break;
 case CLOCK_MONOTONIC_COARSE:
 do_monotonic_coarse(ts);
 break;
 default:
goto fallback;
 }
 return 0;
 fallback: ...
}

So at the cost of an unnecessary extra test of the clock parameter,
the thunk is avoided .

I wonder if the whole switch should be changed to an if / else clause ?

Or, I know this might be unorthodox, but might work :
#define _CAT(V1,V2) V1##V2
#define GTOD_CLK_LABEL(CLK) _CAT(_VCG_L_,CLK)
#define  MAX_CLK 16
//^^ ??
 __vdso_clock_gettime(  ... ) { ...
 static const void *clklbl_tab[MAX_CLK]
 ={ [ CLOCK_MONOTONIC ]
  =   &>OD_CLK_LABEL(CLOCK_MONOTONIC) ,
 [ CLOCK_MONOTONIC_RAW ]
  =   &>OD_CLK_LABEL(CLOCK_MONOTONIC_RAW) ,
// and similarly for all clocks handled ...
};

   goto clklbl_tab[ clock & 0xf ] ;

   GTOD_CLK_LABEL(CLOCK_MONOTONIC) :
if ( do_monotonic(ts) == VCLOCK_NONE )
  goto fallback ;

   GTOD_CLK_LABEL(CLOCK_MONOTONIC_RAW) :
if ( do_monotonic_raw(ts) == VCLOCK_NONE )
  goto fallback ;

 ... // similarly for all clocks

fallback:
 return vdso_fallback_gettime(clock,ts);
}

If a restructuring like that might be acceptable (with correct tab-based
formatting) , and the VDSO can have such a table in its .BSS ,  I think it
would avoid the thunk, and have the advantage of
precomputing the jump table at compile-time, and would not require any
indirect branches, I think.

Any thoughts ?

Thanks & Best regards,
Jason

  ;

 G

Re: [PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-17 Thread Jason Vas Dias

fixed typo in timer_latency.c affecting only -r  printout
:

$ gcc -DN_SAMPLES=1000 -o timer timer_latency.c
CLOCK_MONOTONIC ( using rdtscp_ordered() ) :

$ ./timer -m -r 10
sum: 67615
Total time: 0.67615S - Average Latency: 0.00067S N zero
deltas: 0 N inconsistent deltas: 0
sum: 51858
Total time: 0.51858S - Average Latency: 0.00051S N zero
deltas: 0 N inconsistent deltas: 0
sum: 51742
Total time: 0.51742S - Average Latency: 0.00051S N zero
deltas: 0 N inconsistent deltas: 0
sum: 51944
Total time: 0.51944S - Average Latency: 0.00051S N zero
deltas: 0 N inconsistent deltas: 0
sum: 51838
Total time: 0.51838S - Average Latency: 0.00051S N zero
deltas: 0 N inconsistent deltas: 0
sum: 52397
Total time: 0.52397S - Average Latency: 0.00052S N zero
deltas: 0 N inconsistent deltas: 0
sum: 52428
Total time: 0.52428S - Average Latency: 0.00052S N zero
deltas: 0 N inconsistent deltas: 0
sum: 52135
Total time: 0.52135S - Average Latency: 0.00052S N zero
deltas: 0 N inconsistent deltas: 0
sum: 52145
Total time: 0.52145S - Average Latency: 0.00052S N zero
deltas: 0 N inconsistent deltas: 0
sum: 53116
Total time: 0.53116S - Average Latency: 0.00053S N zero
deltas: 0 N inconsistent deltas: 0
Average of 10 average latencies of 1000 samples : 0.00053S


CLOCK_MONOTONIC_RAW ( using rdtscp() ) :

$ ./timer  -r 10
sum: 25755
Total time: 0.25755S - Average Latency: 0.00025S N zero
deltas: 0 N inconsistent deltas: 0
sum: 21614
Total time: 0.21614S - Average Latency: 0.00021S N zero
deltas: 0 N inconsistent deltas: 0
sum: 21616
Total time: 0.21616S - Average Latency: 0.00021S N zero
deltas: 0 N inconsistent deltas: 0
sum: 21610
Total time: 0.21610S - Average Latency: 0.00021S N zero
deltas: 0 N inconsistent deltas: 0
sum: 21619
Total time: 0.21619S - Average Latency: 0.00021S N zero
deltas: 0 N inconsistent deltas: 0
sum: 21617
Total time: 0.21617S - Average Latency: 0.00021S N zero
deltas: 0 N inconsistent deltas: 0
sum: 21610
Total time: 0.21610S - Average Latency: 0.00021S N zero
deltas: 0 N inconsistent deltas: 0
sum: 16940
Total time: 0.16940S - Average Latency: 0.00016S N zero
deltas: 0 N inconsistent deltas: 0
sum: 16939
Total time: 0.16939S - Average Latency: 0.00016S N zero
deltas: 0 N inconsistent deltas: 0
sum: 16943
Total time: 0.16943S - Average Latency: 0.00016S N zero
deltas: 0 N inconsistent deltas: 0
Average of 10 average latencies of 1000 samples : 0.00019S
/*
 * Program to measure high-res timer latency.
 *
 */
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifndef N_SAMPLES
#define N_SAMPLES 100
#endif
#define _STR(_S_) #_S_
#define STR(_S_) _STR(_S_)

#define TS2NS(_TS_) unsigned long long)(_TS_).tv_sec)*10ULL) + (((unsigned long long)((_TS_).tv_nsec 

int main(int argc, char *const* argv, char *const* envp)
{ struct timespec sample[N_SAMPLES+1];
  unsigned int cnt=N_SAMPLES, s=0 , avg_n=0;
  unsigned long long
deltas [ N_SAMPLES ]
, t1, t2, sum=0, zd=0, ic=0, d
, t_start, avg_ns, *avgs=0;
  clockid_t clk = CLOCK_MONOTONIC_RAW;
  bool do_dump = false;
  int argn=1, repeat=1;
  for(; argn < argc; argn+=1)
if( argv[argn] != NULL )
  if( *(argv[argn]) == '-')
	switch( *(argv[argn]+1) )
	{ case 'm':
	  case 'M':
	clk = CLOCK_MONOTONIC;
	break;
	  case 'd':
	  case 'D':
	do_dump = true;
	break;
 	  case 'r':
  	  case 'R':
	if( (argn < argc) && (argv[argn+1] != NULL))
	  repeat = atoi(argv[argn+=1]);
	break;
	  case '?':
	  case 'h':
	  case 'u':
	  case 'U':
	  case 'H':
	fprintf(stderr,"Usage: timer_latency [\n\t-m : use CLOCK_MONOTONIC clock (not CLOCK_MONOTONIC_RAW)\n\t-d : dump timespec contents. N_SAMPLES: " STR(N_SAMPLES) "\n\t"
	"-r \n]\t"
	"Calculates average timer latency (minimum time that can be measured) over N_SAMPLES.\n"
	   );
	return 0;
	}
  if( repeat > 1 )
  { avgs=alloca(sizeof(unsigned long long) * (N_SAMPLES + 1));
if( ((unsigned long) avgs) & 7 )
  avgs = ((unsigned long long*)(((unsigned char*)avgs)+(8-((unsigned long) avgs) & 7)));
  }
  do {
cnt=N_SAMPLES;
s=0;
  do
  { if( 0 != clock_gettime(clk, &sample[s++]) )
{ fprintf(stderr,"oops, clock_gettime() failed: %d: '%s'.\n", errno, strerror(errno));
  return 1;
}
  }while( --cnt );
  clock_gettime(clk, &sample[s]);
  for(s=1; s < (N_SAMPLES+1); s+=1)
  { t1 = TS2NS(sample[s-1]);
t2 = TS2NS(sample[s]);
if ( (t1 > t2)
   ||(sample[s-1].tv_sec > sample[s].tv_sec)
   ||((sample[s-1].tv_sec == sample[s].tv_sec)
&&(sample[s-1].tv_nsec > sample[s].tv_nsec)
 )
   )
{ fprintf(stderr,"Inconsistency: %llu %llu %lu.%lu %lu.%lu\n", t1 , t2
, sample[s-1].tv_sec, sample[s-1].tv_nsec
, sample[s].tv_sec,   sample[s].tv_nsec
  );

re: [PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-17 Thread Jason Vas Dias

Hi -

I submitted a new stripped-down to bare essentials version of
the patch, (see LKML emails with $subject)  which passes all
checkpatch.pl tests and addresses all concerns raised by reviewers,
which uses only rdtsc_ordered(), and which only only updates in
  vsyscall_gtod_data the new fields:
u32 raw_mult,  raw_shift ; ...
gtod_long_t  monotonic_time_raw_sec   /* == tk->raw_sec */ ,
  monotonic_time_raw_nsec /* == tk->tkr_raw.nsec */;
(this is NOT the formatting used in vgtod.h - sorry about previous
 formatting issues .
) .

I don't see how one could present the raw timespec in user-space
properly without tk->tkr_raw.xtime_nsec and and tk->raw_sec ;
monotonic has gtod->monotonic_time_sec and gtod->monotonic_time_snsec,
and I am only trying to follow exactly the existing algorithm in
timekeeping.c's
getrawmonotonic64() .

When I submitted the initial version of this stripped down patch,
I got an email back from robot reporting a compilation
error saying :

>
>   arch/x86/entry/vdso/vclock_gettime.o: In function `__vdso_clock_gettime':
>   vclock_gettime.c:(.text+0xf7): undefined reference to 
> >`__x86_indirect_thunk_rax'
>   /usr/bin/ld: arch/x86/entry/vdso/vclock_gettime.o: relocation R_X86_64_PC32 
> >against undefined symbol `__x86_indirect_thunk_rax' can not be used when 
> making >a shared object; recompile with -fPIC
>   /usr/bin/ld: final link failed: Bad value
>>> collect2: error: ld returned 1 exit status
>--
>>> arch/x86/entry/vdso/vdso32.so.dbg: undefined symbols found
>--
>>> objcopy: 'arch/x86/entry/vdso/vdso64.so.dbg': No such file
>---


I had fixed this problem with the patch to the RHEL kernel attached to
bug #198161 (attachment #274751:
https://bugzilla.kernel.org/attachment.cgi?id=274751) ,
 by simply reducing the number of clauses in __vdso_clock_gettime's
switch(clock) from 6 to 5 , but at the cost of an extra test of clock
& second switch(clock).

I reported this as GCC bug :
  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84908
because I don't think GCC should fail to do anything
for a switch with 6 clauses and not for one with 5, but
the response I got from H.J. Liu was:

H.J. Lu wrote @ 2018-03-16 22:13:27 UTC:
>
> vDSO isn't compiled with $(KBUILD_CFLAGS).  Why does your kernel do it?
>
> Please try my kernel patch at comment 4..
>

So that patch to the arch/x86/vdso/Makefile only prevents it enabling the
RETPOLINE_CFLAGS for building  the vDSO .

I defer to H.J.'s expertise on GCC + binutils & advisability of enabling
RETPOLINE_CFLAGS in the VDSO - GCC definitely behaves strangely
for the vDSO when RETPOLINE _CFLAGS  are enabled.

Please provide something like the patch in a future version of Linux ,
and I suggest not compiling the vDSO with RETPOLINE_CFLAGS
as does H.J. .

The inconsistency_check program in tools/testing/selftests/timers produces
no errors for long runs and the timer_latency.c program (attached) also
produces no errors and latencies of @ 20ns for CLOCK_MONOTONIC_RAW
and latencies of @ 40ns for CLOCK_MONOTONIC - this is however
with the additional rdtscp patches , and under 4.15.9, for use on my system ;
the 4.16-rc5 version submitted still uses barrier() + rdtsc  , and
that has  a latency
of @ 30ns for CLOCK_MONOTONIC_RAW and @40ns for CLOCK_MONOTONIC ; but
both are much, much better that
200-1000ns for CLOCK_MONOTONIC_RAW that the unpatched
kernels have (all times refer to 'Average Latency' output produced
by 'timer_latency.c').

I do apologize for whitespace errors, unread emails and resends and confusion
of previous emails - I now understand the process and standards much better
and will attempt to adhere to them more closely in future.

Thanks & Best Regards,
Jason Vas Dias
/* 
 * Program to measure high-res timer latency.
 *
 */
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifndef N_SAMPLES
#define N_SAMPLES 100
#endif
#define _STR(_S_) #_S_
#define STR(_S_) _STR(_S_)

#define TS2NS(_TS_) unsigned long long)(_TS_).tv_sec)*10ULL) + (((unsigned long long)((_TS_).tv_nsec 

int main(int argc, char *const* argv, char *const* envp)
{ struct timespec sample[N_SAMPLES+1];
  unsigned int cnt=N_SAMPLES, s=0 , avg_n=0;
  unsigned long long
deltas [ N_SAMPLES ]
, t1, t2, sum=0, zd=0, ic=0, d
, t_start, avg_ns, *avgs=0;
  clockid_t clk = CLOCK_MONOTONIC_RAW;
  bool do_dump = false;
  int argn=1, repeat=1;
  for(; argn < argc; argn+=1)
if( argv[argn] != NULL )
  if( *(argv[argn]) == '-')
	switch( *(argv[argn]+1) )
	{ case 'm':
	  case 'M':
	clk = CLOCK_MONOTONIC;
	break;
	  case 'd':
	  case 'D':
	do_dump = true;
	break;
 	  case 'r':
  	  case 'R':
	if( (a

[PATCH v4.16-rc5 2/2] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-17 Thread jason . vas . dias

 This patch allows compilation to succeed with compilers that support 
-DRETPOLINE -
 it was kindly contributed by H.J. Liu in GCC Bugzilla: 84908 :
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84908

 Apparently the GCC retpoline implementation has a limitation that it 
cannot
 handle switch statements with more than 5 clauses, which 
vclock_gettime.c's
 __vdso_clock_gettime function now contains.

 The automated test builds should now succeed with this patch.


diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 1943aeb..cb64e10 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -76,7 +76,7 @@ CFL := $(PROFILING) -mcmodel=small -fPIC -O2 
-fasynchronous-unwind-tables -m64 \
-fno-omit-frame-pointer -foptimize-sibling-calls \
-DDISABLE_BRANCH_PROFILING -DBUILD_VDSO
 
-$(vobjs): KBUILD_CFLAGS := $(filter-out 
$(GCC_PLUGINS_CFLAGS),$(KBUILD_CFLAGS)) $(CFL)
+$(vobjs): KBUILD_CFLAGS := $(filter-out $(GCC_PLUGINS_CFLAGS) 
$(RETPOLINE_CFLAGS) -DRETPOLINE,$(KBUILD_CFLAGS)) $(CFL)
 
 #
 # vDSO code runs in userspace and -pg doesn't help with profiling anyway.
@@ -143,6 +143,7 @@ KBUILD_CFLAGS_32 := $(filter-out 
-mcmodel=kernel,$(KBUILD_CFLAGS_32))
 KBUILD_CFLAGS_32 := $(filter-out -fno-pic,$(KBUILD_CFLAGS_32))
 KBUILD_CFLAGS_32 := $(filter-out -mfentry,$(KBUILD_CFLAGS_32))
 KBUILD_CFLAGS_32 := $(filter-out $(GCC_PLUGINS_CFLAGS),$(KBUILD_CFLAGS_32))
+KBUILD_CFLAGS_32 := $(filter-out $(RETPOLINE_CFLAGS) 
-DRETPOLINE,$(KBUILD_CFLAGS_32))
 KBUILD_CFLAGS_32 += -m32 -msoft-float -mregparm=0 -fpic
 KBUILD_CFLAGS_32 += $(call cc-option, -fno-stack-protector)
 KBUILD_CFLAGS_32 += $(call cc-option, -foptimize-sibling-calls)

[PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-17 Thread jason . vas . dias



  Resent to address reviewer comments, and allow builds with compilers
  that support -DRETPOLINE to succeed.

  Currently, the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW, &ts )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values, user-space  code needs to know elapsed time from the
  perspective of the CPU no matter how "hot" / fast or "cold" / slow it
  might be running wrt NTP / PTP "real" time; when code needs this,
  the latencies associated with a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .

  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:

   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c
   arch/x86/entry/vdso/Makefile   

  Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug 
#198161,
  as is the test program, timer_latency.c, to demonstrate the problem.

  Before the patch a latency of 200-1000ns was measured for
clock_gettime(CLOCK_MONOTONIC_RAW,&ts)
  calls - after the patch, the same call on the same machine
  has a latency of @ 20ns.


Thanks & Best Regards,
Jason Vas Dias

[PATCH v4.16-rc5 1/2] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-17 Thread jason . vas . dias


 This patch makes the vDSO handle clock_gettime(CLOCK_MONOTONIC_RAW,&ts)
 calls in the same way it handles clock_gettime(CLOCK_MONOTONIC,&ts) 
calls,
 reducing latency from @ 200-1000ns to @ 20ns.


diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d..843b0a6 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -182,27 +182,49 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
-notrace static inline u64 vgetsns(int *mode)
+notrace static inline __always_inline u64 vgetcycles(int *mode)
 {
-   u64 v;
-   cycles_t cycles;
-
-   if (gtod->vclock_mode == VCLOCK_TSC)
-   cycles = vread_tsc();
+   switch (gtod->vclock_mode) {
+   case VCLOCK_TSC:
+   return vread_tsc();
 #ifdef CONFIG_PARAVIRT_CLOCK
-   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
-   cycles = vread_pvclock(mode);
+   case VCLOCK_PVCLOCK:
+   return vread_pvclock(mode);
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
-   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
-   cycles = vread_hvclock(mode);
+   case VCLOCK_HVCLOCK:
+   return vread_hvclock(mode);
 #endif
-   else
+   default:
+   break;
+   }
+   return 0;
+}
+
+notrace static inline u64 vgetsns(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
return 0;
+
v = (cycles - gtod->cycle_last) & gtod->mask;
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
+   return 0;
+
+   v = (cycles - gtod->cycle_last) & gtod->mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +268,27 @@ notrace static int __always_inline do_monotonic(struct 
timespec *ts)
return mode;
 }
 
+notrace static __always_inline int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw(&mode);
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +320,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct 
timespec *ts)
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c 
b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index e1216dd..c4d89b6 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -44,6 +44,8 @@ void update_vsyscall(struct timekeeper *tk)
vdata->mask = tk->tkr_mono.mask;
vdata->mult = tk->tkr_mono.mult;
vdata->shift= tk->tkr_mono.shift;
+   vdata->raw_mult = tk->tkr_raw.mult;
+   vdata->raw_shift= tk->tkr_raw.shift;
 
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
@@ -74,5 +76,8 @@ void update_vsyscall(struct timekeeper *tk)
vdata->monotonic_time_coarse_sec++;
}
 
+   vdata->monotonic_time_raw_sec  = tk->raw_sec;
+   vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec;
+
gtod_write_end(vdata);
 }
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index fb856c9..ec1a37c 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -22,7 +22,8 @@ struct vsyscall_gtod_data {
u64 mask;
u32 mult;
u32 shift;
-
+   u32 raw_mult;
+   u32 raw_shift;
/* open coded 'struct timespec' */
u64 wall_time_snsec;
gtod_long_t wall_time_sec;
@@ -32,6 +33,8 @@ struct vsyscall_gtod_data {
gtod_long_t wall_time_coarse_nsec;
gtod_long_t monotonic_time_coarse_sec;
gtod_long_t monotonic_time_coarse_nsec;
+   gtod_long_t monotonic_time_raw_sec;
+   gtod_long_t monotonic_time_raw_ns

Re: [PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-16 Thread Jason Vas Dias

Good day -

RE:
On 15/03/2018, Thomas Gleixner  wrote:
> On Thu, 15 Mar 2018, Jason Vas Dias wrote:
>> On 15/03/2018, Thomas Gleixner  wrote:
>> > On Thu, 15 Mar 2018, jason.vas.d...@gmail.com wrote:
>> >
>> >>   Resent to address reviewer comments.
>> >
>> > I was being patient so far and tried to guide you through the patch
>> > submission process, but unfortunately this turns out to be just waste of
>> > my
>> > time.
>> >
>> > You have not addressed any of the comments I made here:
>> >
>> > [1]
>> > https://lkml.kernel.org/r/alpine.deb.2.21.1803141511340.2...@nanos.tec.linutronix.de
>> > [2]
>> > https://lkml.kernel.org/r/alpine.deb.2.21.1803141527300.2...@nanos.tec.linutronix.de
>> >
>>
>> I'm really sorry about that - I did not see those mails ,
>> and have searched for them in my inbox -
>
> That's close to the 'my dog ate the homework' excuse.
>


Nevertheless, those messages are NOT in my inbox, nor
can I find them on the list - a google search for
'alpine.DEB.2.21.1803141511340.2481' or
'alpine.DEB.2.21.1803141527300.2481' returns
only the last two mails on the subject , where
you included the links to https://lkml.kernel.org.

I don't know what went wrong here, but I did not
receive those mails until you informed me of them
yesterday evening, when I immediately regenerated
the Patch #1 incorporating fixes for your comments,
and sent it with Subject:
  '[PATCH v4.16-rc5 1/1] x86/vdso: VDSO should handle\
   clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
  '
This version re-uses the 'gtod->cycles' value, which as you point
out, is the same as 'tk->tkr_raw.cycle_last'  -
so I removed vread_tsc_raw() .


> Of course they were sent to the list and to you personally as I used
> reply-all. From the mail server log:
>
> 2018-03-14 15:27:27 1ew7NH-00039q-Hv <= t...@linutronix.de
> id=alpine.deb.2.21.1803141511340.2...@nanos.tec.linutronix.de
>
> 2018-03-14 15:27:30 1ew7NH-00039q-Hv => jason.vas.d...@gmail.com R=dnslookup
> T=remote_smtp H=gmail-smtp-in.l.google.com [2a00:1450:4013:c01::1a]
> X=TLS1.2:RSA_AES_128_CBC_SHA1:128 DN="C=US,ST=California,L=Mountain
> View,O=Google Inc,CN=mx.google.com"
>
> 2018-03-14 15:27:31 1ew7NH-00039q-Hv => linux-kernel@vger.kernel.org
> R=dnslookup T=remote_smtp H=vger.kernel.org [209.132.180.67]
>
> 
>
> 2018-03-14 15:27:47 1ew7NH-00039q-Hv Completed
>
> If those messages would not have been delivered to
> linux-kernel@vger.kernel.org they would hardly be on the mailing list
> archive, right?
>

Yes, I cannot explain why I did not receive them .

I guess I should consider gmail an unreliable delivery
method and use the lkml.org web interface to check
for replies - I will do this from now one.

> And they both got delivered to your gmail account as well.
>

No, they are not in my gmail account Inbox or folders.


> ERROR: Missing Signed-off-by: line(s)
> total: 1 errors, 0 warnings, 71 lines checked
>

I do not know how to fix this error - I was hoping
someone on the list might enlighten me.

>
> WARNING: externs should be avoided in .c files
> #24: FILE: arch/x86/entry/vdso/vclock_gettime.c:31:
> +extern unsigned int __vdso_tsc_calibration(
>

I thought that must be a script bug, since no extern
is being declared by that line; it is an external function
declaration, just like the unmodified line that precedes it.


> WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
> #93:
> new file mode 100644
>
> ERROR: Missing Signed-off-by: line(s)
>
> total: 1 errors, 2 warnings, 143 lines checked
>
> It reports an error for every single patch of your latest submission.
>
>> And I did send the test results in a previous mail -
>
> In private mail which I ignore if there is no real good reason. And just
> for the record. This private mail contains the following headers:
>
> In-Reply-To: 
> References: <1521001222-10712-1-git-send-email-jason.vas.d...@gmail.com>
>  <1521001222-10712-3-git-send-email-jason.vas.d...@gmail.com>
> 
> From: Jason Vas Dias 
> Date: Wed, 14 Mar 2018 15:08:55 +
> Message-ID:
> 
> Subject: Re: [PATCH v4.16-rc5 2/3] x86/vdso: on Intel, VDSO should handle
> CLOCK_MONOTONIC_RAW
>
> So now, if you take the message ID which is in the In-Reply-To: field and
> compare it to the message ID which I used for link [2]:
>
> In-Reply-To: 
>> > https://lkml.kernel.org/r/alpine.deb.2.21.1803141527300.2...@nanos.tec.linutronix.de
>
> you might notice that these are identical. So how did you end up replying
> to a mail which you never recei

[PATCH v4.16-rc5 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-15 Thread jason . vas . dias


Resent to address reviewer comments.

  Currently, the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW, &ts )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values, user-space  code needs to know elapsed time from the
  perspective of the CPU no matter how "hot" / fast or "cold" / slow it
  might be running wrt NTP / PTP "real" time; when code needs this,
  the latencies associated with a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .

  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:

   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c

  Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug 
#198161,
  as is the test program, timer_latency.c, to demonstrate the problem.

  Before the patch a latency of 200-1000ns was measured for
clock_gettime(CLOCK_MONOTONIC_RAW,&ts)
  calls - after the patch, the same call on the same machine
  has a latency of @ 20ns.


Thanks & Best Regards,
Jason Vas Dias

[PATCH v4.16-rc5 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-15 Thread jason . vas . dias

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d..8b9b9cf 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -182,27 +182,49 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
-notrace static inline u64 vgetsns(int *mode)
+notrace static inline __always_inline u64 vgetcycles(int *mode)
 {
-   u64 v;
-   cycles_t cycles;
-
-   if (gtod->vclock_mode == VCLOCK_TSC)
-   cycles = vread_tsc();
+   switch (gtod->vclock_mode) {
+   case VCLOCK_TSC:
+   return vread_tsc();
 #ifdef CONFIG_PARAVIRT_CLOCK
-   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
-   cycles = vread_pvclock(mode);
+   case VCLOCK_PVCLOCK:
+   return vread_pvclock(mode);
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
-   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
-   cycles = vread_hvclock(mode);
+   case VCLOCK_HVCLOCK:
+   return vread_hvclock(mode);
 #endif
-   else
+   default:
+   break;
+   }
+   return 0;
+}
+
+notrace static inline u64 vgetsns(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
return 0;
+
v = (cycles - gtod->cycle_last) & gtod->mask;
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
+   return 0;
+
+   v = (cycles - gtod->cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +268,27 @@ notrace static int __always_inline do_monotonic(struct 
timespec *ts)
return mode;
 }
 
+notrace static __always_inline int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw(&mode);
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +320,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct 
timespec *ts)
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c 
b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index e1216dd..83f5c21 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -44,6 +44,9 @@ void update_vsyscall(struct timekeeper *tk)
vdata->mask = tk->tkr_mono.mask;
vdata->mult = tk->tkr_mono.mult;
vdata->shift= tk->tkr_mono.shift;
+   vdata->raw_mask = tk->tkr_raw.mask;
+   vdata->raw_mult = tk->tkr_raw.mult;
+   vdata->raw_shift= tk->tkr_raw.shift;
 
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
@@ -74,5 +77,8 @@ void update_vsyscall(struct timekeeper *tk)
vdata->monotonic_time_coarse_sec++;
}
 
+   vdata->monotonic_time_raw_sec  = tk->raw_sec;
+   vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec;
+
gtod_write_end(vdata);
 }
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index fb856c9..941e9d6 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -22,7 +22,9 @@ struct vsyscall_gtod_data {
u64 mask;
u32 mult;
u32 shift;
-
+   u32 raw_mask;
+   u32 raw_mult;
+   u32 raw_shift;
/* open coded 'struct timespec' */
u64 wall_time_snsec;
gtod_long_t wall_time_sec;
@@ -32,6 +34,8 @@ struct vsyscall_gtod_data {
gtod_long_t wall_time_coarse_nsec;
gtod_long_t monotonic_time_coarse_sec;
gtod_long_t monotonic_time_coarse_nsec;
+   gtod_long_t monotonic_time_raw_sec;
+   gtod_long_t monotonic_time_raw_nsec;
 
int tz_minuteswest;
int tz_dsttime;

Re: [PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-15 Thread Jason Vas Dias

Hi Thomas -
RE:
On 15/03/2018, Thomas Gleixner  wrote:
> Jason,
>
> On Thu, 15 Mar 2018, jason.vas.d...@gmail.com wrote:
>
>>   Resent to address reviewer comments.
>
> I was being patient so far and tried to guide you through the patch
> submission process, but unfortunately this turns out to be just waste of my
> time.
>
> You have not addressed any of the comments I made here:
>
> [1]
> https://lkml.kernel.org/r/alpine.deb.2.21.1803141511340.2...@nanos.tec.linutronix.de
> [2]
> https://lkml.kernel.org/r/alpine.deb.2.21.1803141527300.2...@nanos.tec.linutronix.de
>

I'm really sorry about that - I did not see those mails ,
and have searched for them in my inbox -
are you sure they were sent to 'linux-kernel@vger.kernel.org' ?
That is the only list I am subscribed to .
I clicked on the links , but the 'To:' field is just
'linux-kernel' .

If I had seen those messages before I re-submitted,
those issues would have been fixed.

checkpatch.pl did not report them -
I ran it with all patches and it reported
no errors .

And I did send the test results in a previous mail -

$ gcc -m64 -o timer timer.c

( must be compiled in 64-bit mode).

This is using the new rdtscp() function :
$ ./timer -r 100
...
Total time: 0.02806S - Average Latency: 0.00028S N zero
deltas: 0 N inconsistent deltas: 0
Average of 100 average latencies of 100 samples : 0.00027S

This is using the rdtsc_ordered() function:

$ ./timer -m -r 100
Total time: 0.05269S - Average Latency: 0.00052S N zero
deltas: 0 N inconsistent deltas: 0
Average of 100 average latencies of 100 samples : 0.00047S

timer.c is a very short program that just reads N_SAMPLES (a
compile-time option)
timespecs using either CLOCK_MONOTONIC_RAW (no -m) or CLOCK_MONOTONIC
first parameter to clock_gettime(),  then
computes the deltas as long long, then averages them , counting any
zero deltas, or deltas where the previous timespec is somehow
greater than the current timespec, which are reported as
inconsitencies (note 'inconistent deltas:0' and 'zero deltas: 0' in output).

So my initial claim that rdtscp() can be twice as fast as rdtsc_ordered()
was not far-fetched - this is what I am seeing .

I think this is because of the explicit barrier() call in rdtsc_ordered() .
This must be slower than than the internal processor pipeline
"cancellation point" (barrier) used by the rdtscp instruction itself.
This is the only reason for the rdtscp call  -  plus all modern Intel
& AMD CPUs support it, and it DOES solve the ordering problem,
whereby instructions in one pipeline of a task can get different
rdtsc() results than instructions in another pipeline.

I will document the results better in the ChangeLog , fix all issues
you identified, and resend .

I did not mean to ignore your comments -
those mails are nowhere in my Inbox -
please ,  confirm the actual email address
they are getting sent to.

Thanks & Regards,
Jason
/* 
 * Program to measure high-res timer latency.
 *
 */
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifndef N_SAMPLES
#define N_SAMPLES 100
#endif
#define _STR(_S_) #_S_
#define STR(_S_) _STR(_S_)

#define TS2NS(_TS_) unsigned long long)(_TS_).tv_sec)*10ULL) + (((unsigned long long)((_TS_).tv_nsec 

int main(int argc, char *const* argv, char *const* envp)
{ struct timespec sample[N_SAMPLES+1];
  unsigned int cnt=N_SAMPLES, s=0 , avg_n=0;
  unsigned long long
deltas [ N_SAMPLES ]
, t1, t2, sum=0, zd=0, ic=0, d
, t_start, avg_ns, *avgs=0;
  clockid_t clk = CLOCK_MONOTONIC_RAW;
  bool do_dump = false;
  int argn=1, repeat=1;
  for(; argn < argc; argn+=1)
if( argv[argn] != NULL )
  if( *(argv[argn]) == '-')
	switch( *(argv[argn]+1) )
	{ case 'm':
	  case 'M':
	clk = CLOCK_MONOTONIC;
	break;
	  case 'd':
	  case 'D':
	do_dump = true;
	break;
 	  case 'r':
  	  case 'R':
	if( (argn < argc) && (argv[argn+1] != NULL))
	  repeat = atoi(argv[argn+=1]);
	break;
	  case '?':
	  case 'h':
	  case 'u':
	  case 'U':
	  case 'H':
	fprintf(stderr,"Usage: timer_latency [\n\t-m : use CLOCK_MONOTONIC clock (not CLOCK_MONOTONIC_RAW)\n\t-d : dump timespec contents. N_SAMPLES: " STR(N_SAMPLES) "\n\t"
	"-r \n]\t" 
	"Calculates average timer latency (minimum time that can be measured) over N_SAMPLES.\n"
	   );
	return 0;
	}
  if( repeat > 1 )
  { avgs=alloca(sizeof(unsigned long long) * (N_SAMPLES + 1));
if( ((unsigned long) avgs) & 7 )
  avgs = ((unsigned long long*)(((unsigned char*)avgs)+(8-((unsigned long) avgs) & 7)));
  }
  do {
cnt=N_SAMPLES;
s=0;
  do
  { if( 0 != clock_gettime(clk, &sample[s++]) )
{ fprintf(stderr,"oops, clock_gettime() failed: %d: '%s'.\n", errno, strerror(errno));
  return 1;
}
  }while( --cnt );
  clock_gettime(clk, &sample[s]);
  for(s=1; s < (N_SAMPLES+1); s+=1)
  { t1 = TS2NS(sample[s-1]);
t2 = TS2NS(sample[s]);
if ( (t1 > t2)

[PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-15 Thread jason . vas . dias


  Resent to address reviewer comments.
   
  Currently, the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW, &ts )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values, user-space  code needs to know elapsed time from the
  perspective of the CPU no matter how "hot" / fast or "cold" / slow it
  might be running wrt NTP / PTP "real" time; when code needs this,
  the latencies associated with a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .

  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:

   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vdso/vdso.lds.S
   arch/x86/entry/vdso/vdsox32.lds.S
   arch/x86/entry/vdso/vdso32/vdso32.lds.S
   arch/x86/entry/vsyscall/vsyscall_gtod.c

  There are 3 patches in the series :

   Patch #1 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with 
rdtsc_ordered()

  Patches #2 & #3 should be considered "optional" :

   Patch #2 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with a new 
rdtscp() function in msr.h

   Patch #3 makes the VDSO export TSC calibration data via a new function in 
the vDSO:
   unsigned int __vdso_linux_tsc_calibration ( struct 
linux_tsc_calibration *tsc_cal )
that user code can optionally call.


   Patch #2 makes clock_gettime(CLOCK_MONOTONIC_RAW) calls somewhat faster
   than clock_gettime(CLOCK_MONOTONIC) calls.

   I think something like Patch #3 is necessary to export TSC calibration data 
to user-space TSC readers.

   It is entirely up to the kernel developers whether they want to include 
patches
   #2 and #3, but I think something like Patch #1 really needs to get into a 
future
   Linux release, as an unecessary latency of 200-1000ns for a timer that can 
tick
   3 times per nanosecond is unacceptable.

   Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug 
#198161. 


Thanks & Best Regards,
Jason Vas Dias

[PATCH v4.16-rc5 2/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-15 Thread jason . vas . dias

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index fbc7371..2c46675 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -184,10 +184,9 @@ notrace static u64 vread_tsc(void)
 
 notrace static u64 vread_tsc_raw(void)
 {
-   u64 tsc
+   u64 tsc  = (gtod->has_rdtscp ? rdtscp((void *)0) : rdtsc_ordered())
  , last = gtod->raw_cycle_last;
 
-   tsc   = rdtsc_ordered();
if (likely(tsc >= last))
return tsc;
asm volatile ("");
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c 
b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index 5af7093..0327a95 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -16,6 +16,9 @@
 #include 
 #include 
 #include 
+#include 
+
+extern unsigned int tsc_khz;
 
 int vclocks_used __read_mostly;
 
@@ -49,6 +52,7 @@ void update_vsyscall(struct timekeeper *tk)
vdata->raw_mask = tk->tkr_raw.mask;
vdata->raw_mult = tk->tkr_raw.mult;
vdata->raw_shift= tk->tkr_raw.shift;
+   vdata->has_rdtscp   = static_cpu_has(X86_FEATURE_RDTSCP);
 
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 30df295..a5ff704 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -218,6 +218,37 @@ static __always_inline unsigned long long 
rdtsc_ordered(void)
return rdtsc();
 }
 
+/**
+ * rdtscp() - read the current TSC and (optionally) CPU number, with built-in
+ *cancellation point replacing barrier - only available
+ *if static_cpu_has(X86_FEATURE_RDTSCP) .
+ * returns:   The 64-bit Time Stamp Counter (TSC) value.
+ * Optionally, 'cpu_out' can be non-null, and on return it will contain
+ * the number (Intel CPU ID) of the CPU that the task is currently running on.
+ * As does EAX_EDT_RET, this uses the "open-coded asm" style to
+ * force the compiler + assembler to always use (eax, edx, ecx) registers,
+ * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit
+ * variables are used - exactly the same code should be generated
+ * for this instruction on 32-bit as on 64-bit when this asm stanza is used.
+ * See: SDM , Vol #2, RDTSCP instruction.
+ */
+static __always_inline u64 rdtscp(u32 *cpu_out)
+{
+   u32 tsc_lo, tsc_hi, tsc_cpu;
+
+   asm volatile
+   ("rdtscp"
+   :   "=a" (tsc_lo)
+ , "=d" (tsc_hi)
+ , "=c" (tsc_cpu)
+   ); // : eax, edx, ecx used - NOT rax, rdx, rcx
+   if (unlikely(cpu_out != ((void *)0)))
+   *cpu_out = tsc_cpu;
+   return u64)tsc_hi) << 32) |
+   (((u64)tsc_lo) & 0x0ULL)
+  );
+}
+
 /* Deprecated, keep it for a cycle for easier merging: */
 #define rdtscll(now)   do { (now) = rdtsc_ordered(); } while (0)
 
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index 24e4d45..e7e4804 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -26,6 +26,7 @@ struct vsyscall_gtod_data {
u64 raw_mask;
u32 raw_mult;
u32 raw_shift;
+   u32 has_rdtscp;
 
/* open coded 'struct timespec' */
u64 wall_time_snsec;

[PATCH v4.16-rc5 1/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-15 Thread jason . vas . dias

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d..fbc7371 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+   u64 tsc
+ , last = gtod->raw_cycle_last;
+
+   tsc   = rdtsc_ordered();
+   if (likely(tsc >= last))
+   return tsc;
+   asm volatile ("");
+   return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *mode)
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +279,27 @@ notrace static int __always_inline do_monotonic(struct 
timespec *ts)
return mode;
 }
 
+notrace static __always_inline int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw(&mode);
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct 
timespec *ts)
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c 
b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index e1216dd..5af7093 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -45,6 +45,11 @@ void update_vsyscall(struct timekeeper *tk)
vdata->mult = tk->tkr_mono.mult;
vdata->shift= tk->tkr_mono.shift;
 
+   vdata->raw_cycle_last   = tk->tkr_raw.cycle_last;
+   vdata->raw_mask = tk->tkr_raw.mask;
+   vdata->raw_mult = tk->tkr_raw.mult;
+   vdata->raw_shift= tk->tkr_raw.shift;
+
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
 
@@ -74,5 +79,8 @@ void update_vsyscall(struct timekeeper *tk)
vdata->monotonic_time_coarse_sec++;
}
 
+   vdata->monotonic_time_raw_sec  = tk->raw_sec;
+   vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec;
+
gtod_write_end(vdata);
 }
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index fb856c9..24e4d45 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -22,6 +22,10 @@ struct vsyscall_gtod_data {
u64 mask;
u32 mult;
u32 shift;
+   u64 raw_cycle_last;
+   u64 raw_mask;
+   u32 raw_mult;
+   u32 raw_shift;
 
/* open coded 'struct timespec' */
u64 wall_time_snsec;
@@ -32,6 +36,8 @@ struct vsyscall_gtod_data {
gtod_long_t wall_time_coarse_nsec;
gtod_long_t monotonic_time_coarse_sec;
gtod_long_t monotonic_time_coarse_nsec;
+   gtod_long_t monotonic_time_raw_sec;
+   gtod_long_t monotonic_time_raw_nsec;
 
int tz_minuteswest;
int tz_dsttime;

[PATCH v4.16-rc5 3/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-15 Thread jason . vas . dias

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index 03f3904..61d9633 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -21,12 +21,15 @@
 #include 
 #include 
 #include 
+#include 
 
 #define gtod (&VVAR(vsyscall_gtod_data))
 
 extern int __vdso_clock_gettime(clockid_t clock, struct timespec *ts);
 extern int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz);
 extern time_t __vdso_time(time_t *t);
+extern unsigned int __vdso_tsc_calibration(
+   struct linux_tsc_calibration_s *tsc_cal);
 
 #ifdef CONFIG_PARAVIRT_CLOCK
 extern u8 pvclock_page
@@ -383,3 +386,25 @@ notrace time_t __vdso_time(time_t *t)
 }
 time_t time(time_t *t)
__attribute__((weak, alias("__vdso_time")));
+
+notraceunsigned int
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal)
+{
+   unsigned long seq;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   if ((gtod->vclock_mode == VCLOCK_TSC) &&
+   (tsc_cal != ((void *)0UL))) {
+   tsc_cal->tsc_khz = gtod->tsc_khz;
+   tsc_cal->mult= gtod->raw_mult;
+   tsc_cal->shift   = gtod->raw_shift;
+   return 1;
+   }
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   return 0;
+}
+
+unsigned int linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal)
+   __attribute((weak, alias("__vdso_linux_tsc_calibration")));
diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S
index d3a2dce..e0b5cce 100644
--- a/arch/x86/entry/vdso/vdso.lds.S
+++ b/arch/x86/entry/vdso/vdso.lds.S
@@ -25,6 +25,8 @@ VERSION {
__vdso_getcpu;
time;
__vdso_time;
+   linux_tsc_calibration;
+   __vdso_linux_tsc_calibration;
local: *;
};
 }
diff --git a/arch/x86/entry/vdso/vdso32/vdso32.lds.S 
b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
index 422764a..17fd07f 100644
--- a/arch/x86/entry/vdso/vdso32/vdso32.lds.S
+++ b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
@@ -26,6 +26,7 @@ VERSION
__vdso_clock_gettime;
__vdso_gettimeofday;
__vdso_time;
+   __vdso_linux_tsc_calibration;
};
 
LINUX_2.5 {
diff --git a/arch/x86/entry/vdso/vdsox32.lds.S 
b/arch/x86/entry/vdso/vdsox32.lds.S
index 05cd1c5..7acac71 100644
--- a/arch/x86/entry/vdso/vdsox32.lds.S
+++ b/arch/x86/entry/vdso/vdsox32.lds.S
@@ -21,6 +21,7 @@ VERSION {
__vdso_gettimeofday;
__vdso_getcpu;
__vdso_time;
+   __vdso_linux_tsc_calibration;
local: *;
};
 }
diff --git a/arch/x86/include/uapi/asm/vdso_tsc_calibration.h 
b/arch/x86/include/uapi/asm/vdso_tsc_calibration.h
new file mode 100644
index 000..ce4b5a45
--- /dev/null
+++ b/arch/x86/include/uapi/asm/vdso_tsc_calibration.h
@@ -0,0 +1,81 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _ASM_X86_VDSO_TSC_CALIBRATION_H
+#define _ASM_X86_VDSO_TSC_CALIBRATION_H
+/*
+ * Programs that want to use rdtsc / rdtscp instructions
+ * from user-space can make use of the Linux kernel TSC calibration
+ * by calling :
+ *__vdso_linux_tsc_calibration(struct linux_tsc_calibration_s *);
+ * ( one has to resolve this symbol as in
+ *   tools/testing/selftests/vDSO/parse_vdso.c
+ * )
+ * which fills in a structure
+ * with the following layout :
+ */
+
+/** struct linux_tsc_calibration_s -
+ * mult:amount to multiply 64-bit TSC value by
+ * shift:   the right shift to apply to (mult*TSC) yielding nanoseconds
+ * tsc_khz: the calibrated TSC frequency in KHz from which previous
+ *  members calculated
+ */
+struct linux_tsc_calibration_s {
+
+   unsigned int mult;
+   unsigned int shift;
+   unsigned int tsc_khz;
+
+};
+
+/* To use:
+ *
+ *  static unsigned
+ *  (*linux_tsc_cal)(struct linux_tsc_calibration_s *linux_tsc_cal) =
+ *vdso_sym("LINUX_2.6", "__vdso_linux_tsc_calibration");
+ *  if(linux_tsc_cal == ((void *)0))
+ *  { fprintf(stderr,"the patch providing __vdso_linux_tsc_calibration"
+ *   " is not applied to the kernel.\n");
+ *return ERROR;
+ *  }
+ *  static struct linux_tsc_calibration clock_source={0};
+ *  if((clock_source.mult==0) && ! (*linux_tsc_cal)(&clock_source) )
+ *fprintf(stderr,"TSC is not the system clocksource.\n");
+ *  unsigned int tsc_lo, tsc_hi, tsc_cpu;
+ *  asm volatile
+ *  ( "rdtscp" : (=a) tsc_hi,  (=d) tsc_lo, (=c) tsc_cpu );
+ *  unsigned long tsc = (((unsigned long)tsc_hi) << 32) | tsc_lo;
+ *  unsigned long nanoseconds =
+ *   (( clock_source . mult ) * tsc ) >> (clock_source . shift);
+ *
+ *  nanoseconds is now TSC value converted to nanoseconds,
+ *  according to Linux' clocksource calibration values.
+ *  Incidentally, 'tsc_cpu' is the number of the CPU the task is running on.

Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-14 Thread Jason Vas Dias

Thanks for the helpful comments, Peter -
re:
On 14/03/2018, Peter Zijlstra  wrote:
>
>> Yes, I am sampling perf counters,
>
> You're not in fact sampling, you're just reading the counters.

Correct, using Linux-ese terminology - but "sampling" in looser English.

>> Reading performance counters does involve  2 ioctls and a read() ,
>
> So you can avoid the whole ioctl(ENABLE), ioctl(DISABLE) nonsense and
> just let them run and do:
>
>   read(group_fd, &buf_pre, size);
>   /* your code section */
>   read(group_fd, &buf_post, size);
>
>   /* compute buf_post - buf_pre */
>
> Which is only 2 system calls, not 4.

But I can't, really - I am trying to restrict the
performance counter measurements
to only a subset of the code, and exclude
performance measurement result processing  -
so the timeline is like:
  struct timespec t_start, t_end;
  perf_event_open(...);
  thread_main_loop() { ... do {
  t _clock_gettime(CLOCK_MONOTONIC_RAW, &t_start);
  t+x _   enable_perf  ();
  total_work = do_some_work();
  disable_perf ();
  clock_gettime(CLOCK_MONOTONIC_RAW, &t_end);
   t+y_
  read_perf_counters_and_store_results
   ( perf_grp_fd, &results ,  total_work,
 TS2T( &t_end ) - TS2T( &t_start)
);
   } while ( );
}

   Now. here the bandwidth / performance results recorded by
   my 'read_perf_counters_and_store_results' method
   is very sensitive to the measurement of the OUTER
   elapsed time .

>
> Also, a while back there was the proposal to extend the mmap()
> self-monitoring interface to groups, see:
>
> https://lkml.kernel.org/r/20170530172555.5ya3ilfw3sowo...@hirez.programming.kicks-ass.net
>
> I never did get around to writing the actual code for it, but it
> shouldn't be too hard.
>

Great, I'm looking forward to trying it - but meanwhile,
to get NON-MULTIPLEXED measurements for the SAME CODE SEQUENCE
over the SAME TIME I believe the group FD method is what is implemented
and what works.

>> The CPU_CLOCK software counter should give the converted TSC cycles
>> seen between the ioctl( grp_fd, PERF_EVENT_IOC_ENABLE , ...)
>> and the  ioctl( grp_fd, PERF_EVENT_IOC_DISABLE ), and the
>> difference between the event->time_running and time_enabled
>> should also measure elapsed time .
>
> While CPU_CLOCK is TSC based, there is no guarantee it has any
> correlation to CLOCK_MONOTONIC_RAW (even if that is also TSC based).
>
> (although, I think I might have fixed that recently and it might just
> work, but it's very much not guaranteed).

Yes, I believe the CPU_CLOCK is effectively the converted TSC -
it does appear to correlate well with the new CLOCK_MONOTONIC_RAW
values from the patched VDSO.

> If you want to correlate to CLOCK_MONOTONIC_RAW you have to read
> CLOCK_MONOTONIC_RAW and not some random other clock value.
>

Exactly ! Hence the need for the patch so that users can get
CLOCK_MONOTONIC_RAW values with low latency and correlate them
with PERF CPU_CLOCK values.

>> This gives the "inner" elapsed time, from the perpective of the kernel,
>> while the measured code section had the counters enabled.
>>
>> But unless the user-space program  also has a way of measuring elapsed
>> time from the CPU's perspective , ie. without being subject to
>> operator or NTP / PTP adjustment, it has no way of correlating this
>> inner elapsed time with any "outer"
>
> You could read the time using the group_fd's mmap() page. That actually
> includes the TSC mult,shift,offset as used by perf clocks.
>

Yes, but as mentioned earlier, that presupposes I want to use the mmap()
sample method - I don't - I want to use the Group FD method, so
that I can be sure the measurements are for the same code sequence
over the same period of time.

>> Currently, users must parse the log file or use gdb / objdump to
>> inspect /proc/kcore to get the TSC calibration and exact
>> mult+shift values for the TSC value conversion.
>
> Which ;-) there's multiple floating around..
>

Yes, but why must Linux make it so difficult ?
I think it has to be recognized that the vDSO or user-space program
are the only places in which low-latency clock values can be generated
for use by user-space programs with sufficiently low latencies to be useful.
So why does it not export the TSC calibration which is so complex to
calibrate when such calibration information is available nowhere else ?

>> Intel does not publish, nor does the CPU come with in ROM or firmware,
>> the actual precise TSC frequency - this must be calibrated against the
>> other clocks , according to a complicated procedure in section 18.2 of
>> the SDM . My TSC has a "rated" / nominal TSC frequency , which one
>> can compute from CPUID leaves, of 2.3ghz, but the "Refined TSC frequency"
>> is 2.8333ghz .
>
> You might

[PATCH v4.16-rc5 1/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-13 Thread jason . vas . dias

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d..fbc7371 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+   u64 tsc
+ , last = gtod->raw_cycle_last;
+
+   tsc   = rdtsc_ordered();
+   if (likely(tsc >= last))
+   return tsc;
+   asm volatile ("");
+   return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *mode)
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +279,27 @@ notrace static int __always_inline do_monotonic(struct 
timespec *ts)
return mode;
 }
 
+notrace static __always_inline int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw(&mode);
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct 
timespec *ts)
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c 
b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index e1216dd..5af7093 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -45,6 +45,11 @@ void update_vsyscall(struct timekeeper *tk)
vdata->mult = tk->tkr_mono.mult;
vdata->shift= tk->tkr_mono.shift;
 
+   vdata->raw_cycle_last   = tk->tkr_raw.cycle_last;
+   vdata->raw_mask = tk->tkr_raw.mask;
+   vdata->raw_mult = tk->tkr_raw.mult;
+   vdata->raw_shift= tk->tkr_raw.shift;
+
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
 
@@ -74,5 +79,8 @@ void update_vsyscall(struct timekeeper *tk)
vdata->monotonic_time_coarse_sec++;
}
 
+   vdata->monotonic_time_raw_sec  = tk->raw_sec;
+   vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec;
+
gtod_write_end(vdata);
 }
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index fb856c9..24e4d45 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -22,6 +22,10 @@ struct vsyscall_gtod_data {
u64 mask;
u32 mult;
u32 shift;
+   u64 raw_cycle_last;
+   u64 raw_mask;
+   u32 raw_mult;
+   u32 raw_shift;
 
/* open coded 'struct timespec' */
u64 wall_time_snsec;
@@ -32,6 +36,8 @@ struct vsyscall_gtod_data {
gtod_long_t wall_time_coarse_nsec;
gtod_long_t monotonic_time_coarse_sec;
gtod_long_t monotonic_time_coarse_nsec;
+   gtod_long_t monotonic_time_raw_sec;
+   gtod_long_t monotonic_time_raw_nsec;
 
int tz_minuteswest;
int tz_dsttime;

[PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-13 Thread jason . vas . dias



  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW, &ts )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values, user-space  code needs to know elapsed time from the
  perspective of the CPU no matter how "hot" / fast or "cold" / slow it
  might be running wrt NTP / PTP "real" time; when code needs this,
  the latencies associated with a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vdso/vdso.lds.S
   arch/x86/entry/vdso/vdsox32.lds.S
   arch/x86/entry/vdso/vdso32/vdso32.lds.S  
   arch/x86/entry/vsyscall/vsyscall_gtod.c

  There are 3 patches in the series :

   Patch #1 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with 
rdtsc_ordered()

   Patch #2 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with a new 
rdtscp() function in msr.h

   Patch #3 makes the VDSO export TSC calibration data via a new function in 
the vDSO: 
   unsigned int __vdso_linux_tsc_calibration ( struct 
linux_tsc_calibration *tsc_cal )
that user code can optionally call.

   Patches #2 & #3 should be considered "optional" .

   Patch #2 makes clock_gettime(CLOCK_MONOTONIC_RAW) calls have @ half the 
latency
   of clock_gettime(CLOCK_MONOTONIC) calls.

   I think something like Patch #3 is necessary to export TSC calibration data 
to user-space TSC readers.


Best Regards,
Jason Vas Dias

[PATCH v4.16-rc5 2/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-13 Thread jason . vas . dias

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index fbc7371..2c46675 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -184,10 +184,9 @@ notrace static u64 vread_tsc(void)
 
 notrace static u64 vread_tsc_raw(void)
 {
-   u64 tsc
+   u64 tsc  = (gtod->has_rdtscp ? rdtscp((void*)0) : rdtsc_ordered())
  , last = gtod->raw_cycle_last;
 
-   tsc   = rdtsc_ordered();
if (likely(tsc >= last))
return tsc;
asm volatile ("");
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c 
b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index 5af7093..0327a95 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -16,6 +16,9 @@
 #include 
 #include 
 #include 
+#include 
+
+extern unsigned tsc_khz;
 
 int vclocks_used __read_mostly;
 
@@ -49,6 +52,7 @@ void update_vsyscall(struct timekeeper *tk)
vdata->raw_mask = tk->tkr_raw.mask;
vdata->raw_mult = tk->tkr_raw.mult;
vdata->raw_shift= tk->tkr_raw.shift;
+   vdata->has_rdtscp   = static_cpu_has(X86_FEATURE_RDTSCP);
 
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 30df295..a5ff704 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -218,6 +218,36 @@ static __always_inline unsigned long long 
rdtsc_ordered(void)
return rdtsc();
 }
 
+/**
+ * rdtscp() - read the current TSC and (optionally) CPU number, with built-in
+ *cancellation point replacing barrier - only available
+ *if static_cpu_has(X86_FEATURE_RDTSCP) .
+ * returns:   The 64-bit Time Stamp Counter (TSC) value.
+ * Optionally, 'cpu_out' can be non-null, and on return it will contain
+ * the number (Intel CPU ID) of the CPU that the task is currently running on.
+ * As does EAX_EDT_RET, this uses the "open-coded asm" style to
+ * force the compiler + assembler to always use (eax, edx, ecx) registers,
+ * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit 
+ * variables are used - exactly the same code should be generated
+ * for this instruction on 32-bit as on 64-bit when this asm stanza is used.
+ * See: SDM , Vol #2, RDTSCP instruction.
+ */
+static __always_inline u64 rdtscp(u32 *cpu_out)
+{
+   u32 tsc_lo, tsc_hi, tsc_cpu;
+   asm volatile
+   ( "rdtscp"
+   :   "=a" (tsc_lo)
+ , "=d" (tsc_hi)
+ , "=c" (tsc_cpu)
+   ); // : eax, edx, ecx used - NOT rax, rdx, rcx
+   if (unlikely(cpu_out != ((void*)0)))
+   *cpu_out = tsc_cpu;
+   return u64)tsc_hi) << 32) |
+   (((u64)tsc_lo) & 0x0ULL )
+  );
+}
+
 /* Deprecated, keep it for a cycle for easier merging: */
 #define rdtscll(now)   do { (now) = rdtsc_ordered(); } while (0)
 
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index 24e4d45..e7e4804 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -26,6 +26,7 @@ struct vsyscall_gtod_data {
u64 raw_mask;
u32 raw_mult;
u32 raw_shift;
+   u32 has_rdtscp;
 
/* open coded 'struct timespec' */
u64 wall_time_snsec;

[PATCH v4.16-rc5 3/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-13 Thread jason . vas . dias

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index 2c46675..772988c 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define gtod (&VVAR(vsyscall_gtod_data))
 
@@ -184,7 +185,7 @@ notrace static u64 vread_tsc(void)
 
 notrace static u64 vread_tsc_raw(void)
 {
-   u64 tsc  = (gtod->has_rdtscp ? rdtscp((void*)0) : rdtsc_ordered())
+   u64 tsc  = (gtod->has_rdtscp ? rdtscp((void *)0) : rdtsc_ordered())
  , last = gtod->raw_cycle_last;
 
if (likely(tsc >= last))
@@ -383,3 +384,21 @@ notrace time_t __vdso_time(time_t *t)
 }
 time_t time(time_t *t)
__attribute__((weak, alias("__vdso_time")));
+
+unsigned int __vdso_linux_tsc_calibration(
+   struct linux_tsc_calibration_s *tsc_cal);
+
+notraceunsigned int
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal)
+{
+   if ((gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void *)0UL))) {
+   tsc_cal->tsc_khz = gtod->tsc_khz;
+   tsc_cal->mult= gtod->raw_mult;
+   tsc_cal->shift   = gtod->raw_shift;
+   return 1;
+   }
+   return 0;
+}
+
+unsigned int linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal)
+   __attribute((weak, alias("__vdso_linux_tsc_calibration")));
diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S
index d3a2dce..e0b5cce 100644
--- a/arch/x86/entry/vdso/vdso.lds.S
+++ b/arch/x86/entry/vdso/vdso.lds.S
@@ -25,6 +25,8 @@ VERSION {
__vdso_getcpu;
time;
__vdso_time;
+   linux_tsc_calibration;
+   __vdso_linux_tsc_calibration;
local: *;
};
 }
diff --git a/arch/x86/entry/vdso/vdso32/vdso32.lds.S 
b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
index 422764a..17fd07f 100644
--- a/arch/x86/entry/vdso/vdso32/vdso32.lds.S
+++ b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
@@ -26,6 +26,7 @@ VERSION
__vdso_clock_gettime;
__vdso_gettimeofday;
__vdso_time;
+   __vdso_linux_tsc_calibration;
};
 
LINUX_2.5 {
diff --git a/arch/x86/entry/vdso/vdsox32.lds.S 
b/arch/x86/entry/vdso/vdsox32.lds.S
index 05cd1c5..7acac71 100644
--- a/arch/x86/entry/vdso/vdsox32.lds.S
+++ b/arch/x86/entry/vdso/vdsox32.lds.S
@@ -21,6 +21,7 @@ VERSION {
__vdso_gettimeofday;
__vdso_getcpu;
__vdso_time;
+   __vdso_linux_tsc_calibration;
local: *;
};
 }
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c 
b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index 0327a95..692562a 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -53,6 +53,7 @@ void update_vsyscall(struct timekeeper *tk)
vdata->raw_mult = tk->tkr_raw.mult;
vdata->raw_shift= tk->tkr_raw.shift;
vdata->has_rdtscp   = static_cpu_has(X86_FEATURE_RDTSCP);
+   vdata->tsc_khz  = tsc_khz;
 
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index a5ff704..c7b2ed2 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -227,7 +227,7 @@ static __always_inline unsigned long long 
rdtsc_ordered(void)
  * the number (Intel CPU ID) of the CPU that the task is currently running on.
  * As does EAX_EDT_RET, this uses the "open-coded asm" style to
  * force the compiler + assembler to always use (eax, edx, ecx) registers,
- * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit 
+ * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit
  * variables are used - exactly the same code should be generated
  * for this instruction on 32-bit as on 64-bit when this asm stanza is used.
  * See: SDM , Vol #2, RDTSCP instruction.
@@ -236,15 +236,15 @@ static __always_inline u64 rdtscp(u32 *cpu_out)
 {
u32 tsc_lo, tsc_hi, tsc_cpu;
asm volatile
-   ( "rdtscp"
+   ("rdtscp"
:   "=a" (tsc_lo)
  , "=d" (tsc_hi)
  , "=c" (tsc_cpu)
); // : eax, edx, ecx used - NOT rax, rdx, rcx
-   if (unlikely(cpu_out != ((void*)0)))
+   if (unlikely(cpu_out != ((void *)0)))
*cpu_out = tsc_cpu;
return u64)tsc_hi) << 32) |
-   (((u64)tsc_lo) & 0x0ULL )
+   (((u64)tsc_lo) & 0x0ULL)
   );
 }
 
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index e7e4804..75078fc 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -27,6 +27,7 @@ struct vsyscall_gtod_data {
u32 raw_mult;
u32 raw_shift;
u32 has_rdtscp;
+   u32 tsc_khz;

Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-13 Thread Jason Vas Dias

On 12/03/2018, Peter Zijlstra  wrote:
> On Mon, Mar 12, 2018 at 07:01:20AM +0000, Jason Vas Dias wrote:
>>   Sometimes, particularly when correlating elapsed time to performance
>>   counter values,
>
> So what actual problem are you tring to solve here? Perf can already
> give you sample time in various clocks, including MONOTONIC_RAW.
>
>

Yes, I am sampling perf counters, including CPU_CYCLES , INSTRUCTIONS,
CPU_CLOCK, TASK_CLOCK, etc, in a Group FD I open with
perf_event_open() , for the current thread on the current CPU -
I am doing this for 4 threads , on Intel & ARM cpus.

Reading performance counters does involve  2 ioctls and a read() ,
which takes time that  already far exceeds the time required to read
the TSC or CNTPCT in the VDSO .

The CPU_CLOCK software counter should give the converted TSC cycles
seen between the ioctl( grp_fd, PERF_EVENT_IOC_ENABLE , ...)
and the  ioctl( grp_fd, PERF_EVENT_IOC_DISABLE ), and the
difference between the event->time_running and time_enabled
should also measure elapsed time .

This gives the "inner" elapsed time, from the perpective of the kernel,
while the measured code section had the counters enabled.

But unless the user-space program  also has a way of measuring elapsed time
from the CPU's perspective , ie. without being subject to operator or NTP / PTP
adjustment, it has no way of correlating this inner elapsed time with
any "outer"
elapsed time measurement it may have made - I also measure the time
taken by I/O operations between threads, for instance.

So that is my primary motivation - for each thread's main run loop, I
enable performance counters and count several PMU counters
and the CPU_CLOCK & TASK_CLOCK .  I want to determine
with maximal accuracy how much elapsed time was used
actually executing the task's instructions on the CPU ,
and how long they took to execute.
I want to try to exclude the time spent gathering and making
and analysing the performance measurements from the
time spent running the threads' main loop .

To do this accurately, it is best to exclude variations in time
that occur because of operator or NTP / PTP adjustments .

The CLOCK_MONOTONIC_RAW clock is the ONLY
clock that is MEANT to be immune from any adjustment.

It is meant to be high - resolution clock with 1ns resolution
that should be subject to no adjustment, and hence one would expect
it it have the lowest latency.

But the way Linux has up to now implemented it , CLOCK_MONOTONIC_RAW
has a resolution (minimum time that can be measured)
that varies from 300 - 1000ns .

I can read the TSC  and store a 16-byte timespec value in @ 8ns
on the same CPU .

I understand that linux must conform to the POSIX interface which
means it cannot provide sub-nanosecond resolution timers, but
it could allow user-space programs to easily discover the timer calibration
so that user-space programs can read the timers themselves.

Currently, users must parse the log file or use gdb / objdump to
inspect /proc/kcore to get the TSC calibration and exact
mult+shift values for the TSC value conversion.

Intel does not publish, nor does the CPU come with in ROM or firmware,
the actual precise TSC frequency - this must be calibrated against the
other clocks , according to a complicated procedure in section 18.2 of
the SDM . My TSC has a "rated" / nominal TSC frequency , which one
can compute from CPUID leaves, of 2.3ghz, but the "Refined TSC frequency"
is 2.8333ghz .

Hence I think Linux should export this calibrated frequency somehow ;
its "calibration" is expressed as the raw clocksource 'mult' and 'shift'
values, and is exported to the VDSO .

I think the VDSO should read the TSC and use the calibration
to render the raw, unadjusted time from the CPU's perspective.

Hence, the patch I am preparing , which is again attached.

I will submit it properly via email once I figure out
how to obtain the 'git-send-mail' tool, and how to
use it to send multiple patches, which seems
to be the only way to submit acceptable patches.

Also the attached timer program measures a latency
of @ 20ns with my patch 4.15.9 kernel, when it
measured a latency of 300-1000ns without it.

Thanks & Regards,

Jason

vdso_clock_monotonic_raw_1.patch
Description: Binary data
/* 
 * Program to measure high-res timer latency.
 *
 */
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifndef N_SAMPLES
#define N_SAMPLES 100
#endif
#define _STR(_S_) #_S_
#define STR(_S_) _STR(_S_)

int main(int argc, char *const* argv, char *const* envp)
{ clockid_t clk = CLOCK_MONOTONIC_RAW;
  bool do_dump = false;
  int argn=1;
  for(; argn < argc; argn+=1)
if( argv[argn] != NULL )
  if( *(argv[argn]) == '-')
	switch( *(argv[argn]+1) )
	{ case 'm':
	  case 'M':
	clk = CLOCK_MONOTONIC;
	break;
	  case

Re: [PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-12 Thread Jason Vas Dias

The split patches with no checkpatch.pl failures are
attached and were just sent in separate emails
to the mailing list .

Sorry it took a few tries to get right .

This will be my last send today -
I'm off to use it at work.

Thanks & all the best,
Jason


vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#1.patch
Description: Binary data


vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#2.patch
Description: Binary data

[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-12 Thread Jason Vas Dias


  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW, &ts )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:

   arch/x86/include/asm/msr.h
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c
   
  This is the second patch in the series,
  which adds use of rdtscp .

  Best Regards,
 Jason Vas Dias  .
 
---
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
2018-03-12 08:12:17.110120433 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 
08:59:21.135475862 +
@@ -187,7 +187,7 @@ notrace static u64 vread_tsc_raw(void)
u64 tsc
  , last = gtod->raw_cycle_last;
 
-   tsc   = rdtsc_ordered();
+   tsc = gtod->has_rdtscp ? rdtscp((void*)0UL) : rdtsc_ordered();
if (likely(tsc >= last))
return tsc;
asm volatile ("");
diff -up linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c
--- linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5-p1  
2018-03-12 07:58:07.974214168 +
+++ linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c  2018-03-12 
08:54:07.490267640 +
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 int vclocks_used __read_mostly;
 
@@ -49,6 +50,7 @@ void update_vsyscall(struct timekeeper *
vdata->raw_mask = tk->tkr_raw.mask;
vdata->raw_mult = tk->tkr_raw.mult;
vdata->raw_shift= tk->tkr_raw.shift;
+   vdata->has_rdtscp   = static_cpu_has(X86_FEATURE_RDTSCP);
 
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
diff -up linux-4.16-rc5/arch/x86/include/asm/msr.h.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/include/asm/msr.h
--- linux-4.16-rc5/arch/x86/include/asm/msr.h.4.16-rc5-p1   2018-03-12 
00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/include/asm/msr.h   2018-03-12 09:06:03.902728749 
+
@@ -218,6 +218,36 @@ static __always_inline unsigned long lon
return rdtsc();
 }
 
+/**
+ * rdtscp() - read the current TSC and (optionally) CPU number, with built-in
+ *cancellation point replacing barrier - only available
+ *if static_cpu_has(X86_FEATURE_RDTSCP) .
+ * returns:   The 64-bit Time Stamp Counter (TSC) value.
+ * Optionally, 'cpu_out' can be non-null, and on return it will contain
+ * the number (Intel CPU ID) of the CPU that the task is currently running on.
+ * As does EAX_EDT_RET, this uses the "open-coded asm" style to
+ * force the compiler + assembler to always use (eax, edx, ecx) registers,
+ * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit 
+ * variables are used - exactly the same code should be generated
+ * for this instruction on 32-bit as on 64-bit when this asm stanza is used.
+ * See: SDM , Vol #2, RDTSCP instruction.
+ */
+static __always_inline u64 rdtscp(u32 *cpu_out)
+{
+   u32 tsc_lo, tsc_hi, tsc_cpu;
+   asm volatile
+   ( "rdtscp"
+   :   "=a" (tsc_lo)
+ , "=d" (tsc_hi)
+ , "=c" (tsc_cpu)
+   );
+   if ( unlikely(cpu_out != ((voi

[PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-12 Thread Jason Vas Dias


  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW, &ts )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only these files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c
   

  There are 2 patches in the series - this first
  one handles CLOCK_MONOTONIC_RAW in VDSO using
  existing rdtsc_ordered() , and the second
  uses new rstscp() function which avoids
  use of an explicit barrier.

  Best Regards,
 Jason Vas Dias  .
 
---
diff -up linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 
linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5  
2018-03-12 00:25:09.0 +
+++ linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c   2018-03-12 
08:12:17.110120433 +
@@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+   u64 tsc
+ , last = gtod->raw_cycle_last;
+
+   tsc   = rdtsc_ordered();
+   if (likely(tsc >= last))
+   return tsc;
+   asm volatile ("");
+   return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +279,27 @@ notrace static int __always_inline do_mo
return mode;
 }
 
+notrace static __always_inline int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw(&mode);
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff -up linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5 
linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c
---

Re: [PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-12 Thread Jason Vas Dias

Good day -

On 12/03/2018, Ingo Molnar  wrote:
>
> * Thomas Gleixner  wrote:
>
>> On Mon, 12 Mar 2018, Jason Vas Dias wrote:
>>
>> checkpatch.pl still reports:
>>
>>total: 15 errors, 3 warnings, 165 lines checked
>>

Sorry I didn't see you had responded until 40 mins ago .

I finally found where checkpatch.pl is and it now reports :

WARNING: Possible unwrapped commit description (prefer a maximum 75
chars per line)
#2:
--- linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5  
2018-03-12
00:25:09.0 +

WARNING: struct  should normally be const
#55: FILE: arch/x86/entry/vdso/vclock_gettime.c:282:
+notrace static __always_inline int do_monotonic_raw(struct timespec *ts)

I don't know how to fix that, since 'ts' cannot be a const pointer.

ERROR: Missing Signed-off-by: line(s)

I guess that disappears once someone OKs the patch.

total: 1 errors, 2 warnings, 127 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
  mechanically convert to the typical style using --fix or --fix-inplace.

../vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#1.patch has style
problems, please review.

NOTE: If any of the errors are false positives, please report
  them to the maintainer, see CHECKPATCH in MAINTAINERS.

>> > +notrace static u64 vread_tsc_raw(void)
>> > +{
>> > +  u64 tsc, last=gtod->raw_cycle_last;
>> > +  if( likely( gtod->has_rdtscp ) )
>> > +  tsc = rdtscp((void*)0);
>>
>> Plus I asked more than once to split that rdtscp() stuff into a separate
>> patch.

I misunderstood - I thought you meant the rdtscp implementation
which was split into a separate file - but now it is in a separate patch ,
(attached).

>>
>> You surely are free to ignore my review comments, but rest assured that
>> I'm
>> free to ignore the crap you insist to send me as well.
>

I didn't mean to ignore any comments, and I'm really trying to fix this problem
the right way and not produce crap.

> In addition to Thomas's review feedback I'd strongly urge the careful
> reading of
> Documentation/SubmittingPatches as well:
>
>  - When sending multiple patches please use git-send-mail
>
>  - Please don't send several patch iterations per day!
>
>  - Code quality of the submitted patches is atrocious, please run them
> through
>scripts/checkpatch.pl (and make sure they pass) to at least enable the
> reading
>of them.
>
>  - ... plus dozens of other details described in
> Documentation/SubmittingPatches.
>
> Thanks,
>
>   Ingo
>

I am reading all those documents and cannot see how the code in
the attached patch contravenes any guidelines / best practices -
if you can, please clarify phrases like "atrocious style" - I cannot
see any style guidelines contravened, and I can prove that
the numeric output produced in 16-30ns is just as good
as that produced before the patch was applied in 300-700ns .

Aside from any style comments, any content comments ?

Sorry I am new to latest kernel  guidelines.
I needed to get this problem solved the right way for use at work today.

Thanks for your advice,
Best Regards
Jason

vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#1.patch
Description: Binary data

[PATCH v4.16-rc4 1/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-12 Thread Jason Vas Dias


  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW, &ts )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, about the same as do_monotonic(), and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  The patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c


  This is a resend of the original patch fixing review issues -
  the next patch will add the rdtscp() function .

  The patch passes the checkpatch.pl script .

  Best Regards,
 Jason Vas Dias  .
 
---
diff -up linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 
linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5  
2018-03-12 00:25:09.0 +
+++ linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c   2018-03-12 
08:12:17.110120433 +
@@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+   u64 tsc
+ , last = gtod->raw_cycle_last;
+
+   tsc = rdtsc_ordered();
+   if (likely(tsc >= last))
+   return tsc;
+   asm volatile ("");
+   return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +279,27 @@ notrace static int __always_inline do_mo
return mode;
 }
 
+notrace static __always_inline int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw(&mode);
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff -up linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5 
linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c
--- linux-4.16-rc5.1/arch/x86/entry/vsys

[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-12 Thread Jason Vas Dias


  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW, &ts )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c   
   arch/x86/entry/vdso/vdso.lds.S
   arch/x86/entry/vdso/vdsox32.lds.S
   arch/x86/entry/vdso/vdso32/vdso32.lds.S  


  and adds one new file:
   arch/x86/include/uapi/asm/vdso_tsc_calibration.h
   
  This is a second patch in the series, 
  which adds a record of the calibrated tsc frequency to the VDSO,
  and a new header:
uapi/asm/vdso_tsc_calibration.h
  which defines a structure :
struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; };
  and a getter function in the VDSO that can optionally be used
  by user-space code to implement sub-nanosecond precision clocks .
  This second patch is entirely optional but I think greatly
  expands the scope of user-space TSC readers .

  Resent : Oops, in previous version of this patch (#2),
  the comments in the new vdso_tsc_calibration were wrong,
  for an earlier version - sorry about that.

  Best Regards,
 Jason Vas Dias  .

 PATCH 2/2:
---
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
2018-03-12 04:29:27.296982872 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 
05:38:53.019891195 +
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define gtod (&VVAR(vsyscall_gtod_data))
 
@@ -385,3 +386,22 @@ notrace time_t __vdso_time(time_t *t)
 }
 time_t time(time_t *t)
__attribute__((weak, alias("__vdso_time")));
+
+extern unsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *);
+
+notraceunsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal)
+{
+   if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) )
+   {
+   tsc_cal -> tsc_khz = gtod->tsc_khz;
+   tsc_cal -> mult= gtod->raw_mult;
+   tsc_cal -> shift   = gtod->raw_shift;
+   return 1;
+   }
+   return 0;
+}
+
+unsigned linux_tsc_calibration(void)
+   __attribute((weak, alias("__vdso_linux_tsc_calibration")));
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S
--- linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1   2018-03-12 
00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S   2018-03-12 
05:18:36.380673342 +
@@ -25,6 +25,8 @@ VERSION {
__vdso_getcpu;
time;
__vdso_time;
+   linux_tsc_calibration;
+   __vdso_linux_tsc_calibration;
local: *;
};
 }
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S
--- linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1  
2018-03-12 00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S  2018-03-12 
05:19:10.765022295 +
@@ -26,6 +26,7 @@ VERSION
__vdso_clock_gettime;
__vdso_gettimeofday;
__vdso_time;
+   __vdso_linux_tsc_calibration;
};
 
LINUX_2.5 {
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S.4

[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-11 Thread Jason Vas Dias


  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW, &ts )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vdso/vdso.lds.S
   arch/x86/entry/vdso/vdsox32.lds.S
   arch/x86/entry/vdso/vdso32/vdso32.lds.S  
   arch/x86/entry/vsyscall/vsyscall_gtod.c
   
  This is a second patch in the series, 
  which adds a record of the calibrated tsc frequency to the VDSO,
  and a new header:
uapi/asm/vdso_tsc_calibration.h
  which defines a structure :
struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; };
  and a getter function in the VDSO that can optionally be used
  by user-space code to implement sub-nanosecond precision clocks .
  This second patch is entirely optional but I think greatly
  expands the scope of user-space TSC readers .

  Oops, previous version of this second patch
  mistakenly copied the changed part of vclock_gettime.c.

  Best Regards,
 Jason Vas Dias  .
 
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
2018-03-12 04:29:27.296982872 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 
05:38:53.019891195 +
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define gtod (&VVAR(vsyscall_gtod_data))
 
@@ -385,3 +386,22 @@ notrace time_t __vdso_time(time_t *t)
 }
 time_t time(time_t *t)
__attribute__((weak, alias("__vdso_time")));
+
+extern unsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *);
+
+notraceunsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal)
+{
+   if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) )
+   {
+   tsc_cal -> tsc_khz = gtod->tsc_khz;
+   tsc_cal -> mult= gtod->raw_mult;
+   tsc_cal -> shift   = gtod->raw_shift;
+   return 1;
+   }
+   return 0;
+}
+
+unsigned linux_tsc_calibration(void)
+   __attribute((weak, alias("__vdso_linux_tsc_calibration")));
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S
--- linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1   2018-03-12 
00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S   2018-03-12 
05:18:36.380673342 +
@@ -25,6 +25,8 @@ VERSION {
__vdso_getcpu;
time;
__vdso_time;
+   linux_tsc_calibration;
+   __vdso_linux_tsc_calibration;
local: *;
};
 }
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S
--- linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1  
2018-03-12 00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S  2018-03-12 
05:19:10.765022295 +
@@ -26,6 +26,7 @@ VERSION
__vdso_clock_gettime;
__vdso_gettimeofday;
__vdso_time;
+   __vdso_linux_tsc_calibration;
};
 
LINUX_2.5 {
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S
--- linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S.4.16-rc5-p1
2018-03-12 00:2

[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-11 Thread Jason Vas Dias


  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW, &ts )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vdso/vdso.lds.S
   arch/x86/entry/vdso/vdsox32.lds.S
   arch/x86/entry/vdso/vdso32/vdso32.lds.S  
   arch/x86/entry/vsyscall/vsyscall_gtod.c
   
  This is a second patch in the series, 
  which adds a record of the calibrated tsc frequency to the VDSO,
  and a new header:
uapi/asm/vdso_tsc_calibration.h
  which defines a structure :
struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; };
  and a getter function in the VDSO that can optionally be used
  by user-space code to implement sub-nanosecond precision clocks .
  This second patch is entirely optional but I think greatly
  expands the scope of user-space TSC readers .

  Best Regards,
 Jason Vas Dias  .
 
---
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
2018-03-12 04:29:27.296982872 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 
05:10:53.185158334 +
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define gtod (&VVAR(vsyscall_gtod_data))
 
@@ -385,3 +386,41 @@ notrace time_t __vdso_time(time_t *t)
 }
 time_t time(time_t *t)
__attribute__((weak, alias("__vdso_time")));
+
+extern unsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *);
+
+notraceunsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal)
+{
+   if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) )
+   {
+   tsc_cal -> tsc_khz = gtod->tsc_khz;
+   tsc_cal -> mult= gtod->raw_mult;
+   tsc_cal -> shift   = gtod->raw_shift;
+   return 1;
+   }
+   return 0;
+}
+
+unsigned linux_tsc_calibration(void)
+   __attribute((weak, alias("__vdso_linux_tsc_calibration")));
+
+extern unsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *);
+
+notraceunsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal)
+{
+   if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) )
+   {
+   tsc_cal -> tsc_khz = gtod->tsc_khz;
+   tsc_cal -> mult= gtod->raw_mult;
+   tsc_cal -> shift   = gtod->raw_shift;
+   return 1;
+   }
+   return 0;
+}
+
+unsigned linux_tsc_calibration(void)
+   __attribute((weak, alias("__vdso_linux_tsc_calibration")));
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S
--- linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1   2018-03-12 
00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S   2018-03-12 
05:18:36.380673342 +
@@ -25,6 +25,8 @@ VERSION {
__vdso_getcpu;
time;
__vdso_time;
+   linux_tsc_calibration;
+   __vdso_linux_tsc_calibration;
local: *;
};
 }
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S
--- linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1  
2

[PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-11 Thread Jason Vas Dias


  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW, &ts )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  The patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/include/asm/msr.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c


  This is a resend of the original patch fixing issues
  identified by tglx in mail thread of $subject -
  mainly that the rdtscp() assembler wrapper function 
  should be in msr.h - it now is.
  
  There is a second patch following in a few minutes
  which adds a record of the calibrated tsc frequency to the VDSO,
  and a new header:
uapi/asm/vdso_tsc_calibration.h
  which defines a structure :
struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; };
  and a getter function in the VDSO that can optionally be used
  by user-space code to implement sub-nanosecond precision clocks .
  This second patch is entirely optional but I think greatly
  expands the scope of user-space TSC readers .

  Best Regards,
 Jason Vas Dias  .
 
---
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 
linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5
2018-03-12 00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 
04:29:27.296982872 +
@@ -182,6 +182,19 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+   u64 tsc, last=gtod->raw_cycle_last;
+   if( likely( gtod->has_rdtscp ) )
+   tsc = rdtscp((void*)0);
+else
+   tsc = rdtsc_ordered();
+   if (likely(tsc >= last))
+   return tsc;
+   asm volatile ("");
+   return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +216,27 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +280,27 @@ notrace static int __always_inline do_mo
return mode;
 }
 
+notrace static int __always_inline do_monotonic_raw( struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw(&mode);
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct time

Re: [PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-11 Thread Jason Vas Dias

Thanks Thomas -

On 11/03/2018, Thomas Gleixner  wrote:
> On Sun, 11 Mar 2018, Jason Vas Dias wrote:
>
> This looks better now. Though running that patch through checkpatch.pl
> results in:
>
> total: 28 errors, 20 warnings, 139 lines checked
>

Hmm, I was unaware of that script, I'll run and find out why -
probably because whitespace is not visible in emacs with
my monospace font and it is very difficult to see if tabs
are used if somehow a '\t\ ' or ' \t' has slipped in .

I'll run the script, fix the errors. and repost.

> 
>
>> +notrace static u64 vread_tsc_raw(void)
>
> Why do you need a separate function? I asked you to use vread_tsc(). So you
> might have reasons for doing that, but please then explain WHY and not just
> throw the stuff in my direction w/o any comment.
>

mainly, because vread_tsc() makes its comparison against gtod->cycles_last ,
a copy of tk->tkr_mono.cycle_last, while vread_tsc_raw() uses
gtod->raw_cycle_last, a copy of tk->tkr_raw.cycle_last .

And rdtscp has a built-in "barrier", as the comments explain, making
rdtsc_ordered()'s 'barrier()' unnecessary .

>> +{
>> +u64 tsc, last=gtod->raw_cycle_last;
>> +if( likely( gtod->has_rdtscp ) ) {
>> +u32 tsc_lo, tsc_hi,
>> +tsc_cpu __attribute__((unused));
>> +asm volatile
>> +( "rdtscp"
>> +/* ^- has built-in cancellation point / pipeline stall
>> "barrier" */
>> +: "=a" (tsc_lo)
>> +, "=d" (tsc_hi)
>> +, "=c" (tsc_cpu)
>> +); // since all variables 32-bit, eax, edx, ecx used -
>> NOT rax, rdx, rcx
>> +tsc  = u64)tsc_hi) & 0xUL) << 32) |
>> (((u64)tsc_lo) & 0xUL);
>
> This is not required to make the vdso accessor for monotonic raw work.
>
> If at all then the rdtscp support wants to be in a separate patch with a
> proper explanation.
>

> Aside of that the code for rdtscp wants to be in a proper inline helper in
> the relevant header file and written according to the coding style the
> kernel uses for asm inlines.
>

Sorry, I will put the function in the same header as rdtsc_ordered () ,
in a separate patch.

> The rest looks ok.
>
> Thanks,
>
>   tglx
>

I'll re-generate patches and resend .

A complete patch , against 4.15.9, is attached , that I am using ,
including a suggested '__vdso_linux_tsc_calibration()'
function and arch/x86/include/uapi/asm/vdso_tsc_calibration.h file
that does not return any pointers into the VDSO .

Presuming this was split into separate patches as you suggest,
and was against the latest HEAD branch (4.16-rcX), would it be OK to
include the vdso_linux_tsc_calibration() work ?
It does enable user space code to develop accurate TSC readers
which are free to use different structures and pico-second resolution.
The actual user-space clock_gettime(CLOCK_MONOTONIC_RAW)
replacement I am using for work just reads the TSC ,  with a latency of
< 8ns, and uses the linux_tsc_calibration to convert using
floating-point as required.

Thanks & Regards,
Jason

vdso_gettime_monotonic_raw-4.15.9.patch
Description: Binary data

[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-11 Thread Jason Vas Dias


  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW, &ts )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc4 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  The patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c


  This is a resend of the original patch fixing indentation issues
  after installation of emacs Lisp cc-mode hooks in
  Documentation/coding-style.rst
  and calling 'indent-region' and 'tabify' (whitespace only changes) -
  SORRY !
  (and even after that, somehow 2 '\t\n's got left in vgtod.h -
   now removed - sorry again!) .

  Best Regards,
 Jason Vas Dias  .

  PATCH:
--- 
diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 
linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4
2018-03-04 22:54:11.0 +
+++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 
19:00:04.630019100 +
@@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+   u64 tsc, last=gtod->raw_cycle_last;
+   if( likely( gtod->has_rdtscp ) ) {
+   u32 tsc_lo, tsc_hi,
+   tsc_cpu __attribute__((unused));
+   asm volatile
+   ( "rdtscp"
+   /* ^- has built-in cancellation point / 
pipeline stall"barrier" */
+   :   "=a" (tsc_lo)
+ , "=d" (tsc_hi)
+ , "=c" (tsc_cpu)
+   ); // since all variables 32-bit, eax, edx, ecx used - 
NOT rax, rdx, rcx
+   tsc = u64)tsc_hi) & 0xUL) << 32) | 
(((u64)tsc_lo) & 0xUL);
+   } else {
+   tsc = rdtsc_ordered();
+   }
+   if (likely(tsc >= last))
+   return tsc;
+   asm volatile ("");
+   return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +290,27 @@ notrace static int __always_inline do_mo
return mode;
 }
 
+notrace static int __always_inline do_monotonic_raw( struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw(&mode);
+

[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-11 Thread Jason Vas Dias


  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW, &ts )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc4 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  The patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c


  This is a resend of the original patch fixing indentation issues
  after installation of emacs Lisp cc-mode hooks in
  Documentation/coding-style.rst
  and calling 'indent-region' and 'tabify' (whitespace only changes) - SORRY ! 

  Best Regards,
 Jason Vas Dias  .
 
---
diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 
linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4
2018-03-04 22:54:11.0 +
+++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 
19:00:04.630019100 +
@@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+   u64 tsc, last=gtod->raw_cycle_last;
+   if( likely( gtod->has_rdtscp ) ) {
+   u32 tsc_lo, tsc_hi,
+   tsc_cpu __attribute__((unused));
+   asm volatile
+   ( "rdtscp"
+   /* ^- has built-in cancellation point / 
pipeline stall"barrier" */
+   :   "=a" (tsc_lo)
+ , "=d" (tsc_hi)
+ , "=c" (tsc_cpu)
+   ); // since all variables 32-bit, eax, edx, ecx used - 
NOT rax, rdx, rcx
+   tsc = u64)tsc_hi) & 0xUL) << 32) | 
(((u64)tsc_lo) & 0xUL);
+   } else {
+   tsc = rdtsc_ordered();
+   }
+   if (likely(tsc >= last))
+   return tsc;
+   asm volatile ("");
+   return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +290,27 @@ notrace static int __always_inline do_mo
return mode;
 }
 
+notrace static int __always_inline do_monotonic_raw( struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw(&mode);
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->

Re: Fwd: [PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer

2018-03-10 Thread Jason Vas Dias

Hi Thomas -

Thanks very much for your help & guidance in previous mail:

RE: On 08/03/2018, Thomas Gleixner  wrote:
> 
> The right way to do that is to put the raw conversion values and the raw
> seconds base value into the vdso data and implement the counterpart of
> getrawmonotonic64(). And if that is done, then it can be done for _ALL_
> clocksources which support VDSO access and not just for the TSC.
>

I have done this now with a new patch, sent in mail with subject :

'[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle 
CLOCK_MONOTONIC_RAW' 

which should address all the concerns you raise.

> I already  know how that works, really.

I never doubted or meant to impugn that !

I am beginning to know a little how that works also, thanks in great
part to your help last week - thanks for your patience.

I was impatient last week to get access to low latency timers for a work
project, and was trying to read the unadjusted clock .

> instead of making completely false claims about the correctness of the kernel
> timekeeping infrastructure.

I really didn't mean to make any such claims - I'm sorry if I did .  I was just 
trying
to say that by the time the results of clock_gettime(CLOCK_MONOTONIC_RAW,&ts) 
were
available to the caller they were not of much use because of the
latencies often dwarfing the time differences .

Anyway, I hope sometime you will consider putting such a patch in the
kernel.

I have developed a verson for ARM also, but that depends on making
CNTPCT + CNTFRQ registers readable in user-space, which is not meant
to be secure and is not normally done , but does work - but it is
against the Texas Instruments (ti-linux) kernel and can be enabled
with a new KConfig option, and brings latencies down from > 300ns
to < 20ns . Maybe I should post that also to kernel.org, or to
ti.com ?

I have a separate patch for the vdso_tsc_calibration export of the
tsc_khz and calibration which no longer returns pointers into the VDSO -
I can post this as a patch if you like.

Thanks & Best Regards,
Jason Vas Dias 

diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4	2018-03-04 22:54:11.0 +
+++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c	2018-03-11 05:08:31.137681337 +
@@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void)
 	return last;
 }

+notrace static u64 vread_tsc_raw(void)
+{
+u64 tsc, last=gtod->raw_cycle_last;
+if( likely( gtod->has_rdtscp ) ) {
+u32 tsc_lo, tsc_hi,
+tsc_cpu __attribute__((unused));
+asm volatile
+( "rdtscp"
+/* ^- has built-in cancellation point / pipeline stall "barrier" */
+: "=a" (tsc_lo)
+, "=d" (tsc_hi)
+, "=c" (tsc_cpu)
+); // since all variables 32-bit, eax, edx, ecx used - NOT rax, rdx, rcx 
+tsc  = u64)tsc_hi) & 0xUL) << 32) | (((u64)tsc_lo) & 0xUL);
+} else {
+tsc  = rdtsc_ordered();
+}
+	if (likely(tsc >= last))
+		return tsc;
+asm volatile ("");
+return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
 	u64 v;
@@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m
 	return v * gtod->mult;
 }

+notrace static inline u64 vgetsns_raw(int *mode)
+{
+	u64 v;
+	cycles_t cycles;
+
+	if (gtod->vclock_mode == VCLOCK_TSC)
+		cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+	else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+		cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+	else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+		cycles = vread_hvclock(mode);
+#endif
+	else
+		return 0;
+	v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+	return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +290,27 @@ notrace static int __always_inline do_mo
 	return mode;
 }

+notrace static int __always_inline do_monotonic_raw( struct timespec *ts)
+{
+	unsigned long seq;
+	u64 ns;
+	int mode;
+
+	do {
+		seq = gtod_read_begin(gtod);
+		mode = gtod->vclock_mode;
+		ts->tv_sec = gtod->monotonic_time_raw_sec;
+		ns = gtod->monotonic_time_raw_nsec;
+		ns += vgetsns_raw(&mode);
+		ns >>= gtod->raw_shift;
+	} while (unlikely(gtod_read_retry(gtod, seq)));
+
+	ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
+	ts->tv_nsec = ns;
+
+	return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
 	unsigne

[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-10 Thread Jason Vas Dias


  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW, &ts )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc4 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  The patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c


  Best Regards,
 Jason Vas Dias  .
 
---
diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 
linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4
2018-03-04 22:54:11.0 +
+++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 
05:08:31.137681337 +
@@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+u64 tsc, last=gtod->raw_cycle_last;
+if( likely( gtod->has_rdtscp ) ) {
+u32 tsc_lo, tsc_hi,
+tsc_cpu __attribute__((unused));
+asm volatile
+( "rdtscp"
+/* ^- has built-in cancellation point / pipeline stall 
"barrier" */
+: "=a" (tsc_lo)
+, "=d" (tsc_hi)
+, "=c" (tsc_cpu)
+); // since all variables 32-bit, eax, edx, ecx used - NOT 
rax, rdx, rcx 
+tsc  = u64)tsc_hi) & 0xUL) << 32) | 
(((u64)tsc_lo) & 0xUL);
+} else {
+tsc  = rdtsc_ordered();
+}
+   if (likely(tsc >= last))
+   return tsc;
+asm volatile ("");
+return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +290,27 @@ notrace static int __always_inline do_mo
return mode;
 }
 
+notrace static int __always_inline do_monotonic_raw( struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw(&mode);
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +342,10 @@ notrace int __vdso_clock_gettime(clockid
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff -up linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc4 
linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c
--- linux-4.16-rc4/arch/x86/entry/vsyscal

Re: [PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-10 Thread Jason Vas Dias

Oops, please disregard 1st mail on  $subject - I guess use of Quoted Printable
is not a way of getting past the email line length.
Patch I tried to send is attached as attachment - will resend inline using
other method.

Sorry, Regards, Jason


vdso_monotonic_raw-v4.16-rc4.patch
Description: Binary data

[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-10 Thread Jason Vas Dias


  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW, &ts )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc4 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  The patch affects only files:

   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c


  Best Regards,
 Jason Vas Dias  .

---
diff -up 
linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 2018-03-04 
22:54:11.0 +
+++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 
05:08:31.137681337 +
@@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void)
return last;
 }

+notrace static u64 vread_tsc_raw(void)
+{
+u64 tsc, last=gtod->raw_cycle_last;
+if( likely( gtod->has_rdtscp ) ) {
+u32 tsc_lo, tsc_hi,
+tsc_cpu __attribute__((unused));
+asm volatile
+( "rdtscp"
+/* ^- has built-in cancellation point / pipeline 
stall"barrier" */
+: "=a" (tsc_lo)
+, "=d" (tsc_hi)
+, "=c" (tsc_cpu)
+); // since all variables 32-bit, eax, edx, ecx used - NOT 
rax, rdx, rcx
+tsc  = u64)tsc_hi) & 0xUL) << 32) | 
(((u64)tsc_lo) & 0xUL);
+} else {
+tsc  = rdtsc_ordered();
+}
+   if (likely(tsc >= last))
+   return tsc;
+asm volatile ("");
+return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }

+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +290,27 @@ notrace static int __always_inline do_mo
return mode;
 }

+notrace static int __always_inline do_monotonic_raw( struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw(&mode);
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +342,10 @@ notrace int __vdso_clock_gettime(clockid
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff -up linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc4 
linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c
--- linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc4 2018-03-04 
22:54:11.0 +
+++ linux-4.1

Re: Fwd: [PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer

2018-03-08 Thread Jason Vas Dias

On 08/03/2018, Thomas Gleixner  wrote:
> On Tue, 6 Mar 2018, Jason Vas Dias wrote:
>> I will prepare a new patch that meets submission + coding style guidelines
>> and
>> does not expose pointers within the vsyscall_gtod_data region to
>> user-space code -
>> but I don't really understand why not, since only the gtod->mult value
>> will
>> change as long as the clocksource remains TSC, and updates to it by the
>> kernel
>> are atomic and partial values cannot be read .
>>
>> The code in the patch reverts to old behavior for clocks which are not
>> the
>> TSC and provides a way for users to determine if the  clock is still the
>> TSC
>> by calling '__vdso_linux_tsc_calibration()', which would return NULL if
>> the clock is not the TSC .
>>
>> I have never seen Linux on a modern intel box spontaneously decide to
>> switch from the TSC clocksource after calibration succeeds and
>> it has decided to use the TSC as the system / platform clock source -
>> what would make it do this ?
>>
>> But for the highly controlled systems I am doing performance testing on,
>> I can guarantee that the clocksource does not change.
>
> We are not writing code for a particular highly controlled system. We
> expose functionality which operates under all circumstances. There are
> various reasons why TSC can be disabled at runtime, crappy BIOS/SMI,
> sockets getting out of sync .
>
>> There is no way user code can write those pointers or do anything other
>> than read them, so I see no harm in exposing them to user-space ; then
>> user-space programs can issue rdtscp and use the same calibration values
>> as the kernel, and use some cached 'previous timespec value' to avoid
>> doing the long division every time.
>>
>> If the shift & mult are not accurate TSC calibration values, then the
>> kernel should put other more accurate calibration values in the gtod .
>
> The raw calibration values are as accurate as the kernel can make them. But
> they can be rather far off from converting to real nanoseconds for various
> reasons. The NTP/PTP adjusted conversion is matching real units and is
> obviously more accurate.
>
>> > Please look at the kernel side implementation of
>> > clock_gettime(CLOCK_MONOTONIC_RAW).
>> > The VDSO side can be implemented in the
>> > same way.
>> > All what is required is to expose the relevant information in the
>> > existing vsyscall_gtod_data data structure.
>>
>> I agree - that is my point entirely , & what I was trying to do .
>
> Well, you did not expose the raw conversion data in vsyscall_gtod_data. You
> are using:
>
> + tsc*= gtod->mult;
> + tsc   >>= gtod->shift;
>
> That's is the adjusted mult/shift value which can change when NTP/PTP is
> enabled and you _cannot_ use it unprotected.
>
>> void getrawmonotonic64(struct timespec64 *ts)
>> {
>>  struct timekeeper *tk = &tk_core.timekeeper;
>>  unsigned long seq;
>>  u64 nsecs;
>>
>>  do {
>>  seq = read_seqcount_begin(&tk_core.seq);
>> #   ^-- I think this is the source of the locking
>> #and the very long latencies !
>
> This protects tk->raw_sec from changing which would result in random time
> stamps. Yes, it can cause slightly larger latencies when the timekeeper is
> updated on another CPU concurrently, but that's not the main reason why
> this is slower in general than the VDSO functions. The syscall overhead is
> there for every invocation and it's substantial.
>
>> So in fact, when the clock source is TSC, the value recorded in 'ts'
>> by clock_gettime(CLOCK_MONOTONIC_RAW, &ts) is very similar to
>>   u64 tsc = rdtscp();
>>   tsc *= gtod->mult;
>>   tsc >>= gtod->shift;
>>   ts.tv_sec=tsc / NSEC_PER_SEC;
>>   ts.tv_nsec=tsc % NSEC_PER_SEC;
>>
>> which is the algorithm I was using in the VDSO fast TSC reader,
>> do_monotonic_raw() .
>
> Except that you are using the adjusted conversion values and not the raw
> ones. So your VDSO implementation of monotonic raw access is just wrong and
> not matching the syscall based implementation in any way.
>
>> The problem with doing anything more in the VDSO is that there
>> is of course nowhere in the VDSO to store any data, as it has
>> no data section or writable pages . So some kind of writable
>> page would need to be added to the vdso , complicating its
>> vdso/vma.c, etc., w

[PATCH v4.15.7 1/1] x86/vdso: handle clock_gettime(CLOCK_MONOTONIC_RAW, &ts) in VDSO

2018-03-06 Thread Jason Vas Dias

Handling clock_gettime( CLOCK_MONOTONIC_RAW, ×pec)
by calling  vdso_fallback_gettime(),  ie. syscall,  is too slow  -
latencies of  300-700ns are common on Haswell (06:3C)  CPUs .

This patch against the 4.15.7 stable branch makes the VDSO handle
clock_gettime(CLOCK_GETTIME_RAW, &ts)
by issuing rdtscp in userspace,  IFF the clock source is the TSC, and converting
it to nanoseconds using the vsyscall_gtod_data 'mult' and 'shift' fields :

  volatile u32 tsc_lo, tsc_hi, tsc_cpu;
  asm volatile( "rdtscp" : (=a) tsc_lo, (=d) tsc_hi, (=c) tsc_cpu );
  u64 tsc = (((u64)tsc_hi)<<32) | ((u64)tsc_lo);
  tsc *= gtod->mult;
  tsc >>=gtod->shift;
  /* tsc is now number of nanoseconds */
  ts->tv_sec = __iter_div_u64_rem( tsc, NSEC_PER_SEC, &ts->tv_nsec);

Use of the "open coded asm" style here actually forces the compiler to
always choose the 32-bit version of rdtscp, which sets only %eax,
%edx, and %ecx and does not clear the high bits of %rax, %rdx, and
%rdx , because the
variables are declared 32-bit  - so the same 32-bit version is used whether
the code is compiled with -m32 or -m64 ( tested using gcc 5.4.0, gcc 6.4.1 ) .

The full story and test programs are in Bug #198961 :
https://bugzilla.kernel.org/show_bug.cgi?id=198961
.

The patched VDSO now handles clock_gettime(CLOCK_MONOTONIC_RAW, &ts)
on the same machine with a latency (minimum time that can be measured)
of
around 100ns (compared with 300-700ns before patch).

I also think it makes sense to expose pointers to the live, updated
gtod->mult and gtod->shift values somehow to userspace . Then
a userspace TSC reader could re-use previous values to avoid
the long-division in most cases and obtain latencies of 10-20ns .

Hence there is now a new method in the VDSO:
   __ vdso_linux_tsc_calibration()
which returns a pointer to a 'struct linux_tsc_calibration'
declared in a new header
   arch/x86/include/uapi/asm/vdso_tsc_calibration.h

If the clock source is NOT the TSC, this function returns NULL .
The pointer is only valid when the system clock source is the TSC .
User-space TSC readers can detect when TSC is modified with Events,
and now can detect when clock source changes from / to TSC with
this function .

The patch :

---
diff --git a/arch/x86/entry/vdso/vclock_gettime.c \
b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d..e840600 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 

 #define gtod (&VVAR(vsyscall_gtod_data))

@@ -246,6 +247,29 @@ notrace static int __always_inline do_monotonic\
(struct timespec *ts)
return mode;
 }

+notrace static int __always_inline do_monotonic_raw( struct timespec *ts)
+{
+volatile u32 tsc_lo=0, tsc_hi=0, tsc_cpu=0; // so same instrs
generated for 64-bit as for 32-bit builds
+u64 ns;
+register u64 tsc=0;
+if (gtod->vclock_mode == VCLOCK_TSC)
+{
+asm volatile
+( "rdtscp"
+: "=a" (tsc_lo)
+, "=d" (tsc_hi)
+, "=c" (tsc_cpu)
+); // : eax, edx, ecx used - NOT rax, rdx, rcx
+tsc = u64)tsc_hi) & 0xUL) << 32) |
(((u64)tsc_lo) & 0xUL);
+tsc*= gtod->mult;
+tsc   >>= gtod->shift;
+ts->tv_sec  = __iter_div_u64_rem(tsc, NSEC_PER_SEC,
&ns);
+ts->tv_nsec = ns;
+return VCLOCK_TSC;
+}
+return VCLOCK_NONE;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +301,10 @@ notrace int __vdso_clock_gettime(clockid_t clock,
struct timespec *ts)
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
@@ -326,3 +354,18 @@ notrace time_t __vdso_time(time_t *t)
 }
 time_t time(time_t *t)
__attribute__((weak, alias("__vdso_time")));
+
+extern const struct linux_tsc_calibration *
+__vdso_linux_tsc_calibration(void);
+
+notrace  const struct linux_tsc_calibration *
+  __vdso_linux_tsc_calibration(void)
+{
+if( gtod->vclock_mode == VCLOCK_TSC )
+return ((const struct linux_tsc_calibration*) >od->mult);
+return 0UL;
+}
+
+const struct linux_tsc_calibration * linux_tsc_calibration(void)
+__attribute((weak, alias("__vdso_linux_tsc_calibration")));
+
diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S
index d3a2dce..41a2ca5 100644
--- a/arch/x86/entry/vdso/vdso.lds.S
+++ b/arch/x86/entry/vdso/vdso.lds.S
@@ -24,7 +24,9 @@ VERSION {
getcpu;
__vdso_getcpu;
ti

Fwd: [PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer

2018-03-06 Thread Jason Vas Dias

On 06/03/2018, Thomas Gleixner  wrote:
> Jason,
>
> On Mon, 5 Mar 2018, Jason Vas Dias wrote:
>
> thanks for providing this. A few formal nits first.
>
> Please read Documentation/process/submitting-patches.rst
>
> Patches need a concise subject line and the subject line wants a prefix, in
> this case 'x86/vdso'.
>
> Please don't put anything past the patch. Your delimiters are human
> readable, but cannot be handled by tools.
>
> Also please follow the kernel coding style guide lines.
>
>> It also provides a new function in the VDSO :
>>
>> struct linux_timestamp_conversion
>> { u32 mult;
>> u32 shift;
>> };
>> extern
>> const struct linux_timestamp_conversion *
>> __vdso_linux_tsc_calibration(void);
>>
>> which can be used by user-space rdtsc / rdtscp issuers
>> by using code such as in
>> tools/testing/selftests/vDSO/parse_vdso.c
>> to call vdso_sym("LINUX_2.6", "__vdso_linux_tsc_calibration"),
>> which returns a pointer to the function in the VDSO, which
>> returns the address of the 'mult' field in the vsyscall_gtod_data.
>
> No, that's just wrong. The VDSO data is solely there for the VDSO accessor
> functions and not to be exposed to random user space.
>
>> Thus user-space programs can use rdtscp and interpret its return values
>> in exactly the same way the kernel would, but without entering the
>> kernel.
>
> The VDSO clock_gettime() functions are providing exactly this mechanism.
>
>> As pointed out in Bug # 198961 :
>> https://bugzilla.kernel.org/show_bug.cgi?id=198961
>> which contains extra test programs and the full story behind this
>> change,
>> using CLOCK_MONOTONIC_RAW without the patch results in
>> a minimum measurable time (latency) of @ 300 - 700ns because of
>> the syscall used by vdso_fallback_gtod() .
>>
>> With the patch, the latency falls to @ 100ns .
>>
>> The latency would be @ 16 - 32 ns if the do_monotonic_raw()
>> handler could record its previous TSC value and seconds return value
>> somewhere, but since the VDSO has no data region or writable page,
>> of course it cannot .
>
> And even if it could, it's not as simple as you want it to be. Clocksources
> can change during runtime and without effective protection the values are
> just garbage.
>
>> Hence, to enable effective use of TSC by user space programs, Linux must
>> provide a way for them to discover the calibration mult and shift values
>> the kernel uses for the clock source ; only by doing so can user-space
>> get values that are comparable to kernel generated values.
>
> Linux must not do anything. It can provide a vdso implementation of
> CLOCK_MONOTONIC_RAW, which does not enter the kernel, but not exposure to
> data which is not reliably accessible by random user space code.
>
>> And I'd really like to know: why does the gtod->mult value change ?
>> After TSC calibration, it and the shift are calculated to render the
>> best approximation of a nanoseconds value from the TSC value.
>>
>> The TSC is MEANT to be monotonic and to continue in sleep states
>> on modern Intel CPUs . So why does the gtod->mult change ?
>
> You are missing the fact that gtod->mult/shift are used for CLOCK_MONOTONIC
> and CLOCK_REALTIME, which are adjusted by NTP/PTP to provide network
> synchronized time. That means CLOCK_MONOTONIC is providing accurate
> and slope compensated nanoseconds.
>
> The raw TSC conversion, even if it is sane hardware, provides just some
> approximation of nanoseconds which can be off by quite a margin.
>
>> But the mult value does change. Currently there is no way for user-space
>> programs to discover that such a change has occurred, or when . With this
>> very tiny simple patch, they could know instantly when such changes
>> occur, and could implement TSC readers that perform the full conversion
>> with latencies of 15-30ns (on my CPU).
>
> No. Accessing the mult/shift pair without protection is racy and can lead
> to completely erratic results.
>
>> +notrace static int __always_inline do_monotonic_raw( struct timespec
>> *ts)
>> +{
>> + volatile u32 tsc_lo=0, tsc_hi=0, tsc_cpu=0; // so same instrs
>> generated for 64-bit as for 32-bit builds
>> + u64 ns;
>> + register u64 tsc=0;
>> + if (gtod->vclock_mode == VCLOCK_TSC)
>> + { asm volatile
>> + ( "rdtscp"
>> + : "=a" (tsc_lo)
>> + , "=d" (tsc_hi)
>> + , "=c" (tsc_cpu)
>> + ); // : eax, edx, ecx used - NOT rax, rdx, rcx
>
> If you look

[PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer

2018-03-04 Thread Jason Vas Dias

sum += sample[s];
  fprintf(stderr,"sum: %llu\n",sum);
  unsigned long long avg_ns = sum / N_SAMPLES;
  t1=(t2 - t_start);
  fprintf(stderr,
  "Total time: %1.1llu.%9.9lluS - Average Latency: %1.1llu.%9.9lluS\n",
  t1/10,   t1-((t1/10)*10),
  avg_ns/10,   avg_ns-((avg_ns/10)*10)
  );
  return 0;
}

: END EXAMPLE

EXAMPLE Usage :
$ gcc -std=gnu11 -o t_vdso_tsc t_vdso_tsc.c
$ ./t_vdso_tsc
Got TSC calibration @ 0x7ffdb9be5098: mult: 5798705 shift: 24
sum: 
Total time: 0.04859S - Average Latency: 0.00022S

Latencies are typically @ 15 - 30 ns .

That multiplication and shift really doesn't leave very many
significant seconds bits!

Please, can the VDSO include some similar functionality to NOT always
enter the kernel for CLOCK_MONOTONIC_RAW , and to export a pointer to
the LIVE (kernel updated) gtod->mult and gtod->shift values somehow .

The documentation states for CLOCK_MONOTONIC_RAW that it is the
same as CLOCK_MONOTONIC except it is NOT subject to NTP adjustments .
This is very far from the case currently, without a patch like the one above.

And the kernel should not restrict user-space programs to only being able
to either measure an NTP adjusted time value, or a time value
difference of greater
than 1000ns with any accuracy, on a modern Intel CPU whose TSC ticks 2.8 times
per nanosecond (picosecond resolution is theoretically possible).

Please, include something like the above patch in future Linux versions.

Thanks & Best Regards,
Jason Vas Dias

Re: perf Intel x86_64 : BUG: BRANCH_INSTRUCTIONS / BRANCH_MISSES cannot be combined with CACHE_REFERENCES / CACHE_MISSES .

2018-02-13 Thread Jason Vas Dias

On 13/02/2018, Jason Vas Dias  wrote:
> Good day -
>
> I'd much appreciate some advice as to why, on my Intel x86_64
> ( DisplayFamily_DisplayModel : 06_3CH ), running either Linux 4.12.10,
> or Linux 3.10.0, any attempt to count all of :
>  PERF_COUNT_HW_BRANCH_INSTRUCTIONS
>   (or raw config 0xC4) , and
>  PERF_COUNT_HW_BRANCH_MISSES
>   (or raw config 0xC5), and
>  combined with
>  PERF_COUNT_HW_CACHE_REFERENCES
>  (or raw config 0x4F2E ), and
>  PERF_COUNT_HW_CACHE_MISSES
>  (or raw config 0x412E) ,
> results in ALL COUNTERS BEING 0 in a read of the Group FD or
> mmap sample area.
>
> This is demonstrated by the example program, which will
> use perf_event_open() to create a Group Leader FD  for the first event,
> and associate all other events with that Event Group , so that it
> will read all events on the group FD .
>
> The perf_event_open() calls and the ioctl(event_fd, PERF_EVENT_IOC_ID, &id)
> calls all return successfully , but if I combine ANY of
> ( PERF_COUNT_HW_BRANCH_INSTRUCTIONS,
>   PERF_COUNT_HW_BRANCH_MISSES
> ) with any of
> ( PERF_COUNT_HW_CACHE_REFERENCES,
>   PERF_COUNT_HW_CACHE_MISSES
> ) in the Event Group, ALL events have '0' event->value.
>
> Demo :
> 1. Compile program to use kernel mapped Generic Events:
>   $ gcc -std=gnu11 -o perf_bug perf_bug.c
>   Running program shows all counters have 0 values, since both
>   CACHE & BRANCH hits+misses are being requested:
>
>   $ ./perf_bug
>   EVENT: Branch Instructions : 0
>   EVENT: Branch Misses : 0
>   EVENT: Instructions : 0
>   EVENT: CPU Cycles : 0
>   EVENT: Ref. CPU Cycles : 0
>   EVENT: Bus Cycles : 0
>   EVENT: Cache References : 0
>   EVENT: Cache Misses : 0
>
>   NOT registering interest in EITHER the BRANCH counters
>   OR the CACHE counters fixes the problem:
>
>   Compile without registering for BRANCH_INSTRUCTIONS
>   or BRANCH_MISSES:
>   $ gcc -std=gnu11 -DNO_BUG_NO_BRANCH  -o perf_bug perf_bug.c
>   $ ./perf_bug
>   EVENT: Instructions : 914
>   EVENT: CPU Cycles : 4110
>   EVENT: Ref. CPU Cycles : 4437
>   EVENT: Bus Cycles : 152
>   EVENT: Cache References : 1
>   EVENT: Cache Misses : 1
>
>   Compile without registering for CACHE_REFERENCES or CACHE_MISSES:
>   $ gcc -std=gnu11 -DNO_BUG_NO_CACHE  -o perf_bug perf_bug.c
>   $ ./perf_bug
> EVENT: Branch Instructions : 106
> EVENT: Branch Misses : 6
> EVENT: Instructions : 914
> EVENT: CPU Cycles : 4132
> EVENT: Ref. CPU Cycles : 8526
> EVENT: Bus Cycles : 295
>
> The same thing happens if I do not use Generic Events, but rather
> "dynamic raw PMU" events, by putting the hex values from
> /sys/bus/event_source/devices/cpu/events/? into the perf_event_attr
> config, OR'ed with (1<<63), and using the PERF_TYPE_RAW perf_event_attr
> type value :
>
> $ gcc -DUSE_RAW_PMU -o perf_bug perf_bug.c
> $ ./perf_bug
> EVENT: Branch Instructions : 0
> EVENT: Branch Misses : 0
> EVENT: Instructions : 0
> EVENT: CPU Cycles : 0
> EVENT: Ref. CPU Cycles : 0
> EVENT: Bus Cycles : 0
> EVENT: Cache References : 0
> EVENT: Cache Misses : 0
>
>
> $ gcc -DUSE_RAW_PMU -DNO_BUG_NO_BRANCH -o perf_bug perf_bug.c
> $ ./perf_bug
> EVENT: Instructions : 914
> EVENT: CPU Cycles : 4102
> EVENT: Ref. CPU Cycles : 4959
> EVENT: Bus Cycles : 171
> EVENT: Cache References : 2
> EVENT: Cache Misses : 2
>
> $ gcc -DUSE_RAW_PMU -DNO_BUG_NO_CACHE -o perf_bug perf_bug.c
> $ ./perf_bug
> EVENT: Branch Instructions : 106
> EVENT: Branch Misses : 6
> EVENT: Instructions : 914
> EVENT: CPU Cycles : 4108
> EVENT: Ref. CPU Cycles : 10817
> EVENT: Bus Cycles : 373
>
>
> The perf tool itself seems to have the same issue:
>
> With CACHE & BRANCH counters does not work :
> $ perf stat -e '{r0c4,r0c5,r0c0,r03c,r0300,r013c,r04F2E,r0412E}:SIu' sleep
> 1
>
>  Performance counter stats for 'sleep 1':
>
>r0c4
>(0.00%)
>r0c5
>(0.00%)
>r0c0
>(0.00%)
>r03c
>(0.00%)
>r0300
>(0.00%)
>r013c
>(0.00%)
>r04F2E
>(0.00%)
> r0412E
>
>1.001652932 seconds time elapsed
>
>Some events weren't counted. Try disabling the NMI watchdog:
>   echo 0 > /proc/sys/kernel/nmi_watchdog
>   perf stat ...
>   echo 1 > /proc/sys/kernel/nmi_watchdog
>
> Disabling the NMI watchdog makes no difference .
>
> It is very strange that perf thinks 'r0412E' is not supp

perf Intel x86_64 : BUG: BRANCH_INSTRUCTIONS / BRANCH_MISSES cannot be combined with CACHE_REFERENCES / CACHE_MISSES .

2018-02-13 Thread Jason Vas Dias

Good day -

I'd much appreciate some advice as to why, on my Intel x86_64
( DisplayFamily_DisplayModel : 06_3CH ), running either Linux 4.12.10,
or Linux 3.10.0, any attempt to count all of :
 PERF_COUNT_HW_BRANCH_INSTRUCTIONS
  (or raw config 0xC4) , and
 PERF_COUNT_HW_BRANCH_MISSES
  (or raw config 0xC5), and
 combined with
 PERF_COUNT_HW_CACHE_REFERENCES
 (or raw config 0x4F2E ), and
 PERF_COUNT_HW_CACHE_MISSES
 (or raw config 0x412E) ,
results in ALL COUNTERS BEING 0 in a read of the Group FD or
mmap sample area.

This is demonstrated by the example program, which will
use perf_event_open() to create a Group Leader FD  for the first event,
and associate all other events with that Event Group , so that it
will read all events on the group FD .

The perf_event_open() calls and the ioctl(event_fd, PERF_EVENT_IOC_ID, &id)
calls all return successfully , but if I combine ANY of
( PERF_COUNT_HW_BRANCH_INSTRUCTIONS,
  PERF_COUNT_HW_BRANCH_MISSES
) with any of
( PERF_COUNT_HW_CACHE_REFERENCES,
  PERF_COUNT_HW_CACHE_MISSES
) in the Event Group, ALL events have '0' event->value.

Demo :
1. Compile program to use kernel mapped Generic Events:
  $ gcc -std=gnu11 -o perf_bug perf_bug.c
  Running program shows all counters have 0 values, since both
  CACHE & BRANCH hits+misses are being requested:

  $ ./perf_bug
  EVENT: Branch Instructions : 0
  EVENT: Branch Misses : 0
  EVENT: Instructions : 0
  EVENT: CPU Cycles : 0
  EVENT: Ref. CPU Cycles : 0
  EVENT: Bus Cycles : 0
  EVENT: Cache References : 0
  EVENT: Cache Misses : 0

  NOT registering interest in EITHER the BRANCH counters
  OR the CACHE counters fixes the problem:

  Compile without registering for BRANCH_INSTRUCTIONS
  or BRANCH_MISSES:
  $ gcc -std=gnu11 -DNO_BUG_NO_BRANCH  -o perf_bug perf_bug.c
  $ ./perf_bug
  EVENT: Instructions : 914
  EVENT: CPU Cycles : 4110
  EVENT: Ref. CPU Cycles : 4437
  EVENT: Bus Cycles : 152
  EVENT: Cache References : 1
  EVENT: Cache Misses : 1

  Compile without registering for CACHE_REFERENCES or CACHE_MISSES:
  $ gcc -std=gnu11 -DNO_BUG_NO_CACHE  -o perf_bug perf_bug.c
  $ ./perf_bug
EVENT: Branch Instructions : 106
EVENT: Branch Misses : 6
EVENT: Instructions : 914
EVENT: CPU Cycles : 4132
EVENT: Ref. CPU Cycles : 8526
EVENT: Bus Cycles : 295

The same thing happens if I do not use Generic Events, but rather
"dynamic raw PMU" events, by putting the hex values from
/sys/bus/event_source/devices/cpu/events/? into the perf_event_attr
config, OR'ed with (1<<63), and using the PERF_TYPE_RAW perf_event_attr
type value :

$ gcc -DUSE_RAW_PMU -o perf_bug perf_bug.c
$ ./perf_bug
EVENT: Branch Instructions : 0
EVENT: Branch Misses : 0
EVENT: Instructions : 0
EVENT: CPU Cycles : 0
EVENT: Ref. CPU Cycles : 0
EVENT: Bus Cycles : 0
EVENT: Cache References : 0
EVENT: Cache Misses : 0


$ gcc -DUSE_RAW_PMU -DNO_BUG_NO_BRANCH -o perf_bug perf_bug.c
$ ./perf_bug
EVENT: Instructions : 914
EVENT: CPU Cycles : 4102
EVENT: Ref. CPU Cycles : 4959
EVENT: Bus Cycles : 171
EVENT: Cache References : 2
EVENT: Cache Misses : 2

$ gcc -DUSE_RAW_PMU -DNO_BUG_NO_CACHE -o perf_bug perf_bug.c
$ ./perf_bug
EVENT: Branch Instructions : 106
EVENT: Branch Misses : 6
EVENT: Instructions : 914
EVENT: CPU Cycles : 4108
EVENT: Ref. CPU Cycles : 10817
EVENT: Bus Cycles : 373


The perf tool itself seems to have the same issue:

With CACHE & BRANCH counters does not work :
$ perf stat -e '{r0c4,r0c5,r0c0,r03c,r0300,r013c,r04F2E,r0412E}:SIu' sleep 1

 Performance counter stats for 'sleep 1':

   r0c4
   (0.00%)
   r0c5
   (0.00%)
   r0c0
   (0.00%)
   r03c
   (0.00%)
   r0300
   (0.00%)
   r013c
   (0.00%)
   r04F2E
   (0.00%)
r0412E

   1.001652932 seconds time elapsed

   Some events weren't counted. Try disabling the NMI watchdog:
echo 0 > /proc/sys/kernel/nmi_watchdog
perf stat ...
echo 1 > /proc/sys/kernel/nmi_watchdog

Disabling the NMI watchdog makes no difference .

It is very strange that perf thinks 'r0412E' is not supported :
   $ cat /sys/bus/event_source/devices/cpu/cache_misses
   event=0x2e,umask=0x41

The kernel should not be advertizing an unsupported event
in a  /sys/bus/event_source/devices/cpu/events/ file, should it ?

So perf stat has the same problem - without either Cache or Branch
counters seems to work fine:

without cache:
$ perf stat -e '{r0c4,r0c5,r0c0,r03c,r0300,r013c}:SIu' sleep 1

 Performance counter stats for 'sleep 1':

 37740  r0c4
  3557  r0c5
188552  r0c0
311684  r03c
360963  r0300
 12461  r013c

   1.001508109 seconds time elapsed

without branch:
$ perf stat -e '{r0c0,r03c,r0300,r013c,r04F2E,r0412E}:SIu' sleep 1

 Performance counter stats for 'sleep 1':

188554  r0c0
32

Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609

2017-02-23 Thread Jason Vas Dias

I have found a new source of weirdness with  TSC  using
clock_gettime(CLOCK_MONOTONIC_RAW,&ts) :

The vsyscall_gtod_data.mult field changes somewhat between
calls to clock_gettime(CLOCK_MONOTONIC_RAW,&ts),
so that sometimes an extra (2^24) nanoseconds are added or
removed from  the value derived from the TSC and stored in 'ts' .

This is demonstrated by the output of the test program in the
attached ttsc.tar  file:
$ ./tlgtd
it worked! - GTOD: clock:1 mult:5798662 shift:24
synced - mult now: 5798661

What it is doing is finding the address of the 'vsyscall_gtod_data' structure
from /proc/kallsyms, and mapping the virtual address to an ELF section
offset within /proc/kcore, and reading just the 'vsyscall_gtod_data' structure
into user-space memory .

Really, this 'mult' value, which is used to return the
seconds|nanoseconds value:
( tsc_cycles * mult ) >> shift
(where shift is 24 ), should not change from the first time it is initialized .

The TSC is meant to be FIXED FREQUENCY, right ?
So how could  /  why should the conversion function from TSC ticks to
nanoseconds change ?

So now it is doubly difficult for user-space libraries to maintain their
RDTSC derived seconds|nanoseconds values to correlate well those returned by
the kernel,  because they must regularly read the updated 'mult' value
used by the
kernel .

I really don't think the kernel should randomly be deciding to
increase / decrease
the TSC tick period by 2^24 nanoseconds!

Is this a bug or intentional ? I am searching for all places where a
'[.>]mult.*=' occurs, but this returns rather alot of matches.

Please could a future version of linux at least export the 'mult' and
'shift' values for
the current clocksource !

Regards,
Jason

On 22/02/2017, Jason Vas Dias  wrote:
> OK, last post on this issue today -
> can anyone explain why, with standard 4.10.0 kernel & no new
> 'notsc_adjust' option, and the same maths being used, these two runs
> should display
> such a wide disparity between clock_gettime(CLOCK_MONOTONIC_RAW,&ts)
> values ? :
>
> $ J/pub/ttsc/ttsc1
> max_extended_leaf: 8008
> has tsc: 1 constant: 1
> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1.
> ts2 - ts1: 162 ts3 - ts2: 110 ns1: 0.00641 ns2: 0.02850
> ts3 - ts2: 175 ns1: 0.00659
> ts3 - ts2: 18 ns1: 0.00643
> ts3 - ts2: 18 ns1: 0.00618
> ts3 - ts2: 17 ns1: 0.00620
> ts3 - ts2: 17 ns1: 0.00616
> ts3 - ts2: 18 ns1: 0.00641
> ts3 - ts2: 18 ns1: 0.00709
> ts3 - ts2: 20 ns1: 0.00763
> ts3 - ts2: 20 ns1: 0.00735
> ts3 - ts2: 20 ns1: 0.00761
> t1 - t0: 78200 - ns2: 0.80824
> $ J/pub/ttsc/ttsc1
> max_extended_leaf: 8008
> has tsc: 1 constant: 1
> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1.
> ts2 - ts1: 217 ts3 - ts2: 221 ns1: 0.01294 ns2: 0.05375
> ts3 - ts2: 210 ns1: 0.01418
> ts3 - ts2: 23 ns1: 0.01399
> ts3 - ts2: 22 ns1: 0.01445
> ts3 - ts2: 25 ns1: 0.01321
> ts3 - ts2: 20 ns1: 0.01428
> ts3 - ts2: 25 ns1: 0.01367
> ts3 - ts2: 23 ns1: 0.01425
> ts3 - ts2: 23 ns1: 0.01357
> ts3 - ts2: 22 ns1: 0.01487
> ts3 - ts2: 25 ns1: 0.01377
> t1 - t0: 145753 - ns2: 0.000150781
>
> (complete source of test program ttsc1 attached in ttsc.tar
>  $ tar -xpf ttsc.tar
>  $ cd ttsc
>  $ make
> ).
>
> On 22/02/2017, Jason Vas Dias  wrote:
>> I actually tried adding a 'notsc_adjust' kernel option to disable any
>> setting or
>> access to the TSC_ADJUST MSR, but then I see the problems  - a big
>> disparity
>> in values depending on which CPU the thread is scheduled -  and no
>> improvement in clock_gettime() latency.  So I don't think the new
>> TSC_ADJUST
>> code in ts_sync.c itself is the issue - but something added @ 460ns
>> onto every clock_gettime() call when moving from v4.8.0 -> v4.10.0 .
>> As I don't think fixing the clock_gettime() latency issue is my problem
>> or
>> even
>> possible with current clock architecture approach, it is a non-issue.
>>
>> But please, can anyone tell me if are there any plans to move the time
>> infrastructure  out of the kernel and into glibc along the lines
>> outlined
>> in previous mail - if not, I am going to concentrate on this more radical
>> overhaul approach for my own systems .
>>
>> At least, I think mapping the clocksource information structure itself in
>> some
>> kind of sharable page makes sense . Processes could map that page
>> copy-on-write
>> so they could start off with all the timing parameters preloaded,  then
>> keep
>> their copy up

Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609

2017-02-22 Thread Jason Vas Dias

OK, last post on this issue today -
can anyone explain why, with standard 4.10.0 kernel & no new
'notsc_adjust' option, and the same maths being used, these two runs
should display
such a wide disparity between clock_gettime(CLOCK_MONOTONIC_RAW,&ts)
values ? :

$ J/pub/ttsc/ttsc1
max_extended_leaf: 8008
has tsc: 1 constant: 1
Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1.
ts2 - ts1: 162 ts3 - ts2: 110 ns1: 0.00641 ns2: 0.02850
ts3 - ts2: 175 ns1: 0.00659
ts3 - ts2: 18 ns1: 0.00643
ts3 - ts2: 18 ns1: 0.00618
ts3 - ts2: 17 ns1: 0.00620
ts3 - ts2: 17 ns1: 0.00616
ts3 - ts2: 18 ns1: 0.00641
ts3 - ts2: 18 ns1: 0.00709
ts3 - ts2: 20 ns1: 0.00763
ts3 - ts2: 20 ns1: 0.00735
ts3 - ts2: 20 ns1: 0.00761
t1 - t0: 78200 - ns2: 0.80824
$ J/pub/ttsc/ttsc1
max_extended_leaf: 8008
has tsc: 1 constant: 1
Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1.
ts2 - ts1: 217 ts3 - ts2: 221 ns1: 0.01294 ns2: 0.05375
ts3 - ts2: 210 ns1: 0.01418
ts3 - ts2: 23 ns1: 0.01399
ts3 - ts2: 22 ns1: 0.01445
ts3 - ts2: 25 ns1: 0.01321
ts3 - ts2: 20 ns1: 0.01428
ts3 - ts2: 25 ns1: 0.01367
ts3 - ts2: 23 ns1: 0.01425
ts3 - ts2: 23 ns1: 0.01357
ts3 - ts2: 22 ns1: 0.01487
ts3 - ts2: 25 ns1: 0.01377
t1 - t0: 145753 - ns2: 0.000150781

(complete source of test program ttsc1 attached in ttsc.tar
 $ tar -xpf ttsc.tar
 $ cd ttsc
 $ make
).

On 22/02/2017, Jason Vas Dias  wrote:
> I actually tried adding a 'notsc_adjust' kernel option to disable any
> setting or
> access to the TSC_ADJUST MSR, but then I see the problems  - a big
> disparity
> in values depending on which CPU the thread is scheduled -  and no
> improvement in clock_gettime() latency.  So I don't think the new
> TSC_ADJUST
> code in ts_sync.c itself is the issue - but something added @ 460ns
> onto every clock_gettime() call when moving from v4.8.0 -> v4.10.0 .
> As I don't think fixing the clock_gettime() latency issue is my problem or
> even
> possible with current clock architecture approach, it is a non-issue.
>
> But please, can anyone tell me if are there any plans to move the time
> infrastructure  out of the kernel and into glibc along the lines
> outlined
> in previous mail - if not, I am going to concentrate on this more radical
> overhaul approach for my own systems .
>
> At least, I think mapping the clocksource information structure itself in
> some
> kind of sharable page makes sense . Processes could map that page
> copy-on-write
> so they could start off with all the timing parameters preloaded,  then
> keep
> their copy updated using the rdtscp instruction , or msync() (read-only)
> with the kernel's single copy to get the latest time any process has
> requested.
> All real-time parameters & adjustments could be stored in that page ,
> & eventually a single copy of the tzdata could be used by both kernel
> & user-space.
> That is what I am working towards. Any plans to make linux real-time tsc
> clock user-friendly ?
>
>
>
> On 22/02/2017, Jason Vas Dias  wrote:
>> Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is
>> read or written . It is probably because it genuinuely does not
>> support any cpuid > 13 ,
>> or the modern TSC_ADJUST interface . This is probably why my
>> clock_gettime()
>> latencies are so bad. Now I have to develop a patch to disable all access
>> to
>> TSC_ADJUST MSR if boot_cpu_data.cpuid_level <= 13 .
>> I really have an unlucky CPU :-) .
>>
>> But really, I think this issue goes deeper into the fundamental limits of
>> time measurement on Linux : it is never going to be possible to measure
>> minimum times with clock_gettime() comparable with those returned by
>> rdtscp instruction - the time taken to enter the kernel through the VDSO,
>> queue an access to vsyscall_gtod_data via a workqueue, access it & do
>> computations & copy value to user-space is NEVER going to be up to the
>> job of measuring small real-time durations of the order of 10-20 TSC
>> ticks
>> .
>>
>> I think the best way to solve this problem going forward would be to
>> store
>> the entire vsyscall_gtod_data  data structure representing the current
>> clocksource
>> in a shared page which is memory-mappable (read-only) by user-space .
>> I think sser-space programs should be able to do something like :
>> int fd =
>> open("/sys/devices/system/clocksource/clocksource0/gtod.page",O_RDONLY);
>> size_t psz = getpagesize();
>> void *gtod = mmap( 0, psz, PROT_READ, MAP_PRIVATE, fd, 0 );
>> msync(gtod,psz,MS_SYNC);
>>

Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609

2017-02-22 Thread Jason Vas Dias

I actually tried adding a 'notsc_adjust' kernel option to disable any setting or
access to the TSC_ADJUST MSR, but then I see the problems  - a big disparity
in values depending on which CPU the thread is scheduled -  and no
improvement in clock_gettime() latency.  So I don't think the new
TSC_ADJUST
code in ts_sync.c itself is the issue - but something added @ 460ns
onto every clock_gettime() call when moving from v4.8.0 -> v4.10.0 .
As I don't think fixing the clock_gettime() latency issue is my problem or even
possible with current clock architecture approach, it is a non-issue.

But please, can anyone tell me if are there any plans to move the time
infrastructure  out of the kernel and into glibc along the lines
outlined
in previous mail - if not, I am going to concentrate on this more radical
overhaul approach for my own systems .

At least, I think mapping the clocksource information structure itself in some
kind of sharable page makes sense . Processes could map that page copy-on-write
so they could start off with all the timing parameters preloaded,  then keep
their copy updated using the rdtscp instruction , or msync() (read-only)
with the kernel's single copy to get the latest time any process has requested.
All real-time parameters & adjustments could be stored in that page ,
& eventually a single copy of the tzdata could be used by both kernel
& user-space.
That is what I am working towards. Any plans to make linux real-time tsc
clock user-friendly ?

On 22/02/2017, Jason Vas Dias  wrote:
> Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is
> read or written . It is probably because it genuinuely does not
> support any cpuid > 13 ,
> or the modern TSC_ADJUST interface . This is probably why my
> clock_gettime()
> latencies are so bad. Now I have to develop a patch to disable all access
> to
> TSC_ADJUST MSR if boot_cpu_data.cpuid_level <= 13 .
> I really have an unlucky CPU :-) .
>
> But really, I think this issue goes deeper into the fundamental limits of
> time measurement on Linux : it is never going to be possible to measure
> minimum times with clock_gettime() comparable with those returned by
> rdtscp instruction - the time taken to enter the kernel through the VDSO,
> queue an access to vsyscall_gtod_data via a workqueue, access it & do
> computations & copy value to user-space is NEVER going to be up to the
> job of measuring small real-time durations of the order of 10-20 TSC ticks
> .
>
> I think the best way to solve this problem going forward would be to store
> the entire vsyscall_gtod_data  data structure representing the current
> clocksource
> in a shared page which is memory-mappable (read-only) by user-space .
> I think sser-space programs should be able to do something like :
> int fd =
> open("/sys/devices/system/clocksource/clocksource0/gtod.page",O_RDONLY);
> size_t psz = getpagesize();
> void *gtod = mmap( 0, psz, PROT_READ, MAP_PRIVATE, fd, 0 );
> msync(gtod,psz,MS_SYNC);
>
> Then they could all read the real-time clock values as they are updated
> in real-time by the kernel, and know exactly how to interpret them .
>
> I also think that all mktime() / gmtime() / localtime() timezone handling
> functionality should be
> moved to user-space, and that the kernel should actually load and link in
> some
> /lib/libtzdata.so
> library, provided by glibc / libc implementations, that is exactly the
> same library
> used by glibc() code to parse tzdata ; tzdata should be loaded at boot time
> by the kernel from the same places glibc loads it, and both the kernel and
> glibc should use identical mktime(), gmtime(), etc. functions to access it,
> and
> glibc using code would not need to enter the kernel at all for any
> time-handling
> code. This tzdata-library code be automatically loaded into process images
> the
> same way the vdso region is , and the whole system could access only one
> copy of it and the 'gtod.page' in memory.
>
> That's just my two-cents worth, and how I'd like to eventually get
> things working
> on my system.
>
> All the best, Regards,
> Jason
>
>
>
>
>
>
>
>
>
>
>
>
>
> On 22/02/2017, Jason Vas Dias  wrote:
>> On 22/02/2017, Jason Vas Dias  wrote:
>>> RE:
>>>>> 4.10 has  new code which utilizes the TSC_ADJUST MSR.
>>>
>>> I just built an unpatched linux v4.10 with tglx's TSC improvements -
>>> much else improved in this kernel (like iwlwifi) - thanks!
>>>
>>> I have attached an updated version of the test program which
>>> doesn't print the bogus "Nominal TSC Frequency" (the previous
>>> version printed it, but equa

Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609

2017-02-22 Thread Jason Vas Dias

Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is
read or written . It is probably because it genuinuely does not
support any cpuid > 13 ,
or the modern TSC_ADJUST interface . This is probably why my clock_gettime()
latencies are so bad. Now I have to develop a patch to disable all access to
TSC_ADJUST MSR if boot_cpu_data.cpuid_level <= 13 .
I really have an unlucky CPU :-) .

But really, I think this issue goes deeper into the fundamental limits of
time measurement on Linux : it is never going to be possible to measure
minimum times with clock_gettime() comparable with those returned by
rdtscp instruction - the time taken to enter the kernel through the VDSO,
queue an access to vsyscall_gtod_data via a workqueue, access it & do
computations & copy value to user-space is NEVER going to be up to the
job of measuring small real-time durations of the order of 10-20 TSC ticks .

I think the best way to solve this problem going forward would be to store
the entire vsyscall_gtod_data  data structure representing the current
clocksource
in a shared page which is memory-mappable (read-only) by user-space .
I think sser-space programs should be able to do something like :
int fd = 
open("/sys/devices/system/clocksource/clocksource0/gtod.page",O_RDONLY);
size_t psz = getpagesize();
void *gtod = mmap( 0, psz, PROT_READ, MAP_PRIVATE, fd, 0 );
msync(gtod,psz,MS_SYNC);

Then they could all read the real-time clock values as they are updated
in real-time by the kernel, and know exactly how to interpret them .

I also think that all mktime() / gmtime() / localtime() timezone handling
functionality should be
moved to user-space, and that the kernel should actually load and link in some
/lib/libtzdata.so
library, provided by glibc / libc implementations, that is exactly the
same library
used by glibc() code to parse tzdata ; tzdata should be loaded at boot time
by the kernel from the same places glibc loads it, and both the kernel and
glibc should use identical mktime(), gmtime(), etc. functions to access it, and
glibc using code would not need to enter the kernel at all for any time-handling
code. This tzdata-library code be automatically loaded into process images the
same way the vdso region is , and the whole system could access only one
copy of it and the 'gtod.page' in memory.

That's just my two-cents worth, and how I'd like to eventually get
things working
on my system.

All the best, Regards,
Jason

On 22/02/2017, Jason Vas Dias  wrote:
> On 22/02/2017, Jason Vas Dias  wrote:
>> RE:
>>>> 4.10 has  new code which utilizes the TSC_ADJUST MSR.
>>
>> I just built an unpatched linux v4.10 with tglx's TSC improvements -
>> much else improved in this kernel (like iwlwifi) - thanks!
>>
>> I have attached an updated version of the test program which
>> doesn't print the bogus "Nominal TSC Frequency" (the previous
>> version printed it, but equally ignored it).
>>
>> The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by
>> a factor of 2 - it used to be @140ns and is now @ 70ns  ! Wow!  :
>>
>> $ uname -r
>> 4.10.0
>> $ ./ttsc1
>> max_extended_leaf: 8008
>> has tsc: 1 constant: 1
>> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz.
>> ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.00588 ns2: 0.02599
>> ts3 - ts2: 178 ns1: 0.00592
>> ts3 - ts2: 14 ns1: 0.00577
>> ts3 - ts2: 14 ns1: 0.00651
>> ts3 - ts2: 17 ns1: 0.00625
>> ts3 - ts2: 17 ns1: 0.00677
>> ts3 - ts2: 17 ns1: 0.00626
>> ts3 - ts2: 17 ns1: 0.00627
>> ts3 - ts2: 17 ns1: 0.00627
>> ts3 - ts2: 18 ns1: 0.00655
>> ts3 - ts2: 17 ns1: 0.00631
>> t1 - t0: 89067 - ns2: 0.91411
>>
>
>
> Oops, going blind in my old age. These latencies are actually 3 times
> greater than under 4.8 !!
>
> Under 4.8, the program printed latencies of @ 140ns for clock_gettime, as
> shown
> in bug 194609 as the 'ns1' (timespec_b - timespec_a) value::
>
> ts3 - ts2: 24 ns1: 0.00162
> ts3 - ts2: 17 ns1: 0.00143
> ts3 - ts2: 17 ns1: 0.00146
> ts3 - ts2: 17 ns1: 0.00149
> ts3 - ts2: 17 ns1: 0.00141
> ts3 - ts2: 16 ns1: 0.00142
>
> now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @
> 600ns, @ 4 times more than under 4.8 .
> But I'm glad the TSC_ADJUST problems are fixed.
>
> Will programs reading :
>  $ cat /sys/devices/msr/events/tsc
>  event=0x00
> read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on the
> TSC ?
>
>> I think this is because under Linux 4.8, the CPU got a fault every
>> time it read the TSC_ADJUST MSR.
>
> maybe it still is!
>
>
>> But

Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609

2017-02-22 Thread Jason Vas Dias

On 22/02/2017, Jason Vas Dias  wrote:
> RE:
>>> 4.10 has  new code which utilizes the TSC_ADJUST MSR.
>
> I just built an unpatched linux v4.10 with tglx's TSC improvements -
> much else improved in this kernel (like iwlwifi) - thanks!
>
> I have attached an updated version of the test program which
> doesn't print the bogus "Nominal TSC Frequency" (the previous
> version printed it, but equally ignored it).
>
> The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by
> a factor of 2 - it used to be @140ns and is now @ 70ns  ! Wow!  :
>
> $ uname -r
> 4.10.0
> $ ./ttsc1
> max_extended_leaf: 8008
> has tsc: 1 constant: 1
> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz.
> ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.00588 ns2: 0.02599
> ts3 - ts2: 178 ns1: 0.00592
> ts3 - ts2: 14 ns1: 0.00577
> ts3 - ts2: 14 ns1: 0.00651
> ts3 - ts2: 17 ns1: 0.00625
> ts3 - ts2: 17 ns1: 0.00677
> ts3 - ts2: 17 ns1: 0.00626
> ts3 - ts2: 17 ns1: 0.00627
> ts3 - ts2: 17 ns1: 0.00627
> ts3 - ts2: 18 ns1: 0.00655
> ts3 - ts2: 17 ns1: 0.00631
> t1 - t0: 89067 - ns2: 0.91411
>


Oops, going blind in my old age. These latencies are actually 3 times
greater than under 4.8 !!

Under 4.8, the program printed latencies of @ 140ns for clock_gettime, as shown
in bug 194609 as the 'ns1' (timespec_b - timespec_a) value::

ts3 - ts2: 24 ns1: 0.00162
ts3 - ts2: 17 ns1: 0.00143
ts3 - ts2: 17 ns1: 0.00146
ts3 - ts2: 17 ns1: 0.00149
ts3 - ts2: 17 ns1: 0.00141
ts3 - ts2: 16 ns1: 0.00142

now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @
600ns, @ 4 times more than under 4.8 .
But I'm glad the TSC_ADJUST problems are fixed.

Will programs reading :
 $ cat /sys/devices/msr/events/tsc
 event=0x00
read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on the
TSC ?

> I think this is because under Linux 4.8, the CPU got a fault every
> time it read the TSC_ADJUST MSR.

maybe it still is!


> But user programs wanting to use the TSC  and correlate its value to
> clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above
> program still have to  dig the TSC frequency value out of the kernel
> with objdump  - this was really the point of the bug #194609.
>
> I would still like to investigate exporting 'tsc_khz' & 'mult' +
> 'shift' values via sysfs.
>
> Regards,
> Jason.
>
>
>
>
>
> On 21/02/2017, Jason Vas Dias  wrote:
>> Thank You for enlightening me -
>>
>> I was just having a hard time believing that Intel would ship a chip
>> that features a monotonic, fixed frequency timestamp counter
>> without specifying in either documentation or on-chip or in ACPI what
>> precisely that hard-wired frequency is, but I now know that to
>> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU
>> assert CPUID:8007[8] ( InvariantTSC ) which it does, which is
>> difficult to reconcile with the statement in the SDM :
>>   17.16.4  Invariant Time-Keeping
>> The invariant TSC is based on the invariant timekeeping hardware
>> (called Always Running Timer or ART), that runs at the core crystal
>> clock
>> frequency. The ratio defined by CPUID leaf 15H expresses the
>> frequency
>> relationship between the ART hardware and TSC. If CPUID.15H:EBX[31:0]
>> !=
>> 0
>> and CPUID.8007H:EDX[InvariantTSC] = 1, the following linearity
>> relationship holds between TSC and the ART hardware:
>> TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )
>>  / CPUID.15H:EAX[31:0] + K
>> Where 'K' is an offset that can be adjusted by a privileged agent*2.
>>  When ART hardware is reset, both invariant TSC and K are also reset.
>>
>> So I'm just trying to figure out what CPUID.15H:EBX[31:0]  and
>> CPUID.15H:EAX[31:0]  are for my hardware.  I assumed (incorrectly)
>> that
>> the "Nominal TSC Frequency" formulae in the manul must apply to all
>> CPUs with InvariantTSC .
>>
>> Do I understand correctly , that since I do have InvariantTSC ,  the
>> TSC_Value is in fact calculated according to the above formula, but with
>> a "hidden" ART Value,  & Core Crystal Clock frequency & its ratio to
>> TSC frequency ?
>> It was obvious this nominal TSC Frequency had nothing to do with the
>> actual TSC frequency used by Linux, which is 'tsc_khz' .
>> I guess wishful thinking led me to believe CPUID:15h was actually
>> supported somehow , because I thought InvariantTSC meant it had ART
>

Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609

2017-02-22 Thread Jason Vas Dias

RE:
>> 4.10 has  new code which utilizes the TSC_ADJUST MSR.

I just built an unpatched linux v4.10 with tglx's TSC improvements -
much else improved in this kernel (like iwlwifi) - thanks!

I have attached an updated version of the test program which
doesn't print the bogus "Nominal TSC Frequency" (the previous
version printed it, but equally ignored it).

The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by
a factor of 2 - it used to be @140ns and is now @ 70ns  ! Wow!  :

$ uname -r
4.10.0
$ ./ttsc1
max_extended_leaf: 8008
has tsc: 1 constant: 1
Invariant TSC is enabled: Actual TSC freq: 2.893299GHz.
ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.00588 ns2: 0.02599
ts3 - ts2: 178 ns1: 0.00592
ts3 - ts2: 14 ns1: 0.00577
ts3 - ts2: 14 ns1: 0.00651
ts3 - ts2: 17 ns1: 0.00625
ts3 - ts2: 17 ns1: 0.00677
ts3 - ts2: 17 ns1: 0.00626
ts3 - ts2: 17 ns1: 0.00627
ts3 - ts2: 17 ns1: 0.00627
ts3 - ts2: 18 ns1: 0.00655
ts3 - ts2: 17 ns1: 0.00631
t1 - t0: 89067 - ns2: 0.91411

I think this is because under Linux 4.8, the CPU got a fault every
time it read the TSC_ADJUST MSR.

But user programs wanting to use the TSC  and correlate its value to
clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above
program still have to  dig the TSC frequency value out of the kernel
with objdump  - this was really the point of the bug #194609.

I would still like to investigate exporting 'tsc_khz' & 'mult' +
'shift' values via sysfs.

Regards,
Jason.





On 21/02/2017, Jason Vas Dias  wrote:
> Thank You for enlightening me -
>
> I was just having a hard time believing that Intel would ship a chip
> that features a monotonic, fixed frequency timestamp counter
> without specifying in either documentation or on-chip or in ACPI what
> precisely that hard-wired frequency is, but I now know that to
> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU
> assert CPUID:8007[8] ( InvariantTSC ) which it does, which is
> difficult to reconcile with the statement in the SDM :
>   17.16.4  Invariant Time-Keeping
> The invariant TSC is based on the invariant timekeeping hardware
> (called Always Running Timer or ART), that runs at the core crystal
> clock
> frequency. The ratio defined by CPUID leaf 15H expresses the frequency
> relationship between the ART hardware and TSC. If CPUID.15H:EBX[31:0] !=
> 0
> and CPUID.8007H:EDX[InvariantTSC] = 1, the following linearity
> relationship holds between TSC and the ART hardware:
> TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )
>  / CPUID.15H:EAX[31:0] + K
> Where 'K' is an offset that can be adjusted by a privileged agent*2.
>  When ART hardware is reset, both invariant TSC and K are also reset.
>
> So I'm just trying to figure out what CPUID.15H:EBX[31:0]  and
> CPUID.15H:EAX[31:0]  are for my hardware.  I assumed (incorrectly)
> that
> the "Nominal TSC Frequency" formulae in the manul must apply to all
> CPUs with InvariantTSC .
>
> Do I understand correctly , that since I do have InvariantTSC ,  the
> TSC_Value is in fact calculated according to the above formula, but with
> a "hidden" ART Value,  & Core Crystal Clock frequency & its ratio to
> TSC frequency ?
> It was obvious this nominal TSC Frequency had nothing to do with the
> actual TSC frequency used by Linux, which is 'tsc_khz' .
> I guess wishful thinking led me to believe CPUID:15h was actually
> supported somehow , because I thought InvariantTSC meant it had ART
> hardware .
>
> I do strongly suggest that Linux exports its calibrated TSC Khz
> somewhere to user
> space .
>
> I think the best long-term solution would be to allow programs to
> somehow read the TSC without invoking
> clock_gettime(CLOCK_MONOTONIC_RAW,&ts), &
> having to enter the kernel, which incurs an overhead of > 120ns on my system
> .
>
>
> Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and
> 'clocksource->shift' values to /sysfs somehow ?
>
> For instance , only  if the 'current_clocksource' is 'tsc', then these
> values could be exported as:
> /sys/devices/system/clocksource/clocksource0/shift
> /sys/devices/system/clocksource/clocksource0/mult
> /sys/devices/system/clocksource/clocksource0/freq
>
> So user-space programs could  know that the value returned by
> clock_gettime(CLOCK_MONOTONIC_RAW)
>   would be
> {.tv_sec =  ( ( rdtsc() * mult ) >> shift ) >> 32,
>   , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U
> }
>   and that represents ticks of period (1.0 / ( freq

Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609

2017-02-21 Thread Jason Vas Dias

":%d:(%s): must be called with invariant
TSC enabled.\n");
return 0;
  }
  U32_t tsc_hi, tsc_lo;
  register UL_t tsc;
  asm volatile
  ( "rdtscp\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t"
"mov %%ecx, %2\n\t"
  : "=m" (tsc_hi) ,
"=m" (tsc_lo) ,
"=m" (_ia64_tsc_user_cpu) :
  : "%eax","%ecx","%edx"
  );
  tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo);
  return tsc;
}

__thread
U64_t _ia64_first_tsc = 0xUL;

static inline __attribute__((always_inline))
U64_t IA64_tsc_ticks_since_start()
{ if(_ia64_first_tsc == 0xUL)
  { _ia64_first_tsc = IA64_tsc_now();
return 0;
  }
  return (IA64_tsc_now() - _ia64_first_tsc) ;
}

static inline __attribute__((always_inline))
void
ia64_tsc_calc_mult_shift
( register U32_t *mult,
  register U32_t *shift
)
{ /* paraphrases Linux clocksource.c's clocks_calc_mult_shift() function:
   * calculates second + nanosecond mult + shift in same way linux does.
   * we want to be compatible with what linux returns in struct
timespec ts after call to
   * clock_gettime(CLOCK_MONOTONIC_RAW, &ts).
   */
  const U32_t scale=1000U;
  register U32_t from= IA64_tsc_khz();
  register U32_t to  = NSEC_PER_SEC / scale;
  register U64_t sec = ( ~0UL / from ) / scale;
  sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1);
  register U64_t maxsec = sec * scale;
  UL_t tmp;
  U32_t sft, sftacc=32;
  /*
   * Calculate the shift factor which is limiting the conversion
   * range:
   */
  tmp = (maxsec * from) >> 32;
  while (tmp)
  { tmp >>=1;
sftacc--;
  }
  /*
   * Find the conversion shift/mult pair which has the best
   * accuracy and fits the maxsec conversion range:
   */
  for (sft = 32; sft > 0; sft--)
  { tmp = ((UL_t) to) << sft;
tmp += from / 2;
tmp = tmp / from;
if ((tmp >> sftacc) == 0)
  break;
  }
  *mult = tmp;
  *shift = sft;
}

__thread
U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U;

static inline __attribute__((always_inline))
U64_t IA64_s_ns_since_start()
{ if( ( _ia64_tsc_mult == ~0U ) || ( _ia64_tsc_shift == ~0U ) )
ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift);
  register U64_t cycles = IA64_tsc_ticks_since_start();
  register U64_t ns = ((cycles *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift);
  return( (((ns / NSEC_PER_SEC)&0xUL) << 32) | ((ns %
NSEC_PER_SEC)&0x3fffUL) );
  /* Yes, we are purposefully ignoring durations of more than 4.2
billion seconds here! */
}


I think Linux should export the 'tsc_khz', 'mult' and 'shift' values somehow,
then user-space libraries could have more confidence in using 'rdtsc'
or 'rdtscp'
if Linux's current_clocksource is 'tsc'.

Regards,
Jason



On 20/02/2017, Thomas Gleixner  wrote:
> On Sun, 19 Feb 2017, Jason Vas Dias wrote:
>
>> CPUID:15H is available in user-space, returning the integers : ( 7,
>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so
>> in detect_art() in tsc.c,
>
> By some definition of available. You can feed CPUID random leaf numbers and
> it will return something, usually the value of the last valid CPUID leaf,
> which is 13 on your CPU. A similar CPU model has
>
> 0x000d 0x00: eax=0x0007 ebx=0x0340 ecx=0x0340
> edx=0x
>
> i.e. 7, 832, 832, 0
>
> Looks familiar, right?
>
> You can verify that with 'cpuid -1 -r' on your machine.
>
>> Linux does not think ART is enabled, and does not set the synthesized
>> CPUID +
>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not
>> see this bit set .
>
> Rightfully so. This is a Haswell Core model.
>
>> if an e1000 NIC card had been installed, PTP would not be available.
>
> PTP is independent of the ART kernel feature . ART just provides enhanced
> PTP features. You are confusing things here.
>
> The ART feature as the kernel sees it is a hardware extension which feeds
> the ART clock to peripherals for timestamping and time correlation
> purposes. The ratio between ART and TSC is described by CPUID leaf 0x15 so
> the kernel can make use of that correlation, e.g. for enhanced PTP
> accuracy.
>
> It's correct, that the NONSTOP_TSC feature depends on the availability of
> ART, but that has nothing to do with the feature bit, which solely
> describes the ratio between TSC and the ART frequency which is exposed to
> peripherals. That frequency is not necessarily the real ART frequency.
>
>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be
>> nowhere else in Linux,  the code will always think X86_FEATURE_ART is 0
>> because the CPU will always get a fault reading the MSR since it has
>>

[PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609

2017-02-19 Thread Jason Vas Dias

Patch to make tsc.c set X86_FEATURE_ART and setup the TSC_ADJUST MSR
correctly on my "i7-4910MQ" CPU, which reports
( boot_cpu_data.cpuid_level==0x13  &&
  boot_cpu_data.extended_cpuid_level==0x8008
), so the code didn't think it supported CPUID:15h, but it does .

Patch:

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 46b2f41..f76cca8 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1030,6 +1030,7 @@ core_initcall(cpufreq_register_tsc_scaling);
 #endif /* CONFIG_CPU_FREQ */

 #define ART_CPUID_LEAF (0x15)
+#define MINIMUM_CPUID_EXTENDED_LEAF_THAT_MUST_HAVE_ART (0x8008)
 #define ART_MIN_DENOMINATOR (1)


@@ -1038,24 +1039,43 @@ core_initcall(cpufreq_register_tsc_scaling);
  */
 static void detect_art(void)
 {
-   unsigned int unused[2];
-
-   if (boot_cpu_data.cpuid_level < ART_CPUID_LEAF)
-   return;
-
-   cpuid(ART_CPUID_LEAF, &art_to_tsc_denominator,
- &art_to_tsc_numerator, unused, unused+1);
-
+   unsigned int v[2];
+
+   if(boot_cpu_data.cpuid_level < ART_CPUID_LEAF)
+{
+if(boot_cpu_data.extended_cpuid_level >=
MINIMUM_CPUID_EXTENDED_LEAF_THAT_MUST_HAVE_ART)
+{
+pr_info("Would normally not use ART -
cpuid_level:%d < %d - but extended_cpuid_level is: %x, so probing for
ART support.\n",
+boot_cpu_data.cpuid_level, ART_CPUID_LEAF,
boot_cpu_data.extended_cpuid_level);
+}else
+return;
+}
+
+cpuid(ART_CPUID_LEAF, &art_to_tsc_denominator,
+  &art_to_tsc_numerator, v, v+1);
+
/* Don't enable ART in a VM, non-stop TSC required */
if (boot_cpu_has(X86_FEATURE_HYPERVISOR) ||
-   !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ||
-   art_to_tsc_denominator < ART_MIN_DENOMINATOR)
-   return;
-
-   if (rdmsrl_safe(MSR_IA32_TSC_ADJUST, &art_to_tsc_offset))
-   return;
-
+   !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ||
+  art_to_tsc_denominator < ART_MIN_DENOMINATOR)
+{
+pr_info("Not using Intel ART for TSC - HYPERVISOR:%d
NO NONSTOP_TSC:%d  bad TSC/Crystal ratio denominator: %d.",
boot_cpu_has(X86_FEATURE_HYPERVISOR),
!boot_cpu_has(X86_FEATURE_NONSTOP_TSC), art_to_tsc_denominator );
+return;
+}
+   if (  (v[0]=rdmsrl_safe(MSR_IA32_TSC_ADJUST,
&art_to_tsc_offset))!=0) /* will get fault on first read if nothing
written yet */
+{
+if((v[1]=wrmsrl_safe(MSR_IA32_TSC_ADJUST, 0))!=0)
+{
+pr_info("Not using Intel ART for TSC - failed
to initialize TSC_ADJUST: %d %d.\n", v[0], v[1] );
+return;
+}else
+{
+art_to_tsc_offset = 0; /* perhaps initalize
to -1 * current rdtsc value ? */
+pr_info("Using Intel ART for TSC - TSC_ADJUST
initialized to %llu.\n",art_to_tsc_offset);
+}
+}
/* Make this sticky over multiple CPU init calls */
+pr_info("Using Intel Always Running Timer (ART) feature %x
for TSC on all CPUs - TSC/CCC: %d/%d offset: %llu.\n",
X86_FEATURE_ART, art_to_tsc_numerator, art_to_tsc_denominator,
art_to_tsc_offset );
setup_force_cpu_cap(X86_FEATURE_ART);
 }


I originally reported this issue on bugzilla.kernel.org : bug # 194609 :
https://bugzilla.kernel.org/show_bug.cgi?id=194609
, but it was not posted to the list , & then I posted it to the list, but
Julia Lawell  kindly suggested I should re-post with
patch inline, & include extra recipients, which includes the last
person to modify tsc.c (Prarit), so am doing so.

My CPU reports 'model name' as
"Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz" ,
has 4 physical & 8 hyperthreading cores with a frequency scalable from 80
to 390 (/sys/devices/system/cpu/cpu0/cpufreq/scaling_{min,max}_freq) , and
flags :
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl
xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor
ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1
sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c
rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase
tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt dtherm
ida arat pln pts

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
$

CPUID:15H is available in user-space, returning the integers : ( 7,
832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so
in detect_art() in tsc.c,
Linux does not think ART is enabled, and does not set the synthesized CPUID +
((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not
see this bit set .
if an e1000 NIC card had been installed, PTP would no

[PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609

2017-02-18 Thread Jason Vas Dias

I originally reported this issue on bugzilla.kernel.org : bug # 194609 :
https://bugzilla.kernel.org/show_bug.cgi?id=194609
, but it was not posted to the list .

My CPU reports 'model name' as
"Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz" ,
has 4 physical & 8 hyperthreading cores with a frequency scalable from 80
to 390 (/sys/devices/system/cpu/cpu0/cpufreq/scaling_{min,max}_freq) , and
flags :
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl
xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor
ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1
sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c
rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase
tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt dtherm
ida arat pln pts

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
$

CPUID:15H is available in user-space, returning the integers : ( 7,
832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so
in detect_art() in tsc.c,
Linux does not think ART is enabled, and does not set the synthesized CPUID +
((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not
see this bit set .
if an e1000 NIC card had been installed, PTP would not be available.
Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be
nowhere else in Linux,  the code will always think X86_FEATURE_ART is 0
because the CPU will always get a fault reading the MSR since it has
never been written.

So the attached patch makes tsc.c set X86_FEATURE_ART correctly in tsc.c ,
and set the TSC_ADJUST to 0 if the rdmsr gets an error .
Please consider applying it to a future linux version.

It would be nice for user-space programs that want to use the TSC with
rdtsc / rdtscp instructions, such as the demo program attached to the
bug report,
could have confidence that Linux is actually generating the results of
clock_gettime(CLOCK_MONOTONIC_RAW, ×pec)
in a predictable way from the TSC by looking at the
 /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space
use of TSC values, so that they can correlate TSC values with linux
clock_gettime() values.

The patch applies to linux kernels v4.8 & v4.9.10 GIT tags  and the
kernels build
and run & the demo program produces results like :
 $ ./ttsc1
has tsc: 1 constant: 1
832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1
Hooray! TSC is enabled with KHz: 2893300
ts2 - ts1: 261 ts3 - ts2: 211 ns1: 0.00146 ns2: 0.01629
ts3 - ts2: 27 ns1: 0.00168
ts3 - ts2: 20 ns1: 0.00147
ts3 - ts2: 14 ns1: 0.00152
ts3 - ts2: 15 ns1: 0.00151
ts3 - ts2: 15 ns1: 0.00153
ts3 - ts2: 15 ns1: 0.00150
ts3 - ts2: 20 ns1: 0.00148
ts3 - ts2: 19 ns1: 0.00164
ts3 - ts2: 19 ns1: 0.00164
ts3 - ts2: 19 ns1: 0.00160
t1 - t0: 52901 - ns2: 0.53951

The value 'ts3 - ts2' is the number of nanoseconds measured by
successive calls to
'rdtscp'; the 'ns1' value is the number of nanoseconds (shown as
decimal seconds)
measured by
  clock_gettime(CLOCK_MONOTONIC_RAW, &ts2) -
  clock_gettime(CLOCK_MONOTONIC_RAW, &ts1)
when casting each {ts.tv_sec, ts.tv_nsec} to a 128 bit long long integer .
It shows a user-space program can read the TSC with a latency of @20ns
but can only measure times >= @ 140ns using Linux clock_gettime()  on this CPU.


x86_kernel_tsc-bz194609.patch
Description: Binary data

Re: please, where has xconfig KConf option documentation gone with linux 4.8's Qt5 / Qt4 xconfig ?

2016-10-15 Thread Jason Vas Dias

 Aha, thanks! I never would have known this without being told -
 there is no visible indication that the symbol info pane exists
 at all until one tries to drag the lower right corner of the window
 notth-eastwards - is this meant to be somehow an intuitive thing to
 do these days to view more info ?

 I did manage to view the option documentation with nconfig  /
 using emacs to view the KConf files (preferable).

 Really, it would be nice if xconfig had some 'View' Menu & one could select
 View -> Option Documentation  or press  over an option to view the
 documentation for it , and if the geometry of the different panes was
correct at
 startup .- the whole Option value pane initially appears on the far right hand
 side, about 10 pixels wide , until resized ; and there really is no sign of the
 documentation pane at all until lower right-hand corner dragged.

 Also, in the Help -> Introduction panel, it says :
   "Toggling Show Debug Info under the Options menu will show the
dependencies..."
 but there is no "Show Debug Info" option on the Options menu - sounds like
 it might be a useful feature - should I be seeing a "Show Debug Info" option ?
 why don't I see one ?  Maybe the Options menu might be a good place to put
 an "Expand Option Documentation Pane" option ?

Thanks anyway for the info.

Regards,
Jason

On 11/10/2016, Randy Dunlap  wrote:
> [changed linux-config to linux-kbuild list]
>
> On 10/09/16 13:46, Jason Vas Dias wrote:
>> Hi -
>> I've been doing 'make xconfig'  to configure the kernel for many years
>> now, and
>> always there used to be some option documentation pane populated with
>> summary documentation for the specific option selected .
>> But now,  when built for Qt 5.7.0, (also tried Qt 4.8 and GTK) there
>> is no option
>> documentation pane - this is a real pain !  The option documentation also
>> is not displayed with any other gui, eg.  'make menuconfig' / 'make
>> gtkconfig' -
>> I'm sure it used to be . This is a regression IMHO .
>> How can I restore display of documentation for each selected option ?
>> Will older xconfig work for Linux 4.8 ? it appears not ...
>> Thanks in advance for any replies,
>> Jason
>
> That's odd. I see the help info in all of xconfig, gconfig, menuconfig, &
> nconfig.
>
> In xconfig, if the right hand side of the config window only lists some
> kernel config
> options and no symbol help/info, the symbol info portion may be hidden. Try
> pointing
> to the bottom of the right side of the window and hold down the left mouse
> button
> and then drag the mouse pointer upward to open the symbol info pane.
> At least that is what works for me.
>
> --
> ~Randy
>

please, where has xconfig KConf option documentation gone with linux 4.8's Qt5 / Qt4 xconfig ?

2016-10-09 Thread Jason Vas Dias

Hi -
I've been doing 'make xconfig'  to configure the kernel for many years now, and
always there used to be some option documentation pane populated with
summary documentation for the specific option selected .
But now,  when built for Qt 5.7.0, (also tried Qt 4.8 and GTK) there
is no option
documentation pane - this is a real pain !  The option documentation also
is not displayed with any other gui, eg.  'make menuconfig' / 'make gtkconfig' -
I'm sure it used to be . This is a regression IMHO .
How can I restore display of documentation for each selected option ?
Will older xconfig work for Linux 4.8 ? it appears not ...
Thanks in advance for any replies,
Jason

4.5.x drm/i915/ + drm/drm_irq + drm/radeon & ACPI problems doing vga_switcheroo switching & getting EDID modes for laptop hybrid graphics with Intel IGC & Radeon Neptune 8970M

2016-04-23 Thread Jason Vas Dias

I have not so far been able to get my Radeon 8970M discrete graphics card
with GPU to go into graphics mode under Linux 4.4.0+
 ( tried 4.4.0, 4.5.0, 4.5.1, ...) on my Clevo KAPOK laptop x86_64 LFS system ,
which has :
CPU : Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz
RAM: 16GB  ;  Disk: 1TB SATA  + 256MB SDD
.

$ lspci -nn | grep VGA
00:02.0 VGA compatible controller [0300]: Intel Corporation 4th Gen
Core Processor Integrated Graphics Controller [8086:0416] (rev 06)
01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc.
[AMD/ATI] Neptune XT [Radeon HD 8970M] [1002:6801]

So far, the Neptune card will only go into graphics mode when driven
by the closed source FGLRX driver under a Linux 3.10 / RHEL-7 clone -
I'm trying to
get it working under Linux  4.4.0+, whose 'drivers/drm/radeon' driver claims to
support the card .

Persistently, the Xorg server v1.18.3  with Xorg Radeon Driver v7.7.0
(latest stable GIT versions) report "No modes" and are unable to discover
any probed EDID display modes for the card , as shown by the Xorg.0.log
excerpt :
[  1503.772] (II) Loading /usr/lib64/xorg/modules/drivers/radeon_drv.so
[  1503.773] (II) Module radeon: vendor="X.Org Foundation"
[  1503.773]compiled for 1.18.3, module version = 7.7.0
[  1503.775]Module class: X.Org Video Driver
[  1503.775]ABI class: X.Org Video Driver, version 20.0
[  1503.775] (II) LoadModule: "intel"
[  1503.777] (II) Loading /usr/lib64/xorg/modules/drivers/intel_drv.so
[  1503.778] (II) Module intel: vendor="X.Org Foundation"
[  1503.778]compiled for 1.18.3, module version = 2.99.917
[  1503.779]Module class: X.Org Video Driver
[  1503.780]ABI class: X.Org Video Driver, version 20.0
...
[  1503.788] (II) RADEON: Driver for ATI Radeon chipsets:
...

[  1503.957] (II) [KMS] Kernel modesetting enabled.
[  1503.957] (II) intel(1): Using Kernel Mode Setting driver: i915,
version 1.6.0 20151218
[  1503.957] (EE) Screen 1 deleted because of no matching config section.
[  1503.957] (II) UnloadModule: "intel"
[  1503.957] (II) RADEON(0): RADEONPreInit_KMS
[  1503.957] (==) RADEON(0): Depth 24, (--) framebuffer bpp 32
[  1503.957] (II) RADEON(0): Pixel depth = 24 bits stored in 4 bytes
(32 bpp pixmaps)
[  1503.957] (==) RADEON(0): Default visual is TrueColor
[  1503.957] (**) RADEON(0): Option "DRI" "3"
[  1503.957] (==) RADEON(0): RGB weight 888
[  1503.957] (II) RADEON(0): Using 8 bits per RGB (8 bit DAC)
[  1503.957] (--) RADEON(0): Chipset: "PITCAIRN" (ChipID = 0x6801)
[  1503.957] (II) Loading sub module "fb"
[  1503.957] (II) LoadModule: "fb"
[  1503.957] (II) Loading /usr/lib64/xorg/modules/libfb.so
[  1503.958] (II) Module fb: vendor="X.Org Foundation"
[  1503.958]compiled for 1.18.3, module version = 1.0.0
[  1503.958]ABI class: X.Org ANSI C Emulation, version 0.4
[  1503.958] (II) Loading sub module "dri2"
[  1503.958] (II) LoadModule: "dri2"
[  1503.958] (II) Module "dri2" already built-in
[  1503.958] (II) Loading sub module "glamoregl"
[  1503.958] (II) LoadModule: "glamoregl"
[  1503.958] (II) Loading /usr/lib64/xorg/modules/libglamoregl.so
[  1503.958] (II) Module glamoregl: vendor="X.Org Foundation"
[  1503.958]compiled for 1.18.3, module version = 0.6.0
[  1503.958]ABI class: X.Org ANSI C Emulation, version 0.4
[  1503.958] (II) glamor: OpenGL accelerated X.org driver based.
[  1504.023] (II) glamor: EGL version 1.4 (DRI2):
[  1504.023] (II) RADEON(0): glamor detected, initialising EGL layer.
[  1504.023] (II) RADEON(0): KMS Color Tiling: enabled
[  1504.023] (II) RADEON(0): KMS Color Tiling 2D: enabled
[  1504.024] (II) RADEON(0): KMS Pageflipping: enabled
[  1504.024] (II) RADEON(0): SwapBuffers wait for vsync: enabled
[  1504.024] (II) RADEON(0): Initializing outputs ...
[  1504.024] (II) RADEON(0): 0 crtcs needed for screen.
[  1504.024] (II) RADEON(0): Allocated crtc nr. 0 to this screen.
[  1504.024] (II) RADEON(0): Allocated crtc nr. 1 to this screen.
[  1504.024] (II) RADEON(0): Allocated crtc nr. 2 to this screen.
[  1504.024] (II) RADEON(0): Allocated crtc nr. 3 to this screen.
[  1504.024] (II) RADEON(0): Allocated crtc nr. 4 to this screen.
[  1504.024] (II) RADEON(0): Allocated crtc nr. 5 to this screen.
[  1504.024] (WW) RADEON(0): No outputs definitely connected, trying again...
[  1504.024] (WW) RADEON(0): Unable to find connected outputs -
setting 1024x768 initial framebuffer
[  1504.024] (II) RADEON(0): Using default gamma of (1.0, 1.0, 1.0)
unless otherwise stated.
[  1504.024] (II) RADEON(0): mem size init: gart size :7fbcc000 vram
size: s:1 visible:ff916000
[  1504.024] (==) RADEON(0): DPI set to (96, 96)
[  1504.024] (II) Loading sub module "ramdac"
[  1504.024] (II) LoadModule: "ramdac"
[  1504.024] (II) Module "ramdac" already built-in
[  1504.024] (EE) RADEON(0): No modes.
[  1504.024] (II) RADEON(0): RADEONFreeScreen
[  1504.024] (II) UnloadModule: "radeon"
[  1504.024] (II) UnloadSubModule: "glamoregl"
[  1504.024] (II) Unloading glamoregl
[  1504.02

how to unmount an rbind mount ?

2016-04-12 Thread Jason Vas Dias

Good day -

Please could anyone advise -

Once one has mounted an alias mount with the 'rbind' option, so that
mounts underneath it are also mounted under the new path, how can
one unmount that filesystem safely without un-mounting the original
mountpoints  ?

For example, I do this for chroots :
$  for d in /dev /proc /sys; do mount -o rbind $d $chroot/$d; done

Now, if I want to unmount the chroot device, I cannot do eg. :
$ unmount ${chroot}/dev
because this will fail since /dev/pts /dev/mqueue etc are still mounted ;
if I do:
$ unmount -R ${chroot}/dev
or
$ unmount ${chroot}/dev/pts
then /dev/pts will be unmounted from the root device filesystem -
the situation is much more horrid to try and unmount ${chroot}/sys or
${chroot}/run .

Personally, I think this is rather buggy behaviour by Linux, since I told the
kernel I only want to BIND the path ${chroot}/dev to /dev - and recursively
bind names beneath ${chroot}/dev/* to /dev/*, with the 'rbind' option, ie. to
make an alias of ${chroot}/dev/* for /dev/* - NOT to actually re-mount the
devices there .  So I think umount should be clever enough to 'un-bind'
sub-mounts of mounts with the 'rbind' option, rather than unmount the
devices from the root filesystem, which is what currently happens.
It does make chroot filesystems very difficult to unmount safely !
Linux badly needs a better umount, IMHO .

Are there any plans to improve umount behavior wrt rbind mounts ?

Re: how to build 2.6.x based kernel with perf ?

2014-12-12 Thread Jason Vas Dias

Here's a patch that fixes the issue for me . Also attached to
Red Hat bugzilla : https://bugzilla.redhat.com/show_bug.cgi?id=1173649


On 12/12/14, Jason Vas Dias  wrote:
> Good day -
> I am trying to build the latest RHEL kernel from the source RPM,
> but this fails because the "perf" component cannot build .
> The build gets as far as building the modules and debug flavour
> of the kernel, but fails for the 'perf' target with :
>
>
> + make -j4 -C tools/perf -s V=1 prefix=/usr all
> CHK -fstack-protector-all
> CHK -Wstack-protector
> CHK -Wvolatile-register-var
> CHK -D_FORTIFY_SOURCE=2
> CHK bionic
> :1:31: error: android/api-level.h: No such file or directory
> : In function 'main':
> :5: error: '__ANDROID_API__' undeclared (first use in this function)
> :5: error: (Each undeclared identifier is reported only once
> :5: error: for each function it appears in.)
> CHK libelf
> CHK libdw
> CHK -DLIBELF_MMAP
> CHK -DHAVE_ELF_GETPHDRNUM
> CHK -DLIBELF_MMAP
> CHK libunwind
> CHK libaudit
> cc1: warnings being treated as errors
> : In function 'main':
> :5: error: implicit declaration of function 'printf'
> :5: error: incompatible implicit declaration of built-in
> function 'printf'
> config/Makefile:240: No libaudit.h found, disables 'trace' tool,
> please install audit-libs-devel or libaudit-dev
> CHK libslang
> CHK gtk2
> CHK -DHAVE_GTK_INFO_BAR
> CHK perl
> CHK python
> CHK python version
> CHK libbfd
> CHK -DHAVE_STRLCPY
> /tmp/ccOCUfYU.o: In function `main':
> :(.text+0x14): undefined reference to `strlcpy'
> collect2: ld returned 1 exit status
> CHK -DHAVE_ON_EXIT
> CHK -DBACKTRACE_SUPPORT
> CHK libnuma
> :1:18: error: numa.h: No such file or directory
> :2:20: error: numaif.h: No such file or directory
> cc1: warnings being treated as errors
> : In function 'main':
> :6: error: implicit declaration of function 'numa_available'
> :6: error: nested extern declaration of 'numa_available'
> config/Makefile:422: No numa.h found, disables 'perf bench numa mem'
> benchmark, please install numa-libs-devel or libnuma-dev
> * new build flags or prefix
> PERF_VERSION = 2.6.32-504.1.3.el6.x86_64.debug
> * new build flags or cross compiler
> cc1: warnings being treated as errors
> /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:113:
> error: no previous prototype for 'breakpoint'
> /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:119:
> error: no previous prototype for 'alloc_arg'
> /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:
> In function 'find_cmdline':
> /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:183:
> error: return discards qualifiers from pointer target type
> /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:186:
> error: return discards qualifiers from pointer target type
> /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:195:
> error: return discards qualifiers from pointer target type
> /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:
> In function 'type_size':
> /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:1243:
> error: missing initializer
> /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:1243:
> error: (near initialization for 'table[9].type')
> /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:
> In function 'event_read_fields':
> /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:1519:
> error: signed and unsigned type in conditional expression
> /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:
> In function 'arg_num_eval':
> /home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:2076:
> err

how to build 2.6.x based kernel with perf ?

2014-12-12 Thread Jason Vas Dias

Good day -
I am trying to build the latest RHEL kernel from the source RPM,
but this fails because the "perf" component cannot build .
The build gets as far as building the modules and debug flavour
of the kernel, but fails for the 'perf' target with :


+ make -j4 -C tools/perf -s V=1 prefix=/usr all
CHK -fstack-protector-all
CHK -Wstack-protector
CHK -Wvolatile-register-var
CHK -D_FORTIFY_SOURCE=2
CHK bionic
:1:31: error: android/api-level.h: No such file or directory
: In function 'main':
:5: error: '__ANDROID_API__' undeclared (first use in this function)
:5: error: (Each undeclared identifier is reported only once
:5: error: for each function it appears in.)
CHK libelf
CHK libdw
CHK -DLIBELF_MMAP
CHK -DHAVE_ELF_GETPHDRNUM
CHK -DLIBELF_MMAP
CHK libunwind
CHK libaudit
cc1: warnings being treated as errors
: In function 'main':
:5: error: implicit declaration of function 'printf'
:5: error: incompatible implicit declaration of built-in
function 'printf'
config/Makefile:240: No libaudit.h found, disables 'trace' tool,
please install audit-libs-devel or libaudit-dev
CHK libslang
CHK gtk2
CHK -DHAVE_GTK_INFO_BAR
CHK perl
CHK python
CHK python version
CHK libbfd
CHK -DHAVE_STRLCPY
/tmp/ccOCUfYU.o: In function `main':
:(.text+0x14): undefined reference to `strlcpy'
collect2: ld returned 1 exit status
CHK -DHAVE_ON_EXIT
CHK -DBACKTRACE_SUPPORT
CHK libnuma
:1:18: error: numa.h: No such file or directory
:2:20: error: numaif.h: No such file or directory
cc1: warnings being treated as errors
: In function 'main':
:6: error: implicit declaration of function 'numa_available'
:6: error: nested extern declaration of 'numa_available'
config/Makefile:422: No numa.h found, disables 'perf bench numa mem'
benchmark, please install numa-libs-devel or libnuma-dev
* new build flags or prefix
PERF_VERSION = 2.6.32-504.1.3.el6.x86_64.debug
* new build flags or cross compiler
cc1: warnings being treated as errors
/home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:113:
error: no previous prototype for 'breakpoint'
/home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:119:
error: no previous prototype for 'alloc_arg'
/home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:
In function 'find_cmdline':
/home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:183:
error: return discards qualifiers from pointer target type
/home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:186:
error: return discards qualifiers from pointer target type
/home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:195:
error: return discards qualifiers from pointer target type
/home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:
In function 'type_size':
/home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:1243:
error: missing initializer
/home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:1243:
error: (near initialization for 'table[9].type')
/home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:
In function 'event_read_fields':
/home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:1519:
error: signed and unsigned type in conditional expression
/home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:
In function 'arg_num_eval':
/home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:2076:
error: enumeration value 'PRINT_HEX' not handled in switch
/home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:2076:
error: enumeration value 'PRINT_DYNAMIC_ARRAY' not handled in switc
/home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:2076:
error: enumeration value 'PRINT_FUNC' not handled in switch
/home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:
In function 'arg_eval':
/home/jvasdias/rpmbuild/BUILD/kernel-2.6.32-504.1.3.el6/linux-2.6.32-504.1.3.el6.x86_64/tools/lib/traceevent/event-parse.c:2235:
error: enumeration value 'PRINT_HEX' not handled in switch
/home/jvasdias/rpmbuild/B

Re: mount BTRFS filesystems created with 3.8+ under 2.6.32 kernels ?

2014-09-22 Thread Jason Vas Dias

Of course the solution was to have created the filesystem in the first
place with
'mkfs.btrfs -O ^extref' . Found this after some more googling ...
Shouldn't this be the default ?
Regards,
Jason

On 9/22/14, Jason Vas Dias  wrote:
> Good day -
>
> I wonder if there is a GIT repository somewhere with a backport of the
> BTRFS
> kernel modules  that will allow BTRFS filesystems created with a 3.8 kernel
> to
> be mounted on a 2.6.32+ kernel .
>
> When I try this, the 2.6.32 kernel crashes with the message :
>  'BTRFS: couldn't mount because of unsupported optional features (40)'
> ( kernel-2.6.32-431.29.2.el6.x86_64 from RHEL 6.4+ ).
>
> The same filesystem mounts fine under Oracle EL6 which now comes with
> kernel-uek 3.8+ .  Has anyone tried to backport the 3.8 BTRFS
> capabilities to 2.6.32 ,
> or is there any way I can remove "Option 40" to get it to mount
> without crashing ?
> It is a very small BTRFS filesystem with a root filesystem and a few
> snapshots.
> I did not specify any BTRFS options in :
>   $ mkfs.btrfs /dev/sda9;
>   ... # mount on /mnt/btr & create some files
>   $ btrfs subvolume snapshot -r /mnt/btr /mnt/btr/root-0
>   $ btrfs subvolume snapshot/mnt/btr /mnt/btr/root-w-0
> Now I can mount /dev/sda9 under any 3.8+ kernel, but not under 2.6.32 .
>
> Thanks in advance for any replies,
> Best Regards, Jason Vas Dias
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

mount BTRFS filesystems created with 3.8+ under 2.6.32 kernels ?

2014-09-22 Thread Jason Vas Dias

Good day -

I wonder if there is a GIT repository somewhere with a backport of the BTRFS
kernel modules  that will allow BTRFS filesystems created with a 3.8 kernel to
be mounted on a 2.6.32+ kernel .

When I try this, the 2.6.32 kernel crashes with the message :
 'BTRFS: couldn't mount because of unsupported optional features (40)'
( kernel-2.6.32-431.29.2.el6.x86_64 from RHEL 6.4+ ).

The same filesystem mounts fine under Oracle EL6 which now comes with
kernel-uek 3.8+ .  Has anyone tried to backport the 3.8 BTRFS
capabilities to 2.6.32 ,
or is there any way I can remove "Option 40" to get it to mount
without crashing ?
It is a very small BTRFS filesystem with a root filesystem and a few snapshots.
I did not specify any BTRFS options in :
  $ mkfs.btrfs /dev/sda9;
  ... # mount on /mnt/btr & create some files
  $ btrfs subvolume snapshot -r /mnt/btr /mnt/btr/root-0
  $ btrfs subvolume snapshot/mnt/btr /mnt/btr/root-w-0
Now I can mount /dev/sda9 under any 3.8+ kernel, but not under 2.6.32 .

Thanks in advance for any replies,
Best Regards, Jason Vas Dias
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

how to build kernel-firmware and kernel-doc RPMs from Red Hat EL6 kernel.spec files ?

2014-09-22 Thread Jason Vas Dias

Sorry for this newbie question, but its been a while since I built the
kernel from the Red Hat source RPMs,  and there appears to be no way to
build the kernel-firmware-*.noarch.rpm RPM package without
modifying the spec file, which contains :

l# we don't want a .config file when building firmware: it just
confuses the build system
%define build_firmware \
   mv .config .config.firmware_save \
   make INSTALL_FW_PATH=$RPM_BUILD_ROOT/lib/firmware firmware_install \
   mv .config.firmware_save .config


When intending to build the kernel-doc and kernel-firmware noarch
RPMs, after the
x86_64 RPMs have successfully built,  with:
   $ rpmbuild --target=noarch --rebuild $path_to_kernel_srpm --define
  '_with_docs 1' --define '_with_firmware 1' --define '_without_perf 1'  ...
 this fails :

+ cd ${BUILDROOT}/lib/modules/
+ ln -s kabi-rhel65 kabi-current
+ mv .config .config.firmware_save
+ make INSTALL_FW_PATH=${BUILDROOT}/lib/firmware firmware_install
scripts/kconfig/conf -s arch/x86/Kconfig
***
*** You have not yet configured your kernel!
*** (missing kernel config file ".config")
***
*** Please run some configurator (e.g. "make oldconfig" or
*** "make menuconfig" or "make xconfig").
***
${BUILD}/scripts/kconfig/Makefile:30: recipe for target 'silentoldconfig' failed
make[2]: *** [silentoldconfig] Error 1
${BUILD}/Makefile:484: recipe for target 'silentoldconfig' failed
make[1]: *** [silentoldconfig] Error 2
  IHEXfirmware/iwlwifi-105-6.ucode
make[1]: *** No rule to make target '${BUILDROOT}/lib/firmware/./',
needed by '${BUILDROOT}/lib/firmware/iwlwifi-105-6.ucode'.  Stop.
Makefile:1112: recipe for target 'firmware_install' failed
make: *** [firmware_install] Error 2
error: Bad exit status from /var/tmp/rpm-tmp.3a7tvf (%install)

(with the $BUILD and $BUILDROOT strings representing actual paths).

So I have to edit that bit of the .spec file to be:

%define build_firmware \
   make INSTALL_FW_PATH=$RPM_BUILD_ROOT/lib/firmware firmware_install

and then I can 'rpmbuild --target -ba $path_to_modified_spec_file'  OK and
the firmware and documentation RPMs are produced OK .

Has anyone found a way of avoiding having to edit the Red Hat RPM spec file
in this manner to enable building noarch kernel-doc and kernel-firmware RPMs ?
This happens with every RHEL-6.4 kernel I've built so far :
   kernel-2.6.32-431.23.3.el6.src.rpm
   kernel-2.6.32-431.29.2.el6.src.rpm

Incidentally, anyone found a way to build the Red Hat RPMs without
  "--define '_with_perf 0'" ?
It seems this causes the build to look for libunwind and the Android
SDK headers,
which are not part of the kernel's BuildRequires .

Thanks in advance for any helpful replies, best Regards,
Jason Vas Dias
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

HP6715b laptop's wireless radio on LED went off after 1st boot of 3.9.6 from 3.4.4 - please help / any ideas ?

2013-06-27 Thread Jason Vas Dias

After building and installing 3.9.6 kernel & modules on
my 2.2ghz HP6715b x86-64 Turion dual core laptop ,
which has always run Linux with no b43 wireless problems
since 2007, now has no access to its onboard broadcom 4311
wireless radio .  I had always used the b43
driver with the correct firmware installed
under /lib/firmware/b43 with b43-fwcutter
as per instructions at http://wireless.kernel.org/en/users/Drivers/b43
, which I've just now redone again, but since booting
3.9.6, which I believe resulted in a firmware download
via udev at first boot, the wireless radio "on" blue LED
indicator goes off after BIOS POST .  It has always been
the case that if the blue LED indicator is off after BIOS
POST ,  then the kernel does not see the device, and I
have no wireless until a hard poweroff and pressing
the touch sensitive wireless on button during BIOS POST.
But now the wireless LED goes on for @ 1 second during
BIOS POST , and never comes on again ,  and there are
no responses to touching wireless-on button after reboot,
though there are to other buttons next to it.  In short,
I've lost wireless access (and home internet access for
my pc - I'm sending this from my mobile) . Can anyone help?
How can I force the card to download the re-installed
b43-fwcutter firmware, if the device no longer appears
in lspci output?  Anyway to force the kernel to ignore
the wireless button ( it could be that the kernel & bios
think this button is in the off state - any way to force
its state to "on") ?  Any ideas / suggestions would be much
appreciated.
Thanks & Regards, Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH: 1/1] ACPI: make evaluation of thermal trip points before temperature or vice versa dependant on new "temp_b4_trip" module parameter to support older AMD x86_64s

2012-07-14 Thread Jason Vas Dias

 This patch adds a new acpi.thermal.temp_b4_trip = 1 settting, which
 causes the temperature to be set before evaluation of thermal trip points (the 
old default) .
 This mode should be selected automatically by DMI match if the system 
identifies as 
  "HPCompaq 6715b" .
 Please consider applying a patch like that attached to fix the issue reported
 in lkml thread "Re: PROBLEM: Performance drop" recently,  whereby
 it was found that HP 6715b laptops ( which have 2.2Ghz dual-core  AMD
 x86_64 k8 CPUs) get stuck running the CPU at 800Khz and cannot switch 
frequency. 

 I have verified that this still the case with v3.4.4 tagged "stable" kernel. 
and with v3.5-rc6,
 which this is a patch against ( ie. against commit 
bd0a521e88aa7a06ae7aabaed7ae196ed4ad867a :
 "Linux 3.5-rc6" :

diff --git a/Makefile b/Makefile
index 81ea154..bf02707 100644   
--- a/Makefile  
+++ b/Makefile  
@@ -1,7 +1,7 @@ 
 VERSION = 3
 PATCHLEVEL = 5 
 SUBLEVEL = 0   
-EXTRAVERSION = -rc5
+EXTRAVERSION = -rc6
 NAME = Saber-toothed Squirrel  

 # *DOCUMENTATION*  
diff --git a/drivers/acpi/thermal.c b/drivers/acpi/thermal.c
index 7dbebea..13d3b22 100644   
--- a/drivers/acpi/thermal.c
+++ b/drivers/acpi/thermal.c
@@ -96,6 +96,10 @@ static int psv;  
 module_param(psv, int, 0644);  
 MODULE_PARM_DESC(psv, "Disable or override all passive trip points.");
   
+static bool temp_b4_trip; 
+module_param(temp_b4_trip, bool, 0644);   
+MODULE_PARM_DESC(temp_b4_trip, "Get the temperature before initializing trip 
points.");
+   

 static int acpi_thermal_add(struct acpi_device *device);   

 static int acpi_thermal_remove(struct acpi_device *device, int type);  

 static int acpi_thermal_resume(struct acpi_device *device);

@@ -941,27 +945,41 @@ static int acpi_thermal_get_info(struct acpi_thermal *tz) 

if (!tz)

return -EINVAL; 



-   /* Get trip points [_CRT, _PSV, etc.] (required) */ 

-   result = acpi_thermal_get_trip_points(tz);  

-   if (result) 

+   if( temp_b4_trip )  

+   { /* some CPUs, eg AMD K8 need temperature before trip points can be 
obtained */
+   /* Get temperature [_TMP] (required) */ 

+   result = acpi_thermal_get_temperature(tz);  

+   if (result) 

return result;  

-   

-   /* Get temperature [_TMP] (required) */ 

-   result = acpi_thermal_get_temperature(tz);  

-   if (result) 

+   

+   /* Get trip points [_CRT, _PSV, etc.] (required) */ 

+   result = acpi_thermal_get_trip_points(tz);  

+   if (result) 

return result;  

-   

+   }else   

+   { /* newer x86_64s need trip points set before temperature obtained */  

+   /* Get trip points [_CRT, _PSV, etc.] (required) */

Re: PROBLEM: Performance drop

2012-07-14 Thread Jason Vas Dias

Hi - any progress on this or on the patch I submitted for it ? -
please see enclosed - apologies for my being forced to use gmail which
has mandatory line wrap -
Please do something about restoring correct thermal operation on
x86_64 K8's with HP BIOS !
Thanks & Regards,
Jason

Re: [PATCH: 1/1] ACPI: make evaluation of thermal trip points before
temperature or vice versa dependant on new "temp_b4_trip" module
parameter to support older AMD x86_64s
Kernel
    x
Jason Vas Dias

Jul 9 (5 days ago)

Reply
to Rusty, linux-kernel, Andreas, Matthew, Len, Comrade
Thanks Rusty - sorry I didn't see your email until now - revised patch
addressing your comments attached -
BTW,  sorry about the word wrap on the initial posting - should I
attach a '.patch' file or inline ?  Trying both .

The Revised Patch (against :
commit bd0a521e88aa7a06ae7aabaed7ae196ed4ad867a
Author: Linus Torvalds 
Date:   Sat Jul 7 17:23:56 2012 -0700

Linux 3.5-rc6
) :
$ git diff bd0a521e88aa7a06ae7aabaed7ae196ed4ad867a >
/tmp/acpi_thermal_temp_b4_trip.patch
$ cat /tmp/acpi_thermal_temp_b4_trip.patch
diff --git a/drivers/acpi/thermal.c b/drivers/acpi/thermal.c
index 7dbebea..13d3b22 100644
--- a/drivers/acpi/thermal.c
+++ b/drivers/acpi/thermal.c
@@ -96,6 +96,10 @@ static int psv;
 module_param(psv, int, 0644);
 MODULE_PARM_DESC(psv, "Disable or override all passive trip
points.");

+static bool temp_b4_trip;
+module_param(temp_b4_trip, bool, 0644);
+MODULE_PARM_DESC(temp_b4_trip, "Get the temperature before
initializing trip points.");
+
 static int acpi_thermal_add(struct acpi_device *device);
 static int acpi_thermal_remove(struct acpi_device *device, int type);
 static int acpi_thermal_resume(struct acpi_device *device);
@@ -941,27 +945,41 @@ static int acpi_thermal_get_info(struct
acpi_thermal *tz)
if (!tz)
return -EINVAL;

-   /* Get trip points [_CRT, _PSV, etc.] (required) */
-   result = acpi_thermal_get_trip_points(tz);
-   if (result)
+   if( temp_b4_trip )
+   { /* some CPUs, eg AMD K8 need temperature before trip points
can be obtained */
+   /* Get temperature [_TMP] (required) */
+   result = acpi_thermal_get_temperature(tz);
+   if (result)
return result;
-
-   /* Get temperature [_TMP] (required) */
-   result = acpi_thermal_get_temperature(tz);
-   if (result)
+
+   /* Get trip points [_CRT, _PSV, etc.] (required) */
+   result = acpi_thermal_get_trip_points(tz);
+   if (result)
return result;
-
+   }else
+   { /* newer x86_64s need trip points set before temperature
obtained */
+   /* Get trip points [_CRT, _PSV, etc.] (required) */
+   result = acpi_thermal_get_trip_points(tz);
+   if (result)
+   return result;
+
+   /* Get temperature [_TMP] (required) */
+   result = acpi_thermal_get_temperature(tz);
+   if (result)
+   return result;
+   }
+
/* Set the cooling mode [_SCP] to active cooling (default) */
result = acpi_thermal_set_cooling_mode(tz,
ACPI_THERMAL_MODE_ACTIVE);
if (!result)
tz->flags.cooling_mode = 1;
-
+
/* Get default polling frequency [_TZP] (optional) */
if (tzp)
tz->polling_frequency = tzp;
else
acpi_thermal_get_polling_frequency(tz);
-
+
return 0;
 }

@@ -1110,6 +1128,14 @@ static int thermal_psv(const struct
dmi_system_id *d) {
return 0;
 }

+static int thermal_temp_b4_trip(const struct dmi_system_id *d) {
+
+   printk(KERN_NOTICE "ACPI: %s detected: : "
+   "getting temperature before trip point
initialisation\n", d->ident);
+   temp_b4_trip = TRUE;
+   return 0;
+}
+
 static struct dmi_system_id thermal_dmi_table[] __initdata = {
/*
 * Award BIOS on this AOpen makes thermal control almost
worthless.
@@ -1147,6 +1173,14 @@ static struct dmi_system_id thermal_dmi_table[]
__initdata = {
DMI_MATCH(DMI_BOARD_NAME, "7ZX"),
},
},
+   {
+.callback = thermal_temp_b4_trip,
+.ident = "HP 6715b laptop",
+.matches = {
+DMI_MATCH(DMI_SYS_VENDOR, "Hewlett-Packard"),
+DMI_MATCH(DMI_PRODUCT_NAME, "HP Compaq 6715b"),
+   },
+   },
{}
 };

Incidentally,  there are still plenty of cpufreq and temperature
related issues on this platform :
with the "ondemand" or "performance"  governors,  placing a large
load on system
  ( eg. building gcc-4.7.1 ) makes the CPU switch into highest
frequency, but not switch
 down after the 65 degree trip point has been toggled once .
 And once the trip point has been reached once and the t

Re: [PATCH: 1/1] ACPI: make evaluation of thermal trip points before temperature or vice versa dependant on new "temp_b4_trip" module parameter to support older AMD x86_64s

2012-07-09 Thread Jason Vas Dias

Thanks Rusty - sorry I didn't see your email until now - revised patch
addressing your comments attached -
BTW,  sorry about the word wrap on the initial posting - should I
attach a '.patch' file or inline ?  Trying both .

The Revised Patch (against :
commit bd0a521e88aa7a06ae7aabaed7ae196ed4ad867a
Author: Linus Torvalds 
Date:   Sat Jul 7 17:23:56 2012 -0700

Linux 3.5-rc6
) :
$ git diff bd0a521e88aa7a06ae7aabaed7ae196ed4ad867a >
/tmp/acpi_thermal_temp_b4_trip.patch
$ cat /tmp/acpi_thermal_temp_b4_trip.patch
diff --git a/drivers/acpi/thermal.c b/drivers/acpi/thermal.c
index 7dbebea..13d3b22 100644
--- a/drivers/acpi/thermal.c
+++ b/drivers/acpi/thermal.c
@@ -96,6 +96,10 @@ static int psv;
 module_param(psv, int, 0644);
 MODULE_PARM_DESC(psv, "Disable or override all passive trip
points.");

+static bool temp_b4_trip;
+module_param(temp_b4_trip, bool, 0644);
+MODULE_PARM_DESC(temp_b4_trip, "Get the temperature before
initializing trip points.");
+
 static int acpi_thermal_add(struct acpi_device *device);
 static int acpi_thermal_remove(struct acpi_device *device, int type);
 static int acpi_thermal_resume(struct acpi_device *device);
@@ -941,27 +945,41 @@ static int acpi_thermal_get_info(struct
acpi_thermal *tz)
if (!tz)
return -EINVAL;

-   /* Get trip points [_CRT, _PSV, etc.] (required) */
-   result = acpi_thermal_get_trip_points(tz);
-   if (result)
+   if( temp_b4_trip )
+   { /* some CPUs, eg AMD K8 need temperature before trip points
can be obtained */
+   /* Get temperature [_TMP] (required) */
+   result = acpi_thermal_get_temperature(tz);
+   if (result)
return result;
-
-   /* Get temperature [_TMP] (required) */
-   result = acpi_thermal_get_temperature(tz);
-   if (result)
+
+   /* Get trip points [_CRT, _PSV, etc.] (required) */
+   result = acpi_thermal_get_trip_points(tz);
+   if (result)
return result;
-
+   }else
+   { /* newer x86_64s need trip points set before temperature
obtained */
+   /* Get trip points [_CRT, _PSV, etc.] (required) */
+   result = acpi_thermal_get_trip_points(tz);
+   if (result)
+   return result;
+
+   /* Get temperature [_TMP] (required) */
+   result = acpi_thermal_get_temperature(tz);
+   if (result)
+   return result;
+   }
+
/* Set the cooling mode [_SCP] to active cooling (default) */
result = acpi_thermal_set_cooling_mode(tz,
ACPI_THERMAL_MODE_ACTIVE);
if (!result)
tz->flags.cooling_mode = 1;
-
+
/* Get default polling frequency [_TZP] (optional) */
if (tzp)
tz->polling_frequency = tzp;
else
acpi_thermal_get_polling_frequency(tz);
-
+
return 0;
 }

@@ -1110,6 +1128,14 @@ static int thermal_psv(const struct
dmi_system_id *d) {
return 0;
 }

+static int thermal_temp_b4_trip(const struct dmi_system_id *d) {
+
+   printk(KERN_NOTICE "ACPI: %s detected: : "
+   "getting temperature before trip point
initialisation\n", d->ident);
+   temp_b4_trip = TRUE;
+   return 0;
+}
+
 static struct dmi_system_id thermal_dmi_table[] __initdata = {
/*
 * Award BIOS on this AOpen makes thermal control almost
worthless.
@@ -1147,6 +1173,14 @@ static struct dmi_system_id thermal_dmi_table[]
__initdata = {
DMI_MATCH(DMI_BOARD_NAME, "7ZX"),
},
},
+   {
+.callback = thermal_temp_b4_trip,
+.ident = "HP 6715b laptop",
+.matches = {
+DMI_MATCH(DMI_SYS_VENDOR, "Hewlett-Packard"),
+DMI_MATCH(DMI_PRODUCT_NAME, "HP Compaq 6715b"),
+   },
+   },
{}
 };

Incidentally,  there are still plenty of cpufreq and temperature
related issues on this platform :
with the "ondemand" or "performance"  governors,  placing a large
load on system
  ( eg. building gcc-4.7.1 ) makes the CPU switch into highest
frequency, but not switch
 down after the 65 degree trip point has been toggled once .
 And once the trip point has been reached once and the temperature
falls below 65, returning CPU freq to 2GHz,
 the reported temperature seems to be stuck at 62 degrees even
though the base of   the laptop nearly burns my hand .
 So I get emergency overheating reboots unless I manually run my
cpufreq & temperature monitoring scripts -
 which, if the CPU freq is 2Ghz,  now have to down the freqency to
800Khz for 2 seconds every 8 seconds
regardless of what temperature
 is reported .


On Mon, Jul 9, 2012 at 1:30 AM, Rusty Russell  wrote:
> On Sun, 8 Jul 2012 19:50:54 +0100, Jason Vas Dias  
> wrote:
>> This patch adds a new

[PATCH: 1/1] ACPI: make evaluation of thermal trip points before temperature or vice versa dependant on new "temp_b4_trip" module parameter to support older AMD x86_64s

2012-07-08 Thread Jason Vas Dias

This patch adds a new acpi.thermal.temp_b4_trip = 1 settting, which
causes the temperature
to be set before evaluation of thermal trip points (the old default) ;
 this mode should
be selected automatically by DMI match if the system identifies as "HP
Compaq 6715b" .

Please consider applying a patch like that attached to fix the issue reported
in lkml thread "Re: PROBLEM: Performance drop" recently,  whereby
it was found that HP 6715b laptops ( which have 2.2Ghz dual-core  AMD
x86_64 k8 CPUs)
get stuck running the CPU at 800Khz and cannot switch frequency. I have verified
that this still the case with v3.4.4 tagged "stable" kernel.

diff --git a/drivers/acpi/thermal.c b/drivers/acpi/thermal.c
index 7dbebea..de2b164 100644
--- a/drivers/acpi/thermal.c
+++ b/drivers/acpi/thermal.c
@@ -96,6 +96,10 @@ static int psv;
 module_param(psv, int, 0644);
 MODULE_PARM_DESC(psv, "Disable or override all passive trip points.");

+static int temp_b4_trip;
+module_param(temp_b4_trip, int, 0);
+MODULE_PARM_DESC(temp_b4_trip, "Get the temperature before
initializing trip points.");
+
 static int acpi_thermal_add(struct acpi_device *device);
 static int acpi_thermal_remove(struct acpi_device *device, int type);
 static int acpi_thermal_resume(struct acpi_device *device);
@@ -941,27 +945,41 @@ static int acpi_thermal_get_info(struct
acpi_thermal *tz)
if (!tz)
return -EINVAL;

-   /* Get trip points [_CRT, _PSV, etc.] (required) */
-   result = acpi_thermal_get_trip_points(tz);
-   if (result)
+   if( temp_b4_trip )
+   { /* some CPUs, eg AMD K8 need temperature before trip points
can be obtained */
+   /* Get temperature [_TMP] (required) */
+   result = acpi_thermal_get_temperature(tz);
+   if (result)
return result;
-
-   /* Get temperature [_TMP] (required) */
-   result = acpi_thermal_get_temperature(tz);
-   if (result)
+
+   /* Get trip points [_CRT, _PSV, etc.] (required) */
+   result = acpi_thermal_get_trip_points(tz);
+   if (result)
return result;
-
+   }else
+   { /* newer x86_64s need trip points set before temperature
obtained */
+   /* Get trip points [_CRT, _PSV, etc.] (required) */
+   result = acpi_thermal_get_trip_points(tz);
+   if (result)
+   return result;
+
+   /* Get temperature [_TMP] (required) */
+   result = acpi_thermal_get_temperature(tz);
+   if (result)
+   return result;
+   }
+
/* Set the cooling mode [_SCP] to active cooling (default) */
result = acpi_thermal_set_cooling_mode(tz, ACPI_THERMAL_MODE_ACTIVE);
if (!result)
tz->flags.cooling_mode = 1;
-
+
/* Get default polling frequency [_TZP] (optional) */
if (tzp)
tz->polling_frequency = tzp;
else
acpi_thermal_get_polling_frequency(tz);
-
+
return 0;
 }

@@ -1110,6 +1128,14 @@ static int thermal_psv(const struct dmi_system_id *d) {
return 0;
 }

+static int thermal_temp_b4_trip(const struct dmi_system_id *d) {
+
+   printk(KERN_NOTICE "ACPI: %s detected: : "
+   "getting temperature before trip point
initialisation\n", d->ident);
+   temp_b4_trip = 1;
+   return 0;
+}
+
 static struct dmi_system_id thermal_dmi_table[] __initdata = {
/*
 * Award BIOS on this AOpen makes thermal control almost worthless.
@@ -1147,6 +1173,14 @@ static struct dmi_system_id thermal_dmi_table[]
__initdata = {
DMI_MATCH(DMI_BOARD_NAME, "7ZX"),
},
},
+   {
+.callback = thermal_temp_b4_trip,
+.ident = "HP 6715b laptop",
+.matches = {
+DMI_MATCH(DMI_SYS_VENDOR, "Hewlett-Packard"),
+DMI_MATCH(DMI_PRODUCT_NAME, "HP Compaq 6715b"),
+   },
+   },
{}
 };


acpi_thermal_HP6715b.patch
Description: Binary data

Re: PROBLEM: Performance drop

2012-07-07 Thread Jason Vas Dias

Sorry, of course the commit I backed out was :
9bcb8118965ab4631a65ee0726e6518f75cda6c5.

On Sat, Jul 7, 2012 at9bcb8118965ab4631a65ee0726e6518f75cda6c5. 3:40
PM, Jason Vas Dias  wrote:
> I can confirm that the AMD Turion X2 2.2Ghz  HP Compaq 6715b
> "business" x86_64 k8 dual-core laptops circa 2007
> DO get stuck in 800Khz mode  and cannot switch out of it after booting
> the "stable" "v3.4.4" tagged kernel.
>
> I followed the containing post and reverted commit
> ff74ae50f01ee67764564815c023c362c87ce18b :
>
> Commit d51cdad33bb5bb370c05129f7c7f3a16a55eff40
> Author: root 
> Date:   Fri Jul 6 18:57:03 2012 +
>
> Revert "ACPI: Evaluate thermal trip points before reading temperature"
>
> This reverts commit 9bcb8118965ab4631a65ee0726e6518f75cda6c5.
>
> commit ff74ae50f01ee67764564815c023c362c87ce18b
> Author: Greg Kroah-Hartman 
> Date:   Fri Jun 22 11:37:50 2012 -0700
>
> And wow !  what a difference - back to a circa 2007 machine versus  a
> circa 1987 machine.
>
> Not too many of us  left around trying to run the latest version of
> linux on nearly 5-year-old hardware I guess, but still -
> please can you restore correct Linux cpufreq & thermal operation on
> old-style AMD k8 CPUs ?
> They do seem to depend on the temperature being set BEFORE 1st entry .
>
> Thanks & Regards,
> Jason Vas Dias (a Software Engineer) 
>
>
> On Wed, May 30, 2012 at 1:43 PM, Andreas Herrmann
>  wrote:
>> On Wed, May 30, 2012 at 03:20:27AM +0700, Comrade DOS wrote:
>>> > Unfortunately you have used acpi=debug instead of apic=debug.  So I
>>> > can't compare I/O APIC configurations between the different test
>>> > scenarios.
>>>
>>> Sorry me for this mistake.
>>
>> No problem.
>>
>> The logs show no difference in IO-APIC pin usage.
>> So it's not the old problem ...
>>
>> Comparing both logs I found following differences:
>>
>> (Most other stuff seems just to be changed formatting.)
>>
>> -ACPI: Thermal Zone [TZ1] (67 C)
>> +ACPI: Thermal Zone [TZ1] (62 C)
>>
>> I think what's shown is the temperature value which just differed
>> between the boots. But that made me look at acpi/thermal.c where the
>> messages came from. The only change between 3.3 and 3.4 is this
>> commit:
>>
>>   commit 9bcb8118965ab4631a65ee0726e6518f75cda6c5
>>   Author: Matthew Garrett 
>>   Date:   Wed Feb 1 10:26:54 2012 -0500
>>
>> ACPI: Evaluate thermal trip points before reading temperature
>>
>> I'd suggest to do a test with this patch reverted.  Maybe this change
>> to fix issues with one HP Laptop (re-)intruduced the trouble with your
>> system.
>>
>> If reverting the patch helps we have to take a closer look at your
>> ACPI tables.  So can you please do a
>>
>>  # git revert 9bcb8118965ab4631a65ee0726e6518f75cda6c5
>>
>> on top of v3.4 and rebuid your kernel and rerun your test (with
>> apic=debug. This allows easier diff to dmesg output of your previous
>> test runs).
>>
>> In any case it also would be good to have the acpi tables from your
>> system. So please also use
>> # acpidump >acpidump.3.3(using the 3.3.x kernel)
>> # acpidump >acpidump.3.4(using the unmodified 3.4 version)
>>
>> and send all files as attachments to your mail.
>>
>> This will allow me to look at your thermal zone definitions in the
>> working and non-working case.
>>
>>
>> Thanks,
>>
>> Andreas
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: PROBLEM: Performance drop

2012-07-07 Thread Jason Vas Dias

I can confirm that the AMD Turion X2 2.2Ghz  HP Compaq 6715b
"business" x86_64 k8 dual-core laptops circa 2007
DO get stuck in 800Khz mode  and cannot switch out of it after booting
the "stable" "v3.4.4" tagged kernel.

I followed the containing post and reverted commit
ff74ae50f01ee67764564815c023c362c87ce18b :

Commit d51cdad33bb5bb370c05129f7c7f3a16a55eff40
Author: root 
Date:   Fri Jul 6 18:57:03 2012 +

Revert "ACPI: Evaluate thermal trip points before reading temperature"

This reverts commit 9bcb8118965ab4631a65ee0726e6518f75cda6c5.

commit ff74ae50f01ee67764564815c023c362c87ce18b
Author: Greg Kroah-Hartman 
Date:   Fri Jun 22 11:37:50 2012 -0700

And wow !  what a difference - back to a circa 2007 machine versus  a
circa 1987 machine.

Not too many of us  left around trying to run the latest version of
linux on nearly 5-year-old hardware I guess, but still -
please can you restore correct Linux cpufreq & thermal operation on
old-style AMD k8 CPUs ?
They do seem to depend on the temperature being set BEFORE 1st entry .

Thanks & Regards,
Jason Vas Dias (a Software Engineer) 


On Wed, May 30, 2012 at 1:43 PM, Andreas Herrmann
 wrote:
> On Wed, May 30, 2012 at 03:20:27AM +0700, Comrade DOS wrote:
>> > Unfortunately you have used acpi=debug instead of apic=debug.  So I
>> > can't compare I/O APIC configurations between the different test
>> > scenarios.
>>
>> Sorry me for this mistake.
>
> No problem.
>
> The logs show no difference in IO-APIC pin usage.
> So it's not the old problem ...
>
> Comparing both logs I found following differences:
>
> (Most other stuff seems just to be changed formatting.)
>
> -ACPI: Thermal Zone [TZ1] (67 C)
> +ACPI: Thermal Zone [TZ1] (62 C)
>
> I think what's shown is the temperature value which just differed
> between the boots. But that made me look at acpi/thermal.c where the
> messages came from. The only change between 3.3 and 3.4 is this
> commit:
>
>   commit 9bcb8118965ab4631a65ee0726e6518f75cda6c5
>   Author: Matthew Garrett 
>   Date:   Wed Feb 1 10:26:54 2012 -0500
>
> ACPI: Evaluate thermal trip points before reading temperature
>
> I'd suggest to do a test with this patch reverted.  Maybe this change
> to fix issues with one HP Laptop (re-)intruduced the trouble with your
> system.
>
> If reverting the patch helps we have to take a closer look at your
> ACPI tables.  So can you please do a
>
>  # git revert 9bcb8118965ab4631a65ee0726e6518f75cda6c5
>
> on top of v3.4 and rebuid your kernel and rerun your test (with
> apic=debug. This allows easier diff to dmesg output of your previous
> test runs).
>
> In any case it also would be good to have the acpi tables from your
> system. So please also use
> # acpidump >acpidump.3.3(using the 3.3.x kernel)
> # acpidump >acpidump.3.4(using the unmodified 3.4 version)
>
> and send all files as attachments to your mail.
>
> This will allow me to look at your thermal zone definitions in the
> working and non-working case.
>
>
> Thanks,
>
> Andreas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

74 matches

Mail list logo