Re: Differences between builtins and modules

2018-05-10 Thread Jason Vas Dias
Sorry I didn't see this mail until now - RE:

Randy Dunlap  wrote:
> Would someone please answer/reply to this (related) kernel bugzilla entry:
> https://bugzilla.kernel.org/show_bug.cgi?id=118661

Yes, I raised this bug because I think modinfo should return 0 exit status
if a requested module is built-in, not just when it has been loaded, like
this modified version does:
$ modinfo snd
modinfo: ERROR: Module snd not found.
built-in: snd
$ echo $?
0

What was the query about the Bug 118661 that needs to be answered ?
I don't see any query on the bug report - just a comment from someone
who also agrees modinfo should return OK for a built-in module .

Glad to hear someone is finally considering fixing modinfo to report
status of built-in modules - with only a 2 year response time.

Thanks & Best Regards,
Jason


Re: Differences between builtins and modules

2018-05-10 Thread Jason Vas Dias
Sorry I didn't see this mail until now - RE:

Randy Dunlap  wrote:
> Would someone please answer/reply to this (related) kernel bugzilla entry:
> https://bugzilla.kernel.org/show_bug.cgi?id=118661

Yes, I raised this bug because I think modinfo should return 0 exit status
if a requested module is built-in, not just when it has been loaded, like
this modified version does:
$ modinfo snd
modinfo: ERROR: Module snd not found.
built-in: snd
$ echo $?
0

What was the query about the Bug 118661 that needs to be answered ?
I don't see any query on the bug report - just a comment from someone
who also agrees modinfo should return OK for a built-in module .

Glad to hear someone is finally considering fixing modinfo to report
status of built-in modules - with only a 2 year response time.

Thanks & Best Regards,
Jason


Re: [PATCH v4.16-rc6 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-23 Thread Jason Vas Dias
Good day -

I believe the last patch I sent, with $subject,
addresses all concerns raised so far by reviewers,
and complies with all kernel coding standards .

Please, it would be most helpful if you could let
me know whether the patch is now acceptable
and will be applied at some stage or not - or if not,
what is the problem with it .

My clients are asking whether the patch is going
to be in the upstream kernel or not, and I need
to tell them something.

Thanks & Best Regards,
Jason


Re: [PATCH v4.16-rc6 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-23 Thread Jason Vas Dias
Good day -

I believe the last patch I sent, with $subject,
addresses all concerns raised so far by reviewers,
and complies with all kernel coding standards .

Please, it would be most helpful if you could let
me know whether the patch is now acceptable
and will be applied at some stage or not - or if not,
what is the problem with it .

My clients are asking whether the patch is going
to be in the upstream kernel or not, and I need
to tell them something.

Thanks & Best Regards,
Jason


[PATCH v4.16-rc6 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-21 Thread jason . vas . dias

This patch implements clock_gettime(CLOCK_MONOTONIC_RAW,) calls entirely 
in the
vDSO, without calling vdso_fallback_gettime() .

It has been augmented to support compilation with or without -DRETPOLINE / 
$(RETPOLINE_CFLAGS) ;
when compiled with -DRETPOLINE, not all functions calls can be inlined 
within __vdso_clock_gettime,
and all functions invoked by __vdso_clock_gettime must have 
'indirect_branch("keep")' +
'function_return("keep")' attributes to compile, otherwise thunk 
relocations will be generated ;
and the functions cannot all be declared '__always_inline_', otherwise a 
compiler -Werror
('not all __always_inline__ functions can be inlined')  is generated.
Also, compared to previous version of same patch,  the do_*_coarse 
functions are still
not inlines, and not inadvertently changed to inline.

I still think it might be better to apply H.J. Liu's patch from
https://bugzilla.kernel.org/show_bug.cgi?id=199129 to disable
-DRETPOLINE compilation for the vDSO .

---
diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d..80d65d4 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -182,29 +182,62 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
-notrace static inline u64 vgetsns(int *mode)
+notrace static inline u64 vgetcycles(int *mode)
 {
-   u64 v;
-   cycles_t cycles;
-
-   if (gtod->vclock_mode == VCLOCK_TSC)
-   cycles = vread_tsc();
+   switch (gtod->vclock_mode) {
+   case VCLOCK_TSC:
+   return vread_tsc();
 #ifdef CONFIG_PARAVIRT_CLOCK
-   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
-   cycles = vread_pvclock(mode);
+   case VCLOCK_PVCLOCK:
+   return vread_pvclock(mode);
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
-   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
-   cycles = vread_hvclock(mode);
+   case VCLOCK_HVCLOCK:
+   return vread_hvclock(mode);
 #endif
-   else
+   default:
+   break;
+   }
+   return 0;
+}
+
+notrace static inline u64 vgetsns(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
return 0;
+
v = (cycles - gtod->cycle_last) & gtod->mask;
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
+   return 0;
+
+   v = (cycles - gtod->cycle_last) & gtod->mask;
+   return v * gtod->raw_mult;
+}
+
+#ifdef RETPOLINE
+#  define  _NO_THUNK_RELOCS_()(indirect_branch("keep"),\
+   function_return("keep"))
+#  define  _RETPOLINE_FUNC_ATTR_   __attribute__(_NO_THUNK_RELOCS_())
+#  define  _RETPOLINE_INLINE_  inline
+#else
+#  define  _RETPOLINE_FUNC_ATTR_
+#  define  _RETPOLINE_INLINE_  __always_inline
+#endif
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
-notrace static int __always_inline do_realtime(struct timespec *ts)
+notrace static _RETPOLINE_INLINE_ _RETPOLINE_FUNC_ATTR_
+int do_realtime(struct timespec *ts)
 {
unsigned long seq;
u64 ns;
@@ -225,7 +258,8 @@ notrace static int __always_inline do_realtime(struct 
timespec *ts)
return mode;
 }
 
-notrace static int __always_inline do_monotonic(struct timespec *ts)
+notrace static _RETPOLINE_INLINE_ _RETPOLINE_FUNC_ATTR_
+int do_monotonic(struct timespec *ts)
 {
unsigned long seq;
u64 ns;
@@ -246,7 +280,30 @@ notrace static int __always_inline do_monotonic(struct 
timespec *ts)
return mode;
 }
 
-notrace static void do_realtime_coarse(struct timespec *ts)
+notrace static _RETPOLINE_INLINE_ _RETPOLINE_FUNC_ATTR_
+int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
+notrace static _RETPOLINE_FUNC_ATTR_
+void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
do {
@@ -256,7 +313,8 @@ notrace static void do_realtime_coarse(struct timespec *ts)
} while (unlikely(gtod_read_retry(gtod, seq)));
 }
 
-notrace static void do_monotonic_coarse(struct timespec *ts)
+notrace static _RETPOLINE_FUNC_ATTR_
+void do_monotonic_coarse(struct timespec *ts)
 {
unsigned long seq;
do {
@@ -266,7 +324,8 @@ notrace static void do_monotonic_coarse(struct timespec 

[PATCH v4.16-rc6 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-21 Thread jason . vas . dias

This patch implements clock_gettime(CLOCK_MONOTONIC_RAW,) calls entirely 
in the
vDSO, without calling vdso_fallback_gettime() .

It has been augmented to support compilation with or without -DRETPOLINE / 
$(RETPOLINE_CFLAGS) ;
when compiled with -DRETPOLINE, not all functions calls can be inlined 
within __vdso_clock_gettime,
and all functions invoked by __vdso_clock_gettime must have 
'indirect_branch("keep")' +
'function_return("keep")' attributes to compile, otherwise thunk 
relocations will be generated ;
and the functions cannot all be declared '__always_inline_', otherwise a 
compiler -Werror
('not all __always_inline__ functions can be inlined')  is generated.
Also, compared to previous version of same patch,  the do_*_coarse 
functions are still
not inlines, and not inadvertently changed to inline.

I still think it might be better to apply H.J. Liu's patch from
https://bugzilla.kernel.org/show_bug.cgi?id=199129 to disable
-DRETPOLINE compilation for the vDSO .

---
diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d..80d65d4 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -182,29 +182,62 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
-notrace static inline u64 vgetsns(int *mode)
+notrace static inline u64 vgetcycles(int *mode)
 {
-   u64 v;
-   cycles_t cycles;
-
-   if (gtod->vclock_mode == VCLOCK_TSC)
-   cycles = vread_tsc();
+   switch (gtod->vclock_mode) {
+   case VCLOCK_TSC:
+   return vread_tsc();
 #ifdef CONFIG_PARAVIRT_CLOCK
-   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
-   cycles = vread_pvclock(mode);
+   case VCLOCK_PVCLOCK:
+   return vread_pvclock(mode);
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
-   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
-   cycles = vread_hvclock(mode);
+   case VCLOCK_HVCLOCK:
+   return vread_hvclock(mode);
 #endif
-   else
+   default:
+   break;
+   }
+   return 0;
+}
+
+notrace static inline u64 vgetsns(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
return 0;
+
v = (cycles - gtod->cycle_last) & gtod->mask;
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
+   return 0;
+
+   v = (cycles - gtod->cycle_last) & gtod->mask;
+   return v * gtod->raw_mult;
+}
+
+#ifdef RETPOLINE
+#  define  _NO_THUNK_RELOCS_()(indirect_branch("keep"),\
+   function_return("keep"))
+#  define  _RETPOLINE_FUNC_ATTR_   __attribute__(_NO_THUNK_RELOCS_())
+#  define  _RETPOLINE_INLINE_  inline
+#else
+#  define  _RETPOLINE_FUNC_ATTR_
+#  define  _RETPOLINE_INLINE_  __always_inline
+#endif
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
-notrace static int __always_inline do_realtime(struct timespec *ts)
+notrace static _RETPOLINE_INLINE_ _RETPOLINE_FUNC_ATTR_
+int do_realtime(struct timespec *ts)
 {
unsigned long seq;
u64 ns;
@@ -225,7 +258,8 @@ notrace static int __always_inline do_realtime(struct 
timespec *ts)
return mode;
 }
 
-notrace static int __always_inline do_monotonic(struct timespec *ts)
+notrace static _RETPOLINE_INLINE_ _RETPOLINE_FUNC_ATTR_
+int do_monotonic(struct timespec *ts)
 {
unsigned long seq;
u64 ns;
@@ -246,7 +280,30 @@ notrace static int __always_inline do_monotonic(struct 
timespec *ts)
return mode;
 }
 
-notrace static void do_realtime_coarse(struct timespec *ts)
+notrace static _RETPOLINE_INLINE_ _RETPOLINE_FUNC_ATTR_
+int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
+notrace static _RETPOLINE_FUNC_ATTR_
+void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
do {
@@ -256,7 +313,8 @@ notrace static void do_realtime_coarse(struct timespec *ts)
} while (unlikely(gtod_read_retry(gtod, seq)));
 }
 
-notrace static void do_monotonic_coarse(struct timespec *ts)
+notrace static _RETPOLINE_FUNC_ATTR_
+void do_monotonic_coarse(struct timespec *ts)
 {
unsigned long seq;
do {
@@ -266,7 +324,8 @@ notrace static void do_monotonic_coarse(struct timespec 

[PATCH v4.16-rc6 (1)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-21 Thread jason . vas . dias

  Resent to address reviewer comments, and allow builds with compilers
  that support -DRETPOLINE to succeed.
  
  Currently, the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.
  
  Sometimes, particularly when correlating elapsed time to performance
  counter values, user-space  code needs to know elapsed time from the
  perspective of the CPU no matter how "hot" / fast or "cold" / slow it
  might be running wrt NTP / PTP "real" time; when code needs this,
  the latencies associated with a syscall are often unacceptably high.
  
  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
  
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .
  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.
  
  The patch is against Linus' latest 4.16-rc6 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .
  
  This patch affects only files:
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c
   
  Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug
  #198161,
  as is the test program, timer_latency.c, to demonstrate the problem.
  
  Before the patch a latency of 200-1000ns was measured for
clock_gettime(CLOCK_MONOTONIC_RAW,)
  calls - after the patch, the same call on the same machine
  has a latency of @ 20ns.
  
  Please consider applying something like this patch to a future Linux release.

  This patch is being resent because it has slight improvements to 
vclock_gettime
  static function attributes wrt. the previous version.

  It also supersedes all previous patches with subject matching
 '.*VDSO should handle.*clock_gettime.*MONOTONIC_RAW'
  that I have sent previously - sorry for the resends.

  Please apply this patch so we stop getting emails from
  intel build bot trying to build previous version, with
  subject :
'[PATCH v4.16-rc5 1/2] x86/vdso: VDSO should handle \
 clock_gettime(CLOCK_MONOTONIC_RAW) without syscall'
  , which only fails to build because its patch 2/2 , which
  removed -DRETPOLINE from the VDSO build, and is now the
  subject of https://bugzilla.kernel.org/show_bug.cgi?id=199129,
  raised by H.J. Liu, was not applied first - Sorry! 

Thanks & Best Regards,
Jason Vas Dias


[PATCH v4.16-rc6 (1)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-21 Thread jason . vas . dias

  Resent to address reviewer comments, and allow builds with compilers
  that support -DRETPOLINE to succeed.
  
  Currently, the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.
  
  Sometimes, particularly when correlating elapsed time to performance
  counter values, user-space  code needs to know elapsed time from the
  perspective of the CPU no matter how "hot" / fast or "cold" / slow it
  might be running wrt NTP / PTP "real" time; when code needs this,
  the latencies associated with a syscall are often unacceptably high.
  
  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
  
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .
  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.
  
  The patch is against Linus' latest 4.16-rc6 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .
  
  This patch affects only files:
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c
   
  Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug
  #198161,
  as is the test program, timer_latency.c, to demonstrate the problem.
  
  Before the patch a latency of 200-1000ns was measured for
clock_gettime(CLOCK_MONOTONIC_RAW,)
  calls - after the patch, the same call on the same machine
  has a latency of @ 20ns.
  
  Please consider applying something like this patch to a future Linux release.

  This patch is being resent because it has slight improvements to 
vclock_gettime
  static function attributes wrt. the previous version.

  It also supersedes all previous patches with subject matching
 '.*VDSO should handle.*clock_gettime.*MONOTONIC_RAW'
  that I have sent previously - sorry for the resends.

  Please apply this patch so we stop getting emails from
  intel build bot trying to build previous version, with
  subject :
'[PATCH v4.16-rc5 1/2] x86/vdso: VDSO should handle \
 clock_gettime(CLOCK_MONOTONIC_RAW) without syscall'
  , which only fails to build because its patch 2/2 , which
  removed -DRETPOLINE from the VDSO build, and is now the
  subject of https://bugzilla.kernel.org/show_bug.cgi?id=199129,
  raised by H.J. Liu, was not applied first - Sorry! 

Thanks & Best Regards,
Jason Vas Dias


Re: [PATCH v4.16-rc6 (1)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-19 Thread Jason Vas Dias
Note there is a bug raised by H.J. Liu :
 Bug 199129: Don't build vDSO with $(RETPOLINE_CFLAGS) -DRETPOLINE
(https://bugzilla.kernel.org/show_bug.cgi?id=199129)

If you agree it is a bug, then use both patches from post :
'[PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle \
 clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
'
else, use the single patch from $subject, which makes the
calls to the statics in vclock_gettime.c' use
   indirect_branch("keep") / function_return("keep") ,
to avoid generation of thunk relocations which would not
occur unless compiled with
   -mindirect-branch=thunk-extern -mindirect-branch-register
.

Thanks & Regards,
Jason


Re: [PATCH v4.16-rc6 (1)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-19 Thread Jason Vas Dias
Note there is a bug raised by H.J. Liu :
 Bug 199129: Don't build vDSO with $(RETPOLINE_CFLAGS) -DRETPOLINE
(https://bugzilla.kernel.org/show_bug.cgi?id=199129)

If you agree it is a bug, then use both patches from post :
'[PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle \
 clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
'
else, use the single patch from $subject, which makes the
calls to the statics in vclock_gettime.c' use
   indirect_branch("keep") / function_return("keep") ,
to avoid generation of thunk relocations which would not
occur unless compiled with
   -mindirect-branch=thunk-extern -mindirect-branch-register
.

Thanks & Regards,
Jason


[PATCH v4.16-rc6 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-19 Thread jason . vas . dias

 This patch makes the vDSO handle clock_gettime(CLOCK_MONOTONIC_RAW,)
 calls in the same way it handles clock_gettime(CLOCK_MONOTONIC,) 
calls,
 reducing latency from @ 200-1000ns to @ 20ns.

 It has been resent and augmented to support compilation with 
-DRETPOLINE /
  -mindirect-branch=thunk-extern -mindirect-branch-register, to avoid
  generating relocations for thunks.
  
---
diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d..9b89f86 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -182,29 +182,60 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
-notrace static inline u64 vgetsns(int *mode)
+notrace static inline u64 vgetcycles(int *mode)
 {
-   u64 v;
-   cycles_t cycles;
-
-   if (gtod->vclock_mode == VCLOCK_TSC)
-   cycles = vread_tsc();
+   switch (gtod->vclock_mode) {
+   case VCLOCK_TSC:
+   return vread_tsc();
 #ifdef CONFIG_PARAVIRT_CLOCK
-   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
-   cycles = vread_pvclock(mode);
+   case VCLOCK_PVCLOCK:
+   return vread_pvclock(mode);
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
-   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
-   cycles = vread_hvclock(mode);
+   case VCLOCK_HVCLOCK:
+   return vread_hvclock(mode);
 #endif
-   else
+   default:
+   break;
+   }
+   return 0;
+}
+
+notrace static inline u64 vgetsns(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
return 0;
+
v = (cycles - gtod->cycle_last) & gtod->mask;
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
+   return 0;
+
+   v = (cycles - gtod->cycle_last) & gtod->mask;
+   return v * gtod->raw_mult;
+}
+
+#ifdef RETPOLINE
+#  define  _NO_THUNK_RELOCS_()(indirect_branch("keep"),\
+   function_return("keep"))
+#  define  _RETPOLINE_FUNC_ATTR_ __attribute__(_NO_THUNK_RELOCS_())
+#else
+#  define  _RETPOLINE_FUNC_ATTR_
+#endif
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
-notrace static int __always_inline do_realtime(struct timespec *ts)
+notrace static inline _RETPOLINE_FUNC_ATTR_
+int do_realtime(struct timespec *ts)
 {
unsigned long seq;
u64 ns;
@@ -225,7 +256,8 @@ notrace static int __always_inline do_realtime(struct 
timespec *ts)
return mode;
 }
 
-notrace static int __always_inline do_monotonic(struct timespec *ts)
+notrace static inline _RETPOLINE_FUNC_ATTR_
+int do_monotonic(struct timespec *ts)
 {
unsigned long seq;
u64 ns;
@@ -246,7 +278,30 @@ notrace static int __always_inline do_monotonic(struct 
timespec *ts)
return mode;
 }
 
-notrace static void do_realtime_coarse(struct timespec *ts)
+notrace static inline _RETPOLINE_FUNC_ATTR_
+int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
+notrace static inline _RETPOLINE_FUNC_ATTR_
+void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
do {
@@ -256,7 +311,8 @@ notrace static void do_realtime_coarse(struct timespec *ts)
} while (unlikely(gtod_read_retry(gtod, seq)));
 }
 
-notrace static void do_monotonic_coarse(struct timespec *ts)
+notrace static inline _RETPOLINE_FUNC_ATTR_
+void do_monotonic_coarse(struct timespec *ts)
 {
unsigned long seq;
do {
@@ -266,7 +322,11 @@ notrace static void do_monotonic_coarse(struct timespec 
*ts)
} while (unlikely(gtod_read_retry(gtod, seq)));
 }
 
-notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
+notrace
+#ifdef RETPOLINE
+   __attribute__((indirect_branch("keep"), function_return("keep")))
+#endif
+int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
 {
switch (clock) {
case CLOCK_REALTIME:
@@ -277,6 +337,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct 
timespec *ts)
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;

[PATCH v4.16-rc6 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-19 Thread jason . vas . dias

 This patch makes the vDSO handle clock_gettime(CLOCK_MONOTONIC_RAW,)
 calls in the same way it handles clock_gettime(CLOCK_MONOTONIC,) 
calls,
 reducing latency from @ 200-1000ns to @ 20ns.

 It has been resent and augmented to support compilation with 
-DRETPOLINE /
  -mindirect-branch=thunk-extern -mindirect-branch-register, to avoid
  generating relocations for thunks.
  
---
diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d..9b89f86 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -182,29 +182,60 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
-notrace static inline u64 vgetsns(int *mode)
+notrace static inline u64 vgetcycles(int *mode)
 {
-   u64 v;
-   cycles_t cycles;
-
-   if (gtod->vclock_mode == VCLOCK_TSC)
-   cycles = vread_tsc();
+   switch (gtod->vclock_mode) {
+   case VCLOCK_TSC:
+   return vread_tsc();
 #ifdef CONFIG_PARAVIRT_CLOCK
-   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
-   cycles = vread_pvclock(mode);
+   case VCLOCK_PVCLOCK:
+   return vread_pvclock(mode);
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
-   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
-   cycles = vread_hvclock(mode);
+   case VCLOCK_HVCLOCK:
+   return vread_hvclock(mode);
 #endif
-   else
+   default:
+   break;
+   }
+   return 0;
+}
+
+notrace static inline u64 vgetsns(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
return 0;
+
v = (cycles - gtod->cycle_last) & gtod->mask;
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
+   return 0;
+
+   v = (cycles - gtod->cycle_last) & gtod->mask;
+   return v * gtod->raw_mult;
+}
+
+#ifdef RETPOLINE
+#  define  _NO_THUNK_RELOCS_()(indirect_branch("keep"),\
+   function_return("keep"))
+#  define  _RETPOLINE_FUNC_ATTR_ __attribute__(_NO_THUNK_RELOCS_())
+#else
+#  define  _RETPOLINE_FUNC_ATTR_
+#endif
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
-notrace static int __always_inline do_realtime(struct timespec *ts)
+notrace static inline _RETPOLINE_FUNC_ATTR_
+int do_realtime(struct timespec *ts)
 {
unsigned long seq;
u64 ns;
@@ -225,7 +256,8 @@ notrace static int __always_inline do_realtime(struct 
timespec *ts)
return mode;
 }
 
-notrace static int __always_inline do_monotonic(struct timespec *ts)
+notrace static inline _RETPOLINE_FUNC_ATTR_
+int do_monotonic(struct timespec *ts)
 {
unsigned long seq;
u64 ns;
@@ -246,7 +278,30 @@ notrace static int __always_inline do_monotonic(struct 
timespec *ts)
return mode;
 }
 
-notrace static void do_realtime_coarse(struct timespec *ts)
+notrace static inline _RETPOLINE_FUNC_ATTR_
+int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
+notrace static inline _RETPOLINE_FUNC_ATTR_
+void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
do {
@@ -256,7 +311,8 @@ notrace static void do_realtime_coarse(struct timespec *ts)
} while (unlikely(gtod_read_retry(gtod, seq)));
 }
 
-notrace static void do_monotonic_coarse(struct timespec *ts)
+notrace static inline _RETPOLINE_FUNC_ATTR_
+void do_monotonic_coarse(struct timespec *ts)
 {
unsigned long seq;
do {
@@ -266,7 +322,11 @@ notrace static void do_monotonic_coarse(struct timespec 
*ts)
} while (unlikely(gtod_read_retry(gtod, seq)));
 }
 
-notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
+notrace
+#ifdef RETPOLINE
+   __attribute__((indirect_branch("keep"), function_return("keep")))
+#endif
+int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
 {
switch (clock) {
case CLOCK_REALTIME:
@@ -277,6 +337,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct 
timespec *ts)
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;

[PATCH v4.16-rc6 (1)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-19 Thread jason . vas . dias


Resent to address reviewer comments, and allow builds with compilers
  that support -DRETPOLINE to succeed.

  Currently, the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values, user-space  code needs to know elapsed time from the
  perspective of the CPU no matter how "hot" / fast or "cold" / slow it
  might be running wrt NTP / PTP "real" time; when code needs this,
  the latencies associated with a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .

  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:

   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c


  Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug 
#198161,
  as is the test program, timer_latency.c, to demonstrate the problem.

  Before the patch a latency of 200-1000ns was measured for
clock_gettime(CLOCK_MONOTONIC_RAW,)
  calls - after the patch, the same call on the same machine
  has a latency of @ 20ns.

  Please consider applying something like this patch to a future Linux release.

Thanks & Best Regards,
Jason Vas Dias


[PATCH v4.16-rc6 (1)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-19 Thread jason . vas . dias


Resent to address reviewer comments, and allow builds with compilers
  that support -DRETPOLINE to succeed.

  Currently, the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values, user-space  code needs to know elapsed time from the
  perspective of the CPU no matter how "hot" / fast or "cold" / slow it
  might be running wrt NTP / PTP "real" time; when code needs this,
  the latencies associated with a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .

  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:

   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c


  Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug 
#198161,
  as is the test program, timer_latency.c, to demonstrate the problem.

  Before the patch a latency of 200-1000ns was measured for
clock_gettime(CLOCK_MONOTONIC_RAW,)
  calls - after the patch, the same call on the same machine
  has a latency of @ 20ns.

  Please consider applying something like this patch to a future Linux release.

Thanks & Best Regards,
Jason Vas Dias


Re: [PATCH v4.16-rc5 2/2] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-17 Thread Jason Vas Dias
On 18/03/2018, Jason Vas Dias <jason.vas.d...@gmail.com> wrote:
(should have CC'ed to list, sorry)
> On 17/03/2018, Andi Kleen <a...@firstfloor.org> wrote:
>>
>> That's quite a mischaracterization of the issue. gcc works as intended,
>> but the kernel did not correctly supply a indirect call retpoline thunk
>> to the vdso, and it just happened to work by accident with the old
>> vdso.
>>
>>>
>>>  The automated test builds should now succeed with this patch.
>>
>> How about just adding the thunk function to the vdso object instead of
>> this cheap hack?
>>
>> The other option would be to build vdso with inline thunks.
>>
>> But just disabling is completely the wrong action.
>>
>> -Andi
>>
>
> Aha! Thanks for the clarification , Andi!
>
> I will do so and resend the 2nd patch.
>
> But is everyone agreed we should accept any slowdown for the timer
> functions ? I personally don't think it is a good idea, but I will
> regenerate the patch with the thunk function and without
> the Makefile change.
>
> Thanks & Best Regards,
> Jason
>


I am wondering if it is not better to avoid the thunk being generated
and remove the Makefile patch ?

I know that changing the switch in __vdso_clock_gettime() like
this avoids the thunk :

   switch(clock) {
   case CLOCK_MONOTONIC:
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
default:
switch (clock) {
case CLOCK_REALTIME:
if (do_realtime(ts) == VCLOCK_NONE)
   goto fallback;
break;
 case CLOCK_MONOTONIC_RAW:
 if (do_monotonic_raw(ts) == VCLOCK_NONE)
   goto fallback;
 break;
 case CLOCK_REALTIME_COARSE:
 do_realtime_coarse(ts);
 break;
 case CLOCK_MONOTONIC_COARSE:
 do_monotonic_coarse(ts);
 break;
 default:
goto fallback;
 }
 return 0;
 fallback: ...
}


So at the cost of an unnecessary extra test of the clock parameter,
the thunk is avoided .

I wonder if the whole switch should be changed to an if / else clause ?

Or, I know this might be unorthodox, but might work :
#define _CAT(V1,V2) V1##V2
#define GTOD_CLK_LABEL(CLK) _CAT(_VCG_L_,CLK)
#define  MAX_CLK 16
//^^ ??
 __vdso_clock_gettime(  ... ) { ...
 static const void *clklbl_tab[MAX_CLK]
 ={ [ CLOCK_MONOTONIC ]
  =   &_CLK_LABEL(CLOCK_MONOTONIC) ,
 [ CLOCK_MONOTONIC_RAW ]
  =   &_CLK_LABEL(CLOCK_MONOTONIC_RAW) ,
// and similarly for all clocks handled ...
};

   goto clklbl_tab[ clock & 0xf ] ;

   GTOD_CLK_LABEL(CLOCK_MONOTONIC) :
if ( do_monotonic(ts) == VCLOCK_NONE )
  goto fallback ;

   GTOD_CLK_LABEL(CLOCK_MONOTONIC_RAW) :
if ( do_monotonic_raw(ts) == VCLOCK_NONE )
  goto fallback ;

 ... // similarly for all clocks

fallback:
 return vdso_fallback_gettime(clock,ts);
}


If a restructuring like that might be acceptable (with correct tab-based
formatting) , and the VDSO can have such a table in its .BSS ,  I think it
would avoid the thunk, and have the advantage of
precomputing the jump table at compile-time, and would not require any
indirect branches, I think.

Any thoughts ?

Thanks & Best regards,
Jason




  ;

 G


Re: [PATCH v4.16-rc5 2/2] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-17 Thread Jason Vas Dias
On 18/03/2018, Jason Vas Dias  wrote:
(should have CC'ed to list, sorry)
> On 17/03/2018, Andi Kleen  wrote:
>>
>> That's quite a mischaracterization of the issue. gcc works as intended,
>> but the kernel did not correctly supply a indirect call retpoline thunk
>> to the vdso, and it just happened to work by accident with the old
>> vdso.
>>
>>>
>>>  The automated test builds should now succeed with this patch.
>>
>> How about just adding the thunk function to the vdso object instead of
>> this cheap hack?
>>
>> The other option would be to build vdso with inline thunks.
>>
>> But just disabling is completely the wrong action.
>>
>> -Andi
>>
>
> Aha! Thanks for the clarification , Andi!
>
> I will do so and resend the 2nd patch.
>
> But is everyone agreed we should accept any slowdown for the timer
> functions ? I personally don't think it is a good idea, but I will
> regenerate the patch with the thunk function and without
> the Makefile change.
>
> Thanks & Best Regards,
> Jason
>


I am wondering if it is not better to avoid the thunk being generated
and remove the Makefile patch ?

I know that changing the switch in __vdso_clock_gettime() like
this avoids the thunk :

   switch(clock) {
   case CLOCK_MONOTONIC:
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
default:
switch (clock) {
case CLOCK_REALTIME:
if (do_realtime(ts) == VCLOCK_NONE)
   goto fallback;
break;
 case CLOCK_MONOTONIC_RAW:
 if (do_monotonic_raw(ts) == VCLOCK_NONE)
   goto fallback;
 break;
 case CLOCK_REALTIME_COARSE:
 do_realtime_coarse(ts);
 break;
 case CLOCK_MONOTONIC_COARSE:
 do_monotonic_coarse(ts);
 break;
 default:
goto fallback;
 }
 return 0;
 fallback: ...
}


So at the cost of an unnecessary extra test of the clock parameter,
the thunk is avoided .

I wonder if the whole switch should be changed to an if / else clause ?

Or, I know this might be unorthodox, but might work :
#define _CAT(V1,V2) V1##V2
#define GTOD_CLK_LABEL(CLK) _CAT(_VCG_L_,CLK)
#define  MAX_CLK 16
//^^ ??
 __vdso_clock_gettime(  ... ) { ...
 static const void *clklbl_tab[MAX_CLK]
 ={ [ CLOCK_MONOTONIC ]
  =   &_CLK_LABEL(CLOCK_MONOTONIC) ,
 [ CLOCK_MONOTONIC_RAW ]
  =   &_CLK_LABEL(CLOCK_MONOTONIC_RAW) ,
// and similarly for all clocks handled ...
};

   goto clklbl_tab[ clock & 0xf ] ;

   GTOD_CLK_LABEL(CLOCK_MONOTONIC) :
if ( do_monotonic(ts) == VCLOCK_NONE )
  goto fallback ;

   GTOD_CLK_LABEL(CLOCK_MONOTONIC_RAW) :
if ( do_monotonic_raw(ts) == VCLOCK_NONE )
  goto fallback ;

 ... // similarly for all clocks

fallback:
 return vdso_fallback_gettime(clock,ts);
}


If a restructuring like that might be acceptable (with correct tab-based
formatting) , and the VDSO can have such a table in its .BSS ,  I think it
would avoid the thunk, and have the advantage of
precomputing the jump table at compile-time, and would not require any
indirect branches, I think.

Any thoughts ?

Thanks & Best regards,
Jason




  ;

 G


Re: [PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-17 Thread Jason Vas Dias
fixed typo in timer_latency.c affecting only -r  printout
:

$ gcc -DN_SAMPLES=1000 -o timer timer_latency.c
CLOCK_MONOTONIC ( using rdtscp_ordered() ) :

$ ./timer -m -r 10
sum: 67615
Total time: 0.67615S - Average Latency: 0.00067S N zero
deltas: 0 N inconsistent deltas: 0
sum: 51858
Total time: 0.51858S - Average Latency: 0.00051S N zero
deltas: 0 N inconsistent deltas: 0
sum: 51742
Total time: 0.51742S - Average Latency: 0.00051S N zero
deltas: 0 N inconsistent deltas: 0
sum: 51944
Total time: 0.51944S - Average Latency: 0.00051S N zero
deltas: 0 N inconsistent deltas: 0
sum: 51838
Total time: 0.51838S - Average Latency: 0.00051S N zero
deltas: 0 N inconsistent deltas: 0
sum: 52397
Total time: 0.52397S - Average Latency: 0.00052S N zero
deltas: 0 N inconsistent deltas: 0
sum: 52428
Total time: 0.52428S - Average Latency: 0.00052S N zero
deltas: 0 N inconsistent deltas: 0
sum: 52135
Total time: 0.52135S - Average Latency: 0.00052S N zero
deltas: 0 N inconsistent deltas: 0
sum: 52145
Total time: 0.52145S - Average Latency: 0.00052S N zero
deltas: 0 N inconsistent deltas: 0
sum: 53116
Total time: 0.53116S - Average Latency: 0.00053S N zero
deltas: 0 N inconsistent deltas: 0
Average of 10 average latencies of 1000 samples : 0.00053S


CLOCK_MONOTONIC_RAW ( using rdtscp() ) :

$ ./timer  -r 10
sum: 25755
Total time: 0.25755S - Average Latency: 0.00025S N zero
deltas: 0 N inconsistent deltas: 0
sum: 21614
Total time: 0.21614S - Average Latency: 0.00021S N zero
deltas: 0 N inconsistent deltas: 0
sum: 21616
Total time: 0.21616S - Average Latency: 0.00021S N zero
deltas: 0 N inconsistent deltas: 0
sum: 21610
Total time: 0.21610S - Average Latency: 0.00021S N zero
deltas: 0 N inconsistent deltas: 0
sum: 21619
Total time: 0.21619S - Average Latency: 0.00021S N zero
deltas: 0 N inconsistent deltas: 0
sum: 21617
Total time: 0.21617S - Average Latency: 0.00021S N zero
deltas: 0 N inconsistent deltas: 0
sum: 21610
Total time: 0.21610S - Average Latency: 0.00021S N zero
deltas: 0 N inconsistent deltas: 0
sum: 16940
Total time: 0.16940S - Average Latency: 0.00016S N zero
deltas: 0 N inconsistent deltas: 0
sum: 16939
Total time: 0.16939S - Average Latency: 0.00016S N zero
deltas: 0 N inconsistent deltas: 0
sum: 16943
Total time: 0.16943S - Average Latency: 0.00016S N zero
deltas: 0 N inconsistent deltas: 0
Average of 10 average latencies of 1000 samples : 0.00019S
/*
 * Program to measure high-res timer latency.
 *
 */
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifndef N_SAMPLES
#define N_SAMPLES 100
#endif
#define _STR(_S_) #_S_
#define STR(_S_) _STR(_S_)

#define TS2NS(_TS_) unsigned long long)(_TS_).tv_sec)*10ULL) + (((unsigned long long)((_TS_).tv_nsec 

int main(int argc, char *const* argv, char *const* envp)
{ struct timespec sample[N_SAMPLES+1];
  unsigned int cnt=N_SAMPLES, s=0 , avg_n=0;
  unsigned long long
deltas [ N_SAMPLES ]
, t1, t2, sum=0, zd=0, ic=0, d
, t_start, avg_ns, *avgs=0;
  clockid_t clk = CLOCK_MONOTONIC_RAW;
  bool do_dump = false;
  int argn=1, repeat=1;
  for(; argn < argc; argn+=1)
if( argv[argn] != NULL )
  if( *(argv[argn]) == '-')
	switch( *(argv[argn]+1) )
	{ case 'm':
	  case 'M':
	clk = CLOCK_MONOTONIC;
	break;
	  case 'd':
	  case 'D':
	do_dump = true;
	break;
 	  case 'r':
  	  case 'R':
	if( (argn < argc) && (argv[argn+1] != NULL))
	  repeat = atoi(argv[argn+=1]);
	break;
	  case '?':
	  case 'h':
	  case 'u':
	  case 'U':
	  case 'H':
	fprintf(stderr,"Usage: timer_latency [\n\t-m : use CLOCK_MONOTONIC clock (not CLOCK_MONOTONIC_RAW)\n\t-d : dump timespec contents. N_SAMPLES: " STR(N_SAMPLES) "\n\t"
	"-r \n]\t"
	"Calculates average timer latency (minimum time that can be measured) over N_SAMPLES.\n"
	   );
	return 0;
	}
  if( repeat > 1 )
  { avgs=alloca(sizeof(unsigned long long) * (N_SAMPLES + 1));
if( ((unsigned long) avgs) & 7 )
  avgs = ((unsigned long long*)(((unsigned char*)avgs)+(8-((unsigned long) avgs) & 7)));
  }
  do {
cnt=N_SAMPLES;
s=0;
  do
  { if( 0 != clock_gettime(clk, [s++]) )
{ fprintf(stderr,"oops, clock_gettime() failed: %d: '%s'.\n", errno, strerror(errno));
  return 1;
}
  }while( --cnt );
  clock_gettime(clk, [s]);
  for(s=1; s < (N_SAMPLES+1); s+=1)
  { t1 = TS2NS(sample[s-1]);
t2 = TS2NS(sample[s]);
if ( (t1 > t2)
   ||(sample[s-1].tv_sec > sample[s].tv_sec)
   ||((sample[s-1].tv_sec == sample[s].tv_sec)
&&(sample[s-1].tv_nsec > sample[s].tv_nsec)
 )
   )
{ fprintf(stderr,"Inconsistency: %llu %llu %lu.%lu %lu.%lu\n", t1 , t2
, sample[s-1].tv_sec, sample[s-1].tv_nsec
, sample[s].tv_sec,   sample[s].tv_nsec
  );
  ic+=1;
  

Re: [PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-17 Thread Jason Vas Dias
fixed typo in timer_latency.c affecting only -r  printout
:

$ gcc -DN_SAMPLES=1000 -o timer timer_latency.c
CLOCK_MONOTONIC ( using rdtscp_ordered() ) :

$ ./timer -m -r 10
sum: 67615
Total time: 0.67615S - Average Latency: 0.00067S N zero
deltas: 0 N inconsistent deltas: 0
sum: 51858
Total time: 0.51858S - Average Latency: 0.00051S N zero
deltas: 0 N inconsistent deltas: 0
sum: 51742
Total time: 0.51742S - Average Latency: 0.00051S N zero
deltas: 0 N inconsistent deltas: 0
sum: 51944
Total time: 0.51944S - Average Latency: 0.00051S N zero
deltas: 0 N inconsistent deltas: 0
sum: 51838
Total time: 0.51838S - Average Latency: 0.00051S N zero
deltas: 0 N inconsistent deltas: 0
sum: 52397
Total time: 0.52397S - Average Latency: 0.00052S N zero
deltas: 0 N inconsistent deltas: 0
sum: 52428
Total time: 0.52428S - Average Latency: 0.00052S N zero
deltas: 0 N inconsistent deltas: 0
sum: 52135
Total time: 0.52135S - Average Latency: 0.00052S N zero
deltas: 0 N inconsistent deltas: 0
sum: 52145
Total time: 0.52145S - Average Latency: 0.00052S N zero
deltas: 0 N inconsistent deltas: 0
sum: 53116
Total time: 0.53116S - Average Latency: 0.00053S N zero
deltas: 0 N inconsistent deltas: 0
Average of 10 average latencies of 1000 samples : 0.00053S


CLOCK_MONOTONIC_RAW ( using rdtscp() ) :

$ ./timer  -r 10
sum: 25755
Total time: 0.25755S - Average Latency: 0.00025S N zero
deltas: 0 N inconsistent deltas: 0
sum: 21614
Total time: 0.21614S - Average Latency: 0.00021S N zero
deltas: 0 N inconsistent deltas: 0
sum: 21616
Total time: 0.21616S - Average Latency: 0.00021S N zero
deltas: 0 N inconsistent deltas: 0
sum: 21610
Total time: 0.21610S - Average Latency: 0.00021S N zero
deltas: 0 N inconsistent deltas: 0
sum: 21619
Total time: 0.21619S - Average Latency: 0.00021S N zero
deltas: 0 N inconsistent deltas: 0
sum: 21617
Total time: 0.21617S - Average Latency: 0.00021S N zero
deltas: 0 N inconsistent deltas: 0
sum: 21610
Total time: 0.21610S - Average Latency: 0.00021S N zero
deltas: 0 N inconsistent deltas: 0
sum: 16940
Total time: 0.16940S - Average Latency: 0.00016S N zero
deltas: 0 N inconsistent deltas: 0
sum: 16939
Total time: 0.16939S - Average Latency: 0.00016S N zero
deltas: 0 N inconsistent deltas: 0
sum: 16943
Total time: 0.16943S - Average Latency: 0.00016S N zero
deltas: 0 N inconsistent deltas: 0
Average of 10 average latencies of 1000 samples : 0.00019S
/*
 * Program to measure high-res timer latency.
 *
 */
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifndef N_SAMPLES
#define N_SAMPLES 100
#endif
#define _STR(_S_) #_S_
#define STR(_S_) _STR(_S_)

#define TS2NS(_TS_) unsigned long long)(_TS_).tv_sec)*10ULL) + (((unsigned long long)((_TS_).tv_nsec 

int main(int argc, char *const* argv, char *const* envp)
{ struct timespec sample[N_SAMPLES+1];
  unsigned int cnt=N_SAMPLES, s=0 , avg_n=0;
  unsigned long long
deltas [ N_SAMPLES ]
, t1, t2, sum=0, zd=0, ic=0, d
, t_start, avg_ns, *avgs=0;
  clockid_t clk = CLOCK_MONOTONIC_RAW;
  bool do_dump = false;
  int argn=1, repeat=1;
  for(; argn < argc; argn+=1)
if( argv[argn] != NULL )
  if( *(argv[argn]) == '-')
	switch( *(argv[argn]+1) )
	{ case 'm':
	  case 'M':
	clk = CLOCK_MONOTONIC;
	break;
	  case 'd':
	  case 'D':
	do_dump = true;
	break;
 	  case 'r':
  	  case 'R':
	if( (argn < argc) && (argv[argn+1] != NULL))
	  repeat = atoi(argv[argn+=1]);
	break;
	  case '?':
	  case 'h':
	  case 'u':
	  case 'U':
	  case 'H':
	fprintf(stderr,"Usage: timer_latency [\n\t-m : use CLOCK_MONOTONIC clock (not CLOCK_MONOTONIC_RAW)\n\t-d : dump timespec contents. N_SAMPLES: " STR(N_SAMPLES) "\n\t"
	"-r \n]\t"
	"Calculates average timer latency (minimum time that can be measured) over N_SAMPLES.\n"
	   );
	return 0;
	}
  if( repeat > 1 )
  { avgs=alloca(sizeof(unsigned long long) * (N_SAMPLES + 1));
if( ((unsigned long) avgs) & 7 )
  avgs = ((unsigned long long*)(((unsigned char*)avgs)+(8-((unsigned long) avgs) & 7)));
  }
  do {
cnt=N_SAMPLES;
s=0;
  do
  { if( 0 != clock_gettime(clk, [s++]) )
{ fprintf(stderr,"oops, clock_gettime() failed: %d: '%s'.\n", errno, strerror(errno));
  return 1;
}
  }while( --cnt );
  clock_gettime(clk, [s]);
  for(s=1; s < (N_SAMPLES+1); s+=1)
  { t1 = TS2NS(sample[s-1]);
t2 = TS2NS(sample[s]);
if ( (t1 > t2)
   ||(sample[s-1].tv_sec > sample[s].tv_sec)
   ||((sample[s-1].tv_sec == sample[s].tv_sec)
&&(sample[s-1].tv_nsec > sample[s].tv_nsec)
 )
   )
{ fprintf(stderr,"Inconsistency: %llu %llu %lu.%lu %lu.%lu\n", t1 , t2
, sample[s-1].tv_sec, sample[s-1].tv_nsec
, sample[s].tv_sec,   sample[s].tv_nsec
  );
  ic+=1;
  

re: [PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-17 Thread Jason Vas Dias
Hi -

I submitted a new stripped-down to bare essentials version of
the patch, (see LKML emails with $subject)  which passes all
checkpatch.pl tests and addresses all concerns raised by reviewers,
which uses only rdtsc_ordered(), and which only only updates in
  vsyscall_gtod_data the new fields:
u32 raw_mult,  raw_shift ; ...
gtod_long_t  monotonic_time_raw_sec   /* == tk->raw_sec */ ,
  monotonic_time_raw_nsec /* == tk->tkr_raw.nsec */;
(this is NOT the formatting used in vgtod.h - sorry about previous
 formatting issues .
) .

I don't see how one could present the raw timespec in user-space
properly without tk->tkr_raw.xtime_nsec and and tk->raw_sec ;
monotonic has gtod->monotonic_time_sec and gtod->monotonic_time_snsec,
and I am only trying to follow exactly the existing algorithm in
timekeeping.c's
getrawmonotonic64() .

When I submitted the initial version of this stripped down patch,
I got an email back from robot<l...@intel.com> reporting a compilation
error saying :

>
>   arch/x86/entry/vdso/vclock_gettime.o: In function `__vdso_clock_gettime':
>   vclock_gettime.c:(.text+0xf7): undefined reference to 
> >`__x86_indirect_thunk_rax'
>   /usr/bin/ld: arch/x86/entry/vdso/vclock_gettime.o: relocation R_X86_64_PC32 
> >against undefined symbol `__x86_indirect_thunk_rax' can not be used when 
> making >a shared object; recompile with -fPIC
>   /usr/bin/ld: final link failed: Bad value
>>> collect2: error: ld returned 1 exit status
>--
>>> arch/x86/entry/vdso/vdso32.so.dbg: undefined symbols found
>--
>>> objcopy: 'arch/x86/entry/vdso/vdso64.so.dbg': No such file
>---


I had fixed this problem with the patch to the RHEL kernel attached to
bug #198161 (attachment #274751:
https://bugzilla.kernel.org/attachment.cgi?id=274751) ,
 by simply reducing the number of clauses in __vdso_clock_gettime's
switch(clock) from 6 to 5 , but at the cost of an extra test of clock
& second switch(clock).

I reported this as GCC bug :
  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84908
because I don't think GCC should fail to do anything
for a switch with 6 clauses and not for one with 5, but
the response I got from H.J. Liu was:

H.J. Lu wrote @ 2018-03-16 22:13:27 UTC:
>
> vDSO isn't compiled with $(KBUILD_CFLAGS).  Why does your kernel do it?
>
> Please try my kernel patch at comment 4..
>

So that patch to the arch/x86/vdso/Makefile only prevents it enabling the
RETPOLINE_CFLAGS for building  the vDSO .

I defer to H.J.'s expertise on GCC + binutils & advisability of enabling
RETPOLINE_CFLAGS in the VDSO - GCC definitely behaves strangely
for the vDSO when RETPOLINE _CFLAGS  are enabled.

Please provide something like the patch in a future version of Linux ,
and I suggest not compiling the vDSO with RETPOLINE_CFLAGS
as does H.J. .

The inconsistency_check program in tools/testing/selftests/timers produces
no errors for long runs and the timer_latency.c program (attached) also
produces no errors and latencies of @ 20ns for CLOCK_MONOTONIC_RAW
and latencies of @ 40ns for CLOCK_MONOTONIC - this is however
with the additional rdtscp patches , and under 4.15.9, for use on my system ;
the 4.16-rc5 version submitted still uses barrier() + rdtsc  , and
that has  a latency
of @ 30ns for CLOCK_MONOTONIC_RAW and @40ns for CLOCK_MONOTONIC ; but
both are much, much better that
200-1000ns for CLOCK_MONOTONIC_RAW that the unpatched
kernels have (all times refer to 'Average Latency' output produced
by 'timer_latency.c').

I do apologize for whitespace errors, unread emails and resends and confusion
of previous emails - I now understand the process and standards much better
and will attempt to adhere to them more closely in future.

Thanks & Best Regards,
Jason Vas Dias
/* 
 * Program to measure high-res timer latency.
 *
 */
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifndef N_SAMPLES
#define N_SAMPLES 100
#endif
#define _STR(_S_) #_S_
#define STR(_S_) _STR(_S_)

#define TS2NS(_TS_) unsigned long long)(_TS_).tv_sec)*10ULL) + (((unsigned long long)((_TS_).tv_nsec 

int main(int argc, char *const* argv, char *const* envp)
{ struct timespec sample[N_SAMPLES+1];
  unsigned int cnt=N_SAMPLES, s=0 , avg_n=0;
  unsigned long long
deltas [ N_SAMPLES ]
, t1, t2, sum=0, zd=0, ic=0, d
, t_start, avg_ns, *avgs=0;
  clockid_t clk = CLOCK_MONOTONIC_RAW;
  bool do_dump = false;
  int argn=1, repeat=1;
  for(; argn < argc; argn+=1)
if( argv[argn] != NULL )
  if( *(argv[argn]) == '-')
	switch( *(argv[argn]+1) )
	{ case 'm':
	  case 'M':
	clk = CLOCK_MONOTONIC;
	break;
	  case 'd':
	  case 'D':
	do_dump = true;
	break;
 	  case 'r':
  	  case 'R':
	if( (argn < argc) && (argv[argn+1] != NULL))
	  repeat = atoi(argv[argn+=1]);
	break;
	  case 

re: [PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-17 Thread Jason Vas Dias
Hi -

I submitted a new stripped-down to bare essentials version of
the patch, (see LKML emails with $subject)  which passes all
checkpatch.pl tests and addresses all concerns raised by reviewers,
which uses only rdtsc_ordered(), and which only only updates in
  vsyscall_gtod_data the new fields:
u32 raw_mult,  raw_shift ; ...
gtod_long_t  monotonic_time_raw_sec   /* == tk->raw_sec */ ,
  monotonic_time_raw_nsec /* == tk->tkr_raw.nsec */;
(this is NOT the formatting used in vgtod.h - sorry about previous
 formatting issues .
) .

I don't see how one could present the raw timespec in user-space
properly without tk->tkr_raw.xtime_nsec and and tk->raw_sec ;
monotonic has gtod->monotonic_time_sec and gtod->monotonic_time_snsec,
and I am only trying to follow exactly the existing algorithm in
timekeeping.c's
getrawmonotonic64() .

When I submitted the initial version of this stripped down patch,
I got an email back from robot reporting a compilation
error saying :

>
>   arch/x86/entry/vdso/vclock_gettime.o: In function `__vdso_clock_gettime':
>   vclock_gettime.c:(.text+0xf7): undefined reference to 
> >`__x86_indirect_thunk_rax'
>   /usr/bin/ld: arch/x86/entry/vdso/vclock_gettime.o: relocation R_X86_64_PC32 
> >against undefined symbol `__x86_indirect_thunk_rax' can not be used when 
> making >a shared object; recompile with -fPIC
>   /usr/bin/ld: final link failed: Bad value
>>> collect2: error: ld returned 1 exit status
>--
>>> arch/x86/entry/vdso/vdso32.so.dbg: undefined symbols found
>--
>>> objcopy: 'arch/x86/entry/vdso/vdso64.so.dbg': No such file
>---


I had fixed this problem with the patch to the RHEL kernel attached to
bug #198161 (attachment #274751:
https://bugzilla.kernel.org/attachment.cgi?id=274751) ,
 by simply reducing the number of clauses in __vdso_clock_gettime's
switch(clock) from 6 to 5 , but at the cost of an extra test of clock
& second switch(clock).

I reported this as GCC bug :
  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84908
because I don't think GCC should fail to do anything
for a switch with 6 clauses and not for one with 5, but
the response I got from H.J. Liu was:

H.J. Lu wrote @ 2018-03-16 22:13:27 UTC:
>
> vDSO isn't compiled with $(KBUILD_CFLAGS).  Why does your kernel do it?
>
> Please try my kernel patch at comment 4..
>

So that patch to the arch/x86/vdso/Makefile only prevents it enabling the
RETPOLINE_CFLAGS for building  the vDSO .

I defer to H.J.'s expertise on GCC + binutils & advisability of enabling
RETPOLINE_CFLAGS in the VDSO - GCC definitely behaves strangely
for the vDSO when RETPOLINE _CFLAGS  are enabled.

Please provide something like the patch in a future version of Linux ,
and I suggest not compiling the vDSO with RETPOLINE_CFLAGS
as does H.J. .

The inconsistency_check program in tools/testing/selftests/timers produces
no errors for long runs and the timer_latency.c program (attached) also
produces no errors and latencies of @ 20ns for CLOCK_MONOTONIC_RAW
and latencies of @ 40ns for CLOCK_MONOTONIC - this is however
with the additional rdtscp patches , and under 4.15.9, for use on my system ;
the 4.16-rc5 version submitted still uses barrier() + rdtsc  , and
that has  a latency
of @ 30ns for CLOCK_MONOTONIC_RAW and @40ns for CLOCK_MONOTONIC ; but
both are much, much better that
200-1000ns for CLOCK_MONOTONIC_RAW that the unpatched
kernels have (all times refer to 'Average Latency' output produced
by 'timer_latency.c').

I do apologize for whitespace errors, unread emails and resends and confusion
of previous emails - I now understand the process and standards much better
and will attempt to adhere to them more closely in future.

Thanks & Best Regards,
Jason Vas Dias
/* 
 * Program to measure high-res timer latency.
 *
 */
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifndef N_SAMPLES
#define N_SAMPLES 100
#endif
#define _STR(_S_) #_S_
#define STR(_S_) _STR(_S_)

#define TS2NS(_TS_) unsigned long long)(_TS_).tv_sec)*10ULL) + (((unsigned long long)((_TS_).tv_nsec 

int main(int argc, char *const* argv, char *const* envp)
{ struct timespec sample[N_SAMPLES+1];
  unsigned int cnt=N_SAMPLES, s=0 , avg_n=0;
  unsigned long long
deltas [ N_SAMPLES ]
, t1, t2, sum=0, zd=0, ic=0, d
, t_start, avg_ns, *avgs=0;
  clockid_t clk = CLOCK_MONOTONIC_RAW;
  bool do_dump = false;
  int argn=1, repeat=1;
  for(; argn < argc; argn+=1)
if( argv[argn] != NULL )
  if( *(argv[argn]) == '-')
	switch( *(argv[argn]+1) )
	{ case 'm':
	  case 'M':
	clk = CLOCK_MONOTONIC;
	break;
	  case 'd':
	  case 'D':
	do_dump = true;
	break;
 	  case 'r':
  	  case 'R':
	if( (argn < argc) && (argv[argn+1] != NULL))
	  repeat = atoi(argv[argn+=1]);
	break;
	  case '?':
	  case 'h':
	  case 'u':
	  case

[PATCH v4.16-rc5 2/2] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-17 Thread jason . vas . dias
 This patch allows compilation to succeed with compilers that support 
-DRETPOLINE -
 it was kindly contributed by H.J. Liu in GCC Bugzilla: 84908 :
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84908

 Apparently the GCC retpoline implementation has a limitation that it 
cannot
 handle switch statements with more than 5 clauses, which 
vclock_gettime.c's
 __vdso_clock_gettime function now contains.

 The automated test builds should now succeed with this patch.


diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 1943aeb..cb64e10 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -76,7 +76,7 @@ CFL := $(PROFILING) -mcmodel=small -fPIC -O2 
-fasynchronous-unwind-tables -m64 \
-fno-omit-frame-pointer -foptimize-sibling-calls \
-DDISABLE_BRANCH_PROFILING -DBUILD_VDSO
 
-$(vobjs): KBUILD_CFLAGS := $(filter-out 
$(GCC_PLUGINS_CFLAGS),$(KBUILD_CFLAGS)) $(CFL)
+$(vobjs): KBUILD_CFLAGS := $(filter-out $(GCC_PLUGINS_CFLAGS) 
$(RETPOLINE_CFLAGS) -DRETPOLINE,$(KBUILD_CFLAGS)) $(CFL)
 
 #
 # vDSO code runs in userspace and -pg doesn't help with profiling anyway.
@@ -143,6 +143,7 @@ KBUILD_CFLAGS_32 := $(filter-out 
-mcmodel=kernel,$(KBUILD_CFLAGS_32))
 KBUILD_CFLAGS_32 := $(filter-out -fno-pic,$(KBUILD_CFLAGS_32))
 KBUILD_CFLAGS_32 := $(filter-out -mfentry,$(KBUILD_CFLAGS_32))
 KBUILD_CFLAGS_32 := $(filter-out $(GCC_PLUGINS_CFLAGS),$(KBUILD_CFLAGS_32))
+KBUILD_CFLAGS_32 := $(filter-out $(RETPOLINE_CFLAGS) 
-DRETPOLINE,$(KBUILD_CFLAGS_32))
 KBUILD_CFLAGS_32 += -m32 -msoft-float -mregparm=0 -fpic
 KBUILD_CFLAGS_32 += $(call cc-option, -fno-stack-protector)
 KBUILD_CFLAGS_32 += $(call cc-option, -foptimize-sibling-calls)


[PATCH v4.16-rc5 2/2] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-17 Thread jason . vas . dias
 This patch allows compilation to succeed with compilers that support 
-DRETPOLINE -
 it was kindly contributed by H.J. Liu in GCC Bugzilla: 84908 :
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84908

 Apparently the GCC retpoline implementation has a limitation that it 
cannot
 handle switch statements with more than 5 clauses, which 
vclock_gettime.c's
 __vdso_clock_gettime function now contains.

 The automated test builds should now succeed with this patch.


diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 1943aeb..cb64e10 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -76,7 +76,7 @@ CFL := $(PROFILING) -mcmodel=small -fPIC -O2 
-fasynchronous-unwind-tables -m64 \
-fno-omit-frame-pointer -foptimize-sibling-calls \
-DDISABLE_BRANCH_PROFILING -DBUILD_VDSO
 
-$(vobjs): KBUILD_CFLAGS := $(filter-out 
$(GCC_PLUGINS_CFLAGS),$(KBUILD_CFLAGS)) $(CFL)
+$(vobjs): KBUILD_CFLAGS := $(filter-out $(GCC_PLUGINS_CFLAGS) 
$(RETPOLINE_CFLAGS) -DRETPOLINE,$(KBUILD_CFLAGS)) $(CFL)
 
 #
 # vDSO code runs in userspace and -pg doesn't help with profiling anyway.
@@ -143,6 +143,7 @@ KBUILD_CFLAGS_32 := $(filter-out 
-mcmodel=kernel,$(KBUILD_CFLAGS_32))
 KBUILD_CFLAGS_32 := $(filter-out -fno-pic,$(KBUILD_CFLAGS_32))
 KBUILD_CFLAGS_32 := $(filter-out -mfentry,$(KBUILD_CFLAGS_32))
 KBUILD_CFLAGS_32 := $(filter-out $(GCC_PLUGINS_CFLAGS),$(KBUILD_CFLAGS_32))
+KBUILD_CFLAGS_32 := $(filter-out $(RETPOLINE_CFLAGS) 
-DRETPOLINE,$(KBUILD_CFLAGS_32))
 KBUILD_CFLAGS_32 += -m32 -msoft-float -mregparm=0 -fpic
 KBUILD_CFLAGS_32 += $(call cc-option, -fno-stack-protector)
 KBUILD_CFLAGS_32 += $(call cc-option, -foptimize-sibling-calls)


[PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-17 Thread jason . vas . dias


  Resent to address reviewer comments, and allow builds with compilers
  that support -DRETPOLINE to succeed.

  Currently, the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values, user-space  code needs to know elapsed time from the
  perspective of the CPU no matter how "hot" / fast or "cold" / slow it
  might be running wrt NTP / PTP "real" time; when code needs this,
  the latencies associated with a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .

  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:

   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c
   arch/x86/entry/vdso/Makefile   

  Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug 
#198161,
  as is the test program, timer_latency.c, to demonstrate the problem.

  Before the patch a latency of 200-1000ns was measured for
clock_gettime(CLOCK_MONOTONIC_RAW,)
  calls - after the patch, the same call on the same machine
  has a latency of @ 20ns.


Thanks & Best Regards,
Jason Vas Dias


[PATCH v4.16-rc5 (2)] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-17 Thread jason . vas . dias


  Resent to address reviewer comments, and allow builds with compilers
  that support -DRETPOLINE to succeed.

  Currently, the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values, user-space  code needs to know elapsed time from the
  perspective of the CPU no matter how "hot" / fast or "cold" / slow it
  might be running wrt NTP / PTP "real" time; when code needs this,
  the latencies associated with a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .

  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:

   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c
   arch/x86/entry/vdso/Makefile   

  Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug 
#198161,
  as is the test program, timer_latency.c, to demonstrate the problem.

  Before the patch a latency of 200-1000ns was measured for
clock_gettime(CLOCK_MONOTONIC_RAW,)
  calls - after the patch, the same call on the same machine
  has a latency of @ 20ns.


Thanks & Best Regards,
Jason Vas Dias


[PATCH v4.16-rc5 1/2] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-17 Thread jason . vas . dias

 This patch makes the vDSO handle clock_gettime(CLOCK_MONOTONIC_RAW,)
 calls in the same way it handles clock_gettime(CLOCK_MONOTONIC,) 
calls,
 reducing latency from @ 200-1000ns to @ 20ns.


diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d..843b0a6 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -182,27 +182,49 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
-notrace static inline u64 vgetsns(int *mode)
+notrace static inline __always_inline u64 vgetcycles(int *mode)
 {
-   u64 v;
-   cycles_t cycles;
-
-   if (gtod->vclock_mode == VCLOCK_TSC)
-   cycles = vread_tsc();
+   switch (gtod->vclock_mode) {
+   case VCLOCK_TSC:
+   return vread_tsc();
 #ifdef CONFIG_PARAVIRT_CLOCK
-   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
-   cycles = vread_pvclock(mode);
+   case VCLOCK_PVCLOCK:
+   return vread_pvclock(mode);
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
-   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
-   cycles = vread_hvclock(mode);
+   case VCLOCK_HVCLOCK:
+   return vread_hvclock(mode);
 #endif
-   else
+   default:
+   break;
+   }
+   return 0;
+}
+
+notrace static inline u64 vgetsns(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
return 0;
+
v = (cycles - gtod->cycle_last) & gtod->mask;
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
+   return 0;
+
+   v = (cycles - gtod->cycle_last) & gtod->mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +268,27 @@ notrace static int __always_inline do_monotonic(struct 
timespec *ts)
return mode;
 }
 
+notrace static __always_inline int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +320,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct 
timespec *ts)
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c 
b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index e1216dd..c4d89b6 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -44,6 +44,8 @@ void update_vsyscall(struct timekeeper *tk)
vdata->mask = tk->tkr_mono.mask;
vdata->mult = tk->tkr_mono.mult;
vdata->shift= tk->tkr_mono.shift;
+   vdata->raw_mult = tk->tkr_raw.mult;
+   vdata->raw_shift= tk->tkr_raw.shift;
 
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
@@ -74,5 +76,8 @@ void update_vsyscall(struct timekeeper *tk)
vdata->monotonic_time_coarse_sec++;
}
 
+   vdata->monotonic_time_raw_sec  = tk->raw_sec;
+   vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec;
+
gtod_write_end(vdata);
 }
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index fb856c9..ec1a37c 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -22,7 +22,8 @@ struct vsyscall_gtod_data {
u64 mask;
u32 mult;
u32 shift;
-
+   u32 raw_mult;
+   u32 raw_shift;
/* open coded 'struct timespec' */
u64 wall_time_snsec;
gtod_long_t wall_time_sec;
@@ -32,6 +33,8 @@ struct vsyscall_gtod_data {
gtod_long_t wall_time_coarse_nsec;
gtod_long_t monotonic_time_coarse_sec;
gtod_long_t monotonic_time_coarse_nsec;
+   gtod_long_t monotonic_time_raw_sec;
+   gtod_long_t monotonic_time_raw_nsec;
 

[PATCH v4.16-rc5 1/2] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-17 Thread jason . vas . dias

 This patch makes the vDSO handle clock_gettime(CLOCK_MONOTONIC_RAW,)
 calls in the same way it handles clock_gettime(CLOCK_MONOTONIC,) 
calls,
 reducing latency from @ 200-1000ns to @ 20ns.


diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d..843b0a6 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -182,27 +182,49 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
-notrace static inline u64 vgetsns(int *mode)
+notrace static inline __always_inline u64 vgetcycles(int *mode)
 {
-   u64 v;
-   cycles_t cycles;
-
-   if (gtod->vclock_mode == VCLOCK_TSC)
-   cycles = vread_tsc();
+   switch (gtod->vclock_mode) {
+   case VCLOCK_TSC:
+   return vread_tsc();
 #ifdef CONFIG_PARAVIRT_CLOCK
-   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
-   cycles = vread_pvclock(mode);
+   case VCLOCK_PVCLOCK:
+   return vread_pvclock(mode);
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
-   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
-   cycles = vread_hvclock(mode);
+   case VCLOCK_HVCLOCK:
+   return vread_hvclock(mode);
 #endif
-   else
+   default:
+   break;
+   }
+   return 0;
+}
+
+notrace static inline u64 vgetsns(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
return 0;
+
v = (cycles - gtod->cycle_last) & gtod->mask;
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
+   return 0;
+
+   v = (cycles - gtod->cycle_last) & gtod->mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +268,27 @@ notrace static int __always_inline do_monotonic(struct 
timespec *ts)
return mode;
 }
 
+notrace static __always_inline int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +320,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct 
timespec *ts)
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c 
b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index e1216dd..c4d89b6 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -44,6 +44,8 @@ void update_vsyscall(struct timekeeper *tk)
vdata->mask = tk->tkr_mono.mask;
vdata->mult = tk->tkr_mono.mult;
vdata->shift= tk->tkr_mono.shift;
+   vdata->raw_mult = tk->tkr_raw.mult;
+   vdata->raw_shift= tk->tkr_raw.shift;
 
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
@@ -74,5 +76,8 @@ void update_vsyscall(struct timekeeper *tk)
vdata->monotonic_time_coarse_sec++;
}
 
+   vdata->monotonic_time_raw_sec  = tk->raw_sec;
+   vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec;
+
gtod_write_end(vdata);
 }
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index fb856c9..ec1a37c 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -22,7 +22,8 @@ struct vsyscall_gtod_data {
u64 mask;
u32 mult;
u32 shift;
-
+   u32 raw_mult;
+   u32 raw_shift;
/* open coded 'struct timespec' */
u64 wall_time_snsec;
gtod_long_t wall_time_sec;
@@ -32,6 +33,8 @@ struct vsyscall_gtod_data {
gtod_long_t wall_time_coarse_nsec;
gtod_long_t monotonic_time_coarse_sec;
gtod_long_t monotonic_time_coarse_nsec;
+   gtod_long_t monotonic_time_raw_sec;
+   gtod_long_t monotonic_time_raw_nsec;
 

Re: [PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-16 Thread Jason Vas Dias
Good day -

RE:
On 15/03/2018, Thomas Gleixner <t...@linutronix.de> wrote:
> On Thu, 15 Mar 2018, Jason Vas Dias wrote:
>> On 15/03/2018, Thomas Gleixner <t...@linutronix.de> wrote:
>> > On Thu, 15 Mar 2018, jason.vas.d...@gmail.com wrote:
>> >
>> >>   Resent to address reviewer comments.
>> >
>> > I was being patient so far and tried to guide you through the patch
>> > submission process, but unfortunately this turns out to be just waste of
>> > my
>> > time.
>> >
>> > You have not addressed any of the comments I made here:
>> >
>> > [1]
>> > https://lkml.kernel.org/r/alpine.deb.2.21.1803141511340.2...@nanos.tec.linutronix.de
>> > [2]
>> > https://lkml.kernel.org/r/alpine.deb.2.21.1803141527300.2...@nanos.tec.linutronix.de
>> >
>>
>> I'm really sorry about that - I did not see those mails ,
>> and have searched for them in my inbox -
>
> That's close to the 'my dog ate the homework' excuse.
>


Nevertheless, those messages are NOT in my inbox, nor
can I find them on the list - a google search for
'alpine.DEB.2.21.1803141511340.2481' or
'alpine.DEB.2.21.1803141527300.2481' returns
only the last two mails on the subject , where
you included the links to https://lkml.kernel.org.

I don't know what went wrong here, but I did not
receive those mails until you informed me of them
yesterday evening, when I immediately regenerated
the Patch #1 incorporating fixes for your comments,
and sent it with Subject:
  '[PATCH v4.16-rc5 1/1] x86/vdso: VDSO should handle\
   clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
  '
This version re-uses the 'gtod->cycles' value, which as you point
out, is the same as 'tk->tkr_raw.cycle_last'  -
so I removed vread_tsc_raw() .


> Of course they were sent to the list and to you personally as I used
> reply-all. From the mail server log:
>
> 2018-03-14 15:27:27 1ew7NH-00039q-Hv <= t...@linutronix.de
> id=alpine.deb.2.21.1803141511340.2...@nanos.tec.linutronix.de
>
> 2018-03-14 15:27:30 1ew7NH-00039q-Hv => jason.vas.d...@gmail.com R=dnslookup
> T=remote_smtp H=gmail-smtp-in.l.google.com [2a00:1450:4013:c01::1a]
> X=TLS1.2:RSA_AES_128_CBC_SHA1:128 DN="C=US,ST=California,L=Mountain
> View,O=Google Inc,CN=mx.google.com"
>
> 2018-03-14 15:27:31 1ew7NH-00039q-Hv => linux-kernel@vger.kernel.org
> R=dnslookup T=remote_smtp H=vger.kernel.org [209.132.180.67]
>
> 
>
> 2018-03-14 15:27:47 1ew7NH-00039q-Hv Completed
>
> If those messages would not have been delivered to
> linux-kernel@vger.kernel.org they would hardly be on the mailing list
> archive, right?
>

Yes, I cannot explain why I did not receive them .

I guess I should consider gmail an unreliable delivery
method and use the lkml.org web interface to check
for replies - I will do this from now one.

> And they both got delivered to your gmail account as well.
>

No, they are not in my gmail account Inbox or folders.


> ERROR: Missing Signed-off-by: line(s)
> total: 1 errors, 0 warnings, 71 lines checked
>

I do not know how to fix this error - I was hoping
someone on the list might enlighten me.

>
> WARNING: externs should be avoided in .c files
> #24: FILE: arch/x86/entry/vdso/vclock_gettime.c:31:
> +extern unsigned int __vdso_tsc_calibration(
>

I thought that must be a script bug, since no extern
is being declared by that line; it is an external function
declaration, just like the unmodified line that precedes it.


> WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
> #93:
> new file mode 100644
>
> ERROR: Missing Signed-off-by: line(s)
>
> total: 1 errors, 2 warnings, 143 lines checked
>
> It reports an error for every single patch of your latest submission.
>
>> And I did send the test results in a previous mail -
>
> In private mail which I ignore if there is no real good reason. And just
> for the record. This private mail contains the following headers:
>
> In-Reply-To: <alpine.deb.2.21.1803141527300.2...@nanos.tec.linutronix.de>
> References: <1521001222-10712-1-git-send-email-jason.vas.d...@gmail.com>
>  <1521001222-10712-3-git-send-email-jason.vas.d...@gmail.com>
> <alpine.deb.2.21.1803141527300.2...@nanos.tec.linutronix.de>
> From: Jason Vas Dias <jason.vas.d...@gmail.com>
> Date: Wed, 14 Mar 2018 15:08:55 +
> Message-ID:
> <calyzvkwb667x-adq4pe8p7_oc2-gdjwqcw4ch4naadmw9zo...@mail.gmail.com>
> Subject: Re: [PATCH v4.16-rc5 2/3] x86/vdso: on Intel, VDSO should handle
> CLOCK_MONOTONIC_RAW
>
> So now, if you take the message ID which is in the In-Reply-To: field and
> compare it to the message ID which I used for link [2]:
>
> In-Reply-To: <a

Re: [PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-16 Thread Jason Vas Dias
Good day -

RE:
On 15/03/2018, Thomas Gleixner  wrote:
> On Thu, 15 Mar 2018, Jason Vas Dias wrote:
>> On 15/03/2018, Thomas Gleixner  wrote:
>> > On Thu, 15 Mar 2018, jason.vas.d...@gmail.com wrote:
>> >
>> >>   Resent to address reviewer comments.
>> >
>> > I was being patient so far and tried to guide you through the patch
>> > submission process, but unfortunately this turns out to be just waste of
>> > my
>> > time.
>> >
>> > You have not addressed any of the comments I made here:
>> >
>> > [1]
>> > https://lkml.kernel.org/r/alpine.deb.2.21.1803141511340.2...@nanos.tec.linutronix.de
>> > [2]
>> > https://lkml.kernel.org/r/alpine.deb.2.21.1803141527300.2...@nanos.tec.linutronix.de
>> >
>>
>> I'm really sorry about that - I did not see those mails ,
>> and have searched for them in my inbox -
>
> That's close to the 'my dog ate the homework' excuse.
>


Nevertheless, those messages are NOT in my inbox, nor
can I find them on the list - a google search for
'alpine.DEB.2.21.1803141511340.2481' or
'alpine.DEB.2.21.1803141527300.2481' returns
only the last two mails on the subject , where
you included the links to https://lkml.kernel.org.

I don't know what went wrong here, but I did not
receive those mails until you informed me of them
yesterday evening, when I immediately regenerated
the Patch #1 incorporating fixes for your comments,
and sent it with Subject:
  '[PATCH v4.16-rc5 1/1] x86/vdso: VDSO should handle\
   clock_gettime(CLOCK_MONOTONIC_RAW) without syscall
  '
This version re-uses the 'gtod->cycles' value, which as you point
out, is the same as 'tk->tkr_raw.cycle_last'  -
so I removed vread_tsc_raw() .


> Of course they were sent to the list and to you personally as I used
> reply-all. From the mail server log:
>
> 2018-03-14 15:27:27 1ew7NH-00039q-Hv <= t...@linutronix.de
> id=alpine.deb.2.21.1803141511340.2...@nanos.tec.linutronix.de
>
> 2018-03-14 15:27:30 1ew7NH-00039q-Hv => jason.vas.d...@gmail.com R=dnslookup
> T=remote_smtp H=gmail-smtp-in.l.google.com [2a00:1450:4013:c01::1a]
> X=TLS1.2:RSA_AES_128_CBC_SHA1:128 DN="C=US,ST=California,L=Mountain
> View,O=Google Inc,CN=mx.google.com"
>
> 2018-03-14 15:27:31 1ew7NH-00039q-Hv => linux-kernel@vger.kernel.org
> R=dnslookup T=remote_smtp H=vger.kernel.org [209.132.180.67]
>
> 
>
> 2018-03-14 15:27:47 1ew7NH-00039q-Hv Completed
>
> If those messages would not have been delivered to
> linux-kernel@vger.kernel.org they would hardly be on the mailing list
> archive, right?
>

Yes, I cannot explain why I did not receive them .

I guess I should consider gmail an unreliable delivery
method and use the lkml.org web interface to check
for replies - I will do this from now one.

> And they both got delivered to your gmail account as well.
>

No, they are not in my gmail account Inbox or folders.


> ERROR: Missing Signed-off-by: line(s)
> total: 1 errors, 0 warnings, 71 lines checked
>

I do not know how to fix this error - I was hoping
someone on the list might enlighten me.

>
> WARNING: externs should be avoided in .c files
> #24: FILE: arch/x86/entry/vdso/vclock_gettime.c:31:
> +extern unsigned int __vdso_tsc_calibration(
>

I thought that must be a script bug, since no extern
is being declared by that line; it is an external function
declaration, just like the unmodified line that precedes it.


> WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
> #93:
> new file mode 100644
>
> ERROR: Missing Signed-off-by: line(s)
>
> total: 1 errors, 2 warnings, 143 lines checked
>
> It reports an error for every single patch of your latest submission.
>
>> And I did send the test results in a previous mail -
>
> In private mail which I ignore if there is no real good reason. And just
> for the record. This private mail contains the following headers:
>
> In-Reply-To: 
> References: <1521001222-10712-1-git-send-email-jason.vas.d...@gmail.com>
>  <1521001222-10712-3-git-send-email-jason.vas.d...@gmail.com>
> 
> From: Jason Vas Dias 
> Date: Wed, 14 Mar 2018 15:08:55 +
> Message-ID:
> 
> Subject: Re: [PATCH v4.16-rc5 2/3] x86/vdso: on Intel, VDSO should handle
> CLOCK_MONOTONIC_RAW
>
> So now, if you take the message ID which is in the In-Reply-To: field and
> compare it to the message ID which I used for link [2]:
>
> In-Reply-To: 
>> > https://lkml.kernel.org/r/alpine.deb.2.21.1803141527300.2...@nanos.tec.linutronix.de
>
> you might notice that these are identical. So how did you end up replying
> to a mail which you never received?
>
> Nice try. I'm really fed up with this.
>

The o

[PATCH v4.16-rc5 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-15 Thread jason . vas . dias

Resent to address reviewer comments.

  Currently, the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values, user-space  code needs to know elapsed time from the
  perspective of the CPU no matter how "hot" / fast or "cold" / slow it
  might be running wrt NTP / PTP "real" time; when code needs this,
  the latencies associated with a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .

  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:

   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c

  Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug 
#198161,
  as is the test program, timer_latency.c, to demonstrate the problem.

  Before the patch a latency of 200-1000ns was measured for
clock_gettime(CLOCK_MONOTONIC_RAW,)
  calls - after the patch, the same call on the same machine
  has a latency of @ 20ns.


Thanks & Best Regards,
Jason Vas Dias


[PATCH v4.16-rc5 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-15 Thread jason . vas . dias

Resent to address reviewer comments.

  Currently, the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values, user-space  code needs to know elapsed time from the
  perspective of the CPU no matter how "hot" / fast or "cold" / slow it
  might be running wrt NTP / PTP "real" time; when code needs this,
  the latencies associated with a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .

  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 20ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:

   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c

  Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug 
#198161,
  as is the test program, timer_latency.c, to demonstrate the problem.

  Before the patch a latency of 200-1000ns was measured for
clock_gettime(CLOCK_MONOTONIC_RAW,)
  calls - after the patch, the same call on the same machine
  has a latency of @ 20ns.


Thanks & Best Regards,
Jason Vas Dias


[PATCH v4.16-rc5 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-15 Thread jason . vas . dias
diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d..8b9b9cf 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -182,27 +182,49 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
-notrace static inline u64 vgetsns(int *mode)
+notrace static inline __always_inline u64 vgetcycles(int *mode)
 {
-   u64 v;
-   cycles_t cycles;
-
-   if (gtod->vclock_mode == VCLOCK_TSC)
-   cycles = vread_tsc();
+   switch (gtod->vclock_mode) {
+   case VCLOCK_TSC:
+   return vread_tsc();
 #ifdef CONFIG_PARAVIRT_CLOCK
-   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
-   cycles = vread_pvclock(mode);
+   case VCLOCK_PVCLOCK:
+   return vread_pvclock(mode);
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
-   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
-   cycles = vread_hvclock(mode);
+   case VCLOCK_HVCLOCK:
+   return vread_hvclock(mode);
 #endif
-   else
+   default:
+   break;
+   }
+   return 0;
+}
+
+notrace static inline u64 vgetsns(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
return 0;
+
v = (cycles - gtod->cycle_last) & gtod->mask;
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
+   return 0;
+
+   v = (cycles - gtod->cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +268,27 @@ notrace static int __always_inline do_monotonic(struct 
timespec *ts)
return mode;
 }
 
+notrace static __always_inline int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +320,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct 
timespec *ts)
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c 
b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index e1216dd..83f5c21 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -44,6 +44,9 @@ void update_vsyscall(struct timekeeper *tk)
vdata->mask = tk->tkr_mono.mask;
vdata->mult = tk->tkr_mono.mult;
vdata->shift= tk->tkr_mono.shift;
+   vdata->raw_mask = tk->tkr_raw.mask;
+   vdata->raw_mult = tk->tkr_raw.mult;
+   vdata->raw_shift= tk->tkr_raw.shift;
 
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
@@ -74,5 +77,8 @@ void update_vsyscall(struct timekeeper *tk)
vdata->monotonic_time_coarse_sec++;
}
 
+   vdata->monotonic_time_raw_sec  = tk->raw_sec;
+   vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec;
+
gtod_write_end(vdata);
 }
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index fb856c9..941e9d6 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -22,7 +22,9 @@ struct vsyscall_gtod_data {
u64 mask;
u32 mult;
u32 shift;
-
+   u32 raw_mask;
+   u32 raw_mult;
+   u32 raw_shift;
/* open coded 'struct timespec' */
u64 wall_time_snsec;
gtod_long_t wall_time_sec;
@@ -32,6 +34,8 @@ struct vsyscall_gtod_data {
gtod_long_t wall_time_coarse_nsec;
gtod_long_t monotonic_time_coarse_sec;
gtod_long_t monotonic_time_coarse_nsec;
+   gtod_long_t monotonic_time_raw_sec;
+   gtod_long_t monotonic_time_raw_nsec;
 
int tz_minuteswest;
int tz_dsttime;


[PATCH v4.16-rc5 1/1] x86/vdso: VDSO should handle clock_gettime(CLOCK_MONOTONIC_RAW) without syscall

2018-03-15 Thread jason . vas . dias
diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d..8b9b9cf 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -182,27 +182,49 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
-notrace static inline u64 vgetsns(int *mode)
+notrace static inline __always_inline u64 vgetcycles(int *mode)
 {
-   u64 v;
-   cycles_t cycles;
-
-   if (gtod->vclock_mode == VCLOCK_TSC)
-   cycles = vread_tsc();
+   switch (gtod->vclock_mode) {
+   case VCLOCK_TSC:
+   return vread_tsc();
 #ifdef CONFIG_PARAVIRT_CLOCK
-   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
-   cycles = vread_pvclock(mode);
+   case VCLOCK_PVCLOCK:
+   return vread_pvclock(mode);
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
-   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
-   cycles = vread_hvclock(mode);
+   case VCLOCK_HVCLOCK:
+   return vread_hvclock(mode);
 #endif
-   else
+   default:
+   break;
+   }
+   return 0;
+}
+
+notrace static inline u64 vgetsns(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
return 0;
+
v = (cycles - gtod->cycle_last) & gtod->mask;
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles = vgetcycles(mode);
+
+   if (cycles == 0)
+   return 0;
+
+   v = (cycles - gtod->cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +268,27 @@ notrace static int __always_inline do_monotonic(struct 
timespec *ts)
return mode;
 }
 
+notrace static __always_inline int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +320,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct 
timespec *ts)
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c 
b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index e1216dd..83f5c21 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -44,6 +44,9 @@ void update_vsyscall(struct timekeeper *tk)
vdata->mask = tk->tkr_mono.mask;
vdata->mult = tk->tkr_mono.mult;
vdata->shift= tk->tkr_mono.shift;
+   vdata->raw_mask = tk->tkr_raw.mask;
+   vdata->raw_mult = tk->tkr_raw.mult;
+   vdata->raw_shift= tk->tkr_raw.shift;
 
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
@@ -74,5 +77,8 @@ void update_vsyscall(struct timekeeper *tk)
vdata->monotonic_time_coarse_sec++;
}
 
+   vdata->monotonic_time_raw_sec  = tk->raw_sec;
+   vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec;
+
gtod_write_end(vdata);
 }
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index fb856c9..941e9d6 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -22,7 +22,9 @@ struct vsyscall_gtod_data {
u64 mask;
u32 mult;
u32 shift;
-
+   u32 raw_mask;
+   u32 raw_mult;
+   u32 raw_shift;
/* open coded 'struct timespec' */
u64 wall_time_snsec;
gtod_long_t wall_time_sec;
@@ -32,6 +34,8 @@ struct vsyscall_gtod_data {
gtod_long_t wall_time_coarse_nsec;
gtod_long_t monotonic_time_coarse_sec;
gtod_long_t monotonic_time_coarse_nsec;
+   gtod_long_t monotonic_time_raw_sec;
+   gtod_long_t monotonic_time_raw_nsec;
 
int tz_minuteswest;
int tz_dsttime;


Re: [PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-15 Thread Jason Vas Dias
Hi Thomas -
RE:
On 15/03/2018, Thomas Gleixner  wrote:
> Jason,
>
> On Thu, 15 Mar 2018, jason.vas.d...@gmail.com wrote:
>
>>   Resent to address reviewer comments.
>
> I was being patient so far and tried to guide you through the patch
> submission process, but unfortunately this turns out to be just waste of my
> time.
>
> You have not addressed any of the comments I made here:
>
> [1]
> https://lkml.kernel.org/r/alpine.deb.2.21.1803141511340.2...@nanos.tec.linutronix.de
> [2]
> https://lkml.kernel.org/r/alpine.deb.2.21.1803141527300.2...@nanos.tec.linutronix.de
>

I'm really sorry about that - I did not see those mails ,
and have searched for them in my inbox -
are you sure they were sent to 'linux-kernel@vger.kernel.org' ?
That is the only list I am subscribed to .
I clicked on the links , but the 'To:' field is just
'linux-kernel' .

If I had seen those messages before I re-submitted,
those issues would have been fixed.

checkpatch.pl did not report them -
I ran it with all patches and it reported
no errors .

And I did send the test results in a previous mail -

$ gcc -m64 -o timer timer.c

( must be compiled in 64-bit mode).

This is using the new rdtscp() function :
$ ./timer -r 100
...
Total time: 0.02806S - Average Latency: 0.00028S N zero
deltas: 0 N inconsistent deltas: 0
Average of 100 average latencies of 100 samples : 0.00027S

This is using the rdtsc_ordered() function:

$ ./timer -m -r 100
Total time: 0.05269S - Average Latency: 0.00052S N zero
deltas: 0 N inconsistent deltas: 0
Average of 100 average latencies of 100 samples : 0.00047S

timer.c is a very short program that just reads N_SAMPLES (a
compile-time option)
timespecs using either CLOCK_MONOTONIC_RAW (no -m) or CLOCK_MONOTONIC
first parameter to clock_gettime(),  then
computes the deltas as long long, then averages them , counting any
zero deltas, or deltas where the previous timespec is somehow
greater than the current timespec, which are reported as
inconsitencies (note 'inconistent deltas:0' and 'zero deltas: 0' in output).

So my initial claim that rdtscp() can be twice as fast as rdtsc_ordered()
was not far-fetched - this is what I am seeing .

I think this is because of the explicit barrier() call in rdtsc_ordered() .
This must be slower than than the internal processor pipeline
"cancellation point" (barrier) used by the rdtscp instruction itself.
This is the only reason for the rdtscp call  -  plus all modern Intel
& AMD CPUs support it, and it DOES solve the ordering problem,
whereby instructions in one pipeline of a task can get different
rdtsc() results than instructions in another pipeline.

I will document the results better in the ChangeLog , fix all issues
you identified, and resend .

I did not mean to ignore your comments -
those mails are nowhere in my Inbox -
please ,  confirm the actual email address
they are getting sent to.

Thanks & Regards,
Jason
/* 
 * Program to measure high-res timer latency.
 *
 */
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifndef N_SAMPLES
#define N_SAMPLES 100
#endif
#define _STR(_S_) #_S_
#define STR(_S_) _STR(_S_)

#define TS2NS(_TS_) unsigned long long)(_TS_).tv_sec)*10ULL) + (((unsigned long long)((_TS_).tv_nsec 

int main(int argc, char *const* argv, char *const* envp)
{ struct timespec sample[N_SAMPLES+1];
  unsigned int cnt=N_SAMPLES, s=0 , avg_n=0;
  unsigned long long
deltas [ N_SAMPLES ]
, t1, t2, sum=0, zd=0, ic=0, d
, t_start, avg_ns, *avgs=0;
  clockid_t clk = CLOCK_MONOTONIC_RAW;
  bool do_dump = false;
  int argn=1, repeat=1;
  for(; argn < argc; argn+=1)
if( argv[argn] != NULL )
  if( *(argv[argn]) == '-')
	switch( *(argv[argn]+1) )
	{ case 'm':
	  case 'M':
	clk = CLOCK_MONOTONIC;
	break;
	  case 'd':
	  case 'D':
	do_dump = true;
	break;
 	  case 'r':
  	  case 'R':
	if( (argn < argc) && (argv[argn+1] != NULL))
	  repeat = atoi(argv[argn+=1]);
	break;
	  case '?':
	  case 'h':
	  case 'u':
	  case 'U':
	  case 'H':
	fprintf(stderr,"Usage: timer_latency [\n\t-m : use CLOCK_MONOTONIC clock (not CLOCK_MONOTONIC_RAW)\n\t-d : dump timespec contents. N_SAMPLES: " STR(N_SAMPLES) "\n\t"
	"-r \n]\t" 
	"Calculates average timer latency (minimum time that can be measured) over N_SAMPLES.\n"
	   );
	return 0;
	}
  if( repeat > 1 )
  { avgs=alloca(sizeof(unsigned long long) * (N_SAMPLES + 1));
if( ((unsigned long) avgs) & 7 )
  avgs = ((unsigned long long*)(((unsigned char*)avgs)+(8-((unsigned long) avgs) & 7)));
  }
  do {
cnt=N_SAMPLES;
s=0;
  do
  { if( 0 != clock_gettime(clk, [s++]) )
{ fprintf(stderr,"oops, clock_gettime() failed: %d: '%s'.\n", errno, strerror(errno));
  return 1;
}
  }while( --cnt );
  clock_gettime(clk, [s]);
  for(s=1; s < (N_SAMPLES+1); s+=1)
  { t1 = TS2NS(sample[s-1]);
t2 = TS2NS(sample[s]);
if ( (t1 > 

Re: [PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-15 Thread Jason Vas Dias
Hi Thomas -
RE:
On 15/03/2018, Thomas Gleixner  wrote:
> Jason,
>
> On Thu, 15 Mar 2018, jason.vas.d...@gmail.com wrote:
>
>>   Resent to address reviewer comments.
>
> I was being patient so far and tried to guide you through the patch
> submission process, but unfortunately this turns out to be just waste of my
> time.
>
> You have not addressed any of the comments I made here:
>
> [1]
> https://lkml.kernel.org/r/alpine.deb.2.21.1803141511340.2...@nanos.tec.linutronix.de
> [2]
> https://lkml.kernel.org/r/alpine.deb.2.21.1803141527300.2...@nanos.tec.linutronix.de
>

I'm really sorry about that - I did not see those mails ,
and have searched for them in my inbox -
are you sure they were sent to 'linux-kernel@vger.kernel.org' ?
That is the only list I am subscribed to .
I clicked on the links , but the 'To:' field is just
'linux-kernel' .

If I had seen those messages before I re-submitted,
those issues would have been fixed.

checkpatch.pl did not report them -
I ran it with all patches and it reported
no errors .

And I did send the test results in a previous mail -

$ gcc -m64 -o timer timer.c

( must be compiled in 64-bit mode).

This is using the new rdtscp() function :
$ ./timer -r 100
...
Total time: 0.02806S - Average Latency: 0.00028S N zero
deltas: 0 N inconsistent deltas: 0
Average of 100 average latencies of 100 samples : 0.00027S

This is using the rdtsc_ordered() function:

$ ./timer -m -r 100
Total time: 0.05269S - Average Latency: 0.00052S N zero
deltas: 0 N inconsistent deltas: 0
Average of 100 average latencies of 100 samples : 0.00047S

timer.c is a very short program that just reads N_SAMPLES (a
compile-time option)
timespecs using either CLOCK_MONOTONIC_RAW (no -m) or CLOCK_MONOTONIC
first parameter to clock_gettime(),  then
computes the deltas as long long, then averages them , counting any
zero deltas, or deltas where the previous timespec is somehow
greater than the current timespec, which are reported as
inconsitencies (note 'inconistent deltas:0' and 'zero deltas: 0' in output).

So my initial claim that rdtscp() can be twice as fast as rdtsc_ordered()
was not far-fetched - this is what I am seeing .

I think this is because of the explicit barrier() call in rdtsc_ordered() .
This must be slower than than the internal processor pipeline
"cancellation point" (barrier) used by the rdtscp instruction itself.
This is the only reason for the rdtscp call  -  plus all modern Intel
& AMD CPUs support it, and it DOES solve the ordering problem,
whereby instructions in one pipeline of a task can get different
rdtsc() results than instructions in another pipeline.

I will document the results better in the ChangeLog , fix all issues
you identified, and resend .

I did not mean to ignore your comments -
those mails are nowhere in my Inbox -
please ,  confirm the actual email address
they are getting sent to.

Thanks & Regards,
Jason
/* 
 * Program to measure high-res timer latency.
 *
 */
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifndef N_SAMPLES
#define N_SAMPLES 100
#endif
#define _STR(_S_) #_S_
#define STR(_S_) _STR(_S_)

#define TS2NS(_TS_) unsigned long long)(_TS_).tv_sec)*10ULL) + (((unsigned long long)((_TS_).tv_nsec 

int main(int argc, char *const* argv, char *const* envp)
{ struct timespec sample[N_SAMPLES+1];
  unsigned int cnt=N_SAMPLES, s=0 , avg_n=0;
  unsigned long long
deltas [ N_SAMPLES ]
, t1, t2, sum=0, zd=0, ic=0, d
, t_start, avg_ns, *avgs=0;
  clockid_t clk = CLOCK_MONOTONIC_RAW;
  bool do_dump = false;
  int argn=1, repeat=1;
  for(; argn < argc; argn+=1)
if( argv[argn] != NULL )
  if( *(argv[argn]) == '-')
	switch( *(argv[argn]+1) )
	{ case 'm':
	  case 'M':
	clk = CLOCK_MONOTONIC;
	break;
	  case 'd':
	  case 'D':
	do_dump = true;
	break;
 	  case 'r':
  	  case 'R':
	if( (argn < argc) && (argv[argn+1] != NULL))
	  repeat = atoi(argv[argn+=1]);
	break;
	  case '?':
	  case 'h':
	  case 'u':
	  case 'U':
	  case 'H':
	fprintf(stderr,"Usage: timer_latency [\n\t-m : use CLOCK_MONOTONIC clock (not CLOCK_MONOTONIC_RAW)\n\t-d : dump timespec contents. N_SAMPLES: " STR(N_SAMPLES) "\n\t"
	"-r \n]\t" 
	"Calculates average timer latency (minimum time that can be measured) over N_SAMPLES.\n"
	   );
	return 0;
	}
  if( repeat > 1 )
  { avgs=alloca(sizeof(unsigned long long) * (N_SAMPLES + 1));
if( ((unsigned long) avgs) & 7 )
  avgs = ((unsigned long long*)(((unsigned char*)avgs)+(8-((unsigned long) avgs) & 7)));
  }
  do {
cnt=N_SAMPLES;
s=0;
  do
  { if( 0 != clock_gettime(clk, [s++]) )
{ fprintf(stderr,"oops, clock_gettime() failed: %d: '%s'.\n", errno, strerror(errno));
  return 1;
}
  }while( --cnt );
  clock_gettime(clk, [s]);
  for(s=1; s < (N_SAMPLES+1); s+=1)
  { t1 = TS2NS(sample[s-1]);
t2 = TS2NS(sample[s]);
if ( (t1 > t2)
   

[PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-15 Thread jason . vas . dias

  Resent to address reviewer comments.
   
  Currently, the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values, user-space  code needs to know elapsed time from the
  perspective of the CPU no matter how "hot" / fast or "cold" / slow it
  might be running wrt NTP / PTP "real" time; when code needs this,
  the latencies associated with a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .

  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:

   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vdso/vdso.lds.S
   arch/x86/entry/vdso/vdsox32.lds.S
   arch/x86/entry/vdso/vdso32/vdso32.lds.S
   arch/x86/entry/vsyscall/vsyscall_gtod.c

  There are 3 patches in the series :

   Patch #1 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with 
rdtsc_ordered()

  Patches #2 & #3 should be considered "optional" :

   Patch #2 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with a new 
rdtscp() function in msr.h

   Patch #3 makes the VDSO export TSC calibration data via a new function in 
the vDSO:
   unsigned int __vdso_linux_tsc_calibration ( struct 
linux_tsc_calibration *tsc_cal )
that user code can optionally call.


   Patch #2 makes clock_gettime(CLOCK_MONOTONIC_RAW) calls somewhat faster
   than clock_gettime(CLOCK_MONOTONIC) calls.

   I think something like Patch #3 is necessary to export TSC calibration data 
to user-space TSC readers.

   It is entirely up to the kernel developers whether they want to include 
patches
   #2 and #3, but I think something like Patch #1 really needs to get into a 
future
   Linux release, as an unecessary latency of 200-1000ns for a timer that can 
tick
   3 times per nanosecond is unacceptable.

   Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug 
#198161. 


Thanks & Best Regards,
Jason Vas Dias


[PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-15 Thread jason . vas . dias

  Resent to address reviewer comments.
   
  Currently, the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values, user-space  code needs to know elapsed time from the
  perspective of the CPU no matter how "hot" / fast or "cold" / slow it
  might be running wrt NTP / PTP "real" time; when code needs this,
  the latencies associated with a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .

  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:

   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vdso/vdso.lds.S
   arch/x86/entry/vdso/vdsox32.lds.S
   arch/x86/entry/vdso/vdso32/vdso32.lds.S
   arch/x86/entry/vsyscall/vsyscall_gtod.c

  There are 3 patches in the series :

   Patch #1 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with 
rdtsc_ordered()

  Patches #2 & #3 should be considered "optional" :

   Patch #2 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with a new 
rdtscp() function in msr.h

   Patch #3 makes the VDSO export TSC calibration data via a new function in 
the vDSO:
   unsigned int __vdso_linux_tsc_calibration ( struct 
linux_tsc_calibration *tsc_cal )
that user code can optionally call.


   Patch #2 makes clock_gettime(CLOCK_MONOTONIC_RAW) calls somewhat faster
   than clock_gettime(CLOCK_MONOTONIC) calls.

   I think something like Patch #3 is necessary to export TSC calibration data 
to user-space TSC readers.

   It is entirely up to the kernel developers whether they want to include 
patches
   #2 and #3, but I think something like Patch #1 really needs to get into a 
future
   Linux release, as an unecessary latency of 200-1000ns for a timer that can 
tick
   3 times per nanosecond is unacceptable.

   Patches for kernels 3.10.0-21 and 4.9.65-rt23 (ARM) are attached to bug 
#198161. 


Thanks & Best Regards,
Jason Vas Dias


[PATCH v4.16-rc5 2/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-15 Thread jason . vas . dias
diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index fbc7371..2c46675 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -184,10 +184,9 @@ notrace static u64 vread_tsc(void)
 
 notrace static u64 vread_tsc_raw(void)
 {
-   u64 tsc
+   u64 tsc  = (gtod->has_rdtscp ? rdtscp((void *)0) : rdtsc_ordered())
  , last = gtod->raw_cycle_last;
 
-   tsc   = rdtsc_ordered();
if (likely(tsc >= last))
return tsc;
asm volatile ("");
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c 
b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index 5af7093..0327a95 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -16,6 +16,9 @@
 #include 
 #include 
 #include 
+#include 
+
+extern unsigned int tsc_khz;
 
 int vclocks_used __read_mostly;
 
@@ -49,6 +52,7 @@ void update_vsyscall(struct timekeeper *tk)
vdata->raw_mask = tk->tkr_raw.mask;
vdata->raw_mult = tk->tkr_raw.mult;
vdata->raw_shift= tk->tkr_raw.shift;
+   vdata->has_rdtscp   = static_cpu_has(X86_FEATURE_RDTSCP);
 
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 30df295..a5ff704 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -218,6 +218,37 @@ static __always_inline unsigned long long 
rdtsc_ordered(void)
return rdtsc();
 }
 
+/**
+ * rdtscp() - read the current TSC and (optionally) CPU number, with built-in
+ *cancellation point replacing barrier - only available
+ *if static_cpu_has(X86_FEATURE_RDTSCP) .
+ * returns:   The 64-bit Time Stamp Counter (TSC) value.
+ * Optionally, 'cpu_out' can be non-null, and on return it will contain
+ * the number (Intel CPU ID) of the CPU that the task is currently running on.
+ * As does EAX_EDT_RET, this uses the "open-coded asm" style to
+ * force the compiler + assembler to always use (eax, edx, ecx) registers,
+ * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit
+ * variables are used - exactly the same code should be generated
+ * for this instruction on 32-bit as on 64-bit when this asm stanza is used.
+ * See: SDM , Vol #2, RDTSCP instruction.
+ */
+static __always_inline u64 rdtscp(u32 *cpu_out)
+{
+   u32 tsc_lo, tsc_hi, tsc_cpu;
+
+   asm volatile
+   ("rdtscp"
+   :   "=a" (tsc_lo)
+ , "=d" (tsc_hi)
+ , "=c" (tsc_cpu)
+   ); // : eax, edx, ecx used - NOT rax, rdx, rcx
+   if (unlikely(cpu_out != ((void *)0)))
+   *cpu_out = tsc_cpu;
+   return u64)tsc_hi) << 32) |
+   (((u64)tsc_lo) & 0x0ULL)
+  );
+}
+
 /* Deprecated, keep it for a cycle for easier merging: */
 #define rdtscll(now)   do { (now) = rdtsc_ordered(); } while (0)
 
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index 24e4d45..e7e4804 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -26,6 +26,7 @@ struct vsyscall_gtod_data {
u64 raw_mask;
u32 raw_mult;
u32 raw_shift;
+   u32 has_rdtscp;
 
/* open coded 'struct timespec' */
u64 wall_time_snsec;


[PATCH v4.16-rc5 2/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-15 Thread jason . vas . dias
diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index fbc7371..2c46675 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -184,10 +184,9 @@ notrace static u64 vread_tsc(void)
 
 notrace static u64 vread_tsc_raw(void)
 {
-   u64 tsc
+   u64 tsc  = (gtod->has_rdtscp ? rdtscp((void *)0) : rdtsc_ordered())
  , last = gtod->raw_cycle_last;
 
-   tsc   = rdtsc_ordered();
if (likely(tsc >= last))
return tsc;
asm volatile ("");
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c 
b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index 5af7093..0327a95 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -16,6 +16,9 @@
 #include 
 #include 
 #include 
+#include 
+
+extern unsigned int tsc_khz;
 
 int vclocks_used __read_mostly;
 
@@ -49,6 +52,7 @@ void update_vsyscall(struct timekeeper *tk)
vdata->raw_mask = tk->tkr_raw.mask;
vdata->raw_mult = tk->tkr_raw.mult;
vdata->raw_shift= tk->tkr_raw.shift;
+   vdata->has_rdtscp   = static_cpu_has(X86_FEATURE_RDTSCP);
 
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 30df295..a5ff704 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -218,6 +218,37 @@ static __always_inline unsigned long long 
rdtsc_ordered(void)
return rdtsc();
 }
 
+/**
+ * rdtscp() - read the current TSC and (optionally) CPU number, with built-in
+ *cancellation point replacing barrier - only available
+ *if static_cpu_has(X86_FEATURE_RDTSCP) .
+ * returns:   The 64-bit Time Stamp Counter (TSC) value.
+ * Optionally, 'cpu_out' can be non-null, and on return it will contain
+ * the number (Intel CPU ID) of the CPU that the task is currently running on.
+ * As does EAX_EDT_RET, this uses the "open-coded asm" style to
+ * force the compiler + assembler to always use (eax, edx, ecx) registers,
+ * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit
+ * variables are used - exactly the same code should be generated
+ * for this instruction on 32-bit as on 64-bit when this asm stanza is used.
+ * See: SDM , Vol #2, RDTSCP instruction.
+ */
+static __always_inline u64 rdtscp(u32 *cpu_out)
+{
+   u32 tsc_lo, tsc_hi, tsc_cpu;
+
+   asm volatile
+   ("rdtscp"
+   :   "=a" (tsc_lo)
+ , "=d" (tsc_hi)
+ , "=c" (tsc_cpu)
+   ); // : eax, edx, ecx used - NOT rax, rdx, rcx
+   if (unlikely(cpu_out != ((void *)0)))
+   *cpu_out = tsc_cpu;
+   return u64)tsc_hi) << 32) |
+   (((u64)tsc_lo) & 0x0ULL)
+  );
+}
+
 /* Deprecated, keep it for a cycle for easier merging: */
 #define rdtscll(now)   do { (now) = rdtsc_ordered(); } while (0)
 
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index 24e4d45..e7e4804 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -26,6 +26,7 @@ struct vsyscall_gtod_data {
u64 raw_mask;
u32 raw_mult;
u32 raw_shift;
+   u32 has_rdtscp;
 
/* open coded 'struct timespec' */
u64 wall_time_snsec;


[PATCH v4.16-rc5 1/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-15 Thread jason . vas . dias
diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d..fbc7371 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+   u64 tsc
+ , last = gtod->raw_cycle_last;
+
+   tsc   = rdtsc_ordered();
+   if (likely(tsc >= last))
+   return tsc;
+   asm volatile ("");
+   return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *mode)
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +279,27 @@ notrace static int __always_inline do_monotonic(struct 
timespec *ts)
return mode;
 }
 
+notrace static __always_inline int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct 
timespec *ts)
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c 
b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index e1216dd..5af7093 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -45,6 +45,11 @@ void update_vsyscall(struct timekeeper *tk)
vdata->mult = tk->tkr_mono.mult;
vdata->shift= tk->tkr_mono.shift;
 
+   vdata->raw_cycle_last   = tk->tkr_raw.cycle_last;
+   vdata->raw_mask = tk->tkr_raw.mask;
+   vdata->raw_mult = tk->tkr_raw.mult;
+   vdata->raw_shift= tk->tkr_raw.shift;
+
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
 
@@ -74,5 +79,8 @@ void update_vsyscall(struct timekeeper *tk)
vdata->monotonic_time_coarse_sec++;
}
 
+   vdata->monotonic_time_raw_sec  = tk->raw_sec;
+   vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec;
+
gtod_write_end(vdata);
 }
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index fb856c9..24e4d45 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -22,6 +22,10 @@ struct vsyscall_gtod_data {
u64 mask;
u32 mult;
u32 shift;
+   u64 raw_cycle_last;
+   u64 raw_mask;
+   u32 raw_mult;
+   u32 raw_shift;
 
/* open coded 'struct timespec' */
u64 wall_time_snsec;
@@ -32,6 +36,8 @@ struct vsyscall_gtod_data {
gtod_long_t wall_time_coarse_nsec;
gtod_long_t monotonic_time_coarse_sec;
gtod_long_t monotonic_time_coarse_nsec;
+   gtod_long_t monotonic_time_raw_sec;
+   gtod_long_t monotonic_time_raw_nsec;
 
int tz_minuteswest;
int tz_dsttime;


[PATCH v4.16-rc5 1/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-15 Thread jason . vas . dias
diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d..fbc7371 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+   u64 tsc
+ , last = gtod->raw_cycle_last;
+
+   tsc   = rdtsc_ordered();
+   if (likely(tsc >= last))
+   return tsc;
+   asm volatile ("");
+   return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *mode)
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +279,27 @@ notrace static int __always_inline do_monotonic(struct 
timespec *ts)
return mode;
 }
 
+notrace static __always_inline int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct 
timespec *ts)
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c 
b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index e1216dd..5af7093 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -45,6 +45,11 @@ void update_vsyscall(struct timekeeper *tk)
vdata->mult = tk->tkr_mono.mult;
vdata->shift= tk->tkr_mono.shift;
 
+   vdata->raw_cycle_last   = tk->tkr_raw.cycle_last;
+   vdata->raw_mask = tk->tkr_raw.mask;
+   vdata->raw_mult = tk->tkr_raw.mult;
+   vdata->raw_shift= tk->tkr_raw.shift;
+
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
 
@@ -74,5 +79,8 @@ void update_vsyscall(struct timekeeper *tk)
vdata->monotonic_time_coarse_sec++;
}
 
+   vdata->monotonic_time_raw_sec  = tk->raw_sec;
+   vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec;
+
gtod_write_end(vdata);
 }
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index fb856c9..24e4d45 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -22,6 +22,10 @@ struct vsyscall_gtod_data {
u64 mask;
u32 mult;
u32 shift;
+   u64 raw_cycle_last;
+   u64 raw_mask;
+   u32 raw_mult;
+   u32 raw_shift;
 
/* open coded 'struct timespec' */
u64 wall_time_snsec;
@@ -32,6 +36,8 @@ struct vsyscall_gtod_data {
gtod_long_t wall_time_coarse_nsec;
gtod_long_t monotonic_time_coarse_sec;
gtod_long_t monotonic_time_coarse_nsec;
+   gtod_long_t monotonic_time_raw_sec;
+   gtod_long_t monotonic_time_raw_nsec;
 
int tz_minuteswest;
int tz_dsttime;


[PATCH v4.16-rc5 3/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-15 Thread jason . vas . dias
diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index 03f3904..61d9633 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -21,12 +21,15 @@
 #include 
 #include 
 #include 
+#include 
 
 #define gtod ((vsyscall_gtod_data))
 
 extern int __vdso_clock_gettime(clockid_t clock, struct timespec *ts);
 extern int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz);
 extern time_t __vdso_time(time_t *t);
+extern unsigned int __vdso_tsc_calibration(
+   struct linux_tsc_calibration_s *tsc_cal);
 
 #ifdef CONFIG_PARAVIRT_CLOCK
 extern u8 pvclock_page
@@ -383,3 +386,25 @@ notrace time_t __vdso_time(time_t *t)
 }
 time_t time(time_t *t)
__attribute__((weak, alias("__vdso_time")));
+
+notraceunsigned int
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal)
+{
+   unsigned long seq;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   if ((gtod->vclock_mode == VCLOCK_TSC) &&
+   (tsc_cal != ((void *)0UL))) {
+   tsc_cal->tsc_khz = gtod->tsc_khz;
+   tsc_cal->mult= gtod->raw_mult;
+   tsc_cal->shift   = gtod->raw_shift;
+   return 1;
+   }
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   return 0;
+}
+
+unsigned int linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal)
+   __attribute((weak, alias("__vdso_linux_tsc_calibration")));
diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S
index d3a2dce..e0b5cce 100644
--- a/arch/x86/entry/vdso/vdso.lds.S
+++ b/arch/x86/entry/vdso/vdso.lds.S
@@ -25,6 +25,8 @@ VERSION {
__vdso_getcpu;
time;
__vdso_time;
+   linux_tsc_calibration;
+   __vdso_linux_tsc_calibration;
local: *;
};
 }
diff --git a/arch/x86/entry/vdso/vdso32/vdso32.lds.S 
b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
index 422764a..17fd07f 100644
--- a/arch/x86/entry/vdso/vdso32/vdso32.lds.S
+++ b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
@@ -26,6 +26,7 @@ VERSION
__vdso_clock_gettime;
__vdso_gettimeofday;
__vdso_time;
+   __vdso_linux_tsc_calibration;
};
 
LINUX_2.5 {
diff --git a/arch/x86/entry/vdso/vdsox32.lds.S 
b/arch/x86/entry/vdso/vdsox32.lds.S
index 05cd1c5..7acac71 100644
--- a/arch/x86/entry/vdso/vdsox32.lds.S
+++ b/arch/x86/entry/vdso/vdsox32.lds.S
@@ -21,6 +21,7 @@ VERSION {
__vdso_gettimeofday;
__vdso_getcpu;
__vdso_time;
+   __vdso_linux_tsc_calibration;
local: *;
};
 }
diff --git a/arch/x86/include/uapi/asm/vdso_tsc_calibration.h 
b/arch/x86/include/uapi/asm/vdso_tsc_calibration.h
new file mode 100644
index 000..ce4b5a45
--- /dev/null
+++ b/arch/x86/include/uapi/asm/vdso_tsc_calibration.h
@@ -0,0 +1,81 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _ASM_X86_VDSO_TSC_CALIBRATION_H
+#define _ASM_X86_VDSO_TSC_CALIBRATION_H
+/*
+ * Programs that want to use rdtsc / rdtscp instructions
+ * from user-space can make use of the Linux kernel TSC calibration
+ * by calling :
+ *__vdso_linux_tsc_calibration(struct linux_tsc_calibration_s *);
+ * ( one has to resolve this symbol as in
+ *   tools/testing/selftests/vDSO/parse_vdso.c
+ * )
+ * which fills in a structure
+ * with the following layout :
+ */
+
+/** struct linux_tsc_calibration_s -
+ * mult:amount to multiply 64-bit TSC value by
+ * shift:   the right shift to apply to (mult*TSC) yielding nanoseconds
+ * tsc_khz: the calibrated TSC frequency in KHz from which previous
+ *  members calculated
+ */
+struct linux_tsc_calibration_s {
+
+   unsigned int mult;
+   unsigned int shift;
+   unsigned int tsc_khz;
+
+};
+
+/* To use:
+ *
+ *  static unsigned
+ *  (*linux_tsc_cal)(struct linux_tsc_calibration_s *linux_tsc_cal) =
+ *vdso_sym("LINUX_2.6", "__vdso_linux_tsc_calibration");
+ *  if(linux_tsc_cal == ((void *)0))
+ *  { fprintf(stderr,"the patch providing __vdso_linux_tsc_calibration"
+ *   " is not applied to the kernel.\n");
+ *return ERROR;
+ *  }
+ *  static struct linux_tsc_calibration clock_source={0};
+ *  if((clock_source.mult==0) && ! (*linux_tsc_cal)(_source) )
+ *fprintf(stderr,"TSC is not the system clocksource.\n");
+ *  unsigned int tsc_lo, tsc_hi, tsc_cpu;
+ *  asm volatile
+ *  ( "rdtscp" : (=a) tsc_hi,  (=d) tsc_lo, (=c) tsc_cpu );
+ *  unsigned long tsc = (((unsigned long)tsc_hi) << 32) | tsc_lo;
+ *  unsigned long nanoseconds =
+ *   (( clock_source . mult ) * tsc ) >> (clock_source . shift);
+ *
+ *  nanoseconds is now TSC value converted to nanoseconds,
+ *  according to Linux' clocksource calibration values.
+ *  Incidentally, 'tsc_cpu' is the number of the CPU the task is running on.
+ *
+ * 

[PATCH v4.16-rc5 3/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-15 Thread jason . vas . dias
diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index 03f3904..61d9633 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -21,12 +21,15 @@
 #include 
 #include 
 #include 
+#include 
 
 #define gtod ((vsyscall_gtod_data))
 
 extern int __vdso_clock_gettime(clockid_t clock, struct timespec *ts);
 extern int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz);
 extern time_t __vdso_time(time_t *t);
+extern unsigned int __vdso_tsc_calibration(
+   struct linux_tsc_calibration_s *tsc_cal);
 
 #ifdef CONFIG_PARAVIRT_CLOCK
 extern u8 pvclock_page
@@ -383,3 +386,25 @@ notrace time_t __vdso_time(time_t *t)
 }
 time_t time(time_t *t)
__attribute__((weak, alias("__vdso_time")));
+
+notraceunsigned int
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal)
+{
+   unsigned long seq;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   if ((gtod->vclock_mode == VCLOCK_TSC) &&
+   (tsc_cal != ((void *)0UL))) {
+   tsc_cal->tsc_khz = gtod->tsc_khz;
+   tsc_cal->mult= gtod->raw_mult;
+   tsc_cal->shift   = gtod->raw_shift;
+   return 1;
+   }
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   return 0;
+}
+
+unsigned int linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal)
+   __attribute((weak, alias("__vdso_linux_tsc_calibration")));
diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S
index d3a2dce..e0b5cce 100644
--- a/arch/x86/entry/vdso/vdso.lds.S
+++ b/arch/x86/entry/vdso/vdso.lds.S
@@ -25,6 +25,8 @@ VERSION {
__vdso_getcpu;
time;
__vdso_time;
+   linux_tsc_calibration;
+   __vdso_linux_tsc_calibration;
local: *;
};
 }
diff --git a/arch/x86/entry/vdso/vdso32/vdso32.lds.S 
b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
index 422764a..17fd07f 100644
--- a/arch/x86/entry/vdso/vdso32/vdso32.lds.S
+++ b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
@@ -26,6 +26,7 @@ VERSION
__vdso_clock_gettime;
__vdso_gettimeofday;
__vdso_time;
+   __vdso_linux_tsc_calibration;
};
 
LINUX_2.5 {
diff --git a/arch/x86/entry/vdso/vdsox32.lds.S 
b/arch/x86/entry/vdso/vdsox32.lds.S
index 05cd1c5..7acac71 100644
--- a/arch/x86/entry/vdso/vdsox32.lds.S
+++ b/arch/x86/entry/vdso/vdsox32.lds.S
@@ -21,6 +21,7 @@ VERSION {
__vdso_gettimeofday;
__vdso_getcpu;
__vdso_time;
+   __vdso_linux_tsc_calibration;
local: *;
};
 }
diff --git a/arch/x86/include/uapi/asm/vdso_tsc_calibration.h 
b/arch/x86/include/uapi/asm/vdso_tsc_calibration.h
new file mode 100644
index 000..ce4b5a45
--- /dev/null
+++ b/arch/x86/include/uapi/asm/vdso_tsc_calibration.h
@@ -0,0 +1,81 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _ASM_X86_VDSO_TSC_CALIBRATION_H
+#define _ASM_X86_VDSO_TSC_CALIBRATION_H
+/*
+ * Programs that want to use rdtsc / rdtscp instructions
+ * from user-space can make use of the Linux kernel TSC calibration
+ * by calling :
+ *__vdso_linux_tsc_calibration(struct linux_tsc_calibration_s *);
+ * ( one has to resolve this symbol as in
+ *   tools/testing/selftests/vDSO/parse_vdso.c
+ * )
+ * which fills in a structure
+ * with the following layout :
+ */
+
+/** struct linux_tsc_calibration_s -
+ * mult:amount to multiply 64-bit TSC value by
+ * shift:   the right shift to apply to (mult*TSC) yielding nanoseconds
+ * tsc_khz: the calibrated TSC frequency in KHz from which previous
+ *  members calculated
+ */
+struct linux_tsc_calibration_s {
+
+   unsigned int mult;
+   unsigned int shift;
+   unsigned int tsc_khz;
+
+};
+
+/* To use:
+ *
+ *  static unsigned
+ *  (*linux_tsc_cal)(struct linux_tsc_calibration_s *linux_tsc_cal) =
+ *vdso_sym("LINUX_2.6", "__vdso_linux_tsc_calibration");
+ *  if(linux_tsc_cal == ((void *)0))
+ *  { fprintf(stderr,"the patch providing __vdso_linux_tsc_calibration"
+ *   " is not applied to the kernel.\n");
+ *return ERROR;
+ *  }
+ *  static struct linux_tsc_calibration clock_source={0};
+ *  if((clock_source.mult==0) && ! (*linux_tsc_cal)(_source) )
+ *fprintf(stderr,"TSC is not the system clocksource.\n");
+ *  unsigned int tsc_lo, tsc_hi, tsc_cpu;
+ *  asm volatile
+ *  ( "rdtscp" : (=a) tsc_hi,  (=d) tsc_lo, (=c) tsc_cpu );
+ *  unsigned long tsc = (((unsigned long)tsc_hi) << 32) | tsc_lo;
+ *  unsigned long nanoseconds =
+ *   (( clock_source . mult ) * tsc ) >> (clock_source . shift);
+ *
+ *  nanoseconds is now TSC value converted to nanoseconds,
+ *  according to Linux' clocksource calibration values.
+ *  Incidentally, 'tsc_cpu' is the number of the CPU the task is running on.
+ *
+ * 

Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-14 Thread Jason Vas Dias
Thanks for the helpful comments, Peter -
re:
On 14/03/2018, Peter Zijlstra  wrote:
>
>> Yes, I am sampling perf counters,
>
> You're not in fact sampling, you're just reading the counters.

Correct, using Linux-ese terminology - but "sampling" in looser English.


>> Reading performance counters does involve  2 ioctls and a read() ,
>
> So you can avoid the whole ioctl(ENABLE), ioctl(DISABLE) nonsense and
> just let them run and do:
>
>   read(group_fd, _pre, size);
>   /* your code section */
>   read(group_fd, _post, size);
>
>   /* compute buf_post - buf_pre */
>
> Which is only 2 system calls, not 4.

But I can't, really - I am trying to restrict the
performance counter measurements
to only a subset of the code, and exclude
performance measurement result processing  -
so the timeline is like:
  struct timespec t_start, t_end;
  perf_event_open(...);
  thread_main_loop() { ... do {
  t _clock_gettime(CLOCK_MONOTONIC_RAW, _start);
  t+x _   enable_perf  ();
  total_work = do_some_work();
  disable_perf ();
  clock_gettime(CLOCK_MONOTONIC_RAW, _end);
   t+y_
  read_perf_counters_and_store_results
   ( perf_grp_fd,  ,  total_work,
 TS2T( _end ) - TS2T( _start)
);
   } while ( );
}

   Now. here the bandwidth / performance results recorded by
   my 'read_perf_counters_and_store_results' method
   is very sensitive to the measurement of the OUTER
   elapsed time .

>
> Also, a while back there was the proposal to extend the mmap()
> self-monitoring interface to groups, see:
>
> https://lkml.kernel.org/r/20170530172555.5ya3ilfw3sowo...@hirez.programming.kicks-ass.net
>
> I never did get around to writing the actual code for it, but it
> shouldn't be too hard.
>

Great, I'm looking forward to trying it - but meanwhile,
to get NON-MULTIPLEXED measurements for the SAME CODE SEQUENCE
over the SAME TIME I believe the group FD method is what is implemented
and what works.


>> The CPU_CLOCK software counter should give the converted TSC cycles
>> seen between the ioctl( grp_fd, PERF_EVENT_IOC_ENABLE , ...)
>> and the  ioctl( grp_fd, PERF_EVENT_IOC_DISABLE ), and the
>> difference between the event->time_running and time_enabled
>> should also measure elapsed time .
>
> While CPU_CLOCK is TSC based, there is no guarantee it has any
> correlation to CLOCK_MONOTONIC_RAW (even if that is also TSC based).
>
> (although, I think I might have fixed that recently and it might just
> work, but it's very much not guaranteed).

Yes, I believe the CPU_CLOCK is effectively the converted TSC -
it does appear to correlate well with the new CLOCK_MONOTONIC_RAW
values from the patched VDSO.

> If you want to correlate to CLOCK_MONOTONIC_RAW you have to read
> CLOCK_MONOTONIC_RAW and not some random other clock value.
>

Exactly ! Hence the need for the patch so that users can get
CLOCK_MONOTONIC_RAW values with low latency and correlate them
with PERF CPU_CLOCK values.

>> This gives the "inner" elapsed time, from the perpective of the kernel,
>> while the measured code section had the counters enabled.
>>
>> But unless the user-space program  also has a way of measuring elapsed
>> time from the CPU's perspective , ie. without being subject to
>> operator or NTP / PTP adjustment, it has no way of correlating this
>> inner elapsed time with any "outer"
>
> You could read the time using the group_fd's mmap() page. That actually
> includes the TSC mult,shift,offset as used by perf clocks.
>

Yes, but as mentioned earlier, that presupposes I want to use the mmap()
sample method - I don't - I want to use the Group FD method, so
that I can be sure the measurements are for the same code sequence
over the same period of time.

>> Currently, users must parse the log file or use gdb / objdump to
>> inspect /proc/kcore to get the TSC calibration and exact
>> mult+shift values for the TSC value conversion.
>
> Which ;-) there's multiple floating around..
>

Yes, but why must Linux make it so difficult ?
I think it has to be recognized that the vDSO or user-space program
are the only places in which low-latency clock values can be generated
for use by user-space programs with sufficiently low latencies to be useful.
So why does it not export the TSC calibration which is so complex to
calibrate when such calibration information is available nowhere else ?


>> Intel does not publish, nor does the CPU come with in ROM or firmware,
>> the actual precise TSC frequency - this must be calibrated against the
>> other clocks , according to a complicated procedure in section 18.2 of
>> the SDM . My TSC has a "rated" / nominal TSC frequency , which one
>> can compute from CPUID leaves, of 2.3ghz, but the "Refined TSC frequency"
>> is 2.8333ghz .
>
> You might 

Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-14 Thread Jason Vas Dias
Thanks for the helpful comments, Peter -
re:
On 14/03/2018, Peter Zijlstra  wrote:
>
>> Yes, I am sampling perf counters,
>
> You're not in fact sampling, you're just reading the counters.

Correct, using Linux-ese terminology - but "sampling" in looser English.


>> Reading performance counters does involve  2 ioctls and a read() ,
>
> So you can avoid the whole ioctl(ENABLE), ioctl(DISABLE) nonsense and
> just let them run and do:
>
>   read(group_fd, _pre, size);
>   /* your code section */
>   read(group_fd, _post, size);
>
>   /* compute buf_post - buf_pre */
>
> Which is only 2 system calls, not 4.

But I can't, really - I am trying to restrict the
performance counter measurements
to only a subset of the code, and exclude
performance measurement result processing  -
so the timeline is like:
  struct timespec t_start, t_end;
  perf_event_open(...);
  thread_main_loop() { ... do {
  t _clock_gettime(CLOCK_MONOTONIC_RAW, _start);
  t+x _   enable_perf  ();
  total_work = do_some_work();
  disable_perf ();
  clock_gettime(CLOCK_MONOTONIC_RAW, _end);
   t+y_
  read_perf_counters_and_store_results
   ( perf_grp_fd,  ,  total_work,
 TS2T( _end ) - TS2T( _start)
);
   } while ( );
}

   Now. here the bandwidth / performance results recorded by
   my 'read_perf_counters_and_store_results' method
   is very sensitive to the measurement of the OUTER
   elapsed time .

>
> Also, a while back there was the proposal to extend the mmap()
> self-monitoring interface to groups, see:
>
> https://lkml.kernel.org/r/20170530172555.5ya3ilfw3sowo...@hirez.programming.kicks-ass.net
>
> I never did get around to writing the actual code for it, but it
> shouldn't be too hard.
>

Great, I'm looking forward to trying it - but meanwhile,
to get NON-MULTIPLEXED measurements for the SAME CODE SEQUENCE
over the SAME TIME I believe the group FD method is what is implemented
and what works.


>> The CPU_CLOCK software counter should give the converted TSC cycles
>> seen between the ioctl( grp_fd, PERF_EVENT_IOC_ENABLE , ...)
>> and the  ioctl( grp_fd, PERF_EVENT_IOC_DISABLE ), and the
>> difference between the event->time_running and time_enabled
>> should also measure elapsed time .
>
> While CPU_CLOCK is TSC based, there is no guarantee it has any
> correlation to CLOCK_MONOTONIC_RAW (even if that is also TSC based).
>
> (although, I think I might have fixed that recently and it might just
> work, but it's very much not guaranteed).

Yes, I believe the CPU_CLOCK is effectively the converted TSC -
it does appear to correlate well with the new CLOCK_MONOTONIC_RAW
values from the patched VDSO.

> If you want to correlate to CLOCK_MONOTONIC_RAW you have to read
> CLOCK_MONOTONIC_RAW and not some random other clock value.
>

Exactly ! Hence the need for the patch so that users can get
CLOCK_MONOTONIC_RAW values with low latency and correlate them
with PERF CPU_CLOCK values.

>> This gives the "inner" elapsed time, from the perpective of the kernel,
>> while the measured code section had the counters enabled.
>>
>> But unless the user-space program  also has a way of measuring elapsed
>> time from the CPU's perspective , ie. without being subject to
>> operator or NTP / PTP adjustment, it has no way of correlating this
>> inner elapsed time with any "outer"
>
> You could read the time using the group_fd's mmap() page. That actually
> includes the TSC mult,shift,offset as used by perf clocks.
>

Yes, but as mentioned earlier, that presupposes I want to use the mmap()
sample method - I don't - I want to use the Group FD method, so
that I can be sure the measurements are for the same code sequence
over the same period of time.

>> Currently, users must parse the log file or use gdb / objdump to
>> inspect /proc/kcore to get the TSC calibration and exact
>> mult+shift values for the TSC value conversion.
>
> Which ;-) there's multiple floating around..
>

Yes, but why must Linux make it so difficult ?
I think it has to be recognized that the vDSO or user-space program
are the only places in which low-latency clock values can be generated
for use by user-space programs with sufficiently low latencies to be useful.
So why does it not export the TSC calibration which is so complex to
calibrate when such calibration information is available nowhere else ?


>> Intel does not publish, nor does the CPU come with in ROM or firmware,
>> the actual precise TSC frequency - this must be calibrated against the
>> other clocks , according to a complicated procedure in section 18.2 of
>> the SDM . My TSC has a "rated" / nominal TSC frequency , which one
>> can compute from CPUID leaves, of 2.3ghz, but the "Refined TSC frequency"
>> is 2.8333ghz .
>
> You might want to look at commit:

[PATCH v4.16-rc5 1/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-13 Thread jason . vas . dias
diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d..fbc7371 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+   u64 tsc
+ , last = gtod->raw_cycle_last;
+
+   tsc   = rdtsc_ordered();
+   if (likely(tsc >= last))
+   return tsc;
+   asm volatile ("");
+   return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *mode)
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +279,27 @@ notrace static int __always_inline do_monotonic(struct 
timespec *ts)
return mode;
 }
 
+notrace static __always_inline int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct 
timespec *ts)
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c 
b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index e1216dd..5af7093 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -45,6 +45,11 @@ void update_vsyscall(struct timekeeper *tk)
vdata->mult = tk->tkr_mono.mult;
vdata->shift= tk->tkr_mono.shift;
 
+   vdata->raw_cycle_last   = tk->tkr_raw.cycle_last;
+   vdata->raw_mask = tk->tkr_raw.mask;
+   vdata->raw_mult = tk->tkr_raw.mult;
+   vdata->raw_shift= tk->tkr_raw.shift;
+
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
 
@@ -74,5 +79,8 @@ void update_vsyscall(struct timekeeper *tk)
vdata->monotonic_time_coarse_sec++;
}
 
+   vdata->monotonic_time_raw_sec  = tk->raw_sec;
+   vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec;
+
gtod_write_end(vdata);
 }
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index fb856c9..24e4d45 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -22,6 +22,10 @@ struct vsyscall_gtod_data {
u64 mask;
u32 mult;
u32 shift;
+   u64 raw_cycle_last;
+   u64 raw_mask;
+   u32 raw_mult;
+   u32 raw_shift;
 
/* open coded 'struct timespec' */
u64 wall_time_snsec;
@@ -32,6 +36,8 @@ struct vsyscall_gtod_data {
gtod_long_t wall_time_coarse_nsec;
gtod_long_t monotonic_time_coarse_sec;
gtod_long_t monotonic_time_coarse_nsec;
+   gtod_long_t monotonic_time_raw_sec;
+   gtod_long_t monotonic_time_raw_nsec;
 
int tz_minuteswest;
int tz_dsttime;


[PATCH v4.16-rc5 1/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-13 Thread jason . vas . dias
diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d..fbc7371 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+   u64 tsc
+ , last = gtod->raw_cycle_last;
+
+   tsc   = rdtsc_ordered();
+   if (likely(tsc >= last))
+   return tsc;
+   asm volatile ("");
+   return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *mode)
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +279,27 @@ notrace static int __always_inline do_monotonic(struct 
timespec *ts)
return mode;
 }
 
+notrace static __always_inline int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct 
timespec *ts)
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c 
b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index e1216dd..5af7093 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -45,6 +45,11 @@ void update_vsyscall(struct timekeeper *tk)
vdata->mult = tk->tkr_mono.mult;
vdata->shift= tk->tkr_mono.shift;
 
+   vdata->raw_cycle_last   = tk->tkr_raw.cycle_last;
+   vdata->raw_mask = tk->tkr_raw.mask;
+   vdata->raw_mult = tk->tkr_raw.mult;
+   vdata->raw_shift= tk->tkr_raw.shift;
+
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
 
@@ -74,5 +79,8 @@ void update_vsyscall(struct timekeeper *tk)
vdata->monotonic_time_coarse_sec++;
}
 
+   vdata->monotonic_time_raw_sec  = tk->raw_sec;
+   vdata->monotonic_time_raw_nsec = tk->tkr_raw.xtime_nsec;
+
gtod_write_end(vdata);
 }
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index fb856c9..24e4d45 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -22,6 +22,10 @@ struct vsyscall_gtod_data {
u64 mask;
u32 mult;
u32 shift;
+   u64 raw_cycle_last;
+   u64 raw_mask;
+   u32 raw_mult;
+   u32 raw_shift;
 
/* open coded 'struct timespec' */
u64 wall_time_snsec;
@@ -32,6 +36,8 @@ struct vsyscall_gtod_data {
gtod_long_t wall_time_coarse_nsec;
gtod_long_t monotonic_time_coarse_sec;
gtod_long_t monotonic_time_coarse_nsec;
+   gtod_long_t monotonic_time_raw_sec;
+   gtod_long_t monotonic_time_raw_nsec;
 
int tz_minuteswest;
int tz_dsttime;


[PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-13 Thread jason . vas . dias


  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values, user-space  code needs to know elapsed time from the
  perspective of the CPU no matter how "hot" / fast or "cold" / slow it
  might be running wrt NTP / PTP "real" time; when code needs this,
  the latencies associated with a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vdso/vdso.lds.S
   arch/x86/entry/vdso/vdsox32.lds.S
   arch/x86/entry/vdso/vdso32/vdso32.lds.S  
   arch/x86/entry/vsyscall/vsyscall_gtod.c

  There are 3 patches in the series :

   Patch #1 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with 
rdtsc_ordered()

   Patch #2 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with a new 
rdtscp() function in msr.h

   Patch #3 makes the VDSO export TSC calibration data via a new function in 
the vDSO: 
   unsigned int __vdso_linux_tsc_calibration ( struct 
linux_tsc_calibration *tsc_cal )
that user code can optionally call.

   Patches #2 & #3 should be considered "optional" .

   Patch #2 makes clock_gettime(CLOCK_MONOTONIC_RAW) calls have @ half the 
latency
   of clock_gettime(CLOCK_MONOTONIC) calls.

   I think something like Patch #3 is necessary to export TSC calibration data 
to user-space TSC readers.


Best Regards,
Jason Vas Dias


[PATCH v4.16-rc5 (3)] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-13 Thread jason . vas . dias


  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values, user-space  code needs to know elapsed time from the
  perspective of the CPU no matter how "hot" / fast or "cold" / slow it
  might be running wrt NTP / PTP "real" time; when code needs this,
  the latencies associated with a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vdso/vdso.lds.S
   arch/x86/entry/vdso/vdsox32.lds.S
   arch/x86/entry/vdso/vdso32/vdso32.lds.S  
   arch/x86/entry/vsyscall/vsyscall_gtod.c

  There are 3 patches in the series :

   Patch #1 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with 
rdtsc_ordered()

   Patch #2 makes the VDSO handle clock_gettime(CLOCK_MONOTONIC_RAW) with a new 
rdtscp() function in msr.h

   Patch #3 makes the VDSO export TSC calibration data via a new function in 
the vDSO: 
   unsigned int __vdso_linux_tsc_calibration ( struct 
linux_tsc_calibration *tsc_cal )
that user code can optionally call.

   Patches #2 & #3 should be considered "optional" .

   Patch #2 makes clock_gettime(CLOCK_MONOTONIC_RAW) calls have @ half the 
latency
   of clock_gettime(CLOCK_MONOTONIC) calls.

   I think something like Patch #3 is necessary to export TSC calibration data 
to user-space TSC readers.


Best Regards,
Jason Vas Dias


[PATCH v4.16-rc5 2/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-13 Thread jason . vas . dias
diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index fbc7371..2c46675 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -184,10 +184,9 @@ notrace static u64 vread_tsc(void)
 
 notrace static u64 vread_tsc_raw(void)
 {
-   u64 tsc
+   u64 tsc  = (gtod->has_rdtscp ? rdtscp((void*)0) : rdtsc_ordered())
  , last = gtod->raw_cycle_last;
 
-   tsc   = rdtsc_ordered();
if (likely(tsc >= last))
return tsc;
asm volatile ("");
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c 
b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index 5af7093..0327a95 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -16,6 +16,9 @@
 #include 
 #include 
 #include 
+#include 
+
+extern unsigned tsc_khz;
 
 int vclocks_used __read_mostly;
 
@@ -49,6 +52,7 @@ void update_vsyscall(struct timekeeper *tk)
vdata->raw_mask = tk->tkr_raw.mask;
vdata->raw_mult = tk->tkr_raw.mult;
vdata->raw_shift= tk->tkr_raw.shift;
+   vdata->has_rdtscp   = static_cpu_has(X86_FEATURE_RDTSCP);
 
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 30df295..a5ff704 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -218,6 +218,36 @@ static __always_inline unsigned long long 
rdtsc_ordered(void)
return rdtsc();
 }
 
+/**
+ * rdtscp() - read the current TSC and (optionally) CPU number, with built-in
+ *cancellation point replacing barrier - only available
+ *if static_cpu_has(X86_FEATURE_RDTSCP) .
+ * returns:   The 64-bit Time Stamp Counter (TSC) value.
+ * Optionally, 'cpu_out' can be non-null, and on return it will contain
+ * the number (Intel CPU ID) of the CPU that the task is currently running on.
+ * As does EAX_EDT_RET, this uses the "open-coded asm" style to
+ * force the compiler + assembler to always use (eax, edx, ecx) registers,
+ * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit 
+ * variables are used - exactly the same code should be generated
+ * for this instruction on 32-bit as on 64-bit when this asm stanza is used.
+ * See: SDM , Vol #2, RDTSCP instruction.
+ */
+static __always_inline u64 rdtscp(u32 *cpu_out)
+{
+   u32 tsc_lo, tsc_hi, tsc_cpu;
+   asm volatile
+   ( "rdtscp"
+   :   "=a" (tsc_lo)
+ , "=d" (tsc_hi)
+ , "=c" (tsc_cpu)
+   ); // : eax, edx, ecx used - NOT rax, rdx, rcx
+   if (unlikely(cpu_out != ((void*)0)))
+   *cpu_out = tsc_cpu;
+   return u64)tsc_hi) << 32) |
+   (((u64)tsc_lo) & 0x0ULL )
+  );
+}
+
 /* Deprecated, keep it for a cycle for easier merging: */
 #define rdtscll(now)   do { (now) = rdtsc_ordered(); } while (0)
 
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index 24e4d45..e7e4804 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -26,6 +26,7 @@ struct vsyscall_gtod_data {
u64 raw_mask;
u32 raw_mult;
u32 raw_shift;
+   u32 has_rdtscp;
 
/* open coded 'struct timespec' */
u64 wall_time_snsec;


[PATCH v4.16-rc5 2/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-13 Thread jason . vas . dias
diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index fbc7371..2c46675 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -184,10 +184,9 @@ notrace static u64 vread_tsc(void)
 
 notrace static u64 vread_tsc_raw(void)
 {
-   u64 tsc
+   u64 tsc  = (gtod->has_rdtscp ? rdtscp((void*)0) : rdtsc_ordered())
  , last = gtod->raw_cycle_last;
 
-   tsc   = rdtsc_ordered();
if (likely(tsc >= last))
return tsc;
asm volatile ("");
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c 
b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index 5af7093..0327a95 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -16,6 +16,9 @@
 #include 
 #include 
 #include 
+#include 
+
+extern unsigned tsc_khz;
 
 int vclocks_used __read_mostly;
 
@@ -49,6 +52,7 @@ void update_vsyscall(struct timekeeper *tk)
vdata->raw_mask = tk->tkr_raw.mask;
vdata->raw_mult = tk->tkr_raw.mult;
vdata->raw_shift= tk->tkr_raw.shift;
+   vdata->has_rdtscp   = static_cpu_has(X86_FEATURE_RDTSCP);
 
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 30df295..a5ff704 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -218,6 +218,36 @@ static __always_inline unsigned long long 
rdtsc_ordered(void)
return rdtsc();
 }
 
+/**
+ * rdtscp() - read the current TSC and (optionally) CPU number, with built-in
+ *cancellation point replacing barrier - only available
+ *if static_cpu_has(X86_FEATURE_RDTSCP) .
+ * returns:   The 64-bit Time Stamp Counter (TSC) value.
+ * Optionally, 'cpu_out' can be non-null, and on return it will contain
+ * the number (Intel CPU ID) of the CPU that the task is currently running on.
+ * As does EAX_EDT_RET, this uses the "open-coded asm" style to
+ * force the compiler + assembler to always use (eax, edx, ecx) registers,
+ * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit 
+ * variables are used - exactly the same code should be generated
+ * for this instruction on 32-bit as on 64-bit when this asm stanza is used.
+ * See: SDM , Vol #2, RDTSCP instruction.
+ */
+static __always_inline u64 rdtscp(u32 *cpu_out)
+{
+   u32 tsc_lo, tsc_hi, tsc_cpu;
+   asm volatile
+   ( "rdtscp"
+   :   "=a" (tsc_lo)
+ , "=d" (tsc_hi)
+ , "=c" (tsc_cpu)
+   ); // : eax, edx, ecx used - NOT rax, rdx, rcx
+   if (unlikely(cpu_out != ((void*)0)))
+   *cpu_out = tsc_cpu;
+   return u64)tsc_hi) << 32) |
+   (((u64)tsc_lo) & 0x0ULL )
+  );
+}
+
 /* Deprecated, keep it for a cycle for easier merging: */
 #define rdtscll(now)   do { (now) = rdtsc_ordered(); } while (0)
 
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index 24e4d45..e7e4804 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -26,6 +26,7 @@ struct vsyscall_gtod_data {
u64 raw_mask;
u32 raw_mult;
u32 raw_shift;
+   u32 has_rdtscp;
 
/* open coded 'struct timespec' */
u64 wall_time_snsec;


[PATCH v4.16-rc5 3/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-13 Thread jason . vas . dias
diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index 2c46675..772988c 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define gtod ((vsyscall_gtod_data))
 
@@ -184,7 +185,7 @@ notrace static u64 vread_tsc(void)
 
 notrace static u64 vread_tsc_raw(void)
 {
-   u64 tsc  = (gtod->has_rdtscp ? rdtscp((void*)0) : rdtsc_ordered())
+   u64 tsc  = (gtod->has_rdtscp ? rdtscp((void *)0) : rdtsc_ordered())
  , last = gtod->raw_cycle_last;
 
if (likely(tsc >= last))
@@ -383,3 +384,21 @@ notrace time_t __vdso_time(time_t *t)
 }
 time_t time(time_t *t)
__attribute__((weak, alias("__vdso_time")));
+
+unsigned int __vdso_linux_tsc_calibration(
+   struct linux_tsc_calibration_s *tsc_cal);
+
+notraceunsigned int
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal)
+{
+   if ((gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void *)0UL))) {
+   tsc_cal->tsc_khz = gtod->tsc_khz;
+   tsc_cal->mult= gtod->raw_mult;
+   tsc_cal->shift   = gtod->raw_shift;
+   return 1;
+   }
+   return 0;
+}
+
+unsigned int linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal)
+   __attribute((weak, alias("__vdso_linux_tsc_calibration")));
diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S
index d3a2dce..e0b5cce 100644
--- a/arch/x86/entry/vdso/vdso.lds.S
+++ b/arch/x86/entry/vdso/vdso.lds.S
@@ -25,6 +25,8 @@ VERSION {
__vdso_getcpu;
time;
__vdso_time;
+   linux_tsc_calibration;
+   __vdso_linux_tsc_calibration;
local: *;
};
 }
diff --git a/arch/x86/entry/vdso/vdso32/vdso32.lds.S 
b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
index 422764a..17fd07f 100644
--- a/arch/x86/entry/vdso/vdso32/vdso32.lds.S
+++ b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
@@ -26,6 +26,7 @@ VERSION
__vdso_clock_gettime;
__vdso_gettimeofday;
__vdso_time;
+   __vdso_linux_tsc_calibration;
};
 
LINUX_2.5 {
diff --git a/arch/x86/entry/vdso/vdsox32.lds.S 
b/arch/x86/entry/vdso/vdsox32.lds.S
index 05cd1c5..7acac71 100644
--- a/arch/x86/entry/vdso/vdsox32.lds.S
+++ b/arch/x86/entry/vdso/vdsox32.lds.S
@@ -21,6 +21,7 @@ VERSION {
__vdso_gettimeofday;
__vdso_getcpu;
__vdso_time;
+   __vdso_linux_tsc_calibration;
local: *;
};
 }
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c 
b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index 0327a95..692562a 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -53,6 +53,7 @@ void update_vsyscall(struct timekeeper *tk)
vdata->raw_mult = tk->tkr_raw.mult;
vdata->raw_shift= tk->tkr_raw.shift;
vdata->has_rdtscp   = static_cpu_has(X86_FEATURE_RDTSCP);
+   vdata->tsc_khz  = tsc_khz;
 
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index a5ff704..c7b2ed2 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -227,7 +227,7 @@ static __always_inline unsigned long long 
rdtsc_ordered(void)
  * the number (Intel CPU ID) of the CPU that the task is currently running on.
  * As does EAX_EDT_RET, this uses the "open-coded asm" style to
  * force the compiler + assembler to always use (eax, edx, ecx) registers,
- * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit 
+ * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit
  * variables are used - exactly the same code should be generated
  * for this instruction on 32-bit as on 64-bit when this asm stanza is used.
  * See: SDM , Vol #2, RDTSCP instruction.
@@ -236,15 +236,15 @@ static __always_inline u64 rdtscp(u32 *cpu_out)
 {
u32 tsc_lo, tsc_hi, tsc_cpu;
asm volatile
-   ( "rdtscp"
+   ("rdtscp"
:   "=a" (tsc_lo)
  , "=d" (tsc_hi)
  , "=c" (tsc_cpu)
); // : eax, edx, ecx used - NOT rax, rdx, rcx
-   if (unlikely(cpu_out != ((void*)0)))
+   if (unlikely(cpu_out != ((void *)0)))
*cpu_out = tsc_cpu;
return u64)tsc_hi) << 32) |
-   (((u64)tsc_lo) & 0x0ULL )
+   (((u64)tsc_lo) & 0x0ULL)
   );
 }
 
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index e7e4804..75078fc 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -27,6 +27,7 @@ struct vsyscall_gtod_data {
u32 raw_mult;
u32 raw_shift;
u32 has_rdtscp;
+   u32 tsc_khz;
 
   

[PATCH v4.16-rc5 3/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-13 Thread jason . vas . dias
diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index 2c46675..772988c 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define gtod ((vsyscall_gtod_data))
 
@@ -184,7 +185,7 @@ notrace static u64 vread_tsc(void)
 
 notrace static u64 vread_tsc_raw(void)
 {
-   u64 tsc  = (gtod->has_rdtscp ? rdtscp((void*)0) : rdtsc_ordered())
+   u64 tsc  = (gtod->has_rdtscp ? rdtscp((void *)0) : rdtsc_ordered())
  , last = gtod->raw_cycle_last;
 
if (likely(tsc >= last))
@@ -383,3 +384,21 @@ notrace time_t __vdso_time(time_t *t)
 }
 time_t time(time_t *t)
__attribute__((weak, alias("__vdso_time")));
+
+unsigned int __vdso_linux_tsc_calibration(
+   struct linux_tsc_calibration_s *tsc_cal);
+
+notraceunsigned int
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal)
+{
+   if ((gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void *)0UL))) {
+   tsc_cal->tsc_khz = gtod->tsc_khz;
+   tsc_cal->mult= gtod->raw_mult;
+   tsc_cal->shift   = gtod->raw_shift;
+   return 1;
+   }
+   return 0;
+}
+
+unsigned int linux_tsc_calibration(struct linux_tsc_calibration_s *tsc_cal)
+   __attribute((weak, alias("__vdso_linux_tsc_calibration")));
diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S
index d3a2dce..e0b5cce 100644
--- a/arch/x86/entry/vdso/vdso.lds.S
+++ b/arch/x86/entry/vdso/vdso.lds.S
@@ -25,6 +25,8 @@ VERSION {
__vdso_getcpu;
time;
__vdso_time;
+   linux_tsc_calibration;
+   __vdso_linux_tsc_calibration;
local: *;
};
 }
diff --git a/arch/x86/entry/vdso/vdso32/vdso32.lds.S 
b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
index 422764a..17fd07f 100644
--- a/arch/x86/entry/vdso/vdso32/vdso32.lds.S
+++ b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
@@ -26,6 +26,7 @@ VERSION
__vdso_clock_gettime;
__vdso_gettimeofday;
__vdso_time;
+   __vdso_linux_tsc_calibration;
};
 
LINUX_2.5 {
diff --git a/arch/x86/entry/vdso/vdsox32.lds.S 
b/arch/x86/entry/vdso/vdsox32.lds.S
index 05cd1c5..7acac71 100644
--- a/arch/x86/entry/vdso/vdsox32.lds.S
+++ b/arch/x86/entry/vdso/vdsox32.lds.S
@@ -21,6 +21,7 @@ VERSION {
__vdso_gettimeofday;
__vdso_getcpu;
__vdso_time;
+   __vdso_linux_tsc_calibration;
local: *;
};
 }
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c 
b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index 0327a95..692562a 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -53,6 +53,7 @@ void update_vsyscall(struct timekeeper *tk)
vdata->raw_mult = tk->tkr_raw.mult;
vdata->raw_shift= tk->tkr_raw.shift;
vdata->has_rdtscp   = static_cpu_has(X86_FEATURE_RDTSCP);
+   vdata->tsc_khz  = tsc_khz;
 
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index a5ff704..c7b2ed2 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -227,7 +227,7 @@ static __always_inline unsigned long long 
rdtsc_ordered(void)
  * the number (Intel CPU ID) of the CPU that the task is currently running on.
  * As does EAX_EDT_RET, this uses the "open-coded asm" style to
  * force the compiler + assembler to always use (eax, edx, ecx) registers,
- * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit 
+ * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit
  * variables are used - exactly the same code should be generated
  * for this instruction on 32-bit as on 64-bit when this asm stanza is used.
  * See: SDM , Vol #2, RDTSCP instruction.
@@ -236,15 +236,15 @@ static __always_inline u64 rdtscp(u32 *cpu_out)
 {
u32 tsc_lo, tsc_hi, tsc_cpu;
asm volatile
-   ( "rdtscp"
+   ("rdtscp"
:   "=a" (tsc_lo)
  , "=d" (tsc_hi)
  , "=c" (tsc_cpu)
); // : eax, edx, ecx used - NOT rax, rdx, rcx
-   if (unlikely(cpu_out != ((void*)0)))
+   if (unlikely(cpu_out != ((void *)0)))
*cpu_out = tsc_cpu;
return u64)tsc_hi) << 32) |
-   (((u64)tsc_lo) & 0x0ULL )
+   (((u64)tsc_lo) & 0x0ULL)
   );
 }
 
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index e7e4804..75078fc 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -27,6 +27,7 @@ struct vsyscall_gtod_data {
u32 raw_mult;
u32 raw_shift;
u32 has_rdtscp;
+   u32 tsc_khz;
 
   

Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-13 Thread Jason Vas Dias
On 12/03/2018, Peter Zijlstra <pet...@infradead.org> wrote:
> On Mon, Mar 12, 2018 at 07:01:20AM +0000, Jason Vas Dias wrote:
>>   Sometimes, particularly when correlating elapsed time to performance
>>   counter values,
>
> So what actual problem are you tring to solve here? Perf can already
> give you sample time in various clocks, including MONOTONIC_RAW.
>
>

Yes, I am sampling perf counters, including CPU_CYCLES , INSTRUCTIONS,
CPU_CLOCK, TASK_CLOCK, etc, in a Group FD I open with
perf_event_open() , for the current thread on the current CPU -
I am doing this for 4 threads , on Intel & ARM cpus.

Reading performance counters does involve  2 ioctls and a read() ,
which takes time that  already far exceeds the time required to read
the TSC or CNTPCT in the VDSO .

The CPU_CLOCK software counter should give the converted TSC cycles
seen between the ioctl( grp_fd, PERF_EVENT_IOC_ENABLE , ...)
and the  ioctl( grp_fd, PERF_EVENT_IOC_DISABLE ), and the
difference between the event->time_running and time_enabled
should also measure elapsed time .

This gives the "inner" elapsed time, from the perpective of the kernel,
while the measured code section had the counters enabled.

But unless the user-space program  also has a way of measuring elapsed time
from the CPU's perspective , ie. without being subject to operator or NTP / PTP
adjustment, it has no way of correlating this inner elapsed time with
any "outer"
elapsed time measurement it may have made - I also measure the time
taken by I/O operations between threads, for instance.

So that is my primary motivation - for each thread's main run loop, I
enable performance counters and count several PMU counters
and the CPU_CLOCK & TASK_CLOCK .  I want to determine
with maximal accuracy how much elapsed time was used
actually executing the task's instructions on the CPU ,
and how long they took to execute.
I want to try to exclude the time spent gathering and making
and analysing the performance measurements from the
time spent running the threads' main loop .

To do this accurately, it is best to exclude variations in time
that occur because of operator or NTP / PTP adjustments .

The CLOCK_MONOTONIC_RAW clock is the ONLY
clock that is MEANT to be immune from any adjustment.

It is meant to be high - resolution clock with 1ns resolution
that should be subject to no adjustment, and hence one would expect
it it have the lowest latency.

But the way Linux has up to now implemented it , CLOCK_MONOTONIC_RAW
has a resolution (minimum time that can be measured)
that varies from 300 - 1000ns .

I can read the TSC  and store a 16-byte timespec value in @ 8ns
on the same CPU .

I understand that linux must conform to the POSIX interface which
means it cannot provide sub-nanosecond resolution timers, but
it could allow user-space programs to easily discover the timer calibration
so that user-space programs can read the timers themselves.

Currently, users must parse the log file or use gdb / objdump to
inspect /proc/kcore to get the TSC calibration and exact
mult+shift values for the TSC value conversion.

Intel does not publish, nor does the CPU come with in ROM or firmware,
the actual precise TSC frequency - this must be calibrated against the
other clocks , according to a complicated procedure in section 18.2 of
the SDM . My TSC has a "rated" / nominal TSC frequency , which one
can compute from CPUID leaves, of 2.3ghz, but the "Refined TSC frequency"
is 2.8333ghz .

Hence I think Linux should export this calibrated frequency somehow ;
its "calibration" is expressed as the raw clocksource 'mult' and 'shift'
values, and is exported to the VDSO .

I think the VDSO should read the TSC and use the calibration
to render the raw, unadjusted time from the CPU's perspective.

Hence, the patch I am preparing , which is again attached.

I will submit it properly via email once I figure out
how to obtain the 'git-send-mail' tool, and how to
use it to send multiple patches, which seems
to be the only way to submit acceptable patches.

Also the attached timer program measures a latency
of @ 20ns with my patch 4.15.9 kernel, when it
measured a latency of 300-1000ns without it.

Thanks & Regards,

Jason


vdso_clock_monotonic_raw_1.patch
Description: Binary data
/* 
 * Program to measure high-res timer latency.
 *
 */
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifndef N_SAMPLES
#define N_SAMPLES 100
#endif
#define _STR(_S_) #_S_
#define STR(_S_) _STR(_S_)

int main(int argc, char *const* argv, char *const* envp)
{ clockid_t clk = CLOCK_MONOTONIC_RAW;
  bool do_dump = false;
  int argn=1;
  for(; argn < argc; argn+=1)
if( argv[argn] != NULL )
  if( *(argv[argn]) == '-')
	switch( *(argv[argn]+1) )
	{ case 'm':
	  case 'M':
	clk = CLOCK_MONOTONIC;
	break;
	  case 'd':
	  case 'D':
	do_dump = true;
	break;
	case '?':
	case 

Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-13 Thread Jason Vas Dias
On 12/03/2018, Peter Zijlstra  wrote:
> On Mon, Mar 12, 2018 at 07:01:20AM +0000, Jason Vas Dias wrote:
>>   Sometimes, particularly when correlating elapsed time to performance
>>   counter values,
>
> So what actual problem are you tring to solve here? Perf can already
> give you sample time in various clocks, including MONOTONIC_RAW.
>
>

Yes, I am sampling perf counters, including CPU_CYCLES , INSTRUCTIONS,
CPU_CLOCK, TASK_CLOCK, etc, in a Group FD I open with
perf_event_open() , for the current thread on the current CPU -
I am doing this for 4 threads , on Intel & ARM cpus.

Reading performance counters does involve  2 ioctls and a read() ,
which takes time that  already far exceeds the time required to read
the TSC or CNTPCT in the VDSO .

The CPU_CLOCK software counter should give the converted TSC cycles
seen between the ioctl( grp_fd, PERF_EVENT_IOC_ENABLE , ...)
and the  ioctl( grp_fd, PERF_EVENT_IOC_DISABLE ), and the
difference between the event->time_running and time_enabled
should also measure elapsed time .

This gives the "inner" elapsed time, from the perpective of the kernel,
while the measured code section had the counters enabled.

But unless the user-space program  also has a way of measuring elapsed time
from the CPU's perspective , ie. without being subject to operator or NTP / PTP
adjustment, it has no way of correlating this inner elapsed time with
any "outer"
elapsed time measurement it may have made - I also measure the time
taken by I/O operations between threads, for instance.

So that is my primary motivation - for each thread's main run loop, I
enable performance counters and count several PMU counters
and the CPU_CLOCK & TASK_CLOCK .  I want to determine
with maximal accuracy how much elapsed time was used
actually executing the task's instructions on the CPU ,
and how long they took to execute.
I want to try to exclude the time spent gathering and making
and analysing the performance measurements from the
time spent running the threads' main loop .

To do this accurately, it is best to exclude variations in time
that occur because of operator or NTP / PTP adjustments .

The CLOCK_MONOTONIC_RAW clock is the ONLY
clock that is MEANT to be immune from any adjustment.

It is meant to be high - resolution clock with 1ns resolution
that should be subject to no adjustment, and hence one would expect
it it have the lowest latency.

But the way Linux has up to now implemented it , CLOCK_MONOTONIC_RAW
has a resolution (minimum time that can be measured)
that varies from 300 - 1000ns .

I can read the TSC  and store a 16-byte timespec value in @ 8ns
on the same CPU .

I understand that linux must conform to the POSIX interface which
means it cannot provide sub-nanosecond resolution timers, but
it could allow user-space programs to easily discover the timer calibration
so that user-space programs can read the timers themselves.

Currently, users must parse the log file or use gdb / objdump to
inspect /proc/kcore to get the TSC calibration and exact
mult+shift values for the TSC value conversion.

Intel does not publish, nor does the CPU come with in ROM or firmware,
the actual precise TSC frequency - this must be calibrated against the
other clocks , according to a complicated procedure in section 18.2 of
the SDM . My TSC has a "rated" / nominal TSC frequency , which one
can compute from CPUID leaves, of 2.3ghz, but the "Refined TSC frequency"
is 2.8333ghz .

Hence I think Linux should export this calibrated frequency somehow ;
its "calibration" is expressed as the raw clocksource 'mult' and 'shift'
values, and is exported to the VDSO .

I think the VDSO should read the TSC and use the calibration
to render the raw, unadjusted time from the CPU's perspective.

Hence, the patch I am preparing , which is again attached.

I will submit it properly via email once I figure out
how to obtain the 'git-send-mail' tool, and how to
use it to send multiple patches, which seems
to be the only way to submit acceptable patches.

Also the attached timer program measures a latency
of @ 20ns with my patch 4.15.9 kernel, when it
measured a latency of 300-1000ns without it.

Thanks & Regards,

Jason


vdso_clock_monotonic_raw_1.patch
Description: Binary data
/* 
 * Program to measure high-res timer latency.
 *
 */
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifndef N_SAMPLES
#define N_SAMPLES 100
#endif
#define _STR(_S_) #_S_
#define STR(_S_) _STR(_S_)

int main(int argc, char *const* argv, char *const* envp)
{ clockid_t clk = CLOCK_MONOTONIC_RAW;
  bool do_dump = false;
  int argn=1;
  for(; argn < argc; argn+=1)
if( argv[argn] != NULL )
  if( *(argv[argn]) == '-')
	switch( *(argv[argn]+1) )
	{ case 'm':
	  case 'M':
	clk = CLOCK_MONOTONIC;
	break;
	  case 'd':
	  case 'D':
	do_dump = true;
	break;
	case '?':
	case 'h':
	case '

Re: [PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-12 Thread Jason Vas Dias
The split patches with no checkpatch.pl failures are
attached and were just sent in separate emails
to the mailing list .

Sorry it took a few tries to get right .

This will be my last send today -
I'm off to use it at work.

Thanks & all the best,
Jason


vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#1.patch
Description: Binary data


vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#2.patch
Description: Binary data


Re: [PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-12 Thread Jason Vas Dias
The split patches with no checkpatch.pl failures are
attached and were just sent in separate emails
to the mailing list .

Sorry it took a few tries to get right .

This will be my last send today -
I'm off to use it at work.

Thanks & all the best,
Jason


vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#1.patch
Description: Binary data


vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#2.patch
Description: Binary data


[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-12 Thread Jason Vas Dias

  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:

   arch/x86/include/asm/msr.h
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c
   
  This is the second patch in the series,
  which adds use of rdtscp .

  Best Regards,
 Jason Vas Dias  .
 
---
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
2018-03-12 08:12:17.110120433 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 
08:59:21.135475862 +
@@ -187,7 +187,7 @@ notrace static u64 vread_tsc_raw(void)
u64 tsc
  , last = gtod->raw_cycle_last;
 
-   tsc   = rdtsc_ordered();
+   tsc = gtod->has_rdtscp ? rdtscp((void*)0UL) : rdtsc_ordered();
if (likely(tsc >= last))
return tsc;
asm volatile ("");
diff -up linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c
--- linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5-p1  
2018-03-12 07:58:07.974214168 +
+++ linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c  2018-03-12 
08:54:07.490267640 +
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 int vclocks_used __read_mostly;
 
@@ -49,6 +50,7 @@ void update_vsyscall(struct timekeeper *
vdata->raw_mask = tk->tkr_raw.mask;
vdata->raw_mult = tk->tkr_raw.mult;
vdata->raw_shift= tk->tkr_raw.shift;
+   vdata->has_rdtscp   = static_cpu_has(X86_FEATURE_RDTSCP);
 
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
diff -up linux-4.16-rc5/arch/x86/include/asm/msr.h.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/include/asm/msr.h
--- linux-4.16-rc5/arch/x86/include/asm/msr.h.4.16-rc5-p1   2018-03-12 
00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/include/asm/msr.h   2018-03-12 09:06:03.902728749 
+
@@ -218,6 +218,36 @@ static __always_inline unsigned long lon
return rdtsc();
 }
 
+/**
+ * rdtscp() - read the current TSC and (optionally) CPU number, with built-in
+ *cancellation point replacing barrier - only available
+ *if static_cpu_has(X86_FEATURE_RDTSCP) .
+ * returns:   The 64-bit Time Stamp Counter (TSC) value.
+ * Optionally, 'cpu_out' can be non-null, and on return it will contain
+ * the number (Intel CPU ID) of the CPU that the task is currently running on.
+ * As does EAX_EDT_RET, this uses the "open-coded asm" style to
+ * force the compiler + assembler to always use (eax, edx, ecx) registers,
+ * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit 
+ * variables are used - exactly the same code should be generated
+ * for this instruction on 32-bit as on 64-bit when this asm stanza is used.
+ * See: SDM , Vol #2, RDTSCP instruction.
+ */
+static __always_inline u64 rdtscp(u32 *cpu_out)
+{
+   u32 tsc_lo, tsc_hi, tsc_cpu;
+   asm volatile
+   ( "rdtscp"
+   :   "=a" (tsc_lo)
+ , "=d" (tsc_hi)
+ , "=c" (tsc_cpu)
+   );
+   if ( unlikely(cpu_out != ((void*)0)) )
+   *cpu_out = tsc_cpu;
+   ret

[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-12 Thread Jason Vas Dias

  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:

   arch/x86/include/asm/msr.h
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c
   
  This is the second patch in the series,
  which adds use of rdtscp .

  Best Regards,
 Jason Vas Dias  .
 
---
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
2018-03-12 08:12:17.110120433 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 
08:59:21.135475862 +
@@ -187,7 +187,7 @@ notrace static u64 vread_tsc_raw(void)
u64 tsc
  , last = gtod->raw_cycle_last;
 
-   tsc   = rdtsc_ordered();
+   tsc = gtod->has_rdtscp ? rdtscp((void*)0UL) : rdtsc_ordered();
if (likely(tsc >= last))
return tsc;
asm volatile ("");
diff -up linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c
--- linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5-p1  
2018-03-12 07:58:07.974214168 +
+++ linux-4.16-rc5/arch/x86/entry/vsyscall/vsyscall_gtod.c  2018-03-12 
08:54:07.490267640 +
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 int vclocks_used __read_mostly;
 
@@ -49,6 +50,7 @@ void update_vsyscall(struct timekeeper *
vdata->raw_mask = tk->tkr_raw.mask;
vdata->raw_mult = tk->tkr_raw.mult;
vdata->raw_shift= tk->tkr_raw.shift;
+   vdata->has_rdtscp   = static_cpu_has(X86_FEATURE_RDTSCP);
 
vdata->wall_time_sec= tk->xtime_sec;
vdata->wall_time_snsec  = tk->tkr_mono.xtime_nsec;
diff -up linux-4.16-rc5/arch/x86/include/asm/msr.h.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/include/asm/msr.h
--- linux-4.16-rc5/arch/x86/include/asm/msr.h.4.16-rc5-p1   2018-03-12 
00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/include/asm/msr.h   2018-03-12 09:06:03.902728749 
+
@@ -218,6 +218,36 @@ static __always_inline unsigned long lon
return rdtsc();
 }
 
+/**
+ * rdtscp() - read the current TSC and (optionally) CPU number, with built-in
+ *cancellation point replacing barrier - only available
+ *if static_cpu_has(X86_FEATURE_RDTSCP) .
+ * returns:   The 64-bit Time Stamp Counter (TSC) value.
+ * Optionally, 'cpu_out' can be non-null, and on return it will contain
+ * the number (Intel CPU ID) of the CPU that the task is currently running on.
+ * As does EAX_EDT_RET, this uses the "open-coded asm" style to
+ * force the compiler + assembler to always use (eax, edx, ecx) registers,
+ * NOT whole (rax, rdx, rcx) on x86_64 , because only 32-bit 
+ * variables are used - exactly the same code should be generated
+ * for this instruction on 32-bit as on 64-bit when this asm stanza is used.
+ * See: SDM , Vol #2, RDTSCP instruction.
+ */
+static __always_inline u64 rdtscp(u32 *cpu_out)
+{
+   u32 tsc_lo, tsc_hi, tsc_cpu;
+   asm volatile
+   ( "rdtscp"
+   :   "=a" (tsc_lo)
+ , "=d" (tsc_hi)
+ , "=c" (tsc_cpu)
+   );
+   if ( unlikely(cpu_out != ((void*)0)) )
+   *cpu_out = tsc_cpu;
+   ret

[PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-12 Thread Jason Vas Dias

  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only these files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c
   

  There are 2 patches in the series - this first
  one handles CLOCK_MONOTONIC_RAW in VDSO using
  existing rdtsc_ordered() , and the second
  uses new rstscp() function which avoids
  use of an explicit barrier.

  Best Regards,
 Jason Vas Dias  .
 
---
diff -up linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 
linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5  
2018-03-12 00:25:09.0 +
+++ linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c   2018-03-12 
08:12:17.110120433 +
@@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+   u64 tsc
+ , last = gtod->raw_cycle_last;
+
+   tsc   = rdtsc_ordered();
+   if (likely(tsc >= last))
+   return tsc;
+   asm volatile ("");
+   return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +279,27 @@ notrace static int __always_inline do_mo
return mode;
 }
 
+notrace static __always_inline int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff -up linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5 
linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c
--- linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16

[PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-12 Thread Jason Vas Dias

  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only these files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c
   

  There are 2 patches in the series - this first
  one handles CLOCK_MONOTONIC_RAW in VDSO using
  existing rdtsc_ordered() , and the second
  uses new rstscp() function which avoids
  use of an explicit barrier.

  Best Regards,
 Jason Vas Dias  .
 
---
diff -up linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 
linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5  
2018-03-12 00:25:09.0 +
+++ linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c   2018-03-12 
08:12:17.110120433 +
@@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+   u64 tsc
+ , last = gtod->raw_cycle_last;
+
+   tsc   = rdtsc_ordered();
+   if (likely(tsc >= last))
+   return tsc;
+   asm volatile ("");
+   return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +279,27 @@ notrace static int __always_inline do_mo
return mode;
 }
 
+notrace static __always_inline int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff -up linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5 
linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c
--- linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16

Re: [PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-12 Thread Jason Vas Dias
Good day -

On 12/03/2018, Ingo Molnar <mi...@kernel.org> wrote:
>
> * Thomas Gleixner <t...@linutronix.de> wrote:
>
>> On Mon, 12 Mar 2018, Jason Vas Dias wrote:
>>
>> checkpatch.pl still reports:
>>
>>total: 15 errors, 3 warnings, 165 lines checked
>>

Sorry I didn't see you had responded until 40 mins ago .

I finally found where checkpatch.pl is and it now reports :

WARNING: Possible unwrapped commit description (prefer a maximum 75
chars per line)
#2:
--- linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5  
2018-03-12
00:25:09.0 +

WARNING: struct  should normally be const
#55: FILE: arch/x86/entry/vdso/vclock_gettime.c:282:
+notrace static __always_inline int do_monotonic_raw(struct timespec *ts)


I don't know how to fix that, since 'ts' cannot be a const pointer.


ERROR: Missing Signed-off-by: line(s)


I guess that disappears once someone OKs the patch.

total: 1 errors, 2 warnings, 127 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
  mechanically convert to the typical style using --fix or --fix-inplace.

../vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#1.patch has style
problems, please review.

NOTE: If any of the errors are false positives, please report
  them to the maintainer, see CHECKPATCH in MAINTAINERS.


>> > +notrace static u64 vread_tsc_raw(void)
>> > +{
>> > +  u64 tsc, last=gtod->raw_cycle_last;
>> > +  if( likely( gtod->has_rdtscp ) )
>> > +  tsc = rdtscp((void*)0);
>>
>> Plus I asked more than once to split that rdtscp() stuff into a separate
>> patch.

I misunderstood - I thought you meant the rdtscp implementation
which was split into a separate file - but now it is in a separate patch ,
(attached).

>>
>> You surely are free to ignore my review comments, but rest assured that
>> I'm
>> free to ignore the crap you insist to send me as well.
>

I didn't mean to ignore any comments, and I'm really trying to fix this problem
the right way and not produce crap.


> In addition to Thomas's review feedback I'd strongly urge the careful
> reading of
> Documentation/SubmittingPatches as well:
>
>  - When sending multiple patches please use git-send-mail
>
>  - Please don't send several patch iterations per day!
>
>  - Code quality of the submitted patches is atrocious, please run them
> through
>scripts/checkpatch.pl (and make sure they pass) to at least enable the
> reading
>of them.
>
>  - ... plus dozens of other details described in
> Documentation/SubmittingPatches.
>
> Thanks,
>
>   Ingo
>

I am reading all those documents and cannot see how the code in
the attached patch contravenes any guidelines / best practices -
if you can, please clarify phrases like "atrocious style" - I cannot
see any style guidelines contravened, and I can prove that
the numeric output produced in 16-30ns is just as good
as that produced before the patch was applied in 300-700ns .

Aside from any style comments, any content comments ?

Sorry I am new to latest kernel  guidelines.
I needed to get this problem solved the right way for use at work today.

Thanks for your advice,
Best Regards
Jason


vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#1.patch
Description: Binary data


Re: [PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-12 Thread Jason Vas Dias
Good day -

On 12/03/2018, Ingo Molnar  wrote:
>
> * Thomas Gleixner  wrote:
>
>> On Mon, 12 Mar 2018, Jason Vas Dias wrote:
>>
>> checkpatch.pl still reports:
>>
>>total: 15 errors, 3 warnings, 165 lines checked
>>

Sorry I didn't see you had responded until 40 mins ago .

I finally found where checkpatch.pl is and it now reports :

WARNING: Possible unwrapped commit description (prefer a maximum 75
chars per line)
#2:
--- linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5  
2018-03-12
00:25:09.0 +

WARNING: struct  should normally be const
#55: FILE: arch/x86/entry/vdso/vclock_gettime.c:282:
+notrace static __always_inline int do_monotonic_raw(struct timespec *ts)


I don't know how to fix that, since 'ts' cannot be a const pointer.


ERROR: Missing Signed-off-by: line(s)


I guess that disappears once someone OKs the patch.

total: 1 errors, 2 warnings, 127 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
  mechanically convert to the typical style using --fix or --fix-inplace.

../vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#1.patch has style
problems, please review.

NOTE: If any of the errors are false positives, please report
  them to the maintainer, see CHECKPATCH in MAINTAINERS.


>> > +notrace static u64 vread_tsc_raw(void)
>> > +{
>> > +  u64 tsc, last=gtod->raw_cycle_last;
>> > +  if( likely( gtod->has_rdtscp ) )
>> > +  tsc = rdtscp((void*)0);
>>
>> Plus I asked more than once to split that rdtscp() stuff into a separate
>> patch.

I misunderstood - I thought you meant the rdtscp implementation
which was split into a separate file - but now it is in a separate patch ,
(attached).

>>
>> You surely are free to ignore my review comments, but rest assured that
>> I'm
>> free to ignore the crap you insist to send me as well.
>

I didn't mean to ignore any comments, and I'm really trying to fix this problem
the right way and not produce crap.


> In addition to Thomas's review feedback I'd strongly urge the careful
> reading of
> Documentation/SubmittingPatches as well:
>
>  - When sending multiple patches please use git-send-mail
>
>  - Please don't send several patch iterations per day!
>
>  - Code quality of the submitted patches is atrocious, please run them
> through
>scripts/checkpatch.pl (and make sure they pass) to at least enable the
> reading
>of them.
>
>  - ... plus dozens of other details described in
> Documentation/SubmittingPatches.
>
> Thanks,
>
>   Ingo
>

I am reading all those documents and cannot see how the code in
the attached patch contravenes any guidelines / best practices -
if you can, please clarify phrases like "atrocious style" - I cannot
see any style guidelines contravened, and I can prove that
the numeric output produced in 16-30ns is just as good
as that produced before the patch was applied in 300-700ns .

Aside from any style comments, any content comments ?

Sorry I am new to latest kernel  guidelines.
I needed to get this problem solved the right way for use at work today.

Thanks for your advice,
Best Regards
Jason


vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-4.16-rc5#1.patch
Description: Binary data


[PATCH v4.16-rc4 1/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-12 Thread Jason Vas Dias

  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, about the same as do_monotonic(), and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  The patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c


  This is a resend of the original patch fixing review issues -
  the next patch will add the rdtscp() function .

  The patch passes the checkpatch.pl script .

  Best Regards,
 Jason Vas Dias  .
 
---
diff -up linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 
linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5  
2018-03-12 00:25:09.0 +
+++ linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c   2018-03-12 
08:12:17.110120433 +
@@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+   u64 tsc
+ , last = gtod->raw_cycle_last;
+
+   tsc = rdtsc_ordered();
+   if (likely(tsc >= last))
+   return tsc;
+   asm volatile ("");
+   return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +279,27 @@ notrace static int __always_inline do_mo
return mode;
 }
 
+notrace static __always_inline int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff -up linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5 
linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c
--- linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5   
2018-03-12 00:25:09.0 +
+++ li

[PATCH v4.16-rc4 1/3] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-12 Thread Jason Vas Dias

  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, about the same as do_monotonic(), and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  The patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c


  This is a resend of the original patch fixing review issues -
  the next patch will add the rdtscp() function .

  The patch passes the checkpatch.pl script .

  Best Regards,
 Jason Vas Dias  .
 
---
diff -up linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 
linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5  
2018-03-12 00:25:09.0 +
+++ linux-4.16-rc5.1/arch/x86/entry/vdso/vclock_gettime.c   2018-03-12 
08:12:17.110120433 +
@@ -182,6 +182,18 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+   u64 tsc
+ , last = gtod->raw_cycle_last;
+
+   tsc = rdtsc_ordered();
+   if (likely(tsc >= last))
+   return tsc;
+   asm volatile ("");
+   return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +215,27 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +279,27 @@ notrace static int __always_inline do_mo
return mode;
 }
 
+notrace static __always_inline int do_monotonic_raw(struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +331,10 @@ notrace int __vdso_clock_gettime(clockid
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff -up linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5 
linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c
--- linux-4.16-rc5.1/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc5   
2018-03-12 00:25:09.0 +
+++ li

[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-12 Thread Jason Vas Dias

  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c   
   arch/x86/entry/vdso/vdso.lds.S
   arch/x86/entry/vdso/vdsox32.lds.S
   arch/x86/entry/vdso/vdso32/vdso32.lds.S  


  and adds one new file:
   arch/x86/include/uapi/asm/vdso_tsc_calibration.h
   
  This is a second patch in the series, 
  which adds a record of the calibrated tsc frequency to the VDSO,
  and a new header:
uapi/asm/vdso_tsc_calibration.h
  which defines a structure :
struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; };
  and a getter function in the VDSO that can optionally be used
  by user-space code to implement sub-nanosecond precision clocks .
  This second patch is entirely optional but I think greatly
  expands the scope of user-space TSC readers .

  Resent : Oops, in previous version of this patch (#2),
  the comments in the new vdso_tsc_calibration were wrong,
  for an earlier version - sorry about that.

  Best Regards,
 Jason Vas Dias  .

 PATCH 2/2:
---
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
2018-03-12 04:29:27.296982872 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 
05:38:53.019891195 +
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define gtod ((vsyscall_gtod_data))
 
@@ -385,3 +386,22 @@ notrace time_t __vdso_time(time_t *t)
 }
 time_t time(time_t *t)
__attribute__((weak, alias("__vdso_time")));
+
+extern unsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *);
+
+notraceunsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal)
+{
+   if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) )
+   {
+   tsc_cal -> tsc_khz = gtod->tsc_khz;
+   tsc_cal -> mult= gtod->raw_mult;
+   tsc_cal -> shift   = gtod->raw_shift;
+   return 1;
+   }
+   return 0;
+}
+
+unsigned linux_tsc_calibration(void)
+   __attribute((weak, alias("__vdso_linux_tsc_calibration")));
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S
--- linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1   2018-03-12 
00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S   2018-03-12 
05:18:36.380673342 +
@@ -25,6 +25,8 @@ VERSION {
__vdso_getcpu;
time;
__vdso_time;
+   linux_tsc_calibration;
+   __vdso_linux_tsc_calibration;
local: *;
};
 }
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S
--- linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1  
2018-03-12 00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S  2018-03-12 
05:19:10.765022295 +
@@ -26,6 +26,7 @@ VERSION
__vdso_clock_gettime;
__vdso_gettimeofday;
__vdso_time;
+   __vdso_linux_tsc_calibration;
};
 
LINUX_2.5 {
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.l

[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-12 Thread Jason Vas Dias

  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c   
   arch/x86/entry/vdso/vdso.lds.S
   arch/x86/entry/vdso/vdsox32.lds.S
   arch/x86/entry/vdso/vdso32/vdso32.lds.S  


  and adds one new file:
   arch/x86/include/uapi/asm/vdso_tsc_calibration.h
   
  This is a second patch in the series, 
  which adds a record of the calibrated tsc frequency to the VDSO,
  and a new header:
uapi/asm/vdso_tsc_calibration.h
  which defines a structure :
struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; };
  and a getter function in the VDSO that can optionally be used
  by user-space code to implement sub-nanosecond precision clocks .
  This second patch is entirely optional but I think greatly
  expands the scope of user-space TSC readers .

  Resent : Oops, in previous version of this patch (#2),
  the comments in the new vdso_tsc_calibration were wrong,
  for an earlier version - sorry about that.

  Best Regards,
 Jason Vas Dias  .

 PATCH 2/2:
---
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
2018-03-12 04:29:27.296982872 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 
05:38:53.019891195 +
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define gtod ((vsyscall_gtod_data))
 
@@ -385,3 +386,22 @@ notrace time_t __vdso_time(time_t *t)
 }
 time_t time(time_t *t)
__attribute__((weak, alias("__vdso_time")));
+
+extern unsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *);
+
+notraceunsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal)
+{
+   if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) )
+   {
+   tsc_cal -> tsc_khz = gtod->tsc_khz;
+   tsc_cal -> mult= gtod->raw_mult;
+   tsc_cal -> shift   = gtod->raw_shift;
+   return 1;
+   }
+   return 0;
+}
+
+unsigned linux_tsc_calibration(void)
+   __attribute((weak, alias("__vdso_linux_tsc_calibration")));
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S
--- linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1   2018-03-12 
00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S   2018-03-12 
05:18:36.380673342 +
@@ -25,6 +25,8 @@ VERSION {
__vdso_getcpu;
time;
__vdso_time;
+   linux_tsc_calibration;
+   __vdso_linux_tsc_calibration;
local: *;
};
 }
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S
--- linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1  
2018-03-12 00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S  2018-03-12 
05:19:10.765022295 +
@@ -26,6 +26,7 @@ VERSION
__vdso_clock_gettime;
__vdso_gettimeofday;
__vdso_time;
+   __vdso_linux_tsc_calibration;
};
 
LINUX_2.5 {
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.l

[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-11 Thread Jason Vas Dias

  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vdso/vdso.lds.S
   arch/x86/entry/vdso/vdsox32.lds.S
   arch/x86/entry/vdso/vdso32/vdso32.lds.S  
   arch/x86/entry/vsyscall/vsyscall_gtod.c
   
  This is a second patch in the series, 
  which adds a record of the calibrated tsc frequency to the VDSO,
  and a new header:
uapi/asm/vdso_tsc_calibration.h
  which defines a structure :
struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; };
  and a getter function in the VDSO that can optionally be used
  by user-space code to implement sub-nanosecond precision clocks .
  This second patch is entirely optional but I think greatly
  expands the scope of user-space TSC readers .

  Oops, previous version of this second patch
  mistakenly copied the changed part of vclock_gettime.c.

  Best Regards,
 Jason Vas Dias  .
 
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
2018-03-12 04:29:27.296982872 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 
05:38:53.019891195 +
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define gtod ((vsyscall_gtod_data))
 
@@ -385,3 +386,22 @@ notrace time_t __vdso_time(time_t *t)
 }
 time_t time(time_t *t)
__attribute__((weak, alias("__vdso_time")));
+
+extern unsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *);
+
+notraceunsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal)
+{
+   if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) )
+   {
+   tsc_cal -> tsc_khz = gtod->tsc_khz;
+   tsc_cal -> mult= gtod->raw_mult;
+   tsc_cal -> shift   = gtod->raw_shift;
+   return 1;
+   }
+   return 0;
+}
+
+unsigned linux_tsc_calibration(void)
+   __attribute((weak, alias("__vdso_linux_tsc_calibration")));
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S
--- linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1   2018-03-12 
00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S   2018-03-12 
05:18:36.380673342 +
@@ -25,6 +25,8 @@ VERSION {
__vdso_getcpu;
time;
__vdso_time;
+   linux_tsc_calibration;
+   __vdso_linux_tsc_calibration;
local: *;
};
 }
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S
--- linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1  
2018-03-12 00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S  2018-03-12 
05:19:10.765022295 +
@@ -26,6 +26,7 @@ VERSION
__vdso_clock_gettime;
__vdso_gettimeofday;
__vdso_time;
+   __vdso_linux_tsc_calibration;
};
 
LINUX_2.5 {
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S
--- linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S.4.16-rc5-p1
2018-03-12 00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/entry/vdso

[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-11 Thread Jason Vas Dias

  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vdso/vdso.lds.S
   arch/x86/entry/vdso/vdsox32.lds.S
   arch/x86/entry/vdso/vdso32/vdso32.lds.S  
   arch/x86/entry/vsyscall/vsyscall_gtod.c
   
  This is a second patch in the series, 
  which adds a record of the calibrated tsc frequency to the VDSO,
  and a new header:
uapi/asm/vdso_tsc_calibration.h
  which defines a structure :
struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; };
  and a getter function in the VDSO that can optionally be used
  by user-space code to implement sub-nanosecond precision clocks .
  This second patch is entirely optional but I think greatly
  expands the scope of user-space TSC readers .

  Oops, previous version of this second patch
  mistakenly copied the changed part of vclock_gettime.c.

  Best Regards,
 Jason Vas Dias  .
 
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
2018-03-12 04:29:27.296982872 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 
05:38:53.019891195 +
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define gtod ((vsyscall_gtod_data))
 
@@ -385,3 +386,22 @@ notrace time_t __vdso_time(time_t *t)
 }
 time_t time(time_t *t)
__attribute__((weak, alias("__vdso_time")));
+
+extern unsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *);
+
+notraceunsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal)
+{
+   if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) )
+   {
+   tsc_cal -> tsc_khz = gtod->tsc_khz;
+   tsc_cal -> mult= gtod->raw_mult;
+   tsc_cal -> shift   = gtod->raw_shift;
+   return 1;
+   }
+   return 0;
+}
+
+unsigned linux_tsc_calibration(void)
+   __attribute((weak, alias("__vdso_linux_tsc_calibration")));
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S
--- linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1   2018-03-12 
00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S   2018-03-12 
05:18:36.380673342 +
@@ -25,6 +25,8 @@ VERSION {
__vdso_getcpu;
time;
__vdso_time;
+   linux_tsc_calibration;
+   __vdso_linux_tsc_calibration;
local: *;
};
 }
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S
--- linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1  
2018-03-12 00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S  2018-03-12 
05:19:10.765022295 +
@@ -26,6 +26,7 @@ VERSION
__vdso_clock_gettime;
__vdso_gettimeofday;
__vdso_time;
+   __vdso_linux_tsc_calibration;
};
 
LINUX_2.5 {
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S
--- linux-4.16-rc5/arch/x86/entry/vdso/vdsox32.lds.S.4.16-rc5-p1
2018-03-12 00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/entry/vdso

[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-11 Thread Jason Vas Dias

  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vdso/vdso.lds.S
   arch/x86/entry/vdso/vdsox32.lds.S
   arch/x86/entry/vdso/vdso32/vdso32.lds.S  
   arch/x86/entry/vsyscall/vsyscall_gtod.c
   
  This is a second patch in the series, 
  which adds a record of the calibrated tsc frequency to the VDSO,
  and a new header:
uapi/asm/vdso_tsc_calibration.h
  which defines a structure :
struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; };
  and a getter function in the VDSO that can optionally be used
  by user-space code to implement sub-nanosecond precision clocks .
  This second patch is entirely optional but I think greatly
  expands the scope of user-space TSC readers .

  Best Regards,
 Jason Vas Dias  .
 
---
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
2018-03-12 04:29:27.296982872 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 
05:10:53.185158334 +
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define gtod ((vsyscall_gtod_data))
 
@@ -385,3 +386,41 @@ notrace time_t __vdso_time(time_t *t)
 }
 time_t time(time_t *t)
__attribute__((weak, alias("__vdso_time")));
+
+extern unsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *);
+
+notraceunsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal)
+{
+   if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) )
+   {
+   tsc_cal -> tsc_khz = gtod->tsc_khz;
+   tsc_cal -> mult= gtod->raw_mult;
+   tsc_cal -> shift   = gtod->raw_shift;
+   return 1;
+   }
+   return 0;
+}
+
+unsigned linux_tsc_calibration(void)
+   __attribute((weak, alias("__vdso_linux_tsc_calibration")));
+
+extern unsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *);
+
+notraceunsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal)
+{
+   if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) )
+   {
+   tsc_cal -> tsc_khz = gtod->tsc_khz;
+   tsc_cal -> mult= gtod->raw_mult;
+   tsc_cal -> shift   = gtod->raw_shift;
+   return 1;
+   }
+   return 0;
+}
+
+unsigned linux_tsc_calibration(void)
+   __attribute((weak, alias("__vdso_linux_tsc_calibration")));
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S
--- linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1   2018-03-12 
00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S   2018-03-12 
05:18:36.380673342 +
@@ -25,6 +25,8 @@ VERSION {
__vdso_getcpu;
time;
__vdso_time;
+   linux_tsc_calibration;
+   __vdso_linux_tsc_calibration;
local: *;
};
 }
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S
--- linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1  
2018-03-12 00:25:09.0 +
+++ linux-4.16-rc5

[PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-11 Thread Jason Vas Dias

  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  This patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vdso/vdso.lds.S
   arch/x86/entry/vdso/vdsox32.lds.S
   arch/x86/entry/vdso/vdso32/vdso32.lds.S  
   arch/x86/entry/vsyscall/vsyscall_gtod.c
   
  This is a second patch in the series, 
  which adds a record of the calibrated tsc frequency to the VDSO,
  and a new header:
uapi/asm/vdso_tsc_calibration.h
  which defines a structure :
struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; };
  and a getter function in the VDSO that can optionally be used
  by user-space code to implement sub-nanosecond precision clocks .
  This second patch is entirely optional but I think greatly
  expands the scope of user-space TSC readers .

  Best Regards,
 Jason Vas Dias  .
 
---
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5-p1 
2018-03-12 04:29:27.296982872 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 
05:10:53.185158334 +
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define gtod ((vsyscall_gtod_data))
 
@@ -385,3 +386,41 @@ notrace time_t __vdso_time(time_t *t)
 }
 time_t time(time_t *t)
__attribute__((weak, alias("__vdso_time")));
+
+extern unsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *);
+
+notraceunsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal)
+{
+   if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) )
+   {
+   tsc_cal -> tsc_khz = gtod->tsc_khz;
+   tsc_cal -> mult= gtod->raw_mult;
+   tsc_cal -> shift   = gtod->raw_shift;
+   return 1;
+   }
+   return 0;
+}
+
+unsigned linux_tsc_calibration(void)
+   __attribute((weak, alias("__vdso_linux_tsc_calibration")));
+
+extern unsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *);
+
+notraceunsigned
+__vdso_linux_tsc_calibration(struct linux_tsc_calibration *tsc_cal)
+{
+   if ( (gtod->vclock_mode == VCLOCK_TSC) && (tsc_cal != ((void*)0UL)) )
+   {
+   tsc_cal -> tsc_khz = gtod->tsc_khz;
+   tsc_cal -> mult= gtod->raw_mult;
+   tsc_cal -> shift   = gtod->raw_shift;
+   return 1;
+   }
+   return 0;
+}
+
+unsigned linux_tsc_calibration(void)
+   __attribute((weak, alias("__vdso_linux_tsc_calibration")));
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S
--- linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S.4.16-rc5-p1   2018-03-12 
00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vdso.lds.S   2018-03-12 
05:18:36.380673342 +
@@ -25,6 +25,8 @@ VERSION {
__vdso_getcpu;
time;
__vdso_time;
+   linux_tsc_calibration;
+   __vdso_linux_tsc_calibration;
local: *;
};
 }
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1 
linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S
--- linux-4.16-rc5/arch/x86/entry/vdso/vdso32/vdso32.lds.S.4.16-rc5-p1  
2018-03-12 00:25:09.0 +
+++ linux-4.16-rc5

[PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-11 Thread Jason Vas Dias

  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  The patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/include/asm/msr.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c


  This is a resend of the original patch fixing issues
  identified by tglx in mail thread of $subject -
  mainly that the rdtscp() assembler wrapper function 
  should be in msr.h - it now is.
  
  There is a second patch following in a few minutes
  which adds a record of the calibrated tsc frequency to the VDSO,
  and a new header:
uapi/asm/vdso_tsc_calibration.h
  which defines a structure :
struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; };
  and a getter function in the VDSO that can optionally be used
  by user-space code to implement sub-nanosecond precision clocks .
  This second patch is entirely optional but I think greatly
  expands the scope of user-space TSC readers .

  Best Regards,
 Jason Vas Dias  .
 
---
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 
linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5
2018-03-12 00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 
04:29:27.296982872 +
@@ -182,6 +182,19 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+   u64 tsc, last=gtod->raw_cycle_last;
+   if( likely( gtod->has_rdtscp ) )
+   tsc = rdtscp((void*)0);
+else
+   tsc = rdtsc_ordered();
+   if (likely(tsc >= last))
+   return tsc;
+   asm volatile ("");
+   return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +216,27 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +280,27 @@ notrace static int __always_inline do_mo
return mode;
 }
 
+notrace static int __always_inline do_monotonic_raw( struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +332,10 @@ n

[PATCH v4.16-rc4 1/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-11 Thread Jason Vas Dias

  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc5 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  The patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/include/asm/msr.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c


  This is a resend of the original patch fixing issues
  identified by tglx in mail thread of $subject -
  mainly that the rdtscp() assembler wrapper function 
  should be in msr.h - it now is.
  
  There is a second patch following in a few minutes
  which adds a record of the calibrated tsc frequency to the VDSO,
  and a new header:
uapi/asm/vdso_tsc_calibration.h
  which defines a structure :
struct linux_tsc_calibration { u32 tsc_khz, mult, shift ; };
  and a getter function in the VDSO that can optionally be used
  by user-space code to implement sub-nanosecond precision clocks .
  This second patch is entirely optional but I think greatly
  expands the scope of user-space TSC readers .

  Best Regards,
 Jason Vas Dias  .
 
---
diff -up linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5 
linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc5
2018-03-12 00:25:09.0 +
+++ linux-4.16-rc5/arch/x86/entry/vdso/vclock_gettime.c 2018-03-12 
04:29:27.296982872 +
@@ -182,6 +182,19 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+   u64 tsc, last=gtod->raw_cycle_last;
+   if( likely( gtod->has_rdtscp ) )
+   tsc = rdtscp((void*)0);
+else
+   tsc = rdtsc_ordered();
+   if (likely(tsc >= last))
+   return tsc;
+   asm volatile ("");
+   return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +216,27 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +280,27 @@ notrace static int __always_inline do_mo
return mode;
 }
 
+notrace static int __always_inline do_monotonic_raw( struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +332,10 @@ n

Re: [PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-11 Thread Jason Vas Dias
Thanks Thomas -

On 11/03/2018, Thomas Gleixner <t...@linutronix.de> wrote:
> On Sun, 11 Mar 2018, Jason Vas Dias wrote:
>
> This looks better now. Though running that patch through checkpatch.pl
> results in:
>
> total: 28 errors, 20 warnings, 139 lines checked
>

Hmm, I was unaware of that script, I'll run and find out why -
probably because whitespace is not visible in emacs with
my monospace font and it is very difficult to see if tabs
are used if somehow a '\t\ ' or ' \t' has slipped in .

I'll run the script, fix the errors. and repost.

> 
>
>> +notrace static u64 vread_tsc_raw(void)
>
> Why do you need a separate function? I asked you to use vread_tsc(). So you
> might have reasons for doing that, but please then explain WHY and not just
> throw the stuff in my direction w/o any comment.
>

mainly, because vread_tsc() makes its comparison against gtod->cycles_last ,
a copy of tk->tkr_mono.cycle_last, while vread_tsc_raw() uses
gtod->raw_cycle_last, a copy of tk->tkr_raw.cycle_last .

And rdtscp has a built-in "barrier", as the comments explain, making
rdtsc_ordered()'s 'barrier()' unnecessary .


>> +{
>> +u64 tsc, last=gtod->raw_cycle_last;
>> +if( likely( gtod->has_rdtscp ) ) {
>> +u32 tsc_lo, tsc_hi,
>> +tsc_cpu __attribute__((unused));
>> +asm volatile
>> +( "rdtscp"
>> +/* ^- has built-in cancellation point / pipeline stall
>> "barrier" */
>> +: "=a" (tsc_lo)
>> +, "=d" (tsc_hi)
>> +, "=c" (tsc_cpu)
>> +); // since all variables 32-bit, eax, edx, ecx used -
>> NOT rax, rdx, rcx
>> +tsc  = u64)tsc_hi) & 0xUL) << 32) |
>> (((u64)tsc_lo) & 0xUL);
>
> This is not required to make the vdso accessor for monotonic raw work.
>
> If at all then the rdtscp support wants to be in a separate patch with a
> proper explanation.
>


> Aside of that the code for rdtscp wants to be in a proper inline helper in
> the relevant header file and written according to the coding style the
> kernel uses for asm inlines.
>

Sorry, I will put the function in the same header as rdtsc_ordered () ,
in a separate patch.

> The rest looks ok.
>
> Thanks,
>
>   tglx
>

I'll re-generate patches and resend .

A complete patch , against 4.15.9, is attached , that I am using ,
including a suggested '__vdso_linux_tsc_calibration()'
function and arch/x86/include/uapi/asm/vdso_tsc_calibration.h file
that does not return any pointers into the VDSO .

Presuming this was split into separate patches as you suggest,
and was against the latest HEAD branch (4.16-rcX), would it be OK to
include the vdso_linux_tsc_calibration() work ?
It does enable user space code to develop accurate TSC readers
which are free to use different structures and pico-second resolution.
The actual user-space clock_gettime(CLOCK_MONOTONIC_RAW)
replacement I am using for work just reads the TSC ,  with a latency of
< 8ns, and uses the linux_tsc_calibration to convert using
floating-point as required.

Thanks & Regards,
Jason


vdso_gettime_monotonic_raw-4.15.9.patch
Description: Binary data


Re: [PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-11 Thread Jason Vas Dias
Thanks Thomas -

On 11/03/2018, Thomas Gleixner  wrote:
> On Sun, 11 Mar 2018, Jason Vas Dias wrote:
>
> This looks better now. Though running that patch through checkpatch.pl
> results in:
>
> total: 28 errors, 20 warnings, 139 lines checked
>

Hmm, I was unaware of that script, I'll run and find out why -
probably because whitespace is not visible in emacs with
my monospace font and it is very difficult to see if tabs
are used if somehow a '\t\ ' or ' \t' has slipped in .

I'll run the script, fix the errors. and repost.

> 
>
>> +notrace static u64 vread_tsc_raw(void)
>
> Why do you need a separate function? I asked you to use vread_tsc(). So you
> might have reasons for doing that, but please then explain WHY and not just
> throw the stuff in my direction w/o any comment.
>

mainly, because vread_tsc() makes its comparison against gtod->cycles_last ,
a copy of tk->tkr_mono.cycle_last, while vread_tsc_raw() uses
gtod->raw_cycle_last, a copy of tk->tkr_raw.cycle_last .

And rdtscp has a built-in "barrier", as the comments explain, making
rdtsc_ordered()'s 'barrier()' unnecessary .


>> +{
>> +u64 tsc, last=gtod->raw_cycle_last;
>> +if( likely( gtod->has_rdtscp ) ) {
>> +u32 tsc_lo, tsc_hi,
>> +tsc_cpu __attribute__((unused));
>> +asm volatile
>> +( "rdtscp"
>> +/* ^- has built-in cancellation point / pipeline stall
>> "barrier" */
>> +: "=a" (tsc_lo)
>> +, "=d" (tsc_hi)
>> +, "=c" (tsc_cpu)
>> +); // since all variables 32-bit, eax, edx, ecx used -
>> NOT rax, rdx, rcx
>> +tsc  = u64)tsc_hi) & 0xUL) << 32) |
>> (((u64)tsc_lo) & 0xUL);
>
> This is not required to make the vdso accessor for monotonic raw work.
>
> If at all then the rdtscp support wants to be in a separate patch with a
> proper explanation.
>


> Aside of that the code for rdtscp wants to be in a proper inline helper in
> the relevant header file and written according to the coding style the
> kernel uses for asm inlines.
>

Sorry, I will put the function in the same header as rdtsc_ordered () ,
in a separate patch.

> The rest looks ok.
>
> Thanks,
>
>   tglx
>

I'll re-generate patches and resend .

A complete patch , against 4.15.9, is attached , that I am using ,
including a suggested '__vdso_linux_tsc_calibration()'
function and arch/x86/include/uapi/asm/vdso_tsc_calibration.h file
that does not return any pointers into the VDSO .

Presuming this was split into separate patches as you suggest,
and was against the latest HEAD branch (4.16-rcX), would it be OK to
include the vdso_linux_tsc_calibration() work ?
It does enable user space code to develop accurate TSC readers
which are free to use different structures and pico-second resolution.
The actual user-space clock_gettime(CLOCK_MONOTONIC_RAW)
replacement I am using for work just reads the TSC ,  with a latency of
< 8ns, and uses the linux_tsc_calibration to convert using
floating-point as required.

Thanks & Regards,
Jason


vdso_gettime_monotonic_raw-4.15.9.patch
Description: Binary data


[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-11 Thread Jason Vas Dias

  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc4 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  The patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c


  This is a resend of the original patch fixing indentation issues
  after installation of emacs Lisp cc-mode hooks in
  Documentation/coding-style.rst
  and calling 'indent-region' and 'tabify' (whitespace only changes) -
  SORRY !
  (and even after that, somehow 2 '\t\n's got left in vgtod.h -
   now removed - sorry again!) .

  Best Regards,
 Jason Vas Dias  .

  PATCH:
--- 
diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 
linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4
2018-03-04 22:54:11.0 +
+++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 
19:00:04.630019100 +
@@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+   u64 tsc, last=gtod->raw_cycle_last;
+   if( likely( gtod->has_rdtscp ) ) {
+   u32 tsc_lo, tsc_hi,
+   tsc_cpu __attribute__((unused));
+   asm volatile
+   ( "rdtscp"
+   /* ^- has built-in cancellation point / 
pipeline stall"barrier" */
+   :   "=a" (tsc_lo)
+ , "=d" (tsc_hi)
+ , "=c" (tsc_cpu)
+   ); // since all variables 32-bit, eax, edx, ecx used - 
NOT rax, rdx, rcx
+   tsc = u64)tsc_hi) & 0xUL) << 32) | 
(((u64)tsc_lo) & 0xUL);
+   } else {
+   tsc = rdtsc_ordered();
+   }
+   if (likely(tsc >= last))
+   return tsc;
+   asm volatile ("");
+   return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +290,27 @@ notrace static int __always_inline do_mo
return mode;
 }
 
+notrace static int __always_inline do_monotonic_raw( struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+ 

[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-11 Thread Jason Vas Dias

  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc4 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  The patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c


  This is a resend of the original patch fixing indentation issues
  after installation of emacs Lisp cc-mode hooks in
  Documentation/coding-style.rst
  and calling 'indent-region' and 'tabify' (whitespace only changes) -
  SORRY !
  (and even after that, somehow 2 '\t\n's got left in vgtod.h -
   now removed - sorry again!) .

  Best Regards,
 Jason Vas Dias  .

  PATCH:
--- 
diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 
linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4
2018-03-04 22:54:11.0 +
+++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 
19:00:04.630019100 +
@@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+   u64 tsc, last=gtod->raw_cycle_last;
+   if( likely( gtod->has_rdtscp ) ) {
+   u32 tsc_lo, tsc_hi,
+   tsc_cpu __attribute__((unused));
+   asm volatile
+   ( "rdtscp"
+   /* ^- has built-in cancellation point / 
pipeline stall"barrier" */
+   :   "=a" (tsc_lo)
+ , "=d" (tsc_hi)
+ , "=c" (tsc_cpu)
+   ); // since all variables 32-bit, eax, edx, ecx used - 
NOT rax, rdx, rcx
+   tsc = u64)tsc_hi) & 0xUL) << 32) | 
(((u64)tsc_lo) & 0xUL);
+   } else {
+   tsc = rdtsc_ordered();
+   }
+   if (likely(tsc >= last))
+   return tsc;
+   asm volatile ("");
+   return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +290,27 @@ notrace static int __always_inline do_mo
return mode;
 }
 
+notrace static int __always_inline do_monotonic_raw( struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+ 

[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-11 Thread Jason Vas Dias

  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc4 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  The patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c


  This is a resend of the original patch fixing indentation issues
  after installation of emacs Lisp cc-mode hooks in
  Documentation/coding-style.rst
  and calling 'indent-region' and 'tabify' (whitespace only changes) - SORRY ! 

  Best Regards,
 Jason Vas Dias  .
 
---
diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 
linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4
2018-03-04 22:54:11.0 +
+++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 
19:00:04.630019100 +
@@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+   u64 tsc, last=gtod->raw_cycle_last;
+   if( likely( gtod->has_rdtscp ) ) {
+   u32 tsc_lo, tsc_hi,
+   tsc_cpu __attribute__((unused));
+   asm volatile
+   ( "rdtscp"
+   /* ^- has built-in cancellation point / 
pipeline stall"barrier" */
+   :   "=a" (tsc_lo)
+ , "=d" (tsc_hi)
+ , "=c" (tsc_cpu)
+   ); // since all variables 32-bit, eax, edx, ecx used - 
NOT rax, rdx, rcx
+   tsc = u64)tsc_hi) & 0xUL) << 32) | 
(((u64)tsc_lo) & 0xUL);
+   } else {
+   tsc = rdtsc_ordered();
+   }
+   if (likely(tsc >= last))
+   return tsc;
+   asm volatile ("");
+   return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +290,27 @@ notrace static int __always_inline do_mo
return mode;
 }
 
+notrace static int __always_inline do_monotonic_raw( struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return 

[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-11 Thread Jason Vas Dias

  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  Sometimes, particularly when correlating elapsed time to performance
  counter values,  code needs to know elapsed time from the perspective
  of the CPU no matter how "hot" / fast or "cold" / slow it might be
  running wrt NTP / PTP ; when code needs this, the latencies with
  a syscall are often unacceptably high.

  I reported this as Bug #198161 :
'https://bugzilla.kernel.org/show_bug.cgi?id=198961'
  and in previous posts with subjects matching 'CLOCK_MONOTONIC_RAW' .
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc4 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  The patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c


  This is a resend of the original patch fixing indentation issues
  after installation of emacs Lisp cc-mode hooks in
  Documentation/coding-style.rst
  and calling 'indent-region' and 'tabify' (whitespace only changes) - SORRY ! 

  Best Regards,
 Jason Vas Dias  .
 
---
diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 
linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4
2018-03-04 22:54:11.0 +
+++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 
19:00:04.630019100 +
@@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+   u64 tsc, last=gtod->raw_cycle_last;
+   if( likely( gtod->has_rdtscp ) ) {
+   u32 tsc_lo, tsc_hi,
+   tsc_cpu __attribute__((unused));
+   asm volatile
+   ( "rdtscp"
+   /* ^- has built-in cancellation point / 
pipeline stall"barrier" */
+   :   "=a" (tsc_lo)
+ , "=d" (tsc_hi)
+ , "=c" (tsc_cpu)
+   ); // since all variables 32-bit, eax, edx, ecx used - 
NOT rax, rdx, rcx
+   tsc = u64)tsc_hi) & 0xUL) << 32) | 
(((u64)tsc_lo) & 0xUL);
+   } else {
+   tsc = rdtsc_ordered();
+   }
+   if (likely(tsc >= last))
+   return tsc;
+   asm volatile ("");
+   return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +290,27 @@ notrace static int __always_inline do_mo
return mode;
 }
 
+notrace static int __always_inline do_monotonic_raw( struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return 

Re: Fwd: [PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer

2018-03-10 Thread Jason Vas Dias
Hi Thomas -

Thanks very much for your help & guidance in previous mail:

RE: On 08/03/2018, Thomas Gleixner <t...@linutronix.de> wrote:
> 
> The right way to do that is to put the raw conversion values and the raw
> seconds base value into the vdso data and implement the counterpart of
> getrawmonotonic64(). And if that is done, then it can be done for _ALL_
> clocksources which support VDSO access and not just for the TSC.
>

I have done this now with a new patch, sent in mail with subject :
  
'[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle 
CLOCK_MONOTONIC_RAW' 

which should address all the concerns you raise.

> I already  know how that works, really.

I never doubted or meant to impugn that !

I am beginning to know a little how that works also, thanks in great
part to your help last week - thanks for your patience.

I was impatient last week to get access to low latency timers for a work
project, and was trying to read the unadjusted clock .

> instead of making completely false claims about the correctness of the kernel
> timekeeping infrastructure.

I really didn't mean to make any such claims - I'm sorry if I did .  I was just 
trying
to say that by the time the results of clock_gettime(CLOCK_MONOTONIC_RAW,) 
were
available to the caller they were not of much use because of the
latencies often dwarfing the time differences .

Anyway, I hope sometime you will consider putting such a patch in the
kernel.

I have developed a verson for ARM also, but that depends on making
CNTPCT + CNTFRQ registers readable in user-space, which is not meant
to be secure and is not normally done , but does work - but it is
against the Texas Instruments (ti-linux) kernel and can be enabled
with a new KConfig option, and brings latencies down from > 300ns
to < 20ns . Maybe I should post that also to kernel.org, or to
ti.com ?

I have a separate patch for the vdso_tsc_calibration export of the
tsc_khz and calibration which no longer returns pointers into the VDSO -
I can post this as a patch if you like.

Thanks & Best Regards,
Jason Vas Dias <jason.vas.d...@gmail.com>

diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4	2018-03-04 22:54:11.0 +
+++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c	2018-03-11 05:08:31.137681337 +
@@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void)
 	return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+u64 tsc, last=gtod->raw_cycle_last;
+if( likely( gtod->has_rdtscp ) ) {
+u32 tsc_lo, tsc_hi,
+tsc_cpu __attribute__((unused));
+asm volatile
+( "rdtscp"
+/* ^- has built-in cancellation point / pipeline stall "barrier" */
+: "=a" (tsc_lo)
+, "=d" (tsc_hi)
+, "=c" (tsc_cpu)
+); // since all variables 32-bit, eax, edx, ecx used - NOT rax, rdx, rcx 
+tsc  = u64)tsc_hi) & 0xUL) << 32) | (((u64)tsc_lo) & 0xUL);
+} else {
+tsc  = rdtsc_ordered();
+}
+	if (likely(tsc >= last))
+		return tsc;
+asm volatile ("");
+return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
 	u64 v;
@@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m
 	return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+	u64 v;
+	cycles_t cycles;
+
+	if (gtod->vclock_mode == VCLOCK_TSC)
+		cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+	else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+		cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+	else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+		cycles = vread_hvclock(mode);
+#endif
+	else
+		return 0;
+	v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+	return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +290,27 @@ notrace static int __always_inline do_mo
 	return mode;
 }
 
+notrace static int __always_inline do_monotonic_raw( struct timespec *ts)
+{
+	unsigned long seq;
+	u64 ns;
+	int mode;
+
+	do {
+		seq = gtod_read_begin(gtod);
+		mode = gtod->vclock_mode;
+		ts->tv_sec = gtod->monotonic_time_raw_sec;
+		ns = gtod->monotonic_time_raw_nsec;
+		ns += vgetsns_raw();
+		ns >>= gtod->raw_shift;
+	} while (unlikely(gtod_read_retry(gtod, seq)));
+
+	ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+	ts->tv_nsec = ns;
+
+	return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespe

Re: Fwd: [PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer

2018-03-10 Thread Jason Vas Dias
Hi Thomas -

Thanks very much for your help & guidance in previous mail:

RE: On 08/03/2018, Thomas Gleixner  wrote:
> 
> The right way to do that is to put the raw conversion values and the raw
> seconds base value into the vdso data and implement the counterpart of
> getrawmonotonic64(). And if that is done, then it can be done for _ALL_
> clocksources which support VDSO access and not just for the TSC.
>

I have done this now with a new patch, sent in mail with subject :
  
'[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle 
CLOCK_MONOTONIC_RAW' 

which should address all the concerns you raise.

> I already  know how that works, really.

I never doubted or meant to impugn that !

I am beginning to know a little how that works also, thanks in great
part to your help last week - thanks for your patience.

I was impatient last week to get access to low latency timers for a work
project, and was trying to read the unadjusted clock .

> instead of making completely false claims about the correctness of the kernel
> timekeeping infrastructure.

I really didn't mean to make any such claims - I'm sorry if I did .  I was just 
trying
to say that by the time the results of clock_gettime(CLOCK_MONOTONIC_RAW,) 
were
available to the caller they were not of much use because of the
latencies often dwarfing the time differences .

Anyway, I hope sometime you will consider putting such a patch in the
kernel.

I have developed a verson for ARM also, but that depends on making
CNTPCT + CNTFRQ registers readable in user-space, which is not meant
to be secure and is not normally done , but does work - but it is
against the Texas Instruments (ti-linux) kernel and can be enabled
with a new KConfig option, and brings latencies down from > 300ns
to < 20ns . Maybe I should post that also to kernel.org, or to
ti.com ?

I have a separate patch for the vdso_tsc_calibration export of the
tsc_khz and calibration which no longer returns pointers into the VDSO -
I can post this as a patch if you like.

Thanks & Best Regards,
Jason Vas Dias 

diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4	2018-03-04 22:54:11.0 +
+++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c	2018-03-11 05:08:31.137681337 +
@@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void)
 	return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+u64 tsc, last=gtod->raw_cycle_last;
+if( likely( gtod->has_rdtscp ) ) {
+u32 tsc_lo, tsc_hi,
+tsc_cpu __attribute__((unused));
+asm volatile
+( "rdtscp"
+/* ^- has built-in cancellation point / pipeline stall "barrier" */
+: "=a" (tsc_lo)
+, "=d" (tsc_hi)
+, "=c" (tsc_cpu)
+); // since all variables 32-bit, eax, edx, ecx used - NOT rax, rdx, rcx 
+tsc  = u64)tsc_hi) & 0xUL) << 32) | (((u64)tsc_lo) & 0xUL);
+} else {
+tsc  = rdtsc_ordered();
+}
+	if (likely(tsc >= last))
+		return tsc;
+asm volatile ("");
+return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
 	u64 v;
@@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m
 	return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+	u64 v;
+	cycles_t cycles;
+
+	if (gtod->vclock_mode == VCLOCK_TSC)
+		cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+	else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+		cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+	else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+		cycles = vread_hvclock(mode);
+#endif
+	else
+		return 0;
+	v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+	return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +290,27 @@ notrace static int __always_inline do_mo
 	return mode;
 }
 
+notrace static int __always_inline do_monotonic_raw( struct timespec *ts)
+{
+	unsigned long seq;
+	u64 ns;
+	int mode;
+
+	do {
+		seq = gtod_read_begin(gtod);
+		mode = gtod->vclock_mode;
+		ts->tv_sec = gtod->monotonic_time_raw_sec;
+		ns = gtod->monotonic_time_raw_nsec;
+		ns += vgetsns_raw();
+		ns >>= gtod->raw_shift;
+	} while (unlikely(gtod_read_retry(gtod, seq)));
+
+	ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+	ts->tv_nsec = ns;
+
+	return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
 	unsigned long seq;
@@ -277,6 +342,10 @@ notrace int __

[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-10 Thread Jason Vas Dias

  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc4 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  The patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c


  Best Regards,
 Jason Vas Dias  .
 
---
diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 
linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4
2018-03-04 22:54:11.0 +
+++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 
05:08:31.137681337 +
@@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+u64 tsc, last=gtod->raw_cycle_last;
+if( likely( gtod->has_rdtscp ) ) {
+u32 tsc_lo, tsc_hi,
+tsc_cpu __attribute__((unused));
+asm volatile
+( "rdtscp"
+/* ^- has built-in cancellation point / pipeline stall 
"barrier" */
+: "=a" (tsc_lo)
+, "=d" (tsc_hi)
+, "=c" (tsc_cpu)
+); // since all variables 32-bit, eax, edx, ecx used - NOT 
rax, rdx, rcx 
+tsc  = u64)tsc_hi) & 0xUL) << 32) | 
(((u64)tsc_lo) & 0xUL);
+} else {
+tsc  = rdtsc_ordered();
+}
+   if (likely(tsc >= last))
+   return tsc;
+asm volatile ("");
+return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +290,27 @@ notrace static int __always_inline do_mo
return mode;
 }
 
+notrace static int __always_inline do_monotonic_raw( struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +342,10 @@ notrace int __vdso_clock_gettime(clockid
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff -up linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc4 
linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c
--- linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc4 
2018-03-04 22:54

[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-10 Thread Jason Vas Dias

  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.
 
  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc4 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  The patch affects only files:
  
   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c


  Best Regards,
 Jason Vas Dias  .
 
---
diff -up linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 
linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4
2018-03-04 22:54:11.0 +
+++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 
05:08:31.137681337 +
@@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
+notrace static u64 vread_tsc_raw(void)
+{
+u64 tsc, last=gtod->raw_cycle_last;
+if( likely( gtod->has_rdtscp ) ) {
+u32 tsc_lo, tsc_hi,
+tsc_cpu __attribute__((unused));
+asm volatile
+( "rdtscp"
+/* ^- has built-in cancellation point / pipeline stall 
"barrier" */
+: "=a" (tsc_lo)
+, "=d" (tsc_hi)
+, "=c" (tsc_cpu)
+); // since all variables 32-bit, eax, edx, ecx used - NOT 
rax, rdx, rcx 
+tsc  = u64)tsc_hi) & 0xUL) << 32) | 
(((u64)tsc_lo) & 0xUL);
+} else {
+tsc  = rdtsc_ordered();
+}
+   if (likely(tsc >= last))
+   return tsc;
+asm volatile ("");
+return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }
 
+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +290,27 @@ notrace static int __always_inline do_mo
return mode;
 }
 
+notrace static int __always_inline do_monotonic_raw( struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +342,10 @@ notrace int __vdso_clock_gettime(clockid
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff -up linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc4 
linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c
--- linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc4 
2018-03-04 22:54

Re: [PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-10 Thread Jason Vas Dias
Oops, please disregard 1st mail on  $subject - I guess use of Quoted Printable
is not a way of getting past the email line length.
Patch I tried to send is attached as attachment - will resend inline using
other method.

Sorry, Regards, Jason


vdso_monotonic_raw-v4.16-rc4.patch
Description: Binary data


Re: [PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-10 Thread Jason Vas Dias
Oops, please disregard 1st mail on  $subject - I guess use of Quoted Printable
is not a way of getting past the email line length.
Patch I tried to send is attached as attachment - will resend inline using
other method.

Sorry, Regards, Jason


vdso_monotonic_raw-v4.16-rc4.patch
Description: Binary data


[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-10 Thread Jason Vas Dias

  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc4 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  The patch affects only files:

   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c


  Best Regards,
 Jason Vas Dias  .

---
diff -up 
linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 2018-03-04 
22:54:11.0 +
+++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 
05:08:31.137681337 +
@@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void)
return last;
 }

+notrace static u64 vread_tsc_raw(void)
+{
+u64 tsc, last=gtod->raw_cycle_last;
+if( likely( gtod->has_rdtscp ) ) {
+u32 tsc_lo, tsc_hi,
+tsc_cpu __attribute__((unused));
+asm volatile
+( "rdtscp"
+/* ^- has built-in cancellation point / pipeline 
stall"barrier" */
+: "=a" (tsc_lo)
+, "=d" (tsc_hi)
+, "=c" (tsc_cpu)
+); // since all variables 32-bit, eax, edx, ecx used - NOT 
rax, rdx, rcx
+tsc  = u64)tsc_hi) & 0xUL) << 32) | 
(((u64)tsc_lo) & 0xUL);
+} else {
+tsc  = rdtsc_ordered();
+}
+   if (likely(tsc >= last))
+   return tsc;
+asm volatile ("");
+return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }

+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +290,27 @@ notrace static int __always_inline do_mo
return mode;
 }

+notrace static int __always_inline do_monotonic_raw( struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +342,10 @@ notrace int __vdso_clock_gettime(clockid
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff -up linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc4 
linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c
--- linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc4 2018-03-04 
22:54:11.0 +
+++ linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c 2018-03

[PATCH v4.16-rc4 1/1] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

2018-03-10 Thread Jason Vas Dias

  Currently the VDSO does not handle
 clock_gettime( CLOCK_MONOTONIC_RAW,  )
  on Intel / AMD - it calls
 vdso_fallback_gettime()
  for this clock, which issues a syscall, having an unacceptably high
  latency (minimum measurable time or time between measurements)
  of 300-700ns on 2 2.8-3.9ghz Haswell x86_64 Family'_'Model : 06_3C
  machines under various versions of Linux.

  This patch handles CLOCK_MONOTONIC_RAW clock_gettime() in the VDSO ,
  by exporting the raw clock calibration, last cycles, last xtime_nsec,
  and last raw_sec value in the vsyscall_gtod_data during vsyscall_update() .

  Now the new do_monotonic_raw() function in the vDSO has a latency of @ 24ns
  on average, and the test program:
   tools/testing/selftest/timers/inconsistency-check.c
  succeeds with arguments: '-c 4 -t 120' or any arbitrary -t value.

  The patch is against Linus' latest 4.16-rc4 tree,
  current HEAD of :
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  .

  The patch affects only files:

   arch/x86/include/asm/vgtod.h
   arch/x86/entry/vdso/vclock_gettime.c
   arch/x86/entry/vsyscall/vsyscall_gtod.c


  Best Regards,
 Jason Vas Dias  .

---
diff -up 
linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c
--- linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c.4.16-rc4 2018-03-04 
22:54:11.0 +
+++ linux-4.16-rc4/arch/x86/entry/vdso/vclock_gettime.c 2018-03-11 
05:08:31.137681337 +
@@ -182,6 +182,29 @@ notrace static u64 vread_tsc(void)
return last;
 }

+notrace static u64 vread_tsc_raw(void)
+{
+u64 tsc, last=gtod->raw_cycle_last;
+if( likely( gtod->has_rdtscp ) ) {
+u32 tsc_lo, tsc_hi,
+tsc_cpu __attribute__((unused));
+asm volatile
+( "rdtscp"
+/* ^- has built-in cancellation point / pipeline 
stall"barrier" */
+: "=a" (tsc_lo)
+, "=d" (tsc_hi)
+, "=c" (tsc_cpu)
+); // since all variables 32-bit, eax, edx, ecx used - NOT 
rax, rdx, rcx
+tsc  = u64)tsc_hi) & 0xUL) << 32) | 
(((u64)tsc_lo) & 0xUL);
+} else {
+tsc  = rdtsc_ordered();
+}
+   if (likely(tsc >= last))
+   return tsc;
+asm volatile ("");
+return last;
+}
+
 notrace static inline u64 vgetsns(int *mode)
 {
u64 v;
@@ -203,6 +226,27 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }

+notrace static inline u64 vgetsns_raw(int *mode)
+{
+   u64 v;
+   cycles_t cycles;
+
+   if (gtod->vclock_mode == VCLOCK_TSC)
+   cycles = vread_tsc_raw();
+#ifdef CONFIG_PARAVIRT_CLOCK
+   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
+#endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
+   else
+   return 0;
+   v = (cycles - gtod->raw_cycle_last) & gtod->raw_mask;
+   return v * gtod->raw_mult;
+}
+
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
@@ -246,6 +290,27 @@ notrace static int __always_inline do_mo
return mode;
 }

+notrace static int __always_inline do_monotonic_raw( struct timespec *ts)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+
+   do {
+   seq = gtod_read_begin(gtod);
+   mode = gtod->vclock_mode;
+   ts->tv_sec = gtod->monotonic_time_raw_sec;
+   ns = gtod->monotonic_time_raw_nsec;
+   ns += vgetsns_raw();
+   ns >>= gtod->raw_shift;
+   } while (unlikely(gtod_read_retry(gtod, seq)));
+
+   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+
+   return mode;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +342,10 @@ notrace int __vdso_clock_gettime(clockid
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff -up linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc4 
linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c
--- linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c.4.16-rc4 2018-03-04 
22:54:11.0 +
+++ linux-4.16-rc4/arch/x86/entry/vsyscall/vsyscall_gtod.c 2018-03

Re: Fwd: [PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer

2018-03-08 Thread Jason Vas Dias
On 08/03/2018, Thomas Gleixner <t...@linutronix.de> wrote:
> On Tue, 6 Mar 2018, Jason Vas Dias wrote:
>> I will prepare a new patch that meets submission + coding style guidelines
>> and
>> does not expose pointers within the vsyscall_gtod_data region to
>> user-space code -
>> but I don't really understand why not, since only the gtod->mult value
>> will
>> change as long as the clocksource remains TSC, and updates to it by the
>> kernel
>> are atomic and partial values cannot be read .
>>
>> The code in the patch reverts to old behavior for clocks which are not
>> the
>> TSC and provides a way for users to determine if the  clock is still the
>> TSC
>> by calling '__vdso_linux_tsc_calibration()', which would return NULL if
>> the clock is not the TSC .
>>
>> I have never seen Linux on a modern intel box spontaneously decide to
>> switch from the TSC clocksource after calibration succeeds and
>> it has decided to use the TSC as the system / platform clock source -
>> what would make it do this ?
>>
>> But for the highly controlled systems I am doing performance testing on,
>> I can guarantee that the clocksource does not change.
>
> We are not writing code for a particular highly controlled system. We
> expose functionality which operates under all circumstances. There are
> various reasons why TSC can be disabled at runtime, crappy BIOS/SMI,
> sockets getting out of sync .
>
>> There is no way user code can write those pointers or do anything other
>> than read them, so I see no harm in exposing them to user-space ; then
>> user-space programs can issue rdtscp and use the same calibration values
>> as the kernel, and use some cached 'previous timespec value' to avoid
>> doing the long division every time.
>>
>> If the shift & mult are not accurate TSC calibration values, then the
>> kernel should put other more accurate calibration values in the gtod .
>
> The raw calibration values are as accurate as the kernel can make them. But
> they can be rather far off from converting to real nanoseconds for various
> reasons. The NTP/PTP adjusted conversion is matching real units and is
> obviously more accurate.
>
>> > Please look at the kernel side implementation of
>> > clock_gettime(CLOCK_MONOTONIC_RAW).
>> > The VDSO side can be implemented in the
>> > same way.
>> > All what is required is to expose the relevant information in the
>> > existing vsyscall_gtod_data data structure.
>>
>> I agree - that is my point entirely , & what I was trying to do .
>
> Well, you did not expose the raw conversion data in vsyscall_gtod_data. You
> are using:
>
> + tsc*= gtod->mult;
> + tsc   >>= gtod->shift;
>
> That's is the adjusted mult/shift value which can change when NTP/PTP is
> enabled and you _cannot_ use it unprotected.
>
>> void getrawmonotonic64(struct timespec64 *ts)
>> {
>>  struct timekeeper *tk = _core.timekeeper;
>>  unsigned long seq;
>>  u64 nsecs;
>>
>>  do {
>>  seq = read_seqcount_begin(_core.seq);
>> #   ^-- I think this is the source of the locking
>> #and the very long latencies !
>
> This protects tk->raw_sec from changing which would result in random time
> stamps. Yes, it can cause slightly larger latencies when the timekeeper is
> updated on another CPU concurrently, but that's not the main reason why
> this is slower in general than the VDSO functions. The syscall overhead is
> there for every invocation and it's substantial.
>
>> So in fact, when the clock source is TSC, the value recorded in 'ts'
>> by clock_gettime(CLOCK_MONOTONIC_RAW, ) is very similar to
>>   u64 tsc = rdtscp();
>>   tsc *= gtod->mult;
>>   tsc >>= gtod->shift;
>>   ts.tv_sec=tsc / NSEC_PER_SEC;
>>   ts.tv_nsec=tsc % NSEC_PER_SEC;
>>
>> which is the algorithm I was using in the VDSO fast TSC reader,
>> do_monotonic_raw() .
>
> Except that you are using the adjusted conversion values and not the raw
> ones. So your VDSO implementation of monotonic raw access is just wrong and
> not matching the syscall based implementation in any way.
>
>> The problem with doing anything more in the VDSO is that there
>> is of course nowhere in the VDSO to store any data, as it has
>> no data section or writable pages . So some kind of writable
>> page would need to be added to the vdso , complicating its
>> vdso/vma.c, etc., which is not desirable.
>
> No, you don

Re: Fwd: [PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer

2018-03-08 Thread Jason Vas Dias
On 08/03/2018, Thomas Gleixner  wrote:
> On Tue, 6 Mar 2018, Jason Vas Dias wrote:
>> I will prepare a new patch that meets submission + coding style guidelines
>> and
>> does not expose pointers within the vsyscall_gtod_data region to
>> user-space code -
>> but I don't really understand why not, since only the gtod->mult value
>> will
>> change as long as the clocksource remains TSC, and updates to it by the
>> kernel
>> are atomic and partial values cannot be read .
>>
>> The code in the patch reverts to old behavior for clocks which are not
>> the
>> TSC and provides a way for users to determine if the  clock is still the
>> TSC
>> by calling '__vdso_linux_tsc_calibration()', which would return NULL if
>> the clock is not the TSC .
>>
>> I have never seen Linux on a modern intel box spontaneously decide to
>> switch from the TSC clocksource after calibration succeeds and
>> it has decided to use the TSC as the system / platform clock source -
>> what would make it do this ?
>>
>> But for the highly controlled systems I am doing performance testing on,
>> I can guarantee that the clocksource does not change.
>
> We are not writing code for a particular highly controlled system. We
> expose functionality which operates under all circumstances. There are
> various reasons why TSC can be disabled at runtime, crappy BIOS/SMI,
> sockets getting out of sync .
>
>> There is no way user code can write those pointers or do anything other
>> than read them, so I see no harm in exposing them to user-space ; then
>> user-space programs can issue rdtscp and use the same calibration values
>> as the kernel, and use some cached 'previous timespec value' to avoid
>> doing the long division every time.
>>
>> If the shift & mult are not accurate TSC calibration values, then the
>> kernel should put other more accurate calibration values in the gtod .
>
> The raw calibration values are as accurate as the kernel can make them. But
> they can be rather far off from converting to real nanoseconds for various
> reasons. The NTP/PTP adjusted conversion is matching real units and is
> obviously more accurate.
>
>> > Please look at the kernel side implementation of
>> > clock_gettime(CLOCK_MONOTONIC_RAW).
>> > The VDSO side can be implemented in the
>> > same way.
>> > All what is required is to expose the relevant information in the
>> > existing vsyscall_gtod_data data structure.
>>
>> I agree - that is my point entirely , & what I was trying to do .
>
> Well, you did not expose the raw conversion data in vsyscall_gtod_data. You
> are using:
>
> + tsc*= gtod->mult;
> + tsc   >>= gtod->shift;
>
> That's is the adjusted mult/shift value which can change when NTP/PTP is
> enabled and you _cannot_ use it unprotected.
>
>> void getrawmonotonic64(struct timespec64 *ts)
>> {
>>  struct timekeeper *tk = _core.timekeeper;
>>  unsigned long seq;
>>  u64 nsecs;
>>
>>  do {
>>  seq = read_seqcount_begin(_core.seq);
>> #   ^-- I think this is the source of the locking
>> #and the very long latencies !
>
> This protects tk->raw_sec from changing which would result in random time
> stamps. Yes, it can cause slightly larger latencies when the timekeeper is
> updated on another CPU concurrently, but that's not the main reason why
> this is slower in general than the VDSO functions. The syscall overhead is
> there for every invocation and it's substantial.
>
>> So in fact, when the clock source is TSC, the value recorded in 'ts'
>> by clock_gettime(CLOCK_MONOTONIC_RAW, ) is very similar to
>>   u64 tsc = rdtscp();
>>   tsc *= gtod->mult;
>>   tsc >>= gtod->shift;
>>   ts.tv_sec=tsc / NSEC_PER_SEC;
>>   ts.tv_nsec=tsc % NSEC_PER_SEC;
>>
>> which is the algorithm I was using in the VDSO fast TSC reader,
>> do_monotonic_raw() .
>
> Except that you are using the adjusted conversion values and not the raw
> ones. So your VDSO implementation of monotonic raw access is just wrong and
> not matching the syscall based implementation in any way.
>
>> The problem with doing anything more in the VDSO is that there
>> is of course nowhere in the VDSO to store any data, as it has
>> no data section or writable pages . So some kind of writable
>> page would need to be added to the vdso , complicating its
>> vdso/vma.c, etc., which is not desirable.
>
> No, you don't need any writeable memo

[PATCH v4.15.7 1/1] x86/vdso: handle clock_gettime(CLOCK_MONOTONIC_RAW, ) in VDSO

2018-03-06 Thread Jason Vas Dias
Handling clock_gettime( CLOCK_MONOTONIC_RAW, )
by calling  vdso_fallback_gettime(),  ie. syscall,  is too slow  -
latencies of  300-700ns are common on Haswell (06:3C)  CPUs .

This patch against the 4.15.7 stable branch makes the VDSO handle
clock_gettime(CLOCK_GETTIME_RAW, )
by issuing rdtscp in userspace,  IFF the clock source is the TSC, and converting
it to nanoseconds using the vsyscall_gtod_data 'mult' and 'shift' fields :

  volatile u32 tsc_lo, tsc_hi, tsc_cpu;
  asm volatile( "rdtscp" : (=a) tsc_lo, (=d) tsc_hi, (=c) tsc_cpu );
  u64 tsc = (((u64)tsc_hi)<<32) | ((u64)tsc_lo);
  tsc *= gtod->mult;
  tsc >>=gtod->shift;
  /* tsc is now number of nanoseconds */
  ts->tv_sec = __iter_div_u64_rem( tsc, NSEC_PER_SEC, >tv_nsec);

Use of the "open coded asm" style here actually forces the compiler to
always choose the 32-bit version of rdtscp, which sets only %eax,
%edx, and %ecx and does not clear the high bits of %rax, %rdx, and
%rdx , because the
variables are declared 32-bit  - so the same 32-bit version is used whether
the code is compiled with -m32 or -m64 ( tested using gcc 5.4.0, gcc 6.4.1 ) .

The full story and test programs are in Bug #198961 :
https://bugzilla.kernel.org/show_bug.cgi?id=198961
.

The patched VDSO now handles clock_gettime(CLOCK_MONOTONIC_RAW, )
on the same machine with a latency (minimum time that can be measured)
of
around 100ns (compared with 300-700ns before patch).

I also think it makes sense to expose pointers to the live, updated
gtod->mult and gtod->shift values somehow to userspace . Then
a userspace TSC reader could re-use previous values to avoid
the long-division in most cases and obtain latencies of 10-20ns .

Hence there is now a new method in the VDSO:
   __ vdso_linux_tsc_calibration()
which returns a pointer to a 'struct linux_tsc_calibration'
declared in a new header
   arch/x86/include/uapi/asm/vdso_tsc_calibration.h

If the clock source is NOT the TSC, this function returns NULL .
The pointer is only valid when the system clock source is the TSC .
User-space TSC readers can detect when TSC is modified with Events,
and now can detect when clock source changes from / to TSC with
this function .

The patch :

---
diff --git a/arch/x86/entry/vdso/vclock_gettime.c \
b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d..e840600 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 

 #define gtod ((vsyscall_gtod_data))

@@ -246,6 +247,29 @@ notrace static int __always_inline do_monotonic\
(struct timespec *ts)
return mode;
 }

+notrace static int __always_inline do_monotonic_raw( struct timespec *ts)
+{
+volatile u32 tsc_lo=0, tsc_hi=0, tsc_cpu=0; // so same instrs
generated for 64-bit as for 32-bit builds
+u64 ns;
+register u64 tsc=0;
+if (gtod->vclock_mode == VCLOCK_TSC)
+{
+asm volatile
+( "rdtscp"
+: "=a" (tsc_lo)
+, "=d" (tsc_hi)
+, "=c" (tsc_cpu)
+); // : eax, edx, ecx used - NOT rax, rdx, rcx
+tsc = u64)tsc_hi) & 0xUL) << 32) |
(((u64)tsc_lo) & 0xUL);
+tsc*= gtod->mult;
+tsc   >>= gtod->shift;
+ts->tv_sec  = __iter_div_u64_rem(tsc, NSEC_PER_SEC,
);
+ts->tv_nsec = ns;
+return VCLOCK_TSC;
+}
+return VCLOCK_NONE;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +301,10 @@ notrace int __vdso_clock_gettime(clockid_t clock,
struct timespec *ts)
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
@@ -326,3 +354,18 @@ notrace time_t __vdso_time(time_t *t)
 }
 time_t time(time_t *t)
__attribute__((weak, alias("__vdso_time")));
+
+extern const struct linux_tsc_calibration *
+__vdso_linux_tsc_calibration(void);
+
+notrace  const struct linux_tsc_calibration *
+  __vdso_linux_tsc_calibration(void)
+{
+if( gtod->vclock_mode == VCLOCK_TSC )
+return ((const struct linux_tsc_calibration*) >mult);
+return 0UL;
+}
+
+const struct linux_tsc_calibration * linux_tsc_calibration(void)
+__attribute((weak, alias("__vdso_linux_tsc_calibration")));
+
diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S
index d3a2dce..41a2ca5 100644
--- a/arch/x86/entry/vdso/vdso.lds.S
+++ b/arch/x86/entry/vdso/vdso.lds.S
@@ -24,7 +24,9 @@ VERSION {
getcpu;
__vdso_getcpu;
time;
-   

[PATCH v4.15.7 1/1] x86/vdso: handle clock_gettime(CLOCK_MONOTONIC_RAW, ) in VDSO

2018-03-06 Thread Jason Vas Dias
Handling clock_gettime( CLOCK_MONOTONIC_RAW, )
by calling  vdso_fallback_gettime(),  ie. syscall,  is too slow  -
latencies of  300-700ns are common on Haswell (06:3C)  CPUs .

This patch against the 4.15.7 stable branch makes the VDSO handle
clock_gettime(CLOCK_GETTIME_RAW, )
by issuing rdtscp in userspace,  IFF the clock source is the TSC, and converting
it to nanoseconds using the vsyscall_gtod_data 'mult' and 'shift' fields :

  volatile u32 tsc_lo, tsc_hi, tsc_cpu;
  asm volatile( "rdtscp" : (=a) tsc_lo, (=d) tsc_hi, (=c) tsc_cpu );
  u64 tsc = (((u64)tsc_hi)<<32) | ((u64)tsc_lo);
  tsc *= gtod->mult;
  tsc >>=gtod->shift;
  /* tsc is now number of nanoseconds */
  ts->tv_sec = __iter_div_u64_rem( tsc, NSEC_PER_SEC, >tv_nsec);

Use of the "open coded asm" style here actually forces the compiler to
always choose the 32-bit version of rdtscp, which sets only %eax,
%edx, and %ecx and does not clear the high bits of %rax, %rdx, and
%rdx , because the
variables are declared 32-bit  - so the same 32-bit version is used whether
the code is compiled with -m32 or -m64 ( tested using gcc 5.4.0, gcc 6.4.1 ) .

The full story and test programs are in Bug #198961 :
https://bugzilla.kernel.org/show_bug.cgi?id=198961
.

The patched VDSO now handles clock_gettime(CLOCK_MONOTONIC_RAW, )
on the same machine with a latency (minimum time that can be measured)
of
around 100ns (compared with 300-700ns before patch).

I also think it makes sense to expose pointers to the live, updated
gtod->mult and gtod->shift values somehow to userspace . Then
a userspace TSC reader could re-use previous values to avoid
the long-division in most cases and obtain latencies of 10-20ns .

Hence there is now a new method in the VDSO:
   __ vdso_linux_tsc_calibration()
which returns a pointer to a 'struct linux_tsc_calibration'
declared in a new header
   arch/x86/include/uapi/asm/vdso_tsc_calibration.h

If the clock source is NOT the TSC, this function returns NULL .
The pointer is only valid when the system clock source is the TSC .
User-space TSC readers can detect when TSC is modified with Events,
and now can detect when clock source changes from / to TSC with
this function .

The patch :

---
diff --git a/arch/x86/entry/vdso/vclock_gettime.c \
b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d..e840600 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 

 #define gtod ((vsyscall_gtod_data))

@@ -246,6 +247,29 @@ notrace static int __always_inline do_monotonic\
(struct timespec *ts)
return mode;
 }

+notrace static int __always_inline do_monotonic_raw( struct timespec *ts)
+{
+volatile u32 tsc_lo=0, tsc_hi=0, tsc_cpu=0; // so same instrs
generated for 64-bit as for 32-bit builds
+u64 ns;
+register u64 tsc=0;
+if (gtod->vclock_mode == VCLOCK_TSC)
+{
+asm volatile
+( "rdtscp"
+: "=a" (tsc_lo)
+, "=d" (tsc_hi)
+, "=c" (tsc_cpu)
+); // : eax, edx, ecx used - NOT rax, rdx, rcx
+tsc = u64)tsc_hi) & 0xUL) << 32) |
(((u64)tsc_lo) & 0xUL);
+tsc*= gtod->mult;
+tsc   >>= gtod->shift;
+ts->tv_sec  = __iter_div_u64_rem(tsc, NSEC_PER_SEC,
);
+ts->tv_nsec = ns;
+return VCLOCK_TSC;
+}
+return VCLOCK_NONE;
+}
+
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
unsigned long seq;
@@ -277,6 +301,10 @@ notrace int __vdso_clock_gettime(clockid_t clock,
struct timespec *ts)
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+   case CLOCK_MONOTONIC_RAW:
+   if (do_monotonic_raw(ts) == VCLOCK_NONE)
+   goto fallback;
+   break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
@@ -326,3 +354,18 @@ notrace time_t __vdso_time(time_t *t)
 }
 time_t time(time_t *t)
__attribute__((weak, alias("__vdso_time")));
+
+extern const struct linux_tsc_calibration *
+__vdso_linux_tsc_calibration(void);
+
+notrace  const struct linux_tsc_calibration *
+  __vdso_linux_tsc_calibration(void)
+{
+if( gtod->vclock_mode == VCLOCK_TSC )
+return ((const struct linux_tsc_calibration*) >mult);
+return 0UL;
+}
+
+const struct linux_tsc_calibration * linux_tsc_calibration(void)
+__attribute((weak, alias("__vdso_linux_tsc_calibration")));
+
diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S
index d3a2dce..41a2ca5 100644
--- a/arch/x86/entry/vdso/vdso.lds.S
+++ b/arch/x86/entry/vdso/vdso.lds.S
@@ -24,7 +24,9 @@ VERSION {
getcpu;
__vdso_getcpu;
time;
-   

Fwd: [PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer

2018-03-06 Thread Jason Vas Dias
On 06/03/2018, Thomas Gleixner <t...@linutronix.de> wrote:
> Jason,
>
> On Mon, 5 Mar 2018, Jason Vas Dias wrote:
>
> thanks for providing this. A few formal nits first.
>
> Please read Documentation/process/submitting-patches.rst
>
> Patches need a concise subject line and the subject line wants a prefix, in
> this case 'x86/vdso'.
>
> Please don't put anything past the patch. Your delimiters are human
> readable, but cannot be handled by tools.
>
> Also please follow the kernel coding style guide lines.
>
>> It also provides a new function in the VDSO :
>>
>> struct linux_timestamp_conversion
>> { u32 mult;
>> u32 shift;
>> };
>> extern
>> const struct linux_timestamp_conversion *
>> __vdso_linux_tsc_calibration(void);
>>
>> which can be used by user-space rdtsc / rdtscp issuers
>> by using code such as in
>> tools/testing/selftests/vDSO/parse_vdso.c
>> to call vdso_sym("LINUX_2.6", "__vdso_linux_tsc_calibration"),
>> which returns a pointer to the function in the VDSO, which
>> returns the address of the 'mult' field in the vsyscall_gtod_data.
>
> No, that's just wrong. The VDSO data is solely there for the VDSO accessor
> functions and not to be exposed to random user space.
>
>> Thus user-space programs can use rdtscp and interpret its return values
>> in exactly the same way the kernel would, but without entering the
>> kernel.
>
> The VDSO clock_gettime() functions are providing exactly this mechanism.
>
>> As pointed out in Bug # 198961 :
>> https://bugzilla.kernel.org/show_bug.cgi?id=198961
>> which contains extra test programs and the full story behind this
>> change,
>> using CLOCK_MONOTONIC_RAW without the patch results in
>> a minimum measurable time (latency) of @ 300 - 700ns because of
>> the syscall used by vdso_fallback_gtod() .
>>
>> With the patch, the latency falls to @ 100ns .
>>
>> The latency would be @ 16 - 32 ns if the do_monotonic_raw()
>> handler could record its previous TSC value and seconds return value
>> somewhere, but since the VDSO has no data region or writable page,
>> of course it cannot .
>
> And even if it could, it's not as simple as you want it to be. Clocksources
> can change during runtime and without effective protection the values are
> just garbage.
>
>> Hence, to enable effective use of TSC by user space programs, Linux must
>> provide a way for them to discover the calibration mult and shift values
>> the kernel uses for the clock source ; only by doing so can user-space
>> get values that are comparable to kernel generated values.
>
> Linux must not do anything. It can provide a vdso implementation of
> CLOCK_MONOTONIC_RAW, which does not enter the kernel, but not exposure to
> data which is not reliably accessible by random user space code.
>
>> And I'd really like to know: why does the gtod->mult value change ?
>> After TSC calibration, it and the shift are calculated to render the
>> best approximation of a nanoseconds value from the TSC value.
>>
>> The TSC is MEANT to be monotonic and to continue in sleep states
>> on modern Intel CPUs . So why does the gtod->mult change ?
>
> You are missing the fact that gtod->mult/shift are used for CLOCK_MONOTONIC
> and CLOCK_REALTIME, which are adjusted by NTP/PTP to provide network
> synchronized time. That means CLOCK_MONOTONIC is providing accurate
> and slope compensated nanoseconds.
>
> The raw TSC conversion, even if it is sane hardware, provides just some
> approximation of nanoseconds which can be off by quite a margin.
>
>> But the mult value does change. Currently there is no way for user-space
>> programs to discover that such a change has occurred, or when . With this
>> very tiny simple patch, they could know instantly when such changes
>> occur, and could implement TSC readers that perform the full conversion
>> with latencies of 15-30ns (on my CPU).
>
> No. Accessing the mult/shift pair without protection is racy and can lead
> to completely erratic results.
>
>> +notrace static int __always_inline do_monotonic_raw( struct timespec
>> *ts)
>> +{
>> + volatile u32 tsc_lo=0, tsc_hi=0, tsc_cpu=0; // so same instrs
>> generated for 64-bit as for 32-bit builds
>> + u64 ns;
>> + register u64 tsc=0;
>> + if (gtod->vclock_mode == VCLOCK_TSC)
>> + { asm volatile
>> + ( "rdtscp"
>> + : "=a" (tsc_lo)
>> + , "=d" (tsc_hi)
>> + , "=c" (tsc_cpu)
>> + ); // : eax, edx, ecx used - NOT rax, rdx, rcx
>
> If you look at the 

Fwd: [PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer

2018-03-06 Thread Jason Vas Dias
On 06/03/2018, Thomas Gleixner  wrote:
> Jason,
>
> On Mon, 5 Mar 2018, Jason Vas Dias wrote:
>
> thanks for providing this. A few formal nits first.
>
> Please read Documentation/process/submitting-patches.rst
>
> Patches need a concise subject line and the subject line wants a prefix, in
> this case 'x86/vdso'.
>
> Please don't put anything past the patch. Your delimiters are human
> readable, but cannot be handled by tools.
>
> Also please follow the kernel coding style guide lines.
>
>> It also provides a new function in the VDSO :
>>
>> struct linux_timestamp_conversion
>> { u32 mult;
>> u32 shift;
>> };
>> extern
>> const struct linux_timestamp_conversion *
>> __vdso_linux_tsc_calibration(void);
>>
>> which can be used by user-space rdtsc / rdtscp issuers
>> by using code such as in
>> tools/testing/selftests/vDSO/parse_vdso.c
>> to call vdso_sym("LINUX_2.6", "__vdso_linux_tsc_calibration"),
>> which returns a pointer to the function in the VDSO, which
>> returns the address of the 'mult' field in the vsyscall_gtod_data.
>
> No, that's just wrong. The VDSO data is solely there for the VDSO accessor
> functions and not to be exposed to random user space.
>
>> Thus user-space programs can use rdtscp and interpret its return values
>> in exactly the same way the kernel would, but without entering the
>> kernel.
>
> The VDSO clock_gettime() functions are providing exactly this mechanism.
>
>> As pointed out in Bug # 198961 :
>> https://bugzilla.kernel.org/show_bug.cgi?id=198961
>> which contains extra test programs and the full story behind this
>> change,
>> using CLOCK_MONOTONIC_RAW without the patch results in
>> a minimum measurable time (latency) of @ 300 - 700ns because of
>> the syscall used by vdso_fallback_gtod() .
>>
>> With the patch, the latency falls to @ 100ns .
>>
>> The latency would be @ 16 - 32 ns if the do_monotonic_raw()
>> handler could record its previous TSC value and seconds return value
>> somewhere, but since the VDSO has no data region or writable page,
>> of course it cannot .
>
> And even if it could, it's not as simple as you want it to be. Clocksources
> can change during runtime and without effective protection the values are
> just garbage.
>
>> Hence, to enable effective use of TSC by user space programs, Linux must
>> provide a way for them to discover the calibration mult and shift values
>> the kernel uses for the clock source ; only by doing so can user-space
>> get values that are comparable to kernel generated values.
>
> Linux must not do anything. It can provide a vdso implementation of
> CLOCK_MONOTONIC_RAW, which does not enter the kernel, but not exposure to
> data which is not reliably accessible by random user space code.
>
>> And I'd really like to know: why does the gtod->mult value change ?
>> After TSC calibration, it and the shift are calculated to render the
>> best approximation of a nanoseconds value from the TSC value.
>>
>> The TSC is MEANT to be monotonic and to continue in sleep states
>> on modern Intel CPUs . So why does the gtod->mult change ?
>
> You are missing the fact that gtod->mult/shift are used for CLOCK_MONOTONIC
> and CLOCK_REALTIME, which are adjusted by NTP/PTP to provide network
> synchronized time. That means CLOCK_MONOTONIC is providing accurate
> and slope compensated nanoseconds.
>
> The raw TSC conversion, even if it is sane hardware, provides just some
> approximation of nanoseconds which can be off by quite a margin.
>
>> But the mult value does change. Currently there is no way for user-space
>> programs to discover that such a change has occurred, or when . With this
>> very tiny simple patch, they could know instantly when such changes
>> occur, and could implement TSC readers that perform the full conversion
>> with latencies of 15-30ns (on my CPU).
>
> No. Accessing the mult/shift pair without protection is racy and can lead
> to completely erratic results.
>
>> +notrace static int __always_inline do_monotonic_raw( struct timespec
>> *ts)
>> +{
>> + volatile u32 tsc_lo=0, tsc_hi=0, tsc_cpu=0; // so same instrs
>> generated for 64-bit as for 32-bit builds
>> + u64 ns;
>> + register u64 tsc=0;
>> + if (gtod->vclock_mode == VCLOCK_TSC)
>> + { asm volatile
>> + ( "rdtscp"
>> + : "=a" (tsc_lo)
>> + , "=d" (tsc_hi)
>> + , "=c" (tsc_cpu)
>> + ); // : eax, edx, ecx used - NOT rax, rdx, rcx
>
> If you look at the existing VDSO time getters the

[PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer

2018-03-04 Thread Jason Vas Dias
 "Total time: %1.1llu.%9.9lluS - Average Latency: %1.1llu.%9.9lluS\n",
  t1/10,   t1-((t1/10)*10),
  avg_ns/10,   avg_ns-((avg_ns/10)*10)
  );
  return 0;
}

: END EXAMPLE

EXAMPLE Usage :
$ gcc -std=gnu11 -o t_vdso_tsc t_vdso_tsc.c
$ ./t_vdso_tsc
Got TSC calibration @ 0x7ffdb9be5098: mult: 5798705 shift: 24
sum: 
Total time: 0.04859S - Average Latency: 0.00022S

Latencies are typically @ 15 - 30 ns .

That multiplication and shift really doesn't leave very many
significant seconds bits!

Please, can the VDSO include some similar functionality to NOT always
enter the kernel for CLOCK_MONOTONIC_RAW , and to export a pointer to
the LIVE (kernel updated) gtod->mult and gtod->shift values somehow .

The documentation states for CLOCK_MONOTONIC_RAW that it is the
same as CLOCK_MONOTONIC except it is NOT subject to NTP adjustments .
This is very far from the case currently, without a patch like the one above.

And the kernel should not restrict user-space programs to only being able
to either measure an NTP adjusted time value, or a time value
difference of greater
than 1000ns with any accuracy, on a modern Intel CPU whose TSC ticks 2.8 times
per nanosecond (picosecond resolution is theoretically possible).

Please, include something like the above patch in future Linux versions.

Thanks & Best Regards,
Jason Vas Dias <jason.vas.d...@gmail.com>


[PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and export 'tsc_calibration' pointer

2018-03-04 Thread Jason Vas Dias
 "Total time: %1.1llu.%9.9lluS - Average Latency: %1.1llu.%9.9lluS\n",
  t1/10,   t1-((t1/10)*10),
  avg_ns/10,   avg_ns-((avg_ns/10)*10)
  );
  return 0;
}

: END EXAMPLE

EXAMPLE Usage :
$ gcc -std=gnu11 -o t_vdso_tsc t_vdso_tsc.c
$ ./t_vdso_tsc
Got TSC calibration @ 0x7ffdb9be5098: mult: 5798705 shift: 24
sum: 
Total time: 0.04859S - Average Latency: 0.00022S

Latencies are typically @ 15 - 30 ns .

That multiplication and shift really doesn't leave very many
significant seconds bits!

Please, can the VDSO include some similar functionality to NOT always
enter the kernel for CLOCK_MONOTONIC_RAW , and to export a pointer to
the LIVE (kernel updated) gtod->mult and gtod->shift values somehow .

The documentation states for CLOCK_MONOTONIC_RAW that it is the
same as CLOCK_MONOTONIC except it is NOT subject to NTP adjustments .
This is very far from the case currently, without a patch like the one above.

And the kernel should not restrict user-space programs to only being able
to either measure an NTP adjusted time value, or a time value
difference of greater
than 1000ns with any accuracy, on a modern Intel CPU whose TSC ticks 2.8 times
per nanosecond (picosecond resolution is theoretically possible).

Please, include something like the above patch in future Linux versions.

Thanks & Best Regards,
Jason Vas Dias 


Re: perf Intel x86_64 : BUG: BRANCH_INSTRUCTIONS / BRANCH_MISSES cannot be combined with CACHE_REFERENCES / CACHE_MISSES .

2018-02-13 Thread Jason Vas Dias
On 13/02/2018, Jason Vas Dias <jason.vas.d...@gmail.com> wrote:
> Good day -
>
> I'd much appreciate some advice as to why, on my Intel x86_64
> ( DisplayFamily_DisplayModel : 06_3CH ), running either Linux 4.12.10,
> or Linux 3.10.0, any attempt to count all of :
>  PERF_COUNT_HW_BRANCH_INSTRUCTIONS
>   (or raw config 0xC4) , and
>  PERF_COUNT_HW_BRANCH_MISSES
>   (or raw config 0xC5), and
>  combined with
>  PERF_COUNT_HW_CACHE_REFERENCES
>  (or raw config 0x4F2E ), and
>  PERF_COUNT_HW_CACHE_MISSES
>  (or raw config 0x412E) ,
> results in ALL COUNTERS BEING 0 in a read of the Group FD or
> mmap sample area.
>
> This is demonstrated by the example program, which will
> use perf_event_open() to create a Group Leader FD  for the first event,
> and associate all other events with that Event Group , so that it
> will read all events on the group FD .
>
> The perf_event_open() calls and the ioctl(event_fd, PERF_EVENT_IOC_ID, )
> calls all return successfully , but if I combine ANY of
> ( PERF_COUNT_HW_BRANCH_INSTRUCTIONS,
>   PERF_COUNT_HW_BRANCH_MISSES
> ) with any of
> ( PERF_COUNT_HW_CACHE_REFERENCES,
>   PERF_COUNT_HW_CACHE_MISSES
> ) in the Event Group, ALL events have '0' event->value.
>
> Demo :
> 1. Compile program to use kernel mapped Generic Events:
>   $ gcc -std=gnu11 -o perf_bug perf_bug.c
>   Running program shows all counters have 0 values, since both
>   CACHE & BRANCH hits+misses are being requested:
>
>   $ ./perf_bug
>   EVENT: Branch Instructions : 0
>   EVENT: Branch Misses : 0
>   EVENT: Instructions : 0
>   EVENT: CPU Cycles : 0
>   EVENT: Ref. CPU Cycles : 0
>   EVENT: Bus Cycles : 0
>   EVENT: Cache References : 0
>   EVENT: Cache Misses : 0
>
>   NOT registering interest in EITHER the BRANCH counters
>   OR the CACHE counters fixes the problem:
>
>   Compile without registering for BRANCH_INSTRUCTIONS
>   or BRANCH_MISSES:
>   $ gcc -std=gnu11 -DNO_BUG_NO_BRANCH  -o perf_bug perf_bug.c
>   $ ./perf_bug
>   EVENT: Instructions : 914
>   EVENT: CPU Cycles : 4110
>   EVENT: Ref. CPU Cycles : 4437
>   EVENT: Bus Cycles : 152
>   EVENT: Cache References : 1
>   EVENT: Cache Misses : 1
>
>   Compile without registering for CACHE_REFERENCES or CACHE_MISSES:
>   $ gcc -std=gnu11 -DNO_BUG_NO_CACHE  -o perf_bug perf_bug.c
>   $ ./perf_bug
> EVENT: Branch Instructions : 106
> EVENT: Branch Misses : 6
> EVENT: Instructions : 914
> EVENT: CPU Cycles : 4132
> EVENT: Ref. CPU Cycles : 8526
> EVENT: Bus Cycles : 295
>
> The same thing happens if I do not use Generic Events, but rather
> "dynamic raw PMU" events, by putting the hex values from
> /sys/bus/event_source/devices/cpu/events/? into the perf_event_attr
> config, OR'ed with (1<<63), and using the PERF_TYPE_RAW perf_event_attr
> type value :
>
> $ gcc -DUSE_RAW_PMU -o perf_bug perf_bug.c
> $ ./perf_bug
> EVENT: Branch Instructions : 0
> EVENT: Branch Misses : 0
> EVENT: Instructions : 0
> EVENT: CPU Cycles : 0
> EVENT: Ref. CPU Cycles : 0
> EVENT: Bus Cycles : 0
> EVENT: Cache References : 0
> EVENT: Cache Misses : 0
>
>
> $ gcc -DUSE_RAW_PMU -DNO_BUG_NO_BRANCH -o perf_bug perf_bug.c
> $ ./perf_bug
> EVENT: Instructions : 914
> EVENT: CPU Cycles : 4102
> EVENT: Ref. CPU Cycles : 4959
> EVENT: Bus Cycles : 171
> EVENT: Cache References : 2
> EVENT: Cache Misses : 2
>
> $ gcc -DUSE_RAW_PMU -DNO_BUG_NO_CACHE -o perf_bug perf_bug.c
> $ ./perf_bug
> EVENT: Branch Instructions : 106
> EVENT: Branch Misses : 6
> EVENT: Instructions : 914
> EVENT: CPU Cycles : 4108
> EVENT: Ref. CPU Cycles : 10817
> EVENT: Bus Cycles : 373
>
>
> The perf tool itself seems to have the same issue:
>
> With CACHE & BRANCH counters does not work :
> $ perf stat -e '{r0c4,r0c5,r0c0,r03c,r0300,r013c,r04F2E,r0412E}:SIu' sleep
> 1
>
>  Performance counter stats for 'sleep 1':
>
>r0c4
>(0.00%)
>r0c5
>(0.00%)
>r0c0
>(0.00%)
>r03c
>(0.00%)
>r0300
>(0.00%)
>r013c
>(0.00%)
>r04F2E
>(0.00%)
> r0412E
>
>1.001652932 seconds time elapsed
>
>Some events weren't counted. Try disabling the NMI watchdog:
>   echo 0 > /proc/sys/kernel/nmi_watchdog
>   perf stat ...
>   echo 1 > /proc/sys/kernel/nmi_watchdog
>
> Disabling the NMI watchdog makes no difference .
>
> It is very strange that perf thinks 'r0412E' is not supported :
>$ cat 

Re: perf Intel x86_64 : BUG: BRANCH_INSTRUCTIONS / BRANCH_MISSES cannot be combined with CACHE_REFERENCES / CACHE_MISSES .

2018-02-13 Thread Jason Vas Dias
On 13/02/2018, Jason Vas Dias  wrote:
> Good day -
>
> I'd much appreciate some advice as to why, on my Intel x86_64
> ( DisplayFamily_DisplayModel : 06_3CH ), running either Linux 4.12.10,
> or Linux 3.10.0, any attempt to count all of :
>  PERF_COUNT_HW_BRANCH_INSTRUCTIONS
>   (or raw config 0xC4) , and
>  PERF_COUNT_HW_BRANCH_MISSES
>   (or raw config 0xC5), and
>  combined with
>  PERF_COUNT_HW_CACHE_REFERENCES
>  (or raw config 0x4F2E ), and
>  PERF_COUNT_HW_CACHE_MISSES
>  (or raw config 0x412E) ,
> results in ALL COUNTERS BEING 0 in a read of the Group FD or
> mmap sample area.
>
> This is demonstrated by the example program, which will
> use perf_event_open() to create a Group Leader FD  for the first event,
> and associate all other events with that Event Group , so that it
> will read all events on the group FD .
>
> The perf_event_open() calls and the ioctl(event_fd, PERF_EVENT_IOC_ID, )
> calls all return successfully , but if I combine ANY of
> ( PERF_COUNT_HW_BRANCH_INSTRUCTIONS,
>   PERF_COUNT_HW_BRANCH_MISSES
> ) with any of
> ( PERF_COUNT_HW_CACHE_REFERENCES,
>   PERF_COUNT_HW_CACHE_MISSES
> ) in the Event Group, ALL events have '0' event->value.
>
> Demo :
> 1. Compile program to use kernel mapped Generic Events:
>   $ gcc -std=gnu11 -o perf_bug perf_bug.c
>   Running program shows all counters have 0 values, since both
>   CACHE & BRANCH hits+misses are being requested:
>
>   $ ./perf_bug
>   EVENT: Branch Instructions : 0
>   EVENT: Branch Misses : 0
>   EVENT: Instructions : 0
>   EVENT: CPU Cycles : 0
>   EVENT: Ref. CPU Cycles : 0
>   EVENT: Bus Cycles : 0
>   EVENT: Cache References : 0
>   EVENT: Cache Misses : 0
>
>   NOT registering interest in EITHER the BRANCH counters
>   OR the CACHE counters fixes the problem:
>
>   Compile without registering for BRANCH_INSTRUCTIONS
>   or BRANCH_MISSES:
>   $ gcc -std=gnu11 -DNO_BUG_NO_BRANCH  -o perf_bug perf_bug.c
>   $ ./perf_bug
>   EVENT: Instructions : 914
>   EVENT: CPU Cycles : 4110
>   EVENT: Ref. CPU Cycles : 4437
>   EVENT: Bus Cycles : 152
>   EVENT: Cache References : 1
>   EVENT: Cache Misses : 1
>
>   Compile without registering for CACHE_REFERENCES or CACHE_MISSES:
>   $ gcc -std=gnu11 -DNO_BUG_NO_CACHE  -o perf_bug perf_bug.c
>   $ ./perf_bug
> EVENT: Branch Instructions : 106
> EVENT: Branch Misses : 6
> EVENT: Instructions : 914
> EVENT: CPU Cycles : 4132
> EVENT: Ref. CPU Cycles : 8526
> EVENT: Bus Cycles : 295
>
> The same thing happens if I do not use Generic Events, but rather
> "dynamic raw PMU" events, by putting the hex values from
> /sys/bus/event_source/devices/cpu/events/? into the perf_event_attr
> config, OR'ed with (1<<63), and using the PERF_TYPE_RAW perf_event_attr
> type value :
>
> $ gcc -DUSE_RAW_PMU -o perf_bug perf_bug.c
> $ ./perf_bug
> EVENT: Branch Instructions : 0
> EVENT: Branch Misses : 0
> EVENT: Instructions : 0
> EVENT: CPU Cycles : 0
> EVENT: Ref. CPU Cycles : 0
> EVENT: Bus Cycles : 0
> EVENT: Cache References : 0
> EVENT: Cache Misses : 0
>
>
> $ gcc -DUSE_RAW_PMU -DNO_BUG_NO_BRANCH -o perf_bug perf_bug.c
> $ ./perf_bug
> EVENT: Instructions : 914
> EVENT: CPU Cycles : 4102
> EVENT: Ref. CPU Cycles : 4959
> EVENT: Bus Cycles : 171
> EVENT: Cache References : 2
> EVENT: Cache Misses : 2
>
> $ gcc -DUSE_RAW_PMU -DNO_BUG_NO_CACHE -o perf_bug perf_bug.c
> $ ./perf_bug
> EVENT: Branch Instructions : 106
> EVENT: Branch Misses : 6
> EVENT: Instructions : 914
> EVENT: CPU Cycles : 4108
> EVENT: Ref. CPU Cycles : 10817
> EVENT: Bus Cycles : 373
>
>
> The perf tool itself seems to have the same issue:
>
> With CACHE & BRANCH counters does not work :
> $ perf stat -e '{r0c4,r0c5,r0c0,r03c,r0300,r013c,r04F2E,r0412E}:SIu' sleep
> 1
>
>  Performance counter stats for 'sleep 1':
>
>r0c4
>(0.00%)
>r0c5
>(0.00%)
>r0c0
>(0.00%)
>r03c
>(0.00%)
>r0300
>(0.00%)
>r013c
>(0.00%)
>r04F2E
>(0.00%)
> r0412E
>
>1.001652932 seconds time elapsed
>
>Some events weren't counted. Try disabling the NMI watchdog:
>   echo 0 > /proc/sys/kernel/nmi_watchdog
>   perf stat ...
>   echo 1 > /proc/sys/kernel/nmi_watchdog
>
> Disabling the NMI watchdog makes no difference .
>
> It is very strange that perf thinks 'r0412E' is not supported :
>$ cat /sys/bus/event_source/devices/cpu/cac

perf Intel x86_64 : BUG: BRANCH_INSTRUCTIONS / BRANCH_MISSES cannot be combined with CACHE_REFERENCES / CACHE_MISSES .

2018-02-13 Thread Jason Vas Dias
Good day -

I'd much appreciate some advice as to why, on my Intel x86_64
( DisplayFamily_DisplayModel : 06_3CH ), running either Linux 4.12.10,
or Linux 3.10.0, any attempt to count all of :
 PERF_COUNT_HW_BRANCH_INSTRUCTIONS
  (or raw config 0xC4) , and
 PERF_COUNT_HW_BRANCH_MISSES
  (or raw config 0xC5), and
 combined with
 PERF_COUNT_HW_CACHE_REFERENCES
 (or raw config 0x4F2E ), and
 PERF_COUNT_HW_CACHE_MISSES
 (or raw config 0x412E) ,
results in ALL COUNTERS BEING 0 in a read of the Group FD or
mmap sample area.

This is demonstrated by the example program, which will
use perf_event_open() to create a Group Leader FD  for the first event,
and associate all other events with that Event Group , so that it
will read all events on the group FD .

The perf_event_open() calls and the ioctl(event_fd, PERF_EVENT_IOC_ID, )
calls all return successfully , but if I combine ANY of
( PERF_COUNT_HW_BRANCH_INSTRUCTIONS,
  PERF_COUNT_HW_BRANCH_MISSES
) with any of
( PERF_COUNT_HW_CACHE_REFERENCES,
  PERF_COUNT_HW_CACHE_MISSES
) in the Event Group, ALL events have '0' event->value.

Demo :
1. Compile program to use kernel mapped Generic Events:
  $ gcc -std=gnu11 -o perf_bug perf_bug.c
  Running program shows all counters have 0 values, since both
  CACHE & BRANCH hits+misses are being requested:

  $ ./perf_bug
  EVENT: Branch Instructions : 0
  EVENT: Branch Misses : 0
  EVENT: Instructions : 0
  EVENT: CPU Cycles : 0
  EVENT: Ref. CPU Cycles : 0
  EVENT: Bus Cycles : 0
  EVENT: Cache References : 0
  EVENT: Cache Misses : 0

  NOT registering interest in EITHER the BRANCH counters
  OR the CACHE counters fixes the problem:

  Compile without registering for BRANCH_INSTRUCTIONS
  or BRANCH_MISSES:
  $ gcc -std=gnu11 -DNO_BUG_NO_BRANCH  -o perf_bug perf_bug.c
  $ ./perf_bug
  EVENT: Instructions : 914
  EVENT: CPU Cycles : 4110
  EVENT: Ref. CPU Cycles : 4437
  EVENT: Bus Cycles : 152
  EVENT: Cache References : 1
  EVENT: Cache Misses : 1

  Compile without registering for CACHE_REFERENCES or CACHE_MISSES:
  $ gcc -std=gnu11 -DNO_BUG_NO_CACHE  -o perf_bug perf_bug.c
  $ ./perf_bug
EVENT: Branch Instructions : 106
EVENT: Branch Misses : 6
EVENT: Instructions : 914
EVENT: CPU Cycles : 4132
EVENT: Ref. CPU Cycles : 8526
EVENT: Bus Cycles : 295

The same thing happens if I do not use Generic Events, but rather
"dynamic raw PMU" events, by putting the hex values from
/sys/bus/event_source/devices/cpu/events/? into the perf_event_attr
config, OR'ed with (1<<63), and using the PERF_TYPE_RAW perf_event_attr
type value :

$ gcc -DUSE_RAW_PMU -o perf_bug perf_bug.c
$ ./perf_bug
EVENT: Branch Instructions : 0
EVENT: Branch Misses : 0
EVENT: Instructions : 0
EVENT: CPU Cycles : 0
EVENT: Ref. CPU Cycles : 0
EVENT: Bus Cycles : 0
EVENT: Cache References : 0
EVENT: Cache Misses : 0


$ gcc -DUSE_RAW_PMU -DNO_BUG_NO_BRANCH -o perf_bug perf_bug.c
$ ./perf_bug
EVENT: Instructions : 914
EVENT: CPU Cycles : 4102
EVENT: Ref. CPU Cycles : 4959
EVENT: Bus Cycles : 171
EVENT: Cache References : 2
EVENT: Cache Misses : 2

$ gcc -DUSE_RAW_PMU -DNO_BUG_NO_CACHE -o perf_bug perf_bug.c
$ ./perf_bug
EVENT: Branch Instructions : 106
EVENT: Branch Misses : 6
EVENT: Instructions : 914
EVENT: CPU Cycles : 4108
EVENT: Ref. CPU Cycles : 10817
EVENT: Bus Cycles : 373


The perf tool itself seems to have the same issue:

With CACHE & BRANCH counters does not work :
$ perf stat -e '{r0c4,r0c5,r0c0,r03c,r0300,r013c,r04F2E,r0412E}:SIu' sleep 1

 Performance counter stats for 'sleep 1':

   r0c4
   (0.00%)
   r0c5
   (0.00%)
   r0c0
   (0.00%)
   r03c
   (0.00%)
   r0300
   (0.00%)
   r013c
   (0.00%)
   r04F2E
   (0.00%)
r0412E

   1.001652932 seconds time elapsed

   Some events weren't counted. Try disabling the NMI watchdog:
echo 0 > /proc/sys/kernel/nmi_watchdog
perf stat ...
echo 1 > /proc/sys/kernel/nmi_watchdog

Disabling the NMI watchdog makes no difference .

It is very strange that perf thinks 'r0412E' is not supported :
   $ cat /sys/bus/event_source/devices/cpu/cache_misses
   event=0x2e,umask=0x41

The kernel should not be advertizing an unsupported event
in a  /sys/bus/event_source/devices/cpu/events/ file, should it ?

So perf stat has the same problem - without either Cache or Branch
counters seems to work fine:

without cache:
$ perf stat -e '{r0c4,r0c5,r0c0,r03c,r0300,r013c}:SIu' sleep 1

 Performance counter stats for 'sleep 1':

 37740  r0c4
  3557  r0c5
188552  r0c0
311684  r03c
360963  r0300
 12461  r013c

   1.001508109 seconds time elapsed

without branch:
$ perf stat -e '{r0c0,r03c,r0300,r013c,r04F2E,r0412E}:SIu' sleep 1

 Performance counter stats for 'sleep 1':

188554  r0c0

perf Intel x86_64 : BUG: BRANCH_INSTRUCTIONS / BRANCH_MISSES cannot be combined with CACHE_REFERENCES / CACHE_MISSES .

2018-02-13 Thread Jason Vas Dias
Good day -

I'd much appreciate some advice as to why, on my Intel x86_64
( DisplayFamily_DisplayModel : 06_3CH ), running either Linux 4.12.10,
or Linux 3.10.0, any attempt to count all of :
 PERF_COUNT_HW_BRANCH_INSTRUCTIONS
  (or raw config 0xC4) , and
 PERF_COUNT_HW_BRANCH_MISSES
  (or raw config 0xC5), and
 combined with
 PERF_COUNT_HW_CACHE_REFERENCES
 (or raw config 0x4F2E ), and
 PERF_COUNT_HW_CACHE_MISSES
 (or raw config 0x412E) ,
results in ALL COUNTERS BEING 0 in a read of the Group FD or
mmap sample area.

This is demonstrated by the example program, which will
use perf_event_open() to create a Group Leader FD  for the first event,
and associate all other events with that Event Group , so that it
will read all events on the group FD .

The perf_event_open() calls and the ioctl(event_fd, PERF_EVENT_IOC_ID, )
calls all return successfully , but if I combine ANY of
( PERF_COUNT_HW_BRANCH_INSTRUCTIONS,
  PERF_COUNT_HW_BRANCH_MISSES
) with any of
( PERF_COUNT_HW_CACHE_REFERENCES,
  PERF_COUNT_HW_CACHE_MISSES
) in the Event Group, ALL events have '0' event->value.

Demo :
1. Compile program to use kernel mapped Generic Events:
  $ gcc -std=gnu11 -o perf_bug perf_bug.c
  Running program shows all counters have 0 values, since both
  CACHE & BRANCH hits+misses are being requested:

  $ ./perf_bug
  EVENT: Branch Instructions : 0
  EVENT: Branch Misses : 0
  EVENT: Instructions : 0
  EVENT: CPU Cycles : 0
  EVENT: Ref. CPU Cycles : 0
  EVENT: Bus Cycles : 0
  EVENT: Cache References : 0
  EVENT: Cache Misses : 0

  NOT registering interest in EITHER the BRANCH counters
  OR the CACHE counters fixes the problem:

  Compile without registering for BRANCH_INSTRUCTIONS
  or BRANCH_MISSES:
  $ gcc -std=gnu11 -DNO_BUG_NO_BRANCH  -o perf_bug perf_bug.c
  $ ./perf_bug
  EVENT: Instructions : 914
  EVENT: CPU Cycles : 4110
  EVENT: Ref. CPU Cycles : 4437
  EVENT: Bus Cycles : 152
  EVENT: Cache References : 1
  EVENT: Cache Misses : 1

  Compile without registering for CACHE_REFERENCES or CACHE_MISSES:
  $ gcc -std=gnu11 -DNO_BUG_NO_CACHE  -o perf_bug perf_bug.c
  $ ./perf_bug
EVENT: Branch Instructions : 106
EVENT: Branch Misses : 6
EVENT: Instructions : 914
EVENT: CPU Cycles : 4132
EVENT: Ref. CPU Cycles : 8526
EVENT: Bus Cycles : 295

The same thing happens if I do not use Generic Events, but rather
"dynamic raw PMU" events, by putting the hex values from
/sys/bus/event_source/devices/cpu/events/? into the perf_event_attr
config, OR'ed with (1<<63), and using the PERF_TYPE_RAW perf_event_attr
type value :

$ gcc -DUSE_RAW_PMU -o perf_bug perf_bug.c
$ ./perf_bug
EVENT: Branch Instructions : 0
EVENT: Branch Misses : 0
EVENT: Instructions : 0
EVENT: CPU Cycles : 0
EVENT: Ref. CPU Cycles : 0
EVENT: Bus Cycles : 0
EVENT: Cache References : 0
EVENT: Cache Misses : 0


$ gcc -DUSE_RAW_PMU -DNO_BUG_NO_BRANCH -o perf_bug perf_bug.c
$ ./perf_bug
EVENT: Instructions : 914
EVENT: CPU Cycles : 4102
EVENT: Ref. CPU Cycles : 4959
EVENT: Bus Cycles : 171
EVENT: Cache References : 2
EVENT: Cache Misses : 2

$ gcc -DUSE_RAW_PMU -DNO_BUG_NO_CACHE -o perf_bug perf_bug.c
$ ./perf_bug
EVENT: Branch Instructions : 106
EVENT: Branch Misses : 6
EVENT: Instructions : 914
EVENT: CPU Cycles : 4108
EVENT: Ref. CPU Cycles : 10817
EVENT: Bus Cycles : 373


The perf tool itself seems to have the same issue:

With CACHE & BRANCH counters does not work :
$ perf stat -e '{r0c4,r0c5,r0c0,r03c,r0300,r013c,r04F2E,r0412E}:SIu' sleep 1

 Performance counter stats for 'sleep 1':

   r0c4
   (0.00%)
   r0c5
   (0.00%)
   r0c0
   (0.00%)
   r03c
   (0.00%)
   r0300
   (0.00%)
   r013c
   (0.00%)
   r04F2E
   (0.00%)
r0412E

   1.001652932 seconds time elapsed

   Some events weren't counted. Try disabling the NMI watchdog:
echo 0 > /proc/sys/kernel/nmi_watchdog
perf stat ...
echo 1 > /proc/sys/kernel/nmi_watchdog

Disabling the NMI watchdog makes no difference .

It is very strange that perf thinks 'r0412E' is not supported :
   $ cat /sys/bus/event_source/devices/cpu/cache_misses
   event=0x2e,umask=0x41

The kernel should not be advertizing an unsupported event
in a  /sys/bus/event_source/devices/cpu/events/ file, should it ?

So perf stat has the same problem - without either Cache or Branch
counters seems to work fine:

without cache:
$ perf stat -e '{r0c4,r0c5,r0c0,r03c,r0300,r013c}:SIu' sleep 1

 Performance counter stats for 'sleep 1':

 37740  r0c4
  3557  r0c5
188552  r0c0
311684  r03c
360963  r0300
 12461  r013c

   1.001508109 seconds time elapsed

without branch:
$ perf stat -e '{r0c0,r03c,r0300,r013c,r04F2E,r0412E}:SIu' sleep 1

 Performance counter stats for 'sleep 1':

188554  r0c0

Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609

2017-02-23 Thread Jason Vas Dias
I have found a new source of weirdness with  TSC  using
clock_gettime(CLOCK_MONOTONIC_RAW,) :

The vsyscall_gtod_data.mult field changes somewhat between
calls to clock_gettime(CLOCK_MONOTONIC_RAW,),
so that sometimes an extra (2^24) nanoseconds are added or
removed from  the value derived from the TSC and stored in 'ts' .

This is demonstrated by the output of the test program in the
attached ttsc.tar  file:
$ ./tlgtd
it worked! - GTOD: clock:1 mult:5798662 shift:24
synced - mult now: 5798661

What it is doing is finding the address of the 'vsyscall_gtod_data' structure
from /proc/kallsyms, and mapping the virtual address to an ELF section
offset within /proc/kcore, and reading just the 'vsyscall_gtod_data' structure
into user-space memory .

Really, this 'mult' value, which is used to return the
seconds|nanoseconds value:
( tsc_cycles * mult ) >> shift
(where shift is 24 ), should not change from the first time it is initialized .

The TSC is meant to be FIXED FREQUENCY, right ?
So how could  /  why should the conversion function from TSC ticks to
nanoseconds change ?

So now it is doubly difficult for user-space libraries to maintain their
RDTSC derived seconds|nanoseconds values to correlate well those returned by
the kernel,  because they must regularly read the updated 'mult' value
used by the
kernel .

I really don't think the kernel should randomly be deciding to
increase / decrease
the TSC tick period by 2^24 nanoseconds!

Is this a bug or intentional ? I am searching for all places where a
'[.>]mult.*=' occurs, but this returns rather alot of matches.

Please could a future version of linux at least export the 'mult' and
'shift' values for
the current clocksource !

Regards,
Jason








On 22/02/2017, Jason Vas Dias <jason.vas.d...@gmail.com> wrote:
> OK, last post on this issue today -
> can anyone explain why, with standard 4.10.0 kernel & no new
> 'notsc_adjust' option, and the same maths being used, these two runs
> should display
> such a wide disparity between clock_gettime(CLOCK_MONOTONIC_RAW,)
> values ? :
>
> $ J/pub/ttsc/ttsc1
> max_extended_leaf: 8008
> has tsc: 1 constant: 1
> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1.
> ts2 - ts1: 162 ts3 - ts2: 110 ns1: 0.00641 ns2: 0.02850
> ts3 - ts2: 175 ns1: 0.00659
> ts3 - ts2: 18 ns1: 0.00643
> ts3 - ts2: 18 ns1: 0.00618
> ts3 - ts2: 17 ns1: 0.00620
> ts3 - ts2: 17 ns1: 0.00616
> ts3 - ts2: 18 ns1: 0.00641
> ts3 - ts2: 18 ns1: 0.00709
> ts3 - ts2: 20 ns1: 0.00763
> ts3 - ts2: 20 ns1: 0.00735
> ts3 - ts2: 20 ns1: 0.00761
> t1 - t0: 78200 - ns2: 0.80824
> $ J/pub/ttsc/ttsc1
> max_extended_leaf: 8008
> has tsc: 1 constant: 1
> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1.
> ts2 - ts1: 217 ts3 - ts2: 221 ns1: 0.01294 ns2: 0.05375
> ts3 - ts2: 210 ns1: 0.01418
> ts3 - ts2: 23 ns1: 0.01399
> ts3 - ts2: 22 ns1: 0.01445
> ts3 - ts2: 25 ns1: 0.01321
> ts3 - ts2: 20 ns1: 0.01428
> ts3 - ts2: 25 ns1: 0.01367
> ts3 - ts2: 23 ns1: 0.01425
> ts3 - ts2: 23 ns1: 0.01357
> ts3 - ts2: 22 ns1: 0.01487
> ts3 - ts2: 25 ns1: 0.01377
> t1 - t0: 145753 - ns2: 0.000150781
>
> (complete source of test program ttsc1 attached in ttsc.tar
>  $ tar -xpf ttsc.tar
>  $ cd ttsc
>  $ make
> ).
>
> On 22/02/2017, Jason Vas Dias <jason.vas.d...@gmail.com> wrote:
>> I actually tried adding a 'notsc_adjust' kernel option to disable any
>> setting or
>> access to the TSC_ADJUST MSR, but then I see the problems  - a big
>> disparity
>> in values depending on which CPU the thread is scheduled -  and no
>> improvement in clock_gettime() latency.  So I don't think the new
>> TSC_ADJUST
>> code in ts_sync.c itself is the issue - but something added @ 460ns
>> onto every clock_gettime() call when moving from v4.8.0 -> v4.10.0 .
>> As I don't think fixing the clock_gettime() latency issue is my problem
>> or
>> even
>> possible with current clock architecture approach, it is a non-issue.
>>
>> But please, can anyone tell me if are there any plans to move the time
>> infrastructure  out of the kernel and into glibc along the lines
>> outlined
>> in previous mail - if not, I am going to concentrate on this more radical
>> overhaul approach for my own systems .
>>
>> At least, I think mapping the clocksource information structure itself in
>> some
>> kind of sharable page makes sense . Processes could map that page
>> copy-on-write
>> so they could start off with all the timing parameters preloaded,  then
>> keep
>> their copy updated using the rdtscp instruction , or msync() (read-only)
>>

Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609

2017-02-23 Thread Jason Vas Dias
I have found a new source of weirdness with  TSC  using
clock_gettime(CLOCK_MONOTONIC_RAW,) :

The vsyscall_gtod_data.mult field changes somewhat between
calls to clock_gettime(CLOCK_MONOTONIC_RAW,),
so that sometimes an extra (2^24) nanoseconds are added or
removed from  the value derived from the TSC and stored in 'ts' .

This is demonstrated by the output of the test program in the
attached ttsc.tar  file:
$ ./tlgtd
it worked! - GTOD: clock:1 mult:5798662 shift:24
synced - mult now: 5798661

What it is doing is finding the address of the 'vsyscall_gtod_data' structure
from /proc/kallsyms, and mapping the virtual address to an ELF section
offset within /proc/kcore, and reading just the 'vsyscall_gtod_data' structure
into user-space memory .

Really, this 'mult' value, which is used to return the
seconds|nanoseconds value:
( tsc_cycles * mult ) >> shift
(where shift is 24 ), should not change from the first time it is initialized .

The TSC is meant to be FIXED FREQUENCY, right ?
So how could  /  why should the conversion function from TSC ticks to
nanoseconds change ?

So now it is doubly difficult for user-space libraries to maintain their
RDTSC derived seconds|nanoseconds values to correlate well those returned by
the kernel,  because they must regularly read the updated 'mult' value
used by the
kernel .

I really don't think the kernel should randomly be deciding to
increase / decrease
the TSC tick period by 2^24 nanoseconds!

Is this a bug or intentional ? I am searching for all places where a
'[.>]mult.*=' occurs, but this returns rather alot of matches.

Please could a future version of linux at least export the 'mult' and
'shift' values for
the current clocksource !

Regards,
Jason








On 22/02/2017, Jason Vas Dias  wrote:
> OK, last post on this issue today -
> can anyone explain why, with standard 4.10.0 kernel & no new
> 'notsc_adjust' option, and the same maths being used, these two runs
> should display
> such a wide disparity between clock_gettime(CLOCK_MONOTONIC_RAW,)
> values ? :
>
> $ J/pub/ttsc/ttsc1
> max_extended_leaf: 8008
> has tsc: 1 constant: 1
> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1.
> ts2 - ts1: 162 ts3 - ts2: 110 ns1: 0.00641 ns2: 0.02850
> ts3 - ts2: 175 ns1: 0.00659
> ts3 - ts2: 18 ns1: 0.00643
> ts3 - ts2: 18 ns1: 0.00618
> ts3 - ts2: 17 ns1: 0.00620
> ts3 - ts2: 17 ns1: 0.00616
> ts3 - ts2: 18 ns1: 0.00641
> ts3 - ts2: 18 ns1: 0.00709
> ts3 - ts2: 20 ns1: 0.00763
> ts3 - ts2: 20 ns1: 0.00735
> ts3 - ts2: 20 ns1: 0.00761
> t1 - t0: 78200 - ns2: 0.80824
> $ J/pub/ttsc/ttsc1
> max_extended_leaf: 8008
> has tsc: 1 constant: 1
> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1.
> ts2 - ts1: 217 ts3 - ts2: 221 ns1: 0.01294 ns2: 0.05375
> ts3 - ts2: 210 ns1: 0.01418
> ts3 - ts2: 23 ns1: 0.01399
> ts3 - ts2: 22 ns1: 0.01445
> ts3 - ts2: 25 ns1: 0.01321
> ts3 - ts2: 20 ns1: 0.01428
> ts3 - ts2: 25 ns1: 0.01367
> ts3 - ts2: 23 ns1: 0.01425
> ts3 - ts2: 23 ns1: 0.01357
> ts3 - ts2: 22 ns1: 0.01487
> ts3 - ts2: 25 ns1: 0.01377
> t1 - t0: 145753 - ns2: 0.000150781
>
> (complete source of test program ttsc1 attached in ttsc.tar
>  $ tar -xpf ttsc.tar
>  $ cd ttsc
>  $ make
> ).
>
> On 22/02/2017, Jason Vas Dias  wrote:
>> I actually tried adding a 'notsc_adjust' kernel option to disable any
>> setting or
>> access to the TSC_ADJUST MSR, but then I see the problems  - a big
>> disparity
>> in values depending on which CPU the thread is scheduled -  and no
>> improvement in clock_gettime() latency.  So I don't think the new
>> TSC_ADJUST
>> code in ts_sync.c itself is the issue - but something added @ 460ns
>> onto every clock_gettime() call when moving from v4.8.0 -> v4.10.0 .
>> As I don't think fixing the clock_gettime() latency issue is my problem
>> or
>> even
>> possible with current clock architecture approach, it is a non-issue.
>>
>> But please, can anyone tell me if are there any plans to move the time
>> infrastructure  out of the kernel and into glibc along the lines
>> outlined
>> in previous mail - if not, I am going to concentrate on this more radical
>> overhaul approach for my own systems .
>>
>> At least, I think mapping the clocksource information structure itself in
>> some
>> kind of sharable page makes sense . Processes could map that page
>> copy-on-write
>> so they could start off with all the timing parameters preloaded,  then
>> keep
>> their copy updated using the rdtscp instruction , or msync() (read-only)
>> with the kernel's single copy to get the latest time any proces

  1   2   >