It's worth noting that you can get rid of a /lot/ of the variance on a modern linux box:

1) Set the CPU to run at the same speed at all times (generally "max performance" but which way you do it doesn't really matter) 2) Set processor masks so that no processes other than your timing code runs on a core of your choice. On hyperthreaded processors, make sure nothing is scheduled on the other 'half' of that core.
3) Set your interrupts to not be scheduled onto that core
4) Make sure your timing code fits in the L1 cache
5) When possible, make sure you don't conditionally branch. The last means instead of doing something like this:

while true {
  if x < y {
    continue loop
  } else {
    write to hardware
  }
}

You do something more like:

while true {
compare x to y
  conditional mov 1 to hardware register on x gt y
  conditional mov 0 to hardware register on x lte y
}

(and if possible, write to memory-mapped hardware pages, rather than making calls into the kernel)

This guarantees that both a) latency writing to hardware is consistent every loop pass (though hardware-induced jitter isn't), and b) that there are no branch mispredicts because there are no conditional branches -- conditional move instructions take a constant time to execute (plus or minus memory access latency).

This basically removes the entire kernel from the picture, any other processes from the picture, and shared CPU resources from the picture, except for those times that you have no choice but to access the memory bus and such. Otherwise, your code will just sit there on its own core doing its own thing and nothing will interrupt it and most sources of unknown jitter are removed.

(It's not perfect, but it's probably the closest you'll get on a PC without specialized hardware. Though I _do_ wonder what could be done with something like the intel i210AT chips on like the apu2 hardware, which can do hardware PPS out and hardware event timestamping...)

-j

On 4/11/2018 4:01 PM, Hal Murray wrote:
kb...@n1k.org said:
Except that’s not the way most timers run. The silicon needed to get a
programable  divider to work at 2.4 GHz is expensive. If you dig into the
hardware descriptions,  the clock tree feeds something much slower to the
“top end� of the typical timer in a CPU or MCU. The exception is the high
perf timers in some of the Intel chips.  There the issue is getting them to
relate to anything “outside� the chip.
I think I got started in this area back in the early DEC Alpha days.  They
had a register that counted raw clock cycles.  Simple.  I got stuck thinking
that was the obvious/clean way to do things.

Many thanks for giving me a poke to go learn more about this area.

That was back before battery operation was as interesting as it is today.  I
suspect power is more likely the critical factor.  Half the power goes into
the low order bit, so counting by 4 every 4th cycle rather than 1 every cycle
saves 3/4 of the power.


That may be what the kernel does, but it implements the result as a drop /
add to a counter.
If the source of time is a register counting CPU clock ticks, and the CPU
clock (2 or 3 GHz) is faster than the resolution of the clock (1 ns) it will
be hard to see any drop/add.  However, if the time register is significantly
slower, then the drop/add is easy to spot.  But all that is lost in the noise
of cache misses and such.

Here is a histogram from an Intel Atom running at 1.6 GHz.

First pass, using rpcc.
     cycles      Hits
         24     86932
         36    904825
         48      8011
         60       122
         72         1
        144        11
...
So it looks like the cycle counter gets bumped by 12.  That's a strange
number.  I suspect it's tangled up with changing the clock speed to save
power.  There are conflicting interests in this area.  If you want to keep
time, you need a register than ticks at a constant rate as you change speed.
If you are doing performance analysis, you want a register than counts cycles
at whatever speed the CPU is running.  Or maybe I'm confused.

Second pass, using clock_gettime.
       nSec      Hits
        698         2
        768         5
        769         2
        838         3
        908         2
        977         1
        978         3
       1047    237102
       1048    383246
       1117    204072
       1118    172490
       1187       275
       1188       135
       1257       263
       1258        47
       1326         7
       1327       216
...
The clock seems to be ticking in 70ns steps.  That doesn't match 12 clock
cycles so I assume they are using something else.

>From another system:
Second pass, using clock_gettime.
       nSec      Hits
         19     45693
         20    347538
         21    591129
         22     15284
         23        63
         24        34
         25        32
...
Note that this is 50 times faster than the previous example.

I haven't figured out the kernel and library software for reading the clock.
There is a special path for some functions like reading the clock that avoids
the overhead of getting in/out of the kernel.  I assume there is some shared
memory.
   https://en.wikipedia.org/wiki/VDSO

Again, thanks Bob.

TICC arrived today.




_______________________________________________
time-nuts mailing list -- time-nuts@febo.com
To unsubscribe, go to https://www.febo.com/cgi-bin/mailman/listinfo/time-nuts
and follow the instructions there.


_______________________________________________
time-nuts mailing list -- time-nuts@febo.com
To unsubscribe, go to https://www.febo.com/cgi-bin/mailman/listinfo/time-nuts
and follow the instructions there.

Reply via email to