It's worth noting that you can get rid of a /lot/ of the variance on a
modern linux box:
1) Set the CPU to run at the same speed at all times (generally "max
performance" but which way you do it doesn't really matter)
2) Set processor masks so that no processes other than your timing code
runs on a core of your choice. On hyperthreaded processors, make sure
nothing is scheduled on the other 'half' of that core.
3) Set your interrupts to not be scheduled onto that core
4) Make sure your timing code fits in the L1 cache
5) When possible, make sure you don't conditionally branch. The last
means instead of doing something like this:
while true {
if x < y {
continue loop
} else {
write to hardware
}
}
You do something more like:
while true {
compare x to y
conditional mov 1 to hardware register on x gt y
conditional mov 0 to hardware register on x lte y
}
(and if possible, write to memory-mapped hardware pages, rather than
making calls into the kernel)
This guarantees that both a) latency writing to hardware is consistent
every loop pass (though hardware-induced jitter isn't), and b) that
there are no branch mispredicts because there are no conditional
branches -- conditional move instructions take a constant time to
execute (plus or minus memory access latency).
This basically removes the entire kernel from the picture, any other
processes from the picture, and shared CPU resources from the picture,
except for those times that you have no choice but to access the memory
bus and such. Otherwise, your code will just sit there on its own core
doing its own thing and nothing will interrupt it and most sources of
unknown jitter are removed.
(It's not perfect, but it's probably the closest you'll get on a PC
without specialized hardware. Though I _do_ wonder what could be done
with something like the intel i210AT chips on like the apu2 hardware,
which can do hardware PPS out and hardware event timestamping...)
-j
On 4/11/2018 4:01 PM, Hal Murray wrote:
[email protected] said:
Except that’s not the way most timers run. The silicon needed to get a
programable divider to work at 2.4 GHz is expensive. If you dig into the
hardware descriptions, the clock tree feeds something much slower to the
“top end� of the typical timer in a CPU or MCU. The exception is the high
perf timers in some of the Intel chips. There the issue is getting them to
relate to anything “outside� the chip.
I think I got started in this area back in the early DEC Alpha days. They
had a register that counted raw clock cycles. Simple. I got stuck thinking
that was the obvious/clean way to do things.
Many thanks for giving me a poke to go learn more about this area.
That was back before battery operation was as interesting as it is today. I
suspect power is more likely the critical factor. Half the power goes into
the low order bit, so counting by 4 every 4th cycle rather than 1 every cycle
saves 3/4 of the power.
That may be what the kernel does, but it implements the result as a drop /
add to a counter.
If the source of time is a register counting CPU clock ticks, and the CPU
clock (2 or 3 GHz) is faster than the resolution of the clock (1 ns) it will
be hard to see any drop/add. However, if the time register is significantly
slower, then the drop/add is easy to spot. But all that is lost in the noise
of cache misses and such.
Here is a histogram from an Intel Atom running at 1.6 GHz.
First pass, using rpcc.
cycles Hits
24 86932
36 904825
48 8011
60 122
72 1
144 11
...
So it looks like the cycle counter gets bumped by 12. That's a strange
number. I suspect it's tangled up with changing the clock speed to save
power. There are conflicting interests in this area. If you want to keep
time, you need a register than ticks at a constant rate as you change speed.
If you are doing performance analysis, you want a register than counts cycles
at whatever speed the CPU is running. Or maybe I'm confused.
Second pass, using clock_gettime.
nSec Hits
698 2
768 5
769 2
838 3
908 2
977 1
978 3
1047 237102
1048 383246
1117 204072
1118 172490
1187 275
1188 135
1257 263
1258 47
1326 7
1327 216
...
The clock seems to be ticking in 70ns steps. That doesn't match 12 clock
cycles so I assume they are using something else.
>From another system:
Second pass, using clock_gettime.
nSec Hits
19 45693
20 347538
21 591129
22 15284
23 63
24 34
25 32
...
Note that this is 50 times faster than the previous example.
I haven't figured out the kernel and library software for reading the clock.
There is a special path for some functions like reading the clock that avoids
the overhead of getting in/out of the kernel. I assume there is some shared
memory.
https://en.wikipedia.org/wiki/VDSO
Again, thanks Bob.
TICC arrived today.
_______________________________________________
time-nuts mailing list -- [email protected]
To unsubscribe, go to https://www.febo.com/cgi-bin/mailman/listinfo/time-nuts
and follow the instructions there.
_______________________________________________
time-nuts mailing list -- [email protected]
To unsubscribe, go to https://www.febo.com/cgi-bin/mailman/listinfo/time-nuts
and follow the instructions there.