Answering to each of your questions in one mail, with a few notes at the end. First of all, it is just a wild idea from when I was in the train the other day, and I haven't written any code for it. Then:
Le 28/03/2017 à 18:01, Mouse a écrit :
(1) Please provide a kernel build option to remove the restriction. [...]
My original plan was to use a sysctl - as suggested by Manuel. One to enable/disable the feature, another to log the segfaults. Le 28/03/2017 à 18:01, Mouse a écrit :
(2) Does that actually help, or does it just compel the attacker to use cruder timers and thus longer test runs? (Or is that enough difference that you believe it would actually help in practice?)
It does help, and that's the conclusion of most papers. There is however another technique (software clock) to try to compute the number of cycles an operation takes, but the resulting accuracy is very low, and not sufficient to detect cache misses via latency. Le 28/03/2017 à 18:30, David Young a écrit :
Why do you single out the rdtsc instruction instead of other time sources?
Because of accuracy. As far as the papers point out, detecting cache misses implies having a precision of at least ~50 cycles, and only rdtsc offers this precision. Syscalls and other software-based timers have a non- deterministic overhead that is bigger than ~50 cycles, and it therefore pollutes the relevant information. Le 28/03/2017 à 18:30, David Young a écrit :
What do you mean by "legitimately" use rdtsc? It seems to me that it is legitimate for a user to use a high-resolution timer to profile some code that's under development. They may want to avoid running that code with root privileges under most circumstances.
Just like you said, that some users need to profile some code they develop. They may indeed want to avoid running their tests with root privileges, and that's where the sysctl is useful - they can disable the feature if they want to. A few notes now. In fact, the rdpmc instruction can also be used for side- channel attacks, but we don't enable it currently so it does not matter. Regarding serialization, I may not have been clear enough too. rdtsc is not serializing, which means that it does not wait for the previous instructions to execute completely before being executed. To compensate for that the user needs to first execute a serializing instruction like cpuid, and right after that put the rdtsc. With the fault approach, serialization is ensured, because when returning to userland 'iret' is used, which is serializing. So we have a 'iret+rdtsc', which has the same effect as 'cpuid+rdtsc'. Also, a detail about my remark on accuracy. The basic use case for rdtsc is the following: start = rdtsc work end = rdtsc elapsed = end - start Here, we will fault on the first rdtsc; so the kernel will be entered, and many cycles will be consumed there. But it does not matter, since the first rdtsc is used as the starting point, and we don't care about adding cycles before it. Therefore, the number of elapsed cycles is the same, with and without the feature. Finally, I'll add that there are other mitigations available on rdtsc, which consist for example in adding a random (small) delta to the counter directly, in order to fuzz the results. But then there is the problem of how big this delta needs to be: big enough to mitigate side-channels, small enough to still give relevant - yet a little inaccurate - information back to userland.