Colm MacCárthaigh <c...@allcosts.net> writes: >since it's important not to leak timing, we use a special >"constant_time_copy_or_dont" routine to do the over-write: > >https://github.com/awslabs/s2n/blob/master/utils/s2n_safety.c#L70
Has anyone actually been able to measure the impact of issues like this? I know about papers like "Remote timing attacks are practical" that work with things like Montgomery modmult, but can much be practically done with a few instructions here and there, not repeated in a loop like a modmult? The reason I ask is a combination of two things, firstly I've done timing measurements on my own code to try and detect differences in behaviour and couldn't find anything among the general noise of code execution, and secondly any constant-time code is a tradeoff between readability/auditability and (near) constant-time operation ("near" because you have no control over what a compiler will do to optimise your code, so it may not be anywhere near constant-time once it's in binary form). Something like s2n_constant_time_copy_or_dont() is barely passable, but I've seen constant- time PKCS #1 decoding and similar that's essentially incomprehensible, there's no way to look at it and see that it's correct. Even something as simple as: https://blog.cloudflare.com/yet-another-padding-oracle-in-openssl-cbc-ciphersuites/ had errors, because the constant-time code is so difficult to audit. >Next, when the handshake does fail, we do two non-standard things. The first >is that we don't return an alert message, we just close the connection. That's actually really annoying, because it implies something other than a TLS crypto problem, e.g. a misconfigured firewall or who knows what. I don't know how diversely-used s2n is, in other words whether you can assume the clients and servers employed will know how to react to this, but in general use it'll make getting things working a major pain since it changes the error indicator from "the TLS handshake failed due to crypto issues" to "something failed somewhere, probably at the network level". >The second non-standard thing we do is that in all error cases, s2n behaves >as if something suspicious is going on and in case timing is involved, we add >a random delay. It's well known that random delays are only partially >effective against timing attacks, but we add a very very big one. We wait a >random amount of time between a minimum of 10 seconds, and a maximum of 30 >seconds. With the delay being granular to nanoseconds. I do the same thing, but I use 1s rather than 10 (although I've just bumped this to 3 in the latest code): /* Delay by a small random amount, used to dither the results of (failed) crypto operations to make timing attacks harder. As with anything involving randomness, there's been a lot of smoke and not much light generated over the use of random delays to address timing attacks. The generic answer is that it doesn't help so there's no point in doing it and everyone should just write constant-time code, another prime example of le mieux est l'ennemi du bien. If the arguments ever get beyond abstract theory (theoretical constant- time code is better than theoretical random delays), the argument given is usually that we're measuring timing differences in us or ns while the delay functions are typically in ms. Another argument is that all an attacker has to do is repeat the measurements in order to filter out the delays. The counterargument for the former is that if you've got an attacker sitting on your local LAN segment making ns-scale timing measurements on your crypto then you've got bigger things to worry about than side- channel attacks (over more distant connections, you can at best get tens to hundreds of us using thousands of measurements, see e.g. "Opportunities and Limits of Remote Timing Attacks" by Crosby, Reid, and Wallach, which also required multiple days of data-gathering beforehand to characterise the network). In addition in cryptlib's case since all operations go via the kernel, you don't have direct (timing) access to the low-level crypto ops but only get strongly dithered results at that level. As a result, while you can still use repeated measurements to eventually filter out the dithering, it's now made the attacker's job much, much harder. Since it's essentially free (it's only invoked if an operation fails), it makes sense to add the delay. Another issue is how long to make the delay. Obviously the longer, the better, however if we make it too long then it can end up looking like a different problem, e.g. a networking issue rather than a crypto defence, which will be a pain to diagnose for users. 3s seems to be a good tradeoff between dithering operations and slowing down an attacker while not adding an excessive delay that looks like it's caused by something else */ Again, it's for ease of deployment, a 1s (or 3s) delay won't get noticed while a 10s delay looks like a network problem, and 30s looks like a major issue somewhere. >The solutions here typically are not small routines that are rigorously >constant time, but instead rely on a degree of code-balancing: taking code >paths that are "similar enough" in timing so as not to be measurable. Yup, that's what I try and do. >trying to be rigorously constant time on large complicated algorithms that >were not designed to be constant time can seriously backfire by introducing >logic errors in the very hard-to-follow constant-time adaptations. You don't even need large complicated algorithms, you can create logic errors in a few lines of code, as your cited example shows. That's why I prefer obvious correctness (or probably-correctness :-) and defences in other places. Peter. _______________________________________________ TLS mailing list TLS@ietf.org https://www.ietf.org/mailman/listinfo/tls