Re: Thread sanitiser problems
The locks aren’t being held for long. I have noticed that the lock in the OPENSSL_CTX only seems to be locked for reading and only in one place. crypto/context.c in openssl_ctx_get_data(). Writes seem to go via CRYPTO_alloc_ex_data() which uses a global write lock (on a different lock). Would it be feasible to drop the OPENSSL_CTX lock completely? This would address the issue. Matt? I think you know this code the best. CRYPTO_alloc_ex_data() is calling the new_func pointer under the read lock and the new function can in turn cause a search of the providers for implementations which grabs the provider lock. Conversely, searching the providers for implementations grabs the provider lock and can lead to openssl_ctx_get_data() grabbing the context lock. We don’t appear to have recursion at the moment, but the setup seems a little fragile. We do have two different ordering for grabbing locks which is also bad. Pauli -- Dr Paul Dale | Cryptographer | Network Security & Encryption Phone +61 7 3031 7217 Oracle Australia > On 31 Jul 2019, at 2:10 pm, Viktor Dukhovni > wrote: > >> On Jul 30, 2019, at 10:02 PM, Dr Paul Dale wrote: >> >> The #9454 description includes thread sanitisizer logs showing different >> lock orderings — this has the potential to dead lock. Agreed with Rich that >> giving up the lock would make sense, but I don’t see a way for this to be >> easily done. > > My take is that we should never hold any lock long enough to even consider > acquiring another lock. No more than one lock should be held at any one > time, and only long enough to bump a reference count to ensure that the > object of interest is no deallocated before we (or our caller in a "get1" > type interface) is done with it. > > I don't know what "long-term" locks we're holding, but it would be great > if it were possible to never (or never recursively) hold any such locks. > > -- > Viktor. >
Re: Thread sanitiser problems
> On Jul 30, 2019, at 10:02 PM, Dr Paul Dale wrote: > > The #9454 description includes thread sanitisizer logs showing different lock > orderings — this has the potential to dead lock. Agreed with Rich that > giving up the lock would make sense, but I don’t see a way for this to be > easily done. My take is that we should never hold any lock long enough to even consider acquiring another lock. No more than one lock should be held at any one time, and only long enough to bump a reference count to ensure that the object of interest is no deallocated before we (or our caller in a "get1" type interface) is done with it. I don't know what "long-term" locks we're holding, but it would be great if it were possible to never (or never recursively) hold any such locks. -- Viktor.
Re: Thread sanitiser problems
Yes, I’m mostly talking about #9454 here. #9455 is a bug (clearing the flush flag after flushing not before). The fix in #9477 addresses this and also removes the dependence on RAND_bytes. The #9454 description includes thread sanitisizer logs showing different lock orderings — this has the potential to dead lock. Agreed with Rich that giving up the lock would make sense, but I don’t see a way for this to be easily done. That this also involves the RAND subsystem could be coincidental. The other stack trace is getting an MD via a digest operation (i.e. both traces are for algorithms that use other algorithms). The locks in question are the provider store lock and the context lock. crypto/context.c: OPENSSL_CTX *ctx; /* function argument */ CRYPTO_THREAD_read_lock(ctx->oncelock); crypto/provider_core.c: OPENSSL_CTX *ctx; /* function argument */ struct provider_store_st *store = get_provider_store(ctx); CRYPTO_THREAD_read_lock(store->lock); Pauli -- Dr Paul Dale | Cryptographer | Network Security & Encryption Phone +61 7 3031 7217 Oracle Australia > On 30 Jul 2019, at 8:52 pm, Matthias St. Pierre > wrote: > > Sorry, my reply was misleading, since Pauli is talking mainly about #9454. > Please take a look at the issue description > > https://github.com/openssl/openssl/issues/9454 > > instead. > > Matthias > > > > On 30.07.19 12:47, Matthias St. Pierre wrote: >> >> >> On 30.07.19 12:43, Kurt Roeckx wrote: >>> >>> I currently fail to see how that's a problem, unless that >>> EVP_CIPHER_CTX tries to use a DRBG. >> >> This is what I mean when I say that things have gotten more complicated >> under the hood >> due to the replumbing. To understand the problem, please take at a look of >> the sanitizer >> callstack in >> >> https://github.com/openssl/openssl/issues/9455 >> >> >> Matthias >> >> >> >
Re: Thread sanitiser problems
Do you need to hold the lock across dependant items? For example, why can't the DRBG code unlock before fetching the AES-CTR code?
Re: Thread sanitiser problems
Sorry, my reply was misleading, since Pauli is talking mainly about #9454. Please take a look at the issue description https://github.com/openssl/openssl/issues/9454 instead. Matthias On 30.07.19 12:47, Matthias St. Pierre wrote: On 30.07.19 12:43, Kurt Roeckx wrote: I currently fail to see how that's a problem, unless that EVP_CIPHER_CTX tries to use a DRBG. This is what I mean when I say that things have gotten more complicated under the hood due to the replumbing. To understand the problem, please take at a look of the sanitizer callstack in https://github.com/openssl/openssl/issues/9455 Matthias
Re: Thread sanitiser problems
On 30.07.19 12:43, Kurt Roeckx wrote: I currently fail to see how that's a problem, unless that EVP_CIPHER_CTX tries to use a DRBG. This is what I mean when I say that things have gotten more complicated under the hood due to the replumbing. To understand the problem, please take at a look of the sanitizer callstack in https://github.com/openssl/openssl/issues/9455 Matthias
Re: Thread sanitiser problems
On Tue, Jul 30, 2019 at 12:41:16PM +0200, Matthias St. Pierre wrote: > On 30.07.19 11:59, Kurt Roeckx wrote: > > On Tue, Jul 30, 2019 at 12:42:33PM +1000, Dr Paul Dale wrote: > > > Overly simplified, the problem boils down to the CTR DRBG needing an AES > > > CTR cipher context to work. When creating the former, a recursive call > > > is made to get the latter. > > I'm not sure what you mean with "CTR" both times. > > > > Are you saying that an AES requires a DRBG now? > > > > No. Pauli simply meant that the CTR DRBG utilizes an EVP_CIPHER_CTX for its > internal implementation. I currently fail to see how that's a problem, unless that EVP_CIPHER_CTX tries to use a DRBG. Kurt
Re: Thread sanitiser problems
On Tue, Jul 30, 2019 at 12:42:33PM +1000, Dr Paul Dale wrote: > Overly simplified, the problem boils down to the CTR DRBG needing an AES CTR > cipher context to work. When creating the former, a recursive call is made > to get the latter. I'm not sure what you mean with "CTR" both times. Are you saying that an AES requires a DRBG now? Kurt
Re: Thread sanitiser problems
On 30.07.19 04:42, Dr Paul Dale wrote: > Bringing the discussions over to the project list. That's a very good idea Pauli to bring this subject to a wider audience for discussion. I would like to take the opportunity to re-post a general remark which I made in https://github.com/openssl/openssl/issues/9455#issuecomment-515340391 > I am convinced that issues #9454 and #9455 might be only the tip of an iceberg > and we shouldn't just narrow down our focus and fix them as isolated issues. > Instead, the @openssl/omc should take them as an indication that it might be > necessary to pause and rethink the rules for how and when the low level core > routines are allowed to utilize higer level crypto routines (like RAND_bytes()). > Also, locking rules might be necessary to prevent lock-order inversion (#9454 (comment)). > Or it might be necessary to simplify the design, e.g. by replacing the context lock > and the store lock by a single lock. > > There has been a lot of replumbing going on recently and we need to take care that > the overall structure of OpenSSL remains stable and manageable. The double and > recursive lock issues are an indicator that things have become more complicated > "under the hood" (or should I say more appropriately "under the washing stand"?) > The original OpenSSL 3.0.0 Design document is only a snapshot from the very beginning. > It has not changed recently, and it might be a good time now to explitly write down > all the changes and innovations which have taken place since then. Matthias
Thread sanitiser problems
Bringing the discussions over to the project list. The problem is initially mentioned in #9454. Followup issues with infinite recursion and use after free also get mentioned in #9455. These are addressed by #9477 which clears the flush flag before flushing and removes the dependence on the RAND call. Reproduction seems to require gcc-8, gcc-7 doesn’t have the appropriate thread sanitisation. Overly simplified, the problem boils down to the CTR DRBG needing an AES CTR cipher context to work. When creating the former, a recursive call is made to get the latter. A deadlock results due to a cycle in the locking. The RAND/DRBG code will not be the only place where this occurs. KDF, MAC and some public key operations will suffer a similar issue. This is almost certainly going to hurt moving forward. On to possible mitigations: 1. Make our locks recursive. Add a reference counter and a thread ID to the lock structure. There would be no need to make either use atomic operations. Getting a read lock after a write lock would be considered an extra write lock, handing a write lock while holding a read lock could be problematic. This seems very messy. 2. Return dependent algorithms as part of the registration process. The particular algorithm could be preloaded somehow. I’m not sure how ugly this will become but it will need names (nids) for each possible DRBG type. Thoughts anyone? Any better solutions? Any other solutions? Pauli -- Dr Paul Dale | Cryptographer | Network Security & Encryption Phone +61 7 3031 7217 Oracle Australia