Re: Thread sanitiser problems

2019-07-31 Thread Dr Paul Dale
The locks aren’t being held for long.

I have noticed that the lock in the OPENSSL_CTX only seems to be locked for 
reading and only in one place.  crypto/context.c in openssl_ctx_get_data().  
Writes seem to go via CRYPTO_alloc_ex_data() which uses a global write lock (on 
a different lock).  Would it be feasible to drop the OPENSSL_CTX lock 
completely?  This would address the issue.  Matt?  I think you know this code 
the best.

CRYPTO_alloc_ex_data() is calling the new_func pointer under the read lock and 
the new function can in turn cause a search of the providers for 
implementations which grabs the provider lock.  Conversely, searching the 
providers for implementations grabs the provider lock and can lead to 
openssl_ctx_get_data() grabbing the context lock.

We don’t appear to have recursion at the moment, but the setup seems a little 
fragile.  We do have two different ordering for grabbing locks which is also 
bad.


Pauli
-- 
Dr Paul Dale | Cryptographer | Network Security & Encryption 
Phone +61 7 3031 7217
Oracle Australia



> On 31 Jul 2019, at 2:10 pm, Viktor Dukhovni  
> wrote:
> 
>> On Jul 30, 2019, at 10:02 PM, Dr Paul Dale  wrote:
>> 
>> The #9454 description includes thread sanitisizer logs showing different 
>> lock orderings — this has the potential to dead lock.  Agreed with Rich that 
>> giving up the lock would make sense, but I don’t see a way for this to be 
>> easily done.
> 
> My take is that we should never hold any lock long enough to even consider
> acquiring another lock.  No more than one lock should be held at any one
> time, and only long enough to bump a reference count to ensure that the
> object of interest is no deallocated before we (or our caller in a "get1"
> type interface) is done with it.
> 
> I don't know what "long-term" locks we're holding, but it would be great
> if it were possible to never (or never recursively) hold any such locks.
> 
> -- 
>   Viktor.
> 



Re: Thread sanitiser problems

2019-07-30 Thread Viktor Dukhovni
> On Jul 30, 2019, at 10:02 PM, Dr Paul Dale  wrote:
> 
> The #9454 description includes thread sanitisizer logs showing different lock 
> orderings — this has the potential to dead lock.  Agreed with Rich that 
> giving up the lock would make sense, but I don’t see a way for this to be 
> easily done.

My take is that we should never hold any lock long enough to even consider
acquiring another lock.  No more than one lock should be held at any one
time, and only long enough to bump a reference count to ensure that the
object of interest is no deallocated before we (or our caller in a "get1"
type interface) is done with it.

I don't know what "long-term" locks we're holding, but it would be great
if it were possible to never (or never recursively) hold any such locks.

-- 
Viktor.



Re: Thread sanitiser problems

2019-07-30 Thread Dr Paul Dale
Yes, I’m mostly talking about #9454 here.  #9455 is a bug (clearing the flush 
flag after flushing not before).  The fix in #9477 addresses this and also 
removes the dependence on RAND_bytes.

The #9454 description includes thread sanitisizer logs showing different lock 
orderings — this has the potential to dead lock.  Agreed with Rich that giving 
up the lock would make sense, but I don’t see a way for this to be easily done.

That this also involves the RAND subsystem could be coincidental.  The other 
stack trace is getting an MD via a digest operation (i.e. both traces are for 
algorithms that use other algorithms).  The locks in question are the provider 
store lock and the context lock.

crypto/context.c:
OPENSSL_CTX *ctx; /* function argument */
CRYPTO_THREAD_read_lock(ctx->oncelock);

crypto/provider_core.c:
OPENSSL_CTX *ctx; /* function argument */
struct provider_store_st *store = get_provider_store(ctx);
CRYPTO_THREAD_read_lock(store->lock);


Pauli
-- 
Dr Paul Dale | Cryptographer | Network Security & Encryption 
Phone +61 7 3031 7217
Oracle Australia



> On 30 Jul 2019, at 8:52 pm, Matthias St. Pierre 
>  wrote:
> 
> Sorry, my reply was misleading, since Pauli is talking mainly about #9454.
> Please take a look at the issue description
> 
> https://github.com/openssl/openssl/issues/9454
> 
> instead.
> 
> Matthias
> 
> 
> 
> On 30.07.19 12:47, Matthias St. Pierre wrote:
>> 
>> 
>> On 30.07.19 12:43, Kurt Roeckx wrote:
>>> 
>>> I currently fail to see how that's a problem, unless that
>>> EVP_CIPHER_CTX tries to use a DRBG.
>> 
>> This is what I mean when I say that things have gotten more complicated 
>> under the hood
>> due to the replumbing. To understand the problem, please take at a look of 
>> the sanitizer
>> callstack in
>> 
>> https://github.com/openssl/openssl/issues/9455
>> 
>> 
>> Matthias
>> 
>> 
>> 
> 



Re: Thread sanitiser problems

2019-07-30 Thread Salz, Rich
Do you need to hold the lock across dependant items? For example, why can't the 
DRBG code unlock before fetching the AES-CTR code?





Re: Thread sanitiser problems

2019-07-30 Thread Matthias St. Pierre

Sorry, my reply was misleading, since Pauli is talking mainly about #9454.
Please take a look at the issue description

https://github.com/openssl/openssl/issues/9454

instead.

Matthias



On 30.07.19 12:47, Matthias St. Pierre wrote:



On 30.07.19 12:43, Kurt Roeckx wrote:


I currently fail to see how that's a problem, unless that
EVP_CIPHER_CTX tries to use a DRBG.


This is what I mean when I say that things have gotten more complicated under 
the hood
due to the replumbing. To understand the problem, please take at a look of the 
sanitizer
callstack in

https://github.com/openssl/openssl/issues/9455


Matthias







Re: Thread sanitiser problems

2019-07-30 Thread Matthias St. Pierre




On 30.07.19 12:43, Kurt Roeckx wrote:


I currently fail to see how that's a problem, unless that
EVP_CIPHER_CTX tries to use a DRBG.


This is what I mean when I say that things have gotten more complicated under 
the hood
due to the replumbing. To understand the problem, please take at a look of the 
sanitizer
callstack in

https://github.com/openssl/openssl/issues/9455


Matthias





Re: Thread sanitiser problems

2019-07-30 Thread Kurt Roeckx
On Tue, Jul 30, 2019 at 12:41:16PM +0200, Matthias St. Pierre wrote:
> On 30.07.19 11:59, Kurt Roeckx wrote:
> > On Tue, Jul 30, 2019 at 12:42:33PM +1000, Dr Paul Dale wrote:
> > > Overly simplified, the problem boils down to the CTR DRBG needing an AES 
> > > CTR cipher context to work.  When creating the former, a recursive call 
> > > is made to get the latter.
> > I'm not sure what you mean with "CTR" both times.
> > 
> > Are you saying that an AES requires a DRBG now?
> > 
> 
> No. Pauli simply meant that the CTR DRBG utilizes an EVP_CIPHER_CTX for its 
> internal implementation.

I currently fail to see how that's a problem, unless that
EVP_CIPHER_CTX tries to use a DRBG.


Kurt



Re: Thread sanitiser problems

2019-07-30 Thread Matthias St. Pierre

On 30.07.19 11:59, Kurt Roeckx wrote:

On Tue, Jul 30, 2019 at 12:42:33PM +1000, Dr Paul Dale wrote:

Overly simplified, the problem boils down to the CTR DRBG needing an AES CTR 
cipher context to work.  When creating the former, a recursive call is made to 
get the latter.

I'm not sure what you mean with "CTR" both times.

Are you saying that an AES requires a DRBG now?



No. Pauli simply meant that the CTR DRBG utilizes an EVP_CIPHER_CTX for its 
internal implementation.
(The original FIPS 2.0 implementation was based on low level crypto calls, but 
that was changed by you
to EVP in commit

https://github.com/openssl/openssl/commit/dbdcc04f27db70ac71748eb595ce23c9733afbe7

for performance reasons.)

Matthias


Re: Thread sanitiser problems

2019-07-30 Thread Kurt Roeckx
On Tue, Jul 30, 2019 at 12:42:33PM +1000, Dr Paul Dale wrote:
> Overly simplified, the problem boils down to the CTR DRBG needing an AES CTR 
> cipher context to work.  When creating the former, a recursive call is made 
> to get the latter.

I'm not sure what you mean with "CTR" both times.

Are you saying that an AES requires a DRBG now?


Kurt



Re: Thread sanitiser problems

2019-07-30 Thread Matthias St. Pierre

On 30.07.19 04:42, Dr Paul Dale wrote:
> Bringing the discussions over to the project list.

That's a very good idea Pauli to bring this subject to a wider audience for 
discussion.
I would like to take the opportunity to re-post  a general remark which I made 
in
https://github.com/openssl/openssl/issues/9455#issuecomment-515340391

> I am convinced that issues #9454 and #9455 might be only the tip of an iceberg
> and we shouldn't just narrow down our focus and fix them as isolated issues.
> Instead, the @openssl/omc should take them as an indication that it might be
> necessary to pause and rethink the rules for how and when the low level core
> routines are allowed to utilize higer level crypto routines (like 
RAND_bytes()).
> Also, locking rules might be necessary to prevent lock-order inversion (#9454 
(comment)).
> Or it might be necessary to simplify the design, e.g. by replacing the 
context lock
> and the store lock by a single lock.
>
> There has been a lot of replumbing going on recently and we need to take care 
that
> the overall structure of OpenSSL remains stable and manageable. The double and
> recursive lock issues are an indicator that things have become more 
complicated
> "under the hood" (or should I say more appropriately "under the washing 
stand"?)
> The original OpenSSL 3.0.0 Design document is only a snapshot from the very 
beginning.
> It has not changed recently, and it might be a good time now to explitly 
write down
> all the changes and innovations which have taken place since then.


Matthias


Thread sanitiser problems

2019-07-29 Thread Dr Paul Dale
Bringing the discussions over to the project list.

The problem is initially mentioned in #9454.  Followup issues with infinite 
recursion and use after free also get mentioned in #9455.  These are addressed 
by #9477 which clears the flush flag before flushing and removes the dependence 
on the RAND call.  Reproduction seems to require gcc-8, gcc-7 doesn’t have the 
appropriate thread sanitisation.

Overly simplified, the problem boils down to the CTR DRBG needing an AES CTR 
cipher context to work.  When creating the former, a recursive call is made to 
get the latter.  A deadlock results due to a cycle in the locking.  The 
RAND/DRBG code will not be the only place where this occurs.  KDF, MAC and some 
public key operations will suffer a similar issue.

This is almost certainly going to hurt moving forward.


On to possible mitigations:

1. Make our locks recursive.  Add a reference counter and a thread ID to the 
lock structure.  There would be no need to make either use atomic operations.  
Getting a read lock after a write lock would be considered an extra write lock, 
handing a write lock while holding a read lock could be problematic.  This 
seems very messy.

2. Return dependent algorithms as part of the registration process.  The 
particular algorithm could be preloaded somehow.  I’m not sure how ugly this 
will become but it will need names (nids) for each possible DRBG type.



Thoughts anyone?
Any better solutions?
Any other solutions?


Pauli
-- 
Dr Paul Dale | Cryptographer | Network Security & Encryption 
Phone +61 7 3031 7217
Oracle Australia