Re: PATCH #2: connection_reuse
On 8/20/2020 2:38 PM, Wietse Venema wrote: > Thorsten Habich: >> On 8/19/2020 4:31 PM, Viktor Dukhovni wrote: >>> Do *resumed* sessions always fail to validate? Or is that intermittent? >> As far as I could see resumed sessions that failed keep failing > That's not what he asked. > > What he asked is: > > - Do FAILURES happen ONLY after a session is RESUMED. > > Wietse Sorry, no. The first connection decides if the problem occurs or not. If the session is resumed the error only occurs *if the first connection failed*. If the first connection was successful the error will not appear. The status then seem to change in case of a restart (as clarified by Victor that clears the session cache) or after I assume tlsproxy_tls_session_cache_timeout (default: 3600). In the examples I found in our logs, after a failed connection, the first successful delivery without a restart was made after 1h + x minutes. For sessions which do not get resumed at all the error occurs frequently, too. If I remember correctly the certificate verification with connection reuse (so the tlsproxy gets involved) was fixed with: 20200620 Bugfix (introduced: Postfix 3.4): SMTP over TLS connection reuse was broken for configurations that use explicit trust anchors. Reported by Thorsten Habich. Fixed by calling DANE initialization unconditionally (WTF). File: tlsproxy/tlsproxy.c. Might there still be a problem?
Re: PATCH #2: connection_reuse
On 8/20/2020 2:38 PM, Wietse Venema wrote: > Thorsten Habich: >> On 8/19/2020 4:31 PM, Viktor Dukhovni wrote: >>> Do *resumed* sessions always fail to validate? Or is that intermittent? >> As far as I could see resumed sessions that failed keep failing > That's not what he asked. > > What he asked is: > > - Do FAILURES happen ONLY after a session is RESUMED. > > Wietse Sorry, no. The first connection decides if the problem occurs or not. If the session is resumed the error only occurs *if the first connection failed*. If the first connection was successful the error will not appear. The status then seem to change in case of a restart (as clarified by Victor that clears the session cache) or after I assume tlsproxy_tls_session_cache_timeout (default: 3600). In the examples I found in our logs, after a failed connection, the first successful delivery without a restart was made after 1h + x minutes. For sessions which do not get resumed at all the error occurs frequently, too. If I remember correctly the certificate verification with connection reuse (so the tlsproxy gets involved) was fixed with: 20200620 Bugfix (introduced: Postfix 3.4): SMTP over TLS connection reuse was broken for configurations that use explicit trust anchors. Reported by Thorsten Habich. Fixed by calling DANE initialization unconditionally (WTF). File: tlsproxy/tlsproxy.c. Might there still be a problem?
Re: PATCH #2: connection_reuse
Thorsten Habich: > If I remember correctly the certificate verification with connection > reuse (so the tlsproxy gets involved) was fixed with: > > 20200620 > > ??? Bugfix (introduced: Postfix 3.4): SMTP over TLS connection > ??? reuse was broken for configurations that use explicit trust > ??? anchors. Reported by Thorsten Habich. Fixed by calling DANE > ??? initialization unconditionally (WTF). File: tlsproxy/tlsproxy.c. > > Might there still be a problem? YOU can verify that, by using a transport map to SELECTIVELY send mail over an SMTP client that has TLS smtp connection reuse turned off so that it does not use tlsproxy. main.cf: transport_maps = hash:/etc/postfix/transport master.cf: smtp-noreuse .. .. .. .. .. .. smtp -o smtp_tls_connection_reuse = yes /etc/postfix/transport: example.com smtp-noreuse: Wietse
Re: PATCH #2: connection_reuse
On Thu, Aug 20, 2020 at 04:59:49PM +0300, Thorsten Habich wrote: > > - Do FAILURES happen ONLY after a session is RESUMED. > > Sorry, no. The first connection decides if the problem occurs or not. > If the session is resumed the error only occurs *if the first > connection failed*. Thanks for the answer. This means that there are no issues recording the proper validation status in the session cache, and the issue is entirely validation failure on initial handshake. I don't recall seeing any logging posted showing those initial validation failures. This might be as good a time as any to address that (the failure logs for the initial connection should have been part of the post that started this thread). > If the first connection was successful the error will not appear. The > status then seem to change in case of a restart (as clarified by Victor > that clears the session cache) or after I assume > tlsproxy_tls_session_cache_timeout (default: 3600). > > In the examples I found in our logs, after a failed connection, the > first successful delivery without a restart was made after 1h + x minutes. This is of course expected. With a 1h session cache lifetime, new full handshakes happen only after the previous saved session has expired. I would recommend a shorter session lifetime for now. It will help to get a better handle on the problem, by doing the initial handshake more frequently. > For sessions which do not get resumed at all the error occurs > frequently, too. Yes, that's why you're seeing problems on resumption. > If I remember correctly the certificate verification with connection > reuse (so the tlsproxy gets involved) was fixed with: You keep talking about connection reuse, as though it were somehow relevant, even though I haven't seen anything in this thread that suggests that connection reuse is in any involved. Why do you believe that connection reuse is a factor in this issue? I hope you're not still conflating session resumption with connection reuse. -- Viktor.
Re: PATCH #3 (Postfix 3.4 + 3.5): TLS connection_reuse with "tafile"
On Thu, Aug 20, 2020 at 01:20:00PM -0400, Wietse Venema wrote: > Viktor Dukhovni: > > > - &_DANE_BASED(state->client_start_props->tls_level)) > > + && TLS_DANE_HASTA(state->client_start_props->dane)) > > @@ -1427,7 +1427,7 @@ static void tlsp_get_request_event(int event, void > > *context) > > - TLS_DANE_BASED(state->client_start_props->tls_level)); > > + TLS_DANE_HASTA(state->client_start_props->dane)); > > This looks weird. I thought that the problem was with trust anchors, not DANE? Yes, the problem is with trust anchors, but DANE is the general case of: * Policy-based end-entity cert matching: - DANE "_25._tcp.example.net. IN TLSA 3 ? ? ..." - The Postfix "fingerprint" security level * Policy-based issuer CA cert matching: - DANE "_25._tcp.example.net. IN TLSA 2 ? ? ..." - The Postfix verify/secure levels with a custom per-site "tafile" . Actual DANE TLSA RRsets can have either or both DANE-EE or DANE-TA records, with verification ultimately matching either or both. The "fingerprint" level is mapped to DANE-EE, while "tafile" support is mapped to DANE-TA. Thus actual DANE, fingerprint and secure/verify with a "tafile" are all handled via the "general case" of "some sort of DANE-like policy". In Postfix 3.6, the job of validating "some sort DANE-like policy" is entirely delegated to OpenSSL. You'll be pleased to know, that in Postfix 3.6 the TLS_DANE_HASTA() and TLS_DANE_HASEE() macros are gone. We no longer need to treat the various DANE-like matching differently. -- Viktor.
Re: PATCH #3 (Postfix 3.4 + 3.5): TLS connection_reuse with "tafile"
Viktor Dukhovni: > On Thu, Aug 20, 2020 at 01:20:00PM -0400, Wietse Venema wrote: > > > Viktor Dukhovni: > > > > > - &_DANE_BASED(state->client_start_props->tls_level)) > > > + && TLS_DANE_HASTA(state->client_start_props->dane)) > > > @@ -1427,7 +1427,7 @@ static void tlsp_get_request_event(int event, void > > > *context) > > > - TLS_DANE_BASED(state->client_start_props->tls_level)); > > > + TLS_DANE_HASTA(state->client_start_props->dane)); > > > > This looks weird. I thought that the problem was with trust anchors, not > > DANE? > > Yes, the problem is with trust anchors, but DANE is the general case of: > > * Policy-based end-entity cert matching: > > - DANE "_25._tcp.example.net. IN TLSA 3 ? ? ..." > > - The Postfix "fingerprint" security level > > * Policy-based issuer CA cert matching: > > - DANE "_25._tcp.example.net. IN TLSA 2 ? ? ..." > > - The Postfix verify/secure levels with a custom per-site > "tafile" . > > Actual DANE TLSA RRsets can have either or both DANE-EE or DANE-TA > records, with verification ultimately matching either or both. The > "fingerprint" level is mapped to DANE-EE, while "tafile" support is > mapped to DANE-TA. > > Thus actual DANE, fingerprint and secure/verify with a "tafile" are all > handled via the "general case" of "some sort of DANE-like policy". > > In Postfix 3.6, the job of validating "some sort DANE-like policy" is > entirely delegated to OpenSSL. You'll be pleased to know, that in > Postfix 3.6 the TLS_DANE_HASTA() and TLS_DANE_HASEE() macros are gone. > We no longer need to treat the various DANE-like matching differently. As discussed offlist, I would have structured the code in a different manner, such that trust-anchor support does not call into the DANE stack, but DANE and trust anchors are entirely separate features that call into a common infrastructure. In any case most of that code is gone with Postfix 3.6. Wietse
Re: PATCH #3 (Postfix 3.4 + 3.5): TLS connection_reuse with "tafile"
Viktor Dukhovni: > state->client_start_props->fd = state->ciphertext_fd; > /* These predicates and warning belong inside tls_client_start(). */ > if (!tls_dane_avail()/* mandatory side effects!! */ > - &_DANE_BASED(state->client_start_props->tls_level)) > + && TLS_DANE_HASTA(state->client_start_props->dane)) > msg_warn("%s: DANE requested, but not available", >state->client_start_props->namaddr); > else > @@ -1427,7 +1427,7 @@ static void tlsp_get_request_event(int event, void > *context) > } > state->appl_state = tlsp_client_init(state->tls_params, >state->client_init_props, > - TLS_DANE_BASED(state->client_start_props->tls_level)); > + TLS_DANE_HASTA(state->client_start_props->dane)); > ready = state->appl_state != 0; > break; > case TLS_PROXY_FLAG_ROLE_SERVER: This looks weird. I thought that the problem was with trust anchors, not DANE? Wietse
Re: PATCH #2: connection_reuse
On 8/19/2020 4:31 PM, Viktor Dukhovni wrote: > > Do *resumed* sessions always fail to validate? Or is that intermittent? As far as I could see resumed sessions that failed keep failing (probably until the session cache expires) but I had to restart the Postfix most times before that happened. > When resumption fails, was the preceding non-resumed session successful? Yes. Other connections with tafile or with CApath configuration were successfully made. I saw a tlsproxy process with the same process ID which had a failed tafile based session and another successful non-tafile connection right afterwards. I think I will increase the debugging today and afterwards turn off connection_reuse for the tafile based configurations at least with Postfix <3.5.4 the verification only failed when connection_reuse was on. > > Have you considered as a differential diagnostic procedure setting up a > separate > transport for the problem domain, and using the trust-anchors in question as > the CAfile for the transport instead of a per-destination policy "tafile"? > > Are the trust-anchors self-signed CA certs, or are they "intermediate" certs > signed by some other CA? If intermediate, it takes a bit more effort to > turn them into a usable CAfile, because they'd need to be encapsulated > as "TRUSTED CERTIFICATE" PEM objects, with a trust EKU of "serverAuth". > I can post an example of how to do that if necessary. It would be nice if you could post an example. I need to discuss that with my colleagues. > Also, can you test the Postfix 3.6-20200725 snapshot? In Postfix 3.6 > the "tafile" code is based on the DANE support in OpenSSL 1.1.1, rather > than the older DANE certificate validation code in Postfix itself. I tried the same setup on a test system yesterday and weren't able to reproduce the problem. So I guess testing with Postfix 3.6 isn't possible until it's becoming stable. Is any backport for Postfix 3.5 possible? Am I right, that posttls-finger always fails verification with -A option?
Re: PATCH #2: connection_reuse
Thorsten Habich: > > On 8/19/2020 4:31 PM, Viktor Dukhovni wrote: > > > > Do *resumed* sessions always fail to validate? Or is that intermittent? > > As far as I could see resumed sessions that failed keep failing That's not what he asked. What he asked is: - Do FAILURES happen ONLY after a session is RESUMED. Wietse