Re: PATCH #2: connection_reuse

2020-08-20 Thread Thorsten Habich


On 8/20/2020 2:38 PM, Wietse Venema wrote:
> Thorsten Habich:
>> On 8/19/2020 4:31 PM, Viktor Dukhovni wrote:
>>> Do *resumed* sessions always fail to validate?  Or is that intermittent?
>> As far as I could see resumed sessions that failed keep failing
> That's not what he asked.
>
> What he asked is:
>
> - Do FAILURES happen ONLY after a session is RESUMED.
>
>   Wietse

Sorry, no. The first connection decides if the problem occurs or not. If
the session is resumed the error only occurs *if the first connection
failed*.
If the first connection was successful the error will not appear. The
status then seem to change in case of a restart (as clarified by Victor
that clears the session cache) or after I assume
tlsproxy_tls_session_cache_timeout (default: 3600).

In the examples I found in our logs, after a failed connection, the
first successful delivery without a restart was made after 1h + x minutes.

For sessions which do not get resumed at all the error occurs
frequently, too.

If I remember correctly the certificate verification with connection
reuse (so the tlsproxy gets involved) was fixed with:

20200620

    Bugfix (introduced: Postfix 3.4): SMTP over TLS connection
    reuse was broken for configurations that use explicit trust
    anchors. Reported by Thorsten Habich. Fixed by calling DANE
    initialization unconditionally (WTF). File: tlsproxy/tlsproxy.c.

Might there still be a problem?




Re: PATCH #2: connection_reuse

2020-08-20 Thread Thorsten Habich


On 8/20/2020 2:38 PM, Wietse Venema wrote:
> Thorsten Habich:
>> On 8/19/2020 4:31 PM, Viktor Dukhovni wrote:
>>> Do *resumed* sessions always fail to validate?  Or is that intermittent?
>> As far as I could see resumed sessions that failed keep failing
> That's not what he asked.
>
> What he asked is:
>
> - Do FAILURES happen ONLY after a session is RESUMED.
>
>   Wietse

Sorry, no. The first connection decides if the problem occurs or not. If
the session is resumed the error only occurs *if the first connection
failed*.
If the first connection was successful the error will not appear. The
status then seem to change in case of a restart (as clarified by Victor
that clears the session cache) or after I assume
tlsproxy_tls_session_cache_timeout (default: 3600).

In the examples I found in our logs, after a failed connection, the
first successful delivery without a restart was made after 1h + x minutes.

For sessions which do not get resumed at all the error occurs
frequently, too.

If I remember correctly the certificate verification with connection
reuse (so the tlsproxy gets involved) was fixed with:

20200620

    Bugfix (introduced: Postfix 3.4): SMTP over TLS connection
    reuse was broken for configurations that use explicit trust
    anchors. Reported by Thorsten Habich. Fixed by calling DANE
    initialization unconditionally (WTF). File: tlsproxy/tlsproxy.c.

Might there still be a problem?




Re: PATCH #2: connection_reuse

2020-08-20 Thread Wietse Venema
Thorsten Habich:
> If I remember correctly the certificate verification with connection
> reuse (so the tlsproxy gets involved) was fixed with:
> 
> 20200620
> 
> ??? Bugfix (introduced: Postfix 3.4): SMTP over TLS connection
> ??? reuse was broken for configurations that use explicit trust
> ??? anchors. Reported by Thorsten Habich. Fixed by calling DANE
> ??? initialization unconditionally (WTF). File: tlsproxy/tlsproxy.c.
> 
> Might there still be a problem?

YOU can verify that, by using a transport map to SELECTIVELY send
mail over an SMTP client that has TLS smtp connection reuse turned
off so that it does not use tlsproxy.

main.cf:
transport_maps = hash:/etc/postfix/transport

master.cf:
smtp-noreuse .. .. .. .. .. .. smtp
-o smtp_tls_connection_reuse = yes

/etc/postfix/transport:
example.com smtp-noreuse:

Wietse


Re: PATCH #2: connection_reuse

2020-08-20 Thread Viktor Dukhovni
On Thu, Aug 20, 2020 at 04:59:49PM +0300, Thorsten Habich wrote:

> > - Do FAILURES happen ONLY after a session is RESUMED.
> 
> Sorry, no. The first connection decides if the problem occurs or not.
> If the session is resumed the error only occurs *if the first
> connection failed*.

Thanks for the answer.  This means that there are no issues recording
the proper validation status in the session cache, and the issue is
entirely validation failure on initial handshake.

I don't recall seeing any logging posted showing those initial
validation failures.  This might be as good a time as any to address
that (the failure logs for the initial connection should have been part
of the post that started this thread).

> If the first connection was successful the error will not appear. The
> status then seem to change in case of a restart (as clarified by Victor
> that clears the session cache) or after I assume
> tlsproxy_tls_session_cache_timeout (default: 3600).
> 
> In the examples I found in our logs, after a failed connection, the
> first successful delivery without a restart was made after 1h + x minutes.

This is of course expected.  With a 1h session cache lifetime, new full
handshakes happen only after the previous saved session has expired.  I
would recommend a shorter session lifetime for now.  It will help to get
a better handle on the problem, by doing the initial handshake more
frequently.

> For sessions which do not get resumed at all the error occurs
> frequently, too.

Yes, that's why you're seeing problems on resumption.

> If I remember correctly the certificate verification with connection
> reuse (so the tlsproxy gets involved) was fixed with:

You keep talking about connection reuse, as though it were somehow
relevant, even though I haven't seen anything in this thread that
suggests that connection reuse is in any involved.  Why do you
believe that connection reuse is a factor in this issue?

I hope you're not still conflating session resumption with connection
reuse.

-- 
Viktor.


Re: PATCH #3 (Postfix 3.4 + 3.5): TLS connection_reuse with "tafile"

2020-08-20 Thread Viktor Dukhovni
On Thu, Aug 20, 2020 at 01:20:00PM -0400, Wietse Venema wrote:

> Viktor Dukhovni:
>
> > -   &_DANE_BASED(state->client_start_props->tls_level))
> > +   && TLS_DANE_HASTA(state->client_start_props->dane))
> > @@ -1427,7 +1427,7 @@ static void tlsp_get_request_event(int event, void 
> > *context)
> > - TLS_DANE_BASED(state->client_start_props->tls_level));
> > + TLS_DANE_HASTA(state->client_start_props->dane));
> 
> This looks weird. I thought that the problem was with trust anchors, not DANE?

Yes, the problem is with trust anchors, but DANE is the general case of:

* Policy-based end-entity cert matching:

- DANE "_25._tcp.example.net. IN TLSA 3 ? ? ..." 

- The Postfix "fingerprint" security level

* Policy-based issuer CA cert matching:

- DANE "_25._tcp.example.net. IN TLSA 2 ? ? ..." 

- The Postfix verify/secure levels with a custom per-site
  "tafile" .

Actual DANE TLSA RRsets can have either or both DANE-EE or DANE-TA
records, with verification ultimately matching either or both.  The
"fingerprint" level is mapped to DANE-EE, while "tafile" support is
mapped to DANE-TA.

Thus actual DANE, fingerprint and secure/verify with a "tafile" are all
handled via the "general case" of "some sort of DANE-like policy".

In Postfix 3.6, the job of validating "some sort DANE-like policy" is
entirely delegated to OpenSSL.  You'll be pleased to know, that in
Postfix 3.6 the TLS_DANE_HASTA() and TLS_DANE_HASEE() macros are gone.
We no longer need to treat the various DANE-like matching differently.

-- 
Viktor.


Re: PATCH #3 (Postfix 3.4 + 3.5): TLS connection_reuse with "tafile"

2020-08-20 Thread Wietse Venema
Viktor Dukhovni:
> On Thu, Aug 20, 2020 at 01:20:00PM -0400, Wietse Venema wrote:
> 
> > Viktor Dukhovni:
> >
> > > - &_DANE_BASED(state->client_start_props->tls_level))
> > > + && TLS_DANE_HASTA(state->client_start_props->dane))
> > > @@ -1427,7 +1427,7 @@ static void tlsp_get_request_event(int event, void 
> > > *context)
> > > -   TLS_DANE_BASED(state->client_start_props->tls_level));
> > > +   TLS_DANE_HASTA(state->client_start_props->dane));
> > 
> > This looks weird. I thought that the problem was with trust anchors, not 
> > DANE?
> 
> Yes, the problem is with trust anchors, but DANE is the general case of:
> 
> * Policy-based end-entity cert matching:
> 
> - DANE "_25._tcp.example.net. IN TLSA 3 ? ? ..." 
> 
> - The Postfix "fingerprint" security level
> 
> * Policy-based issuer CA cert matching:
> 
> - DANE "_25._tcp.example.net. IN TLSA 2 ? ? ..." 
> 
> - The Postfix verify/secure levels with a custom per-site
>   "tafile" .
> 
> Actual DANE TLSA RRsets can have either or both DANE-EE or DANE-TA
> records, with verification ultimately matching either or both.  The
> "fingerprint" level is mapped to DANE-EE, while "tafile" support is
> mapped to DANE-TA.
> 
> Thus actual DANE, fingerprint and secure/verify with a "tafile" are all
> handled via the "general case" of "some sort of DANE-like policy".
> 
> In Postfix 3.6, the job of validating "some sort DANE-like policy" is
> entirely delegated to OpenSSL.  You'll be pleased to know, that in
> Postfix 3.6 the TLS_DANE_HASTA() and TLS_DANE_HASEE() macros are gone.
> We no longer need to treat the various DANE-like matching differently.

As discussed offlist, I would have structured the code in a different
manner, such that trust-anchor support does not call into the DANE
stack, but DANE and trust anchors are entirely separate features that
call into a common infrastructure.

In any case most of that code is gone with Postfix 3.6.

Wietse


Re: PATCH #3 (Postfix 3.4 + 3.5): TLS connection_reuse with "tafile"

2020-08-20 Thread Wietse Venema
Viktor Dukhovni:
>  state->client_start_props->fd = state->ciphertext_fd;
>  /* These predicates and warning belong inside tls_client_start(). */
>  if (!tls_dane_avail()/* mandatory side effects!! */
> - &_DANE_BASED(state->client_start_props->tls_level))
> + && TLS_DANE_HASTA(state->client_start_props->dane))
>   msg_warn("%s: DANE requested, but not available",
>state->client_start_props->namaddr);
>  else
> @@ -1427,7 +1427,7 @@ static void tlsp_get_request_event(int event, void 
> *context)
>   }
>   state->appl_state = tlsp_client_init(state->tls_params,
>state->client_init_props,
> -   TLS_DANE_BASED(state->client_start_props->tls_level));
> +   TLS_DANE_HASTA(state->client_start_props->dane));
>   ready = state->appl_state != 0;
>   break;
>  case TLS_PROXY_FLAG_ROLE_SERVER:

This looks weird. I thought that the problem was with trust anchors, not DANE?

Wietse


Re: PATCH #2: connection_reuse

2020-08-20 Thread Thorsten Habich


On 8/19/2020 4:31 PM, Viktor Dukhovni wrote:
>
> Do *resumed* sessions always fail to validate?  Or is that intermittent?

As far as I could see resumed sessions that failed keep failing
(probably until the session cache expires) but I had to restart the
Postfix most times before that happened.

> When resumption fails, was the preceding non-resumed session successful?

Yes. Other connections with tafile or with CApath configuration were
successfully made.
I saw a tlsproxy process with the same process ID which had a failed
tafile based session and another successful non-tafile connection right
afterwards.

I think I will increase the debugging today and afterwards turn off
connection_reuse for the tafile based configurations at least with
Postfix <3.5.4 the verification only failed when connection_reuse was on.

>
> Have you considered as a differential diagnostic procedure setting up a 
> separate
> transport for the problem domain, and using the trust-anchors in question as
> the CAfile for the transport instead of a per-destination policy "tafile"?
>
> Are the trust-anchors self-signed CA certs, or are they "intermediate" certs
> signed by some other CA?  If intermediate, it takes a bit more effort to
> turn them into a usable CAfile, because they'd need to be encapsulated
> as "TRUSTED CERTIFICATE" PEM objects, with a trust EKU of "serverAuth".
> I can post an example of how to do that if necessary.

It would be nice if you could post an example. I need to discuss that
with my colleagues.

> Also, can you test the Postfix 3.6-20200725 snapshot?  In Postfix 3.6
> the "tafile" code is based on the DANE support in OpenSSL 1.1.1, rather
> than the older DANE certificate validation code in Postfix itself.

I tried the same setup on a test system yesterday and weren't able to
reproduce the problem. So I guess testing with Postfix 3.6 isn't
possible until it's becoming stable.
Is any backport for Postfix 3.5 possible?

Am I right, that posttls-finger always fails verification with -A option?




Re: PATCH #2: connection_reuse

2020-08-20 Thread Wietse Venema
Thorsten Habich:
> 
> On 8/19/2020 4:31 PM, Viktor Dukhovni wrote:
> >
> > Do *resumed* sessions always fail to validate?  Or is that intermittent?
> 
> As far as I could see resumed sessions that failed keep failing

That's not what he asked.

What he asked is:

- Do FAILURES happen ONLY after a session is RESUMED.

Wietse