Re: [Standards] Proper SRV Record Fallback

Dave Cridland Fri, 12 Jan 2018 01:45:07 -0800

On 12 January 2018 at 04:05, Travis Burtrum <[email protected]> wrote:
> Hello,
>
> My replies in-line as well.
>
> On 01/10/2018 03:20 AM, Jonas Wielicki wrote:
>> Hi Travis,
>>
>> Notes inline.
>>
>> On Montag, 8. Januar 2018 23:19:36 CET Travis Burtrum wrote:
>>> First, what do docs say:
>>>
>>> RFC-6120[2] Section-3.2.1 #7 says:
>>>> 7. If the initiating entity fails to connect using all resolved IP
>>>>
>>>>       addresses for a given FDQN, then it repeats the process of
>>>>       resolution and connection for the next FQDN returned by the SRV
>>>>       lookup based on the priority and weight as defined in [DNS-SRV].
>>>
>>> 'fails to connect' does this mean the TCP connection fails, or the XMPP
>>> connection fails?
>>>
>>> #8 might leave a hint:
>>>> 8. If the initiating entity receives a response to its SRV query but
>>>>
>>>>       it is not able to establish an XMPP connection using the data
>>>>       received in the response, it SHOULD NOT attempt the fallback
>>>>       process described in the next section (this helps to prevent a
>>>>       state mismatch between inbound and outbound connections).
>>>
>>> This clearly says XMPP connection, but does it apply to #7 ?
>>>
>>> It is also clear I didn't think about this too hard when writing
>>> XEP-0368, because I clearly (to me) assume SRV fallback
>>
>> The text you quote is *not* about SRV fallback. It refers to the fallback to
>> A/AAAA records in the next section (3.2.2). (which, holy cow, we should 
>> really
>> not ever ever do if we got SRV records.)
>>
>> The only wording in the RFC for SRV iteration is #7 you quoted. So it is all
>> about the definition of "fails to connect". The elders may have information 
>> on
>> what was originally meant by that and if there is some more wisdom on the
>> reasoning for this.
>
> Yes I fully agree RFC-wise this all hinges on what 'fails to connect'
> means, I quoted 8 because it was the only one that didn't use those
> exact words and said 'XMPP connection' instead.  Whether that means
> anything or not is up for interpretation.  I'd also vote since it's at
> best ambiguous we just decide what the 'right' thing is anyway.
>


The question is at what point do we declare a connection "complete",
so a subsequent failure is considered a connection failure for the
domain as a whole.

It's typically been implemented as TCP connection, in most code I can
immediately see.

I can see an argument that we should make it an XMLStream instead
(even one that immediately gives an error).

I can just about go along with adding a TLS-protected XMLStream to the
mix, but quite honestly I'd be uncomfortable here.

>>> will happen if a
>>> complete XMPP connection is not successful, because under Implementation
>>> Notes I say:
>>>> Server operators should not expect multiplexing (via ALPN) to work in
>>>> all scenarios and therefore should provide additional SRV record(s)
>>>> that do not require multiplexing (either standard STARTTLS or
>>>> dedicated direct XMPP-over-TLS). This is a result of relying on ALPN
>>>> for multiplexing, where ALPN might not be supported by all devices or
>>>> may be disabled by a user due to privacy reasons.
>>>
>>> While I don't explicitly say it, if a port required ALPN to multiplex,
>>> it will generally end up connecting you to a non-XMPP server without
>>> ALPN, meaning you will get back invalid XML, other junk, and/or an
>>> invalid TLS cert.
>>
>> This definitely could use wording in '368.
>
> Absolutely agree, however I'd like to wait to update it so I can also
> note what we decide here, I think.
>
>> While I’m at it, I am really uncomfortable with further supporting the "put
>> everything behind SSL on 443" and move the Deep-Packet-Inspection-war behind
>> the TLS, driving us to a world where we’ll everything on port 443, with ALPN-
>> based multiplexing. But that’s kinda OT. (but this is why I’m hesitant with
>> making ALPN a MUST.)
>
> Yea this has been addressed a few times over the course of this XEP and
> while I agree with the sentiment, I'd prefer to connect by any means
> possible than to stay unconnected knowing it's more 'pure' that way or
> something. :) (though by all means, when you find evil networks, try to
> get them to change)
>
>>> Now that the docs are out of the way, on to the discussion:
>>>
>>> In my opinion, at least all of cannot-connect-to-port, non-XML,
>>> not-proper-stream and invalid TLS cert should trigger a fallback to the
>>> next highest priority SRV record.
>>
>> Is there a guarantee or requirement that servers in two different SRV
>> priorities can be used at the same time? If not, it seems a bad I idea to 
>> fall
>> back on them for purely application-layer reasons.
>
> I believe so?  At least my understanding of SRV is that clients can end
> up connecting to any at any time.
>

I don't think a lower priority SRV record can be used if a higher
priority one is available, but that's not quite the same thing.

However, this doesn't mean that a lower priority server instance can't
be in use at the same time as a higher priority one - networking
failures might cause the higher priority ones to be unreachable to
some clients (but not others).

>>> Everyone in the MUC seemed to agree
>>> if authentication fails a fallback would be a bad idea.
>>>
>>> Sam Whited said that if a TCP connection is established fallback should
>>> cease, that it shouldn't have anything to do with or any knowledge of
>>> XMPP, and that it might have security implementations to do otherwise.
>>> (please correct and forgive me if I misunderstood)  I disagree with
>>> this, I think if Eve has control over DNS (and no DNSSEC) she can return
>>> arbitrary records anyway so SRV fallback doesn't matter.
>>
>> That’s not true. As soon as one of the SRV records points to another 
>> (possibly
>> unsigned) zone, Mallory could forge DNS replies there even without the 
>> ability
>> to forge the whole SRV RRset (a low TTL on one of the SRV target host names
>> (compared to the SRV records themselves) could also ease an attack on those
>> host names compared to the SRV records). The other servers could be taken 
>> down
>> by (D)DoS, or if you’re on-path, by messing with the TLS handshake.
>>
>> Now this is irrelevant to the current discussion insofar that this is a 
>> vector
>> already present in RFC6120 behaviour where "failed to connect" is interpreted
>> at the TCP level, but it shows that there are cases we haven’t thought of 
>> yet.
>>
>> I can’t think of any example which is only allowed by the "new" fallback 
>> rules
>> right now, but that doesn’t--unfortunately--mean that none exist.
>
> Right DNSSEC only protects if all domains are equally protected, both
> SRV and A/AAAA.  My only point is with interpreting 'connects to TCP' as
> 'stop trying other SRV records' then someone only has to DOS the highest
> priority server, instead of all of them.
>

Well, DNSSEC only provides protection where its deployed.

But protecting the SRV record is providing some protection even if the
address records are unprotected.

> I also kind of object to calling these "new" rules, it's how I've always
> interpreted how it should work, how conversations works, and quite
> possibly many others work this way too.  I'm just looking for consensus
> since there doesn't seem to be one. :)
>
>> One argument which could be made is that we assume that certificate 
>> validation
>> is safe. In that case, anything post-TLS is (I think) safe to use as a cause
>> for fallback, because if an attacker is able to play Mallory (Dolev-Yao) on
>> the post-TLS (inside TLS) stream, it’s kind of game-over anyways. At least I
>> can’t think of anything which can be gained from diverting traffic (by
>> deliberately causing SRV fallback with e.g. invalid-XML post-TLS) to another
>> host here (the attacker already has full control of the traffic). (If they 
>> can
>> only Eavesdrop and not manipulate, I don’t see how they could divert the
>> traffic with things happening post-TLS which couldn’t be applied pre-TLS too
>> (e.g. DoS of the connection)).
>>
>> So if this argument holds, we only need to take special care (with respect to
>> security issues) for pre-TLS fallback rules. Since we’re talking about
>> connecting to xmpps-server, there is no pre-TLS as far as the XML stream is
>> concerned (e.g. invalid-XML would be safe to fall-back on).
>>
>>
>> Now things become tricky if we look at how to handle invalid certificates and
>> other TLS issues. (FWIW, when I asked about this years ago in, I think, 
>> jdev@,
>> it was suggested to me to fall back on about anything which isn’t authn
>> failure. Unfortunately, I can’t recall who said that.)
>>
>> So the worst an attacker could do (assuming that we do strict certificate
>> validation, don’t allow non-TLS and that TLS is safe), is DoS, I think. Any
>> modification of the TLS (and pre-TLS) handshake would lead the client to
>> either fail to set up TLS or succeed to set up TLS with the target host, at
>> which point the post-TLS argument from above takes hold. (If a client can be
>> tricked to use a non-TLS stream, that’s a problem all by itself I guess.)
>>
>> Being able to cause a failure to set up TLS (e.g. when stripping the
>> <starttls/> feature with a client who doesn’t attempt starttls independent of
>> the presence of the feature; or by actively MitM-ing the TLS exchange in an
>> attempt to impersonate the target host with an invalid certificate
>> (#corporatefirewall)) is a DoS vector, which can, indeed, be circumvented by
>> simply trying the next SRV record (assuming that the attacker cannot 
>> influence
>> that path, too).
>>
>>
>> Do these arguments make sense?
>
> I think so, TLS failure surely shouldn't stop fallback since an attacker
> can easily set that up.  But also I don't think that's the cut-off, as
> you said next, if the *right* (tls-authed) server sends
> <internal-server-error/> you'd want to continue fallback too.
>

What about if the right server responds perfectly, but has a high
latency (or packet loss)?

What about if the right server responds perfectly but you keep losing
the connection?

>> Now one case where this could be a problem, I imagine, is where different SRV
>> priorities are used to group primary and hot-standby servers respectively,
>> with the hot-standby servers being unusable while the primaries are being
>> used. If the hot-standbys cannot reject connections while the primaries are
>> being used, a client could be tricked to connecting to the hot-standbys,
>> potentially getting out-of-sync with the rest of the domain and isolating 
>> them
>> on a seemingly empty server with no s2s connectivity.
>>
>> But that can easily happen with DNS connectivity issues already and I’d argue
>> that this is then an issue with the zone which set this up.
>
> Yes I think this is improper SRV use, all servers in your SRV records
> should be usable at any time.
>
>>> I think my proposal is even more generic than the above, I think
>>> authentication-response should be the point when fallback ceases.
>>
>> I disagree. I think the point where authentication is about to start (i.e. 
>> the
>> point right before selection of the SASL mechanism) should be where fallback
>> ceases. In addition, no fallback should be made if a required stream feature
>> is not offered.
>>
>> I think it is reasonable to assume that all servers which can be used
>> interchangably will have identical or equivalent stream- and other features.
>> Thus fallback should not be attempted if there is a problem with the offered
>> stream features. Examples: (a) client requires starttls, server doesn’t 
>> offer;
>> (b) client does not allow DIGEST-MD5 or PLAIN for policy reasons, server only
>> offers those.
>
> I think that's equally sensible, I think either one of these would solve
> 99% of the problem.  I suppose it's *possible* in a migration sense if
> servers are different versions or something to offer different SASL
> mechanisms or digests, but I can't imagine it would be common enough to
> worry about in the wild.
>
>> Stream errors which happen before authentication are more difficult.
>> (<internal-server-error/> would be a good candidate for "try the next host".)
>> But I can see how "try the next host" could be a reasonable course of action
>> here.
>>
>>
>>> […] after authentication, whether it's successful or not,
>>> you no longer fall back anymore.
>>
>> I wholeheartedly agree on this one. While failed authn can be an issue on the
>> server-side affecting only a single host, I think it will in most cases 
>> simply
>> be a typo in the password or a changed password. In both cases, early user
>> feedback is important (now a clever client could ask the user for the 
>> password
>> and also try the other SRV options in the background to rule out server 
>> config
>> issues, but that’s nothing we should specify.)
>>
>> (I would argue that it is good practice to block (e.g. with a proper stream
>> error) new connection attempts entirely if you know you won’t be able to
>> handle authentication currently.)
>
> You read minds, I was going to say UI wouldn't have to show a dumb
> 'connecting' the entire time, it could say 'trying server 1', 'trying
> server 2', 'are you sure password is correct? trying server 3' etc etc
>
> It's possible falling back would fix bad username/password too (database
> replication on primary down or something), but at this point we are
> going down the rabbit hole zinid mentioned, what about bookmarks, mam
> sync, etc etc.  This seems like one of those sensible 99% fix points to me.
>
>>> Depending what we decide, I plan to set up various domain/SRV record
>>> combinations for testing, probably clients and servers both need this
>>> type of testing, and I doubt it is done often.
>>
>> Setting up test domains sounds like a great thing to do. I’d like to 
>> integrate
>> that in my test suite.
>
> Still going to hold off a bit to try to reach consensus, but sounds
> great, I'll talk to you about it. :)
>
>> kind regards,
>> Jonas
>
> Thanks much!
> Travis
> _______________________________________________
> Standards mailing list
> Info: https://mail.jabber.org/mailman/listinfo/standards
> Unsubscribe: [email protected]
> _______________________________________________
_______________________________________________
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: [email protected]
_______________________________________________

Re: [Standards] Proper SRV Record Fallback

Reply via email to