Re: [Standards] Proper SRV Record Fallback

Travis Burtrum Thu, 11 Jan 2018 20:06:32 -0800

Hello,

My replies in-line as well.


On 01/10/2018 03:20 AM, Jonas Wielicki wrote:
> Hi Travis,
> 
> Notes inline.
> 
> On Montag, 8. Januar 2018 23:19:36 CET Travis Burtrum wrote:
>> First, what do docs say:
>>
>> RFC-6120[2] Section-3.2.1 #7 says:
>>> 7. If the initiating entity fails to connect using all resolved IP
>>>
>>>       addresses for a given FDQN, then it repeats the process of
>>>       resolution and connection for the next FQDN returned by the SRV
>>>       lookup based on the priority and weight as defined in [DNS-SRV].
>>
>> 'fails to connect' does this mean the TCP connection fails, or the XMPP
>> connection fails?
>>
>> #8 might leave a hint:
>>> 8. If the initiating entity receives a response to its SRV query but
>>>
>>>       it is not able to establish an XMPP connection using the data
>>>       received in the response, it SHOULD NOT attempt the fallback
>>>       process described in the next section (this helps to prevent a
>>>       state mismatch between inbound and outbound connections).
>>
>> This clearly says XMPP connection, but does it apply to #7 ?
>>
>> It is also clear I didn't think about this too hard when writing
>> XEP-0368, because I clearly (to me) assume SRV fallback 
> 
> The text you quote is *not* about SRV fallback. It refers to the fallback to 
> A/AAAA records in the next section (3.2.2). (which, holy cow, we should 
> really 
> not ever ever do if we got SRV records.)
> 
> The only wording in the RFC for SRV iteration is #7 you quoted. So it is all 
> about the definition of "fails to connect". The elders may have information 
> on 
> what was originally meant by that and if there is some more wisdom on the 
> reasoning for this.

Yes I fully agree RFC-wise this all hinges on what 'fails to connect'
means, I quoted 8 because it was the only one that didn't use those
exact words and said 'XMPP connection' instead.  Whether that means
anything or not is up for interpretation.  I'd also vote since it's at
best ambiguous we just decide what the 'right' thing is anyway.

>> will happen if a
>> complete XMPP connection is not successful, because under Implementation
>> Notes I say:
>>> Server operators should not expect multiplexing (via ALPN) to work in
>>> all scenarios and therefore should provide additional SRV record(s)
>>> that do not require multiplexing (either standard STARTTLS or
>>> dedicated direct XMPP-over-TLS). This is a result of relying on ALPN
>>> for multiplexing, where ALPN might not be supported by all devices or
>>> may be disabled by a user due to privacy reasons.
>>
>> While I don't explicitly say it, if a port required ALPN to multiplex,
>> it will generally end up connecting you to a non-XMPP server without
>> ALPN, meaning you will get back invalid XML, other junk, and/or an
>> invalid TLS cert.
> 
> This definitely could use wording in '368.

Absolutely agree, however I'd like to wait to update it so I can also
note what we decide here, I think.

> While I’m at it, I am really uncomfortable with further supporting the "put 
> everything behind SSL on 443" and move the Deep-Packet-Inspection-war behind 
> the TLS, driving us to a world where we’ll everything on port 443, with ALPN-
> based multiplexing. But that’s kinda OT. (but this is why I’m hesitant with 
> making ALPN a MUST.)

Yea this has been addressed a few times over the course of this XEP and
while I agree with the sentiment, I'd prefer to connect by any means
possible than to stay unconnected knowing it's more 'pure' that way or
something. :) (though by all means, when you find evil networks, try to
get them to change)

>> Now that the docs are out of the way, on to the discussion:
>>
>> In my opinion, at least all of cannot-connect-to-port, non-XML,
>> not-proper-stream and invalid TLS cert should trigger a fallback to the
>> next highest priority SRV record.
> 
> Is there a guarantee or requirement that servers in two different SRV 
> priorities can be used at the same time? If not, it seems a bad I idea to 
> fall 
> back on them for purely application-layer reasons.

I believe so?  At least my understanding of SRV is that clients can end
up connecting to any at any time.

>> Everyone in the MUC seemed to agree
>> if authentication fails a fallback would be a bad idea.
>>
>> Sam Whited said that if a TCP connection is established fallback should
>> cease, that it shouldn't have anything to do with or any knowledge of
>> XMPP, and that it might have security implementations to do otherwise.
>> (please correct and forgive me if I misunderstood)  I disagree with
>> this, I think if Eve has control over DNS (and no DNSSEC) she can return
>> arbitrary records anyway so SRV fallback doesn't matter. 
> 
> That’s not true. As soon as one of the SRV records points to another 
> (possibly 
> unsigned) zone, Mallory could forge DNS replies there even without the 
> ability 
> to forge the whole SRV RRset (a low TTL on one of the SRV target host names  
> (compared to the SRV records themselves) could also ease an attack on those 
> host names compared to the SRV records). The other servers could be taken 
> down 
> by (D)DoS, or if you’re on-path, by messing with the TLS handshake.
> 
> Now this is irrelevant to the current discussion insofar that this is a 
> vector 
> already present in RFC6120 behaviour where "failed to connect" is interpreted 
> at the TCP level, but it shows that there are cases we haven’t thought of yet.
> 
> I can’t think of any example which is only allowed by the "new" fallback 
> rules 
> right now, but that doesn’t--unfortunately--mean that none exist.

Right DNSSEC only protects if all domains are equally protected, both
SRV and A/AAAA.  My only point is with interpreting 'connects to TCP' as
'stop trying other SRV records' then someone only has to DOS the highest
priority server, instead of all of them.

I also kind of object to calling these "new" rules, it's how I've always
interpreted how it should work, how conversations works, and quite
possibly many others work this way too.  I'm just looking for consensus
since there doesn't seem to be one. :)

> One argument which could be made is that we assume that certificate 
> validation 
> is safe. In that case, anything post-TLS is (I think) safe to use as a cause 
> for fallback, because if an attacker is able to play Mallory (Dolev-Yao) on 
> the post-TLS (inside TLS) stream, it’s kind of game-over anyways. At least I 
> can’t think of anything which can be gained from diverting traffic (by 
> deliberately causing SRV fallback with e.g. invalid-XML post-TLS) to another 
> host here (the attacker already has full control of the traffic). (If they 
> can 
> only Eavesdrop and not manipulate, I don’t see how they could divert the 
> traffic with things happening post-TLS which couldn’t be applied pre-TLS too 
> (e.g. DoS of the connection)).
> 
> So if this argument holds, we only need to take special care (with respect to 
> security issues) for pre-TLS fallback rules. Since we’re talking about 
> connecting to xmpps-server, there is no pre-TLS as far as the XML stream is 
> concerned (e.g. invalid-XML would be safe to fall-back on).
> 
> 
> Now things become tricky if we look at how to handle invalid certificates and 
> other TLS issues. (FWIW, when I asked about this years ago in, I think, 
> jdev@, 
> it was suggested to me to fall back on about anything which isn’t authn 
> failure. Unfortunately, I can’t recall who said that.)
> 
> So the worst an attacker could do (assuming that we do strict certificate 
> validation, don’t allow non-TLS and that TLS is safe), is DoS, I think. Any 
> modification of the TLS (and pre-TLS) handshake would lead the client to 
> either fail to set up TLS or succeed to set up TLS with the target host, at 
> which point the post-TLS argument from above takes hold. (If a client can be 
> tricked to use a non-TLS stream, that’s a problem all by itself I guess.)
> 
> Being able to cause a failure to set up TLS (e.g. when stripping the 
> <starttls/> feature with a client who doesn’t attempt starttls independent of 
> the presence of the feature; or by actively MitM-ing the TLS exchange in an 
> attempt to impersonate the target host with an invalid certificate 
> (#corporatefirewall)) is a DoS vector, which can, indeed, be circumvented by 
> simply trying the next SRV record (assuming that the attacker cannot 
> influence 
> that path, too).
> 
> 
> Do these arguments make sense?

I think so, TLS failure surely shouldn't stop fallback since an attacker
can easily set that up.  But also I don't think that's the cut-off, as
you said next, if the *right* (tls-authed) server sends
<internal-server-error/> you'd want to continue fallback too.

> Now one case where this could be a problem, I imagine, is where different SRV 
> priorities are used to group primary and hot-standby servers respectively, 
> with the hot-standby servers being unusable while the primaries are being 
> used. If the hot-standbys cannot reject connections while the primaries are 
> being used, a client could be tricked to connecting to the hot-standbys, 
> potentially getting out-of-sync with the rest of the domain and isolating 
> them 
> on a seemingly empty server with no s2s connectivity.
> 
> But that can easily happen with DNS connectivity issues already and I’d argue 
> that this is then an issue with the zone which set this up.

Yes I think this is improper SRV use, all servers in your SRV records
should be usable at any time.

>> I think my proposal is even more generic than the above, I think
>> authentication-response should be the point when fallback ceases.
> 
> I disagree. I think the point where authentication is about to start (i.e. 
> the 
> point right before selection of the SASL mechanism) should be where fallback 
> ceases. In addition, no fallback should be made if a required stream feature 
> is not offered.
> 
> I think it is reasonable to assume that all servers which can be used 
> interchangably will have identical or equivalent stream- and other features. 
> Thus fallback should not be attempted if there is a problem with the offered 
> stream features. Examples: (a) client requires starttls, server doesn’t 
> offer; 
> (b) client does not allow DIGEST-MD5 or PLAIN for policy reasons, server only 
> offers those.

I think that's equally sensible, I think either one of these would solve
99% of the problem.  I suppose it's *possible* in a migration sense if
servers are different versions or something to offer different SASL
mechanisms or digests, but I can't imagine it would be common enough to
worry about in the wild.

> Stream errors which happen before authentication are more difficult. 
> (<internal-server-error/> would be a good candidate for "try the next host".) 
> But I can see how "try the next host" could be a reasonable course of action 
> here.
> 
> 
>> […] after authentication, whether it's successful or not,
>> you no longer fall back anymore.
> 
> I wholeheartedly agree on this one. While failed authn can be an issue on the 
> server-side affecting only a single host, I think it will in most cases 
> simply 
> be a typo in the password or a changed password. In both cases, early user 
> feedback is important (now a clever client could ask the user for the 
> password 
> and also try the other SRV options in the background to rule out server 
> config 
> issues, but that’s nothing we should specify.)
> 
> (I would argue that it is good practice to block (e.g. with a proper stream 
> error) new connection attempts entirely if you know you won’t be able to 
> handle authentication currently.)

You read minds, I was going to say UI wouldn't have to show a dumb
'connecting' the entire time, it could say 'trying server 1', 'trying
server 2', 'are you sure password is correct? trying server 3' etc etc

It's possible falling back would fix bad username/password too (database
replication on primary down or something), but at this point we are
going down the rabbit hole zinid mentioned, what about bookmarks, mam
sync, etc etc.  This seems like one of those sensible 99% fix points to me.

>> Depending what we decide, I plan to set up various domain/SRV record
>> combinations for testing, probably clients and servers both need this
>> type of testing, and I doubt it is done often.
> 
> Setting up test domains sounds like a great thing to do. I’d like to 
> integrate 
> that in my test suite.

Still going to hold off a bit to try to reach consensus, but sounds
great, I'll talk to you about it. :)

> kind regards,
> Jonas

Thanks much!
Travis
_______________________________________________
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
_______________________________________________

Re: [Standards] Proper SRV Record Fallback

Reply via email to