Re: [sidr] WGLC: draft-ietf-sidr-origin-ops

Shane Amante Sat, 29 Oct 2011 22:38:14 -0700

I have some questions that pertain to this document, specifically around:
- whether it's intended or 'safe' to use BGP Attributes, (MED, communities), to 
convey validity of prefixes from one ASN to another ASN
- better guidance/recommendations around the number, placement and 
synchronization characteristics of RPKI caches within a SP.



1)  From Section 3:
---snip---
   A local valid cache containing all RPKI data may be gathered from the
   global distributed database using the rsync protocol, [RFC5781], and
   a validation tool such as rcynic [rcynic].
---snip---

Would it be possible to mention and/or point to how the above process is 
supposed to be bootstrapped?  IOW, is it expected that, eventually?, the RIR's 
are going to publish to their end-users and maintain URI's of RPKI publication 
points?  Since this is an Ops guidelines document, some guidance and/or 
pointers are likely to save [lots of] questions down the road.  I'm not 
expecting this to be a tutorial document, but some idea on the theory of how a 
new SP bootstraps their cache(s) would be helpful.

2)  Given that, to my knowledge, the RPKI is [very] loosely synchronized in a 
"pull-only" fashion, shouldn't there be some text added below to that effect 
that:
    a)  It may not be best to go more than, say, 2 levels of RPKI caches deep 
inside a single organization/ASN to avoid RPKI caches from being out of sync 
with each other?  IOW, there are likely a small set of 1st/top-level RPKI 
caches that speak externally to fetch RPKI cache information, (similar to 
'hidden' authoritative DNS servers), then a second tier of RPKI caches that 
synchronize (only) from the top-level RPKI caches, (similar to external, 
anycast authoritative DNS servers). 
    b)  Operators should look at running more aggressive synchronization 
intervals _internally_ within their organization/ASN, from "children" 
(2nd-level) RPKI caches to the 'parent' (top-level) RPKI cache in their 
organization/ASN, compared to more "relaxed" synchronization intervals to RPKI 
caches external to their organization (top-level RPKI caches in their ASN to 
RIR's)?
---snip---
   Validated caches may also be created and maintained from other
   validated caches.  Network operators SHOULD take maximum advantage of
   this feature to minimize load on the global distributed RPKI
   database.  Of course, the recipient SHOULD re-validate the data.
---snip---
While I'm here, I don't think the text in Section 6, "Notes", addresses the 
above concerns, at all.  In fact, I find it extremely unhelpful to just dismiss 
this concern, out of hand, with the text: "There is no 'fix' for this, it is 
the nature of distributed data with distributed caches".  We know what the 
answer is here: you tune the synchronization intervals to strike the 
appropriate balance between [very] tight synchronization vs. increased load on 
the systems being synchronized.  I find it hard to believe a simple suggestion 
such as this is not proposed in the text, even including the phrase "the 
suggested values for such synchronization are outside the scope of this 
document, but will likely be subject to further studies to determine optimal 
values based on field experience".

3)  Granted, the following text is only a "SHOULD", but the text offers no 
reasoning as to why caches should be placed close to routers, i.e.: are there 
latency concerns (for the RPKI <-> cache protocol), or is it that a 
geographically distributed system is one way to avoid a 
single-point-of-failure, or something else entirely?  As a start, just defining 
"close" would help, e.g.: same POP, same (U.S.) state, same country, same 
timezone … but, then a statement as to any latency or resiliency requirement 
for geographic deployment of RPKI caches wold be useful.

    Furthermore, given the [very] loosely synchronized nature of the RPKI, 
should the text point out that the number of RPKI caches (internal to the 
organization) be balanced against the potential need of an organization to 
maintain a more tightly synchronized view, across their entire network, of 
validated routing information?  A concern might be that if routers in Continent 
A pull information from their RPKI caches that tell them that ROA is not 
"Invalid", but other routers in Continent B are still using 'older' information 
in RPKI caches in Continent B that says the same ROA is either "Not Found" or 
"Valid", then the result might be that BGP Path Selection swings all traffic 
from Continent A to Continent B.  At a minimum, this could lead to 
substantially increased latency or, at worst, congestion, packet-loss or a 
unintended DoS.  
---snip---
   As RPKI-based origin validation relies on the availability of RPKI
   data, operators SHOULD locate caches close to routers that require
   these data and services.  A router can peer with one or more nearby
   caches.
---snip---

In Section 5, "Routing Policy":
4)  From a practical standpoint, LOCAL_PREF is already widely used to influence 
Traffic Engineering, both by an SP as well as by the SP's customers (through 
the use of "TE communities" sent by a downstream customer to the SP) -- the 
latter of which is done in order so the customer can influence traffic from the 
SP toward themselves, (e.g.: one example where a customer prefers a circuit be 
'backup' for another circuit only if their other SP is not announcing that same 
prefix).  In reality, I think that there will have to be significant re-work of 
an SP's existing BGP policies to encode dual-meanings inside a single 
LOCAL_PREF attribute, (route validity + TE preference).  It may be good to 
acknowledge this by recommending that in the text, above, something like:
====
    In the short-term, the LOCAL_PREF Attribute may be used to carry both the 
validity state of a prefix along with it's Traffic Engineering 
characteristic(s).  It is likely that the SP will have to change their BGP 
policies such that they can encode these two, separate characteristics in the 
same BGP attribute without negatively impacting their existing use or leading 
to accidental privilege escalation attacks. 
====
---snip---
Some may choose to use the large Local-Preference hammer.
---snip---

5)  I have three comments on the below:
    a)  It's not clear, to me, what is meant by "internal metric" below.  Do 
you mean MED or IGP metric or something else?  I don't see IGP metric as being 
practical, so I'm assuming you mean additively altering MED (up|down) based on 
validity state.  Regardless, I would recommend you state more precisely which 
BGP Path Attribute you're referring to below.
    b)  Since MED is passed from one ASN to (only) a second, downstream ASN to 
influence ingress TE policy, is it "OK" from a security PoV that MED is a 
*trusted* means to convey ROA validity information from one ASN to a second?  
Presumably, the answer should be "heck, no", right?  If that's the case, then 
wouldn't it be wise to state that:
        i)  MED's, encoded with any ROA validity information, should get reset 
on egress from an ASN to remove said validity information and only carry TE 
information, as appropriate; and,
        ii) MED's should not be trusted on ingress to convey any meaning with 
respect to validity information?
    c)  What is meant by the statement, "might choose to let AS-Path rule"?  Is 
your intent to state that an SP may choose to just use MED, which follows after 
LOCAL_PREF & AS_PATH in the BGP Path Selection Algorithm, as a means to 
determining validity of a particular prefix?  If so, then it would be much more 
clear if you just stated that, e.g.:
====
    If LOCAL_PREF is not used to convey validity information, then MED is 
likely the next best candidate BGP Attribute that can be used to influence path 
selection based on the validity of a particular prefix.  As with LOCAL_PREF, 
care must be taken to avoid changing the MED attribute and creating privilege 
escalation attacks.
====
---snip---
   […]  Others
   might choose to let AS-Path rule and set their internal metric, which
   comes after AS-Path in the BGP decision process.
---snip---



Other Comments:
6)  Related to #5, above, BGP Communities are another transitive attribute that 
/might/ be used to convey validity information of a prefix, or lack thereof, 
from one ASN to a second ASN (or, more).  However, as we know, there is no 
means to authenticate BGP Attributes, from one ASN to the next.  So, from a 
security hygiene perspective, would it be best to say something along the lines 
of:
====
The validity state of routes MUST NOT be transmitted beyond the borders of an 
SP's ASN, since: a) there is no authenticity of BGP Attributes; and, b) this 
would place hidden dependencies on the ability of the upstream ASN to validate 
routes and pass them along to others, which would increase the fragility of the 
overall system.  Finally, ASN's MUST NOT rely on BGP Attributes received on an 
eBGP session, to convey any meaning with respect to validity of a particular 
prefix for the reasons just stated.
====

7)  Is this document only intended (scoped?) to cover PE's that can (or, 
eventually, will) speak the RPKI-RTR protocol for validation?  Or is this 
document intended to also cover PE's that do not speak RPKI-RTR, but those PE's 
would obviously need some other mechanism, (e.g.: periodically pushing an 
updated config to them based on RPKI validated data), in order that they could 
influence the policy applied to valid routes in such a way that is consistent 
with other more modern routers that do run RPKI-RTR protocol?  If so, wouldn't 
it be good to suggest this, even if only as a means to increase the deployment 
speed?  Or, to at least let readers know that this needs to be considered 
during their deployment so that they can factor in the load on their [existing] 
systems that might do this work as well as the effects of the 'loosely 
synchronized' aspects of the RPKI?

-shane


On Oct 28, 2011, at 7:59 AM, Christopher Morrow wrote:
> Two folks seem to have given this a read-through, is that all the
> interest that exists? is documenting how originators of routes ought
> to think/use/abuse RPKI not something we should do here?
> 
> please chime in if you've given this a read and are onboard with it
> moving forward.
> 
> -chris
> 
> On Sat, Oct 15, 2011 at 12:22 AM, Randy Bush <[email protected]> wrote:
>>>> What's the rationale of this change from version 10 to 11?
>>> after much discussion with ops and security folk, it is the purpose of
>>> the whole exercise.  you wanna stop 7007?
>> 
>> fwiw, it has swung back and forth a few times
>> 
>> randy
>> _______________________________________________
>> sidr mailing list
>> [email protected]
>> https://www.ietf.org/mailman/listinfo/sidr
>> 
> _______________________________________________
> sidr mailing list
> [email protected]
> https://www.ietf.org/mailman/listinfo/sidr

_______________________________________________
sidr mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/sidr

Re: [sidr] WGLC: draft-ietf-sidr-origin-ops

Reply via email to