Re: Towards an RPKI-rich Internet (and the appropriate allocation of responsibility in the event an RIR RPKI CA outage)
Dear all, I'm very happy to see the direction this conversation has taken, seems we've moved on towards focussing on solutions and outcomes - this is encouraging. On Mon, Oct 01, 2018 at 05:44:17PM +0100, Nick Hilliard wrote: > John Curran wrote on 01/10/2018 00:21: > > There is likely some on the nanog mailing list who have a view on > > this matter, so I pose the question of "who should be responsible" > > for consequences of RPKI RIR CA failure to this list for further > > discussion. > > other replies in this thread have assumed that RPKI CA failure modes > are restricted to loss of availability, but there are others failure > modes, for example: > > - fraud: rogue CA employee / external threat actor signs ROAs > illegitimately > > - negligence: CA accidentally signs illegitimate ROAs due to e.g. > software bug > > - force majeure: e.g. court orders CA to sign prefix with AS0, > complicated by NIR RPKI delegation in jurisdictions which may have > difficult relations with other parts of the world. > > These types of situations are well-trodden territory for other types > of PKI CA, where users > > Otherwise, as other people have pointed out, catastrophic systems > failure at the CA is designed to be fail-safe. I.e. if the CA goes > away, ROAs will be evaluated as "unknown" and life will continue on. > If people misconfigure their networks and do silly things with this > specific failure mode, that's their problem. You can't stop people > from aiming guns at their feet and pulling the trigger. There are a number of failure modes and I believe the operational community has yet to fully explore how to mitigate most risks. Over time I expect we'll develop BCPs how to improve the robustness of the system; these BCPs can only come into existence driven by actual operational experierence. A positive development that addresses some aspects of the concerns raised is Certificate Transparency. Cloudflare set up a CT log (https://groups.google.com/forum/#!topic/certificate-transparency/_deL5iGB5sY) and I hope others like Google will also consider doing this. CT is a great tool to help keep the roots perform in line with community expectations. I consider it the operator community's responsibility to figure out how to deal with outages. I don't intend to hold the RIRs liable - we'll need to learn to protect ourselves. Kind regards, Job
Re: Towards an RPKI-rich Internet (and the appropriate allocation of responsibility in the event an RIR RPKI CA outage)
On 1 Oct 2018, at 9:44 AM, Nick Hilliard wrote: > > John Curran wrote on 01/10/2018 00:21: >> There is likely some on the nanog mailing list who have a view on this >> matter, so I pose the question of "who should be responsible" for >> consequences of RPKI RIR CA failure to this list for further discussion. > > other replies in this thread have assumed that RPKI CA failure modes are > restricted to loss of availability, but there are others failure modes, for > example: > > - fraud: rogue CA employee / external threat actor signs ROAs illegitimately > > - negligence: CA accidentally signs illegitimate ROAs due to e.g. software bug > > - force majeure: e.g. court orders CA to sign prefix with AS0, complicated by > NIR RPKI delegation in jurisdictions which may have difficult relations with > other parts of the world. Nick - Agreed… My question was specific to liability consequential to an operational outage of an RIR CA, since the community’s view of the proper allocation of liability from loss of availability will significantly shape the necessary legalities. (Liability from fraud or gross negligence is unlikely to respect such terms in any case) > Otherwise, as other people have pointed out, catastrophic systems failure at > the CA is designed to be fail-safe. I.e. if the CA goes away, ROAs will be > evaluated as "unknown" and life will continue on. If people misconfigure > their networks and do silly things with this specific failure mode, that's > their problem. One would expect as much (i.e. it’s their problem for networks doing silly things), but we’ve heard some folks suggest it should be the RIR's problem (given the RIR CA's role in triggering events by going unavailable.) Thanks! /John John Curran President and CEO ARIN
Re: Towards an RPKI-rich Internet (and the appropriate allocation of responsibility in the event an RIR RPKI CA outage)
John Curran wrote on 01/10/2018 00:21: There is likely some on the nanog mailing list who have a view on this matter, so I pose the question of "who should be responsible" for consequences of RPKI RIR CA failure to this list for further discussion. other replies in this thread have assumed that RPKI CA failure modes are restricted to loss of availability, but there are others failure modes, for example: - fraud: rogue CA employee / external threat actor signs ROAs illegitimately - negligence: CA accidentally signs illegitimate ROAs due to e.g. software bug - force majeure: e.g. court orders CA to sign prefix with AS0, complicated by NIR RPKI delegation in jurisdictions which may have difficult relations with other parts of the world. These types of situations are well-trodden territory for other types of PKI CA, where users Otherwise, as other people have pointed out, catastrophic systems failure at the CA is designed to be fail-safe. I.e. if the CA goes away, ROAs will be evaluated as "unknown" and life will continue on. If people misconfigure their networks and do silly things with this specific failure mode, that's their problem. You can't stop people from aiming guns at their feet and pulling the trigger. Nick
Re: Towards an RPKI-rich Internet (and the appropriate allocation of responsibility in the event an RIR RPKI CA outage)
On 1 Oct 2018, at 1:20 AM, Mark Tinka mailto:mark.ti...@seacom.mu>> wrote: On 1/Oct/18 01:21, John Curran wrote: It is possible to architect the various legalities surrounding RPKI to support any of the above outcomes, but it first requires a shared understanding of what the network community believes is the correct outcome. There is likely some on the nanog mailing list who have a view on this matter, so I pose the question of "who should be responsible" for consequences of RPKI RIR CA failure to this list for further discussion. John, in the instance where all RIR's transition to a single "All Resource" TA, what would, in your mind, be the (potential) liability considerations? Mark - If there were to be an RIR CA outage, it would not appear that the RIRs use of “All Resources” TAs would materially affect the resulting operational impact to the Internet. (As noted earlier, the impact would be predominantly proportional to the number of ISPs that fail to follow best practices in route processing and fall back properly when their received routes end up with status NotFound, i.e. no longer match against their cache of validate ROAs since the cache has expired) The “All Resources” TA used by each RIR done to avoiding CA invalidation due to overclaiming (as detailed in https://datatracker.ietf.org/doc/rfc8360) – it reduces the probability of a different and hopefully rare RPKI failure scenario (involving the possible accidental invalidation of an RIR CA) until such time as a slightly different RPKI validation algorithm can be deployed that would limit any such invalidation solely to the resources in the overlap. (That’s my high-level understanding of the situation; comments on this question from those closer to the actual network bits would be most welcome…) Thanks! /John John Curran President and CEO ARIN
Re: Towards an RPKI-rich Internet (and the appropriate allocation of responsibility in the event an RIR RPKI CA outage)
On 1/Oct/18 14:23, Jason Lixfeld wrote: > > > I need to swap out the wheels on my car. I think I know better than > to read the manual to, say, understand how much torque I should apply > to each bolt, or what pattern I should use when tightening the bolts. > Or, I read the manual but decide it’s too hard to understand, and I > don’t ask for help in clearing up some of the grey areas. > > I change the wheels anyway. In the end, it looks right. They roll. > Meh. All good. > > Then the wheels fall off. > > There is absolutely no one to blame for any of that but me. > > In my view, I see no difference here. As with anything else operators need to be responsible for their networks when running them. If I want to participate in the BGP on the Internet, I need to learn how to run BGP. If my part of the BGP breaks my network or those of others because I did not school myself on BGP, it's no one else's fault but mine. I can't blame the IETF for this. There is plenty of text freely available on the Internet about RPKI. In fact, I'd go as far as saying all RIR's have been running RPKI workshops for years. Mark.
Re: Towards an RPKI-rich Internet (and the appropriate allocation of responsibility in the event an RIR RPKI CA outage)
> On Oct 1, 2018, at 4:36 AM, Mark Tinka wrote: > > On 1/Oct/18 10:26, John Curran wrote: > >> Indeed… Hence the question of liability during a RIR CA outage, should the >> liability for misconfigured ISPs (those handful of ISPs who do not properly >> fall back to using state NotFound routes) be the responsibility of each ISP, >> or perhaps those who announce ROAs, or should be with the RIR? > > Any equipment misconfigurations should be the responsibility of the operator. ^^ > Responsibility for ROA's should lie with the resource holder, in ensuring > that not only is the information true, but that also all announced prefixes > are covered by a ROA. ^^ I need to swap out the wheels on my car. I think I know better than to read the manual to, say, understand how much torque I should apply to each bolt, or what pattern I should use when tightening the bolts. Or, I read the manual but decide it’s too hard to understand, and I don’t ask for help in clearing up some of the grey areas. I change the wheels anyway. In the end, it looks right. They roll. Meh. All good. Then the wheels fall off. There is absolutely no one to blame for any of that but me. In my view, I see no difference here.
Re: Towards an RPKI-rich Internet (and the appropriate allocation of responsibility in the event an RIR RPKI CA outage)
On 1/Oct/18 10:26, John Curran wrote: > > Of course, this presumes correct routing configuration by the ISP when > setting up RPKI route validation; while one would hope that the vast majority > handle this situation correctly, there is no assurance that will be true > without exception. If RPKI routing validation is widely deployed, tens of > thousands of ISPs will be setting up such a configuration, with customer > impact during an RPKI CA outage occurring for those who somehow failure to > fall back to using NotFound routes. If only a small percentage get this > wrong, it will still represent dozens of ISPs going dark as a result. It is equally important to understand how vendors have interpreted the RFC for default treatment of RPKI data. When we started testing IOS and IOS XE back in 2014/2015, we hit an issue where Cisco were automatically applying policy to RPKI state without configuration from the operator. This was fixed in later code, but goes to show that one should not assume that vendors are always doing the right thing, or at the very least, fully understand what their view on RPKI might be on the wider Internet, in real production. So before deploying network-wide, I encourage operators to test what their equipment will do when RPKI is enabled but without any manual policy applied. > Indeed… Hence the question of liability during a RIR CA outage, should the > liability for misconfigured ISPs (those handful of ISPs who do not properly > fall back to using state NotFound routes) be the responsibility of each ISP, > or perhaps those who announce ROAs, or should be with the RIR? Any equipment misconfigurations should be the responsibility of the operator. Responsibility for ROA's should lie with the resource holder, in ensuring that not only is the information true, but that also all announced prefixes are covered by a ROA. An RIR CA outage would, in my mind, be the responsibility of the RIR. But this comes back to my question of how this handled with an "all resource" TA. Mark.
Re: Towards an RPKI-rich Internet (and the appropriate allocation of responsibility in the event an RIR RPKI CA outage)
On Mon, Oct 01, 2018 at 09:47:43AM +0200, Alex Band wrote: > Hello, > > To avoid any misunderstanding in this discussion going forward, I would > like to reiterate that an RPKI ROA is a positive attestation. An > unavailable, expired or invalid ROA will result in a BGP announcement > with the status NotFound. The announcement will *not* become INVALID, > thereby being dropped. > > Please read Section 5 of RFC 7115 that John linked carefully: > > Bush Best Current Practice [Page 7] > > RFC 7115 RPKI-Based Origin Validation OpJanuary 2014 > > > Announcements with NotFound origins should be preferred over those > with Invalid origins. > > Announcements with Invalid origins SHOULD NOT be used, but may be > used to meet special operational needs. In such circumstances, the > announcement should have a lower preference than that given to Valid > or NotFound. > > Thus, a continued outage of an RPKI CA (or publication server) will > result in announcements with status NotFound. This means that the > prefixes held by this CA will no longer benefit from protection by the > RPKI. However, since only *invalid* announcements should be dropped, > this should not lead to large scale outages in routing. > > It is important to be aware of the impact of such an outage when > considering questions of liability. This depends if the prefix in question is covered by another ROA. Because in that case it is well possible that the prefix is marked INVALID. This is especially an issue if a partial failure of a publication server is taking out the more specifics but leaves a large covering ROA (maybe even one with origin AS 0). In the end from a security standpoint it is probably better to fail closed because the alternative is no RPKI and then hijacks become possible and MITM attacks or DNS spoofing can be done leaving every Internet user at risk. Also consider that not using best common practice to protect a service is also putting you at risk for liability charges. So ignoring RPKI because of possible liability concerns may as fire back at you. -- :wq Claudio
Re: Towards an RPKI-rich Internet (and the appropriate allocation of responsibility in the event an RIR RPKI CA outage)
On 1 Oct 2018, at 12:47 AM, Alex Band wrote: > > Hello, > > To avoid any misunderstanding in this discussion going forward, I would like > to reiterate that an RPKI ROA is a positive attestation. An unavailable, > expired or invalid ROA will result in a BGP announcement with the status > NotFound. The announcement will *not* become INVALID, thereby being dropped. > > Please read Section 5 of RFC 7115 that John linked carefully: > ... > > Thus, a continued outage of an RPKI CA (or publication server) will result in > announcements with status NotFound. This means that the prefixes held by this > CA will no longer benefit from protection by the RPKI. However, since only > *invalid* announcements should be dropped, this should not lead to large > scale outages in routing. Alex - Yes – ISPs who have configured RPKI route validation and are using it to preference routes should continue to utilize routes that are have NotFound status due to lack of RPKI repository data. As RFC 7115 notes - " Hence, an operator's policy should not be overly strict and should prefer Valid announcements; it should attach a lower preference to, but still use, NotFound announcements, and drop or give a very low preference to Invalid announcements. " Of course, this presumes correct routing configuration by the ISP when setting up RPKI route validation; while one would hope that the vast majority handle this situation correctly, there is no assurance that will be true without exception. If RPKI routing validation is widely deployed, tens of thousands of ISPs will be setting up such a configuration, with customer impact during an RPKI CA outage occurring for those who somehow failure to fall back to using NotFound routes. If only a small percentage get this wrong, it will still represent dozens of ISPs going dark as a result. > It is important to be aware of the impact of such an outage when considering > questions of liability. Indeed… Hence the question of liability during a RIR CA outage, should the liability for misconfigured ISPs (those handful of ISPs who do not properly fall back to using state NotFound routes) be the responsibility of each ISP, or perhaps those who announce ROAs, or should be with the RIR? Thanks! /John John Curran President and CEO ARIN
Re: Towards an RPKI-rich Internet (and the appropriate allocation of responsibility in the event an RIR RPKI CA outage)
On 1/Oct/18 01:21, John Curran wrote: > > > > It is possible to architect the various legalities surrounding RPKI to > support any of the above outcomes, but it first requires a shared > understanding of what the network community believes is the correct > outcome. There is likely some on the nanog mailing list who have a > view on this matter, so I pose the question of "who should be > responsible" for consequences of RPKI RIR CA failure to this list for > further discussion. John, in the instance where all RIR's transition to a single "All Resource" TA, what would, in your mind, be the (potential) liability considerations? Mark.
Re: Towards an RPKI-rich Internet (and the appropriate allocation of responsibility in the event an RIR RPKI CA outage)
On 1/Oct/18 09:47, Alex Band wrote: > > Thus, a continued outage of an RPKI CA (or publication server) will result in > announcements with status NotFound. This means that the prefixes held by this > CA will no longer benefit from protection by the RPKI. However, since only > *invalid* announcements should be dropped, this should not lead to large > scale outages in routing. Indeed, and this is on the basis that operators are not overzealous about aggressively acting against a "NotFound" RPKI state. Mark.
Re: Towards an RPKI-rich Internet (and the appropriate allocation of responsibility in the event an RIR RPKI CA outage)
Hello, To avoid any misunderstanding in this discussion going forward, I would like to reiterate that an RPKI ROA is a positive attestation. An unavailable, expired or invalid ROA will result in a BGP announcement with the status NotFound. The announcement will *not* become INVALID, thereby being dropped. Please read Section 5 of RFC 7115 that John linked carefully: Bush Best Current Practice [Page 7] RFC 7115 RPKI-Based Origin Validation OpJanuary 2014 Announcements with NotFound origins should be preferred over those with Invalid origins. Announcements with Invalid origins SHOULD NOT be used, but may be used to meet special operational needs. In such circumstances, the announcement should have a lower preference than that given to Valid or NotFound. Thus, a continued outage of an RPKI CA (or publication server) will result in announcements with status NotFound. This means that the prefixes held by this CA will no longer benefit from protection by the RPKI. However, since only *invalid* announcements should be dropped, this should not lead to large scale outages in routing. It is important to be aware of the impact of such an outage when considering questions of liability. Kind regards, Alex Band NLnet Labs > On 1 Oct 2018, at 01:21, John Curran wrote: > > Folks - > > Perhaps it would be helpful to confirm that we have common goals in the > network operator community regarding RPKI, and then work from those goals on > the necessary plans to achieve them. > > It appears that many network operators would like to improve the integrity of > their network routing via RPKI deployment. The Regional Internet Registries > (RIRs) have all worked to support RPKI services, and while there are > different opinions among operators regarding the cost/benefit tradeoffs of > RPKI Route Origin Validation (ROV), it is clear that we have to collectively > work together now if we are ever to have overall RPKI deployment sufficient > to create the network effects that will ensure compelling long-term value for > its deployment. > > Let’s presume that we’ve achieved that very outcome at some point in future; > i.e. we’re have an Internet where nearly all network operators are publishing > Route Origin Authorizations (ROAs) via RIR RPKI services and are using RPKI > data for route validation. It is reasonable to presume that over the next > decade the Internet will become even more pervasive in everyday life, > including being essential for many connected devices to function, and relied > upon for everything from daily personal communication and conducting business > to even more innovative uses such as payment & sale systems, delivery of > medical care, etc. > > Recognizing that purpose of RPKI is improve integrity of routing, and not add > undo fragility to the network, it is reasonable to expect that many network > operators will take due care with the introduction of route validation into > their network routing, including best practices such as falling back > successfully in the event of unavailability of an RIR RPKI Certificate > Authority (CA) and resulting cache timeouts. It is also reasonable expect > that RIR RPKI CA services are provisioned with appropriate robustness of > systems and controls that befit the highly network-critical nature of these > services. > > Presuming we all share this common goal, the question that arises is whether > we have a common vision regarding what should happen when something goes > wrong in this wonderful RPKI-rich Internet of the future… More than anyone, > network operators realize that even with excellent systems, procedures, and > redundancy, outages can (and do) still occur. Hopefully, these are quite > rare, and limited to occasions where Murphy’s Law has somehow resulted in > nearly unimaginable patterns of coincident failures, but it would > irresponsible to not consider the “what if” scenarios for RPKI failure and > whether there is shared vision of the resulting consequences. > > In particular, it would be good to consider the case of an RIR RPKI CA system > failure, one sufficient to result in widespread cache expirations for relying > parties. Ideally, we will never have to see this scenario when RPKI is > widely deployed, but it also not completely inconceivable that an RIR RPKI CA > experience such an outage [1]. For network operators following reasonable > deployment practices, an RIR RPKI CA outage should result in a fallback to > unvalidated network routing data and no significant network impacts. > However, it’s likely not a reasonable assumption that all network operators > will have properly designed and implemented best practices in this regard, so > there will very likely be some networks that experience significant impacts > consequential to any RIR RPKI CA outage. Even if this is only 1 or 2 percent > of net
Towards an RPKI-rich Internet (and the appropriate allocation of responsibility in the event an RIR RPKI CA outage)
Folks - Perhaps it would be helpful to confirm that we have common goals in the network operator community regarding RPKI, and then work from those goals on the necessary plans to achieve them. It appears that many network operators would like to improve the integrity of their network routing via RPKI deployment. The Regional Internet Registries (RIRs) have all worked to support RPKI services, and while there are different opinions among operators regarding the cost/benefit tradeoffs of RPKI Route Origin Validation (ROV), it is clear that we have to collectively work together now if we are ever to have overall RPKI deployment sufficient to create the network effects that will ensure compelling long-term value for its deployment. Let’s presume that we’ve achieved that very outcome at some point in future; i.e. we’re have an Internet where nearly all network operators are publishing Route Origin Authorizations (ROAs) via RIR RPKI services and are using RPKI data for route validation. It is reasonable to presume that over the next decade the Internet will become even more pervasive in everyday life, including being essential for many connected devices to function, and relied upon for everything from daily personal communication and conducting business to even more innovative uses such as payment & sale systems, delivery of medical care, etc. Recognizing that purpose of RPKI is improve integrity of routing, and not add undo fragility to the network, it is reasonable to expect that many network operators will take due care with the introduction of route validation into their network routing, including best practices such as falling back successfully in the event of unavailability of an RIR RPKI Certificate Authority (CA) and resulting cache timeouts. It is also reasonable expect that RIR RPKI CA services are provisioned with appropriate robustness of systems and controls that befit the highly network-critical nature of these services. Presuming we all share this common goal, the question that arises is whether we have a common vision regarding what should happen when something goes wrong in this wonderful RPKI-rich Internet of the future… More than anyone, network operators realize that even with excellent systems, procedures, and redundancy, outages can (and do) still occur. Hopefully, these are quite rare, and limited to occasions where Murphy’s Law has somehow resulted in nearly unimaginable patterns of coincident failures, but it would irresponsible to not consider the “what if” scenarios for RPKI failure and whether there is shared vision of the resulting consequences. In particular, it would be good to consider the case of an RIR RPKI CA system failure, one sufficient to result in widespread cache expirations for relying parties. Ideally, we will never have to see this scenario when RPKI is widely deployed, but it also not completely inconceivable that an RIR RPKI CA experience such an outage [1]. For network operators following reasonable deployment practices, an RIR RPKI CA outage should result in a fallback to unvalidated network routing data and no significant network impacts. However, it’s likely not a reasonable assumption that all network operators will have properly designed and implemented best practices in this regard, so there will very likely be some networks that experience significant impacts consequential to any RIR RPKI CA outage. Even if this is only 1 or 2 percent of network operators with such configuration issues, it will mean hundreds of ISP outages occurring simultaneously throughout the Internet and millions of customers (individuals and businesses) effected globally. While the Internet is the world’s largest cooperative endeavor, there inevitably will be many folks impacted of a RIR RPKI outage, including some asking (appropriately) the question of “who should bear responsibility” for the harm that they suffered. It is worth understanding what the network community believes is the most appropriate answer to this question, since a common outlook on this question can be used to guide implementation details to match. Additionally, a common understanding on this question will provide real insight into how the network community intends risk of the system to be distributed among the participants. There are several possible options worth considering: A) The most obvious answer for the party that should be held liable for the impacts that result from an RPKI CA failure would be the respective RIR that experienced the outage. This seems rather straightforward until one considers that the RIRs are providing these services specifically noting that they may not be (despite all precautions) available 100% percent of the time, and clearly documented expectations that those relying on RPKI CA information for routing origin validation should be fallback to routing with not validated state [2]. The impacted parti