Re: [sidr] Scaling properties of caching in a globally deployed RPKI / BGPSEC system

Eric Osterweil Thu, 06 Dec 2012 09:57:24 -0800

Hey Tim,

My turn to apologize for the delay.  I have taken your very excellent feedback 
and tried to address it in a new revision of the tech note.  That should be 
coming out soon.  Please do let me know if it seems off again!


Eric

On Nov 23, 2012, at 11:26 AM, Tim Bruijnzeels wrote:

> Hi Eric,
> 
> Sorry for the late response..
> 
> With regards to some of your comments regarding requirements and ideas for 
> alternative deltas, some responses inline below, but other than that see the 
> thread where both Bryan Weber and I talk about our ideas. This is somewhat 
> separate from the questions about "how a fully deployed RPKI tree would look 
> like", and "what churn we might see" and "how many connecting RPs, and how 
> frequent". All that can feed into requirements, but is in essence another 
> discussion, so I suggest we keep them separate.
> 
> 
> On Nov 16, 2012, at 10:44 PM, Eric Osterweil <[email protected]> wrote:
> 
>> 
>> On Nov 16, 2012, at 10:45 AM, Tim Bruijnzeels wrote:
>> 
>>> Hi,
>>> 
>>> Some more comments on the numbers and formula..
>>> 
>>> On Nov 15, 2012, at 5:36 AM, Arturo Servin <[email protected]> wrote:
>>> 
>> 
>> <snip>
>> 
>>> Apart from any signed objects it may publish, every CA typically has:
>>> = 1 certificate published by its parent
>> 
>> But, as I asked Arturo, would we expect to have a CA from each parent (i.e. 
>> each RIR that an org may allocations from)?  While this may often still just 
>> be 1, it seems important to note, no?  THat was one reason we had for 
>> breaking it apart.  I'm more than willing to believe that I'm wrong here, 
>> but I'd like to understand how. 
>> 
>>> = 1 manifest
>>> = 1 CRL
>>> = 1 GB record (as Arturo said not widely deployed, but let's throw it in 
>>> for full deployment)
>> 
>> How does the RPKI certify ASN allocations?  This is needed to certify router 
>> EE certs (they are tied to an ASN, no?).  Such an AS cert would add another 
>> 1 to the above and make it 5, right?  Otherwise, how does a forward signing 
>> peer associate itself with the ASN, instead of some prefix?  The ASN isn't 
>> necessarily allocated by the same RIR as any of the allocations, so IP 
>> allocation and ASN allocation are orthogonal, no?
> 
> Okay, let me rephrase.. it gets confusing when talking about organisations as 
> CAs, and CA certificates: i.e. certificates that have the CA bit set and can 
> sign and publish other certificates, signed objects etc.
> 
> We're talking 4 objects for each CA Certificate:
> = 1 CA Certificate with IPv4, IPv6 and/or ASN
> = 1 mft
> = 1 CRL
> = 1 GB
> 
> An organisation may have more than one CA certificate:
> = if it has different parents and get certificates from more than one
> = if it has 'minority space'. For example if one of our members has resources 
> for which another RIR is the majority holder, then this is not on our normal 
> certificate. We get a certificate with all such resources signed by the other 
> RIR, and then we use this to sign an additional certificate for this member. 
> Certificates can have only one parent, so we can not merge this.
> 
> Regarding the second point: I haven't done detailed analysis on our data yet. 
> This does affect at least a portion of our members, so the average number of 
> CA certs per member organisation is more than 1. I expect that it will not be 
> a *lot* higher, but like I said I haven't analysed the data yet.
> 
> Having said that, it seems that the relative component of these, in a sense, 
> boilerplate objects of the total count, are of a different order than the 
> amount of expected ROAs and router certificates.
> 
>>> 
>>> So that's 4 objects.
>>> 
>>> During a key roll the CA will have the following additional objects:
>>> = 1 cert published by the parent
>>> = 1 manifest
>>> = 1 CRL
>>> 
>>> Making 7 objects. But typically not all CAs roll at the same time.
>> 
>> Unless it is an algorithm rollover, and that is expected to last for years 
>> (iirc).  Then this set would be doubled (plus double the numbers below), 
>> right?
>> 
> 
> I did not consider algorithm roll over, but I think you're right.
> 
>>> 
>>> The number of signed ROAs and Router certificates does,
>> 
>> And EE certs.  While 1:1 with ROAs, they require additional (very different) 
>> processing, esp if you start down the road of HSMs.  So, we claimed this 
>> additional operational requirement means that even if you double up on the 
>> downloads, those are still two separate objects.  You have to manage EE 
>> rollover keeping crypto material the same or changing it, depending on 
>> details of the ROA, etc.  That won't come for free, and (again) needing HSMs 
>> makes this a big deal.  So, we really felt it was important to call this 
>> complexity out by counting each. 
> 
> I don't understand the use of EE certs and HSMs here. And I don't see how 
> this can significantly raise the object count.
> 
> In the rpki EE certs can only be used to sign rpki signed objects and they 
> are embedded in those objects, not published separately.
> 
> All keys in our online system are protected by HSMs. The keys in EE 
> certificates even more so: they are generated in an HSM, used only once, and 
> then forgotten. In the online system the HSM protects against key theft if an 
> attacker somehow gets access to the database/file system.. The will not be 
> able to export the keys without access to a quorum of key cards that protect 
> the internal key of the HSM.
> 
> If you want to use an *offline* key protected by an HSM, then you would have 
> an additional CA cert (and mft, crl and gb). This is what we do with our TA 
> at least:
> 
> TA (offline) -> signs CA cert for online use => signs member CAs
> 
> The idea being that if the worst happens to our online system, we can rebuild 
> and re-issue using the more secure offline key. The hosted members don't have 
> their own offline key, they are protected by the same one.. Non-hosted CAs 
> may want to use an intermediate offline certificate. This adds some overhead: 
> 4 objects per offline key (CA cert, mft, crl, gb). 
> 
>>> in my opinion, not depend on the number of CAs, but:
>>> = ROA -> The number of announcements seen in BGP * some aggregation factor 
>>> (1 / # average prefixes on one ROA)
>> 
>> I, pretty much agree with this (as I think the tech-note said).  I do, 
>> however have to note that with MOAS, you need multiple ROAs.  Small point, 
>> but worth stating. :)
> 
> Yes, a ROA can have one ASN only, but multiple prefixes.
> 
> We aggregate prefixes for the same ASN as much as we can on a single ROA. So 
> far we are managing a factor of around 3 prefixes per ROA. Lacnic is similar. 
> The other RIRs seem to aggregate a bit less at this time.
> 
> See here:
> http://certification-stats.ripe.net/
> 
> 
>> 
>>> = Router certs -> The number of ASNs * the number of keys for each ASN
>> 
>> \times The number of eBGP speakers you mean, right?
> 
> Yes. My model assumes that this number of speakers can related to the number 
> of ASNs.
> 
> Randy suggested a number of physical bhp sec speaking routers and that they 
> may have two keys. That may well be a better model.
> 
> 
>> 
>>> 
>>> So I think a better model would be to say:
>>> 
>>> number           object
>>> #CA                 CA cert
>>> #CA                 MFT
>>> #CA                 CRL
>>> #CA                 GB
>> 
>> Ack, we just estimated this as the # of SIAs, and then varied it from 5 to 
>> 42,000.
> 
> 5 here would mean the current RIR TAs without any other signed content.
> 
> The total object count for this depends on the number of CA certs. See my 
> previous best estimate below.
> 
> 
>> 
>>> 
>>> #prefixes * X  ROAs
>> 
>> Yeah, but we didn't guestimate the $X$ value.  It sounds like we should, but 
>> is there any data we can use to do so?
>> 
>>> 
>>> #ASNs * Y      Router certs
>>> 
>>> Ototal  = 4 * #CA + #prefixes * X + #ASNs * Y
>> 
>> Re: the above, I think this would be 
>> 
>> O^Total = 4 * #CA + #ASNs + #prefixes + X * #prefixes + Y * #ASNs
>> 
>> We had called out the need for AS EE cert (which was not in the equation you 
>> outlined), and we felt it was important to not omit EE certs (if for no 
>> other reason than the operational complexity they bring).
>> 
>>> 
>>> As for the numbers.. this is a bit of a guessing game.. we just really 
>>> don't know at this time. We can take our best guess, but should keep in 
>>> mind that our best guess is probably off, and needs re-evaluation in years 
>>> to come.
>> 
>> 100% agree.  That is why we called this a back-of-the-envelope calculation 
>> and are totally seeking feedback, and are absolutely interested in pushing 
>> revisions as things evolve.
>> 
>>> 
>>> #CAs
>>> 
>>> If this were the total number of current members for all RIRs this number 
>>> would be around 40k. However, there are also PI users that are not direct 
>>> members of the RIRs, and some members will delegate some of their resources 
>>> further. For reference I believe that in the RIPE region we have around 25k 
>>> PI prefixes. I expect that a lot of the organisations that hold these 
>>> resources will be happy to let a sponsoring RIR member (LIR in our region) 
>>> manage their ROAs. But not all.. So I think that in a full deployment world 
>>> this number may be significantly bigger. If anyone has any ideas on this, 
>>> please chime in… Going on nothing more than gut feeling I would say the 
>>> total could be in the order of: 
>>> 
>>> = 40k RIR members plus 40k self managing PI holders / children of members?
>>> 
>>>   80k. 
>> 
>> Really?  I had been thinking that this number was tied to the origins, but I 
>> can see your logic.  It would great to try and find a way to estimate this, 
>> so I'd like to echo your request for anyone with info to chime in.
>> 
>>> 
>>> #prefixes and 'X'
>>> 
>>> The number of announced prefixes is still rising. Currently we are nearing 
>>> 500k.
>>> Worst case X is 1, meaning every ASN - prefix combination has its own ROA.
>>> 
>>> In reality this number will be lower because we can and do aggregate. But 
>>> not all implementers will do this. There is something to be said for *not* 
>>> fate-sharing ROAs for different prefixes from the same ASN. Also, most our 
>>> members are fairly small, and they do not do huge numbers of announcements 
>>> individually. Our current aggregation rate -- and we really try... is:
>>>  792 ROAs / (2197 IPv4 + 468 IPv6 prefixes) = 0.3 ROA/prefix
>>> 
>>> (see: http://certification-stats.ripe.net/) 
>>> 
>>> For scalability assessment I am not sure though that a factor of (1/0.3=) 3 
>>> between this level of aggregation, which seems best case, and no 
>>> aggregation, worst case, is that significant in the big picture. I will use 
>>> the mean of these two numbers below..
>> 
>> Fair enough.  I can certainly see the logic here, but if we wound up with a 
>> good way to do the estimation that would be even better, no? :)
>> 
>> <snip>
>> 
>>> In principal I like the approach to turn this around and define what an 
>>> acceptable average delivery rate would be given the total number of objects 
>>> and the maximum sync time. But on the other hand this can lead to rejecting 
>>> any infrastructure we could come up with.. just set the goals high enough 
>>> and nothing will be enough. So I think we should be cautious here. If there 
>>> are absolute objective minimal requirements it would be good to know them. 
>>> But other than that it may be best to be pragmatic about it and turn this 
>>> back.. try to think of other ways and see if they actually perform 
>>> significantly better..
>> 
>> I can respect this concern, but we really do have to deal with any 
>> systemic/complexity/operational/etc. facets of the system that we have 
>> designed.  We need to know how this design is going to behave if we are 
>> going to enshrine it.  For example, the above calculations, and ours, and 
>> any derivative formulation would make revoking a key within an hour seem to 
>> be impossible.  Is it a day, a week, a month, etc?  That may still be 
>> unclear (depending on how we model this), but how can we go any farther 
>> forward without taking a careful look at this design?  We must know if it 
>> meets our requirements, and I think measurements like these help tell us how 
>> feasible this will all be.
>> 
>>> Bearing with the document, if we take the current rsync repositories as a 
>>> starting point to see where we are heading without changes:
>>> = It should be noted that fetch times depend on lay-out, hierarchical 
>>> layout, allowing recursive, saves a lot of overhead (and latency)
>>> = We *do* recursive fetches on all current RIR repositories (yes, we hacked 
>>> in which base directory to use)
>>> = Testing on my laptop I typically see fetch times of around 20ms per 
>>> object, not 628ms
>> 
>> To be perfectly fair, I just used the #s I found form the BGPSEC design 
>> team's measurements.  I was very hesitant to use any particular set of 
>> numbers here.  As I'm sure you know, there are opinions about rsync, what it 
>> looks like under load, with asynchrony, repos' operational uptimes, 
>> restarting because of changes in the repository (I think that was something 
>> from your preso), etc.  I think you likely know (much better than me) that 
>> if a repo is under heavy load, churn, or just having an outage, that can 
>> cause cache's sync time to suffer.  Hence, I really liked getting real 
>> operational data from non-lab measurements.  I actually really feel that 
>> data is quite useful.  Consider this, you all (who are running these things) 
>> are the experts.  If we wind up w/ ~42,000 repos, they will _not_ all be run 
>> as well as you run them.
>> 
> 
> I understand your concern about many repos, some of them small and possibly 
> not well managed.
> 
> But the number of repos is not 1:1 the number of organisations acting as CAs.
> 
> We currently see 5 different repositories for the hosted solutions provided 
> by the RIRs. Non-hosted is being worked on, but so far not done in the 
> production environment. When this arrives, some non-hosted people will want 
> to do their own publishing, some others will use a bigger repository, for 
> example with their RIR, or it may be that 3rd parties will start providing 
> this service. In any case 42k repositories seems a bit much, though more than 
> 5 is very likely.
> 
>>> 
>>> A full fetch based on today's numbers would then take 1M * 20 ms = 20k 
>>> seconds = 330 minutes = 5.5 hours = 0.25 days.
>> 
>> Sorry, but I really think this has some problems.  First, the numbers I see 
>> in the cited preso are way larger than this, and just for the object sets we 
>> see today.  So, I have to say that this calculation doesn't seem to jive 
>> with Randy's numbers.
>> 
> 
> As you can read in my other emails I agree in general. I don't think rsync 
> can scale to the levels we need.
> 
> 
> But the numbers on today's *small* repositories can be improved a lot by 
> making them hierarchical, or if your validator happens to know that it can 
> use a higher directory. We do that last thing, rcynic does not. We made our 
> repository hierarchical though.
> 
> I think Randy's numbers may be outdated and represent the totals for our old 
> *flat* rsync repo. This adds a huge amount of latency and setup overhead.
> 
> The numbers I got was by just running our validator* and watch the log for 
> lines like:
> 16:51:12,883 INFO  Prefetching 'rsync://rpki.ripe.net/repository/'
> 16:52:06,980 INFO  Done prefetching for 'rsync://rpki.ripe.net/repository/'
> 16:52:26,189 INFO  Finished validating RIPE NCC RPKI Root, fetched 4447 valid 
> Objects
> 
> So the crypto took 20 seconds. The fetching of 4447 objects took 54 seconds 
> => 12 ms/object. For Lacnic: 280 objects in 5 seconds => 17 ms/object from my 
> laptop on a wireless in Amsterdam to South America.
> 
> 
> *: 
> http://www.ripe.net/lir-services/resource-management/certification/tools-and-resources
> 
> 
> 
>>> 
>>> So this is quite a bit faster. Eric, Danny how did you get to the numbers 
>>> in table 3? Like I said: I just got some times from validation runs done on 
>>> my laptop. We do collect data from our validators 'in the wild' though.. 
>>> unfortunately the format of this data (we store a *lot*) is such that it 
>>> will take some time for me to dig out more representative numbers. More 
>>> time than I have now, but I will try….
>> 
>> The citation at the end of the document comes from Randy.  It shows (MRT?) 
>> graphs with these numbers on them.  This doesn't include issues like repos 
>> under load from 42,000 caches DNS, outages, etc.  I think the numbers taken 
>> from his preso are very charitable, and they are actual measurements.  
>> Moreover, his own experiments showed that replicating this all inside a 
>> ``fairly large scale'' experiment took 660 minutes by itself... and that was 
>> with just 14,000 objects.  This actually totally contradicts the numbers you 
>> calculated above.  Sorry, I think it just doesn't add up.
>> 
>>> Another important factor is the amount of RPs that we can expect.. I know 
>>> that Rob and Randy et al are looking into ways to let RPs share data and be 
>>> less reliant on central repository servers. On the other hand if all ASNs 
>>> run at least 1 rpki validation cache that talks to the repositories 
>>> directly then we're looking at 40k clients. If they want updates, say just 
>>> 3 times per day, that's 120k requests per day, so something like 1 - 1.5 
>>> per second.
>> 
>> Again, I used Randy's numbers and even his experiments on 14k objects on a 
>> small topology show it takes twice as long as your global estimates.
>> 
> 
> We are talking about different things here:
> = You're looking at how long it takes for 1 RP to get in sync.
> = I was referring to the number of RPs that a repository can expect to 
> connect per second.
> 
>>> This slide shows the happy case where all RPs are up to date, and they just 
>>> check with the server to see if there are any updates. So importantly this 
>>> does not include long running data transfers.. This is not server grade 
>>> hardware (just a mac mini) but it's useful as an order of magnitude 
>>> indication imo. We see that the total number of RPs just checking for 
>>> updates that the server can handle / second depends linearly on the 
>>> repository size (it needs to do an O(n) scan). The total number of 
>>> concurrent RPs also depends on the repository size. It appears that some 
>>> list / index is kept in memory for every connection. Long story short, we 
>>> can only process small numbers of RPs per second and it's quite trivial to 
>>> end up with too many concurrent RPs pushing the servers to the memory 
>>> limited cliff for huge repositories.
>> 
>> Yeah, I was wondering about that.  It felt like it was beyond my perspective 
>> to estimate, so I tried to focus this sizing analysis on a more general 
>> systemic view.  I totally appreciate the above comment, but maybe we could 
>> try to model that in another tech-note?  I'm happy to try and help, if you 
>> would like.
>> 
> 
> This is one of the main reasons why I think that rsync won't scale to the 
> needs we can expect in full deployment. Even if all RIR repositories are 
> hierarchical, and we don't see a lot of non-hosted CAs publishing elsewhere. 
> We can not expect to keep getting that number of say 20 ms/object.. Not 
> without very *huge* investments in setting up some home grown rsync CDN, 
> spreading them like very busy root servers over the world. Not something that 
> the non-hosted folk will likely want to do either btw..
> 
> Then regarding non-hosted CAs not publishing in their RIR repository.. I 
> hadn't really thought of this before, but he advantage of the recursive rsync 
> is lost here. Say 5 organisations do non-hosted and they all publish with a 
> 3rd party providing a repository service.. they are siblings. This repo is 
> flat.
> 
> So apart from the other things I list in the document I sent yesterday, I 
> think we have a requirement that the delta protocol is not dependent on the 
> PKI hierarchy.
> 
> 
> 
>>> 
>>> It's because of this that I keep going on about:
>>> = We should have a separate delta protocol and notification mechanism and 
>>> not rely on rsync for this
>>> = For scalability:
>>>     = the hard work (CPU) should be done by the clients, not the server
>>>     = it should be possible to offload connections & memory away from the 
>>> server (proxies)
>>> = It makes sense to look at http and scalability of existing CDNs for 
>>> delivery
>> 
>> ibid.
>> 
>>> 
>>> My gut feeling (yes it's involved a lot today) tells me that this SHOULD 
>>> scale a lot better.. For example serving a small update notification file 
>>> over http using a CDN, 10ks of request / second, easy.. Data transfer to 
>>> RPs.. probably not a whole lot better actually if you need it all -- though 
>>> using existing CDNs with a global infrastructure closer to RPs may help 
>>> here. But if we have a notification file that points to small deltas and 
>>> fetching these small files is cheap this may actually be a big 
>>> improvement.. So although it may take a while to do the first sync, 
>>> *staying* in sync may actually perform a lot better. Well.. all this is 
>>> thought experiment at this stage. Without a pilot and actual measurements 
>>> it's hard to be sure.
>> 
>> I really think these are they types of conversations that we should be 
>> having.  Thank you very much for putting your thoughts here!
>> 
>> Eric
> 

_______________________________________________
sidr mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/sidr

Re: [sidr] Scaling properties of caching in a globally deployed RPKI / BGPSEC system

Reply via email to