Re: [sidr] Scaling properties of caching in a globally deployed RPKI / BGPSEC system

Tim Bruijnzeels Fri, 23 Nov 2012 08:26:47 -0800

Hi Eric,

Sorry for the late response..

With regards to some of your comments regarding requirements and ideas for 
alternative deltas, some responses inline below, but other than that see the 
thread where both Bryan Weber and I talk about our ideas. This is somewhat 
separate from the questions about "how a fully deployed RPKI tree would look 
like", and "what churn we might see" and "how many connecting RPs, and how 
frequent". All that can feed into requirements, but is in essence another 
discussion, so I suggest we keep them separate.

On Nov 16, 2012, at 10:44 PM, Eric Osterweil <[email protected]> wrote:

> 
> On Nov 16, 2012, at 10:45 AM, Tim Bruijnzeels wrote:
> 
>> Hi,
>> 
>> Some more comments on the numbers and formula..
>> 
>> On Nov 15, 2012, at 5:36 AM, Arturo Servin <[email protected]> wrote:
>> 
> 
> <snip>
> 
>> Apart from any signed objects it may publish, every CA typically has:
>> = 1 certificate published by its parent
> 
> But, as I asked Arturo, would we expect to have a CA from each parent (i.e. 
> each RIR that an org may allocations from)?  While this may often still just 
> be 1, it seems important to note, no?  THat was one reason we had for 
> breaking it apart.  I'm more than willing to believe that I'm wrong here, but 
> I'd like to understand how. 
> 
>> = 1 manifest
>> = 1 CRL
>> = 1 GB record (as Arturo said not widely deployed, but let's throw it in for 
>> full deployment)
> 
> How does the RPKI certify ASN allocations?  This is needed to certify router 
> EE certs (they are tied to an ASN, no?).  Such an AS cert would add another 1 
> to the above and make it 5, right?  Otherwise, how does a forward signing 
> peer associate itself with the ASN, instead of some prefix?  The ASN isn't 
> necessarily allocated by the same RIR as any of the allocations, so IP 
> allocation and ASN allocation are orthogonal, no?

Okay, let me rephrase.. it gets confusing when talking about organisations as 
CAs, and CA certificates: i.e. certificates that have the CA bit set and can 
sign and publish other certificates, signed objects etc.

We're talking 4 objects for each CA Certificate:
= 1 CA Certificate with IPv4, IPv6 and/or ASN
= 1 mft
= 1 CRL
= 1 GB

An organisation may have more than one CA certificate:
= if it has different parents and get certificates from more than one
= if it has 'minority space'. For example if one of our members has resources 
for which another RIR is the majority holder, then this is not on our normal 
certificate. We get a certificate with all such resources signed by the other 
RIR, and then we use this to sign an additional certificate for this member. 
Certificates can have only one parent, so we can not merge this.

Regarding the second point: I haven't done detailed analysis on our data yet. 
This does affect at least a portion of our members, so the average number of CA 
certs per member organisation is more than 1. I expect that it will not be a 
*lot* higher, but like I said I haven't analysed the data yet.

Having said that, it seems that the relative component of these, in a sense, 
boilerplate objects of the total count, are of a different order than the 
amount of expected ROAs and router certificates.

>> 
>> So that's 4 objects.
>> 
>> During a key roll the CA will have the following additional objects:
>> = 1 cert published by the parent
>> = 1 manifest
>> = 1 CRL
>> 
>> Making 7 objects. But typically not all CAs roll at the same time.
> 
> Unless it is an algorithm rollover, and that is expected to last for years 
> (iirc).  Then this set would be doubled (plus double the numbers below), 
> right?
> 

I did not consider algorithm roll over, but I think you're right.

>> 
>> The number of signed ROAs and Router certificates does,
> 
> And EE certs.  While 1:1 with ROAs, they require additional (very different) 
> processing, esp if you start down the road of HSMs.  So, we claimed this 
> additional operational requirement means that even if you double up on the 
> downloads, those are still two separate objects.  You have to manage EE 
> rollover keeping crypto material the same or changing it, depending on 
> details of the ROA, etc.  That won't come for free, and (again) needing HSMs 
> makes this a big deal.  So, we really felt it was important to call this 
> complexity out by counting each. 

I don't understand the use of EE certs and HSMs here. And I don't see how this 
can significantly raise the object count.

In the rpki EE certs can only be used to sign rpki signed objects and they are 
embedded in those objects, not published separately.

All keys in our online system are protected by HSMs. The keys in EE 
certificates even more so: they are generated in an HSM, used only once, and 
then forgotten. In the online system the HSM protects against key theft if an 
attacker somehow gets access to the database/file system.. The will not be able 
to export the keys without access to a quorum of key cards that protect the 
internal key of the HSM.

If you want to use an *offline* key protected by an HSM, then you would have an 
additional CA cert (and mft, crl and gb). This is what we do with our TA at 
least:

TA (offline) -> signs CA cert for online use => signs member CAs

The idea being that if the worst happens to our online system, we can rebuild 
and re-issue using the more secure offline key. The hosted members don't have 
their own offline key, they are protected by the same one.. Non-hosted CAs may 
want to use an intermediate offline certificate. This adds some overhead: 4 
objects per offline key (CA cert, mft, crl, gb). 

>> in my opinion, not depend on the number of CAs, but:
>> = ROA -> The number of announcements seen in BGP * some aggregation factor 
>> (1 / # average prefixes on one ROA)
> 
> I, pretty much agree with this (as I think the tech-note said).  I do, 
> however have to note that with MOAS, you need multiple ROAs.  Small point, 
> but worth stating. :)

Yes, a ROA can have one ASN only, but multiple prefixes.

We aggregate prefixes for the same ASN as much as we can on a single ROA. So 
far we are managing a factor of around 3 prefixes per ROA. Lacnic is similar. 
The other RIRs seem to aggregate a bit less at this time.

See here:
http://certification-stats.ripe.net/

> 
>> = Router certs -> The number of ASNs * the number of keys for each ASN
> 
> \times The number of eBGP speakers you mean, right?

Yes. My model assumes that this number of speakers can related to the number of 
ASNs.

Randy suggested a number of physical bhp sec speaking routers and that they may 
have two keys. That may well be a better model.

> 
>> 
>> So I think a better model would be to say:
>> 
>> number           object
>> #CA                 CA cert
>> #CA                 MFT
>> #CA                 CRL
>> #CA                 GB
> 
> Ack, we just estimated this as the # of SIAs, and then varied it from 5 to 
> 42,000.

5 here would mean the current RIR TAs without any other signed content.

The total object count for this depends on the number of CA certs. See my 
previous best estimate below.

> 
>> 
>> #prefixes * X  ROAs
> 
> Yeah, but we didn't guestimate the $X$ value.  It sounds like we should, but 
> is there any data we can use to do so?
> 
>> 
>> #ASNs * Y      Router certs
>> 
>> Ototal  = 4 * #CA + #prefixes * X + #ASNs * Y
> 
> Re: the above, I think this would be 
> 
> O^Total = 4 * #CA + #ASNs + #prefixes + X * #prefixes + Y * #ASNs
> 
> We had called out the need for AS EE cert (which was not in the equation you 
> outlined), and we felt it was important to not omit EE certs (if for no other 
> reason than the operational complexity they bring).
> 
>> 
>> As for the numbers.. this is a bit of a guessing game.. we just really don't 
>> know at this time. We can take our best guess, but should keep in mind that 
>> our best guess is probably off, and needs re-evaluation in years to come.
> 
> 100% agree.  That is why we called this a back-of-the-envelope calculation 
> and are totally seeking feedback, and are absolutely interested in pushing 
> revisions as things evolve.
> 
>> 
>> #CAs
>> 
>> If this were the total number of current members for all RIRs this number 
>> would be around 40k. However, there are also PI users that are not direct 
>> members of the RIRs, and some members will delegate some of their resources 
>> further. For reference I believe that in the RIPE region we have around 25k 
>> PI prefixes. I expect that a lot of the organisations that hold these 
>> resources will be happy to let a sponsoring RIR member (LIR in our region) 
>> manage their ROAs. But not all.. So I think that in a full deployment world 
>> this number may be significantly bigger. If anyone has any ideas on this, 
>> please chime in… Going on nothing more than gut feeling I would say the 
>> total could be in the order of: 
>> 
>> = 40k RIR members plus 40k self managing PI holders / children of members?
>> 
>>   80k. 
> 
> Really?  I had been thinking that this number was tied to the origins, but I 
> can see your logic.  It would great to try and find a way to estimate this, 
> so I'd like to echo your request for anyone with info to chime in.
> 
>> 
>> #prefixes and 'X'
>> 
>> The number of announced prefixes is still rising. Currently we are nearing 
>> 500k.
>> Worst case X is 1, meaning every ASN - prefix combination has its own ROA.
>> 
>> In reality this number will be lower because we can and do aggregate. But 
>> not all implementers will do this. There is something to be said for *not* 
>> fate-sharing ROAs for different prefixes from the same ASN. Also, most our 
>> members are fairly small, and they do not do huge numbers of announcements 
>> individually. Our current aggregation rate -- and we really try... is:
>>  792 ROAs / (2197 IPv4 + 468 IPv6 prefixes) = 0.3 ROA/prefix
>> 
>> (see: http://certification-stats.ripe.net/) 
>> 
>> For scalability assessment I am not sure though that a factor of (1/0.3=) 3 
>> between this level of aggregation, which seems best case, and no 
>> aggregation, worst case, is that significant in the big picture. I will use 
>> the mean of these two numbers below..
> 
> Fair enough.  I can certainly see the logic here, but if we wound up with a 
> good way to do the estimation that would be even better, no? :)
> 
> <snip>
> 
>> In principal I like the approach to turn this around and define what an 
>> acceptable average delivery rate would be given the total number of objects 
>> and the maximum sync time. But on the other hand this can lead to rejecting 
>> any infrastructure we could come up with.. just set the goals high enough 
>> and nothing will be enough. So I think we should be cautious here. If there 
>> are absolute objective minimal requirements it would be good to know them. 
>> But other than that it may be best to be pragmatic about it and turn this 
>> back.. try to think of other ways and see if they actually perform 
>> significantly better..
> 
> I can respect this concern, but we really do have to deal with any 
> systemic/complexity/operational/etc. facets of the system that we have 
> designed.  We need to know how this design is going to behave if we are going 
> to enshrine it.  For example, the above calculations, and ours, and any 
> derivative formulation would make revoking a key within an hour seem to be 
> impossible.  Is it a day, a week, a month, etc?  That may still be unclear 
> (depending on how we model this), but how can we go any farther forward 
> without taking a careful look at this design?  We must know if it meets our 
> requirements, and I think measurements like these help tell us how feasible 
> this will all be.
> 
>> Bearing with the document, if we take the current rsync repositories as a 
>> starting point to see where we are heading without changes:
>> = It should be noted that fetch times depend on lay-out, hierarchical 
>> layout, allowing recursive, saves a lot of overhead (and latency)
>> = We *do* recursive fetches on all current RIR repositories (yes, we hacked 
>> in which base directory to use)
>> = Testing on my laptop I typically see fetch times of around 20ms per 
>> object, not 628ms
> 
> To be perfectly fair, I just used the #s I found form the BGPSEC design 
> team's measurements.  I was very hesitant to use any particular set of 
> numbers here.  As I'm sure you know, there are opinions about rsync, what it 
> looks like under load, with asynchrony, repos' operational uptimes, 
> restarting because of changes in the repository (I think that was something 
> from your preso), etc.  I think you likely know (much better than me) that if 
> a repo is under heavy load, churn, or just having an outage, that can cause 
> cache's sync time to suffer.  Hence, I really liked getting real operational 
> data from non-lab measurements.  I actually really feel that data is quite 
> useful.  Consider this, you all (who are running these things) are the 
> experts.  If we wind up w/ ~42,000 repos, they will _not_ all be run as well 
> as you run them.
> 

I understand your concern about many repos, some of them small and possibly not 
well managed.

But the number of repos is not 1:1 the number of organisations acting as CAs.

We currently see 5 different repositories for the hosted solutions provided by 
the RIRs. Non-hosted is being worked on, but so far not done in the production 
environment. When this arrives, some non-hosted people will want to do their 
own publishing, some others will use a bigger repository, for example with 
their RIR, or it may be that 3rd parties will start providing this service. In 
any case 42k repositories seems a bit much, though more than 5 is very likely.

>> 
>> A full fetch based on today's numbers would then take 1M * 20 ms = 20k 
>> seconds = 330 minutes = 5.5 hours = 0.25 days.
> 
> Sorry, but I really think this has some problems.  First, the numbers I see 
> in the cited preso are way larger than this, and just for the object sets we 
> see today.  So, I have to say that this calculation doesn't seem to jive with 
> Randy's numbers.
> 

As you can read in my other emails I agree in general. I don't think rsync can 
scale to the levels we need.

But the numbers on today's *small* repositories can be improved a lot by making 
them hierarchical, or if your validator happens to know that it can use a 
higher directory. We do that last thing, rcynic does not. We made our 
repository hierarchical though.

I think Randy's numbers may be outdated and represent the totals for our old 
*flat* rsync repo. This adds a huge amount of latency and setup overhead.

The numbers I got was by just running our validator* and watch the log for 
lines like:
16:51:12,883 INFO  Prefetching 'rsync://rpki.ripe.net/repository/'
16:52:06,980 INFO  Done prefetching for 'rsync://rpki.ripe.net/repository/'
16:52:26,189 INFO  Finished validating RIPE NCC RPKI Root, fetched 4447 valid 
Objects

So the crypto took 20 seconds. The fetching of 4447 objects took 54 seconds => 
12 ms/object. For Lacnic: 280 objects in 5 seconds => 17 ms/object from my 
laptop on a wireless in Amsterdam to South America.

*: 
http://www.ripe.net/lir-services/resource-management/certification/tools-and-resources

>> 
>> So this is quite a bit faster. Eric, Danny how did you get to the numbers in 
>> table 3? Like I said: I just got some times from validation runs done on my 
>> laptop. We do collect data from our validators 'in the wild' though.. 
>> unfortunately the format of this data (we store a *lot*) is such that it 
>> will take some time for me to dig out more representative numbers. More time 
>> than I have now, but I will try….
> 
> The citation at the end of the document comes from Randy.  It shows (MRT?) 
> graphs with these numbers on them.  This doesn't include issues like repos 
> under load from 42,000 caches DNS, outages, etc.  I think the numbers taken 
> from his preso are very charitable, and they are actual measurements.  
> Moreover, his own experiments showed that replicating this all inside a 
> ``fairly large scale'' experiment took 660 minutes by itself... and that was 
> with just 14,000 objects.  This actually totally contradicts the numbers you 
> calculated above.  Sorry, I think it just doesn't add up.
> 
>> Another important factor is the amount of RPs that we can expect.. I know 
>> that Rob and Randy et al are looking into ways to let RPs share data and be 
>> less reliant on central repository servers. On the other hand if all ASNs 
>> run at least 1 rpki validation cache that talks to the repositories directly 
>> then we're looking at 40k clients. If they want updates, say just 3 times 
>> per day, that's 120k requests per day, so something like 1 - 1.5 per second.
> 
> Again, I used Randy's numbers and even his experiments on 14k objects on a 
> small topology show it takes twice as long as your global estimates.
> 

We are talking about different things here:
= You're looking at how long it takes for 1 RP to get in sync.
= I was referring to the number of RPs that a repository can expect to connect 
per second.

>> This slide shows the happy case where all RPs are up to date, and they just 
>> check with the server to see if there are any updates. So importantly this 
>> does not include long running data transfers.. This is not server grade 
>> hardware (just a mac mini) but it's useful as an order of magnitude 
>> indication imo. We see that the total number of RPs just checking for 
>> updates that the server can handle / second depends linearly on the 
>> repository size (it needs to do an O(n) scan). The total number of 
>> concurrent RPs also depends on the repository size. It appears that some 
>> list / index is kept in memory for every connection. Long story short, we 
>> can only process small numbers of RPs per second and it's quite trivial to 
>> end up with too many concurrent RPs pushing the servers to the memory 
>> limited cliff for huge repositories.
> 
> Yeah, I was wondering about that.  It felt like it was beyond my perspective 
> to estimate, so I tried to focus this sizing analysis on a more general 
> systemic view.  I totally appreciate the above comment, but maybe we could 
> try to model that in another tech-note?  I'm happy to try and help, if you 
> would like.
> 

This is one of the main reasons why I think that rsync won't scale to the needs 
we can expect in full deployment. Even if all RIR repositories are 
hierarchical, and we don't see a lot of non-hosted CAs publishing elsewhere. We 
can not expect to keep getting that number of say 20 ms/object.. Not without 
very *huge* investments in setting up some home grown rsync CDN, spreading them 
like very busy root servers over the world. Not something that the non-hosted 
folk will likely want to do either btw..

Then regarding non-hosted CAs not publishing in their RIR repository.. I hadn't 
really thought of this before, but he advantage of the recursive rsync is lost 
here. Say 5 organisations do non-hosted and they all publish with a 3rd party 
providing a repository service.. they are siblings. This repo is flat.

So apart from the other things I list in the document I sent yesterday, I think 
we have a requirement that the delta protocol is not dependent on the PKI 
hierarchy.

>> 
>> It's because of this that I keep going on about:
>> = We should have a separate delta protocol and notification mechanism and 
>> not rely on rsync for this
>> = For scalability:
>>     = the hard work (CPU) should be done by the clients, not the server
>>     = it should be possible to offload connections & memory away from the 
>> server (proxies)
>> = It makes sense to look at http and scalability of existing CDNs for 
>> delivery
> 
> ibid.
> 
>> 
>> My gut feeling (yes it's involved a lot today) tells me that this SHOULD 
>> scale a lot better.. For example serving a small update notification file 
>> over http using a CDN, 10ks of request / second, easy.. Data transfer to 
>> RPs.. probably not a whole lot better actually if you need it all -- though 
>> using existing CDNs with a global infrastructure closer to RPs may help 
>> here. But if we have a notification file that points to small deltas and 
>> fetching these small files is cheap this may actually be a big improvement.. 
>> So although it may take a while to do the first sync, *staying* in sync may 
>> actually perform a lot better. Well.. all this is thought experiment at this 
>> stage. Without a pilot and actual measurements it's hard to be sure.
> 
> I really think these are they types of conversations that we should be 
> having.  Thank you very much for putting your thoughts here!
> 
> Eric

_______________________________________________
sidr mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/sidr

Re: [sidr] Scaling properties of caching in a globally deployed RPKI / BGPSEC system

Reply via email to