Re: [sidr] Scaling properties of caching in a globally deployed RPKI / BGPSEC system

Eric Osterweil Fri, 16 Nov 2012 13:44:49 -0800

On Nov 16, 2012, at 10:45 AM, Tim Bruijnzeels wrote:

> Hi,
> 
> Some more comments on the numbers and formula..
> 
> On Nov 15, 2012, at 5:36 AM, Arturo Servin <[email protected]> wrote:
>


<snip>

> Apart from any signed objects it may publish, every CA typically has:
> = 1 certificate published by its parent

But, as I asked Arturo, would we expect to have a CA from each parent (i.e. 
each RIR that an org may allocations from)?  While this may often still just be 
1, it seems important to note, no?  THat was one reason we had for breaking it 
apart.  I'm more than willing to believe that I'm wrong here, but I'd like to 
understand how. 

> = 1 manifest
> = 1 CRL
> = 1 GB record (as Arturo said not widely deployed, but let's throw it in for 
> full deployment)

How does the RPKI certify ASN allocations?  This is needed to certify router EE 
certs (they are tied to an ASN, no?).  Such an AS cert would add another 1 to 
the above and make it 5, right?  Otherwise, how does a forward signing peer 
associate itself with the ASN, instead of some prefix?  The ASN isn't 
necessarily allocated by the same RIR as any of the allocations, so IP 
allocation and ASN allocation are orthogonal, no?

> 
> So that's 4 objects.
> 
> During a key roll the CA will have the following additional objects:
> = 1 cert published by the parent
> = 1 manifest
> = 1 CRL
> 
> Making 7 objects. But typically not all CAs roll at the same time.

Unless it is an algorithm rollover, and that is expected to last for years 
(iirc).  Then this set would be doubled (plus double the numbers below), right?

> 
> The number of signed ROAs and Router certificates does,

And EE certs.  While 1:1 with ROAs, they require additional (very different) 
processing, esp if you start down the road of HSMs.  So, we claimed this 
additional operational requirement means that even if you double up on the 
downloads, those are still two separate objects.  You have to manage EE 
rollover keeping crypto material the same or changing it, depending on details 
of the ROA, etc.  That won't come for free, and (again) needing HSMs makes this 
a big deal.  So, we really felt it was important to call this complexity out by 
counting each. 

> in my opinion, not depend on the number of CAs, but:
> = ROA -> The number of announcements seen in BGP * some aggregation factor (1 
> / # average prefixes on one ROA)

I, pretty much agree with this (as I think the tech-note said).  I do, however 
have to note that with MOAS, you need multiple ROAs.  Small point, but worth 
stating. :)

> = Router certs -> The number of ASNs * the number of keys for each ASN

\times The number of eBGP speakers you mean, right?

> 
> So I think a better model would be to say:
> 
> number           object
> #CA                 CA cert
> #CA                 MFT
> #CA                 CRL
> #CA                 GB

Ack, we just estimated this as the # of SIAs, and then varied it from 5 to 
42,000.

> 
> #prefixes * X  ROAs

Yeah, but we didn't guestimate the $X$ value.  It sounds like we should, but is 
there any data we can use to do so?

> 
> #ASNs * Y      Router certs
> 
>  Ototal  = 4 * #CA + #prefixes * X + #ASNs * Y

Re: the above, I think this would be 

O^Total = 4 * #CA + #ASNs + #prefixes + X * #prefixes + Y * #ASNs

We had called out the need for AS EE cert (which was not in the equation you 
outlined), and we felt it was important to not omit EE certs (if for no other 
reason than the operational complexity they bring).

> 
> As for the numbers.. this is a bit of a guessing game.. we just really don't 
> know at this time. We can take our best guess, but should keep in mind that 
> our best guess is probably off, and needs re-evaluation in years to come.

100% agree.  That is why we called this a back-of-the-envelope calculation and 
are totally seeking feedback, and are absolutely interested in pushing 
revisions as things evolve.

> 
>  #CAs
> 
> If this were the total number of current members for all RIRs this number 
> would be around 40k. However, there are also PI users that are not direct 
> members of the RIRs, and some members will delegate some of their resources 
> further. For reference I believe that in the RIPE region we have around 25k 
> PI prefixes. I expect that a lot of the organisations that hold these 
> resources will be happy to let a sponsoring RIR member (LIR in our region) 
> manage their ROAs. But not all.. So I think that in a full deployment world 
> this number may be significantly bigger. If anyone has any ideas on this, 
> please chime in… Going on nothing more than gut feeling I would say the total 
> could be in the order of: 
>    
>  = 40k RIR members plus 40k self managing PI holders / children of members?
> 
>    80k. 

Really?  I had been thinking that this number was tied to the origins, but I 
can see your logic.  It would great to try and find a way to estimate this, so 
I'd like to echo your request for anyone with info to chime in.

> 
> #prefixes and 'X'
> 
> The number of announced prefixes is still rising. Currently we are nearing 
> 500k.
> Worst case X is 1, meaning every ASN - prefix combination has its own ROA.
> 
> In reality this number will be lower because we can and do aggregate. But not 
> all implementers will do this. There is something to be said for *not* 
> fate-sharing ROAs for different prefixes from the same ASN. Also, most our 
> members are fairly small, and they do not do huge numbers of announcements 
> individually. Our current aggregation rate -- and we really try... is:
>   792 ROAs / (2197 IPv4 + 468 IPv6 prefixes) = 0.3 ROA/prefix
> 
> (see: http://certification-stats.ripe.net/) 
> 
> For scalability assessment I am not sure though that a factor of (1/0.3=) 3 
> between this level of aggregation, which seems best case, and no aggregation, 
> worst case, is that significant in the big picture. I will use the mean of 
> these two numbers below..

Fair enough.  I can certainly see the logic here, but if we wound up with a 
good way to do the estimation that would be even better, no? :)

<snip>

> In principal I like the approach to turn this around and define what an 
> acceptable average delivery rate would be given the total number of objects 
> and the maximum sync time. But on the other hand this can lead to rejecting 
> any infrastructure we could come up with.. just set the goals high enough and 
> nothing will be enough. So I think we should be cautious here. If there are 
> absolute objective minimal requirements it would be good to know them. But 
> other than that it may be best to be pragmatic about it and turn this back.. 
> try to think of other ways and see if they actually perform significantly 
> better..

I can respect this concern, but we really do have to deal with any 
systemic/complexity/operational/etc. facets of the system that we have 
designed.  We need to know how this design is going to behave if we are going 
to enshrine it.  For example, the above calculations, and ours, and any 
derivative formulation would make revoking a key within an hour seem to be 
impossible.  Is it a day, a week, a month, etc?  That may still be unclear 
(depending on how we model this), but how can we go any farther forward without 
taking a careful look at this design?  We must know if it meets our 
requirements, and I think measurements like these help tell us how feasible 
this will all be.

> Bearing with the document, if we take the current rsync repositories as a 
> starting point to see where we are heading without changes:
> = It should be noted that fetch times depend on lay-out, hierarchical layout, 
> allowing recursive, saves a lot of overhead (and latency)
> = We *do* recursive fetches on all current RIR repositories (yes, we hacked 
> in which base directory to use)
> = Testing on my laptop I typically see fetch times of around 20ms per object, 
> not 628ms

To be perfectly fair, I just used the #s I found form the BGPSEC design team's 
measurements.  I was very hesitant to use any particular set of numbers here.  
As I'm sure you know, there are opinions about rsync, what it looks like under 
load, with asynchrony, repos' operational uptimes, restarting because of 
changes in the repository (I think that was something from your preso), etc.  I 
think you likely know (much better than me) that if a repo is under heavy load, 
churn, or just having an outage, that can cause cache's sync time to suffer.  
Hence, I really liked getting real operational data from non-lab measurements.  
I actually really feel that data is quite useful.  Consider this, you all (who 
are running these things) are the experts.  If we wind up w/ ~42,000 repos, 
they will _not_ all be run as well as you run them.

> 
> A full fetch based on today's numbers would then take 1M * 20 ms = 20k 
> seconds = 330 minutes = 5.5 hours = 0.25 days.

Sorry, but I really think this has some problems.  First, the numbers I see in 
the cited preso are way larger than this, and just for the object sets we see 
today.  So, I have to say that this calculation doesn't seem to jive with 
Randy's numbers.

> 
> So this is quite a bit faster. Eric, Danny how did you get to the numbers in 
> table 3? Like I said: I just got some times from validation runs done on my 
> laptop. We do collect data from our validators 'in the wild' though.. 
> unfortunately the format of this data (we store a *lot*) is such that it will 
> take some time for me to dig out more representative numbers. More time than 
> I have now, but I will try….

The citation at the end of the document comes from Randy.  It shows (MRT?) 
graphs with these numbers on them.  This doesn't include issues like repos 
under load from 42,000 caches DNS, outages, etc.  I think the numbers taken 
from his preso are very charitable, and they are actual measurements.  
Moreover, his own experiments showed that replicating this all inside a 
``fairly large scale'' experiment took 660 minutes by itself... and that was 
with just 14,000 objects.  This actually totally contradicts the numbers you 
calculated above.  Sorry, I think it just doesn't add up.

> Another important factor is the amount of RPs that we can expect.. I know 
> that Rob and Randy et al are looking into ways to let RPs share data and be 
> less reliant on central repository servers. On the other hand if all ASNs run 
> at least 1 rpki validation cache that talks to the repositories directly then 
> we're looking at 40k clients. If they want updates, say just 3 times per day, 
> that's 120k requests per day, so something like 1 - 1.5 per second.

Again, I used Randy's numbers and even his experiments on 14k objects on a 
small topology show it takes twice as long as your global estimates.

> This slide shows the happy case where all RPs are up to date, and they just 
> check with the server to see if there are any updates. So importantly this 
> does not include long running data transfers.. This is not server grade 
> hardware (just a mac mini) but it's useful as an order of magnitude 
> indication imo. We see that the total number of RPs just checking for updates 
> that the server can handle / second depends linearly on the repository size 
> (it needs to do an O(n) scan). The total number of concurrent RPs also 
> depends on the repository size. It appears that some list / index is kept in 
> memory for every connection. Long story short, we can only process small 
> numbers of RPs per second and it's quite trivial to end up with too many 
> concurrent RPs pushing the servers to the memory limited cliff for huge 
> repositories.

Yeah, I was wondering about that.  It felt like it was beyond my perspective to 
estimate, so I tried to focus this sizing analysis on a more general systemic 
view.  I totally appreciate the above comment, but maybe we could try to model 
that in another tech-note?  I'm happy to try and help, if you would like.

> 
> It's because of this that I keep going on about:
> = We should have a separate delta protocol and notification mechanism and not 
> rely on rsync for this
> = For scalability:
>      = the hard work (CPU) should be done by the clients, not the server
>      = it should be possible to offload connections & memory away from the 
> server (proxies)
> = It makes sense to look at http and scalability of existing CDNs for delivery

ibid.

> 
> My gut feeling (yes it's involved a lot today) tells me that this SHOULD 
> scale a lot better.. For example serving a small update notification file 
> over http using a CDN, 10ks of request / second, easy.. Data transfer to 
> RPs.. probably not a whole lot better actually if you need it all -- though 
> using existing CDNs with a global infrastructure closer to RPs may help here. 
> But if we have a notification file that points to small deltas and fetching 
> these small files is cheap this may actually be a big improvement.. So 
> although it may take a while to do the first sync, *staying* in sync may 
> actually perform a lot better. Well.. all this is thought experiment at this 
> stage. Without a pilot and actual measurements it's hard to be sure.

I really think these are they types of conversations that we should be having.  
Thank you very much for putting your thoughts here!

Eric
_______________________________________________
sidr mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/sidr

Re: [sidr] Scaling properties of caching in a globally deployed RPKI / BGPSEC system

Reply via email to