Re: [sidr] sidr-arch-09 refresh cycle time

George Michaelson Tue, 27 Oct 2009 17:45:07 -0700

Matt, If you don't mind I'd like to add some input to this discussiontoo.


On 28/10/2009, at 6:28 AM, Matt Lepinski wrote:

Geoff,
I'm happy to accept that the new wording is poor, but I'm prettysure the old wording was also bad, and I think this discussion isimportant.
The old wording could easily be interpreted to suggest that once perday was the correct frequency for pulling from a repository. (Thatis, I believe the previous version was making a de factorecommendation for a default behaivor of one pull every 24 hours ...there wasn't a RECOMMEND in the text, but we all know that examplestend to be normative in this type of document.)
1) So the first implicit question is: Should the working group bemaking a recommendation as to the frequency with which a relyingparty pulls from the repository?

I think we have to think about it, yes. But in what document? Why inan Architecture document, this close to 'closure' ?

Or equivalently: Is there a "wrong" frequency that people might useif we didn't give them any guidence?
It seems that retreiving updates "too frequently" (e.g., every 5minutes) strains the repository system and that retreiving updates"too infrequently" (e.g., monthly) means that when I inject a newROA into the system, it will take "unacceptably long" for thisinformation to propogate to the relying parties that make use ofthis information. Therefore, we should have text in the documentthat articulates some middle ground that we believe is reasonablefor the Internet. (I make no claims that the current text in thedocument achieves this goal.)
2) The second question is: If we make a recommendation regardingfrequency with which relying parties should pull updates, whatfrequency should we recommend.

The goal set earlier on in the life of this project was to stabilizethe system in two complete 24h work cycles of the repository system asa whole. And yes, this was predicated on a MINIMUM of one fetch per24h per RP.


At least, thats what I understood. Maybe I was wrong?

If certification-active 'producers' approach each other in theprovisioning process at least 4 times a day, and this is suitablyspread into 24h, I believe that we satisfy this 2x 24h cycle timegoal, for relying parties, for the expected 'depth' maximums in theobserved allocation hierarchy today.

I have not seen any discussion which suggests either we're changingfrom 2 x 24h relying party cycles, or from an expected depth of atworst 8 levels of delegation. (there might well be deeper trees. Iexpect that they lie in the margins of global routing. I also believenow that the tree will be far shallower, for the overwhelming majorityof the space)

So, I don't understand why we've moved to 3 hourly cycle time for RPs.I don't see where this came from.

Here, I understand that "everyone hitting the repository system atonce" is a bad outcome regardless of the frequency that werecommend. That is, regardless of whether we recommend "once perday", "once per month", or "eight times daily" we will likely seeproblems with too much server load at midnight. If anyone canrecommend text to avoid this phenomena (i.e., to encourage people tospread out their queries to the repository system), please send text.

Well, the discussed approach informally has been to use a simplerandomization mechanism to select a cron time in the 24h window, suchthat people choose a start time at random, and then cycle in areasonable sub-multiple of that time.


This has worked for other processes historically.

Here is one example, from a google search:

        http://www.moundalexis.com/archives/000076.php

Its not directly what I mean, but I am sure people can go back intotheir UUCP dial-sync days, or USENET call memory-stack, and rememberapproaches for doing this.

Or, a provider could publish a simple 'pick a token' system and adviseRPs to go there, and be assigned a random slot for fetch to balanceload.

I don't think we can recommend a fetch cycle period until weunderstand the dynamics of change in the repository.


So I'd pose questions at this time:

what rate of delay against a change in the repository is acceptableto an RP?


        what impact does depth in the tree have on change?

how many cycles of update are allowed to pass before the tree isconsidered 'stable' against significant change?

is this in fact not deterministic but depends on context which isout of scope?

I agree that there are roughly 30,000 AS numbers visible in BGP, soit's reasonable to assume on the order of 30,000 relying parties whowill be routinely querying the repository system. We might alsoassume that 30,000 is a reasonable order of magnitude for the numberof CAs in the RPKI (we might easily average 2 CAs per AS, but surelynot 10 CAs per AS).

Certainly when I modelled a large tree, these numbers were of theorder of magnitude I modelled. I think I did around 10,000 distinct CAs.

However, one thing that wasn't clear from reading your analysis washow many CAs a given repository server would be hosting. If a serverrun by a large ISP or an RIR was providing a cache of all RPKI data,then clients would have longer connections to this server (as theycould retrieve much of the data they need in one place), but theywould be unlikely to receive requests from all 30,000 relyingparties (e.g. an ISP might provide a complete cache for theircustomers but for non-customers they would typically only serve datafor which they are authoritative). Alternatively, if a server isonly serving data for a small subset of the CAs in the RPKI, then itmight receive requests from all relying parties, but those sessionswould tend to be short (especially when nothing has changed).

RIRs have of the order 2000-5000 direct entities who they will beissuing CA certificates to. If they run hosted portals, its numbers ofthis magnitude which potentially could seek to use a hosted portal forEE certificate facing activity.

I don't believe individual entities below the RIR level faceadministering CA spaces of this complexity. I think they are moreprobably under the 100 to 1000 CA/EE level.

If people believe they want to host third party solutions, then Ithink the size lies anywhere between the 1-10 and the 30,000 level."it depends"

Aggregated repositories would of course have different dynamics ofchange.

In any case, I believe the way forward (with regards to server load)is to answer the question, "How many simultaneous connections arereasonable for a server that hosts publication points for X CAs?"and then work backwards from there to determine if a given intervalof relying party requests is reasonable from the server standpoint.I admit that I haven't completely thought through re-key, but I'lltry to dig up some rough connection-time numbers based on ourrelying party software, and do a few back-of-the envolopecomputations.

This would be good. But, I don't think it goes to an architecturequestion. I think it would be far better to do this work, and draft anoperational guidelines document instead which puts this experienceinto a context which is more ameanable to change over time.

With regards to client load, I'm not convinced that there's anyproblem with frequent queries to the repository system. If therelying party queries a publication point and rsync determines thatnothing has changed, then no changes are required to ethe relyingparty's local cache and no cryptographic calculations are required.

Please bear in mind that some amount of held state exists server sideto determine what has changed. It might be a file-system walk and on-demand checks on files, for stat()

information.

it might be a simple DB lookup. But in either case, its not 'for free'-some level of work is being done.


And, held TCP/IP session state.

Is this of the same order of magnitude as running a web server? Ithink that it very probably is. Particularly ones which are doingdynamic content serve, rather than simple cached state.

Their solutions (million-chickens, reverse caches, memcached ...) areprobably applicable here.

If something has changed, then the relying party has to performvalidation (which includes cryptographic signature verification) onthe manifest and any new objects that have been added.(Additionally, there may be resulting changes to the client's localcache ... e.g., if a new CRL revokes a previously validcertificate ... but such changes don't require new cryptographiccomputations, and so I believe the bottleneck is going to be the oneor two signature verifications per object changed [1]). Now thepoint from the relying party side is that if 5,000 manifests changeand 10,000 signed objects are added to the repository system on agiven day, then the relying party needs to do roughly 30,000signature verifications regardless of whether it learns of all thesechanges at once, or whether it learns of them in small batchesthroughout the course of the day. Therefore, I don't see how makingfrequent checks for new data has a significant impact on the relyingparty's processing load.

I'm also unsure from this sum what the benefit of frequent checks was.Is it faster completion of the cycle time around re-key? Whats itbenefiting?

Finally, in addition to server and relying party processing loads,one must also look at the benefit of frequent repository fetches.Keep in mind, that a relying party has no way of distinguishing thefollowing two events: (A) a route advertisement is originated by anAS that is authorized to advertise the route, but the relying partyhasn't fetched recently enough to obtain the new ROA; and (B) aroute advertisement is originated by an unauthorized entity that isattempting to hijack address space. In this discussion, it is alsoimportant to note that manifests can gaurantee that the relyingparty received all signed objects that existed at the moment thatthe manifest was published (i.e., a manifest can detect maliciousdeletion of data from a repository or corruption of data in transit)but the manifest says nothing about data that may have been addedsince the manifest was issued. This is why there is benefit in arelying party going back to the publication point perioidically tosee whether a new manifest has been issued.
In any case, it's good to know that we'll have plenty to talk aboutin Hiroshima.


Yes. I think there is a discussion here.

If we could abstract these questions to a yet-to-be-done Operationsdocument, I think we could continue to talk about Architecturedocuments as plausibly 'done' and this means we can progress WGLC.


What do you think?

-George


- Matt Lepinski


_______________________________________________
sidr mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/sidr

Re: [sidr] sidr-arch-09 refresh cycle time

Reply via email to