Matt, If you don't mind I'd like to add some input to this discussion
too.
On 28/10/2009, at 6:28 AM, Matt Lepinski wrote:
Geoff,
I'm happy to accept that the new wording is poor, but I'm pretty
sure the old wording was also bad, and I think this discussion is
important.
The old wording could easily be interpreted to suggest that once per
day was the correct frequency for pulling from a repository. (That
is, I believe the previous version was making a de facto
recommendation for a default behaivor of one pull every 24 hours ...
there wasn't a RECOMMEND in the text, but we all know that examples
tend to be normative in this type of document.)
1) So the first implicit question is: Should the working group be
making a recommendation as to the frequency with which a relying
party pulls from the repository?
I think we have to think about it, yes. But in what document? Why in
an Architecture document, this close to 'closure' ?
Or equivalently: Is there a "wrong" frequency that people might use
if we didn't give them any guidence?
It seems that retreiving updates "too frequently" (e.g., every 5
minutes) strains the repository system and that retreiving updates
"too infrequently" (e.g., monthly) means that when I inject a new
ROA into the system, it will take "unacceptably long" for this
information to propogate to the relying parties that make use of
this information. Therefore, we should have text in the document
that articulates some middle ground that we believe is reasonable
for the Internet. (I make no claims that the current text in the
document achieves this goal.)
2) The second question is: If we make a recommendation regarding
frequency with which relying parties should pull updates, what
frequency should we recommend.
The goal set earlier on in the life of this project was to stabilize
the system in two complete 24h work cycles of the repository system as
a whole. And yes, this was predicated on a MINIMUM of one fetch per
24h per RP.
At least, thats what I understood. Maybe I was wrong?
If certification-active 'producers' approach each other in the
provisioning process at least 4 times a day, and this is suitably
spread into 24h, I believe that we satisfy this 2x 24h cycle time
goal, for relying parties, for the expected 'depth' maximums in the
observed allocation hierarchy today.
I have not seen any discussion which suggests either we're changing
from 2 x 24h relying party cycles, or from an expected depth of at
worst 8 levels of delegation. (there might well be deeper trees. I
expect that they lie in the margins of global routing. I also believe
now that the tree will be far shallower, for the overwhelming majority
of the space)
So, I don't understand why we've moved to 3 hourly cycle time for RPs.
I don't see where this came from.
Here, I understand that "everyone hitting the repository system at
once" is a bad outcome regardless of the frequency that we
recommend. That is, regardless of whether we recommend "once per
day", "once per month", or "eight times daily" we will likely see
problems with too much server load at midnight. If anyone can
recommend text to avoid this phenomena (i.e., to encourage people to
spread out their queries to the repository system), please send text.
Well, the discussed approach informally has been to use a simple
randomization mechanism to select a cron time in the 24h window, such
that people choose a start time at random, and then cycle in a
reasonable sub-multiple of that time.
This has worked for other processes historically.
Here is one example, from a google search:
http://www.moundalexis.com/archives/000076.php
Its not directly what I mean, but I am sure people can go back into
their UUCP dial-sync days, or USENET call memory-stack, and remember
approaches for doing this.
Or, a provider could publish a simple 'pick a token' system and advise
RPs to go there, and be assigned a random slot for fetch to balance
load.
I don't think we can recommend a fetch cycle period until we
understand the dynamics of change in the repository.
So I'd pose questions at this time:
what rate of delay against a change in the repository is acceptable
to an RP?
what impact does depth in the tree have on change?
how many cycles of update are allowed to pass before the tree is
considered 'stable' against significant change?
is this in fact not deterministic but depends on context which is
out of scope?
I agree that there are roughly 30,000 AS numbers visible in BGP, so
it's reasonable to assume on the order of 30,000 relying parties who
will be routinely querying the repository system. We might also
assume that 30,000 is a reasonable order of magnitude for the number
of CAs in the RPKI (we might easily average 2 CAs per AS, but surely
not 10 CAs per AS).
Certainly when I modelled a large tree, these numbers were of the
order of magnitude I modelled. I think I did around 10,000 distinct CAs.
However, one thing that wasn't clear from reading your analysis was
how many CAs a given repository server would be hosting. If a server
run by a large ISP or an RIR was providing a cache of all RPKI data,
then clients would have longer connections to this server (as they
could retrieve much of the data they need in one place), but they
would be unlikely to receive requests from all 30,000 relying
parties (e.g. an ISP might provide a complete cache for their
customers but for non-customers they would typically only serve data
for which they are authoritative). Alternatively, if a server is
only serving data for a small subset of the CAs in the RPKI, then it
might receive requests from all relying parties, but those sessions
would tend to be short (especially when nothing has changed).
RIRs have of the order 2000-5000 direct entities who they will be
issuing CA certificates to. If they run hosted portals, its numbers of
this magnitude which potentially could seek to use a hosted portal for
EE certificate facing activity.
I don't believe individual entities below the RIR level face
administering CA spaces of this complexity. I think they are more
probably under the 100 to 1000 CA/EE level.
If people believe they want to host third party solutions, then I
think the size lies anywhere between the 1-10 and the 30,000 level.
"it depends"
Aggregated repositories would of course have different dynamics of
change.
In any case, I believe the way forward (with regards to server load)
is to answer the question, "How many simultaneous connections are
reasonable for a server that hosts publication points for X CAs?"
and then work backwards from there to determine if a given interval
of relying party requests is reasonable from the server standpoint.
I admit that I haven't completely thought through re-key, but I'll
try to dig up some rough connection-time numbers based on our
relying party software, and do a few back-of-the envolope
computations.
This would be good. But, I don't think it goes to an architecture
question. I think it would be far better to do this work, and draft an
operational guidelines document instead which puts this experience
into a context which is more ameanable to change over time.
With regards to client load, I'm not convinced that there's any
problem with frequent queries to the repository system. If the
relying party queries a publication point and rsync determines that
nothing has changed, then no changes are required to ethe relying
party's local cache and no cryptographic calculations are required.
Please bear in mind that some amount of held state exists server side
to determine what has changed. It might be a file-system walk and on-
demand checks on files, for stat()
information.
it might be a simple DB lookup. But in either case, its not 'for free'
-some level of work is being done.
And, held TCP/IP session state.
Is this of the same order of magnitude as running a web server? I
think that it very probably is. Particularly ones which are doing
dynamic content serve, rather than simple cached state.
Their solutions (million-chickens, reverse caches, memcached ...) are
probably applicable here.
If something has changed, then the relying party has to perform
validation (which includes cryptographic signature verification) on
the manifest and any new objects that have been added.
(Additionally, there may be resulting changes to the client's local
cache ... e.g., if a new CRL revokes a previously valid
certificate ... but such changes don't require new cryptographic
computations, and so I believe the bottleneck is going to be the one
or two signature verifications per object changed [1]). Now the
point from the relying party side is that if 5,000 manifests change
and 10,000 signed objects are added to the repository system on a
given day, then the relying party needs to do roughly 30,000
signature verifications regardless of whether it learns of all these
changes at once, or whether it learns of them in small batches
throughout the course of the day. Therefore, I don't see how making
frequent checks for new data has a significant impact on the relying
party's processing load.
I'm also unsure from this sum what the benefit of frequent checks was.
Is it faster completion of the cycle time around re-key? Whats it
benefiting?
Finally, in addition to server and relying party processing loads,
one must also look at the benefit of frequent repository fetches.
Keep in mind, that a relying party has no way of distinguishing the
following two events: (A) a route advertisement is originated by an
AS that is authorized to advertise the route, but the relying party
hasn't fetched recently enough to obtain the new ROA; and (B) a
route advertisement is originated by an unauthorized entity that is
attempting to hijack address space. In this discussion, it is also
important to note that manifests can gaurantee that the relying
party received all signed objects that existed at the moment that
the manifest was published (i.e., a manifest can detect malicious
deletion of data from a repository or corruption of data in transit)
but the manifest says nothing about data that may have been added
since the manifest was issued. This is why there is benefit in a
relying party going back to the publication point perioidically to
see whether a new manifest has been issued.
In any case, it's good to know that we'll have plenty to talk about
in Hiroshima.
Yes. I think there is a discussion here.
If we could abstract these questions to a yet-to-be-done Operations
document, I think we could continue to talk about Architecture
documents as plausibly 'done' and this means we can progress WGLC.
What do you think?
-George
- Matt Lepinski
_______________________________________________
sidr mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/sidr