Re: [sidr] Updates to rpki-rtr protocol (RFC 6810 bis)

Rob Austein Thu, 27 Mar 2014 12:20:14 -0700

At Fri, 21 Mar 2014 17:23:05 +0100, Tim Bruijnzeels wrote:
> 
> Sorry for the late reply.


Likewise :)

> > 2) We added a few timing parameters to the End Of Data PDU.  These,
> >   like the Serial Number mechanism, are lifted almost verbatim from
> >   the DNS zone transfer protocol.  We left them out of RFC 6810, but
> >   subsequent exploration of some of the corner cases of the RPKI
> >   Router protocol convinced us that leaving these timing parameters
> >   out of the protocol had been a mistake.
> 
> Can you elaborate a bit on why this had been a mistake?

Back around Halloween, Randy and I received an off-list message asking
about the end of RFC 6810 section 8.  Our correspondent is welcome to
jump in here and speak for himself, but he's not one of the usual IETF
suspects and may not be on list, so I'll attempt to channel him for
the moment.

He pointed out that RFC 6810 is inconsistent in its advice about what
the router should do if it loses connectivity to the cache for more
than two hours: in one place we say that the router SHOULD delete the
data it got from the cache, in another place we say that the router
should retain the data.  Furthermore, RFC 6810 is unclear on how long
the router should retain the data (if it does), and, if so, what the
expiration period should be.

The DNS zone transfer protocol, on which the rpki-rtr protocol is
based, conveys explicit answers to all of these questions to the
router, in the form of the timing parameters included in the SOA
record.  Upon reading this question, I concluded that leaving the rest
of these out of the rpki-rtr protocol had been a mistake.  Granted,
this just pushes some of the decisions back from the router to the
cache, but the cache is (probably) in a better position to know the
right answers, or, at least, plausible harmless answers.

That said: the proposed defaults in the current draft, while not quite
picked out of thin air, were not the result of years of detailed
analysis, so I'm open to suggestions for better values.

> With regards to the minimum allowed value for the refresh. Why is it
> 120 seconds? That seems rather long to me. I understand that
> currently updates do not happen that often and propagate that
> quickly, but that may change. And if it does, and the RP tool sees
> more regular changes and it is confident that it can deal with load,
> then I think it would be better if it could use a lower number
> without having to revisit this text.

Part of the point of the REFRESH and RETRY parameters is to allow the
cache to specify how much beating it is willing to accept from the
router(s).  In that context, telling the router to check for updates
every minute seems a bit excessive.

Keep in mind that the current recommendation is to poll hourly, and
that the real mechanism for pushing updates out quickly is the Notify
PDU (which we also lifted almost verbatim from DNS, no surprise).

There may be a better way to phrase the text in the draft, but the
REFRESH parameter is intended to control how often the router polls in
the absence of a Notify PDU.  Assuming that the cache generates Notify
PDus, this kind of polling is almost unnecessary; we included it
because Notify PDUs are just a hint to the router that now would be a
good time to poll, which the router is free to ignore.  The REFRESH
value tells the router that it really should consider polling at that
interval if it hasn't been responding to Notify PDUs.

As a demented counterexample, I guess one could build a router which
refuses to process Notify PDUs at all but which wants to be kept up to
date so badly that it wants to poll every five seconds.  I will
confess to not having a lot of sympathy with such a design.

All that said, if you have a proposal for a better minimum, holler.

> With regards to the retry and expire intervals: I am not sure what I
> (as cache) should tell the router, so I would probably just go for
> the defaults.

That's fine, if you have no better idea, although you might consider
making them configurable by the cache operator.

> Reading them they look to me more like responsibilities of the
> client. But please let me know if I am missing something below..

The thing you may be missing is that these are likely to be tied to
things like loading on your cache (eg, how many routers is this poor
little cache serving, and are you running the cache on a TRS-80), how
often your cache fetches data from the RPKI, and what kinds of
certificate expiration times people are using to populate the RPKI.
Granted, the cache may not be much smarter about some of this than the
router is, but, pretty much by definition, the cache can't possibly be
dumber about these issues than the router is.

> Did you intend for the expire interval to be something that an
> operator can configure in the cache, and have it automatically
> propagated to the client routers? I think I would prefer to
> configure it on the routers.

Says the author of a cache implementation :)

What I got from the original query that started me down this path is
that the router guys don't really know either.  Nothing you say can
require them to hold data longer than they want to do so, but you, as
the cache, ought to be able to tell them that it's a really bad idea
to hold data longer than X.

You, as the cache, have some idea of what your own update frequency is
and can, if you choose, measure what kind of churn rate you're seeing,
from which you can determine (at least as well as the router can) how
long it is before N percent of the data have changed.  Granted, you
might not bother to do that; you might instead choose to hardwire an
advertised expiration time measured in years.  But since the cache is
stripping off all the certificate expiration timestamps, the cache by
definition knows more about this than the router does, so it seems
advisable to have some way for the cache to tell the router when old
data should go away.

> With regards to the retry interval, I am not sure if there should be
> any limit. Why can't the router decide, and what it is the synergy
> with having multiple cache configured in a router described in
> section 10?

RETRY is one cache telling the routers that use it how often it's
willing to be beaten up when there's a problem refreshing data.
Different cache, (potentially) different RETRY.

> I think I may prefer to not limit the router's retries at all, and
> recommend operators to monitor their cache. Do we need a standard
> query type for the latter? At least to get the cache's own opinion
> on its health that could be integrated with monitoring?

I don't understand how this would work well enough to have an informed
opinion, but it sounds more complicated than DNS-like RETRY.  Where's
the gain?  Also, "recommend operators to monitor their cache" sounds
like you're talking about human beings.  Certainly there are some in
the NOC, but do we really need to drag them in whenever a router fails
to sync with its cache(s)?

Keep in mind that, for all of these parameters, the cache's
responsibility is probably just to pass along settings that were
configured by the human who installed the cache software (which may in
turn just be passing along the default values from the spec).  The
point of the exercise here is to make these things explicit, visible,
and configurable by the operator, rather than leaving them in the
realm of things which the router implementor may or may not have made
configurable in some implementation-specific way.

_______________________________________________
sidr mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/sidr

Re: [sidr] Updates to rpki-rtr protocol (RFC 6810 bis)

Reply via email to