Hi,

There are a number of separate discussions about problems with the rpki 
repository and ways to mitigate those problems going on on the list at the 
moment.

First of all let me say: as a starting point the current system works most of 
the time, but we are finding issues that I think should be fixed, that are not 
trivial.

So my take on this is that we've learned from operational experience and it 
would now be good to (1) enumerate the problems that we see, and (2) refine a 
list of requirements for improvement, and then (3) find ways forward to address 
these requirements, without breaking the existing infrastructure. As some of 
you know from discussions we had I do actually have some ideas in the solution 
space, but.. before going there in detail (beyond proof of concept) I think the 
WG should address 1 & 2..

It may be best if we could discuss this face to face, eg at one of the upcoming 
interim meetings. I am not a huge fan of interim meetings, but I am afraid this 
is a difficult subject to find consensus on on the list, and too big to discuss 
in a 2 hour IETF slot. My preference would be to discuss this in the planned 
meeting just before the Vancouver IETF: I am already planning to travel to that 
one, and I expect it will be the easiest to attend for most people.


Since I don't know if and when this will happen though, let me write down my 
ideas, without going to solution space where I think face to face or at least 
interactive presentations are most needed to make progress.


1 = Current problems we encounter (implementing a validator and running a 
publication point):

= Updates happen while we are rsync'ing from the validator:
  = We may miss objects that are on the manifest
  = We may find objects that are not on the manifest (we actually ignore these)
  = The CRL may be newer and revoke this MFT
  = The MFT may be newer and the CRL hash value is wrong

  All this makes it very difficult for the RP to make clear automated 
validation decisions.
  And we all know that *no one* reads the logs... or even if they do, most 
people won't know how to decide..

= PFX-validate assumes knowledge of *all* applicable ROAs, but the RP may not 
get all data
  = The only way an RP can detect this is by looking at the manifest.
  = But the manifest is considered, at least by people I talk to, to not be 
authoritative on this.
  = The reason why a ROA might be missing is not clear to the RP
      = A man in the middle may filter bits of information (hold back ROA, MFT, 
CRL)
      = There may be a bug in the CA
      = There may be a bug or problem at the publisher (eg someone *deleted* 
the ROA on disk)
      = There may be a race condition - stuff is changing while we look, as 
described above

= We need to call 'rsync' on the system (there is only one implementation, 
libraries unavailable for most coding languages)
  = Forks cause cpu overhead and may (temporarily) result in duplicating the 
memory usage (at least for jvm)
  = Parallel processing requiring system calls does not scale
  = We need rsync installed and on the path
  = We need to like that version
  = We need to make sense of exit codes to have useful error messages and take 
action (or inform user)

= Publication point availability
  = We don't know of any commercial CDNs that do rsync
  = Doing this ourselves by having multiple rsync servers (and eg anycast) is 
not trivial
  = An RP possibly ending up on different (out of sync) mirror servers, and 
getting inconsistent responses, does not help
  = To avoid load on back-ends the general advice to RPs has been to not be too 
aggressive, even though they 'want' fresh data

= Local policy knobs in validation
  = The absence of sensible defaults make it difficult to automate validation
  = Uncertainty here, for implementers, is even worse for end user: they really 
just want to know:
       "so, is this thing *valid*, or not?"
  = Giving them a knob confuses them and lowers the trust in this system


These are my major findings at least. There may be more. With regards to the 
hierarchical repository lay-out, or absence thereof causing issues. Our 
validator gets around that by having additional configuration for the RIR TAs 
to do additional pre-fetching of the repositories. This helps, so yes, either 
this work-around or a change in those repository lay-outs would help RPs. 
Having said that, the problems that I enumerated above remain in my opinion.


Chris suggested that we do some more coordinated measurements. I think this is 
an excellent idea and I would like to help with this effort. If possible it 
would be very worthwhile to get some quantifiable sense of the issues. Apart 
from monitoring over time, it would also be interesting to do some aggressive 
testing, like load stressing a publication point, set up a large, high churn, 
test repository, or set up different validators to update far more often (like 
every few minutes) and see what breaks.



2 = Possible requirements for moving forward:

This is not a complete or final list of course. I am very interested in your 
additions. Even thought there may be requirements that we're not able to meet, 
there is still value in listing them and deciding..

= New schemes should iterate on existing infrastructure without breaking it
= If more than one retrieval mechanism is allowed, then *objects* should be 
uniquely identifiable
= Inconsistent data to relying parties should be prevented
= It should be detectable for an RP if it does not get all the data a CA 
intended to publish
= Protocols should be ubiquitous with regards to support in coding languages
= Update semantics and protocol should allow for distributed caching
= Local policy knobs because of validation uncertainties should be avoided as 
much as possible






Regards,

Tim

_______________________________________________
sidr mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/sidr

Reply via email to