Re: [routing-wg] Improving operations at RIPE NCC TA (Was: Delay in publishing RPKI objects)

2021-02-22 Thread Nathalie Trenaman
Hi Daniel,

> Op 19 feb. 2021, om 09:33 heeft Daniel Suchy via routing-wg 
>  het volgende geschreven:
> 
> Hi Nathalie,
> 
> On 2/19/21 9:15 AM, Nathalie Trenaman wrote:
>> We can do many things but our main concern is to implement what is
>> needed in a way that we can manage effectively and with input from all
>> the relevant stakeholders. You've provided a big list here and some of
>> these are already on our roadmap. For example, ROV in AS, we are
>> working on this, and we expect to come with an announcement soon.
> 
> Can you share your roadmap? I think also plans and timeline should be open. 
> As these plans exists, you can just publish such document(s) for those who're 
> interested.
> I think community should be informed in advance about plans - not just 
> ex-post by "marketing" announcements about done things.

I shared the RPKI roadmap on RIPE Labs last year: 
https://labs.ripe.net/Members/nathalie_nathalie/where-were-at-with-rpki-resiliency
 

and the work plan for this year has recently been finalised and will first be 
presented to the Executive Board in March. After that, I will publish another 
RIPE Labs article with the work plan for this year and announce it in this 
working group as well. 


> 
>> Also, open-sourcing the RPKI core is on our roadmap, but this will take
>> a bit longer.
> 
> Can you explain in detail, where's problem with opensourcing RPKI core 
> (publishing it's code)? Are there some legal reasons or there's something 
> else blocking publishing code you're using? As above, how long we have to 
> wait? I think (open) community review is important also from security 
> perspective for this critical part of internet infrastructure.
> 

I agree that open sourcing the RPKI core is important, and so are many other 
things that we are working on. Please be assured that this is on our radar and 
we’ll move forward with this as soon as we can. 

> - Daniel
> 

Kind regards,
Nathalie Trenaman
Routing Security Programme Manager
RIPE NCC

Re: [routing-wg] Improving operations at RIPE NCC TA (Was: Delay in publishing RPKI objects)

2021-02-19 Thread Daniel Suchy via routing-wg

Hi Nathalie,

On 2/19/21 9:15 AM, Nathalie Trenaman wrote:

We can do many things but our main concern is to implement what is
needed in a way that we can manage effectively and with input from all
the relevant stakeholders. You've provided a big list here and some of
these are already on our roadmap. For example, ROV in AS, we are
working on this, and we expect to come with an announcement soon.


Can you share your roadmap? I think also plans and timeline should be 
open. As these plans exists, you can just publish such document(s) for 
those who're interested.
I think community should be informed in advance about plans - not just 
ex-post by "marketing" announcements about done things.



Also, open-sourcing the RPKI core is on our roadmap, but this will take
a bit longer.


Can you explain in detail, where's problem with opensourcing RPKI core 
(publishing it's code)? Are there some legal reasons or there's 
something else blocking publishing code you're using? As above, how long 
we have to wait? I think (open) community review is important also from 
security perspective for this critical part of internet infrastructure.


- Daniel



Re: [routing-wg] Improving operations at RIPE NCC TA (Was: Delay in publishing RPKI objects)

2021-02-19 Thread Nathalie Trenaman
Hi Job,

See my responses inline in your final section...

> Op 17 feb. 2021, om 16:58 heeft Job Snijders via routing-wg 
>  het volgende geschreven:
> 
> Dear RIPE NCC,
> 
> On Wed, Feb 17, 2021 at 11:28:32AM +0100, Nathalie Trenaman wrote:
>>> The multitude of RPKI service impacting events as a result from
>>> maloperation of the RIPE NCC trust anchor are starting to give me
>>> cause for concern.
>> 
>> I’m sorry to hear this. Transparency is key for us, this means that we
>> report any event. In this case, we were not compliant with our CPS and
>> this non-publishing period had operational impact.
> 
> From the previous email there might be a misunderstanding about what
> rpki-client and Routinator do. Both utilities help Relying Parties
> validate X.509 signed CMS objects and transform the validated content
> into authorizations and attestations. Neither utility is a SLA or CPS
> compliance monitor. RIPE NCC - as CA operator - needs different tools.
> 
> Neither utility has been designed to interpret the Certification
> Practise Policy (written in a natural language) and subsequently
> programmatically transform the described 'Service Level' into metrics
> suitable for monitoring.
> 
> A relying party can never tell the difference between a publication
> pipeline being empty because CAs didn't issue new objects, or a
> publication pipeline being empty because of a malfunction in one of RIPE
> NCC's RPKI subsystems.
> 
> More examples of 'out of scope' functionality for Relying Party
> software: validators don't monitor whether lirportal.ripe.net is
> functional, whether RIPE NCC's BPKI API endpoints are operational, or
> whether LIRs paid their invoices, the list is quite long. The validators
> by themselves are the wrong tool for RPKI CPS/SLA monitoring.
> 
> You state "transparency is key for us", but I fear ad-hoc low-quality
> a-posteriori reports are not the appropriate mechanism to impress and
> reassure this community regarding 'transparency'.
> 
> I have some tangible suggestions to RIPE NCC that will improve the
> reliability of the Trust Anchor and potentially help rebuild trust:
> 
> A need for Certificate Transparency
> ---
> 
> RIPE NCC should set up a Certificate Transparency project which publicly
> shows which certificates (fingerprints) were issued when, and store such
> information in immutable logs, accessible to all.
> 
> How can anyone trust a Trust Anchor which does not offer transparency
> about its issuance process?
> 
> Lack of transparency to signer software
> ---
> 
> The RIPE NCC WHOIS database software is open source, as is most of the
> software for RIPE Atlas, K-ROOT, and other efforts RIPE NCC has
> undertaken over the years.
> 
> Why has the signer source code still not open sourced? Why can't members
> review the code related to scheduled changes? Why is an organisation
> proclaiming 'transparency' being opaque about how the RPKI certificates
> are issued?
> 
> Lack of Public status dashboard
> ---
> 
> RIPE NCC should set up a website like https://rpki-status.ripe.net/
> which shows dashboards with graphs and traffic lights related to each
> (best effort) commitment listed in the CPS. RIPE NCC should continuously
> publish & revoke & delete objects and verify whether those activities
> are visible externally, and then automatically report whether any
> potential delays observed are within the Service Levels outlined in the
> CPS.
> 
> Metrics that come to mind:
> 
> * delta between last certificate issuance & successful publication
> * Object count in the repository, repo size (and graphs)
> * Time-To-Keyroll (and graphs on duration & frequency)
> * Resource utilisation of various RPKI subsystems
> * aggregate bandwidth consumption for RPKI endpoints (including rrdp, API, 
> rsync)
> * Graphs & logs of overlap between INRs listed on EE certificates under
>  the RIPE TA and other commonly used TAs, matched against known
>  transfers. This will help detect compromises as well as understand
>  whether transfers are successful or not.
> * Unique client IP count for RSYNC & RRDP for last hour/day/week
> * Number of CS/hostmaster tickets mentioning RPKI
> 
> There is are plenty of aspects to monitor, perhaps some notes should be
> copied from how the DNS root is monitored.
> 
> Lack of operational experience with BGP ROV at RIPE NCC
> ---
> 
> I believe the number of potential future incidents related to the RIPE
> NCC Trust Anchor can be prevented (or remediation time reduced) if RIPE
> NCC themselves apply RPKI based BGP Origin Validation 'invalid ==
> reject' policies on the AS  EBGP border routers. RIPE NCC OPS
> themselves having a dependency on the RPKI services will increase
> organization-wide exposure to the (lack of) well-being of the Trust
> Anchor services, and given the short communication channels between the

Re: [routing-wg] Improving operations at RIPE NCC TA (Was: Delay in publishing RPKI objects)

2021-02-17 Thread Hank Nussbacher

  
  
On 17/02/2021 17:58, Job Snijders via
  routing-wg wrote:


+1.


-Hank




  Dear RIPE NCC,

On Wed, Feb 17, 2021 at 11:28:32AM +0100, Nathalie Trenaman wrote:

  

  The multitude of RPKI service impacting events as a result from
maloperation of the RIPE NCC trust anchor are starting to give me
cause for concern.



I’m sorry to hear this. Transparency is key for us, this means that we
report any event. In this case, we were not compliant with our CPS and
this non-publishing period had operational impact.

  
  
>From the previous email there might be a misunderstanding about what
rpki-client and Routinator do. Both utilities help Relying Parties
validate X.509 signed CMS objects and transform the validated content
into authorizations and attestations. Neither utility is a SLA or CPS
compliance monitor. RIPE NCC - as CA operator - needs different tools.

Neither utility has been designed to interpret the Certification
Practise Policy (written in a natural language) and subsequently
programmatically transform the described 'Service Level' into metrics
suitable for monitoring.

A relying party can never tell the difference between a publication
pipeline being empty because CAs didn't issue new objects, or a
publication pipeline being empty because of a malfunction in one of RIPE
NCC's RPKI subsystems.

More examples of 'out of scope' functionality for Relying Party
software: validators don't monitor whether lirportal.ripe.net is
functional, whether RIPE NCC's BPKI API endpoints are operational, or
whether LIRs paid their invoices, the list is quite long. The validators
by themselves are the wrong tool for RPKI CPS/SLA monitoring.

You state "transparency is key for us", but I fear ad-hoc low-quality
a-posteriori reports are not the appropriate mechanism to impress and
reassure this community regarding 'transparency'.

I have some tangible suggestions to RIPE NCC that will improve the
reliability of the Trust Anchor and potentially help rebuild trust:

A need for Certificate Transparency
---

RIPE NCC should set up a Certificate Transparency project which publicly
shows which certificates (fingerprints) were issued when, and store such
information in immutable logs, accessible to all.

How can anyone trust a Trust Anchor which does not offer transparency
about its issuance process?

Lack of transparency to signer software
---

The RIPE NCC WHOIS database software is open source, as is most of the
software for RIPE Atlas, K-ROOT, and other efforts RIPE NCC has
undertaken over the years.

Why has the signer source code still not open sourced? Why can't members
review the code related to scheduled changes? Why is an organisation
proclaiming 'transparency' being opaque about how the RPKI certificates
are issued?

Lack of Public status dashboard
---

RIPE NCC should set up a website like https://rpki-status.ripe.net/
which shows dashboards with graphs and traffic lights related to each
(best effort) commitment listed in the CPS. RIPE NCC should continuously
publish & revoke & delete objects and verify whether those activities
are visible externally, and then automatically report whether any
potential delays observed are within the Service Levels outlined in the
CPS.

Metrics that come to mind:

* delta between last certificate issuance & successful publication
* Object count in the repository, repo size (and graphs)
* Time-To-Keyroll (and graphs on duration & frequency)
* Resource utilisation of various RPKI subsystems
* aggregate bandwidth consumption for RPKI endpoints (including rrdp, API, rsync)
* Graphs & logs of overlap between INRs listed on EE certificates under
  the RIPE TA and other commonly used TAs, matched against known
  transfers. This will help detect compromises as well as understand
  whether transfers are successful or not.
* Unique client IP count for RSYNC & RRDP for last hour/day/week
* Number of CS/hostmaster tickets mentioning RPKI

There is are plenty of aspects to monitor, perhaps some notes should be
copied from how the DNS root is monitored.

Lack of operational experience with BGP ROV at RIPE NCC
---

I believe the number of potential future incidents related to the RIPE
NCC Trust Anchor can be prevented (or remediation time reduced) if RIPE
NCC themselves apply RPKI based BGP Origin Validation 'invalid ==
reject' policies on the AS  EBGP border routers. RIPE NCC OPS
themselves having a dependency on the RPKI services will increase
organization-wide exposure to the (lack of) well-being of the Trust
Anchor services, and given the short communication channels between the
OPS team and the RPKI team my expectation is that we'll see problems
being solved faster and perhaps even problems being prevented.

An analogy: RIPE NCC is a 

Re: [routing-wg] Improving operations at RIPE NCC TA (Was: Delay in publishing RPKI objects)

2021-02-17 Thread Daniel Suchy via routing-wg

Hello,
I agreee with Job that reliability of TA needs to be improved and I 
fully support ideas described by Job below.


- Daniel


On 2/17/21 4:58 PM, Job Snijders via routing-wg wrote:

Dear RIPE NCC,

On Wed, Feb 17, 2021 at 11:28:32AM +0100, Nathalie Trenaman wrote:

The multitude of RPKI service impacting events as a result from
maloperation of the RIPE NCC trust anchor are starting to give me
cause for concern.


I’m sorry to hear this. Transparency is key for us, this means that we
report any event. In this case, we were not compliant with our CPS and
this non-publishing period had operational impact.


 From the previous email there might be a misunderstanding about what
rpki-client and Routinator do. Both utilities help Relying Parties
validate X.509 signed CMS objects and transform the validated content
into authorizations and attestations. Neither utility is a SLA or CPS
compliance monitor. RIPE NCC - as CA operator - needs different tools.

Neither utility has been designed to interpret the Certification
Practise Policy (written in a natural language) and subsequently
programmatically transform the described 'Service Level' into metrics
suitable for monitoring.

A relying party can never tell the difference between a publication
pipeline being empty because CAs didn't issue new objects, or a
publication pipeline being empty because of a malfunction in one of RIPE
NCC's RPKI subsystems.

More examples of 'out of scope' functionality for Relying Party
software: validators don't monitor whether lirportal.ripe.net is
functional, whether RIPE NCC's BPKI API endpoints are operational, or
whether LIRs paid their invoices, the list is quite long. The validators
by themselves are the wrong tool for RPKI CPS/SLA monitoring.

You state "transparency is key for us", but I fear ad-hoc low-quality
a-posteriori reports are not the appropriate mechanism to impress and
reassure this community regarding 'transparency'.

I have some tangible suggestions to RIPE NCC that will improve the
reliability of the Trust Anchor and potentially help rebuild trust:

A need for Certificate Transparency
---

RIPE NCC should set up a Certificate Transparency project which publicly
shows which certificates (fingerprints) were issued when, and store such
information in immutable logs, accessible to all.

How can anyone trust a Trust Anchor which does not offer transparency
about its issuance process?

Lack of transparency to signer software
---

The RIPE NCC WHOIS database software is open source, as is most of the
software for RIPE Atlas, K-ROOT, and other efforts RIPE NCC has
undertaken over the years.

Why has the signer source code still not open sourced? Why can't members
review the code related to scheduled changes? Why is an organisation
proclaiming 'transparency' being opaque about how the RPKI certificates
are issued?

Lack of Public status dashboard
---

RIPE NCC should set up a website like https://rpki-status.ripe.net/
which shows dashboards with graphs and traffic lights related to each
(best effort) commitment listed in the CPS. RIPE NCC should continuously
publish & revoke & delete objects and verify whether those activities
are visible externally, and then automatically report whether any
potential delays observed are within the Service Levels outlined in the
CPS.

Metrics that come to mind:

* delta between last certificate issuance & successful publication
* Object count in the repository, repo size (and graphs)
* Time-To-Keyroll (and graphs on duration & frequency)
* Resource utilisation of various RPKI subsystems
* aggregate bandwidth consumption for RPKI endpoints (including rrdp, API, 
rsync)
* Graphs & logs of overlap between INRs listed on EE certificates under
   the RIPE TA and other commonly used TAs, matched against known
   transfers. This will help detect compromises as well as understand
   whether transfers are successful or not.
* Unique client IP count for RSYNC & RRDP for last hour/day/week
* Number of CS/hostmaster tickets mentioning RPKI

There is are plenty of aspects to monitor, perhaps some notes should be
copied from how the DNS root is monitored.

Lack of operational experience with BGP ROV at RIPE NCC
---

I believe the number of potential future incidents related to the RIPE
NCC Trust Anchor can be prevented (or remediation time reduced) if RIPE
NCC themselves apply RPKI based BGP Origin Validation 'invalid ==
reject' policies on the AS  EBGP border routers. RIPE NCC OPS
themselves having a dependency on the RPKI services will increase
organization-wide exposure to the (lack of) well-being of the Trust
Anchor services, and given the short communication channels between the
OPS team and the RPKI team my expectation is that we'll see problems
being solved faster and perhaps even problems being prevented.

An analogy: RIPE NCC is a 

[routing-wg] Improving operations at RIPE NCC TA (Was: Delay in publishing RPKI objects)

2021-02-17 Thread Job Snijders via routing-wg
Dear RIPE NCC,

On Wed, Feb 17, 2021 at 11:28:32AM +0100, Nathalie Trenaman wrote:
> > The multitude of RPKI service impacting events as a result from
> > maloperation of the RIPE NCC trust anchor are starting to give me
> > cause for concern.
> 
> I’m sorry to hear this. Transparency is key for us, this means that we
> report any event. In this case, we were not compliant with our CPS and
> this non-publishing period had operational impact.

>From the previous email there might be a misunderstanding about what
rpki-client and Routinator do. Both utilities help Relying Parties
validate X.509 signed CMS objects and transform the validated content
into authorizations and attestations. Neither utility is a SLA or CPS
compliance monitor. RIPE NCC - as CA operator - needs different tools.

Neither utility has been designed to interpret the Certification
Practise Policy (written in a natural language) and subsequently
programmatically transform the described 'Service Level' into metrics
suitable for monitoring.

A relying party can never tell the difference between a publication
pipeline being empty because CAs didn't issue new objects, or a
publication pipeline being empty because of a malfunction in one of RIPE
NCC's RPKI subsystems.

More examples of 'out of scope' functionality for Relying Party
software: validators don't monitor whether lirportal.ripe.net is
functional, whether RIPE NCC's BPKI API endpoints are operational, or
whether LIRs paid their invoices, the list is quite long. The validators
by themselves are the wrong tool for RPKI CPS/SLA monitoring.

You state "transparency is key for us", but I fear ad-hoc low-quality
a-posteriori reports are not the appropriate mechanism to impress and
reassure this community regarding 'transparency'.

I have some tangible suggestions to RIPE NCC that will improve the
reliability of the Trust Anchor and potentially help rebuild trust:

A need for Certificate Transparency
---

RIPE NCC should set up a Certificate Transparency project which publicly
shows which certificates (fingerprints) were issued when, and store such
information in immutable logs, accessible to all.

How can anyone trust a Trust Anchor which does not offer transparency
about its issuance process?

Lack of transparency to signer software
---

The RIPE NCC WHOIS database software is open source, as is most of the
software for RIPE Atlas, K-ROOT, and other efforts RIPE NCC has
undertaken over the years.

Why has the signer source code still not open sourced? Why can't members
review the code related to scheduled changes? Why is an organisation
proclaiming 'transparency' being opaque about how the RPKI certificates
are issued?

Lack of Public status dashboard
---

RIPE NCC should set up a website like https://rpki-status.ripe.net/
which shows dashboards with graphs and traffic lights related to each
(best effort) commitment listed in the CPS. RIPE NCC should continuously
publish & revoke & delete objects and verify whether those activities
are visible externally, and then automatically report whether any
potential delays observed are within the Service Levels outlined in the
CPS.

Metrics that come to mind:

* delta between last certificate issuance & successful publication
* Object count in the repository, repo size (and graphs)
* Time-To-Keyroll (and graphs on duration & frequency)
* Resource utilisation of various RPKI subsystems
* aggregate bandwidth consumption for RPKI endpoints (including rrdp, API, 
rsync)
* Graphs & logs of overlap between INRs listed on EE certificates under
  the RIPE TA and other commonly used TAs, matched against known
  transfers. This will help detect compromises as well as understand
  whether transfers are successful or not.
* Unique client IP count for RSYNC & RRDP for last hour/day/week
* Number of CS/hostmaster tickets mentioning RPKI

There is are plenty of aspects to monitor, perhaps some notes should be
copied from how the DNS root is monitored.

Lack of operational experience with BGP ROV at RIPE NCC
---

I believe the number of potential future incidents related to the RIPE
NCC Trust Anchor can be prevented (or remediation time reduced) if RIPE
NCC themselves apply RPKI based BGP Origin Validation 'invalid ==
reject' policies on the AS  EBGP border routers. RIPE NCC OPS
themselves having a dependency on the RPKI services will increase
organization-wide exposure to the (lack of) well-being of the Trust
Anchor services, and given the short communication channels between the
OPS team and the RPKI team my expectation is that we'll see problems
being solved faster and perhaps even problems being prevented.

An analogy: RIPE NCC is a kitchenchef refusing to eat their own food.
How can we trust RIPE NCC to operate RPKI services, when RIPE NCC
themselves refuses to apply the cryptographic products to their BGP