Re: [routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022

2022-02-16 Thread Hank Nussbacher

On 16/02/2022 19:46, Gert Doering wrote:

Hi,

On Wed, Feb 16, 2022 at 05:01:25PM +0100, Ties de Kock wrote:

This afternoon, between 13:00 UTC and 14:10 UTC rrdp.ripe.net was unavailable.

[..]

Thanks for this great postmortem writeup, and for being open about
what happened, and how things always go wrong at the same time
(service and monitoring).


Let me add to what Gert said.  Additional bonus points:
- sending out the postmortem the same day of incident
- total transparency

If this type of incident can happen to RIPE it can so easily happen to 
any of the best of us.


-Hank


--

To unsubscribe from this mailing list, get a password reminder, or change your 
subscription options, please visit: 
https://lists.ripe.net/mailman/listinfo/routing-wg


Re: [routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022

2022-02-16 Thread Job Snijders via routing-wg
On Wed, 16 Feb 2022 at 19:49, Rob Austein  wrote:

> On Wed, 16 Feb 2022 13:10:27 -0500, Job Snijders wrote:
> > On Wed, 16 Feb 2022 at 19:07, Randy Bush  wrote:
> >
> >  sra commented to me that, an rp doing protocol fall-over from rrdp to
> >  rsync, or vice versa, has to do the full download as the data structure
> >  is so different.  i.e. load spike
> >
> > Perhaps it doesn’t need to be a full load: “rsync ―compare-dest”
> > (against a previously downloaded and validated set of signed
> > objects) offers a path towards optimising the protocol fall-over.
>
> Even assuming the RRDP client stores and believes the rsync URIs in
> the RRDP data stream, and further assuming that the client is clever
> enough to write out its RRDP-derived database into a directory tree
> which exactly matches an rsync filesystem layout before failing over,



The OpenBSD RPKI validator does the above, while maintaining robust
cryptographic integrity (in version 7.6 and higher). I hope other
validators take inspiration from this, similar to how we (OpenBSD) took
inspiration from the Dragon Labs implementation. Your work lives on and on,
hat tip to you Rob! :-)


RRDP doesn't convey things like file modification dates that rsync
> needs to perform an efficient incremental transfer, so the first rsync
> pass is still going to be expensive.
>
> Not obvious to me that there's any good way to optimize this.  YMMV.
>

Ties once pointed me at the GPL rsync “-c” (checksum) option, which makes
transfers more focussed on content rather than filesystem attributes. From
my (openrsync) this is still work to be done. I see a path :-)

Kind regards,

Job
-- 

To unsubscribe from this mailing list, get a password reminder, or change your 
subscription options, please visit: 
https://lists.ripe.net/mailman/listinfo/routing-wg


Re: [routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022

2022-02-16 Thread Job Snijders via routing-wg
On Wed, 16 Feb 2022 at 19:07, Randy Bush  wrote:

> thanks for the post mortem, ties.
>
> sra commented to me that, an rp doing protocol fall-over from rrdp to
> rsync, or vice versa, has to do the full download as the data structure
> is so different.  i.e. load spike


Perhaps it doesn’t need to be a full load: “rsync —compare-dest” (against a
previously downloaded and validated set of signed objects) offers a path
towards optimising the protocol fall-over.

Kind regards,

Job

>
-- 

To unsubscribe from this mailing list, get a password reminder, or change your 
subscription options, please visit: 
https://lists.ripe.net/mailman/listinfo/routing-wg


Re: [routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022

2022-02-16 Thread Randy Bush
thanks for the post mortem, ties.

sra commented to me that, an rp doing protocol fall-over from rrdp to
rsync, or vice versa, has to do the full download as the data structure
is so different.  i.e. load spike.

randy

-- 

To unsubscribe from this mailing list, get a password reminder, or change your 
subscription options, please visit: 
https://lists.ripe.net/mailman/listinfo/routing-wg


Re: [routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022

2022-02-16 Thread Gert Doering
Hi,

On Wed, Feb 16, 2022 at 05:01:25PM +0100, Ties de Kock wrote:
> This afternoon, between 13:00 UTC and 14:10 UTC rrdp.ripe.net was unavailable.
[..]

Thanks for this great postmortem writeup, and for being open about
what happened, and how things always go wrong at the same time
(service and monitoring).

Gert Doering
-- NetMaster
-- 
have you enabled IPv6 on something today...?

SpaceNet AG  Vorstand: Sebastian v. Bomhard, Michael Emmer
Joseph-Dollinger-Bogen 14Aufsichtsratsvors.: A. Grundner-Culemann
D-80807 Muenchen HRB: 136055 (AG Muenchen)
Tel: +49 (0)89/32356-444 USt-IdNr.: DE813185279


signature.asc
Description: PGP signature
-- 

To unsubscribe from this mailing list, get a password reminder, or change your 
subscription options, please visit: 
https://lists.ripe.net/mailman/listinfo/routing-wg


[routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022

2022-02-16 Thread Ties de Kock

Dear colleagues,

This afternoon, between 13:00 UTC and 14:10 UTC rrdp.ripe.net was unavailable.
During this period, a significant fraction of relying party instances attempting
to fall back to rsync://rpki.ripe.net could not retrieve objects due to capacity
constraints.

At approximately 13:00 UTC, the RPKI team attempted to move the DNS records for
rrdp.ripe.net from out of the ripe.net zone file into a separate include file.
We did this change to prepare for implementing an automated failover between the
CDNs.

This resulted in an outage in the RRDP service, which was caused by an issue in
the ripe.net zone file in the DNS zone. The file contains several $ORIGIN
directives, but they are not reset properly when a block ends. The consequence
is that later relative names in the zone file accidentally get the incorrect
origin applied to them, and it is easy to miss this if the $ORIGIN directive
appears much earlier in the file. 

To prevent such DNS issues in the future, all the blocks will be moved out of
the main zone file into separate include files, because $ORIGIN directives in
them do not persist beyond the end of the file.

Also, earlier today, we hit an issue that our monitoring was broken due to a
change in the prometheus configuration file. This reduced our visibility into
the outage and meant no alerts were sent until this recovered.

A third contributing factor was that a secondary monitoring system monitoring
the RPKI prometheus infrastructure did not alert due to the web interface
returning an HTTP 200 despite the broken configuration.

A final factor was that the capacity of rsync://rpki.ripe.net is limited. Only
part of the relying party instances that attempted to fall back could update
from rsync. This prevented relying party instances from retrieving new objects.

Full timeline:
  * 07:04 UTC: broken alert configuration committed
  * 08:46 UTC: broken alert configuration applied, breaking monitoring.
  * 13:02 UTC: DNS change (effectively removing rrdp.ripe.net from zone) applied
  * 13:44 UTC: alert configuration reverted
  * 14:10 UTC: DNS configuration recovered
  * 14:25 UTC: rsync connection rate back at baseline level

During the period where rrdp.ripe.net was not available, many relying party
instances started falling back to rsync. On partial data available, we observed
a median rsync connection duration of 300 seconds, and 99th percentile of 1660
seconds, with ~55% of rsync connections disconnecting with an error code. Based
on this preliminary data we conclude that this is indicative of underlying IO
limitation in our NFS setup. We will further investigate this.

During the period of outage, our rsync servers returned 5043 “max connection
reached” errors to 2307 unique IP addresses.

We have applied one mitigation (linting of alert configuration). We are also
working on improving our external monitoring without a dependency on on-premise
infrastructure.

Kind regards,
Ties


-- 

To unsubscribe from this mailing list, get a password reminder, or change your 
subscription options, please visit: 
https://lists.ripe.net/mailman/listinfo/routing-wg