Re: [routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022

2022-02-16 Thread Hank Nussbacher

On 16/02/2022 19:46, Gert Doering wrote:

Hi,

On Wed, Feb 16, 2022 at 05:01:25PM +0100, Ties de Kock wrote:

This afternoon, between 13:00 UTC and 14:10 UTC rrdp.ripe.net was unavailable.

[..]

Thanks for this great postmortem writeup, and for being open about
what happened, and how things always go wrong at the same time
(service and monitoring).


Let me add to what Gert said.  Additional bonus points:
- sending out the postmortem the same day of incident
- total transparency

If this type of incident can happen to RIPE it can so easily happen to 
any of the best of us.


-Hank


--

To unsubscribe from this mailing list, get a password reminder, or change your 
subscription options, please visit: 
https://lists.ripe.net/mailman/listinfo/routing-wg


Re: [routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022

2022-02-16 Thread Job Snijders via routing-wg
On Wed, 16 Feb 2022 at 19:49, Rob Austein  wrote:

> On Wed, 16 Feb 2022 13:10:27 -0500, Job Snijders wrote:
> > On Wed, 16 Feb 2022 at 19:07, Randy Bush  wrote:
> >
> >  sra commented to me that, an rp doing protocol fall-over from rrdp to
> >  rsync, or vice versa, has to do the full download as the data structure
> >  is so different.  i.e. load spike
> >
> > Perhaps it doesn’t need to be a full load: “rsync ―compare-dest”
> > (against a previously downloaded and validated set of signed
> > objects) offers a path towards optimising the protocol fall-over.
>
> Even assuming the RRDP client stores and believes the rsync URIs in
> the RRDP data stream, and further assuming that the client is clever
> enough to write out its RRDP-derived database into a directory tree
> which exactly matches an rsync filesystem layout before failing over,



The OpenBSD RPKI validator does the above, while maintaining robust
cryptographic integrity (in version 7.6 and higher). I hope other
validators take inspiration from this, similar to how we (OpenBSD) took
inspiration from the Dragon Labs implementation. Your work lives on and on,
hat tip to you Rob! :-)


RRDP doesn't convey things like file modification dates that rsync
> needs to perform an efficient incremental transfer, so the first rsync
> pass is still going to be expensive.
>
> Not obvious to me that there's any good way to optimize this.  YMMV.
>

Ties once pointed me at the GPL rsync “-c” (checksum) option, which makes
transfers more focussed on content rather than filesystem attributes. From
my (openrsync) this is still work to be done. I see a path :-)

Kind regards,

Job
-- 

To unsubscribe from this mailing list, get a password reminder, or change your 
subscription options, please visit: 
https://lists.ripe.net/mailman/listinfo/routing-wg


Re: [routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022

2022-02-16 Thread Job Snijders via routing-wg
On Wed, 16 Feb 2022 at 19:07, Randy Bush  wrote:

> thanks for the post mortem, ties.
>
> sra commented to me that, an rp doing protocol fall-over from rrdp to
> rsync, or vice versa, has to do the full download as the data structure
> is so different.  i.e. load spike


Perhaps it doesn’t need to be a full load: “rsync —compare-dest” (against a
previously downloaded and validated set of signed objects) offers a path
towards optimising the protocol fall-over.

Kind regards,

Job

>
-- 

To unsubscribe from this mailing list, get a password reminder, or change your 
subscription options, please visit: 
https://lists.ripe.net/mailman/listinfo/routing-wg


Re: [routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022

2022-02-16 Thread Randy Bush
thanks for the post mortem, ties.

sra commented to me that, an rp doing protocol fall-over from rrdp to
rsync, or vice versa, has to do the full download as the data structure
is so different.  i.e. load spike.

randy

-- 

To unsubscribe from this mailing list, get a password reminder, or change your 
subscription options, please visit: 
https://lists.ripe.net/mailman/listinfo/routing-wg


Re: [routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022

2022-02-16 Thread Gert Doering
Hi,

On Wed, Feb 16, 2022 at 05:01:25PM +0100, Ties de Kock wrote:
> This afternoon, between 13:00 UTC and 14:10 UTC rrdp.ripe.net was unavailable.
[..]

Thanks for this great postmortem writeup, and for being open about
what happened, and how things always go wrong at the same time
(service and monitoring).

Gert Doering
-- NetMaster
-- 
have you enabled IPv6 on something today...?

SpaceNet AG  Vorstand: Sebastian v. Bomhard, Michael Emmer
Joseph-Dollinger-Bogen 14Aufsichtsratsvors.: A. Grundner-Culemann
D-80807 Muenchen HRB: 136055 (AG Muenchen)
Tel: +49 (0)89/32356-444 USt-IdNr.: DE813185279


signature.asc
Description: PGP signature
-- 

To unsubscribe from this mailing list, get a password reminder, or change your 
subscription options, please visit: 
https://lists.ripe.net/mailman/listinfo/routing-wg


[routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022

2022-02-16 Thread Ties de Kock

Dear colleagues,

This afternoon, between 13:00 UTC and 14:10 UTC rrdp.ripe.net was unavailable.
During this period, a significant fraction of relying party instances attempting
to fall back to rsync://rpki.ripe.net could not retrieve objects due to capacity
constraints.

At approximately 13:00 UTC, the RPKI team attempted to move the DNS records for
rrdp.ripe.net from out of the ripe.net zone file into a separate include file.
We did this change to prepare for implementing an automated failover between the
CDNs.

This resulted in an outage in the RRDP service, which was caused by an issue in
the ripe.net zone file in the DNS zone. The file contains several $ORIGIN
directives, but they are not reset properly when a block ends. The consequence
is that later relative names in the zone file accidentally get the incorrect
origin applied to them, and it is easy to miss this if the $ORIGIN directive
appears much earlier in the file. 

To prevent such DNS issues in the future, all the blocks will be moved out of
the main zone file into separate include files, because $ORIGIN directives in
them do not persist beyond the end of the file.

Also, earlier today, we hit an issue that our monitoring was broken due to a
change in the prometheus configuration file. This reduced our visibility into
the outage and meant no alerts were sent until this recovered.

A third contributing factor was that a secondary monitoring system monitoring
the RPKI prometheus infrastructure did not alert due to the web interface
returning an HTTP 200 despite the broken configuration.

A final factor was that the capacity of rsync://rpki.ripe.net is limited. Only
part of the relying party instances that attempted to fall back could update
from rsync. This prevented relying party instances from retrieving new objects.

Full timeline:
  * 07:04 UTC: broken alert configuration committed
  * 08:46 UTC: broken alert configuration applied, breaking monitoring.
  * 13:02 UTC: DNS change (effectively removing rrdp.ripe.net from zone) applied
  * 13:44 UTC: alert configuration reverted
  * 14:10 UTC: DNS configuration recovered
  * 14:25 UTC: rsync connection rate back at baseline level

During the period where rrdp.ripe.net was not available, many relying party
instances started falling back to rsync. On partial data available, we observed
a median rsync connection duration of 300 seconds, and 99th percentile of 1660
seconds, with ~55% of rsync connections disconnecting with an error code. Based
on this preliminary data we conclude that this is indicative of underlying IO
limitation in our NFS setup. We will further investigate this.

During the period of outage, our rsync servers returned 5043 “max connection
reached” errors to 2307 unique IP addresses.

We have applied one mitigation (linting of alert configuration). We are also
working on improving our external monitoring without a dependency on on-premise
infrastructure.

Kind regards,
Ties


-- 

To unsubscribe from this mailing list, get a password reminder, or change your 
subscription options, please visit: 
https://lists.ripe.net/mailman/listinfo/routing-wg


Re: [routing-wg] rsync://rpki.ripe.net rsyncd limits set too low?

2022-02-16 Thread Job Snijders via routing-wg
Hi Ties,

Thank you for the quick reply.

On Wed, Feb 16, 2022 at 03:32:06PM +0100, Ties de Kock wrote:
> Ouch. Fallback to rsync due to a DNS misconfiguration (which should
> have recovered).

Thanks for the confirmation. Indeed, my monitors seem to have returned
to 'all clear'.

> There are multiple instances behind a load-balancer. The current
> storage is on NFS which has a performance limitation - it peaked at
> about 80K operations/second (2m average).

Welp! That's a lot of IO.

Sharing from my own experience with a tiny publication point: I estimate
there are about 4,000 RPs deployed on the Internet. Assuming their
synchronisation attempts are evenly distributed across the hour, a
naieve calculation suggests every single second a new client will
attempt to connect.

> We will follow up with a more detailed post-mortem.

Much appreciated!

Kind regards,

Job

-- 

To unsubscribe from this mailing list, get a password reminder, or change your 
subscription options, please visit: 
https://lists.ripe.net/mailman/listinfo/routing-wg


Re: [routing-wg] Open-sourcing of the RIPE NCC’s RPKI core software

2022-02-16 Thread Cynthia Revström via routing-wg
Hi,

On Thu, Feb 10, 2022 at 6:35 PM Shane Kerr  wrote:
>
> I'm a little disappointed that you didn't choose a copyleft style
> license, like with the RIPE Atlas Software Probe, which uses GPLv3. That
> would help ensure that the work of the RIPE NCC employees is not used by
> a proprietary product or service by a company unwilling or unable to
> share their changes. Probably in the presentation at RIPE 84 there will
> be a bit of explanation about the choice of using a license so easy to
> convert back to closed source. 

There are issues regardless of what license you pick and remember that
there are proprietary HSM dependencies used, which could potentially
cause issues with GPLv3 for the NCC itself as far as I understand it.

-Cynthia

-- 

To unsubscribe from this mailing list, get a password reminder, or change your 
subscription options, please visit: 
https://lists.ripe.net/mailman/listinfo/routing-wg


Re: [routing-wg] rsync://rpki.ripe.net rsyncd limits set too low?

2022-02-16 Thread Ties de Kock
Hi Job.

> On 16 Feb 2022, at 15:05, Job Snijders via routing-wg  
> wrote:
> 
> Hi all,
> 
> I noticed the RIPE NCC RRDP service (https://rrdp.ripe.net/) became
> unreachable at 2022-02-16 13:34:10 UTC+0 (and still is down).

Ouch. Fallback to rsync due to a DNS misconfiguration (which should have
recovered).

> This RRDP outage event should not pose an issue for most RPKI
> validators, because most RPKI cache implementations (which follow best
> practises) will attempt to try to synchronize via RSYNC, in case RRDP is
> unavailable.
> 
> However, it seems RIPE NCC adjusted the default rsyncd settings and
> lowered the concurrent connection count from 200 (which already is too
> low for RPKI Repository Servers) to 150?
> 
>$ rsync --no-motd -rt rsync://rpki.ripe.net/repository/
>@ERROR: max connections (150) reached -- try again later
>rsync error: error starting client-server protocol (code 5) at
> main.c(1666)
> [Receiver=3.1.2]
> 
> I'm not familiar with the RIPE RPKI RSYNC service architecture, so the
> above error could be misleading: perhaps there is a loadbalancer
> distributing TCP sessions across multiple backends, each backend
> configured to serve up to 150 clients? Or perhaps there is a single
> rsyncd instance (in which case 150 definitely is too low).

We have described our rsync infrastructure extensively in earlier messages
(e.g. [0]). There are multiple instances behind a load-balancer. The current
storage is on NFS which has a performance limitation - it peaked at about 80K
operations/second (2m average).

We will follow up with a more detailed post-mortem.

Kind regards,
Ties


[0]: https://www.ripe.net/ripe/mail/archives/routing-wg/2021-June/004351.html


-- 

To unsubscribe from this mailing list, get a password reminder, or change your 
subscription options, please visit: 
https://lists.ripe.net/mailman/listinfo/routing-wg


Re: [routing-wg] rsync://rpki.ripe.net rsyncd limits set too low?

2022-02-16 Thread Job Snijders via routing-wg
On Wed, Feb 16, 2022 at 03:05:30PM +0100, Job Snijders wrote:
> However, it seems RIPE NCC adjusted the default rsyncd settings and
> lowered the concurrent connection count from 200 (which already is too
> low for RPKI Repository Servers) to 150?

Small correction: I appear to be confused about 200 being the default,
according to documentation the default is 'unlimited'

Kind regards,

Job

-- 

To unsubscribe from this mailing list, get a password reminder, or change your 
subscription options, please visit: 
https://lists.ripe.net/mailman/listinfo/routing-wg


[routing-wg] rsync://rpki.ripe.net rsyncd limits set too low?

2022-02-16 Thread Job Snijders via routing-wg
Hi all,

I noticed the RIPE NCC RRDP service (https://rrdp.ripe.net/) became
unreachable at 2022-02-16 13:34:10 UTC+0 (and still is down).

This RRDP outage event should not pose an issue for most RPKI
validators, because most RPKI cache implementations (which follow best
practises) will attempt to try to synchronize via RSYNC, in case RRDP is
unavailable.

However, it seems RIPE NCC adjusted the default rsyncd settings and
lowered the concurrent connection count from 200 (which already is too
low for RPKI Repository Servers) to 150?

$ rsync --no-motd -rt rsync://rpki.ripe.net/repository/
@ERROR: max connections (150) reached -- try again later
rsync error: error starting client-server protocol (code 5) at
main.c(1666)
[Receiver=3.1.2]

I'm not familiar with the RIPE RPKI RSYNC service architecture, so the
above error could be misleading: perhaps there is a loadbalancer
distributing TCP sessions across multiple backends, each backend
configured to serve up to 150 clients? Or perhaps there is a single
rsyncd instance (in which case 150 definitely is too low).

Is the RIPE NCC RPKI RSYNC service underprovisioned? If yes, why?

Kind regards,

Job

-- 

To unsubscribe from this mailing list, get a password reminder, or change your 
subscription options, please visit: 
https://lists.ripe.net/mailman/listinfo/routing-wg