Re: [routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022
On 16/02/2022 19:46, Gert Doering wrote: Hi, On Wed, Feb 16, 2022 at 05:01:25PM +0100, Ties de Kock wrote: This afternoon, between 13:00 UTC and 14:10 UTC rrdp.ripe.net was unavailable. [..] Thanks for this great postmortem writeup, and for being open about what happened, and how things always go wrong at the same time (service and monitoring). Let me add to what Gert said. Additional bonus points: - sending out the postmortem the same day of incident - total transparency If this type of incident can happen to RIPE it can so easily happen to any of the best of us. -Hank -- To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://lists.ripe.net/mailman/listinfo/routing-wg
Re: [routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022
On Wed, 16 Feb 2022 at 19:49, Rob Austein wrote: > On Wed, 16 Feb 2022 13:10:27 -0500, Job Snijders wrote: > > On Wed, 16 Feb 2022 at 19:07, Randy Bush wrote: > > > > sra commented to me that, an rp doing protocol fall-over from rrdp to > > rsync, or vice versa, has to do the full download as the data structure > > is so different. i.e. load spike > > > > Perhaps it doesn’t need to be a full load: “rsync ―compare-dest” > > (against a previously downloaded and validated set of signed > > objects) offers a path towards optimising the protocol fall-over. > > Even assuming the RRDP client stores and believes the rsync URIs in > the RRDP data stream, and further assuming that the client is clever > enough to write out its RRDP-derived database into a directory tree > which exactly matches an rsync filesystem layout before failing over, The OpenBSD RPKI validator does the above, while maintaining robust cryptographic integrity (in version 7.6 and higher). I hope other validators take inspiration from this, similar to how we (OpenBSD) took inspiration from the Dragon Labs implementation. Your work lives on and on, hat tip to you Rob! :-) RRDP doesn't convey things like file modification dates that rsync > needs to perform an efficient incremental transfer, so the first rsync > pass is still going to be expensive. > > Not obvious to me that there's any good way to optimize this. YMMV. > Ties once pointed me at the GPL rsync “-c” (checksum) option, which makes transfers more focussed on content rather than filesystem attributes. From my (openrsync) this is still work to be done. I see a path :-) Kind regards, Job -- To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://lists.ripe.net/mailman/listinfo/routing-wg
Re: [routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022
On Wed, 16 Feb 2022 at 19:07, Randy Bush wrote: > thanks for the post mortem, ties. > > sra commented to me that, an rp doing protocol fall-over from rrdp to > rsync, or vice versa, has to do the full download as the data structure > is so different. i.e. load spike Perhaps it doesn’t need to be a full load: “rsync —compare-dest” (against a previously downloaded and validated set of signed objects) offers a path towards optimising the protocol fall-over. Kind regards, Job > -- To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://lists.ripe.net/mailman/listinfo/routing-wg
Re: [routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022
thanks for the post mortem, ties. sra commented to me that, an rp doing protocol fall-over from rrdp to rsync, or vice versa, has to do the full download as the data structure is so different. i.e. load spike. randy -- To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://lists.ripe.net/mailman/listinfo/routing-wg
Re: [routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022
Hi, On Wed, Feb 16, 2022 at 05:01:25PM +0100, Ties de Kock wrote: > This afternoon, between 13:00 UTC and 14:10 UTC rrdp.ripe.net was unavailable. [..] Thanks for this great postmortem writeup, and for being open about what happened, and how things always go wrong at the same time (service and monitoring). Gert Doering -- NetMaster -- have you enabled IPv6 on something today...? SpaceNet AG Vorstand: Sebastian v. Bomhard, Michael Emmer Joseph-Dollinger-Bogen 14Aufsichtsratsvors.: A. Grundner-Culemann D-80807 Muenchen HRB: 136055 (AG Muenchen) Tel: +49 (0)89/32356-444 USt-IdNr.: DE813185279 signature.asc Description: PGP signature -- To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://lists.ripe.net/mailman/listinfo/routing-wg
[routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022
Dear colleagues, This afternoon, between 13:00 UTC and 14:10 UTC rrdp.ripe.net was unavailable. During this period, a significant fraction of relying party instances attempting to fall back to rsync://rpki.ripe.net could not retrieve objects due to capacity constraints. At approximately 13:00 UTC, the RPKI team attempted to move the DNS records for rrdp.ripe.net from out of the ripe.net zone file into a separate include file. We did this change to prepare for implementing an automated failover between the CDNs. This resulted in an outage in the RRDP service, which was caused by an issue in the ripe.net zone file in the DNS zone. The file contains several $ORIGIN directives, but they are not reset properly when a block ends. The consequence is that later relative names in the zone file accidentally get the incorrect origin applied to them, and it is easy to miss this if the $ORIGIN directive appears much earlier in the file. To prevent such DNS issues in the future, all the blocks will be moved out of the main zone file into separate include files, because $ORIGIN directives in them do not persist beyond the end of the file. Also, earlier today, we hit an issue that our monitoring was broken due to a change in the prometheus configuration file. This reduced our visibility into the outage and meant no alerts were sent until this recovered. A third contributing factor was that a secondary monitoring system monitoring the RPKI prometheus infrastructure did not alert due to the web interface returning an HTTP 200 despite the broken configuration. A final factor was that the capacity of rsync://rpki.ripe.net is limited. Only part of the relying party instances that attempted to fall back could update from rsync. This prevented relying party instances from retrieving new objects. Full timeline: * 07:04 UTC: broken alert configuration committed * 08:46 UTC: broken alert configuration applied, breaking monitoring. * 13:02 UTC: DNS change (effectively removing rrdp.ripe.net from zone) applied * 13:44 UTC: alert configuration reverted * 14:10 UTC: DNS configuration recovered * 14:25 UTC: rsync connection rate back at baseline level During the period where rrdp.ripe.net was not available, many relying party instances started falling back to rsync. On partial data available, we observed a median rsync connection duration of 300 seconds, and 99th percentile of 1660 seconds, with ~55% of rsync connections disconnecting with an error code. Based on this preliminary data we conclude that this is indicative of underlying IO limitation in our NFS setup. We will further investigate this. During the period of outage, our rsync servers returned 5043 “max connection reached” errors to 2307 unique IP addresses. We have applied one mitigation (linting of alert configuration). We are also working on improving our external monitoring without a dependency on on-premise infrastructure. Kind regards, Ties -- To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://lists.ripe.net/mailman/listinfo/routing-wg
Re: [routing-wg] rsync://rpki.ripe.net rsyncd limits set too low?
Hi Ties, Thank you for the quick reply. On Wed, Feb 16, 2022 at 03:32:06PM +0100, Ties de Kock wrote: > Ouch. Fallback to rsync due to a DNS misconfiguration (which should > have recovered). Thanks for the confirmation. Indeed, my monitors seem to have returned to 'all clear'. > There are multiple instances behind a load-balancer. The current > storage is on NFS which has a performance limitation - it peaked at > about 80K operations/second (2m average). Welp! That's a lot of IO. Sharing from my own experience with a tiny publication point: I estimate there are about 4,000 RPs deployed on the Internet. Assuming their synchronisation attempts are evenly distributed across the hour, a naieve calculation suggests every single second a new client will attempt to connect. > We will follow up with a more detailed post-mortem. Much appreciated! Kind regards, Job -- To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://lists.ripe.net/mailman/listinfo/routing-wg
Re: [routing-wg] Open-sourcing of the RIPE NCC’s RPKI core software
Hi, On Thu, Feb 10, 2022 at 6:35 PM Shane Kerr wrote: > > I'm a little disappointed that you didn't choose a copyleft style > license, like with the RIPE Atlas Software Probe, which uses GPLv3. That > would help ensure that the work of the RIPE NCC employees is not used by > a proprietary product or service by a company unwilling or unable to > share their changes. Probably in the presentation at RIPE 84 there will > be a bit of explanation about the choice of using a license so easy to > convert back to closed source. There are issues regardless of what license you pick and remember that there are proprietary HSM dependencies used, which could potentially cause issues with GPLv3 for the NCC itself as far as I understand it. -Cynthia -- To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://lists.ripe.net/mailman/listinfo/routing-wg
Re: [routing-wg] rsync://rpki.ripe.net rsyncd limits set too low?
Hi Job. > On 16 Feb 2022, at 15:05, Job Snijders via routing-wg > wrote: > > Hi all, > > I noticed the RIPE NCC RRDP service (https://rrdp.ripe.net/) became > unreachable at 2022-02-16 13:34:10 UTC+0 (and still is down). Ouch. Fallback to rsync due to a DNS misconfiguration (which should have recovered). > This RRDP outage event should not pose an issue for most RPKI > validators, because most RPKI cache implementations (which follow best > practises) will attempt to try to synchronize via RSYNC, in case RRDP is > unavailable. > > However, it seems RIPE NCC adjusted the default rsyncd settings and > lowered the concurrent connection count from 200 (which already is too > low for RPKI Repository Servers) to 150? > >$ rsync --no-motd -rt rsync://rpki.ripe.net/repository/ >@ERROR: max connections (150) reached -- try again later >rsync error: error starting client-server protocol (code 5) at > main.c(1666) > [Receiver=3.1.2] > > I'm not familiar with the RIPE RPKI RSYNC service architecture, so the > above error could be misleading: perhaps there is a loadbalancer > distributing TCP sessions across multiple backends, each backend > configured to serve up to 150 clients? Or perhaps there is a single > rsyncd instance (in which case 150 definitely is too low). We have described our rsync infrastructure extensively in earlier messages (e.g. [0]). There are multiple instances behind a load-balancer. The current storage is on NFS which has a performance limitation - it peaked at about 80K operations/second (2m average). We will follow up with a more detailed post-mortem. Kind regards, Ties [0]: https://www.ripe.net/ripe/mail/archives/routing-wg/2021-June/004351.html -- To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://lists.ripe.net/mailman/listinfo/routing-wg
Re: [routing-wg] rsync://rpki.ripe.net rsyncd limits set too low?
On Wed, Feb 16, 2022 at 03:05:30PM +0100, Job Snijders wrote: > However, it seems RIPE NCC adjusted the default rsyncd settings and > lowered the concurrent connection count from 200 (which already is too > low for RPKI Repository Servers) to 150? Small correction: I appear to be confused about 200 being the default, according to documentation the default is 'unlimited' Kind regards, Job -- To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://lists.ripe.net/mailman/listinfo/routing-wg
[routing-wg] rsync://rpki.ripe.net rsyncd limits set too low?
Hi all, I noticed the RIPE NCC RRDP service (https://rrdp.ripe.net/) became unreachable at 2022-02-16 13:34:10 UTC+0 (and still is down). This RRDP outage event should not pose an issue for most RPKI validators, because most RPKI cache implementations (which follow best practises) will attempt to try to synchronize via RSYNC, in case RRDP is unavailable. However, it seems RIPE NCC adjusted the default rsyncd settings and lowered the concurrent connection count from 200 (which already is too low for RPKI Repository Servers) to 150? $ rsync --no-motd -rt rsync://rpki.ripe.net/repository/ @ERROR: max connections (150) reached -- try again later rsync error: error starting client-server protocol (code 5) at main.c(1666) [Receiver=3.1.2] I'm not familiar with the RIPE RPKI RSYNC service architecture, so the above error could be misleading: perhaps there is a loadbalancer distributing TCP sessions across multiple backends, each backend configured to serve up to 150 clients? Or perhaps there is a single rsyncd instance (in which case 150 definitely is too low). Is the RIPE NCC RPKI RSYNC service underprovisioned? If yes, why? Kind regards, Job -- To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://lists.ripe.net/mailman/listinfo/routing-wg