Re: Better description of what happened

2021-10-06 Thread Hugo Slabbert
> > Do we actually know this wrt the tools referred to in "the total loss of > DNS broke many of the tools we’d normally use to investigate and resolve > outages like this."? Those tools aren't necessarily located in any of > the remote data centers, and some of them might even refer to resources

Re: Better description of what happened

2021-10-06 Thread Tom Beecher
I mean, at the end of the day they likely designed these systems to be able to handle one or more datacenters being disconnected from the world, and considered a scenario of ALL their datacenters being disconnected from the world so unlikely they chose not to solve for it. Works great, until it

Re: Better description of what happened

2021-10-06 Thread Bjørn Mork
Tom Beecher writes: > Even if the external > announcements were not withdrawn, and the edge DNS servers could provide > stale answers, the IPs those answers provided wouldn't have actually been > reachable Do we actually know this wrt the tools referred to in "the total loss of DNS broke many

Re: Better description of what happened

2021-10-06 Thread PJ Capelli via NANOG
I probably still have my US Robotics 14.4 in the basement, but it's been awhile since I've had access to a POTS line it would work on ... :) pj capelli pjcape...@pm.me "Never to get lost, is not living" - Rebecca Solnit Sent with ProtonMail Secure Email. ‐‐‐ Original Message ‐‐‐ On

Re: Better description of what happened

2021-10-06 Thread Tom Beecher
By what they have said publicly, the initial trigger point was that all of their datacenters were disconnected from their internal backbone, thus unreachable. Once that occurs, nothing else really matters. Even if the external announcements were not withdrawn, and the edge DNS servers could

Re: Better description of what happened

2021-10-06 Thread Curtis Maurand
On 10/5/21 5:51 AM, scott wrote: On 10/5/21 8:39 PM, Michael Thomas wrote: This bit posted by Randy might get lost in the other thread, but it appears that their DNS withdraws BGP routes for prefixes that they can't reach or are flaky it seems. Apparently that goes for the prefixes that

Re: Better description of what happened

2021-10-05 Thread Hugo Slabbert
Had some chats with other folks: Arguably you could change the nameserver isolation check failure action to be "depref your exports" rather than "yank it all". Basically, set up a tiered setup so the boxes passing those additional health checks and that should have correct entries would be your

Re: Better description of what happened

2021-10-05 Thread Michael Thomas
On 10/5/21 3:09 PM, Andy Brezinsky wrote: It's a few years old, but Facebook has talked a little bit about their DNS infrastructure before.  Here's a little clip that talks about Cartographer: https://youtu.be/bxhYNfFeVF4?t=2073 From their outage report, it sounds like their authoritative

Re: Better description of what happened

2021-10-05 Thread Andy Brezinsky
It's a few years old, but Facebook has talked a little bit about their DNS infrastructure before.  Here's a little clip that talks about Cartographer: https://youtu.be/bxhYNfFeVF4?t=2073 From their outage report, it sounds like their authoritative DNS servers withdraw their anycast

Re: Better description of what happened

2021-10-05 Thread scott
On 10/5/21 8:39 PM, Michael Thomas wrote: This bit posted by Randy might get lost in the other thread, but it appears that their DNS withdraws BGP routes for prefixes that they can't reach or are flaky it seems. Apparently that goes for the prefixes that the name servers are on too. This

Better description of what happened

2021-10-05 Thread Michael Thomas
This bit posted by Randy might get lost in the other thread, but it appears that their DNS withdraws BGP routes for prefixes that they can't reach or are flaky it seems. Apparently that goes for the prefixes that the name servers are on too. This caused internal outages too as it seems they