Re: Facebook post-mortems...

2021-10-05 Thread Jeff Tantsura
129.134.30.0/23, 129.134.30.0/24, 129.134.31.0/24. The specific routes covering all 4 nameservers (a-d) were withdrawn from all FB peering at approximately 15:40 UTC. Cheers, Jeff > On Oct 4, 2021, at 22:45, William Herrin wrote: > > On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas wrote: >>

Re: IRR for IX peers

2021-10-05 Thread Mark Tinka
On 10/5/21 09:29, Łukasz Bromirski wrote: …like a, say, „single pane of glass”? ;) Oh dear Lord :-)... Mark.

Re: facebook outage

2021-10-05 Thread Niels Bakker
* jllee9...@gmail.com (John Lee) [Tue 05 Oct 2021, 01:06 CEST]: I was seeing NXDOMAIN errors, so I wonder if they had a DNS outage of some sort?? Were you using host(1)? Please don't, and use dig(1) instead. There were as far as I know at no point NXDOMAINs being returned, but due to the

Re: Facebook post-mortems...

2021-10-05 Thread Carsten Bormann
On 5. Oct 2021, at 07:42, William Herrin wrote: > > On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas wrote: >> They have a monkey patch subsystem. Lol. > > Yes, actually, they do. They use Chef extensively to configure > operating systems. Chef is written in Ruby. Ruby has something called >

Re: IRR for IX peers

2021-10-05 Thread Łukasz Bromirski
…like a, say, „single pane of glass”? ;) -- ./ > On 5 Oct 2021, at 06:25, Mark Tinka wrote: > >  > >> On 10/4/21 21:55, Nick Hilliard wrote: >> >> Nearly 30 years on, this is still the state of the art. > > Not an unlike an NMS... still can't walk into a shop and just buy one that >

Re: Facebook post-mortems...

2021-10-05 Thread Ryan Landry
Niels, you are correct about my initial tweet, which I updated in later tweets to clarify with a hat tip to Will Hargrave as thanks for seeking more detail. Cheers, Ryan On Tue, Oct 5, 2021 at 08:24 Niels Bakker wrote: > * telescop...@gmail.com (Lou D) [Tue 05 Oct 2021, 15:12 CEST]: >

RE: Facebook post-mortems...

2021-10-05 Thread Kain, Becki (.)
Why ever would have a card reader on your external facing network, if that was really the case why they couldn't get in to fix it? -Original Message- From: NANOG On Behalf Of Patrick W. Gilmore Sent: Monday, October 04, 2021 10:53 PM To: North American Operators' Group Subject: Re:

BGP communities, was: Re: Facebook post-mortems... - Update!

2021-10-05 Thread Jay Hennigan
On 10/5/21 09:49, Warren Kumari wrote: Can someone explain to me, preferably in baby words, why so many providers view information like https://as37100.net/?bgp as secret/proprietary? I've interacted with numerous providers who require an NDA or pinky-swear to get a

Re: Facebook post-mortems...

2021-10-05 Thread Warren Kumari
On Tue, Oct 5, 2021 at 1:47 PM Miles Fidelman wrote: > jcur...@istaff.org wrote: > > Fairly abstract - Facebook Engineering - > https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr >

RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG
1. If you have some DNS working, you can point it at a static “we are down and we know it” page much sooner. 2. Good catch and you’re right that it would have reduce the planetary impact. Less call to help-desk and less reboot of devices. It would have give visibility on what’s

Re: Facebook post-mortems... - Update!

2021-10-05 Thread Warren Kumari
On Tue, Oct 5, 2021 at 9:56 AM Mark Tinka wrote: > > > On 10/5/21 15:40, Mark Tinka wrote: > > > > > I don't disagree with you one bit. It's for that exact reason that we > > built: > > > > https://as37100.net/ > > > > ... not for us, but specifically for other random network operators > >

Re: Disaster Recovery Process

2021-10-05 Thread Jeff Shultz
7. Make sure any access controlled rooms have physical keys that are available at need - and aren't secured by the same access control that they are to circumvent. . 8. Don't make your access control dependent on internet access - always have something on the local network it can fall back to.

Re: Disaster Recovery Process

2021-10-05 Thread Niels Bakker
* deles...@gmail.com (jim deleskie) [Tue 05 Oct 2021, 19:13 CEST]: World broke. Crazy $$ per hour down time. Doors open with a fire axe. Please stop spreading fake news. https://twitter.com/MikeIsaac/status/1445196576956162050 |need to issue a correction: the team dispatched to the Facebook

Re: Facebook post-mortems...

2021-10-05 Thread Bjørn Mork
Jean St-Laurent via NANOG writes: > Let's check how these big companies are spreading their NS's. > > $ dig +short facebook.com NS > d.ns.facebook.com. > b.ns.facebook.com. > c.ns.facebook.com. > a.ns.facebook.com. > > $ dig +short google.com NS > ns1.google.com. > ns4.google.com. >

Re: Facebook post-mortems...

2021-10-05 Thread Michael Thomas
On 10/4/21 10:42 PM, William Herrin wrote: On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas wrote: They have a monkey patch subsystem. Lol. Yes, actually, they do. They use Chef extensively to configure operating systems. Chef is written in Ruby. Ruby has something called Monkey Patches. This

Re: Facebook post-mortems...

2021-10-05 Thread Michael Thomas
On 10/5/21 12:17 AM, Carsten Bormann wrote: On 5. Oct 2021, at 07:42, William Herrin wrote: On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas wrote: They have a monkey patch subsystem. Lol. Yes, actually, they do. They use Chef extensively to configure operating systems. Chef is written in

Re: Facebook post-mortems...

2021-10-05 Thread Miles Fidelman
jcur...@istaff.org wrote: Fairly abstract - Facebook Engineering - https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr Also, Cloudflare’s take on the

Re: Disaster Recovery Process

2021-10-05 Thread Jamie Dahl
The NIMS/ICS system works very well for issues like this. I utilize ICS regularly in my Search and Rescue world, and the last two companies I worked for utilize(d) it extensively during outages. It allows folks from various different disciplines, roles and backgrounds to come in, and provide

Re: Facebook post-mortems... - Update!

2021-10-05 Thread Randy Bush
> Can someone explain to me, preferably in baby words, why so many providers > view information like https://as37100.net/?bgp as secret/proprietary? it shows we're important

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka
On 10/5/21 16:59, Matthew Kaufman wrote: Disagree for two reasons: 1. If you have some DNS working, you can point it at a static “we are down and we know it” page much sooner. Isn't that what Twirra is for, nowadays :-)... 2. If you have convinced the entire world to install tracking

Re: Facebook post-mortems...

2021-10-05 Thread Ryan Brooks
> On Oct 5, 2021, at 10:32 AM, Jean St-Laurent via NANOG > wrote: > > If you have some DNS working, you can point it at a static “we are down and > we know it” page much sooner, At the scale of facebook that seems extremely difficult to pull off w/o most of their architecture online.

Re: Disaster Recovery Process

2021-10-05 Thread jim deleskie
World broke. Crazy $$ per hour down time. Doors open with a fire axe. Glass breaks super easy too and much less expensive then adding 15 min to failure. -jim On Tue., Oct. 5, 2021, 7:05 p.m. Jeff Shultz, wrote: > 7. Make sure any access controlled rooms have physical keys that are >

Re: Disaster Recovery Process

2021-10-05 Thread jim deleskie
I don't see posting in a DR process thead about thinking to use alternative entry methods to locked doors and spreading false information. If do well. Mail filters are simple. -jim On Tue., Oct. 5, 2021, 7:35 p.m. Niels Bakker, wrote: > * deles...@gmail.com (jim deleskie) [Tue 05 Oct 2021,

Re: Facebook post-mortems...

2021-10-05 Thread Joe Maimon
Mark Tinka wrote: So I'm not worried about DNS stability when split across multiple physical entities. I'm talking about the actual services being hosted on a single network that goes bye-bye like what we saw yesterday. All the DNS resolution means diddly, even if it tells us that DNS

Re: Facebook post-mortems...

2021-10-05 Thread Niels Bakker
Ryan, thanks for sharing your data, it's unfortunate that it was seemingly misinterpreted by a few souls. * ryan.lan...@gmail.com (Ryan Landry) [Tue 05 Oct 2021, 17:52 CEST]: Niels, you are correct about my initial tweet, which I updated in later tweets to clarify with a hat tip to Will

Re: Disaster Recovery Process

2021-10-05 Thread Warren Kumari
On Tue, Oct 5, 2021 at 1:07 PM Jeff Shultz wrote: > 7. Make sure any access controlled rooms have physical keys that are > available at need - and aren't secured by the same access control that they > are to circumvent. . > 8. Don't make your access control dependent on internet access - always

Re: massive facebook outage presently

2021-10-05 Thread David Andrzejewski
I find it hilarious and ironic that their CTO had to use a competitor’s platform to confirm their outage. - dave > On Oct 4, 2021, at 16:45, Hank Nussbacher wrote: > > On 04/10/2021 22:05, Jason Kuehl wrote: > > BGP related: > https://twitter.com/SGgrc/status/1445116435731296256 > as also

Re: Facebook post-mortems...

2021-10-05 Thread Tom Beecher
> > My speculative guess would be that OOB access to a few outbound-facing > routers per DC does not help much if a configuration error withdraws the > infrastructure prefixes down to the rack level while dedicated OOB to > each RSW would be prohibitive. > If your OOB has any dependence on the

Re: Facebook post-mortems...

2021-10-05 Thread Joe Greco
On Tue, Oct 05, 2021 at 02:22:09PM +0200, Mark Tinka wrote: > > > On 10/5/21 14:08, Jean St-Laurent via NANOG wrote: > > >Maybe withdrawing those routes to their NS could have been mitigated by > >having NS in separate entities. > > Well, doesn't really matter if you can resolve the A//MX

Re: Facebook post-mortems...

2021-10-05 Thread Masataka Ohta
Carsten Bormann wrote: While Ruby indeed has a chain-saw (read: powerful, dangerous, still the tool of choice in certain cases) in its toolkit that is generally called “monkey-patching”, I think Michael was actually thinking about the “chaos monkey”,

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka
On 10/5/21 15:04, Joe Greco wrote: You don't think at least 10,000 helpdesk requests about Facebook being down were sent yesterday? That and Jane + Thando likely re-installing all their apps and iOS/Android on their phones, and rebooting them 300 times in the hopes that Facebook and

RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG
I agree to resolve non-routable address doesn’t bring you a working service. I thought a few networks were still reachable like their MX or some DRP networks. Thanks for the update Jean From: Tom Beecher Sent: October 5, 2021 8:33 AM To: Jean St-Laurent Cc: Jeff Tantsura ; William

Re: Disaster Recovery Process

2021-10-05 Thread Jared Mauch
> On Oct 5, 2021, at 10:05 AM, Karl Auer wrote: > > On Tue, 2021-10-05 at 08:50 -0400, Jared Mauch wrote: >> A few reminders for people: >> [excellent list snipped] > > I'd add one "soft" list item: > > - in your emergency plan, have one or two people nominated who are VERY > high up in the

RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG
Maybe withdrawing those routes to their NS could have been mitigated by having NS in separate entities. Let's check how these big companies are spreading their NS's. $ dig +short facebook.com NS d.ns.facebook.com. b.ns.facebook.com. c.ns.facebook.com. a.ns.facebook.com. $ dig +short google.com

Re: massive facebook outage presently

2021-10-05 Thread PJ Capelli via NANOG
Seems unlikely that FB internal controls would allow such a backdoor ... "Never to get lost, is not living" - Rebecca Solnit Sent with ProtonMail Secure Email. ‐‐‐ Original Message ‐‐‐ On Monday, October 4th, 2021 at 4:12 PM, Baldur Norddahl wrote: > On Mon, 4 Oct 2021 at 21:58,

Re: massive facebook outage presently

2021-10-05 Thread Jorge Amodio
How come such a large operation does not have an out of bound access in case of emergencies ??? Somebody's getting fired ! -J On Mon, Oct 4, 2021 at 3:51 PM Aaron C. de Bruyn via NANOG wrote: > It looks like it might take a while according to a news reporter's tweet: > > "Was just on phone

Re: Facebook post-mortems...

2021-10-05 Thread Hauke Lampe
On 05.10.21 07:22, Hank Nussbacher wrote: > Thanks for the posting.  How come they couldn't access their routers via > their OOB access? My speculative guess would be that OOB access to a few outbound-facing routers per DC does not help much if a configuration error withdraws the infrastructure

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka
On 10/5/21 08:55, a...@nethead.de wrote: Rumour is that when the FB route prefixes had been withdrawn their door authentication system stopped working and they could not get back into the building or server room :) Assuming there is any truth to that, guess we can't cancel the hard

Disaster Recovery Process

2021-10-05 Thread Jared Mauch
> On Oct 4, 2021, at 4:53 PM, Jorge Amodio wrote: > > How come such a large operation does not have an out of bound access in case > of emergencies ??? > > I mentioned to someone yesterday that most OOB systems _are_ the internet. It doesn’t always seem like you need things like modems

Re: Facebook post-mortems...

2021-10-05 Thread Lou D
Facebook stopped announcing the vast majority of their IP space to the DFZ during this. This is where I would like to learn more about the outage. Direct Peering FB connections saw a drop in a networks (about a dozen) and one the networks covered their C and D Nameservers but the block for A and

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka
On 10/5/21 14:52, Joe Greco wrote: That's not quite true. It still gives much better clue as to what is going on; if a host resolves to an IP but isn't pingable/traceroutable, that is something that many more techy people will understand than if the domain is simply unresolvable. Not

Re: Facebook post-mortems...

2021-10-05 Thread Joe Greco
On Tue, Oct 05, 2021 at 02:57:42PM +0200, Mark Tinka wrote: > > > On 10/5/21 14:52, Joe Greco wrote: > > >That's not quite true. It still gives much better clue as to what is > >going on; if a host resolves to an IP but isn't pingable/traceroutable, > >that is something that many more techy

Re: Facebook post-mortems...

2021-10-05 Thread Niels Bakker
* telescop...@gmail.com (Lou D) [Tue 05 Oct 2021, 15:12 CEST]: Facebook stopped announcing the vast majority of their IP space to the DFZ during this. People keep repeating this but I don't think it's true. It's probably based on this tweet:

Re: Facebook post-mortems...

2021-10-05 Thread Justin Keller
Per o comments, the linked Facebook outage was from around 5/15/21 On Mon, Oct 4, 2021 at 9:08 PM Rubens Kuhl wrote: > > The FB one seems to be from a previous event. Downtime doesn't match, > visible flaw effects don't either. > > > Rubens > > > On Mon, Oct 4, 2021 at 9:59 PM wrote: > > > >

RE: massive facebook outage presently

2021-10-05 Thread Jean St-Laurent via NANOG
I don't understand how this would have helped yesterday. >From what is public so far, they really paint themselves in a corner with no >way out. A classic, but at epic scale. They will learn and improve for sure, but I don't understand how "firmware default to your own network" would have help

RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG
If your NS are in 2 separate entities, you could still resolve your MX/A//NS. Look how Amazon is doing it. dig +short amazon.com NS ns4.p31.dynect.net. ns3.p31.dynect.net. ns1.p31.dynect.net. ns2.p31.dynect.net. pdns6.ultradns.co.uk. pdns1.ultradns.net. They use dyn DNS from Oracle and

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka
On 10/5/21 14:08, Jean St-Laurent via NANOG wrote: Maybe withdrawing those routes to their NS could have been mitigated by having NS in separate entities. Well, doesn't really matter if you can resolve the A//MX records, but you can't connect to the network that is hosting the

Re: Disaster Recovery Process

2021-10-05 Thread Karl Auer
On Tue, 2021-10-05 at 08:50 -0400, Jared Mauch wrote: > A few reminders for people: > [excellent list snipped] I'd add one "soft" list item: - in your emergency plan, have one or two people nominated who are VERY high up in the organisation. Their lines need to be open to the decisionmakers in

Re: Facebook post-mortems...

2021-10-05 Thread Tom Beecher
> > People keep repeating this but I don't think it's true. > My comment is solely sourced on my direct observations on my network, maybe 30-45 minutes in. Everything except a few /24s disappeared from DFZ providers, but I still heard those prefixes from direct peerings. There was no

Re: Facebook post-mortems...

2021-10-05 Thread Matthew Kaufman
On Tue, Oct 5, 2021 at 5:44 AM Mark Tinka wrote: > > > On 10/5/21 14:08, Jean St-Laurent via NANOG wrote: > > > Maybe withdrawing those routes to their NS could have been mitigated by > having NS in separate entities. > > Well, doesn't really matter if you can resolve the A//MX records, >

Re: massive facebook outage presently

2021-10-05 Thread Glenn Kelley
This is why you should have Routers that are Firmware Defaulted to your own network.  ALWAYS Be it Calix or even a Mikrotik which you have setup with Netboot - having these default to your own setup is REALLY a game changer. Without it - you are rolling trucks or at minimum taking heavy call

Re: Facebook post-mortems... - Update!

2021-10-05 Thread Mark Tinka
On 10/5/21 15:40, Mark Tinka wrote: I don't disagree with you one bit. It's for that exact reason that we built:     https://as37100.net/ ... not for us, but specifically for other random network operators around the world whom we may never get to drink a crate of wine with. I have

Re: Disaster Recovery Process

2021-10-05 Thread Sean Donelan
On Wed, 6 Oct 2021, Karl Auer wrote: I'd add one "soft" list item: - in your emergency plan, have one or two people nominated who are VERY high up in the organisation. Their lines need to be open to the decisionmakers in the emergency team(s). Their job is to put the fear of a vengeful god into

RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG
Does anyone have info whether this network 69.171.240.0/20 was reachable during the outage. Jean From: NANOG On Behalf Of Tom Beecher Sent: October 5, 2021 10:30 AM To: NANOG Subject: Re: Facebook post-mortems... People keep repeating this but I don't think it's true. My comment

Re: Facebook post-mortems...

2021-10-05 Thread Callahan Warlick
I think that was from an outage in 2010: https://engineering.fb.com/2010/09/23/uncategorized/more-details-on-today-s-outage/ On Mon, Oct 4, 2021 at 6:19 PM Jay Hennigan wrote: > On 10/4/21 17:58, jcur...@istaff.org wrote: > > Fairly abstract - Facebook Engineering - > > >

Re: Facebook post-mortems...

2021-10-05 Thread av
On 10/5/21 1:22 PM, Hank Nussbacher wrote: Thanks for the posting.  How come they couldn't access their routers via their OOB access? Rumour is that when the FB route prefixes had been withdrawn their door authentication system stopped working and they could not get back into the building or

Re: Facebook post-mortems...

2021-10-05 Thread Tom Beecher
> > Maybe withdrawing those routes to their NS could have been mitigated by > having NS in separate entities. > Assuming they had such a thing in place , it would not have helped. Facebook stopped announcing the vast majority of their IP space to the DFZ during this. So even they did have an

Re: Facebook post-mortems...

2021-10-05 Thread Hank Nussbacher
On 05/10/2021 13:17, Hauke Lampe wrote: On 05.10.21 07:22, Hank Nussbacher wrote: Thanks for the posting.  How come they couldn't access their routers via their OOB access? My speculative guess would be that OOB access to a few outbound-facing routers per DC does not help much if a

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka
On 10/5/21 14:58, Jean St-Laurent wrote: If your NS are in 2 separate entities, you could still resolve your MX/A//NS. Look how Amazon is doing it. dig +short amazon.com NS ns4.p31.dynect.net. ns3.p31.dynect.net. ns1.p31.dynect.net. ns2.p31.dynect.net. pdns6.ultradns.co.uk.

RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG
As of now, their MX is hosted on 69.171.251.251 Was this network still announced yesterday in the DFZ during the outage? 69.171.224.0/19 69.171.240.0/20 Jean From: Jean St-Laurent Sent: October 5, 2021 9:50 AM To: 'Tom Beecher' Cc: 'Jeff Tantsura' ; 'William Herrin' ; 'NANOG'

Re: Facebook post-mortems...

2021-10-05 Thread Joe Greco
On Tue, Oct 05, 2021 at 03:40:39PM +0200, Mark Tinka wrote: > Yes, total nightmare yesterday, but sure that 9,999 of the helpdesk > tickets had nothing to do with DNS. They likely all were - "Your > Internet is down, just fix it; we don't wanna know". Unrealistic user expectations are not the

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka
On 10/5/21 16:49, Joe Greco wrote: Unrealistic user expectations are not the point. Users can demand whatever unrealistic claptrap they wish to. The user's expectations, today, are always going to be unrealistic, especially when they are able to enjoy a half-decent service

RE: HBO Max Contact

2021-10-05 Thread Travis Garrison
We have just ran into this issue. We contacted Digital Elements and they let us know the issue is with Wind Scribe VPN service. Wind Scribe will randomly select client IP addresses and use that as the host IP. Of course, when they do that, it gets our IP addresses blocked. We have never

Re: Facebook post-mortems...

2021-10-05 Thread Michael Thomas
Actually for card readers, the offline verification nature of certificates is probably a nice property. But client certs pose all sorts of other problems like their scalability, ease of making changes (roles, etc), and other kinds of considerations that make you want to fetch more information

Re: Facebook post-mortems...

2021-10-05 Thread Randy Monroe via NANOG
Updated: https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/ On Tue, Oct 5, 2021 at 1:26 PM Michael Thomas wrote: > > On 10/5/21 12:17 AM, Carsten Bormann wrote: > > On 5. Oct 2021, at 07:42, William Herrin wrote: > >> On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas wrote:

Better description of what happened

2021-10-05 Thread Michael Thomas
This bit posted by Randy might get lost in the other thread, but it appears that their DNS withdraws BGP routes for prefixes that they can't reach or are flaky it seems. Apparently that goes for the prefixes that the name servers are on too. This caused internal outages too as it seems they

Re: Facebook post-mortems...

2021-10-05 Thread Matthew Petach
On Tue, Oct 5, 2021 at 8:57 AM Kain, Becki (.) wrote: > Why ever would have a card reader on your external facing network, if that > was really the case why they couldn't get in to fix it? > Let's hypothesize for a moment. Let's suppose you've decided that certificate-based authentication is

Re: Facebook post-mortems...

2021-10-05 Thread Masataka Ohta
Randy Monroe via NANOG wrote: Updated: https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/ So, what was lost was internal connectivity between data centers. That facebook use very short expiration period for zone data is a separate issue. As long as name servers with

Re: Better description of what happened

2021-10-05 Thread Michael Thomas
On 10/5/21 3:09 PM, Andy Brezinsky wrote: It's a few years old, but Facebook has talked a little bit about their DNS infrastructure before.  Here's a little clip that talks about Cartographer: https://youtu.be/bxhYNfFeVF4?t=2073 From their outage report, it sounds like their authoritative

Re: Better description of what happened

2021-10-05 Thread scott
On 10/5/21 8:39 PM, Michael Thomas wrote: This bit posted by Randy might get lost in the other thread, but it appears that their DNS withdraws BGP routes for prefixes that they can't reach or are flaky it seems. Apparently that goes for the prefixes that the name servers are on too. This

Re: Better description of what happened

2021-10-05 Thread Andy Brezinsky
It's a few years old, but Facebook has talked a little bit about their DNS infrastructure before.  Here's a little clip that talks about Cartographer: https://youtu.be/bxhYNfFeVF4?t=2073 From their outage report, it sounds like their authoritative DNS servers withdraw their anycast

Re: Better description of what happened

2021-10-05 Thread Hugo Slabbert
Had some chats with other folks: Arguably you could change the nameserver isolation check failure action to be "depref your exports" rather than "yank it all". Basically, set up a tiered setup so the boxes passing those additional health checks and that should have correct entries would be your

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka
On 10/6/21 06:51, Hank Nussbacher wrote: - "During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network" Can anyone guess as to

Re: Facebook post-mortems...

2021-10-05 Thread Hank Nussbacher
On 05/10/2021 21:11, Randy Monroe via NANOG wrote: Updated: https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/ Lets try to breakdown this "engineering" blog posting: - "During one of these routine maintenance jobs, a command was issued with the intention to assess the