Re: BGP communities, was: Re: Facebook post-mortems... - Update!

2021-10-07 Thread Ross Tajvar
There are also a bunch at http://bgp.community (linked to the source where possible instead of keeping a stale copy). On Tue, Oct 5, 2021, 1:17 PM Jay Hennigan wrote: > On 10/5/21 09:49, Warren Kumari wrote: > > > Can someone explain to me, preferably in baby words, why so many > > providers

Re: Facebook post-mortems...

2021-10-06 Thread Bjørn Mork
Masataka Ohta writes: > Bjørn Mork wrote: > >> Removing all DNS servers at the same time is never a good idea, even in >> the situation where you believe they are all failing. > > As I wrote: > > : That facebook use very short expiration period for zone > : data is a separate issue. > > that is a

Re: Facebook post-mortems...

2021-10-06 Thread Masataka Ohta
Bjørn Mork wrote: Removing all DNS servers at the same time is never a good idea, even in the situation where you believe they are all failing. As I wrote: : That facebook use very short expiration period for zone : data is a separate issue. that is a separate issue. > This is a very hard

Re: Facebook post-mortems...

2021-10-06 Thread Masataka Ohta
Hank Nussbacher wrote: - "it was not possible to access our data centers through our normal means because their networks were down, and second, the total loss of DNS broke many of the internal tools we'd normally use to investigate and resolve outages like this. Our primary and out-of-band

Re: Facebook post-mortems...

2021-10-06 Thread Bjørn Mork
Masataka Ohta writes: > As long as name servers with expired zone data won't serve > request from outside of facebook, whether BGP routes to the > name servers are announced or not is unimportant. I am not convinced this is true. You'd normally serve some semi-static content, especially wrt

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka
On 10/6/21 06:51, Hank Nussbacher wrote: - "During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network" Can anyone guess as to

Re: Facebook post-mortems...

2021-10-05 Thread Hank Nussbacher
On 05/10/2021 21:11, Randy Monroe via NANOG wrote: Updated: https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/ Lets try to breakdown this "engineering" blog posting: - "During one of these routine maintenance jobs, a command was issued with the intention to assess the

Re: Facebook post-mortems...

2021-10-05 Thread Masataka Ohta
Randy Monroe via NANOG wrote: Updated: https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/ So, what was lost was internal connectivity between data centers. That facebook use very short expiration period for zone data is a separate issue. As long as name servers with

Re: Facebook post-mortems...

2021-10-05 Thread Michael Thomas
Actually for card readers, the offline verification nature of certificates is probably a nice property. But client certs pose all sorts of other problems like their scalability, ease of making changes (roles, etc), and other kinds of considerations that make you want to fetch more information

Re: Facebook post-mortems...

2021-10-05 Thread Matthew Petach
On Tue, Oct 5, 2021 at 8:57 AM Kain, Becki (.) wrote: > Why ever would have a card reader on your external facing network, if that > was really the case why they couldn't get in to fix it? > Let's hypothesize for a moment. Let's suppose you've decided that certificate-based authentication is

Re: Facebook post-mortems...

2021-10-05 Thread Randy Monroe via NANOG
Updated: https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/ On Tue, Oct 5, 2021 at 1:26 PM Michael Thomas wrote: > > On 10/5/21 12:17 AM, Carsten Bormann wrote: > > On 5. Oct 2021, at 07:42, William Herrin wrote: > >> On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas wrote:

Re: Facebook post-mortems...

2021-10-05 Thread Warren Kumari
On Tue, Oct 5, 2021 at 1:47 PM Miles Fidelman wrote: > jcur...@istaff.org wrote: > > Fairly abstract - Facebook Engineering - > https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr >

Re: Facebook post-mortems... - Update!

2021-10-05 Thread Randy Bush
> Can someone explain to me, preferably in baby words, why so many providers > view information like https://as37100.net/?bgp as secret/proprietary? it shows we're important

Re: Facebook post-mortems...

2021-10-05 Thread Miles Fidelman
jcur...@istaff.org wrote: Fairly abstract - Facebook Engineering - https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr Also, Cloudflare’s take on the

Re: Facebook post-mortems...

2021-10-05 Thread Michael Thomas
On 10/5/21 12:17 AM, Carsten Bormann wrote: On 5. Oct 2021, at 07:42, William Herrin wrote: On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas wrote: They have a monkey patch subsystem. Lol. Yes, actually, they do. They use Chef extensively to configure operating systems. Chef is written in

Re: Facebook post-mortems...

2021-10-05 Thread Michael Thomas
On 10/4/21 10:42 PM, William Herrin wrote: On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas wrote: They have a monkey patch subsystem. Lol. Yes, actually, they do. They use Chef extensively to configure operating systems. Chef is written in Ruby. Ruby has something called Monkey Patches. This

BGP communities, was: Re: Facebook post-mortems... - Update!

2021-10-05 Thread Jay Hennigan
On 10/5/21 09:49, Warren Kumari wrote: Can someone explain to me, preferably in baby words, why so many providers view information like https://as37100.net/?bgp as secret/proprietary? I've interacted with numerous providers who require an NDA or pinky-swear to get a

Re: Facebook post-mortems...

2021-10-05 Thread Niels Bakker
Ryan, thanks for sharing your data, it's unfortunate that it was seemingly misinterpreted by a few souls. * ryan.lan...@gmail.com (Ryan Landry) [Tue 05 Oct 2021, 17:52 CEST]: Niels, you are correct about my initial tweet, which I updated in later tweets to clarify with a hat tip to Will

Re: Facebook post-mortems...

2021-10-05 Thread Ryan Brooks
> On Oct 5, 2021, at 10:32 AM, Jean St-Laurent via NANOG > wrote: > > If you have some DNS working, you can point it at a static “we are down and > we know it” page much sooner, At the scale of facebook that seems extremely difficult to pull off w/o most of their architecture online.

Re: Facebook post-mortems... - Update!

2021-10-05 Thread Warren Kumari
On Tue, Oct 5, 2021 at 9:56 AM Mark Tinka wrote: > > > On 10/5/21 15:40, Mark Tinka wrote: > > > > > I don't disagree with you one bit. It's for that exact reason that we > > built: > > > > https://as37100.net/ > > > > ... not for us, but specifically for other random network operators > >

Re: Facebook post-mortems...

2021-10-05 Thread Joe Maimon
Mark Tinka wrote: So I'm not worried about DNS stability when split across multiple physical entities. I'm talking about the actual services being hosted on a single network that goes bye-bye like what we saw yesterday. All the DNS resolution means diddly, even if it tells us that DNS

RE: Facebook post-mortems...

2021-10-05 Thread Kain, Becki (.)
: Facebook post-mortems... WARNING: This message originated outside of Ford Motor Company. Use caution when opening attachments, clicking links, or responding. Update about the October 4th outage https://clicktime.symantec.com/3X9y1HrhXV7HkUEoMWnXtR67Vc?u=https%3A%2F%2Fengineering.fb.com%2F2021

Re: Facebook post-mortems...

2021-10-05 Thread Ryan Landry
Niels, you are correct about my initial tweet, which I updated in later tweets to clarify with a hat tip to Will Hargrave as thanks for seeking more detail. Cheers, Ryan On Tue, Oct 5, 2021 at 08:24 Niels Bakker wrote: > * telescop...@gmail.com (Lou D) [Tue 05 Oct 2021, 15:12 CEST]: >

RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG
happening. It seems to be really resilient in today’s world, a business needs their NS in at least 2 different entities like amazon.com is doing. Jean From: NANOG On Behalf Of Matthew Kaufman Sent: October 5, 2021 10:59 AM To: Mark Tinka Cc: nanog@nanog.org Subject: Re: Facebook post

Re: Facebook post-mortems...

2021-10-05 Thread Bjørn Mork
Jean St-Laurent via NANOG writes: > Let's check how these big companies are spreading their NS's. > > $ dig +short facebook.com NS > d.ns.facebook.com. > b.ns.facebook.com. > c.ns.facebook.com. > a.ns.facebook.com. > > $ dig +short google.com NS > ns1.google.com. > ns4.google.com. >

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka
On 10/5/21 16:59, Matthew Kaufman wrote: Disagree for two reasons: 1. If you have some DNS working, you can point it at a static “we are down and we know it” page much sooner. Isn't that what Twirra is for, nowadays :-)... 2. If you have convinced the entire world to install tracking

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka
On 10/5/21 16:49, Joe Greco wrote: Unrealistic user expectations are not the point. Users can demand whatever unrealistic claptrap they wish to. The user's expectations, today, are always going to be unrealistic, especially when they are able to enjoy a half-decent service

Re: Facebook post-mortems...

2021-10-05 Thread Matthew Kaufman
On Tue, Oct 5, 2021 at 5:44 AM Mark Tinka wrote: > > > On 10/5/21 14:08, Jean St-Laurent via NANOG wrote: > > > Maybe withdrawing those routes to their NS could have been mitigated by > having NS in separate entities. > > Well, doesn't really matter if you can resolve the A//MX records, >

Re: Facebook post-mortems...

2021-10-05 Thread Joe Greco
On Tue, Oct 05, 2021 at 03:40:39PM +0200, Mark Tinka wrote: > Yes, total nightmare yesterday, but sure that 9,999 of the helpdesk > tickets had nothing to do with DNS. They likely all were - "Your > Internet is down, just fix it; we don't wanna know". Unrealistic user expectations are not the

RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG
Does anyone have info whether this network 69.171.240.0/20 was reachable during the outage. Jean From: NANOG On Behalf Of Tom Beecher Sent: October 5, 2021 10:30 AM To: NANOG Subject: Re: Facebook post-mortems... People keep repeating this but I don't think it's true. My comment

Re: Facebook post-mortems...

2021-10-05 Thread Tom Beecher
> > People keep repeating this but I don't think it's true. > My comment is solely sourced on my direct observations on my network, maybe 30-45 minutes in. Everything except a few /24s disappeared from DFZ providers, but I still heard those prefixes from direct peerings. There was no

Re: Facebook post-mortems...

2021-10-05 Thread Niels Bakker
* telescop...@gmail.com (Lou D) [Tue 05 Oct 2021, 15:12 CEST]: Facebook stopped announcing the vast majority of their IP space to the DFZ during this. People keep repeating this but I don't think it's true. It's probably based on this tweet:

RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG
' Subject: RE: Facebook post-mortems... I agree to resolve non-routable address doesn’t bring you a working service. I thought a few networks were still reachable like their MX or some DRP networks. Thanks for the update Jean From: Tom Beecher mailto:beec...@beecher.cc> >

Re: Facebook post-mortems... - Update!

2021-10-05 Thread Mark Tinka
On 10/5/21 15:40, Mark Tinka wrote: I don't disagree with you one bit. It's for that exact reason that we built:     https://as37100.net/ ... not for us, but specifically for other random network operators around the world whom we may never get to drink a crate of wine with. I have

RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG
Herrin ; NANOG Subject: Re: Facebook post-mortems... Maybe withdrawing those routes to their NS could have been mitigated by having NS in separate entities. Assuming they had such a thing in place , it would not have helped. Facebook stopped announcing the vast majority of their IP

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka
On 10/5/21 15:04, Joe Greco wrote: You don't think at least 10,000 helpdesk requests about Facebook being down were sent yesterday? That and Jane + Thando likely re-installing all their apps and iOS/Android on their phones, and rebooting them 300 times in the hopes that Facebook and

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka
On 10/5/21 14:58, Jean St-Laurent wrote: If your NS are in 2 separate entities, you could still resolve your MX/A//NS. Look how Amazon is doing it. dig +short amazon.com NS ns4.p31.dynect.net. ns3.p31.dynect.net. ns1.p31.dynect.net. ns2.p31.dynect.net. pdns6.ultradns.co.uk.

Re: Facebook post-mortems...

2021-10-05 Thread Joe Greco
On Tue, Oct 05, 2021 at 02:57:42PM +0200, Mark Tinka wrote: > > > On 10/5/21 14:52, Joe Greco wrote: > > >That's not quite true. It still gives much better clue as to what is > >going on; if a host resolves to an IP but isn't pingable/traceroutable, > >that is something that many more techy

Re: Facebook post-mortems...

2021-10-05 Thread Masataka Ohta
Carsten Bormann wrote: While Ruby indeed has a chain-saw (read: powerful, dangerous, still the tool of choice in certain cases) in its toolkit that is generally called “monkey-patching”, I think Michael was actually thinking about the “chaos monkey”,

RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG
different entities for DNS is not financially viable? Jean -Original Message- From: NANOG On Behalf Of Mark Tinka Sent: October 5, 2021 8:22 AM To: nanog@nanog.org Subject: Re: Facebook post-mortems... On 10/5/21 14:08, Jean St-Laurent via NANOG wrote: > Maybe withdrawing th

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka
On 10/5/21 14:52, Joe Greco wrote: That's not quite true. It still gives much better clue as to what is going on; if a host resolves to an IP but isn't pingable/traceroutable, that is something that many more techy people will understand than if the domain is simply unresolvable. Not

Re: Facebook post-mortems...

2021-10-05 Thread Lou D
.org. >> ns-1984.awsdns-56.co.uk. >> ns-659.awsdns-18.net. >> ns-81.awsdns-10.com. >> >> Amnazon and Netflix seem to not keep their eggs in the same basket. From >> a first look, they seem more resilient than facebook.com, google.com and >> apple.com >> >>

Re: Facebook post-mortems...

2021-10-05 Thread Hank Nussbacher
On 05/10/2021 13:17, Hauke Lampe wrote: On 05.10.21 07:22, Hank Nussbacher wrote: Thanks for the posting.  How come they couldn't access their routers via their OOB access? My speculative guess would be that OOB access to a few outbound-facing routers per DC does not help much if a

Re: Facebook post-mortems...

2021-10-05 Thread Joe Greco
On Tue, Oct 05, 2021 at 02:22:09PM +0200, Mark Tinka wrote: > > > On 10/5/21 14:08, Jean St-Laurent via NANOG wrote: > > >Maybe withdrawing those routes to their NS could have been mitigated by > >having NS in separate entities. > > Well, doesn't really matter if you can resolve the A//MX

Re: Facebook post-mortems...

2021-10-05 Thread Tom Beecher
> > My speculative guess would be that OOB access to a few outbound-facing > routers per DC does not help much if a configuration error withdraws the > infrastructure prefixes down to the rack level while dedicated OOB to > each RSW would be prohibitive. > If your OOB has any dependence on the

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka
On 10/5/21 08:55, a...@nethead.de wrote: Rumour is that when the FB route prefixes had been withdrawn their door authentication system stopped working and they could not get back into the building or server room :) Assuming there is any truth to that, guess we can't cancel the hard

Re: Facebook post-mortems...

2021-10-05 Thread Tom Beecher
first look, they seem more resilient than facebook.com, google.com and > apple.com > > Jean > > -Original Message- > From: NANOG On Behalf Of Jeff > Tantsura > Sent: October 5, 2021 2:18 AM > To: William Herrin > Cc: nanog@nanog.org > Subject: Re: Faceboo

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka
On 10/5/21 14:08, Jean St-Laurent via NANOG wrote: Maybe withdrawing those routes to their NS could have been mitigated by having NS in separate entities. Well, doesn't really matter if you can resolve the A//MX records, but you can't connect to the network that is hosting the

Re: Facebook post-mortems...

2021-10-05 Thread Hauke Lampe
On 05.10.21 07:22, Hank Nussbacher wrote: > Thanks for the posting.  How come they couldn't access their routers via > their OOB access? My speculative guess would be that OOB access to a few outbound-facing routers per DC does not help much if a configuration error withdraws the infrastructure

Re: Facebook post-mortems...

2021-10-05 Thread av
On 10/5/21 1:22 PM, Hank Nussbacher wrote: Thanks for the posting.  How come they couldn't access their routers via their OOB access? Rumour is that when the FB route prefixes had been withdrawn their door authentication system stopped working and they could not get back into the building or

Re: Facebook post-mortems...

2021-10-05 Thread Callahan Warlick
I think that was from an outage in 2010: https://engineering.fb.com/2010/09/23/uncategorized/more-details-on-today-s-outage/ On Mon, Oct 4, 2021 at 6:19 PM Jay Hennigan wrote: > On 10/4/21 17:58, jcur...@istaff.org wrote: > > Fairly abstract - Facebook Engineering - > > >

Re: Facebook post-mortems...

2021-10-05 Thread Justin Keller
Per o comments, the linked Facebook outage was from around 5/15/21 On Mon, Oct 4, 2021 at 9:08 PM Rubens Kuhl wrote: > > The FB one seems to be from a previous event. Downtime doesn't match, > visible flaw effects don't either. > > > Rubens > > > On Mon, Oct 4, 2021 at 9:59 PM wrote: > > > >

RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG
-Original Message- From: NANOG On Behalf Of Jeff Tantsura Sent: October 5, 2021 2:18 AM To: William Herrin Cc: nanog@nanog.org Subject: Re: Facebook post-mortems... 129.134.30.0/23, 129.134.30.0/24, 129.134.31.0/24. The specific routes covering all 4 nameservers (a-d) were withdrawn from all

Re: Facebook post-mortems...

2021-10-05 Thread Carsten Bormann
On 5. Oct 2021, at 07:42, William Herrin wrote: > > On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas wrote: >> They have a monkey patch subsystem. Lol. > > Yes, actually, they do. They use Chef extensively to configure > operating systems. Chef is written in Ruby. Ruby has something called >

Re: Facebook post-mortems...

2021-10-05 Thread Jeff Tantsura
129.134.30.0/23, 129.134.30.0/24, 129.134.31.0/24. The specific routes covering all 4 nameservers (a-d) were withdrawn from all FB peering at approximately 15:40 UTC. Cheers, Jeff > On Oct 4, 2021, at 22:45, William Herrin wrote: > > On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas wrote: >>

Re: Facebook post-mortems...

2021-10-04 Thread William Herrin
On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas wrote: > They have a monkey patch subsystem. Lol. Yes, actually, they do. They use Chef extensively to configure operating systems. Chef is written in Ruby. Ruby has something called Monkey Patches. This is where at an arbitrary location in the code

Re: Facebook post-mortems...

2021-10-04 Thread Hank Nussbacher
On 05/10/2021 05:53, Patrick W. Gilmore wrote: Update about the October 4th outage https://engineering.fb.com/2021/10/04/networking-traffic/outage/ Thanks for the posting. How come they couldn't access their routers via their OOB access? -Hank

Re: Facebook post-mortems...

2021-10-04 Thread Patrick W. Gilmore
Update about the October 4th outage https://engineering.fb.com/2021/10/04/networking-traffic/outage/ -- TTFN, patrick > On Oct 4, 2021, at 9:25 PM, Mel Beckman wrote: > > The CF post mortem looks sensible, and a good summary of what we all saw from > the outside with BGP routes being

Re: Facebook post-mortems...

2021-10-04 Thread Mel Beckman
The CF post mortem looks sensible, and a good summary of what we all saw from the outside with BGP routes being withdrawn. Given the fragility of BGP, this could still end up being a malicious attack. -mel via cell > On Oct 4, 2021, at 6:19 PM, Jay Hennigan wrote: > > On 10/4/21 17:58,

Re: update - Re: Facebook post-mortems...

2021-10-04 Thread Rabbi Rob Thomas
>> Fairly abstract - Facebook Engineering - >> https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr >> > > My bad - might be best to ignore the above

Re: update - Re: Facebook post-mortems...

2021-10-04 Thread Michael Thomas
On 10/4/21 6:07 PM, jcur...@istaff.org wrote: On 4 Oct 2021, at 8:58 PM, jcur...@istaff.org wrote: Fairly abstract - Facebook Engineering - https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr

Re: Facebook post-mortems...

2021-10-04 Thread Jay Hennigan
On 10/4/21 17:58, jcur...@istaff.org wrote: Fairly abstract - Facebook Engineering - https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr I believe

Re: Facebook post-mortems...

2021-10-04 Thread Michael Thomas
On 10/4/21 5:58 PM, jcur...@istaff.org wrote: Fairly abstract - Facebook Engineering - https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr Also,

update - Re: Facebook post-mortems...

2021-10-04 Thread jcurran
On 4 Oct 2021, at 8:58 PM, jcur...@istaff.org wrote: > > Fairly abstract - Facebook Engineering - > https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr > >

Re: Facebook post-mortems...

2021-10-04 Thread Rubens Kuhl
The FB one seems to be from a previous event. Downtime doesn't match, visible flaw effects don't either. Rubens On Mon, Oct 4, 2021 at 9:59 PM wrote: > > Fairly abstract - Facebook Engineering - >