subject:"Facebook post\-mortems..."

Re: BGP communities, was: Re: Facebook post-mortems... - Update!

2021-10-07 Thread Ross Tajvar

There are also a bunch at http://bgp.community (linked to the source where
possible instead of keeping a stale copy).

On Tue, Oct 5, 2021, 1:17 PM Jay Hennigan  wrote:

> On 10/5/21 09:49, Warren Kumari wrote:
>
> > Can someone explain to me, preferably in baby words, why so many
> > providers view information like https://as37100.net/?bgp
> >  as secret/proprietary?
> > I've interacted with numerous providers who require an NDA or
> > pinky-swear to get a list of their communities -- is this really just 1:
> > security through obscurity, 2: an artifact of the culture of not
> > sharing, 3: an attempt to seem cool by making you jump through hoops to
> > prove your worthiness, 4: some weird 'mah competitors won't be able to
> > figure out my secret sauce without knowing that 17 means Asia, or 5:
> > something else?
>
> Not sure the rationale of leeping them secret, but at least one
> aggregated source of dozens of them exists and has been around for a
> long time. https://onestep.net/communities/
>
> --
> Jay Hennigan - j...@west.net
> Network Engineering - CCIE #7880
> 503 897-8550 - WB6RDV
>

Re: Facebook post-mortems...

2021-10-06 Thread Bjørn Mork

Masataka Ohta  writes:
> Bjørn Mork wrote:
>
>> Removing all DNS servers at the same time is never a good idea, even in
>> the situation where you believe they are all failing.
>
> As I wrote:
>
> : That facebook use very short expiration period for zone
> : data is a separate issue.
>
> that is a separate issue.

Sorry, I don't understand what you're getting at.  The TTL is not an
issue. An infinite TTL won't save you if all authoritative servers are
unreachable.  It will just make things worse in almost every other error
scenario.

The only solution to the problem of unreachable authoritative DNS
servers is:  Don't do that.

>> This is a very hard problem to solve.
>
> If that is their policy, it is just a policy to enforce and not
> a problem to solve.

The policy is there to solve a real problem.

Serving stale data from a single disconnected anycast site is a problem.
A disconnected site is unmanaged and must make autonomous decisions.
That pre-programmed decision is "just policy".  Should you withdraw the
DNS routes or not?  Serve stale or risk meltdown?

I still don't think there is an easy and obviously correct answer.  But
they do of course need to add a safety net or two if they continue with
the "meltdown" policy.

Bjørn

Re: Facebook post-mortems...

2021-10-06 Thread Masataka Ohta


Bjørn Mork wrote:


Removing all DNS servers at the same time is never a good idea, even in
the situation where you believe they are all failing.


As I wrote:

: That facebook use very short expiration period for zone
: data is a separate issue.

that is a separate issue.

> This is a very hard problem to solve.

If that is their policy, it is just a policy to enforce and not
a problem to solve.

Masataka Ohta

Re: Facebook post-mortems...

2021-10-06 Thread Masataka Ohta


Hank Nussbacher wrote:


- "it was not possible to access our data centers through our normal
 means because their networks were down, and second, the total loss
of DNS broke many of the internal tools we'd normally use to
investigate and resolve outages like this.  Our primary and
out-of-band network access was down..."

Does this mean that FB acknowledges that the loss of DNS broke their
OOB access?


It means FB still do not yet understand what happened.

Lack of BGP announcement does not mean "total loss". Name
servers should still be accessible by internal tools.

But, withdrawing route (for BGP and, maybe, IGP) of failing anycast
server is a bad engineering seemingly derived from commonly seen
misunderstanding that anycast could provide redundancy.

Redundancy of DNS is maintained by multiple (unicast or anycast)
name servers with different addresses, for which, withdrawal of
failing route is unnecessary complication.

Masataka Ohta

Re: Facebook post-mortems...

2021-10-06 Thread Bjørn Mork

Masataka Ohta  writes:

> As long as name servers with expired zone data won't serve
> request from outside of facebook, whether BGP routes to the
> name servers are announced or not is unimportant.

I am not convinced this is true.  You'd normally serve some semi-static
content, especially wrt stuff you need yourself to manage your network.
Removing all DNS servers at the same time is never a good idea, even in
the situation where you believe they are all failing.

The problem is of course that you can't let the servers take the
decision to withdraw from anycast if you want to prevent this
catastrophe.  The servers have no knowledge of the rest of the network.
They only know that they've lost contact with it.  So they all make the
same stupid decision.

But if the servers can't withdraw, then they will serve stale content if
the data center loses backbone access. And with a large enough network
then that is probably something which happens on a regular basis.

This is a very hard problem to solve.

Thanks a lot to facebook for making the detailed explanation available
to the public.  I'm crossing my fingers hoping they follow up with
details about the solutions they come up with.  The problem affects any
critical anycast DNS service. And it doesn't have to be as big as
facebook to be locally critical to an enterprise, ISP or whatever.

Bjørn

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka





On 10/6/21 06:51, Hank Nussbacher wrote:



- "During one of these routine maintenance jobs, a command was issued 
with the intention to assess the availability of global backbone 
capacity, which unintentionally took down all the connections in our 
backbone network"


Can anyone guess as to what command FB issued that would cause them to 
withdraw all those prefixes?


Hard to say, as it seems that the command was innocent enough, perhaps 
running a batch of other sub-commands to check port status, bandwidth 
utilization, MPLS-TE values, e.t.c. However, sounds like another 
unforeseen bug in the command ran other things, or the cascade process 
of how the sub-commands were ran caused unforeseen problems.


We shall guess this one forever, as I doubt Facebook will go into that 
much detail.


What I can tell you is that all the major content providers spend a lot 
of time, money and effort in automating both capacity planning, as well 
as capacity auditing. It's a bit more complex for them, because their 
variables aren't just links and utilization, but also locations, fibre 
availability, fibre pricing, capacity lease pricing, the presence of 
carrier-neutral data centres, the presence of exchange points, current 
vendor equipment models and pricing, projection of future fibre and 
capacity pricing, e.t.c.


It's a totally different world from normal ISP-land.




- "it was not possible to access our data centers through our normal 
means because their networks were down, and second, the total loss of 
DNS broke many of the internal tools we’d normally use to investigate 
and resolve outages like this.  Our primary and out-of-band network 
access was down..."


Does this mean that FB acknowledges that the loss of DNS broke their 
OOB access?


I need to put my thinking cap on, but not sure whether running DNS in 
the IGP would have been better in this instance.


We run our Anycast DNS network in our IGP, mainly to always guarantee 
latency-based routing, but also to ensure that the failure of a 
higher-level protocol like BGP does not disconnect internal access that 
is needed for troubleshooting and repair. Given the IGP is a much more 
lower-level routing protocol, it's more likely (not impossible) that it 
would not go down with BGP.


In the past, we have, indeed, had BGP issues that allowed us to maintain 
DNS access internally as the IGP was unaffected.


The final statement from that report is interesting:

    "From here on out, our job is to strengthen our testing,
    drills, and overall resilience to make sure events like this
    happen as rarely as possible."

... which, in my rudimentary translation, means that:

    "There are no guarantees that our automation software will not
    poop cows again, but we hope that when that does happen, we
    shall be able to send our guys out to site much more quickly."

... which, to be fair, is totally understandable. These automation 
tools, especially in large networks such as BigContent, are 
significantly more fragile the more complex they get, and the more batch 
tasks they need to perform on various parts of a network of this size 
and scope. It's a pity these automation tools are all homegrown, and 
can't be bought "pre-packaged and pre-approved to never fail" from IT 
Software Store down the road. But it's the only way for networks of this 
capacity to operate, and the risk they always sit with for being that large.


Mark.

Re: Facebook post-mortems...

2021-10-05 Thread Hank Nussbacher


On 05/10/2021 21:11, Randy Monroe via NANOG wrote:
Updated: 
https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/ 


Lets try to breakdown this "engineering" blog posting:

- "During one of these routine maintenance jobs, a command was issued 
with the intention to assess the availability of global backbone 
capacity, which unintentionally took down all the connections in our 
backbone network"


Can anyone guess as to what command FB issued that would cause them to 
withdraw all those prefixes?


- "it was not possible to access our data centers through our normal 
means because their networks were down, and second, the total loss of 
DNS broke many of the internal tools we’d normally use to investigate 
and resolve outages like this.  Our primary and out-of-band network 
access was down..."


Does this mean that FB acknowledges that the loss of DNS broke their OOB 
access?


-Hank

Re: Facebook post-mortems...

2021-10-05 Thread Masataka Ohta


Randy Monroe via NANOG wrote:


Updated:
https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/


So, what was lost was internal connectivity between data centers.

That facebook use very short expiration period for zone
data is a separate issue.

As long as name servers with expired zone data won't serve
request from outside of facebook, whether BGP routes to the
name servers are announced or not is unimportant.

Masataka Ohta

Re: Facebook post-mortems...

2021-10-05 Thread Michael Thomas

Actually for card readers, the offline verification nature of 
certificates is probably a nice property. But client certs pose all 
sorts of other problems like their scalability, ease of making changes 
(roles, etc), and other kinds of considerations that make you want to 
fetch more information online... which completely negates the advantages 
of offline verification. Just the CRL problem would probably sink you 
since when you fire an employee you want access to be cut off immediately.


The other thing that would scare me in general with expecting offline 
verification is the *reason* it's being used is for offline might get 
forgotten and back comes the online dependencies while nobody is looking.


BTW: you don't need to reach the trust anchor, though you almost 
certainly need to run OCSP or something like it if you have client certs.


Mike

On 10/5/21 1:34 PM, Matthew Petach wrote:



On Tue, Oct 5, 2021 at 8:57 AM Kain, Becki (.) > wrote:


Why ever would have a card reader on your external facing network,
if that was really the case why they couldn't get in to fix it?


Let's hypothesize for a moment.

Let's suppose you've decided that certificate-based
authentication is the cat's meow, and so you've got
dot1x authentication on every network port in your
corporate environment, all your users are authenticated
via certificates, all properly signed all the way up the
chain to the root trust anchor.

Life is good.

But then you have a bad network day.  Suddenly,
you can't talk to upstream registries/registrars,
you can't reach the trust anchor for your certificates,
and you discover that all the laptops plugged into
your network switches are failing to validate their
authenticity; sure, you're on the network, but you're
in a guest vlan, with no access.  Your user credentials
aren't able to be validated, so you're stuck with the
base level of access, which doesn't let you into the
OOB network.

Turns out your card readers were all counting on
dot1x authentication to get them into the right vlan
as well, and with the network buggered up, the
switches can't validate *their* certificates either,
so the door badge card readers just flash their
LEDs impotently when you wave your badge at
them.

Remember, one attribute of certificates is that they are
designated as valid for a particular domain, or set of
subdomains with a wildcard; that is, an authenticator needs
to know where the certificate is being presented to know if
it is valid within that scope or not.   You can do that scope
validation through several different mechanisms,
such as through a chain of trust to a certificate authority,
or through DNSSEC with DANE--but fundamentally,
all certificates have a scope within which they are valid,
and a means to identify in which scope they are being
used.  And wether your certificate chain of trust is
being determined by certificate authorities or DANE,
they all require that trust to be validated by something
other than the client and server alone--which generally
makes them dependent on some level of external
network connectivity being present in order to properly
function.   [yes, yes, we can have a side discussion about
having every authentication server self-sign certificates
as its own CA, and thus eliminate external network
connectivity dependencies--but that's an administrative
nightmare that I don't think any large organization would
sign up for.]

So, all of the client certificates and authorization servers
we're talking about exist on your internal network, but they
all counted on reachability to your infrastructure
servers in order to properly authenticate and grant
access to devices and people.  If your BGP update
made your infrastructure servers, such as DNS servers,
become unreachable, then suddenly you might well
find yourself locked out both physically and logically
from your own network.

Again, this is purely hypothetical, but it's one scenario
in which a routing-level "oops" could end up causing
physical-entry denial, as well as logical network access
level denial, without actually having those authentication
systems on external facing networks.

Certificate-based authentication is scalable and cool, but
it's really important to think about even generally "that'll
never happen" failure scenarios when deploying it into
critical systems.  It's always good to have the "break glass
in case of emergency" network that doesn't rely on dot1x,
that works without DNS, without NTP, without RADIUS,
or any other external system, with a binder with printouts
of the IP addresses of all your really critical servers and
routers in it which gets updated a few times a year, so that
when the SHTF, a person sitting at a laptop plugged into
that network with the binder next to them can get into the
emergency-only local account on each router to fix things.

And yes, you want every command that local emergency-only
user types into a router to be logged, because someone

Re: Facebook post-mortems...

2021-10-05 Thread Matthew Petach

On Tue, Oct 5, 2021 at 8:57 AM Kain, Becki (.)  wrote:

> Why ever would have a card reader on your external facing network, if that
> was really the case why they couldn't get in to fix it?
>

Let's hypothesize for a moment.

Let's suppose you've decided that certificate-based
authentication is the cat's meow, and so you've got
dot1x authentication on every network port in your
corporate environment, all your users are authenticated
via certificates, all properly signed all the way up the
chain to the root trust anchor.

Life is good.

But then you have a bad network day.  Suddenly,
you can't talk to upstream registries/registrars,
you can't reach the trust anchor for your certificates,
and you discover that all the laptops plugged into
your network switches are failing to validate their
authenticity; sure, you're on the network, but you're
in a guest vlan, with no access.  Your user credentials
aren't able to be validated, so you're stuck with the
base level of access, which doesn't let you into the
OOB network.

Turns out your card readers were all counting on
dot1x authentication to get them into the right vlan
as well, and with the network buggered up, the
switches can't validate *their* certificates either,
so the door badge card readers just flash their
LEDs impotently when you wave your badge at
them.

Remember, one attribute of certificates is that they are
designated as valid for a particular domain, or set of
subdomains with a wildcard; that is, an authenticator needs
to know where the certificate is being presented to know if
it is valid within that scope or not.   You can do that scope
validation through several different mechanisms,
such as through a chain of trust to a certificate authority,
or through DNSSEC with DANE--but fundamentally,
all certificates have a scope within which they are valid,
and a means to identify in which scope they are being
used.  And wether your certificate chain of trust is
being determined by certificate authorities or DANE,
they all require that trust to be validated by something
other than the client and server alone--which generally
makes them dependent on some level of external
network connectivity being present in order to properly
function.   [yes, yes, we can have a side discussion about
having every authentication server self-sign certificates
as its own CA, and thus eliminate external network
connectivity dependencies--but that's an administrative
nightmare that I don't think any large organization would
sign up for.]

So, all of the client certificates and authorization servers
we're talking about exist on your internal network, but they
all counted on reachability to your infrastructure
servers in order to properly authenticate and grant
access to devices and people.  If your BGP update
made your infrastructure servers, such as DNS servers,
become unreachable, then suddenly you might well
find yourself locked out both physically and logically
from your own network.

Again, this is purely hypothetical, but it's one scenario
in which a routing-level "oops" could end up causing
physical-entry denial, as well as logical network access
level denial, without actually having those authentication
systems on external facing networks.

Certificate-based authentication is scalable and cool, but
it's really important to think about even generally "that'll
never happen" failure scenarios when deploying it into
critical systems.  It's always good to have the "break glass
in case of emergency" network that doesn't rely on dot1x,
that works without DNS, without NTP, without RADIUS,
or any other external system, with a binder with printouts
of the IP addresses of all your really critical servers and
routers in it which gets updated a few times a year, so that
when the SHTF, a person sitting at a laptop plugged into
that network with the binder next to them can get into the
emergency-only local account on each router to fix things.

And yes, you want every command that local emergency-only
user types into a router to be logged, because someone
wanting to create mischief in your network is going to aim
for that account access if they can get it; so watch it like a
hawk, and the only time it had better be accessed and used
is when the big red panic button has already been hit, and
the executives are huddled around speakerphones wanting
to know just how fast you can get things working again.  ^_^;

I know nothing of the incident in question.  But sitting at home,
hypothesizing about ways in which things could go wrong, this
is one of the reasons why I still configure static emergency
accounts on network devices, even with centrally administered
account systems, and why there's always a set of "no dot1x"
ports that work to get into the OOB/management network even
when everything else has gone toes-up.   :)

So--that's one way in which an outage like this could have
locked people out of buildings.   ^_^;

Thanks!

Matt
[ready for the deluge of people pointing out I've

Re: Facebook post-mortems...

2021-10-05 Thread Randy Monroe via NANOG

Updated:
https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/

On Tue, Oct 5, 2021 at 1:26 PM Michael Thomas  wrote:

>
> On 10/5/21 12:17 AM, Carsten Bormann wrote:
> > On 5. Oct 2021, at 07:42, William Herrin  wrote:
> >> On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas  wrote:
> >>> They have a monkey patch subsystem. Lol.
> >> Yes, actually, they do. They use Chef extensively to configure
> >> operating systems. Chef is written in Ruby. Ruby has something called
> >> Monkey Patches.
> > While Ruby indeed has a chain-saw (read: powerful, dangerous, still the
> tool of choice in certain cases) in its toolkit that is generally called
> “monkey-patching”, I think Michael was actually thinking about the “chaos
> monkey”,
> > https://en.wikipedia.org/wiki/Chaos_engineering#Chaos_Monkey
> > https://netflix.github.io/chaosmonkey/
>
> No, chaos monkey is a purposeful thing to induce corner case errors so
> they can be fixed. The earlier outage involved a config sanitizer that
> screwed up and then pushed it out. I can't get my head around why
> anybody thought that was a good idea vs rejecting it and making somebody
> fix the config.
>
> Mike
>
>
>

-- 

Randy Monroe

Network Engineering

[image: Uber]

Re: Facebook post-mortems...

2021-10-05 Thread Warren Kumari

On Tue, Oct 5, 2021 at 1:47 PM Miles Fidelman 
wrote:

> jcur...@istaff.org wrote:
>
> Fairly abstract - Facebook Engineering -
> https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr
> 
>
> Also, Cloudflare’s take on the outage -
> https://blog.cloudflare.com/october-2021-facebook-outage/
>
> FYI,
> /John
>
> This may be a dumb question, but does this suggest that Facebook publishes
> rather short TTLs for their DNS records?  Otherwise, why would an internal
> failure make them unreachable so quickly?
>

Looks like 60 seconds:

$  dig +norec star-mini.c10r.facebook.com. @d.ns.c10r.facebook.com.

; <<>> DiG 9.10.6 <<>> +norec star-mini.c10r.facebook.com. @
d.ns.c10r.facebook.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 25582
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;star-mini.c10r.facebook.com. IN A

;; ANSWER SECTION:
star-mini.c10r.facebook.com. 60 IN A 157.240.229.35

;; Query time: 42 msec
;; SERVER: 185.89.219.11#53(185.89.219.11)
;; WHEN: Tue Oct 05 14:01:06 EDT 2021
;; MSG SIZE  rcvd: 72



... and cue the "Bwahahhaha! If *I* ran Facebook I'd make the TTL be [2
sec|30sec|5min|1h|6h+3sec|1day|6months|maxint32]" threads

Choosing the TTL is a balancing act between stability, agility, load,
politeness, renewal latency, etc -- but I'm sure NANOG can boil it down to
"They did it wrong!..."

W


> Miles Fidelman
>
> --
> In theory, there is no difference between theory and practice.
> In practice, there is.   Yogi Berra
>
> Theory is when you know everything but nothing works.
> Practice is when everything works but no one knows why.
> In our lab, theory and practice are combined:
> nothing works and no one knows why.  ... unknown
>
>

-- 
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
  -- E. W. Dijkstra

Re: Facebook post-mortems... - Update!

2021-10-05 Thread Randy Bush

> Can someone explain to me, preferably in baby words, why so many providers
> view information like https://as37100.net/?bgp as secret/proprietary?

it shows we're important

Re: Facebook post-mortems...

2021-10-05 Thread Miles Fidelman


jcur...@istaff.org wrote:
Fairly abstract - Facebook Engineering - 
https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr 



Also, Cloudflare’s take on the outage - 
https://blog.cloudflare.com/october-2021-facebook-outage/


FYI,
/John

This may be a dumb question, but does this suggest that Facebook 
publishes rather short TTLs for their DNS records?  Otherwise, why would 
an internal failure make them unreachable so quickly?


Miles Fidelman

--
In theory, there is no difference between theory and practice.
In practice, there is.   Yogi Berra

Theory is when you know everything but nothing works.
Practice is when everything works but no one knows why.
In our lab, theory and practice are combined:
nothing works and no one knows why.  ... unknown

Re: Facebook post-mortems...

2021-10-05 Thread Michael Thomas




On 10/5/21 12:17 AM, Carsten Bormann wrote:

On 5. Oct 2021, at 07:42, William Herrin  wrote:

On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas  wrote:

They have a monkey patch subsystem. Lol.

Yes, actually, they do. They use Chef extensively to configure
operating systems. Chef is written in Ruby. Ruby has something called
Monkey Patches.

While Ruby indeed has a chain-saw (read: powerful, dangerous, still the tool of 
choice in certain cases) in its toolkit that is generally called 
“monkey-patching”, I think Michael was actually thinking about the “chaos 
monkey”,
https://en.wikipedia.org/wiki/Chaos_engineering#Chaos_Monkey
https://netflix.github.io/chaosmonkey/


No, chaos monkey is a purposeful thing to induce corner case errors so 
they can be fixed. The earlier outage involved a config sanitizer that 
screwed up and then pushed it out. I can't get my head around why 
anybody thought that was a good idea vs rejecting it and making somebody 
fix the config.


Mike

Re: Facebook post-mortems...

2021-10-05 Thread Michael Thomas




On 10/4/21 10:42 PM, William Herrin wrote:

On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas  wrote:

They have a monkey patch subsystem. Lol.

Yes, actually, they do. They use Chef extensively to configure
operating systems. Chef is written in Ruby. Ruby has something called
Monkey Patches. This is where at an arbitrary location in the code you
re-open an object defined elsewhere and change its methods.

Chef doesn't always do the right thing. You tell Chef to remove an RPM
and it does. Even if it has to remove half the operating system to
satisfy the dependencies. If you want it to do something reasonable,
say throw an error because you didn't actually tell it to remove half
the operating system, you have a choice: spin up a fork of chef with a
couple patches to the chef-rpm interaction or just monkey-patch it in
one of your chef recipes.


Just because a language allows monkey patching doesn't mean that you 
should use it. In that particular outage they said that they fix up 
errant looking config files rather than throw an error and make somebody 
fix it. That is an extremely bad practice and frankly looks like amateur 
hour to me.


Mike

BGP communities, was: Re: Facebook post-mortems... - Update!

2021-10-05 Thread Jay Hennigan


On 10/5/21 09:49, Warren Kumari wrote:

Can someone explain to me, preferably in baby words, why so many 
providers view information like https://as37100.net/?bgp 
 as secret/proprietary?
I've interacted with numerous providers who require an NDA or 
pinky-swear to get a list of their communities -- is this really just 1: 
security through obscurity, 2: an artifact of the culture of not 
sharing, 3: an attempt to seem cool by making you jump through hoops to 
prove your worthiness, 4: some weird 'mah competitors won't be able to 
figure out my secret sauce without knowing that 17 means Asia, or 5: 
something else?


Not sure the rationale of leeping them secret, but at least one 
aggregated source of dozens of them exists and has been around for a 
long time. https://onestep.net/communities/


--
Jay Hennigan - j...@west.net
Network Engineering - CCIE #7880
503 897-8550 - WB6RDV

Re: Facebook post-mortems...

2021-10-05 Thread Niels Bakker

Ryan, thanks for sharing your data, it's unfortunate that it was 
seemingly misinterpreted by a few souls.



* ryan.lan...@gmail.com (Ryan Landry) [Tue 05 Oct 2021, 17:52 CEST]:
Niels, you are correct about my initial tweet, which I updated in 
later tweets to clarify with a hat tip to Will Hargrave as thanks 
for seeking more detail.

Re: Facebook post-mortems...

2021-10-05 Thread Ryan Brooks

> On Oct 5, 2021, at 10:32 AM, Jean St-Laurent via NANOG  
> wrote:
> 
> If you have some DNS working, you can point it at a static “we are down and 
> we know it” page much sooner,

At the scale of facebook that seems extremely difficult to pull off w/o most of 
their architecture online.  Imagine trying to terminate >billion sessions.

When they started to come back up and had their "We're sorry" page up- even 
their static png couldn't make it onto the wire.

Re: Facebook post-mortems... - Update!

2021-10-05 Thread Warren Kumari

On Tue, Oct 5, 2021 at 9:56 AM Mark Tinka  wrote:

>
>
> On 10/5/21 15:40, Mark Tinka wrote:
>
> >
> > I don't disagree with you one bit. It's for that exact reason that we
> > built:
> >
> > https://as37100.net/
> >
> > ... not for us, but specifically for other random network operators
> > around the world whom we may never get to drink a crate of wine with.
>

Can someone explain to me, preferably in baby words, why so many providers
view information like https://as37100.net/?bgp as secret/proprietary?
I've interacted with numerous providers who require an NDA or pinky-swear
to get a list of their communities -- is this really just 1: security
through obscurity, 2: an artifact of the culture of not sharing, 3: an
attempt to seem cool by making you jump through hoops to prove your
worthiness, 4: some weird 'mah competitors won't be able to figure out my
secret sauce without knowing that 17 means Asia, or 5: something else?

Yes, some providers do publish these (usually on the website equivalent of
a locked filing cabinet stuck in a disused lavatory with a sign on the door
saying ‘Beware of the Leopard.”), and PeeringDB has definitely helped, but
I still don't understand many providers stance on this...

W

> >
> > I have to say that it has likely cut e-mails to our NOC as well as
> > overall pain in half, if not more.
>
> What I forgot to add, however, is that unlike Facebook, we aren't a
> major content provider. So we don't have a need to parallel our DNS
> resiliency with our service resiliency, in terms of 3rd party
> infrastructure. If our network were to melt, we'll already be getting it
> from our eyeballs.
>
> If we had content of note that was useful to, say, a handful-billion
> people around the world, we'd give some thought - however complex - to
> having critical services running on 3rd party infrastructure.
>
> Mark.
>

-- 
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
  -- E. W. Dijkstra

Re: Facebook post-mortems...

2021-10-05 Thread Joe Maimon





Mark Tinka wrote:


So I'm not worried about DNS stability when split across multiple 
physical entities.


I'm talking about the actual services being hosted on a single network 
that goes bye-bye like what we saw yesterday.


All the DNS resolution means diddly, even if it tells us that DNS is 
not the issue.


Mark.


You could put up a temp page or two. Like, the internet is not down, we 
are just having a bad day. Bear with us for a bit. Go outside and enjoy 
nature for the next few hours.


But more importantly, internal infrastructure domains, containing router 
names, bootstraps, tools, utilities, physical access control, config 
repositories, network documentations, oob-network names (who remembers 
those?) , oob-email, oob communications (messenger, conferences, voip), 
etc..


Doesnt even have to be globally registered. External DNS server in the 
resolver list of all tech laptops slaving the zone.


Rapid response requires certain amenities, or as we can see, your 
talking about hours just getting started.


Also, the oob-network needs to be used regularly or it will be 
essentially unusable when actually needed, due to bit rot (accumulation 
of unnoticed and unresolved issues) and lack of mind muscle memory.


It should be standard practice to deploy all new equipment from the 
oob-network servicing it. Install things how you want to be able to 
repair them.


Joe

RE: Facebook post-mortems...

2021-10-05 Thread Kain, Becki (.)

Why ever would have a card reader on your external facing network, if that was 
really the case why they couldn't get in to fix it?


-Original Message-
From: NANOG  On Behalf Of Patrick W. 
Gilmore
Sent: Monday, October 04, 2021 10:53 PM
To: North American Operators' Group 
Subject: Re: Facebook post-mortems...

WARNING: This message originated outside of Ford Motor Company. Use caution 
when opening attachments, clicking links, or responding.


Update about the October 4th outage

https://clicktime.symantec.com/3X9y1HrhXV7HkUEoMWnXtR67Vc?u=https%3A%2F%2Fengineering.fb.com%2F2021%2F10%2F04%2Fnetworking-traffic%2Foutage%2F

--
TTFN,
patrick

> On Oct 4, 2021, at 9:25 PM, Mel Beckman  wrote:
>
> The CF post mortem looks sensible, and a good summary of what we all saw from 
> the outside with BGP routes being withdrawn.
>
> Given the fragility of BGP, this could still end up being a malicious attack.
>
> -mel via cell
>
>> On Oct 4, 2021, at 6:19 PM, Jay Hennigan  wrote:
>>
>> On 10/4/21 17:58, jcur...@istaff.org wrote:
>>> Fairly abstract - Facebook Engineering - 
>>> https://clicktime.symantec.com/3CDR8hh26akhF2bhzN9S5cv7Vc?u=https%3A%2F%2Fm.facebook.com%2Fnt%2Fscreen%2F%3Fparams%3D%257B%2522note_id%2522%253A10158791436142200%257D%26path%3D%252Fnotes%252Fnote%252F%26_rdr
>>>  
>>> <https://clicktime.symantec.com/3KA6ZdSTySHYFm2mVQy4h5j7Vc?u=https%3A%2F%2Fm.facebook.com%2Fnt%2Fscreen%2F%3Fparams%3D%7B"note_id":10158791436142200}=/notes/note/&_rdr>
>>
>> I believe that the above link refers to a previous outage. The duration of 
>> the outage doesn't match today's, the technical explanation doesn't align 
>> very well, and many of the comments reference earlier dates.
>>
>>> Also, Cloudflare’s take on the outage - 
>>> https://clicktime.symantec.com/3EkkFFLL3nVZGvWBnB834uN7Vc?u=https%3A%2F%2Fblog.cloudflare.com%2Foctober-2021-facebook-outage%2F
>>>  
>>> <https://clicktime.symantec.com/3EkkFFLL3nVZGvWBnB834uN7Vc?u=https%3A%2F%2Fblog.cloudflare.com%2Foctober-2021-facebook-outage%2F>
>>
>> This appears to indeed reference today's event.
>>
>> --
>> Jay Hennigan - j...@west.net
>> Network Engineering - CCIE #7880
>> 503 897-8550 - WB6RDV

Re: Facebook post-mortems...

2021-10-05 Thread Ryan Landry

Niels, you are correct about my initial tweet, which I updated in later
tweets to clarify with a hat tip to Will Hargrave as thanks for seeking
more detail.

Cheers,
Ryan

On Tue, Oct 5, 2021 at 08:24 Niels Bakker  wrote:

> * telescop...@gmail.com (Lou D) [Tue 05 Oct 2021, 15:12 CEST]:
> >Facebook stopped announcing the vast majority of their IP space to
> >the DFZ during this.
>
> People keep repeating this but I don't think it's true.
>
> It's probably based on this tweet:
> https://twitter.com/ryan505/status/1445118376339140618
>
> but that's an aggregate adding up prefix counts from many sessions.
> The total number of hosts covered by those announcements didn't vary
> by nearly as much, since to a significant extent it were more specifics
> (/24) of larger prefixes (e.g. /17) that disappeared, while those /17s
> stayed.
>
> (There were no covering prefixes for WhatsApp's NS addresses so those
> were completely unreachable from the DFZ.)
>
>
> -- Niels.
>

RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG

1.  If you have some DNS working, you can point it at a static “we are down 
and we know it” page much sooner.
2.   

Good catch and you’re right that it would have reduce the planetary impact. 
Less call to help-desk and less reboot of devices. It would have give 
visibility on what’s happening.

It seems to be really resilient in today’s world, a business needs their NS in 
at least 2 different entities like amazon.com is doing.

Jean

From: NANOG  On Behalf Of Matthew 
Kaufman
Sent: October 5, 2021 10:59 AM
To: Mark Tinka 
Cc: nanog@nanog.org
Subject: Re: Facebook post-mortems...

On Tue, Oct 5, 2021 at 5:44 AM Mark Tinka mailto:mark@tinka.africa> > wrote:

On 10/5/21 14:08, Jean St-Laurent via NANOG wrote:

> Maybe withdrawing those routes to their NS could have been mitigated by 
> having NS in separate entities.

Well, doesn't really matter if you can resolve the A//MX records, 
but you can't connect to the network that is hosting the services.

Disagree for two reasons:

1. If you have some DNS working, you can point it at a static “we are down and 
we know it” page much sooner.

2. If you have convinced the entire world to install tracking pixels on their 
web pages that all need your IP address, it is rude to the rest of the world’s 
DNS to not be able to always provide a prompt (and cacheable) response.

Re: Facebook post-mortems...

2021-10-05 Thread Bjørn Mork

Jean St-Laurent via NANOG  writes:

> Let's check how these big companies are spreading their NS's.
>
> $ dig +short facebook.com NS
> d.ns.facebook.com.
> b.ns.facebook.com.
> c.ns.facebook.com.
> a.ns.facebook.com.
>
> $ dig +short google.com NS
> ns1.google.com.
> ns4.google.com.
> ns2.google.com.
> ns3.google.com.
>
> $ dig +short apple.com NS
> a.ns.apple.com.
> b.ns.apple.com.
> c.ns.apple.com.
> d.ns.apple.com.
>
> $ dig +short amazon.com NS
> ns4.p31.dynect.net.
> ns3.p31.dynect.net.
> ns1.p31.dynect.net.
> ns2.p31.dynect.net.
> pdns6.ultradns.co.uk.
> pdns1.ultradns.net.
>
> $ dig +short netflix.com NS
> ns-1372.awsdns-43.org.
> ns-1984.awsdns-56.co.uk.
> ns-659.awsdns-18.net.
> ns-81.awsdns-10.com.

Just to state the obvious: Names are irrelevant. Addresses are not.

These names are just place holders for the glue in the parent zone
anyway.  If you look behind the names you'll find that Apple spread
their servers between two ASes. So they are not as vulnerable as Google
and Facebook.

Bjørn

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka




On 10/5/21 16:59, Matthew Kaufman wrote:



Disagree for two reasons:

1. If you have some DNS working, you can point it at a static “we are 
down and we know it” page much sooner.


Isn't that what Twirra is for, nowadays :-)...




2. If you have convinced the entire world to install tracking pixels 
on their web pages that all need your IP address, it is rude to the 
rest of the world’s DNS to not be able to always provide a prompt (and 
cacheable) response.


Agreed, but I know many an exec that signs the capex cheques who may 
find "rude" not a noteworthy discussion point when we submit the budget.


Not saying I think being rude is cool, but there is a reason we are 
here, now, today.


Mark.

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka





On 10/5/21 16:49, Joe Greco wrote:


Unrealistic user expectations are not the point.  Users can demand
whatever unrealistic claptrap they wish to.


The user's expectations, today, are always going to be unrealistic, 
especially when they are able to enjoy a half-decent service free-of-charge.


The bar has moved. Nothing we can do about it but adapt.




The point is that there are a lot of helpdesk staff at a lot of
organizations who are responsible for responding to these issues.
When Facebook or Microsoft or Amazon take a dump, you get a storm
of requests.  This is a storm of requests not just to one helpdesk,
but to MANY helpdesks, across a wide number of organizations, and
this means that you have thousands of people trying to investigate
what has happened.


We are in agreement.

And it's no coincidence that the Facebook's of the world rely almost 
100% on non-human contact to give their users support. So that leaves 
us, infrastructure, in the firing line to pick up the slack for a lack 
of warm-body access to BigContent.




It is very common for large companies to forget (or not care) that
their technical failures impact not just their users, but also
external support organizations.


Not just large companies, but I believe all companies... and worse, not 
at ground level where folk on lists like these tend to keep in touch, 
but higher up where money decisions where caring about your footprint on 
other Internet settlers whom you may never meet matters.


You and I can bash our heads till they come home, but if the folk that 
need to say "Yes" to $$$ needed to help external parties troubleshoot 
better don't get it, then perhaps starting a NOG or some such is our 
best bet.




I totally get your disdain and indifference towards end users in these
instances; for the average end user, yes, it indeed makes no difference
if DNS works or not.


On the contrary, I looove customers. I wasn't into them, say, 12 
years ago, but since I began to understand that users will respond to 
empathy and value, I fell in love with them. They drive my entire 
thought-process and decision-making.


This is why I keep saying, "Users don't care about how we build the 
Internet", and they shouldn't. And I support that.


BigContent get it, and for better or worse, they are the ones who've set 
the bar higher than what most network operators are happy with.


Infrastructure still doesn't get it, and we are seeing the effects of 
that play out around the world, with the recent SK Broadband/Netflix 
debacle being the latest barbershop gossip.




However, some of those end users do have a point of contact up the
chain.  This could be their ISP support, or a company helpdesk, and
most of these are tasked with taking an issue like this to some sort
of resolution.  What I'm talking about here is that it is easier to
debug and make a determination that there is an IP connectivity issue
when DNS works.  If DNS isn't working, then you get into a bunch of
stuff where you need to do things like determine if maybe it is some
sort of DNSSEC issue, or other arcane and obscure issues, which tends
to be beyond what front line helpdesk is capable of.


We are in agreement.



These issues often cost companies real time and money to figure out.
It is unlikely that Facebook is going to compensate them for this, so
this brings me back around to the point that it's preferable to have
DNS working when you have a BGP problem, because this is ultimately
easier for people to test and reach a reasonable determination that
the problem is on Facebook's side quickly and easily.


We are in agreement.

So let's see if Facebook can fix the scope of their DNS architecture, 
and whether others can learn from it. I know I have... even though we 
provide friendly secondary for a bunch of folk we are friends with, we 
haven't done the same for our own networks... all our stuff sits on just 
our network - granted in many different countries, but still, one AS.


It's been nagging at the back of my mind for yonks, but yesterday was 
the nudge I needed to get this organized; so off I go.


Mark.

Re: Facebook post-mortems...

2021-10-05 Thread Matthew Kaufman

On Tue, Oct 5, 2021 at 5:44 AM Mark Tinka  wrote:

>
>
> On 10/5/21 14:08, Jean St-Laurent via NANOG wrote:
>
> > Maybe withdrawing those routes to their NS could have been mitigated by
> having NS in separate entities.
>
> Well, doesn't really matter if you can resolve the A//MX records,
> but you can't connect to the network that is hosting the services.

Disagree for two reasons:

1. If you have some DNS working, you can point it at a static “we are down
and we know it” page much sooner.

2. If you have convinced the entire world to install tracking pixels on
their web pages that all need your IP address, it is rude to the rest of
the world’s DNS to not be able to always provide a prompt (and cacheable)
response.

>

Re: Facebook post-mortems...

2021-10-05 Thread Joe Greco

On Tue, Oct 05, 2021 at 03:40:39PM +0200, Mark Tinka wrote:
> Yes, total nightmare yesterday, but sure that 9,999 of the helpdesk 
> tickets had nothing to do with DNS. They likely all were - "Your 
> Internet is down, just fix it; we don't wanna know".

Unrealistic user expectations are not the point.  Users can demand
whatever unrealistic claptrap they wish to. 

The point is that there are a lot of helpdesk staff at a lot of
organizations who are responsible for responding to these issues.
When Facebook or Microsoft or Amazon take a dump, you get a storm
of requests.  This is a storm of requests not just to one helpdesk,
but to MANY helpdesks, across a wide number of organizations, and
this means that you have thousands of people trying to investigate
what has happened.

It is very common for large companies to forget (or not care) that
their technical failures impact not just their users, but also
external support organizations.

I totally get your disdain and indifference towards end users in these
instances; for the average end user, yes, it indeed makes no difference
if DNS works or not.

However, some of those end users do have a point of contact up the
chain.  This could be their ISP support, or a company helpdesk, and
most of these are tasked with taking an issue like this to some sort
of resolution.  What I'm talking about here is that it is easier to
debug and make a determination that there is an IP connectivity issue
when DNS works.  If DNS isn't working, then you get into a bunch of
stuff where you need to do things like determine if maybe it is some
sort of DNSSEC issue, or other arcane and obscure issues, which tends
to be beyond what front line helpdesk is capable of.

These issues often cost companies real time and money to figure out.
It is unlikely that Facebook is going to compensate them for this, so
this brings me back around to the point that it's preferable to have
DNS working when you have a BGP problem, because this is ultimately
easier for people to test and reach a reasonable determination that
the problem is on Facebook's side quickly and easily.

... JG
-- 
Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net
"The strain of anti-intellectualism has been a constant thread winding its way
through our political and cultural life, nurtured by the false notion that
democracy means that 'my ignorance is just as good as your knowledge.'"-Asimov

RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG

Does anyone have info whether this network 69.171.240.0/20 was reachable during 
the outage.

Jean

From: NANOG  On Behalf Of Tom Beecher
Sent: October 5, 2021 10:30 AM
To: NANOG 
Subject: Re: Facebook post-mortems...

People keep repeating this but I don't think it's true.

My comment is solely sourced on my direct observations on my network, maybe 
30-45 minutes in. 

Everything except a few /24s disappeared from DFZ providers, but I still heard 
those prefixes from direct peerings. There was no disaggregation that I saw, 
just the big stuff gone. This was consistent over 5 continents from my 
viewpoints.

Others may have seen different things at different times. I do not run an 
eyeball so I had no need to continually monitor.  

On Tue, Oct 5, 2021 at 10:22 AM Niels Bakker mailto:na...@bakker.net> > wrote:

* telescop...@gmail.com <mailto:telescop...@gmail.com>  (Lou D) [Tue 05 Oct 
2021, 15:12 CEST]:
>Facebook stopped announcing the vast majority of their IP space to 
>the DFZ during this.

People keep repeating this but I don't think it's true.

It's probably based on this tweet: 
https://twitter.com/ryan505/status/1445118376339140618

but that's an aggregate adding up prefix counts from many sessions. 
The total number of hosts covered by those announcements didn't vary 
by nearly as much, since to a significant extent it were more specifics 
(/24) of larger prefixes (e.g. /17) that disappeared, while those /17s 
stayed.

(There were no covering prefixes for WhatsApp's NS addresses so those 
were completely unreachable from the DFZ.)

-- Niels.

Re: Facebook post-mortems...

2021-10-05 Thread Tom Beecher

>
> People keep repeating this but I don't think it's true.
>

My comment is solely sourced on my direct observations on my network, maybe
30-45 minutes in.

Everything except a few /24s disappeared from DFZ providers, but I still
heard those prefixes from direct peerings. There was no disaggregation that
I saw, just the big stuff gone. This was consistent over 5 continents from
my viewpoints.

Others may have seen different things at different times. I do not run an
eyeball so I had no need to continually monitor.

On Tue, Oct 5, 2021 at 10:22 AM Niels Bakker  wrote:

> * telescop...@gmail.com (Lou D) [Tue 05 Oct 2021, 15:12 CEST]:
> >Facebook stopped announcing the vast majority of their IP space to
> >the DFZ during this.
>
> People keep repeating this but I don't think it's true.
>
> It's probably based on this tweet:
> https://twitter.com/ryan505/status/1445118376339140618
>
> but that's an aggregate adding up prefix counts from many sessions.
> The total number of hosts covered by those announcements didn't vary
> by nearly as much, since to a significant extent it were more specifics
> (/24) of larger prefixes (e.g. /17) that disappeared, while those /17s
> stayed.
>
> (There were no covering prefixes for WhatsApp's NS addresses so those
> were completely unreachable from the DFZ.)
>
>
> -- Niels.
>

Re: Facebook post-mortems...

2021-10-05 Thread Niels Bakker


* telescop...@gmail.com (Lou D) [Tue 05 Oct 2021, 15:12 CEST]:
Facebook stopped announcing the vast majority of their IP space to 
the DFZ during this.


People keep repeating this but I don't think it's true.

It's probably based on this tweet: 
https://twitter.com/ryan505/status/1445118376339140618


but that's an aggregate adding up prefix counts from many sessions. 
The total number of hosts covered by those announcements didn't vary 
by nearly as much, since to a significant extent it were more specifics 
(/24) of larger prefixes (e.g. /17) that disappeared, while those /17s 
stayed.


(There were no covering prefixes for WhatsApp's NS addresses so those 
were completely unreachable from the DFZ.)



-- Niels.

RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG

As of now, their MX is hosted on 69.171.251.251

Was this network still announced yesterday in the DFZ during the outage? 

69.171.224.0/19 

69.171.240.0/20

Jean

From: Jean St-Laurent  
Sent: October 5, 2021 9:50 AM
To: 'Tom Beecher' 
Cc: 'Jeff Tantsura' ; 'William Herrin' 
; 'NANOG' 
Subject: RE: Facebook post-mortems...

I agree to resolve non-routable address doesn’t bring you a working service.

I thought a few networks were still reachable like their MX or some DRP 
networks.

Thanks for the update

Jean

From: Tom Beecher mailto:beec...@beecher.cc> > 
Sent: October 5, 2021 8:33 AM
To: Jean St-Laurent mailto:j...@ddostest.me> >
Cc: Jeff Tantsura mailto:jefftant.i...@gmail.com> >; 
William Herrin mailto:b...@herrin.us> >; NANOG 
mailto:nanog@nanog.org> >
Subject: Re: Facebook post-mortems...

Maybe withdrawing those routes to their NS could have been mitigated by having 
NS in separate entities.

Assuming they had such a thing in place , it would not have helped. 

Facebook stopped announcing the vast majority of their IP space to the DFZ 
during this. So even they did have an offnet DNS server that could have 
provided answers to clients, those same clients probably wouldn't have been 
able to connect to the IPs returned anyways. 

If you are running your own auths like they are, you likely view your public 
network reachability as almost bulletproof and that it will never disappear. 
Which is probably true most of the time. Until yesterday happens and the 9's in 
your reliability percentage change to 7's. 

On Tue, Oct 5, 2021 at 8:10 AM Jean St-Laurent via NANOG mailto:nanog@nanog.org> > wrote:

Maybe withdrawing those routes to their NS could have been mitigated by having 
NS in separate entities.

Let's check how these big companies are spreading their NS's.

$ dig +short facebook.com <http://facebook.com>  NS
d.ns.facebook.com <http://d.ns.facebook.com> .
b.ns.facebook.com <http://b.ns.facebook.com> .
c.ns.facebook.com <http://c.ns.facebook.com> .
a.ns.facebook.com <http://a.ns.facebook.com> .

$ dig +short google.com <http://google.com>  NS
ns1.google.com <http://ns1.google.com> .
ns4.google.com <http://ns4.google.com> .
ns2.google.com <http://ns2.google.com> .
ns3.google.com <http://ns3.google.com> .

$ dig +short apple.com <http://apple.com>  NS
a.ns.apple.com <http://a.ns.apple.com> .
b.ns.apple.com <http://b.ns.apple.com> .
c.ns.apple.com <http://c.ns.apple.com> .
d.ns.apple.com <http://d.ns.apple.com> .

$ dig +short amazon.com <http://amazon.com>  NS
ns4.p31.dynect.net <http://ns4.p31.dynect.net> .
ns3.p31.dynect.net <http://ns3.p31.dynect.net> .
ns1.p31.dynect.net <http://ns1.p31.dynect.net> .
ns2.p31.dynect.net <http://ns2.p31.dynect.net> .
pdns6.ultradns.co.uk <http://pdns6.ultradns.co.uk> .
pdns1.ultradns.net <http://pdns1.ultradns.net> .

$ dig +short netflix.com <http://netflix.com>  NS
ns-1372.awsdns-43.org <http://ns-1372.awsdns-43.org> .
ns-1984.awsdns-56.co.uk <http://ns-1984.awsdns-56.co.uk> .
ns-659.awsdns-18.net <http://ns-659.awsdns-18.net> .
ns-81.awsdns-10.com <http://ns-81.awsdns-10.com> .

Amnazon and Netflix seem to not keep their eggs in the same basket. From a 
first look, they seem more resilient than facebook.com <http://facebook.com> , 
google.com <http://google.com>  and apple.com <http://apple.com> 

Jean

-Original Message-----
From: NANOG mailto:ddostest...@nanog.org> > On Behalf Of Jeff Tantsura
Sent: October 5, 2021 2:18 AM
To: William Herrin mailto:b...@herrin.us> >
Cc: nanog@nanog.org <mailto:nanog@nanog.org> 
Subject: Re: Facebook post-mortems...

129.134.30.0/23 <http://129.134.30.0/23> , 129.134.30.0/24 
<http://129.134.30.0/24> , 129.134.31.0/24 <http://129.134.31.0/24> . The 
specific routes covering all 4 nameservers (a-d) were withdrawn from all FB 
peering at approximately 15:40 UTC.

Cheers,
Jeff

> On Oct 4, 2021, at 22:45, William Herrin  <mailto:b...@herrin.us> > wrote:
> 
> On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas  <mailto:m...@mtcc.com> > wrote:
>> They have a monkey patch subsystem. Lol.
> 
> Yes, actually, they do. They use Chef extensively to configure 
> operating systems. Chef is written in Ruby. Ruby has something called 
> Monkey Patches. This is where at an arbitrary location in the code you 
> re-open an object defined elsewhere and change its methods.
> 
> Chef doesn't always do the right thing. You tell Chef to remove an RPM 
> and it does. Even if it has to remove half the operating system to 
> satisfy the dependencies. If you want it to do something reasonable, 
> say throw an error because you didn't actually tell it to remove half 
> the operating system, you have a choice: spin up a fork of chef with a 
> couple patches to the chef-rpm interaction or just monkey-patch it in 
> one of your chef recipes.
> 
> Regards,
> Bill Herrin
> 
> --
> William Herrin
> b...@herrin.us <mailto:b...@herrin.us> 
> https://bill.herrin.us/

Re: Facebook post-mortems... - Update!

2021-10-05 Thread Mark Tinka





On 10/5/21 15:40, Mark Tinka wrote:



I don't disagree with you one bit. It's for that exact reason that we 
built:


    https://as37100.net/

... not for us, but specifically for other random network operators 
around the world whom we may never get to drink a crate of wine with.


I have to say that it has likely cut e-mails to our NOC as well as 
overall pain in half, if not more.


What I forgot to add, however, is that unlike Facebook, we aren't a 
major content provider. So we don't have a need to parallel our DNS 
resiliency with our service resiliency, in terms of 3rd party 
infrastructure. If our network were to melt, we'll already be getting it 
from our eyeballs.


If we had content of note that was useful to, say, a handful-billion 
people around the world, we'd give some thought - however complex - to 
having critical services running on 3rd party infrastructure.


Mark.

RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG

I agree to resolve non-routable address doesn’t bring you a working service.

I thought a few networks were still reachable like their MX or some DRP 
networks.

Thanks for the update

Jean

From: Tom Beecher  
Sent: October 5, 2021 8:33 AM
To: Jean St-Laurent 
Cc: Jeff Tantsura ; William Herrin ; 
NANOG 
Subject: Re: Facebook post-mortems...

Maybe withdrawing those routes to their NS could have been mitigated by having 
NS in separate entities.

Assuming they had such a thing in place , it would not have helped. 

Facebook stopped announcing the vast majority of their IP space to the DFZ 
during this. So even they did have an offnet DNS server that could have 
provided answers to clients, those same clients probably wouldn't have been 
able to connect to the IPs returned anyways. 

If you are running your own auths like they are, you likely view your public 
network reachability as almost bulletproof and that it will never disappear. 
Which is probably true most of the time. Until yesterday happens and the 9's in 
your reliability percentage change to 7's. 

On Tue, Oct 5, 2021 at 8:10 AM Jean St-Laurent via NANOG mailto:nanog@nanog.org> > wrote:

Maybe withdrawing those routes to their NS could have been mitigated by having 
NS in separate entities.

Let's check how these big companies are spreading their NS's.

$ dig +short facebook.com <http://facebook.com>  NS
d.ns.facebook.com <http://d.ns.facebook.com> .
b.ns.facebook.com <http://b.ns.facebook.com> .
c.ns.facebook.com <http://c.ns.facebook.com> .
a.ns.facebook.com <http://a.ns.facebook.com> .

$ dig +short google.com <http://google.com>  NS
ns1.google.com <http://ns1.google.com> .
ns4.google.com <http://ns4.google.com> .
ns2.google.com <http://ns2.google.com> .
ns3.google.com <http://ns3.google.com> .

$ dig +short apple.com <http://apple.com>  NS
a.ns.apple.com <http://a.ns.apple.com> .
b.ns.apple.com <http://b.ns.apple.com> .
c.ns.apple.com <http://c.ns.apple.com> .
d.ns.apple.com <http://d.ns.apple.com> .

$ dig +short amazon.com <http://amazon.com>  NS
ns4.p31.dynect.net <http://ns4.p31.dynect.net> .
ns3.p31.dynect.net <http://ns3.p31.dynect.net> .
ns1.p31.dynect.net <http://ns1.p31.dynect.net> .
ns2.p31.dynect.net <http://ns2.p31.dynect.net> .
pdns6.ultradns.co.uk <http://pdns6.ultradns.co.uk> .
pdns1.ultradns.net <http://pdns1.ultradns.net> .

$ dig +short netflix.com <http://netflix.com>  NS
ns-1372.awsdns-43.org <http://ns-1372.awsdns-43.org> .
ns-1984.awsdns-56.co.uk <http://ns-1984.awsdns-56.co.uk> .
ns-659.awsdns-18.net <http://ns-659.awsdns-18.net> .
ns-81.awsdns-10.com <http://ns-81.awsdns-10.com> .

Amnazon and Netflix seem to not keep their eggs in the same basket. From a 
first look, they seem more resilient than facebook.com <http://facebook.com> , 
google.com <http://google.com>  and apple.com <http://apple.com> 

Jean

-Original Message-
From: NANOG mailto:ddostest...@nanog.org> > On Behalf Of Jeff Tantsura
Sent: October 5, 2021 2:18 AM
To: William Herrin mailto:b...@herrin.us> >
Cc: nanog@nanog.org <mailto:nanog@nanog.org> 
Subject: Re: Facebook post-mortems...

129.134.30.0/23 <http://129.134.30.0/23> , 129.134.30.0/24 
<http://129.134.30.0/24> , 129.134.31.0/24 <http://129.134.31.0/24> . The 
specific routes covering all 4 nameservers (a-d) were withdrawn from all FB 
peering at approximately 15:40 UTC.

Cheers,
Jeff

> On Oct 4, 2021, at 22:45, William Herrin  <mailto:b...@herrin.us> > wrote:
> 
> On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas  <mailto:m...@mtcc.com> > wrote:
>> They have a monkey patch subsystem. Lol.
> 
> Yes, actually, they do. They use Chef extensively to configure 
> operating systems. Chef is written in Ruby. Ruby has something called 
> Monkey Patches. This is where at an arbitrary location in the code you 
> re-open an object defined elsewhere and change its methods.
> 
> Chef doesn't always do the right thing. You tell Chef to remove an RPM 
> and it does. Even if it has to remove half the operating system to 
> satisfy the dependencies. If you want it to do something reasonable, 
> say throw an error because you didn't actually tell it to remove half 
> the operating system, you have a choice: spin up a fork of chef with a 
> couple patches to the chef-rpm interaction or just monkey-patch it in 
> one of your chef recipes.
> 
> Regards,
> Bill Herrin
> 
> --
> William Herrin
> b...@herrin.us <mailto:b...@herrin.us> 
> https://bill.herrin.us/

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka





On 10/5/21 15:04, Joe Greco wrote:



You don't think at least 10,000 helpdesk requests about Facebook being
down were sent yesterday?


That and Jane + Thando likely re-installing all their apps and 
iOS/Android on their phones, and rebooting them 300 times in the hopes 
that Facebook and WhatsApp would work.


Yes, total nightmare yesterday, but sure that 9,999 of the helpdesk 
tickets had nothing to do with DNS. They likely all were - "Your 
Internet is down, just fix it; we don't wanna know".




There's something to be said for building these things to be resilient
in a manner that isn't just convenient internally, but also externally
to those people that network operators sometimes forget also support
their network issues indirectly.


I don't disagree with you one bit. It's for that exact reason that we built:

    https://as37100.net/

... not for us, but specifically for other random network operators 
around the world whom we may never get to drink a crate of wine with.


I have to say that it has likely cut e-mails to our NOC as well as 
overall pain in half, if not more.


Mark.

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka





On 10/5/21 14:58, Jean St-Laurent wrote:

If your NS are in 2 separate entities, you could still resolve your 
MX/A//NS.

Look how Amazon is doing it.

dig +short amazon.com NS
ns4.p31.dynect.net.
ns3.p31.dynect.net.
ns1.p31.dynect.net.
ns2.p31.dynect.net.
pdns6.ultradns.co.uk.
pdns1.ultradns.net.

They use dyn DNS from Oracle and ultradns. 2 very strong network of anycast DNS 
servers.

Amazon would have not been impacted like Facebook yesterday. Unless ultradns 
and Oracle have their DNS servers hosted in Amazon infra? I doubt that Oracle 
has dns hosted in Amazon, but it's possible.

Probably the management overhead to use 2 different entities for DNS is not 
financially viable?


So I'm not worried about DNS stability when split across multiple 
physical entities.


I'm talking about the actual services being hosted on a single network 
that goes bye-bye like what we saw yesterday.


All the DNS resolution means diddly, even if it tells us that DNS is not 
the issue.


Mark.

Re: Facebook post-mortems...

2021-10-05 Thread Joe Greco

On Tue, Oct 05, 2021 at 02:57:42PM +0200, Mark Tinka wrote:
> 
> 
> On 10/5/21 14:52, Joe Greco wrote:
> 
> >That's not quite true.  It still gives much better clue as to what is
> >going on; if a host resolves to an IP but isn't pingable/traceroutable,
> >that is something that many more techy people will understand than if
> >the domain is simply unresolvable.  Not everyone has the skill set and
> >knowledge of DNS to understand how to track down what nameservers
> >Facebook is supposed to have, and how to debug names not resolving.
> >There are lots of helpdesk people who are not expert in every topic.
> >
> >Having DNS doesn't magically get you service back, of course, but it
> >leaves a better story behind than simply vanishing from the network.
> 
> That's great for you and me who believe in and like troubleshooting.
> 
> Jane and Thando who just want their Instagram timeline feed couldn't 
> care less about DNS working but network access is down. To them, it's 
> broken, despite your state-of-the-art global DNS architecture.

You don't think at least 10,000 helpdesk requests about Facebook being
down were sent yesterday?

There's something to be said for building these things to be resilient
in a manner that isn't just convenient internally, but also externally
to those people that network operators sometimes forget also support
their network issues indirectly.

... JG
-- 
Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net
"The strain of anti-intellectualism has been a constant thread winding its way
through our political and cultural life, nurtured by the false notion that
democracy means that 'my ignorance is just as good as your knowledge.'"-Asimov

Re: Facebook post-mortems...

2021-10-05 Thread Masataka Ohta


Carsten Bormann wrote:


While Ruby indeed has a chain-saw (read: powerful, dangerous, still
the tool of choice in certain cases) in its toolkit that is generally
called “monkey-patching”, I think Michael was actually thinking about
the “chaos monkey”, 
https://en.wikipedia.org/wiki/Chaos_engineering#Chaos_Monkey 
https://netflix.github.io/chaosmonkey/


That was a Netflix invention, but see also 
https://en.wikipedia.org/wiki/Chaos_engineering#Facebook_Storm


It seems to me that so called chaos engineering assumes cosmic
internet environment, though, in good old days, we were aware
that the Internet is the source of chaos.

Masataka Ohta

RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG

If your NS are in 2 separate entities, you could still resolve your 
MX/A//NS.

Look how Amazon is doing it.

dig +short amazon.com NS
ns4.p31.dynect.net.
ns3.p31.dynect.net.
ns1.p31.dynect.net.
ns2.p31.dynect.net.
pdns6.ultradns.co.uk.
pdns1.ultradns.net.

They use dyn DNS from Oracle and ultradns. 2 very strong network of anycast DNS 
servers.

Amazon would have not been impacted like Facebook yesterday. Unless ultradns 
and Oracle have their DNS servers hosted in Amazon infra? I doubt that Oracle 
has dns hosted in Amazon, but it's possible.

Probably the management overhead to use 2 different entities for DNS is not 
financially viable?

Jean

-Original Message-
From: NANOG  On Behalf Of Mark Tinka
Sent: October 5, 2021 8:22 AM
To: nanog@nanog.org
Subject: Re: Facebook post-mortems...

On 10/5/21 14:08, Jean St-Laurent via NANOG wrote:

> Maybe withdrawing those routes to their NS could have been mitigated by 
> having NS in separate entities.

Well, doesn't really matter if you can resolve the A//MX records, but you 
can't connect to the network that is hosting the services.

At any rate, having 3rd party DNS hosting for your domain is always a good 
thing to have. But in reality, it only hits the spot if the service is also 
available on a 3rd party network, otherwise, you keep DNS up, but get no 
service.

Mark.

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka





On 10/5/21 14:52, Joe Greco wrote:


That's not quite true.  It still gives much better clue as to what is
going on; if a host resolves to an IP but isn't pingable/traceroutable,
that is something that many more techy people will understand than if
the domain is simply unresolvable.  Not everyone has the skill set and
knowledge of DNS to understand how to track down what nameservers
Facebook is supposed to have, and how to debug names not resolving.
There are lots of helpdesk people who are not expert in every topic.

Having DNS doesn't magically get you service back, of course, but it
leaves a better story behind than simply vanishing from the network.


That's great for you and me who believe in and like troubleshooting.

Jane and Thando who just want their Instagram timeline feed couldn't 
care less about DNS working but network access is down. To them, it's 
broken, despite your state-of-the-art global DNS architecture.


I'm also yet to find any DNS operator who makes deploying 3rd party 
resiliency to give other random network operators in the wild 
troubleshooting joy their #1 priority for doing so :-).


On the real though, I'm all for as much useful redundancy as we can get 
away with. But given just how much we rely on the web for basic life 
these days, we need to do better about making actual services as 
resilient as we can (and have) the DNS.


Mark.

Re: Facebook post-mortems...

2021-10-05 Thread Lou D

Facebook stopped announcing the vast majority of their IP space to the DFZ
during this.

This is where I would like to learn more about the outage. Direct Peering
FB connections saw a drop in a networks (about a dozen) and one the
networks covered their C and D Nameservers but the block for A and B name
servers remained advertised but simply not responsive .
I imagine the dropped blocks could have prevented internal responses but an
suprise all of these issue would stem from the perspective I have .

On Tue, Oct 5, 2021 at 8:48 AM Tom Beecher  wrote:

> Maybe withdrawing those routes to their NS could have been mitigated by
>> having NS in separate entities.
>>
>
> Assuming they had such a thing in place , it would not have helped.
>
> Facebook stopped announcing the vast majority of their IP space to the DFZ
> during this. So even they did have an offnet DNS server that could have
> provided answers to clients, those same clients probably wouldn't have been
> able to connect to the IPs returned anyways.
>
> If you are running your own auths like they are, you likely view your
> public network reachability as almost bulletproof and that it will never
> disappear. Which is probably true most of the time. Until yesterday happens
> and the 9's in your reliability percentage change to 7's.
>
> On Tue, Oct 5, 2021 at 8:10 AM Jean St-Laurent via NANOG 
> wrote:
>
>> Maybe withdrawing those routes to their NS could have been mitigated by
>> having NS in separate entities.
>>
>> Let's check how these big companies are spreading their NS's.
>>
>> $ dig +short facebook.com NS
>> d.ns.facebook.com.
>> b.ns.facebook.com.
>> c.ns.facebook.com.
>> a.ns.facebook.com.
>>
>> $ dig +short google.com NS
>> ns1.google.com.
>> ns4.google.com.
>> ns2.google.com.
>> ns3.google.com.
>>
>> $ dig +short apple.com NS
>> a.ns.apple.com.
>> b.ns.apple.com.
>> c.ns.apple.com.
>> d.ns.apple.com.
>>
>> $ dig +short amazon.com NS
>> ns4.p31.dynect.net.
>> ns3.p31.dynect.net.
>> ns1.p31.dynect.net.
>> ns2.p31.dynect.net.
>> pdns6.ultradns.co.uk.
>> pdns1.ultradns.net.
>>
>> $ dig +short netflix.com NS
>> ns-1372.awsdns-43.org.
>> ns-1984.awsdns-56.co.uk.
>> ns-659.awsdns-18.net.
>> ns-81.awsdns-10.com.
>>
>> Amnazon and Netflix seem to not keep their eggs in the same basket. From
>> a first look, they seem more resilient than facebook.com, google.com and
>> apple.com
>>
>> Jean
>>
>> -Original Message-
>> From: NANOG  On Behalf Of Jeff
>> Tantsura
>> Sent: October 5, 2021 2:18 AM
>> To: William Herrin 
>> Cc: nanog@nanog.org
>> Subject: Re: Facebook post-mortems...
>>
>> 129.134.30.0/23, 129.134.30.0/24, 129.134.31.0/24. The specific routes
>> covering all 4 nameservers (a-d) were withdrawn from all FB peering at
>> approximately 15:40 UTC.
>>
>> Cheers,
>> Jeff
>>
>> > On Oct 4, 2021, at 22:45, William Herrin  wrote:
>> >
>> > On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas  wrote:
>> >> They have a monkey patch subsystem. Lol.
>> >
>> > Yes, actually, they do. They use Chef extensively to configure
>> > operating systems. Chef is written in Ruby. Ruby has something called
>> > Monkey Patches. This is where at an arbitrary location in the code you
>> > re-open an object defined elsewhere and change its methods.
>> >
>> > Chef doesn't always do the right thing. You tell Chef to remove an RPM
>> > and it does. Even if it has to remove half the operating system to
>> > satisfy the dependencies. If you want it to do something reasonable,
>> > say throw an error because you didn't actually tell it to remove half
>> > the operating system, you have a choice: spin up a fork of chef with a
>> > couple patches to the chef-rpm interaction or just monkey-patch it in
>> > one of your chef recipes.
>> >
>> > Regards,
>> > Bill Herrin
>> >
>> > --
>> > William Herrin
>> > b...@herrin.us
>> > https://bill.herrin.us/
>>
>>

Re: Facebook post-mortems...

2021-10-05 Thread Hank Nussbacher


On 05/10/2021 13:17, Hauke Lampe wrote:

On 05.10.21 07:22, Hank Nussbacher wrote:


Thanks for the posting.  How come they couldn't access their routers via
their OOB access?


My speculative guess would be that OOB access to a few outbound-facing
routers per DC does not help much if a configuration error withdraws the
infrastructure prefixes down to the rack level while dedicated OOB to
each RSW would be prohibitive.

https://research.fb.com/wp-content/uploads/2021/03/Running-BGP-in-Data-Centers-at-Scale_final.pdf



Thanks for sharing that article.  But OOB access involves exactly that - 
Out Of Band - meaning one doesn't depend on any infrastructure prefixes 
or DFZ announced prefixes.  OOB access is usually via a local ADSL or 
wireless modem connected to the BFR.  The article does not discuss OOB 
at all.


Regards,
Hank

Re: Facebook post-mortems...

2021-10-05 Thread Joe Greco

On Tue, Oct 05, 2021 at 02:22:09PM +0200, Mark Tinka wrote:
> 
> 
> On 10/5/21 14:08, Jean St-Laurent via NANOG wrote:
> 
> >Maybe withdrawing those routes to their NS could have been mitigated by 
> >having NS in separate entities.
> 
> Well, doesn't really matter if you can resolve the A//MX records, 
> but you can't connect to the network that is hosting the services.
> 
> At any rate, having 3rd party DNS hosting for your domain is always a 
> good thing to have. But in reality, it only hits the spot if the service 
> is also available on a 3rd party network, otherwise, you keep DNS up, 
> but get no service.

That's not quite true.  It still gives much better clue as to what is
going on; if a host resolves to an IP but isn't pingable/traceroutable,
that is something that many more techy people will understand than if
the domain is simply unresolvable.  Not everyone has the skill set and
knowledge of DNS to understand how to track down what nameservers 
Facebook is supposed to have, and how to debug names not resolving.
There are lots of helpdesk people who are not expert in every topic.

Having DNS doesn't magically get you service back, of course, but it
leaves a better story behind than simply vanishing from the network.

... JG
-- 
Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net
"The strain of anti-intellectualism has been a constant thread winding its way
through our political and cultural life, nurtured by the false notion that
democracy means that 'my ignorance is just as good as your knowledge.'"-Asimov

Re: Facebook post-mortems...

2021-10-05 Thread Tom Beecher

>
> My speculative guess would be that OOB access to a few outbound-facing
> routers per DC does not help much if a configuration error withdraws the
> infrastructure prefixes down to the rack level while dedicated OOB to
> each RSW would be prohibitive.
>

If your OOB has any dependence on the inband side, it's not OOB.

It's not complicated to have a completely independent OOB infra , even at
scale.

On Tue, Oct 5, 2021 at 8:40 AM Hauke Lampe  wrote:

> On 05.10.21 07:22, Hank Nussbacher wrote:
>
> > Thanks for the posting.  How come they couldn't access their routers via
> > their OOB access?
>
> My speculative guess would be that OOB access to a few outbound-facing
> routers per DC does not help much if a configuration error withdraws the
> infrastructure prefixes down to the rack level while dedicated OOB to
> each RSW would be prohibitive.
>
>
> https://research.fb.com/wp-content/uploads/2021/03/Running-BGP-in-Data-Centers-at-Scale_final.pdf
>

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka





On 10/5/21 08:55, a...@nethead.de wrote:



Rumour is that when the FB route prefixes had been withdrawn their 
door authentication system stopped working and they could not get back 
into the building or server room :)


Assuming there is any truth to that, guess we can't cancel the hard 
lines yet :-).


#EverythingoIP

Mark.

Re: Facebook post-mortems...

2021-10-05 Thread Tom Beecher

>
> Maybe withdrawing those routes to their NS could have been mitigated by
> having NS in separate entities.
>

Assuming they had such a thing in place , it would not have helped.

Facebook stopped announcing the vast majority of their IP space to the DFZ
during this. So even they did have an offnet DNS server that could have
provided answers to clients, those same clients probably wouldn't have been
able to connect to the IPs returned anyways.

If you are running your own auths like they are, you likely view your
public network reachability as almost bulletproof and that it will never
disappear. Which is probably true most of the time. Until yesterday happens
and the 9's in your reliability percentage change to 7's.

On Tue, Oct 5, 2021 at 8:10 AM Jean St-Laurent via NANOG 
wrote:

> Maybe withdrawing those routes to their NS could have been mitigated by
> having NS in separate entities.
>
> Let's check how these big companies are spreading their NS's.
>
> $ dig +short facebook.com NS
> d.ns.facebook.com.
> b.ns.facebook.com.
> c.ns.facebook.com.
> a.ns.facebook.com.
>
> $ dig +short google.com NS
> ns1.google.com.
> ns4.google.com.
> ns2.google.com.
> ns3.google.com.
>
> $ dig +short apple.com NS
> a.ns.apple.com.
> b.ns.apple.com.
> c.ns.apple.com.
> d.ns.apple.com.
>
> $ dig +short amazon.com NS
> ns4.p31.dynect.net.
> ns3.p31.dynect.net.
> ns1.p31.dynect.net.
> ns2.p31.dynect.net.
> pdns6.ultradns.co.uk.
> pdns1.ultradns.net.
>
> $ dig +short netflix.com NS
> ns-1372.awsdns-43.org.
> ns-1984.awsdns-56.co.uk.
> ns-659.awsdns-18.net.
> ns-81.awsdns-10.com.
>
> Amnazon and Netflix seem to not keep their eggs in the same basket. From a
> first look, they seem more resilient than facebook.com, google.com and
> apple.com
>
> Jean
>
> -Original Message-
> From: NANOG  On Behalf Of Jeff
> Tantsura
> Sent: October 5, 2021 2:18 AM
> To: William Herrin 
> Cc: nanog@nanog.org
> Subject: Re: Facebook post-mortems...
>
> 129.134.30.0/23, 129.134.30.0/24, 129.134.31.0/24. The specific routes
> covering all 4 nameservers (a-d) were withdrawn from all FB peering at
> approximately 15:40 UTC.
>
> Cheers,
> Jeff
>
> > On Oct 4, 2021, at 22:45, William Herrin  wrote:
> >
> > On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas  wrote:
> >> They have a monkey patch subsystem. Lol.
> >
> > Yes, actually, they do. They use Chef extensively to configure
> > operating systems. Chef is written in Ruby. Ruby has something called
> > Monkey Patches. This is where at an arbitrary location in the code you
> > re-open an object defined elsewhere and change its methods.
> >
> > Chef doesn't always do the right thing. You tell Chef to remove an RPM
> > and it does. Even if it has to remove half the operating system to
> > satisfy the dependencies. If you want it to do something reasonable,
> > say throw an error because you didn't actually tell it to remove half
> > the operating system, you have a choice: spin up a fork of chef with a
> > couple patches to the chef-rpm interaction or just monkey-patch it in
> > one of your chef recipes.
> >
> > Regards,
> > Bill Herrin
> >
> > --
> > William Herrin
> > b...@herrin.us
> > https://bill.herrin.us/
>
>

Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka





On 10/5/21 14:08, Jean St-Laurent via NANOG wrote:


Maybe withdrawing those routes to their NS could have been mitigated by having 
NS in separate entities.


Well, doesn't really matter if you can resolve the A//MX records, 
but you can't connect to the network that is hosting the services.


At any rate, having 3rd party DNS hosting for your domain is always a 
good thing to have. But in reality, it only hits the spot if the service 
is also available on a 3rd party network, otherwise, you keep DNS up, 
but get no service.


Mark.

Re: Facebook post-mortems...

2021-10-05 Thread Hauke Lampe

On 05.10.21 07:22, Hank Nussbacher wrote:

> Thanks for the posting.  How come they couldn't access their routers via
> their OOB access?

My speculative guess would be that OOB access to a few outbound-facing
routers per DC does not help much if a configuration error withdraws the
infrastructure prefixes down to the rack level while dedicated OOB to
each RSW would be prohibitive.

https://research.fb.com/wp-content/uploads/2021/03/Running-BGP-in-Data-Centers-at-Scale_final.pdf

Re: Facebook post-mortems...

2021-10-05 Thread av


On 10/5/21 1:22 PM, Hank Nussbacher wrote:
Thanks for the posting.  How come they couldn't access their routers via 
their OOB access?


Rumour is that when the FB route prefixes had been withdrawn their door 
authentication system stopped working and they could not get back into 
the building or server room :)

Re: Facebook post-mortems...

2021-10-05 Thread Callahan Warlick

I think that was from an outage in 2010:
https://engineering.fb.com/2010/09/23/uncategorized/more-details-on-today-s-outage/



On Mon, Oct 4, 2021 at 6:19 PM Jay Hennigan  wrote:

> On 10/4/21 17:58, jcur...@istaff.org wrote:
> > Fairly abstract - Facebook Engineering -
> >
> https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr
> >  "note_id":10158791436142200}=/notes/note/&_rdr>
>
> I believe that the above link refers to a previous outage. The duration
> of the outage doesn't match today's, the technical explanation doesn't
> align very well, and many of the comments reference earlier dates.
>
> > Also, Cloudflare’s take on the outage -
> > https://blog.cloudflare.com/october-2021-facebook-outage/
> > 
>
> This appears to indeed reference today's event.
>
> --
> Jay Hennigan - j...@west.net
> Network Engineering - CCIE #7880
> 503 897-8550 - WB6RDV
>

Re: Facebook post-mortems...

2021-10-05 Thread Justin Keller

Per o comments, the linked Facebook outage was from around 5/15/21

On Mon, Oct 4, 2021 at 9:08 PM Rubens Kuhl  wrote:
>
> The FB one seems to be from a previous event. Downtime doesn't match,
> visible flaw effects don't either.
>
>
> Rubens
>
>
> On Mon, Oct 4, 2021 at 9:59 PM  wrote:
> >
> > Fairly abstract - Facebook Engineering - 
> > https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr
> >
> > Also, Cloudflare’s take on the outage - 
> > https://blog.cloudflare.com/october-2021-facebook-outage/
> >
> > FYI,
> > /John
> >

RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG

Maybe withdrawing those routes to their NS could have been mitigated by having 
NS in separate entities.

Let's check how these big companies are spreading their NS's.

$ dig +short facebook.com NS
d.ns.facebook.com.
b.ns.facebook.com.
c.ns.facebook.com.
a.ns.facebook.com.

$ dig +short google.com NS
ns1.google.com.
ns4.google.com.
ns2.google.com.
ns3.google.com.

$ dig +short apple.com NS
a.ns.apple.com.
b.ns.apple.com.
c.ns.apple.com.
d.ns.apple.com.

$ dig +short amazon.com NS
ns4.p31.dynect.net.
ns3.p31.dynect.net.
ns1.p31.dynect.net.
ns2.p31.dynect.net.
pdns6.ultradns.co.uk.
pdns1.ultradns.net.

$ dig +short netflix.com NS
ns-1372.awsdns-43.org.
ns-1984.awsdns-56.co.uk.
ns-659.awsdns-18.net.
ns-81.awsdns-10.com.

Amnazon and Netflix seem to not keep their eggs in the same basket. From a 
first look, they seem more resilient than facebook.com, google.com and apple.com

Jean

-Original Message-
From: NANOG  On Behalf Of Jeff 
Tantsura
Sent: October 5, 2021 2:18 AM
To: William Herrin 
Cc: nanog@nanog.org
Subject: Re: Facebook post-mortems...

129.134.30.0/23, 129.134.30.0/24, 129.134.31.0/24. The specific routes covering 
all 4 nameservers (a-d) were withdrawn from all FB peering at approximately 
15:40 UTC.

Cheers,
Jeff

> On Oct 4, 2021, at 22:45, William Herrin  wrote:
> 
> On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas  wrote:
>> They have a monkey patch subsystem. Lol.
> 
> Yes, actually, they do. They use Chef extensively to configure 
> operating systems. Chef is written in Ruby. Ruby has something called 
> Monkey Patches. This is where at an arbitrary location in the code you 
> re-open an object defined elsewhere and change its methods.
> 
> Chef doesn't always do the right thing. You tell Chef to remove an RPM 
> and it does. Even if it has to remove half the operating system to 
> satisfy the dependencies. If you want it to do something reasonable, 
> say throw an error because you didn't actually tell it to remove half 
> the operating system, you have a choice: spin up a fork of chef with a 
> couple patches to the chef-rpm interaction or just monkey-patch it in 
> one of your chef recipes.
> 
> Regards,
> Bill Herrin
> 
> --
> William Herrin
> b...@herrin.us
> https://bill.herrin.us/

Re: Facebook post-mortems...

2021-10-05 Thread Carsten Bormann

On 5. Oct 2021, at 07:42, William Herrin  wrote:
> 
> On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas  wrote:
>> They have a monkey patch subsystem. Lol.
> 
> Yes, actually, they do. They use Chef extensively to configure
> operating systems. Chef is written in Ruby. Ruby has something called
> Monkey Patches. 

While Ruby indeed has a chain-saw (read: powerful, dangerous, still the tool of 
choice in certain cases) in its toolkit that is generally called 
“monkey-patching”, I think Michael was actually thinking about the “chaos 
monkey”, 
https://en.wikipedia.org/wiki/Chaos_engineering#Chaos_Monkey
https://netflix.github.io/chaosmonkey/

That was a Netflix invention, but see also
https://en.wikipedia.org/wiki/Chaos_engineering#Facebook_Storm

Grüße, Carsten

Re: Facebook post-mortems...

2021-10-05 Thread Jeff Tantsura

129.134.30.0/23, 129.134.30.0/24, 129.134.31.0/24. The specific routes covering 
all 4 nameservers (a-d) were withdrawn from all FB peering at approximately 
15:40 UTC.

Cheers,
Jeff

> On Oct 4, 2021, at 22:45, William Herrin  wrote:
> 
> On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas  wrote:
>> They have a monkey patch subsystem. Lol.
> 
> Yes, actually, they do. They use Chef extensively to configure
> operating systems. Chef is written in Ruby. Ruby has something called
> Monkey Patches. This is where at an arbitrary location in the code you
> re-open an object defined elsewhere and change its methods.
> 
> Chef doesn't always do the right thing. You tell Chef to remove an RPM
> and it does. Even if it has to remove half the operating system to
> satisfy the dependencies. If you want it to do something reasonable,
> say throw an error because you didn't actually tell it to remove half
> the operating system, you have a choice: spin up a fork of chef with a
> couple patches to the chef-rpm interaction or just monkey-patch it in
> one of your chef recipes.
> 
> Regards,
> Bill Herrin
> 
> -- 
> William Herrin
> b...@herrin.us
> https://bill.herrin.us/

Re: Facebook post-mortems...

2021-10-04 Thread William Herrin

On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas  wrote:
> They have a monkey patch subsystem. Lol.

Yes, actually, they do. They use Chef extensively to configure
operating systems. Chef is written in Ruby. Ruby has something called
Monkey Patches. This is where at an arbitrary location in the code you
re-open an object defined elsewhere and change its methods.

Chef doesn't always do the right thing. You tell Chef to remove an RPM
and it does. Even if it has to remove half the operating system to
satisfy the dependencies. If you want it to do something reasonable,
say throw an error because you didn't actually tell it to remove half
the operating system, you have a choice: spin up a fork of chef with a
couple patches to the chef-rpm interaction or just monkey-patch it in
one of your chef recipes.

Regards,
Bill Herrin

-- 
William Herrin
b...@herrin.us
https://bill.herrin.us/

Re: Facebook post-mortems...

2021-10-04 Thread Hank Nussbacher


On 05/10/2021 05:53, Patrick W. Gilmore wrote:

Update about the October 4th outage

https://engineering.fb.com/2021/10/04/networking-traffic/outage/



Thanks for the posting.  How come they couldn't access their routers via 
their OOB access?


-Hank

Re: Facebook post-mortems...

2021-10-04 Thread Patrick W. Gilmore

Update about the October 4th outage

https://engineering.fb.com/2021/10/04/networking-traffic/outage/

-- 
TTFN,
patrick

> On Oct 4, 2021, at 9:25 PM, Mel Beckman  wrote:
> 
> The CF post mortem looks sensible, and a good summary of what we all saw from 
> the outside with BGP routes being withdrawn. 
> 
> Given the fragility of BGP, this could still end up being a malicious attack. 
> 
> -mel via cell
> 
>> On Oct 4, 2021, at 6:19 PM, Jay Hennigan  wrote:
>> 
>> On 10/4/21 17:58, jcur...@istaff.org wrote:
>>> Fairly abstract - Facebook Engineering - 
>>> https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr
>>>  
>>> 
>> 
>> I believe that the above link refers to a previous outage. The duration of 
>> the outage doesn't match today's, the technical explanation doesn't align 
>> very well, and many of the comments reference earlier dates.
>> 
>>> Also, Cloudflare’s take on the outage - 
>>> https://blog.cloudflare.com/october-2021-facebook-outage/ 
>>> 
>> 
>> This appears to indeed reference today's event.
>> 
>> -- 
>> Jay Hennigan - j...@west.net
>> Network Engineering - CCIE #7880
>> 503 897-8550 - WB6RDV

Re: Facebook post-mortems...

2021-10-04 Thread Mel Beckman

The CF post mortem looks sensible, and a good summary of what we all saw from 
the outside with BGP routes being withdrawn. 

Given the fragility of BGP, this could still end up being a malicious attack. 

-mel via cell

> On Oct 4, 2021, at 6:19 PM, Jay Hennigan  wrote:
> 
> On 10/4/21 17:58, jcur...@istaff.org wrote:
>> Fairly abstract - Facebook Engineering - 
>> https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr
>>  
>> 
> 
> I believe that the above link refers to a previous outage. The duration of 
> the outage doesn't match today's, the technical explanation doesn't align 
> very well, and many of the comments reference earlier dates.
> 
>> Also, Cloudflare’s take on the outage - 
>> https://blog.cloudflare.com/october-2021-facebook-outage/ 
>> 
> 
> This appears to indeed reference today's event.
> 
> -- 
> Jay Hennigan - j...@west.net
> Network Engineering - CCIE #7880
> 503 897-8550 - WB6RDV

Re: update - Re: Facebook post-mortems...

2021-10-04 Thread Rabbi Rob Thomas


>> Fairly abstract - Facebook Engineering -
>> https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr
>> 
> 
> My bad - might be best to ignore the above post as it is a
> unconfirmed/undated post-mortem that may reference a different event. 

If I'm reading the source correctly, the timestamp inside is for 09 SEP
2021 14:22:49 GMT (Unix time 1631197369).  Then again, I may not be
reading it correctly.  :)


-- 
Rabbi Rob Thomas   Team Cymru
   "It is easy to believe in freedom of speech for those with whom we
agree." - Leo McKern



OpenPGP_signature
Description: OpenPGP digital signature

Re: update - Re: Facebook post-mortems...

2021-10-04 Thread Michael Thomas



On 10/4/21 6:07 PM, jcur...@istaff.org wrote:
On 4 Oct 2021, at 8:58 PM, jcur...@istaff.org 
 wrote:


Fairly abstract - Facebook Engineering - 
https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr 



My bad - might be best to ignore the above post as it is a 
unconfirmed/undated post-mortem that may reference a different event.


One of the replies say it's from February, so year.

Mike

Re: Facebook post-mortems...

2021-10-04 Thread Jay Hennigan


On 10/4/21 17:58, jcur...@istaff.org wrote:
Fairly abstract - Facebook Engineering - 
https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr 



I believe that the above link refers to a previous outage. The duration 
of the outage doesn't match today's, the technical explanation doesn't 
align very well, and many of the comments reference earlier dates.


Also, Cloudflare’s take on the outage - 
https://blog.cloudflare.com/october-2021-facebook-outage/ 



This appears to indeed reference today's event.

--
Jay Hennigan - j...@west.net
Network Engineering - CCIE #7880
503 897-8550 - WB6RDV

Re: Facebook post-mortems...

2021-10-04 Thread Michael Thomas



On 10/4/21 5:58 PM, jcur...@istaff.org wrote:
Fairly abstract - Facebook Engineering - 
https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr 



Also, Cloudflare’s take on the outage - 
https://blog.cloudflare.com/october-2021-facebook-outage/ 





They have a monkey patch subsystem. Lol.

Mike

update - Re: Facebook post-mortems...

2021-10-04 Thread jcurran

On 4 Oct 2021, at 8:58 PM, jcur...@istaff.org wrote:
> 
> Fairly abstract - Facebook Engineering - 
> https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr
>  
> 
My bad - might be best to ignore the above post as it is a unconfirmed/undated 
post-mortem that may reference a different event. 

> Also, Cloudflare’s take on the outage - 
> https://blog.cloudflare.com/october-2021-facebook-outage/ 
> 
The Cloudflare writeup looks quite solid (loss of network -> no DNS servers -> 
major issue) 
/John

Re: Facebook post-mortems...

2021-10-04 Thread Rubens Kuhl

The FB one seems to be from a previous event. Downtime doesn't match,
visible flaw effects don't either.


Rubens


On Mon, Oct 4, 2021 at 9:59 PM  wrote:
>
> Fairly abstract - Facebook Engineering - 
> https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr
>
> Also, Cloudflare’s take on the outage - 
> https://blog.cloudflare.com/october-2021-facebook-outage/
>
> FYI,
> /John
>

Facebook post-mortems...

2021-10-04 Thread jcurran

Fairly abstract - Facebook Engineering - 
https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr
 


Also, Cloudflare’s take on the outage - 
https://blog.cloudflare.com/october-2021-facebook-outage/ 


FYI,
/John

66 matches

Mail list logo