Re: CloudFlare Issues?

2020-07-19 Thread vidister via NANOG
All three resolver of the big German hoster Hetzner went offline with 1.1.1.1 
and were down another hour after Cloudflare was up again.
I hope it was a coincidence and they are not forwarding their requests to 
1.1.1.1.

https://www.hetzner-status.de/

- vidister

> On 18. Jul 2020, at 02:46, John Von Essen  wrote:
> 
> Did anyone see any collateral damage from this outside of Cloudflare? 
> Specifically Azure?
> 
> I manage a very large site in Azure, and at the exact same time of the 
> Cloudflare incident we saw a spike in traffic (like a DDoS or Bot), then 
> followed by unusual hardware resource anomalies. We’re globally spread in 
> Azure, but we only saw this in the US and Brazil.
> 
> Very coincidental, but possible.
> 
> 
> -John
> 
>> On Jul 17, 2020, at 5:33 PM, Aaron C. de Bruyn via NANOG  
>> wrote:
>> 
>> More digging shows high latency to CloudFlare DNS servers from Comcast in 
>> Washington and Oregon as well as a few other providers (Charter, ToledoTel), 
>> etc...
>> 
>> Sites that do resolve using other DNS servers but are hosted on CloudFlare 
>> aren't loading.
>> Sites that use CloudFlare for their DNS aren't resolving either.
>> traceroute to 1.1.1.1 (1.1.1.1), 30 hops max, 60 byte packets
>> 
>> 1  _gateway (192.168.42.254)  0.185 ms  0.109 ms  0.117 ms
>> 2  pppoe-gw-208-70-52.toledotel.com (208.70.52.1)  1.896 ms  1.881 ms  1.903 
>> ms
>> 3  tuk-edge-13.inet.qwest.net (198.233.244.225)  4.158 ms  4.082 ms  4.071 ms
>> 4  sea-brdr-03.inet.qwest.net (67.14.41.154)  8.976 ms  8.949 ms  8.903 ms
>> 5  * * *
>> 6  ae-1-51.ear2.Seattle1.Level3.net (4.69.203.173)  4.494 ms  4.350 ms  
>> 4.311 ms
>> 7  4.53.154.10 (4.53.154.10)  77.622 ms  103.323 ms  103.240 ms
>> 8  * * *
>> 9  * * *
>> 10  * * *
>> 11  * * *
>> 12  * * *
>> 13  one.one.one.one (1.1.1.1)  87.515 ms * *
>> 
>> -A
>> 
>> On Fri, Jul 17, 2020 at 2:18 PM Aaron C. de Bruyn  wrote:
>> Anyone seeing Cloudflare DNS outages or site issues?
>> 
>> Affecting a bunch of sites in Washington and Oregon.
>> 
>> -A
> 



Re: CloudFlare Issues?

2020-07-19 Thread Rafael Possamai
Noticed high latency from some smokeping instances from about 16:10 until 16:35 
(central time). One of the worst variances was from ~20ms to upwards of 100ms 
RTT.

Re: CloudFlare Issues?

2020-07-19 Thread Brendan Carlson
We're peered with them and are having issues resolving some domains via
Cloudflare right now.

On Fri, Jul 17, 2020 at 2:44 PM Aaron C. de Bruyn via NANOG 
wrote:

> More digging shows high latency to CloudFlare DNS servers from Comcast in
> Washington and Oregon as well as a few other providers (Charter,
> ToledoTel), etc...
>
> Sites that do resolve using other DNS servers but are hosted on CloudFlare
> aren't loading.
> Sites that use CloudFlare for their DNS aren't resolving either.
> traceroute to 1.1.1.1 (1.1.1.1), 30 hops max, 60 byte packets
>
>  1  _gateway (192.168.42.254)  0.185 ms  0.109 ms  0.117 ms
>  2  pppoe-gw-208-70-52.toledotel.com (208.70.52.1)  1.896 ms  1.881 ms
>  1.903 ms
>  3  tuk-edge-13.inet.qwest.net (198.233.244.225)  4.158 ms  4.082 ms
>  4.071 ms
>  4  sea-brdr-03.inet.qwest.net (67.14.41.154)  8.976 ms  8.949 ms  8.903
> ms
>  5  * * *
>  6  ae-1-51.ear2.Seattle1.Level3.net (4.69.203.173)  4.494 ms  4.350 ms
>  4.311 ms
>  7  4.53.154.10 (4.53.154.10)  77.622 ms  103.323 ms  103.240 ms
>  8  * * *
>  9  * * *
> 10  * * *
> 11  * * *
> 12  * * *
> 13  one.one.one.one (1.1.1.1)  87.515 ms * *
>
> -A
>
> On Fri, Jul 17, 2020 at 2:18 PM Aaron C. de Bruyn 
> wrote:
>
>> Anyone seeing Cloudflare DNS outages or site issues?
>>
>> Affecting a bunch of sites in Washington and Oregon.
>>
>> -A
>>
>

-- 


http://www.bcarlsonmedia.com
@brendancarlson 
+1 (626) 921-6503


Re: CloudFlare Issues?

2020-07-17 Thread John Von Essen
Did anyone see any collateral damage from this outside of Cloudflare? 
Specifically Azure?

I manage a very large site in Azure, and at the exact same time of the 
Cloudflare incident we saw a spike in traffic (like a DDoS or Bot), then 
followed by unusual hardware resource anomalies. We’re globally spread in 
Azure, but we only saw this in the US and Brazil.

Very coincidental, but possible.


-John

> On Jul 17, 2020, at 5:33 PM, Aaron C. de Bruyn via NANOG  
> wrote:
> 
> More digging shows high latency to CloudFlare DNS servers from Comcast in 
> Washington and Oregon as well as a few other providers (Charter, ToledoTel), 
> etc...
> 
> Sites that do resolve using other DNS servers but are hosted on CloudFlare 
> aren't loading.
> Sites that use CloudFlare for their DNS aren't resolving either.
> traceroute to 1.1.1.1 (1.1.1.1), 30 hops max, 60 byte packets
> 
>  1  _gateway (192.168.42.254)  0.185 ms  0.109 ms  0.117 ms
>  2  pppoe-gw-208-70-52.toledotel.com 
>  (208.70.52.1)  1.896 ms  1.881 ms  
> 1.903 ms
>  3  tuk-edge-13.inet.qwest.net  
> (198.233.244.225)  4.158 ms  4.082 ms  4.071 ms
>  4  sea-brdr-03.inet.qwest.net  
> (67.14.41.154)  8.976 ms  8.949 ms  8.903 ms
>  5  * * *
>  6  ae-1-51.ear2.Seattle1.Level3.net 
>  (4.69.203.173)  4.494 ms  4.350 ms 
>  4.311 ms
>  7  4.53.154.10 (4.53.154.10)  77.622 ms  103.323 ms  103.240 ms
>  8  * * *
>  9  * * *
> 10  * * *
> 11  * * *
> 12  * * *
> 13  one.one.one.one (1.1.1.1)  87.515 ms * *
> 
> -A
> 
> On Fri, Jul 17, 2020 at 2:18 PM Aaron C. de Bruyn  > wrote:
> Anyone seeing Cloudflare DNS outages or site issues?
> 
> Affecting a bunch of sites in Washington and Oregon.
> 
> -A



smime.p7s
Description: S/MIME cryptographic signature


Re: CloudFlare Issues?

2020-07-17 Thread Chris Adams
Once upon a time, Peter Kristolaitis  said:
> Cloudflare's status page acknowledged a recursive DNS issue as of a
> few minutes ago.  Lots of reports of problems on the Outages list
> and Reddit.

It was not just recursive - authoritative DNS on Cloudflare servers also
did not respond.
-- 
Chris Adams 


Re: CloudFlare Issues?

2020-07-17 Thread Justin Paine via NANOG
The team is working on it.

_
*Justin Paine*
Head of Trust & Safety
PGP: BBAA 6BCE 3305 7FD6 6452 7115 57B6 0114 DE0B 314D

101 Townsend St., San Francisco, CA 94107



On Fri, Jul 17, 2020 at 2:53 PM  wrote:

> Chris Grundemann wrote on 7/17/2020 2:38 PM:
>
> Looks like there may be something big up (read: down) at CloudFlare, but
> their status page is not reporting anything yet.
>
> Am I crazy? Or just time to give up on the internet for this week?
>
> --
> @ChrisGrundemann
> http://chrisgrundemann.com
>
> Status page just updated: Edge network and resolver issues.
>
> We had noticed something was up on our network as well w/ IPv6 name
> resolution timing out for some sites.
>
>
>


CloudFlare Issues?

2020-07-17 Thread Aaron C. de Bruyn via NANOG
Anyone seeing Cloudflare DNS outages or site issues?

Affecting a bunch of sites in Washington and Oregon.

-A


Re: CloudFlare Issues?

2020-07-17 Thread Coy Hile



> On Jul 17, 2020, at 5:38 PM, Chris Grundemann  wrote:
> 
> Looks like there may be something big up (read: down) at CloudFlare, but 
> their status page is not reporting anything yet.
> 
> Am I crazy? Or just time to give up on the internet for this week?
> 
> 

You’re not crazy. I’m seeing the same behavior (still unable to get back into a 
few Discord servers) from Comcast in the Philly area. 

--
Coy Hile
coy.h...@coyhile.com






Re: CloudFlare Issues?

2020-07-17 Thread Aaron C. de Bruyn via NANOG
CloudFlare updated their status page and confirmed the issue:

https://www.cloudflarestatus.com/

-A

On Fri, Jul 17, 2020 at 2:33 PM Aaron C. de Bruyn 
wrote:

> More digging shows high latency to CloudFlare DNS servers from Comcast in
> Washington and Oregon as well as a few other providers (Charter,
> ToledoTel), etc...
>
> Sites that do resolve using other DNS servers but are hosted on CloudFlare
> aren't loading.
> Sites that use CloudFlare for their DNS aren't resolving either.
> traceroute to 1.1.1.1 (1.1.1.1), 30 hops max, 60 byte packets
>
>  1  _gateway (192.168.42.254)  0.185 ms  0.109 ms  0.117 ms
>  2  pppoe-gw-208-70-52.toledotel.com (208.70.52.1)  1.896 ms  1.881 ms
>  1.903 ms
>  3  tuk-edge-13.inet.qwest.net (198.233.244.225)  4.158 ms  4.082 ms
>  4.071 ms
>  4  sea-brdr-03.inet.qwest.net (67.14.41.154)  8.976 ms  8.949 ms  8.903
> ms
>  5  * * *
>  6  ae-1-51.ear2.Seattle1.Level3.net (4.69.203.173)  4.494 ms  4.350 ms
>  4.311 ms
>  7  4.53.154.10 (4.53.154.10)  77.622 ms  103.323 ms  103.240 ms
>  8  * * *
>  9  * * *
> 10  * * *
> 11  * * *
> 12  * * *
> 13  one.one.one.one (1.1.1.1)  87.515 ms * *
>
> -A
>
> On Fri, Jul 17, 2020 at 2:18 PM Aaron C. de Bruyn 
> wrote:
>
>> Anyone seeing Cloudflare DNS outages or site issues?
>>
>> Affecting a bunch of sites in Washington and Oregon.
>>
>> -A
>>
>


Re: CloudFlare Issues?

2020-07-17 Thread Dave Phelps
 From cloudflarestatus.com

Cloudflare Network and Resolver Issues

*Investigating* - Cloudflare is investigating issues with Cloudflare
Resolver and our edge network in certain locations.

Customers using Cloudflare services in certain regions are impacted as
requests might fail and/or errors may be displayed.

Data Centers impacted include: SJC, DFW, SEA, LAX, ORD, IAD, EWR, ATL, LHR,
AMS, FRA, CDG

On Fri, Jul 17, 2020 at 4:44 PM Dave Phelps  wrote:

> Cloudlflare's status page shows they are investigating an issue. Discord's
> status page also shows Cloudflare has an issue. Most people aren't making
> the Cloudflare connection yet and reporting many other services down
> instead.
>
> On Fri, Jul 17, 2020 at 4:40 PM Chris Grundemann 
> wrote:
>
>> Looks like there may be something big up (read: down) at CloudFlare, but
>> their status page is not reporting anything yet.
>>
>> Am I crazy? Or just time to give up on the internet for this week?
>>
>> --
>> @ChrisGrundemann
>> http://chrisgrundemann.com
>>
>


Re: CloudFlare Issues?

2020-07-17 Thread Dave Phelps
Cloudlflare's status page shows they are investigating an issue. Discord's
status page also shows Cloudflare has an issue. Most people aren't making
the Cloudflare connection yet and reporting many other services down
instead.

On Fri, Jul 17, 2020 at 4:40 PM Chris Grundemann 
wrote:

> Looks like there may be something big up (read: down) at CloudFlare, but
> their status page is not reporting anything yet.
>
> Am I crazy? Or just time to give up on the internet for this week?
>
> --
> @ChrisGrundemann
> http://chrisgrundemann.com
>


Re: CloudFlare Issues?

2020-07-17 Thread Peter Kristolaitis
Cloudflare's status page acknowledged a recursive DNS issue as of a few 
minutes ago.  Lots of reports of problems on the Outages list and Reddit.


From their status page:

*Investigating*- Cloudflare is investigating issues with Cloudflare 
Resolver and our edge network in certain locations.


Customers using Cloudflare services in certain regions are impacted as 
requests might fail and/or errors may be displayed.


Data Centers impacted include: SJC, DFW, SEA, LAX, ORD, IAD, EWR, ATL, 
LHR, AMS, FRA, CDG

Jul17,21:37UTC


On 2020-07-17 5:38 p.m., Chris Grundemann wrote:
Looks like there may be something big up (read: down) at CloudFlare, 
but their status page is not reporting anything yet.


Am I crazy? Or just time to give up on the internet for this week?

--
@ChrisGrundemann
http://chrisgrundemann.com


RE: CloudFlare Issues?

2020-07-17 Thread Spencer Coplin
My sites appear to be normal. Maybe it’s time for happy hour?

Thank you,
Spencer

From: NANOG  On Behalf Of 
Chris Grundemann
Sent: Friday, July 17, 2020 4:39 PM
To: NANOG list 
Subject: CloudFlare Issues?

CAUTION: This email originated from an external source. Verify the sender 
before taking any actions.

Looks like there may be something big up (read: down) at CloudFlare, but their 
status page is not reporting anything yet.

Am I crazy? Or just time to give up on the internet for this week?

--
@ChrisGrundemann
http://chrisgrundemann.com


Re: CloudFlare Issues?

2020-07-17 Thread blakangel

Chris Grundemann wrote on 7/17/2020 2:38 PM:

Looks like there may be something big up (read: down) at CloudFlare, 
but their status page is not reporting anything yet.


Am I crazy? Or just time to give up on the internet for this week?

--
@ChrisGrundemann
http://chrisgrundemann.com

Status page just updated: Edge network and resolver issues.

We had noticed something was up on our network as well w/ IPv6 name 
resolution timing out for some sites.





RE: CloudFlare Issues?

2020-07-17 Thread Kody Vicknair
https://www.cloudflarestatus.com/


From: NANOG  On Behalf Of 
Chris Grundemann
Sent: Friday, July 17, 2020 4:39 PM
To: NANOG list 
Subject: CloudFlare Issues?

*External Email: Use Caution*
Looks like there may be something big up (read: down) at CloudFlare, but their 
status page is not reporting anything yet.

Am I crazy? Or just time to give up on the internet for this week?

--
@ChrisGrundemann
https://link.edgepilot.com/s/f7346db5/qDrwfqlG4ESjtP0RY1EKXQ?u=http://chrisgrundemann.com/


Links contained in this email have been replaced. If you click on a link in the 
email above, the link will be analyzed for known threats. If a known threat is 
found, you will not be able to proceed to the destination. If suspicious 
content is detected, you will see a warning.


Re: CloudFlare Issues?

2020-07-17 Thread Rob McEwen
I think they were down for about 30 or so minutes, but came back up 
right about the time you hit the send button

--Rob McEwen

On 7/17/2020 5:38 PM, Chris Grundemann wrote:
Looks like there may be something big up (read: down) at CloudFlare, 
but their status page is not reporting anything yet.


Am I crazy? Or just time to give up on the internet for this week?

--
@ChrisGrundemann
http://chrisgrundemann.com



--
Rob McEwen
invaluement




Re: CloudFlare Issues?

2020-07-17 Thread Aaron C. de Bruyn via NANOG
More digging shows high latency to CloudFlare DNS servers from Comcast in
Washington and Oregon as well as a few other providers (Charter,
ToledoTel), etc...

Sites that do resolve using other DNS servers but are hosted on CloudFlare
aren't loading.
Sites that use CloudFlare for their DNS aren't resolving either.
traceroute to 1.1.1.1 (1.1.1.1), 30 hops max, 60 byte packets

 1  _gateway (192.168.42.254)  0.185 ms  0.109 ms  0.117 ms
 2  pppoe-gw-208-70-52.toledotel.com (208.70.52.1)  1.896 ms  1.881 ms
 1.903 ms
 3  tuk-edge-13.inet.qwest.net (198.233.244.225)  4.158 ms  4.082 ms  4.071
ms
 4  sea-brdr-03.inet.qwest.net (67.14.41.154)  8.976 ms  8.949 ms  8.903 ms
 5  * * *
 6  ae-1-51.ear2.Seattle1.Level3.net (4.69.203.173)  4.494 ms  4.350 ms
 4.311 ms
 7  4.53.154.10 (4.53.154.10)  77.622 ms  103.323 ms  103.240 ms
 8  * * *
 9  * * *
10  * * *
11  * * *
12  * * *
13  one.one.one.one (1.1.1.1)  87.515 ms * *

-A

On Fri, Jul 17, 2020 at 2:18 PM Aaron C. de Bruyn 
wrote:

> Anyone seeing Cloudflare DNS outages or site issues?
>
> Affecting a bunch of sites in Washington and Oregon.
>
> -A
>


CloudFlare Issues?

2020-07-17 Thread Chris Grundemann
Looks like there may be something big up (read: down) at CloudFlare, but
their status page is not reporting anything yet.

Am I crazy? Or just time to give up on the internet for this week?

-- 
@ChrisGrundemann
http://chrisgrundemann.com


Re: CloudFlare issues?

2019-07-07 Thread Mark Tinka



On 6/Jul/19 23:44, Matt Corallo wrote:
> On my test net I take ROA_INVALIDs and convert them to unreachables with
> a low preference (ie so that any upstreams taking only the shorter path
> will be selected, but so that such packets will never be routed).
>
> Obviously this isn't a well-supported operation, but I'm curious what
> people think of such an approach? If you really want to treat
> ROA_INVALID as "this is probably a hijack", you don't really want to be
> sending the hijacker traffic.

If a prefixe's RPKI state is Invalid, drop it! Simple.

In most cases, it's a mistake due to a mis-configuration and/or a lack
of deep understanding of RPKI. In fewer cases, it's an actual hijack.

Either way, dropping the Invalid routes keeps the BGP clean and quickly
encourages the originating network to get things fixed.

As you point out, RPKI state validation is locally-significant, with
protection extending to downstream customers only. So for this to really
work, it needs critical mass. One, two, three, four or five networks
implementing ROV and dropping Invalids does not a secure BGP make.

Mark.


Re: CloudFlare issues?

2019-07-07 Thread Mark Tinka



On 6/Jul/19 22:05, Brett Frankenberger wrote:

> These were more-specifics, though.  So if you drop all the
> more-specifics as failing ROV, then you end up following the valid
> shorter prefix to the destination.

I can't quite recall which Cloudflare prefixes were impacted. If you
have a sniff at https://bgp.he.net/AS13335#_prefixes and
https://bgp.he.net/AS13335#_prefixes6 you will see that Cloudflare have
a larger portion of their IPv6 prefixes ROA'd than the IPv4 ones. If you
remember which Cloudflare prefixes were affected by the Verizon debacle,
we can have a closer look.


>   Quite possibly that points at the
> upstream which sent you the more-specific which you rejected, at which
> point your packets end up same going to the same place they would have
> gone if you had accepted the invalid more-specific.

But that's my point... we did not have the chance to drop any of the
affected Cloudflare prefixes because we do not use the ARIN TAL.

That means that we are currently ignoring the RPKI value of Cloudflare's
prefixes that are under ARIN.

Also, AFAICT, none of our current upstreams are doing ROV. You can see
that list here:

    https://bgp.he.net/AS37100#_graph4

Mark.



Re: CloudFlare issues?

2019-07-06 Thread Matt Corallo
Oops, I mean with a script which removes such routes if there is an
encompassing route which a different upstream takes, as obviously the
more-specific would otherwise still win.

Matt

On 7/6/19 5:44 PM, Matt Corallo wrote:
> On my test net I take ROA_INVALIDs and convert them to unreachables with
> a low preference (ie so that any upstreams taking only the shorter path
> will be selected, but so that such packets will never be routed).
> 
> Obviously this isn't a well-supported operation, but I'm curious what
> people think of such an approach? If you really want to treat
> ROA_INVALID as "this is probably a hijack", you don't really want to be
> sending the hijacker traffic.
> 
> Of course if upstreams are rejecting ROA_INVALID you can still have the
> same problem one network away, but its an interesting result for
> testing, especially since it rejects a bunch of crap in China where CT
> has reassigned prefixes with covering ROAs to customers who re-announce
> on their own ASN (which appears to be common).
> 
> Matt
> 
> On 7/6/19 4:05 PM, Brett Frankenberger wrote:
>> On Thu, Jul 04, 2019 at 11:46:05AM +0200, Mark Tinka wrote:
>>> I finally thought about this after I got off my beer high :-).
>>>
>>> Some of our customers complained about losing access to Cloudflare's
>>> resources during the Verizon debacle. Since we are doing ROV and
>>> dropping Invalids, this should not have happened, given most of
>>> Cloudflare's IPv4 and IPv6 routes are ROA'd.
>>
>> These were more-specifics, though.  So if you drop all the
>> more-specifics as failing ROV, then you end up following the valid
>> shorter prefix to the destination.  Quite possibly that points at the
>> upstream which sent you the more-specific which you rejected, at which
>> point your packets end up same going to the same place they would have
>> gone if you had accepted the invalid more-specific.
>>
>> Two potential issues here:  First, if you don't have an upstream who
>> is also rejecting the invalid routes, then anywhere you send the
>> packets, they're going to follow the more-specific.  Second, even if
>> you do have an upstream that is rejecting the invalid routes, ROV won't
>> cause you to prefer the less-specific from an upstream that is
>> rejecting the invalid routes over a less-specific from an upstream that
>> is accepting the invalid routes.
>>
>> For example:
>>if upstream A sends you:
>>   10.0.0.0./16 valid
>>and upstream B sends you
>>   10.0.0.0/16 valid
>>   10.0.0.0/17 invalid
>>   10.0.128.0/17 invalid
>> you want send to send the packet to A.  But ROV won't cause that, and if
>> upstream B is selected by your BGP decision criteria (path length,
>> etc.), you're packets will ultimately follow the more-specific.
>>
>> (Of course, the problem is can occur more than one network away.  Even
>> if you do send to upstream A, there's no guarantee that A's
>> less-specifics aren't pointed at another network that does have the
>> more-specifics.  But at least you give them a fighting chance by
>> sending them to A.)
>>
>>  -- Brett
>>


Re: CloudFlare issues?

2019-07-06 Thread Matt Corallo
On my test net I take ROA_INVALIDs and convert them to unreachables with
a low preference (ie so that any upstreams taking only the shorter path
will be selected, but so that such packets will never be routed).

Obviously this isn't a well-supported operation, but I'm curious what
people think of such an approach? If you really want to treat
ROA_INVALID as "this is probably a hijack", you don't really want to be
sending the hijacker traffic.

Of course if upstreams are rejecting ROA_INVALID you can still have the
same problem one network away, but its an interesting result for
testing, especially since it rejects a bunch of crap in China where CT
has reassigned prefixes with covering ROAs to customers who re-announce
on their own ASN (which appears to be common).

Matt

On 7/6/19 4:05 PM, Brett Frankenberger wrote:
> On Thu, Jul 04, 2019 at 11:46:05AM +0200, Mark Tinka wrote:
>> I finally thought about this after I got off my beer high :-).
>>
>> Some of our customers complained about losing access to Cloudflare's
>> resources during the Verizon debacle. Since we are doing ROV and
>> dropping Invalids, this should not have happened, given most of
>> Cloudflare's IPv4 and IPv6 routes are ROA'd.
> 
> These were more-specifics, though.  So if you drop all the
> more-specifics as failing ROV, then you end up following the valid
> shorter prefix to the destination.  Quite possibly that points at the
> upstream which sent you the more-specific which you rejected, at which
> point your packets end up same going to the same place they would have
> gone if you had accepted the invalid more-specific.
> 
> Two potential issues here:  First, if you don't have an upstream who
> is also rejecting the invalid routes, then anywhere you send the
> packets, they're going to follow the more-specific.  Second, even if
> you do have an upstream that is rejecting the invalid routes, ROV won't
> cause you to prefer the less-specific from an upstream that is
> rejecting the invalid routes over a less-specific from an upstream that
> is accepting the invalid routes.
> 
> For example:
>if upstream A sends you:
>   10.0.0.0./16 valid
>and upstream B sends you
>   10.0.0.0/16 valid
>   10.0.0.0/17 invalid
>   10.0.128.0/17 invalid
> you want send to send the packet to A.  But ROV won't cause that, and if
> upstream B is selected by your BGP decision criteria (path length,
> etc.), you're packets will ultimately follow the more-specific.
> 
> (Of course, the problem is can occur more than one network away.  Even
> if you do send to upstream A, there's no guarantee that A's
> less-specifics aren't pointed at another network that does have the
> more-specifics.  But at least you give them a fighting chance by
> sending them to A.)
> 
>  -- Brett
> 


Re: CloudFlare issues?

2019-07-06 Thread Brett Frankenberger
On Thu, Jul 04, 2019 at 11:46:05AM +0200, Mark Tinka wrote:
> I finally thought about this after I got off my beer high :-).
> 
> Some of our customers complained about losing access to Cloudflare's
> resources during the Verizon debacle. Since we are doing ROV and
> dropping Invalids, this should not have happened, given most of
> Cloudflare's IPv4 and IPv6 routes are ROA'd.

These were more-specifics, though.  So if you drop all the
more-specifics as failing ROV, then you end up following the valid
shorter prefix to the destination.  Quite possibly that points at the
upstream which sent you the more-specific which you rejected, at which
point your packets end up same going to the same place they would have
gone if you had accepted the invalid more-specific.

Two potential issues here:  First, if you don't have an upstream who
is also rejecting the invalid routes, then anywhere you send the
packets, they're going to follow the more-specific.  Second, even if
you do have an upstream that is rejecting the invalid routes, ROV won't
cause you to prefer the less-specific from an upstream that is
rejecting the invalid routes over a less-specific from an upstream that
is accepting the invalid routes.

For example:
   if upstream A sends you:
  10.0.0.0./16 valid
   and upstream B sends you
  10.0.0.0/16 valid
  10.0.0.0/17 invalid
  10.0.128.0/17 invalid
you want send to send the packet to A.  But ROV won't cause that, and if
upstream B is selected by your BGP decision criteria (path length,
etc.), you're packets will ultimately follow the more-specific.

(Of course, the problem is can occur more than one network away.  Even
if you do send to upstream A, there's no guarantee that A's
less-specifics aren't pointed at another network that does have the
more-specifics.  But at least you give them a fighting chance by
sending them to A.)

 -- Brett


Re: CloudFlare issues?

2019-07-05 Thread i3D.net - Martijn Schmidt via NANOG
Hey Sandy,

At this time i3D.net is not able to fully implement RPKI for technical 
reasons: there are still some Brocade routers in our network which don't 
support it. We are making very good progress migrating the entire 
network over to Juniper routers which do support RPKI, and we will 
certainly deploy ROV when that is done, but with upwards of 40 
default-free backbone routers spread over six continents it's not a 
logistically trivial task.

That being said, a network doesn't need to use ROV to benefit from the 
routing security afforded by the RPKI protocol. Nearly all of the 
prefixes originated by AS49544 have been covered by RPKI ROAs for 
several years now. Those networks which have already deployed ROV are 
inoculated against route hijacks of i3D.net's IP space in scenarios 
where the bad paths would be marked as RPKI invalid. Considering that 
i3D.net was founded in The Netherlands and that a significant amount of 
our enterprise customers have businesses which are focused on the Dutch 
market, the fact that two of the major eyeball networks in the country 
(that'd be KPN & XS4ALL) are using ROV is already a huge win for 
everyone involved.

And, let's not forget that the degree of protection afforded by this 
relatively passive participation in RPKI is directly proportional to the 
use of a non-ARIN TAL. Real-world example: Mark Tinka's remark 
concerning Seacom's connection to Cloudflare's IP space being affected 
by the hijack due to the ARIN TAL problem, despite both involved parties 
fully deploying RPKI by both signing ROAs and implementing ROV.

Best regards,
Martijn

On 7/5/19 8:46 PM, Sandra Murphy wrote:
> Martijn - i3D.net is not in the list Job posted yesterday of RPKI ROV 
> deployment.  Your message below hints that you may be using RPKI.  Are you 
> doing ROV?  (You may be in the “hundreds of others” category.)
>
> —Sandy
>
> Begin forwarded message:
>
> From: Job Snijders 
> Subject: Re: CloudFlare issues?
> Date: July 4, 2019 at 11:33:57 AM EDT
> To: Francois Lecavalier 
> Cc: "nanog@nanog.org" 
>
> I believe at this point in time it is safe to accept valid and unknown
> (combined with an IRR filter), and reject RPKI invalid BGP announcements
> at your EBGP borders. Large examples of other organisations who already
> are rejecting invalid announcements are AT, Nordunet, DE-CIX, YYCIX,
> XS4ALL, MSK-IX, INEX, France-IX, Seacomm, Workonline, KPN International,
> and hundreds of others.
>
>
>
>> On Jul 4, 2019, at 5:56 AM, i3D.net - Martijn Schmidt via NANOG 
>>  wrote:
>>
>> So that means it's time for everyone to migrate their ARIN resources to a 
>> sane RIR that does allow normal access to and redistribution of its RPKI 
>> TAL? ;-)
>>
>> The RPKI TAL problem + an industry-standard IRRDB instead of WHOIS-RWS were 
>> both major reasons for us to bring our ARIN IPv4 address space to RIPE. 
>> Unfortunately we had to renumber our handful of IPv6 customers because ARIN 
>> doesn't do IPv6 inter-RIR transfers, but hey, no pain no gain.
>>
>> Therefore, Cloudflare folks - when are you transferring your resources away 
>> from ARIN? :D
>>
>> Best regards,
>> Martijn
>>
>> On 7/4/19 11:46 AM, Mark Tinka wrote:
>>> I finally thought about this after I got off my beer high :-).
>>>
>>> Some of our customers complained about losing access to Cloudflare's 
>>> resources during the Verizon debacle. Since we are doing ROV and dropping 
>>> Invalids, this should not have happened, given most of Cloudflare's IPv4 
>>> and IPv6 routes are ROA'd.
>>>
>>> However, since we are not using the ARIN TAL (for known reasons), this 
>>> explains why this also broke for us.
>>>
>>> Back to beer now :-)...
>>>
>>> Mark.



Re: CloudFlare issues?

2019-07-05 Thread Sandra Murphy
Martijn - i3D.net is not in the list Job posted yesterday of RPKI ROV 
deployment.  Your message below hints that you may be using RPKI.  Are you 
doing ROV?  (You may be in the “hundreds of others” category.)

—Sandy

Begin forwarded message:

From: Job Snijders 
Subject: Re: CloudFlare issues?
Date: July 4, 2019 at 11:33:57 AM EDT
To: Francois Lecavalier 
Cc: "nanog@nanog.org" 

I believe at this point in time it is safe to accept valid and unknown
(combined with an IRR filter), and reject RPKI invalid BGP announcements
at your EBGP borders. Large examples of other organisations who already
are rejecting invalid announcements are AT, Nordunet, DE-CIX, YYCIX,
XS4ALL, MSK-IX, INEX, France-IX, Seacomm, Workonline, KPN International,
and hundreds of others.



> On Jul 4, 2019, at 5:56 AM, i3D.net - Martijn Schmidt via NANOG 
>  wrote:
> 
> So that means it's time for everyone to migrate their ARIN resources to a 
> sane RIR that does allow normal access to and redistribution of its RPKI TAL? 
> ;-)
> 
> The RPKI TAL problem + an industry-standard IRRDB instead of WHOIS-RWS were 
> both major reasons for us to bring our ARIN IPv4 address space to RIPE. 
> Unfortunately we had to renumber our handful of IPv6 customers because ARIN 
> doesn't do IPv6 inter-RIR transfers, but hey, no pain no gain.
> 
> Therefore, Cloudflare folks - when are you transferring your resources away 
> from ARIN? :D
> 
> Best regards,
> Martijn
> 
> On 7/4/19 11:46 AM, Mark Tinka wrote:
>> I finally thought about this after I got off my beer high :-).
>> 
>> Some of our customers complained about losing access to Cloudflare's 
>> resources during the Verizon debacle. Since we are doing ROV and dropping 
>> Invalids, this should not have happened, given most of Cloudflare's IPv4 and 
>> IPv6 routes are ROA'd.
>> 
>> However, since we are not using the ARIN TAL (for known reasons), this 
>> explains why this also broke for us.
>> 
>> Back to beer now :-)...
>> 
>> Mark.
> 



Re: CloudFlare issues?

2019-07-04 Thread Job Snijders
> Anyway, you can now enjoy https://rpki.net/s/rpki-test even more! :-)

my apologies, I fumbled the ball on typing in that URL, I intended to
point here: https://www.ripe.net/s/rpki-test


Re: CloudFlare issues?

2019-07-04 Thread Mark Tinka



On 4/Jul/19 20:46, Francois Lecavalier wrote:

> It's been close to 3 hours now since I dropped them - radio silence.
>
> Whoever fears implementing RPKI/ROA/ROV, simply don't.  It's very easy to 
> implement, validate and troubleshoot.

Well done! Congrats!

Mark.




Re: CloudFlare issues?

2019-07-04 Thread Mark Tinka



On 4/Jul/19 20:46, Francois Lecavalier wrote:

> It's been close to 3 hours now since I dropped them - radio silence.
>
> Whoever fears implementing RPKI/ROA/ROV, simply don't.  It's very easy to 
> implement, validate and troubleshoot.

Well done! Congrats!

Mark.



Re: CloudFlare issues?

2019-07-04 Thread Job Snijders
On Thu, Jul 4, 2019 at 8:46 PM Francois Lecavalier
 wrote:

> It's been close to 3 hours now since I dropped them - radio silence.

I am going to assume that "radio silence" for you means that your
network is fully functional and none of your customers have raised
issues! :-)

> Whoever fears implementing RPKI/ROA/ROV, simply don't.  It's very easy to 
> implement, validate and troubleshoot.

Thank you for sharing your report. I believe it is good to share rpki
stories with each other, not just to celebrate the deployment of an
exciting technology, but also to help provide debugging information
ahead of time should there be issues between provider A and B due to a
ROA misconfiguration. Announcing to the public that one has deployed
RPKI - in this stage of the lifecycle of the tech - probably is a
productive action to consider.

Anyway, you can now enjoy https://rpki.net/s/rpki-test even more! :-)

Kind regards,

Job


Re: CloudFlare issues?

2019-07-04 Thread Ben Maddison via NANOG
Welcome to the club!

Get Outlook for Android<https://aka.ms/ghei36>


From: Francois Lecavalier 
Sent: Thursday, July 4, 2019 8:46:46 PM
To: Ben Maddison; j...@ntt.net
Cc: nanog@nanog.org
Subject: RE: CloudFlare issues?

>> At this point in time I think the ideal deployment model is to perform
>> the validation within your administrative domain and run your own
>> validators.

>+1

We'll definitely look into this shortly.  I definitely don't want to leave a 
security measure in the end of a third party but with my team being so busy it 
was a quick temp fix.

> The larger challenge has been related to vendor implementation choices and 
> bugs, particularly on ios-xe. Happy to go into more detail if anyone is 
> interested.

We are on Juniper MX204's at the edge and they have been solid for the last 60 
weeks - we ran into a long list of bugs on other platforms but not on these.

So I had about 4200 routes marked as invalid.  After looking at a sample of 
them it looks like most of them have a valid ROA with an improper mask length - 
so there is ultimately a route to these prefixes and at worse would result in 
"suboptimal" routing - or should I say: the remote network can't control its 
route propagation anymore.  In most case they are a stub networks with a single 
/24 reassigned from the upstream provider.  I have no traffic going directly to 
these networks and I don't expect any to go there anytime soon.

It's been close to 3 hours now since I dropped them - radio silence.

Whoever fears implementing RPKI/ROA/ROV, simply don't.  It's very easy to 
implement, validate and troubleshoot.

-Original Message-
From: Ben Maddison 
Sent: Thursday, July 4, 2019 11:51 AM
To: j...@ntt.net; Francois Lecavalier 
Cc: nanog@nanog.org
Subject: [External] Re: CloudFlare issues?

Hi Francois,

On Thu, 2019-07-04 at 17:33 +0200, Job Snijders wrote:
> Dear Francois,
>
> On Thu, Jul 04, 2019 at 03:22:23PM +, Francois Lecavalier wrote:
> >
> At this point in time I think the ideal deployment model is to perform
> the validation within your administrative domain and run your own
> validators.

+1

>
> > But I also have a question for all the ROA folks out there.  So far
> > we are not taking any action other than lowering the local-pref - we
> > want to make sure this is stable before we start denying prefixes.
> > So the question, is it safe as of this date to : 1.Accept valid, 2.
> > Accept unknown, 3. Reject invalid?  Have any large network who
> > implemented it dealt with unreachable destinations?  I'm wondering
> > as I haven't found any blog mentioning anything in this regard and
> > ClouFlare docs only shows example for valid and invalid, but nothing
> > for unknown.
>
We have been dropping Invalids since April, and have had only a
(single-digit) handful of support requests related to those becoming 
unreachable.

The larger challenge has been related to vendor implementation choices and 
bugs, particularly on ios-xe. Happy to go into more detail if anyone is 
interested.

I would recommend *not* taking any policy action that distinguishes Valid from 
Unknown. If you find that you have routes for the same prefix/len with both 
statuses, then that is a bug and/or misconfiguration which you could turn into 
a loop by taking policy action on that difference.

Cheers,

Ben
This e-mail may be privileged and/or confidential, and the sender does not 
waive any related rights and obligations. Any distribution, use or copying of 
this e-mail or the information it contains by other than an intended recipient 
is unauthorized. If you received this e-mail in error, please advise me (by 
return e-mail or otherwise) immediately. Ce courrier électronique est 
confidentiel et protégé. L'expéditeur ne renonce pas aux droits et obligations 
qui s'y rapportent. Toute diffusion, utilisation ou copie de ce message ou des 
renseignements qu'il contient par une personne autre que le (les) 
destinataire(s) désigné(s) est interdite. Si vous recevez ce courrier 
électronique par erreur, veuillez m'en aviser immédiatement, par retour de 
courrier électronique ou par un autre moyen.


RE: CloudFlare issues?

2019-07-04 Thread Francois Lecavalier
>> At this point in time I think the ideal deployment model is to perform
>> the validation within your administrative domain and run your own
>> validators.

>+1

We'll definitely look into this shortly.  I definitely don't want to leave a 
security measure in the end of a third party but with my team being so busy it 
was a quick temp fix.

> The larger challenge has been related to vendor implementation choices and 
> bugs, particularly on ios-xe. Happy to go into more detail if anyone is 
> interested.

We are on Juniper MX204's at the edge and they have been solid for the last 60 
weeks - we ran into a long list of bugs on other platforms but not on these.

So I had about 4200 routes marked as invalid.  After looking at a sample of 
them it looks like most of them have a valid ROA with an improper mask length - 
so there is ultimately a route to these prefixes and at worse would result in 
"suboptimal" routing - or should I say: the remote network can't control its 
route propagation anymore.  In most case they are a stub networks with a single 
/24 reassigned from the upstream provider.  I have no traffic going directly to 
these networks and I don't expect any to go there anytime soon.

It's been close to 3 hours now since I dropped them - radio silence.

Whoever fears implementing RPKI/ROA/ROV, simply don't.  It's very easy to 
implement, validate and troubleshoot.

-Original Message-
From: Ben Maddison 
Sent: Thursday, July 4, 2019 11:51 AM
To: j...@ntt.net; Francois Lecavalier 
Cc: nanog@nanog.org
Subject: [External] Re: CloudFlare issues?

Hi Francois,

On Thu, 2019-07-04 at 17:33 +0200, Job Snijders wrote:
> Dear Francois,
>
> On Thu, Jul 04, 2019 at 03:22:23PM +, Francois Lecavalier wrote:
> >
> At this point in time I think the ideal deployment model is to perform
> the validation within your administrative domain and run your own
> validators.

+1

>
> > But I also have a question for all the ROA folks out there.  So far
> > we are not taking any action other than lowering the local-pref - we
> > want to make sure this is stable before we start denying prefixes.
> > So the question, is it safe as of this date to : 1.Accept valid, 2.
> > Accept unknown, 3. Reject invalid?  Have any large network who
> > implemented it dealt with unreachable destinations?  I'm wondering
> > as I haven't found any blog mentioning anything in this regard and
> > ClouFlare docs only shows example for valid and invalid, but nothing
> > for unknown.
>
We have been dropping Invalids since April, and have had only a
(single-digit) handful of support requests related to those becoming 
unreachable.

The larger challenge has been related to vendor implementation choices and 
bugs, particularly on ios-xe. Happy to go into more detail if anyone is 
interested.

I would recommend *not* taking any policy action that distinguishes Valid from 
Unknown. If you find that you have routes for the same prefix/len with both 
statuses, then that is a bug and/or misconfiguration which you could turn into 
a loop by taking policy action on that difference.

Cheers,

Ben
This e-mail may be privileged and/or confidential, and the sender does not 
waive any related rights and obligations. Any distribution, use or copying of 
this e-mail or the information it contains by other than an intended recipient 
is unauthorized. If you received this e-mail in error, please advise me (by 
return e-mail or otherwise) immediately. Ce courrier électronique est 
confidentiel et protégé. L'expéditeur ne renonce pas aux droits et obligations 
qui s'y rapportent. Toute diffusion, utilisation ou copie de ce message ou des 
renseignements qu'il contient par une personne autre que le (les) 
destinataire(s) désigné(s) est interdite. Si vous recevez ce courrier 
électronique par erreur, veuillez m'en aviser immédiatement, par retour de 
courrier électronique ou par un autre moyen.


Re: CloudFlare issues?

2019-07-04 Thread Mark Tinka



On 4/Jul/19 17:50, Ben Maddison via NANOG wrote:

> We have been dropping Invalids since April, and have had only a
> (single-digit) handful of support requests related to those becoming
> unreachable.

We've had 2 cases where customers could not reach a prefix. Both were
mistakes (as we've found most Invalid routes to be), which were promptly
fixed.

One of them was where a cloud provider decided to originate a longer
prefix on behalf of their content-producing customer, using their own AS
as opposed to the one the customer had used to create the ROA for the
covering block.

Mark.


Re: CloudFlare issues?

2019-07-04 Thread Mark Tinka



On 4/Jul/19 17:33, Job Snijders wrote:

> At this point in time I think the ideal deployment model is to perform
> the validation within your administrative domain and run your own
> validators.

In essence, this is also my thought process.

I think Cloudflare are very well-intentioned in making it as painless as
possible to support other operators to get RPKI deployed (and more power
to them to going to such lengths to do so), but you have to determine
whether you are willing to let a service such as this run outside of our
domain.

Every year, someone asks me whether I'd be willing to outsource my route
reflector VNF's to AWS/Azure/e.t.c. My answer to that falls within the
realms of handling RPKI for your network :-).

Mark.


Re: CloudFlare issues?

2019-07-04 Thread Nick Hilliard

Francois Lecavalier wrote on 04/07/2019 16:22:
My assumption is that 1.Accept valid, 2. Accept unknown, 3. Reject 
invalid shouldn’t break anything.


Accepting valid ROAs is a better idea after checking that the source AS 
is legitimate from the peer.


Nick


Re: CloudFlare issues?

2019-07-04 Thread Mark Tinka


On 4/Jul/19 17:22, Francois Lecavalier wrote:

>  
>
> Following that Verizon debacle I got onboard with ROV, after a couple
> research I stopped my choice on the ….drum roll…. CloudFlare GoRTR
> (https://github.com/cloudflare/gortr).  If you trust them enough they
> provide an updated JSON every 15 minutes of the global RIR aggregate. 
> I’ll see down the road if we’ll fetch them ourselves but at least it
> got us up and running in less than an hour.  It was also easy for us
> to deploy as the routers and the servers are on the same PoP directly
> connected, so we don’t need the whole encryption recipe they provide
> for mass distribution.
>

Funny you should mention this... I was speaking with Tom today during an
RPKI talk he gave at MyNOG, about whether we'd be willing to trust their
RTR streams.

But, I'm glad you found a quick solution to get you up and running.
Welcome to the club.


>  
>
> But I also have a question for all the ROA folks out there.  So far we
> are not taking any action other than lowering the local-pref – we want
> to make sure this is stable before we start denying prefixes.  So the
> question, is it safe as of this date to : 1.Accept valid, 2. Accept
> unknown, 3. Reject invalid?  Have any large network who implemented it
> dealt with unreachable destinations?  I’m wondering as I haven’t found
> any blog mentioning anything in this regard and ClouFlare docs only
> shows example for valid and invalid, but nothing for unknown.
>
>  
>
> My assumption is that 1.Accept valid, 2. Accept unknown, 3. Reject
> invalid shouldn’t break anything.
>

Well, a Valid and NotFound state implicitly mean that the routes can be
used for routing/forwarding. In that case, the only policy we create and
apply is against Invalid routes, which is to DROP them.

Mark.


Re: CloudFlare issues?

2019-07-04 Thread Ben Maddison via NANOG
Hi Francois,

On Thu, 2019-07-04 at 17:33 +0200, Job Snijders wrote:
> Dear Francois,
> 
> On Thu, Jul 04, 2019 at 03:22:23PM +, Francois Lecavalier wrote:
> > 
> At this point in time I think the ideal deployment model is to
> perform
> the validation within your administrative domain and run your own
> validators. 

+1

> 
> > But I also have a question for all the ROA folks out there.  So far
> > we
> > are not taking any action other than lowering the local-pref - we
> > want
> > to make sure this is stable before we start denying prefixes.  So
> > the
> > question, is it safe as of this date to : 1.Accept valid, 2. Accept
> > unknown, 3. Reject invalid?  Have any large network who implemented
> > it
> > dealt with unreachable destinations?  I'm wondering as I haven't
> > found
> > any blog mentioning anything in this regard and ClouFlare docs only
> > shows example for valid and invalid, but nothing for unknown.
> 
We have been dropping Invalids since April, and have had only a
(single-digit) handful of support requests related to those becoming
unreachable.

The larger challenge has been related to vendor implementation choices
and bugs, particularly on ios-xe. Happy to go into more detail if
anyone is interested.

I would recommend *not* taking any policy action that distinguishes
Valid from Unknown. If you find that you have routes for the same
prefix/len with both statuses, then that is a bug and/or
misconfiguration which you could turn into a loop by taking policy
action on that difference.

Cheers,

Ben


Re: CloudFlare issues?

2019-07-04 Thread Job Snijders
Dear Francois,

On Thu, Jul 04, 2019 at 03:22:23PM +, Francois Lecavalier wrote:
> Following that Verizon debacle I got onboard with ROV, after a couple
> research I stopped my choice on the drum roll CloudFlare GoRTR
> (https://github.com/cloudflare/gortr).  If you trust them enough they
> provide an updated JSON every 15 minutes of the global RIR aggregate.

At this point in time I think the ideal deployment model is to perform
the validation within your administrative domain and run your own
validators. You can combine routinator with gortr, or use cloudflare's
octorpki software https://github.com/cloudflare/cfrpki

> I'll see down the road if we'll fetch them ourselves but at least it
> got us up and running in less than an hour.  It was also easy for us
> to deploy as the routers and the servers are on the same PoP directly
> connected, so we don't need the whole encryption recipe they provide
> for mass distribution.

yeah, that is true!

> But I also have a question for all the ROA folks out there.  So far we
> are not taking any action other than lowering the local-pref - we want
> to make sure this is stable before we start denying prefixes.  So the
> question, is it safe as of this date to : 1.Accept valid, 2. Accept
> unknown, 3. Reject invalid?  Have any large network who implemented it
> dealt with unreachable destinations?  I'm wondering as I haven't found
> any blog mentioning anything in this regard and ClouFlare docs only
> shows example for valid and invalid, but nothing for unknown.

I believe at this point in time it is safe to accept valid and unknown
(combined with an IRR filter), and reject RPKI invalid BGP announcements
at your EBGP borders. Large examples of other organisations who already
are rejecting invalid announcements are AT, Nordunet, DE-CIX, YYCIX,
XS4ALL, MSK-IX, INEX, France-IX, Seacomm, Workonline, KPN International,
and hundreds of others.

You can run an analysis yourself to see how traffic would be impacted in
your network using pmacct or Kentik, see this post for more info:
https://mailman.nanog.org/pipermail/nanog/2019-February/099522.html

> My assumption is that 1.Accept valid, 2. Accept unknown, 3. Reject
> invalid shouldn't break anything.

Correct! Let us know how it went :-)

Kind regards,

Job


Re: CloudFlare issues?

2019-07-04 Thread Francois Lecavalier
Hi Mark,

Following that Verizon debacle I got onboard with ROV, after a couple research 
I stopped my choice on the drum roll CloudFlare GoRTR 
(https://github.com/cloudflare/gortr).  If you trust them enough they provide 
an updated JSON every 15 minutes of the global RIR aggregate.  I'll see down 
the road if we'll fetch them ourselves but at least it got us up and running in 
less than an hour.  It was also easy for us to deploy as the routers and the 
servers are on the same PoP directly connected, so we don't need the whole 
encryption recipe they provide for mass distribution.

But I also have a question for all the ROA folks out there.  So far we are not 
taking any action other than lowering the local-pref - we want to make sure 
this is stable before we start denying prefixes.  So the question, is it safe 
as of this date to : 1.Accept valid, 2. Accept unknown, 3. Reject invalid?  
Have any large network who implemented it dealt with unreachable destinations?  
I'm wondering as I haven't found any blog mentioning anything in this regard 
and ClouFlare docs only shows example for valid and invalid, but nothing for 
unknown.

My assumption is that 1.Accept valid, 2. Accept unknown, 3. Reject invalid 
shouldn't break anything.

Thanks,

-Francois
This e-mail may be privileged and/or confidential, and the sender does not 
waive any related rights and obligations. Any distribution, use or copying of 
this e-mail or the information it contains by other than an intended recipient 
is unauthorized. If you received this e-mail in error, please advise me (by 
return e-mail or otherwise) immediately. Ce courrier ?lectronique est 
confidentiel et prot?g?. L'exp?diteur ne renonce pas aux droits et obligations 
qui s'y rapportent. Toute diffusion, utilisation ou copie de ce message ou des 
renseignements qu'il contient par une personne autre que le (les) 
destinataire(s) d?sign?(s) est interdite. Si vous recevez ce courrier 
?lectronique par erreur, veuillez m'en aviser imm?diatement, par retour de 
courrier ?lectronique ou par un autre moyen.


Re: CloudFlare issues?

2019-07-04 Thread i3D.net - Martijn Schmidt via NANOG
So that means it's time for everyone to migrate their ARIN resources to a sane 
RIR that does allow normal access to and redistribution of its RPKI TAL? ;-)

The RPKI TAL problem + an industry-standard IRRDB instead of WHOIS-RWS were 
both major reasons for us to bring our ARIN IPv4 address space to RIPE. 
Unfortunately we had to renumber our handful of IPv6 customers because ARIN 
doesn't do IPv6 inter-RIR transfers, but hey, no pain no gain.

Therefore, Cloudflare folks - when are you transferring your resources away 
from ARIN? :D

Best regards,
Martijn

On 7/4/19 11:46 AM, Mark Tinka wrote:
I finally thought about this after I got off my beer high :-).

Some of our customers complained about losing access to Cloudflare's resources 
during the Verizon debacle. Since we are doing ROV and dropping Invalids, this 
should not have happened, given most of Cloudflare's IPv4 and IPv6 routes are 
ROA'd.

However, since we are not using the ARIN TAL (for known reasons), this explains 
why this also broke for us.

Back to beer now :-)...

Mark.



Re: CloudFlare issues?

2019-07-04 Thread Mark Tinka
I finally thought about this after I got off my beer high :-).

Some of our customers complained about losing access to Cloudflare's
resources during the Verizon debacle. Since we are doing ROV and
dropping Invalids, this should not have happened, given most of
Cloudflare's IPv4 and IPv6 routes are ROA'd.

However, since we are not using the ARIN TAL (for known reasons), this
explains why this also broke for us.

Back to beer now :-)...

Mark.


Re: Are network operators morons? [was: CloudFlare issues?]

2019-06-25 Thread Randy Bush
>> perhaps the good side of this saga is that it may be an inflection
>> point
> I doubt it.
> The greyer my hair gets, the crankier I get.

i suspect i am a bit ahead of you there

but i used to think that the public would never become aware of privacy
issues.  snowen bumped that ball and tim cook spiked it.  and it is
getting more and more air time.

randy


Re: Are network operators morons? [was: CloudFlare issues?]

2019-06-25 Thread Sean Donelan

On Tue, 25 Jun 2019, Randy Bush wrote:

perhaps the good side of this saga is that it may be an inflection point


I doubt it.

The greyer my hair gets, the crankier I get.




Re: Are network operators morons? [was: CloudFlare issues?]

2019-06-25 Thread Randy Bush
perhaps the good side of this saga is that it may be an inflection point

randy


Re: CloudFlare issues?

2019-06-25 Thread Randy Bush
>> Respectfully, I believe Cloudflare’s public comments today have been
>> a real disservice. This blog post, and your CEO on Twitter today,
>> took every opportunity to say “DAMN THOSE MORONS AT 701!”. They’re
>> not.
> 
> I presume that seeing a CF blog post isn’t regular for you. :-).

never seen such a thing :)

amidst all this conjecturbation and blame casting, have any of the
parties *directly* involved, i.e. 701 and their customer, issued any
sort of post mortem from which we might learn?

randy


Re: BGP filtering study resources (Was: CloudFlare issues?)

2019-06-25 Thread Alex Band
For further community-driven RPKI information there is:

https://rpki.readthedocs.io/ 

Along with an FAQ:

https://rpki.readthedocs.io/en/latest/about/faq.html

Cheers,

-Alex

> On 25 Jun 2019, at 17:55, BATTLES, TIM  wrote:
> 
> https://www.nccoe.nist.gov/projects/building-blocks/secure-inter-domain-routing
>  
> Timothy A Battles
> Chief Security Office
> 314-280-4578
> tb2...@att.com
> 12976 Hollenberg Dr
> Bridgeton, MO 63044
>  
> The information contained in this e-mail, including any attachment(s), is 
> intended solely for use by the named addressee(s).  If you are not the 
> intended recipient, or a person designated as responsible for delivering such 
> messages to the intended recipient, you are not authorized to disclose, copy, 
> distribute or retain this message, in whole or in part, without written 
> authorization from the sender.  This e-mail may contain proprietary, 
> confidential or privileged information. If you have received this message in 
> error, please notify the sender immediately. 
>  
>  
> From: NANOG  On Behalf Of Tom Beecher
> Sent: Tuesday, June 25, 2019 9:42 AM
> To: Job Snijders 
> Cc: NANOG 
> Subject: Re: BGP filtering study resources (Was: CloudFlare issues?)
>  
> Job also enjoys having his ID checked. Can we get a best practices link added 
> to the list for that?
>  
> On Tue, Jun 25, 2019 at 10:27 AM Job Snijders  wrote:
> Dear Stephen,
> 
> On Tue, Jun 25, 2019 at 07:04:12AM -0700, Stephen Satchell wrote:
> > On 6/25/19 2:25 AM, Katie Holly wrote:
> > > Disclaimer: As much as I dislike Cloudflare (I used to complain
> > > about them a lot on Twitter), this is something I am absolutely
> > > agreeing with them. Verizon failed to do the most basic of network
> > > security, and it will happen again, and again, and again...
> > 
> > I used to be a quality control engineer in my career, so I have a
> > question to ask from the perspective of a QC guy:  what is the Best
> > Practice for minimizing, if not totally preventing, this sort of
> > problem?  Is there a "cookbook" answer to this?
> > 
> > (I only run edge networks now, and don't have BGP to worry about.  If
> > my current $dayjob goes away -- they all do -- I might have to get
> > back into the BGP game, so this is not an idle query.)
> > 
> > Somehow "just be careful and clueful" isn't the right answer.
> 
> Here are some resources which maybe can serve as a starting point for
> anyone interested in the problem space:
> 
> presentation: Architecting robust routing policies
> pdf: 
> https://ripe77.ripe.net/presentations/59-RIPE77_Snijders_Routing_Policy_Architecture.pdf
> video: 
> https://ripe77.ripe.net/archive/video/Job_Snijders-B._BGP_Policy_Update-20181017-140440.mp4
> 
> presentation: Practical Everyday BGP filtering "Peerlocking"
> pdf: http://instituut.net/~job/NANOG67_NTT_peerlocking_JobSnijders.pdf
> video: https://www.youtube.com/watch?v=CSLpWBrHy10
> 
> RFC 8212 ("EBGP default deny") and why we should ask our vendors like
> Cisco IOS, IOS XE, NX-OS, Juniper, Arista, Brocade, etc... to be
> compliant with this RFC:
> slides 2-14: 
> http://largebgpcommunities.net/presentations/ITNOG3-Job_Snijders_Recent_BGP_Innovations.pdf
> skip to the rfc8212 part: https://youtu.be/V6Wsq66-f40?t=854
> compliance tracker: http://github.com/bgp/RFC8212
> 
> The NLNOG Day in Fall 2018 has a wealth of RPKI related presentations
> and testimonies: https://nlnog.net/nlnog-day-2018/
> 
> Finally, there is the NLNOG BGP Filter Guide: http://bgpfilterguide.nlnog.net/
> If you spot errors or have suggestions, please submit them via github
> https://github.com/nlnog/bgpfilterguide
> 
> Please let me or the group know should you require further information,
> I love talking about this topic ;-)
> 
> Kind regards,
> 
> Job



RE: BGP filtering study resources (Was: CloudFlare issues?)

2019-06-25 Thread BATTLES, TIM
https://www.nccoe.nist.gov/projects/building-blocks/secure-inter-domain-routing

Timothy A Battles
Chief Security Office
314-280-4578
tb2...@att.com<mailto:tb2...@att.com>
12976 Hollenberg Dr
Bridgeton, MO 63044

The information contained in this e-mail, including any attachment(s), is 
intended solely for use by the named addressee(s).  If you are not the intended 
recipient, or a person designated as responsible for delivering such messages 
to the intended recipient, you are not authorized to disclose, copy, distribute 
or retain this message, in whole or in part, without written authorization from 
the sender.  This e-mail may contain proprietary, confidential or privileged 
information. If you have received this message in error, please notify the 
sender immediately.


From: NANOG  On Behalf Of Tom Beecher
Sent: Tuesday, June 25, 2019 9:42 AM
To: Job Snijders 
Cc: NANOG 
Subject: Re: BGP filtering study resources (Was: CloudFlare issues?)

Job also enjoys having his ID checked. Can we get a best practices link added 
to the list for that?

On Tue, Jun 25, 2019 at 10:27 AM Job Snijders 
mailto:j...@ntt.net>> wrote:
Dear Stephen,

On Tue, Jun 25, 2019 at 07:04:12AM -0700, Stephen Satchell wrote:
> On 6/25/19 2:25 AM, Katie Holly wrote:
> > Disclaimer: As much as I dislike Cloudflare (I used to complain
> > about them a lot on Twitter), this is something I am absolutely
> > agreeing with them. Verizon failed to do the most basic of network
> > security, and it will happen again, and again, and again...
>
> I used to be a quality control engineer in my career, so I have a
> question to ask from the perspective of a QC guy:  what is the Best
> Practice for minimizing, if not totally preventing, this sort of
> problem?  Is there a "cookbook" answer to this?
>
> (I only run edge networks now, and don't have BGP to worry about.  If
> my current $dayjob goes away -- they all do -- I might have to get
> back into the BGP game, so this is not an idle query.)
>
> Somehow "just be careful and clueful" isn't the right answer.

Here are some resources which maybe can serve as a starting point for
anyone interested in the problem space:

presentation: Architecting robust routing policies
pdf: 
https://ripe77.ripe.net/presentations/59-RIPE77_Snijders_Routing_Policy_Architecture.pdf<https://urldefense.proofpoint.com/v2/url?u=https-3A__ripe77.ripe.net_presentations_59-2DRIPE77-5FSnijders-5FRouting-5FPolicy-5FArchitecture.pdf=DwMFaQ=LFYZ-o9_HUMeMTSQicvjIg=B3FumFykkOB2Dwz5qxjHsw=7spUlDmAq8LEqZ2qtr0yGE_POLHSLNqM_kfbfRguxqs=ODEpNYQM3Oxc67pj6eCdvHWgPf1En0HiyjOkMG_Yfeg=>
video: 
https://ripe77.ripe.net/archive/video/Job_Snijders-B._BGP_Policy_Update-20181017-140440.mp4<https://urldefense.proofpoint.com/v2/url?u=https-3A__ripe77.ripe.net_archive_video_Job-5FSnijders-2DB.-5FBGP-5FPolicy-5FUpdate-2D20181017-2D140440.mp4=DwMFaQ=LFYZ-o9_HUMeMTSQicvjIg=B3FumFykkOB2Dwz5qxjHsw=7spUlDmAq8LEqZ2qtr0yGE_POLHSLNqM_kfbfRguxqs=SOfM7pQuVvA0h-tSkmBwJwQ0t36KgX-SwZnXRZqZvfc=>

presentation: Practical Everyday BGP filtering "Peerlocking"
pdf: 
http://instituut.net/~job/NANOG67_NTT_peerlocking_JobSnijders.pdf<https://urldefense.proofpoint.com/v2/url?u=http-3A__instituut.net_-7Ejob_NANOG67-5FNTT-5Fpeerlocking-5FJobSnijders.pdf=DwMFaQ=LFYZ-o9_HUMeMTSQicvjIg=B3FumFykkOB2Dwz5qxjHsw=7spUlDmAq8LEqZ2qtr0yGE_POLHSLNqM_kfbfRguxqs=IPGjlZCNVY3OwZaPbNPfKH4oYVbujIe2B274fgZ3Y08=>
video: 
https://www.youtube.com/watch?v=CSLpWBrHy10<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_watch-3Fv-3DCSLpWBrHy10=DwMFaQ=LFYZ-o9_HUMeMTSQicvjIg=B3FumFykkOB2Dwz5qxjHsw=7spUlDmAq8LEqZ2qtr0yGE_POLHSLNqM_kfbfRguxqs=AxMa0NAWJUjT1xn73vv5I7E5SwECQF6RV9_kKFiOdZ4=>

RFC 8212 ("EBGP default deny") and why we should ask our vendors like
Cisco IOS, IOS XE, NX-OS, Juniper, Arista, Brocade, etc... to be
compliant with this RFC:
slides 2-14: 
http://largebgpcommunities.net/presentations/ITNOG3-Job_Snijders_Recent_BGP_Innovations.pdf<https://urldefense.proofpoint.com/v2/url?u=http-3A__largebgpcommunities.net_presentations_ITNOG3-2DJob-5FSnijders-5FRecent-5FBGP-5FInnovations.pdf=DwMFaQ=LFYZ-o9_HUMeMTSQicvjIg=B3FumFykkOB2Dwz5qxjHsw=7spUlDmAq8LEqZ2qtr0yGE_POLHSLNqM_kfbfRguxqs=ah2RTfckZ-s51fkzT7gk9L3_aN44yhGl3clLE2CXI7w=>
skip to the rfc8212 part: 
https://youtu.be/V6Wsq66-f40?t=854<https://urldefense.proofpoint.com/v2/url?u=https-3A__youtu.be_V6Wsq66-2Df40-3Ft-3D854=DwMFaQ=LFYZ-o9_HUMeMTSQicvjIg=B3FumFykkOB2Dwz5qxjHsw=7spUlDmAq8LEqZ2qtr0yGE_POLHSLNqM_kfbfRguxqs=_Hs_rZFkuPNN3sy6QRMeqhxsIcV2lxVKlkBbzNKNSx4=>
compliance tracker: 
http://github.com/bgp/RFC8212<https://urldefense.proofpoint.com/v2/url?u=http-3A__github.com_bgp_RFC8212=DwMFaQ=LFYZ-o9_HUMeMTSQicvjIg=B3FumFykkOB2Dwz5qxjHsw=7spUlDmAq8LEqZ2qtr0yGE_POLHSLNqM_kfbfRguxqs=mhk4WCxGewj26dk7ZG5tTg0EoNDieX0MTQuR0tJp-uk=>

The NLNOG Day in Fall 2018 has a wealth of

Re: BGP filtering study resources (Was: CloudFlare issues?)

2019-06-25 Thread Tom Beecher
Job also enjoys having his ID checked. Can we get a best practices link
added to the list for that?

On Tue, Jun 25, 2019 at 10:27 AM Job Snijders  wrote:

> Dear Stephen,
>
> On Tue, Jun 25, 2019 at 07:04:12AM -0700, Stephen Satchell wrote:
> > On 6/25/19 2:25 AM, Katie Holly wrote:
> > > Disclaimer: As much as I dislike Cloudflare (I used to complain
> > > about them a lot on Twitter), this is something I am absolutely
> > > agreeing with them. Verizon failed to do the most basic of network
> > > security, and it will happen again, and again, and again...
> >
> > I used to be a quality control engineer in my career, so I have a
> > question to ask from the perspective of a QC guy:  what is the Best
> > Practice for minimizing, if not totally preventing, this sort of
> > problem?  Is there a "cookbook" answer to this?
> >
> > (I only run edge networks now, and don't have BGP to worry about.  If
> > my current $dayjob goes away -- they all do -- I might have to get
> > back into the BGP game, so this is not an idle query.)
> >
> > Somehow "just be careful and clueful" isn't the right answer.
>
> Here are some resources which maybe can serve as a starting point for
> anyone interested in the problem space:
>
> presentation: Architecting robust routing policies
> pdf:
> https://ripe77.ripe.net/presentations/59-RIPE77_Snijders_Routing_Policy_Architecture.pdf
> video:
> https://ripe77.ripe.net/archive/video/Job_Snijders-B._BGP_Policy_Update-20181017-140440.mp4
>
> presentation: Practical Everyday BGP filtering "Peerlocking"
> pdf: http://instituut.net/~job/NANOG67_NTT_peerlocking_JobSnijders.pdf
> video: https://www.youtube.com/watch?v=CSLpWBrHy10
>
> RFC 8212 ("EBGP default deny") and why we should ask our vendors like
> Cisco IOS, IOS XE, NX-OS, Juniper, Arista, Brocade, etc... to be
> compliant with this RFC:
> slides 2-14:
> http://largebgpcommunities.net/presentations/ITNOG3-Job_Snijders_Recent_BGP_Innovations.pdf
> skip to the rfc8212 part: https://youtu.be/V6Wsq66-f40?t=854
> compliance tracker: http://github.com/bgp/RFC8212
>
> The NLNOG Day in Fall 2018 has a wealth of RPKI related presentations
> and testimonies: https://nlnog.net/nlnog-day-2018/
>
> Finally, there is the NLNOG BGP Filter Guide:
> http://bgpfilterguide.nlnog.net/
> If you spot errors or have suggestions, please submit them via github
> https://github.com/nlnog/bgpfilterguide
>
> Please let me or the group know should you require further information,
> I love talking about this topic ;-)
>
> Kind regards,
>
> Job
>


Re: Are network operators morons? [was: CloudFlare issues?]

2019-06-25 Thread Christopher Morrow
(thanks, btw, again)

On Tue, Jun 25, 2019 at 8:33 AM Patrick W. Gilmore  wrote:
> It is not like 701 is causing problems every week, or even ever year. If you 
> think this one incident proves they are ‘morons’, you are only showing you 
> are neither experienced nor mature enough to make that judgement.
>

I would be shocked if 701 is no longer filtering customers by default.
I know they weren't filtering 'peers'.

it seems like the particular case yesterday was a missed customer
prefix-list :( which is sad, but happens.
the japan incident seems to be the other type, I'd guess.

-chris


BGP filtering study resources (Was: CloudFlare issues?)

2019-06-25 Thread Job Snijders
Dear Stephen,

On Tue, Jun 25, 2019 at 07:04:12AM -0700, Stephen Satchell wrote:
> On 6/25/19 2:25 AM, Katie Holly wrote:
> > Disclaimer: As much as I dislike Cloudflare (I used to complain
> > about them a lot on Twitter), this is something I am absolutely
> > agreeing with them. Verizon failed to do the most basic of network
> > security, and it will happen again, and again, and again...
> 
> I used to be a quality control engineer in my career, so I have a
> question to ask from the perspective of a QC guy:  what is the Best
> Practice for minimizing, if not totally preventing, this sort of
> problem?  Is there a "cookbook" answer to this?
> 
> (I only run edge networks now, and don't have BGP to worry about.  If
> my current $dayjob goes away -- they all do -- I might have to get
> back into the BGP game, so this is not an idle query.)
> 
> Somehow "just be careful and clueful" isn't the right answer.

Here are some resources which maybe can serve as a starting point for
anyone interested in the problem space:

presentation: Architecting robust routing policies
pdf: 
https://ripe77.ripe.net/presentations/59-RIPE77_Snijders_Routing_Policy_Architecture.pdf
video: 
https://ripe77.ripe.net/archive/video/Job_Snijders-B._BGP_Policy_Update-20181017-140440.mp4

presentation: Practical Everyday BGP filtering "Peerlocking"
pdf: http://instituut.net/~job/NANOG67_NTT_peerlocking_JobSnijders.pdf
video: https://www.youtube.com/watch?v=CSLpWBrHy10

RFC 8212 ("EBGP default deny") and why we should ask our vendors like
Cisco IOS, IOS XE, NX-OS, Juniper, Arista, Brocade, etc... to be
compliant with this RFC:
slides 2-14: 
http://largebgpcommunities.net/presentations/ITNOG3-Job_Snijders_Recent_BGP_Innovations.pdf
skip to the rfc8212 part: https://youtu.be/V6Wsq66-f40?t=854
compliance tracker: http://github.com/bgp/RFC8212

The NLNOG Day in Fall 2018 has a wealth of RPKI related presentations
and testimonies: https://nlnog.net/nlnog-day-2018/

Finally, there is the NLNOG BGP Filter Guide: http://bgpfilterguide.nlnog.net/
If you spot errors or have suggestions, please submit them via github
https://github.com/nlnog/bgpfilterguide

Please let me or the group know should you require further information,
I love talking about this topic ;-)

Kind regards,

Job


Re: CloudFlare issues?

2019-06-25 Thread Ca By
On Tue, Jun 25, 2019 at 7:06 AM Stephen Satchell  wrote:

> On 6/25/19 2:25 AM, Katie Holly wrote:
> > Disclaimer: As much as I dislike Cloudflare (I used to complain about
> > them a lot on Twitter), this is something I am absolutely agreeing with
> > them. Verizon failed to do the most basic of network security, and it
> > will happen again, and again, and again...
>
> I used to be a quality control engineer in my career, so I have a
> question to ask from the perspective of a QC guy:  what is the Best
> Practice for minimizing, if not totally preventing, this sort of
> problem?  Is there a "cookbook" answer to this?
>
> (I only run edge networks now, and don't have BGP to worry about.  If my
> current $dayjob goes away -- they all do -- I might have to get back
> into the BGP game, so this is not an idle query.)
>
> Somehow "just be careful and clueful" isn't the right answer.


1. Know what to expect — create policy to enforce routes and paths that you
expect, knowing sometimes this may be very broad

2. Enforce what you expect — drop routes and session that do not conform

3.  Use all the internal tools in series as layers of defense —
as-path-list with regex, ip prefix lists, max-routes — they work in series
and all must match. Shoving everything into a route-map is not best,
because what happens when that policy breaks?  Good to have layers.

4. Use irr, rpki, and alarming as external ecosystem tools.

5. Dont run noction or ios, unsafe defaults.

6. When on the phone with your peer, verbally check to make sure they
double check their policy.  Dont assume.





>


Re: Are network operators morons? [was: CloudFlare issues?]

2019-06-25 Thread Mark Tinka



On 25/Jun/19 14:59, Adam Kennedy via NANOG wrote:

>
>
> I believe there probably is a happy medium we can all meet, sort of
> our own ISP DMZ, where we can help one another in the simple mistakes
> or cut each other some slack in those difficult times. I like to think
> NANOG is that place.

Isn't that the point of NOG's, and why we rack so many air miles each
year trying to meet each other and break bread (or something) while
checking the Competition Hats at the door?

Mark.


Re: CloudFlare issues?

2019-06-25 Thread Aftab Siddiqui
Hi Stephen,


> I used to be a quality control engineer in my career, so I have a
> question to ask from the perspective of a QC guy:  what is the Best
> Practice for minimizing, if not totally preventing, this sort of
> problem?  Is there a "cookbook" answer to this?
>

As suggested by Job in the thread above,

- deploy RPKI based BGP Origin validation (with invalid == reject)
- apply maximum prefix limits on all EBGP sessions
- ask your router vendor to comply with RFC 8212 ('default deny')
- turn off your 'BGP optimizers' --> You actually don't need that at
all. I survived without any optimizer.

Aslo, read RFC7454 and join MANRS :)

Regards,
Aftab Siddiqui


Re: CloudFlare issues?

2019-06-25 Thread Stephen Satchell
On 6/25/19 2:25 AM, Katie Holly wrote:
> Disclaimer: As much as I dislike Cloudflare (I used to complain about
> them a lot on Twitter), this is something I am absolutely agreeing with
> them. Verizon failed to do the most basic of network security, and it
> will happen again, and again, and again...

I used to be a quality control engineer in my career, so I have a
question to ask from the perspective of a QC guy:  what is the Best
Practice for minimizing, if not totally preventing, this sort of
problem?  Is there a "cookbook" answer to this?

(I only run edge networks now, and don't have BGP to worry about.  If my
current $dayjob goes away -- they all do -- I might have to get back
into the BGP game, so this is not an idle query.)

Somehow "just be careful and clueful" isn't the right answer.


Re: Are network operators morons? [was: CloudFlare issues?]

2019-06-25 Thread Adam Kennedy via NANOG


Now with that out of the way...  The mentality of everyone working together
for a Better Internet (tm) is sort of a mantra of WISPA and WISPs in
general. It is a mantra that has puzzled me and perplexed my own feelings
as a network engineer. Do I want a better overall experience for my users
and customers? Absolutely. Do I strive to make our network the best...
pause... in the world? Definitely. Should I do the same to help a
neighboring ISP, a competitor? This is where I scratch my head. You would
absolutely think that we would all want a better overall Internet. One that
we can depend on in times of need. One that we can be proud of. But we are
driven, unfortunately, by our C-level execs to shun the competition and do
whatever we can to get a leg up on everyone else. While this is good for
the bottom line it is not exactly a healthy mentality to pit everyone
against each other. It causes animosity between providers and we end up
blaming each other for something simple and then claim they are stupid. A
mistake that may be easy to make, a mistake that we have probably made
ourselves a few times, perhaps a mistake we can learn to shrug off.

I believe there probably is a happy medium we can all meet, sort of our own
ISP DMZ, where we can help one another in the simple mistakes or cut each
other some slack in those difficult times. I like to think NANOG is that
place.

--

Adam Kennedy, Network & Systems Engineer

adamkenn...@watchcomm.net

*Watch Communications*

(866) 586-1518






On Tue, Jun 25, 2019 at 8:50 AM Matthew Walster  wrote:

>
>
> On Tue, 25 Jun 2019, 14:31 Patrick W. Gilmore,  wrote:
>
>> I must be old. All I can think is Kids These Days, and maybe Get Off My
>> BGP, er Lawn.
>>
>
> Maybe they ought to [puts on shades] mind their MANRS.
>
> M (scuttling away)
>
>>


Re: Are network operators morons? [was: CloudFlare issues?]

2019-06-25 Thread Matthew Walster
On Tue, 25 Jun 2019, 14:31 Patrick W. Gilmore,  wrote:

> I must be old. All I can think is Kids These Days, and maybe Get Off My
> BGP, er Lawn.
>

Maybe they ought to [puts on shades] mind their MANRS.

M (scuttling away)

>


Are network operators morons? [was: CloudFlare issues?]

2019-06-25 Thread Patrick W. Gilmore
[Removing the attribution, because many people have made statements like this 
over the last day - or year. Just selecting this one as a succinct and recent 
example to illustrate the point.]

>> This blog post, and your CEO on Twitter today, took every opportunity to say 
>> “DAMN THOSE MORONS AT 701!”.
> Damn those morons at 701, period.

I must be old. All I can think is Kids These Days, and maybe Get Off My BGP, er 
Lawn.

Any company running a large, high complex infrastructure is going to make 
mistakes. Period.

It is not like 701 is causing problems every week, or even ever year. If you 
think this one incident proves they are ‘morons’, you are only showing you are 
neither experienced nor mature enough to make that judgement.

To be clear, they may well be morons. I no longer know many people architecting 
and operating 701’s backbone, so I cannot tell you first-hand how smart they 
are. Maybe they are stupid, but exceptionally lucky. However, the facts at hand 
do not support your blanket assertion, and making it does not speak well of you.

OTOH, I do have first-hand experience with previous CF blog posts, and to say 
they spin things in their favor is being generous. But then, it’s a blog post, 
i.e. Marketing. What else would you expect?


I know it is anathema to the ethos of the network engineers & architects to 
work together instead of hurling insults, but it would probably result in a 
better Internet. And isn’t that what we all (supposedly) want?

-- 
TTFN,
patrick



Re: CloudFlare issues?

2019-06-25 Thread Tom Beecher
Verizon Business / Enterprise is the access network, aka 701/2/3.

Verizon Media Group is the CDNs/Media side. Digital Media Services (
Edgecast ) , Yahoo, AOL. 15133 / 10310 / 1668. ( The entity formerly named
Oath, created when Yahoo was acquired. )

On Tue, Jun 25, 2019 at 06:54 Hank Nussbacher  wrote:

> On 25/06/2019 08:17, Christopher Morrow wrote:
> > On Tue, Jun 25, 2019 at 12:49 AM Hank Nussbacher 
> wrote:
> >> On 25/06/2019 03:03, Tom Beecher wrote:
> >>> Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do
> >>> not work on 701.  My comments are my own opinions only.
> >>>
> >>> Respectfully, I believe Cloudflare’s public comments today have been a
> >>> real disservice. This blog post, and your CEO on Twitter today, took
> >>> every opportunity to say “DAMN THOSE MORONS AT 701!”. They’re not.
> >>>
> >>>
> >> Perhaps suggest to VZ management to use their blog:
> >> https://www.verizondigitalmedia.com/blog/
> > #coughwrongvz
> >
> > I think anyway - you probably mean:
> > https://enterprise.verizon.com/
> This post is unrelated to Verizon Enterprise?
>
> https://www.verizondigitalmedia.com/blog/2019/06/exponential-global-growth-at-75-tbps/
>
> -Hank
> >
> > GoodLuck! I think it's 3 clicks to: "www22.verizon.com" which gets
> > even moar fun!
> > The NOC used to answer if you called: +1-800-900-0241
> > which is in their whois records...
> >
> >> to contrandict what CF blogged about?
> >>
> >> -Hank
> >>
>
>


Re: CloudFlare issues?

2019-06-25 Thread Hank Nussbacher

On 25/06/2019 08:17, Christopher Morrow wrote:

On Tue, Jun 25, 2019 at 12:49 AM Hank Nussbacher  wrote:

On 25/06/2019 03:03, Tom Beecher wrote:

Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do
not work on 701.  My comments are my own opinions only.

Respectfully, I believe Cloudflare’s public comments today have been a
real disservice. This blog post, and your CEO on Twitter today, took
every opportunity to say “DAMN THOSE MORONS AT 701!”. They’re not.



Perhaps suggest to VZ management to use their blog:
https://www.verizondigitalmedia.com/blog/

#coughwrongvz

I think anyway - you probably mean:
https://enterprise.verizon.com/

This post is unrelated to Verizon Enterprise?
https://www.verizondigitalmedia.com/blog/2019/06/exponential-global-growth-at-75-tbps/

-Hank


GoodLuck! I think it's 3 clicks to: "www22.verizon.com" which gets
even moar fun!
The NOC used to answer if you called: +1-800-900-0241
which is in their whois records...


to contrandict what CF blogged about?

-Hank





Re: CloudFlare issues?

2019-06-25 Thread Rich Kulawiec
On Mon, Jun 24, 2019 at 09:39:13PM -0400, Ross Tajvar wrote:
> A technical one - see below from CF's blog post:
> "It is unfortunate that while we tried both e-mail and phone calls to reach
> out to Verizon, at the time of writing this article (over 8 hours after the
> incident), we have not heard back from them, nor are we aware of them
> taking action to resolve the issue."

Which is why an operation the size of Verizon should be able to manage
the trivial task of monitoring its RFC 2142 role addresses 24x7 with a
response time measured in minutes.   And not just Verizon: every large
operation should be doing the same.  There is no excuse for failure
to implement this rudimentary operational practice.

[ And let me add that a very good way to deal with mail sent to those
addresses is to use procmail to pre-sort based on who it's from.  Every
time a message is received from a new source, a new procmail rule should
be added to classify it appropriately.  Over time, this makes it very
easy to identify traffic from clueful people vs. traffic from idiots,
and thus to readily discern what needs to be triaged first. ]

---rsk


Re: CloudFlare issues?

2019-06-25 Thread Katie Holly

Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do not work 
on 701.  My comments are my own opinions only.

Disclaimer: As much as I dislike Cloudflare (I used to complain about them a 
lot on Twitter), this is something I am absolutely agreeing with them. Verizon 
failed to do the most basic of network security, and it will happen again, and 
again, and again...


This blog post, and your CEO on Twitter today, took every opportunity to say 
“DAMN THOSE MORONS AT 701!”.

Damn those morons at 701, period.


But do we know they didn’t?

They didn't, otherwise yesterday's LSE could have been prevented.


Do we know it was there and just setup wrong?

If it virtually exists but does not work, it does not exist.


Did another change at another time break what was there?

What's not there can not be changed. Keep in mind, another well-known route 
leak happened back in 2017 when Google leaked routes towards Verizon and 
Verizon silently accepted and propagated all of them without filtering. 
Probably nothing has changed since then.


Shouldn’t we be working on facts?

They have stated the facts.


to take a harder stance on the BGP optimizer that generated he bogus routes

The BGP optimizer was only the trigger for this event, the actual 
mis-configuration happened between 396531 and 701. IDGAF if 396531 or one of 
their peers uses a BGP optimizer, 701 should have filtered those out, but they 
decided to not do that instead.


You’re right to use this as a lever to push for proper filtering , RPKI, best 
practices.

Yes, and 701 should follow those "best practices".

Point being, I do network stuff since around 10 years and started doing BGP and 
internet routing related stuff only around three years ago and even _I_ can 
follow best practices. And if I have the knowledge about those things and can 
follow best practices, Verizon SURELY has enough resources to do so as well!

On 6/25/19 2:03 AM, Tom Beecher wrote:

Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do not work 
on 701.  My comments are my own opinions only.

Respectfully, I believe Cloudflare’s public comments today have been a real 
disservice. This blog post, and your CEO on Twitter today, took every 
opportunity to say “DAMN THOSE MORONS AT 701!”. They’re not.

You are 100% right that 701 should have had some sort of protection mechanism 
in place to prevent this. But do we know they didn’t? Do we know it was there 
and just setup wrong? Did another change at another time break what was there? 
I used 701 many  jobs ago and they absolutely had filtering in place; it saved 
my bacon when I screwed up once and started readvertising a full table from a 
2nd provider. They smacked my session down an I got a nice call about it.

You guys have repeatedly accused them of being dumb without even speaking to 
anyone yet from the sounds of it. Shouldn’t we be working on facts?

Should they have been easier to reach once an issue was detected? Probably. 
They’re certainly not the first vendor to have a slow response time though. 
Seems like when an APAC carrier takes 18 hours to get back to us, we write it 
off as the cost of doing business.

It also would have been nice, in my opinion, to take a harder stance on the BGP 
optimizer that generated he bogus routes, and the steel company that failed BGP 
101 and just gladly reannounced one upstream to another. 701 is culpable for 
their mistakes, but there doesn’t seem like there is much appetite to shame the 
other contributors.

You’re right to use this as a lever to push for proper filtering , RPKI, best 
practices. I’m 100% behind that. We can all be a hell of a lot better at what 
we do. This stuff happens more than it should, but less than it could.

But this industry is one big ass glass house. What’s that thing about stones 
again?

On Mon, Jun 24, 2019 at 18:06 Justin Paine via NANOG mailto:nanog@nanog.org>> wrote:

FYI for the group -- we just published this: 
https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-today/


_
*Justin Paine*
Director of Trust & Safety
PGP: BBAA 6BCE 3305 7FD6 6452 7115 57B6 0114 DE0B 314D
101 Townsend St., San Francisco, CA 94107 





On Mon, Jun 24, 2019 at 2:25 PM Mark Tinka mailto:mark.ti...@seacom.mu>> wrote:



On 24/Jun/19 18:09, Pavel Lunin wrote:

 >
 > Hehe, I haven't seen this text before. Can't agree more.
 >
 > Get your tie back on Job, nobody listened again.
 >
 > More seriously, I see no difference between prefix hijacking and the
 > so called bgp optimisation based on completely fake announces on
 > behalf of other people.
 >
 > If ever your upstream or any other party who your company pays money
 > to does this dirty thing, now it's just 

Re: CloudFlare issues?

2019-06-24 Thread Christopher Morrow
On Tue, Jun 25, 2019 at 12:49 AM Hank Nussbacher  wrote:
>
> On 25/06/2019 03:03, Tom Beecher wrote:
> > Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do
> > not work on 701.  My comments are my own opinions only.
> >
> > Respectfully, I believe Cloudflare’s public comments today have been a
> > real disservice. This blog post, and your CEO on Twitter today, took
> > every opportunity to say “DAMN THOSE MORONS AT 701!”. They’re not.
> >
> >
> Perhaps suggest to VZ management to use their blog:
> https://www.verizondigitalmedia.com/blog/

#coughwrongvz

I think anyway - you probably mean:
https://enterprise.verizon.com/

GoodLuck! I think it's 3 clicks to: "www22.verizon.com" which gets
even moar fun!
The NOC used to answer if you called: +1-800-900-0241
which is in their whois records...

> to contrandict what CF blogged about?
>
> -Hank
>


Re: CloudFlare issues?

2019-06-24 Thread Hank Nussbacher

On 25/06/2019 03:03, Tom Beecher wrote:
Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do 
not work on 701.  My comments are my own opinions only.


Respectfully, I believe Cloudflare’s public comments today have been a 
real disservice. This blog post, and your CEO on Twitter today, took 
every opportunity to say “DAMN THOSE MORONS AT 701!”. They’re not.




Perhaps suggest to VZ management to use their blog:
https://www.verizondigitalmedia.com/blog/
to contradict what CF blogged about?

-Hank



Re: CloudFlare issues?

2019-06-24 Thread Jared Mauch



> On Jun 24, 2019, at 9:39 PM, Ross Tajvar  wrote:
> 
> 
> On Mon, Jun 24, 2019 at 9:01 PM Jared Mauch  wrote:
> >
> > > On Jun 24, 2019, at 8:50 PM, Ross Tajvar  wrote:
> > >
> > > Maybe I'm in the minority here, but I have higher standards for a T1 than 
> > > any of the other players involved. Clearly several entities failed to do 
> > > what they should have done, but Verizon is not a small or inexperienced 
> > > operation. Taking 8+ hours to respond to a critical operational problem 
> > > is what stood out to me as unacceptable.
> >
> > Are you talking about a press response or a technical one?  The impacts I 
> > saw were for around 2h or so based on monitoring I’ve had up since 2007.  
> > Not great but far from the worst as Tom mentioned.  I’ve seen people cease 
> > to announce IP space we reclaimed from them for months (or years) because 
> > of stale config.  I’ve also seen routes come back from the dead because 
> > they were pinned to an interface that was down for 2 years but never fully 
> > cleaned up.  (Then the telco looped the circuit, interface came up, route 
> > in table, announced globally — bad day all around).
> >
> 
> A technical one - see below from CF's blog post:
> "It is unfortunate that while we tried both e-mail and phone calls to reach 
> out to Verizon, at the time of writing this article (over 8 hours after the 
> incident), we have not heard back from them, nor are we aware of them taking 
> action to resolve the issue.”

I don’t know if CF is a customer (or not) of VZ, but it’s likely easy enough to 
find with a looking glass somewhere, but they were perhaps a few of the 20k 
prefixes impacted (as reported by others).

We have heard from them and not a lot of the other people, but most of them 
likely don’t do business with VZ directly.  I’m not sure VZ is going to contact 
them all or has the capability to respond to them all (or respond to 
non-customers except via a press release).

> > > And really - does it matter if the protection *was* there but something 
> > > broke it? I don't think it does. Ultimately, Verizon failed implement 
> > > correct protections on their network. And then failed to respond when it 
> > > became a problem.
> >
> > I think it does matter.  As I said in my other reply, people do things like 
> > drop ACLs to debug.  Perhaps that’s unsafe, but it is something you do to 
> > debug.  Not knowing what happened, I dunno.  It is also 2019 so I hold 
> > networks to a higher standard than I did in 2009 or 1999.
> >
> 
> Dropping an ACL is fine, but then you have to clean it up when you're done. 
> Your customers don't care that you almost didn't have an outage because you 
> almost did your job right. Yeah, there's a difference between not following 
> policy and not having a policy, but neither one is acceptable behavior from a 
> T1 imo. If it's that easy to cause an outage by not following policy, then I 
> argue that the policy should be better, or something should be better - 
> monitoring, automation, sanity checks. etc. There are lots of ways to solve 
> that problem. And in 2019 I really think there's no excuse for a T1 not to be 
> doing that kind of thing.

I don’t know about the outage (other than what I observed).  I offered some 
suggestions for people to help prevent it from happening, so I’ll leave it 
there.  We all make mistakes, I’ve been part of many and I’m sure that list 
isn’t yet complete.

- Jared

Re: CloudFlare issues?

2019-06-24 Thread Ross Tajvar
On Mon, Jun 24, 2019 at 9:01 PM Jared Mauch  wrote:
>
> > On Jun 24, 2019, at 8:50 PM, Ross Tajvar  wrote:
> >
> > Maybe I'm in the minority here, but I have higher standards for a T1
than any of the other players involved. Clearly several entities failed to
do what they should have done, but Verizon is not a small or inexperienced
operation. Taking 8+ hours to respond to a critical operational problem is
what stood out to me as unacceptable.
>
> Are you talking about a press response or a technical one?  The impacts I
saw were for around 2h or so based on monitoring I’ve had up since 2007.
Not great but far from the worst as Tom mentioned.  I’ve seen people cease
to announce IP space we reclaimed from them for months (or years) because
of stale config.  I’ve also seen routes come back from the dead because
they were pinned to an interface that was down for 2 years but never fully
cleaned up.  (Then the telco looped the circuit, interface came up, route
in table, announced globally — bad day all around).
>

A technical one - see below from CF's blog post:
"It is unfortunate that while we tried both e-mail and phone calls to reach
out to Verizon, at the time of writing this article (over 8 hours after the
incident), we have not heard back from them, nor are we aware of them
taking action to resolve the issue."

> > And really - does it matter if the protection *was* there but something
broke it? I don't think it does. Ultimately, Verizon failed implement
correct protections on their network. And then failed to respond when it
became a problem.
>
> I think it does matter.  As I said in my other reply, people do things
like drop ACLs to debug.  Perhaps that’s unsafe, but it is something you do
to debug.  Not knowing what happened, I dunno.  It is also 2019 so I hold
networks to a higher standard than I did in 2009 or 1999.
>

Dropping an ACL is fine, but then you have to clean it up when you're done.
Your customers don't care that you *almost* didn't have an outage because
you *almost* did your job right. Yeah, there's a difference between not
following policy and not having a policy, but neither one is acceptable
behavior from a T1 imo. If it's that easy to cause an outage by not
following policy, then I argue that the policy should be better, or *something
*should be better - monitoring, automation, sanity checks. etc. There are
lots of ways to solve that problem. And in 2019 I really think there's no
excuse for a T1 not to be doing that kind of thing.

> - Jared


Re: CloudFlare issues?

2019-06-24 Thread Jared Mauch



> On Jun 24, 2019, at 8:50 PM, Ross Tajvar  wrote:
> 
> Maybe I'm in the minority here, but I have higher standards for a T1 than any 
> of the other players involved. Clearly several entities failed to do what 
> they should have done, but Verizon is not a small or inexperienced operation. 
> Taking 8+ hours to respond to a critical operational problem is what stood 
> out to me as unacceptable.

Are you talking about a press response or a technical one?  The impacts I saw 
were for around 2h or so based on monitoring I’ve had up since 2007.  Not great 
but far from the worst as Tom mentioned.  I’ve seen people cease to announce IP 
space we reclaimed from them for months (or years) because of stale config.  
I’ve also seen routes come back from the dead because they were pinned to an 
interface that was down for 2 years but never fully cleaned up.  (Then the 
telco looped the circuit, interface came up, route in table, announced globally 
— bad day all around).

> And really - does it matter if the protection *was* there but something broke 
> it? I don't think it does. Ultimately, Verizon failed implement correct 
> protections on their network. And then failed to respond when it became a 
> problem.

I think it does matter.  As I said in my other reply, people do things like 
drop ACLs to debug.  Perhaps that’s unsafe, but it is something you do to 
debug.  Not knowing what happened, I dunno.  It is also 2019 so I hold networks 
to a higher standard than I did in 2009 or 1999.

- Jared

Re: CloudFlare issues?

2019-06-24 Thread Jared Mauch



> On Jun 24, 2019, at 8:03 PM, Tom Beecher  wrote:
> 
> Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do not work 
> on 701.  My comments are my own opinions only. 
> 
> Respectfully, I believe Cloudflare’s public comments today have been a real 
> disservice. This blog post, and your CEO on Twitter today, took every 
> opportunity to say “DAMN THOSE MORONS AT 701!”. They’re not. 

I presume that seeing a CF blog post isn’t regular for you. :-). — please read 
on

> You are 100% right that 701 should have had some sort of protection mechanism 
> in place to prevent this. But do we know they didn’t? Do we know it was there 
> and just setup wrong? Did another change at another time break what was 
> there? I used 701 many  jobs ago and they absolutely had filtering in place; 
> it saved my bacon when I screwed up once and started readvertising a full 
> table from a 2nd provider. They smacked my session down an I got a nice call 
> about it. 
> 
> You guys have repeatedly accused them of being dumb without even speaking to 
> anyone yet from the sounds of it. Shouldn’t we be working on facts? 
> 
> Should they have been easier to reach once an issue was detected? Probably. 
> They’re certainly not the first vendor to have a slow response time though. 
> Seems like when an APAC carrier takes 18 hours to get back to us, we write it 
> off as the cost of doing business. 
> 
> It also would have been nice, in my opinion, to take a harder stance on the 
> BGP optimizer that generated he bogus routes, and the steel company that 
> failed BGP 101 and just gladly reannounced one upstream to another. 701 is 
> culpable for their mistakes, but there doesn’t seem like there is much 
> appetite to shame the other contributors. 
> 
> You’re right to use this as a lever to push for proper filtering , RPKI, best 
> practices. I’m 100% behind that. We can all be a hell of a lot better at what 
> we do. This stuff happens more than it should, but less than it could. 
> 
> But this industry is one big ass glass house. What’s that thing about stones 
> again? 

I’m careful to not talk about the people impacted.  There were a lot of people 
impacted, roughly 3-4% of the IP space was impacted today and I personally 
heard from more providers than can be counted on a single hand about their 
impact.

Not everyone is going to write about their business impact in public.  I’m not 
authorized to speak for my employer about any impacts that we may have had (for 
example) but if there was impact to 3-4% of IP space, statistically speaking 
there’s always a chance someone was impacted.

I do agree about the glass house thing.  There’s a lot of blame to go around, 
and today I’ve been quoting “go read _normal accidents_” to people.  It’s 
because sufficiently complex systems tend to have complex failures where 
numerous safety systems or controls were bypassed.  Those of us with more than 
a few days of experience likely know what some of them are, we also don’t know 
if those safety systems were disabled as part of debugging by one or more 
parties.  Who hasn’t dropped an ACL to debug why it isn’t working, or if that 
fixed the problem?

I don’t know what happened, but I sure know the symptoms and sets of fixes that 
the industry should apply and enforce.  I have been communicating some of them 
in public and many of them in private today, including offering help to other 
operators with how to implement some of the fixes.

It’s a bad day when someone changes your /16 to two /17’s and sends them out 
regardless of if the packets flow through or not.  These things aren’t new, nor 
do I expect things to be significantly better tomorrow either.  I know people 
at VZ and suspect once they woke up they did something about it.  I also know 
how hard it is to contact someone you don’t have a business relationship with.  
A number of the larger providers have no way for a non-customer to phone, 
message or open a ticket online about problems they may have.  Who knows, their 
ticket system may be in the cloud and was also impacted.

What I do know is that if 3-4% of the home/structures were flooded or 
temporarily unusable because of some form of disaster or evacuation, people 
would be proposing better engineering methods or inspection techniques for 
these structures.

If you are a small network and just point default, there is nothing for you to 
see here and nothing that you can do.  If you speak BGP with your upstream, you 
can filter out some of the bad routes.  You perhaps know that 1239, 3356 and 
others should only be seen directly from a network like 701 and can apply 
filters of this sort to prevent from accepting those more specifics.  I don’t 
believe it’s just 174 that the routes went to, but they were one of the 
networks aside from 701 where I saw paths from today.

(Now the part where you as a 3rd party to this event can help!)

If you peer, build some pre-flight and post-flight scripts to check how 

Re: CloudFlare issues?

2019-06-24 Thread Ross Tajvar
Maybe I'm in the minority here, but I have higher standards for a T1 than
any of the other players involved. Clearly several entities failed to do
what they should have done, but Verizon is not a small or inexperienced
operation. Taking 8+ hours to respond to a critical operational problem is
what stood out to me as unacceptable.

And really - does it matter if the protection *was* there but something
broke it? I don't think it does. Ultimately, Verizon failed implement
correct protections on their network. And then failed to respond when it
became a problem.

On Mon, Jun 24, 2019, 8:06 PM Tom Beecher  wrote:

> Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do not
> work on 701.  My comments are my own opinions only.
>
> Respectfully, I believe Cloudflare’s public comments today have been a
> real disservice. This blog post, and your CEO on Twitter today, took every
> opportunity to say “DAMN THOSE MORONS AT 701!”. They’re not.
>
> You are 100% right that 701 should have had some sort of protection
> mechanism in place to prevent this. But do we know they didn’t? Do we know
> it was there and just setup wrong? Did another change at another time break
> what was there? I used 701 many  jobs ago and they absolutely had filtering
> in place; it saved my bacon when I screwed up once and started
> readvertising a full table from a 2nd provider. They smacked my session
> down an I got a nice call about it.
>
> You guys have repeatedly accused them of being dumb without even speaking
> to anyone yet from the sounds of it. Shouldn’t we be working on facts?
>
> Should they have been easier to reach once an issue was detected?
> Probably. They’re certainly not the first vendor to have a slow response
> time though. Seems like when an APAC carrier takes 18 hours to get back to
> us, we write it off as the cost of doing business.
>
> It also would have been nice, in my opinion, to take a harder stance on
> the BGP optimizer that generated he bogus routes, and the steel company
> that failed BGP 101 and just gladly reannounced one upstream to another.
> 701 is culpable for their mistakes, but there doesn’t seem like there is
> much appetite to shame the other contributors.
>
> You’re right to use this as a lever to push for proper filtering , RPKI,
> best practices. I’m 100% behind that. We can all be a hell of a lot better
> at what we do. This stuff happens more than it should, but less than it
> could.
>
> But this industry is one big ass glass house. What’s that thing about
> stones again?
>
> On Mon, Jun 24, 2019 at 18:06 Justin Paine via NANOG 
> wrote:
>
>> FYI for the group -- we just published this:
>> https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-today/
>>
>>
>> _
>> *Justin Paine*
>> Director of Trust & Safety
>> PGP: BBAA 6BCE 3305 7FD6 6452 7115 57B6 0114 DE0B 314D
>> 101 Townsend St., San Francisco, CA 94107
>> 
>>
>>
>>
>> On Mon, Jun 24, 2019 at 2:25 PM Mark Tinka  wrote:
>>
>>>
>>>
>>> On 24/Jun/19 18:09, Pavel Lunin wrote:
>>>
>>> >
>>> > Hehe, I haven't seen this text before. Can't agree more.
>>> >
>>> > Get your tie back on Job, nobody listened again.
>>> >
>>> > More seriously, I see no difference between prefix hijacking and the
>>> > so called bgp optimisation based on completely fake announces on
>>> > behalf of other people.
>>> >
>>> > If ever your upstream or any other party who your company pays money
>>> > to does this dirty thing, now it's just the right moment to go explain
>>> > them that you consider this dangerous for your business and are
>>> > looking for better partners among those who know how to run internet
>>> > without breaking it.
>>>
>>> We struggled with a number of networks using these over eBGP sessions
>>> they had with networks that shared their routing data with BGPmon. It
>>> sent off all sorts of alarms, and troubleshooting it was hard when a
>>> network thinks you are de-aggregating massively, and yet you know you
>>> aren't.
>>>
>>> Each case took nearly 3 weeks to figure out.
>>>
>>> BGP optimizers are the bane of my existence.
>>>
>>> Mark.
>>>
>>>


Re: CloudFlare issues?

2019-06-24 Thread James Jun
On Mon, Jun 24, 2019 at 08:03:26PM -0400, Tom Beecher wrote:
> 
> You are 100% right that 701 should have had some sort of protection
> mechanism in place to prevent this. But do we know they didn???t? Do we know
> it was there and just setup wrong? Did another change at another time break
> what was there? I used 701 many  jobs ago and they absolutely had filtering
> in place; it saved my bacon when I screwed up once and started
> readvertising a full table from a 2nd provider. They smacked my session
> down an I got a nice call about it.

In my past (and current) dealings with AS701, I do agree that they have 
generally
been good about filtering customer sessions and running a tight ship.  But, 
manual
config changes being what they are, I suppose an honest mistake or oversight 
issue
had occurred at 701 today that made them contribute significantly to today's 
outage.


> 
> It also would have been nice, in my opinion, to take a harder stance on the
> BGP optimizer that generated he bogus routes, and the steel company that
> failed BGP 101 and just gladly reannounced one upstream to another. 701 is
> culpable for their mistakes, but there doesn???t seem like there is much
> appetite to shame the other contributors.

I think the biggest question to be asked here -- why the hell is a BGP optimizer
(Noction in this case) injecting fake more specifics to steer traffic?  And why 
did a
regional provider providing IP transit (DQE), use such a dangerous 
accident-waiting-to-
happen tool in their network, especially when they have other ASNs taking 
transit
feeds from them, with all these fake man-in-the-middle routes being injected?

I get that BGP optimizers can have some use cases, but IMO, in most of the 
situations,
(especially if you are a network provider selling transit and taking peering 
yourself)
a well crafted routing policy and interconnection strategy eliminates the need 
for 
implementing flawed route selection optimizers in your network.

The notion of BGP Optimizer generating fake more specifics is absurd, and is 
definitely
not a tool that is designed to "fail -> safe".  Instead of failing safe, it has 
failed
epically and catastrophically today.  I remember long time ago, when Internap 
used
to sell their FCP product, Internap SE were advising the customer to make 
appropriate
adjustments to local-preference to prefer the FCP generated routes to ensure 
optimal
selection.  That is a much more sane design choice, than injecting 
man-in-the-middle
attacks and relying on customers to prevent a disaster.

Any time I have a sit down with any engineer who "outsources" responsibility of 
maintaining robustness principle onto their customer, it makes me want to puke.

James


Re: CloudFlare issues?

2019-06-24 Thread Scott Weeks


--- beec...@beecher.cc wrote:
From: Tom Beecher 

:: Shouldn’t we be working on facts?

Nah, this is NANOG...  >;-)



:: But this industry is one big ass glass house. What’s that 
:: thing about stones again?

We all have broken windows?


:)
scott

Re: CloudFlare issues?

2019-06-24 Thread Tom Beecher
Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do not
work on 701.  My comments are my own opinions only.

Respectfully, I believe Cloudflare’s public comments today have been a real
disservice. This blog post, and your CEO on Twitter today, took every
opportunity to say “DAMN THOSE MORONS AT 701!”. They’re not.

You are 100% right that 701 should have had some sort of protection
mechanism in place to prevent this. But do we know they didn’t? Do we know
it was there and just setup wrong? Did another change at another time break
what was there? I used 701 many  jobs ago and they absolutely had filtering
in place; it saved my bacon when I screwed up once and started
readvertising a full table from a 2nd provider. They smacked my session
down an I got a nice call about it.

You guys have repeatedly accused them of being dumb without even speaking
to anyone yet from the sounds of it. Shouldn’t we be working on facts?

Should they have been easier to reach once an issue was detected? Probably.
They’re certainly not the first vendor to have a slow response time though.
Seems like when an APAC carrier takes 18 hours to get back to us, we write
it off as the cost of doing business.

It also would have been nice, in my opinion, to take a harder stance on the
BGP optimizer that generated he bogus routes, and the steel company that
failed BGP 101 and just gladly reannounced one upstream to another. 701 is
culpable for their mistakes, but there doesn’t seem like there is much
appetite to shame the other contributors.

You’re right to use this as a lever to push for proper filtering , RPKI,
best practices. I’m 100% behind that. We can all be a hell of a lot better
at what we do. This stuff happens more than it should, but less than it
could.

But this industry is one big ass glass house. What’s that thing about
stones again?

On Mon, Jun 24, 2019 at 18:06 Justin Paine via NANOG 
wrote:

> FYI for the group -- we just published this:
> https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-today/
>
>
> _
> *Justin Paine*
> Director of Trust & Safety
> PGP: BBAA 6BCE 3305 7FD6 6452 7115 57B6 0114 DE0B 314D
> 101 Townsend St., San Francisco, CA 94107
> 
>
>
>
> On Mon, Jun 24, 2019 at 2:25 PM Mark Tinka  wrote:
>
>>
>>
>> On 24/Jun/19 18:09, Pavel Lunin wrote:
>>
>> >
>> > Hehe, I haven't seen this text before. Can't agree more.
>> >
>> > Get your tie back on Job, nobody listened again.
>> >
>> > More seriously, I see no difference between prefix hijacking and the
>> > so called bgp optimisation based on completely fake announces on
>> > behalf of other people.
>> >
>> > If ever your upstream or any other party who your company pays money
>> > to does this dirty thing, now it's just the right moment to go explain
>> > them that you consider this dangerous for your business and are
>> > looking for better partners among those who know how to run internet
>> > without breaking it.
>>
>> We struggled with a number of networks using these over eBGP sessions
>> they had with networks that shared their routing data with BGPmon. It
>> sent off all sorts of alarms, and troubleshooting it was hard when a
>> network thinks you are de-aggregating massively, and yet you know you
>> aren't.
>>
>> Each case took nearly 3 weeks to figure out.
>>
>> BGP optimizers are the bane of my existence.
>>
>> Mark.
>>
>>


Re: CloudFlare issues?

2019-06-24 Thread Justin Paine via NANOG
FYI for the group -- we just published this:
https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-today/


_
*Justin Paine*
Director of Trust & Safety
PGP: BBAA 6BCE 3305 7FD6 6452 7115 57B6 0114 DE0B 314D
101 Townsend St., San Francisco, CA 94107



On Mon, Jun 24, 2019 at 2:25 PM Mark Tinka  wrote:

>
>
> On 24/Jun/19 18:09, Pavel Lunin wrote:
>
> >
> > Hehe, I haven't seen this text before. Can't agree more.
> >
> > Get your tie back on Job, nobody listened again.
> >
> > More seriously, I see no difference between prefix hijacking and the
> > so called bgp optimisation based on completely fake announces on
> > behalf of other people.
> >
> > If ever your upstream or any other party who your company pays money
> > to does this dirty thing, now it's just the right moment to go explain
> > them that you consider this dangerous for your business and are
> > looking for better partners among those who know how to run internet
> > without breaking it.
>
> We struggled with a number of networks using these over eBGP sessions
> they had with networks that shared their routing data with BGPmon. It
> sent off all sorts of alarms, and troubleshooting it was hard when a
> network thinks you are de-aggregating massively, and yet you know you
> aren't.
>
> Each case took nearly 3 weeks to figure out.
>
> BGP optimizers are the bane of my existence.
>
> Mark.
>
>


Re: CloudFlare issues?

2019-06-24 Thread Mark Tinka



On 24/Jun/19 18:09, Pavel Lunin wrote:

>
> Hehe, I haven't seen this text before. Can't agree more.
>
> Get your tie back on Job, nobody listened again.
>
> More seriously, I see no difference between prefix hijacking and the
> so called bgp optimisation based on completely fake announces on
> behalf of other people.
>
> If ever your upstream or any other party who your company pays money
> to does this dirty thing, now it's just the right moment to go explain
> them that you consider this dangerous for your business and are
> looking for better partners among those who know how to run internet
> without breaking it.

We struggled with a number of networks using these over eBGP sessions
they had with networks that shared their routing data with BGPmon. It
sent off all sorts of alarms, and troubleshooting it was hard when a
network thinks you are de-aggregating massively, and yet you know you
aren't.

Each case took nearly 3 weeks to figure out.

BGP optimizers are the bane of my existence.

Mark.



Re: CloudFlare issues?

2019-06-24 Thread Fredrik Korsbäck
On 2019-06-24 20:16, Mark Tinka wrote:
> 
> 
> On 24/Jun/19 16:11, Job Snijders wrote:
> 
>>
>> - deploy RPKI based BGP Origin validation (with invalid == reject)
>> - apply maximum prefix limits on all EBGP sessions
>> - ask your router vendor to comply with RFC 8212 ('default deny')
>> - turn off your 'BGP optimizers'
> 
> I cannot over-emphasize the above, especially the BGP optimizers.
> 
> Mark.
> 

+1

https://honestnetworker.net/2019/06/24/leaking-your-optimized-routes-to-stub-networks-that-then-leak-it-to-a-tier1-transit-that-doesnt-filter/



-- 
hugge



Re: CloudFlare issues?

2019-06-24 Thread Mark Tinka



On 24/Jun/19 16:11, Job Snijders wrote:

>
> - deploy RPKI based BGP Origin validation (with invalid == reject)
> - apply maximum prefix limits on all EBGP sessions
> - ask your router vendor to comply with RFC 8212 ('default deny')
> - turn off your 'BGP optimizers'

I cannot over-emphasize the above, especially the BGP optimizers.

Mark.


Re: CloudFlare issues?

2019-06-24 Thread Jaden Roberts
From https://www.cloudflarestatus.com/​:

Identified - We have identified a possible route leak impacting some Cloudflare 
IP ranges and are working with the network involved to resolve this.
Jun 24, 11:36 UTC

Seeing issues in Australia too for some sites that are routing through 
Cloudflare.


[https://info.serversaustralia.com.au/hubfs/Brand-2018/logo-font-sau.gif]
Jaden Roberts
Senior Network Engineer
4 Amy Close, Wyong, NSW 2259
Need assistance? We are here 24/7 +61 2 8115 
[https://app.frontapp.com/api/1/noauth/companies/servers_australia_pty_ltd/seen/msg_3d1x0hx/0/9d9184ad.gif]
On June 24, 2019, 9:06 PM GMT+10 
daknob@gmail.com wrote:

Yes, traffic from Greek networks is routed through NYC 
(alter.net), and previously it had a 60% packet loss. Now 
it’s still via NYC, but no packet loss. This happens in GR-IX Athens, not GR-IX 
Thessaloniki, but the problem definitely exists.

Antonis

On 24 Jun 2019, at 13:55, Dmitry Sherman 
mailto:dmi...@interhost.net>> wrote:

Hello are there any issues with CloudFlare services now?

Dmitry Sherman
dmi...@interhost.net
Interhost Networks Ltd
Web: http://www.interhost.co.il
fb: https://www.facebook.com/InterhostIL
Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157




Re: CloudFlare issues?

2019-06-24 Thread Pavel Lunin
>I'd like to point everyone to an op-ed I wrote on the topic of "BGP
optimizers": >https://seclists.org/nanog/2017/Aug/318

Hehe, I haven't seen this text before. Can't agree more.

Get your tie back on Job, nobody listened again.

More seriously, I see no difference between prefix hijacking and the so called 
bgp optimisation based on completely fake announces on behalf of other people.

If ever your upstream or any other party who your company pays money to does 
this dirty thing, now it's just the right moment to go explain them that you 
consider this dangerous for your business and are looking for better partners 
among those who know how to run internet without breaking it.

Re: CloudFlare issues?

2019-06-24 Thread Max Tulyev

24.06.19 19:04, Matthew Walster пише:



On Mon, 24 Jun 2019, 16:28 Max Tulyev, > wrote:


1. Why Cloudflare did not immediately announced all their address space
by /24s? This can put the service up instantly for almost all places
Probably RPKI and that being a really bad idea that takes a long time to 
configure across every device, especially when you're dealing with an 
anycast network.


Good idea is to prepare it and provisioning tools before ;)


2. Why almost all carriers did not filter the leak on their side, but
waited for "a better weather on Mars" for several hours?


Probably most did not notice immediately, or trusted their fellow large 
carrier peers to fix the matter faster than their own change control 
process would accept such a drastic change that had not been fully 
analysed and identified. The duration was actually quite low, on a human 
scale...


Did not notice a lot of calls "I can't access ..."? Really?
OK, then another question. Which time from that calls starts to "people 
who know BGP know about it" is good?


Re: CloudFlare issues?

2019-06-24 Thread Christopher Morrow
On Mon, Jun 24, 2019 at 10:41 AM Filip Hruska  wrote:
>
> Verizon is the one who should've noticed something was amiss and dropped
> their customer's BGP session.
> They also should have had filters and prefix count limits in place,
> which would have prevented this whole disaster.
>

oddly VZ used to be quite good about filtering customer seesions :(
there ARE cases where: "customer says they may announce X" and that
doesn't happen along a path expected :( For instance they end up
announcing a path through their other transit to a prefix in the
permitted list on the VZ side :(  it doesn't seem plausible that that
is what was happening here though, I don't expect the duquesne folk to
have customer paths to (for instance) savi moebel in germany...

there are some pretty fun as-paths in the set of ~25k prefixes leaked
(that routeviews saw).


Re: CloudFlare issues?

2019-06-24 Thread Filip Hruska
Verizon is the one who should've noticed something was amiss and dropped 
their customer's BGP session.
They also should have had filters and prefix count limits in place, 
which would have prevented this whole disaster.


As to why any of that didn't happen, who actually knows.

Regards,
Filip

On 6/24/19 4:28 PM, Max Tulyev wrote:
Why almost all carriers did not filter the leak on their side, but 
waited for "a better weather on Mars" for several hours? 


--
Filip Hruska
Linux System Administrator



Re: CloudFlare issues?

2019-06-24 Thread Max Tulyev

Hi All,

here in Ukraine we got an impact as well!

Have two questions:

1. Why Cloudflare did not immediately announced all their address space 
by /24s? This can put the service up instantly for almost all places.


2. Why almost all carriers did not filter the leak on their side, but 
waited for "a better weather on Mars" for several hours?


24.06.19 13:55, Dmitry Sherman пише:

Hello are there any issues with CloudFlare services now?

Dmitry Sherman
dmi...@interhost.net
Interhost Networks Ltd
Web: http://www.interhost.co.il
fb: https://www.facebook.com/InterhostIL
Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157




Re: CloudFlare issues?

2019-06-24 Thread Andree Toonk
This is what looked happened:

There was a large scale BGP 'leak' incident causing about 20k prefixes
for 2400 network (ASNs) to be rerouted through AS396531 (a steel plant)
and then on to its transit provider: Verizon (AS701) Start time:
10:34:21 (UTC) End time: 12:37  (UTC)
All ASpaths had the following in common:
701 396531 33154


33154 (DQECOM ) is an ISP providing transit to 396531.
396531 is by the looks of it a steel plant. dual homed to 701 and 33154.
701 is verizon and accepted by the looks of it all BGP announcements
from 396531

What appears to have happened is that 33154  those routes were
propagated to 396531, which then send them to Verizon and voila... there
is the full leak at work.
(DQECOM  runs a BGP optimizer (https://www.noction.com/clients/dqe ,
thanks Job for pointing that out, more below)

As a result traffic for 20k prefixes or so was now rerouted through
verizon and 396531 (the steel plant)

We've seen numerous incidents like this in the past
lessons learned:
1) if you do use a BGP optimizer, please FILTER!
2) Verizon... filter your customers, please!


Since the BGP optimizer introduces new more specific routes, a lot of
traffic for high traffic destinations would have been rerouted through
that path, which would have been congested, causing the outages.
There were many cloudflare prefixes affected, but also folks like
Amazon, Akamai, Facebook, Apple, Linode etc.

here's one example for Amazon - CloudFront : 52.84.32.0/22. Normally
announced as a 52.84.32.0/21 but during the incident as a /22 (remember
more specifics always win)
https://stat.ripe.net/52.84.32.0%2F22#tabId=routing_bgplay.ignoreReannouncements=false_bgplay.resource=52.84.32.0/22_bgplay.starttime=1561337999_bgplay.endtime=1561377599_bgplay.rrcs=0,1,2,5,6,7,10,11,13,14,15,16,18,20_bgplay.instant=null_bgplay.type=bgp

RPKI would have worked here (assuming you're strict with the max length)!


Cheers
 Andree


My secret spy satellite informs me that Dmitry Sherman wrote On
2019-06-24, 3:55 AM:
> Hello are there any issues with CloudFlare services now?
>
> Dmitry Sherman
> dmi...@interhost.net
> Interhost Networks Ltd
> Web: http://www.interhost.co.il
> fb: https://www.facebook.com/InterhostIL
> Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157
>



Re: CloudFlare issues?

2019-06-24 Thread Job Snijders
On Mon, Jun 24, 2019 at 08:18:27AM -0400, Tom Paseka via NANOG wrote:
> a Verizon downstream BGP customer is leaking the full table, and some more
> specific from us and many other providers.

It appears that one of the implicated ASNs, AS 33154 "DQE Communications
LLC" is listed as customer on Noction's website:
https://www.noction.com/clients/dqe

I suspect AS 33154's customer AS 396531 turned up a new circuit with
Verizon, but didn't have routing policies to prevent sending routes from
33154 to 701 and vice versa, or their router didn't have support for RFC
8212.

I'd like to point everyone to an op-ed I wrote on the topic of "BGP
optimizers": https://seclists.org/nanog/2017/Aug/318

So in summary, I believe the following happened:

- 33154 generated fake more-specifics, which are not visible in the DFZ
- 33154 announces those fake more-specifics to at least one customer 
(396531)
- this customer (396531) propagated to to another upstream provider (701)
- it appears that 701 did not sufficient prefix filtering, or a 
maximum-prefix limit

While it is easy to point at the alleged BGP optimizer as the root
cause, I do think we now have observed a cascading catastrophic failure
both in process and technologies. Here are some recommendations that all
of us can apply, that may have helped dampen the negative effects:

- deploy RPKI based BGP Origin validation (with invalid == reject)
- apply maximum prefix limits on all EBGP sessions
- ask your router vendor to comply with RFC 8212 ('default deny')
- turn off your 'BGP optimizers'

I suspect we, collectively, suffered significant financial damage in
this incident.

Kind regards,

Job


Re: CloudFlare issues?

2019-06-24 Thread Robbie Trencheny
This is my final update, I’m going back to bed, wake me up when the
internet is working again.

https://news.ycombinator.com/item?id=20262316

——

1230 UTC update We are working with networks around the world and are
observing network routes for Google and AWS being leaked at well.

On Mon, Jun 24, 2019 at 05:20 Robbie Trencheny  wrote:

> *1204 UTC update* This leak is wider spread that just Cloudflare.
>
> *1208 UTC update* Amazon Web Services now reporting external networking
> problem
>
> On Mon, Jun 24, 2019 at 05:18 Tom Paseka  wrote:
>
>> a Verizon downstream BGP customer is leaking the full table, and some
>> more specific from us and many other providers.
>>
>> On Mon, Jun 24, 2019 at 7:56 AM Robbie Trencheny  wrote:
>>
>>> *1147 UTC update* Staring at internal graphs looks like global traffic
>>> is now at 97% of expected so impact lessening.
>>>
>>> On Mon, Jun 24, 2019 at 04:51 Dovid Bender  wrote:
>>>
 We are seeing issues as well getting to HE. The traffic is going via
 Alter.



 On Mon, Jun 24, 2019 at 7:48 AM Robbie Trencheny  wrote:

> From John Graham-Cumming, CTO of Cloudflare, on Hacker News right now:
>
> This appears to be a routing problem with Level3. All our systems are
> running normally but traffic isn't getting to us for a portion of our
> domains.
>
> 1128 UTC update Looks like we're dealing with a route leak and we're
> talking directly with the leaker and Level3 at the moment.
>
> 1131 UTC update Just to be clear this isn't affecting all our traffic
> or all our domains or all countries. A portion of traffic isn't hitting
> Cloudflare. Looks to be about an aggregate 10% drop in traffic to us.
>
> 1134 UTC update We are now certain we are dealing with a route leak.
>
> On Mon, Jun 24, 2019 at 04:04 Antonios Chariton 
> wrote:
>
>> Yes, traffic from Greek networks is routed through NYC (alter.net),
>> and previously it had a 60% packet loss. Now it’s still via NYC, but no
>> packet loss. This happens in GR-IX Athens, not GR-IX Thessaloniki, but 
>> the
>> problem definitely exists.
>>
>> Antonis
>>
>>
>> On 24 Jun 2019, at 13:55, Dmitry Sherman 
>> wrote:
>>
>> Hello are there any issues with CloudFlare services now?
>>
>> Dmitry Sherman
>> dmi...@interhost.net
>> Interhost Networks Ltd
>> Web: http://www.interhost.co.il
>> fb: https://www.facebook.com/InterhostIL
>> Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157
>>
>>
>> --
> --
> Robbie Trencheny (@robbie )
> 925-884-3728
> robbie.io
>
 --
>>> --
>>> Robbie Trencheny (@robbie )
>>> 925-884-3728
>>> robbie.io
>>>
>> --
> --
> Robbie Trencheny (@robbie )
> 925-884-3728
> robbie.io
>
-- 
--
Robbie Trencheny (@robbie )
925-884-3728
robbie.io


Re: CloudFlare issues?

2019-06-24 Thread Robbie Trencheny
*1204 UTC update* This leak is wider spread that just Cloudflare.

*1208 UTC update* Amazon Web Services now reporting external networking
problem

On Mon, Jun 24, 2019 at 05:18 Tom Paseka  wrote:

> a Verizon downstream BGP customer is leaking the full table, and some more
> specific from us and many other providers.
>
> On Mon, Jun 24, 2019 at 7:56 AM Robbie Trencheny  wrote:
>
>> *1147 UTC update* Staring at internal graphs looks like global traffic
>> is now at 97% of expected so impact lessening.
>>
>> On Mon, Jun 24, 2019 at 04:51 Dovid Bender  wrote:
>>
>>> We are seeing issues as well getting to HE. The traffic is going via
>>> Alter.
>>>
>>>
>>>
>>> On Mon, Jun 24, 2019 at 7:48 AM Robbie Trencheny  wrote:
>>>
 From John Graham-Cumming, CTO of Cloudflare, on Hacker News right now:

 This appears to be a routing problem with Level3. All our systems are
 running normally but traffic isn't getting to us for a portion of our
 domains.

 1128 UTC update Looks like we're dealing with a route leak and we're
 talking directly with the leaker and Level3 at the moment.

 1131 UTC update Just to be clear this isn't affecting all our traffic
 or all our domains or all countries. A portion of traffic isn't hitting
 Cloudflare. Looks to be about an aggregate 10% drop in traffic to us.

 1134 UTC update We are now certain we are dealing with a route leak.

 On Mon, Jun 24, 2019 at 04:04 Antonios Chariton 
 wrote:

> Yes, traffic from Greek networks is routed through NYC (alter.net),
> and previously it had a 60% packet loss. Now it’s still via NYC, but no
> packet loss. This happens in GR-IX Athens, not GR-IX Thessaloniki, but the
> problem definitely exists.
>
> Antonis
>
>
> On 24 Jun 2019, at 13:55, Dmitry Sherman  wrote:
>
> Hello are there any issues with CloudFlare services now?
>
> Dmitry Sherman
> dmi...@interhost.net
> Interhost Networks Ltd
> Web: http://www.interhost.co.il
> fb: https://www.facebook.com/InterhostIL
> Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157
>
>
> --
 --
 Robbie Trencheny (@robbie )
 925-884-3728
 robbie.io

>>> --
>> --
>> Robbie Trencheny (@robbie )
>> 925-884-3728
>> robbie.io
>>
> --
--
Robbie Trencheny (@robbie )
925-884-3728
robbie.io


Re: CloudFlare issues?

2019-06-24 Thread Tom Paseka via NANOG
a Verizon downstream BGP customer is leaking the full table, and some more
specific from us and many other providers.

On Mon, Jun 24, 2019 at 7:56 AM Robbie Trencheny  wrote:

> *1147 UTC update* Staring at internal graphs looks like global traffic is
> now at 97% of expected so impact lessening.
>
> On Mon, Jun 24, 2019 at 04:51 Dovid Bender  wrote:
>
>> We are seeing issues as well getting to HE. The traffic is going via
>> Alter.
>>
>>
>>
>> On Mon, Jun 24, 2019 at 7:48 AM Robbie Trencheny  wrote:
>>
>>> From John Graham-Cumming, CTO of Cloudflare, on Hacker News right now:
>>>
>>> This appears to be a routing problem with Level3. All our systems are
>>> running normally but traffic isn't getting to us for a portion of our
>>> domains.
>>>
>>> 1128 UTC update Looks like we're dealing with a route leak and we're
>>> talking directly with the leaker and Level3 at the moment.
>>>
>>> 1131 UTC update Just to be clear this isn't affecting all our traffic or
>>> all our domains or all countries. A portion of traffic isn't hitting
>>> Cloudflare. Looks to be about an aggregate 10% drop in traffic to us.
>>>
>>> 1134 UTC update We are now certain we are dealing with a route leak.
>>>
>>> On Mon, Jun 24, 2019 at 04:04 Antonios Chariton 
>>> wrote:
>>>
 Yes, traffic from Greek networks is routed through NYC (alter.net),
 and previously it had a 60% packet loss. Now it’s still via NYC, but no
 packet loss. This happens in GR-IX Athens, not GR-IX Thessaloniki, but the
 problem definitely exists.

 Antonis


 On 24 Jun 2019, at 13:55, Dmitry Sherman  wrote:

 Hello are there any issues with CloudFlare services now?

 Dmitry Sherman
 dmi...@interhost.net
 Interhost Networks Ltd
 Web: http://www.interhost.co.il
 fb: https://www.facebook.com/InterhostIL
 Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157


 --
>>> --
>>> Robbie Trencheny (@robbie )
>>> 925-884-3728
>>> robbie.io
>>>
>> --
> --
> Robbie Trencheny (@robbie )
> 925-884-3728
> robbie.io
>


Re: CloudFlare issues?

2019-06-24 Thread Robbie Trencheny
*1147 UTC update* Staring at internal graphs looks like global traffic is
now at 97% of expected so impact lessening.

On Mon, Jun 24, 2019 at 04:51 Dovid Bender  wrote:

> We are seeing issues as well getting to HE. The traffic is going via Alter.
>
>
>
> On Mon, Jun 24, 2019 at 7:48 AM Robbie Trencheny  wrote:
>
>> From John Graham-Cumming, CTO of Cloudflare, on Hacker News right now:
>>
>> This appears to be a routing problem with Level3. All our systems are
>> running normally but traffic isn't getting to us for a portion of our
>> domains.
>>
>> 1128 UTC update Looks like we're dealing with a route leak and we're
>> talking directly with the leaker and Level3 at the moment.
>>
>> 1131 UTC update Just to be clear this isn't affecting all our traffic or
>> all our domains or all countries. A portion of traffic isn't hitting
>> Cloudflare. Looks to be about an aggregate 10% drop in traffic to us.
>>
>> 1134 UTC update We are now certain we are dealing with a route leak.
>>
>> On Mon, Jun 24, 2019 at 04:04 Antonios Chariton 
>> wrote:
>>
>>> Yes, traffic from Greek networks is routed through NYC (alter.net), and
>>> previously it had a 60% packet loss. Now it’s still via NYC, but no packet
>>> loss. This happens in GR-IX Athens, not GR-IX Thessaloniki, but the problem
>>> definitely exists.
>>>
>>> Antonis
>>>
>>>
>>> On 24 Jun 2019, at 13:55, Dmitry Sherman  wrote:
>>>
>>> Hello are there any issues with CloudFlare services now?
>>>
>>> Dmitry Sherman
>>> dmi...@interhost.net
>>> Interhost Networks Ltd
>>> Web: http://www.interhost.co.il
>>> fb: https://www.facebook.com/InterhostIL
>>> Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157
>>>
>>>
>>> --
>> --
>> Robbie Trencheny (@robbie )
>> 925-884-3728
>> robbie.io
>>
> --
--
Robbie Trencheny (@robbie )
925-884-3728
robbie.io


Re: CloudFlare issues?

2019-06-24 Thread Dovid Bender
We are seeing issues as well getting to HE. The traffic is going via Alter.



On Mon, Jun 24, 2019 at 7:48 AM Robbie Trencheny  wrote:

> From John Graham-Cumming, CTO of Cloudflare, on Hacker News right now:
>
> This appears to be a routing problem with Level3. All our systems are
> running normally but traffic isn't getting to us for a portion of our
> domains.
>
> 1128 UTC update Looks like we're dealing with a route leak and we're
> talking directly with the leaker and Level3 at the moment.
>
> 1131 UTC update Just to be clear this isn't affecting all our traffic or
> all our domains or all countries. A portion of traffic isn't hitting
> Cloudflare. Looks to be about an aggregate 10% drop in traffic to us.
>
> 1134 UTC update We are now certain we are dealing with a route leak.
>
> On Mon, Jun 24, 2019 at 04:04 Antonios Chariton 
> wrote:
>
>> Yes, traffic from Greek networks is routed through NYC (alter.net), and
>> previously it had a 60% packet loss. Now it’s still via NYC, but no packet
>> loss. This happens in GR-IX Athens, not GR-IX Thessaloniki, but the problem
>> definitely exists.
>>
>> Antonis
>>
>>
>> On 24 Jun 2019, at 13:55, Dmitry Sherman  wrote:
>>
>> Hello are there any issues with CloudFlare services now?
>>
>> Dmitry Sherman
>> dmi...@interhost.net
>> Interhost Networks Ltd
>> Web: http://www.interhost.co.il
>> fb: https://www.facebook.com/InterhostIL
>> Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157
>>
>>
>> --
> --
> Robbie Trencheny (@robbie )
> 925-884-3728
> robbie.io
>


Re: CloudFlare issues?

2019-06-24 Thread Robbie Trencheny
>From John Graham-Cumming, CTO of Cloudflare, on Hacker News right now:

This appears to be a routing problem with Level3. All our systems are
running normally but traffic isn't getting to us for a portion of our
domains.

1128 UTC update Looks like we're dealing with a route leak and we're
talking directly with the leaker and Level3 at the moment.

1131 UTC update Just to be clear this isn't affecting all our traffic or
all our domains or all countries. A portion of traffic isn't hitting
Cloudflare. Looks to be about an aggregate 10% drop in traffic to us.

1134 UTC update We are now certain we are dealing with a route leak.

On Mon, Jun 24, 2019 at 04:04 Antonios Chariton 
wrote:

> Yes, traffic from Greek networks is routed through NYC (alter.net), and
> previously it had a 60% packet loss. Now it’s still via NYC, but no packet
> loss. This happens in GR-IX Athens, not GR-IX Thessaloniki, but the problem
> definitely exists.
>
> Antonis
>
>
> On 24 Jun 2019, at 13:55, Dmitry Sherman  wrote:
>
> Hello are there any issues with CloudFlare services now?
>
> Dmitry Sherman
> dmi...@interhost.net
> Interhost Networks Ltd
> Web: http://www.interhost.co.il
> fb: https://www.facebook.com/InterhostIL
> Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157
>
>
> --
--
Robbie Trencheny (@robbie )
925-884-3728
robbie.io


Re: CloudFlare issues?

2019-06-24 Thread James Jun
On Mon, Jun 24, 2019 at 02:03:47PM +0300, Antonios Chariton wrote:
> Yes, traffic from Greek networks is routed through NYC (alter.net 
> ), and previously it had a 60% packet loss. Now it???s 
> still via NYC, but no packet loss. This happens in GR-IX Athens, not GR-IX 
> Thessaloniki, but the problem definitely exists.
>

It seems Verizon has stopped filtering a downstream customer, or filtering 
broke.

Time to implement peer locking path filters for those using VZ as paid peer..

 Network  Next HopMetric LocPrf Weight Path
 *   2.18.64.0/24 137.39.3.550 701 396531 33154 
174 6057 i
 *   2.19.251.0/24137.39.3.550 701 396531 33154 
174 6057 i
 *   2.22.24.0/23 137.39.3.550 701 396531 33154 
174 6057 i
 *   2.22.26.0/23 137.39.3.550 701 396531 33154 
174 6057 i
 *   2.22.28.0/24 137.39.3.550 701 396531 33154 
174 6057 i
 *   2.24.0.0/16  137.39.3.550 701 396531 33154 
3356 12576 i
 *202.232.0.20 2497 701 396531 
33154 3356 12576 i
 *   2.24.0.0/13  202.232.0.20 2497 701 396531 
33154 3356 12576 i
 *   2.25.0.0/16  137.39.3.550 701 396531 33154 
3356 12576 i
 *202.232.0.20 2497 701 396531 
33154 3356 12576 i
 *   2.26.0.0/16  137.39.3.550 701 396531 33154 
3356 12576 i
 *202.232.0.20 2497 701 396531 
33154 3356 12576 i
 *   2.27.0.0/16  137.39.3.550 701 396531 33154 
3356 12576 i
 *202.232.0.20 2497 701 396531 
33154 3356 12576 i
 *   2.28.0.0/16  137.39.3.550 701 396531 33154 
3356 12576 i
 *202.232.0.20 2497 701 396531 
33154 3356 12576 i
 *   2.29.0.0/16  137.39.3.550 701 396531 33154 
3356 12576 i
 *202.232.0.20 2497 701 396531 
33154 3356 12576 i
 *   2.30.0.0/16  137.39.3.550 701 396531 33154 
3356 12576 i
 *202.232.0.20 2497 701 396531 
33154 3356 12576 i
 *   2.31.0.0/16  137.39.3.550 701 396531 33154 
3356 12576 i
 *202.232.0.20 2497 701 396531 
33154 3356 12576 i
 *   2.56.16.0/22 137.39.3.550 701 396531 33154 
1239 9009 i
 *   2.56.150.0/24137.39.3.550 701 396531 33154 
1239 9009 i
 *   2.57.48.0/22 137.39.3.550 701 396531 33154 
174 50782 i
 *   2.58.47.0/24 137.39.3.550 701 396531 33154 
1239 9009 i
 *   2.59.0.0/23  137.39.3.550 701 396531 33154 
1239 9009 i
 *   2.59.244.0/22137.39.3.550 701 396531 33154 
3356 29119 i
 *   2.148.0.0/14 137.39.3.550 701 396531 33154 
3356 2119 i
 *   3.5.128.0/24 137.39.3.550 701 396531 33154 
3356 16509 i
 *   3.5.128.0/22 137.39.3.550 701 396531 33154 
3356 16509 i 


Re: CloudFlare issues?

2019-06-24 Thread Antonios Chariton
Yes, traffic from Greek networks is routed through NYC (alter.net 
), and previously it had a 60% packet loss. Now it’s still 
via NYC, but no packet loss. This happens in GR-IX Athens, not GR-IX 
Thessaloniki, but the problem definitely exists.

Antonis 

> On 24 Jun 2019, at 13:55, Dmitry Sherman  > wrote:
> 
> Hello are there any issues with CloudFlare services now?
> 
> Dmitry Sherman
> dmi...@interhost.net 
> Interhost Networks Ltd
> Web: http://www.interhost.co.il
> fb: https://www.facebook.com/InterhostIL
> Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157
> 



CloudFlare issues?

2019-06-24 Thread Dmitry Sherman
Hello are there any issues with CloudFlare services now?

Dmitry Sherman
dmi...@interhost.net
Interhost Networks Ltd
Web: http://www.interhost.co.il
fb: https://www.facebook.com/InterhostIL
Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157