Re: CloudFlare issues?

2019-07-04 Thread Job Snijders
> Anyway, you can now enjoy https://rpki.net/s/rpki-test even more! :-)

my apologies, I fumbled the ball on typing in that URL, I intended to
point here: https://www.ripe.net/s/rpki-test


Re: CloudFlare issues?

2019-07-04 Thread Mark Tinka



On 4/Jul/19 20:46, Francois Lecavalier wrote:

> It's been close to 3 hours now since I dropped them - radio silence.
>
> Whoever fears implementing RPKI/ROA/ROV, simply don't.  It's very easy to 
> implement, validate and troubleshoot.

Well done! Congrats!

Mark.




Re: CloudFlare issues?

2019-07-04 Thread Mark Tinka



On 4/Jul/19 20:46, Francois Lecavalier wrote:

> It's been close to 3 hours now since I dropped them - radio silence.
>
> Whoever fears implementing RPKI/ROA/ROV, simply don't.  It's very easy to 
> implement, validate and troubleshoot.

Well done! Congrats!

Mark.



Re: CloudFlare issues?

2019-07-04 Thread Job Snijders
On Thu, Jul 4, 2019 at 8:46 PM Francois Lecavalier
 wrote:

> It's been close to 3 hours now since I dropped them - radio silence.

I am going to assume that "radio silence" for you means that your
network is fully functional and none of your customers have raised
issues! :-)

> Whoever fears implementing RPKI/ROA/ROV, simply don't.  It's very easy to 
> implement, validate and troubleshoot.

Thank you for sharing your report. I believe it is good to share rpki
stories with each other, not just to celebrate the deployment of an
exciting technology, but also to help provide debugging information
ahead of time should there be issues between provider A and B due to a
ROA misconfiguration. Announcing to the public that one has deployed
RPKI - in this stage of the lifecycle of the tech - probably is a
productive action to consider.

Anyway, you can now enjoy https://rpki.net/s/rpki-test even more! :-)

Kind regards,

Job


Re: CloudFlare issues?

2019-07-04 Thread Ben Maddison via NANOG
Welcome to the club!

Get Outlook for Android


From: Francois Lecavalier 
Sent: Thursday, July 4, 2019 8:46:46 PM
To: Ben Maddison; j...@ntt.net
Cc: nanog@nanog.org
Subject: RE: CloudFlare issues?

>> At this point in time I think the ideal deployment model is to perform
>> the validation within your administrative domain and run your own
>> validators.

>+1

We'll definitely look into this shortly.  I definitely don't want to leave a 
security measure in the end of a third party but with my team being so busy it 
was a quick temp fix.

> The larger challenge has been related to vendor implementation choices and 
> bugs, particularly on ios-xe. Happy to go into more detail if anyone is 
> interested.

We are on Juniper MX204's at the edge and they have been solid for the last 60 
weeks - we ran into a long list of bugs on other platforms but not on these.

So I had about 4200 routes marked as invalid.  After looking at a sample of 
them it looks like most of them have a valid ROA with an improper mask length - 
so there is ultimately a route to these prefixes and at worse would result in 
"suboptimal" routing - or should I say: the remote network can't control its 
route propagation anymore.  In most case they are a stub networks with a single 
/24 reassigned from the upstream provider.  I have no traffic going directly to 
these networks and I don't expect any to go there anytime soon.

It's been close to 3 hours now since I dropped them - radio silence.

Whoever fears implementing RPKI/ROA/ROV, simply don't.  It's very easy to 
implement, validate and troubleshoot.

-Original Message-
From: Ben Maddison 
Sent: Thursday, July 4, 2019 11:51 AM
To: j...@ntt.net; Francois Lecavalier 
Cc: nanog@nanog.org
Subject: [External] Re: CloudFlare issues?

Hi Francois,

On Thu, 2019-07-04 at 17:33 +0200, Job Snijders wrote:
> Dear Francois,
>
> On Thu, Jul 04, 2019 at 03:22:23PM +, Francois Lecavalier wrote:
> >
> At this point in time I think the ideal deployment model is to perform
> the validation within your administrative domain and run your own
> validators.

+1

>
> > But I also have a question for all the ROA folks out there.  So far
> > we are not taking any action other than lowering the local-pref - we
> > want to make sure this is stable before we start denying prefixes.
> > So the question, is it safe as of this date to : 1.Accept valid, 2.
> > Accept unknown, 3. Reject invalid?  Have any large network who
> > implemented it dealt with unreachable destinations?  I'm wondering
> > as I haven't found any blog mentioning anything in this regard and
> > ClouFlare docs only shows example for valid and invalid, but nothing
> > for unknown.
>
We have been dropping Invalids since April, and have had only a
(single-digit) handful of support requests related to those becoming 
unreachable.

The larger challenge has been related to vendor implementation choices and 
bugs, particularly on ios-xe. Happy to go into more detail if anyone is 
interested.

I would recommend *not* taking any policy action that distinguishes Valid from 
Unknown. If you find that you have routes for the same prefix/len with both 
statuses, then that is a bug and/or misconfiguration which you could turn into 
a loop by taking policy action on that difference.

Cheers,

Ben
This e-mail may be privileged and/or confidential, and the sender does not 
waive any related rights and obligations. Any distribution, use or copying of 
this e-mail or the information it contains by other than an intended recipient 
is unauthorized. If you received this e-mail in error, please advise me (by 
return e-mail or otherwise) immediately. Ce courrier électronique est 
confidentiel et protégé. L'expéditeur ne renonce pas aux droits et obligations 
qui s'y rapportent. Toute diffusion, utilisation ou copie de ce message ou des 
renseignements qu'il contient par une personne autre que le (les) 
destinataire(s) désigné(s) est interdite. Si vous recevez ce courrier 
électronique par erreur, veuillez m'en aviser immédiatement, par retour de 
courrier électronique ou par un autre moyen.


RE: CloudFlare issues?

2019-07-04 Thread Francois Lecavalier
>> At this point in time I think the ideal deployment model is to perform
>> the validation within your administrative domain and run your own
>> validators.

>+1

We'll definitely look into this shortly.  I definitely don't want to leave a 
security measure in the end of a third party but with my team being so busy it 
was a quick temp fix.

> The larger challenge has been related to vendor implementation choices and 
> bugs, particularly on ios-xe. Happy to go into more detail if anyone is 
> interested.

We are on Juniper MX204's at the edge and they have been solid for the last 60 
weeks - we ran into a long list of bugs on other platforms but not on these.

So I had about 4200 routes marked as invalid.  After looking at a sample of 
them it looks like most of them have a valid ROA with an improper mask length - 
so there is ultimately a route to these prefixes and at worse would result in 
"suboptimal" routing - or should I say: the remote network can't control its 
route propagation anymore.  In most case they are a stub networks with a single 
/24 reassigned from the upstream provider.  I have no traffic going directly to 
these networks and I don't expect any to go there anytime soon.

It's been close to 3 hours now since I dropped them - radio silence.

Whoever fears implementing RPKI/ROA/ROV, simply don't.  It's very easy to 
implement, validate and troubleshoot.

-Original Message-
From: Ben Maddison 
Sent: Thursday, July 4, 2019 11:51 AM
To: j...@ntt.net; Francois Lecavalier 
Cc: nanog@nanog.org
Subject: [External] Re: CloudFlare issues?

Hi Francois,

On Thu, 2019-07-04 at 17:33 +0200, Job Snijders wrote:
> Dear Francois,
>
> On Thu, Jul 04, 2019 at 03:22:23PM +, Francois Lecavalier wrote:
> >
> At this point in time I think the ideal deployment model is to perform
> the validation within your administrative domain and run your own
> validators.

+1

>
> > But I also have a question for all the ROA folks out there.  So far
> > we are not taking any action other than lowering the local-pref - we
> > want to make sure this is stable before we start denying prefixes.
> > So the question, is it safe as of this date to : 1.Accept valid, 2.
> > Accept unknown, 3. Reject invalid?  Have any large network who
> > implemented it dealt with unreachable destinations?  I'm wondering
> > as I haven't found any blog mentioning anything in this regard and
> > ClouFlare docs only shows example for valid and invalid, but nothing
> > for unknown.
>
We have been dropping Invalids since April, and have had only a
(single-digit) handful of support requests related to those becoming 
unreachable.

The larger challenge has been related to vendor implementation choices and 
bugs, particularly on ios-xe. Happy to go into more detail if anyone is 
interested.

I would recommend *not* taking any policy action that distinguishes Valid from 
Unknown. If you find that you have routes for the same prefix/len with both 
statuses, then that is a bug and/or misconfiguration which you could turn into 
a loop by taking policy action on that difference.

Cheers,

Ben
This e-mail may be privileged and/or confidential, and the sender does not 
waive any related rights and obligations. Any distribution, use or copying of 
this e-mail or the information it contains by other than an intended recipient 
is unauthorized. If you received this e-mail in error, please advise me (by 
return e-mail or otherwise) immediately. Ce courrier électronique est 
confidentiel et protégé. L'expéditeur ne renonce pas aux droits et obligations 
qui s'y rapportent. Toute diffusion, utilisation ou copie de ce message ou des 
renseignements qu'il contient par une personne autre que le (les) 
destinataire(s) désigné(s) est interdite. Si vous recevez ce courrier 
électronique par erreur, veuillez m'en aviser immédiatement, par retour de 
courrier électronique ou par un autre moyen.


Real-world MPLS P/LSR experience on BCM T3 (X5/X7) vs T2+

2019-07-04 Thread Jason Lixfeld
Hey all,

In the role of an MPLS P/LSR, I’m curious if there have been any gotchas (or 
fixes) revealed with BCM T3 vs. T2+.  I remember reading somewhere some years 
ago that there were oddities on the T2+ that I’d like to believe have been 
addressed on T3, but does anyone have any real-world experience with T3 in an 
MPLS core?  (IS-IS, LDP, rLFA, 2-3 labels wide, likely 
SR/Seamless/BGP-LU/whatever down the road)

I’m sure J, C, A, etc. all have their own challenges wrapping their code around 
the APIs, so would be curious to hear anything anyone has to share along those 
lines as well.

Thanks in advance.

Re: CloudFlare issues?

2019-07-04 Thread Mark Tinka



On 4/Jul/19 17:50, Ben Maddison via NANOG wrote:

> We have been dropping Invalids since April, and have had only a
> (single-digit) handful of support requests related to those becoming
> unreachable.

We've had 2 cases where customers could not reach a prefix. Both were
mistakes (as we've found most Invalid routes to be), which were promptly
fixed.

One of them was where a cloud provider decided to originate a longer
prefix on behalf of their content-producing customer, using their own AS
as opposed to the one the customer had used to create the ROA for the
covering block.

Mark.


Re: CloudFlare issues?

2019-07-04 Thread Mark Tinka



On 4/Jul/19 17:33, Job Snijders wrote:

> At this point in time I think the ideal deployment model is to perform
> the validation within your administrative domain and run your own
> validators.

In essence, this is also my thought process.

I think Cloudflare are very well-intentioned in making it as painless as
possible to support other operators to get RPKI deployed (and more power
to them to going to such lengths to do so), but you have to determine
whether you are willing to let a service such as this run outside of our
domain.

Every year, someone asks me whether I'd be willing to outsource my route
reflector VNF's to AWS/Azure/e.t.c. My answer to that falls within the
realms of handling RPKI for your network :-).

Mark.


Re: CloudFlare issues?

2019-07-04 Thread Nick Hilliard

Francois Lecavalier wrote on 04/07/2019 16:22:
My assumption is that 1.Accept valid, 2. Accept unknown, 3. Reject 
invalid shouldn’t break anything.


Accepting valid ROAs is a better idea after checking that the source AS 
is legitimate from the peer.


Nick


Re: CloudFlare issues?

2019-07-04 Thread Mark Tinka


On 4/Jul/19 17:22, Francois Lecavalier wrote:

>  
>
> Following that Verizon debacle I got onboard with ROV, after a couple
> research I stopped my choice on the ….drum roll…. CloudFlare GoRTR
> (https://github.com/cloudflare/gortr).  If you trust them enough they
> provide an updated JSON every 15 minutes of the global RIR aggregate. 
> I’ll see down the road if we’ll fetch them ourselves but at least it
> got us up and running in less than an hour.  It was also easy for us
> to deploy as the routers and the servers are on the same PoP directly
> connected, so we don’t need the whole encryption recipe they provide
> for mass distribution.
>

Funny you should mention this... I was speaking with Tom today during an
RPKI talk he gave at MyNOG, about whether we'd be willing to trust their
RTR streams.

But, I'm glad you found a quick solution to get you up and running.
Welcome to the club.


>  
>
> But I also have a question for all the ROA folks out there.  So far we
> are not taking any action other than lowering the local-pref – we want
> to make sure this is stable before we start denying prefixes.  So the
> question, is it safe as of this date to : 1.Accept valid, 2. Accept
> unknown, 3. Reject invalid?  Have any large network who implemented it
> dealt with unreachable destinations?  I’m wondering as I haven’t found
> any blog mentioning anything in this regard and ClouFlare docs only
> shows example for valid and invalid, but nothing for unknown.
>
>  
>
> My assumption is that 1.Accept valid, 2. Accept unknown, 3. Reject
> invalid shouldn’t break anything.
>

Well, a Valid and NotFound state implicitly mean that the routes can be
used for routing/forwarding. In that case, the only policy we create and
apply is against Invalid routes, which is to DROP them.

Mark.


Re: CloudFlare issues?

2019-07-04 Thread Ben Maddison via NANOG
Hi Francois,

On Thu, 2019-07-04 at 17:33 +0200, Job Snijders wrote:
> Dear Francois,
> 
> On Thu, Jul 04, 2019 at 03:22:23PM +, Francois Lecavalier wrote:
> > 
> At this point in time I think the ideal deployment model is to
> perform
> the validation within your administrative domain and run your own
> validators. 

+1

> 
> > But I also have a question for all the ROA folks out there.  So far
> > we
> > are not taking any action other than lowering the local-pref - we
> > want
> > to make sure this is stable before we start denying prefixes.  So
> > the
> > question, is it safe as of this date to : 1.Accept valid, 2. Accept
> > unknown, 3. Reject invalid?  Have any large network who implemented
> > it
> > dealt with unreachable destinations?  I'm wondering as I haven't
> > found
> > any blog mentioning anything in this regard and ClouFlare docs only
> > shows example for valid and invalid, but nothing for unknown.
> 
We have been dropping Invalids since April, and have had only a
(single-digit) handful of support requests related to those becoming
unreachable.

The larger challenge has been related to vendor implementation choices
and bugs, particularly on ios-xe. Happy to go into more detail if
anyone is interested.

I would recommend *not* taking any policy action that distinguishes
Valid from Unknown. If you find that you have routes for the same
prefix/len with both statuses, then that is a bug and/or
misconfiguration which you could turn into a loop by taking policy
action on that difference.

Cheers,

Ben


Re: CloudFlare issues?

2019-07-04 Thread Job Snijders
Dear Francois,

On Thu, Jul 04, 2019 at 03:22:23PM +, Francois Lecavalier wrote:
> Following that Verizon debacle I got onboard with ROV, after a couple
> research I stopped my choice on the drum roll CloudFlare GoRTR
> (https://github.com/cloudflare/gortr).  If you trust them enough they
> provide an updated JSON every 15 minutes of the global RIR aggregate.

At this point in time I think the ideal deployment model is to perform
the validation within your administrative domain and run your own
validators. You can combine routinator with gortr, or use cloudflare's
octorpki software https://github.com/cloudflare/cfrpki

> I'll see down the road if we'll fetch them ourselves but at least it
> got us up and running in less than an hour.  It was also easy for us
> to deploy as the routers and the servers are on the same PoP directly
> connected, so we don't need the whole encryption recipe they provide
> for mass distribution.

yeah, that is true!

> But I also have a question for all the ROA folks out there.  So far we
> are not taking any action other than lowering the local-pref - we want
> to make sure this is stable before we start denying prefixes.  So the
> question, is it safe as of this date to : 1.Accept valid, 2. Accept
> unknown, 3. Reject invalid?  Have any large network who implemented it
> dealt with unreachable destinations?  I'm wondering as I haven't found
> any blog mentioning anything in this regard and ClouFlare docs only
> shows example for valid and invalid, but nothing for unknown.

I believe at this point in time it is safe to accept valid and unknown
(combined with an IRR filter), and reject RPKI invalid BGP announcements
at your EBGP borders. Large examples of other organisations who already
are rejecting invalid announcements are AT, Nordunet, DE-CIX, YYCIX,
XS4ALL, MSK-IX, INEX, France-IX, Seacomm, Workonline, KPN International,
and hundreds of others.

You can run an analysis yourself to see how traffic would be impacted in
your network using pmacct or Kentik, see this post for more info:
https://mailman.nanog.org/pipermail/nanog/2019-February/099522.html

> My assumption is that 1.Accept valid, 2. Accept unknown, 3. Reject
> invalid shouldn't break anything.

Correct! Let us know how it went :-)

Kind regards,

Job


Re: CloudFlare issues?

2019-07-04 Thread Francois Lecavalier
Hi Mark,

Following that Verizon debacle I got onboard with ROV, after a couple research 
I stopped my choice on the drum roll CloudFlare GoRTR 
(https://github.com/cloudflare/gortr).  If you trust them enough they provide 
an updated JSON every 15 minutes of the global RIR aggregate.  I'll see down 
the road if we'll fetch them ourselves but at least it got us up and running in 
less than an hour.  It was also easy for us to deploy as the routers and the 
servers are on the same PoP directly connected, so we don't need the whole 
encryption recipe they provide for mass distribution.

But I also have a question for all the ROA folks out there.  So far we are not 
taking any action other than lowering the local-pref - we want to make sure 
this is stable before we start denying prefixes.  So the question, is it safe 
as of this date to : 1.Accept valid, 2. Accept unknown, 3. Reject invalid?  
Have any large network who implemented it dealt with unreachable destinations?  
I'm wondering as I haven't found any blog mentioning anything in this regard 
and ClouFlare docs only shows example for valid and invalid, but nothing for 
unknown.

My assumption is that 1.Accept valid, 2. Accept unknown, 3. Reject invalid 
shouldn't break anything.

Thanks,

-Francois
This e-mail may be privileged and/or confidential, and the sender does not 
waive any related rights and obligations. Any distribution, use or copying of 
this e-mail or the information it contains by other than an intended recipient 
is unauthorized. If you received this e-mail in error, please advise me (by 
return e-mail or otherwise) immediately. Ce courrier ?lectronique est 
confidentiel et prot?g?. L'exp?diteur ne renonce pas aux droits et obligations 
qui s'y rapportent. Toute diffusion, utilisation ou copie de ce message ou des 
renseignements qu'il contient par une personne autre que le (les) 
destinataire(s) d?sign?(s) est interdite. Si vous recevez ce courrier 
?lectronique par erreur, veuillez m'en aviser imm?diatement, par retour de 
courrier ?lectronique ou par un autre moyen.


Re: CloudFlare issues?

2019-07-04 Thread i3D.net - Martijn Schmidt via NANOG
So that means it's time for everyone to migrate their ARIN resources to a sane 
RIR that does allow normal access to and redistribution of its RPKI TAL? ;-)

The RPKI TAL problem + an industry-standard IRRDB instead of WHOIS-RWS were 
both major reasons for us to bring our ARIN IPv4 address space to RIPE. 
Unfortunately we had to renumber our handful of IPv6 customers because ARIN 
doesn't do IPv6 inter-RIR transfers, but hey, no pain no gain.

Therefore, Cloudflare folks - when are you transferring your resources away 
from ARIN? :D

Best regards,
Martijn

On 7/4/19 11:46 AM, Mark Tinka wrote:
I finally thought about this after I got off my beer high :-).

Some of our customers complained about losing access to Cloudflare's resources 
during the Verizon debacle. Since we are doing ROV and dropping Invalids, this 
should not have happened, given most of Cloudflare's IPv4 and IPv6 routes are 
ROA'd.

However, since we are not using the ARIN TAL (for known reasons), this explains 
why this also broke for us.

Back to beer now :-)...

Mark.



Re: CloudFlare issues?

2019-07-04 Thread Mark Tinka
I finally thought about this after I got off my beer high :-).

Some of our customers complained about losing access to Cloudflare's
resources during the Verizon debacle. Since we are doing ROV and
dropping Invalids, this should not have happened, given most of
Cloudflare's IPv4 and IPv6 routes are ROA'd.

However, since we are not using the ARIN TAL (for known reasons), this
explains why this also broke for us.

Back to beer now :-)...

Mark.