RE: Soliciting your opinions on Internet routing: A survey on BGP convergence

2017-01-11 Thread Jakob Heitz (jheitz)
When you simply bring down an ebgp session, withdraws will propagate throughout 
the network.
Soon after, the alternate routes will propagate. In the interim, some routers 
will lose connectivity.
This problem is solved by graceful shutdown.
This only works for planned shutdown
This interim time can be many minutes because of the advertisement-interval 
(MRAI timer).
A possible solution to reduce this interim to seconds instead of minutes is to 
set the MRAI timer to 0 on all routers. A potential problem with that is that 
any BGP instability in the network will cause some serious flapping.
Another alternative is to use BGP add-path (rfc7911) to distribute backup 
routes.
This will avoid the MRAI problem, but requires more memory on routers.
This also works for accidental shutdown.

Thanks,
Jakob.


> -Original Message-
> From: Jakob Heitz (jheitz)
> Sent: Tuesday, January 10, 2017 11:52 AM
> To: nanog@nanog.org; 'baldur.nordd...@gmail.com' 
> Subject: RE: Soliciting your opinions on Internet routing: A survey on BGP 
> convergence
> 
> Hi Baldur,
> 
> Have you tried graceful shutdown?
> You need redundant links, but not to the same transit.
> https://tools.ietf.org/html/draft-ietf-grow-bgp-gshut-06
> This draft is expired, but it is actually implemented by several vendors.
> 
> I implemented this.
> http://www.slideshare.net/bduvivie/bgp-graceful-shutdown-ios-xr
> I added an option to configure AS-path prepends in case the gshut community 
> was not supported by peers.
> 
> Thanks,
> Jakob.
> 
> 
> > Date: Tue, 10 Jan 2017 03:51:04 +0100
> > From: Baldur Norddahl 
> >
> > Hello
> >
> > I find that the type of outage that affects our network the most is
> > neither of the two options you describe. As is probably typical for
> > smaller networks, we do not have redundant uplinks to all of our
> > transits. If a transit link goes, for example because we had to reboot a
> > router, traffic is supposed to reroute to the remaining transit links.
> > Internally our network handles this fairly fast for egress traffic.
> >
> > However the problem is the ingress traffic - it can be 5 to 15 minutes
> > before everything has settled down. This is the time before everyone
> > else on the internet has processed that they will have to switch to your
> > alternate transit.
> >
> > The only solution I know of is to have redundant links to all transits.
> > Going forward I will make sure we have this because it is a huge
> > disadvantage not being able to take a router out of service without
> > causing downtime for all users. Not to mention that a router crash or
> > link failure that should have taken seconds at most to reroute, but
> > instead causes at least 5 minutes of unstable internet.
> >
> > Regards,
> >
> > Baldur


Re: Soliciting your opinions on Internet routing: A survey on BGP convergence

2017-01-10 Thread Laurent Vanbever
Hi Joel,

> On 10 Jan 2017, at 06:51, joel jaeggli  wrote:
> 
> On 1/9/17 2:56 PM, Laurent Vanbever wrote:
>> Hi NANOG,
>> 
>> We often read that the Internet (i.e. BGP) is "slow to converge". But how 
>> slow
>> is it really? Do you care anyway? And can we (researchers) do anything about 
>> it?
>> Please help us out to find out by answering our short anonymous survey 
>> (<10 minutes).
>> 
>> Survey URL: https://goo.gl/forms/JZd2CK0EFpCk0c272 
>> 
>> 
>> 
>> ** Background:
>> 
>> While existing fast-reroute mechanisms enable sub-second convergence upon 
>> local outages (planned or not), they do not apply to remote outages 
>> happening 
>> further away from your AS as their detection and protection mechanisms only 
>> work locally.
>> 
>> Remote outages therefore mandate a "BGP-only" convergence which tends to be
>> slow, as long streams of BGP UPDATEs (containing up to 100,000s of them) must
>> be propagated router-by-router. Our initial measurements indicate that it can
>> take state-of-the-art BGP routers dozens of seconds to process and propagate
>> these large streams of BGP UPDATEs. During this time, traffic for important
>> destinations can be lost.
> 
> One of the phenomena that is relatively easy to observe by withdrawing a
> prefix entirely is the convergence towards longer and longer AS paths
> until the route disappears entirely. that is providers that are further
> away will remain advertising the route and in the interim their
> neighbors  will ingest the available path will  until they too process
> the withdraw. it can take a comically long time (like 5 minutes)  to see
> the prefix ultimately disappear from the internet. When withdrawing a
> prefix from a peer with which you have a single adjacency this can
> easily happens in miniature.

Thanks! Yes, definitely. This relates to the issue Baldur was raising in which 
a less-preferred prefix (or not prefix at all in your case) has to take over a 
more preferred one. That case is definitely bad for BGP convergence. 

Our survey/study is more geared towards cases where there is diversity 
available (alternates paths are there and at least partially visible). We are 
especially interested in finding out whether, even when you take all the 
precautionary measures required by the book, long BGP convergence can still 
bite you and… whether we can do anything about it.


Laurent

PS: 

Thanks so much to the 21 operators who have answered already! If you haven’t so 
already, please help us out to find out about troublesome BGP convergence by 
answering our short anonymous survey  (<10 minutes): 
https://goo.gl/forms/JZd2CK0EFpCk0c272 




Re: Soliciting your opinions on Internet routing: A survey on BGP convergence

2017-01-10 Thread Laurent Vanbever
Dear Baldur,

> I find that the type of outage that affects our network the most is neither 
> of the two options you describe. As is probably typical for smaller networks, 
> we do not have redundant uplinks to all of our transits. If a transit link 
> goes, for example because we had to reboot a router, traffic is supposed to 
> reroute to the remaining transit links. Internally our network handles this 
> fairly fast for egress traffic.
> 
> However the problem is the ingress traffic - it can be 5 to 15 minutes before 
> everything has settled down. This is the time before everyone else on the 
> internet has processed that they will have to switch to your alternate 
> transit.

Thanks a lot for your input. Indeed, that case is a bit special. I’d say it is 
a kind of remote outage that remote ASes experience towards your prefix and, as 
such, requires a "BGP-only” convergence. I guess if your prefixes going via 
alternate transit are not visible at all prior to the switch (and I guess not), 
this is a kind of “extreme” convergence where routes have to be 
withdrawn/updated Internet-wide. This reminds me of the paper by Craig Labovitz 
et al. 
(http://conferences.sigcomm.org/sigcomm/2000/conf/paper/sigcomm2000-5-2.pdf 
) 
which I think classify these events as Tlong ("An active route with a short 
ASPath is implicitly replaced with a new route possessing a longer ASPath. This 
represents both a route failure and failover”). And indeed, these are the 
second slowest just before the withdraw of a prefix Internet-wide.

You’re right that our survey targets more the case in which large bursts of 
UPDATEs/WITHDRAWs are exchanged. I guess a parallel case to the one you mention 
could be that your prime transit performs a planned maintenance (or experiences 
a failure) that triggers the sending of WITHDRAWs for your prefixes out.

> The only solution I know of is to have redundant links to all transits. Going 
> forward I will make sure we have this because it is a huge disadvantage not 
> being able to take a router out of service without causing downtime for all 
> users. Not to mention that a router crash or link failure that should have 
> taken seconds at most to reroute, but instead causes at least 5 minutes of 
> unstable internet.

Maybe you could advertise better routes (i.e., with shorter AS-PATHs/longer 
prefixes) via the alternate transit prior to the take down? Ideally, if you 
could somehow make your primary transit switch to use an alternate transit 
prior to the maintenance (maybe with a special community?), you could 
completely avoid a disruption. This would go into the direction of minimizing 
the amount of WITHDRAWs in favor of UPDATEs. But, of course, this would only 
work in the case of planned maintenance.

We would definitely welcome more input on the convergence issue you face!

Best,
Laurent

Re: Soliciting your opinions on Internet routing: A survey on BGP convergence

2017-01-10 Thread Mike Jones
On 10 January 2017 at 19:58, Job Snijders  wrote:
> On Tue, Jan 10, 2017 at 03:51:04AM +0100, Baldur Norddahl wrote:
>> If a transit link goes, for example because we had to reboot a router,
>> traffic is supposed to reroute to the remaining transit links.
>> Internally our network handles this fairly fast for egress traffic.
>>
>> However the problem is the ingress traffic - it can be 5 to 15 minutes
>> before everything has settled down. This is the time before everyone
>> else on the internet has processed that they will have to switch to
>> your alternate transit.
>>
>> The only solution I know of is to have redundant links to all transits.
>
> Alternatively, if you reboot a router, perhaps you could first shutdown
> the eBGP sessions, then wait 5 to 10 minutes for the traffic to drain
> away (should be visible in your NMS stats), and then proceed with the
> maintenance?
>
> Of course this only works for planned reboots, not suprise reboots.
>
> Kind regards,
>
> Job

If I tear down my eBGP sessions the upstream router withdraws the
route and the traffic just stops. Are your upstreams propagating
withdraws without actually updating their own routing tables?

I believe the simple explanation of the problem can be seen by firing
up an inbound mtr from a distant network then withdrawing the route
from the path it is taking. It should show either destination
unreachable or a routing loop which "retreats" (under the right
circumstances I have observed it distinctly move 1 hop at a time)
until it finds an alternate path.

My observed convergence times for a single withdraw are however in the
sub-10 second range, to get all the networks in the original path
pointing at a new one. My view on the problem is that if you are
failing over frequently enough for a customer to notice and report it,
you have bigger problems than convergence times.

- Mike Jones


Re: Soliciting your opinions on Internet routing: A survey on BGP convergence

2017-01-10 Thread Jared Mauch

> On Jan 10, 2017, at 3:14 PM, Hugo Slabbert  wrote:
> 
> 
> On Tue 2017-Jan-10 20:58:02 +0100, Job Snijders  wrote:
> 
>> On Tue, Jan 10, 2017 at 03:51:04AM +0100, Baldur Norddahl wrote:
>>> If a transit link goes, for example because we had to reboot a router,
>>> traffic is supposed to reroute to the remaining transit links.
>>> Internally our network handles this fairly fast for egress traffic.
>>> 
>>> However the problem is the ingress traffic - it can be 5 to 15 minutes
>>> before everything has settled down. This is the time before everyone
>>> else on the internet has processed that they will have to switch to
>>> your alternate transit.
>>> 
>>> The only solution I know of is to have redundant links to all transits.
>> 
>> Alternatively, if you reboot a router, perhaps you could first shutdown
>> the eBGP sessions, then wait 5 to 10 minutes for the traffic to drain
>> away (should be visible in your NMS stats), and then proceed with the
>> maintenance?
>> 
>> Of course this only works for planned reboots, not suprise reboots.
> 
> ...or link failures.

One other comment:

there has been a long history of poorly behaving BGP stacks that would
take quite some time to hunt through the paths.  While this can still
occur with people with nearing ancient software and hardware still in-use,
many of the modern software/hardware options enable things like BGP-PIC
(in your survey) by default.

Many of these options you document as best practices like path mtu discovery
are well known fixes for networks, as well as using jumbo mtu internally to
obtain 9k+ mss for high performance TCP.  Vendors have not always chosen to
enable the TCP options by default like the protocols have, eg: BGP-PIC and
like Jakob’s response, tout other solutions vs fixing the TCP stack first.

Many of these performances were documented in 2002 and are considered best
practices by many networks, but due to their obscure knobs may not be
widely deployed as a result, or seen as risky to configure.  (We had a
vendor panic when we discovered a bug in their TCP-SACK code, they were
almost frozen in not fixing the code because touching TCP felt dangerous
and there was an inadequate testing culture around something seen as ‘stable’).

here’s the presentation from IETF 53, I don’t see it in the proceedings handily:

http://morse.colorado.edu/~epperson/courses/routing-protocols/handouts/bgp_scalability_IETF.ppt

- Jared



Re: Soliciting your opinions on Internet routing: A survey on BGP convergence

2017-01-10 Thread Hugo Slabbert


On Tue 2017-Jan-10 20:58:02 +0100, Job Snijders  wrote:


On Tue, Jan 10, 2017 at 03:51:04AM +0100, Baldur Norddahl wrote:

If a transit link goes, for example because we had to reboot a router,
traffic is supposed to reroute to the remaining transit links.
Internally our network handles this fairly fast for egress traffic.

However the problem is the ingress traffic - it can be 5 to 15 minutes
before everything has settled down. This is the time before everyone
else on the internet has processed that they will have to switch to
your alternate transit.

The only solution I know of is to have redundant links to all transits.


Alternatively, if you reboot a router, perhaps you could first shutdown
the eBGP sessions, then wait 5 to 10 minutes for the traffic to drain
away (should be visible in your NMS stats), and then proceed with the
maintenance?

Of course this only works for planned reboots, not suprise reboots.


...or link failures.



Kind regards,

Job


--
Hugo Slabbert   | email, xmpp/jabber: h...@slabnet.com
pgp key: B178313E   | also on Signal


signature.asc
Description: Digital signature


Re: Soliciting your opinions on Internet routing: A survey on BGP convergence

2017-01-10 Thread Job Snijders
On Tue, Jan 10, 2017 at 03:51:04AM +0100, Baldur Norddahl wrote:
> If a transit link goes, for example because we had to reboot a router,
> traffic is supposed to reroute to the remaining transit links.
> Internally our network handles this fairly fast for egress traffic.
>
> However the problem is the ingress traffic - it can be 5 to 15 minutes
> before everything has settled down. This is the time before everyone
> else on the internet has processed that they will have to switch to
> your alternate transit.
>
> The only solution I know of is to have redundant links to all transits.

Alternatively, if you reboot a router, perhaps you could first shutdown
the eBGP sessions, then wait 5 to 10 minutes for the traffic to drain
away (should be visible in your NMS stats), and then proceed with the
maintenance?

Of course this only works for planned reboots, not suprise reboots.

Kind regards,

Job


RE: Soliciting your opinions on Internet routing: A survey on BGP convergence

2017-01-10 Thread Jakob Heitz (jheitz)
Hi Baldur,

Have you tried graceful shutdown?
You need redundant links, but not to the same transit.
https://tools.ietf.org/html/draft-ietf-grow-bgp-gshut-06
This draft is expired, but it is actually implemented by several vendors.

I implemented this.
http://www.slideshare.net/bduvivie/bgp-graceful-shutdown-ios-xr
I added an option to configure AS-path prepends in case the gshut community was 
not supported by peers.

Thanks,
Jakob.


> Date: Tue, 10 Jan 2017 03:51:04 +0100
> From: Baldur Norddahl 
> 
> Hello
> 
> I find that the type of outage that affects our network the most is
> neither of the two options you describe. As is probably typical for
> smaller networks, we do not have redundant uplinks to all of our
> transits. If a transit link goes, for example because we had to reboot a
> router, traffic is supposed to reroute to the remaining transit links.
> Internally our network handles this fairly fast for egress traffic.
> 
> However the problem is the ingress traffic - it can be 5 to 15 minutes
> before everything has settled down. This is the time before everyone
> else on the internet has processed that they will have to switch to your
> alternate transit.
> 
> The only solution I know of is to have redundant links to all transits.
> Going forward I will make sure we have this because it is a huge
> disadvantage not being able to take a router out of service without
> causing downtime for all users. Not to mention that a router crash or
> link failure that should have taken seconds at most to reroute, but
> instead causes at least 5 minutes of unstable internet.
> 
> Regards,
> 
> Baldur


Re: Soliciting your opinions on Internet routing: A survey on BGP convergence

2017-01-09 Thread joel jaeggli
On 1/9/17 2:56 PM, Laurent Vanbever wrote:
> Hi NANOG,
> 
> We often read that the Internet (i.e. BGP) is "slow to converge". But how slow
> is it really? Do you care anyway? And can we (researchers) do anything about 
> it?
> Please help us out to find out by answering our short anonymous survey 
> (<10 minutes).
> 
> Survey URL: https://goo.gl/forms/JZd2CK0EFpCk0c272 
> 
> 
> 
> ** Background:
> 
> While existing fast-reroute mechanisms enable sub-second convergence upon 
> local outages (planned or not), they do not apply to remote outages happening 
> further away from your AS as their detection and protection mechanisms only 
> work locally.
> 
> Remote outages therefore mandate a "BGP-only" convergence which tends to be
> slow, as long streams of BGP UPDATEs (containing up to 100,000s of them) must
> be propagated router-by-router. Our initial measurements indicate that it can
> take state-of-the-art BGP routers dozens of seconds to process and propagate
> these large streams of BGP UPDATEs. During this time, traffic for important
> destinations can be lost.

One of the phenomena that is relatively easy to observe by withdrawing a
prefix entirely is the convergence towards longer and longer AS paths
until the route disappears entirely. that is providers that are further
away will remain advertising the route and in the interim their
neighbors  will ingest the available path will  until they too process
the withdraw. it can take a comically long time (like 5 minutes)  to see
the prefix ultimately disappear from the internet. When withdrawing a
prefix from a peer with which you have a single adjacency this can
easily happens in miniature.

> 
> ** This survey:
> 
> This survey aims at evaluating the impact of slow BGP convergence on
> operational practices. We expect the findings to increase the understanding of
> the perceived BGP convergence in the Internet, which could then help
> researchers to design better fast-reroute mechanisms.
> 
> We expect the questionnaire to be filled out by network operators whose job 
> relates
> to BGP operations. It has a total of 17 questions and should take less 10 
> minutes
> to answer. The survey and the collected data are anonymous (so please do *not*
> include information that may help to identify you or your organization). 
> All questions are optional, so if you don't like a question or don't know the 
> answer,
> please skip it.
> 
> A summary of the aggregate results will be published as a part of a scientific
> article later this year.
> 
> Thank you so much in advance, and we look forward to read your responses!
> 
> 
> Laurent Vanbever (ETH Zürich, Switzerland)
> 
> 
> PS: It goes without saying that we would be also extremely grateful if you 
> could
> forward this email to any operator you might know who may not read NANOG.
> 




signature.asc
Description: OpenPGP digital signature


Re: Soliciting your opinions on Internet routing: A survey on BGP convergence

2017-01-09 Thread Baldur Norddahl

Hello

I find that the type of outage that affects our network the most is 
neither of the two options you describe. As is probably typical for 
smaller networks, we do not have redundant uplinks to all of our 
transits. If a transit link goes, for example because we had to reboot a 
router, traffic is supposed to reroute to the remaining transit links. 
Internally our network handles this fairly fast for egress traffic.


However the problem is the ingress traffic - it can be 5 to 15 minutes 
before everything has settled down. This is the time before everyone 
else on the internet has processed that they will have to switch to your 
alternate transit.


The only solution I know of is to have redundant links to all transits. 
Going forward I will make sure we have this because it is a huge 
disadvantage not being able to take a router out of service without 
causing downtime for all users. Not to mention that a router crash or 
link failure that should have taken seconds at most to reroute, but 
instead causes at least 5 minutes of unstable internet.


Regards,

Baldur


Den 09/01/2017 kl. 23.56 skrev Laurent Vanbever:

Hi NANOG,

We often read that the Internet (i.e. BGP) is "slow to converge". But how slow
is it really? Do you care anyway? And can we (researchers) do anything about it?
Please help us out to find out by answering our short anonymous survey
(<10 minutes).

Survey URL: https://goo.gl/forms/JZd2CK0EFpCk0c272 



** Background:

While existing fast-reroute mechanisms enable sub-second convergence upon
local outages (planned or not), they do not apply to remote outages happening
further away from your AS as their detection and protection mechanisms only
work locally.

Remote outages therefore mandate a "BGP-only" convergence which tends to be
slow, as long streams of BGP UPDATEs (containing up to 100,000s of them) must
be propagated router-by-router. Our initial measurements indicate that it can
take state-of-the-art BGP routers dozens of seconds to process and propagate
these large streams of BGP UPDATEs. During this time, traffic for important
destinations can be lost.


** This survey:

This survey aims at evaluating the impact of slow BGP convergence on
operational practices. We expect the findings to increase the understanding of
the perceived BGP convergence in the Internet, which could then help
researchers to design better fast-reroute mechanisms.

We expect the questionnaire to be filled out by network operators whose job 
relates
to BGP operations. It has a total of 17 questions and should take less 10 
minutes
to answer. The survey and the collected data are anonymous (so please do *not*
include information that may help to identify you or your organization).
All questions are optional, so if you don't like a question or don't know the 
answer,
please skip it.

A summary of the aggregate results will be published as a part of a scientific
article later this year.

Thank you so much in advance, and we look forward to read your responses!


Laurent Vanbever (ETH Zürich, Switzerland)


PS: It goes without saying that we would be also extremely grateful if you could
forward this email to any operator you might know who may not read NANOG.