RE: Open letter to Level3 concerning the global routing issues on June 12th

2015-06-14 Thread Evan Moore
While this is all true, and I'm always willing to forgive honest errors 
accompanied by sincere admissions, my recent Level 3 experience (beginning 
prior to the 12th) has strongly biased me toward the third option Niels meant: 
serious lack of clue.  I've had multiple tickets open over several weeks 
inquiring why Level 3 is announcing several /24s out of 8/8 to peers, and I 
keep getting told it's my fault.  Supposedly my tickets have gone upstream to 
higher levels, but nothing changes and the answers I get are wrong.

If anyone with clue at Level 3 would like to redeem my faith, please get in 
touch.

ERM

Evan R Moore
Network Engineer and Bitwrangler
Sovernet Communications
emo...@sover.net


-Original Message-
From: NANOG [mailto:nanog-boun...@nanog.org] On Behalf Of Jared Mauch
Sent: Sunday, June 14, 2015 10:38 AM
To: Stephen Satchell
Cc: nanog@nanog.org
Subject: Re: Open letter to Level3 concerning the global routing issues on June 
12th

There are lots of options from failure to follow procedure to software defect 
amongst others. We are all human, except for my coworker the Troy-bot-3000. 
Even well intentioned and motivated people have bad things happen to them. 

What I look for in these incidents is what can be learned and improved upon. 

If you are motivated about the routing manifesto please join the mailing list. 

Thanks,

Jared Mauch

> On Jun 14, 2015, at 10:33 AM, Stephen Satchell  wrote:
> 
>> On 06/14/2015 07:06 AM, Niels Bakker wrote:
>> * raf...@gav.ufsc.br (Rafael Possamai) [Sun 14 Jun 2015, 04:54 CEST]:
>>> This was either an isolated incident or they really don't care much.
>> 
>> Have you considered the third option?
> 
> Third option?


Re: Open letter to Level3 concerning the global routing issues on June 12th

2015-06-14 Thread Jared Mauch
There are lots of options from failure to follow procedure to software defect 
amongst others. We are all human, except for my coworker the Troy-bot-3000. 
Even well intentioned and motivated people have bad things happen to them. 

What I look for in these incidents is what can be learned and improved upon. 

If you are motivated about the routing manifesto please join the mailing list. 

Thanks,

Jared Mauch

> On Jun 14, 2015, at 10:33 AM, Stephen Satchell  wrote:
> 
>> On 06/14/2015 07:06 AM, Niels Bakker wrote:
>> * raf...@gav.ufsc.br (Rafael Possamai) [Sun 14 Jun 2015, 04:54 CEST]:
>>> This was either an isolated incident or they really don't care much.
>> 
>> Have you considered the third option?
> 
> Third option?


Re: Open letter to Level3 concerning the global routing issues on June 12th

2015-06-14 Thread Stephen Satchell

On 06/14/2015 07:06 AM, Niels Bakker wrote:

* raf...@gav.ufsc.br (Rafael Possamai) [Sun 14 Jun 2015, 04:54 CEST]:

This was either an isolated incident or they really don't care much.


Have you considered the third option?


Third option?



Re: Open letter to Level3 concerning the global routing issues on June 12th

2015-06-14 Thread Niels Bakker

* raf...@gav.ufsc.br (Rafael Possamai) [Sun 14 Jun 2015, 04:54 CEST]:
A lot of these things are for show only.. Like a big corporation 
donating to non-profits and sponsoring "feel good" events.


Donating costs actual money, unlike putting a statement on a webpage



This was either an isolated incident or they really don't care much.


Have you considered the third option?


-- Niels.


Re: Open letter to Level3 concerning the global routing issues on June 12th

2015-06-13 Thread Rafael Possamai
A lot of these things are for show only.. Like a big corporation donating
to non-profits and sponsoring "feel good" events. You can see that a lot of
these same businesses also lobby Washington like crazy, so there you go...
This was either an isolated incident or they really don't care much.

On Sat, Jun 13, 2015 at 1:54 PM, Hank Nussbacher 
wrote:

> At 17:32 12/06/2015 +0200, Martin Millnert wrote:
>
> Interesting that Level3 is a member of http://www.routingmanifesto.org/
>
> or see
>
>
> http://www.internetsociety.org/news/network-operators-around-world-demonstrate-their-commitment-secure-and-resilient-internet
>
> to quote Level3
> "As one of the most connected Internet providers in the world, security of
> the Internet is top-of-mind at Level 3 Communications. We are dedicated to
> supporting and protecting the Internet ecosystem and work each day to
> safeguard customers' critical communications. The Internet is a shared
> responsibility, and only through these important collaborative efforts can
> we continue to ensure the protection of this collective infrastructure."
>
> -Hank
>
>
>  Dear Level3,
>>
>> The Internet is a cooperative effort, and it works well only when its
>> participants take constructive actions to address errors and remedy
>> problems.
>> Your position as a major Internet Carrier bestows upon you a certain
>> degree of responsibility for the correct operation of the Internet all
>> across (and beyond) the planet. You have many customers. Customers will
>> always occasionally make mistakes. You as a major Internet Carrier have
>> a responsibility to limit, not amplify, your customers' mistakes.
>> Other major carriers implement technical measures that severely limits
>> the damages from customer mistakes from having global impact.
>> Other major carriers also implement operational procedures in addition
>> to technical measures.
>> In combination, these measures drastically reduce the outage-hours as a
>> result of customer configuration errors.
>>
>> At 08:44 UTC on Friday 12th of June, one of your transit customers,
>> Telekom Malaysia (AS4788) began announcing the full Internet table back
>> to you, which you accepted and propagated to your peers and customers,
>> causing global outages for close to 3 hours.
>> [ https://twitter.com/DynResearch/status/609340592036970496 ]
>> During this 3 hour window, it appears (from your own service outage
>> reports) that you did nothing to stop the global Internet outage, but
>> that Telekom Malaysia themselves eventually resolved it. This lack of
>> action on your end, and your disregard for the correct operation of the
>> global Internet is astonishing. These mistakes do not need to happen.
>> AS4788 under normal circumstances announces ~1900 IPv4 prefixes to the
>> Internet. You accepted multiple hundred thousand prefixes from them - a
>> max prefix setting would have severely limited the damage. We expect
>> that these are your practices as well, but they failed. When they do, it
>> should not take ~3 hours to shut down the session(s).
>>
>> Many operators, in despair, turned down their peering sessions with you
>> once it was clear you were causing the outages and no immediate fix was
>> in sight. This improved the situation for some - but not all did. Had
>> you deployed proper IRR-filtering to filter the bad announcements the
>> impact would've been far less critical.
>>
>> As a direct consequence of your ~3 hours of inaction, as a local
>> example, Swedish payment terminals were experiencing problems all over
>> the country. The Swedish economy was directly affected by your inaction.
>> There were queues when I was buying lunch! Imagine the food rage. The
>> situation was probably similar at other places around the globe where
>> people were awake.
>>
>> Operators around the planet are curious:
>>   - Did Level3 not detect or understand that it was causing global
>> Internet outages for ~3 hours?
>>   - If Level3 did in fact detect or understand it was causing global
>> Internet outages, why did it not properly and immediately remedy the
>> situation?
>>   - What is Level3 going to do to address these questions and begin work
>> on restoring its credibility as a carrier?
>>
>> We all understand that mistakes do happen (in applying customer
>> interface templates, etc.). However the Internet is all too pervasive in
>> everyday life today for anything but swift action by carriers to remedy
>> breakage after the fact. It is absolutely not sufficient to let a
>> customer spend 3 hours to detect and fix a situation like this one. It
>> is unacceptable that no swift action was taken on your end to limit the
>> global routing issues you caused.
>>
>> Sincerely,
>> Martin Millnert
>> Member of Internet Community - no carrier / ISP affiliation.
>>
>
>


Re: Open letter to Level3 concerning the global routing issues on June 12th

2015-06-13 Thread Hank Nussbacher

At 17:32 12/06/2015 +0200, Martin Millnert wrote:

Interesting that Level3 is a member of http://www.routingmanifesto.org/

or see

http://www.internetsociety.org/news/network-operators-around-world-demonstrate-their-commitment-secure-and-resilient-internet

to quote Level3
"As one of the most connected Internet providers in the world, security of 
the Internet is top-of-mind at Level 3 Communications. We are dedicated to 
supporting and protecting the Internet ecosystem and work each day to 
safeguard customers' critical communications. The Internet is a shared 
responsibility, and only through these important collaborative efforts can 
we continue to ensure the protection of this collective infrastructure."


-Hank


Dear Level3,

The Internet is a cooperative effort, and it works well only when its
participants take constructive actions to address errors and remedy
problems.
Your position as a major Internet Carrier bestows upon you a certain
degree of responsibility for the correct operation of the Internet all
across (and beyond) the planet. You have many customers. Customers will
always occasionally make mistakes. You as a major Internet Carrier have
a responsibility to limit, not amplify, your customers' mistakes.
Other major carriers implement technical measures that severely limits
the damages from customer mistakes from having global impact.
Other major carriers also implement operational procedures in addition
to technical measures.
In combination, these measures drastically reduce the outage-hours as a
result of customer configuration errors.

At 08:44 UTC on Friday 12th of June, one of your transit customers,
Telekom Malaysia (AS4788) began announcing the full Internet table back
to you, which you accepted and propagated to your peers and customers,
causing global outages for close to 3 hours.
[ https://twitter.com/DynResearch/status/609340592036970496 ]
During this 3 hour window, it appears (from your own service outage
reports) that you did nothing to stop the global Internet outage, but
that Telekom Malaysia themselves eventually resolved it. This lack of
action on your end, and your disregard for the correct operation of the
global Internet is astonishing. These mistakes do not need to happen.
AS4788 under normal circumstances announces ~1900 IPv4 prefixes to the
Internet. You accepted multiple hundred thousand prefixes from them - a
max prefix setting would have severely limited the damage. We expect
that these are your practices as well, but they failed. When they do, it
should not take ~3 hours to shut down the session(s).

Many operators, in despair, turned down their peering sessions with you
once it was clear you were causing the outages and no immediate fix was
in sight. This improved the situation for some - but not all did. Had
you deployed proper IRR-filtering to filter the bad announcements the
impact would've been far less critical.

As a direct consequence of your ~3 hours of inaction, as a local
example, Swedish payment terminals were experiencing problems all over
the country. The Swedish economy was directly affected by your inaction.
There were queues when I was buying lunch! Imagine the food rage. The
situation was probably similar at other places around the globe where
people were awake.

Operators around the planet are curious:
  - Did Level3 not detect or understand that it was causing global
Internet outages for ~3 hours?
  - If Level3 did in fact detect or understand it was causing global
Internet outages, why did it not properly and immediately remedy the
situation?
  - What is Level3 going to do to address these questions and begin work
on restoring its credibility as a carrier?

We all understand that mistakes do happen (in applying customer
interface templates, etc.). However the Internet is all too pervasive in
everyday life today for anything but swift action by carriers to remedy
breakage after the fact. It is absolutely not sufficient to let a
customer spend 3 hours to detect and fix a situation like this one. It
is unacceptable that no swift action was taken on your end to limit the
global routing issues you caused.

Sincerely,
Martin Millnert
Member of Internet Community - no carrier / ISP affiliation.




Re: Open letter to Level3 concerning the global routing issues on June 12th

2015-06-13 Thread Rubens Kuhl
>
> At 08:44 UTC on Friday 12th of June, one of your transit customers,
> Telekom Malaysia (AS4788) began announcing the full Internet table back
> to you, which you accepted and propagated to your peers and customers,
> causing global outages for close to 3 hours.
>

One thing of notice is that AS Paths were really not short, so some kind of
local preference has to be in place. Although it's usual to apply local
preference to transit customers, it's probably wise to only do it for
prefixes belonging to customer or registered at IRRs. So, if someone does
not want to filter prefixes from customers, at least could not apply larger
preference to all such prefixes. Focus on the know prefixes and let AS Path
sort out those weird paths.


Rubens


Re: Open letter to Level3 concerning the global routing issues on June 12th

2015-06-13 Thread Gavin Henry
> Actually I had pretty good experiences with Level3 as it has been years
as they could use IRR filters to update automatically your prefix list. I
remember that Level3 was one of the first carriers to enable that feature
and several years afterwards there were still global networks (tier1) that
could only do static prefix-lists.
>

It is weird, as they don't take new announcements without their being an
inetnum object entry in an IRR.

Maybe that's just for us small guys?

http://www.surevoip.co.uk (AS199659)


Re: Open letter to Level3 concerning the global routing issues on June 12th

2015-06-13 Thread Justin M. Streiner

On Sat, 13 Jun 2015, Mark Tinka wrote:


For peering and customers, we set a default prefix limit value for IPv4
and IPv6. We only change this if the peer/customer informs us that they
will announce a lot more than what we've configured. We add some % to
cover for "sudden" growth, but not too much to impact the network.

For customers, we add prefix lists and AS_PATH filters as mandatory.

I'm sure others do the same. It would be good if we all did.

I know the largest transit providers tend to be more relaxed for various
reasons. Some rely on filters generated by IRR entries, others don't.

A lot more work is needed, indeed. It's not 2008 anymore...


At my previous job (regional ISP with a decent amount of BGP-speaking 
downstream customers), we did prefix and AS-PATH filtering on all customer 
sessions.  The only thing lacking at that time (1997-2004) was a decent 
way to automate changes - everything was pretty manual.  That said, it 
kept issues caused by customers leaking routes back to us down to pretty 
much nil.


jms


Re: Open letter to Level3 concerning the global routing issues on June 12th

2015-06-13 Thread Rafael Possamai
Something about Malaysia, first the airplanes... now BGP leaks?

On Fri, Jun 12, 2015 at 10:32 AM, Martin Millnert 
wrote:

> Dear Level3,
>
> The Internet is a cooperative effort, and it works well only when its
> participants take constructive actions to address errors and remedy
> problems.
> Your position as a major Internet Carrier bestows upon you a certain
> degree of responsibility for the correct operation of the Internet all
> across (and beyond) the planet. You have many customers. Customers will
> always occasionally make mistakes. You as a major Internet Carrier have
> a responsibility to limit, not amplify, your customers' mistakes.
> Other major carriers implement technical measures that severely limits
> the damages from customer mistakes from having global impact.
> Other major carriers also implement operational procedures in addition
> to technical measures.
> In combination, these measures drastically reduce the outage-hours as a
> result of customer configuration errors.
>
> At 08:44 UTC on Friday 12th of June, one of your transit customers,
> Telekom Malaysia (AS4788) began announcing the full Internet table back
> to you, which you accepted and propagated to your peers and customers,
> causing global outages for close to 3 hours.
> [ https://twitter.com/DynResearch/status/609340592036970496 ]
> During this 3 hour window, it appears (from your own service outage
> reports) that you did nothing to stop the global Internet outage, but
> that Telekom Malaysia themselves eventually resolved it. This lack of
> action on your end, and your disregard for the correct operation of the
> global Internet is astonishing. These mistakes do not need to happen.
> AS4788 under normal circumstances announces ~1900 IPv4 prefixes to the
> Internet. You accepted multiple hundred thousand prefixes from them - a
> max prefix setting would have severely limited the damage. We expect
> that these are your practices as well, but they failed. When they do, it
> should not take ~3 hours to shut down the session(s).
>
> Many operators, in despair, turned down their peering sessions with you
> once it was clear you were causing the outages and no immediate fix was
> in sight. This improved the situation for some - but not all did. Had
> you deployed proper IRR-filtering to filter the bad announcements the
> impact would've been far less critical.
>
> As a direct consequence of your ~3 hours of inaction, as a local
> example, Swedish payment terminals were experiencing problems all over
> the country. The Swedish economy was directly affected by your inaction.
> There were queues when I was buying lunch! Imagine the food rage. The
> situation was probably similar at other places around the globe where
> people were awake.
>
> Operators around the planet are curious:
>   - Did Level3 not detect or understand that it was causing global
> Internet outages for ~3 hours?
>   - If Level3 did in fact detect or understand it was causing global
> Internet outages, why did it not properly and immediately remedy the
> situation?
>   - What is Level3 going to do to address these questions and begin work
> on restoring its credibility as a carrier?
>
> We all understand that mistakes do happen (in applying customer
> interface templates, etc.). However the Internet is all too pervasive in
> everyday life today for anything but swift action by carriers to remedy
> breakage after the fact. It is absolutely not sufficient to let a
> customer spend 3 hours to detect and fix a situation like this one. It
> is unacceptable that no swift action was taken on your end to limit the
> global routing issues you caused.
>
> Sincerely,
> Martin Millnert
> Member of Internet Community - no carrier / ISP affiliation.
>


RE: Open letter to Level3 concerning the global routing issues on June 12th

2015-06-13 Thread Jürgen Jaritsch
The Level3 automatic prefix update feature is broken since 8-10 months and they 
are unable to fix it. I can provide ~10 ticket IDs with several discussions 
about the broken feature. We have to open a ticket with them for every new 
prefix we want to announce ...


Jürgen Jaritsch
Head of Network & Infrastructure

ANEXIA Internetdienstleistungs GmbH

Telefon: +43-5-0556-300
Telefax: +43-5-0556-500

E-Mail: j...@anexia.at
Web: http://www.anexia.at

Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt
Geschäftsführer: Alexander Windbichler
Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601


-Original Message-
From: Grzegorz Janoszka [grzeg...@janoszka.pl]
Received: Samstag, 13 Juni 2015, 13:51
To: nanog@nanog.org [nanog@nanog.org]
Subject: Re: Open letter to Level3 concerning the global routing issues on June 
12th

On 2015-06-13 12:34, Mark Tinka wrote:
> I know the largest transit providers tend to be more relaxed for various
> reasons. Some rely on filters generated by IRR entries, others don't.

Actually I had pretty good experiences with Level3 as it has been years
as they could use IRR filters to update automatically your prefix list.
I remember that Level3 was one of the first carriers to enable that
feature and several years afterwards there were still global networks
(tier1) that could only do static prefix-lists.

--
Grzegorz Janoszka


Re: Open letter to Level3 concerning the global routing issues on June 12th

2015-06-13 Thread Grzegorz Janoszka

On 2015-06-13 12:34, Mark Tinka wrote:

I know the largest transit providers tend to be more relaxed for various
reasons. Some rely on filters generated by IRR entries, others don't.


Actually I had pretty good experiences with Level3 as it has been years 
as they could use IRR filters to update automatically your prefix list. 
I remember that Level3 was one of the first carriers to enable that 
feature and several years afterwards there were still global networks 
(tier1) that could only do static prefix-lists.


--
Grzegorz Janoszka


Re: Open letter to Level3 concerning the global routing issues on June 12th

2015-06-13 Thread Roland Dobbins

On 13 Jun 2015, at 17:34, Mark Tinka wrote:

> A lot more work is needed, indeed. It's not 2008 anymore...

Nor 1997:



;>

---
Roland Dobbins 


Re: Open letter to Level3 concerning the global routing issues on June 12th

2015-06-13 Thread Mark Tinka


On 12/Jun/15 19:12, Job Snijders wrote:
>  
>
> The simplest protection mechanism of all: maximum prefix limits. If you
> turn up a peer or customer, confirm with them how many routes you should
> expect, add 15% and configure that. 

For peering and customers, we set a default prefix limit value for IPv4
and IPv6. We only change this if the peer/customer informs us that they
will announce a lot more than what we've configured. We add some % to
cover for "sudden" growth, but not too much to impact the network.

For customers, we add prefix lists and AS_PATH filters as mandatory.

I'm sure others do the same. It would be good if we all did.

I know the largest transit providers tend to be more relaxed for various
reasons. Some rely on filters generated by IRR entries, others don't.

A lot more work is needed, indeed. It's not 2008 anymore...

Mark.


Re: Open letter to Level3 concerning the global routing issues on June 12th

2015-06-12 Thread Jared Mauch

> On Jun 12, 2015, at 1:40 PM, jim deleskie  wrote:
> 
> Todd,
> 
>  One of my few work "regrets" is we where not able to move this forward.
> There was/is lots of value in it.

There are many of us trying to tilt at these topics in various ways.

I know that at $dayjob we try to keep things clean, monitor what’s going on
etc..

I’m happy to dump any ASN into my leak detector stuff here that wants
it:

http://puck.nether.net/bgp/leakinfo.cgi

it only looks for one type of thing, but with “the cloud” it’s much easier
to toss feeds and compute at these things than 10-20 years ago.

I’m always disappointed to find that people just “give up” at a certain
scale in trying to filter things.

I blame many of the vendors for not having the will to fix their BGP
implementations to advertise no routes to a new peer without policy.

I blame vendors for failing to train/test people on filtering routes
as part of their *IE certification.  If you’re an internet expert you
don’t make these errors, or don’t have them occur for such a long duration.

I blame vendors for selling devices route optimization that translate a 
regular BGP feed into a garbage feed that can cause global pollution.

Many people don’t understand their IP routing “supply chain” so lines of people
waiting to pay because you can’t swipe your card is the fault of many
people, including the people without cash to cover their food bills.

I can rant all day about this amongst other things.  What have you done
today to improve your routing security?

- Jared



Re: Open letter to Level3 concerning the global routing issues on June 12th

2015-06-12 Thread Jared Mauch

> On Jun 12, 2015, at 1:36 PM, Todd Underwood  wrote:
> 
> it's probably far better for everyone in such a situation to simply never
> post anything.  :-/

Yeah it was a bad move trying to equate those two and causes the exact impact
you expect.

:(

- Jared


Re: Open letter to Level3 concerning the global routing issues on June 12th

2015-06-12 Thread jim deleskie
Todd,

  One of my few work "regrets" is we where not able to move this forward.
There was/is lots of value in it.

Agree'd on the posting.

-jim

On Fri, Jun 12, 2015 at 2:36 PM, Todd Underwood  wrote:

> i remember that presentation!
>
> https://www.nanog.org/meetings/abstract?id=459
>
> :-)
>
> On Fri, Jun 12, 2015 at 11:53 AM, jim deleskie  wrote:
>
>> People from Big telcom should never reply to mailing lists from work
>> addresses unless specifically allowed, which I suspect TATA doesn't
>> either,
>> based on some direct, buy old knowledge :)
>>
>
> indeed, people from big companies who post on mailing lists at all will be
> called out as official representatives of their company no matter what
> address they use, from recent experience.
>
> it's probably far better for everyone in such a situation to simply never
> post anything.  :-/
>
> t
>


Re: Open letter to Level3 concerning the global routing issues on June 12th

2015-06-12 Thread Todd Underwood
i remember that presentation!

https://www.nanog.org/meetings/abstract?id=459

:-)

On Fri, Jun 12, 2015 at 11:53 AM, jim deleskie  wrote:

> People from Big telcom should never reply to mailing lists from work
> addresses unless specifically allowed, which I suspect TATA doesn't either,
> based on some direct, buy old knowledge :)
>

indeed, people from big companies who post on mailing lists at all will be
called out as official representatives of their company no matter what
address they use, from recent experience.

it's probably far better for everyone in such a situation to simply never
post anything.  :-/

t


Re: Open letter to Level3 concerning the global routing issues on June 12th

2015-06-12 Thread Job Snijders
On Fri, Jun 12, 2015 at 12:53:13PM -0300, jim deleskie wrote:
> Filtering has been a community issue since my days @ MCI being AS3561,
> often discussed not often enough acted one, I suspect the topic has come up
> at every "large" NSP I've worked at.  Frequently someone complains its
> "hard" to fix, or router X makes it hard to fix, or customer Y won;t agree,
> and not enough people stand up to force fix the issues.  I've did a preso
> on it ( while working at TATA) with some other "smart folks" but for all
> the usual reasons it died on the vine.

Next time around put up more of a fight? :-)

In all seriousness not all hope is lost: Even on the crappiest
platforms, an operator can do better then nothing with little effort. 

The simplest protection mechanism of all: maximum prefix limits. If you
turn up a peer or customer, confirm with them how many routes you should
expect, add 15% and configure that. 

In this day and age AS_PATH filters are still underutilized, if you
apply them on egress they are a very easy way to prevent sending routes
from your upstream to your peers, or accepting your upstreams routes
from peers/customers.

Vote with your wallet, talk to your vendors how to make your life
easier. Once example: ask Cisco to implement
https://tools.cisco.com/bugsearch/bug/CSCuq14541 ("Add "bgp enforce
ebgp-outbound-policy" knob to prevent route leaks" - this is a PR asking
that if a new neighbor is configured you don't immediatly send all
routes & accept everything).

There are actively maintained open source tools such as bgpq3 which can
help you generate filters to apply on your customer sessions: it takes 2
seconds to generate an effective IOS prefix-list for 4788:

Vurt:~ job$ time bgpq3 -h rr.ntt.net -A AS-TMNET-CUSTOMERS | wc -l
6884
real0m1.947s
(source: https://github.com/snar/bgpq3 - can output in BIRD, XR,
IOS, JunOS or JSON syntax)

Today there are plenty of networks which use the above techniques
successfully on a variety of devices. 

Kind regards,

Job


Re: Open letter to Level3 concerning the global routing issues on June 12th

2015-06-12 Thread jim deleskie
People from Big telcom should never reply to mailing lists from work
addresses unless specifically allowed, which I suspect TATA doesn't either,
based on some direct, buy old knowledge :)

Filtering has been a community issue since my days @ MCI being AS3561,
often discussed not often enough acted one, I suspect the topic has come up
at every "large" NSP I've worked at.  Frequently someone complains its
"hard" to fix, or router X makes it hard to fix, or customer Y won;t agree,
and not enough people stand up to force fix the issues.  I've did a preso
on it ( while working at TATA) with some other "smart folks" but for all
the usual reasons it died on the vine.  I don't blame (3) for this but our
community as a whole.  Many "people/networks" have to not do the "right
thing(tm)" for a failure like this to happen.


-jim

On Fri, Jun 12, 2015 at 12:43 PM, Utkarsh Gosain <
utkarsh.gos...@tatacommunications.com> wrote:

> Hi Martin
> I am not a spokesperson on behalf of L3 but I have worked for big telcos
> my whole career and my recommendation is to raise a trouble ticket if any
> one on the forum is their customer and is affected.
> I don’t think Engineers at NOC are authorized to reply to forums at any of
> the major telcos especially regarding outages unless someone raise a
> trouble ticket and seeks an RCA of the issue one on one with them.
>
>
> Utkarsh Gosain
> Global Acc Director
> Tata Communications
>
>
> -Original Message-
> From: NANOG [mailto:nanog-boun...@nanog.org] On Behalf Of Martin Millnert
> Sent: Friday, June 12, 2015 11:33 AM
> To: NANOG
> Subject: Open letter to Level3 concerning the global routing issues on
> June 12th
>
> Dear Level3,
>
> The Internet is a cooperative effort, and it works well only when its
> participants take constructive actions to address errors and remedy
> problems.
> Your position as a major Internet Carrier bestows upon you a certain
> degree of responsibility for the correct operation of the Internet all
> across (and beyond) the planet. You have many customers. Customers will
> always occasionally make mistakes. You as a major Internet Carrier have a
> responsibility to limit, not amplify, your customers' mistakes.
> Other major carriers implement technical measures that severely limits the
> damages from customer mistakes from having global impact.
> Other major carriers also implement operational procedures in addition to
> technical measures.
> In combination, these measures drastically reduce the outage-hours as a
> result of customer configuration errors.
>
> At 08:44 UTC on Friday 12th of June, one of your transit customers,
> Telekom Malaysia (AS4788) began announcing the full Internet table back to
> you, which you accepted and propagated to your peers and customers, causing
> global outages for close to 3 hours.
> [ https://twitter.com/DynResearch/status/609340592036970496 ] During this
> 3 hour window, it appears (from your own service outage
> reports) that you did nothing to stop the global Internet outage, but that
> Telekom Malaysia themselves eventually resolved it. This lack of action on
> your end, and your disregard for the correct operation of the global
> Internet is astonishing. These mistakes do not need to happen.
> AS4788 under normal circumstances announces ~1900 IPv4 prefixes to the
> Internet. You accepted multiple hundred thousand prefixes from them - a max
> prefix setting would have severely limited the damage. We expect that these
> are your practices as well, but they failed. When they do, it should not
> take ~3 hours to shut down the session(s).
>
> Many operators, in despair, turned down their peering sessions with you
> once it was clear you were causing the outages and no immediate fix was in
> sight. This improved the situation for some - but not all did. Had you
> deployed proper IRR-filtering to filter the bad announcements the impact
> would've been far less critical.
>
> As a direct consequence of your ~3 hours of inaction, as a local example,
> Swedish payment terminals were experiencing problems all over the country.
> The Swedish economy was directly affected by your inaction.
> There were queues when I was buying lunch! Imagine the food rage. The
> situation was probably similar at other places around the globe where
> people were awake.
>
> Operators around the planet are curious:
>   - Did Level3 not detect or understand that it was causing global
> Internet outages for ~3 hours?
>   - If Level3 did in fact detect or understand it was causing global
> Internet outages, why did it not properly and immediately remedy the
> situation?
>   - What is Level3 going to do to address these questions and begin wo

RE: Open letter to Level3 concerning the global routing issues on June 12th

2015-06-12 Thread Utkarsh Gosain
Hi Martin
I am not a spokesperson on behalf of L3 but I have worked for big telcos my 
whole career and my recommendation is to raise a trouble ticket if any one on 
the forum is their customer and is affected.
I don’t think Engineers at NOC are authorized to reply to forums at any of the 
major telcos especially regarding outages unless someone raise a trouble ticket 
and seeks an RCA of the issue one on one with them.


Utkarsh Gosain
Global Acc Director 
Tata Communications


-Original Message-
From: NANOG [mailto:nanog-boun...@nanog.org] On Behalf Of Martin Millnert
Sent: Friday, June 12, 2015 11:33 AM
To: NANOG
Subject: Open letter to Level3 concerning the global routing issues on June 12th

Dear Level3,

The Internet is a cooperative effort, and it works well only when its 
participants take constructive actions to address errors and remedy problems.
Your position as a major Internet Carrier bestows upon you a certain degree of 
responsibility for the correct operation of the Internet all across (and 
beyond) the planet. You have many customers. Customers will always occasionally 
make mistakes. You as a major Internet Carrier have a responsibility to limit, 
not amplify, your customers' mistakes.
Other major carriers implement technical measures that severely limits the 
damages from customer mistakes from having global impact.
Other major carriers also implement operational procedures in addition to 
technical measures.
In combination, these measures drastically reduce the outage-hours as a result 
of customer configuration errors.

At 08:44 UTC on Friday 12th of June, one of your transit customers, Telekom 
Malaysia (AS4788) began announcing the full Internet table back to you, which 
you accepted and propagated to your peers and customers, causing global outages 
for close to 3 hours.
[ https://twitter.com/DynResearch/status/609340592036970496 ] During this 3 
hour window, it appears (from your own service outage
reports) that you did nothing to stop the global Internet outage, but that 
Telekom Malaysia themselves eventually resolved it. This lack of action on your 
end, and your disregard for the correct operation of the global Internet is 
astonishing. These mistakes do not need to happen.
AS4788 under normal circumstances announces ~1900 IPv4 prefixes to the 
Internet. You accepted multiple hundred thousand prefixes from them - a max 
prefix setting would have severely limited the damage. We expect that these are 
your practices as well, but they failed. When they do, it should not take ~3 
hours to shut down the session(s).

Many operators, in despair, turned down their peering sessions with you once it 
was clear you were causing the outages and no immediate fix was in sight. This 
improved the situation for some - but not all did. Had you deployed proper 
IRR-filtering to filter the bad announcements the impact would've been far less 
critical.

As a direct consequence of your ~3 hours of inaction, as a local example, 
Swedish payment terminals were experiencing problems all over the country. The 
Swedish economy was directly affected by your inaction.
There were queues when I was buying lunch! Imagine the food rage. The situation 
was probably similar at other places around the globe where people were awake.

Operators around the planet are curious:
  - Did Level3 not detect or understand that it was causing global Internet 
outages for ~3 hours?
  - If Level3 did in fact detect or understand it was causing global Internet 
outages, why did it not properly and immediately remedy the situation?
  - What is Level3 going to do to address these questions and begin work on 
restoring its credibility as a carrier?

We all understand that mistakes do happen (in applying customer interface 
templates, etc.). However the Internet is all too pervasive in everyday life 
today for anything but swift action by carriers to remedy breakage after the 
fact. It is absolutely not sufficient to let a customer spend 3 hours to detect 
and fix a situation like this one. It is unacceptable that no swift action was 
taken on your end to limit the global routing issues you caused.

Sincerely,
Martin Millnert
Member of Internet Community - no carrier / ISP affiliation. 


Open letter to Level3 concerning the global routing issues on June 12th

2015-06-12 Thread Martin Millnert
Dear Level3,

The Internet is a cooperative effort, and it works well only when its
participants take constructive actions to address errors and remedy
problems.
Your position as a major Internet Carrier bestows upon you a certain
degree of responsibility for the correct operation of the Internet all
across (and beyond) the planet. You have many customers. Customers will
always occasionally make mistakes. You as a major Internet Carrier have
a responsibility to limit, not amplify, your customers' mistakes.
Other major carriers implement technical measures that severely limits
the damages from customer mistakes from having global impact.
Other major carriers also implement operational procedures in addition
to technical measures.
In combination, these measures drastically reduce the outage-hours as a
result of customer configuration errors.

At 08:44 UTC on Friday 12th of June, one of your transit customers,
Telekom Malaysia (AS4788) began announcing the full Internet table back
to you, which you accepted and propagated to your peers and customers,
causing global outages for close to 3 hours.
[ https://twitter.com/DynResearch/status/609340592036970496 ]
During this 3 hour window, it appears (from your own service outage
reports) that you did nothing to stop the global Internet outage, but
that Telekom Malaysia themselves eventually resolved it. This lack of
action on your end, and your disregard for the correct operation of the
global Internet is astonishing. These mistakes do not need to happen.
AS4788 under normal circumstances announces ~1900 IPv4 prefixes to the
Internet. You accepted multiple hundred thousand prefixes from them - a
max prefix setting would have severely limited the damage. We expect
that these are your practices as well, but they failed. When they do, it
should not take ~3 hours to shut down the session(s).

Many operators, in despair, turned down their peering sessions with you
once it was clear you were causing the outages and no immediate fix was
in sight. This improved the situation for some - but not all did. Had
you deployed proper IRR-filtering to filter the bad announcements the
impact would've been far less critical.

As a direct consequence of your ~3 hours of inaction, as a local
example, Swedish payment terminals were experiencing problems all over
the country. The Swedish economy was directly affected by your inaction.
There were queues when I was buying lunch! Imagine the food rage. The
situation was probably similar at other places around the globe where
people were awake.

Operators around the planet are curious:
  - Did Level3 not detect or understand that it was causing global
Internet outages for ~3 hours?
  - If Level3 did in fact detect or understand it was causing global
Internet outages, why did it not properly and immediately remedy the
situation?
  - What is Level3 going to do to address these questions and begin work
on restoring its credibility as a carrier?

We all understand that mistakes do happen (in applying customer
interface templates, etc.). However the Internet is all too pervasive in
everyday life today for anything but swift action by carriers to remedy
breakage after the fact. It is absolutely not sufficient to let a
customer spend 3 hours to detect and fix a situation like this one. It
is unacceptable that no swift action was taken on your end to limit the
global routing issues you caused.

Sincerely,
Martin Millnert
Member of Internet Community - no carrier / ISP affiliation. 


signature.asc
Description: This is a digitally signed message part