Re: Service Provider NetFlow Collectors

2018-12-31 Thread Avi Freedman
We do have a minimum for commercial service that's more like $1500/mo but we 
are coming out with a free tier in Q1 with lower retention (among other deltas, 
but including fully slice and dice flow analytics +BGP that it sounded like 
Erik might be looking for).

Feel free to ping me if anyone would like to help us test the free tier in 
January.

Thanks,

Avi Freedman
CEO, Kentik

> Doesn't Kentik cost like $2000 a month minimum?
> 
> 
> On Mon, Dec 31, 2018 at 11:57 AM Matthew Crocker 
> wrote:
> 
> >  +1 Kentik as well,  DDoS, RTBH, Netflow.  Cloud based so I don't have to
> > worry about it.
> >
> > On 12/31/18, 11:37 AM, "NANOG on behalf of Bryan Holloway" <
> > nanog-boun...@nanog.org on behalf of br...@shout.net> wrote:
> >
> > +1 Kentik ...
> >
> > We've been using their DDoS/RTBH mitigation with good success.
> >
> >
> > On 12/31/18 3:52 AM, Eric Lindsjö wrote:
> > > Hi,
> > >
> > > We use kentik and we're very happy. Works great, tons of new
> > features
> > > coming along all the time. Going to start looking into ddos
> > detection
> > > and mitigation soon.
> > >
> > > Would recommend.
> > >
> > > Kind regards,
> > > Eric Lindsjö
> > >
> > >
> > > On 12/31/2018 04:29 AM, Erik Sundberg wrote:
> > >>
> > >> Hi Nanog….
> > >>
> > >> We are looking at replacing our Netflow collector. I am wonder what
> > >> other service providers are using to collect netflow data off their
> > >> Core and Edge Routers. Pros/Cons… What to watch out for any info
> > would
> > >> help.
> > >>
> > >> We are mainly looking to analyze the netflow data. Bonus if it does
> > >> ddos detection and mitigation.
> > >>
> > >> We are looking at
> > >>
> > >> ManageEngine Netflow Analyzer
> > >>
> > >> PRTG
> > >>
> > >> Plixer – Scrutinizer
> > >>
> > >> PeakFlow
> > >>
> > >> Kentik
> > >>
> > >> Solarwinds NTA
> > >>
> > >> Thanks in advance…
> > >>
> > >> Erik
> > >>
> > >>
> > >>
> > 
> > >>
> > >> CONFIDENTIALITY NOTICE: This e-mail transmission, and any
> > documents,
> > >> files or previous e-mail messages attached to it may contain
> > >> confidential information that is legally privileged. If you are not
> > >> the intended recipient, or a person responsible for delivering it
> > to
> > >> the intended recipient, you are hereby notified that any
> > disclosure,
> > >> copying, distribution or use of any of the information contained in
> > or
> > >> attached to this transmission is STRICTLY PROHIBITED. If you have
> > >> received this transmission in error please notify the sender
> > >> immediately by replying to this e-mail. You must destroy the
> > original
> > >> transmission and its attachments without reading or saving in any
> > >> manner. Thank you.
> > >
> >
> >
> >


Re: CenturyLink RCA?

2018-12-31 Thread Lee
On 12/31/18, Keith Medcalf  wrote:
>> It could have been worse:
>>   https://www.cio.com.au/article/65115/all_systems_down/
>
> "Make network changes only between 2am and 5am on weekends."
>
> Wow.  Just wow.

yeah.  out of all the possible lessons they could have learned..

>  I suppose the IT types are considerably different than
> Process Operations.  Our rule is to only make changes scheduled at 09:00 (or
> no later than will permit a complete backout and restore by 15:00) Local
> Time on "Full Staff" day that is not immediately preceded or followed by a
> reduced staff day, holiday, or weekend-day.

Do you get paid differently based on time of day?  I used to be at a
place where they were drifting into a 'no changes until midnight' mode
except for one group; the rumor I heard was they got overtime pay
after 6PM which is why they got to do all their changes during the
day.

Lee


Re: IP Dslams

2018-12-31 Thread Jason Baugher
Most of my experience is with Calix C7 and E7 DSL, fan of both. Recently 
learning the Adtran TA5000, not impressed. Hardware may be solid, but 
management is ugly and painful.



Sent from my U.S. Cellular® Smartphone


 Original message 
From: Erik Sundberg 
Date: 12/31/18 1:32 PM (GMT-06:00)
To: Nick Edwards 
Cc: nanog@nanog.org
Subject: RE: IP Dslams

I haven’t used any of theses…

Check out Adtran Total Access 5000 Platform…. Used by a lot of EoC / EoDS1 
carriers


Google: Ethernet Extender DSLAM
https://enableit.com/rackmount-extender/


From: NANOG  On Behalf Of Nick Edwards
Sent: Friday, December 28, 2018 7:36 PM
To: nanog@nanog.org
Subject: IP Dslams

Howdy,
We have a requirement for an aged care facility to provide voice and data, we 
have the voice worked out, but data, WiFi is out of the question, so are 
looking for IP-Dslams, preferably a system that is all-in-one, or self 
contained, as in contains its own BBRAS/LNS/PPP server/Radius, such as has a 
property managment API, or even just a webpage manager where admin can add in 
new residents when they arive, or delete when they depart I know these used to 
be available  many years ago, but that vendor has like many vanished, only 
requirement is for ADSL2+, prefer units with either 48 ports or multiples of 
(192 etc) and have filtered voice out ports (telco50/rj21 etc)
If anyone knows of such units, would appreciate some details on them,  
brand/model suppliers if known, etc, we can try get out google fu back if we 
have some steering:)
Thank Y'all
(resent - original never made it to the list for some gremlin reason)



CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files or 
previous e-mail messages attached to it may contain confidential information 
that is legally privileged. If you are not the intended recipient, or a person 
responsible for delivering it to the intended recipient, you are hereby 
notified that any disclosure, copying, distribution or use of any of the 
information contained in or attached to this transmission is STRICTLY 
PROHIBITED. If you have received this transmission in error please notify the 
sender immediately by replying to this e-mail. You must destroy the original 
transmission and its attachments without reading or saving in any manner. Thank 
you.

Jason Baugher, Network Operations Manager
405 Emminga Road | PO Box 217 | Golden, IL 62339-0217
P:(217) 696-4411 | F:(217) 696-4811 | www.adams.net
[Adams-Logo]

The information contained in this email message is PRIVILEGED AND CONFIDENTIAL, 
and is intended for the use of the addressee and no one else. If you are not 
the intended recipient, please do not read, distribute, reproduce or use this 
email message (or the attachments) and notify the sender of the mistaken 
transmission. Thank you.


Re: Service Provider NetFlow Collectors

2018-12-31 Thread Colton Conor
Doesn't Kentik cost like $2000 a month minimum?


On Mon, Dec 31, 2018 at 11:57 AM Matthew Crocker 
wrote:

>  +1 Kentik as well,  DDoS, RTBH, Netflow.  Cloud based so I don't have to
> worry about it.
>
> On 12/31/18, 11:37 AM, "NANOG on behalf of Bryan Holloway" <
> nanog-boun...@nanog.org on behalf of br...@shout.net> wrote:
>
> +1 Kentik ...
>
> We've been using their DDoS/RTBH mitigation with good success.
>
>
> On 12/31/18 3:52 AM, Eric Lindsjö wrote:
> > Hi,
> >
> > We use kentik and we're very happy. Works great, tons of new
> features
> > coming along all the time. Going to start looking into ddos
> detection
> > and mitigation soon.
> >
> > Would recommend.
> >
> > Kind regards,
> > Eric Lindsjö
> >
> >
> > On 12/31/2018 04:29 AM, Erik Sundberg wrote:
> >>
> >> Hi Nanog….
> >>
> >> We are looking at replacing our Netflow collector. I am wonder what
> >> other service providers are using to collect netflow data off their
> >> Core and Edge Routers. Pros/Cons… What to watch out for any info
> would
> >> help.
> >>
> >> We are mainly looking to analyze the netflow data. Bonus if it does
> >> ddos detection and mitigation.
> >>
> >> We are looking at
> >>
> >> ManageEngine Netflow Analyzer
> >>
> >> PRTG
> >>
> >> Plixer – Scrutinizer
> >>
> >> PeakFlow
> >>
> >> Kentik
> >>
> >> Solarwinds NTA
> >>
> >> Thanks in advance…
> >>
> >> Erik
> >>
> >>
> >>
> 
> >>
> >> CONFIDENTIALITY NOTICE: This e-mail transmission, and any
> documents,
> >> files or previous e-mail messages attached to it may contain
> >> confidential information that is legally privileged. If you are not
> >> the intended recipient, or a person responsible for delivering it
> to
> >> the intended recipient, you are hereby notified that any
> disclosure,
> >> copying, distribution or use of any of the information contained in
> or
> >> attached to this transmission is STRICTLY PROHIBITED. If you have
> >> received this transmission in error please notify the sender
> >> immediately by replying to this e-mail. You must destroy the
> original
> >> transmission and its attachments without reading or saving in any
> >> manner. Thank you.
> >
>
>
>


Re: IP Dslams

2018-12-31 Thread Colton Conor
Carl,

What did you select to replace your MX BNG?

To Nick, we use Adtran Total Access 5000's today. They work fine, but if I
was doing a new install I would do Calix with their newer lines that have
SDN BNG functions. Calix just has better CPE to go along with it, but they
are just G.Fast and ethernet only CPE's.

Why only ADSL2+?

What are you doing for voice?

Do you have access to Coax cable? If so I would do a small 32x10 CMTS with
cable modem. Much cheaper and future proof.

On Mon, Dec 31, 2018 at 3:47 PM Carl Peterson 
wrote:

> I'd consider breaking down the two functions.
> Set up your customer connections using ADSL Ethernet, etc and put each
> unit in the building on its own CVLAN.  This should never change even when
> the subscribers in the unit change.  This way you can configure it once and
> never touch it again.  I'd use Calix G.fast but I have no idea what your
> budget/wiring looks like and I'm not sure where their e3-48 and E5-48 are
> in general availability.
>
> Then hand the SVLAN with all the CVLANs off to the BNG and authenticate
> the circuits using IPoE.  Waystream has an ASR6000 switch with BNG
> functionalities (I've never used it, just came across it when looking for
> other options to replace my MX BNG.
>
> On Mon, Dec 31, 2018 at 1:15 PM Nick Edwards 
> wrote:
>
>> Howdy,
>> We have a requirement for an aged care facility to provide voice and
>> data, we have the voice worked out, but data, WiFi is out of the question,
>> so are looking for IP-Dslams, preferably a system that is all-in-one, or
>> self contained, as in contains its own BBRAS/LNS/PPP server/Radius, such as
>> has a property managment API, or even just a webpage manager where admin
>> can add in new residents when they arive, or delete when they depart I know
>> these used to be available  many years ago, but that vendor has like many
>> vanished, only requirement is for ADSL2+, prefer units with either 48 ports
>> or multiples of (192 etc) and have filtered voice out ports (telco50/rj21
>> etc)
>>
>> If anyone knows of such units, would appreciate some details on them,
>> brand/model suppliers if known, etc, we can try get out google fu back if
>> we have some steering:)
>>
>> Thank Y'all
>>
>> (resent - original never made it to the list for some gremlin reason)
>>
>


Re: IP Dslams

2018-12-31 Thread Carl Peterson
I'd consider breaking down the two functions.
Set up your customer connections using ADSL Ethernet, etc and put each unit
in the building on its own CVLAN.  This should never change even when the
subscribers in the unit change.  This way you can configure it once and
never touch it again.  I'd use Calix G.fast but I have no idea what your
budget/wiring looks like and I'm not sure where their e3-48 and E5-48 are
in general availability.

Then hand the SVLAN with all the CVLANs off to the BNG and authenticate the
circuits using IPoE.  Waystream has an ASR6000 switch with BNG
functionalities (I've never used it, just came across it when looking for
other options to replace my MX BNG.

On Mon, Dec 31, 2018 at 1:15 PM Nick Edwards 
wrote:

> Howdy,
> We have a requirement for an aged care facility to provide voice and data,
> we have the voice worked out, but data, WiFi is out of the question, so are
> looking for IP-Dslams, preferably a system that is all-in-one, or self
> contained, as in contains its own BBRAS/LNS/PPP server/Radius, such as has
> a property managment API, or even just a webpage manager where admin can
> add in new residents when they arive, or delete when they depart I know
> these used to be available  many years ago, but that vendor has like many
> vanished, only requirement is for ADSL2+, prefer units with either 48 ports
> or multiples of (192 etc) and have filtered voice out ports (telco50/rj21
> etc)
>
> If anyone knows of such units, would appreciate some details on them,
> brand/model suppliers if known, etc, we can try get out google fu back if
> we have some steering:)
>
> Thank Y'all
>
> (resent - original never made it to the list for some gremlin reason)
>


Re: CenturyLink RCA?

2018-12-31 Thread William Herrin
On Mon, Dec 31, 2018 at 7:24 AM Naslund, Steve  wrote:
> Bad design if that’s the case, that would be a huge subnet.

According to the notes at the URL Saku shared, they suffered a cascade
failure from which they needed the equipment vendor's help to recover.
That indicates at least two grave design errors:

1. Vendor monoculture is a single point of failure. Same equipment
running the same software triggers the same bug. It all kabooms at
once. Different vendors running different implementations have
compatibility issues but when one has a bug it's much less likely to
take down all the rest.

2. Failure to implement system boundaries. When you automate systems
it's important to restrict the reach of that automation. Whether it's
a regional boundary or independent backbones, a critical system like
this one should be structurally segmented so that malfunctioning
automation can bring down only one piece of it.

Regards,
Bill Herrin



  However even if that was the case, you would not need to replace
hardware in multiple places.  You might have to reset it but not
replace it.  Also being an ILEC it seems hard to believe how long
their dispatches to their own central office took.  It might have
taken awhile to locate the original problem but they should have been
able to send a corrective procedure to CO personnel who are a lot
closer to the equipment.  In my region (Northern Illinois) we can
typically get access to a CO in under 30 minutes 24/7.  They are
essentially smart hands technicians that can reseat or replace line
cards.
>
> > 2.  Do we believe that an OOB management card was able to generate so much 
> > traffic as to bring down the optical switching?  Very doubtful which means 
> > that the systems were actually broken due to trying to PROCESS the "invalid 
> > >frames".  Seems like very poor control plane management if the system is 
> > attempting to process invalid data and bringing down the forwarding plane.
>
> >L2 loop. You will kill your JNPR/CSCO with enough trash on MGMT ETH.
> >However I can be argued that optical network should fail up in absence of 
> >control-plane, IP network has to fail down.
>
> Most of the optical muxes I have worked with will run without any management 
> card or control plane at all.  Usually the line cards keep forwarding 
> according to the existing configuration even in the absence of all management 
> functions.  It would help if we knew what gear this was.  True optical muxes 
> do not require much care and feeding once they have a configuration loaded.  
> If they are truly dependent on that control plane, then it needs to be 
> redundant enough with watch dogs to reset them if they become non responsive 
> and they need policers and rate limiter on their interfaces.  Seems they 
> would be vulnerable to a DoS if a bad
> BPDU can wipe them out.
>
> > 3.  In the cited document it was stated that the offending packet did not 
> > have source or destination information.  If so, how did it get propagated 
> > throughout the network?
>
> >BPDU
>
> Maybe, it would be strange that it was invalid but valid enough to continue 
> forwarding.  In any case loss of the management network should not interrupt 
> forwarding.  I also would not be happy with an optical network that relies on 
> spanning tree to remain operational.
>
> > My guess at the time and my current opinion (which has no real factual 
> > basis, just years of experience) is that a bad software package was 
> > propagated through their network.
>
> >Lot of possible reasons, I choose to believe what they've communicated is 
> >what the writer of the communication thought that happened, but as they 
> >likely are not SME it's broken radio communication. BCAST storm on L2 DCN 
> >>would plausibly fit the very ambiguous reason offered and is something 
> >people actually are doing.
>
> My biggest problem with their explanation is the replacement of line cards in 
> multiple cities.  The only way that happens is when bad code gets pushed to 
> them.  If it took them that long to fix an L2 broadcast storm, something is 
> seriously wrong with their engineering.  Resetting the management interfaces 
> should be sufficient once the offending line card is removed.  That is why I 
> think this was a software update failure or a configuration push.  Either 
> way, they should be jumping up and down on their vendor as to why this caused 
> such large scale effects.



--
William Herrin  her...@dirtside.com  b...@herrin.us
Dirtside Systems . Web: 


Re: CenturyLink RCA?

2018-12-31 Thread William Herrin
On Mon, Dec 31, 2018 at 12:31 PM Keith Medcalf  wrote:
> > It could have been worse:
> >   https://www.cio.com.au/article/65115/all_systems_down/
>
> "Make network changes only between 2am and 5am on weekends."
>
> Wow.  Just wow.  I suppose the IT types are considerably different
> than Process Operations.  Our rule is to only make changes
> scheduled at 09:00 (or no later than will permit a complete backout
> and restore by 15:00) Local Time on "Full Staff" day that is not
> immediately preceded or followed by a reduced staff day,
> holiday, or weekend-day.

It depends on your system architecture. If you've built your
redundancy well so that you have a continuously maintainable system
then you do the work during normal staffing and only when followed by
days when folks will be around to notice and fix any mistakes.

If you require a disruptive maintenance window then you schedule it
for minimum usage times instead.


Other conclusions from the article are dubious as well:

* Retire legacy network gear faster and create overall life cycle
management for networking gear.

Retire equipment when it ceases to be cost-effective, not merely
because it was manufactured too many years ago. Just don't forget to
factor risk in to the cost.

* Document all changes, including keeping up-to-date physical and
logical network diagrams.

"Good intentions never work, you need good mechanisms to make anything
happen." - Jeff Bezos

Regards,
Bill Herrin


-- 
William Herrin  her...@dirtside.com  b...@herrin.us
Dirtside Systems . Web: 


RE: CenturyLink RCA?

2018-12-31 Thread Keith Medcalf
> It could have been worse:
>   https://www.cio.com.au/article/65115/all_systems_down/

"Make network changes only between 2am and 5am on weekends."

Wow.  Just wow.  I suppose the IT types are considerably different than Process 
Operations.  Our rule is to only make changes scheduled at 09:00 (or no later 
than will permit a complete backout and restore by 15:00) Local Time on "Full 
Staff" day that is not immediately preceded or followed by a reduced staff day, 
holiday, or weekend-day.

--
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.






Re: IP Dslams

2018-12-31 Thread Paul Stewart
+1 for Adtran TA5000 .. we use them, my former employer uses them with great 
success.  There’s also the Calix series of gear that is quite good too …



From: NANOG  on behalf of Erik Sundberg 

Date: Monday, December 31, 2018 at 2:31 PM
To: Nick Edwards 
Cc: "nanog@nanog.org" 
Subject: RE: IP Dslams

I haven’t used any of theses…

Check out Adtran Total Access 5000 Platform…. Used by a lot of EoC / EoDS1 
carriers


Google: Ethernet Extender DSLAM
https://enableit.com/rackmount-extender/


From: NANOG  On Behalf Of Nick Edwards
Sent: Friday, December 28, 2018 7:36 PM
To: nanog@nanog.org
Subject: IP Dslams

Howdy,
We have a requirement for an aged care facility to provide voice and data, we 
have the voice worked out, but data, WiFi is out of the question, so are 
looking for IP-Dslams, preferably a system that is all-in-one, or self 
contained, as in contains its own BBRAS/LNS/PPP server/Radius, such as has a 
property managment API, or even just a webpage manager where admin can add in 
new residents when they arive, or delete when they depart I know these used to 
be available  many years ago, but that vendor has like many vanished, only 
requirement is for ADSL2+, prefer units with either 48 ports or multiples of 
(192 etc) and have filtered voice out ports (telco50/rj21 etc)
If anyone knows of such units, would appreciate some details on them,  
brand/model suppliers if known, etc, we can try get out google fu back if we 
have some steering:)
Thank Y'all
(resent - original never made it to the list for some gremlin reason)



CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files or 
previous e-mail messages attached to it may contain confidential information 
that is legally privileged. If you are not the intended recipient, or a person 
responsible for delivering it to the intended recipient, you are hereby 
notified that any disclosure, copying, distribution or use of any of the 
information contained in or attached to this transmission is STRICTLY 
PROHIBITED. If you have received this transmission in error please notify the 
sender immediately by replying to this e-mail. You must destroy the original 
transmission and its attachments without reading or saving in any manner. Thank 
you.



Re: Larry Roberts, RIP.

2018-12-31 Thread Mel Beckman
Such irony that Roberts’ NYTimes article is behind a paywall :)

Here’s a more informative, much more entertaining, and totally free article:

https://www.i-programmer.info/news/82-heritage/12414-internet-pioneer-lawrence-roberts-dies-aged-81.html


On Dec 30, 2018, at 7:59 PM, Dobbins, Roland 
mailto:roland.dobb...@netscout.com>> wrote:





Roland Dobbins mailto:roland.dobb...@netscout.com>>



RE: CenturyLink RCA?

2018-12-31 Thread Naslund, Steve
A note for the guys hanging on to those POTS lines…It won’t really help.  One 
of our sites in Dubuque Iowa had ten CenturyLink PRIs (they are the LEC there) 
homed off of a 5ESS switch.  These all were unable to process calls during the 
CenturyLink problem.  The ISDN messaging returned indicated that the CL phone 
switch had no routes.  This tells me that either their inter-switch trunking or 
SS7 network or both are being transported over the same optical network as the 
Internet services.  So, even if your local line is POTS or traditional TDM it 
won’t matter if all of their transport is dependent on the IP world.

Looking at the Reddit comments on the Infinera devices being a problem, that 
makes more sense because that device blurs the line between optical mux and IP 
enabled devices with its Ethernet mapping functions.  One advantage of the pure 
optical mux is that it does not need, care, or understand L2 and L3 network 
protocols and are largely unaffected by those layers.  Convergence in devices 
moving across more network layers exposes it to more potential bugs.  
Convergence can easily lead to more single points of failure and the traffic 
capacity of these devices kind of encourages carriers to put more stuff in one 
basket than they traditionally did.  I understand the motivation to build a 
single high speed IP centric backbone but it makes everything dependent on that 
backbone.

Steven Naslund
Chicago IL



Re: Disney+ CDN

2018-12-31 Thread Brian R
I would guess they are using the Hulu platform as the backend for their 
streaming services going forward.  They are now the primary stakeholders in 
Hulu (purchase of Fox).  I don't know if they do cache servers however.

Brian


From: NANOG  on behalf of Aaron Graves 

Sent: Saturday, December 29, 2018 5:21 PM
To: nanog@nanog.org
Subject: Disney+ CDN

Anyone know what Disney is planning on doing for streaming content distribution 
once they leave Netflix?  Would be nice if they'd provide an on-prem cache 
server.

AG


RE: IP Dslams

2018-12-31 Thread Erik Sundberg
I haven’t used any of theses…

Check out Adtran Total Access 5000 Platform…. Used by a lot of EoC / EoDS1 
carriers


Google: Ethernet Extender DSLAM
https://enableit.com/rackmount-extender/


From: NANOG  On Behalf Of Nick Edwards
Sent: Friday, December 28, 2018 7:36 PM
To: nanog@nanog.org
Subject: IP Dslams

Howdy,
We have a requirement for an aged care facility to provide voice and data, we 
have the voice worked out, but data, WiFi is out of the question, so are 
looking for IP-Dslams, preferably a system that is all-in-one, or self 
contained, as in contains its own BBRAS/LNS/PPP server/Radius, such as has a 
property managment API, or even just a webpage manager where admin can add in 
new residents when they arive, or delete when they depart I know these used to 
be available  many years ago, but that vendor has like many vanished, only 
requirement is for ADSL2+, prefer units with either 48 ports or multiples of 
(192 etc) and have filtered voice out ports (telco50/rj21 etc)
If anyone knows of such units, would appreciate some details on them,  
brand/model suppliers if known, etc, we can try get out google fu back if we 
have some steering:)
Thank Y'all
(resent - original never made it to the list for some gremlin reason)



CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files or 
previous e-mail messages attached to it may contain confidential information 
that is legally privileged. If you are not the intended recipient, or a person 
responsible for delivering it to the intended recipient, you are hereby 
notified that any disclosure, copying, distribution or use of any of the 
information contained in or attached to this transmission is STRICTLY 
PROHIBITED. If you have received this transmission in error please notify the 
sender immediately by replying to this e-mail. You must destroy the original 
transmission and its attachments without reading or saving in any manner. Thank 
you.


Re: CenturyLink RCA?

2018-12-31 Thread Eric Loos
This seems entirely plausible given that DWDM amplifiers and lasers being a 
complex analog system, they need OOB to align. 

--
Eric

> On 31 Dec 2018, at 16:06, Saku Ytti  wrote:
> 
> Hey Steve,
> 
> I will continue to speculate, as that's all we have.
> 
>> 1.  Are you telling me that several line cards failed in multiple cities in 
>> the same way at the same time?  Don't think so unless the same software 
>> fault was propagated to all of them.  If the problem was that they needed to 
>> be reset, couldn't that be accomplished by simply reseating them?
> 
> L2 DCN/OOB, whole network shares single broadcast domain
> 
>> 2.  Do we believe that an OOB management card was able to generate so much 
>> traffic as to bring down the optical switching?  Very doubtful which means 
>> that the systems were actually broken due to trying to PROCESS the "invalid 
>> frames".  Seems like very poor control plane management if the system is 
>> attempting to process invalid data and bringing down the forwarding plane.
> 
> L2 loop. You will kill your JNPR/CSCO with enough trash on MGMT ETH.
> However I can be argued that optical network should fail up in absence
> of control-plane, IP network has to fail down.
> 
>> 3.  In the cited document it was stated that the offending packet did not 
>> have source or destination information.  If so, how did it get propagated 
>> throughout the network?
> 
> BPDU
> 
>> My guess at the time and my current opinion (which has no real factual 
>> basis, just years of experience) is that a bad software package was 
>> propagated through their network.
> 
> Lot of possible reasons, I choose to believe what they've communicated
> is what the writer of the communication thought that happened, but as
> they likely are not SME it's broken radio communication. BCAST storm
> on L2 DCN would plausibly fit the very ambiguous reason offered and is
> something people actually are doing.
> 
> -- 
>  ++ytti


Larry Roberts, RIP.

2018-12-31 Thread Dobbins, Roland





Roland Dobbins 


Re: CenturyLink RCA?

2018-12-31 Thread Töma Gavrichenkov
There's a Reddit user claiming he works at CL who says the reason were some
faulty Infinera DTN-X instances.

https://www.reddit.com/r/centurylink/comments/aa2qa4/comment/ecovgab

(dunno though why the user posted that to Reddit and not here)

30 Dec. 2018 г., 20:19 Saku Ytti :

> Hey John,
>
> Your criticism is warranted, but would also be addressed by
> explanation DCN/OOB being the source of the problem.
>
> At any rate, I am looking forward to stop speculating and start
> reading post-mortem written by someone who knows how networks work.
>
> On Sun, 30 Dec 2018 at 18:28, John Von Essen  wrote:
> >
> > One thing that is troubling when reading that URL is that it appears
> several steps of restoration required teams to go onsite for local login,
> etc.,. Granted, to troubleshoot hardware you need to be physically present
> to pop a line card in and out, but CTL/LVL3 should have full out-of-band
> console and power control to all core devices, we shouldn't be waiting for
> someone to drive to a location to get console or do power cycling. And I
> would imagine the first step to alot of the troubleshooting was power
> cycling and local console logs.
> >
> >
> > -John
> >
> >
> >
> > On 12/30/18 10:42 AM, Mike Hammett wrote:
> >
> > It's technical enough so that laypeople immediately lose interest, yet
> completely useless to anyone that works with this stuff.
> >
> >
> >
> > -
> > Mike Hammett
> > Intelligent Computing Solutions
> > http://www.ics-il.com
> >
> > Midwest-IX
> > http://www.midwest-ix.com
> >
> > 
> > From: "Saku Ytti" 
> > To: "nanog list" 
> > Sent: Sunday, December 30, 2018 7:42:49 AM
> > Subject: CenturyLink RCA?
> >
> > Apologies for the URL, I do not know official source and I do not
> > share the URLs sentiment.
> > https://fuckingcenturylink.com/
> >
> > Can someone translate this to IP engineer? What did actually happen?
> > From my own history, I rarely recognise the problem I fixed from
> > reading the public RCA. I hope CenturyLink will do better.
> >
> > Best guess so far that I've heard is
> >
> > a) CenturyLink runs global L2 DCN/OOB
> > b) there was HW fault which caused L2 loop (perhaps HW dropped BPDU,
> > I've had this failure mode)
> > c) DCN had direct access to control-plane, and L2 congested
> > control-plane resources causing it to deprovision waves
> >
> > Now of course this is entirely speculation, but intended to show what
> > type of explanation is acceptable and can be used to fix things.
> > Hopefully CenturyLink does come out with IP-engineering readable
> > explanation, so that we may use it as leverage to support work in our
> > own domains to remove such risks.
> >
> > a) do not run L2 DCN/OOB
> > b) do not connect MGMT ETH (it is unprotected access to control-plane,
> > it  cannot be protected by CoPP/lo0 filter/LPTS ec)
> > c) do add in your RFP scoring item for proper OOB port (Like Cisco CMP)
> > d) do fail optical network up
> >
> > --
> >   ++ytti
> >
>
>
> --
>   ++ytti
>


Re: CenturyLink RCA?

2018-12-31 Thread Joe Carroll
Technical obscurity...  managed perception.

On Sun, Dec 30, 2018 at 10:43 Mike Hammett  wrote:

> It's technical enough so that laypeople immediately lose interest, yet
> completely useless to anyone that works with this stuff.
>
>
>
> -
> Mike Hammett
> Intelligent Computing Solutions
> http://www.ics-il.com
>
> Midwest-IX
> http://www.midwest-ix.com
>
> --
> *From: *"Saku Ytti" 
> *To: *"nanog list" 
> *Sent: *Sunday, December 30, 2018 7:42:49 AM
> *Subject: *CenturyLink RCA?
>
> Apologies for the URL, I do not know official source and I do not
> share the URLs sentiment.
> https://fuckingcenturylink.com/
>
> Can someone translate this to IP engineer? What did actually happen?
> From my own history, I rarely recognise the problem I fixed from
> reading the public RCA. I hope CenturyLink will do better.
>
> Best guess so far that I've heard is
>
> a) CenturyLink runs global L2 DCN/OOB
> b) there was HW fault which caused L2 loop (perhaps HW dropped BPDU,
> I've had this failure mode)
> c) DCN had direct access to control-plane, and L2 congested
> control-plane resources causing it to deprovision waves
>
> Now of course this is entirely speculation, but intended to show what
> type of explanation is acceptable and can be used to fix things.
> Hopefully CenturyLink does come out with IP-engineering readable
> explanation, so that we may use it as leverage to support work in our
> own domains to remove such risks.
>
> a) do not run L2 DCN/OOB
> b) do not connect MGMT ETH (it is unprotected access to control-plane,
> it  cannot be protected by CoPP/lo0 filter/LPTS ec)
> c) do add in your RFP scoring item for proper OOB port (Like Cisco CMP)
> d) do fail optical network up
>
>
> --
>   ++ytti
>
>


Disney+ CDN

2018-12-31 Thread Aaron Graves
Anyone know what Disney is planning on doing for streaming content
distribution once they leave Netflix?  Would be nice if they'd provide an
on-prem cache server.

AG


IP Dslams

2018-12-31 Thread Nick Edwards
Howdy,
We have a requirement for an aged care facility to provide voice and data,
we have the voice worked out, but data, WiFi is out of the question, so are
looking for IP-Dslams, preferably a system that is all-in-one, or self
contained, as in contains its own BBRAS/LNS/PPP server/Radius, such as has
a property managment API, or even just a webpage manager where admin can
add in new residents when they arive, or delete when they depart I know
these used to be available  many years ago, but that vendor has like many
vanished, only requirement is for ADSL2+, prefer units with either 48 ports
or multiples of (192 etc) and have filtered voice out ports (telco50/rj21
etc)

If anyone knows of such units, would appreciate some details on them,
brand/model suppliers if known, etc, we can try get out google fu back if
we have some steering:)

Thank Y'all

(resent - original never made it to the list for some gremlin reason)


Re: CenturyLink RCA?

2018-12-31 Thread Lee
On 12/31/18, Aaron1  wrote:
> Yeah, could have been one of those...gone from bad to worse things like Dave
> mentioned... initial problem and course of action perhaps led to a worse
> problem.
>
> I’ve had DWDM issues that have taken down multiple locations far apart from
> each other due to how the transport guys hauled stuff
>
> A few years back I had about 15 routers all reboot suddenly... they were all
> far apart from each other, turned out to be one of the dual bgp sessions to
> rr cluster flapped and all 15 routers crash rebooted.
>
> But ~50 hours of downtime !?

It could have been worse:
  https://www.cio.com.au/article/65115/all_systems_down/


Re: CenturyLink RCA?

2018-12-31 Thread Aaron1
Yeah, could have been one of those...gone from bad to worse things like Dave 
mentioned... initial problem and course of action perhaps led to a worse 
problem.

I’ve had DWDM issues that have taken down multiple locations far apart from 
each other due to how the transport guys hauled stuff 

A few years back I had about 15 routers all reboot suddenly... they were all 
far apart from each other, turned out to be one of the dual bgp sessions to rr 
cluster flapped and all 15 routers crash rebooted.

But ~50 hours of downtime !? 

Aaron

> On Dec 31, 2018, at 11:41 AM, Dave Temkin  wrote:
> 
>> On Mon, Dec 31, 2018 at 11:33 AM Naslund, Steve  wrote:
> 
>> They shouldn’t need OOB to operate existing lambdas just to configure new 
>> ones.  One possibility is that the management interface also handles master 
>> timing which would be a really bad idea but possible (should be redundant 
>> and it should be able to free run for a reasonable amount of time).  The 
>> main issue exposed is that obviously the management interface is critical 
>> and is not redundant enough.  That is if we believe the OOB explanation in 
>> the first place (which by the way is obviously not OOB since it wiped out 
>> the in band network when it failed).
>> 
>>  
>> 
>> Steven Naslund
>> 
>> Chicago IL
>> 
>>  
>> 
>  
> A theory, and only a theory, is that they decided to, in order to 
> troubleshoot a much smaller problem (OOB/etc.), deploy an optical 
> configuration change that, when faced with inaccessibility to multiple nodes, 
> ended up causing a significant inconsistency in their optical network, 
> wreaking havoc on all sorts of other systems. With the OOB network already in 
> chaos, card reseats were required to stabilize things on that network and 
> then they could rebuild the optical network from a fully reachable state.
> 
> Again, only a theory.
> 
> -Dave
> 
>  
>>  
>> 
>> >This seems entirely plausible given that DWDM amplifiers and lasers being a 
>> >complex analog system, they need OOB to align. 
>> 
>> >--
>> 
>> >Eric
>> 
>> 
>> 


Re: CenturyLink

2018-12-31 Thread Saku Ytti
Hey Matthew,

Thi

> There isn't a specific regulation on free-running GPS, just "due diligence". 
> I work at a algorithmic program trading company (and have been for 20 years). 
> We have a high ROI, the cost differential for the rubidium OC versus having 
> to drop everything to conform to regulatory requirements due to a short GPS 
> outage, makes this a no-brainer.

Thanks, this makes sense to me.

CAPEX on networking, systems etc does not matter to bottom line, so no
point knowing or figuring out if thing NEEDS to be better, because
objectively it is better and cost is irrelevant.

-- 
  ++ytti


Re: Service Provider NetFlow Collectors

2018-12-31 Thread Matthew Crocker
 +1 Kentik as well,  DDoS, RTBH, Netflow.  Cloud based so I don't have to worry 
about it.

On 12/31/18, 11:37 AM, "NANOG on behalf of Bryan Holloway" 
 wrote:

+1 Kentik ...

We've been using their DDoS/RTBH mitigation with good success.


On 12/31/18 3:52 AM, Eric Lindsjö wrote:
> Hi,
> 
> We use kentik and we're very happy. Works great, tons of new features 
> coming along all the time. Going to start looking into ddos detection 
> and mitigation soon.
> 
> Would recommend.
> 
> Kind regards,
> Eric Lindsjö
> 
> 
> On 12/31/2018 04:29 AM, Erik Sundberg wrote:
>>
>> Hi Nanog….
>>
>> We are looking at replacing our Netflow collector. I am wonder what 
>> other service providers are using to collect netflow data off their 
>> Core and Edge Routers. Pros/Cons… What to watch out for any info would 
>> help.
>>
>> We are mainly looking to analyze the netflow data. Bonus if it does 
>> ddos detection and mitigation.
>>
>> We are looking at
>>
>> ManageEngine Netflow Analyzer
>>
>> PRTG
>>
>> Plixer – Scrutinizer
>>
>> PeakFlow
>>
>> Kentik
>>
>> Solarwinds NTA
>>
>> Thanks in advance…
>>
>> Erik
>>
>>
>> 
>>
>> CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, 
>> files or previous e-mail messages attached to it may contain 
>> confidential information that is legally privileged. If you are not 
>> the intended recipient, or a person responsible for delivering it to 
>> the intended recipient, you are hereby notified that any disclosure, 
>> copying, distribution or use of any of the information contained in or 
>> attached to this transmission is STRICTLY PROHIBITED. If you have 
>> received this transmission in error please notify the sender 
>> immediately by replying to this e-mail. You must destroy the original 
>> transmission and its attachments without reading or saving in any 
>> manner. Thank you.
> 




Re: CenturyLink RCA?

2018-12-31 Thread Dave Temkin
On Mon, Dec 31, 2018 at 11:33 AM Naslund, Steve 
wrote:

> They shouldn’t need OOB to operate existing lambdas just to configure new
> ones.  One possibility is that the management interface also handles master
> timing which would be a really bad idea but possible (should be redundant
> and it should be able to free run for a reasonable amount of time).  The
> main issue exposed is that obviously the management interface is critical
> and is not redundant enough.  That is if we believe the OOB explanation in
> the first place (which by the way is obviously not OOB since it wiped out
> the in band network when it failed).
>
>
>
> Steven Naslund
>
> Chicago IL
>
>
>

A theory, and only a theory, is that they decided to, in order to
troubleshoot a much smaller problem (OOB/etc.), deploy an optical
configuration change that, when faced with inaccessibility to multiple
nodes, ended up causing a significant inconsistency in their optical
network, wreaking havoc on all sorts of other systems. With the OOB network
already in chaos, card reseats were required to stabilize things on that
network and then they could rebuild the optical network from a fully
reachable state.

Again, only a theory.

-Dave



>
>
> >This seems entirely plausible given that DWDM amplifiers and lasers
> being a complex analog system, they need OOB to align.
>
> >--
>
> >Eric
>
>
>
>


RE: Service Provider NetFlow Collectors

2018-12-31 Thread Romeo Czumbil
I personally recommend Kentik.
We mainly got it for DDoS detection which so far been 100% reliable for us
Now we also use it for other traffic analysis.
Query is extremely fast.
Support is also fantastic. If you're looking for a feature that they may not 
have, just ask...



From: NANOG  On Behalf Of Erik Sundberg
Sent: Sunday, December 30, 2018 10:29 PM
To: nanog@nanog.org
Subject: Service Provider NetFlow Collectors

Hi Nanog

We are looking at replacing our Netflow collector. I am wonder what other 
service providers are using to collect netflow data off their Core and Edge 
Routers. Pros/Cons... What to watch out for any info would help.

We are mainly looking to analyze the netflow data. Bonus if it does ddos 
detection and mitigation.

We are looking at
ManageEngine Netflow Analyzer
PRTG
Plixer - Scrutinizer
PeakFlow
Kentik
Solarwinds NTA


Thanks in advance...

Erik




CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files or 
previous e-mail messages attached to it may contain confidential information 
that is legally privileged. If you are not the intended recipient, or a person 
responsible for delivering it to the intended recipient, you are hereby 
notified that any disclosure, copying, distribution or use of any of the 
information contained in or attached to this transmission is STRICTLY 
PROHIBITED. If you have received this transmission in error please notify the 
sender immediately by replying to this e-mail. You must destroy the original 
transmission and its attachments without reading or saving in any manner. Thank 
you.


Re: Service Provider NetFlow Collectors

2018-12-31 Thread Mike Hammett
I just recently rolled out Elastiflow. Lots of great information. 




- 
Mike Hammett 
Intelligent Computing Solutions 
http://www.ics-il.com 

Midwest-IX 
http://www.midwest-ix.com 

- Original Message -

From: "Michel 'ic' Luczak"  
To: "Erik Sundberg"  
Cc: nanog@nanog.org 
Sent: Monday, December 31, 2018 3:40:40 AM 
Subject: Re: Service Provider NetFlow Collectors 

Don’t underestimate good old ELK 
https://www.elastic.co/guide/en/logstash/current/netflow-module.html 
+ https://github.com/robcowart/elastiflow 


BR, ic 





On 31 Dec 2018, at 04:29, Erik Sundberg < esundb...@nitelusa.com > wrote: 



Hi Nanog…. 

We are looking at replacing our Netflow collector. I am wonder what other 
service providers are using to collect netflow data off their Core and Edge 
Routers. Pros/Cons… What to watch out for any info would help. 

We are mainly looking to analyze the netflow data. Bonus if it does ddos 
detection and mitigation. 

We are looking at 
ManageEngine Netflow Analyzer 
PRTG 
Plixer – Scrutinizer 
PeakFlow 
Kentik 
Solarwinds NTA 


Thanks in advance… 

Erik 



CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files or 
previous e-mail messages attached to it may contain confidential information 
that is legally privileged. If you are not the intended recipient, or a person 
responsible for delivering it to the intended recipient, you are hereby 
notified that any disclosure, copying, distribution or use of any of the 
information contained in or attached to this transmission is STRICTLY 
PROHIBITED. If you have received this transmission in error please notify the 
sender immediately by replying to this e-mail. You must destroy the original 
transmission and its attachments without reading or saving in any manner. Thank 
you. 





Re: Service Provider NetFlow Collectors

2018-12-31 Thread Bryan Holloway

+1 Kentik ...

We've been using their DDoS/RTBH mitigation with good success.


On 12/31/18 3:52 AM, Eric Lindsjö wrote:

Hi,

We use kentik and we're very happy. Works great, tons of new features 
coming along all the time. Going to start looking into ddos detection 
and mitigation soon.


Would recommend.

Kind regards,
Eric Lindsjö


On 12/31/2018 04:29 AM, Erik Sundberg wrote:


Hi Nanog….

We are looking at replacing our Netflow collector. I am wonder what 
other service providers are using to collect netflow data off their 
Core and Edge Routers. Pros/Cons… What to watch out for any info would 
help.


We are mainly looking to analyze the netflow data. Bonus if it does 
ddos detection and mitigation.


We are looking at

ManageEngine Netflow Analyzer

PRTG

Plixer – Scrutinizer

PeakFlow

Kentik

Solarwinds NTA

Thanks in advance…

Erik




CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, 
files or previous e-mail messages attached to it may contain 
confidential information that is legally privileged. If you are not 
the intended recipient, or a person responsible for delivering it to 
the intended recipient, you are hereby notified that any disclosure, 
copying, distribution or use of any of the information contained in or 
attached to this transmission is STRICTLY PROHIBITED. If you have 
received this transmission in error please notify the sender 
immediately by replying to this e-mail. You must destroy the original 
transmission and its attachments without reading or saving in any 
manner. Thank you.




RE: CenturyLink RCA?

2018-12-31 Thread Naslund, Steve
They shouldn’t need OOB to operate existing lambdas just to configure new ones. 
 One possibility is that the management interface also handles master timing 
which would be a really bad idea but possible (should be redundant and it 
should be able to free run for a reasonable amount of time).  The main issue 
exposed is that obviously the management interface is critical and is not 
redundant enough.  That is if we believe the OOB explanation in the first place 
(which by the way is obviously not OOB since it wiped out the in band network 
when it failed).

Steven Naslund
Chicago IL


>This seems entirely plausible given that DWDM amplifiers and lasers being a 
>complex analog system, they need OOB to align.
>--
>Eric




RE: CenturyLink RCA?

2018-12-31 Thread Naslund, Steve
I agree 100%.  Now they need to figure out why bricking the management network 
stopped forwarding on the optical side.  > (Forgive my top posting, not on my 
desktop as I’m out of town)

Steven Naslund
Chicago IL
>
>Wild guess, based on my own experience as a NOC admin/head of operations at a 
>large ISP - they have an automated deployment system for new firmware for a 
>(mission critical) piece of backbone hardware.
>
>They may have tested said firmware on a chassis with cards that did not 
>exactly match the hardware they had in actual deployment (ie: card was older 
>hw revision in deployed hardware), and while it worked fine there, it 
>proceeded >shit the bed in the production.
>
>Or, they missed a mandatory low level hardware firmware upgrade that has to be 
>applied separately before the other main upgrade.
>
>Kinda picturing in my mind that they staged all the updates, set a timer, 
>staggered reboot, and after the first hit the fan, they couldn’t stop the rest 
>as it fell apart as each upgraded unit fell on its own sword on reboot.
>
>I’ve been bit by the ‘this card revision is not supported under this 
>platform/release’ bug more often then I’d like to admit.
>
>And, yes, my eyes did start to get glossy and hazy the more I read their 
>explanation as well.  It’s exactly the kind of useless post I’d write when I 
>want to get (stupid) people off my back about a problem.





RE: CenturyLink RCA?

2018-12-31 Thread Naslund, Steve
See my comments in line.

Steve

>Hey Steve,

>I will continue to speculate, as that's all we have.

> 1.  Are you telling me that several line cards failed in multiple cities in 
> the same way at the same time?  Don't think so unless the same software fault 
> was propagated to all of them.  If the problem was that they needed to be 
> reset, >couldn't that be accomplished by simply reseating them?

>L2 DCN/OOB, whole network shares single broadcast domain. 

Bad design if that’s the case, that would be a huge subnet.  However even if 
that was the case, you would not need to replace hardware in multiple places.  
You might have to reset it but not replace it.  Also being an ILEC it seems 
hard to believe how long their dispatches to their own central office took.  It 
might have taken awhile to locate the original problem but they should have 
been able to send a corrective procedure to CO personnel who are a lot closer 
to the equipment.  In my region (Northern Illinois) we can typically get access 
to a CO in under 30 minutes 24/7.  They are essentially smart hands technicians 
that can reseat or replace line cards.

> 2.  Do we believe that an OOB management card was able to generate so much 
> traffic as to bring down the optical switching?  Very doubtful which means 
> that the systems were actually broken due to trying to PROCESS the "invalid 
> >frames".  Seems like very poor control plane management if the system is 
> attempting to process invalid data and bringing down the forwarding plane.

>L2 loop. You will kill your JNPR/CSCO with enough trash on MGMT ETH.
>However I can be argued that optical network should fail up in absence of 
>control-plane, IP network has to fail down.

Most of the optical muxes I have worked with will run without any management 
card or control plane at all.  Usually the line cards keep forwarding according 
to the existing configuration even in the absence of all management functions.  
It would help if we knew what gear this was.  True optical muxes do not require 
much care and feeding once they have a configuration loaded.  If they are truly 
dependent on that control plane, then it needs to be redundant enough with 
watch dogs to reset them if they become non responsive and they need policers 
and rate limiter on their interfaces.  Seems they would be vulnerable to a DoS 
if a bad 
BPDU can wipe them out.

> 3.  In the cited document it was stated that the offending packet did not 
> have source or destination information.  If so, how did it get propagated 
> throughout the network?

>BPDU

Maybe, it would be strange that it was invalid but valid enough to continue 
forwarding.  In any case loss of the management network should not interrupt 
forwarding.  I also would not be happy with an optical network that relies on 
spanning tree to remain operational.

> My guess at the time and my current opinion (which has no real factual basis, 
> just years of experience) is that a bad software package was propagated 
> through their network.

>Lot of possible reasons, I choose to believe what they've communicated is what 
>the writer of the communication thought that happened, but as they likely are 
>not SME it's broken radio communication. BCAST storm on L2 DCN >would 
>plausibly fit the very ambiguous reason offered and is something people 
>actually are doing.

My biggest problem with their explanation is the replacement of line cards in 
multiple cities.  The only way that happens is when bad code gets pushed to 
them.  If it took them that long to fix an L2 broadcast storm, something is 
seriously wrong with their engineering.  Resetting the management interfaces 
should be sufficient once the offending line card is removed.  That is why I 
think this was a software update failure or a configuration push.  Either way, 
they should be jumping up and down on their vendor as to why this caused such 
large scale effects.


Re: CenturyLink RCA?

2018-12-31 Thread Brielle
(Forgive my top posting, not on my desktop as I’m out of town)

Wild guess, based on my own experience as a NOC admin/head of operations at a 
large ISP - they have an automated deployment system for new firmware for a 
(mission critical) piece of backbone hardware.

They may have tested said firmware on a chassis with cards that did not exactly 
match the hardware they had in actual deployment (ie: card was older hw 
revision in deployed hardware), and while it worked fine there, it proceeded 
shit the bed in the production.

Or, they missed a mandatory low level hardware firmware upgrade that has to be 
applied separately before the other main upgrade.

Kinda picturing in my mind that they staged all the updates, set a timer, 
staggered reboot, and after the first hit the fan, they couldn’t stop the rest 
as it fell apart as each upgraded unit fell on its own sword on reboot.

I’ve been bit by the ‘this card revision is not supported under this 
platform/release’ bug more often then I’d like to admit.

And, yes, my eyes did start to get glossy and hazy the more I read their 
explanation as well.  It’s exactly the kind of useless post I’d write when I 
want to get (stupid) people off my back about a problem.

Sent from my iPad

> On Dec 31, 2018, at 7:53 AM, Naslund, Steve  wrote:
> 
> Not buying this explanation for a number of reasons :
> 
> 1.  Are you telling me that several line cards failed in multiple cities in 
> the same way at the same time?  Don't think so unless the same software fault 
> was propagated to all of them.  If the problem was that they needed to be 
> reset, couldn't that be accomplished by simply reseating them?
> 
> 2.  Do we believe that an OOB management card was able to generate so much 
> traffic as to bring down the optical switching?  Very doubtful which means 
> that the systems were actually broken due to trying to PROCESS the "invalid 
> frames".  Seems like very poor control plane management if the system is 
> attempting to process invalid data and bringing down the forwarding plane.
> 
> 3.  In the cited document it was stated that the offending packet did not 
> have source or destination information.  If so, how did it get propagated 
> throughout the network?
> 
> My guess at the time and my current opinion (which has no real factual basis, 
> just years of experience) is that a bad software package was propagated 
> through their network.
> 
> Steven Naslund
> Chicago IL
> 
>> 
>> One thing that is troubling when reading that URL is that it appears several 
>> steps of restoration required teams to go onsite for local login, etc.,. 
>> Granted, to troubleshoot hardware you need to be physically present to pop a 
>> line card in and out, but CTL/LVL3 should have full out-of-band console and 
>> power control to all core devices, we shouldn't be waiting for someone to 
>> drive to a location to get console or do power cycling. And I would imagine 
>> the first step to alot of the troubleshooting was power cycling and local 
>> console logs.
>> 
>> 
>> -John
>> 
>> 
>> 
>> On 12/30/18 10:42 AM, Mike Hammett wrote:
>> 
>> It's technical enough so that laypeople immediately lose interest, yet 
>> completely useless to anyone that works with this stuff.
>> 
>> 
>> 
>> -
>> Mike Hammett
>> Intelligent Computing Solutions
>> http://www.ics-il.com
>> 
>> Midwest-IX
>> http://www.midwest-ix.com
>> 
>> 
>> From: "Saku Ytti" 
>> To: "nanog list" 
>> Sent: Sunday, December 30, 2018 7:42:49 AM
>> Subject: CenturyLink RCA?
>> 
>> Apologies for the URL, I do not know official source and I do not 
>> share the URLs sentiment.
>> https://fuckingcenturylink.com/
>> 
>> Can someone translate this to IP engineer? What did actually happen?
>> From my own history, I rarely recognise the problem I fixed from 
>> reading the public RCA. I hope CenturyLink will do better.
>> 
>> Best guess so far that I've heard is
>> 
>> a) CenturyLink runs global L2 DCN/OOB
>> b) there was HW fault which caused L2 loop (perhaps HW dropped BPDU, 
>> I've had this failure mode)
>> c) DCN had direct access to control-plane, and L2 congested 
>> control-plane resources causing it to deprovision waves
>> 
>> Now of course this is entirely speculation, but intended to show what 
>> type of explanation is acceptable and can be used to fix things.
>> Hopefully CenturyLink does come out with IP-engineering readable 
>> explanation, so that we may use it as leverage to support work in our 
>> own domains to remove such risks.
>> 
>> a) do not run L2 DCN/OOB
>> b) do not connect MGMT ETH (it is unprotected access to control-plane, 
>> it  cannot be protected by CoPP/lo0 filter/LPTS ec)
>> c) do add in your RFP scoring item for proper OOB port (Like Cisco 
>> CMP)
>> d) do fail optical network up
>> 
>> --
>>  ++ytti
>> 
> 
> 
> --
>  ++ytti



Re: CenturyLink RCA?

2018-12-31 Thread Saku Ytti
Hey Steve,

I will continue to speculate, as that's all we have.

> 1.  Are you telling me that several line cards failed in multiple cities in 
> the same way at the same time?  Don't think so unless the same software fault 
> was propagated to all of them.  If the problem was that they needed to be 
> reset, couldn't that be accomplished by simply reseating them?

L2 DCN/OOB, whole network shares single broadcast domain

> 2.  Do we believe that an OOB management card was able to generate so much 
> traffic as to bring down the optical switching?  Very doubtful which means 
> that the systems were actually broken due to trying to PROCESS the "invalid 
> frames".  Seems like very poor control plane management if the system is 
> attempting to process invalid data and bringing down the forwarding plane.

L2 loop. You will kill your JNPR/CSCO with enough trash on MGMT ETH.
However I can be argued that optical network should fail up in absence
of control-plane, IP network has to fail down.

> 3.  In the cited document it was stated that the offending packet did not 
> have source or destination information.  If so, how did it get propagated 
> throughout the network?

BPDU

> My guess at the time and my current opinion (which has no real factual basis, 
> just years of experience) is that a bad software package was propagated 
> through their network.

Lot of possible reasons, I choose to believe what they've communicated
is what the writer of the communication thought that happened, but as
they likely are not SME it's broken radio communication. BCAST storm
on L2 DCN would plausibly fit the very ambiguous reason offered and is
something people actually are doing.

-- 
  ++ytti


RE: CenturyLink RCA?

2018-12-31 Thread Naslund, Steve
Not buying this explanation for a number of reasons :

1.  Are you telling me that several line cards failed in multiple cities in the 
same way at the same time?  Don't think so unless the same software fault was 
propagated to all of them.  If the problem was that they needed to be reset, 
couldn't that be accomplished by simply reseating them?

2.  Do we believe that an OOB management card was able to generate so much 
traffic as to bring down the optical switching?  Very doubtful which means that 
the systems were actually broken due to trying to PROCESS the "invalid frames". 
 Seems like very poor control plane management if the system is attempting to 
process invalid data and bringing down the forwarding plane.

3.  In the cited document it was stated that the offending packet did not have 
source or destination information.  If so, how did it get propagated throughout 
the network?

My guess at the time and my current opinion (which has no real factual basis, 
just years of experience) is that a bad software package was propagated through 
their network.

Steven Naslund
Chicago IL

>
> One thing that is troubling when reading that URL is that it appears several 
> steps of restoration required teams to go onsite for local login, etc.,. 
> Granted, to troubleshoot hardware you need to be physically present to pop a 
> line card in and out, but CTL/LVL3 should have full out-of-band console and 
> power control to all core devices, we shouldn't be waiting for someone to 
> drive to a location to get console or do power cycling. And I would imagine 
> the first step to alot of the troubleshooting was power cycling and local 
> console logs.
>
>
> -John
>
>
>
> On 12/30/18 10:42 AM, Mike Hammett wrote:
>
> It's technical enough so that laypeople immediately lose interest, yet 
> completely useless to anyone that works with this stuff.
>
>
>
> -
> Mike Hammett
> Intelligent Computing Solutions
> http://www.ics-il.com
>
> Midwest-IX
> http://www.midwest-ix.com
>
> 
> From: "Saku Ytti" 
> To: "nanog list" 
> Sent: Sunday, December 30, 2018 7:42:49 AM
> Subject: CenturyLink RCA?
>
> Apologies for the URL, I do not know official source and I do not 
> share the URLs sentiment.
> https://fuckingcenturylink.com/
>
> Can someone translate this to IP engineer? What did actually happen?
> From my own history, I rarely recognise the problem I fixed from 
> reading the public RCA. I hope CenturyLink will do better.
>
> Best guess so far that I've heard is
>
> a) CenturyLink runs global L2 DCN/OOB
> b) there was HW fault which caused L2 loop (perhaps HW dropped BPDU, 
> I've had this failure mode)
> c) DCN had direct access to control-plane, and L2 congested 
> control-plane resources causing it to deprovision waves
>
> Now of course this is entirely speculation, but intended to show what 
> type of explanation is acceptable and can be used to fix things.
> Hopefully CenturyLink does come out with IP-engineering readable 
> explanation, so that we may use it as leverage to support work in our 
> own domains to remove such risks.
>
> a) do not run L2 DCN/OOB
> b) do not connect MGMT ETH (it is unprotected access to control-plane, 
> it  cannot be protected by CoPP/lo0 filter/LPTS ec)
> c) do add in your RFP scoring item for proper OOB port (Like Cisco 
> CMP)
> d) do fail optical network up
>
> --
>   ++ytti
>


--
  ++ytti


Re: Service Provider NetFlow Collectors

2018-12-31 Thread Karsten Elfenbein
An other tool worth looking into is Traffic Sentinel from inMon.

Karsten

Am Mo., 31. Dez. 2018 um 04:31 Uhr schrieb Erik Sundberg
:
>
> Hi Nanog….
>
>
>
> We are looking at replacing our Netflow collector. I am wonder what other 
> service providers are using to collect netflow data off their Core and Edge 
> Routers. Pros/Cons… What to watch out for any info would help.
>
>
>
> We are mainly looking to analyze the netflow data. Bonus if it does ddos 
> detection and mitigation.
>
>
>
> We are looking at
>
> ManageEngine Netflow Analyzer
>
> PRTG
>
> Plixer – Scrutinizer
>
> PeakFlow
>
> Kentik
>
> Solarwinds NTA
>
>
>
>
>
> Thanks in advance…
>
>
>
> Erik
>
>
>
>
> 
>
> CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files or 
> previous e-mail messages attached to it may contain confidential information 
> that is legally privileged. If you are not the intended recipient, or a 
> person responsible for delivering it to the intended recipient, you are 
> hereby notified that any disclosure, copying, distribution or use of any of 
> the information contained in or attached to this transmission is STRICTLY 
> PROHIBITED. If you have received this transmission in error please notify the 
> sender immediately by replying to this e-mail. You must destroy the original 
> transmission and its attachments without reading or saving in any manner. 
> Thank you.


Re: Service Provider NetFlow Collectors

2018-12-31 Thread Jörg Kost

Hi,

I am always peeking at this OSS project for new installations

https://github.com/VerizonDigital/vflow

- but did not try it out myself so far.

Jörg

On 31 Dec 2018, at 4:29, Erik Sundberg wrote:


Hi Nanog

We are looking at replacing our Netflow collector. I am wonder what 
other service providers are using to collect netflow data off their 
Core and Edge Routers. Pros/Cons... What to watch out for any info 
would help.


We are mainly looking to analyze the netflow data. Bonus if it does 
ddos detection and mitigation.


We are looking at
ManageEngine Netflow Analyzer
PRTG
Plixer - Scrutinizer
PeakFlow
Kentik
Solarwinds NTA


Thanks in advance...

Erik


RE: CenturyLink

2018-12-31 Thread Matthew Huff
There isn't a specific regulation on free-running GPS, just "due diligence". I 
work at a algorithmic program trading company (and have been for 20 years). We 
have a high ROI, the cost differential for the rubidium OC versus having to 
drop everything to conform to regulatory requirements due to a short GPS 
outage, makes this a no-brainer.


Matthew Huff | 1 Manhattanville Rd 
Director of Operations   | Purchase, NY 10577
OTA Management LLC   | Phone: 914-460-4039


-Original Message-
From: NANOG [mailto:nanog-boun...@nanog.org] On Behalf Of Saku Ytti
Sent: Monday, December 31, 2018 3:28 AM
To: Gary E. Miller 
Cc: nanog@nanog.org
Subject: Re: CenturyLink

Hey Gary,

On Mon, 31 Dec 2018 at 05:02, Gary E. Miller  wrote:

> The Rb frequency reference will be two or three orders of magnitude 
> more stable than an expensive ovenized crystal.

Perhaps, but not supported by this:
https://www.meinbergglobal.com/english/specs/gpsopt.htm

For the tl;dr folk, crystal drifts +-4.5us per day, Rb +-1.1us (both seem like 
unsatisfactorily high numbers to me, i.e. you don't want to be free-running 24h 
with Rb). Luckily today we have GPS, Glonass, BeiDou, Galileo and couple 
smaller ones, so there should be somewhat reasonable amount of redundancy. 
Unsure which commercially available NTP or PPP master clocks support all four.

But I of course readily accept Rb is objectively more accurate than crystal, 
I'm just curious where it matters and I'm curious which regulation applies, who 
fall under the regulation and what specifically does the regulation require 
about free-running accuracy.

--
  ++ytti


Re: Service Provider NetFlow Collectors

2018-12-31 Thread Eric Lindsjö

Hi,

We use kentik and we're very happy. Works great, tons of new features 
coming along all the time. Going to start looking into ddos detection 
and mitigation soon.


Would recommend.

Kind regards,
Eric Lindsjö


On 12/31/2018 04:29 AM, Erik Sundberg wrote:


Hi Nanog….

We are looking at replacing our Netflow collector. I am wonder what 
other service providers are using to collect netflow data off their 
Core and Edge Routers. Pros/Cons… What to watch out for any info would 
help.


We are mainly looking to analyze the netflow data. Bonus if it does 
ddos detection and mitigation.


We are looking at

ManageEngine Netflow Analyzer

PRTG

Plixer – Scrutinizer

PeakFlow

Kentik

Solarwinds NTA

Thanks in advance…

Erik




CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, 
files or previous e-mail messages attached to it may contain 
confidential information that is legally privileged. If you are not 
the intended recipient, or a person responsible for delivering it to 
the intended recipient, you are hereby notified that any disclosure, 
copying, distribution or use of any of the information contained in or 
attached to this transmission is STRICTLY PROHIBITED. If you have 
received this transmission in error please notify the sender 
immediately by replying to this e-mail. You must destroy the original 
transmission and its attachments without reading or saving in any 
manner. Thank you.




Re: Service Provider NetFlow Collectors

2018-12-31 Thread Michel 'ic' Luczak
Don’t underestimate good old ELK
https://www.elastic.co/guide/en/logstash/current/netflow-module.html 

+ https://github.com/robcowart/elastiflow 


BR, ic

> On 31 Dec 2018, at 04:29, Erik Sundberg  wrote:
> 
> Hi Nanog….
>  
> We are looking at replacing our Netflow collector. I am wonder what other 
> service providers are using to collect netflow data off their Core and Edge 
> Routers. Pros/Cons… What to watch out for any info would help.
>  
> We are mainly looking to analyze the netflow data. Bonus if it does ddos 
> detection and mitigation.
>  
> We are looking at
> ManageEngine Netflow Analyzer
> PRTG
> Plixer – Scrutinizer
> PeakFlow
> Kentik
> Solarwinds NTA
>  
>  
> Thanks in advance…
>  
> Erik
>  
> 
> 
> CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files or 
> previous e-mail messages attached to it may contain confidential information 
> that is legally privileged. If you are not the intended recipient, or a 
> person responsible for delivering it to the intended recipient, you are 
> hereby notified that any disclosure, copying, distribution or use of any of 
> the information contained in or attached to this transmission is STRICTLY 
> PROHIBITED. If you have received this transmission in error please notify the 
> sender immediately by replying to this e-mail. You must destroy the original 
> transmission and its attachments without reading or saving in any manner. 
> Thank you.



Re: CenturyLink

2018-12-31 Thread Saku Ytti
Hey Gary,

On Mon, 31 Dec 2018 at 05:02, Gary E. Miller  wrote:

> The Rb frequency reference will be two or three orders of magnitude
> more stable than an expensive ovenized crystal.

Perhaps, but not supported by this:
https://www.meinbergglobal.com/english/specs/gpsopt.htm

For the tl;dr folk, crystal drifts +-4.5us per day, Rb +-1.1us (both
seem like unsatisfactorily high numbers to me, i.e. you don't want to
be free-running 24h with Rb). Luckily today we have GPS, Glonass,
BeiDou, Galileo and couple smaller ones, so there should be somewhat
reasonable amount of redundancy. Unsure which commercially available
NTP or PPP master clocks support all four.

But I of course readily accept Rb is objectively more accurate than
crystal, I'm just curious where it matters and I'm curious which
regulation applies, who fall under the regulation and what
specifically does the regulation require about free-running accuracy.

-- 
  ++ytti