RE: Level3 worldwide emergency upgrade?

2013-02-10 Thread Simon Allard
I think you might find its this issue.

PSN-2013-01-823

Junos: Crafted TCP packet can lead to kernel crash




-Original Message-
From: Matthew Petach [mailto:mpet...@netflight.com] 
Sent: Thursday, 7 February 2013 7:23 a.m.
To: Jonathan Towne
Cc: nanog@nanog.org
Subject: Re: Level3 worldwide emergency upgrade?

On Wed, Feb 6, 2013 at 5:10 AM, Jonathan Towne jto...@slic.com wrote:
 On Wed, Feb 06, 2013 at 07:57:06AM -0500, Alex Rubenstein scribbled:
 # The question should be more along the lines of, why aren't you multihomed 
 in a way that would make a 30 minute outage (which is inevitable) irrelevant 
 to you?

 The fun part of this emergency maintenance in the northeast USA was 
 that even folks who are multihomed felt it: Level3 managed to do this 
 in a way that kept BGP sessions up but killed the ability to actually 
 pass traffic.  I'm not sure what they did that caused this, or whether 
 anyone but northeast folks were affected by it, but it sure was neat 
 to be effectively blackholed in and out of one of your provided circuits for 
 a while.


I recommend you grab
http://kestrel3.netflight.com/2013.02.05-NANOG57-day2-afternoon-session.txt

and search for PR8361907

Richard did a very good lightning talk about why Juniper boxes will bring up 
BGP but blackhole traffic for 30 minutes to over an hour, depending on number 
of BGP sessions it is handling.

His recommendation--if you don't like it, go tell Juniper to fix that bug.

Matt



--
BEGIN-ANTISPAM-VOTING-LINKS
--

Teach Email Guard if this mail (ID 09IV6SM1n) is spam:
Spam:
https://emailguard.orcon.net.nz/canit/b.php?i=09IV6SM1nm=d5617dabf346t=20130207c=s
Not spam:
https://emailguard.orcon.net.nz/canit/b.php?i=09IV6SM1nm=d5617dabf346t=20130207c=n
Forget vote: 
https://emailguard.orcon.net.nz/canit/b.php?i=09IV6SM1nm=d5617dabf346t=20130207c=f
--
END-ANTISPAM-VOTING-LINKS




RE: Level3 worldwide emergency upgrade?

2013-02-10 Thread Simon Allard
Sorry, should rephrase.

The reason for the upgrade is PSN-2013-01-823 (PR 839412)

The reason for the BGP blackhole, is as you point out PR8361907


-Original Message-
From: Simon Allard [mailto:simon.all...@team.orcon.net.nz] 
Sent: Monday, 11 February 2013 2:48 p.m.
To: Matthew Petach; Jonathan Towne
Cc: nanog@nanog.org
Subject: RE: Level3 worldwide emergency upgrade?

I think you might find its this issue.

PSN-2013-01-823

Junos: Crafted TCP packet can lead to kernel crash




-Original Message-
From: Matthew Petach [mailto:mpet...@netflight.com]
Sent: Thursday, 7 February 2013 7:23 a.m.
To: Jonathan Towne
Cc: nanog@nanog.org
Subject: Re: Level3 worldwide emergency upgrade?

On Wed, Feb 6, 2013 at 5:10 AM, Jonathan Towne jto...@slic.com wrote:
 On Wed, Feb 06, 2013 at 07:57:06AM -0500, Alex Rubenstein scribbled:
 # The question should be more along the lines of, why aren't you multihomed 
 in a way that would make a 30 minute outage (which is inevitable) irrelevant 
 to you?

 The fun part of this emergency maintenance in the northeast USA was 
 that even folks who are multihomed felt it: Level3 managed to do this 
 in a way that kept BGP sessions up but killed the ability to actually 
 pass traffic.  I'm not sure what they did that caused this, or whether 
 anyone but northeast folks were affected by it, but it sure was neat 
 to be effectively blackholed in and out of one of your provided circuits for 
 a while.


I recommend you grab
http://kestrel3.netflight.com/2013.02.05-NANOG57-day2-afternoon-session.txt

and search for PR8361907

Richard did a very good lightning talk about why Juniper boxes will bring up 
BGP but blackhole traffic for 30 minutes to over an hour, depending on number 
of BGP sessions it is handling.

His recommendation--if you don't like it, go tell Juniper to fix that bug.

Matt



--
BEGIN-ANTISPAM-VOTING-LINKS
--

Teach Email Guard if this mail (ID 09IV6SM1n) is spam:
Spam:
https://emailguard.orcon.net.nz/canit/b.php?i=09IV6SM1nm=d5617dabf346t=20130207c=s
Not spam:
https://emailguard.orcon.net.nz/canit/b.php?i=09IV6SM1nm=d5617dabf346t=20130207c=n
Forget vote: 
https://emailguard.orcon.net.nz/canit/b.php?i=09IV6SM1nm=d5617dabf346t=20130207c=f
--
END-ANTISPAM-VOTING-LINKS




--
BEGIN-ANTISPAM-VOTING-LINKS
--
Teach Email Guard if this mail (ID 08IWO9iX8) is spam:Spam:
https://emailguard.orcon.net.nz/canit/b.php?i=08IWO9iX8m=e4a08b3bbde1t=20130211c=sNot
 spam:
https://emailguard.orcon.net.nz/canit/b.php?i=08IWO9iX8m=e4a08b3bbde1t=20130211c=nForget
 vote: 
https://emailguard.orcon.net.nz/canit/b.php?i=08IWO9iX8m=e4a08b3bbde1t=20130211c=f
--
END-ANTISPAM-VOTING-LINKS




Re: Level3 worldwide emergency upgrade?

2013-02-07 Thread Jeff Tantsura
Good times indeed...

Regards,
Jeff

On Feb 7, 2013, at 2:09, Brett Watson br...@the-watsons.org wrote:

 Hell, we used to not have to bother notifying customers of anything, we just 
 fixed the problem. Reminds me a of a story I've probably shared on the past. 
 
 1995, IETF in Dallas. The big ISP I worked for at the time got tripped up 
 on a 24-day IS-IS timer bug (maybe all of them at the time did, I don't 
 recall)  where all adjacencies reset at once. That's like, entire network 
 down. Working with our engineering team in the *terminal* lab mind you, and 
 Ravi Chandra (then at Cisco) we reloaded the entire network of routers with 
 new code from Cisco once they'd fixed the bug. I seem to remember this being 
 my first exposure to Tony Li's infamous line, ... Confidence Level: boots in 
 the lab.
 
 Good times.
 
 -b
 
 
 On Feb 6, 2013, at 5:41 PM, Brandt, Ralph wrote:
 
 David. I am on an evening shift and am just now reading this thread.   
 
 I was almost tempted to write an explanation that would have had
 identical content with yours based simply on Level3 doing something and
 keeping the information close.  
 
 Responsible Vendors do not try to hide what is being done unless it is
 an Op Sec issue and I have never seen Level3 act with less than
 responsibility so it had to be Op Sec. 
 
 When it is that, it is best if the remainder of us sit quietly on the
 sidelines.
 
 Ralph Brandt
 
 
 -Original Message-
 From: Siegel, David [mailto:david.sie...@level3.com] 
 Sent: Wednesday, February 06, 2013 12:01 PM
 To: 'Ray Wong'; nanog@nanog.org
 Subject: RE: Level3 worldwide emergency upgrade?
 
 Hi Ray,
 
 This topic reminds me of yesterday's discussion in the conference around
 getting some BCOP's drafted.  it would be useful to confirm my own view
 of the BCOP around communicating security issues.  My understanding for
 the best practice is to limit knowledge distribution of security related
 problems both before and after the patches are deployed.  You limit
 knowledge before the patch is deployed to prevent yourself from being
 exploited, but you also limit knowledge afterwards in order to limit
 potential damage to others (customers, competitors...the Internet at
 large).  You also do not want to announce that you will be deploying a
 security patch until you have a fix in hand and know when you will
 deploy it (typically, next available maintenance window unless the cat
 is out of the bag and danger is real and imminent).
 
 As a service provider, you should stay on top of security alerts from
 your vendors so that you can make your own decision about what action is
 required.  I would not recommend relying on service provider maintenance
 bulletins or public operations mailing lists for obtaining this type of
 information.  There is some information that can cause more harm than
 good if it is distributed in the wrong way and information relating to
 security vulnerabilities definitely falls into that category.
 
 Dave
 
 -Original Message-
 From: Ray Wong [mailto:r...@rayw.net] 
 Sent: Wednesday, February 06, 2013 9:16 AM
 To: nanog@nanog.org
 Subject: Re: Level3 worldwide emergency upgrade?
 
 
 OK, having had that first cup of coffee, I can say perhaps the main
 reason I was wondering is I've gotten used to Level3 always being on top
 of things (and admittedly, rarely communicating). They've reached the
 top by often being a black box of reliability, so it's (perhaps
 unrealistically) surprising to see them caught by surprise. Anything
 that pushes them into scramble mode causes me to lose a little sleep
 anyway. The alternative to what they did seems likely for at least a few
 providers who'll NOT manage to fix things in time, so I may well be
 looking at longer outages from other providers, and need to issue
 guidance to others on what to do if/when other links go down for periods
 long enough that all the cost-bounding monitoring alarms start to scream
 even louder.
 
 I was also grumpy at myself for having not noticed advance
 communication, which I still don't seem to have, though since I
 outsourced my email to bigG, I've noticed I'm more likely to miss
 things. Perhaps giving up maintaining that massive set of procmail rules
 has cost me a bit more edge.
 
 Related, of course, just because you design/run your network to tolerate
 some issues doesn't mean you can also budget to be in support contract
 as well. :) Knowing more about the exploit/fix might mean trying to find
 a way to get free upgrades to some kit to prevent more localized attacks
 to other types of gear, as well, though in this case it's all about
 Juniper PR839412 then, so vendor specific, it seems?
 
 There are probably more reasons to wish for more info, too. There's
 still more of them (exploiters/attackers) than there are those of us
 trying to keep things running smoothly and transparently, so anything
 that smells of OMG new exploit found! also triggers my desire to share
 information

Re: Level3 worldwide emergency upgrade?

2013-02-07 Thread Brian Landers
The Juniper PR in question is actually 836197.


On Wed, Feb 6, 2013 at 10:22 AM, Matthew Petach mpet...@netflight.comwrote:

 On Wed, Feb 6, 2013 at 5:10 AM, Jonathan Towne jto...@slic.com wrote:
  On Wed, Feb 06, 2013 at 07:57:06AM -0500, Alex Rubenstein scribbled:
  # The question should be more along the lines of, why aren't you
 multihomed in a way that would make a 30 minute outage (which is
 inevitable) irrelevant to you?
 
  The fun part of this emergency maintenance in the northeast USA was that
 even
  folks who are multihomed felt it: Level3 managed to do this in a way that
  kept BGP sessions up but killed the ability to actually pass traffic.
  I'm not
  sure what they did that caused this, or whether anyone but northeast
 folks
  were affected by it, but it sure was neat to be effectively blackholed
 in and
  out of one of your provided circuits for a while.


 I recommend you grab
 http://kestrel3.netflight.com/2013.02.05-NANOG57-day2-afternoon-session.txt

 and search for PR8361907

 Richard did a very good lightning talk about why
 Juniper boxes will bring up BGP but blackhole
 traffic for 30 minutes to over an hour, depending
 on number of BGP sessions it is handling.

 His recommendation--if you don't like it, go tell
 Juniper to fix that bug.

 Matt




-- 
Brian C Landers
http://www.packetslave.com/
CCIE #23115 (RS + Security)


RE: Level3 worldwide emergency upgrade?

2013-02-07 Thread Siegel, David
I remember being glued to my workstation for 10 straight hours due to an OSPF 
bug that took down the whole of net99's network.

I was pretty proud of our size at the time...about 30Mbps at peak.  Times are 
different and so are expectations.  :-)

Dave


-Original Message-
From: Brett Watson [mailto:br...@the-watsons.org] 
Sent: Wednesday, February 06, 2013 6:07 PM
To: nanog@nanog.org
Subject: Re: Level3 worldwide emergency upgrade?

Hell, we used to not have to bother notifying customers of anything, we just 
fixed the problem. Reminds me a of a story I've probably shared on the past. 

1995, IETF in Dallas. The big ISP I worked for at the time got tripped up on 
a 24-day IS-IS timer bug (maybe all of them at the time did, I don't recall)  
where all adjacencies reset at once. That's like, entire network down. Working 
with our engineering team in the *terminal* lab mind you, and Ravi Chandra 
(then at Cisco) we reloaded the entire network of routers with new code from 
Cisco once they'd fixed the bug. I seem to remember this being my first 
exposure to Tony Li's infamous line, ... Confidence Level: boots in the lab.

Good times.

-b


On Feb 6, 2013, at 5:41 PM, Brandt, Ralph wrote:

 David. I am on an evening shift and am just now reading this thread.   
 
 I was almost tempted to write an explanation that would have had 
 identical content with yours based simply on Level3 doing something 
 and keeping the information close.
 
 Responsible Vendors do not try to hide what is being done unless it is 
 an Op Sec issue and I have never seen Level3 act with less than 
 responsibility so it had to be Op Sec.
 
 When it is that, it is best if the remainder of us sit quietly on the 
 sidelines.
 
 Ralph Brandt
 
 
 -Original Message-
 From: Siegel, David [mailto:david.sie...@level3.com]
 Sent: Wednesday, February 06, 2013 12:01 PM
 To: 'Ray Wong'; nanog@nanog.org
 Subject: RE: Level3 worldwide emergency upgrade?
 
 Hi Ray,
 
 This topic reminds me of yesterday's discussion in the conference 
 around getting some BCOP's drafted.  it would be useful to confirm my 
 own view of the BCOP around communicating security issues.  My 
 understanding for the best practice is to limit knowledge distribution 
 of security related problems both before and after the patches are 
 deployed.  You limit knowledge before the patch is deployed to prevent 
 yourself from being exploited, but you also limit knowledge afterwards 
 in order to limit potential damage to others (customers, 
 competitors...the Internet at large).  You also do not want to 
 announce that you will be deploying a security patch until you have a 
 fix in hand and know when you will deploy it (typically, next 
 available maintenance window unless the cat is out of the bag and danger is 
 real and imminent).
 
 As a service provider, you should stay on top of security alerts from 
 your vendors so that you can make your own decision about what action 
 is required.  I would not recommend relying on service provider 
 maintenance bulletins or public operations mailing lists for obtaining 
 this type of information.  There is some information that can cause 
 more harm than good if it is distributed in the wrong way and 
 information relating to security vulnerabilities definitely falls into that 
 category.
 
 Dave
 
 -Original Message-
 From: Ray Wong [mailto:r...@rayw.net]
 Sent: Wednesday, February 06, 2013 9:16 AM
 To: nanog@nanog.org
 Subject: Re: Level3 worldwide emergency upgrade?
 
 
 
 OK, having had that first cup of coffee, I can say perhaps the main 
 reason I was wondering is I've gotten used to Level3 always being on 
 top of things (and admittedly, rarely communicating). They've reached 
 the top by often being a black box of reliability, so it's (perhaps
 unrealistically) surprising to see them caught by surprise. Anything 
 that pushes them into scramble mode causes me to lose a little sleep 
 anyway. The alternative to what they did seems likely for at least a 
 few providers who'll NOT manage to fix things in time, so I may well 
 be looking at longer outages from other providers, and need to issue 
 guidance to others on what to do if/when other links go down for 
 periods long enough that all the cost-bounding monitoring alarms start 
 to scream even louder.
 
 I was also grumpy at myself for having not noticed advance 
 communication, which I still don't seem to have, though since I 
 outsourced my email to bigG, I've noticed I'm more likely to miss 
 things. Perhaps giving up maintaining that massive set of procmail 
 rules has cost me a bit more edge.
 
 Related, of course, just because you design/run your network to 
 tolerate some issues doesn't mean you can also budget to be in support 
 contract as well. :) Knowing more about the exploit/fix might mean 
 trying to find a way to get free upgrades to some kit to prevent more 
 localized attacks to other types of gear, as well, though in this case 
 it's all

Re: Level3 worldwide emergency upgrade?

2013-02-07 Thread Dorian Kim
No one had hit the ISIS bug before the IETF enforced maintenance freeze because 
no one in their right mind would be running three week old code back then. I 
don't think things have changed that much. ;)

-dorian

On Feb 7, 2013, at 4:19 PM, Siegel, David wrote:

 I remember being glued to my workstation for 10 straight hours due to an OSPF 
 bug that took down the whole of net99's network.
 
 I was pretty proud of our size at the time...about 30Mbps at peak.  Times are 
 different and so are expectations.  :-)
 
 Dave
 
 
 -Original Message-
 From: Brett Watson [mailto:br...@the-watsons.org] 
 Sent: Wednesday, February 06, 2013 6:07 PM
 To: nanog@nanog.org
 Subject: Re: Level3 worldwide emergency upgrade?
 
 Hell, we used to not have to bother notifying customers of anything, we just 
 fixed the problem. Reminds me a of a story I've probably shared on the past. 
 
 1995, IETF in Dallas. The big ISP I worked for at the time got tripped up 
 on a 24-day IS-IS timer bug (maybe all of them at the time did, I don't 
 recall)  where all adjacencies reset at once. That's like, entire network 
 down. Working with our engineering team in the *terminal* lab mind you, and 
 Ravi Chandra (then at Cisco) we reloaded the entire network of routers with 
 new code from Cisco once they'd fixed the bug. I seem to remember this being 
 my first exposure to Tony Li's infamous line, ... Confidence Level: boots in 
 the lab.
 
 Good times.
 
 -b
 
 
 On Feb 6, 2013, at 5:41 PM, Brandt, Ralph wrote:
 
 David. I am on an evening shift and am just now reading this thread.   
 
 I was almost tempted to write an explanation that would have had 
 identical content with yours based simply on Level3 doing something 
 and keeping the information close.
 
 Responsible Vendors do not try to hide what is being done unless it is 
 an Op Sec issue and I have never seen Level3 act with less than 
 responsibility so it had to be Op Sec.
 
 When it is that, it is best if the remainder of us sit quietly on the 
 sidelines.
 
 Ralph Brandt
 
 
 -Original Message-
 From: Siegel, David [mailto:david.sie...@level3.com]
 Sent: Wednesday, February 06, 2013 12:01 PM
 To: 'Ray Wong'; nanog@nanog.org
 Subject: RE: Level3 worldwide emergency upgrade?
 
 Hi Ray,
 
 This topic reminds me of yesterday's discussion in the conference 
 around getting some BCOP's drafted.  it would be useful to confirm my 
 own view of the BCOP around communicating security issues.  My 
 understanding for the best practice is to limit knowledge distribution 
 of security related problems both before and after the patches are 
 deployed.  You limit knowledge before the patch is deployed to prevent 
 yourself from being exploited, but you also limit knowledge afterwards 
 in order to limit potential damage to others (customers, 
 competitors...the Internet at large).  You also do not want to 
 announce that you will be deploying a security patch until you have a 
 fix in hand and know when you will deploy it (typically, next 
 available maintenance window unless the cat is out of the bag and danger is 
 real and imminent).
 
 As a service provider, you should stay on top of security alerts from 
 your vendors so that you can make your own decision about what action 
 is required.  I would not recommend relying on service provider 
 maintenance bulletins or public operations mailing lists for obtaining 
 this type of information.  There is some information that can cause 
 more harm than good if it is distributed in the wrong way and 
 information relating to security vulnerabilities definitely falls into that 
 category.
 
 Dave
 
 -Original Message-
 From: Ray Wong [mailto:r...@rayw.net]
 Sent: Wednesday, February 06, 2013 9:16 AM
 To: nanog@nanog.org
 Subject: Re: Level3 worldwide emergency upgrade?
 
 
 
 OK, having had that first cup of coffee, I can say perhaps the main 
 reason I was wondering is I've gotten used to Level3 always being on 
 top of things (and admittedly, rarely communicating). They've reached 
 the top by often being a black box of reliability, so it's (perhaps
 unrealistically) surprising to see them caught by surprise. Anything 
 that pushes them into scramble mode causes me to lose a little sleep 
 anyway. The alternative to what they did seems likely for at least a 
 few providers who'll NOT manage to fix things in time, so I may well 
 be looking at longer outages from other providers, and need to issue 
 guidance to others on what to do if/when other links go down for 
 periods long enough that all the cost-bounding monitoring alarms start 
 to scream even louder.
 
 I was also grumpy at myself for having not noticed advance 
 communication, which I still don't seem to have, though since I 
 outsourced my email to bigG, I've noticed I'm more likely to miss 
 things. Perhaps giving up maintaining that massive set of procmail 
 rules has cost me a bit more edge.
 
 Related, of course, just because you design/run your network

Re: Level3 worldwide emergency upgrade?

2013-02-06 Thread JP Viljoen
On 06 Feb 2013, at 11:58 AM, Ray Wong r...@rayw.net wrote:
 Does anyone have details on tonight's apparent worldwide emergency
 router upgrade? All I managed to get out of the portal was 30 minutes,
 Service Affecting (no kidding?) and the NOC line gave me the
 recording about it and disconnected me.

Nothing confirmed from my side, but the general guess I saw was that it was 
Juniper-related.

-J


Re: Level3 worldwide emergency upgrade?

2013-02-06 Thread Stephane Bortzmeyer
On Wed, Feb 06, 2013 at 01:04:40PM +0200,
 JP Viljoen froztb...@froztbyte.net wrote 
 a message of 10 lines which said:

 the general guess I saw was that it was Juniper-related.

Juniper Technical Bulletin PSN-2013-01-823, probably?



Re: Level3 worldwide emergency upgrade?

2013-02-06 Thread james jones
ugh!

On Wed, Feb 6, 2013 at 6:04 AM, JP Viljoen froztb...@froztbyte.net wrote:

 On 06 Feb 2013, at 11:58 AM, Ray Wong r...@rayw.net wrote:
  Does anyone have details on tonight's apparent worldwide emergency
  router upgrade? All I managed to get out of the portal was 30 minutes,
  Service Affecting (no kidding?) and the NOC line gave me the
  recording about it and disconnected me.

 Nothing confirmed from my side, but the general guess I saw was that it
 was Juniper-related.

 -J



Re: Level3 worldwide emergency upgrade?

2013-02-06 Thread Jason Biel
That is general guess.

On Wed, Feb 6, 2013 at 5:11 AM, Stephane Bortzmeyer bortzme...@nic.frwrote:

 On Wed, Feb 06, 2013 at 01:04:40PM +0200,
  JP Viljoen froztb...@froztbyte.net wrote
  a message of 10 lines which said:

  the general guess I saw was that it was Juniper-related.

 Juniper Technical Bulletin PSN-2013-01-823, probably?




-- 
Jason


Re: Level3 worldwide emergency upgrade?

2013-02-06 Thread Bret Palsson
I just received this email from level3



Summary

Level 3 Communications will perform a mandatory network upgrade that will be 
service impacting and will impact devices in multiple locations. We are 
upgrading the code on portions of the global network to increase stability for 
the overall network. During this maintenance activity customers may be impacted 
for approximately 30 minutes.

Updates

This window of this maintenance has completed successfully.


On Feb 6, 2013, at 4:21 AM, Jason Biel ja...@biel-tech.com wrote:

 That is general guess.
 
 On Wed, Feb 6, 2013 at 5:11 AM, Stephane Bortzmeyer bortzme...@nic.frwrote:
 
 On Wed, Feb 06, 2013 at 01:04:40PM +0200,
 JP Viljoen froztb...@froztbyte.net wrote
 a message of 10 lines which said:
 
 the general guess I saw was that it was Juniper-related.
 
 Juniper Technical Bulletin PSN-2013-01-823, probably?
 
 
 
 
 -- 
 Jason




Re: Level3 worldwide emergency upgrade?

2013-02-06 Thread Peter Ehiwe
Also received same ...

On Wed, Feb 6, 2013 at 10:58 AM, Ray Wong r...@rayw.net wrote:

 Does anyone have details on tonight's apparent worldwide emergency
 router upgrade? All I managed to get out of the portal was 30 minutes,
 Service Affecting (no kidding?) and the NOC line gave me the
 recording about it and disconnected me.

 -R




-- 
Warm Regards

Peter(CCIE 23782).


Re: Level3 worldwide emergency upgrade?

2013-02-06 Thread Jared Mauch

On Feb 6, 2013, at 6:38 AM, Peter Ehiwe petereh...@gmail.com wrote:

 Also received same ...
 
 On Wed, Feb 6, 2013 at 10:58 AM, Ray Wong r...@rayw.net wrote:
 
 Does anyone have details on tonight's apparent worldwide emergency
 router upgrade? All I managed to get out of the portal was 30 minutes,
 Service Affecting (no kidding?) and the NOC line gave me the
 recording about it and disconnected me.

So, I'm wondering what is shocking that someone may have to push out some sort 
of upgrade either urgently or periodically that is so impacting and causes 
these emails on the list.

There seems to be some sort of psychological event happening in addition to the 
technological one.

In the past I've had to push out software fixes urgently due to various 
reasons, either being a software thing like the PSN or some weird 
hardware+software interaction that causes bad things to happen.

Would you rather your ISP not maintain their devices?  Are the consequences so 
bad of a 30 minute outage that your business is severely impacted?

- Jared





RE: Level3 worldwide emergency upgrade?

2013-02-06 Thread Alex Rubenstein
 Would you rather your ISP not maintain their devices?  Are the 
 consequences so bad of a 30 minute outage that your business
 is severely impacted?
 
 - Jared

You had me up until that line.

That should be expanded a little ...

First, I'd say, yes - many businesses would be severely impacted and may even 
have consequential issues if they had to sustain a 30 minute outage. Suppose 
for a moment they couldn't process money machines transactions for 30 minutes; 
or Netflix couldn't serve content for 30 minutes; or youporn was offline for 30 
minutes.

The question should be more along the lines of, why aren't you multihomed in a 
way that would make a 30 minute outage (which is inevitable) irrelevant to you?





Re: Level3 worldwide emergency upgrade?

2013-02-06 Thread Jonathan Towne
On Wed, Feb 06, 2013 at 07:57:06AM -0500, Alex Rubenstein scribbled:
# The question should be more along the lines of, why aren't you multihomed in 
a way that would make a 30 minute outage (which is inevitable) irrelevant to 
you?

The fun part of this emergency maintenance in the northeast USA was that even
folks who are multihomed felt it: Level3 managed to do this in a way that
kept BGP sessions up but killed the ability to actually pass traffic.  I'm not
sure what they did that caused this, or whether anyone but northeast folks
were affected by it, but it sure was neat to be effectively blackholed in and
out of one of your provided circuits for a while.

Also, in the northeast, they managed to make it quite a bit more than a 30min
outage for many people; they even slid hours outside of their advertised
emergency window.

I do applaud them for what I can only assume was a *massive* undertaking:
emergency upgrading that many routers in such a short period of time.

-- Jonathan Towne



Re: Level3 worldwide emergency upgrade?

2013-02-06 Thread Jared Mauch

On Feb 6, 2013, at 7:57 AM, Alex Rubenstein a...@corp.nac.net wrote:

 Would you rather your ISP not maintain their devices?  Are the 
 consequences so bad of a 30 minute outage that your business
 is severely impacted?
 
 - Jared
 
 You had me up until that line.
 
 That should be expanded a little ...
 
 First, I'd say, yes - many businesses would be severely impacted and may even 
 have consequential issues if they had to sustain a 30 minute outage. Suppose 
 for a moment they couldn't process money machines transactions for 30 
 minutes; or Netflix couldn't serve content for 30 minutes; or youporn was 
 offline for 30 minutes.
 
 The question should be more along the lines of, why aren't you multihomed in 
 a way that would make a 30 minute outage (which is inevitable) irrelevant to 
 you?

Yeah, perhaps not as elegantly worded as I would have hoped, but there are many 
reasons things go down.  Just one of those elements is the internet part, 
there's also transport, power, and other elements that combine to make this 
complex system called the internet.  If you N+N or N+1 your power, perhaps 
something similar for your connectivity is important.  Or you just plan to be 
down/broken periodically for 30 minutes and have a plan to cover that.

The building where our NOC is located sometimes gets evacuated.  Having a plan 
for that is important.  During one visit, there was a small fire in the 
building (or so we were told).  Certainly an unexpected event that disrupted us 
for ~30 minutes.  

The handling and response of these events certainly is important.  I do want to 
understand why and how it's so bad so if there are things as a SP in the 
community we can improve upon we can do that.

That's my real goal, not poking at people who are single homed and down.

- Jared


RE: Level3 worldwide emergency upgrade?

2013-02-06 Thread Alex Rubenstein
 Yeah, perhaps not as elegantly worded as I would have hoped, but there are
 many reasons things go down.  Just one of those elements is the internet
 part, there's also transport, power, and other elements that combine to
 make this complex system called the internet.  If you N+N or N+1 your
 power, perhaps something similar for your connectivity is important.  Or you
 just plan to be down/broken periodically for 30 minutes and have a plan to
 cover that.

Agreed.

 The building where our NOC is located sometimes gets evacuated.  Having a
 plan for that is important.  During one visit, there was a small fire in the
 building (or so we were told).  Certainly an unexpected event that disrupted
 us for ~30 minutes.

And, if it is important to you, you will have N+N NOC's - ie, more than one, 
and different buildings, cities, or countries, depending on your requirement.


 The handling and response of these events certainly is important.  I do want
 to understand why and how it's so bad so if there are things as a SP in the
 community we can improve upon we can do that.

I suspect, as I touched previously, the most noise will come from the people 
who are the least realistic, and least prepared. Personally, I live with the 
expectation that whatever it is (power, fiber, transport, ISP, highways, fuel 
delivery, etc.) will at some point be broken, degraded, or otherwise 
unavailable, and you have to plan accordingly. 

Personally (and I speak for NAC) I/we don't care, really, if any upstream IP 
provider breaks; we have made appropriate plans to work around that in an 
automated fashion. Hope that answers your more general question.









Re: Level3 worldwide emergency upgrade?

2013-02-06 Thread Andrew Sullivan
On Wed, Feb 06, 2013 at 07:39:14AM -0500, Jared Mauch wrote:
 
 So, I'm wondering what is shocking that someone may have to push out some 
 sort of upgrade either urgently or periodically that is so impacting and 
 causes these emails on the list.
 

My impression is mostly that people are left feeling uncomfortable by
a massive upgrade of this sort with so little communication about why
and so on.  Emergency work for five hours and 30 minutes
disconnection that turns out to take longer than 30 minutes of
disconnection probably ought to come with some explanation (at least
after the fact).

Regards,

A

-- 
Andrew Sullivan
Dyn, Inc.
asulli...@dyn.com
v: +1 603 663 0448



Re: Level3 worldwide emergency upgrade?

2013-02-06 Thread Ray Wong
On Wed, Feb 6, 2013 at 7:10 AM, Andrew Sullivan asulli...@dyn.com wrote:
 On Wed, Feb 06, 2013 at 07:39:14AM -0500, Jared Mauch wrote:

 So, I'm wondering what is shocking that someone may have to push out some 
 sort of upgrade either urgently or periodically that is so impacting and 
 causes these emails on the list.


 My impression is mostly that people are left feeling uncomfortable by
 a massive upgrade of this sort with so little communication about why
 and so on.  Emergency work for five hours and 30 minutes
 disconnection that turns out to take longer than 30 minutes of
 disconnection probably ought to come with some explanation (at least
 after the fact).


Especially in the wake they already recently did one. It's unsettling
to receive little communication, and even multihomed, there's always
the question of being pushed into overages around other providers.

Yes, short notice maintenance does happen. Better communication
happens much less often.

I was more looking for details, i.e. the sort of problem this is, as
it probably also means all my *other* providers are going to be
scrambling in the next few days/weeks/months, depending on what gear
they're all using. I'm out of the global infrastructure game myself
for a few years currently, but I still have to think ahead to the
network I do maintain.


-R



Re: Level3 worldwide emergency upgrade?

2013-02-06 Thread joel jaeggli

On 2/6/13 7:43 AM, Ray Wong wrote:

On Wed, Feb 6, 2013 at 7:10 AM, Andrew Sullivan asulli...@dyn.com wrote:

On Wed, Feb 06, 2013 at 07:39:14AM -0500, Jared Mauch wrote:

So, I'm wondering what is shocking that someone may have to push out some sort 
of upgrade either urgently or periodically that is so impacting and causes 
these emails on the list.


My impression is mostly that people are left feeling uncomfortable by
a massive upgrade of this sort with so little communication about why
and so on.  Emergency work for five hours and 30 minutes
disconnection that turns out to take longer than 30 minutes of
disconnection probably ought to come with some explanation (at least
after the fact).


Especially in the wake they already recently did one. It's unsettling
to receive little communication, and even multihomed, there's always
the question of being pushed into overages around other providers.

Yes, short notice maintenance does happen. Better communication
happens much less often.
I recieved advance (24 hours) notification of maintenances over the last 
two days to circuits ranging in size from 100MB/s to 10Gb/s in about a 
dozen locations. I assumed there would be further disruption as devices 
I'm not directly connected to were touched.


I was more looking for details, i.e. the sort of problem this is, as
it probably also means all my *other* providers are going to be
scrambling in the next few days/weeks/months, depending on what gear
they're all using.
All your other providers using that vendor have been scrambling for 
about a week as well. Junos devices should be upgraded.

  I'm out of the global infrastructure game myself
for a few years currently, but I still have to think ahead to the
network I do maintain.


-R






Re: Level3 worldwide emergency upgrade?

2013-02-06 Thread Ray Wong


OK, having had that first cup of coffee, I can say perhaps the main
reason I was wondering is I've gotten used to Level3 always being on
top of things (and admittedly, rarely communicating). They've reached
the top by often being a black box of reliability, so it's (perhaps
unrealistically) surprising to see them caught by surprise. Anything
that pushes them into scramble mode causes me to lose a little sleep
anyway. The alternative to what they did seems likely for at least a
few providers who'll NOT manage to fix things in time, so I may well
be looking at longer outages from other providers, and need to issue
guidance to others on what to do if/when other links go down for
periods long enough that all the cost-bounding monitoring alarms start
to scream even louder.

I was also grumpy at myself for having not noticed advance
communication, which I still don't seem to have, though since I
outsourced my email to bigG, I've noticed I'm more likely to miss
things. Perhaps giving up maintaining that massive set of procmail
rules has cost me a bit more edge.

Related, of course, just because you design/run your network to
tolerate some issues doesn't mean you can also budget to be in support
contract as well. :) Knowing more about the exploit/fix might mean
trying to find a way to get free upgrades to some kit to prevent more
localized attacks to other types of gear, as well, though in this case
it's all about Juniper PR839412 then, so vendor specific, it seems?

There are probably more reasons to wish for more info, too. There's
still more of them (exploiters/attackers) than there are those of us
trying to keep things running smoothly and transparently, so anything
that smells of OMG new exploit found! also triggers my desire to
share information. The network bad guys share information far more
quickly and effectively than we do, it often seems.

-R



Re: Level3 worldwide emergency upgrade?

2013-02-06 Thread PC
Given the issue was announced a week ago, I'm surprised they didn't provide
some sort of emergency notification prior to the upgrade.  However, I
certainly understand their immediate desire to deploy this update.  I don't
think it's bad as the BGP one from not too long ago in that exploit code is
not yet publicly available to my knowledge, but it certainly won't take
long.


On Wed, Feb 6, 2013 at 9:04 AM, joel jaeggli joe...@bogus.com wrote:

 On 2/6/13 7:43 AM, Ray Wong wrote:

 On Wed, Feb 6, 2013 at 7:10 AM, Andrew Sullivan asulli...@dyn.com
 wrote:

 On Wed, Feb 06, 2013 at 07:39:14AM -0500, Jared Mauch wrote:

 So, I'm wondering what is shocking that someone may have to push out
 some sort of upgrade either urgently or periodically that is so impacting
 and causes these emails on the list.

  My impression is mostly that people are left feeling uncomfortable by
 a massive upgrade of this sort with so little communication about why
 and so on.  Emergency work for five hours and 30 minutes
 disconnection that turns out to take longer than 30 minutes of
 disconnection probably ought to come with some explanation (at least
 after the fact).

  Especially in the wake they already recently did one. It's unsettling
 to receive little communication, and even multihomed, there's always
 the question of being pushed into overages around other providers.

 Yes, short notice maintenance does happen. Better communication
 happens much less often.

 I recieved advance (24 hours) notification of maintenances over the last
 two days to circuits ranging in size from 100MB/s to 10Gb/s in about a
 dozen locations. I assumed there would be further disruption as devices I'm
 not directly connected to were touched.


 I was more looking for details, i.e. the sort of problem this is, as
 it probably also means all my *other* providers are going to be
 scrambling in the next few days/weeks/months, depending on what gear
 they're all using.

 All your other providers using that vendor have been scrambling for about
 a week as well. Junos devices should be upgraded.

I'm out of the global infrastructure game myself
 for a few years currently, but I still have to think ahead to the
 network I do maintain.


 -R






Re: Level3 worldwide emergency upgrade?

2013-02-06 Thread Justin M. Streiner

On Wed, 6 Feb 2013, Ray Wong wrote:


My impression is mostly that people are left feeling uncomfortable by
a massive upgrade of this sort with so little communication about why
and so on.  Emergency work for five hours and 30 minutes
disconnection that turns out to take longer than 30 minutes of
disconnection probably ought to come with some explanation (at least
after the fact).


I was more looking for details, i.e. the sort of problem this is, as
it probably also means all my *other* providers are going to be
scrambling in the next few days/weeks/months, depending on what gear
they're all using. I'm out of the global infrastructure game myself
for a few years currently, but I still have to think ahead to the
network I do maintain.


If Level3 is pushing this upgrade because of a security vulnerability, 
like the recent Juniper PSN, any public notification will likely be 
tersely worded out of necessity.


You might be able to get more details by contacting your account team, but 
it's highly unlikely that you'll see the level of detail you're looking 
for in a public communication.  That's not a knock against Level3, and 
most other carriers will likely be equally tight-lipped on the details.


jms



RE: Level3 worldwide emergency upgrade?

2013-02-06 Thread Siegel, David
Hi Ray,

This topic reminds me of yesterday's discussion in the conference around 
getting some BCOP's drafted.  it would be useful to confirm my own view of the 
BCOP around communicating security issues.  My understanding for the best 
practice is to limit knowledge distribution of security related problems both 
before and after the patches are deployed.  You limit knowledge before the 
patch is deployed to prevent yourself from being exploited, but you also limit 
knowledge afterwards in order to limit potential damage to others (customers, 
competitors...the Internet at large).  You also do not want to announce that 
you will be deploying a security patch until you have a fix in hand and know 
when you will deploy it (typically, next available maintenance window unless 
the cat is out of the bag and danger is real and imminent).

As a service provider, you should stay on top of security alerts from your 
vendors so that you can make your own decision about what action is required.  
I would not recommend relying on service provider maintenance bulletins or 
public operations mailing lists for obtaining this type of information.  There 
is some information that can cause more harm than good if it is distributed in 
the wrong way and information relating to security vulnerabilities definitely 
falls into that category.

Dave

-Original Message-
From: Ray Wong [mailto:r...@rayw.net] 
Sent: Wednesday, February 06, 2013 9:16 AM
To: nanog@nanog.org
Subject: Re: Level3 worldwide emergency upgrade?



OK, having had that first cup of coffee, I can say perhaps the main reason I 
was wondering is I've gotten used to Level3 always being on top of things (and 
admittedly, rarely communicating). They've reached the top by often being a 
black box of reliability, so it's (perhaps
unrealistically) surprising to see them caught by surprise. Anything that 
pushes them into scramble mode causes me to lose a little sleep anyway. The 
alternative to what they did seems likely for at least a few providers who'll 
NOT manage to fix things in time, so I may well be looking at longer outages 
from other providers, and need to issue guidance to others on what to do 
if/when other links go down for periods long enough that all the cost-bounding 
monitoring alarms start to scream even louder.

I was also grumpy at myself for having not noticed advance communication, which 
I still don't seem to have, though since I outsourced my email to bigG, I've 
noticed I'm more likely to miss things. Perhaps giving up maintaining that 
massive set of procmail rules has cost me a bit more edge.

Related, of course, just because you design/run your network to tolerate some 
issues doesn't mean you can also budget to be in support contract as well. :) 
Knowing more about the exploit/fix might mean trying to find a way to get free 
upgrades to some kit to prevent more localized attacks to other types of gear, 
as well, though in this case it's all about Juniper PR839412 then, so vendor 
specific, it seems?

There are probably more reasons to wish for more info, too. There's still more 
of them (exploiters/attackers) than there are those of us trying to keep things 
running smoothly and transparently, so anything that smells of OMG new exploit 
found! also triggers my desire to share information. The network bad guys 
share information far more quickly and effectively than we do, it often seems.

-R




Re: Level3 worldwide emergency upgrade?

2013-02-06 Thread joel jaeggli

On 2/6/13 8:34 AM, Justin M. Streiner wrote:

On Wed, 6 Feb 2013, Ray Wong wrote:


My impression is mostly that people are left feeling uncomfortable by
a massive upgrade of this sort with so little communication about why
and so on.  Emergency work for five hours and 30 minutes
disconnection that turns out to take longer than 30 minutes of
disconnection probably ought to come with some explanation (at least
after the fact).


I was more looking for details, i.e. the sort of problem this is, as
it probably also means all my *other* providers are going to be
scrambling in the next few days/weeks/months, depending on what gear
they're all using. I'm out of the global infrastructure game myself
for a few years currently, but I still have to think ahead to the
network I do maintain.


If Level3 is pushing this upgrade because of a security vulnerability, 
like the recent Juniper PSN, any public notification will likely be 
tersely worded out of necessity.



The one that motivated us to upgrade is:

PR839412

I assume that applies to most people with interest in running current 
junos. My imagination is pretty good so that got my attention.
You might be able to get more details by contacting your account team, 
but it's highly unlikely that you'll see the level of detail you're 
looking for in a public communication. That's not a knock against 
Level3, and most other carriers will likely be equally tight-lipped on 
the details.


jms






Re: Level3 worldwide emergency upgrade?

2013-02-06 Thread Matthew Petach
On Wed, Feb 6, 2013 at 5:10 AM, Jonathan Towne jto...@slic.com wrote:
 On Wed, Feb 06, 2013 at 07:57:06AM -0500, Alex Rubenstein scribbled:
 # The question should be more along the lines of, why aren't you multihomed 
 in a way that would make a 30 minute outage (which is inevitable) irrelevant 
 to you?

 The fun part of this emergency maintenance in the northeast USA was that even
 folks who are multihomed felt it: Level3 managed to do this in a way that
 kept BGP sessions up but killed the ability to actually pass traffic.  I'm not
 sure what they did that caused this, or whether anyone but northeast folks
 were affected by it, but it sure was neat to be effectively blackholed in and
 out of one of your provided circuits for a while.


I recommend you grab
http://kestrel3.netflight.com/2013.02.05-NANOG57-day2-afternoon-session.txt

and search for PR8361907

Richard did a very good lightning talk about why
Juniper boxes will bring up BGP but blackhole
traffic for 30 minutes to over an hour, depending
on number of BGP sessions it is handling.

His recommendation--if you don't like it, go tell
Juniper to fix that bug.

Matt



Re: Level3 worldwide emergency upgrade?

2013-02-06 Thread Alexander Maassen
On Wed, 2013-02-06 at 07:57 -0500, Alex Rubenstein wrote:
  Would you rather your ISP not maintain their devices?  Are the 
  consequences so bad of a 30 minute outage that your business
  is severely impacted?
  
  - Jared
 
 You had me up until that line.
 
 That should be expanded a little ...
 
 First, I'd say, yes - many businesses would be severely impacted and may even 
 have consequential issues if they had to sustain a 30 minute outage. Suppose 
 for a moment they couldn't process money machines transactions for 30 
 minutes; or Netflix couldn't serve content for 30 minutes; or youporn was 
 offline for 30 minutes.
 
 The question should be more along the lines of, why aren't you multihomed in 
 a way that would make a 30 minute outage (which is inevitable) irrelevant to 
 you?
 
 
 

multihomed or simply redundantly equipped to switch over faster ?



signature.asc
Description: This is a digitally signed message part


Re: Level3 worldwide emergency upgrade?

2013-02-06 Thread Florian Weimer
* Andrew Sullivan:

 My impression is mostly that people are left feeling uncomfortable
 by a massive upgrade of this sort with so little communication about
 why and so on.

That's a side effect of Juniper's notification policy.  Perhaps
someone should them take them by their word (Security patches and
advisories are freely available from our web site.) and post this
stuff publicly, so that everybody feels rightly scared and complains
less about these disruptions.



RE: Level3 worldwide emergency upgrade?

2013-02-06 Thread Brandt, Ralph
David. I am on an evening shift and am just now reading this thread.   

I was almost tempted to write an explanation that would have had
identical content with yours based simply on Level3 doing something and
keeping the information close.  

Responsible Vendors do not try to hide what is being done unless it is
an Op Sec issue and I have never seen Level3 act with less than
responsibility so it had to be Op Sec. 

When it is that, it is best if the remainder of us sit quietly on the
sidelines.

Ralph Brandt


-Original Message-
From: Siegel, David [mailto:david.sie...@level3.com] 
Sent: Wednesday, February 06, 2013 12:01 PM
To: 'Ray Wong'; nanog@nanog.org
Subject: RE: Level3 worldwide emergency upgrade?

Hi Ray,

This topic reminds me of yesterday's discussion in the conference around
getting some BCOP's drafted.  it would be useful to confirm my own view
of the BCOP around communicating security issues.  My understanding for
the best practice is to limit knowledge distribution of security related
problems both before and after the patches are deployed.  You limit
knowledge before the patch is deployed to prevent yourself from being
exploited, but you also limit knowledge afterwards in order to limit
potential damage to others (customers, competitors...the Internet at
large).  You also do not want to announce that you will be deploying a
security patch until you have a fix in hand and know when you will
deploy it (typically, next available maintenance window unless the cat
is out of the bag and danger is real and imminent).

As a service provider, you should stay on top of security alerts from
your vendors so that you can make your own decision about what action is
required.  I would not recommend relying on service provider maintenance
bulletins or public operations mailing lists for obtaining this type of
information.  There is some information that can cause more harm than
good if it is distributed in the wrong way and information relating to
security vulnerabilities definitely falls into that category.

Dave

-Original Message-
From: Ray Wong [mailto:r...@rayw.net] 
Sent: Wednesday, February 06, 2013 9:16 AM
To: nanog@nanog.org
Subject: Re: Level3 worldwide emergency upgrade?



OK, having had that first cup of coffee, I can say perhaps the main
reason I was wondering is I've gotten used to Level3 always being on top
of things (and admittedly, rarely communicating). They've reached the
top by often being a black box of reliability, so it's (perhaps
unrealistically) surprising to see them caught by surprise. Anything
that pushes them into scramble mode causes me to lose a little sleep
anyway. The alternative to what they did seems likely for at least a few
providers who'll NOT manage to fix things in time, so I may well be
looking at longer outages from other providers, and need to issue
guidance to others on what to do if/when other links go down for periods
long enough that all the cost-bounding monitoring alarms start to scream
even louder.

I was also grumpy at myself for having not noticed advance
communication, which I still don't seem to have, though since I
outsourced my email to bigG, I've noticed I'm more likely to miss
things. Perhaps giving up maintaining that massive set of procmail rules
has cost me a bit more edge.

Related, of course, just because you design/run your network to tolerate
some issues doesn't mean you can also budget to be in support contract
as well. :) Knowing more about the exploit/fix might mean trying to find
a way to get free upgrades to some kit to prevent more localized attacks
to other types of gear, as well, though in this case it's all about
Juniper PR839412 then, so vendor specific, it seems?

There are probably more reasons to wish for more info, too. There's
still more of them (exploiters/attackers) than there are those of us
trying to keep things running smoothly and transparently, so anything
that smells of OMG new exploit found! also triggers my desire to share
information. The network bad guys share information far more quickly and
effectively than we do, it often seems.

-R





Re: Level3 worldwide emergency upgrade?

2013-02-06 Thread joel jaeggli

On 2/6/13 4:41 PM, Brandt, Ralph wrote:

David. I am on an evening shift and am just now reading this thread.

I was almost tempted to write an explanation that would have had
identical content with yours based simply on Level3 doing something and
keeping the information close.

Responsible Vendors do not try to hide what is being done unless it is
an Op Sec issue and I have never seen Level3 act with less than
responsibility so it had to be Op Sec.

When it is that, it is best if the remainder of us sit quietly on the
sidelines.
To be clear. the existence of the PR has been know publicly and the 
software releases that address it  have been available for a week now.


Everyone who has potential exposure should be addressing the issue, and 
soon.

Ralph Brandt


-Original Message-
From: Siegel, David [mailto:david.sie...@level3.com]
Sent: Wednesday, February 06, 2013 12:01 PM
To: 'Ray Wong'; nanog@nanog.org
Subject: RE: Level3 worldwide emergency upgrade?

Hi Ray,

This topic reminds me of yesterday's discussion in the conference around
getting some BCOP's drafted.  it would be useful to confirm my own view
of the BCOP around communicating security issues.  My understanding for
the best practice is to limit knowledge distribution of security related
problems both before and after the patches are deployed.  You limit
knowledge before the patch is deployed to prevent yourself from being
exploited, but you also limit knowledge afterwards in order to limit
potential damage to others (customers, competitors...the Internet at
large).  You also do not want to announce that you will be deploying a
security patch until you have a fix in hand and know when you will
deploy it (typically, next available maintenance window unless the cat
is out of the bag and danger is real and imminent).

As a service provider, you should stay on top of security alerts from
your vendors so that you can make your own decision about what action is
required.  I would not recommend relying on service provider maintenance
bulletins or public operations mailing lists for obtaining this type of
information.  There is some information that can cause more harm than
good if it is distributed in the wrong way and information relating to
security vulnerabilities definitely falls into that category.

Dave

-Original Message-
From: Ray Wong [mailto:r...@rayw.net]
Sent: Wednesday, February 06, 2013 9:16 AM
To: nanog@nanog.org
Subject: Re: Level3 worldwide emergency upgrade?

OK, having had that first cup of coffee, I can say perhaps the main
reason I was wondering is I've gotten used to Level3 always being on top
of things (and admittedly, rarely communicating). They've reached the
top by often being a black box of reliability, so it's (perhaps
unrealistically) surprising to see them caught by surprise. Anything
that pushes them into scramble mode causes me to lose a little sleep
anyway. The alternative to what they did seems likely for at least a few
providers who'll NOT manage to fix things in time, so I may well be
looking at longer outages from other providers, and need to issue
guidance to others on what to do if/when other links go down for periods
long enough that all the cost-bounding monitoring alarms start to scream
even louder.

I was also grumpy at myself for having not noticed advance
communication, which I still don't seem to have, though since I
outsourced my email to bigG, I've noticed I'm more likely to miss
things. Perhaps giving up maintaining that massive set of procmail rules
has cost me a bit more edge.

Related, of course, just because you design/run your network to tolerate
some issues doesn't mean you can also budget to be in support contract
as well. :) Knowing more about the exploit/fix might mean trying to find
a way to get free upgrades to some kit to prevent more localized attacks
to other types of gear, as well, though in this case it's all about
Juniper PR839412 then, so vendor specific, it seems?

There are probably more reasons to wish for more info, too. There's
still more of them (exploiters/attackers) than there are those of us
trying to keep things running smoothly and transparently, so anything
that smells of OMG new exploit found! also triggers my desire to share
information. The network bad guys share information far more quickly and
effectively than we do, it often seems.

-R









Re: Level3 worldwide emergency upgrade?

2013-02-06 Thread Brett Watson
Hell, we used to not have to bother notifying customers of anything, we just 
fixed the problem. Reminds me a of a story I've probably shared on the past. 

1995, IETF in Dallas. The big ISP I worked for at the time got tripped up on 
a 24-day IS-IS timer bug (maybe all of them at the time did, I don't recall)  
where all adjacencies reset at once. That's like, entire network down. Working 
with our engineering team in the *terminal* lab mind you, and Ravi Chandra 
(then at Cisco) we reloaded the entire network of routers with new code from 
Cisco once they'd fixed the bug. I seem to remember this being my first 
exposure to Tony Li's infamous line, ... Confidence Level: boots in the lab.

Good times.

-b


On Feb 6, 2013, at 5:41 PM, Brandt, Ralph wrote:

 David. I am on an evening shift and am just now reading this thread.   
 
 I was almost tempted to write an explanation that would have had
 identical content with yours based simply on Level3 doing something and
 keeping the information close.  
 
 Responsible Vendors do not try to hide what is being done unless it is
 an Op Sec issue and I have never seen Level3 act with less than
 responsibility so it had to be Op Sec. 
 
 When it is that, it is best if the remainder of us sit quietly on the
 sidelines.
 
 Ralph Brandt
 
 
 -Original Message-
 From: Siegel, David [mailto:david.sie...@level3.com] 
 Sent: Wednesday, February 06, 2013 12:01 PM
 To: 'Ray Wong'; nanog@nanog.org
 Subject: RE: Level3 worldwide emergency upgrade?
 
 Hi Ray,
 
 This topic reminds me of yesterday's discussion in the conference around
 getting some BCOP's drafted.  it would be useful to confirm my own view
 of the BCOP around communicating security issues.  My understanding for
 the best practice is to limit knowledge distribution of security related
 problems both before and after the patches are deployed.  You limit
 knowledge before the patch is deployed to prevent yourself from being
 exploited, but you also limit knowledge afterwards in order to limit
 potential damage to others (customers, competitors...the Internet at
 large).  You also do not want to announce that you will be deploying a
 security patch until you have a fix in hand and know when you will
 deploy it (typically, next available maintenance window unless the cat
 is out of the bag and danger is real and imminent).
 
 As a service provider, you should stay on top of security alerts from
 your vendors so that you can make your own decision about what action is
 required.  I would not recommend relying on service provider maintenance
 bulletins or public operations mailing lists for obtaining this type of
 information.  There is some information that can cause more harm than
 good if it is distributed in the wrong way and information relating to
 security vulnerabilities definitely falls into that category.
 
 Dave
 
 -Original Message-
 From: Ray Wong [mailto:r...@rayw.net] 
 Sent: Wednesday, February 06, 2013 9:16 AM
 To: nanog@nanog.org
 Subject: Re: Level3 worldwide emergency upgrade?
 
 
 
 OK, having had that first cup of coffee, I can say perhaps the main
 reason I was wondering is I've gotten used to Level3 always being on top
 of things (and admittedly, rarely communicating). They've reached the
 top by often being a black box of reliability, so it's (perhaps
 unrealistically) surprising to see them caught by surprise. Anything
 that pushes them into scramble mode causes me to lose a little sleep
 anyway. The alternative to what they did seems likely for at least a few
 providers who'll NOT manage to fix things in time, so I may well be
 looking at longer outages from other providers, and need to issue
 guidance to others on what to do if/when other links go down for periods
 long enough that all the cost-bounding monitoring alarms start to scream
 even louder.
 
 I was also grumpy at myself for having not noticed advance
 communication, which I still don't seem to have, though since I
 outsourced my email to bigG, I've noticed I'm more likely to miss
 things. Perhaps giving up maintaining that massive set of procmail rules
 has cost me a bit more edge.
 
 Related, of course, just because you design/run your network to tolerate
 some issues doesn't mean you can also budget to be in support contract
 as well. :) Knowing more about the exploit/fix might mean trying to find
 a way to get free upgrades to some kit to prevent more localized attacks
 to other types of gear, as well, though in this case it's all about
 Juniper PR839412 then, so vendor specific, it seems?
 
 There are probably more reasons to wish for more info, too. There's
 still more of them (exploiters/attackers) than there are those of us
 trying to keep things running smoothly and transparently, so anything
 that smells of OMG new exploit found! also triggers my desire to share
 information. The network bad guys share information far more quickly and
 effectively than we do, it often seems.
 
 -R
 
 
 




Re: Level3 worldwide emergency upgrade?

2013-02-06 Thread bmanning
 
ah - those were the days of glory... :)



On Wed, Feb 06, 2013 at 06:06:39PM -0700, Brett Watson wrote:
 Hell, we used to not have to bother notifying customers of anything, we just 
 fixed the problem. Reminds me a of a story I've probably shared on the past. 
 
 1995, IETF in Dallas. The big ISP I worked for at the time got tripped up 
 on a 24-day IS-IS timer bug (maybe all of them at the time did, I don't 
 recall)  where all adjacencies reset at once. That's like, entire network 
 down. Working with our engineering team in the *terminal* lab mind you, and 
 Ravi Chandra (then at Cisco) we reloaded the entire network of routers with 
 new code from Cisco once they'd fixed the bug. I seem to remember this being 
 my first exposure to Tony Li's infamous line, ... Confidence Level: boots in 
 the lab.
 
 Good times.
 
 -b
 
 
 On Feb 6, 2013, at 5:41 PM, Brandt, Ralph wrote:
 
  David. I am on an evening shift and am just now reading this thread.   
  
  I was almost tempted to write an explanation that would have had
  identical content with yours based simply on Level3 doing something and
  keeping the information close.  
  
  Responsible Vendors do not try to hide what is being done unless it is
  an Op Sec issue and I have never seen Level3 act with less than
  responsibility so it had to be Op Sec. 
  
  When it is that, it is best if the remainder of us sit quietly on the
  sidelines.
  
  Ralph Brandt
  
  
  -Original Message-
  From: Siegel, David [mailto:david.sie...@level3.com] 
  Sent: Wednesday, February 06, 2013 12:01 PM
  To: 'Ray Wong'; nanog@nanog.org
  Subject: RE: Level3 worldwide emergency upgrade?
  
  Hi Ray,
  
  This topic reminds me of yesterday's discussion in the conference around
  getting some BCOP's drafted.  it would be useful to confirm my own view
  of the BCOP around communicating security issues.  My understanding for
  the best practice is to limit knowledge distribution of security related
  problems both before and after the patches are deployed.  You limit
  knowledge before the patch is deployed to prevent yourself from being
  exploited, but you also limit knowledge afterwards in order to limit
  potential damage to others (customers, competitors...the Internet at
  large).  You also do not want to announce that you will be deploying a
  security patch until you have a fix in hand and know when you will
  deploy it (typically, next available maintenance window unless the cat
  is out of the bag and danger is real and imminent).
  
  As a service provider, you should stay on top of security alerts from
  your vendors so that you can make your own decision about what action is
  required.  I would not recommend relying on service provider maintenance
  bulletins or public operations mailing lists for obtaining this type of
  information.  There is some information that can cause more harm than
  good if it is distributed in the wrong way and information relating to
  security vulnerabilities definitely falls into that category.
  
  Dave
  
  -Original Message-
  From: Ray Wong [mailto:r...@rayw.net] 
  Sent: Wednesday, February 06, 2013 9:16 AM
  To: nanog@nanog.org
  Subject: Re: Level3 worldwide emergency upgrade?
  
  
  
  OK, having had that first cup of coffee, I can say perhaps the main
  reason I was wondering is I've gotten used to Level3 always being on top
  of things (and admittedly, rarely communicating). They've reached the
  top by often being a black box of reliability, so it's (perhaps
  unrealistically) surprising to see them caught by surprise. Anything
  that pushes them into scramble mode causes me to lose a little sleep
  anyway. The alternative to what they did seems likely for at least a few
  providers who'll NOT manage to fix things in time, so I may well be
  looking at longer outages from other providers, and need to issue
  guidance to others on what to do if/when other links go down for periods
  long enough that all the cost-bounding monitoring alarms start to scream
  even louder.
  
  I was also grumpy at myself for having not noticed advance
  communication, which I still don't seem to have, though since I
  outsourced my email to bigG, I've noticed I'm more likely to miss
  things. Perhaps giving up maintaining that massive set of procmail rules
  has cost me a bit more edge.
  
  Related, of course, just because you design/run your network to tolerate
  some issues doesn't mean you can also budget to be in support contract
  as well. :) Knowing more about the exploit/fix might mean trying to find
  a way to get free upgrades to some kit to prevent more localized attacks
  to other types of gear, as well, though in this case it's all about
  Juniper PR839412 then, so vendor specific, it seems?
  
  There are probably more reasons to wish for more info, too. There's
  still more of them (exploiters/attackers) than there are those of us
  trying to keep things running smoothly and transparently, so anything