RE: Converged Networks Threat (Was: Level3 Outage)
> From where i'm sitting, I see a number of potentially > dangerous trends that could result in some quite catastrophic > failures of networks. No, i'm not predicting that the > internet will end in 8^H7 days or anything like that. I > think the Level3 outage as seen from the outside is a clear > case that single providers will continue to have their own > network failures for time to come. (I just hope daily it's > not my employers network ;-) ) I don't agree with this 'the sky is falling' perspective and we've seen these discussions over and over. Survivability was and continues to be a design goal of anything we do here. Was from the first days and it's true to this day. When you implement a critical service, you need to do due diligence on whether the path chosen meets the needs. > Now the question of Emergency Services is being posed > here but also in parallel by a number of other people at the > FCC. We've seen the E911 recommendation come out regarding > VoIP calls. How long until a simple power failure results in > the inability to place calls? There are specific requirements (read: gov't regulations) to implement E911 with a number of redundancy options, typicalling calling for things like triple path redundancy. While I have worked on E911 infrastructure in the past and I'm not aware of an exhaustive analysis for E911 over IP, I don't see a reason off the top of my head why you can't do the same thing on IP. Sure, requires careful planning. But what critical service doesn't? What are you asking for? More gov't regulation? > While my friends that are local VFD do still have the > traditional pager service with towers, etc... how long until > the T1's that are used for dial-in or speaking to the towers > are moved to some sort of IP based system? The global > economy seems to be going this direction with varying degrees > of caution. > > I'm concerned, but not worried.. the network will survive.. What's your point then? :) There's no panacea for poor implementation. That's why knowledge and experience is important in network design and it's importance is directly linked to the definined critical need of the service implemented. Sorry, just angst for me here. No visible life. Thanks Christian * "The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential, proprietary, and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from all computers." 113
RE: Converged Networks Threat (Was: Level3 Outage)
> If events are not properly triggered back upstream (ie: > adjencies stay up, bgp remains fairly stable) and you end up > dumping a lot of traffic on the floor, it's sometimes a bit > more dificult to diagnose than loss of light on a physical path. > > On the sunny side, I see this improving over time. > Software bugs will be squashed. Poorly designed networks > will be reconfigured to better handle these situations. But this happens everywhere, everyday, regardless of the underlying technology. * "The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential, proprietary, and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from all computers." 113
Re: Converged Networks Threat (Was: Level3 Outage)
On Thu, Feb 26, 2004 at 11:28:09AM +, [EMAIL PROTECTED] wrote: > > > Wouldn't it be great > >if routers had the equivalent of 'User mode Linux' each process > >handling a service, isolated and protected from each other. The > >physical router would be nothing more than a generic kernel handling > >resource allocation. Each virtual router would have access to x amount > >of resources and will either halt, sleep, crash when it exhausts those > >resources for a given time slice. > > This is possible today. Build your own routers using > the right microkernel, OSKIT and the Click Modular Router > software and you can have this. When we restrict ourselves > only to router packages from major vendors then we are > doomed to using outdated technology at inflated prices. Tell you what Michael, build me some of those, have it pass my labs and I'll give you millions in business. Deal? Let me draw it out here: Step 1: Buy box Step 2: Install Click Modular Router Software Step 3: Profit /vijay
Re: Converged Networks Threat (Was: Level3 Outage)
>> This is possible today. Build your own routers using >> the right microkernel, OSKIT and the Click Modular Router >> software and you can have this. When we restrict ourselves >> only to router packages from major vendors then we are >> doomed to using outdated technology at inflated prices. >Tell you what Michael, build me some of those, have it pass my labs >and I'll give you millions in business. Deal? The problem with your lab is that you have too many millions to give. In order to win those millions people would have to prove that their box is at least as good as C and J in the core of the largest Internet backbones in the world. That is an awfully big hurdle and attempting it costs so much money that anyone would be a fool to try it unless they already had millions in the bank. History shows that if you can build a mousetrap that is technically better than anything on the market, your best route for success is to sell it into niche markets where the customer appreciates the technical advances that you can provide and is willing to pay for those technical advances. I don't think that describes the larger Internet provider networks. --Michael Dillon
Re: Converged Networks Threat (Was: Level3 Outage)
On Thu, 26 Feb 2004 14:48:55 GMT, [EMAIL PROTECTED] said: > History shows that if you can build a mousetrap that is technically > better than anything on the market, your best route for success is > to sell it into niche markets where the customer appreciates the > technical advances that you can provide and is willing to pay for > those technical advances. I don't think that describes the larger > Internet provider networks. So your target market is those mom&pop ISPs that *dont* buy their Ciscos from eBay? :) pgp0.pgp Description: PGP signature
Re: Converged Networks Threat (Was: Level3 Outage)
On Thu, Feb 26, 2004 at 02:48:55PM +, [EMAIL PROTECTED] wrote: > > >> This is possible today. Build your own routers using > >> the right microkernel, OSKIT and the Click Modular Router > >> software and you can have this. When we restrict ourselves > >> only to router packages from major vendors then we are > >> doomed to using outdated technology at inflated prices. > > >Tell you what Michael, build me some of those, have it pass my labs > >and I'll give you millions in business. Deal? > > The problem with your lab is that you have too many millions > to give. In order to win those millions people would have to prove > that their box is at least as good as C and J in the core of the > largest Internet backbones in the world. That is an awfully big Let me try this one more time. From the top. You said: begin quote software and you can have this. When we restrict ourselves only to router packages from major vendors then we are doomed to using outdated technology at inflated prices. end quote So now we have > to give. In order to win those millions people would have to prove > that their box is at least as good as C and J in the core of the So the outdated technology at inflated prices is too high of a hurdle to pass for the magic Click Modular Software router, the ones that are allegedly NOT antiquated and are not using outdated technology? But somehow still cannot function in a core? > History shows that if you can build a mousetrap that is technically > better than anything on the market, your best route for success is Thought it went build a better mousetrap and the world will beat a path to your door, etc etc etc. > to sell it into niche markets where the customer appreciates the > technical advances that you can provide and is willing to pay for > those technical advances. I don't think that describes the larger > Internet provider networks. How would you know this? Historically, the cutting edge technology has always gone into the large cores first because they are the ones pushing the bleeding edge in terms of capacity, power, and routing. /vijay
Re: Converged Networks Threat (Was: Level3 Outage)
--- vijay gill <[EMAIL PROTECTED]> wrote: > How would you know this? Historically, the cutting > edge technology > has always gone into the large cores first because > they are the > ones pushing the bleeding edge in terms of capacity, > power, and > routing. > > /vijay I'm not sure that I'd agree with that statement: most of the large providers with whom I'm familiar tend to be relatively conservative with regard to new technology deployments, for a couple of reasons: 1) their backbones currently "work" - changing them into something which may or may not "work better" is a non-trivial operation, and risks the network. 2) they have an installed base of customers who are living with existing functionality - this goes back to reason 1 - unless there is money to be made, nobody wants to deploy anything. 3) It makes more sense to deploy a new box at the edge, and eventually permit it to migrate to the core after it's been thoroughly proven - the IP model has features living on the edges of the network, while capacity lives in the core. If you have 3 high-cap boxes in the core, it's probably easier to add a fourth than it is to rip the three out and replace them with two higher-cap boxes. 4) existing management infrastructure permits the management of existing boxes - it's easier to deploy an all-new network than it is to upgrade from one technology/platform to another. -David Barak -Fully RFC 1925 Compliant __ Do you Yahoo!? Get better spam protection with Yahoo! Mail. http://antispam.yahoo.com/tools
Re: Converged Networks Threat (Was: Level3 Outage)
On Thu, Feb 26, 2004 at 10:05:03AM -0800, David Barak wrote: > > --- vijay gill <[EMAIL PROTECTED]> wrote: > > How would you know this? Historically, the cutting > > edge technology > > has always gone into the large cores first because > > they are the > > ones pushing the bleeding edge in terms of capacity, > > power, and > > routing. > > > > /vijay > > I'm not sure that I'd agree with that statement: most > of the large providers with whom I'm familiar tend to > be relatively conservative with regard to new > technology deployments, for a couple of reasons: > > 1) their backbones currently "work" - changing them > into something which may or may not "work better" is a > non-trivial operation, and risks the network. This is perhaps current. Check back to see large deployments GSR - sprint/UUNEt GRF - uunet Juniper - UUNET/CWUSA In all of the above cases, those were the large isps that forced development of the boxes. Most of the smaller "cutting edge" networks are still running 7513s. GSR was invented because the 7513s were running out of PPS. CEF was designed to support offloading the RP. > 2) they have an installed base of customers who are > living with existing functionality - this goes back to > reason 1 - unless there is money to be made, nobody > wants to deploy anything. > > 3) It makes more sense to deploy a new box at the > edge, and eventually permit it to migrate to the core > after it's been thoroughly proven - the IP model has > features living on the edges of the network, while > capacity lives in the core. If you have 3 high-cap > boxes in the core, it's probably easier to add a > fourth than it is to rip the three out and replace > them with two higher-cap boxes. The core has expanded to the edge, not the other way around. The aggregate backplane bandwidth requirements tend to drive core box evolution first while the edge box normally has to deal with high touch features and port multiplexing. These of course are becoming more and more specialized over time. > 4) existing management infrastructure permits the > management of existing boxes - it's easier to deploy > an all-new network than it is to upgrade from one > technology/platform to another. Only if you are willing to write off your entire capital investment. No one is willing to do that today. > > -David Barak > -Fully RFC 1925 Compliant > /vijay > __ > Do you Yahoo!? > Get better spam protection with Yahoo! Mail. > http://antispam.yahoo.com/tools
Re: Converged Networks Threat (Was: Level3 Outage)
>> 1) their backbones currently "work" - changing them >> into something which may or may not "work better" is a >> non-trivial operation, and risks the network. i would disagree. their backbone tend to reach scaling problems, hence the need for bleeding/leading edge technologies. that's been my experience in three past-large networks. > > This is perhaps current. Check back to see large deployments > GSR - sprint/UUNEt > GRF - uunet > Juniper - UUNET/CWUSA indeed, and going back even further is-is, 7000 and the original SSE - mci/sprint vip and netflow - genuity (the original)/probably many others -b
Re: Converged Networks Threat (Was: Level3 Outage)
vijay gill wrote: CEF was designed to support offloading the RP. Not really. There existed distributed fastswitching before DCEF came along. It might still exist. CEF was developed to address the issue of route cache insertion and purging. The unneccessarily painful 60 second interval new destination stall was widely documented before CEF got widespread use. The "fast switching" approach was also particularly painful when DDOS attacks occurred. Pete
Re: Converged Networks Threat (Was: Level3 Outage)
On Thu, Feb 26, 2004 at 09:32:07PM +0200, Petri Helenius wrote: > along. It might still exist. CEF was developed to address the issue of > route cache insertion and purging. The unneccessarily painful 60 second > interval new destination stall was widely documented before CEF got > widespread use. The "fast switching" approach was also particularly > painful when DDOS attacks occurred. Thanks for the correction. I clearly was not paying enough attention when composing. /vijay
Re: Converged Networks Threat (Was: Level3 Outage)
--- vijay gill <[EMAIL PROTECTED]> wrote: > In all of the above cases, those were the large isps > that forced > development of the boxes. Most of the smaller > "cutting edge" > networks are still running 7513s. > Hmm - what I was getting at was that the big ISPs for the most part still have a whole lot of 7513s running around (figuratively), while if I were building a new network from the ground up, I'd be unlikely to use them. > GSR was invented because the 7513s were running out > of PPS. > CEF was designed to support offloading the RP. > > > 2) they have an installed base of customers who > are > > living with existing functionality - this goes > back to > > reason 1 - unless there is money to be made, > nobody > > wants to deploy anything. > > > > 3) It makes more sense to deploy a new box at the > > edge, and eventually permit it to migrate to the > core > > after it's been thoroughly proven - the IP model > has > > features living on the edges of the network, while > > capacity lives in the core. If you have 3 > high-cap > > boxes in the core, it's probably easier to add a > > fourth than it is to rip the three out and replace > > them with two higher-cap boxes. > > The core has expanded to the edge, not the other way > around. > The aggregate backplane bandwidth requirements tend > to > drive core box evolution first while the edge box > normally > has to deal with high touch features and port > multiplexing. > These of course are becoming more and more > specialized over > time. > I agree, from a capacity perspective: the GSR began life as a core router because it supported big pipes. It's only recently that it's had anywhere near the number of features which the 7500 has (and there are still a whole lot of specialized features which it doesn't have). From a feature deployment approach, new boxes come in at the edge (think of the deployment of the 7500 itself: it was an IP front-end for ATM networks) > > 4) existing management infrastructure permits the > > management of existing boxes - it's easier to > deploy > > an all-new network than it is to upgrade from one > > technology/platform to another. > > Only if you are willing to write off your entire > capital > investment. No one is willing to do that today. That is EXACTLY my point: as new companies are unwilling to write off an investment, they MUST keep supporting the old stuff. once they're supporting the old stuff of vendor X, that provides an incentive to get more new stuff from vendor X, if the management platform is the same. For instance, if I've got a Marconi ATM network, I'm unlikely to buy new Cisco ATM gear, unless I'm either building a parallel network, or am looking for an edge front-end to offer new features. However, if I were building a new ATM network today, I would do a bake-off between the vendors and see which one met my needs best. -David Barak -Fully RFC 1925 Compliant- = David Barak -fully RFC 1925 compliant- __ Do you Yahoo!? Get better spam protection with Yahoo! Mail. http://antispam.yahoo.com/tools
Re: Converged Networks Threat (Was: Level3 Outage)
> History shows that if you can build a mousetrap that is technically > better than anything on the market, your best route for success is > to sell it into niche markets where the customer appreciates the > technical advances that you can provide and is willing to pay for > those technical advances. I don't think that describes the larger > Internet provider networks. and this has been so well shown by the blazing successes of bay networks, avici, what-its-name that burst into flames in everyone's labs, ... watch out for flying pigs randy
Re: Converged Networks Threat (Was: Level3 Outage)
and this has been so well shown by the blazing successes of bay networks, avici, what-its-name that burst into flames in everyone's labs, ... That's a very good point. Building a router that works (at least learning from J's example) is hiring away the most important talent from your competition. Though, it could also be said that the companies that hired that same talent away from J have not met the same success, yet. Deepak
Re: Converged Networks Threat (Was: Level3 Outage)
> Wouldn't it be great >if routers had the equivalent of 'User mode Linux' each process >handling a service, isolated and protected from each other. The >physical router would be nothing more than a generic kernel handling >resource allocation. Each virtual router would have access to x amount >of resources and will either halt, sleep, crash when it exhausts those >resources for a given time slice. This is possible today. Build your own routers using the right microkernel, OSKIT and the Click Modular Router software and you can have this. When we restrict ourselves only to router packages from major vendors then we are doomed to using outdated technology at inflated prices. --Michael Dillon
Re: Converged Networks Threat (Was: Level3 Outage)
Convergence, and our "lust" to throw TDM/ATM infrastructure in the garbge is an area very near and dear to my heart. I apologize if I am being a bit redundant here... but from our perspective, we are an ISP that is under a lot of pressure to deploy a VoIP solution. I just don't think we can... It's just not reliable enough yet. Period. In a TDM environment the end node switch is incredibly reliable. I can't ever remember in my 30 years on this earth when the end node my telephone was connected to was EVER down, not once, not EVER. A circuit switch environment gives us inherint admission control (if there are not enough tandem/interswitch trunks we just get a fast busy). This allows them to guarantee end to end quality. The one problem, is that if any of the tandems along the path my call is connected get nuked off the face of the earth, I am completely off the air. In an IP (packet based) environment, theoretically routing protocols can reroute my call while it is in progress if a catstrophic event occurs, like the entire NE losing power. The inherint problem with IP is that it has no admission control, and that it's fundamental resliant design was to make sure that the "core" of the network knew nothing about the flows within, so that it _could_ survive a failure. This design goal is the problem when trying to guarantee end to end quality of service. Without admission control, we can pack it full, so that nothing works Variable length frames mean that we have little idea of what is coming down the pipe next. This can all be solved by massivly overbuilding our network. Other than the occasional DoS against an area of the network, outages caused by overuse are relativley rare Yhe big problem is the end node hardware in IP networks. Routers crash ALL the time it is actually a joke. Yes, theoretically a user could have 3 separate connections to the Internet and use their VoIP phone and be happy, but that is not the case. They buy Internet service from one place, that is aggregated in the same building as that TDM end node in the voice world(usually). That aggregation (access) layer is the single biggest vulnerability in both worlds. It just does not fail in the TDM world like it does in the IP world. We need to find ways to make that work better in the IP world so it can be as reliable as the TMD world. I realize that us (the public) are asking IP hardware vendors for new features far faster than can be released reliably... but surely we can find ways to fail it over more effectivley than it does now... Dan.
Re: Converged Networks Threat (Was: Level3 Outage)
David Meyer wrote: No doubt. However, the problem is: What constitutes "unnecessary system complexity"? A designed system's robustness comes in part from its complexity. So its not that complexity is inherently bad; rather, it is just that you wind up with extreme sensitivity to outlying events which is exhibited by catastrophic cascading failures if you push a system's complexity past some point; these are the so-called "robust yet fragile" systems (think NE power outage). I think you hit the nail on the head. I view complexity as diminishing returns play. When you increase complexity, the increase does benefit a decreasing percentage of the users. A way to manage complexity is splitting large systems into smaller pieces and try to make the pieces independent enough to survive a failure of neighboring piece. This approach exists at least in the marketing materials of many telecommunications equipment vendors. The question then becomes, "what good is a backbone router without BGP process". So far I haven´t seen a router with a disposable entity on interface or peer basis. So if a BGP speaker to 10.1.1.1 crashes the system would still be able to maintain relationship to 10.2.2.2. Obviously the point of single device availability becomes moot if we can figure out a way to route/switch around the failed device quickly enough. Today we don´t even have a generic IP layer liveness protocol so by default packets will be blackholed for a definite duration until a routing protocol starts to miss it´s hello packets. (I´m aware of work towards this goal) In summary, I feel systems should be designed to run independent in all failure modes. If you lose 1-n neighbors the system should be self-sufficient on figuring out near-immediately the situation, continue working while negotiating with neighbors about the overall picture. Pete
RE: Converged Networks Threat (Was: Level3 Outage)
On Wed, 2004-02-25 at 20:16, Bora Akyol wrote: > This train of thought works well for only accidental failures, > unfortunately > if you have an adversary that is bent on disturbing communications > and damaging the critical infrastructure of a country, physical faith > sharing > makes things less robust than they need to be. By the way, no > disagreement > from me on any of the points you make. Keeping it simple and robust is > definitely > a good first step. Having diverse paths in the fiber infrastructure is > also necessary. I don't think faith sharing prevents us from having diverse paths, since this is where redundancy comes in. Even if all services run over the same fibre paths, there isn't any problem as long as there's a sufficient number of alternative paths in case any of the paths goe down. Cheers, -- --- Erik Haagsman Network Architect We Dare BV tel: +31.10.7507008 fax: +31.10.7507005 http://www.we-dare.nl
Re: Converged Networks Threat (Was: Level3 Outage)
Yesterday we witnessed a large scale failure that has yet to be attributed to configuration, software, or hardware; however one need look no further than the 168.0.0.0/6 thread, or the GBLX customer who leaked several tens of thousands of their peers' routes to GBLX shortly This should be rewritten 'Or GLBX who LET one of their customers leak several tens of thousands of the peers routes...'. I'm sorry, a network should be able to protect itself from its users and customers. BGP filters are not that hard to figure out and peer prefix limits should be part of every config. Don't trust the guy at the other end of the pipe to do the right thing. -Matt
Re: Converged Networks Threat (Was: Level3 Outage)
Petri, >> I think it has been proven a few times that physical fate sharing is >> only a minor contributor to the total connectivity availability while >> system complexity mostly controlled by software written and operated by >> imperfect humans contribute a major share to end-to-end availability. Yes, and at the very least would seem to match our intuition and experience. >> From this, it can be deduced that reducing unneccessary system >> complexity and shortening the strings of pearls that make up the system >> contribute to better availablity and resiliency of the system. Diversity >> works both ways in this equation. It lessens the probablity of same >> failure hitting majority of your boxes but at the same time increases >> the knowledge needed to understand and maintain the whole system. No doubt. However, the problem is: What constitutes "unnecessary system complexity"? A designed system's robustness comes in part from its complexity. So its not that complexity is inherently bad; rather, it is just that you wind up with extreme sensitivity to outlying events which is exhibited by catastrophic cascading failures if you push a system's complexity past some point; these are the so-called "robust yet fragile" systems (think NE power outage). BTW, the extreme sensitivity to outlying events/catastrophic cascading failures property is a signature of class of dynamic systems of which we believe the Internet is an example; unfortunately, the machinery we currently have (in dynamical systems theory) isn't yet mature enough to provide us with engineering rules. >> I would vote for the KISS principle if in doubt. Truly. See RFC 3439 and/or http://www.1-4-5.net/~dmm/complexity_and_the_internet. I also said a few words about this topic at NANOG26 where we has a panel on this topic (my slides on http://www.maoz.com/~dmm/NANOG26/complexity_panel). Dave
RE: Converged Networks Threat (Was: Level3 Outage)
> > > I think it has been proven a few times that physical fate sharing is > only a minor contributor to the total connectivity availability while > system complexity mostly controlled by software written and > operated by > imperfect humans contribute a major share to end-to-end availability. > > From this, it can be deduced that reducing unneccessary system > complexity and shortening the strings of pearls that make up > the system > contribute to better availablity and resiliency of the > system. Diversity > works both ways in this equation. It lessens the probablity of same > failure hitting majority of your boxes but at the same time increases > the knowledge needed to understand and maintain the whole system. > > I would vote for the KISS principle if in doubt. Hi Pete This train of thought works well for only accidental failures, unfortunately if you have an adversary that is bent on disturbing communications and damaging the critical infrastructure of a country, physical faith sharing makes things less robust than they need to be. By the way, no disagreement from me on any of the points you make. Keeping it simple and robust is definitely a good first step. Having diverse paths in the fiber infrastructure is also necessary. Regards, Bora
Re: Converged Networks Threat (Was: Level3 Outage)
On Wed, 2004-02-25 at 13:34, David Meyer wrote: > Is it that sharing fate in the switching fabric (as > opposed to say, in the transport fabric, or even > conduit) reduces the resiliency of a given service (in > this case FR/ATM/TDM), and as such poses the "danger" > you describe? Our vendors will tell us that the IP routing fabrics of today are indeed quite reliable and resistant to failure, and they may be right when it comes to hardware MTBF. However, the IP network relies a great deal more on shared/inter-domain, real-time configuration (BGP) than do any traditional telecommunications networks utilizing the tried and true technologies referenced above. Yesterday we witnessed a large scale failure that has yet to be attributed to configuration, software, or hardware; however one need look no further than the 168.0.0.0/6 thread, or the GBLX customer who leaked several tens of thousands of their peers' routes to GBLX shortly before the Level(3) event, to show that configuration-induced failures in the Internet reach much further than in traditional TDM or single vendor PVC networks. The single point of failure we all share is our reliance on a correct BGP table, populated by our peers and transit providers; and kept free of errors by those same operators. -- JSW
Re: Converged Networks Threat (Was: Level3 Outage)
[EMAIL PROTECTED] (Jared Mauch) writes: > ... > I keep hear of Frame-Relay and ATM signaling that is going > to happen in large providers MPLS cores. That's right, your "safe" TDM > based services, will be transported over someones IP backbone first. One of my DS3/DS1 vendors recently told me of a plan to use MPLS for part of the route inside their switching center. I said "not with my circuits you won't". Once they understood that I was willing to take my business elsewhere or simply do without, they decided that an M13 was worth having after all. My advice is, walk softly but carry a big stick. When we all say "everything over IP" that means teaching more devices how to speak 802.11 or other packet-based access protocols rather than giving them ATM or F/R or dialup modem circuitry. It does *not* mean simulating an ISO-L1 or ISO-L2 "circuit" using a ISO-L3 "network". (Ick.) -- Paul Vixie
Re: Converged Networks Threat (Was: Level3 Outage)
Jared Mauch wrote: On the sunny side, I see this improving over time. Software bugs will be squashed. Poorly designed networks will be reconfigured to better handle these situations. The trend running against these points is the added features and complexity into the software due to market requirements. So while the box you got two years ago might have less bugs today, there are more attractive new devices with new bugs in the old and new features. People seem to be quite convinced that if you put more features into a box, people will pay more for it. On your second point, it seems that most network protocols are converging towards port TCP/80. So unless network performance and availability degrades really badly, most users are indifferent and the 1st level helpdesk at their provider tells that "at times the internet might be slow" and they usually are quite happy and understanding with that answer because they don´t know that it could be better. So outside Fortune 500 and some clueful individuals, where is the market for non-poorly designed bug free "Internet"? Pete
RE: Converged Networks Threat (Was: Level3 Outage)
>From Jared: > I keep hear of Frame-Relay and ATM signaling that is > going to happen in large providers MPLS cores. That's right, > your "safe" TDM based services, will be transported over > someones IP backbone first. > This means if they don't protect their IP network, the TDM > services could fail. These types of CES services are not > just limited to Frame and ATM. > (Did anyone with frame/atm/vpn services from Level3 > experience the same outage?) We use Level3 for IP transit and transport (both DS-3 and Ethernet over MPLS (via Martini)) all over the country. As with everyone else, we saw the problems with the transit traffic out of SJC and ATL. However, our transport services were not affected at all by the problems. In fact, I just ended up sending my Level3-SJC bound traffic to LAX via Level3 which was going through the same equipment as the transit traffic which was having problems. >From Pete: > From this, it can be deduced that reducing unneccessary > system complexity and shortening the strings of pearls that > make up the system contribute to better availablity and > resiliency of the system. Diversity works both ways in this > equation. It lessens the probablity of same failure hitting > majority of your boxes but at the same time increases the > knowledge needed to understand and maintain the whole system. > > I would vote for the KISS principle if in doubt. I agree. Granted the string of pearls is always going to be pretty long, but there are definitely is a trend from what I have seen with customers to make the string longer than it needs to be. -Sean Sean P. Crandall VP Engineering Operations MegaPath Networks Inc. 6691 Owens Drive Pleasanton, CA 94588 (925) 201-2530 (office) (925) 201-2550 (fax)
Re: Converged Networks Threat (Was: Level3 Outage)
Is it that sharing fate in the switching fabric (as opposed to say, in the transport fabric, or even conduit) reduces the resiliency of a given service (in this case FR/ATM/TDM), and as such poses the "danger" you describe? Sharing fate in the physical layer (multiple fibers in the same conduit) or transport layer (multiple services on the same SONET) have clear and well defined resource limits. A GigE running down a piece of fiber will NEVER jump over to the ATM network fiber and wipe it out. Same goes with SONET. An STS1 is an STS1 and will never eat up an OC-48 no matter how much traffic. Clear well defined resource requirements with well defined protection between resources. shared fate in the switching fabric won't be as stable until routers (the switching fabric) can allocate and manage resources in a clear and defined way. If the resources are being over committed the fabric must be able to handle the full burden of resource requests while still managing to provide appropriate resource limits to services. QoS plays a part in managing the resources of a given link, what manages the resources a service can consume in the fabric itself (CPU, Memory, bandwidth). With proper traffic engineering you can build/overbuild the network to handle 'normal' traffic with a great deal of reliability. The switch fabric and/or network itself must be able to protect itself from the abnormal. Limiting memory/CPU consumption of a flapping BGP peer so you still have enough resources to handle the AToM traffic which is given a higher priority. Let the BGP peers fail, let the Internet traffic drop to save the high priority traffic and the MPLS glue traffic to keep the core operational. Wouldn't it be great if routers had the equivalent of 'User mode Linux' each process handling a service, isolated and protected from each other. The physical router would be nothing more than a generic kernel handling resource allocation. Each virtual router would have access to x amount of resources and will either halt, sleep, crash when it exhausts those resources for a given time slice. I don't know of any method in the current router offerings to limit a VRF to x% of CPU and y% of memory. -Matt Is this an accurate characterization of your point? If so, why should sharing fate in the switching fabric necessarily reduce the resiliency of the those services that share that fabric (i.e., why should this be so)? I have some ideas, but I'm interested in what ideas other folks have. Thanks, Dave
Re: Converged Networks Threat (Was: Level3 Outage)
On Wed, Feb 25, 2004 at 10:34:55AM -0800, David Meyer wrote: > Jared, > > >> > Is your concern that carrying FR/ATM/TDM over a packet > >> > core (IP or MPLS or ..) will, via some mechanism, reduce > >> > the resilience of the those services, of the packet core, > >> > of both, or something else? > >> > >>I'm saying that if a network had a FR/ATM/TDM failure in > >> the past it would be limited to just the FR/ATM/TDM network. > >> (well, aside from any IP circuits that are riding that FR/ATM/TDM > >> network). We're now seeing the change from the TDM based > >> network being the underlying network to the "IP/MPLS Core" > >> being this underlying network. > >> > >>What it means is that a failure of the IP portion of the network > >> that disrupts the underlying MPLS/GMPLS/whatnot core that is now > >> transporting these FR/ATM/TDM services, does pose a risk. Is the risk > >> greater than in the past, relying on the TDM/WDM network? I think that > >> there could be some more spectacular network failures to come. Overall > >> I think people will learn from these to make the resulting networks > >> more reliable. (eg: there has been a lot learned as a result of the > >> NE power outage last year). > > I think folks can almost certainly agree that when you > share fate, well, you share fate. But maybe there is > something else here. Many of these services have always > shared fate at the transport level; that is, in most > cases, I didn't have a separate fiber plant/DWDM > infrastructure for FR/ATM/TDM, IP, Service X, etc, so > fate was already being/has always been shared in the > transport infrastructure. > > So maybe try this question: > > Is it that sharing fate in the switching fabric (as > opposed to say, in the transport fabric, or even > conduit) reduces the resiliency of a given service (in > this case FR/ATM/TDM), and as such poses the "danger" > you describe? I think the threat is that the switching fabric and forwarding plane can be disrupted by more things than exist in a pure TDM based network. This isn't to say that the packet (or even label) network isn't the "future" of these services, it's just that today there are some interesting problems that still exist as the technology continues to mature. > Is this an accurate characterization of your point? If > so, why should sharing fate in the switching fabric > necessarily reduce the resiliency of the those services > that share that fabric (i.e., why should this be so)? I > have some ideas, but I'm interested in what ideas other > folks have. I believe that there still exist a number of cases where the switching fabric can get out-of-sync with the control-plane. If events are not properly triggered back upstream (ie: adjencies stay up, bgp remains fairly stable) and you end up dumping a lot of traffic on the floor, it's sometimes a bit more dificult to diagnose than loss of light on a physical path. On the sunny side, I see this improving over time. Software bugs will be squashed. Poorly designed networks will be reconfigured to better handle these situations. - jared -- Jared Mauch | pgp key available via finger from [EMAIL PROTECTED] clue++; | http://puck.nether.net/~jared/ My statements are only mine.
Re: Converged Networks Threat (Was: Level3 Outage)
David Meyer wrote: Is this an accurate characterization of your point? If so, why should sharing fate in the switching fabric necessarily reduce the resiliency of the those services that share that fabric (i.e., why should this be so)? I have some ideas, but I'm interested in what ideas other folks have. I think it has been proven a few times that physical fate sharing is only a minor contributor to the total connectivity availability while system complexity mostly controlled by software written and operated by imperfect humans contribute a major share to end-to-end availability. From this, it can be deduced that reducing unneccessary system complexity and shortening the strings of pearls that make up the system contribute to better availablity and resiliency of the system. Diversity works both ways in this equation. It lessens the probablity of same failure hitting majority of your boxes but at the same time increases the knowledge needed to understand and maintain the whole system. I would vote for the KISS principle if in doubt. Pete
Re: Converged Networks Threat (Was: Level3 Outage)
I'm saying that if a network had a FR/ATM/TDM failure in the past it would be limited to just the FR/ATM/TDM network. (well, aside from any IP circuits that are riding that FR/ATM/TDM network). We're now seeing the change from the TDM based network being the underlying network to the "IP/MPLS Core" being this underlying network. What it means is that a failure of the IP portion of the network that disrupts the underlying MPLS/GMPLS/whatnot core that is now transporting these FR/ATM/TDM services, does pose a risk. Is the risk greater than in the past, relying on the TDM/WDM network? I think that there could be some more spectacular network failures to come. Overall I think people will learn from these to make the resulting networks more reliable. (eg: there has been a lot learned as a result of the NE power outage last year). Internet traffic should run over an IP/MPLS core in a separate session (VRF, Virtual context, whatever..) so the MPLS core never sees the full BGP routing information of the Internet. So long as router vendors can provide proper protection between routing instances so one virtual router can't consume all memory/cpu; The MPLS core should be pretty stable. The core MPLS network and control plane should be completely separate from regular traffic and much less complex for any given carrier. VoIP, Internet, EoM, AToM, FRoM, TDMoM should all run in separate sessions all isolated from each other. A router should act like a unix machine treating each MPLS/VRF session as a separate user, isolating and protecting users from each other, providing resource allocation and limits. I'm not sure of the effectiveness of current generation routers but it should be coming down the line. That said, the IP/MPLS core should be more stable than traditional TDM networks, the Internet itself may not stabilize but that shouldn't affect the core. What happened at L3 was an internet outage, that shouldn't in theory affect the MPLS core. Think back 10 years when it was common for a unix binary to wipe out a machine by consuming all resources (fork bombs anyone?). Unix machines have come a long way since then. Routers need to follow the same progression. What is the routing equivalent of 'while (1) { fork(); };'? Currently it is massive BGP flapping that chew resources. A good router should be immune to that and can be with proper resource management. -Matt
Re: Converged Networks Threat (Was: Level3 Outage)
Jared, >> >Is your concern that carrying FR/ATM/TDM over a packet >> >core (IP or MPLS or ..) will, via some mechanism, reduce >> >the resilience of the those services, of the packet core, >> >of both, or something else? >> >> I'm saying that if a network had a FR/ATM/TDM failure in >> the past it would be limited to just the FR/ATM/TDM network. >> (well, aside from any IP circuits that are riding that FR/ATM/TDM >> network). We're now seeing the change from the TDM based >> network being the underlying network to the "IP/MPLS Core" >> being this underlying network. >> >> What it means is that a failure of the IP portion of the network >> that disrupts the underlying MPLS/GMPLS/whatnot core that is now >> transporting these FR/ATM/TDM services, does pose a risk. Is the risk >> greater than in the past, relying on the TDM/WDM network? I think that >> there could be some more spectacular network failures to come. Overall >> I think people will learn from these to make the resulting networks >> more reliable. (eg: there has been a lot learned as a result of the >> NE power outage last year). I think folks can almost certainly agree that when you share fate, well, you share fate. But maybe there is something else here. Many of these services have always shared fate at the transport level; that is, in most cases, I didn't have a separate fiber plant/DWDM infrastructure for FR/ATM/TDM, IP, Service X, etc, so fate was already being/has always been shared in the transport infrastructure. So maybe try this question: Is it that sharing fate in the switching fabric (as opposed to say, in the transport fabric, or even conduit) reduces the resiliency of a given service (in this case FR/ATM/TDM), and as such poses the "danger" you describe? Is this an accurate characterization of your point? If so, why should sharing fate in the switching fabric necessarily reduce the resiliency of the those services that share that fabric (i.e., why should this be so)? I have some ideas, but I'm interested in what ideas other folks have. Thanks, Dave
Re: Converged Networks Threat (Was: Level3 Outage)
In message <[EMAIL PROTECTED]>, Jared Mauch writes: > > (I know this is treading on a few "what if" scenarios, but it could >actually mean a lot if we convert to a mostly IP world as I see the trend). > I think your analysis is dead-on. --Steve Bellovin, http://www.research.att.com/~smb
Re: Converged Networks Threat (Was: Level3 Outage)
At 10:52 AM 2/25/2004, you wrote: recommendation come out regarding VoIP calls. How long until a simple power failure results in the inability to place calls? We're already at that point. If the power goes out at home, I'd have to grab a flashlight and go hunting for a regular ol' POTS-powered phone. Or use the cell phone (as I did when Bubba had a few too many to drink one night recently and took out a power transformer). But I do have a few old regular phones. How many people don't? Interactive Intelligence, Artisoft and many others are selling businesses phone systems that run entirely on a "server" that may or may not be connected to a UPS of sufficient capacity to keep the server running during an extended outage. These systems are frequently handling a PRI instead of POTS lines, so there's no backup when the UPS dies. One the "phone server" goes down, no phone service. VOIP services have the same problem. Lights go out, that whiz-bang handy-dandy VOIP phone doesn't work, either. Sure, we talking about the end user, not the core/backbone. But the answer to the question, strictly speaking, is that a simple power outage can result in many people being unable to make a simple phone call (or at best, relying on their cell phones... assuming the generator fired at their nearest cell when the lights went out).
Re: Converged Networks Threat (Was: Level3 Outage)
On Wed, Feb 25, 2004 at 09:44:51AM -0800, David Meyer wrote: > Jared, > > >>I keep hear of Frame-Relay and ATM signaling that is going > >> to happen in large providers MPLS cores. That's right, your "safe" TDM > >> based services, will be transported over someones IP backbone first. > >> This means if they don't protect their IP network, the TDM services could > >> fail. These types of CES services are not just limited to Frame and ATM. > >> (Did anyone with frame/atm/vpn services from Level3 experience the > >> same outage?) > > Is your concern that carrying FR/ATM/TDM over a packet > core (IP or MPLS or ..) will, via some mechanism, reduce > the resilience of the those services, of the packet core, > of both, or something else? I'm saying that if a network had a FR/ATM/TDM failure in the past it would be limited to just the FR/ATM/TDM network. (well, aside from any IP circuits that are riding that FR/ATM/TDM network). We're now seeing the change from the TDM based network being the underlying network to the "IP/MPLS Core" being this underlying network. What it means is that a failure of the IP portion of the network that disrupts the underlying MPLS/GMPLS/whatnot core that is now transporting these FR/ATM/TDM services, does pose a risk. Is the risk greater than in the past, relying on the TDM/WDM network? I think that there could be some more spectacular network failures to come. Overall I think people will learn from these to make the resulting networks more reliable. (eg: there has been a lot learned as a result of the NE power outage last year). > >>We're at (or already past) the dangerous point of network > >> convergence. While I suspect that nobody directly died as a result of > >> the recent outage, the trend to link together hospitals, doctors > >> and other agencies via the Internet and a series of VPN clients continues > >> to grow. (I say this knowing how important the internet is to > >> the medical community, reading x-rays and other data scans at > >> home for the oncall is quite common). > > Again, I'm unclear as to what constitutes "the dangerous > point of network convergence", or for that matter, what > constitutes convergence (I'm sure we have close to a > common understanding, but its worth making that > explicit). In any event, can you be more explicit about > what you mean here? Transporting FR/ATM/TDM/Voice over the IP/MPLS core, as well as some of the technology shifts (VoIP, Voice over Cable, etc..) are removing some of the resiliance from the end-user network that existed in the past. I think that most companies that offer frame-relay which also have a IP network are looking at moving their frame-relay on to their IP network. (I could be wrong here clearly). This means that overall we need to continue to provide a more reliable IP network than in the past. It is critically important. I think that Pete Templin is right to question peoples statements that "nobody died because of a network outage". While I think that the answer is likely No, will that be the case in 2-3 years as Qwest, SBC, Verizon, and others move to a more native VoIP infrastructure? A failure within their IP network could result in some emergency calling (eg: 911) not working. While there are alternate means of calling for help (cell phone, etc..) that may not rely upon the same network elements that have failed, some people would consider a 60 second delay as you switch contact methods too long and an excessive risk to someones health. I think it bolsters the case for personal emergency preparedness, but also spending more time looking at the services you purchase. If you are relying on a private frame-relay circuit as backup for your VPN over the public internet, knowing if this is switched over an IP network becomes more important. (I know this is treading on a few "what if" scenarios, but it could actually mean a lot if we convert to a mostly IP world as I see the trend). - jared -- Jared Mauch | pgp key available via finger from [EMAIL PROTECTED] clue++; | http://puck.nether.net/~jared/ My statements are only mine.
Re: Converged Networks Threat (Was: Level3 Outage)
Jared, >> I keep hear of Frame-Relay and ATM signaling that is going >> to happen in large providers MPLS cores. That's right, your "safe" TDM >> based services, will be transported over someones IP backbone first. >> This means if they don't protect their IP network, the TDM services could >> fail. These types of CES services are not just limited to Frame and ATM. >> (Did anyone with frame/atm/vpn services from Level3 experience the >> same outage?) Is your concern that carrying FR/ATM/TDM over a packet core (IP or MPLS or ..) will, via some mechanism, reduce the resilience of the those services, of the packet core, of both, or something else? >> We're at (or already past) the dangerous point of network >> convergence. While I suspect that nobody directly died as a result of >> the recent outage, the trend to link together hospitals, doctors >> and other agencies via the Internet and a series of VPN clients continues >> to grow. (I say this knowing how important the internet is to >> the medical community, reading x-rays and other data scans at >> home for the oncall is quite common). Again, I'm unclear as to what constitutes "the dangerous point of network convergence", or for that matter, what constitutes convergence (I'm sure we have close to a common understanding, but its worth making that explicit). In any event, can you be more explicit about what you mean here? Thanks, Dave
Converged Networks Threat (Was: Level3 Outage)
Ok. I can't sit by here while people speculate about the possible problems of a network outage. I think that most everyone here reading NANOG realizes that the Internet is becoming more and more central to daily life even for those that are not connected to the internet. From where i'm sitting, I see a number of potentially dangerous trends that could result in some quite catastrophic failures of networks. No, i'm not predicting that the internet will end in 8^H7 days or anything like that. I think the Level3 outage as seen from the outside is a clear case that single providers will continue to have their own network failures for time to come. (I just hope daily it's not my employers network ;-) ) So, We're sitting here at the crossroads, where VoIP is "coming of age". Vonage, 8x8 and others are blazing a path that the rest of the providers are now beginning to gun for. We've already read in press releases and articles in the past year how providers in Canada and the US are moving to VoIP transport within their long-distance networks. I keep hear of Frame-Relay and ATM signaling that is going to happen in large providers MPLS cores. That's right, your "safe" TDM based services, will be transported over someones IP backbone first. This means if they don't protect their IP network, the TDM services could fail. These types of CES services are not just limited to Frame and ATM. (Did anyone with frame/atm/vpn services from Level3 experience the same outage?) Now the question of Emergency Services is being posed here but also in parallel by a number of other people at the FCC. We've seen the E911 recommendation come out regarding VoIP calls. How long until a simple power failure results in the inability to place calls? Now, i'm not trying to pick on Level3 at all. The trend I outline here is very real. The reliance on the Internet for critical communications is a trend that continues. Look at how it was used on 9/11 for communications when cell and land based telephony networks were crippled. The internet has become a very critical part of all of our lives (some more than others) with banks using VPNs to link their ATMs back into their corporate network as well as the number of people that use it for just plain "just in time" bill payment and other things. I can literally cancel my home phone line, cell phone and communicate soley with my internet connection, performing all my bill payments without any paperwork. I can even file my taxes online. We're at (or already past) the dangerous point of network convergence. While I suspect that nobody directly died as a result of the recent outage, the trend to link together hospitals, doctors and other agencies via the Internet and a series of VPN clients continues to grow. (I say this knowing how important the internet is to the medical community, reading x-rays and other data scans at home for the oncall is quite common). While my friends that are local VFD do still have the traditional pager service with towers, etc... how long until the T1's that are used for dial-in or speaking to the towers are moved to some sort of IP based system? The global economy seems to be going this direction with varying degrees of caution. I'm concerned, but not worried.. the network will survive.. - Jared On Wed, Feb 25, 2004 at 09:17:30AM -0600, Pete Templin wrote: > If an IP-based system lets you see the status of the 23 hospitals in San > Antonio graphically, perhaps overlaid with near-real-time traffic > conditions, I'd rather use it as primary and telephone as secondary. > > Counting on it? No. Gaining usability from it? You betcha. > > Brian Knoblauch wrote: > > > If you're counting on IP (a "best attempt" protocol) for critical > >data, you've got a serious design flaw in your system... > > > >-Original Message- > >From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of > >Pete > >Templin > >Sent: Wednesday, February 25, 2004 9:10 > >To: Colin Neeson > >Cc: [EMAIL PROTECTED] > >Subject: Re: Level 3 statement concerning 2/23 events (nothing to see, move > >along) > > > > > > > > > >Are you sure no one died as a result? My hobby is volunteering as a > >firefighter and EMT. If Level3's network sits between a dispatch center > >or mobile data terminal and a key resource, it could be a factor > >(hospital status website, hazardous materials action guide, VoIP link > >that didn't reroute because the control plane was happy but the > >forwarding plane was sad, etc.). > > > >And if the problem could happen to another network tomorrow but could be > >prevented or patched, wouldn't inquiring minds want to know? Your life > >might be more interesting when the fit hits the shan if you have the > >same vulnerability. > > > >Colin Neeson wrote: > > > > > >>Because, in the the grand scale scheme of things, it's real