Re: Link capacity upgrade threshold
So, in summary: Your dropped packet counters are the ones to be looking at as a measure of goodput, more than your utilization counters. Indeed. Capacity upgrades are best gauged by drop rates; bit-rates without this context are largely useless. When you're only aware of the RX side though, in the absence of an equivalent to BECN, what's the best way to track this? Do any of the Ethernet OAM standards expose this data? Similarly, could anyone share experiences with transit link upgrades to accommodate bursts? In the past, any requests to transit providers have been answered w/ the need for significant increases to 95%ile commits. While this makes sense from a sales perspective, there's a strong (but insufficient) engineering argument against it.
Re: Link capacity upgrade threshold
On Tue, 1 Sep 2009, Kevin Graham wrote: Indeed. Capacity upgrades are best gauged by drop rates; bit-rates without this context are largely useless. If you're dropping packets, you're already over the cliff. Our job as ISP is to forward the packets our customers send to us, how is that compatible with upgrading links when they're so full that you're not only buffering but you're actually DROPPING packets? -- Mikael Abrahamssonemail: swm...@swm.pp.se
Re: Link capacity upgrade threshold
Mikael Abrahamsson wrote: If you're dropping packets, you're already over the cliff. Our job as ISP is to forward the packets our customers send to us, how is that compatible with upgrading links when they're so full that you're not only buffering but you're actually DROPPING packets? Many ISPs don't even watch the dropping rate. Packets can easily start dropping long before you reach an 80% average mark or may not drop until 90% utilization. Dropped packets are a safeguard measurement to detect data collision that fills the buffers. One could argue that setting QOS with larger buffers and monitoring the buffer usage is better than waiting for the drop. Jack
Re: Link capacity upgrade threshold
On Wed, Sep 02, 2009 at 08:39:20AM +0200, Mikael Abrahamsson wrote: On Tue, 1 Sep 2009, Kevin Graham wrote: Indeed. Capacity upgrades are best gauged by drop rates; bit-rates without this context are largely useless. If you're dropping packets, you're already over the cliff. Our job as ISP is to forward the packets our customers send to us, how is that compatible with upgrading links when they're so full that you're not only buffering but you're actually DROPPING packets? By all means watch your traffic utilization and plan your upgrades in a timely fashion, but watching for dropped packets can help reveal unexpected issues, such as all of those routers out there that don't actually do line rate depending on your particular traffic profile or pattern of traffic between ports. Personally I find the whole argument over what % of utilization should trigger an upgrade to be little more than a giant excercise in penis waving. People throw out all kinds of numbers, 80%, 60%, 50%, 40%, I've even seen someone claim 25%, but in the end I find more value in a quick reaction time to ANY unexpected event than I do in adhering to some arbitrary rule about when to upgrade. I'd rather see someone leave their links 80% full but have enough spare parts and a competent enough operations staff that they can turn around an upgrade in a matter of hours, than I would see them upgrade something unnecessarily at 40% and then not be able to react to an unplanned issue on a different port. And honestly, I've peered with a lot of networks who claim to do preemptive upgrades at numbers like 40%, but I've never actually seen it happen. In fact, the relationship between the marketing claim of upgrading at x percentage and the number of weeks you have to run the port congested before the other side gets budgetary approval to sign the PO for the optics that they need to do the upgrade but don't own seems to be inversely proportional. :) -- Richard A Steenbergen r...@e-gerbil.net http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)
RE: Link capacity upgrade threshold
What SNMP MIB records drops? I poked around for a short time, and I'm thinking that generically the drops fall into the errors counter. Hopefully that's not the case. Frank -Original Message- From: Kevin Graham [mailto:kgra...@industrial-marshmallow.com] Sent: Wednesday, September 02, 2009 1:32 AM To: Bill Woodcock; nanog Subject: Re: Link capacity upgrade threshold So, in summary: Your dropped packet counters are the ones to be looking at as a measure of goodput, more than your utilization counters. Indeed. Capacity upgrades are best gauged by drop rates; bit-rates without this context are largely useless. When you're only aware of the RX side though, in the absence of an equivalent to BECN, what's the best way to track this? Do any of the Ethernet OAM standards expose this data? Similarly, could anyone share experiences with transit link upgrades to accommodate bursts? In the past, any requests to transit providers have been answered w/ the need for significant increases to 95%ile commits. While this makes sense from a sales perspective, there's a strong (but insufficient) engineering argument against it.
Re: Link capacity upgrade threshold
On Sun, 30 Aug 2009, Nick Hilliard wrote: In order to get a really good idea of what's going on at a microburst level, you would need to poll as often as it takes to fill the buffer of the port in question. This is not feasible in the general case, which is why we resort to hacks like QoS to make sure that when there is congestion, it is handled semi-sensibly. Or some enterprising vendor could start recording utilisation stats? regards, -- Paul Jakma p...@jakma.org Key ID: 64A2FF6A Fortune: Try to value useful qualities in one who loves you.
Re: Link capacity upgrade threshold
On Tue, Sep 01, 2009 at 11:55:45AM +0100, Paul Jakma wrote: On Sun, 30 Aug 2009, Nick Hilliard wrote: In order to get a really good idea of what's going on at a microburst level, you would need to poll as often as it takes to fill the buffer of the port in question. This is not feasible in the general case, which is why we resort to hacks like QoS to make sure that when there is congestion, it is handled semi-sensibly. Or some enterprising vendor could start recording utilisation stats? do any router vendors provide something akin to hardware latches to keep track of highest buffer fill levels? poll as frequently/infrequently as you like... -- Aaron J. Grier | Not your ordinary poofy goof. | agr...@poofygoof.com
RE: Link capacity upgrade threshold
Another approach to collecting buffer utilization is to infer such utilization from other variables. Active measurement of round trip times (RTT), packet loss, and jitter on a link-by-link basis is a reliable way of inferring interface queuing which leads to packet loss. A link that runs with good values on all 3 measures (low RTT, little or no packet loss, low jitter with small inter-packet arrival variation) can be deemed not a candidate for bandwidth upgrades. The key to active measurement is random measurement of the links so as to catch the bursts. The BRIX active measurement product (now owned by EXFO) is a good active measurement tool which randomizes probe data so as to, over time, collect a randomized sample of link behavior. -Original Message- From: Aaron J. Grier [mailto:agr...@poofygoof.com] Sent: Tuesday, September 01, 2009 12:19 PM To: nanog@nanog.org Subject: Re: Link capacity upgrade threshold On Tue, Sep 01, 2009 at 11:55:45AM +0100, Paul Jakma wrote: On Sun, 30 Aug 2009, Nick Hilliard wrote: In order to get a really good idea of what's going on at a microburst level, you would need to poll as often as it takes to fill the buffer of the port in question. This is not feasible in the general case, which is why we resort to hacks like QoS to make sure that when there is congestion, it is handled semi-sensibly. Or some enterprising vendor could start recording utilisation stats? do any router vendors provide something akin to hardware latches to keep track of highest buffer fill levels? poll as frequently/infrequently as you like... -- Aaron J. Grier | Not your ordinary poofy goof. | agr...@poofygoof.com
RE: Link capacity upgrade threshold
do any router vendors provide something akin to hardware latches to keep track of highest buffer fill levels? poll as frequently/infrequently as you like... Without getting into each permutation of a device's architecture, aren't buffer fills really just buffer drops? There are means to determine this. Lots of vendors have configurable buffer pools for inter-device traffic levels that record high water levels as well. Deepak Jain AiNET
Re: Link capacity upgrade threshold
Holmes,David A wrote: runs with good values on all 3 measures (low RTT, little or no packet loss, low jitter with small inter-packet arrival variation) can be deemed not a candidate for bandwidth upgrades. The key to active Sounds great, unless you don't own the router on the other side of the link which is subject to icmp filtering has a loaded RE, etc. If you pass the traffic through the routers to a reliable server, you'll be monitoring multiple links/routers and not just a single one. Jack
Re: Link capacity upgrade threshold
On Sun, 30 Aug 2009, Randy Bush wrote: If your 95th percentile utilization is at 80% capacity, it's time to start planning the upgrade. s/80/60/ the normal snmp and other averaging methods *really* miss the bursts. Agreed. Internet traffic is very burtsy. If you care your customer experience upgrade at 60-65% level. Especially if an interface is towards a customers is similar in bandwith of backbone links... Best Regards, Janos Mohacsi
Re: Link capacity upgrade threshold
If your 95th percentile utilization is at 80% capacity, it's time to start planning the upgrade. s/80/60/ the normal snmp and other averaging methods *really* miss the bursts. randy
Re: Link capacity upgrade threshold
On 30/08/2009 13:04, Randy Bush wrote: the normal snmp and other averaging methods *really* miss the bursts. Definitely. For fun and giggles, I recently turned on 30 second polling on some kit and it turned up all sorts of interesting peculiarities that were completely blotted out in a 5 minute average. In order to get a really good idea of what's going on at a microburst level, you would need to poll as often as it takes to fill the buffer of the port in question. This is not feasible in the general case, which is why we resort to hacks like QoS to make sure that when there is congestion, it is handled semi-sensibly. There's a lot to the saying that QoS really means Quantity of Service, because quality of service only ever becomes a problem if there is a shortfall in quantity. Nick
Re: Link capacity upgrade threshold
Nick Hilliard wrote: Definitely. For fun and giggles, I recently turned on 30 second polling on some kit and it turned up all sorts of interesting peculiarities that were completely blotted out in a 5 minute average. Would RMON History and Alarms help? I've always considered rolling them out to some of my kit to catch microbursts. Poggs
Re: Link capacity upgrade threshold
If talking about just max capacity, I would agree with most of the statements of 80+% being in the right range, likely with a very fine line of when you actually start seeing a performance impact. Operationally, at least in our network, I'd never run anything at that level. Providers that are redundant for each other don't normally operate above 40-45%, in order to accommodate a failure. Other links that have a backup, but don't actively load share, normally run up to about 60-70% before being upgraded. By the time the upgrade is complete, it could be close to 80%. Tom Sands Rackspace Hosting William Herrin wrote: On Sat, Aug 29, 2009 at 11:50 PM, devang pateldevan...@gmail.com wrote: I just wanted to know what is Link capacity upgrade threshold in terms of % of link utilization? Just to get an idea... If your 95th percentile utilization is at 80% capacity, it's time to start planning the upgrade. If your 95th percentile utilization is at 95% it's time to finish the upgrade. If you average or median utilizations are at 80% capacity then as often as not it's time for your boss to fire you and replace you with someone who can do the job. Slight variations depending on the resource. Use absolute peak instead of 95th percentile for modem bank utilization -- under normal circumstances a modem bank should never ring busy. And a gig-e can run a little closer to the edge (percentage-wise) before folks notice slowness than a T1 can. Regards, Bill Herrin Confidentiality Notice: This e-mail message (including any attached or embedded documents) is intended for the exclusive and confidential use of the individual or entity to which this message is addressed, and unless otherwise expressly indicated, is confidential and privileged information of Rackspace. Any dissemination, distribution or copying of the enclosed material is prohibited. If you receive this transmission in error, please notify us immediately by e-mail at ab...@rackspace.com, and delete the original message. Your cooperation is appreciated.
Re: Link capacity upgrade threshold
Date: Sun, 30 Aug 2009 21:04:15 +0900 From: Randy Bush ra...@psg.com If your 95th percentile utilization is at 80% capacity, it's time to start planning the upgrade. s/80/60/ the normal snmp and other averaging methods *really* miss the bursts. s/60/40/ If you need to carry large TCP flows, say 2Gbps on a 10GE, dropping even a single packet due to congestion is unacceptable. Even with fast recovery, the average transmission rate will take a noticeable dip on every drop and even a drop rate under 1% will slow the flow dramatically. The point is, what is acceptable for one traffic profile may be unacceptable for another. Mail and web browsing are generally unaffected by light congestion. Other applications are not so forgiving. -- R. Kevin Oberman, Network Engineer Energy Sciences Network (ESnet) Ernest O. Lawrence Berkeley National Laboratory (Berkeley Lab) E-mail: ober...@es.net Phone: +1 510 486-8634 Key fingerprint:059B 2DDF 031C 9BA3 14A4 EADA 927D EBB3 987B 3751
Re: Link capacity upgrade threshold
What system were you using to monitor link usage? Shane On Aug 30, 2009, at 8:26 AM, Nick Hilliard wrote: On 30/08/2009 13:04, Randy Bush wrote: the normal snmp and other averaging methods *really* miss the bursts. Definitely. For fun and giggles, I recently turned on 30 second polling on some kit and it turned up all sorts of interesting peculiarities that were completely blotted out in a 5 minute average. In order to get a really good idea of what's going on at a microburst level, you would need to poll as often as it takes to fill the buffer of the port in question. This is not feasible in the general case, which is why we resort to hacks like QoS to make sure that when there is congestion, it is handled semi-sensibly. There's a lot to the saying that QoS really means Quantity of Service, because quality of service only ever becomes a problem if there is a shortfall in quantity. Nick
Re: Link capacity upgrade threshold
On 30/08/2009 17:53, Shane Ronan wrote: What system were you using to monitor link usage? yrtg Nick
Re: Link capacity upgrade threshold
On Aug 30, 2009, at 1:23 AM, Mikael Abrahamsson wrote: On Sun, 30 Aug 2009, William Herrin wrote: If your 95th percentile utilization is at 80% capacity, it's time to start planning the upgrade. If your 95th percentile utilization is at 95% it's time to finish the upgrade. I now see why people at the IETF spoke in a way that core network congestion was something natural. If your MRTG graph is showing 95% load in 5 minute average, you're most likely congesting/buffering at some time during that 5 minute interval. If this is acceptable or not in your network (it's not in mine) that's up to you. Also, a gig link on a Cisco will do approx 93-94% of imix of a gig in the values presented via SNMP (around 930-940 megabit/s as seen in show int) before it's full, because of IFG, ethernet header overhead etc. I've heard this said many times. I've also seen 'sho int' say 950,000,000 bits/sec and not see packets get dropped. I was under the impression show int showed -every- byte leaving the interface. I could make an argument that IFG would not be included, but things like ethernet headers better be. Does this change between IOS revisions, or hardware, or is it old info, or ... what? -- TTFN, patrick P.S. I agree that without perfect conditions (e.g. using an Ixia to test link speeds), you should upgrade WAY before 90-something percent. microbursts are real, and buffer space is small these days. I'm just asking what the counters -actually- show. So personally, I consider a gig link in desperate need of upgrade when it's showing around 850-880 megs of traffic in mrtg. -- Mikael Abrahamssonemail: swm...@swm.pp.se
Re: Link capacity upgrade threshold
If your 95th percentile utilization is at 80% capacity... s/80/60/ s/60/40/ I would suggest that the reason each of you have a different number is because there's a different best number for each case. Looking for any single number to fit all cases, rather than understanding the underlying process, is unlikely to yield good results. First, different people have different requirements. Some people need lowest possible cost, some people need lowest cost per volume of bits delivered, some people need lowest cost per burst capacity, some need low latency, some need low jitter, some want good customer service, some want flexible payment terms, and undoubtedly there are a thousand other possible qualities. Second, this is a binary digital network. It's never 80% full, it's never 60% full, and it's never 40% full. It's always exactly 100% full or exactly 0% full. If SNMP tells you that you've moved 800 megabits in a second on a one-gigabit pipe, then, modulo any bad implementations of SNMP, your pipe was 100% full for eight-tenths of that second. SNMP does not hide anything. Applying any percentile function to your data, on the other hand, does hide data. Specifically, it discards all of your data except a single point, irreversibly. So if you want to know anything about your network, you won't be looking at percentiles. Having your circuit be 100% full is a good thing, presuming you're paying for it and the traffic has some value to you. Having it be 100% full as much of the time as possible is a good thing, because that gives you a high ratio of value to cost. Dropping packets, on the other hand, is likely to be a bad thing, both because each packet putatively had value, and because many dropped packets are likely to be resent, and a resent packet is one you've paid for twice, and that's precluded the sending of another new, paid-for packet in that timeframe. The cost of not dropping packets is not having buffers overflow, and the cost of not having buffers overflow is either having deep buffers, which means high latency, or having customers with a predictable flow of traffic. Which brings me to item three. In my experience, the single biggest contributor to buffer overflow is having in-feeding (or downstream customer) circuits which are of burst capacity too close to that of the out-feeding (or upstream transit) circuits. Let's say that your outbound circuit is a gigabit, you have two inbound circuits that are a gigabit and run at 100% utilization 10% of the time each, and you have a megabit of buffer memory allocated to the outbound circuit. 1% of the time, both of the inbound circuits will be at 100% utilization simultaneously. When that's happening, you'll have data flowing in at the rate of two gigabits per second, which will fill the buffer in one twentieth of a second, if it persists. And, just like Rosencrantz and Guildenstern flipping coins, such a run will inevitably persist longer than you'd desire, frequently enough. On the other hand, if you have twenty inbound circuits of 100 megabits each, which are transmitting at 100% of capacity 10% of the time each, you're looking at exactly the same amount of data, however it arrives _much more predictably_, since the 2-gigabit inflow would only occur 0.001% of the time, rather than 1% of the time. And it would also be proportionally unlikely to persist for the longer periods of time necessary to overflow the buffer. Thus Kevin's ESnet customers, who are much more likely to be 10gb or 40gb downstream circuits feeding into his 40gb upstream circuits, are much more likely to overflow buffers, than a consumer Internet provider who's feeding 1mb circuits into a gigabit circuit, even if the aggregation ratio of the latter is hundreds of times higher. So, in summary: Your dropped packet counters are the ones to be looking at as a measure of goodput, more than your utilization counters. And keep the size of your aggregation pipes as much bigger than the size of the pipes you aggregate into them as you can afford to. As always, my apologies to those of you for whom this is unnecessarily remedial, for using NANOG bandwidth and a portion of your Sunday morning. -Bill PGP.sig Description: This is a digitally signed message part
Re: Link capacity upgrade threshold
On Sun, Aug 30, 2009 at 01:03:35PM -0400, Patrick W. Gilmore wrote: Also, a gig link on a Cisco will do approx 93-94% of imix of a gig in the values presented via SNMP (around 930-940 megabit/s as seen in show int) before it's full, because of IFG, ethernet header overhead etc. I've heard this said many times. I've also seen 'sho int' say 950,000,000 bits/sec and not see packets get dropped. I was under the impression show int showed -every- byte leaving the interface. I could make an argument that IFG would not be included, but things like ethernet headers better be. Does this change between IOS revisions, or hardware, or is it old info, or ... what? Actually Cisco does count layer 2 header overhead in its snmp and show int results, it is Juniper who does not (for most platforms at any rate) due to their hw architecture. I did some tests regarding this a while back on j-nsp, you'll see different results for different platforms and depending on whether you're looking at the tx or rx. Also you'll see different results for vlan overhead and the like, which can further complicate things. That said, show int is an epic disaster for a significantly large percentage of the time. I've seen more bugs and false readings on that thing than I can possibly count, so you really shouldn't rely on it for rate readings. The problem is extra special bad on SVIs, where you might see a reading that is 20% high or low from reality at any given second, even on modern code. I'm not aware of any major issues detecting drops though, so you should at least be able to detect them when they happen (which isn't always at line rate). If you're on a 6500/7600 platform running anything SXF+ try show platform hardware capacity interface to look for interfaces with lots of drops globally. -- Richard A Steenbergen r...@e-gerbil.net http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)
RE: Link capacity upgrade threshold
If your 95th percentile utilization is at 80% capacity, it's time to start planning the upgrade. s/80/60/ the normal snmp and other averaging methods *really* miss the bursts. s/60/40/ What is this upgrade thing you all speak of? When your links become saturated, shouldn't you solve the problem by deploying DPI-based application-discriminatory throttling and start double-dipping your customers? After all, it's their fault for using up more bandwidth than your flawed business model told you they will use. (If you're not familiar with Bell Canada, it's OK if you don't get the joke).
Re: Link capacity upgrade threshold
I consider a circuit nearing capacity at 80-85%. Depending on the circuit we start the process of increasing capacity around 70%. There are almost always telco issues, in-building issues, not enough physical ports on the provider end, and other such things that slow you down. Justin From: devang patel devan...@gmail.com Date: Sat, 29 Aug 2009 21:50:41 -0600 To: nanog@nanog.org Subject: Link capacity upgrade threshold Hi All, I just wanted to know what is Link capacity upgrade threshold in terms of % of link utilization? Just to get an idea... thanks, Devang Patel
Re: Link capacity upgrade threshold
On Sat, Aug 29, 2009 at 11:50 PM, devang pateldevan...@gmail.com wrote: I just wanted to know what is Link capacity upgrade threshold in terms of % of link utilization? Just to get an idea... If your 95th percentile utilization is at 80% capacity, it's time to start planning the upgrade. If your 95th percentile utilization is at 95% it's time to finish the upgrade. If you average or median utilizations are at 80% capacity then as often as not it's time for your boss to fire you and replace you with someone who can do the job. Slight variations depending on the resource. Use absolute peak instead of 95th percentile for modem bank utilization -- under normal circumstances a modem bank should never ring busy. And a gig-e can run a little closer to the edge (percentage-wise) before folks notice slowness than a T1 can. Regards, Bill Herrin -- William D. Herrin her...@dirtside.com b...@herrin.us 3005 Crane Dr. .. Web: http://bill.herrin.us/ Falls Church, VA 22042-3004
Re: Link capacity upgrade threshold
On Sun, 30 Aug 2009, William Herrin wrote: If your 95th percentile utilization is at 80% capacity, it's time to start planning the upgrade. If your 95th percentile utilization is at 95% it's time to finish the upgrade. I now see why people at the IETF spoke in a way that core network congestion was something natural. If your MRTG graph is showing 95% load in 5 minute average, you're most likely congesting/buffering at some time during that 5 minute interval. If this is acceptable or not in your network (it's not in mine) that's up to you. Also, a gig link on a Cisco will do approx 93-94% of imix of a gig in the values presented via SNMP (around 930-940 megabit/s as seen in show int) before it's full, because of IFG, ethernet header overhead etc. So personally, I consider a gig link in desperate need of upgrade when it's showing around 850-880 megs of traffic in mrtg. -- Mikael Abrahamssonemail: swm...@swm.pp.se