Re: Amazon diagnosis
On 04/29/2011 12:35 PM, Joly MacFie wrote: *http://aws.amazon.com/message/65648/*http://aws.amazon.com/message/65648/ ___ So, in a nut shell, Amazon had a single point of failure which touched off this entire incident. I am still waiting for proof that single points of failure can realistically be completely eliminated from any moderately complicated network environment / application. So far, I think murphy is still winning on this one. Good job by the AWS team however, I am sure your new procedures and processes will receive a shakeout again, and it will be interesting to see how that goes. I bet there will be more to learn along this road for us all. Mike-
Re: Amazon diagnosis
- Original Message - From: Mike mike-na...@tiedyenetworks.com On 04/29/2011 12:35 PM, Joly MacFie wrote: *http://aws.amazon.com/message/65648/*http://aws.amazon.com/message/65648/ So, in a nut shell, Amazon had a single point of failure which touched off this entire incident. I am still waiting for proof that single points of failure can realistically be completely eliminated from any moderately complicated network environment / application. So far, I think murphy is still winning on this one. Well, in fairness to Amazon, let's ask this: did the failure occur *behind a component interface they advertise as Reliable*? Either way, was it possible for a single customer to avoid that possible failure, and at what cost in expansion of scope and money? Cheers, -- jra
Re: Amazon diagnosis
On 5/1/2011 2:07 PM, Mike wrote: I am still waiting for proof that single points of failure can realistically be completely eliminated from any moderately complicated network environment / application. So far, I think murphy is still winning on this one. Sure they can, but as a thought exercise fully 2n redundancy is difficult on a small scale for anything web facing. I've seen a very simple implementation for a website requiring 5 9's that consumed over $50k in equipment, and this wasn't even geographically diverse. I have to believe that scaling up the concept of doing it right results in exponential cost increases. To illustrate the problem, I would give you the first step in the thought exercise: first find two datacenters with diverse carriers, that aren't on the same regional power grid (As we've learned in the (iirc) 2003 power outage, New York and DC won't work, nor will Ohio, so you need redundant teams to cover a very remote site).
RE: Amazon diagnosis
I am still waiting for proof that single points of failure can realistically be completely eliminated from any moderately complicated network environment / application. So far, I think murphy is still winning on this one. Good job by the AWS team however, I am sure your new procedures and processes will receive a shakeout again, and it will be interesting to see how that goes. I bet there will be more to learn along this road for us all. Mike- From my reading of what happened, it looks like they didn't have a single point of failure but ended up routing around their own redundancy. They apparently had a redundant primary network and, on top of that, a secondary network. The secondary network, however, did not have the capacity of the primary network. Rather than failing over from the active portion of the primary network to the standby portion of the primary network, they inadvertently failed the entire primary network to the secondary. This resulted in the secondary network reaching saturation and becoming unusable. There isn't anything that can be done to mitigate against human error. You can TRY, but as history shows us, it all boils down the human that implements the procedure. All the redundancy in the world will not do you an iota of good if someone explicitly does the wrong thing. In this case it is my opinion that Amazon should not have considered their secondary network to be a true secondary if it was not capable of handling the traffic. A completely broken network might have been an easier failure mode to handle than a saturated network (high packet loss but the network is there). This looks like it was a procedural error and not an architectural problem. They seem to have had standby capability on the primary network and, from the way I read their statement, did not use it.
Re: Amazon diagnosis
On 5/1/2011 9:29 AM, Jeff Wheeler wrote: On Sun, May 1, 2011 at 2:18 PM, Andrew Kirchtrel...@trelane.net wrote: Sure they can, but as a thought exercise fully 2n redundancy is difficult on a small scale for anything web facing. I've seen a very simple implementation for a website requiring 5 9's that consumed over $50k in equipment, and this wasn't even geographically diverse. I have What it really boils down to is this: if application developers are doing their jobs, a given service can be easy and inexpensive to distribute to unrelated systems/networks without a huge infrastructure expense. If the developers are not, you end up spending a lot of money on infrastructure to make up for code, databases, and APIs which were not designed with this in mind. These same developers who do not design and implement services with diversity and redundancy in mind will fare little better with AWS than any other platform. Look at Reddit, for example. This is an application/service which is utterly trivial to implement in a cheap, distributed manner, yet they have failed to do so for years, and suffer repeated, long-duration outages as a result. They probably buy a lot more AWS services than would otherwise be needed, and truly have a more complex infrastructure than such a simple service should. IT managers would do well to understand that a few smart programmers, who understand how all their tools (web servers, databases, filesystems, load-balancers, etc.) actually work, can often do more to keep infrastructure cost under control, and improve the reliability of services, than any other investment in IT resources. If you want a perfect example of this, consider Netflix. Their infrastructure runs on AWS and we didn't see any downtime with them throughout the entire affair. One of the interesting things they've done to try and enforce reliability of services is an in house service called Chaos Monkey who's sole purpose is to randomly kill instances and services inside the infrastructure. Courtesy of Chaos Monkey and the defensive programming it enforces, nothing is dependent on each other, you will always get at least some form of a service. For example if the recommendation engine dies, then the application is smart enough to catch that and instead return a list of the most popular movies, and so on. There is an interesting blog from their Director of Engineering about what they learned on their migration to AWS, including using less chatty APIs to reduce the impact of typical AWS latency: http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html Paul
Re: Amazon diagnosis
On Sun, May 01, 2011 at 12:50:37PM -0700, George Bonser wrote: From my reading of what happened, it looks like they didn't have a single point of failure but ended up routing around their own redundancy. They apparently had a redundant primary network and, on top of that, a secondary network. The secondary network, however, did not have the capacity of the primary network. Rather than failing over from the active portion of the primary network to the standby portion of the primary network, they inadvertently failed the entire primary network to the secondary. This resulted in the secondary network reaching saturation and becoming unusable. There isn't anything that can be done to mitigate against human error. You can TRY, but as history shows us, it all boils down the human that implements the procedure. All the redundancy in the world will not do you an iota of good if someone explicitly does the wrong thing. [ ... ] This looks like it was a procedural error and not an architectural problem. They seem to have had standby capability on the primary network and, from the way I read their statement, did not use it. The procedural error was putting all the traffic on the secondary network. They promptly recognized that error, and fixed it. It's certainly true that you can't eliminate human error. The architectural problem is that they had insufficient error recovery capability. Initially, the system was trying to use a network that was too small; that situation lasted for some number of minutes; it's no surprise that the system couldn't operate under those conditions and that isn't an indictment of the architecture. However, after they put it back on a network that wasn't too small, the service stayed down/degraded for many, many hours. That's an architectural problem. (And a very common one. Error recovery is hard and tedious and more often than not, not done well.) Prodecural error isn't the only way to get into that boat. If the wrong pair of redundant equipment in their primary network failed simultanesouly, they'd have likely found themselves in the same boat: a short outage caused by a risk they accepted: loss of a pair of rundundant hardware; followed by a long outage (after they restored the network) caused by insufficient recovery capability. Their writeup suggests they fully understand these issues and are doing the right thing by seeking to have better recovery capability. They spent one sentence saying they'll look at their procedures to reduce the risk of a similar procedural error in the future, and then spent paragraphs on what they are going to do to have better recovery should something like this occur in the future. (One additional comment, for whoever posted that NetFlix had a better architecture and wasn't impacted by this outage. It might well be that NetFlix does have a better archiecture and that might be why they weren't impacted ... but there's also the possibility that they just run in a different region. Lots of entities with poor architecture running on AWS survived this outage just fine, simply by not being in the region that had the problem.) -- Brett
Re: Amazon diagnosis
Date: Sun, 01 May 2011 11:07:56 -0700 From: Mike mike-na...@tiedyenetworks.com To: nanog@nanog.org Subject: Re: Amazon diagnosis On 04/29/2011 12:35 PM, Joly MacFie wrote: http://aws.amazon.com/message/65648/ ___ So, in a nut shell, Amazon had a single point of failure which touched off this entire incident. I am still waiting for proof that single points of failure can realistically be completely eliminated from any moderately complicated network environment / application. So far, I think murphy is still winning on this one. this was a classical case of _O'Brien's_Law_ in action -- which states, rather pithily: Murphy... was an OPTIMIST!!
Re: Amazon diagnosis
On Sun, 01 May 2011 11:07:56 PDT, Mike said: I am still waiting for proof that single points of failure can realistically be completely eliminated from any moderately complicated network environment / application. So far, I think murphy is still winning on this one. For starters, you almost always screw up and have one NOC full of chuckle-headed banana eaters. And if you have two NOCs, that implies one entity deciding which one takes lead on a problem. ;) pgpypkvaWbtwM.pgp Description: PGP signature
RE: Amazon diagnosis
Subject: RE: Amazon diagnosis Date: Sun, 1 May 2011 12:50:37 -0700 From: George Bonser gbon...@seven.com They apparently had a redundant primary network and, on top of that, a secondary network. The secondary network, however, did not have the capacity of the primary network. Rather than failing over from the active portion of the primary network to the standby portion of the primary network, they inadvertently failed the entire primary network to the secondary. This resulted in the secondary network reaching saturation and becoming unusable. There isn't anything that can be done to mitigate against human error. You can TRY, but as history shows us, it all boils down the human that implements the procedure. All the redundancy in the world will not do you an iota of good if someone explicitly does the wrong thing. ... This looks like it was a procedural error and not an architectural problem. A sage sayeth sooth: For any 'fool-proof' system, there exists a *sufficiently*determied* fool capable of breaking it. It would seem that the validity of that has just been re-confirmed. wry grin It is worthy of note that it is considerably harder to protect against accidental stupidity than it is to protect againt intentional malice. ('malice' is _much_ more predictable, in general. wry grin)
Re: Amazon diagnosis
On Fri, Apr 29, 2011 at 2:35 PM, Joly MacFie j...@punkcast.com wrote: *http://aws.amazon.com/message/65648/*http://aws.amazon.com/message/65648/ ___ -- --- Joly MacFie 218 565 9365 Skype:punkcast WWWhatsup NYC - http://wwwhatsup.com http://pinstand.com - http://punkcast.com VP (Admin) - ISOC-NY - http://isoc-ny.org -- - http://storagemojo.com/2011/04/29/amazons-ebs-outage/ ***Stefan Mititelu http://twitter.com/netfortius http://www.linkedin.com/in/netfortius
Multitenant FWs
Hi, What do you guys recommend for Multitenant Firewalls with support for over 1,000+ users/contexts? I have looked at Centrinet's Accessmanager and Barracuda NG Firewall. Any other players/products? Many Thanks in advance for the input,
RE: Multitenant FWs
Paloalto Networks build some nice gear From: David Oramas [david.ora...@aptel.com.au] Sent: Sunday, May 01, 2011 8:42 PM To: nanog@nanog.org Subject: Multitenant FWs Hi, What do you guys recommend for Multitenant Firewalls with support for over 1,000+ users/contexts? I have looked at Centrinet's Accessmanager and Barracuda NG Firewall. Any other players/products? Many Thanks in advance for the input,
RE: Multitenant FWs
-Original Message- From: David Oramas [mailto:david.ora...@aptel.com.au] Sent: Sunday, May 01, 2011 9:42 PM To: nanog@nanog.org Subject: Multitenant FWs Hi, What do you guys recommend for Multitenant Firewalls with support for over 1,000+ users/contexts? I have looked at Centrinet's Accessmanager and Barracuda NG Firewall. Any other players/products? Many Thanks in advance for the input, When I worked on building out Verizon's Network Based Firewall solution many years ago, I chose Juniper NS-5400 platforms due to their multitenancy capabilities and ability to support literally thousands of virtual firewall contexts and many times that for users. This decision was made after an exhaustive analysis of competing solutions from Checkpoint, Cisco, and Juniper. Juniper's SRX line of products might make a good fit, but they currently don't have full Logical System support which would certainly be a requirement for any multi-tenant offering. However, Logical System support is on the roadmap so you might want to look into this depending on your timeframe for deployment. As the other list member pointed out, Palo Alto does make some really nice gear and I have really been impressed with their Application Layer Firewalling capability (Application Identification, Web Firewalling, etc), however, I was suitably unimpressed with their multitenant capability and think you might be hard pressed to offer such an offering to more than one customer using such a device. Stefan Fouant
Re: Multitenant FWs
On Sun, May 1, 2011 at 11:05 PM, Stefan Fouant sfou...@shortestpathfirst.net wrote: -Original Message- From: David Oramas [mailto:david.ora...@aptel.com.au] Sent: Sunday, May 01, 2011 9:42 PM To: nanog@nanog.org Subject: Multitenant FWs Hi, What do you guys recommend for Multitenant Firewalls with support for over 1,000+ users/contexts? I have looked at Centrinet's Accessmanager and Barracuda NG Firewall. Any other players/products? Many Thanks in advance for the input, one thing to keep in mind is that as near as I can tell no vendor (not a singl eone) has actual hard limits configurable for each tenant firewall instance. So, one can use all of the 'firewall rule' resources, one can use all of the 'route memory' ... leaving other instances flailing :( In my mind, unless you have very loose sla's or are highly overprovisioned... until vendors treat this basic problem this model is a failure. When I worked on building out Verizon's Network Based Firewall solution many years ago, I chose Juniper NS-5400 platforms due to their multitenancy capabilities and ability to support literally thousands of virtual firewall contexts and many times that for users. This decision was made after an yup.. too bad no actual customers showed up :( (well, not any in real numbers... though not due to the tech on the FW side, nor the engineering work) As the other list member pointed out, Palo Alto does make some really nice gear and I have really been impressed with their Application Layer Firewalling capability (Application Identification, Web Firewalling, etc), however, I was suitably unimpressed with their multitenant capability and think you might be hard pressed to offer such an offering to more than one customer using such a device. no support for actual limits on resources, eh? :( nothing on at least: memory dedicated to a tenant routing resources packet processing resources inspection rule resources bandwidth/through-put management operations (I'm sure I left some off, but the above would be an excellent thing to see vendors support with hard limits THAT I CAN CONFIGURE!!) -chris
RE: Multitenant FWs
-Original Message- From: christopher.mor...@gmail.com [mailto:christopher.mor...@gmail.com] On Behalf Of Christopher Morrow one thing to keep in mind is that as near as I can tell no vendor (not a singl eone) has actual hard limits configurable for each tenant firewall instance. So, one can use all of the 'firewall rule' resources, one can use all of the 'route memory' ... leaving other instances flailing :( Ahem, actually ScreenOS does support just such a thing through the use of resource profiles - with this you can limit the amount of CPU, Sessions, Policies, MIPs and DIPs (used for NAT), and other user defined objects such as address book entries, etc. that each VSYS can avail. This was one of the primary drivers behind our decision to utilize the NS-5400 for Verizon's NBFW (you remember that place right Chris, heh') Stefan Fouant
Re: Multitenant FWs
On Mon, May 2, 2011 at 12:20 AM, Stefan Fouant sfou...@shortestpathfirst.net wrote: -Original Message- From: christopher.mor...@gmail.com [mailto:christopher.mor...@gmail.com] On Behalf Of Christopher Morrow one thing to keep in mind is that as near as I can tell no vendor (not a singl eone) has actual hard limits configurable for each tenant firewall instance. So, one can use all of the 'firewall rule' resources, one can use all of the 'route memory' ... leaving other instances flailing :( Ahem, actually ScreenOS does support just such a thing through the use of resource profiles - with this you can limit the amount of CPU, Sessions, Policies, MIPs and DIPs (used for NAT), and other user defined objects such as address book entries, etc. that each VSYS can avail. This was one of the good to know... I wonder how well it isolates. primary drivers behind our decision to utilize the NS-5400 for Verizon's NBFW (you remember that place right Chris, heh') i do, occasionally via the twitching :) Stefan Fouant
RE: Multitenant FWs
-Original Message- From: christopher.mor...@gmail.com [mailto:christopher.mor...@gmail.com] On Behalf Of Christopher Morrow Ahem, actually ScreenOS does support just such a thing through the use of resource profiles - with this you can limit the amount of CPU, Sessions, Policies, MIPs and DIPs (used for NAT), and other user defined objects such as address book entries, etc. that each VSYS can avail. This was one of the good to know... I wonder how well it isolates. Ask the Vz marketing folks... oh, wait, 1 customer isn't really enough to demonstrate how well it isolates after all I guess ;) primary drivers behind our decision to utilize the NS-5400 for Verizon's NBFW (you remember that place right Chris, heh') i do, occasionally via the twitching :) Hehe... Stefan Fouant