Re: FYI Netflix is down
On Mon, Jul 9, 2012 at 10:20 AM, Dave Hart daveh...@gmail.com wrote: We continue to investigate why these connections were timing out during connect, rather than quickly determining that there was no route to the unavailable hosts and failing quickly. potential translation: We continue to shoot ourselves in the foot by filtering all ICMP without understanding the implications. Sorry to mention my favorite hardware vendor again, but that is what I liked about using F5 BigIP as load balancing devices... They did layer 7 url checking to see if the service was really responding (instead of just pinging or opening a connection to the IP). We performed tests that would do a complete LDAP SSL query to verify a directory server could actually look up a person. If it failed to answer within a certain time frame, then it was taken out of rotation. I do not know if that was ever implemented in production, but we did verify it worked. On the software in the hardware can fail point, my only defense is you do redundant testing of the watcher devices, and have enough of them to vote misbehaving ones out of service. Oh, and it is best if the global load balancing hardware/software is located somewhere else besides the data centers being monitored. -- steve pirk
Re: FYI Netflix is down
Steve at pirk, I fail to grasp the concept in your argument. You do realise, do you not, that your $ black boxes from your favourite brand name vendor have software running inside of them do you not ? Case in point for example, the recent LINX issues it wasn't the hardware that gave them the headaches, but the software running on it sure did ! I am a big believer in using hardware to load balance data centers, and not leave it up to software in the data center which might fail.
Re: FYI Netflix is down
Hi, Well depending on your black box, your millage will vary. Their wide use of ASIC eliminate a lot of the headache of pure software implementation. Buffer, timing, expected results, etc. Their real sofware only represent a small part of the device and is mostly relegated to management and some L4 to L7 handling. So yes, ASIC/FPGA devices have software their result and behavior are predictable and the system is more stable because of it. PS: Yes, CAM lockout, bad RAM is still a pita for them. In short: It is quite a thing to say that because everything can be categorized as software that someone point is invalid. - Alain Hebertaheb...@pubnix.net PubNIX Inc. 50 boul. St-Charles P.O. Box 26770 Beaconsfield, Quebec H9W 6G7 Tel: 514-990-5911 http://www.pubnix.netFax: 514-990-9443 On 07/09/12 07:42, gb10hkzo-na...@yahoo.co.uk wrote: Steve at pirk, I fail to grasp the concept in your argument. You do realise, do you not, that your $ black boxes from your favourite brand name vendor have software running inside of them do you not ? Case in point for example, the recent LINX issues it wasn't the hardware that gave them the headaches, but the software running on it sure did ! I am a big believer in using hardware to load balance data centers, and not leave it up to software in the data center which might fail.
Re: FYI Netflix is down
On Mon, 09 Jul 2012 08:07:14 -0400, Alain Hebert said: Their wide use of ASIC eliminate a lot of the headache of pure software implementation. And gets you, in return, the headaches of buggy hardware, where bug-fixing is just a bit harder than load the new release. ;) pgpSvdXo7xMkN.pgp Description: PGP signature
Re: FYI Netflix is down
On Sun, Jul 8, 2012 at 8:27 PM, steve pirk [egrep] st...@pirk.com wrote: I am pretty sure Netflix and others were trying to do it right, as they all had graceful fail-over to a secondary AWS zone defined. It looks to me like Amazon uses DNS round-robin to load balance the zones, because they mention returning a list of addresses for DNS queries, and explains the failure of the services to shunt over to other zones in their postmortem. There are also bugs from the Netflix side uncovered by the AWS outage: Lessons Netflix Learned from the AWS Storm http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html For an infrastructure this large, no matter you are running your own datacenter or using the cloud, it is certain that the code is not bug free. And another thing is, if everything is too automated, then failure in one component can trigger bugs in areas that no one has ever thought of... Rayson == Open Grid Scheduler - The Official Open Source Grid Engine http://gridscheduler.sourceforge.net/ Elastic Load Balancers (ELBs) allow web traffic directed at a single IP address to be spread across many EC2 instances. They are a tool for high availability as traffic to a single end-point can be handled by many redundant servers. ELBs live in individual Availability Zones and front EC2 instances in those same zones or in other Availability Zones. ELBs can also be deployed in multiple Availability Zones. In this configuration, each Availability Zone’s end-point will have a separate IP address. A single Domain Name will point to all of the end-points’ IP addresses. When a client, such as a web browser, queries DNS with a Domain Name, it receives the IP address (“A”) records of all of the ELBs in random order. While some clients only process a single IP address, many (such as newer versions of web-browsers) will retry the subsequent IP addresses if they fail to connect to the first. A large number of non-browser clients only operate with a single IP address. During the disruption this past Friday night, the control plane (which encompasses calls to add a new ELB, scale an ELB, add EC2 instances to an ELB, and remove traffic from ELBs) began performing traffic shifts to account for the loss of load balancers in the affected Availability Zone. As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before. The bug caused the ELB control plane to attempt to scale these ELBs to larger ELB instance sizes. This resulted in a sudden flood of requests which began to backlog the control plane. At the same time, customers began launching new EC2 instances to replace capacity lost in the impacted Availability Zone, requesting the instances be added to existing load balancers in the other zones. These requests further increased the ELB control plane backlog. Because the ELB control plane currently manages requests for the US East-1 Region through a shared queue, it fell increasingly behind in processing these requests; and pretty soon, these requests started taking a very long time to complete. http://aws.amazon.com/message/67457/ *In reality, though, Amazon data centers have outages all the time. In fact, Amazon tells its customers to plan for this to happen, and to be ready to roll over to a new data center whenever there’s an outage.* *That’s what was supposed to happen at Netflix Friday night. But it didn’t work out that way. According to Twitter messages from Netflix Director of Cloud Architecture Adrian Cockcroft and Instagram Engineer Rick Branson, it looks like an Amazon Elastic Load Balancing service, designed to spread Netflix’s processing loads across data centers, failed during the outage. Without that ELB service working properly, the Netflix and Pintrest services hosted by Amazon crashed.* http://www.wired.com/wiredenterprise/2012/06/real-clouds-crush-amazon/ I am a big believer in using hardware to load balance data centers, and not leave it up to software in the data center which might fail. Speaking of services like RightScale, Google announced Compute Engine at Google I/O this year. BuildFax was an early Adopter, and they gave it great reviews... http://www.youtube.com/watch?v=LCjSJ778tGU It looks like Google has entered into the VPS market. 'bout time... ;-] http://cloud.google.com/products/compute-engine.html --steve pirk
Re: FYI Netflix is down
On Mon, Jul 9, 2012 at 15:50 UTC, Rayson Ho wrote: There are also bugs from the Netflix side uncovered by the AWS outage: Lessons Netflix Learned from the AWS Storm http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html We continue to investigate why these connections were timing out during connect, rather than quickly determining that there was no route to the unavailable hosts and failing quickly. potential translation: We continue to shoot ourselves in the foot by filtering all ICMP without understanding the implications. Cheers, Dave Hart
Re: FYI Netflix is down
On Tue, Jul 3, 2012 at 1:00 PM, Ryan Malayter malay...@gmail.com wrote: Doing it the right way makes the cloud far less cost-effective and far less agile. Once you get it all set up just so, change becomes very difficult. All the monitoring and fail-over/fail-back operations are generally application-specific and provider-specific, so there's a lot of lock-in. Tools like RightScale are a step in the right direction, but don't really touch the application layer. You also have to worry about the availability of yet another provider! I am pretty sure Netflix and others were trying to do it right, as they all had graceful fail-over to a secondary AWS zone defined. It looks to me like Amazon uses DNS round-robin to load balance the zones, because they mention returning a list of addresses for DNS queries, and explains the failure of the services to shunt over to other zones in their postmortem. Elastic Load Balancers (ELBs) allow web traffic directed at a single IP address to be spread across many EC2 instances. They are a tool for high availability as traffic to a single end-point can be handled by many redundant servers. ELBs live in individual Availability Zones and front EC2 instances in those same zones or in other Availability Zones. ELBs can also be deployed in multiple Availability Zones. In this configuration, each Availability Zone’s end-point will have a separate IP address. A single Domain Name will point to all of the end-points’ IP addresses. When a client, such as a web browser, queries DNS with a Domain Name, it receives the IP address (“A”) records of all of the ELBs in random order. While some clients only process a single IP address, many (such as newer versions of web-browsers) will retry the subsequent IP addresses if they fail to connect to the first. A large number of non-browser clients only operate with a single IP address. During the disruption this past Friday night, the control plane (which encompasses calls to add a new ELB, scale an ELB, add EC2 instances to an ELB, and remove traffic from ELBs) began performing traffic shifts to account for the loss of load balancers in the affected Availability Zone. As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before. The bug caused the ELB control plane to attempt to scale these ELBs to larger ELB instance sizes. This resulted in a sudden flood of requests which began to backlog the control plane. At the same time, customers began launching new EC2 instances to replace capacity lost in the impacted Availability Zone, requesting the instances be added to existing load balancers in the other zones. These requests further increased the ELB control plane backlog. Because the ELB control plane currently manages requests for the US East-1 Region through a shared queue, it fell increasingly behind in processing these requests; and pretty soon, these requests started taking a very long time to complete. http://aws.amazon.com/message/67457/ *In reality, though, Amazon data centers have outages all the time. In fact, Amazon tells its customers to plan for this to happen, and to be ready to roll over to a new data center whenever there’s an outage.* *That’s what was supposed to happen at Netflix Friday night. But it didn’t work out that way. According to Twitter messages from Netflix Director of Cloud Architecture Adrian Cockcroft and Instagram Engineer Rick Branson, it looks like an Amazon Elastic Load Balancing service, designed to spread Netflix’s processing loads across data centers, failed during the outage. Without that ELB service working properly, the Netflix and Pintrest services hosted by Amazon crashed.* http://www.wired.com/wiredenterprise/2012/06/real-clouds-crush-amazon/ I am a big believer in using hardware to load balance data centers, and not leave it up to software in the data center which might fail. Speaking of services like RightScale, Google announced Compute Engine at Google I/O this year. BuildFax was an early Adopter, and they gave it great reviews... http://www.youtube.com/watch?v=LCjSJ778tGU It looks like Google has entered into the VPS market. 'bout time... ;-] http://cloud.google.com/products/compute-engine.html --steve pirk
Re: FYI Netflix is down
On Jul 8, 2012, at 7:27 PM, steve pirk [egrep] st...@pirk.com wrote: I am pretty sure Netflix and others were trying to do it right, as they all had graceful fail-over to a secondary AWS zone defined. Having a single company as an infrastructure supplier is not trying to do it right from an engineering OR business perspective. It's lazy. No matter how many availability zones the vendor claims.
RE: FYI Netflix is down
-Original Message- I imagine Netflix is mature enough to track this data as you suggest, and that's why they use AWS - downtime isn't a big deal for their business unless it gets really, really bad. There is another possibility that is probably much more widespread amongst AWS (and other cloud) customers. Here is the scenario: You are a small, hungry startup. No capital for servers. Cloud seems great. Then, big growth hits! Cloud seems even better - you may have the capital now, thanks to friendly VC/public investment/private equity, but you don't have the time to catch up. So, keep using cloud. Then, the now mid-sized company discovers one day that their use of the cloud is no longer economical, if it ever was. They are big enough to use a dedicated hardware in collocation or wholesale datacenter solution, with blended transit from some upstreams. But the cost to transition out of cloud is big, too. So, they might go with a hybrid strategy, at least for a few years. This happens all the time. Not saying Netflix is doing this, but lots of other folks are. It’s a trap that’s easy to fall into. Especially with rapid growth. - Dan
Re: FYI Netflix is down
On Jul 6, 2012, at 1:50 PM, Dan Golding wrote: This happens all the time. Not saying Netflix is doing this, but lots of other folks are. It’s a trap that’s easy to fall into. Especially with Netflix did the reverse. The moved *to* Amazon, so they could do noops.
Re: FYI Netflix is down
Tell that to people in the third world without utilities. On Jul 3, 2012 8:32 PM, Randy Bush ra...@psg.com wrote: Also, I don't think there is an acceptable level of downtime for water. coming soon to a planet near you randy
Re: FYI Netflix is down
Tell that to people in the third world without utilities. Also, I don't think there is an acceptable level of downtime for water. coming soon to a planet near you i work there regularly. the typical nanog kiddie does not. randy
Re: FYI Netflix is down
On Jul 2, 2012, at 7:19 PM, Rodrick Brown rodrick.br...@gmail.com wrote: People are acting as if Netflix is part of some critical service they stream movies for Christ sake. Some acceptable level of loss is fine for 99.99% of Netflix's user base just like cable, electricity and running water I suffer a few hours of losses each year from those services it suck yes, is it the end of the world no.. Actually calculating - understanding - cost of downtime, and what variations on that exist over time, are keys to reliability engineering. But if you plan to cover X failure scenarios and only cover X/2 failure scenarios due to implementation glitches you goofed. The right answer may be relax and accept the downtime and it may be spend $10 million dollars to avoid most of these. If you haven't thought it through and quantified, do so... George William Herbert Sent from my iPhone
RE: FYI Netflix is down
-Original Message- From: James Downs [mailto:e...@egon.cc] On Jul 2, 2012, at 7:19 PM, Rodrick Brown wrote: People are acting as if Netflix is part of some critical service they stream movies for Christ sake. Some acceptable level of loss is fine for 99.99% of Netflix's user base just like cable, electricity and running water I suffer a few hours of losses each year from those services it suck yes, is it the end of the world no.. You missed the point. And very publically missed the point, too. The Netflix issues led to a large discussion of downtime, testing, and fault tolerance that has been very useful for the community and could lead to some good content for NANOG conferences (/pokes PC). For Netflix (and all other similar services) downtime is money and money is downtime. There is a quantifiable cost for customer acquisition and a quantifiable churn during each minute of downtime. Mature organizations actually calculate and track this. The trick is to ensure that you have balanced the cost of greater redundancy vs the cost of churn/customer acquisition. If you are spending too much on redundancy, it's as big of mistake as spending too little. Also, I don't think there is an acceptable level of downtime for water. Neither do water utilities. - Dan
RE: FYI Netflix is down
James Downs wrote: For Netflix (and all other similar services) downtime is money and money is downtime. There is a quantifiable cost for customer acquisition and a quantifiable churn during each minute of downtime. Mature organizations actually calculate and track this. The trick is to ensure that you have balanced the cost of greater redundancy vs the cost of churn/customer acquisition. If you are spending too much on redundancy, it's as big of mistake as spending too little. Actually, for Netflix, so long as downtime is infrequent or short enough that users don't cancel, it actually saves them money. They're not paying royalties for movies being streamed during downtime, but they're still collecting their $8/month. There is no meaningful SLA for the end user to my knowledge. I imagine the threshold for *any* user churn based on downtime is very high for Netflix. So long as they are about as good as cable/sattelite TV in terms of uptime Netflix will do fine. You would have to get into 98% uptime or lower before people would really start getting irritated enough to cancel. Of course multiple short outages would be more painful than a few longer ones from a customer's perspective. I imagine Netflix is mature enough to track this data as you suggest, and that's why they use AWS - downtime isn't a big deal for their business unless it gets really, really bad.
Re: FYI Netflix is down
On Jul 3, 2012, at 6:11 AM, Dan Golding wrote: Also, I don't think there is an acceptable level of downtime for water. Neither do water utilities. I remember a certain conversation I had with a web-developer. We were talking about zero downtime releases. He thought it was acceptable if the website went down for 15 minutes, because people will just come back. Naturally, he was not as forgiving about the idea that his bank might think the same way, or that I might provide DB or server uptimes with that kind of reliability. Downtime will kill some companies, and not others. Twitter certainly survived their fail-whale period. But then, no one pays for twitter. -j
Re: FYI Netflix is down
On Jul 3, 2012, at 9:11 AM, Dan Golding dgold...@ragingwire.com wrote: -Original Message- From: James Downs [mailto:e...@egon.cc] On Jul 2, 2012, at 7:19 PM, Rodrick Brown wrote: People are acting as if Netflix is part of some critical service they stream movies for Christ sake. Some acceptable level of loss is fine for 99.99% of Netflix's user base just like cable, electricity and running water I suffer a few hours of losses each year from those services it suck yes, is it the end of the world no.. You missed the point. And very publically missed the point, too. The Netflix issues led to a large discussion of downtime, testing, and fault tolerance that has been very useful for the community and could lead to some good content for NANOG conferences (/pokes PC). For Netflix (and all other similar services) downtime is money and money is downtime. There is a quantifiable cost for customer acquisition and a quantifiable churn during each minute of downtime. Mature organizations actually calculate and track this. The trick is to ensure that you have balanced the cost of greater redundancy vs the cost of churn/customer acquisition. If you are spending too much on redundancy, it's as big of mistake as spending too little. I totally got the point and the last bit of my post was just tongue in cheek. As I stated in my original response it's very unrealistic to plan for every possible failure scenario given the constraints most businesses face when implementing BCP today. I doubt Amazon gave much thought to multiple site outages and clients not being able to dynamically redeploy their engines because of inaccessibility from ELB. Also, I don't think there is an acceptable level of downtime for water. Neither do water utilities. - Dan
Re: FYI Netflix is down
On Jul 3, 2012, at 10:58 AM, Ryan Malayter malay...@gmail.com wrote: James Downs wrote: For Netflix (and all other similar services) downtime is money and money is downtime. There is a quantifiable cost for customer acquisition and a quantifiable churn during each minute of downtime. Mature organizations actually calculate and track this. The trick is to ensure that you have balanced the cost of greater redundancy vs the cost of churn/customer acquisition. If you are spending too much on redundancy, it's as big of mistake as spending too little. Actually, for Netflix, so long as downtime is infrequent or short enough that users don't cancel, it actually saves them money. They're not paying royalties for movies being streamed during downtime, but they're still collecting their $8/month. There is no meaningful SLA for the end user to my knowledge. I imagine the threshold for *any* user churn based on downtime is very high for Netflix. So long as they are about as good as cable/sattelite TV in terms of uptime Netflix will do fine. You would have to get into 98% uptime or lower before people would really start getting irritated enough to cancel. Of course multiple short outages would be more painful than a few longer ones from a customer's perspective. I imagine Netflix is mature enough to track this data as you suggest, and that's why they use AWS - downtime isn't a big deal for their business unless it gets really, really bad. My thoughts exactly!
Re: FYI Netflix is down
On Tue, 3 Jul 2012, Rodrick Brown wrote: face when implementing BCP today. I doubt Amazon gave much thought to multiple site outages and clients not being able to dynamically redeploy their engines because of inaccessibility from ELB. Considering there's a grand total of -one- tool in the entirely AWS toolkit that supports working across multiple regions at all sanely (that would be ec2-migrate-bundle, btw), I'd agree. Amazon has put nearly zero thought into multiple site outages or how their customer base could leverage the multiple sites (regions) operated by AWS. -- david raistrickhttp://www.netmeister.org/news/learn2quote.html dr...@icantclick.org
Re: FYI Netflix is down
On Mon, 2 Jul 2012, Greg D. Moore wrote: As for pulling the plug to test stuff. I recall a demo at Netapps in the early 00's. They were talking about their fault tolerance and how great it was. So I walked up to their demo array and said, So, it shouldn't be a problem if I pulled this drive right here? Before I could the salesperson or tech guy, can't remember, told me to stop. He didn't want to risk it. Lightweight. Your story reminded me of this Sun ZFS demo. http://www.youtube.com/watch?v=QGIwg6ye1gE -- Jon Lewis, MCP :) | I route Senior Network Engineer | therefore you are Atlantic Net| _ http://www.lewis.org/~jlewis/pgp for PGP public key_
Re: FYI Netflix is down
On Mon, 2 Jul 2012, david raistrick wrote: On Mon, 2 Jul 2012, James Downs wrote: back-plane / control-plane was unable to cope with the requests. Netflix uses Amazon's ELB to balance the traffic and no back-plane meant they were unable to reconfigure it to route around the problem. Someone needs to define back-plane/control-plane in this case. (and what wasn't working) Amazon resources are controlled (from a consumer viewpoint) by API - that API is also used by amazon's internal toolkits that support ELB (and RDS..). Those (http accessed) API interfaces were unavailable for a good portion of the outages. It seems like if you're going to outsource your mission critical infrastructure to cloud you should probably pick at least 2 unrelated cloud providers and if at all possible, not outsource the systems that balance/direct traffic...and if you're really serious about it, have at least two of these setup at different facilities such that if the primary goes offline, the secondary takes over. If a cloud provider fails, you redirect to another. -- Jon Lewis, MCP :) | I route Senior Network Engineer | therefore you are Atlantic Net| _ http://www.lewis.org/~jlewis/pgp for PGP public key_
Re: FYI Netflix is down
On 6/29/12 8:22 PM, Joe Blanchard wrote: Seems that they are unreachable at the moment. Called and theres a recorded message stating they are aware of an issue, no details. I didn't see anyone post this yet, so here's Amazon's summary of events: http://aws.amazon.com/message/67457/
Re: FYI Netflix is down
- Original Message - From: Steven Bellovin s...@cs.columbia.edu Subject: Re: FYI Netflix is down On Jul 2, 2012, at 3:43 PM, Greg D. Moore wrote: At 03:08 PM 7/2/2012, George Herbert wrote: If folks have not read it, I would suggest reading Normal Accidents by Charles Perrow. Strong second to that suggestion. Quite unfortunately, that book appears not to be in Safari's library. Does anyone here know anyone at Safari? Cheers, -- jra -- Jay R. Ashworth Baylink j...@baylink.com Designer The Things I Think RFC 2100 Ashworth Associates http://baylink.pitas.com 2000 Land Rover DII St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274
Re: FYI Netflix is down
On Jul 3, 2012, at 10:38 AM, Jay Ashworth j...@baylink.com wrote: - Original Message - From: Steven Bellovin s...@cs.columbia.edu Subject: Re: FYI Netflix is down On Jul 2, 2012, at 3:43 PM, Greg D. Moore wrote: At 03:08 PM 7/2/2012, George Herbert wrote: If folks have not read it, I would suggest reading Normal Accidents by Charles Perrow. Strong second to that suggestion. Quite unfortunately, that book appears not to be in Safari's library. Does anyone here know anyone at Safari? Not the Safari division, but ORA yes, others at my company do. Will forward the suggestion. George William Herbert Sent from my iPhone
Re: FYI Netflix is down
Jon Lewis wrote: It seems like if you're going to outsource your mission critical infrastructure to cloud you should probably pick at least 2 unrelated cloud providers and if at all possible, not outsource the systems that balance/direct traffic...and if you're really serious about it, have at least two of these setup at different facilities such that if the primary goes offline, the secondary takes over. If a cloud provider fails, you redirect to another. Really, you need at least three independent providers. One primary (A), one backup (B), and one witness to monitor the others for failure. The witness site can of course be low-powered, as it is not in the data plane of the applications, but just participates in the control plane. In the event of a loss of communication, the majority clique wins, and the isolated environments shut themselves down. This is of course how any sane clustering setup has protected against split brain scenarios for decades. Doing it the right way makes the cloud far less cost-effective and far less agile. Once you get it all set up just so, change becomes very difficult. All the monitoring and fail-over/fail-back operations are generally application-specific and provider-specific, so there's a lot of lock-in. Tools like RightScale are a step in the right direction, but don't really touch the application layer. You also have to worry about the availability of yet another provider! -- RPM
Re: FYI Netflix is down
Also, I don't think there is an acceptable level of downtime for water. coming soon to a planet near you randy
RE: FYI Netflix is down
-Original Message- From: Todd Underwood [mailto:toddun...@gmail.com] scott, This was not a cascading failure. It was a simple power outage Actually, it was a very complex power outage. I'm going to assume that what happened this weekend was similar to the event that happened at the same facility approximately two weeks ago (its immaterial - the details are probably different, but it illustrates the complexity of a data center failure) Utility Power Failed First Backup Generator Failed (shut down due to a faulty fan) Second Backup Generator Failed (breaker coordination problem resulting in faulty trip of a breaker) In this case, it was clearly a cascading failure, although only limited in scope. The failure in this case, also clearly involved people. There was one material failure (the fan), but the system should have been resilient enough to deal with it. The system should also have been resilient enough to deal with the breaker coordination issue (which should not have occurred), but was not. Data centers are not commodities. There is a way to engineer these facilities to be much more resilient. Not everyone's business model supports it. - Dan Cascading failures involve interdependencies among components. Not always. Cascading failures can also occur when there is zero dependency between components. The simplest form of this is where one environment fails over to another, but the target environment is not capable of handling the additional load and then fails itself as a result (in some form or other, but frequently different to the mode of the original failure). indeed. and that is an interdependency among components. in particular, it is a capacity interdependency. Whilst the Amazon outage might have been a simple power outage, it's likely that at least some of the website outages caused were a combination of not just the direct Amazon outage, but also the flow- on effect of their redundancy attempting (but failing) to kick in - potentially making the problem worse than just the Amazon outage caused. i think you over-estimate these websites. most of them simply have no redundancy (and obviously have no tested, effective redundancy) and were simply hoping that amazon didn't really go down that much. hope is not the best strategy, as it turns out. i suspect that randy is right though: many of these businesses do not promise perfect uptime and can survive these kinds of failures with little loss to business or reputation. twitter has branded it's early failures with a whale that no only didn't hurt it but helped endear the service to millions. when your service fits these criteria, why would you bother doing the complicated systems and application engineering necessary to actually have functional redundancy? it simply isn't worth it. t Scott
Re: FYI Netflix is down
Actually, it was a very complex power outage. I'm going to assume that what happened this weekend was similar to the event that happened at the same facility approximately two weeks ago (its immaterial - the details are probably different, but it illustrates the complexity of a data center failure) Utility Power Failed First Backup Generator Failed (shut down due to a faulty fan) Second Backup Generator Failed (breaker coordination problem resulting in faulty trip of a breaker) In this case, it was clearly a cascading failure, although only limited in scope. The failure in this case, also clearly involved people. There was one material failure (the fan), but the system should have been resilient enough to deal with it. The system should also have been resilient enough to deal with the breaker coordination issue (which should not have occurred), but was not. Data centers are not commodities. There is a way to engineer these facilities to be much more resilient. Not everyone's business model supports it. ok, i give in. as some level of granularity everything is a cascading failure (since molecules colide and the world is an infinite chain of causation in which human free will is merely a myth /Spinoza) of course, this use of 'cascading' is vacuous and not useful anymore since it applies to nearly every failure, but i'll go along with it. from the perspective of a datacenter power engineer, this was a cascading failure of a few small number of components. from the perspective of every datacenter customer: this was a power failure. from the perspective of people watching B-rate movies: this was a failure to implement and test a reliable system for streaming those movies in the face of a power outage at one facility. from the perspective of nanog mailing list readers: this was an interesting opportunity to speculate about failures about which we have no data (as usual!). can we all agree on those facts? :-) t
Re: FYI Netflix is down
While I was working for a wireless telecom company our primary datacenter was knocked off the power grid due to weather, the generators kicked on and everything was fine, till one generator was struck by lighting and that same strike fried the control panel on the second one. Considering the second generator had no control panel we had no means of monitoring it for temp, fuel, input voltage (when it came back), output voltage, surge protection, or ultimately if the generator spiked to go full voltage due to a regulator failure. Needless to say we had to shut the second generator down for safety reasons. While in the military I seen many generators struck by lighting as well. Im not saying Amazon was not at fault here, but I can see where this is possible and happens more frequently than one might think. I hate to play devils advocate here, but you as the customer should always have backups to your backups, and practice these fail-overs on a regular basis. Otherwise you are the fault here, no one else... -- Thank you, Robert Miller http://www.armoredpackets.com Twitter: @arch3angel On 7/2/12 11:01 AM, Dan Golding wrote: -Original Message- From: Todd Underwood [mailto:toddun...@gmail.com] scott, This was not a cascading failure. It was a simple power outage Actually, it was a very complex power outage. I'm going to assume that what happened this weekend was similar to the event that happened at the same facility approximately two weeks ago (its immaterial - the details are probably different, but it illustrates the complexity of a data center failure) Utility Power Failed First Backup Generator Failed (shut down due to a faulty fan) Second Backup Generator Failed (breaker coordination problem resulting in faulty trip of a breaker) In this case, it was clearly a cascading failure, although only limited in scope. The failure in this case, also clearly involved people. There was one material failure (the fan), but the system should have been resilient enough to deal with it. The system should also have been resilient enough to deal with the breaker coordination issue (which should not have occurred), but was not. Data centers are not commodities. There is a way to engineer these facilities to be much more resilient. Not everyone's business model supports it. - Dan Cascading failures involve interdependencies among components. Not always. Cascading failures can also occur when there is zero dependency between components. The simplest form of this is where one environment fails over to another, but the target environment is not capable of handling the additional load and then fails itself as a result (in some form or other, but frequently different to the mode of the original failure). indeed. and that is an interdependency among components. in particular, it is a capacity interdependency. Whilst the Amazon outage might have been a simple power outage, it's likely that at least some of the website outages caused were a combination of not just the direct Amazon outage, but also the flow- on effect of their redundancy attempting (but failing) to kick in - potentially making the problem worse than just the Amazon outage caused. i think you over-estimate these websites. most of them simply have no redundancy (and obviously have no tested, effective redundancy) and were simply hoping that amazon didn't really go down that much. hope is not the best strategy, as it turns out. i suspect that randy is right though: many of these businesses do not promise perfect uptime and can survive these kinds of failures with little loss to business or reputation. twitter has branded it's early failures with a whale that no only didn't hurt it but helped endear the service to millions. when your service fits these criteria, why would you bother doing the complicated systems and application engineering necessary to actually have functional redundancy? it simply isn't worth it. t Scott
Re: FYI Netflix is down
In a message written on Mon, Jul 02, 2012 at 11:30:06AM -0400, Todd Underwood wrote: from the perspective of people watching B-rate movies: this was a failure to implement and test a reliable system for streaming those movies in the face of a power outage at one facility. I want to emphasize _and test_. Work on an infrastructure which is redundant and designed to provide 100% uptime (which is impossible, but that's another story) means that there should be confidence in a failure being automatically worked around, detected, and reported. I used to work with a guy who had a simple test for these things, and if I was a VP at Amazon, Netflix, or any other large company I would do the same. About once a month he would walk out on the floor of the data center and break something. Pull out an ethernet. Unplug a server. Flip a breaker. Then he would wait, to see how long before a technician came to fix it. If these activities were service impacting to customers the engineering or implementation was faulty, and remediation was performed. Assuming they acted as designed and the customers saw no faults the team was graded on how quickly the detected and corrected the outage. I've seen too many companies who's test is planned months in advance, and who exclude the parts they think aren't up to scratch from the test. Then an event occurs, and they fail, and take down customers. TL;DR If you're not confident your operation could withstand someone walking into your data center and randomly doing something, you are NOT redundant. -- Leo Bicknell - bickn...@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/ pgpb37PwndunF.pgp Description: PGP signature
Re: FYI Netflix is down
On Mon, 2 Jul 2012, Leo Bicknell wrote: I used to work with a guy who had a simple test for these things, and if I was a VP at Amazon, Netflix, or any other large company I would do the same. About once a month he would walk out on the you mean like this? http://techblog.netflix.com/2011/07/netflix-simian-army.html -- david raistrickhttp://www.netmeister.org/news/learn2quote.html dr...@icantclick.org
Re: FYI Netflix is down
In a message written on Mon, Jul 02, 2012 at 12:13:22PM -0400, david raistrick wrote: you mean like this? http://techblog.netflix.com/2011/07/netflix-simian-army.html Yes, Netflix seems to get it, and I think their Simian Army is a great QA tool. However, it is not a complete testing system, I have never seen them talk about testing non-software components, and I hope they do that as well. As we saw in the previous Amazon outage, part of the problem was a circuit breaker configuration. -- Leo Bicknell - bickn...@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/ pgpznsW9y4T3n.pgp Description: PGP signature
Re: FYI Netflix is down
On Mon, 2 Jul 2012, Leo Bicknell wrote: http://techblog.netflix.com/2011/07/netflix-simian-army.html Yes, Netflix seems to get it, and I think their Simian Army is a great QA tool. However, it is not a complete testing system, I have never seen them talk about testing non-software components, and I hope they do that as well. As we saw in the previous Amazon outage, part of the problem was a circuit breaker configuration. When the hardware is outsourced how would you propose testing the non-software components? They do simulate availability zone issues (and AZ is as close as you get to controlling which internal power/network/etc grid you're attached to). I suppose they could introduce artificial network latency/loss @ each instance - and could add testing around what happens when amazon's API disappears (as was the case friday). Beyond thatthe rest of it is up to the hardware provider (Amazon, in this case). ..david (who also relies on outsourced hardware these days) -- david raistrickhttp://www.netmeister.org/news/learn2quote.html dr...@icantclick.org
Re: FYI Netflix is down
This is an excellent example of how tests should be ran, unfortunately far too many places don't do this... -- Thank you, Robert Miller http://www.armoredpackets.com Twitter: @arch3angel On 7/2/12 12:09 PM, Leo Bicknell wrote: In a message written on Mon, Jul 02, 2012 at 11:30:06AM -0400, Todd Underwood wrote: from the perspective of people watching B-rate movies: this was a failure to implement and test a reliable system for streaming those movies in the face of a power outage at one facility. I want to emphasize _and test_. Work on an infrastructure which is redundant and designed to provide 100% uptime (which is impossible, but that's another story) means that there should be confidence in a failure being automatically worked around, detected, and reported. I used to work with a guy who had a simple test for these things, and if I was a VP at Amazon, Netflix, or any other large company I would do the same. About once a month he would walk out on the floor of the data center and break something. Pull out an ethernet. Unplug a server. Flip a breaker. Then he would wait, to see how long before a technician came to fix it. If these activities were service impacting to customers the engineering or implementation was faulty, and remediation was performed. Assuming they acted as designed and the customers saw no faults the team was graded on how quickly the detected and corrected the outage. I've seen too many companies who's test is planned months in advance, and who exclude the parts they think aren't up to scratch from the test. Then an event occurs, and they fail, and take down customers. TL;DR If you're not confident your operation could withstand someone walking into your data center and randomly doing something, you are NOT redundant.
Re: FYI Netflix is down
The problem is large scale tests take a lot of time and planning. For it to be done right, you really need a dedicated DR team. -Grant On Mon, Jul 2, 2012 at 11:31 AM, AP NANOG na...@armoredpackets.com wrote: This is an excellent example of how tests should be ran, unfortunately far too many places don't do this... -- Thank you, Robert Miller http://www.armoredpackets.com Twitter: @arch3angel On 7/2/12 12:09 PM, Leo Bicknell wrote: In a message written on Mon, Jul 02, 2012 at 11:30:06AM -0400, Todd Underwood wrote: from the perspective of people watching B-rate movies: this was a failure to implement and test a reliable system for streaming those movies in the face of a power outage at one facility. I want to emphasize _and test_. Work on an infrastructure which is redundant and designed to provide 100% uptime (which is impossible, but that's another story) means that there should be confidence in a failure being automatically worked around, detected, and reported. I used to work with a guy who had a simple test for these things, and if I was a VP at Amazon, Netflix, or any other large company I would do the same. About once a month he would walk out on the floor of the data center and break something. Pull out an ethernet. Unplug a server. Flip a breaker. Then he would wait, to see how long before a technician came to fix it. If these activities were service impacting to customers the engineering or implementation was faulty, and remediation was performed. Assuming they acted as designed and the customers saw no faults the team was graded on how quickly the detected and corrected the outage. I've seen too many companies who's test is planned months in advance, and who exclude the parts they think aren't up to scratch from the test. Then an event occurs, and they fail, and take down customers. TL;DR If you're not confident your operation could withstand someone walking into your data center and randomly doing something, you are NOT redundant.
Re: FYI Netflix is down
In a message written on Mon, Jul 02, 2012 at 12:23:57PM -0400, david raistrick wrote: When the hardware is outsourced how would you propose testing the non-software components? They do simulate availability zone issues (and AZ is as close as you get to controlling which internal power/network/etc grid you're attached to). Find a provider with a similar methodology. Perhaps Netflix never conducts a power test, but their colo vendor would perform such testing. If no colo providers exist that share their values on testing, that may be a sign that outsourcing it isn't the right answer... -- Leo Bicknell - bickn...@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/ pgpzGpzGDjIwI.pgp Description: PGP signature
Re: FYI Netflix is down
On Jul 2, 2012 10:53 AM, Leo Bicknell bickn...@ufp.org wrote: In a message written on Mon, Jul 02, 2012 at 12:23:57PM -0400, david raistrick wrote: When the hardware is outsourced how would you propose testing the non-software components? They do simulate availability zone issues (and AZ is as close as you get to controlling which internal power/network/etc grid you're attached to). Find a provider with a similar methodology. Perhaps Netflix never conducts a power test, but their colo vendor would perform such testing. If no colo providers exist that share their values on testing, that may be a sign that outsourcing it isn't the right answer... -- Leo Bicknell - bickn...@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/ I suggest using RAIC Redundant array of inexpensive clouds. Make your chaos animal go after sites and regions instead of individual VMs. CB
Re: FYI Netflix is down
On Jul 2, 2012, at 9:23 AM, david raistrick wrote: When the hardware is outsourced how would you propose testing the non-software components? They do simulate availability zone issues (and AZ is as close as you get to controlling which internal power/network/etc grid you're attached to). We all know what netflix *says* they do, but they *did* have an outage. -j
Re: FYI Netflix is down
On 2 July 2012 19:20, Cameron Byrne cb.li...@gmail.com wrote: Make your chaos animal go after sites and regions instead of individual VMs. CB From a previous post mortem http://techblog.netflix.com/2011_04_01_archive.html Create More Failures Currently, Netflix uses a service called Chaos Monkeyhttp://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html to simulate service failure. Basically, Chaos Monkey is a service that kills other services. We run this service because we want engineering teams to be used to a constant level of failure in the cloud. Services should automatically recover without any manual intervention. We don't however, simulate what happens when an entire AZ goes down and therefore we haven't engineered our systems to automatically deal with those sorts of failures. Internally we are having discussions about doing that and people are already starting to call this service Chaos Gorilla. ** It would seem the Gorilla hasn't quite matured. Tony
Re: FYI Netflix is down
On 07/02/2012 08:53 AM, Tony McCrory wrote: On 2 July 2012 19:20, Cameron Byrne cb.li...@gmail.com wrote: Make your chaos animal go after sites and regions instead of individual VMs. CB From a previous post mortem http://techblog.netflix.com/2011_04_01_archive.html Create More Failures Currently, Netflix uses a service called Chaos Monkeyhttp://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html to simulate service failure. Basically, Chaos Monkey is a service that kills other services. We run this service because we want engineering teams to be used to a constant level of failure in the cloud. Services should automatically recover without any manual intervention. We don't however, simulate what happens when an entire AZ goes down and therefore we haven't engineered our systems to automatically deal with those sorts of failures. Internally we are having discussions about doing that and people are already starting to call this service Chaos Gorilla. ** It would seem the Gorilla hasn't quite matured. Tony From conversations with Adrian Cockcroft this weekend it wasn't the result of Chaos Gorilla or Chaos Monkey failing to prepare them adequately. All their automated stuff worked perfectly, the infrastructure tried to self heal. The problem was that yet again Amazon's back-plane / control-plane was unable to cope with the requests. Netflix uses Amazon's ELB to balance the traffic and no back-plane meant they were unable to reconfigure it to route around the problem. Paul
RE: FYI Netflix is down
-Original Message- From: Leo Bicknell [mailto:bickn...@ufp.org] I want to emphasize _and test_. [snip] I used to work with a guy who had a simple test for these things, and if I was a VP at Amazon, Netflix, or any other large company I would do the same. About once a month he would walk out on the floor of the data center and break something. Pull out an ethernet. Unplug a server. Flip a breaker. *DING DING* - we have a winner! In a previous life, I used to spend a lot of time in other people's data centers. The key question to ask was how often they pulled the plug - i.e. disconnected utility power without having backup generators running. Simulating an actual failure. That goes for pulling out an Ethernet cord or unplugging a server, or flipping a breaker. Its all the same. The problem is that if you don't do this for a while, you get SCARED of doing it, and you stop doing it. The longer you go without, the scarier it gets, to the point where you will never do it, because you have no idea what will happen, other that you probably getting fired. This is called horrible engineering management, and is very common. The other problem, of course, is that people design under the assumption that everything will always work, and that failure modes, when they occur, are predictable and fall into a narrow set. Multiple failure modes? Not tested. Failure modes including operator error? Never tested. When was the last time you had a drill? - Dan Then he would wait, to see how long before a technician came to fix it. If these activities were service impacting to customers the engineering or implementation was faulty, and remediation was performed. Assuming they acted as designed and the customers saw no faults the team was graded on how quickly the detected and corrected the outage. I've seen too many companies who's test is planned months in advance, and who exclude the parts they think aren't up to scratch from the test. Then an event occurs, and they fail, and take down customers. TL;DR If you're not confident your operation could withstand someone walking into your data center and randomly doing something, you are NOT redundant. -- Leo Bicknell - bickn...@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/
Re: FYI Netflix is down
I believe in my dictionary Chaos Gorilla translates into Time To Go Home, with a rough definition of Everything just crapped out - The world is ending; but then again I may have hat incorrect :-) -- Thank you, Robert Miller http://www.armoredpackets.com Twitter: @arch3angel On 7/2/12 2:59 PM, Paul Graydon wrote: On 07/02/2012 08:53 AM, Tony McCrory wrote: On 2 July 2012 19:20, Cameron Byrne cb.li...@gmail.com wrote: Make your chaos animal go after sites and regions instead of individual VMs. CB From a previous post mortem http://techblog.netflix.com/2011_04_01_archive.html Create More Failures Currently, Netflix uses a service called Chaos Monkeyhttp://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html to simulate service failure. Basically, Chaos Monkey is a service that kills other services. We run this service because we want engineering teams to be used to a constant level of failure in the cloud. Services should automatically recover without any manual intervention. We don't however, simulate what happens when an entire AZ goes down and therefore we haven't engineered our systems to automatically deal with those sorts of failures. Internally we are having discussions about doing that and people are already starting to call this service Chaos Gorilla. ** It would seem the Gorilla hasn't quite matured. Tony From conversations with Adrian Cockcroft this weekend it wasn't the result of Chaos Gorilla or Chaos Monkey failing to prepare them adequately. All their automated stuff worked perfectly, the infrastructure tried to self heal. The problem was that yet again Amazon's back-plane / control-plane was unable to cope with the requests. Netflix uses Amazon's ELB to balance the traffic and no back-plane meant they were unable to reconfigure it to route around the problem. Paul
Re: FYI Netflix is down
Good band name. Chaos Gorilla -- --- Joly MacFie 218 565 9365 Skype:punkcast WWWhatsup NYC - http://wwwhatsup.com http://pinstand.com - http://punkcast.com VP (Admin) - ISOC-NY - http://isoc-ny.org -- -
Re: FYI Netflix is down
At 03:08 PM 7/2/2012, George Herbert wrote: If folks have not read it, I would suggest reading Normal Accidents by Charles Perrow. The it can't happen is almost guaranteed to happen. ;-) And when it does, it'll often interact in ways we can't predict or sometimes even understand. As for pulling the plug to test stuff. I recall a demo at Netapps in the early 00's. They were talking about their fault tolerance and how great it was. So I walked up to their demo array and said, So, it shouldn't be a problem if I pulled this drive right here? Before I could the salesperson or tech guy, can't remember, told me to stop. He didn't want to risk it. That right there said loads about their confidence in their own system. Late reply, but: On Sat, Jun 30, 2012 at 12:30 AM, Lynda shr...@deaddrop.org wrote: ... Second, and more important. I *was* a computer science guy in a past life, and this is nonsense. You can have astonishingly large software projects that just continue to run smoothly, day in, day out, and they don't hit the news, because they don't break. There are data centers that don't hit the news, in precisely the same way. I really need to write the book on IT reliability I keep meaning to. There's reliability - backwards looking statistical, which can be 100% for a given service or datacenter - and then there's dependability, forwards-predicted outage risks, which people often *assert* equals the prior reliability record, but in reality you often have a number of latent failures (and latent cascade paths) that you do not understand, did not identify previously, and are not aware of. I've had or had to respond to over a billion dollars of culminative IT disaster loss over my consulting career so far; I have NEVER seen anyone who did it perfect, even the best pros. And I include myself in that list. Looking at other fields like aerospace and nuclear engineering, what is done in IT is not anywhere close to the same level of QA and engineering analysis and testing. We cannot assert better results with less work. Oh, that never happens, except I've had my stuff in three locations that had catastrophic generator failures. Oh, that never happens when you're doing power maintenance and the best-rated electrical company in California, in conjunction with the generator vendor and a couple of independent power EEs, mis-balance the maintenance generator loads between legs and blow the generators and datacenter. Oh, that never happens that the datacenter burns (or starts to burn and then gets flooded). Oh, that never happens that the FM-200 goes off or preaction breaks and water leaks. Oh, that never happens that well maintained and monitored and triple-redundant AC units all trip offline due to a common mode failure over the course of a weekend and the room gets up to 106 degrees. Oh thank god the next thing didn't go wrong in THAT situation, because the spot temperature meters indicated that the ceiling height of that particular room peaked at 1 degree short of the temp at which the sprinkler heads are supposed to discharge, so we nearly lost that room to flooding rather than just a 10% disk and 15% power supply attrition over the next year... Don't be so confident in the infrastructure. It's not engineered or built or maintained well enough to actually support that assertion. The same can be said of the application software and application architecture and integration. -- -george william herbert george.herb...@gmail.com Greg D. Moore http://greenmountainsoftware.wordpress.com/ CEO QuiCR: Quick, Crowdsourced Responses. http://www.quicr.net
Re: FYI Netflix is down
On Mon, 2 Jul 2012, James Downs wrote: back-plane / control-plane was unable to cope with the requests. Netflix uses Amazon's ELB to balance the traffic and no back-plane meant they were unable to reconfigure it to route around the problem. Someone needs to define back-plane/control-plane in this case. (and what wasn't working) Amazon resources are controlled (from a consumer viewpoint) by API - that API is also used by amazon's internal toolkits that support ELB (and RDS..). Those (http accessed) API interfaces were unavailable for a good portion of the outages. I know nothing of the netflix side of it - but that's what -we- saw. (and that caused all us-east RDS instances in every AZ to appear offline..) -- david raistrickhttp://www.netmeister.org/news/learn2quote.html dr...@icantclick.org
Re: FYI Netflix is down
On Mon, Jul 02, 2012 at 09:09:09AM -0700, Leo Bicknell wrote: In a message written on Mon, Jul 02, 2012 at 11:30:06AM -0400, Todd Underwood wrote: from the perspective of people watching B-rate movies: this was a failure to implement and test a reliable system for streaming those movies in the face of a power outage at one facility. I want to emphasize _and test_. Work on an infrastructure which is redundant and designed to provide 100% uptime (which is impossible, but that's another story) means that there should be confidence in a failure being automatically worked around, detected, and reported. I used to work with a guy who had a simple test for these things, and if I was a VP at Amazon, Netflix, or any other large company I would do the same. About once a month he would walk out on the floor of the data center and break something. Pull out an ethernet. Unplug a server. Flip a breaker. Sounds like something a VP would do. And, actually, it's an important step: make sure the easy failures are covered. But it's really a very small part of resilience. What happens when one instance of a shared service starts performing slowly? What happens when one instance of a redundant database starts timing out queries or returning empty result sets? What happens when the Ethernet interface starts dropping 10% of the packets across it? When happens when the Ethernet switch linecard locks up and stops passing dataplane traffic, but link (physical layer) and/or control plane traffic flows just fine? What happens when the server kernel panics due to bad memeory, reboots, gets all the way up, runs for 30 seconds, kernel panics, lather, rinse, repeat. Reliability is hard. And if you stop looking once you get to the point where you can safely toggle the power switch without causing an impact, you're only a very small part of the way there. -- Brett
RE: FYI Netflix is down
-Original Message- From: Greg D. Moore [mailto:moor...@greenms.com] If folks have not read it, I would suggest reading Normal Accidents by Charles Perrow. Also, Human Error by James Reason.
Re: FYI Netflix is down
On Mon, Jul 2, 2012 at 12:43 PM, Greg D. Moore moor...@greenms.com wrote: At 03:08 PM 7/2/2012, George Herbert wrote: If folks have not read it, I would suggest reading Normal Accidents by Charles Perrow. The it can't happen is almost guaranteed to happen. ;-) And when it does, it'll often interact in ways we can't predict or sometimes even understand. Seconded. There are also aerospace and nuclear and failure analysis books which are good, but I often encourage people to start with that one. As for pulling the plug to test stuff. I recall a demo at Netapps in the early 00's. They were talking about their fault tolerance and how great it was. So I walked up to their demo array and said, So, it shouldn't be a problem if I pulled this drive right here? Before I could the salesperson or tech guy, can't remember, told me to stop. He didn't want to risk it. That right there said loads about their confidence in their own system. I worked for a Sun clone vendor (Axil) for a while and took some of our systems and storage to Comdex one year in the 90s. We had a RAID unit (Mylex controller) we had just introduced. Beforehand, I made REALLY REALLY SURE that the pull-the-disk and pull-the-redundant-power tricks worked. And showed them to people with the Please keep in mind that this voids the warranty, but here we *rip* go All of the other server vendors were giving me dirty looks for that one. Apparently I sold a few systems that way. You have to watch for connector wear-out and things like that, but ... All the clusters I've built, I've insisted on a burn-in time plug pull test on all the major components. We caught things with those from time to time. Especially with N+1, if it is really N+0 due to a bug or flaw you need to know that... -- -george william herbert george.herb...@gmail.com
Re: FYI Netflix is down
At 05:04 PM 7/2/2012, George Herbert wrote: On Mon, Jul 2, 2012 at 12:43 PM, Greg D. Moore moor...@greenms.com wrote: At 03:08 PM 7/2/2012, George Herbert wrote: If folks have not read it, I would suggest reading Normal Accidents by Charles Perrow. The it can't happen is almost guaranteed to happen. ;-) And when it does, it'll often interact in ways we can't predict or sometimes even understand. Seconded. I figured you had probably read it. :-) There are also aerospace and nuclear and failure analysis books which are good, but I often encourage people to start with that one. As for pulling the plug to test stuff. I recall a demo at Netapps in the early 00's. They were talking about their fault tolerance and how great it was. So I walked up to their demo array and said, So, it shouldn't be a problem if I pulled this drive right here? Before I could the salesperson or tech guy, can't remember, told me to stop. He didn't want to risk it. That right there said loads about their confidence in their own system. I worked for a Sun clone vendor (Axil) for a while and took some of our systems and storage to Comdex one year in the 90s. We had a RAID unit (Mylex controller) we had just introduced. Beforehand, I made REALLY REALLY SURE that the pull-the-disk and pull-the-redundant-power tricks worked. And showed them to people with the Please keep in mind that this voids the warranty, but here we *rip* go All of the other server vendors were giving me dirty looks for that one. Apparently I sold a few systems that way. I can imagine. Back when we were testing a cluster from MicronPC, the techs were in our office and they encouraged us to do that. It was re-assuring. You have to watch for connector wear-out and things like that, but ... All the clusters I've built, I've insisted on a burn-in time plug pull test on all the major components. We caught things with those from time to time. Especially with N+1, if it is really N+0 due to a bug or flaw you need to know that... About 7 years back, we were about to move a production platform to a cluster+SAN that an outside vendor had installed. I was brought in at the last minute to lead the project. Before we did the move, I said, Umm, has anyone tried a remote reboot of the servers? Oh they rebooted fine when we were at the datacenter with the vendor. We're good. I repeated my question and finally did the old, Ok, I know I'm being a pain, but please, let's just try it once, remotely before we're committed. So we rebooted, and wait, and waited, and waited. It took a trip out to the datacenter (we couldn't afford good remote KVM tools back then) to see that the server was trying to mount stuff off of something on the network. At first we couldn't figure out what it was. Finally realized it was looking for files on the vendor's laptop. So of course it had worked fine when the vendor was at the datacenter. Despite all that, the vendor still denied it being their problem. Anyway, enough reminiscing. Things happen. We can only do so much to prevent them, and never assume. -- -george william herbert george.herb...@gmail.com Greg D. Moore http://greenmountainsoftware.wordpress.com/ CEO QuiCR: Quick, Crowdsourced Responses. http://www.quicr.net
Re: FYI Netflix is down
On Jul 2, 2012, at 3:43 PM, Greg D. Moore wrote: At 03:08 PM 7/2/2012, George Herbert wrote: If folks have not read it, I would suggest reading Normal Accidents by Charles Perrow. Strong second to that suggestion. --Steve Bellovin, https://www.cs.columbia.edu/~smb
Re: FYI Netflix is down
On Jul 2, 2012, at 1:20 PM, david raistrick wrote: Amazon resources are controlled (from a consumer viewpoint) by API - that API is also used by amazon's internal toolkits that support ELB (and RDS..). Those (http accessed) API interfaces were unavailable for a good portion of the outages. Right, and other toolkits like boto. Each AZ has a different endpoint (url), and as I have no resources running in East, I saw no problems with the API endpoints I use. So, as you note, US-EAST Region was not controllable. I know nothing of the netflix side of it - but that's what -we- saw. (and that caused all us-east RDS instances in every AZ to appear And, if you lose US-EAST, you need to run *somewhere*. Netflix did not cutover www.netflix.com to another Region. Why not is another question. -j
Re: FYI Netflix is down
On Jul 2, 2012, at 7:03 PM, James Downs e...@egon.cc wrote: On Jul 2, 2012, at 1:20 PM, david raistrick wrote: Amazon resources are controlled (from a consumer viewpoint) by API - that API is also used by amazon's internal toolkits that support ELB (and RDS..). Those (http accessed) API interfaces were unavailable for a good portion of the outages. Right, and other toolkits like boto. Each AZ has a different endpoint (url), and as I have no resources running in East, I saw no problems with the API endpoints I use. So, as you note, US-EAST Region was not controllable. I know nothing of the netflix side of it - but that's what -we- saw. (and that caused all us-east RDS instances in every AZ to appear And, if you lose US-EAST, you need to run *somewhere*. Netflix did not cutover www.netflix.com to another Region. Why not is another question. At which point are you guys going to realize that no matter how much resiliency, redundancy and fault tolerance you plan into an infrastructure there are always the unforeseen that just doesn't make any sense to plan for. Four major decision factors are cost, complexity, time and failure rate. At some point a business need to focus on its core business. IT like any other business resource has to be managed efficiently and its sole purpose is for the enablement of said business nothing more. Some of the post here are highly laughable and so unrealistic. People are acting as if Netflix is part of some critical service they stream movies for Christ sake. Some acceptable level of loss is fine for 99.99% of Netflix's user base just like cable, electricity and running water I suffer a few hours of losses each year from those services it suck yes, is it the end of the world no.. This horse is dead!
Re: FYI Netflix is down
On Jul 2, 2012, at 7:19 PM, Rodrick Brown wrote: People are acting as if Netflix is part of some critical service they stream movies for Christ sake. Some acceptable level of loss is fine for 99.99% of Netflix's user base just like cable, electricity and running water I suffer a few hours of losses each year from those services it suck yes, is it the end of the world no.. You missed the point.
Re: FYI Netflix is down
George Herbert george.herb...@gmail.com said: I worked for a Sun clone vendor (Axil) for a while and took some of our systems and storage to Comdex one year in the 90s. We had a RAID unit (Mylex controller) we had just introduced. Beforehand, I made REALLY REALLY SURE that the pull-the-disk and pull-the-redundant-power tricks worked. And showed them to people with the Please keep in mind that this voids the warranty, but here we *rip* go All of the other server vendors were giving me dirty looks for that one. Apparently I sold a few systems that way. :) Nice. Thanks. Many years ago, I worked for one of DEC's research groups. We built a network using FDDI 4B/5B link technology based on AMD TAXI chips. (They were state of the art back then.) The switches were 3U(?) boxes with 12 ports. It took a rack of 6 or 8 of them in the phone closet to cover a floor. Workstations had 2 cables plugged into different switches. In theory, we covered any single point of failure. My office was near the phone closet. I got to watch my boss give demos to visiting VIPs. He was pretty good at it. In the middle of explaining things, he would grab a power cord and yank it. Blinka-blinka=blinka and the remaining switches would reconfigure and go back to work. (It took under a second.) It was interesting to watch the VIPs. Most of them got it: the network really could recover quickly. The interesting ones had a telco background. They were really surprised. The concept of disrupting live traffic for something as insignificant as a demo was off scale in their culture. It was just a research lab. We were used to eating our own dog food. -- Greg D. Moore moor...@greenms.com said: If folks have not read it, I would suggest reading Normal Accidents by Charles Perrow. +1 The it can't happen is almost guaranteed to happen. ;-) And when it does, it'll often interact in ways we can't predict or sometimes even understand. My memory of that sort of event is roughly... (see above for context) The hardware broke and turned a vanilla packet into a super-long packet. My FPGA code was supposed to catch that case and do something sane. It was never tested and didn't work. It poured crap all over memory. Needless to say, things went downhill from there. Easy to spot in hindsight. None of us thought that was an interesting case while we were testing. -- These are my opinions. I hate spam.
Re: [FoRK] FYI Netflix is down
On Sat, Jun 30, 2012 at 03:15:07AM -0400, Andrew D Kirch wrote: On 6/30/2012 3:11 AM, Tyler Haske wrote: How to run a datacenter 101. Have more then one location, preferably far apart. It being Amazon I would expect more. :/ Amazon has many datacenters and tries to make it easy to diversify. Based on? Clouds are nothing more than outsourced responsibility. My business has stopped while my IT department explains to me that it's not their fault because Amazon's down snip It *is* their fault. You can blame faulty manufacturing for having a HDD die, but it's IT's fault if it takes out the only copy of your database. AWS 101: Amazon has clearly-marked Availablity Zones for a reason. Oh, and business 101: have an exit strategy for every vendor. This outage is mighty interesting. It's surprising how many big operations had poor availability strategies. Also, I've been working on an exit strategy for one of my VM/colo providers, and AWS + colo in NoVa is one of my options. The cloud may be a technological wonder, but as far as business practices go, it's a DISASTER. I wouldn't say so. Like any disruptive service, you're getting an acceptably lower-quality product for significantly less money. And like most disruptors, it operates by different rules than the old stuff. Regards, Aaron ___ FoRK mailing list http://xent.com/mailman/listinfo/fork
It's the end of the world, as we know it (Was: FYI Netflix is down)
- Original Message - From: jamie rishaw j...@arpa.com you know what's happening even more? ..Amazon not learning their lesson. Please stop these crappy practices, people. Do real world DR testing. Play What If This City Dropped Off The Map games, because tonight, parts of VA infact did. You know what I want everyone to do? Go read this. Right now; it's Sunday, and I'll wait: http://interdictor.livejournal.com/2005/08/27/ Start there, and click Next Date a lot, until you get to the end. Entire metropolitan areas can, and do, fall completely off the map. If your audience is larger than that area, then you need to prepare for it. And being reminded of how big it can get is occasionally necessary. The 4ESS in the third subbasement of 1WTC that was a toll switch for most of the northeast reportedly stayed on the air, talking to it's SS7 neighbors, until something like 1500EDT, 11 Sep 2001. It can get *really* big. Are you ready? Cheers, -- jra -- Jay R. Ashworth Baylink j...@baylink.com Designer The Things I Think RFC 2100 Ashworth Associates http://baylink.pitas.com 2000 Land Rover DII St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274
Re: FYI Netflix is down
- Original Message - From: Tyler Haske tyler.ha...@gmail.com How to run a datacenter 101. Have more then one location, preferably far apart. It being Amazon I would expect more. :/ Not entirely. Datacenters do go down, our best efforts to the contrary notwithstanding. Amazon doesn't guarantee you redundancy on EC2, only the tools to provide it yourself. 25% Amazon; 75% service provider clients; that's my appraisal of the blame. Cheers, -- jra -- Jay R. Ashworth Baylink j...@baylink.com Designer The Things I Think RFC 2100 Ashworth Associates http://baylink.pitas.com 2000 Land Rover DII St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274
Re: FYI Netflix is down
On Sun, Jul 1, 2012 at 11:38 AM, Jay Ashworth j...@baylink.com wrote: Not entirely. Datacenters do go down, our best efforts to the contrary notwithstanding. Amazon doesn't guarantee you redundancy on EC2, only the tools to provide it yourself. 25% Amazon; 75% service provider clients; that's my appraisal of the blame. From a Wired article: That’s what was supposed to happen at Netflix Friday night. But it didn’t work out that way. According to Twitter messages from Netflix Director of Cloud Architecture Adrian Cockcroft and Instagram Engineer Rick Branson, it looks like an Amazon Elastic Load Balancing service, designed to spread Netflix’s processing loads across data centers, failed during the outage. Without that ELB service working properly, the Netflix and Pintrest services hosted by Amazon crashed. http://www.wired.com/wiredenterprise/2012/06/real-clouds-crush-amazon/ The GSLB fail-over that was supposed to take place for the affected services (that had configured their applications to fail-over) failed. I heard about this the day after Google announced the Compute Engine addition to the App Engine product lines they have. The demo was awesome. I imagine Google has GSLB down pat by now, so some companies might start looking... ;-] --steve
Re: FYI Netflix is down
On 6/29/2012 10:38 PM, jamie rishaw wrote: you know what's happening even more? ..Amazon not learning their lesson. they just had an outage quite similar.. they performed a full audit on electrical systems worldwide, according to the rfo/post mortem. looks like they need to perform a full and we mean it audit, and like I've been doing/participating in at dot coms for a decade plus: Actually Do Regular Load tests.. Related/equally to blame: companies that rely heavily on one aws zone, or arguably one cloud (period), are asking for it. Please stop these crappy practices, people. Do real world DR testing. Play What If This City Dropped Off The Map games, because tonight, parts of VA infact did. ... I am not a computer science guy but been around a long time. Data centers and clouds are like software. Once they reach a certain size, its impossible to keep the bugs out. You can test and test your heart out and something will slip by. You can say the same thing about nuclear reactors, Apollo moon missions, the NorthEast power grid, and most other technology disasters.
Re: FYI Netflix is down
well one would think that they could at least get power redundancy right... On Sat, Jun 30, 2012 at 1:07 AM, Roy r.engehau...@gmail.com wrote: On 6/29/2012 10:38 PM, jamie rishaw wrote: you know what's happening even more? ..Amazon not learning their lesson. they just had an outage quite similar.. they performed a full audit on electrical systems worldwide, according to the rfo/post mortem. looks like they need to perform a full and we mean it audit, and like I've been doing/participating in at dot coms for a decade plus: Actually Do Regular Load tests.. Related/equally to blame: companies that rely heavily on one aws zone, or arguably one cloud (period), are asking for it. Please stop these crappy practices, people. Do real world DR testing. Play What If This City Dropped Off The Map games, because tonight, parts of VA infact did. ... I am not a computer science guy but been around a long time. Data centers and clouds are like software. Once they reach a certain size, its impossible to keep the bugs out. You can test and test your heart out and something will slip by. You can say the same thing about nuclear reactors, Apollo moon missions, the NorthEast power grid, and most other technology disasters.
Re: FYI Netflix is down
I am not a computer science guy but been around a long time. Data centers and clouds are like software. Once they reach a certain size, its impossible to keep the bugs out. You can test and test your heart out and something will slip by. You can say the same thing about nuclear reactors, Apollo moon missions, the NorthEast power grid, and most other technology disasters. How to run a datacenter 101. Have more then one location, preferably far apart. It being Amazon I would expect more. :/
Re: FYI Netflix is down
On 6/30/2012 3:11 AM, Tyler Haske wrote: How to run a datacenter 101. Have more then one location, preferably far apart. It being Amazon I would expect more. :/ Based on? Clouds are nothing more than outsourced responsibility. My business has stopped while my IT department explains to me that it's not their fault because Amazon's down, and I can't exactly fire Amazon. The cloud may be a technological wonder, but as far as business practices go, it's a DISASTER. Andrew
Re: FYI Netflix is down
On 6/30/12 12:11 AM, Tyler Haske wrote: I am not a computer science guy but been around a long time. Data centers and clouds are like software. Once they reach a certain size, its impossible to keep the bugs out. You can test and test your heart out and something will slip by. You can say the same thing about nuclear reactors, Apollo moon missions, the NorthEast power grid, and most other technology disasters. How to run a datacenter 101. Have more then one location, preferably far apart. It being Amazon I would expect more. :/ there are 7 regions in ec2 three in north america two in asia one in europe and one in south america. us east coast, the one currently being impacted is further subdivided into 5 availability zones. us east 1d appears to be the only one currently being impacted. distributing your application is left as an exercise to the reader.
Re: FYI Netflix is down
On 6/30/2012 12:11 AM, Tyler Haske wrote: On 6/29/2012 11:07 PM, Roy wrote: I am not a computer science guy but been around a long time. Data centers and clouds are like software. Once they reach a certain size, its impossible to keep the bugs out. You can test and test your heart out and something will slip by. You can say the same thing about nuclear reactors, Apollo moon missions, the NorthEast power grid, and most other technology disasters. How to run a datacenter 101. Have more then one location, preferably far apart. It being Amazon I would expect more. :/ First off. They HAVE more than one location, and they are indeed far apart. That said, it's all mixed together, like some kind of goulash, and the companies who've gone with this particular model for their sites are paying for that fact. Second, and more important. I *was* a computer science guy in a past life, and this is nonsense. You can have astonishingly large software projects that just continue to run smoothly, day in, day out, and they don't hit the news, because they don't break. There are data centers that don't hit the news, in precisely the same way. If I had a business, right now, I would not have chosen Amazon's cloud (or anyone's for that matter). I would also not be using Google docs/services, for precisely the same reason. I'm a fan of controlling risk, where possible, and I'd say that this is all in the wrong direction for doing that. No worries, though. It seems we are doomed to continue making the same mistakes, over and over. -- Politicians are like a Slinky. They're really not good for anything, but they still bring a smile to your face when you push them down a flight of stairs.
Re: FYI Netflix is down
On Sat, 30 Jun 2012, jamie rishaw wrote: you know what's happening even more? ..Amazon not learning their lesson. I was not giving anyone a free pass or attempting to shrug off the outage. I was just stating that there are many reasons why things break. I haven't seen anything official on this yet, but this looks a lot like a cascading failure. jms
Re: FYI Netflix is down
On 6/30/12, Grant Ridder shortdudey...@gmail.com wrote: well one would think that they could at least get power redundancy right... It is very similar to suggesting redundancy within a site against building collapse. Reliable power redundancy is very hard and very expensive.Much harder and much more expensive than achieving network redunancy against switch or router failures. And there are always tradeoffs involved, because there is only one utility grid available. There are always some limitations in the amount of isolation possible. You have devices plugged into both power systems. There is some possibility a random device plugged into both systems creates a short in both branches that it plugs into. Both power systems always have to share the same ground, due to safety considerations. Both power systems always have to have fuses or breakers installed, due to safety considerations, and there is always a possibility that various kinds of anomolies cause fuses to simultaneously blow in both systems. -- -JH
Re: FYI Netflix is down
On Jun 30, 2012 12:25 AM, joel jaeggli joe...@bogus.com wrote: On 6/30/12 12:11 AM, Tyler Haske wrote: I am not a computer science guy but been around a long time. Data centers and clouds are like software. Once they reach a certain size, its impossible to keep the bugs out. You can test and test your heart out and something will slip by. You can say the same thing about nuclear reactors, Apollo moon missions, the NorthEast power grid, and most other technology disasters. How to run a datacenter 101. Have more then one location, preferably far apart. It being Amazon I would expect more. :/ there are 7 regions in ec2 three in north america two in asia one in europe and one in south america. us east coast, the one currently being impacted is further subdivided into 5 availability zones. us east 1d appears to be the only one currently being impacted. distributing your application is left as an exercise to the reader. +1 Sorry to be the monday morning quarterback, but the sites that went down learned a valuable lesson in single point of failure analysis. A highly redundant and professionally run data center is a single point of failure. Geo-redundancy is key. In fact, i would take distributed data centers over RAID, UPS, or any other fancy pants © mechanisms any day. And, aws East also seems to be cursed. I would run out of west for a while. :-) I would also look into clouds of clouds. ... Who knows. Amazon could have an Enron moment, at which point a corporate entity with a tax id is now a single point of failure. Pay your money, take your chances. CB
Re: FYI Netflix is down
On 6/30/12, Cameron Byrne cb.li...@gmail.com wrote: On Jun 30, 2012 12:25 AM, joel jaeggli joe...@bogus.com wrote: On 6/30/12 12:11 AM, Tyler Haske wrote: Geo-redundancy is key. In fact, i would take distributed data centers over RAID, UPS, or any other fancy pants © mechanisms any day. Geo-redundancy is more expensive than any of those technologies, because it directly impacts every application and reduces performance. It means that, for example, if an application needs to guarantee something is persisted to a distributed database, such as a record that such and such user's credit card has just been charged $X or such and such user has uploaded this blob to the web service ;The round trip time of the longest latency path between any of the redundancy sites, is added to the critical path of the WRITE transaction latency during the commit stage.Because you cannot complete a transaction and ensure you have consistency or correct data, until that transaction reaches a system at the remote site managing the persistence, and is acknowledged as received intact. For example, if you have geo sites, which are a minimum of 250 miles apart; if you recall, light only travels 186.28 miles per millisecond. That means you have a 500 mile round-trip and therefore have added a bare minimum of 2.6 milliseconds of latency to every write transaction, and probably more like 15 milliseconds. If your original transaction latency was at 1 milliseconds. or 1000 transactions per second, AND you require only that the data reaches the remote site and is acknowledged (not that the transaction succeeds at the remote site, before you commit), you are now at a minimum of 2.6 millisecondsaverage 384 transactions per second. To actually do it safely, you require 3.6 milliseconds, limiting you to an average of 277 transactions per second. If the application is not specially designed for remote site redundancy, then this means you require a scheme such as synchronous storage-level replication to achieve clustering; which has even worse results if there is significant geographic dispersion. RAID transactional latencies are much lower. UPSes and redundant power do not increase transaction latencies at all. -- -JH
Re: FYI Netflix is down
On 6/30/12 4:50 AM, Justin M. Streiner wrote: On Sat, 30 Jun 2012, jamie rishaw wrote: you know what's happening even more? ..Amazon not learning their lesson. I was not giving anyone a free pass or attempting to shrug off the outage. I was just stating that there are many reasons why things break. I haven't seen anything official on this yet, but this looks a lot like a cascading failure. But haven't they all been cascading failures? One can't just say well it's a huge system, therefore hard. Especially when they claimed to have learned their lesson from previous outwardly similar failures; either they were lying, or didn't really learn anything, or the scope simply exceeds their grasp. If it's too hard for entity X to handle a large system (for whatever large means to them), then X needs to break it down into smaller parts that they're capable of handling in a competent manner. ~Seth
Re: FYI Netflix is down
On 6/30/2012 12:11 AM, Tyler Haske wrote: I am not a computer science guy but been around a long time. Data centers and clouds are like software. Once they reach a certain size, its impossible to keep the bugs out. You can test and test your heart out and something will slip by. You can say the same thing about nuclear reactors, Apollo moon missions, the NorthEast power grid, and most other technology disasters. How to run a datacenter 101. Have more then one location, preferably far apart. It being Amazon I would expect more. :/ . It doesn't change my theory. You add that complexity, something happens and the failover routing doesn't work as planned. Been there, done that, have the T-shirt.
Re: FYI Netflix is down
On Jun 30, 2012 11:23 AM, Seth Mattinen se...@rollernet.us wrote: But haven't they all been cascading failures? No. They have not. That's not what that term means. 'Cascading failure' has a fairly specific meaning that doesn't imply resilience in the face of decomposition into smaller parts. Cascading failures can occur even when a system is decomposed into small parts, each of which is apparently well run. T
Re: FYI Netflix is down
On 6/30/12, Todd Underwood toddun...@gmail.com wrote: On Jun 30, 2012 11:23 AM, Seth Mattinen se...@rollernet.us wrote: But haven't they all been cascading failures? No. They have not. That's not what that term means. 'Cascading failure' has a fairly specific meaning that doesn't imply resilience in the face of decomposition into smaller parts. Cascading Not sure where you're going there; Cascading failures are common, but fortunately are usually temporary or have some kind of scope limit. Cascading just means you have a dependency between components, where the failure of one component may result in the failure of a second component, the failure of the second component results in failure of a third component, and this process continues until no more components are dependent or no more components are still operating. This can happen to the small pieces inside of one specific system, causing that system to collapse. It's just as valid to say Cascading failure is across across larger/more complex pieces of different higher level systems, where the components of one system aren't sufficiently independent of those in other systems, causing both systems to fail. Your application logic can be a point of failure, just as readily as your datacenter can. Cascades can happen at a higher level where entire systems are dependant upon entire other systems. And it can happen Organizationally, External dependancy risk occurs when an entire business is dependant on another organization (such as product support), to remotely administer software they sold, and the subcontracter of the product support org. stops doing their job, or a smaller component (one member of their staff) becomes a rogue/malicious element. -- -JH
Re: FYI Netflix is down
On 6/30/12 9:25 AM, Todd Underwood wrote: On Jun 30, 2012 11:23 AM, Seth Mattinen se...@rollernet.us mailto:se...@rollernet.us wrote: But haven't they all been cascading failures? No. They have not. That's not what that term means. 'Cascading failure' has a fairly specific meaning that doesn't imply resilience in the face of decomposition into smaller parts. Cascading failures can occur even when a system is decomposed into small parts, each of which is apparently well run. I honestly have no idea how to parse that since it doesn't jive with my practical view of a cascading failure. ~Seth
Re: FYI Netflix is down
This was not a cascading failure. It was a simple power outage Cascading failures involve interdependencies among components. T On Jun 30, 2012 2:21 PM, Seth Mattinen se...@rollernet.us wrote: On 6/30/12 9:25 AM, Todd Underwood wrote: On Jun 30, 2012 11:23 AM, Seth Mattinen se...@rollernet.us mailto:se...@rollernet.us wrote: But haven't they all been cascading failures? No. They have not. That's not what that term means. 'Cascading failure' has a fairly specific meaning that doesn't imply resilience in the face of decomposition into smaller parts. Cascading failures can occur even when a system is decomposed into small parts, each of which is apparently well run. I honestly have no idea how to parse that since it doesn't jive with my practical view of a cascading failure. ~Seth
Re: FYI Netflix is down
On 6/30/12, Todd Underwood toddun...@gmail.com wrote: This was not a cascading failure. It was a simple power outage Cascading failures involve interdependencies among components. Actually, you can't really say that. It's true that it was a simple power outage for Amazon. Power failed, causing the AWS service at certain locations to experience issues. Any of the issues related to services at locations that didn't lose power are a possible result of cascade. But as for the other possible outages being reported... Instagram, Pinterest, Netflix, Heroku, Woot, Pocket, zoneedit. Possibly Amazon's power failure caused AWS problems, which resulted in issues with these services. Some of these services may actually have had redundancy in place, but experience a failure of their service as a result of unexpected cascade from the affected site. T -- -JH
Re: FYI Netflix is down
The last 2 Amazon outages were power issues isolated to just there us-east Virginia data center. I read somewhere that Amazon has something like 70% of their ec2 resources in Virginia and its also their oldest ec2 datacenter..so I am guessing they learned a lot of lessons and are stuck with an aged infrastructure there. I think the real problem here is that a large subset of the customers using ec2 misunderstand the redundancy that is built into the Amazon architecture. You are essentially supposed to view individual virtual machines as bring entirely disposable and make duplicates of everything across availability zones and for extra points across regions. most people instead think that the 2 cents/hour price tag is a massive cost savings and the cloud is invincible..look at the SLA for ec2...Amazon basically doesn't really consider it a real outage unless its more than one availability zone that is down whats more surprising is that netflix was so affected by a single availability zone outage. They are constantly talking about their chaos monkey/simian army tool that purposely breaks random parts of their infrastructure to prove its fault tolerate, or to point out weaknesses to fix. ( http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html) I think the closest thing to a cascading failure they have had was 4/29/11 outage (http://aws.amazon.com/message/65648/) Mike On Jun 30, 2012 3:05 PM, Todd Underwood toddun...@gmail.com wrote: This was not a cascading failure. It was a simple power outage Cascading failures involve interdependencies among components. T On Jun 30, 2012 2:21 PM, Seth Mattinen se...@rollernet.us wrote: On 6/30/12 9:25 AM, Todd Underwood wrote: On Jun 30, 2012 11:23 AM, Seth Mattinen se...@rollernet.us mailto:se...@rollernet.us wrote: But haven't they all been cascading failures? No. They have not. That's not what that term means. 'Cascading failure' has a fairly specific meaning that doesn't imply resilience in the face of decomposition into smaller parts. Cascading failures can occur even when a system is decomposed into small parts, each of which is apparently well run. I honestly have no idea how to parse that since it doesn't jive with my practical view of a cascading failure. ~Seth
Re: FYI Netflix is down
On 6/30/12 12:04 PM, Todd Underwood wrote: This was not a cascading failure. It was a simple power outage Cascading failures involve interdependencies among components. I guess I'm assuming there were UPS and generator systems involved (and failing) with powering the critical load, but I suppose it could all be direct to utility power. ~Seth
Re: FYI Netflix is down
If I recall correctly, availability zone (AZ) mappings are specific to an AWS account, and in fact there is no way to know if you are running in the same AZ as another AWS account: http://aws.amazon.com/ec2/faqs/#How_can_I_make_sure_that_I_am_in_the_same_Availability_Zone_as_another_developer Also, AWS Elastic Load Balancer (and/or CloudWatch) should be able to detect that some instances are not reachable, and thus can start new instances and remap DNS entries automatically: http://aws.amazon.com/elasticloadbalancing/ This time only 1 AZ is affected by the power outage, so sites with fault tolerance built into their AWS infrastructure should be able to handle the issues relatively easily. Rayson == Open Grid Scheduler - The Official Open Source Grid Engine http://gridscheduler.sourceforge.net/ On Fri, Jun 29, 2012 at 11:44 PM, Grant Ridder shortdudey...@gmail.com wrote: I have an instance in zone C and it is up and fine, so it must be A, B, or D that is down. On Fri, Jun 29, 2012 at 10:42 PM, James Laszko jam...@mythostech.comwrote: To further expand: 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone. 8:40 PM PDT We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area. We are actively working to restore power. -Original Message- From: Grant Ridder [mailto:shortdudey...@gmail.com] Sent: Friday, June 29, 2012 8:42 PM To: Jason Baugher Cc: nanog@nanog.org Subject: Re: FYI Netflix is down From Amazon Amazon Elastic Compute Cloud (N. Virginia) (http://status.aws.amazon.com/ ) 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone. -Grant On Fri, Jun 29, 2012 at 10:40 PM, Jason Baugher ja...@thebaughers.com wrote: Seeing some reports of Pinterest and Instagram down as well. Amazon cloud services being implicated. On 6/29/2012 10:22 PM, Joe Blanchard wrote: Seems that they are unreachable at the moment. Called and theres a recorded message stating they are aware of an issue, no details. -Joe
Re: FYI Netflix is down
Sorry to be the monday morning quarterback, but the sites that went down learned a valuable lesson in single point of failure analysis. as this has happened more than once before, i am less optimistic. or maybe they decided the spof risk was not worth the avoidance costs. randy
Re: FYI Netflix is down
The interesting thing to me is the us population by time zone. If amazon has 70% of servers in the eastern time zone it makes some sense. Mountain + pacific is smaller than central, which is a bit more than half eastern. These stats are older but a good rough gauge: http://answers.google.com/answers/threadview?id=714986 Jared Mauch On Jun 30, 2012, at 4:03 PM, Seth Mattinen se...@rollernet.us wrote: On 6/30/12 12:04 PM, Todd Underwood wrote: This was not a cascading failure. It was a simple power outage Cascading failures involve interdependencies among components. I guess I'm assuming there were UPS and generator systems involved (and failing) with powering the critical load, but I suppose it could all be direct to utility power. ~Seth
Re: FYI Netflix is down
On Sat, Jun 30, 2012 at 12:04 PM, Todd Underwood toddun...@gmail.comwrote: This was not a cascading failure. It was a simple power outage Cascading failures involve interdependencies among components. Not always. Cascading failures can also occur when there is zero dependency between components. The simplest form of this is where one environment fails over to another, but the target environment is not capable of handling the additional load and then fails itself as a result (in some form or other, but frequently different to the mode of the original failure). Whilst the Amazon outage might have been a simple power outage, it's likely that at least some of the website outages caused were a combination of not just the direct Amazon outage, but also the flow-on effect of their redundancy attempting (but failing) to kick in - potentially making the problem worse than just the Amazon outage caused. Scott
Re: FYI Netflix is down
scott, This was not a cascading failure. It was a simple power outage Cascading failures involve interdependencies among components. Not always. Cascading failures can also occur when there is zero dependency between components. The simplest form of this is where one environment fails over to another, but the target environment is not capable of handling the additional load and then fails itself as a result (in some form or other, but frequently different to the mode of the original failure). indeed. and that is an interdependency among components. in particular, it is a capacity interdependency. Whilst the Amazon outage might have been a simple power outage, it's likely that at least some of the website outages caused were a combination of not just the direct Amazon outage, but also the flow-on effect of their redundancy attempting (but failing) to kick in - potentially making the problem worse than just the Amazon outage caused. i think you over-estimate these websites. most of them simply have no redundancy (and obviously have no tested, effective redundancy) and were simply hoping that amazon didn't really go down that much. hope is not the best strategy, as it turns out. i suspect that randy is right though: many of these businesses do not promise perfect uptime and can survive these kinds of failures with little loss to business or reputation. twitter has branded it's early failures with a whale that no only didn't hurt it but helped endear the service to millions. when your service fits these criteria, why would you bother doing the complicated systems and application engineering necessary to actually have functional redundancy? it simply isn't worth it. t Scott
Re: FYI Netflix is down
+-- | On 2012-06-30 16:08:40, Rayson Ho wrote: | | If I recall correctly, availability zone (AZ) mappings are specific to | an AWS account, and in fact there is no way to know if you are running | in the same AZ as another AWS account: | | http://aws.amazon.com/ec2/faqs/#How_can_I_make_sure_that_I_am_in_the_same_Availability_Zone_as_another_developer | | Also, AWS Elastic Load Balancer (and/or CloudWatch) should be able to | detect that some instances are not reachable, and thus can start new | instances and remap DNS entries automatically: | http://aws.amazon.com/elasticloadbalancing/ | | This time only 1 AZ is affected by the power outage, so sites with | fault tolerance built into their AWS infrastructure should be able to | handle the issues relatively easily. Explain Netflix and Heroku last night. Both of whom architect across multiple AZs and have for many years. The API and EBS across the region were also affected. ELB was _also_ affected across the region, and many customers continue to report problems with it. We were told in May of last year after the last massive full-region EBS outage that the control planes for the API and related services were being decoupled so issues in a single AZ would not affect all. Seems to not be the case. Just because they offer these features that should help with resiliency doesn't actually mean they _work_ under duress. -- bdha cyberpunk is dead. long live cyberpunk.
Re: FYI Netflix is down
On Sat, Jun 30, 2012 at 4:45 PM, Bryan Horstmann-Allen b...@mirrorshades.net wrote: Explain Netflix and Heroku last night. Both of whom architect across multiple AZs and have for many years. The API and EBS across the region were also affected. ELB was _also_ affected across the region, and many customers continue to report problems with it. We were told in May of last year after the last massive full-region EBS outage that the control planes for the API and related services were being decoupled so issues in a single AZ would not affect all. Seems to not be the case. Just because they offer these features that should help with resiliency doesn't actually mean they _work_ under duress. -- But in netflix case, if they architected their environment the way they said they did, why wouldnt they just fail over to us-west? especially at their scale, I wouldn't expect them to be dependent on any AWS function in any region. Mike
Re: FYI Netflix is down
+-- | On 2012-06-30 16:55:53, Mike Devlin wrote: | | But in netflix case, if they architected their environment the way they | said they did, why wouldnt they just fail over to us-west? especially at | their scale, I wouldn't expect them to be dependent on any AWS function in | any region. Have a look at Asgard, the AWS management tool they just open sourced. It implies they rely very heavily on many AWS features, some of which are very much region specific. As to their multi-region capability, I have no idea. I don't think I've ever seen the mention it. -- bdha cyberpunk is dead. long live cyberpunk.
Re: FYI Netflix is down
On Sat, Jun 30, 2012 at 5:04 PM, Bryan Horstmann-Allen b...@mirrorshades.net wrote: Have a look at Asgard, the AWS management tool they just open sourced. It implies they rely very heavily on many AWS features, some of which are very much region specific. As to their multi-region capability, I have no idea. I don't think I've ever seen the mention it. -- bdha cyberpunk is dead. long live cyberpunk. yeah, i am sure I am making some assumptions about how much resilience they have been building into their architecture, but since every year they have been getting rid of more and more of their physical infrastructure and putting it fully in AWS, and given the fact they are a pay service, I would think they would account for a region going down Mike
Re: FYI Netflix is down
On Sat, Jun 30, 2012 at 01:19:54PM -0700, Scott Howard wrote: On Sat, Jun 30, 2012 at 12:04 PM, Todd Underwood toddun...@gmail.comwrote: This was not a cascading failure. It was a simple power outage Cascading failures involve interdependencies among components. Not always. Cascading failures can also occur when there is zero dependency between components. The simplest form of this is where one environment fails over to another, but the target environment is not capable of handling the additional load and then fails itself as a result (in some form or other, but frequently different to the mode of the original failure). That's an interdependency. Environment A is dependent on environment B being up and pulling some of the load away from A; B is dependent on A beingup and pulling some of the load away from B. A Crashes for reason X - Load Shifts to B - B Crashes due to load is a classic cascading failure. And it's not limited to software systems. It's how most major blackouts occur (except with more than three steps in the cascade, of course). -- Brett
FYI Netflix is down
Seems that they are unreachable at the moment. Called and theres a recorded message stating they are aware of an issue, no details. -Joe
Re: FYI Netflix is down
Seeing some reports of Pinterest and Instagram down as well. Amazon cloud services being implicated. On 6/29/2012 10:22 PM, Joe Blanchard wrote: Seems that they are unreachable at the moment. Called and theres a recorded message stating they are aware of an issue, no details. -Joe
Re: FYI Netflix is down
From Amazon Amazon Elastic Compute Cloud (N. Virginia) (http://status.aws.amazon.com/) 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone. -Grant On Fri, Jun 29, 2012 at 10:40 PM, Jason Baugher ja...@thebaughers.comwrote: Seeing some reports of Pinterest and Instagram down as well. Amazon cloud services being implicated. On 6/29/2012 10:22 PM, Joe Blanchard wrote: Seems that they are unreachable at the moment. Called and theres a recorded message stating they are aware of an issue, no details. -Joe
RE: FYI Netflix is down
To further expand: 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone. 8:40 PM PDT We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area. We are actively working to restore power. -Original Message- From: Grant Ridder [mailto:shortdudey...@gmail.com] Sent: Friday, June 29, 2012 8:42 PM To: Jason Baugher Cc: nanog@nanog.org Subject: Re: FYI Netflix is down From Amazon Amazon Elastic Compute Cloud (N. Virginia) (http://status.aws.amazon.com/) 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone. -Grant On Fri, Jun 29, 2012 at 10:40 PM, Jason Baugher ja...@thebaughers.comwrote: Seeing some reports of Pinterest and Instagram down as well. Amazon cloud services being implicated. On 6/29/2012 10:22 PM, Joe Blanchard wrote: Seems that they are unreachable at the moment. Called and theres a recorded message stating they are aware of an issue, no details. -Joe
Re: FYI Netflix is down
I have an instance in zone C and it is up and fine, so it must be A, B, or D that is down. On Fri, Jun 29, 2012 at 10:42 PM, James Laszko jam...@mythostech.comwrote: To further expand: 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone. 8:40 PM PDT We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area. We are actively working to restore power. -Original Message- From: Grant Ridder [mailto:shortdudey...@gmail.com] Sent: Friday, June 29, 2012 8:42 PM To: Jason Baugher Cc: nanog@nanog.org Subject: Re: FYI Netflix is down From Amazon Amazon Elastic Compute Cloud (N. Virginia) (http://status.aws.amazon.com/ ) 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone. -Grant On Fri, Jun 29, 2012 at 10:40 PM, Jason Baugher ja...@thebaughers.com wrote: Seeing some reports of Pinterest and Instagram down as well. Amazon cloud services being implicated. On 6/29/2012 10:22 PM, Joe Blanchard wrote: Seems that they are unreachable at the moment. Called and theres a recorded message stating they are aware of an issue, no details. -Joe
Re: FYI Netflix is down
Nature is such a PITA. On 6/29/2012 10:42 PM, James Laszko wrote: To further expand: 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone. 8:40 PM PDT We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area. We are actively working to restore power. -Original Message- From: Grant Ridder [mailto:shortdudey...@gmail.com] Sent: Friday, June 29, 2012 8:42 PM To: Jason Baugher Cc: nanog@nanog.org Subject: Re: FYI Netflix is down From Amazon Amazon Elastic Compute Cloud (N. Virginia) (http://status.aws.amazon.com/) 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone. -Grant On Fri, Jun 29, 2012 at 10:40 PM, Jason Baugher ja...@thebaughers.comwrote: Seeing some reports of Pinterest and Instagram down as well. Amazon cloud services being implicated. On 6/29/2012 10:22 PM, Joe Blanchard wrote: Seems that they are unreachable at the moment. Called and theres a recorded message stating they are aware of an issue, no details. -Joe
Re: FYI Netflix is down
Whatever happened to UPSs and generators? On Fri, Jun 29, 2012 at 8:45 PM, Jason Baugher ja...@thebaughers.comwrote: Nature is such a PITA. On 6/29/2012 10:42 PM, James Laszko wrote: To further expand: 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone. 8:40 PM PDT We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area. We are actively working to restore power. -Original Message- From: Grant Ridder [mailto:shortdudey123@gmail.**comshortdudey...@gmail.com ] Sent: Friday, June 29, 2012 8:42 PM To: Jason Baugher Cc: nanog@nanog.org Subject: Re: FYI Netflix is down From Amazon Amazon Elastic Compute Cloud (N. Virginia) ( http://status.aws.amazon.com/**) 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone. -Grant On Fri, Jun 29, 2012 at 10:40 PM, Jason Baugher ja...@thebaughers.com wrote: Seeing some reports of Pinterest and Instagram down as well. Amazon cloud services being implicated. On 6/29/2012 10:22 PM, Joe Blanchard wrote: Seems that they are unreachable at the moment. Called and theres a recorded message stating they are aware of an issue, no details. -Joe -- Mike Lyon 408-621-4826 mike.l...@gmail.com http://www.linkedin.com/in/mlyon
Re: FYI Netflix is down
On Fri, Jun 29, 2012 at 11:44 PM, Grant Ridder shortdudey...@gmail.com wrote: I have an instance in zone C and it is up and fine, so it must be A, B, or D that is down. It is my understanding that instance zones are randomized between customers -- so your zone C may be my zone A. Ian -- Ian Wilson ian.m.wil...@gmail.com Solving site load issues with database replication is a lot like solving your own personal problems with heroin -- at first, it sorta works, but after a while things just get out of hand.
Re: FYI Netflix is down
Yes, although, when you launch an instance, you do have the option of selecting a zone if you want. However, once the instance is started it stays in that zone and does not switch. On Fri, Jun 29, 2012 at 10:47 PM, Ian Wilson ian.m.wil...@gmail.com wrote: On Fri, Jun 29, 2012 at 11:44 PM, Grant Ridder shortdudey...@gmail.com wrote: I have an instance in zone C and it is up and fine, so it must be A, B, or D that is down. It is my understanding that instance zones are randomized between customers -- so your zone C may be my zone A. Ian -- Ian Wilson ian.m.wil...@gmail.com Solving site load issues with database replication is a lot like solving your own personal problems with heroin -- at first, it sorta works, but after a while things just get out of hand.
Re: FYI Netflix is down
I was wondering the same thing! Also, Reddit appears to be really slow right now and I keep getting reddit is under heavy load right now, sorry. Try again in a few minutes. I wonder if it's related. I believe they use Amazon for some of their stuff. Derek On 6/29/2012 11:47 PM, Mike Lyon wrote: Whatever happened to UPSs and generators? On Fri, Jun 29, 2012 at 8:45 PM, Jason Baugher ja...@thebaughers.comwrote: Nature is such a PITA. On 6/29/2012 10:42 PM, James Laszko wrote: To further expand: 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone. 8:40 PM PDT We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area. We are actively working to restore power. -Original Message- From: Grant Ridder [mailto:shortdudey123@gmail.**comshortdudey...@gmail.com ] Sent: Friday, June 29, 2012 8:42 PM To: Jason Baugher Cc: nanog@nanog.org Subject: Re: FYI Netflix is down From Amazon Amazon Elastic Compute Cloud (N. Virginia) ( http://status.aws.amazon.com/**) 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone. -Grant On Fri, Jun 29, 2012 at 10:40 PM, Jason Baugher ja...@thebaughers.com wrote: Seeing some reports of Pinterest and Instagram down as well. Amazon cloud services being implicated. On 6/29/2012 10:22 PM, Joe Blanchard wrote: Seems that they are unreachable at the moment. Called and theres a recorded message stating they are aware of an issue, no details. -Joe
Re: FYI Netflix is down
On 6/29/12 8:47 PM, Mike Lyon wrote: Whatever happened to UPSs and generators? You don't need them with The Cloud! But seriously, this is something like the third or fourth time AWS fell over flat in recent memory. ~Seth
Re: FYI Netflix is down
They may use it for content, but reddit.com resolves to IPs own by quest On Fri, Jun 29, 2012 at 10:51 PM, Seth Mattinen se...@rollernet.us wrote: On 6/29/12 8:47 PM, Mike Lyon wrote: Whatever happened to UPSs and generators? You don't need them with The Cloud! But seriously, this is something like the third or fourth time AWS fell over flat in recent memory. ~Seth
Re: FYI Netflix is down
8:49 PM PDT Power has been restored to the impacted Availability Zone and we are working to bring impacted instances and volumes back online On Fri, Jun 29, 2012 at 10:52 PM, Grant Ridder shortdudey...@gmail.comwrote: They may use it for content, but reddit.com resolves to IPs own by quest On Fri, Jun 29, 2012 at 10:51 PM, Seth Mattinen se...@rollernet.uswrote: On 6/29/12 8:47 PM, Mike Lyon wrote: Whatever happened to UPSs and generators? You don't need them with The Cloud! But seriously, this is something like the third or fourth time AWS fell over flat in recent memory. ~Seth