Re: FYI Netflix is down
The last 2 Amazon outages were power issues isolated to just there us-east Virginia data center. I read somewhere that Amazon has something like 70% of their ec2 resources in Virginia and its also their oldest ec2 datacenter..so I am guessing they learned a lot of lessons and are stuck with an aged infrastructure there. I think the real problem here is that a large subset of the customers using ec2 misunderstand the redundancy that is built into the Amazon architecture. You are essentially supposed to view individual virtual machines as bring entirely disposable and make duplicates of everything across availability zones and for extra points across regions. most people instead think that the 2 cents/hour price tag is a massive cost savings and the cloud is invincible..look at the SLA for ec2...Amazon basically doesn't really consider it a real outage unless its more than one availability zone that is down whats more surprising is that netflix was so affected by a single availability zone outage. They are constantly talking about their chaos monkey/simian army tool that purposely breaks random parts of their infrastructure to prove its fault tolerate, or to point out weaknesses to fix. ( http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html) I think the closest thing to a cascading failure they have had was 4/29/11 outage (http://aws.amazon.com/message/65648/) Mike On Jun 30, 2012 3:05 PM, Todd Underwood toddun...@gmail.com wrote: This was not a cascading failure. It was a simple power outage Cascading failures involve interdependencies among components. T On Jun 30, 2012 2:21 PM, Seth Mattinen se...@rollernet.us wrote: On 6/30/12 9:25 AM, Todd Underwood wrote: On Jun 30, 2012 11:23 AM, Seth Mattinen se...@rollernet.us mailto:se...@rollernet.us wrote: But haven't they all been cascading failures? No. They have not. That's not what that term means. 'Cascading failure' has a fairly specific meaning that doesn't imply resilience in the face of decomposition into smaller parts. Cascading failures can occur even when a system is decomposed into small parts, each of which is apparently well run. I honestly have no idea how to parse that since it doesn't jive with my practical view of a cascading failure. ~Seth
Re: FYI Netflix is down
On Sat, Jun 30, 2012 at 4:45 PM, Bryan Horstmann-Allen b...@mirrorshades.net wrote: Explain Netflix and Heroku last night. Both of whom architect across multiple AZs and have for many years. The API and EBS across the region were also affected. ELB was _also_ affected across the region, and many customers continue to report problems with it. We were told in May of last year after the last massive full-region EBS outage that the control planes for the API and related services were being decoupled so issues in a single AZ would not affect all. Seems to not be the case. Just because they offer these features that should help with resiliency doesn't actually mean they _work_ under duress. -- But in netflix case, if they architected their environment the way they said they did, why wouldnt they just fail over to us-west? especially at their scale, I wouldn't expect them to be dependent on any AWS function in any region. Mike
Re: FYI Netflix is down
On Sat, Jun 30, 2012 at 5:04 PM, Bryan Horstmann-Allen b...@mirrorshades.net wrote: Have a look at Asgard, the AWS management tool they just open sourced. It implies they rely very heavily on many AWS features, some of which are very much region specific. As to their multi-region capability, I have no idea. I don't think I've ever seen the mention it. -- bdha cyberpunk is dead. long live cyberpunk. yeah, i am sure I am making some assumptions about how much resilience they have been building into their architecture, but since every year they have been getting rid of more and more of their physical infrastructure and putting it fully in AWS, and given the fact they are a pay service, I would think they would account for a region going down Mike
technical contact at ATT Wireless
Hi, Would anyone happen to know a contact at ATT wireless that would be able to help diagnose a DNS issue? we are seeing the DNS record for boston.com intermittantly resolve to the wrong IP address, but I am having trouble getting through to the correct people through normal support. Thanks Mike
RE: Network diagram app that shows realtime link utilizatin
Check out InterMapper (http://www.intermapper.com/) Its java based, but works real well