Re: FYI Netflix is down

2012-06-30 Thread Mike Devlin
The last 2 Amazon outages were power issues isolated to just there us-east
Virginia data center. I read somewhere that Amazon has something like 70%
of their ec2 resources in Virginia and its also their oldest ec2
datacenter..so I am guessing they learned a lot of lessons and are stuck
with an aged infrastructure there.

I think the real problem here is that a large subset of the customers using
ec2 misunderstand the redundancy that is built into the Amazon
architecture. You are essentially supposed to view individual virtual
machines as bring entirely disposable and make duplicates of everything
across availability zones and for extra points across regions.

most people instead think that the 2 cents/hour price tag is a massive cost
savings and the cloud is invincible..look at the SLA for ec2...Amazon
basically doesn't really consider it a real outage unless its more than one
availability zone that is down

whats more surprising is that netflix was so affected by a single
availability zone outage. They are constantly talking about their chaos
monkey/simian army tool that purposely breaks random parts of their
infrastructure to prove its fault tolerate, or to point out weaknesses to
fix. (
http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html)


I think the closest thing to a cascading failure they have had was 4/29/11
outage (http://aws.amazon.com/message/65648/)


Mike


On Jun 30, 2012 3:05 PM, Todd Underwood toddun...@gmail.com wrote:

 This was not a cascading failure.  It was a simple power outage

 Cascading failures involve interdependencies among components.

 T
 On Jun 30, 2012 2:21 PM, Seth Mattinen se...@rollernet.us wrote:

  On 6/30/12 9:25 AM, Todd Underwood wrote:
  
   On Jun 30, 2012 11:23 AM, Seth Mattinen se...@rollernet.us
   mailto:se...@rollernet.us wrote:
  
  
   But haven't they all been cascading failures?
  
   No.  They have not.  That's not what that term means.
  
   'Cascading failure' has a fairly specific meaning that doesn't imply
   resilience in the face of decomposition into smaller parts.  Cascading
   failures can occur even when a system is decomposed into small parts,
   each of which is apparently well run.
  
 
 
  I honestly have no idea how to parse that since it doesn't jive with my
  practical view of a cascading failure.
 
  ~Seth
 
 



Re: FYI Netflix is down

2012-06-30 Thread Mike Devlin
On Sat, Jun 30, 2012 at 4:45 PM, Bryan Horstmann-Allen 
b...@mirrorshades.net wrote:

 Explain Netflix and Heroku last night. Both of whom architect across
 multiple
 AZs and have for many years.

 The API and EBS across the region were also affected. ELB was _also_
 affected
 across the region, and many customers continue to report problems with it.

 We were told in May of last year after the last massive full-region EBS
 outage
 that the control planes for the API and related services were being
 decoupled
 so issues in a single AZ would not affect all. Seems to not be the case.

 Just because they offer these features that should help with resiliency
 doesn't
 actually mean they _work_ under duress.
 --



But in netflix case, if they architected their environment the way they
said they did, why wouldnt they just fail over to us-west? especially at
their scale, I wouldn't expect them to be dependent on any AWS function in
any region.


Mike


Re: FYI Netflix is down

2012-06-30 Thread Mike Devlin
On Sat, Jun 30, 2012 at 5:04 PM, Bryan Horstmann-Allen 
b...@mirrorshades.net wrote:


 Have a look at Asgard, the AWS management tool they just open sourced. It
 implies they rely very heavily on many AWS features, some of which are very
 much region specific.

 As to their multi-region capability, I have no idea. I don't think I've
 ever
 seen the mention it.
 --
 bdha
 cyberpunk is dead. long live cyberpunk.



yeah, i am sure I am making some assumptions about how much resilience they
have been building into their architecture, but since every year they have
been getting rid of more and more of their physical infrastructure and
putting it fully in AWS, and given the fact they are a pay service, I would
think they would account for a region going down

Mike


technical contact at ATT Wireless

2012-06-28 Thread Mike Devlin
Hi,

Would anyone happen to know a contact at ATT wireless that would be able to
help diagnose a DNS issue? we are seeing the DNS record for boston.com
intermittantly resolve to the wrong IP address, but I am having trouble
getting through to the correct people through normal support.


Thanks

Mike


RE: Network diagram app that shows realtime link utilizatin

2012-05-03 Thread Mike Devlin
Check out InterMapper (http://www.intermapper.com/) Its java based, but
works real well