Re: FYI Netflix is down

2012-07-11 Thread steve pirk [egrep]
On Mon, Jul 9, 2012 at 10:20 AM, Dave Hart daveh...@gmail.com wrote:

 We continue to investigate why these connections were timing out
 during connect, rather than quickly determining that there was no
 route to the unavailable hosts and failing quickly.

 potential translation:

 We continue to shoot ourselves in the foot by filtering all ICMP
 without understanding the implications.


Sorry to mention my favorite hardware vendor again, but that is what I
liked about using F5 BigIP as load balancing devices... They did layer 7
url checking to see if the service was really responding (instead of just
pinging or opening a connection to the IP).
We performed tests that would do a complete LDAP SSL query to verify a
directory server could actually look up a person. If it failed to answer
within a certain time frame, then it was taken out of rotation.

I do not know if that was ever implemented in production, but we did verify
it worked.

On the software in the hardware can fail point, my only defense is you do
redundant testing of the watcher devices, and have enough of them to vote
misbehaving ones out of service. Oh, and it is best if the global load
balancing hardware/software is located somewhere else besides the data
centers being monitored.

-- 
steve pirk


Re: FYI Netflix is down

2012-07-09 Thread gb10hkzo-na...@yahoo.co.uk
Steve at pirk,

I fail to grasp the concept in your argument.

You do realise, do you not, that your $ black boxes from your favourite 
brand name vendor have software running inside of them do you not ?

Case in point for example, the recent LINX issues it wasn't the hardware 
that gave them the headaches, but the software running on it sure did !

I am a big believer in using hardware to load balance data centers, and not
leave it up to software in the data center which might fail. 


Re: FYI Netflix is down

2012-07-09 Thread Alain Hebert

Hi,

Well depending on your black box, your millage will vary.

Their wide use of ASIC eliminate a lot of the headache of pure 
software implementation.


Buffer, timing, expected results, etc.

Their real sofware only represent a small part of the device and 
is mostly relegated to management and some L4 to L7 handling.


So yes, ASIC/FPGA devices have software their result and behavior 
are predictable and the system is more stable because of it.


PS: Yes, CAM lockout, bad RAM is still a pita for them.

In short:

It is quite a thing to say that because everything can be 
categorized as software that someone point is invalid.


-
Alain Hebertaheb...@pubnix.net
PubNIX Inc.
50 boul. St-Charles
P.O. Box 26770 Beaconsfield, Quebec H9W 6G7
Tel: 514-990-5911  http://www.pubnix.netFax: 514-990-9443


On 07/09/12 07:42, gb10hkzo-na...@yahoo.co.uk wrote:

Steve at pirk,

I fail to grasp the concept in your argument.

You do realise, do you not, that your $ black boxes from your favourite 
brand name vendor have software running inside of them do you not ?

Case in point for example, the recent LINX issues it wasn't the hardware 
that gave them the headaches, but the software running on it sure did !


I am a big believer in using hardware to load balance data centers, and not
leave it up to software in the data center which might fail.






Re: FYI Netflix is down

2012-07-09 Thread valdis . kletnieks
On Mon, 09 Jul 2012 08:07:14 -0400, Alain Hebert said:

  Their wide use of ASIC eliminate a lot of the headache of pure
 software implementation.

And gets you, in return, the headaches of buggy hardware, where
bug-fixing is just a bit harder than load the new release. ;)


pgpSvdXo7xMkN.pgp
Description: PGP signature


Re: FYI Netflix is down

2012-07-09 Thread Rayson Ho
On Sun, Jul 8, 2012 at 8:27 PM, steve pirk [egrep] st...@pirk.com wrote:
 I am pretty sure Netflix and others were trying to do it right, as they
 all had graceful fail-over to a secondary AWS zone defined.
 It looks to me like Amazon uses DNS round-robin to load balance the zones,
 because they mention returning a list of addresses for DNS queries, and
 explains the failure of the services to shunt over to other zones in their
 postmortem.

There are also bugs from the Netflix side uncovered by the AWS outage:

Lessons Netflix Learned from the AWS Storm

http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html

For an infrastructure this large, no matter you are running your own
datacenter or using the cloud, it is certain that the code is not bug
free. And another thing is, if everything is too automated, then
failure in one component can trigger bugs in areas that no one has
ever thought of...

Rayson

==
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/






 Elastic Load Balancers (ELBs) allow web traffic directed at a single IP
 address to be spread across many EC2 instances. They are a tool for high
 availability as traffic to a single end-point can be handled by many
 redundant servers. ELBs live in individual Availability Zones and front EC2
 instances in those same zones or in other Availability Zones.



 ELBs can also be deployed in multiple Availability Zones. In this
 configuration, each Availability Zone’s end-point will have a separate IP
 address. A single Domain Name will point to all of the end-points’ IP
 addresses. When a client, such as a web browser, queries DNS with a Domain
 Name, it receives the IP address (“A”) records of all of the ELBs in random
 order. While some clients only process a single IP address, many (such as
 newer versions of web-browsers) will retry the subsequent IP addresses if
 they fail to connect to the first. A large number of non-browser clients
 only operate with a single IP address.
 During the disruption this past Friday night, the control plane (which
 encompasses calls to add a new ELB, scale an ELB, add EC2 instances to an
 ELB, and remove traffic from ELBs) began performing traffic shifts to
 account for the loss of load balancers in the affected Availability Zone.
 As the power and systems returned, a large number of ELBs came up in a
 state which triggered a bug we hadn’t seen before. The bug caused the ELB
 control plane to attempt to scale these ELBs to larger ELB instance sizes.
 This resulted in a sudden flood of requests which began to backlog the
 control plane. At the same time, customers began launching new EC2
 instances to replace capacity lost in the impacted Availability Zone,
 requesting the instances be added to existing load balancers in the other
 zones. These requests further increased the ELB control plane backlog.
 Because the ELB control plane currently manages requests for the US East-1
 Region through a shared queue, it fell increasingly behind in processing
 these requests; and pretty soon, these requests started taking a very long
 time to complete.

  http://aws.amazon.com/message/67457/


 *In reality, though, Amazon data centers have outages all the time. In
 fact, Amazon tells its customers to plan for this to happen, and to be
 ready to roll over to a new data center whenever there’s an outage.*

 *That’s what was supposed to happen at Netflix Friday night. But it
 didn’t work out that way. According to Twitter messages from Netflix
 Director of Cloud Architecture Adrian Cockcroft and Instagram Engineer Rick
 Branson, it looks like an Amazon Elastic Load Balancing service, designed
 to spread Netflix’s processing loads across data centers, failed during the
 outage. Without that ELB service working properly, the Netflix and Pintrest
 services hosted by Amazon crashed.*

  http://www.wired.com/wiredenterprise/2012/06/real-clouds-crush-amazon/

 I am a big believer in using hardware to load balance data centers, and not
 leave it up to software in the data center which might fail.

 Speaking of services like RightScale, Google announced Compute Engine at
 Google I/O this year. BuildFax was an early Adopter, and they gave it great
 reviews...
 http://www.youtube.com/watch?v=LCjSJ778tGU

 It looks like Google has entered into the VPS market. 'bout time... ;-]
 http://cloud.google.com/products/compute-engine.html

 --steve pirk



Re: FYI Netflix is down

2012-07-09 Thread Dave Hart
On Mon, Jul 9, 2012 at 15:50 UTC, Rayson Ho wrote:
 There are also bugs from the Netflix side uncovered by the AWS outage:

 Lessons Netflix Learned from the AWS Storm

 http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html

We continue to investigate why these connections were timing out
during connect, rather than quickly determining that there was no
route to the unavailable hosts and failing quickly.

potential translation:

We continue to shoot ourselves in the foot by filtering all ICMP
without understanding the implications.

Cheers,
Dave Hart



Re: FYI Netflix is down

2012-07-08 Thread steve pirk [egrep]
On Tue, Jul 3, 2012 at 1:00 PM, Ryan Malayter malay...@gmail.com wrote:

 Doing it the right way makes the cloud far less cost-effective and far
 less agile. Once you get it all set up just so, change becomes very
 difficult. All the monitoring and fail-over/fail-back operations are
 generally application-specific and provider-specific, so there's a lot
 of lock-in. Tools like RightScale are a step in the right direction,
 but don't really touch the application layer. You also have to worry
 about the availability of yet another provider!


I am pretty sure Netflix and others were trying to do it right, as they
all had graceful fail-over to a secondary AWS zone defined.
It looks to me like Amazon uses DNS round-robin to load balance the zones,
because they mention returning a list of addresses for DNS queries, and
explains the failure of the services to shunt over to other zones in their
postmortem.

 Elastic Load Balancers (ELBs) allow web traffic directed at a single IP
 address to be spread across many EC2 instances. They are a tool for high
 availability as traffic to a single end-point can be handled by many
 redundant servers. ELBs live in individual Availability Zones and front EC2
 instances in those same zones or in other Availability Zones.



 ELBs can also be deployed in multiple Availability Zones. In this
 configuration, each Availability Zone’s end-point will have a separate IP
 address. A single Domain Name will point to all of the end-points’ IP
 addresses. When a client, such as a web browser, queries DNS with a Domain
 Name, it receives the IP address (“A”) records of all of the ELBs in random
 order. While some clients only process a single IP address, many (such as
 newer versions of web-browsers) will retry the subsequent IP addresses if
 they fail to connect to the first. A large number of non-browser clients
 only operate with a single IP address.
 During the disruption this past Friday night, the control plane (which
 encompasses calls to add a new ELB, scale an ELB, add EC2 instances to an
 ELB, and remove traffic from ELBs) began performing traffic shifts to
 account for the loss of load balancers in the affected Availability Zone.
 As the power and systems returned, a large number of ELBs came up in a
 state which triggered a bug we hadn’t seen before. The bug caused the ELB
 control plane to attempt to scale these ELBs to larger ELB instance sizes.
 This resulted in a sudden flood of requests which began to backlog the
 control plane. At the same time, customers began launching new EC2
 instances to replace capacity lost in the impacted Availability Zone,
 requesting the instances be added to existing load balancers in the other
 zones. These requests further increased the ELB control plane backlog.
 Because the ELB control plane currently manages requests for the US East-1
 Region through a shared queue, it fell increasingly behind in processing
 these requests; and pretty soon, these requests started taking a very long
 time to complete.

 http://aws.amazon.com/message/67457/


 *In reality, though, Amazon data centers have outages all the time. In
 fact, Amazon tells its customers to plan for this to happen, and to be
 ready to roll over to a new data center whenever there’s an outage.*

 *That’s what was supposed to happen at Netflix Friday night. But it
 didn’t work out that way. According to Twitter messages from Netflix
 Director of Cloud Architecture Adrian Cockcroft and Instagram Engineer Rick
 Branson, it looks like an Amazon Elastic Load Balancing service, designed
 to spread Netflix’s processing loads across data centers, failed during the
 outage. Without that ELB service working properly, the Netflix and Pintrest
 services hosted by Amazon crashed.*

 http://www.wired.com/wiredenterprise/2012/06/real-clouds-crush-amazon/

I am a big believer in using hardware to load balance data centers, and not
leave it up to software in the data center which might fail.

Speaking of services like RightScale, Google announced Compute Engine at
Google I/O this year. BuildFax was an early Adopter, and they gave it great
reviews...
http://www.youtube.com/watch?v=LCjSJ778tGU

It looks like Google has entered into the VPS market. 'bout time... ;-]
http://cloud.google.com/products/compute-engine.html

--steve pirk


Re: FYI Netflix is down

2012-07-08 Thread Ryan Malayter


On Jul 8, 2012, at 7:27 PM, steve pirk [egrep] st...@pirk.com wrote:
 
 I am pretty sure Netflix and others were trying to do it right, as they all 
 had graceful fail-over to a secondary AWS zone defined.

Having a single company as an infrastructure supplier is not trying to do it 
right from an engineering OR business perspective. It's lazy. No matter how 
many availability zones the vendor claims.

RE: FYI Netflix is down

2012-07-06 Thread Dan Golding


 -Original Message-
 
 I imagine Netflix is mature enough to track this data as you suggest,
 and that's why they use AWS - downtime isn't a big deal for their
 business unless it gets really, really bad.

There is another possibility that is probably much more widespread amongst AWS 
(and other cloud) customers. Here is the scenario:

You are a small, hungry startup. No capital for servers. Cloud seems great. 
Then, big growth hits! Cloud seems even better - you may have the capital now, 
thanks to friendly VC/public investment/private equity, but you don't have the 
time to catch up. So, keep using cloud. 

Then, the now mid-sized company discovers one day that their use of the cloud 
is no longer economical, if it ever was. They are big enough to use a dedicated 
hardware in collocation or wholesale datacenter solution, with blended transit 
from some upstreams. But the cost to transition out of cloud is big, too. So, 
they might go with a hybrid strategy, at least for a few years. 

This happens all the time. Not saying Netflix is doing this, but lots of other 
folks are. It’s a trap that’s easy to fall into. Especially with rapid growth.

- Dan


Re: FYI Netflix is down

2012-07-06 Thread James Downs

On Jul 6, 2012, at 1:50 PM, Dan Golding wrote:

 This happens all the time. Not saying Netflix is doing this, but lots of 
 other folks are. It’s a trap that’s easy to fall into. Especially with 

Netflix did the reverse. The moved *to* Amazon, so they could do noops.


Re: FYI Netflix is down

2012-07-04 Thread Kyle Creyts
Tell that to people in the third world without utilities.
On Jul 3, 2012 8:32 PM, Randy Bush ra...@psg.com wrote:

  Also, I don't think there is an acceptable level of downtime for
  water.

 coming soon to a planet near you

 randy




Re: FYI Netflix is down

2012-07-04 Thread Randy Bush
 Tell that to people in the third world without utilities.
 Also, I don't think there is an acceptable level of downtime for
 water.
 coming soon to a planet near you

i work there regularly.  the typical nanog kiddie does not.

randy



Re: FYI Netflix is down

2012-07-03 Thread George Herbert




On Jul 2, 2012, at 7:19 PM, Rodrick Brown rodrick.br...@gmail.com wrote:

 People are acting as if Netflix is part of some critical service they stream 
 movies for Christ sake.  Some acceptable level of loss is fine for 99.99% of 
 Netflix's user base just like cable, electricity and running water I suffer a 
 few hours of losses each year from those services it suck yes, is it the end 
 of the world no..

Actually calculating - understanding - cost of downtime, and what variations on 
that exist over time, are keys to reliability engineering.

But if you plan to cover X failure scenarios and only cover X/2 failure 
scenarios due to implementation glitches you goofed.

The right answer may be relax and accept the downtime and it may be spend 
$10 million dollars to avoid most of these.  If you haven't thought it through 
and quantified, do so...



George William Herbert
Sent from my iPhone


RE: FYI Netflix is down

2012-07-03 Thread Dan Golding
 -Original Message-
 From: James Downs [mailto:e...@egon.cc]
 
 
 On Jul 2, 2012, at 7:19 PM, Rodrick Brown wrote:
 
  People are acting as if Netflix is part of some critical service
they
 stream movies for Christ sake.  Some acceptable level of loss is fine
 for 99.99% of Netflix's user base just like cable, electricity and
 running water I suffer a few hours of losses each year from those
 services it suck yes, is it the end of the world no..
 
 You missed the point.

And very publically missed the point, too. The Netflix issues led to a
large discussion of downtime, testing, and fault tolerance that has been
very useful for the community and could lead to some good content for
NANOG conferences (/pokes PC). For Netflix (and all other similar
services) downtime is money and money is downtime. There is a
quantifiable cost for customer acquisition and a quantifiable churn
during each minute of downtime. Mature organizations actually calculate
and track this. The trick is to ensure that you have balanced the cost
of greater redundancy vs the cost of churn/customer acquisition. If you
are spending too much on redundancy, it's as big of mistake as spending
too little. 

Also, I don't think there is an acceptable level of downtime for water.
Neither do water utilities. 

- Dan



RE: FYI Netflix is down

2012-07-03 Thread Ryan Malayter
James Downs wrote:
 For Netflix (and all other similar
 services) downtime is money and money is downtime. There is a
 quantifiable cost for customer acquisition and a quantifiable churn
 during each minute of downtime. Mature organizations actually calculate
 and track this. The trick is to ensure that you have balanced the cost
 of greater redundancy vs the cost of churn/customer acquisition. If you
 are spending too much on redundancy, it's as big of mistake as spending
 too little.

Actually, for Netflix, so long as downtime is infrequent or short
enough that users don't cancel, it actually saves them money. They're
not paying royalties for movies being streamed during downtime, but
they're still collecting their $8/month. There is no meaningful SLA
for the end user to my knowledge.

I imagine the threshold for *any* user churn based on downtime is very
high for Netflix. So long as they are about as good as
cable/sattelite TV in terms of uptime Netflix will do fine. You would
have to get into 98% uptime or lower before people would really start
getting irritated enough to cancel. Of course multiple short outages
would be more painful than a few longer ones from a customer's
perspective.

I imagine Netflix is mature enough to track this data as you suggest,
and that's why they use AWS - downtime isn't a big deal for their
business unless it gets really, really bad.



Re: FYI Netflix is down

2012-07-03 Thread James Downs

On Jul 3, 2012, at 6:11 AM, Dan Golding wrote:

 Also, I don't think there is an acceptable level of downtime for water.
 Neither do water utilities. 

I remember a certain conversation I had with a web-developer. We were talking 
about zero downtime releases. He thought it was acceptable if the website 
went down for 15 minutes, because people will just come back. Naturally, he 
was not as forgiving about the idea that his bank might think the same way, or 
that I might provide DB or server uptimes with that kind of reliability.

Downtime will kill some companies, and not others. Twitter certainly survived 
their fail-whale period. But then, no one pays for twitter.

-j


Re: FYI Netflix is down

2012-07-03 Thread Rodrick Brown

On Jul 3, 2012, at 9:11 AM, Dan Golding dgold...@ragingwire.com wrote:

 -Original Message-
 From: James Downs [mailto:e...@egon.cc]
 
 
 On Jul 2, 2012, at 7:19 PM, Rodrick Brown wrote:
 
 People are acting as if Netflix is part of some critical service
 they
 stream movies for Christ sake.  Some acceptable level of loss is fine
 for 99.99% of Netflix's user base just like cable, electricity and
 running water I suffer a few hours of losses each year from those
 services it suck yes, is it the end of the world no..
 
 You missed the point.
 
 And very publically missed the point, too. The Netflix issues led to a
 large discussion of downtime, testing, and fault tolerance that has been
 very useful for the community and could lead to some good content for
 NANOG conferences (/pokes PC). For Netflix (and all other similar
 services) downtime is money and money is downtime. There is a
 quantifiable cost for customer acquisition and a quantifiable churn
 during each minute of downtime. Mature organizations actually calculate
 and track this. The trick is to ensure that you have balanced the cost
 of greater redundancy vs the cost of churn/customer acquisition. If you
 are spending too much on redundancy, it's as big of mistake as spending
 too little. 

I totally got the point and the last bit of my post was just tongue in cheek. 

As I stated in my original response it's very unrealistic to plan for every 
possible failure scenario given the constraints most businesses face when 
implementing BCP today. I doubt Amazon gave much thought to multiple site 
outages and clients not being able to dynamically redeploy their engines 
because of inaccessibility from ELB.


 
 Also, I don't think there is an acceptable level of downtime for water.
 Neither do water utilities. 
 
 - Dan
 



Re: FYI Netflix is down

2012-07-03 Thread Rodrick Brown
On Jul 3, 2012, at 10:58 AM, Ryan Malayter malay...@gmail.com wrote:

 James Downs wrote:
 For Netflix (and all other similar
 services) downtime is money and money is downtime. There is a
 quantifiable cost for customer acquisition and a quantifiable churn
 during each minute of downtime. Mature organizations actually calculate
 and track this. The trick is to ensure that you have balanced the cost
 of greater redundancy vs the cost of churn/customer acquisition. If you
 are spending too much on redundancy, it's as big of mistake as spending
 too little.
 
 Actually, for Netflix, so long as downtime is infrequent or short
 enough that users don't cancel, it actually saves them money. They're
 not paying royalties for movies being streamed during downtime, but
 they're still collecting their $8/month. There is no meaningful SLA
 for the end user to my knowledge.
 
 I imagine the threshold for *any* user churn based on downtime is very
 high for Netflix. So long as they are about as good as
 cable/sattelite TV in terms of uptime Netflix will do fine. You would
 have to get into 98% uptime or lower before people would really start
 getting irritated enough to cancel. Of course multiple short outages
 would be more painful than a few longer ones from a customer's
 perspective.
 
 I imagine Netflix is mature enough to track this data as you suggest,
 and that's why they use AWS - downtime isn't a big deal for their
 business unless it gets really, really bad.
 

My thoughts exactly! 



Re: FYI Netflix is down

2012-07-03 Thread david raistrick

On Tue, 3 Jul 2012, Rodrick Brown wrote:

face when implementing BCP today. I doubt Amazon gave much thought to 
multiple site outages and clients not being able to dynamically redeploy 
their engines because of inaccessibility from ELB.


Considering there's a grand total of -one- tool in the entirely AWS 
toolkit that supports working across multiple regions at all sanely (that 
would be ec2-migrate-bundle, btw), I'd agree.   Amazon has put nearly zero 
thought into multiple site outages or how their customer base could 
leverage the multiple sites (regions) operated by AWS.




--
david raistrickhttp://www.netmeister.org/news/learn2quote.html
dr...@icantclick.org



Re: FYI Netflix is down

2012-07-03 Thread Jon Lewis

On Mon, 2 Jul 2012, Greg D. Moore wrote:

As for pulling the plug to test stuff. I recall a demo at Netapps in the 
early 00's.  They were talking about their fault tolerance and how great it 
was.  So I walked up to their demo array and said, So, it shouldn't be a 
problem if I pulled this drive right here?  Before I could the salesperson 
or tech guy, can't remember,  told me to stop.  He didn't want to risk it.


Lightweight.  Your story reminded me of this Sun ZFS demo.
http://www.youtube.com/watch?v=QGIwg6ye1gE

--
 Jon Lewis, MCP :)   |  I route
 Senior Network Engineer |  therefore you are
 Atlantic Net|
_ http://www.lewis.org/~jlewis/pgp for PGP public key_



Re: FYI Netflix is down

2012-07-03 Thread Jon Lewis

On Mon, 2 Jul 2012, david raistrick wrote:


On Mon, 2 Jul 2012, James Downs wrote:

back-plane / control-plane was unable to cope with the requests.  Netflix 
uses Amazon's ELB to balance the traffic and no back-plane meant they were 
unable to reconfigure it to route around the problem.


Someone needs to define back-plane/control-plane in this case. (and what 
wasn't working)


Amazon resources are controlled (from a consumer viewpoint) by API - that API 
is also used by amazon's internal toolkits that support ELB (and RDS..). 
Those (http accessed) API interfaces were unavailable for a good portion of 
the outages.


It seems like if you're going to outsource your mission critical 
infrastructure to cloud you should probably pick at least 2 unrelated 
cloud providers and if at all possible, not outsource the systems that 
balance/direct traffic...and if you're really serious about it, have at 
least two of these setup at different facilities such that if the primary 
goes offline, the secondary takes over.  If a cloud provider fails, you 
redirect to another.


--
 Jon Lewis, MCP :)   |  I route
 Senior Network Engineer |  therefore you are
 Atlantic Net|
_ http://www.lewis.org/~jlewis/pgp for PGP public key_



Re: FYI Netflix is down

2012-07-03 Thread Seth Mattinen
On 6/29/12 8:22 PM, Joe Blanchard wrote:
 Seems that they are unreachable at the moment. Called and theres a recorded
 message stating they are aware of an issue, no details.
 


I didn't see anyone post this yet, so here's Amazon's summary of events:

http://aws.amazon.com/message/67457/



Re: FYI Netflix is down

2012-07-03 Thread Jay Ashworth
- Original Message -
 From: Steven Bellovin s...@cs.columbia.edu

 Subject: Re: FYI Netflix is down
 On Jul 2, 2012, at 3:43 PM, Greg D. Moore wrote:
 
  At 03:08 PM 7/2/2012, George Herbert wrote:
 
  If folks have not read it, I would suggest reading Normal Accidents
  by Charles Perrow.
 
 Strong second to that suggestion.

Quite unfortunately, that book appears not to be in Safari's library.

Does anyone here know anyone at Safari?

Cheers,
-- jra
-- 
Jay R. Ashworth  Baylink   j...@baylink.com
Designer The Things I Think   RFC 2100
Ashworth  Associates http://baylink.pitas.com 2000 Land Rover DII
St Petersburg FL USA  http://photo.imageinc.us +1 727 647 1274



Re: FYI Netflix is down

2012-07-03 Thread George Herbert


On Jul 3, 2012, at 10:38 AM, Jay Ashworth j...@baylink.com wrote:

 - Original Message -
 From: Steven Bellovin s...@cs.columbia.edu
 
 Subject: Re: FYI Netflix is down
 On Jul 2, 2012, at 3:43 PM, Greg D. Moore wrote:
 
 At 03:08 PM 7/2/2012, George Herbert wrote:
 
 If folks have not read it, I would suggest reading Normal Accidents
 by Charles Perrow.
 
 Strong second to that suggestion.
 
 Quite unfortunately, that book appears not to be in Safari's library.
 
 Does anyone here know anyone at Safari?


Not the Safari division, but ORA yes, others at my company do.  Will forward 
the suggestion.

George William Herbert
Sent from my iPhone



Re: FYI Netflix is down

2012-07-03 Thread Ryan Malayter
Jon Lewis wrote:
 It seems like if you're going to outsource your mission critical
 infrastructure to cloud you should probably pick at least 2
 unrelated cloud providers and if at all possible, not outsource the
 systems that balance/direct traffic...and if you're really serious
 about it, have at least two of these setup at different facilities
 such that if the primary goes offline, the secondary takes over. If a
 cloud provider fails, you redirect to another.

Really, you need at least three independent providers. One primary
(A), one backup (B), and one witness to monitor the others for
failure. The witness site can of course be low-powered, as it is not
in the data plane of the applications, but just participates in the
control plane. In the event of a loss of communication, the majority
clique wins, and the isolated environments shut themselves down. This
is of course how any sane clustering setup has protected against
split brain scenarios for decades.

Doing it the right way makes the cloud far less cost-effective and far
less agile. Once you get it all set up just so, change becomes very
difficult. All the monitoring and fail-over/fail-back operations are
generally application-specific and provider-specific, so there's a lot
of lock-in. Tools like RightScale are a step in the right direction,
but don't really touch the application layer. You also have to worry
about the availability of yet another provider!
-- 
RPM



Re: FYI Netflix is down

2012-07-03 Thread Randy Bush
 Also, I don't think there is an acceptable level of downtime for
 water.

coming soon to a planet near you

randy



RE: FYI Netflix is down

2012-07-02 Thread Dan Golding
 -Original Message-
 From: Todd Underwood [mailto:toddun...@gmail.com]
 
 scott,
 
 
  This was not a cascading failure.  It was a simple power outage

Actually, it was a very complex power outage. I'm going to assume that what 
happened this weekend was similar to the event that happened at the same 
facility approximately two weeks ago (its immaterial - the details are probably 
different, but it illustrates the complexity of a data center failure)

Utility Power Failed
First Backup Generator Failed (shut down due to a faulty fan)
Second Backup Generator Failed (breaker coordination problem resulting in 
faulty trip of a breaker)

In this case, it was clearly a cascading failure, although only limited in 
scope. The failure in this case, also clearly involved people. There was one 
material failure (the fan), but the system should have been resilient enough to 
deal with it. The system should also have been resilient enough to deal with 
the breaker coordination issue (which should not have occurred), but was not. 
Data centers are not commodities. There is a way to engineer these facilities 
to be much more resilient. Not everyone's business model supports it.

- Dan


 
  Cascading failures involve interdependencies among components.
 
 
  Not always.  Cascading failures can also occur when there is zero
  dependency between components.  The simplest form of this is where
 one
  environment fails over to another, but the target environment is not
  capable of handling the additional load and then fails itself as a
  result (in some form or other, but frequently different to the mode
 of the original failure).
 
 indeed.  and that is an interdependency among components.  in
 particular, it is a capacity interdependency.
 
  Whilst the Amazon outage might have been a simple power outage,
 it's
  likely that at least some of the website outages caused were a
  combination of not just the direct Amazon outage, but also the flow-
 on
  effect of their redundancy attempting (but failing) to kick in -
  potentially making the problem worse than just the Amazon outage
 caused.
 
 i think you over-estimate these websites.  most of them simply have no
 redundancy (and obviously have no tested, effective redundancy) and
 were simply hoping that amazon didn't really go down that much.
 
 hope is not the best strategy, as it turns out.
 
 i suspect that randy is right though:  many of these businesses do not
 promise perfect uptime and can survive these kinds of failures with
 little loss to business or reputation.  twitter has branded it's early
 failures with a whale that no only didn't hurt it but helped endear the
 service to millions.  when your service fits these criteria, why would
 you bother doing the complicated systems and application engineering
 necessary to actually have functional redundancy?
 
 it simply isn't worth it.
 
 t
 
 
    Scott



Re: FYI Netflix is down

2012-07-02 Thread Todd Underwood
 Actually, it was a very complex power outage. I'm going to assume that what 
 happened this weekend was similar to the event that happened at the same 
 facility approximately two weeks ago (its immaterial - the details are 
 probably different, but it illustrates the complexity of a data center 
 failure)

 Utility Power Failed
 First Backup Generator Failed (shut down due to a faulty fan)
 Second Backup Generator Failed (breaker coordination problem resulting in 
 faulty trip of a breaker)

 In this case, it was clearly a cascading failure, although only limited in 
 scope. The failure in this case, also clearly involved people. There was one 
 material failure (the fan), but the system should have been resilient enough 
 to deal with it. The system should also have been resilient enough to deal 
 with the breaker coordination issue (which should not have occurred), but was 
 not. Data centers are not commodities. There is a way to engineer these 
 facilities to be much more resilient. Not everyone's business model supports 
 it.

ok, i give in.  as some level of granularity everything is a cascading
failure (since molecules colide and the world is an infinite chain of
causation in which human free will is merely a myth /Spinoza)

of course, this use of 'cascading' is vacuous and not useful anymore
since it applies to nearly every failure, but i'll go along with it.

from the perspective of a datacenter power engineer, this was a
cascading failure of a few small number of components.

from the perspective of every datacenter customer:  this was a power failure.

from the perspective of people watching B-rate movies:  this was a
failure to implement and test a reliable system for streaming those
movies in the face of a power outage at one facility.

from the perspective of nanog mailing list readers:  this was an
interesting opportunity to speculate about failures about which we
have no data (as usual!).

can we all agree on those facts?

:-)

t



Re: FYI Netflix is down

2012-07-02 Thread AP NANOG
While I was working for a wireless telecom company our primary 
datacenter was knocked off the power grid due to weather, the generators 
kicked on and everything was fine, till one generator was struck by 
lighting and that same strike fried the control panel on the second 
one.  Considering the second generator had no control panel we had no 
means of monitoring it for temp, fuel, input voltage (when it came 
back), output voltage, surge protection, or ultimately if the generator 
spiked to go full voltage due to a regulator failure.  Needless to say 
we had to shut the second generator down for safety reasons.


While in the military I seen many generators struck by lighting as well.

Im not saying Amazon was not at fault here, but I can see where this is 
possible and happens more frequently than one might think.


I hate to play devils advocate here, but you as the customer should 
always have backups to your backups, and practice these fail-overs on a 
regular basis.  Otherwise you are the fault here, no one else...


--

Thank you,

Robert Miller
http://www.armoredpackets.com

Twitter: @arch3angel

On 7/2/12 11:01 AM, Dan Golding wrote:

-Original Message-
From: Todd Underwood [mailto:toddun...@gmail.com]

scott,


This was not a cascading failure.  It was a simple power outage

Actually, it was a very complex power outage. I'm going to assume that what 
happened this weekend was similar to the event that happened at the same 
facility approximately two weeks ago (its immaterial - the details are probably 
different, but it illustrates the complexity of a data center failure)

Utility Power Failed
First Backup Generator Failed (shut down due to a faulty fan)
Second Backup Generator Failed (breaker coordination problem resulting in 
faulty trip of a breaker)

In this case, it was clearly a cascading failure, although only limited in 
scope. The failure in this case, also clearly involved people. There was one 
material failure (the fan), but the system should have been resilient enough to 
deal with it. The system should also have been resilient enough to deal with 
the breaker coordination issue (which should not have occurred), but was not. 
Data centers are not commodities. There is a way to engineer these facilities 
to be much more resilient. Not everyone's business model supports it.

- Dan



Cascading failures involve interdependencies among components.


Not always.  Cascading failures can also occur when there is zero
dependency between components.  The simplest form of this is where

one

environment fails over to another, but the target environment is not
capable of handling the additional load and then fails itself as a
result (in some form or other, but frequently different to the mode

of the original failure).

indeed.  and that is an interdependency among components.  in
particular, it is a capacity interdependency.


Whilst the Amazon outage might have been a simple power outage,

it's

likely that at least some of the website outages caused were a
combination of not just the direct Amazon outage, but also the flow-

on

effect of their redundancy attempting (but failing) to kick in -
potentially making the problem worse than just the Amazon outage

caused.

i think you over-estimate these websites.  most of them simply have no
redundancy (and obviously have no tested, effective redundancy) and
were simply hoping that amazon didn't really go down that much.

hope is not the best strategy, as it turns out.

i suspect that randy is right though:  many of these businesses do not
promise perfect uptime and can survive these kinds of failures with
little loss to business or reputation.  twitter has branded it's early
failures with a whale that no only didn't hurt it but helped endear the
service to millions.  when your service fits these criteria, why would
you bother doing the complicated systems and application engineering
necessary to actually have functional redundancy?

it simply isn't worth it.

t


   Scott




Re: FYI Netflix is down

2012-07-02 Thread Leo Bicknell
In a message written on Mon, Jul 02, 2012 at 11:30:06AM -0400, Todd Underwood 
wrote:
 from the perspective of people watching B-rate movies:  this was a
 failure to implement and test a reliable system for streaming those
 movies in the face of a power outage at one facility.

I want to emphasize _and test_.

Work on an infrastructure which is redundant and designed to provide
100% uptime (which is impossible, but that's another story) means
that there should be confidence in a failure being automatically
worked around, detected, and reported.

I used to work with a guy who had a simple test for these things,
and if I was a VP at Amazon, Netflix, or any other large company I
would do the same.  About once a month he would walk out on the
floor of the data center and break something.  Pull out an ethernet.
Unplug a server.  Flip a breaker.

Then he would wait, to see how long before a technician came to fix
it.

If these activities were service impacting to customers the engineering
or implementation was faulty, and remediation was performed.  Assuming
they acted as designed and the customers saw no faults the team was
graded on how quickly the detected and corrected the outage.

I've seen too many companies who's test is planned months in advance,
and who exclude the parts they think aren't up to scratch from the test.
Then an event occurs, and they fail, and take down customers.

TL;DR If you're not confident your operation could withstand someone
walking into your data center and randomly doing something, you are
NOT redundant.

-- 
   Leo Bicknell - bickn...@ufp.org - CCIE 3440
PGP keys at http://www.ufp.org/~bicknell/


pgpb37PwndunF.pgp
Description: PGP signature


Re: FYI Netflix is down

2012-07-02 Thread david raistrick

On Mon, 2 Jul 2012, Leo Bicknell wrote:


I used to work with a guy who had a simple test for these things,
and if I was a VP at Amazon, Netflix, or any other large company I
would do the same.  About once a month he would walk out on the


you mean like this?

http://techblog.netflix.com/2011/07/netflix-simian-army.html



--
david raistrickhttp://www.netmeister.org/news/learn2quote.html
dr...@icantclick.org



Re: FYI Netflix is down

2012-07-02 Thread Leo Bicknell
In a message written on Mon, Jul 02, 2012 at 12:13:22PM -0400, david raistrick 
wrote:
 you mean like this?
 
 http://techblog.netflix.com/2011/07/netflix-simian-army.html

Yes, Netflix seems to get it, and I think their Simian Army is a
great QA tool.  However, it is not a complete testing system, I
have never seen them talk about testing non-software components,
and I hope they do that as well.  As we saw in the previous Amazon
outage, part of the problem was a circuit breaker configuration.

-- 
   Leo Bicknell - bickn...@ufp.org - CCIE 3440
PGP keys at http://www.ufp.org/~bicknell/


pgpznsW9y4T3n.pgp
Description: PGP signature


Re: FYI Netflix is down

2012-07-02 Thread david raistrick

On Mon, 2 Jul 2012, Leo Bicknell wrote:


http://techblog.netflix.com/2011/07/netflix-simian-army.html


Yes, Netflix seems to get it, and I think their Simian Army is a
great QA tool.  However, it is not a complete testing system, I
have never seen them talk about testing non-software components,
and I hope they do that as well.  As we saw in the previous Amazon
outage, part of the problem was a circuit breaker configuration.



When the hardware is outsourced how would you propose testing the 
non-software components?  They do simulate availability zone issues (and 
AZ is as close as you get to controlling which internal power/network/etc 
grid you're attached to).


I suppose they could introduce artificial network latency/loss @ each 
instance - and could add testing around what happens when amazon's API 
disappears (as was the case friday).


Beyond thatthe rest of it is up to the hardware provider (Amazon, in 
this case).


..david (who also relies on outsourced hardware these days)



--
david raistrickhttp://www.netmeister.org/news/learn2quote.html
dr...@icantclick.org



Re: FYI Netflix is down

2012-07-02 Thread AP NANOG
This is an excellent example of how tests should be ran, unfortunately 
far too many places don't do this...


--

Thank you,

Robert Miller
http://www.armoredpackets.com

Twitter: @arch3angel

On 7/2/12 12:09 PM, Leo Bicknell wrote:

In a message written on Mon, Jul 02, 2012 at 11:30:06AM -0400, Todd Underwood 
wrote:

from the perspective of people watching B-rate movies:  this was a
failure to implement and test a reliable system for streaming those
movies in the face of a power outage at one facility.

I want to emphasize _and test_.

Work on an infrastructure which is redundant and designed to provide
100% uptime (which is impossible, but that's another story) means
that there should be confidence in a failure being automatically
worked around, detected, and reported.

I used to work with a guy who had a simple test for these things,
and if I was a VP at Amazon, Netflix, or any other large company I
would do the same.  About once a month he would walk out on the
floor of the data center and break something.  Pull out an ethernet.
Unplug a server.  Flip a breaker.

Then he would wait, to see how long before a technician came to fix
it.

If these activities were service impacting to customers the engineering
or implementation was faulty, and remediation was performed.  Assuming
they acted as designed and the customers saw no faults the team was
graded on how quickly the detected and corrected the outage.

I've seen too many companies who's test is planned months in advance,
and who exclude the parts they think aren't up to scratch from the test.
Then an event occurs, and they fail, and take down customers.

TL;DR If you're not confident your operation could withstand someone
walking into your data center and randomly doing something, you are
NOT redundant.





Re: FYI Netflix is down

2012-07-02 Thread Grant Ridder
The problem is large scale tests take a lot of time and planning.  For it
to be done right, you really need a dedicated DR team.

-Grant

On Mon, Jul 2, 2012 at 11:31 AM, AP NANOG na...@armoredpackets.com wrote:

 This is an excellent example of how tests should be ran, unfortunately
 far too many places don't do this...


 --

 Thank you,

 Robert Miller
 http://www.armoredpackets.com

 Twitter: @arch3angel

 On 7/2/12 12:09 PM, Leo Bicknell wrote:

 In a message written on Mon, Jul 02, 2012 at 11:30:06AM -0400, Todd
 Underwood wrote:

 from the perspective of people watching B-rate movies:  this was a
 failure to implement and test a reliable system for streaming those
 movies in the face of a power outage at one facility.

 I want to emphasize _and test_.

 Work on an infrastructure which is redundant and designed to provide
 100% uptime (which is impossible, but that's another story) means
 that there should be confidence in a failure being automatically
 worked around, detected, and reported.

 I used to work with a guy who had a simple test for these things,
 and if I was a VP at Amazon, Netflix, or any other large company I
 would do the same.  About once a month he would walk out on the
 floor of the data center and break something.  Pull out an ethernet.
 Unplug a server.  Flip a breaker.

 Then he would wait, to see how long before a technician came to fix
 it.

 If these activities were service impacting to customers the engineering
 or implementation was faulty, and remediation was performed.  Assuming
 they acted as designed and the customers saw no faults the team was
 graded on how quickly the detected and corrected the outage.

 I've seen too many companies who's test is planned months in advance,
 and who exclude the parts they think aren't up to scratch from the test.
 Then an event occurs, and they fail, and take down customers.

 TL;DR If you're not confident your operation could withstand someone
 walking into your data center and randomly doing something, you are
 NOT redundant.





Re: FYI Netflix is down

2012-07-02 Thread Leo Bicknell
In a message written on Mon, Jul 02, 2012 at 12:23:57PM -0400, david raistrick 
wrote:
 When the hardware is outsourced how would you propose testing the 
 non-software components?  They do simulate availability zone issues (and 
 AZ is as close as you get to controlling which internal power/network/etc 
 grid you're attached to).

Find a provider with a similar methodology.  Perhaps Netflix never
conducts a power test, but their colo vendor would perform such
testing.

If no colo providers exist that share their values on testing, that
may be a sign that outsourcing it isn't the right answer...

-- 
   Leo Bicknell - bickn...@ufp.org - CCIE 3440
PGP keys at http://www.ufp.org/~bicknell/


pgpzGpzGDjIwI.pgp
Description: PGP signature


Re: FYI Netflix is down

2012-07-02 Thread Cameron Byrne
On Jul 2, 2012 10:53 AM, Leo Bicknell bickn...@ufp.org wrote:

 In a message written on Mon, Jul 02, 2012 at 12:23:57PM -0400, david
raistrick wrote:
  When the hardware is outsourced how would you propose testing the
  non-software components?  They do simulate availability zone issues (and
  AZ is as close as you get to controlling which internal
power/network/etc
  grid you're attached to).

 Find a provider with a similar methodology.  Perhaps Netflix never
 conducts a power test, but their colo vendor would perform such
 testing.

 If no colo providers exist that share their values on testing, that
 may be a sign that outsourcing it isn't the right answer...

 --
Leo Bicknell - bickn...@ufp.org - CCIE 3440
 PGP keys at http://www.ufp.org/~bicknell/

I suggest using RAIC

Redundant array of inexpensive clouds.

Make your chaos animal go after sites and regions instead of individual VMs.

CB


Re: FYI Netflix is down

2012-07-02 Thread James Downs

On Jul 2, 2012, at 9:23 AM, david raistrick wrote:

 When the hardware is outsourced how would you propose testing the 
 non-software components?  They do simulate availability zone issues (and AZ 
 is as close as you get to controlling which internal power/network/etc grid 
 you're attached to).

We all know what netflix *says* they do, but they *did* have an outage.

-j


Re: FYI Netflix is down

2012-07-02 Thread Tony McCrory
On 2 July 2012 19:20, Cameron Byrne cb.li...@gmail.com wrote:


 Make your chaos animal go after sites and regions instead of individual
 VMs.

 CB


From a previous post mortem
http://techblog.netflix.com/2011_04_01_archive.html


Create More Failures
Currently, Netflix uses a service called Chaos
Monkeyhttp://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
to simulate service failure. Basically, Chaos Monkey is a service that
kills other services. We run this service because we want engineering teams
to be used to a constant level of failure in the cloud. Services should
automatically recover without any manual intervention. We don't however,
simulate what happens when an entire AZ goes down and therefore we haven't
engineered our systems to automatically deal with those sorts of failures.
Internally we are having discussions about doing that and people are
already starting to call this service Chaos Gorilla.
**

It would seem the Gorilla hasn't quite matured.

Tony


Re: FYI Netflix is down

2012-07-02 Thread Paul Graydon

On 07/02/2012 08:53 AM, Tony McCrory wrote:

On 2 July 2012 19:20, Cameron Byrne cb.li...@gmail.com wrote:


Make your chaos animal go after sites and regions instead of individual
VMs.

CB


 From a previous post mortem
http://techblog.netflix.com/2011_04_01_archive.html


Create More Failures
Currently, Netflix uses a service called Chaos
Monkeyhttp://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
to simulate service failure. Basically, Chaos Monkey is a service that
kills other services. We run this service because we want engineering teams
to be used to a constant level of failure in the cloud. Services should
automatically recover without any manual intervention. We don't however,
simulate what happens when an entire AZ goes down and therefore we haven't
engineered our systems to automatically deal with those sorts of failures.
Internally we are having discussions about doing that and people are
already starting to call this service Chaos Gorilla.
**

It would seem the Gorilla hasn't quite matured.

Tony
From conversations with Adrian Cockcroft this weekend it wasn't the 
result of Chaos Gorilla or Chaos Monkey failing to prepare them 
adequately.  All their automated stuff worked perfectly, the 
infrastructure tried to self heal.  The problem was that yet again 
Amazon's back-plane / control-plane was unable to cope with the 
requests.  Netflix uses Amazon's ELB to balance the traffic and no 
back-plane meant they were unable to reconfigure it to route around the 
problem.


Paul



RE: FYI Netflix is down

2012-07-02 Thread Dan Golding


 -Original Message-
 From: Leo Bicknell [mailto:bickn...@ufp.org]
 

 
 I want to emphasize _and test_.
[snip]
 
 I used to work with a guy who had a simple test for these things, and
 if I was a VP at Amazon, Netflix, or any other large company I would
do
 the same.  About once a month he would walk out on the floor of the
 data center and break something.  Pull out an ethernet.
 Unplug a server.  Flip a breaker.
 

*DING DING* - we have a winner! In a previous life, I used to spend a
lot of time in other people's data centers. The key question to ask was
how often they pulled the plug - i.e. disconnected utility power without
having backup generators running. Simulating an actual failure. That
goes for pulling out an Ethernet cord or unplugging a server, or
flipping a breaker. Its all the same. The problem is that if you don't
do this for a while, you get SCARED of doing it, and you stop doing it.
The longer you go without, the scarier it gets, to the point where you
will never do it, because you have no idea what will happen, other that
you probably getting fired. This is called horrible engineering
management, and is very common.

The other problem, of course, is that people design under the assumption
that everything will always work, and that failure modes, when they
occur, are predictable and fall into a narrow set. Multiple failure
modes? Not tested. Failure modes including operator error? Never tested.


When was the last time you had a drill?

- Dan


 Then he would wait, to see how long before a technician came to fix
it.
 
 If these activities were service impacting to customers the
engineering
 or implementation was faulty, and remediation was performed.  Assuming
 they acted as designed and the customers saw no faults the team was
 graded on how quickly the detected and corrected the outage.
 
 I've seen too many companies who's test is planned months in
advance,
 and who exclude the parts they think aren't up to scratch from the
 test.
 Then an event occurs, and they fail, and take down customers.
 
 TL;DR If you're not confident your operation could withstand someone
 walking into your data center and randomly doing something, you are
NOT
 redundant.
 
 --
Leo Bicknell - bickn...@ufp.org - CCIE 3440
 PGP keys at http://www.ufp.org/~bicknell/



Re: FYI Netflix is down

2012-07-02 Thread AP NANOG
I believe in my dictionary Chaos Gorilla translates into Time To Go 
Home, with a rough definition of Everything just crapped out - The 
world is ending; but then again I may have hat incorrect :-)


--

Thank you,

Robert Miller
http://www.armoredpackets.com

Twitter: @arch3angel

On 7/2/12 2:59 PM, Paul Graydon wrote:

On 07/02/2012 08:53 AM, Tony McCrory wrote:

On 2 July 2012 19:20, Cameron Byrne cb.li...@gmail.com wrote:


Make your chaos animal go after sites and regions instead of individual
VMs.

CB


 From a previous post mortem
http://techblog.netflix.com/2011_04_01_archive.html


Create More Failures
Currently, Netflix uses a service called Chaos
Monkeyhttp://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html 


to simulate service failure. Basically, Chaos Monkey is a service that
kills other services. We run this service because we want engineering 
teams

to be used to a constant level of failure in the cloud. Services should
automatically recover without any manual intervention. We don't however,
simulate what happens when an entire AZ goes down and therefore we 
haven't
engineered our systems to automatically deal with those sorts of 
failures.

Internally we are having discussions about doing that and people are
already starting to call this service Chaos Gorilla.
**

It would seem the Gorilla hasn't quite matured.

Tony
From conversations with Adrian Cockcroft this weekend it wasn't the 
result of Chaos Gorilla or Chaos Monkey failing to prepare them 
adequately.  All their automated stuff worked perfectly, the 
infrastructure tried to self heal.  The problem was that yet again 
Amazon's back-plane / control-plane was unable to cope with the 
requests.  Netflix uses Amazon's ELB to balance the traffic and no 
back-plane meant they were unable to reconfigure it to route around 
the problem.


Paul







Re: FYI Netflix is down

2012-07-02 Thread Joly MacFie
Good band name.

  Chaos Gorilla

-- 
---
Joly MacFie  218 565 9365 Skype:punkcast
WWWhatsup NYC - http://wwwhatsup.com
 http://pinstand.com - http://punkcast.com
 VP (Admin) - ISOC-NY - http://isoc-ny.org
--
-



Re: FYI Netflix is down

2012-07-02 Thread Greg D. Moore

At 03:08 PM 7/2/2012, George Herbert wrote:

If folks have not read it, I would suggest reading Normal Accidents 
by Charles Perrow.


The it can't happen is almost guaranteed to happen. ;-)  And when 
it does, it'll often interact in ways we can't predict or sometimes 
even understand.


As for pulling the plug to test stuff. I recall a demo at Netapps in 
the early 00's.  They were talking about their fault tolerance and 
how great it was.  So I walked up to their demo array and said, So, 
it shouldn't be a problem if I pulled this drive right here?  Before 
I could the salesperson or tech guy, can't remember,  told me to 
stop.  He didn't want to risk it.


That right there said loads about their confidence in their own system.




Late reply, but:

On Sat, Jun 30, 2012 at 12:30 AM, Lynda shr...@deaddrop.org wrote:
...
 Second, and more important. I *was* a computer science guy in a 
past life,

 and this is nonsense. You can have astonishingly large software projects
 that just continue to run smoothly, day in, day out, and they don't hit the
 news, because they don't break. There are data centers that don't hit the
 news, in precisely the same way.

I really need to write the book on IT reliability I keep meaning to.

There's reliability - backwards looking statistical, which can be 100%
for a given service or datacenter - and then there's dependability,
forwards-predicted outage risks, which people often *assert* equals
the prior reliability record, but in reality you often have a number
of latent failures (and latent cascade paths) that you do not
understand, did not identify previously, and are not aware of.

I've had or had to respond to over a billion dollars of culminative IT
disaster loss over my consulting career so far; I have NEVER seen
anyone who did it perfect, even the best pros.  And I include myself
in that list.

Looking at other fields like aerospace and nuclear engineering, what
is done in IT is not anywhere close to the same level of QA and
engineering analysis and testing.  We cannot assert better results
with less work.

Oh, that never happens, except I've had my stuff in three locations
that had catastrophic generator failures.  Oh, that never happens
when you're doing power maintenance and the best-rated electrical
company in California, in conjunction with the generator vendor and a
couple of independent power EEs, mis-balance the maintenance generator
loads between legs and blow the generators and datacenter.  Oh, that
never happens that the datacenter burns (or starts to burn and then
gets flooded).  Oh, that never happens that the FM-200 goes off or
preaction breaks and water leaks.  Oh, that never happens that well
maintained and monitored and triple-redundant AC units all trip
offline due to a common mode failure over the course of a weekend and
the room gets up to 106 degrees.  Oh thank god the next thing didn't
go wrong in THAT situation, because the spot temperature meters
indicated that the ceiling height of that particular room peaked at 1
degree short of the temp at which the sprinkler heads are supposed to
discharge, so we nearly lost that room to flooding rather than just a
10% disk and 15% power supply attrition over the next year...

Don't be so confident in the infrastructure.  It's not engineered or
built or maintained well enough to actually support that assertion.
The same can be said of the application software and application
architecture and integration.


--
-george william herbert
george.herb...@gmail.com


Greg D. 
Moore 
http://greenmountainsoftware.wordpress.com/

CEO QuiCR: Quick, Crowdsourced Responses. http://www.quicr.net






Re: FYI Netflix is down

2012-07-02 Thread david raistrick

On Mon, 2 Jul 2012, James Downs wrote:


back-plane / control-plane was unable to cope with the requests.  Netflix uses 
Amazon's ELB to balance the traffic and no back-plane meant they were unable to 
reconfigure it to route around the problem.


Someone needs to define back-plane/control-plane in this case. (and what 
wasn't working)


Amazon resources are controlled (from a consumer viewpoint) by API - that 
API is also used by amazon's internal toolkits that support ELB (and 
RDS..).   Those (http accessed) API interfaces were unavailable for a good 
portion of the outages.


I know nothing of the netflix side of it - but that's what -we- saw. (and 
that caused all us-east RDS instances in every AZ to appear offline..)




--
david raistrickhttp://www.netmeister.org/news/learn2quote.html
dr...@icantclick.org



Re: FYI Netflix is down

2012-07-02 Thread Brett Frankenberger
On Mon, Jul 02, 2012 at 09:09:09AM -0700, Leo Bicknell wrote:
 In a message written on Mon, Jul 02, 2012 at 11:30:06AM -0400, Todd Underwood 
 wrote:
  from the perspective of people watching B-rate movies:  this was a
  failure to implement and test a reliable system for streaming those
  movies in the face of a power outage at one facility.
 
 I want to emphasize _and test_.
 
 Work on an infrastructure which is redundant and designed to provide
 100% uptime (which is impossible, but that's another story) means
 that there should be confidence in a failure being automatically
 worked around, detected, and reported.
 
 I used to work with a guy who had a simple test for these things,
 and if I was a VP at Amazon, Netflix, or any other large company I
 would do the same.  About once a month he would walk out on the
 floor of the data center and break something.  Pull out an ethernet.
 Unplug a server.  Flip a breaker.

Sounds like something a VP would do.  And, actually, it's an important
step: make sure the easy failures are covered. 

But it's really a very small part of resilience.  What happens when one
instance of a shared service starts performing slowly?  What happens
when one instance of a redundant database starts timing out queries or
returning empty result sets?  What happens when the Ethernet interface
starts dropping 10% of the packets across it?  When happens when the
Ethernet switch linecard locks up and stops passing dataplane traffic,
but link (physical layer) and/or control plane traffic flows just fine? 
What happens when the server kernel panics due to bad memeory, reboots,
gets all the way up, runs for 30 seconds, kernel panics, lather, rinse,
repeat.

Reliability is hard.  And if you stop looking once you get to the point
where you can safely toggle the power switch without causing an impact,
you're only a very small part of the way there.  

 -- Brett



RE: FYI Netflix is down

2012-07-02 Thread Dan Golding

 -Original Message-
 From: Greg D. Moore [mailto:moor...@greenms.com]
 
 
 If folks have not read it, I would suggest reading Normal Accidents by
 Charles Perrow.
 

Also, Human Error by James Reason.

 




Re: FYI Netflix is down

2012-07-02 Thread George Herbert
On Mon, Jul 2, 2012 at 12:43 PM, Greg D. Moore moor...@greenms.com wrote:
 At 03:08 PM 7/2/2012, George Herbert wrote:

 If folks have not read it, I would suggest reading Normal Accidents by
 Charles Perrow.

 The it can't happen is almost guaranteed to happen. ;-)  And when it does,
 it'll often interact in ways we can't predict or sometimes even understand.

Seconded.

There are also aerospace and nuclear and failure analysis books which
are good, but I often encourage people to start with that one.

 As for pulling the plug to test stuff. I recall a demo at Netapps in the
 early 00's.  They were talking about their fault tolerance and how great it
 was.  So I walked up to their demo array and said, So, it shouldn't be a
 problem if I pulled this drive right here?  Before I could the salesperson
 or tech guy, can't remember,  told me to stop.  He didn't want to risk it.

 That right there said loads about their confidence in their own system.

I worked for a Sun clone vendor (Axil) for a while and took some of
our systems and storage to Comdex one year in the 90s.  We had a RAID
unit (Mylex controller) we had just introduced.  Beforehand, I made
REALLY REALLY SURE that the pull-the-disk and pull-the-redundant-power
tricks worked.  And showed them to people with the Please keep in
mind that this voids the warranty, but here we *rip* go  All of
the other server vendors were giving me dirty looks for that one.
Apparently I sold a few systems that way.

You have to watch for connector wear-out and things like that, but ...

All the clusters I've built, I've insisted on a burn-in time plug pull
test on all the major components.  We caught things with those from
time to time.  Especially with N+1, if it is really N+0 due to a bug
or flaw you need to know that...


-- 
-george william herbert
george.herb...@gmail.com



Re: FYI Netflix is down

2012-07-02 Thread Greg D. Moore

At 05:04 PM 7/2/2012, George Herbert wrote:

On Mon, Jul 2, 2012 at 12:43 PM, Greg D. Moore moor...@greenms.com wrote:
 At 03:08 PM 7/2/2012, George Herbert wrote:

 If folks have not read it, I would suggest reading Normal Accidents by
 Charles Perrow.

 The it can't happen is almost guaranteed to happen. ;-)  And 
when it does,

 it'll often interact in ways we can't predict or sometimes even understand.

Seconded.



I figured you had probably read it. :-)



There are also aerospace and nuclear and failure analysis books which
are good, but I often encourage people to start with that one.

 As for pulling the plug to test stuff. I recall a demo at Netapps in the
 early 00's.  They were talking about their fault tolerance and how great it
 was.  So I walked up to their demo array and said, So, it shouldn't be a
 problem if I pulled this drive right here?  Before I could the salesperson
 or tech guy, can't remember,  told me to stop.  He didn't want to risk it.

 That right there said loads about their confidence in their own system.

I worked for a Sun clone vendor (Axil) for a while and took some of
our systems and storage to Comdex one year in the 90s.  We had a RAID
unit (Mylex controller) we had just introduced.  Beforehand, I made
REALLY REALLY SURE that the pull-the-disk and pull-the-redundant-power
tricks worked.  And showed them to people with the Please keep in
mind that this voids the warranty, but here we *rip* go  All of
the other server vendors were giving me dirty looks for that one.
Apparently I sold a few systems that way.



I can imagine. Back when we were testing a cluster from MicronPC, the 
techs were in our office and they encouraged us to do that.  It was 
re-assuring.




You have to watch for connector wear-out and things like that, but ...

All the clusters I've built, I've insisted on a burn-in time plug pull
test on all the major components.  We caught things with those from
time to time.  Especially with N+1, if it is really N+0 due to a bug
or flaw you need to know that...



About 7 years back, we were about to move a production platform to a 
cluster+SAN that an outside vendor had installed.  I was brought in 
at the last minute to lead the project.  Before we did the move, I 
said, Umm, has anyone tried a remote reboot of the servers?


Oh they rebooted fine when we were at the datacenter with the 
vendor.  We're good.


I repeated my question and finally did the old, Ok, I know I'm being 
a pain, but please, let's just try it once, remotely before we're 
committed.  So we rebooted, and wait, and waited, and waited.


It took a trip out to the datacenter (we couldn't afford good remote 
KVM tools back then) to see that the server was trying to mount stuff 
off of something on the network.  At first we couldn't figure out 
what it was.  Finally realized it was looking for files on the 
vendor's laptop.  So of course it had worked fine when the vendor was 
at the datacenter.


Despite all that, the vendor still denied it being their problem.

Anyway, enough reminiscing.  Things happen.  We can only do so much 
to prevent them, and never assume.






--
-george william herbert
george.herb...@gmail.com


Greg D. 
Moore 
http://greenmountainsoftware.wordpress.com/

CEO QuiCR: Quick, Crowdsourced Responses. http://www.quicr.net






Re: FYI Netflix is down

2012-07-02 Thread Steven Bellovin

On Jul 2, 2012, at 3:43 PM, Greg D. Moore wrote:

 At 03:08 PM 7/2/2012, George Herbert wrote:
 
 If folks have not read it, I would suggest reading Normal Accidents by 
 Charles Perrow.


Strong second to that suggestion.

--Steve Bellovin, https://www.cs.columbia.edu/~smb








Re: FYI Netflix is down

2012-07-02 Thread James Downs

On Jul 2, 2012, at 1:20 PM, david raistrick wrote:

 Amazon resources are controlled (from a consumer viewpoint) by API - that API 
 is also used by amazon's internal toolkits that support ELB (and RDS..).   
 Those (http accessed) API interfaces were unavailable for a good portion of 
 the outages.

Right, and other toolkits like boto. Each AZ has a different endpoint (url), 
and as I have no resources running in East, I saw no problems with the API 
endpoints I use. So, as you note, US-EAST Region was not controllable.

 I know nothing of the netflix side of it - but that's what -we- saw. (and 
 that caused all us-east RDS instances in every AZ to appear 


And, if you lose US-EAST, you need to run *somewhere*. Netflix did not cutover 
www.netflix.com to another Region. Why not is another question.

-j


Re: FYI Netflix is down

2012-07-02 Thread Rodrick Brown

On Jul 2, 2012, at 7:03 PM, James Downs e...@egon.cc wrote:

 
 On Jul 2, 2012, at 1:20 PM, david raistrick wrote:
 
 Amazon resources are controlled (from a consumer viewpoint) by API - that 
 API is also used by amazon's internal toolkits that support ELB (and RDS..). 
   Those (http accessed) API interfaces were unavailable for a good portion 
 of the outages.
 
 Right, and other toolkits like boto. Each AZ has a different endpoint (url), 
 and as I have no resources running in East, I saw no problems with the API 
 endpoints I use. So, as you note, US-EAST Region was not controllable.
 
 I know nothing of the netflix side of it - but that's what -we- saw. (and 
 that caused all us-east RDS instances in every AZ to appear 
 
 
 And, if you lose US-EAST, you need to run *somewhere*. Netflix did not 
 cutover www.netflix.com to another Region. Why not is another question.

At which point are you guys going to realize that no matter how much 
resiliency, redundancy and fault tolerance you plan into an infrastructure 
there are always the unforeseen that just doesn't make any sense to plan for. 

Four major decision factors are cost, complexity, time and failure rate. At 
some point a business need to focus on its core business. IT like any other 
business resource has to be managed efficiently and its sole purpose is for the 
enablement of said business nothing more. 

Some of the post here are highly laughable and so unrealistic. 

People are acting as if Netflix is part of some critical service they stream 
movies for Christ sake.  Some acceptable level of loss is fine for 99.99% of 
Netflix's user base just like cable, electricity and running water I suffer a 
few hours of losses each year from those services it suck yes, is it the end of 
the world no.. 

This horse is dead! 

 



Re: FYI Netflix is down

2012-07-02 Thread James Downs

On Jul 2, 2012, at 7:19 PM, Rodrick Brown wrote:

 People are acting as if Netflix is part of some critical service they stream 
 movies for Christ sake.  Some acceptable level of loss is fine for 99.99% of 
 Netflix's user base just like cable, electricity and running water I suffer a 
 few hours of losses each year from those services it suck yes, is it the end 
 of the world no.. 

You missed the point.


Re: FYI Netflix is down

2012-07-02 Thread Hal Murray

George Herbert george.herb...@gmail.com said:

 I worked for a Sun clone vendor (Axil) for a while and took some of our
 systems and storage to Comdex one year in the 90s.  We had a RAID unit
 (Mylex controller) we had just introduced.  Beforehand, I made REALLY REALLY
 SURE that the pull-the-disk and pull-the-redundant-power tricks worked.  And
 showed them to people with the Please keep in mind that this voids the
 warranty, but here we *rip* go  All of the other server vendors were
 giving me dirty looks for that one. Apparently I sold a few systems that
 way. 

:)  Nice.  Thanks.

Many years ago, I worked for one of DEC's research groups.  We built a 
network using FDDI 4B/5B link technology based on AMD TAXI chips.  (They were 
state of the art back then.)  The switches were 3U(?) boxes with 12 ports.  
It took a rack of 6 or 8 of them in the phone closet to cover a floor.  
Workstations had 2 cables plugged into different switches.  In theory, we 
covered any single point of failure.

My office was near the phone closet.  I got to watch my boss give demos to 
visiting VIPs.  He was pretty good at it.  In the middle of explaining 
things, he would grab a power cord and yank it.  Blinka-blinka=blinka and the 
remaining switches would reconfigure and go back to work.  (It took under a 
second.)

It was interesting to watch the VIPs.  Most of them got it: the network 
really could recover quickly. The interesting ones had a telco background.  
They were really surprised.  The concept of disrupting live traffic for 
something as insignificant as a demo was off scale in their culture.

It was just a research lab.  We were used to eating our own dog food.

--

Greg D. Moore moor...@greenms.com said:

 If folks have not read it, I would suggest reading Normal Accidents  by
 Charles Perrow.

+1

 The it can't happen is almost guaranteed to happen. ;-)  And when  it
 does, it'll often interact in ways we can't predict or sometimes  even
 understand. 

My memory of that sort of event is roughly...  (see above for context)

The hardware broke and turned a vanilla packet into a super-long packet.  My 
FPGA code was supposed to catch that case and do something sane.  It was 
never tested and didn't work.  It poured crap all over memory.  Needless to 
say, things went downhill from there.

Easy to spot in hindsight.  None of us thought that was an interesting case 
while we were testing.


-- 
These are my opinions.  I hate spam.






Re: [FoRK] FYI Netflix is down

2012-07-01 Thread Aaron Burt
On Sat, Jun 30, 2012 at 03:15:07AM -0400, Andrew D Kirch wrote:
 On 6/30/2012 3:11 AM, Tyler Haske wrote:
 How to run a datacenter 101. Have more then one location, preferably
 far apart. It being Amazon I would expect more. :/

Amazon has many datacenters and tries to make it easy to diversify.

 Based on?  Clouds are nothing more than outsourced responsibility.
 My business has stopped while my IT department explains to me that
 it's not their fault because Amazon's down snip

It *is* their fault.  You can blame faulty manufacturing for having a HDD
die, but it's IT's fault if it takes out the only copy of your database.

AWS 101:  Amazon has clearly-marked Availablity Zones for a reason.

Oh, and business 101: have an exit strategy for every vendor.

This outage is mighty interesting.  It's surprising how many big operations
had poor availability strategies.  Also, I've been working on an exit
strategy for one of my VM/colo providers, and AWS + colo in NoVa is one of
my options.

 The cloud may be a technological wonder, but as far as
 business practices go, it's a DISASTER.

I wouldn't say so.  Like any disruptive service, you're getting an
acceptably lower-quality product for significantly less money.  And like
most disruptors, it operates by different rules than the old stuff.

Regards,
  Aaron
___
FoRK mailing list
http://xent.com/mailman/listinfo/fork



It's the end of the world, as we know it (Was: FYI Netflix is down)

2012-07-01 Thread Jay Ashworth
- Original Message -
 From: jamie rishaw j...@arpa.com

 you know what's happening even more?
 
 ..Amazon not learning their lesson.

 Please stop these crappy practices, people. Do real world DR testing.
 Play What If This City Dropped Off The Map games, because tonight,
 parts of VA infact did.

You know what I want everyone to do?

Go read this.  Right now; it's Sunday, and I'll wait:

  http://interdictor.livejournal.com/2005/08/27/

Start there, and click Next Date a lot, until you get to the end.

Entire metropolitan areas can, and do, fall completely off the map.  If
your audience is larger than that area, then you need to prepare for it.

And being reminded of how big it can get is occasionally necessary.

The 4ESS in the third subbasement of 1WTC that was a toll switch for most of 
the northeast reportedly stayed on the air, talking to it's SS7 neighbors,
until something like 1500EDT, 11 Sep 2001.

It can get *really* big.  Are you ready?

Cheers,
-- jra
-- 
Jay R. Ashworth  Baylink   j...@baylink.com
Designer The Things I Think   RFC 2100
Ashworth  Associates http://baylink.pitas.com 2000 Land Rover DII
St Petersburg FL USA  http://photo.imageinc.us +1 727 647 1274



Re: FYI Netflix is down

2012-07-01 Thread Jay Ashworth
- Original Message -
 From: Tyler Haske tyler.ha...@gmail.com

 How to run a datacenter 101. Have more then one location, preferably
 far apart. It being Amazon I would expect more. :/

Not entirely.  Datacenters do go down, our best efforts to the contrary 
notwithstanding.  Amazon doesn't guarantee you redundancy on EC2, only
the tools to provide it yourself.  25% Amazon; 75% service provider clients;
that's my appraisal of the blame.

Cheers,
-- jra
-- 
Jay R. Ashworth  Baylink   j...@baylink.com
Designer The Things I Think   RFC 2100
Ashworth  Associates http://baylink.pitas.com 2000 Land Rover DII
St Petersburg FL USA  http://photo.imageinc.us +1 727 647 1274



Re: FYI Netflix is down

2012-07-01 Thread steve pirk [egrep]
On Sun, Jul 1, 2012 at 11:38 AM, Jay Ashworth j...@baylink.com wrote:

 Not entirely.  Datacenters do go down, our best efforts to the contrary
 notwithstanding.  Amazon doesn't guarantee you redundancy on EC2, only
 the tools to provide it yourself.  25% Amazon; 75% service provider
 clients;
 that's my appraisal of the blame.


From a Wired article:

 That’s what was supposed to happen at Netflix Friday night. But it didn’t
 work out that way. According to Twitter messages from Netflix Director of
 Cloud Architecture Adrian Cockcroft and Instagram Engineer Rick Branson, it
 looks like an Amazon Elastic Load Balancing service, designed to spread
 Netflix’s processing loads across data centers, failed during the outage.
 Without that ELB service working properly, the Netflix and Pintrest
 services hosted by Amazon crashed.

http://www.wired.com/wiredenterprise/2012/06/real-clouds-crush-amazon/

The GSLB fail-over that was supposed to take place for the affected
services (that had configured their applications to fail-over) failed.

I heard about this the day after Google announced the Compute Engine
addition to the App Engine product lines they have. The demo was awesome.
I imagine Google has GSLB down pat by now, so some companies might start
looking... ;-]

--steve


Re: FYI Netflix is down

2012-06-30 Thread Roy

On 6/29/2012 10:38 PM, jamie rishaw wrote:

you know what's happening even more?

..Amazon not learning their lesson.

they just had an outage quite similar.. they performed a full audit on
electrical systems worldwide, according to the rfo/post mortem.

looks like they need to perform a full and we mean it audit, and like
I've been doing/participating in at dot coms for a decade plus: Actually Do
Regular Load tests..

Related/equally to blame: companies that rely heavily on one aws zone, or
arguably one cloud (period), are asking for it.

Please stop these crappy practices, people.  Do real world DR testing.
  Play What If This City Dropped Off The Map games, because tonight, parts
of VA infact did.

...


I am not a computer science guy but been around a long time.  Data 
centers and clouds are like software.  Once they reach a certain size, 
its impossible to keep the bugs out.  You can test and test your heart 
out and something will slip by.  You can say the same thing about 
nuclear reactors, Apollo moon missions, the NorthEast power grid, and 
most other technology disasters.






Re: FYI Netflix is down

2012-06-30 Thread Grant Ridder
well one would think that they could at least get power redundancy right...

On Sat, Jun 30, 2012 at 1:07 AM, Roy r.engehau...@gmail.com wrote:

 On 6/29/2012 10:38 PM, jamie rishaw wrote:

 you know what's happening even more?

 ..Amazon not learning their lesson.

 they just had an outage quite similar.. they performed a full audit on
 electrical systems worldwide, according to the rfo/post mortem.

 looks like they need to perform a full and we mean it audit, and like
 I've been doing/participating in at dot coms for a decade plus: Actually
 Do
 Regular Load tests..

 Related/equally to blame: companies that rely heavily on one aws zone, or
 arguably one cloud (period), are asking for it.

 Please stop these crappy practices, people.  Do real world DR testing.
  Play What If This City Dropped Off The Map games, because tonight,
 parts
 of VA infact did.

 ...


 I am not a computer science guy but been around a long time.  Data centers
 and clouds are like software.  Once they reach a certain size, its
 impossible to keep the bugs out.  You can test and test your heart out and
 something will slip by.  You can say the same thing about nuclear reactors,
 Apollo moon missions, the NorthEast power grid, and most other technology
 disasters.






Re: FYI Netflix is down

2012-06-30 Thread Tyler Haske
 I am not a computer science guy but been around a long time.  Data centers
 and clouds are like software.  Once they reach a certain size, its
 impossible to keep the bugs out.  You can test and test your heart out and
 something will slip by.  You can say the same thing about nuclear reactors,
 Apollo moon missions, the NorthEast power grid, and most other technology
 disasters.

How to run a datacenter 101. Have more then one location, preferably
far apart. It being Amazon I would expect more. :/



Re: FYI Netflix is down

2012-06-30 Thread Andrew D Kirch

On 6/30/2012 3:11 AM, Tyler Haske wrote:

How to run a datacenter 101. Have more then one location, preferably
far apart. It being Amazon I would expect more. :/



Based on?  Clouds are nothing more than outsourced responsibility. My 
business has stopped while my IT department explains to me that it's not 
their fault because Amazon's down, and I can't exactly fire Amazon.  The 
cloud may be a technological wonder, but as far as business practices 
go, it's a DISASTER.


Andrew



Re: FYI Netflix is down

2012-06-30 Thread joel jaeggli

On 6/30/12 12:11 AM, Tyler Haske wrote:

I am not a computer science guy but been around a long time.  Data centers
and clouds are like software.  Once they reach a certain size, its
impossible to keep the bugs out.  You can test and test your heart out and
something will slip by.  You can say the same thing about nuclear reactors,
Apollo moon missions, the NorthEast power grid, and most other technology
disasters.

How to run a datacenter 101. Have more then one location, preferably
far apart. It being Amazon I would expect more. :/
there are 7 regions  in ec2 three in north  america two in asia one in 
europe and one in south america.


us east coast, the one currently being impacted is further subdivided 
into 5 availability zones.


us east 1d appears to be the only one currently being impacted.

distributing your application is left as an exercise to the reader.




Re: FYI Netflix is down

2012-06-30 Thread Lynda

On 6/30/2012 12:11 AM, Tyler Haske wrote:
 On 6/29/2012 11:07 PM, Roy wrote:

I am not a computer science guy but been around a long time.  Data centers
and clouds are like software.  Once they reach a certain size, its
impossible to keep the bugs out.  You can test and test your heart out and
something will slip by.  You can say the same thing about nuclear reactors,
Apollo moon missions, the NorthEast power grid, and most other technology
disasters.


How to run a datacenter 101. Have more then one location, preferably
far apart. It being Amazon I would expect more. :/


First off. They HAVE more than one location, and they are indeed far 
apart. That said, it's all mixed together, like some kind of goulash, 
and the companies who've gone with this particular model for their sites 
are paying for that fact.


Second, and more important. I *was* a computer science guy in a past 
life, and this is nonsense. You can have astonishingly large software 
projects that just continue to run smoothly, day in, day out, and they 
don't hit the news, because they don't break. There are data centers 
that don't hit the news, in precisely the same way.


If I had a business, right now, I would not have chosen Amazon's cloud 
(or anyone's for that matter). I would also not be using Google 
docs/services, for precisely the same reason. I'm a fan of controlling 
risk, where possible, and I'd say that this is all in the wrong 
direction for doing that.


No worries, though. It seems we are doomed to continue making the same 
mistakes, over and over.


--
Politicians are like a Slinky.
They're really not good for anything,
but they still bring a smile to your face
when you push them down a flight of stairs.



Re: FYI Netflix is down

2012-06-30 Thread Justin M. Streiner

On Sat, 30 Jun 2012, jamie rishaw wrote:


you know what's happening even more?

..Amazon not learning their lesson.


I was not giving anyone a free pass or attempting to shrug off the outage. 
I was just stating that there are many reasons why things break.  I 
haven't seen anything official on this yet, but this looks a lot like a 
cascading failure.


jms



Re: FYI Netflix is down

2012-06-30 Thread Jimmy Hess
On 6/30/12, Grant Ridder shortdudey...@gmail.com wrote:
 well one would think that they could at least get power redundancy right...

It is very similar to suggesting redundancy within a site against
building collapse.

Reliable power redundancy is very hard and very expensive.Much
harder and much more expensive  than  achieving network redunancy
against switch or router failures.
And there are always tradeoffs involved,   because there is only one
utility grid available.
There are always some limitations in the amount of isolation possible.

You have devices plugged into both power systems.
There is some possibility a random device plugged into both systems
creates a short in both branches that it plugs into.

Both power systems always have to share the same ground, due to safety
considerations.

Both power systems always have to have fuses or breakers installed,
due to safety considerations,   and  there is always a possibility
that  various kinds of anomolies
cause fuses to simultaneously blow in both systems.

--
-JH



Re: FYI Netflix is down

2012-06-30 Thread Cameron Byrne
On Jun 30, 2012 12:25 AM, joel jaeggli joe...@bogus.com wrote:

 On 6/30/12 12:11 AM, Tyler Haske wrote:

 I am not a computer science guy but been around a long time.  Data
centers
 and clouds are like software.  Once they reach a certain size, its
 impossible to keep the bugs out.  You can test and test your heart out
and
 something will slip by.  You can say the same thing about nuclear
reactors,
 Apollo moon missions, the NorthEast power grid, and most other
technology
 disasters.

 How to run a datacenter 101. Have more then one location, preferably
 far apart. It being Amazon I would expect more. :/

 there are 7 regions  in ec2 three in north  america two in asia one in
europe and one in south america.

 us east coast, the one currently being impacted is further subdivided
into 5 availability zones.

 us east 1d appears to be the only one currently being impacted.

 distributing your application is left as an exercise to the reader.



+1

Sorry to be the monday morning quarterback, but the sites that went down
learned a valuable lesson in single point of failure analysis.  A highly
redundant and professionally run data center is a single point of failure.

Geo-redundancy is key. In fact, i would take distributed data centers over
RAID, UPS, or any other fancy pants © mechanisms any day.

And,  aws East also seems to be cursed. I would run out of west for a
while. :-)

I would also look into clouds of clouds. ... Who knows. Amazon could have
an Enron moment, at which point a corporate entity with a tax id is now a
single point of failure.

Pay your money, take your chances.

CB


Re: FYI Netflix is down

2012-06-30 Thread Jimmy Hess
On 6/30/12, Cameron Byrne cb.li...@gmail.com wrote:
 On Jun 30, 2012 12:25 AM, joel jaeggli joe...@bogus.com wrote:
 On 6/30/12 12:11 AM, Tyler Haske wrote:
 Geo-redundancy is key. In fact, i would take distributed data centers over
 RAID, UPS, or any other fancy pants © mechanisms any day.

Geo-redundancy is more expensive than any of those technologies, because it
directly impacts every application and reduces performance.  It means
that, for example, if an application needs to guarantee something is
persisted to a distributed database,  such as a record that such and
such user's  credit card has just been charged  $X  or such and such
user has uploaded  this blob to the web service ;The round trip
time of the longest latency path between any of the redundancy sites,
is added to the critical path of the WRITE transaction latency during
the commit stage.Because you cannot complete a transaction and
ensure you have consistency or correct data, until that transaction
reaches a system at the remote site managing the persistence, and is
acknowledged as received intact.

For example,  if you have geo sites, which are a minimum of   250
miles apart;  if you recall,  light only travels 186.28 miles per
millisecond.   That means you have a 500 mile round-trip and therefore
have added a bare minimum of 2.6 milliseconds of latency to every
write transaction,  and probably more like  15 milliseconds.

If your original transaction latency was   at  1 milliseconds.   or
1000 transactions per second,  AND you require only that the data
reaches the remote site and is acknowledged  (not that the
transaction succeeds at the remote site,  before you commit),  you are
now at  a minimum of 2.6 millisecondsaverage  384 transactions per
second.

To actually do it safely, you require   3.6 milliseconds,  limiting
you to an average of  277 transactions per second.

If the application is not specially designed for remote site
redundancy, then this means you require a scheme such as synchronous
storage-level  replication  to achieve clustering;  which  has even
worse results if there is significant geographic dispersion.


RAID transactional latencies are much lower.

UPSes and redundant power  do not increase transaction latencies at all.

--
-JH



Re: FYI Netflix is down

2012-06-30 Thread Seth Mattinen
On 6/30/12 4:50 AM, Justin M. Streiner wrote:
 On Sat, 30 Jun 2012, jamie rishaw wrote:
 
 you know what's happening even more?

 ..Amazon not learning their lesson.
 
 I was not giving anyone a free pass or attempting to shrug off the
 outage. I was just stating that there are many reasons why things
 break.  I haven't seen anything official on this yet, but this looks a
 lot like a cascading failure.


But haven't they all been cascading failures?

One can't just say well it's a huge system, therefore hard. Especially
when they claimed to have learned their lesson from previous outwardly
similar failures; either they were lying, or didn't really learn
anything, or the scope simply exceeds their grasp.

If it's too hard for entity X to handle a large system (for whatever
large means to them), then X needs to break it down into smaller parts
that they're capable of handling in a competent manner.

~Seth



Re: FYI Netflix is down

2012-06-30 Thread Roy

On 6/30/2012 12:11 AM, Tyler Haske wrote:

I am not a computer science guy but been around a long time.  Data centers
and clouds are like software.  Once they reach a certain size, its
impossible to keep the bugs out.  You can test and test your heart out and
something will slip by.  You can say the same thing about nuclear reactors,
Apollo moon missions, the NorthEast power grid, and most other technology
disasters.

How to run a datacenter 101. Have more then one location, preferably
far apart. It being Amazon I would expect more. :/
.



It doesn't change my theory.  You add that complexity, something happens 
and the failover routing doesn't work as planned.  Been there, done 
that, have the T-shirt.





Re: FYI Netflix is down

2012-06-30 Thread Todd Underwood
On Jun 30, 2012 11:23 AM, Seth Mattinen se...@rollernet.us wrote:


 But haven't they all been cascading failures?

No.  They have not.  That's not what that term means.

'Cascading failure' has a fairly specific meaning that doesn't imply
resilience in the face of decomposition into smaller parts.  Cascading
failures can occur even when a system is decomposed into small parts, each
of which is apparently well run.

T


Re: FYI Netflix is down

2012-06-30 Thread Jimmy Hess
On 6/30/12, Todd Underwood toddun...@gmail.com wrote:
 On Jun 30, 2012 11:23 AM, Seth Mattinen se...@rollernet.us wrote:
 But haven't they all been cascading failures?
 No.  They have not.  That's not what that term means.

 'Cascading failure' has a fairly specific meaning that doesn't imply
 resilience in the face of decomposition into smaller parts.  Cascading

Not sure where you're going there;  Cascading failures are common, but
fortunately are usually temporary or have some kind of scope limit.
Cascading  just  means you have  a  dependency  between components,
where the failure of one component may result in the failure of a
second component,
the failure of the second  component results in failure of a third component,
and this process continues until no more components are dependent or no
more components are still operating.

This can happen to the small pieces inside of one specific system,
causing that system to collapse.
It's just as valid to say Cascading failure is across  across
larger/more complex pieces of different higher level systems,  where
the components of one system aren't sufficiently independent of those
in other systems,   causing both systems to fail.

Your application logic can be a point of failure,  just as readily as
your datacenter can.
Cascades can happen at a higher level where entire systems are
dependant upon entire other systems.

And it can happen Organizationally,   External dependancy risk occurs
when an entire business is dependant on another organization  (such as
product support),  to remotely administer software they sold,  and the
subcontracter of the product support org.  stops doing their job,  or
a smaller component  (one member of their staff)  becomes a
rogue/malicious element.

--
-JH



Re: FYI Netflix is down

2012-06-30 Thread Seth Mattinen
On 6/30/12 9:25 AM, Todd Underwood wrote:
 
 On Jun 30, 2012 11:23 AM, Seth Mattinen se...@rollernet.us
 mailto:se...@rollernet.us wrote:


 But haven't they all been cascading failures?
 
 No.  They have not.  That's not what that term means.
 
 'Cascading failure' has a fairly specific meaning that doesn't imply
 resilience in the face of decomposition into smaller parts.  Cascading
 failures can occur even when a system is decomposed into small parts,
 each of which is apparently well run.
 


I honestly have no idea how to parse that since it doesn't jive with my
practical view of a cascading failure.

~Seth



Re: FYI Netflix is down

2012-06-30 Thread Todd Underwood
This was not a cascading failure.  It was a simple power outage

Cascading failures involve interdependencies among components.

T
On Jun 30, 2012 2:21 PM, Seth Mattinen se...@rollernet.us wrote:

 On 6/30/12 9:25 AM, Todd Underwood wrote:
 
  On Jun 30, 2012 11:23 AM, Seth Mattinen se...@rollernet.us
  mailto:se...@rollernet.us wrote:
 
 
  But haven't they all been cascading failures?
 
  No.  They have not.  That's not what that term means.
 
  'Cascading failure' has a fairly specific meaning that doesn't imply
  resilience in the face of decomposition into smaller parts.  Cascading
  failures can occur even when a system is decomposed into small parts,
  each of which is apparently well run.
 


 I honestly have no idea how to parse that since it doesn't jive with my
 practical view of a cascading failure.

 ~Seth




Re: FYI Netflix is down

2012-06-30 Thread Jimmy Hess
On 6/30/12, Todd Underwood toddun...@gmail.com wrote:
 This was not a cascading failure.  It was a simple power outage
 Cascading failures involve interdependencies among components.

Actually, you can't really say that.  It's true that it was a simple
power outage for Amazon.
Power failed, causing the AWS service at certain locations to experience issues.

Any of the issues related to services at locations that didn't lose
power are a possible result of cascade.

But as for the other possible outages being reported...  Instagram,
Pinterest, Netflix, Heroku, Woot, Pocket, zoneedit.

Possibly Amazon's power failure caused AWS problems, which resulted in
issues with these services.

Some of these services  may actually have had redundancy in place,
but experience a failure of their service as a result of  unexpected
cascade from the affected site.


 T
--
-JH



Re: FYI Netflix is down

2012-06-30 Thread Mike Devlin
The last 2 Amazon outages were power issues isolated to just there us-east
Virginia data center. I read somewhere that Amazon has something like 70%
of their ec2 resources in Virginia and its also their oldest ec2
datacenter..so I am guessing they learned a lot of lessons and are stuck
with an aged infrastructure there.

I think the real problem here is that a large subset of the customers using
ec2 misunderstand the redundancy that is built into the Amazon
architecture. You are essentially supposed to view individual virtual
machines as bring entirely disposable and make duplicates of everything
across availability zones and for extra points across regions.

most people instead think that the 2 cents/hour price tag is a massive cost
savings and the cloud is invincible..look at the SLA for ec2...Amazon
basically doesn't really consider it a real outage unless its more than one
availability zone that is down

whats more surprising is that netflix was so affected by a single
availability zone outage. They are constantly talking about their chaos
monkey/simian army tool that purposely breaks random parts of their
infrastructure to prove its fault tolerate, or to point out weaknesses to
fix. (
http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html)


I think the closest thing to a cascading failure they have had was 4/29/11
outage (http://aws.amazon.com/message/65648/)


Mike


On Jun 30, 2012 3:05 PM, Todd Underwood toddun...@gmail.com wrote:

 This was not a cascading failure.  It was a simple power outage

 Cascading failures involve interdependencies among components.

 T
 On Jun 30, 2012 2:21 PM, Seth Mattinen se...@rollernet.us wrote:

  On 6/30/12 9:25 AM, Todd Underwood wrote:
  
   On Jun 30, 2012 11:23 AM, Seth Mattinen se...@rollernet.us
   mailto:se...@rollernet.us wrote:
  
  
   But haven't they all been cascading failures?
  
   No.  They have not.  That's not what that term means.
  
   'Cascading failure' has a fairly specific meaning that doesn't imply
   resilience in the face of decomposition into smaller parts.  Cascading
   failures can occur even when a system is decomposed into small parts,
   each of which is apparently well run.
  
 
 
  I honestly have no idea how to parse that since it doesn't jive with my
  practical view of a cascading failure.
 
  ~Seth
 
 



Re: FYI Netflix is down

2012-06-30 Thread Seth Mattinen
On 6/30/12 12:04 PM, Todd Underwood wrote:
 This was not a cascading failure.  It was a simple power outage
 
 Cascading failures involve interdependencies among components.
 

I guess I'm assuming there were UPS and generator systems involved (and
failing) with powering the critical load, but I suppose it could all be
direct to utility power.

~Seth



Re: FYI Netflix is down

2012-06-30 Thread Rayson Ho
If I recall correctly, availability zone (AZ) mappings are specific to
an AWS account, and in fact there is no way to know if you are running
in the same AZ as another AWS account:

http://aws.amazon.com/ec2/faqs/#How_can_I_make_sure_that_I_am_in_the_same_Availability_Zone_as_another_developer


Also, AWS Elastic Load Balancer (and/or CloudWatch) should be able to
detect that some instances are not reachable, and thus can start new
instances and remap DNS entries automatically:
http://aws.amazon.com/elasticloadbalancing/


This time only 1 AZ is affected by the power outage, so sites with
fault tolerance built into their AWS infrastructure should be able to
handle the issues relatively easily.

Rayson

==
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/



On Fri, Jun 29, 2012 at 11:44 PM, Grant Ridder shortdudey...@gmail.com wrote:
 I have an instance in zone C and it is up and fine, so it must be A, B, or
 D that is down.

 On Fri, Jun 29, 2012 at 10:42 PM, James Laszko jam...@mythostech.comwrote:

 To further expand:

 8:21 PM PDT We are investigating connectivity issues for a number of
 instances in the US-EAST-1 Region.

  8:31 PM PDT We are investigating elevated errors rates for APIs in the
 US-EAST-1 (Northern Virginia) region, as well as connectivity issues to
 instances in a single availability zone.

  8:40 PM PDT We can confirm that a large number of instances in a single
 Availability Zone have lost power due to electrical storms in the area. We
 are actively working to restore power.

 -Original Message-
 From: Grant Ridder [mailto:shortdudey...@gmail.com]
 Sent: Friday, June 29, 2012 8:42 PM
 To: Jason Baugher
 Cc: nanog@nanog.org
 Subject: Re: FYI Netflix is down

 From Amazon

 Amazon Elastic Compute Cloud (N. Virginia)  (http://status.aws.amazon.com/
 )
 8:21 PM PDT We are investigating connectivity issues for a number of
 instances in the US-EAST-1 Region.
 8:31 PM PDT We are investigating elevated errors rates for APIs in the
 US-EAST-1 (Northern Virginia) region, as well as connectivity issues to
 instances in a single availability zone.

 -Grant

 On Fri, Jun 29, 2012 at 10:40 PM, Jason Baugher ja...@thebaughers.com
 wrote:

  Seeing some reports of Pinterest and Instagram down as well. Amazon
  cloud services being implicated.
 
 
  On 6/29/2012 10:22 PM, Joe Blanchard wrote:
 
  Seems that they are unreachable at the moment. Called and theres a
  recorded message stating they are aware of an issue, no details.
 
  -Joe
 
 
 
 
 
 




Re: FYI Netflix is down

2012-06-30 Thread Randy Bush
 Sorry to be the monday morning quarterback, but the sites that went
 down learned a valuable lesson in single point of failure analysis.

as this has happened more than once before, i am less optimistic.

or maybe they decided the spof risk was not worth the avoidance costs.

randy



Re: FYI Netflix is down

2012-06-30 Thread Jared Mauch
The interesting thing to me is the us population by time zone. If amazon has 
70% of servers in the eastern time zone it makes some sense. 

Mountain + pacific is smaller than central, which is a bit more than half 
eastern. These stats are older but a good rough gauge:

http://answers.google.com/answers/threadview?id=714986

Jared Mauch

On Jun 30, 2012, at 4:03 PM, Seth Mattinen se...@rollernet.us wrote:

 On 6/30/12 12:04 PM, Todd Underwood wrote:
 This was not a cascading failure.  It was a simple power outage
 
 Cascading failures involve interdependencies among components.
 
 
 I guess I'm assuming there were UPS and generator systems involved (and
 failing) with powering the critical load, but I suppose it could all be
 direct to utility power.
 
 ~Seth



Re: FYI Netflix is down

2012-06-30 Thread Scott Howard
On Sat, Jun 30, 2012 at 12:04 PM, Todd Underwood toddun...@gmail.comwrote:

 This was not a cascading failure.  It was a simple power outage

 Cascading failures involve interdependencies among components.


Not always.  Cascading failures can also occur when there is zero
dependency between components.  The simplest form of this is where one
environment fails over to another, but the target environment is not
capable of handling the additional load and then fails itself as a result
(in some form or other, but frequently different to the mode of the
original failure).

Whilst the Amazon outage might have been a simple power outage, it's
likely that at least some of the website outages caused were a combination
of not just the direct Amazon outage, but also the flow-on effect of their
redundancy attempting (but failing) to kick in - potentially making the
problem worse than just the Amazon outage caused.

  Scott


Re: FYI Netflix is down

2012-06-30 Thread Todd Underwood
scott,


 This was not a cascading failure.  It was a simple power outage

 Cascading failures involve interdependencies among components.


 Not always.  Cascading failures can also occur when there is zero dependency
 between components.  The simplest form of this is where one environment
 fails over to another, but the target environment is not capable of handling
 the additional load and then fails itself as a result (in some form or
 other, but frequently different to the mode of the original failure).

indeed.  and that is an interdependency among components.  in
particular, it is a capacity interdependency.

 Whilst the Amazon outage might have been a simple power outage, it's
 likely that at least some of the website outages caused were a combination
 of not just the direct Amazon outage, but also the flow-on effect of their
 redundancy attempting (but failing) to kick in - potentially making the
 problem worse than just the Amazon outage caused.

i think you over-estimate these websites.  most of them simply have no
redundancy (and obviously have no tested, effective redundancy) and
were simply hoping that amazon didn't really go down that much.

hope is not the best strategy, as it turns out.

i suspect that randy is right though:  many of these businesses do not
promise perfect uptime and can survive these kinds of failures with
little loss to business or reputation.  twitter has branded it's early
failures with a whale that no only didn't hurt it but helped endear
the service to millions.  when your service fits these criteria, why
would you bother doing the complicated systems and application
engineering necessary to actually have functional redundancy?

it simply isn't worth it.

t


   Scott



Re: FYI Netflix is down

2012-06-30 Thread Bryan Horstmann-Allen
+--
| On 2012-06-30 16:08:40, Rayson Ho wrote:
| 
| If I recall correctly, availability zone (AZ) mappings are specific to
| an AWS account, and in fact there is no way to know if you are running
| in the same AZ as another AWS account:
| 
| 
http://aws.amazon.com/ec2/faqs/#How_can_I_make_sure_that_I_am_in_the_same_Availability_Zone_as_another_developer
| 
| Also, AWS Elastic Load Balancer (and/or CloudWatch) should be able to
| detect that some instances are not reachable, and thus can start new
| instances and remap DNS entries automatically:
| http://aws.amazon.com/elasticloadbalancing/
| 
| This time only 1 AZ is affected by the power outage, so sites with
| fault tolerance built into their AWS infrastructure should be able to
| handle the issues relatively easily.

Explain Netflix and Heroku last night. Both of whom architect across multiple
AZs and have for many years.

The API and EBS across the region were also affected. ELB was _also_ affected
across the region, and many customers continue to report problems with it.

We were told in May of last year after the last massive full-region EBS outage
that the control planes for the API and related services were being decoupled
so issues in a single AZ would not affect all. Seems to not be the case.

Just because they offer these features that should help with resiliency doesn't
actually mean they _work_ under duress.
-- 
bdha
cyberpunk is dead. long live cyberpunk.



Re: FYI Netflix is down

2012-06-30 Thread Mike Devlin
On Sat, Jun 30, 2012 at 4:45 PM, Bryan Horstmann-Allen 
b...@mirrorshades.net wrote:

 Explain Netflix and Heroku last night. Both of whom architect across
 multiple
 AZs and have for many years.

 The API and EBS across the region were also affected. ELB was _also_
 affected
 across the region, and many customers continue to report problems with it.

 We were told in May of last year after the last massive full-region EBS
 outage
 that the control planes for the API and related services were being
 decoupled
 so issues in a single AZ would not affect all. Seems to not be the case.

 Just because they offer these features that should help with resiliency
 doesn't
 actually mean they _work_ under duress.
 --



But in netflix case, if they architected their environment the way they
said they did, why wouldnt they just fail over to us-west? especially at
their scale, I wouldn't expect them to be dependent on any AWS function in
any region.


Mike


Re: FYI Netflix is down

2012-06-30 Thread Bryan Horstmann-Allen
+--
| On 2012-06-30 16:55:53, Mike Devlin wrote:
| 
| But in netflix case, if they architected their environment the way they
| said they did, why wouldnt they just fail over to us-west? especially at
| their scale, I wouldn't expect them to be dependent on any AWS function in
| any region.
 
Have a look at Asgard, the AWS management tool they just open sourced. It
implies they rely very heavily on many AWS features, some of which are very
much region specific.

As to their multi-region capability, I have no idea. I don't think I've ever
seen the mention it.
-- 
bdha
cyberpunk is dead. long live cyberpunk.



Re: FYI Netflix is down

2012-06-30 Thread Mike Devlin
On Sat, Jun 30, 2012 at 5:04 PM, Bryan Horstmann-Allen 
b...@mirrorshades.net wrote:


 Have a look at Asgard, the AWS management tool they just open sourced. It
 implies they rely very heavily on many AWS features, some of which are very
 much region specific.

 As to their multi-region capability, I have no idea. I don't think I've
 ever
 seen the mention it.
 --
 bdha
 cyberpunk is dead. long live cyberpunk.



yeah, i am sure I am making some assumptions about how much resilience they
have been building into their architecture, but since every year they have
been getting rid of more and more of their physical infrastructure and
putting it fully in AWS, and given the fact they are a pay service, I would
think they would account for a region going down

Mike


Re: FYI Netflix is down

2012-06-30 Thread Brett Frankenberger
On Sat, Jun 30, 2012 at 01:19:54PM -0700, Scott Howard wrote:
 On Sat, Jun 30, 2012 at 12:04 PM, Todd Underwood toddun...@gmail.comwrote:
 
  This was not a cascading failure.  It was a simple power outage
 
  Cascading failures involve interdependencies among components.
 
 
 Not always.  Cascading failures can also occur when there is zero
 dependency between components.  The simplest form of this is where one
 environment fails over to another, but the target environment is not
 capable of handling the additional load and then fails itself as a result
 (in some form or other, but frequently different to the mode of the
 original failure).

That's an interdependency.  Environment A is dependent on environment B
being up and pulling some of the load away from A; B is dependent on A
beingup and pulling some of the load away from B.
A Crashes for reason X - Load Shifts to B - B Crashes due to load
is a classic cascading failure.  And it's not limited to software
systems.  It's how most major blackouts occur (except with more than
three steps in the cascade, of course).

 -- Brett



FYI Netflix is down

2012-06-29 Thread Joe Blanchard
Seems that they are unreachable at the moment. Called and theres a recorded
message stating they are aware of an issue, no details.

-Joe


Re: FYI Netflix is down

2012-06-29 Thread Jason Baugher
Seeing some reports of Pinterest and Instagram down as well. Amazon 
cloud services being implicated.


On 6/29/2012 10:22 PM, Joe Blanchard wrote:

Seems that they are unreachable at the moment. Called and theres a recorded
message stating they are aware of an issue, no details.

-Joe








Re: FYI Netflix is down

2012-06-29 Thread Grant Ridder
From Amazon

Amazon Elastic Compute Cloud (N. Virginia)  (http://status.aws.amazon.com/)
8:21 PM PDT We are investigating connectivity issues for a number of
instances in the US-EAST-1 Region.
8:31 PM PDT We are investigating elevated errors rates for APIs in the
US-EAST-1 (Northern Virginia) region, as well as connectivity issues to
instances in a single availability zone.

-Grant

On Fri, Jun 29, 2012 at 10:40 PM, Jason Baugher ja...@thebaughers.comwrote:

 Seeing some reports of Pinterest and Instagram down as well. Amazon cloud
 services being implicated.


 On 6/29/2012 10:22 PM, Joe Blanchard wrote:

 Seems that they are unreachable at the moment. Called and theres a
 recorded
 message stating they are aware of an issue, no details.

 -Joe








RE: FYI Netflix is down

2012-06-29 Thread James Laszko
To further expand:

8:21 PM PDT We are investigating connectivity issues for a number of instances 
in the US-EAST-1 Region.

 8:31 PM PDT We are investigating elevated errors rates for APIs in the 
US-EAST-1 (Northern Virginia) region, as well as connectivity issues to 
instances in a single availability zone.

 8:40 PM PDT We can confirm that a large number of instances in a single 
Availability Zone have lost power due to electrical storms in the area. We are 
actively working to restore power.

-Original Message-
From: Grant Ridder [mailto:shortdudey...@gmail.com] 
Sent: Friday, June 29, 2012 8:42 PM
To: Jason Baugher
Cc: nanog@nanog.org
Subject: Re: FYI Netflix is down

From Amazon

Amazon Elastic Compute Cloud (N. Virginia)  (http://status.aws.amazon.com/)
8:21 PM PDT We are investigating connectivity issues for a number of instances 
in the US-EAST-1 Region.
8:31 PM PDT We are investigating elevated errors rates for APIs in the
US-EAST-1 (Northern Virginia) region, as well as connectivity issues to 
instances in a single availability zone.

-Grant

On Fri, Jun 29, 2012 at 10:40 PM, Jason Baugher ja...@thebaughers.comwrote:

 Seeing some reports of Pinterest and Instagram down as well. Amazon 
 cloud services being implicated.


 On 6/29/2012 10:22 PM, Joe Blanchard wrote:

 Seems that they are unreachable at the moment. Called and theres a 
 recorded message stating they are aware of an issue, no details.

 -Joe









Re: FYI Netflix is down

2012-06-29 Thread Grant Ridder
I have an instance in zone C and it is up and fine, so it must be A, B, or
D that is down.

On Fri, Jun 29, 2012 at 10:42 PM, James Laszko jam...@mythostech.comwrote:

 To further expand:

 8:21 PM PDT We are investigating connectivity issues for a number of
 instances in the US-EAST-1 Region.

  8:31 PM PDT We are investigating elevated errors rates for APIs in the
 US-EAST-1 (Northern Virginia) region, as well as connectivity issues to
 instances in a single availability zone.

  8:40 PM PDT We can confirm that a large number of instances in a single
 Availability Zone have lost power due to electrical storms in the area. We
 are actively working to restore power.

 -Original Message-
 From: Grant Ridder [mailto:shortdudey...@gmail.com]
 Sent: Friday, June 29, 2012 8:42 PM
 To: Jason Baugher
 Cc: nanog@nanog.org
 Subject: Re: FYI Netflix is down

 From Amazon

 Amazon Elastic Compute Cloud (N. Virginia)  (http://status.aws.amazon.com/
 )
 8:21 PM PDT We are investigating connectivity issues for a number of
 instances in the US-EAST-1 Region.
 8:31 PM PDT We are investigating elevated errors rates for APIs in the
 US-EAST-1 (Northern Virginia) region, as well as connectivity issues to
 instances in a single availability zone.

 -Grant

 On Fri, Jun 29, 2012 at 10:40 PM, Jason Baugher ja...@thebaughers.com
 wrote:

  Seeing some reports of Pinterest and Instagram down as well. Amazon
  cloud services being implicated.
 
 
  On 6/29/2012 10:22 PM, Joe Blanchard wrote:
 
  Seems that they are unreachable at the moment. Called and theres a
  recorded message stating they are aware of an issue, no details.
 
  -Joe
 
 
 
 
 
 



Re: FYI Netflix is down

2012-06-29 Thread Jason Baugher

Nature is such a PITA.

On 6/29/2012 10:42 PM, James Laszko wrote:

To further expand:

8:21 PM PDT We are investigating connectivity issues for a number of instances 
in the US-EAST-1 Region.

  8:31 PM PDT We are investigating elevated errors rates for APIs in the 
US-EAST-1 (Northern Virginia) region, as well as connectivity issues to 
instances in a single availability zone.

  8:40 PM PDT We can confirm that a large number of instances in a single 
Availability Zone have lost power due to electrical storms in the area. We are 
actively working to restore power.

-Original Message-
From: Grant Ridder [mailto:shortdudey...@gmail.com]
Sent: Friday, June 29, 2012 8:42 PM
To: Jason Baugher
Cc: nanog@nanog.org
Subject: Re: FYI Netflix is down

From Amazon

Amazon Elastic Compute Cloud (N. Virginia)  (http://status.aws.amazon.com/)
8:21 PM PDT We are investigating connectivity issues for a number of instances 
in the US-EAST-1 Region.
8:31 PM PDT We are investigating elevated errors rates for APIs in the
US-EAST-1 (Northern Virginia) region, as well as connectivity issues to 
instances in a single availability zone.

-Grant

On Fri, Jun 29, 2012 at 10:40 PM, Jason Baugher ja...@thebaughers.comwrote:


Seeing some reports of Pinterest and Instagram down as well. Amazon
cloud services being implicated.


On 6/29/2012 10:22 PM, Joe Blanchard wrote:


Seems that they are unreachable at the moment. Called and theres a
recorded message stating they are aware of an issue, no details.

-Joe














Re: FYI Netflix is down

2012-06-29 Thread Mike Lyon
Whatever happened to UPSs and generators?

On Fri, Jun 29, 2012 at 8:45 PM, Jason Baugher ja...@thebaughers.comwrote:

 Nature is such a PITA.


 On 6/29/2012 10:42 PM, James Laszko wrote:

 To further expand:

 8:21 PM PDT We are investigating connectivity issues for a number of
 instances in the US-EAST-1 Region.

  8:31 PM PDT We are investigating elevated errors rates for APIs in the
 US-EAST-1 (Northern Virginia) region, as well as connectivity issues to
 instances in a single availability zone.

  8:40 PM PDT We can confirm that a large number of instances in a single
 Availability Zone have lost power due to electrical storms in the area. We
 are actively working to restore power.

 -Original Message-
 From: Grant Ridder [mailto:shortdudey123@gmail.**comshortdudey...@gmail.com
 ]
 Sent: Friday, June 29, 2012 8:42 PM
 To: Jason Baugher
 Cc: nanog@nanog.org
 Subject: Re: FYI Netflix is down

 From Amazon

 Amazon Elastic Compute Cloud (N. Virginia)  (
 http://status.aws.amazon.com/**)
 8:21 PM PDT We are investigating connectivity issues for a number of
 instances in the US-EAST-1 Region.
 8:31 PM PDT We are investigating elevated errors rates for APIs in the
 US-EAST-1 (Northern Virginia) region, as well as connectivity issues to
 instances in a single availability zone.

 -Grant

 On Fri, Jun 29, 2012 at 10:40 PM, Jason Baugher ja...@thebaughers.com
 wrote:

  Seeing some reports of Pinterest and Instagram down as well. Amazon
 cloud services being implicated.


 On 6/29/2012 10:22 PM, Joe Blanchard wrote:

  Seems that they are unreachable at the moment. Called and theres a
 recorded message stating they are aware of an issue, no details.

 -Joe












-- 
Mike Lyon
408-621-4826
mike.l...@gmail.com

http://www.linkedin.com/in/mlyon


Re: FYI Netflix is down

2012-06-29 Thread Ian Wilson
On Fri, Jun 29, 2012 at 11:44 PM, Grant Ridder shortdudey...@gmail.com wrote:
 I have an instance in zone C and it is up and fine, so it must be A, B, or
 D that is down.

It is my understanding that instance zones are randomized between
customers -- so your zone C may be my zone A.

Ian
-- 
Ian Wilson
ian.m.wil...@gmail.com

Solving site load issues with database replication is a lot like
solving your own personal problems with heroin -- at first, it sorta
works, but after a while things just get out of hand.



Re: FYI Netflix is down

2012-06-29 Thread Grant Ridder
Yes, although, when you launch an instance, you do have the option of
selecting a zone if you want.  However, once the instance is started it
stays in that zone and does not switch.

On Fri, Jun 29, 2012 at 10:47 PM, Ian Wilson ian.m.wil...@gmail.com wrote:

 On Fri, Jun 29, 2012 at 11:44 PM, Grant Ridder shortdudey...@gmail.com
 wrote:
  I have an instance in zone C and it is up and fine, so it must be A, B,
 or
  D that is down.

 It is my understanding that instance zones are randomized between
 customers -- so your zone C may be my zone A.

 Ian
 --
 Ian Wilson
 ian.m.wil...@gmail.com

 Solving site load issues with database replication is a lot like
 solving your own personal problems with heroin -- at first, it sorta
 works, but after a while things just get out of hand.



Re: FYI Netflix is down

2012-06-29 Thread Derek Ivey
I was wondering the same thing! Also, Reddit appears to be really slow 
right now and I keep getting reddit is under heavy load right now, 
sorry. Try again in a few minutes.


I wonder if it's related. I believe they use Amazon for some of their stuff.

Derek

On 6/29/2012 11:47 PM, Mike Lyon wrote:

Whatever happened to UPSs and generators?

On Fri, Jun 29, 2012 at 8:45 PM, Jason Baugher ja...@thebaughers.comwrote:


Nature is such a PITA.


On 6/29/2012 10:42 PM, James Laszko wrote:


To further expand:

8:21 PM PDT We are investigating connectivity issues for a number of
instances in the US-EAST-1 Region.

  8:31 PM PDT We are investigating elevated errors rates for APIs in the
US-EAST-1 (Northern Virginia) region, as well as connectivity issues to
instances in a single availability zone.

  8:40 PM PDT We can confirm that a large number of instances in a single
Availability Zone have lost power due to electrical storms in the area. We
are actively working to restore power.

-Original Message-
From: Grant Ridder [mailto:shortdudey123@gmail.**comshortdudey...@gmail.com
]
Sent: Friday, June 29, 2012 8:42 PM
To: Jason Baugher
Cc: nanog@nanog.org
Subject: Re: FYI Netflix is down

From Amazon

Amazon Elastic Compute Cloud (N. Virginia)  (
http://status.aws.amazon.com/**)
8:21 PM PDT We are investigating connectivity issues for a number of
instances in the US-EAST-1 Region.
8:31 PM PDT We are investigating elevated errors rates for APIs in the
US-EAST-1 (Northern Virginia) region, as well as connectivity issues to
instances in a single availability zone.

-Grant

On Fri, Jun 29, 2012 at 10:40 PM, Jason Baugher ja...@thebaughers.com

wrote:

  Seeing some reports of Pinterest and Instagram down as well. Amazon

cloud services being implicated.


On 6/29/2012 10:22 PM, Joe Blanchard wrote:

  Seems that they are unreachable at the moment. Called and theres a

recorded message stating they are aware of an issue, no details.

-Joe

















Re: FYI Netflix is down

2012-06-29 Thread Seth Mattinen
On 6/29/12 8:47 PM, Mike Lyon wrote:
 Whatever happened to UPSs and generators?
 

You don't need them with The Cloud!

But seriously, this is something like the third or fourth time AWS fell
over flat in recent memory.

~Seth





Re: FYI Netflix is down

2012-06-29 Thread Grant Ridder
They may use it for content, but reddit.com resolves to IPs own by quest

On Fri, Jun 29, 2012 at 10:51 PM, Seth Mattinen se...@rollernet.us wrote:

 On 6/29/12 8:47 PM, Mike Lyon wrote:
  Whatever happened to UPSs and generators?
 

 You don't need them with The Cloud!

 But seriously, this is something like the third or fourth time AWS fell
 over flat in recent memory.

 ~Seth






Re: FYI Netflix is down

2012-06-29 Thread Grant Ridder
8:49 PM PDT Power has been restored to the impacted Availability Zone and
we are working to bring impacted instances and volumes back online

On Fri, Jun 29, 2012 at 10:52 PM, Grant Ridder shortdudey...@gmail.comwrote:

 They may use it for content, but reddit.com resolves to IPs own by quest


 On Fri, Jun 29, 2012 at 10:51 PM, Seth Mattinen se...@rollernet.uswrote:

 On 6/29/12 8:47 PM, Mike Lyon wrote:
  Whatever happened to UPSs and generators?
 

 You don't need them with The Cloud!

 But seriously, this is something like the third or fourth time AWS fell
 over flat in recent memory.

 ~Seth







  1   2   >