Re: Amazon diagnosis

2011-05-01 Thread Mike

On 04/29/2011 12:35 PM, Joly MacFie wrote:

*http://aws.amazon.com/message/65648/*http://aws.amazon.com/message/65648/

___



So, in a nut shell, Amazon had a single point of failure which touched 
off this entire incident.


I am still waiting for proof that single points of failure can 
realistically be completely eliminated from any moderately complicated 
network environment / application. So far, I think murphy is still 
winning on this one.


Good job by the AWS team however, I am sure your new procedures and 
processes will receive a shakeout again, and it will be interesting to 
see how that goes. I bet there will be more to learn along this road for 
us all.


Mike-



Re: Amazon diagnosis

2011-05-01 Thread Jay Ashworth
- Original Message -
 From: Mike mike-na...@tiedyenetworks.com

 On 04/29/2011 12:35 PM, Joly MacFie wrote:
  *http://aws.amazon.com/message/65648/*http://aws.amazon.com/message/65648/
 
 So, in a nut shell, Amazon had a single point of failure which touched
 off this entire incident.
 
 I am still waiting for proof that single points of failure can
 realistically be completely eliminated from any moderately complicated
 network environment / application. So far, I think murphy is still
 winning on this one.

Well, in fairness to Amazon, let's ask this: did the failure occur *behind
a component interface they advertise as Reliable*?  Either way, was it possible
for a single customer to avoid that possible failure, and at what cost in
expansion of scope and money?

Cheers,
-- jra



Re: Amazon diagnosis

2011-05-01 Thread Andrew Kirch
On 5/1/2011 2:07 PM, Mike wrote:
 I am still waiting for proof that single points of failure can
 realistically be completely eliminated from any moderately complicated
 network environment / application. So far, I think murphy is still
 winning on this one.

Sure they can, but as a thought exercise fully 2n redundancy is
difficult on a small scale for anything web facing.  I've seen a very
simple implementation for a website requiring 5 9's that consumed over
$50k in equipment, and this wasn't even geographically diverse.  I have
to believe that scaling up the concept of doing it right results in
exponential cost increases.  To illustrate the problem, I would give you
the first step in the thought exercise:  first find two datacenters with
diverse carriers, that aren't on the same regional power grid (As we've
learned in the (iirc) 2003 power outage, New York and DC won't work, nor
will Ohio, so you need redundant teams to cover a very remote site).



RE: Amazon diagnosis

2011-05-01 Thread George Bonser
 I am still waiting for proof that single points of failure can
 realistically be completely eliminated from any moderately complicated
 network environment / application. So far, I think murphy is still
 winning on this one.
 
 Good job by the AWS team however, I am sure your new procedures and
 processes will receive a shakeout again, and it will be interesting to
 see how that goes. I bet there will be more to learn along this road
 for
 us all.
 
 Mike-

From my reading of what happened, it looks like they didn't have a
single point of failure but ended up routing around their own
redundancy.

They apparently had a redundant primary network and, on top of that, a
secondary network.  The secondary network, however, did not have the
capacity of the primary network.

Rather than failing over from the active portion of the primary network
to the standby portion of the primary network, they inadvertently failed
the entire primary network to the secondary.  This resulted in the
secondary network reaching saturation and becoming unusable.

There isn't anything that can be done to mitigate against human error.
You can TRY, but as history shows us, it all boils down the human that
implements the procedure.  All the redundancy in the world will not do
you an iota of good if someone explicitly does the wrong thing.  In this
case it is my opinion that Amazon should not have considered their
secondary network to be a true secondary if it was not capable of
handling the traffic.  A completely broken network might have been an
easier failure mode to handle than a saturated network (high packet loss
but the network is there).

This looks like it was a procedural error and not an architectural
problem.  They seem to have had standby capability on the primary
network and, from the way I read their statement, did not use it.





Re: Amazon diagnosis

2011-05-01 Thread Paul Graydon

On 5/1/2011 9:29 AM, Jeff Wheeler wrote:

On Sun, May 1, 2011 at 2:18 PM, Andrew Kirchtrel...@trelane.net  wrote:

Sure they can, but as a thought exercise fully 2n redundancy is
difficult on a small scale for anything web facing.  I've seen a very
simple implementation for a website requiring 5 9's that consumed over
$50k in equipment, and this wasn't even geographically diverse.  I have

What it really boils down to is this: if application developers are
doing their jobs, a given service can be easy and inexpensive to
distribute to unrelated systems/networks without a huge infrastructure
expense.  If the developers are not, you end up spending a lot of
money on infrastructure to make up for code, databases, and APIs which
were not designed with this in mind.

These same developers who do not design and implement services with
diversity and redundancy in mind will fare little better with AWS than
any other platform.  Look at Reddit, for example.  This is an
application/service which is utterly trivial to implement in a cheap,
distributed manner, yet they have failed to do so for years, and
suffer repeated, long-duration outages as a result.  They probably buy
a lot more AWS services than would otherwise be needed, and truly have
a more complex infrastructure than such a simple service should.

IT managers would do well to understand that a few smart programmers,
who understand how all their tools (web servers, databases,
filesystems, load-balancers, etc.) actually work, can often do more to
keep infrastructure cost under control, and improve the reliability of
services, than any other investment in IT resources.
If you want a perfect example of this, consider Netflix.  Their 
infrastructure runs on AWS and we didn't see any downtime with them 
throughout the entire affair.
One of the interesting things they've done to try and enforce 
reliability of services is an in house service called Chaos Monkey who's 
sole purpose is to randomly kill instances and services inside the 
infrastructure.  Courtesy of Chaos Monkey and the defensive programming 
it enforces, nothing is dependent on each other, you will always get at 
least some form of a service.  For example if the recommendation engine 
dies, then the application is smart enough to catch that and instead 
return a list of the most popular movies, and so on.  There is an 
interesting blog from their Director of Engineering about what they 
learned on their migration to AWS, including using less chatty APIs to 
reduce the impact of typical AWS latency:

http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html

Paul



Re: Amazon diagnosis

2011-05-01 Thread Brett Frankenberger
On Sun, May 01, 2011 at 12:50:37PM -0700, George Bonser wrote:
 
 From my reading of what happened, it looks like they didn't have a
 single point of failure but ended up routing around their own
 redundancy.
 
 They apparently had a redundant primary network and, on top of that, a
 secondary network.  The secondary network, however, did not have the
 capacity of the primary network.
 
 Rather than failing over from the active portion of the primary network
 to the standby portion of the primary network, they inadvertently failed
 the entire primary network to the secondary.  This resulted in the
 secondary network reaching saturation and becoming unusable.
 
 There isn't anything that can be done to mitigate against human error.
 You can TRY, but as history shows us, it all boils down the human that
 implements the procedure.  All the redundancy in the world will not do
 you an iota of good if someone explicitly does the wrong thing.
   [ ... ]
 
 This looks like it was a procedural error and not an architectural
 problem.  They seem to have had standby capability on the primary
 network and, from the way I read their statement, did not use it.

The procedural error was putting all the traffic on the secondary
network.  They promptly recognized that error, and fixed it.  It's
certainly true that you can't eliminate human error.

The architectural problem is that they had insufficient error recovery
capability.  Initially, the system was trying to use a network that was
too small; that situation lasted for some number of minutes; it's no
surprise that the system couldn't operate under those conditions and
that isn't an indictment of the architecture.  However, after they put
it back on a network that wasn't too small, the service stayed
down/degraded for many, many hours.  That's an architectural problem. 
(And a very common one.  Error recovery is hard and tedious and more
often than not, not done well.)

Prodecural error isn't the only way to get into that boat.  If the
wrong pair of redundant equipment in their primary network failed
simultanesouly, they'd have likely found themselves in the same boat: a
short outage caused by a risk they accepted: loss of a pair of
rundundant hardware; followed by a long outage (after they restored the
network) caused by insufficient recovery capability.

Their writeup suggests they fully understand these issues and are doing
the right thing by seeking to have better recovery capability.  They
spent one sentence saying they'll look at their procedures to reduce
the risk of a similar procedural error in the future, and then spent
paragraphs on what they are going to do to have better recovery should
something like this occur in the future.

(One additional comment, for whoever posted that NetFlix had a better
architecture and wasn't impacted by this outage.  It might well be that
NetFlix does have a better archiecture and that might be why they
weren't impacted ... but there's also the possibility that they just
run in a different region.  Lots of entities with poor architecture
running on AWS survived this outage just fine, simply by not being in
the region that had the problem.)

 -- Brett



Re: Amazon diagnosis

2011-05-01 Thread Robert Bonomi

 Date: Sun, 01 May 2011 11:07:56 -0700
 From: Mike mike-na...@tiedyenetworks.com
 To: nanog@nanog.org
 Subject: Re: Amazon diagnosis

 On 04/29/2011 12:35 PM, Joly MacFie wrote:
  http://aws.amazon.com/message/65648/
 
  ___


 So, in a nut shell, Amazon had a single point of failure which touched 
 off this entire incident.

 I am still waiting for proof that single points of failure can 
 realistically be completely eliminated from any moderately complicated 
 network environment / application. So far, I think murphy is still 
 winning on this one.

this was a classical case of _O'Brien's_Law_ in action -- which states,
rather pithily: 

Murphy...

 was an OPTIMIST!!





Re: Amazon diagnosis

2011-05-01 Thread Valdis . Kletnieks
On Sun, 01 May 2011 11:07:56 PDT, Mike said:

 I am still waiting for proof that single points of failure can 
 realistically be completely eliminated from any moderately complicated 
 network environment / application. So far, I think murphy is still 
 winning on this one.

For starters, you almost always screw up and have one NOC full of
chuckle-headed banana eaters. And if you have two NOCs, that implies
one entity deciding which one takes lead on a problem. ;)




pgpypkvaWbtwM.pgp
Description: PGP signature


RE: Amazon diagnosis

2011-05-01 Thread Robert Bonomi

 Subject: RE: Amazon diagnosis
 Date: Sun, 1 May 2011 12:50:37 -0700
 From: George Bonser gbon...@seven.com

 They apparently had a redundant primary network and, on top of that, a
 secondary network.  The secondary network, however, did not have the
 capacity of the primary network.

 Rather than failing over from the active portion of the primary network
 to the standby portion of the primary network, they inadvertently failed
 the entire primary network to the secondary.  This resulted in the
 secondary network reaching saturation and becoming unusable.

 There isn't anything that can be done to mitigate against human error.
 You can TRY, but as history shows us, it all boils down the human that
 implements the procedure.  All the redundancy in the world will not do
 you an iota of good if someone explicitly does the wrong thing.  ...

 This looks like it was a procedural error and not an architectural
 problem.  

A sage sayeth sooth: 

  For any 'fool-proof' system, there exists 
   a *sufficiently*determied* fool capable of
   breaking it.

It would seem that the validity of that has just been re-confirmed.  wry grin


It is worthy of note that it is considerably harder to protect against
accidental stupidity than it is to protect againt intentional malice.
('malice' is _much_ more predictable, in general.  wry grin)





Re: Amazon diagnosis

2011-05-01 Thread Stefan
On Fri, Apr 29, 2011 at 2:35 PM, Joly MacFie j...@punkcast.com wrote:
 *http://aws.amazon.com/message/65648/*http://aws.amazon.com/message/65648/

 ___
 --
 ---
 Joly MacFie  218 565 9365 Skype:punkcast
 WWWhatsup NYC - http://wwwhatsup.com
  http://pinstand.com - http://punkcast.com
  VP (Admin) - ISOC-NY - http://isoc-ny.org
 --
 -


http://storagemojo.com/2011/04/29/amazons-ebs-outage/

***Stefan Mititelu
http://twitter.com/netfortius
http://www.linkedin.com/in/netfortius



Multitenant FWs

2011-05-01 Thread David Oramas
Hi,
What do you guys recommend for Multitenant Firewalls with support for over 
1,000+ users/contexts?
I have looked at Centrinet's Accessmanager and Barracuda NG Firewall. Any other 
players/products?
Many Thanks in advance for the input,





RE: Multitenant FWs

2011-05-01 Thread Mark Gauvin
Paloalto Networks build some nice gear

From: David Oramas [david.ora...@aptel.com.au]
Sent: Sunday, May 01, 2011 8:42 PM
To: nanog@nanog.org
Subject: Multitenant FWs

Hi,
What do you guys recommend for Multitenant Firewalls with support for over 
1,000+ users/contexts?
I have looked at Centrinet's Accessmanager and Barracuda NG Firewall. Any other 
players/products?
Many Thanks in advance for the input,






RE: Multitenant FWs

2011-05-01 Thread Stefan Fouant
 -Original Message-
 From: David Oramas [mailto:david.ora...@aptel.com.au]
 Sent: Sunday, May 01, 2011 9:42 PM
 To: nanog@nanog.org
 Subject: Multitenant FWs
 
 Hi,
 What do you guys recommend for Multitenant Firewalls with support for
 over 1,000+ users/contexts?
 I have looked at Centrinet's Accessmanager and Barracuda NG Firewall.
 Any other players/products?
 Many Thanks in advance for the input,

When I worked on building out Verizon's Network Based Firewall solution many
years ago, I chose Juniper NS-5400 platforms due to their multitenancy
capabilities and ability to support literally thousands of virtual firewall
contexts and many times that for users.  This decision was made after an
exhaustive analysis of competing solutions from Checkpoint, Cisco, and
Juniper.  Juniper's SRX line of products might make a good fit, but they
currently don't have full Logical System support which would certainly be a
requirement for any multi-tenant offering.  However, Logical System support
is on the roadmap so you might want to look into this depending on your
timeframe for deployment.

As the other list member pointed out, Palo Alto does make some really nice
gear and I have really been impressed with their Application Layer
Firewalling capability (Application Identification, Web Firewalling, etc),
however, I was suitably unimpressed with their multitenant capability and
think you might be hard pressed to offer such an offering to more than one
customer using such a device. 

Stefan Fouant





Re: Multitenant FWs

2011-05-01 Thread Christopher Morrow
On Sun, May 1, 2011 at 11:05 PM, Stefan Fouant
sfou...@shortestpathfirst.net wrote:
 -Original Message-
 From: David Oramas [mailto:david.ora...@aptel.com.au]
 Sent: Sunday, May 01, 2011 9:42 PM
 To: nanog@nanog.org
 Subject: Multitenant FWs

 Hi,
 What do you guys recommend for Multitenant Firewalls with support for
 over 1,000+ users/contexts?
 I have looked at Centrinet's Accessmanager and Barracuda NG Firewall.
 Any other players/products?
 Many Thanks in advance for the input,

one thing to keep in mind is that as near as I can tell no vendor (not
a singl eone) has actual hard limits configurable for each tenant
firewall instance. So, one can use all of the 'firewall rule'
resources, one can use all of the 'route memory' ... leaving other
instances flailing :(

In my mind, unless you have very loose sla's or are highly
overprovisioned... until vendors treat this basic problem this model
is a failure.

 When I worked on building out Verizon's Network Based Firewall solution many
 years ago, I chose Juniper NS-5400 platforms due to their multitenancy
 capabilities and ability to support literally thousands of virtual firewall
 contexts and many times that for users.  This decision was made after an

yup.. too bad no actual customers showed up :( (well, not any in real
numbers... though not due to the tech on the FW side, nor the
engineering work)

 As the other list member pointed out, Palo Alto does make some really nice
 gear and I have really been impressed with their Application Layer
 Firewalling capability (Application Identification, Web Firewalling, etc),
 however, I was suitably unimpressed with their multitenant capability and
 think you might be hard pressed to offer such an offering to more than one
 customer using such a device.

no support for actual limits on resources, eh? :( nothing on at least:

memory dedicated to a tenant
routing resources
packet processing resources
inspection rule resources
bandwidth/through-put
management operations

(I'm sure I left some off, but the above would be an excellent thing
to see vendors support with hard limits THAT I CAN CONFIGURE!!)

-chris



RE: Multitenant FWs

2011-05-01 Thread Stefan Fouant
 -Original Message-
 From: christopher.mor...@gmail.com
 [mailto:christopher.mor...@gmail.com] On Behalf Of Christopher Morrow
 
 one thing to keep in mind is that as near as I can tell no vendor (not
 a singl eone) has actual hard limits configurable for each tenant
 firewall instance. So, one can use all of the 'firewall rule'
 resources, one can use all of the 'route memory' ... leaving other
 instances flailing :(

Ahem, actually ScreenOS does support just such a thing through the use of
resource profiles - with this you can limit the amount of CPU, Sessions,
Policies, MIPs and DIPs (used for NAT), and other user defined objects such
as address book entries, etc. that each VSYS can avail.  This was one of the
primary drivers behind our decision to utilize the NS-5400 for Verizon's
NBFW (you remember that place right Chris, heh')

Stefan Fouant





Re: Multitenant FWs

2011-05-01 Thread Christopher Morrow
On Mon, May 2, 2011 at 12:20 AM, Stefan Fouant
sfou...@shortestpathfirst.net wrote:
 -Original Message-
 From: christopher.mor...@gmail.com
 [mailto:christopher.mor...@gmail.com] On Behalf Of Christopher Morrow

 one thing to keep in mind is that as near as I can tell no vendor (not
 a singl eone) has actual hard limits configurable for each tenant
 firewall instance. So, one can use all of the 'firewall rule'
 resources, one can use all of the 'route memory' ... leaving other
 instances flailing :(

 Ahem, actually ScreenOS does support just such a thing through the use of
 resource profiles - with this you can limit the amount of CPU, Sessions,
 Policies, MIPs and DIPs (used for NAT), and other user defined objects such
 as address book entries, etc. that each VSYS can avail.  This was one of the

good to know... I wonder how well it isolates.

 primary drivers behind our decision to utilize the NS-5400 for Verizon's
 NBFW (you remember that place right Chris, heh')

i do, occasionally via the twitching :)

 Stefan Fouant






RE: Multitenant FWs

2011-05-01 Thread Stefan Fouant
 -Original Message-
 From: christopher.mor...@gmail.com
 [mailto:christopher.mor...@gmail.com] On Behalf Of Christopher Morrow
 
  Ahem, actually ScreenOS does support just such a thing through the
 use of
  resource profiles - with this you can limit the amount of CPU,
 Sessions,
  Policies, MIPs and DIPs (used for NAT), and other user defined
 objects such
  as address book entries, etc. that each VSYS can avail.  This was one
 of the
 
 good to know... I wonder how well it isolates.

Ask the Vz marketing folks... oh, wait, 1 customer isn't really enough to
demonstrate how well it isolates after all I guess ;)

  primary drivers behind our decision to utilize the NS-5400 for
 Verizon's
  NBFW (you remember that place right Chris, heh')
 
 i do, occasionally via the twitching :)

Hehe...

Stefan Fouant