RE: [SLUG] Redundant Web Servers

2003-06-09 Thread Jon Biddell
Well, we've solved the problem, and it works perfectly

With much thanks to all the SLUGgers (and squid-ers) that gave
suggestions, we presented a plan to marketing that was almost
foolproof

And we costed it at approx. $200k initially, plus another $50k
recurring costs

They saw our point...:-)

Jon

-= -Original Message-
-= From: [EMAIL PROTECTED]
-= [mailto:[EMAIL PROTECTED]
-= Behalf Of [EMAIL PROTECTED]
-= Sent: Saturday, 7 June 2003 7:58 AM
-= To: Robert Collins
-= Cc: [EMAIL PROTECTED]
-= Subject: Re: [SLUG] Redundant Web Servers
-=
-=
-= On 2 Jun 2003, Robert Collins wrote:
-=
-=  Dual homed connection to them, using two separate
-= exchanges and/or
-=  connection technologies - on two power grids... you
-= may need to rent
-=  facilities to get these two things.
-=
-= And note that not even this will guarantee absolutely
-= seamless failover.
-=
-= A flapping link on one of the redundant links will give
-= BGP heartattacks,
-= and result in route advertisements changing frequently
-= until someone gets
-= in the moddiel of it and stops the flapping link from
-= advertising anything
-= until it's repaired.
-=
-= {wry grin} been there, done that.
-=
-= DaZZa
-=
-= --
-= SLUG - Sydney Linux User's Group - http://slug.org.au/
-= More Info: http://lists.slug.org.au/listinfo/slug
-=

-- 
SLUG - Sydney Linux User's Group - http://slug.org.au/
More Info: http://lists.slug.org.au/listinfo/slug


Re: [SLUG] Redundant Web Servers

2003-06-06 Thread dazza
On 2 Jun 2003, Robert Collins wrote:

 Dual homed connection to them, using two separate exchanges and/or
 connection technologies - on two power grids... you may need to rent
 facilities to get these two things.

And note that not even this will guarantee absolutely seamless failover.

A flapping link on one of the redundant links will give BGP heartattacks,
and result in route advertisements changing frequently until someone gets
in the moddiel of it and stops the flapping link from advertising anything
until it's repaired.

{wry grin} been there, done that.

DaZZa

-- 
SLUG - Sydney Linux User's Group - http://slug.org.au/
More Info: http://lists.slug.org.au/listinfo/slug


Re: [SLUG] Redundant Web Servers

2003-06-03 Thread Glen Turner
Jon Biddell wrote:

Our marketing types want 24/7 availability of our corporate web 
site - a fair enough request, I guess...

However we have a number of restrictions on what we can do;

1. Must (presently) remain with IIS - moving to a Linux/Apache 
solution may become possible later, but it's political

2. Servers must be physically located on different campuses - 
because we connect tot he 'net through AARNET, we want them on 
different RNO's.
Hi Jon,

There are some AARNet network changes in the works you
need to be aware of.
The network will be rebuilt across the coming six months
(ie: AARNet3).  There will be two routing cores in each
state capital, with dual connections to campuses which
have fiber diversity options.
So there will be no need to connect to multiple RNOs for
equipment diversity.
This makes your problem significantly simpler, a campus
resilience problem rather than a global availability
problem.
I'd strongly suggest not connecting the same web server
to multiple AARNet2 RNOs, as each RNO has its own BGP
autonomous system.  You'll recall that BGP events for
each prefix are counted by backbone routers and when
too many events occur the route is dampened.
So if there is a short outage the the route will appear
to move between RNOs, adding to its likelyhood of being
dampened.  So you've just made a 10s outage into a 30m
outage.
You're much better off not moving between AS numbers when
doing recovery, then there are no global BGP issues to
bite you.
This is exactly the reason that the AARNet3 network will
use MPLS for recovery rather than an IP-layer mechanism.
We're also likely to revisit the one-AS-per-state design
and have a single big AS for the whole of AARNet (this
wasn't practical for AARNet2, as MPLS didn't exist then
and BGP was the only sane way to express network engineering
policies that were as complex as AARNet's).
3. There must be NO DISCERNABLE INTERRUPTION TO SERVICE when one 
fails. Doing a shift-reload in the browser is NOT an option. It 
must be TOTALLY TRANSPARENT.
This means sharing TCP state between the two machines (or
between some front-ends).  This is certainly do-able, but
not something you'd want to do across a WAN (because you
need to control the jitter of the Hello probes within
the cluster to prevent false triggering).
The only other solution I can come up with, given the above anal 
restrictions, is to use a round robin DNS setup, but this will 
involve doing a reload if the primary server fails to pick up the 
secondary DNS entry.
There's nothing to prevent DNS being aware of the server
state.  It's the cached DNS responses that will go to the
wrong server.
Since you're forced to use IIS you might want to look into
the clustering technologies in Windows Server 2003.  And
the HSRP protocol for your dual campus routers (one of which
can be connected to each AARNet3 PoP).  This would bring
high available to the entire campus, not just the web site.
The networking companies also all offer products to address
this problem.  A typical example is Cisco LocalDirector 417G,
list price A$37,000.  It neatly addresses the TCP issues,
but you still need to provide the resilient network
infrastructure, which is expensive of you don't already
have it.
These are pretty stiff prices, which is why a lot of
firms outsource the problem to content delivery providers
such as Akamai.  Unfortunately you're web people will
need to get over their IIS addiction to use these services
(they generally use a sandboxed Java to run any server-side
applications).
Since you're paying my salary, feel free to call :-)
I'm in Brisbane today arranging a link to Townsville,
but I should be contactable in the afternoon and are
back in my office on Wednesday.
Regards,
Glen
--
 Glen Turner Tel: (08) 8303 3936 or +61 8 8303 3936
 Network Engineer  Email: [EMAIL PROTECTED]
 Australian Academic  Research Network   www.aarnet.edu.au
--
 linux.conf.au 2004, Adelaide  lca2004.linux.org.au
 Main conference 14-17 January 2004   Miniconfs from 12 Jan
--
SLUG - Sydney Linux User's Group - http://slug.org.au/
More Info: http://lists.slug.org.au/listinfo/slug


[SLUG] Redundant Web Servers

2003-06-02 Thread Jon Biddell
Hi all,

Our marketing types want 24/7 availability of our corporate web 
site - a fair enough request, I guess...

However we have a number of restrictions on what we can do;

1. Must (presently) remain with IIS - moving to a Linux/Apache 
solution may become possible later, but it's political

2. Servers must be physically located on different campuses - 
because we connect tot he 'net through AARNET, we want them on 
different RNO's.

3. There must be NO DISCERNABLE INTERRUPTION TO SERVICE when one 
fails. Doing a shift-reload in the browser is NOT an option. It 
must be TOTALLY TRANSPARENT.

Keeping the boxes in sync is no problem.

I was thinking of a Linux box with 3 NICs - one to each server and 
one to the 'net, but this will only work if the servers are 
physically located on the same network.

The only other solution I can come up with, given the above anal 
restrictions, is to use a round robin DNS setup, but this will 
involve doing a reload if the primary server fails to pick up the 
secondary DNS entry.

I'm open to suggestions if anyone knows of a more elegant way of 
doing it - hell, if anyone knows how to make it work, I'll listen 
!!

Jon
-- 
SLUG - Sydney Linux User's Group - http://slug.org.au/
More Info: http://lists.slug.org.au/listinfo/slug


Re: [SLUG] Redundant Web Servers

2003-06-02 Thread Steve Kowalik
At  9:17 am, Monday, June  2 2003, Jon Biddell mumbled:
 3. There must be NO DISCERNABLE INTERRUPTION TO SERVICE when one 
 fails. Doing a shift-reload in the browser is NOT an option. It 
 must be TOTALLY TRANSPARENT.
 
You're going to get one anyway. If the machine falls over, you're not going
to get any more data, and the client will have to re-request.

One solution is mod_backhand with apache, and the IIS servers behind it.

That may conflict with the politics, but whatever.

-- 
   Steve
* StevenK laughs at Joy's connection.
* Joy spits on StevenK 
* StevenK sees the spit coming at him slowly and ducks in time.
jaiger StevenK: how did you do that?  you moved like *them*
StevenK jaiger: Can you fly that thing? *points*
jaiger not yet
jaiger apt-get install libpilot-chopper
-- 
SLUG - Sydney Linux User's Group - http://slug.org.au/
More Info: http://lists.slug.org.au/listinfo/slug


Re: [SLUG] Redundant Web Servers

2003-06-02 Thread James Gregory
Let me prefix this: I don't really know what I'm talking about, double
check anything I say.

On Mon, 2003-06-02 at 09:16, Jon Biddell wrote:

 2. Servers must be physically located on different campuses - 
 because we connect tot he 'net through AARNET, we want them on 
 different RNO's.
 
 3. There must be NO DISCERNABLE INTERRUPTION TO SERVICE when one 
 fails. Doing a shift-reload in the browser is NOT an option. It 
 must be TOTALLY TRANSPARENT.

Wow. Well, point 3 makes it pretty hard. As I understand it, that's an
intentional design decision of tcp/ip -- if it were easy to have another
computer interrupt an existing tcp connection and just take it over,
then I'm sure it would be exploited. Thus to keep a tcp connection open
you need to have a certain amount of state information; I think it does
this through so-called sequence numbers, but I'm not a network ninja,
so I'm not sure. The point is that to be able to have another computer
step in half way through a transaction, you'll need to have state
information being transferred between the two computers constantly.

Now, the other option is to have some sort of proxying server which just
farms requests out to each server, but then you have a single point of
failure and you're right back where you started.

I believe that there are boxes that do this, but they're hugely
expensive. Like hundreds of thousands of dollars.

So, I suppose you need to analyze the risks that you're trying to
minimise. It would be easier to have a single box in a single building
with multiple connections that were arbitrated by bgp. I still think
you'd need to do a reload in most real situations.

I'll be interested to hear what you come up with. Sorry I can't be more
help.

James.


-- 
SLUG - Sydney Linux User's Group - http://slug.org.au/
More Info: http://lists.slug.org.au/listinfo/slug


Re: [SLUG] Redundant Web Servers

2003-06-02 Thread Peter Chubb
 James == James Gregory [EMAIL PROTECTED] writes:


 2. Servers must be physically located on different campuses -
 because we connect tot he 'net through AARNET, we want them on
 different RNO's.
 
 3. There must be NO DISCERNABLE INTERRUPTION TO SERVICE when one
 fails. Doing a shift-reload in the browser is NOT an option. It
 must be TOTALLY TRANSPARENT.

James Wow. Well, point 3 makes it pretty hard. As I understand it,
James that's an intentional design decision of tcp/ip -- if it were
James easy to have another computer interrupt an existing tcp
James connection and just take it over, then I'm sure it would be

If you're only serving static content, that's not an issue:  HTTP
version 1 uses a new tcp/ip connexion for each request anyway,
With round-robin DNS you may end up with different images on the same
page being served from different servers anyway.

Personally I'd go with round-robin DNS, and try to detect failure and
update the DNS fast.  Some people's browsers would appear to hang
for a short while when attempting to access the next page, until the
DNS caught up (this implies using a short timeout on the name).


Peter C
-- 
SLUG - Sydney Linux User's Group - http://slug.org.au/
More Info: http://lists.slug.org.au/listinfo/slug


Re: [SLUG] Redundant Web Servers

2003-06-02 Thread Andrew McNaughton





On Mon, 2 Jun 2003, James Gregory wrote:

   3. There must be NO DISCERNABLE INTERRUPTION TO SERVICE when one
   fails. Doing a shift-reload in the browser is NOT an option. It
   must be TOTALLY TRANSPARENT.
 
  James Wow. Well, point 3 makes it pretty hard. As I understand it,
  James that's an intentional design decision of tcp/ip -- if it were
  James easy to have another computer interrupt an existing tcp
  James connection and just take it over, then I'm sure it would be
 
  If you're only serving static content, that's not an issue:  HTTP
  version 1 uses a new tcp/ip connexion for each request anyway,
  With round-robin DNS you may end up with different images on the same
  page being served from different servers anyway.

 Sure, that's a given. I thought the problem was that it had to happen
 without a reload - server crashing halfway through serving a particular
 html page. I considered 0 ttl dns as well, but it only works if you can
 afford reloads.

I suppose you might be able to hack something together with MIMEs
multipart/x-mixed-replace in a proxy which monitored content length and
was ready to fetch a second MIME part where required.  It would be a bit
messy though, not necessarily compatible with all browsers, and the proxy
is still going to be a single point of failure.

Andrew McNaughton



--

No added Sugar.  Not tested on animals.  If irritation occurs,
discontinue use.

---
Andrew McNaughton   In Sydney
Working on a Product Recommender System
[EMAIL PROTECTED]
Mobile: +61 422 753 792 http://staff.scoop.co.nz/andrew/cv.doc



-- 
SLUG - Sydney Linux User's Group - http://slug.org.au/
More Info: http://lists.slug.org.au/listinfo/slug


Re: [SLUG] Redundant Web Servers

2003-06-02 Thread Luke Burton
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


On Monday, June 2, 2003, at 09:16  AM, Jon Biddell wrote:

 3. There must be NO DISCERNABLE INTERRUPTION TO SERVICE when one
 fails. Doing a shift-reload in the browser is NOT an option. It
 must be TOTALLY TRANSPARENT.

The marketing types have to understand that nothing is perfect, for 
starters. HTTP and browsers aren't intelligent enough to go oh, this 
feed stopped midway through. Let's see whether there are any secondary 
sites for this. Ultimately, you may end up with broken portions of the 
page, should something halt midway through serving a client.

That being said, they are probably not thinking of it in such a finely 
grained manner. That's worth clarifying though. Don't let them slip one 
past you!!

On to some technical stuff. I'm not really up to speed with exactly how 
squid works, but couldn't a round robin DNS present issues for clients 
accessing through a proxy? If squid has cached a DNS reply, it might 
query a stale IP address. Any squid boffins got comments on that one? 
I'm thinking of say, Telstra's proxy farm that all bigpond people go 
through for instance.

A good compromise might be to have a 'forwarder' machine hosted on a 
highly available, redundant network of your choosing. You make sure 
that the logic in this thing is as simple as possible, so that there is 
a minimised risk of it going wrong. You pay a few $$ to make sure that 
it's on failover hardware, redundant net connections, etc.

Its job is to forward requests to your bulkier, more failure prone IIS 
installations at your two campuses. It will know whether either of them 
have gone down or had performance unacceptably degraded, and start 
forwarding to your other box. There will be two processes - one will be 
a little httpd that executes a simple loop to decide where to forward 
the request; the other will something that polls your servers to 
determine health (maybe even via SNMP + ping + HTTP GET). The second 
process feeds a small table that the first process uses to make 
decisions on.

Yes, this is technically a single point of failure system - but you are 
mitigating that by 1. keep its job very, very simple; 2. putting it on 
a dedicated, simple Linux machine; 3. hosting it somewhere very highly 
available.

Regards,

Luke.

- --
Luke Burton.
(PGP keys: http://www.hagus.net/pgp)

Yes, questions. Morphology, longevity, incept dates.


-BEGIN PGP SIGNATURE-
Version: PGP 8.0.2

iQA/AwUBPtruzYCXGdaqw+o1EQKBLACgp4N+fmgkt4EhyZaSevhD+vQpeqEAnjO8
dcC3gDLmv7x7heUkK6XW4AY1
=vpDV
-END PGP SIGNATURE-

-- 
SLUG - Sydney Linux User's Group - http://slug.org.au/
More Info: http://lists.slug.org.au/listinfo/slug


Re: [SLUG] Redundant Web Servers

2003-06-02 Thread Chris D.
This one time, at band camp, Luke Burton wrote:
On Monday, June 2, 2003, at 09:16  AM, Jon Biddell wrote:

 3. There must be NO DISCERNABLE INTERRUPTION TO SERVICE when one
 fails. Doing a shift-reload in the browser is NOT an option. It
 must be TOTALLY TRANSPARENT.

A good compromise might be to have a 'forwarder' machine hosted on a 
highly available, redundant network of your choosing. You make sure 
that the logic in this thing is as simple as possible, so that there is 
a minimised risk of it going wrong. You pay a few $$ to make sure that 
it's on failover hardware, redundant net connections, etc.

Just a thought, maybe just an old Pentium box that does port forwarding.

- Chris
[EMAIL PROTECTED]
-- 
SLUG - Sydney Linux User's Group - http://slug.org.au/
More Info: http://lists.slug.org.au/listinfo/slug


Re: [SLUG] Redundant Web Servers

2003-06-02 Thread Robert Collins
On Mon, 2003-06-02 at 09:16, Jon Biddell wrote:
 Hi all,
 
 Our marketing types want 24/7 availability of our corporate web 
 site - a fair enough request, I guess...
 
 However we have a number of restrictions on what we can do;
 
 1. Must (presently) remain with IIS - moving to a Linux/Apache 
 solution may become possible later, but it's political

 :}. I suppose windows/Apache is also political?

 3. There must be NO DISCERNABLE INTERRUPTION TO SERVICE when one 
 fails. Doing a shift-reload in the browser is NOT an option. It 
 must be TOTALLY TRANSPARENT.

This is (as has already been mentioned) tricky. See below for a
discussion

 Keeping the boxes in sync is no problem.
 
 I was thinking of a Linux box with 3 NICs - one to each server and 
 one to the 'net, but this will only work if the servers are 
 physically located on the same network.

That box becomes a single point of failure.

 The only other solution I can come up with, given the above anal 
 restrictions, is to use a round robin DNS setup, but this will 
 involve doing a reload if the primary server fails to pick up the 
 secondary DNS entry.

Much more than a reload: If you encounter a flapping situation with both
servers, you may actually increase the perceived downtime (as a as worst
case...).

 I'm open to suggestions if anyone knows of a more elegant way of 
 doing it - hell, if anyone knows how to make it work, I'll listen 
 !!

Firstly, you haven't clear identified to us, your free conslutants, the
current greatest failure risks. I.e. if the mean time between failure
for the various components is (using arbitrary figures):
Firewalls 60,000hrs.
Lan switches 100,000 hrs.
IIS 48 hrs.
Windows 200hrs.
Linux front end server 30,000 hrs.

And for simplicity we'll assume that failure here is catastrophic: you
put in a cold spare in the event of a failure. It's easy to see in the
above scenario that anything that encapsulates IIS will give you huge
uptimes relative to the naked beast being directly visible. That said,
you can start to plan how you make it all hang together.

ASSUMING that you are only concerned about IIS, not about NIC failures,
switch failures, firewall or router failures, it's really quite trivial:
front end IIS with squid, with a couple of hacks. The hacks will be to
buffer entire objects before sending full headers to the client, that
way a crashed server can result in squid retrying from the other serer,
not in the client recieving an error.

If you want to protect against network failures, multihomed connectivity
*at each site* is the way to go. Unless you have a large network, many
core routers won't propogate dual homed routes (because of the filtering
of long prefixes) - so get your ISP to dual-home you to their network at
each site.

That protects you against transient link failures at each site, and the
multiple sites allows you to fail over. You'll need a hacked DNS setup
to dynamic add and remove virtual servers as each site comes online or
suffers a failure, and that means you'll want your TTL way down. Be sure
to have the DNS servers located far away from your hosting site.

The above will not get you your requirement to 'not have to reload'. To
do that you need another hack to the front end we've introduced - you
need to convert all dynamic content to fixed length content...

Here's why:
1) You cannot realistically force everyone to use HTTP/1.1.
2) HTTP/1.0 treats a TCP connection close as 'EOF' on dynamic content -
unless you have -only- static content, browsers WILL end up with corrupt
files from time to time.

So, the above covers:
unrecoverable front end server failure mid transmission (convert all
responses to static length)
back end server failure (front end reattempts from fall over server).
simple router failures (dual homed network links).
site failures (multiple sites, with dns updates triggered on link down /
heartbeat failure (link down is better - faster updates)).
round robin cache time issues (low DNS ttl).
There's more that can be down, but the above should keep you nice and
busy.

Lastly, let me add that in all the large scale sites I've been involved
with (usually web application hosting of some sort), the business folk
do not ACTUALLY want 100% 24/7 availability - which is what all your
requirements add up to - once the cost is detailed (with reasons).
Usually, 4 nines (99.99% uptime - 1 hour of unscheduled downtime per
year) is more than enough to keep clients paying large $$$ happy. IIRC
the rule of thumb is: for each 9 you add, multiple the total project
cost by 10. And, 4 nines is 'trivially' achievable from a single site
with the appropriate resources.

My suggestion for you:
A good ISP with a end to end redundant network (including standby
routers within each lan and redundant switches).
Dual homed connection to them, using two separate exchanges and/or
connection technologies - on two power grids... you may need to rent
facilities to get these two things.
Solid UPS's and