Our "marketing types" want 24/7 availability of our corporate web site - a fair enough request, I guess...
However we have a number of restrictions on what we can do;
1. Must (presently) remain with IIS - moving to a Linux/Apache solution may become possible later, but it's "political"
2. Servers must be physically located on different campuses - because we connect tot he 'net through AARNET, we want them on different RNO's.
Hi Jon,
There are some AARNet network changes in the works you need to be aware of.
The network will be rebuilt across the coming six months (ie: AARNet3). There will be two routing cores in each state capital, with dual connections to campuses which have fiber diversity options.
So there will be no need to connect to multiple RNOs for equipment diversity.
This makes your problem significantly simpler, a campus resilience problem rather than a global availability problem.
I'd strongly suggest not connecting the same web server to multiple AARNet2 RNOs, as each RNO has its own BGP autonomous system. You'll recall that BGP events for each prefix are counted by backbone routers and when too many events occur the route is dampened.
So if there is a short outage the the route will appear to move between RNOs, adding to its likelyhood of being dampened. So you've just made a 10s outage into a 30m outage.
You're much better off not moving between AS numbers when doing recovery, then there are no global BGP issues to bite you.
This is exactly the reason that the AARNet3 network will use MPLS for recovery rather than an IP-layer mechanism. We're also likely to revisit the one-AS-per-state design and have a single big AS for the whole of AARNet (this wasn't practical for AARNet2, as MPLS didn't exist then and BGP was the only sane way to express network engineering policies that were as complex as AARNet's).
3. There must be NO DISCERNABLE INTERRUPTION TO SERVICE when one fails. Doing a "shift-reload" in the browser is NOT an option. It must be TOTALLY TRANSPARENT.
This means sharing TCP state between the two machines (or between some front-ends). This is certainly do-able, but not something you'd want to do across a WAN (because you need to control the jitter of the Hello probes within the cluster to prevent false triggering).
The only other solution I can come up with, given the above anal restrictions, is to use a "round robin" DNS setup, but this will involve doing a reload if the primary server fails to pick up the secondary DNS entry.
There's nothing to prevent DNS being aware of the server state. It's the cached DNS responses that will go to the wrong server.
Since you're forced to use IIS you might want to look into the clustering technologies in Windows Server 2003. And the HSRP protocol for your dual campus routers (one of which can be connected to each AARNet3 PoP). This would bring high available to the entire campus, not just the web site.
The networking companies also all offer products to address this problem. A typical example is Cisco LocalDirector 417G, list price A$37,000. It neatly addresses the TCP issues, but you still need to provide the resilient network infrastructure, which is expensive of you don't already have it.
These are pretty stiff prices, which is why a lot of firms outsource the problem to content delivery providers such as Akamai. Unfortunately you're web people will need to get over their IIS addiction to use these services (they generally use a sandboxed Java to run any server-side applications).
Since you're paying my salary, feel free to call :-) I'm in Brisbane today arranging a link to Townsville, but I should be contactable in the afternoon and are back in my office on Wednesday.
Regards, Glen
-- Glen Turner Tel: (08) 8303 3936 or +61 8 8303 3936 Network Engineer Email: [EMAIL PROTECTED] Australian Academic & Research Network www.aarnet.edu.au -- linux.conf.au 2004, Adelaide lca2004.linux.org.au Main conference 14-17 January 2004 Miniconfs from 12 Jan
-- SLUG - Sydney Linux User's Group - http://slug.org.au/ More Info: http://lists.slug.org.au/listinfo/slug
