RE: [SLUG] Redundant Web Servers
Well, we've solved the problem, and it works perfectly With much thanks to all the SLUGgers (and squid-ers) that gave suggestions, we presented a plan to marketing that was almost foolproof And we costed it at approx. $200k initially, plus another $50k recurring costs They saw our point...:-) Jon -= -Original Message- -= From: [EMAIL PROTECTED] -= [mailto:[EMAIL PROTECTED] -= Behalf Of [EMAIL PROTECTED] -= Sent: Saturday, 7 June 2003 7:58 AM -= To: Robert Collins -= Cc: [EMAIL PROTECTED] -= Subject: Re: [SLUG] Redundant Web Servers -= -= -= On 2 Jun 2003, Robert Collins wrote: -= -= Dual homed connection to them, using two separate -= exchanges and/or -= connection technologies - on two power grids... you -= may need to rent -= facilities to get these two things. -= -= And note that not even this will guarantee absolutely -= seamless failover. -= -= A flapping link on one of the redundant links will give -= BGP heartattacks, -= and result in route advertisements changing frequently -= until someone gets -= in the moddiel of it and stops the flapping link from -= advertising anything -= until it's repaired. -= -= {wry grin} been there, done that. -= -= DaZZa -= -= -- -= SLUG - Sydney Linux User's Group - http://slug.org.au/ -= More Info: http://lists.slug.org.au/listinfo/slug -= -- SLUG - Sydney Linux User's Group - http://slug.org.au/ More Info: http://lists.slug.org.au/listinfo/slug
Re: [SLUG] Redundant Web Servers
On 2 Jun 2003, Robert Collins wrote: Dual homed connection to them, using two separate exchanges and/or connection technologies - on two power grids... you may need to rent facilities to get these two things. And note that not even this will guarantee absolutely seamless failover. A flapping link on one of the redundant links will give BGP heartattacks, and result in route advertisements changing frequently until someone gets in the moddiel of it and stops the flapping link from advertising anything until it's repaired. {wry grin} been there, done that. DaZZa -- SLUG - Sydney Linux User's Group - http://slug.org.au/ More Info: http://lists.slug.org.au/listinfo/slug
Re: [SLUG] Redundant Web Servers
Jon Biddell wrote: Our marketing types want 24/7 availability of our corporate web site - a fair enough request, I guess... However we have a number of restrictions on what we can do; 1. Must (presently) remain with IIS - moving to a Linux/Apache solution may become possible later, but it's political 2. Servers must be physically located on different campuses - because we connect tot he 'net through AARNET, we want them on different RNO's. Hi Jon, There are some AARNet network changes in the works you need to be aware of. The network will be rebuilt across the coming six months (ie: AARNet3). There will be two routing cores in each state capital, with dual connections to campuses which have fiber diversity options. So there will be no need to connect to multiple RNOs for equipment diversity. This makes your problem significantly simpler, a campus resilience problem rather than a global availability problem. I'd strongly suggest not connecting the same web server to multiple AARNet2 RNOs, as each RNO has its own BGP autonomous system. You'll recall that BGP events for each prefix are counted by backbone routers and when too many events occur the route is dampened. So if there is a short outage the the route will appear to move between RNOs, adding to its likelyhood of being dampened. So you've just made a 10s outage into a 30m outage. You're much better off not moving between AS numbers when doing recovery, then there are no global BGP issues to bite you. This is exactly the reason that the AARNet3 network will use MPLS for recovery rather than an IP-layer mechanism. We're also likely to revisit the one-AS-per-state design and have a single big AS for the whole of AARNet (this wasn't practical for AARNet2, as MPLS didn't exist then and BGP was the only sane way to express network engineering policies that were as complex as AARNet's). 3. There must be NO DISCERNABLE INTERRUPTION TO SERVICE when one fails. Doing a shift-reload in the browser is NOT an option. It must be TOTALLY TRANSPARENT. This means sharing TCP state between the two machines (or between some front-ends). This is certainly do-able, but not something you'd want to do across a WAN (because you need to control the jitter of the Hello probes within the cluster to prevent false triggering). The only other solution I can come up with, given the above anal restrictions, is to use a round robin DNS setup, but this will involve doing a reload if the primary server fails to pick up the secondary DNS entry. There's nothing to prevent DNS being aware of the server state. It's the cached DNS responses that will go to the wrong server. Since you're forced to use IIS you might want to look into the clustering technologies in Windows Server 2003. And the HSRP protocol for your dual campus routers (one of which can be connected to each AARNet3 PoP). This would bring high available to the entire campus, not just the web site. The networking companies also all offer products to address this problem. A typical example is Cisco LocalDirector 417G, list price A$37,000. It neatly addresses the TCP issues, but you still need to provide the resilient network infrastructure, which is expensive of you don't already have it. These are pretty stiff prices, which is why a lot of firms outsource the problem to content delivery providers such as Akamai. Unfortunately you're web people will need to get over their IIS addiction to use these services (they generally use a sandboxed Java to run any server-side applications). Since you're paying my salary, feel free to call :-) I'm in Brisbane today arranging a link to Townsville, but I should be contactable in the afternoon and are back in my office on Wednesday. Regards, Glen -- Glen Turner Tel: (08) 8303 3936 or +61 8 8303 3936 Network Engineer Email: [EMAIL PROTECTED] Australian Academic Research Network www.aarnet.edu.au -- linux.conf.au 2004, Adelaide lca2004.linux.org.au Main conference 14-17 January 2004 Miniconfs from 12 Jan -- SLUG - Sydney Linux User's Group - http://slug.org.au/ More Info: http://lists.slug.org.au/listinfo/slug
[SLUG] Redundant Web Servers
Hi all, Our marketing types want 24/7 availability of our corporate web site - a fair enough request, I guess... However we have a number of restrictions on what we can do; 1. Must (presently) remain with IIS - moving to a Linux/Apache solution may become possible later, but it's political 2. Servers must be physically located on different campuses - because we connect tot he 'net through AARNET, we want them on different RNO's. 3. There must be NO DISCERNABLE INTERRUPTION TO SERVICE when one fails. Doing a shift-reload in the browser is NOT an option. It must be TOTALLY TRANSPARENT. Keeping the boxes in sync is no problem. I was thinking of a Linux box with 3 NICs - one to each server and one to the 'net, but this will only work if the servers are physically located on the same network. The only other solution I can come up with, given the above anal restrictions, is to use a round robin DNS setup, but this will involve doing a reload if the primary server fails to pick up the secondary DNS entry. I'm open to suggestions if anyone knows of a more elegant way of doing it - hell, if anyone knows how to make it work, I'll listen !! Jon -- SLUG - Sydney Linux User's Group - http://slug.org.au/ More Info: http://lists.slug.org.au/listinfo/slug
Re: [SLUG] Redundant Web Servers
At 9:17 am, Monday, June 2 2003, Jon Biddell mumbled: 3. There must be NO DISCERNABLE INTERRUPTION TO SERVICE when one fails. Doing a shift-reload in the browser is NOT an option. It must be TOTALLY TRANSPARENT. You're going to get one anyway. If the machine falls over, you're not going to get any more data, and the client will have to re-request. One solution is mod_backhand with apache, and the IIS servers behind it. That may conflict with the politics, but whatever. -- Steve * StevenK laughs at Joy's connection. * Joy spits on StevenK * StevenK sees the spit coming at him slowly and ducks in time. jaiger StevenK: how did you do that? you moved like *them* StevenK jaiger: Can you fly that thing? *points* jaiger not yet jaiger apt-get install libpilot-chopper -- SLUG - Sydney Linux User's Group - http://slug.org.au/ More Info: http://lists.slug.org.au/listinfo/slug
Re: [SLUG] Redundant Web Servers
Let me prefix this: I don't really know what I'm talking about, double check anything I say. On Mon, 2003-06-02 at 09:16, Jon Biddell wrote: 2. Servers must be physically located on different campuses - because we connect tot he 'net through AARNET, we want them on different RNO's. 3. There must be NO DISCERNABLE INTERRUPTION TO SERVICE when one fails. Doing a shift-reload in the browser is NOT an option. It must be TOTALLY TRANSPARENT. Wow. Well, point 3 makes it pretty hard. As I understand it, that's an intentional design decision of tcp/ip -- if it were easy to have another computer interrupt an existing tcp connection and just take it over, then I'm sure it would be exploited. Thus to keep a tcp connection open you need to have a certain amount of state information; I think it does this through so-called sequence numbers, but I'm not a network ninja, so I'm not sure. The point is that to be able to have another computer step in half way through a transaction, you'll need to have state information being transferred between the two computers constantly. Now, the other option is to have some sort of proxying server which just farms requests out to each server, but then you have a single point of failure and you're right back where you started. I believe that there are boxes that do this, but they're hugely expensive. Like hundreds of thousands of dollars. So, I suppose you need to analyze the risks that you're trying to minimise. It would be easier to have a single box in a single building with multiple connections that were arbitrated by bgp. I still think you'd need to do a reload in most real situations. I'll be interested to hear what you come up with. Sorry I can't be more help. James. -- SLUG - Sydney Linux User's Group - http://slug.org.au/ More Info: http://lists.slug.org.au/listinfo/slug
Re: [SLUG] Redundant Web Servers
James == James Gregory [EMAIL PROTECTED] writes: 2. Servers must be physically located on different campuses - because we connect tot he 'net through AARNET, we want them on different RNO's. 3. There must be NO DISCERNABLE INTERRUPTION TO SERVICE when one fails. Doing a shift-reload in the browser is NOT an option. It must be TOTALLY TRANSPARENT. James Wow. Well, point 3 makes it pretty hard. As I understand it, James that's an intentional design decision of tcp/ip -- if it were James easy to have another computer interrupt an existing tcp James connection and just take it over, then I'm sure it would be If you're only serving static content, that's not an issue: HTTP version 1 uses a new tcp/ip connexion for each request anyway, With round-robin DNS you may end up with different images on the same page being served from different servers anyway. Personally I'd go with round-robin DNS, and try to detect failure and update the DNS fast. Some people's browsers would appear to hang for a short while when attempting to access the next page, until the DNS caught up (this implies using a short timeout on the name). Peter C -- SLUG - Sydney Linux User's Group - http://slug.org.au/ More Info: http://lists.slug.org.au/listinfo/slug
Re: [SLUG] Redundant Web Servers
On Mon, 2 Jun 2003, James Gregory wrote: 3. There must be NO DISCERNABLE INTERRUPTION TO SERVICE when one fails. Doing a shift-reload in the browser is NOT an option. It must be TOTALLY TRANSPARENT. James Wow. Well, point 3 makes it pretty hard. As I understand it, James that's an intentional design decision of tcp/ip -- if it were James easy to have another computer interrupt an existing tcp James connection and just take it over, then I'm sure it would be If you're only serving static content, that's not an issue: HTTP version 1 uses a new tcp/ip connexion for each request anyway, With round-robin DNS you may end up with different images on the same page being served from different servers anyway. Sure, that's a given. I thought the problem was that it had to happen without a reload - server crashing halfway through serving a particular html page. I considered 0 ttl dns as well, but it only works if you can afford reloads. I suppose you might be able to hack something together with MIMEs multipart/x-mixed-replace in a proxy which monitored content length and was ready to fetch a second MIME part where required. It would be a bit messy though, not necessarily compatible with all browsers, and the proxy is still going to be a single point of failure. Andrew McNaughton -- No added Sugar. Not tested on animals. If irritation occurs, discontinue use. --- Andrew McNaughton In Sydney Working on a Product Recommender System [EMAIL PROTECTED] Mobile: +61 422 753 792 http://staff.scoop.co.nz/andrew/cv.doc -- SLUG - Sydney Linux User's Group - http://slug.org.au/ More Info: http://lists.slug.org.au/listinfo/slug
Re: [SLUG] Redundant Web Servers
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Monday, June 2, 2003, at 09:16 AM, Jon Biddell wrote: 3. There must be NO DISCERNABLE INTERRUPTION TO SERVICE when one fails. Doing a shift-reload in the browser is NOT an option. It must be TOTALLY TRANSPARENT. The marketing types have to understand that nothing is perfect, for starters. HTTP and browsers aren't intelligent enough to go oh, this feed stopped midway through. Let's see whether there are any secondary sites for this. Ultimately, you may end up with broken portions of the page, should something halt midway through serving a client. That being said, they are probably not thinking of it in such a finely grained manner. That's worth clarifying though. Don't let them slip one past you!! On to some technical stuff. I'm not really up to speed with exactly how squid works, but couldn't a round robin DNS present issues for clients accessing through a proxy? If squid has cached a DNS reply, it might query a stale IP address. Any squid boffins got comments on that one? I'm thinking of say, Telstra's proxy farm that all bigpond people go through for instance. A good compromise might be to have a 'forwarder' machine hosted on a highly available, redundant network of your choosing. You make sure that the logic in this thing is as simple as possible, so that there is a minimised risk of it going wrong. You pay a few $$ to make sure that it's on failover hardware, redundant net connections, etc. Its job is to forward requests to your bulkier, more failure prone IIS installations at your two campuses. It will know whether either of them have gone down or had performance unacceptably degraded, and start forwarding to your other box. There will be two processes - one will be a little httpd that executes a simple loop to decide where to forward the request; the other will something that polls your servers to determine health (maybe even via SNMP + ping + HTTP GET). The second process feeds a small table that the first process uses to make decisions on. Yes, this is technically a single point of failure system - but you are mitigating that by 1. keep its job very, very simple; 2. putting it on a dedicated, simple Linux machine; 3. hosting it somewhere very highly available. Regards, Luke. - -- Luke Burton. (PGP keys: http://www.hagus.net/pgp) Yes, questions. Morphology, longevity, incept dates. -BEGIN PGP SIGNATURE- Version: PGP 8.0.2 iQA/AwUBPtruzYCXGdaqw+o1EQKBLACgp4N+fmgkt4EhyZaSevhD+vQpeqEAnjO8 dcC3gDLmv7x7heUkK6XW4AY1 =vpDV -END PGP SIGNATURE- -- SLUG - Sydney Linux User's Group - http://slug.org.au/ More Info: http://lists.slug.org.au/listinfo/slug
Re: [SLUG] Redundant Web Servers
This one time, at band camp, Luke Burton wrote: On Monday, June 2, 2003, at 09:16 AM, Jon Biddell wrote: 3. There must be NO DISCERNABLE INTERRUPTION TO SERVICE when one fails. Doing a shift-reload in the browser is NOT an option. It must be TOTALLY TRANSPARENT. A good compromise might be to have a 'forwarder' machine hosted on a highly available, redundant network of your choosing. You make sure that the logic in this thing is as simple as possible, so that there is a minimised risk of it going wrong. You pay a few $$ to make sure that it's on failover hardware, redundant net connections, etc. Just a thought, maybe just an old Pentium box that does port forwarding. - Chris [EMAIL PROTECTED] -- SLUG - Sydney Linux User's Group - http://slug.org.au/ More Info: http://lists.slug.org.au/listinfo/slug
Re: [SLUG] Redundant Web Servers
On Mon, 2003-06-02 at 09:16, Jon Biddell wrote: Hi all, Our marketing types want 24/7 availability of our corporate web site - a fair enough request, I guess... However we have a number of restrictions on what we can do; 1. Must (presently) remain with IIS - moving to a Linux/Apache solution may become possible later, but it's political :}. I suppose windows/Apache is also political? 3. There must be NO DISCERNABLE INTERRUPTION TO SERVICE when one fails. Doing a shift-reload in the browser is NOT an option. It must be TOTALLY TRANSPARENT. This is (as has already been mentioned) tricky. See below for a discussion Keeping the boxes in sync is no problem. I was thinking of a Linux box with 3 NICs - one to each server and one to the 'net, but this will only work if the servers are physically located on the same network. That box becomes a single point of failure. The only other solution I can come up with, given the above anal restrictions, is to use a round robin DNS setup, but this will involve doing a reload if the primary server fails to pick up the secondary DNS entry. Much more than a reload: If you encounter a flapping situation with both servers, you may actually increase the perceived downtime (as a as worst case...). I'm open to suggestions if anyone knows of a more elegant way of doing it - hell, if anyone knows how to make it work, I'll listen !! Firstly, you haven't clear identified to us, your free conslutants, the current greatest failure risks. I.e. if the mean time between failure for the various components is (using arbitrary figures): Firewalls 60,000hrs. Lan switches 100,000 hrs. IIS 48 hrs. Windows 200hrs. Linux front end server 30,000 hrs. And for simplicity we'll assume that failure here is catastrophic: you put in a cold spare in the event of a failure. It's easy to see in the above scenario that anything that encapsulates IIS will give you huge uptimes relative to the naked beast being directly visible. That said, you can start to plan how you make it all hang together. ASSUMING that you are only concerned about IIS, not about NIC failures, switch failures, firewall or router failures, it's really quite trivial: front end IIS with squid, with a couple of hacks. The hacks will be to buffer entire objects before sending full headers to the client, that way a crashed server can result in squid retrying from the other serer, not in the client recieving an error. If you want to protect against network failures, multihomed connectivity *at each site* is the way to go. Unless you have a large network, many core routers won't propogate dual homed routes (because of the filtering of long prefixes) - so get your ISP to dual-home you to their network at each site. That protects you against transient link failures at each site, and the multiple sites allows you to fail over. You'll need a hacked DNS setup to dynamic add and remove virtual servers as each site comes online or suffers a failure, and that means you'll want your TTL way down. Be sure to have the DNS servers located far away from your hosting site. The above will not get you your requirement to 'not have to reload'. To do that you need another hack to the front end we've introduced - you need to convert all dynamic content to fixed length content... Here's why: 1) You cannot realistically force everyone to use HTTP/1.1. 2) HTTP/1.0 treats a TCP connection close as 'EOF' on dynamic content - unless you have -only- static content, browsers WILL end up with corrupt files from time to time. So, the above covers: unrecoverable front end server failure mid transmission (convert all responses to static length) back end server failure (front end reattempts from fall over server). simple router failures (dual homed network links). site failures (multiple sites, with dns updates triggered on link down / heartbeat failure (link down is better - faster updates)). round robin cache time issues (low DNS ttl). There's more that can be down, but the above should keep you nice and busy. Lastly, let me add that in all the large scale sites I've been involved with (usually web application hosting of some sort), the business folk do not ACTUALLY want 100% 24/7 availability - which is what all your requirements add up to - once the cost is detailed (with reasons). Usually, 4 nines (99.99% uptime - 1 hour of unscheduled downtime per year) is more than enough to keep clients paying large $$$ happy. IIRC the rule of thumb is: for each 9 you add, multiple the total project cost by 10. And, 4 nines is 'trivially' achievable from a single site with the appropriate resources. My suggestion for you: A good ISP with a end to end redundant network (including standby routers within each lan and redundant switches). Dual homed connection to them, using two separate exchanges and/or connection technologies - on two power grids... you may need to rent facilities to get these two things. Solid UPS's and