> Subject: Details of using tclhttpd as a proxy server. > Date: Thu, 13 Apr 2000 00:12:43 -0700 (PDT) > From: David LeBlanc <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > > Hi; > > I've decided to take a stab at making tclhttpd into a caching proxy server, > but need some details on how to go about it and i'm hoping someone in the > group can point me in the right direction. Good on you! My first piece of advice would be to read the HTTP/1.1 spec. > I'm guessing that under normal conditions, the webserver will see only the > path part of the http://somesite.com/index.html - where .com could be any > of .com, .edu, .gov etc. and index.html could be anything from blank to a > deep path like /root/branch/twig/ to /root/branch/twig/index.htm(l) and > also include things like :ports and cgi strings. My first question, is what > does a proxy server see, and what does it look like? I guess the proxy must > see the whole thing from "http..." on so that it can look in it's cache to > see if that site:path:page is in the cache. When a user agent sends a request to a server, it knows whether it is talking to a proxy server or an origin server. In the latter case, the request includes the path, eg. GET /path/to/resource HTTP/1.1 In the former case, the request includes the whole URI, eg. GET http://www.origin.net:8888/path/to/resource HTTP/1.1 The request URI may include a query string, but it will not include a fragment identifier. HTTP/1.1 changes this. It recommends that the full URI is included in all requests, whether or not the request is being made to an origin server. Origin servers know who they are, so it is easy enough to identify that the request is being made to a local resource. > Secondly, it's fairly obvious that if it's not in the cache (i'd use > domain/root/branch/twig/page.html in the file directory structure) It may be better to use a MD5 checksum of the URI. Just a suggestion ;-) > that one > should either pass it out to the web and somehow capture the fetched page > before sending it on to the client, or if the computer is offline, return a > document unavailable page. Yes, that's basically how a caching proxy works. In fact, a proxy works almost the same except that it does not bother saving a local copy. Maybe the first step could be to implement a non-caching proxy? > Would one use the socket command and the http > package to fetch the page and it's images etc and then return it to the proxy? You bet. Of course, for requests using non-http schemes you'd have to make use of other packages, eg. ftp:. Isn't tcllib going to have a ftp package? Of course, the trick is to do the fetch asynchronously. It is extremely likely that several requests will be made in a short timeframe Bear in mind that there is no distinction between "pages" and their "images". As far as HTTP is concerned they are all just documents. > Steve Ball also mentioned something about how to check the freshness of the > cached page using the 'head' fetch - how is that done? Indeed I did. HTTP/1.1 has a whole bunch of stuff for cache control. Sometimes documents are marked as being uncacheable, etc (like a stock quote that is being constantly updated). You need to be able to identify things like that. As I recall, with HTTP/1.1 you can do a conditional GET/POST. Something like, "GET this document but only if it is newer than this date". Your reply will either be "200 - here's the document data" or "XXX - the document hasn't changed" (I can't remember the error code used off-hand). As you will have noticed, I talk about HTTP/1.1 alot. That's because it has an awful lot of useful features over HTTP/1.0. You end up not being able to make decisions about documents with HTTP/1.0, resulting in unnecessary fetches. If possible, try and implement the HTTP/1.1 features - most browsers use HTTP/1.1 these days. > There is also the > matter of purging the cache of stale pages, but that seems pretty straight > forward using the file dates in the cache and running a utility (thread) to > clean it up periodically. In order to keep disk/memory usage within some upper limit you will have to run a "thread" to purge expired documents, and documents that aren't expired but have to be ditched to make room. I imagine you'll need to set low- and high-water marks to trigger the purge thread into action. Your purge thread's algorithm would be something like this: If a document has expired then delete it. Otherwise if you're still above the low-water mark then you need to decide which still-fresh documents to delete. If possible, keep a hit count. Then I would favour deleting documents that are hit less frequently, are bigger and older. You want to establish some metric that takes all of the factors into account. IOW, if a document in your cache is quite old and big, but is very popular (it might be a home page, for example) then you would rather keep it in your cache in order to increase the hit rate. If you are really interested in the theory behind making better proxies then I'd suggest looking through the proceedings of the int'l WWW conferences for the last five years. > I guess what i'm looking for is a general docuemnt about the http protocol. > I'm going to be looking at the http reccommendation at W3.org, but i'd like > to find something that is a more dynamic description of how to use http. > Can anyone point me to something like that, either web, sources or book? As I said above, the HTTP/1.1 spec is the first place to look. Like any spec, it is rather dry reading. http://www.w3.org/Protocols/ has lots of links to papers, reports, related standards and specs, etc. Cheers, Steve -- Steve Ball | Swish XML Editor | Training & Seminars Zveno Pty Ltd | Web Tcl Complete | XML XSL http://www.zveno.com/ | TclXML TclDOM | Tcl, Web Development [EMAIL PROTECTED] +-----------------------+--------------------- Ph. +61 2 6242 4099 | Mobile (0413) 594 462 | Fax +61 2 6242 4099
