I have recently come across two on-line articles on Web-usage analysis that throw a lot of doubt on the validity of attempting to identify user sessions from the type of data that is currently recorded in Web server logs. User-session identification is made difficult by a number of causes, including caching, load balancing (which assigns multiple IP addresses during the same user session), and the use of spiders. One of these critical articles is by Stephen Turner (Cambridge University) [1], the other is from Susan Haigh and Janette Megarity (National Library of Canada) [2].
Haigh & Megarity have described user-session estimations as "at best, gross estimates". It seems to me that what is needed is a systematic validation of the efficacy of the various Web-analysis algorithms currently available. This could be done by simulating log-file data from known transactions and comparing how well an algorithm is able to recover the transactions from the data. This should be repeated using a wide range of hypothetical scenarios, such as very frequent load balancing (as occurs in reality with AOL users). Does anyone know if such a validation has been done? Richard References --------------- [1] S. Turner. "Analog 5.03: How the Web Works". http://www.analog.cx/docs/webworks.html [7 July 2001] [2] S. Haigh, J. Megarity. "Measuring Web Site Usage: Log File Analysis". http://www.nlc-bnc.ca/9/1/p1-256-e.html [4 August 1998] ------------------------------- Richard Dybowski, 143 Village Way, Pinner, Middlesex HA5 5AA, UK Tel (mobile): 079 76 25 00 92
