I have recently come across two on-line articles on Web-usage analysis that 
throw a lot of doubt on the validity of attempting to identify user 
sessions from the type of data that is currently recorded in Web server 
logs. User-session identification is made difficult by a number of causes, 
including caching, load balancing (which assigns multiple IP addresses 
during the same user session), and the use of spiders. One of these 
critical articles is by Stephen Turner (Cambridge University) [1], the 
other is from Susan Haigh and Janette Megarity (National Library of Canada) 
[2].

Haigh & Megarity have described user-session estimations as "at best, gross 
estimates". It seems to me that what is needed is a systematic validation 
of the efficacy of the various Web-analysis algorithms currently available. 
This could be done by simulating log-file data from known transactions and 
comparing how well an algorithm is able to recover the transactions from 
the data. This should be repeated using a wide range of hypothetical 
scenarios, such as very frequent load balancing (as occurs in reality with 
AOL users).

Does anyone know if such a validation has been done?

Richard

References
---------------

[1] S. Turner. "Analog 5.03: How the Web Works". 
http://www.analog.cx/docs/webworks.html [7 July 2001]

[2] S. Haigh, J. Megarity. "Measuring Web Site Usage: Log File Analysis". 
http://www.nlc-bnc.ca/9/1/p1-256-e.html [4 August 1998]

-------------------------------
Richard Dybowski, 143 Village Way, Pinner, Middlesex  HA5 5AA, UK
Tel (mobile): 079 76 25 00 92

Reply via email to