On Sat, Dec 22, 2012 at 4:33 AM, Kai R. Larsen <[email protected]>wrote:
> ... > I'm not quite sure that your answer is directly responsive to the > question That would definitely not be the first time that I have missed the point. > ... > 1. Goal is to examine relationship between 250 web pages, so we extract > the user sessions (they end after 1/2 hour of inactivity), remove bot > entries, and input looks like this: > User# Page# > 1 5 > 1 8 > 2 1 > This looks very good. > We do not include number of hits on a page or a star rating for each page > (we have none). Sounds like you're saying that this is where the problem > lies. No. I think that this is a good idea. Occasionally, a threshold on number of hits is useful, but normally not. It is common for some additional measure of engagement to be helpful as well. For instance, if you can tell that the page survived for some number of seconds in the users browser, that might be better than simple page load (JS based beacons work well for this). You might also get clues from an unload event (not usually reliable) or evidence that the user went somewhere else right away (this is very trick to get right in the presence of multiple tabs). But the idea of a binary they-did-it feature is good in principle. > Mahout expecting either a binary variable or a count of number of > accesses would explain the weird results. Yes. You can check for data format sensitivity by just putting a 1 as the third value on each line. Doing some kind of log-entropy > weighting makes further sense, thanks@ Is what you shared log-entropy, by > the way? > It is closely related. See http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html for my preferred method. This is available in Mahout.
