On Sat, Dec 22, 2012 at 4:33 AM, Kai R. Larsen <[email protected]>wrote:

> ...
> I'm not quite sure that your answer is directly responsive to the
> question


That would definitely not be the first time that I have missed the point.


> ...
> 1. Goal is to examine relationship between 250 web pages, so we extract
> the user sessions (they end after 1/2 hour of inactivity), remove bot
> entries, and input looks like this:
> User#   Page#
> 1       5
> 1       8
> 2       1
>

This looks very good.


> We do not include number of hits on a page or a star rating for each page
> (we have none). Sounds like you're saying that this is where the problem
> lies.


No.  I think that this is a good idea.  Occasionally, a threshold on number
of hits is useful, but normally not.  It is common for some additional
measure of engagement to be helpful as well.  For instance, if you can tell
that the page survived for some number of seconds in the users browser,
that might be better than simple page load (JS based beacons work well for
this).  You might also get clues from an unload event (not usually
reliable) or evidence that the user went somewhere else right away (this is
very trick to get right in the presence of multiple tabs).

But the idea of a binary they-did-it feature is good in principle.


> Mahout expecting either a binary variable or a count of number of
> accesses would explain the weird results.


Yes.  You can check for data format sensitivity by just putting a 1 as the
third value on each line.

Doing some kind of log-entropy
> weighting makes further sense, thanks@ Is what you shared log-entropy, by
> the way?
>

It is closely related.

See http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html for
my preferred method.  This is available in Mahout.

Reply via email to