Thanks so much for this Ted,

I'm not quite sure that your answer is directly responsive to the
question, so let me try to clarify. As far as I understand Mahout, this is
our process:
1. Goal is to examine relationship between 250 web pages, so we extract
the user sessions (they end after 1/2 hour of inactivity), remove bot
entries, and input looks like this:
User#   Page#
1       5
1       8
2       1
Š

We do not include number of hits on a page or a star rating for each page
(we have none). Sounds like you're saying that this is where the problem
lies.  Mahout expecting either a binary variable or a count of number of
accesses would explain the weird results. Doing some kind of log-entropy
weighting makes further sense, thanks@ Is what you shared log-entropy, by
the way?

Kai :-)


On 12/22/12 4:47 AM, "Ted Dunning" <[email protected]> wrote:

>The basic reason that it is common to binarize the relationships is that
>putting weights on these relationships makes it really easy to over-fit,
>thus giving you very goofy results.
>
>One method for putting weights on these elements is to simply use
>
>weight(i,j) = log ((N_rows +1)/(rowSum_i + 1)) log((N_cols +1) / (colSum_j
>+1))
>
>Where all weights are set to zero if you don't have a 1 in that cell of
>the
>item-item matrix.
>
>Another reasonable weighting is to simply use row or column counts
>(depending on the application).  You get something very similar to this
>weighting when you use a text retrieval engine to produce recommendations
>where documents are columns of the item-item matrix and you multiply by a
>user history expressed in items.
>
>On Fri, Dec 21, 2012 at 3:45 PM, Kai R. Larsen
><[email protected]>wrote:
>
>> Hi,
>>
>> My sincere apologies if this is a naïve question (I'm sure it is).
>>
>> I've engaged a programmer to take an weblog and focus on 250 pages
>> containing items that may be similar (or not).  The goal is create
>> item-item relationship tables where every cell contains a score for how
>> similar two items are.  He now tells me that only two of the (many)
>>Mahout
>> algorithms can be used to generate such tables, and those that do
>>generate
>> a distance of 1 or some other constant value between all pairs.
>>
>> This can't be true, can it?  There must be a way to tease out such
>> information from the algorithms.  Any advice?  Any ideas why all
>> relationships would be one?  While it is common for the website users to
>> have visited only one page at a time, it is not pervasive.
>>
>> Best,
>>
>> Kai Larsen
>>

Reply via email to