Thanks so much for this Ted, I'm not quite sure that your answer is directly responsive to the question, so let me try to clarify. As far as I understand Mahout, this is our process: 1. Goal is to examine relationship between 250 web pages, so we extract the user sessions (they end after 1/2 hour of inactivity), remove bot entries, and input looks like this: User# Page# 1 5 1 8 2 1 Š
We do not include number of hits on a page or a star rating for each page (we have none). Sounds like you're saying that this is where the problem lies. Mahout expecting either a binary variable or a count of number of accesses would explain the weird results. Doing some kind of log-entropy weighting makes further sense, thanks@ Is what you shared log-entropy, by the way? Kai :-) On 12/22/12 4:47 AM, "Ted Dunning" <[email protected]> wrote: >The basic reason that it is common to binarize the relationships is that >putting weights on these relationships makes it really easy to over-fit, >thus giving you very goofy results. > >One method for putting weights on these elements is to simply use > >weight(i,j) = log ((N_rows +1)/(rowSum_i + 1)) log((N_cols +1) / (colSum_j >+1)) > >Where all weights are set to zero if you don't have a 1 in that cell of >the >item-item matrix. > >Another reasonable weighting is to simply use row or column counts >(depending on the application). You get something very similar to this >weighting when you use a text retrieval engine to produce recommendations >where documents are columns of the item-item matrix and you multiply by a >user history expressed in items. > >On Fri, Dec 21, 2012 at 3:45 PM, Kai R. Larsen ><[email protected]>wrote: > >> Hi, >> >> My sincere apologies if this is a naïve question (I'm sure it is). >> >> I've engaged a programmer to take an weblog and focus on 250 pages >> containing items that may be similar (or not). The goal is create >> item-item relationship tables where every cell contains a score for how >> similar two items are. He now tells me that only two of the (many) >>Mahout >> algorithms can be used to generate such tables, and those that do >>generate >> a distance of 1 or some other constant value between all pairs. >> >> This can't be true, can it? There must be a way to tease out such >> information from the algorithms. Any advice? Any ideas why all >> relationships would be one? While it is common for the website users to >> have visited only one page at a time, it is not pervasive. >> >> Best, >> >> Kai Larsen >>
