On Nov 15, 2012, at 6:02 PM, Pat Ferrel <[email protected]> wrote:
> Trying to catch up. > > Isn't the sum of similarities actually a globally comparable number for > strength of preference in a boolean model? I was thinking it wasn't but it is > really. It may not be ideal but as an ordinal it should work, right? Not sure without details. > Is the logic behind the IDF idea that very popular items are of less value in > calculating recommendations? Roughly yes. > If an IDF weight is to be applied isn't it to the preference values (0,1) > before the similarity is calculated between users? Not directly in my approach. In my approach LLR Is used to produce a binary item to item matrix. The LLR computation has terms that are similar to IDF and which do something similar to what you say, though with corrections to avoid excess contribution by singletons and such. I then weight the item item matrix according to the frequency of the remaining non-zero elements in it. I often recommend achieving this end by simply creating a document per row of the item item matrix and constructing a query of the users item history. A text retrieval engine then does the necessary weighting as part of it normal query operations. > The intuition would be that people aren't all that similar just because they > have puppy liking in common. > > I'm afraid I got lost applying IDFish weighting to similarity strengths > themselves. > > On Nov 15, 2012, at 10:50 AM, Sean Owen <[email protected]> wrote: > > That's kind of what it does now... though it weights everything as "1". Not > so smart, but for sparse-ish data is not far off from a smarter answer. > > > On Thu, Nov 15, 2012 at 6:47 PM, Ted Dunning <[email protected]> wrote: > >> My own preference (pun intended) is to use log-likelihood score for >> determining which similarities are non-zero and then use simple frequency >> weighting such as IDF for weighting the similarities. This doesn't make >> direct use of cooccurrence frequencies, but it works really well. One >> reason that it seems to work well is that by using only general occurrence >> frequencies makes it *really* hard to overfit. >> >> >
