I concur. Log is a natural choice for lots of measures like this. On Tue, Nov 23, 2010 at 7:12 AM, Sean Owen <[email protected]> wrote:
> Yes I think the logarithm is a fine choice. The base doesn't matter as > the scale of ratings doesn't make a difference. > > On Tue, Nov 23, 2010 at 2:07 PM, Sebastian Schelter <[email protected]> > wrote: > > Hi, > > > > I'm currently looking into the last.fm dataset (from > > http://denoiserthebetter.posterous.com/music-recommendation-datasets) as > I'm > > planning to write a magazine article or blogpost on howto create a simple > > music recommender with Mahout. It should be an easy-to-follow tutorial > that > > encourages people to download Mahout and play a little with the > recommender > > stuff. > > > > The dataset consists of several million > > (userID,artist,numberOfPlays)-tuples, and my goal is to find the most > > similar artists and recommend new artists to users. I extracted a 20% > sample > > of the data, ignored the numberOfPlays and used an ItembasedRecommender > with > > LoglikelihoodSimilarity, did some random tests and got reasonable > results. > > > > Now I wanna go on and include the "strength" of the preference into the > > computation. What would be the best way to deal with the numberOfPlays? I > > thought about using the log of the numberOfPlays as rating value and > > applying PearsonCorrelationSimilarity as measure, would that be a viable > way > > to approach this problem? > > > > --sebastian > > >
