Hi,
I'm currently looking into the last.fm dataset (from
http://denoiserthebetter.posterous.com/music-recommendation-datasets) as
I'm planning to write a magazine article or blogpost on howto create a
simple music recommender with Mahout. It should be an easy-to-follow
tutorial that encourages people to download Mahout and play a little
with the recommender stuff.
The dataset consists of several million
(userID,artist,numberOfPlays)-tuples, and my goal is to find the most
similar artists and recommend new artists to users. I extracted a 20%
sample of the data, ignored the numberOfPlays and used an
ItembasedRecommender with LoglikelihoodSimilarity, did some random tests
and got reasonable results.
Now I wanna go on and include the "strength" of the preference into the
computation. What would be the best way to deal with the numberOfPlays?
I thought about using the log of the numberOfPlays as rating value and
applying PearsonCorrelationSimilarity as measure, would that be a viable
way to approach this problem?
--sebastian
- Playing with the last.fm dataset Sebastian Schelter
-