Hi, i'm working on an experiment where i have a catalog of movies from IMDB containing all the metadata for each movie (title/description/year/director/actors/etc...) and i would like to solve the following problem:
INPUT: a movie title (or id in imdb) OUTPUT: the most "similar" movies but i have no user base or user activity, just the pure movie items. so by "similar" i mean the movies having the most similar title and/or description and/or director etc... i'm not sure how to build the appropriate global similarity measure, as for description i could e.g. try to build a term vectors containing the most frequent words (using e.g. tf/idf) or using lda, but then i have no clue other than intuition to attribute e.g. more weight to the similarity between the description or the similarity between actors or e.g. the same year (approximately) etc.. is anyone has to deal with a similar problem or have any insights of how to approach it? also, is mahout contains any tools that would help me to build such a (weighted) similarity measure and most importantly allow me to experiment if one similarity is better than another? thanks a lot in advance for any insights Eric