Hi mc tell, For featurizing the description, in case you have all description on hand, you could use LDA to extract the set of topics and then per description decide what topics are relevant for it and to what extend. This is provided by the current implementation of LDA in the form of topic-probability distribution per document (see MAHOUT-458<https://issues.apache.org/jira/browse/MAHOUT-458> )
Regards, Vasil On Tue, Jun 7, 2011 at 3:23 PM, mc tell <[email protected]> wrote: > Hi, > > I would like to build a system able to say how similar two items are from a > set of attributes including: title, genre, ratings, year, description and > more. > So i guess i could build a feature vector for each item and then come up > with some similarity measures. > > However i have no clue on which method i could use to: > - determine a weight to put on each feature (other than intuitive) > - how to deal with the 'description' attribute (i.e. a more or less long > free text) and to transform it into a relevant set of features. > - what algorithms in mahout could be adapted to build such things > > Thanks a lot in advance for any insights, links or anything related to > that. >
