The output of RowSimilarityJob can be loaded by the FileItemSimilarity.
--sebastian On 02/24/2014 08:31 PM, Juan José Ramos wrote:
Is there a way to reproduce this process: https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis+using+the+Mahout+command+line inside Java code and not using the command line tool? I am not interested in the clustering part but in 'Calculate several similar docs to each doc in the data'. In particular, I am interested in loading the output of the rowsimilarity tool into memory to be used as my custom ItemSimilarity implementation for an ItemBasedRecommender. What I exactly want is to have a matrix in memory where for every doc in my catalogue I have the similarity with the 100 (that is the threshold I am using) most similar items an undefined similarity for the rest. Is it possible to do with the Java API? I know it can be done calling the commands from inside the Java code and I guess that also using corresponding SparseVectorsFromSequenceFiles, DistributedRowMatrix and RowItemSimilarityJob. But I still see cannot see an easy way of parsing the output of RowItemSimilarityJob to the memory representation I intend to use. Thanks a lot.
