Correct me if I'm wrong, but is it not the ItemSimilarityJob mean to be for item-based CF? In particular, in the documentation I can read that: Preferences in the input file should look like userID,itemID[,preferencevalue]
And in my case the input I have is just text documents and I want to pre-compute similarities between them beforehand, even before any user has expressed any preference value for any item. In order to use ItemSimilarityJob for this purpose, what should be the input I need to provide? Would it be the output of seq2sparse? Thanks again. On Mon, Feb 24, 2014 at 8:54 PM, Sebastian Schelter <[email protected]> wrote: > You're right, my bad. If you don't use RowSimilarityJob directly, but > org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob > (which calls RowSimilarityJob under the covers), your output will be a > textfile that is directly usable with FileItemSimilarity. > > --sebastian > > > On 02/24/2014 09:30 PM, Juan José Ramos wrote: > >> Thanks for the prompt reply. >> >> RowSimilarityJob produces an output in the form of: >> Key: 0: Value: {61112:0.21139380179557016,52144:0.23797846026935565,...} >> >> whereas FileItemSimilarity is expecting a comma or tab separated inputs. >> >> I assume that you meant that the output of RowSimilarityJob can be loaded >> by the FileItemSimilarity after doing the appropriate parsing. Is that >> correct, or is there actually a way to load the raw output of >> RowSimilarityJob into FileItemSimilarity? >> >> Thanks. >> >> >> On Mon, Feb 24, 2014 at 7:41 PM, Sebastian Schelter <[email protected]> >> wrote: >> >> The output of RowSimilarityJob can be loaded by the FileItemSimilarity. >>> >>> --sebastian >>> >>> >>> On 02/24/2014 08:31 PM, Juan José Ramos wrote: >>> >>> Is there a way to reproduce this process: >>>> https://cwiki.apache.org/confluence/display/MAHOUT/ >>>> Quick+tour+of+text+analysis+using+the+Mahout+command+line >>>> >>>> inside Java code and not using the command line tool? I am not >>>> interested >>>> in the clustering part but in 'Calculate several similar docs to each >>>> doc >>>> in the data'. In particular, I am interested in loading the output of >>>> the >>>> rowsimilarity tool into memory to be used as my custom ItemSimilarity >>>> implementation for an ItemBasedRecommender. >>>> >>>> What I exactly want is to have a matrix in memory where for every doc in >>>> my >>>> catalogue I have the similarity with the 100 (that is the threshold I am >>>> using) most similar items an undefined similarity for the rest. >>>> >>>> Is it possible to do with the Java API? I know it can be done calling >>>> the >>>> commands from inside the Java code and I guess that also using >>>> corresponding SparseVectorsFromSequenceFiles, DistributedRowMatrix and >>>> RowItemSimilarityJob. But I still see cannot see an easy way of parsing >>>> the >>>> output of RowItemSimilarityJob to the memory representation I intend to >>>> use. >>>> >>>> Thanks a lot. >>>> >>>> >>>> >>> >> >
