Thanks for the answer. That was the approach I had in mind in the first place the only difference would be that I will write the output to a file that can be later used to create a FileItemSimilarity.
I think that would be a very nice feature to have in the API. Thanks again. On Mon, Feb 24, 2014 at 9:27 PM, Sebastian Schelter <[email protected]> wrote: > I overlooked that you're interested in document similarities. Sry again :) > > Another way would be to read the output of RowSimilarityJob with a > o.a.m.common.iterator.sequencefile.SequenceFileDirIterable > > You create a list of instances of o.a.m.cf.taste.impl.similarity. > GenericItemSimilarity.ItemItemSimilarity > > e.g. for the output > > > Key: 0: Value: {61112:0.21139380179557016,52144:0.23797846026935565,...} > > you would do > > list.add(new ItemItemSimilarity(0, 61112, 0.21139380179557016)); > list.add(new ItemItemSimilarity(0, 52144, 0.23797846026935565)); > ... > > After that you create a GenericItemSimilarity from the list of > ItemItemSimilarities, which is the in-memory item similarity you asked for. > > Hope that helps, > Sebastian > > > > On 02/24/2014 10:04 PM, Juan José Ramos wrote: > >> Correct me if I'm wrong, but is it not the ItemSimilarityJob mean to be >> for >> item-based CF? In particular, in the documentation I can read that: >> Preferences in the input file should look like >> userID,itemID[,preferencevalue] >> >> And in my case the input I have is just text documents and I want to >> pre-compute similarities between them beforehand, even before any user has >> expressed any preference value for any item. >> >> In order to use ItemSimilarityJob for this purpose, what should be the >> input I need to provide? Would it be the output of seq2sparse? >> >> Thanks again. >> >> >> On Mon, Feb 24, 2014 at 8:54 PM, Sebastian Schelter <[email protected]> >> wrote: >> >> You're right, my bad. If you don't use RowSimilarityJob directly, but >>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob >>> (which calls RowSimilarityJob under the covers), your output will be a >>> textfile that is directly usable with FileItemSimilarity. >>> >>> --sebastian >>> >>> >>> On 02/24/2014 09:30 PM, Juan José Ramos wrote: >>> >>> Thanks for the prompt reply. >>>> >>>> RowSimilarityJob produces an output in the form of: >>>> Key: 0: Value: {61112:0.21139380179557016, >>>> 52144:0.23797846026935565,...} >>>> >>>> whereas FileItemSimilarity is expecting a comma or tab separated inputs. >>>> >>>> I assume that you meant that the output of RowSimilarityJob can be >>>> loaded >>>> by the FileItemSimilarity after doing the appropriate parsing. Is that >>>> correct, or is there actually a way to load the raw output of >>>> RowSimilarityJob into FileItemSimilarity? >>>> >>>> Thanks. >>>> >>>> >>>> On Mon, Feb 24, 2014 at 7:41 PM, Sebastian Schelter <[email protected]> >>>> wrote: >>>> >>>> The output of RowSimilarityJob can be loaded by the >>>> FileItemSimilarity. >>>> >>>>> >>>>> --sebastian >>>>> >>>>> >>>>> On 02/24/2014 08:31 PM, Juan José Ramos wrote: >>>>> >>>>> Is there a way to reproduce this process: >>>>> >>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/ >>>>>> Quick+tour+of+text+analysis+using+the+Mahout+command+line >>>>>> >>>>>> inside Java code and not using the command line tool? I am not >>>>>> interested >>>>>> in the clustering part but in 'Calculate several similar docs to each >>>>>> doc >>>>>> in the data'. In particular, I am interested in loading the output of >>>>>> the >>>>>> rowsimilarity tool into memory to be used as my custom ItemSimilarity >>>>>> implementation for an ItemBasedRecommender. >>>>>> >>>>>> What I exactly want is to have a matrix in memory where for every doc >>>>>> in >>>>>> my >>>>>> catalogue I have the similarity with the 100 (that is the threshold I >>>>>> am >>>>>> using) most similar items an undefined similarity for the rest. >>>>>> >>>>>> Is it possible to do with the Java API? I know it can be done calling >>>>>> the >>>>>> commands from inside the Java code and I guess that also using >>>>>> corresponding SparseVectorsFromSequenceFiles, DistributedRowMatrix >>>>>> and >>>>>> RowItemSimilarityJob. But I still see cannot see an easy way of >>>>>> parsing >>>>>> the >>>>>> output of RowItemSimilarityJob to the memory representation I intend >>>>>> to >>>>>> use. >>>>>> >>>>>> Thanks a lot. >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
