Regarding the parsing of a VectorWriteble object, what is the recommended approach to access the different 'DocID: similarity' pairs?
I can see that if I get the String representation of the org.apache.mahout.math.Vector object it should not be hard to parse using the text representation. However, is there a way to access the individual elements of the 'DocID: similarity' pair? I tried iterating through the individual Vector.Element objects and calling get(), but that does not return what I intend to. More than happy to contribute to the project once I get this working. Thanks a lot. On Tue, Feb 25, 2014 at 9:52 AM, Juan José Ramos <[email protected]> wrote: > Thanks for the answer. > > That was the approach I had in mind in the first place the only difference > would be that I will write the output to a file that can be later used to > create a FileItemSimilarity. > > I think that would be a very nice feature to have in the API. > > Thanks again. > > > On Mon, Feb 24, 2014 at 9:27 PM, Sebastian Schelter <[email protected]>wrote: > >> I overlooked that you're interested in document similarities. Sry again :) >> >> Another way would be to read the output of RowSimilarityJob with a >> o.a.m.common.iterator.sequencefile.SequenceFileDirIterable >> >> You create a list of instances of o.a.m.cf.taste.impl.similarity. >> GenericItemSimilarity.ItemItemSimilarity >> >> e.g. for the output >> >> >> Key: 0: Value: {61112:0.21139380179557016,52144:0.23797846026935565,...} >> >> you would do >> >> list.add(new ItemItemSimilarity(0, 61112, 0.21139380179557016)); >> list.add(new ItemItemSimilarity(0, 52144, 0.23797846026935565)); >> ... >> >> After that you create a GenericItemSimilarity from the list of >> ItemItemSimilarities, which is the in-memory item similarity you asked for. >> >> Hope that helps, >> Sebastian >> >> >> >> On 02/24/2014 10:04 PM, Juan José Ramos wrote: >> >>> Correct me if I'm wrong, but is it not the ItemSimilarityJob mean to be >>> for >>> item-based CF? In particular, in the documentation I can read that: >>> Preferences in the input file should look like >>> userID,itemID[,preferencevalue] >>> >>> And in my case the input I have is just text documents and I want to >>> pre-compute similarities between them beforehand, even before any user >>> has >>> expressed any preference value for any item. >>> >>> In order to use ItemSimilarityJob for this purpose, what should be the >>> input I need to provide? Would it be the output of seq2sparse? >>> >>> Thanks again. >>> >>> >>> On Mon, Feb 24, 2014 at 8:54 PM, Sebastian Schelter <[email protected]> >>> wrote: >>> >>> You're right, my bad. If you don't use RowSimilarityJob directly, but >>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob >>>> (which calls RowSimilarityJob under the covers), your output will be a >>>> textfile that is directly usable with FileItemSimilarity. >>>> >>>> --sebastian >>>> >>>> >>>> On 02/24/2014 09:30 PM, Juan José Ramos wrote: >>>> >>>> Thanks for the prompt reply. >>>>> >>>>> RowSimilarityJob produces an output in the form of: >>>>> Key: 0: Value: {61112:0.21139380179557016, >>>>> 52144:0.23797846026935565,...} >>>>> >>>>> whereas FileItemSimilarity is expecting a comma or tab separated >>>>> inputs. >>>>> >>>>> I assume that you meant that the output of RowSimilarityJob can be >>>>> loaded >>>>> by the FileItemSimilarity after doing the appropriate parsing. Is that >>>>> correct, or is there actually a way to load the raw output of >>>>> RowSimilarityJob into FileItemSimilarity? >>>>> >>>>> Thanks. >>>>> >>>>> >>>>> On Mon, Feb 24, 2014 at 7:41 PM, Sebastian Schelter <[email protected]> >>>>> wrote: >>>>> >>>>> The output of RowSimilarityJob can be loaded by the >>>>> FileItemSimilarity. >>>>> >>>>>> >>>>>> --sebastian >>>>>> >>>>>> >>>>>> On 02/24/2014 08:31 PM, Juan José Ramos wrote: >>>>>> >>>>>> Is there a way to reproduce this process: >>>>>> >>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/ >>>>>>> Quick+tour+of+text+analysis+using+the+Mahout+command+line >>>>>>> >>>>>>> inside Java code and not using the command line tool? I am not >>>>>>> interested >>>>>>> in the clustering part but in 'Calculate several similar docs to each >>>>>>> doc >>>>>>> in the data'. In particular, I am interested in loading the output of >>>>>>> the >>>>>>> rowsimilarity tool into memory to be used as my custom ItemSimilarity >>>>>>> implementation for an ItemBasedRecommender. >>>>>>> >>>>>>> What I exactly want is to have a matrix in memory where for every >>>>>>> doc in >>>>>>> my >>>>>>> catalogue I have the similarity with the 100 (that is the threshold >>>>>>> I am >>>>>>> using) most similar items an undefined similarity for the rest. >>>>>>> >>>>>>> Is it possible to do with the Java API? I know it can be done calling >>>>>>> the >>>>>>> commands from inside the Java code and I guess that also using >>>>>>> corresponding SparseVectorsFromSequenceFiles, DistributedRowMatrix >>>>>>> and >>>>>>> RowItemSimilarityJob. But I still see cannot see an easy way of >>>>>>> parsing >>>>>>> the >>>>>>> output of RowItemSimilarityJob to the memory representation I intend >>>>>>> to >>>>>>> use. >>>>>>> >>>>>>> Thanks a lot. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
