Re: Load output of rowsimilarity to memory

Juan José Ramos Tue, 25 Feb 2014 02:59:27 -0800

Regarding the parsing of a VectorWriteble object, what is the recommended
approach to access the different 'DocID: similarity' pairs?


I can see that if I get the String representation of the
org.apache.mahout.math.Vector
object it should not be hard to parse using the text representation.

However, is there a way to access the individual elements of the 'DocID:
similarity' pair? I tried iterating through the individual Vector.Element
objects and calling get(), but that does not return what I intend to.

More than happy to contribute to the project once I get this working.

Thanks a lot.

On Tue, Feb 25, 2014 at 9:52 AM, Juan José Ramos <[email protected]> wrote:

> Thanks for the answer.
>
> That was the approach I had in mind in the first place the only difference
> would be that I will write the output to a file that can be later used to
> create a FileItemSimilarity.
>
> I think that would be a very nice feature to have in the API.
>
> Thanks again.
>
>
> On Mon, Feb 24, 2014 at 9:27 PM, Sebastian Schelter <[email protected]>wrote:
>
>> I overlooked that you're interested in document similarities. Sry again :)
>>
>> Another way would be to read the output of RowSimilarityJob with a
>> o.a.m.common.iterator.sequencefile.SequenceFileDirIterable
>>
>> You create a list of instances of o.a.m.cf.taste.impl.similarity.
>> GenericItemSimilarity.ItemItemSimilarity
>>
>> e.g. for the output
>>
>>
>> Key: 0: Value: {61112:0.21139380179557016,52144:0.23797846026935565,...}
>>
>> you would do
>>
>> list.add(new ItemItemSimilarity(0, 61112, 0.21139380179557016));
>> list.add(new ItemItemSimilarity(0, 52144, 0.23797846026935565));
>> ...
>>
>> After that you create a GenericItemSimilarity from the list of
>> ItemItemSimilarities, which is the in-memory item similarity you asked for.
>>
>> Hope that helps,
>> Sebastian
>>
>>
>>
>> On 02/24/2014 10:04 PM, Juan José Ramos wrote:
>>
>>> Correct me if I'm wrong, but is it not the ItemSimilarityJob mean to be
>>> for
>>> item-based CF? In particular, in the documentation I can read that:
>>> Preferences in the input file should look like
>>> userID,itemID[,preferencevalue]
>>>
>>> And in my case the input I have is just text documents and I want to
>>> pre-compute similarities between them beforehand, even before any user
>>> has
>>> expressed any preference value for any item.
>>>
>>> In order to use ItemSimilarityJob for this purpose, what should be the
>>> input I need to provide? Would it be the output of seq2sparse?
>>>
>>> Thanks again.
>>>
>>>
>>> On Mon, Feb 24, 2014 at 8:54 PM, Sebastian Schelter <[email protected]>
>>> wrote:
>>>
>>>  You're right, my bad. If you don't use RowSimilarityJob directly, but
>>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
>>>> (which calls RowSimilarityJob under the covers), your output will be a
>>>> textfile that is directly usable with FileItemSimilarity.
>>>>
>>>> --sebastian
>>>>
>>>>
>>>> On 02/24/2014 09:30 PM, Juan José Ramos wrote:
>>>>
>>>>  Thanks for the prompt reply.
>>>>>
>>>>> RowSimilarityJob produces an output in the form of:
>>>>> Key: 0: Value: {61112:0.21139380179557016,
>>>>> 52144:0.23797846026935565,...}
>>>>>
>>>>> whereas FileItemSimilarity is expecting a comma or tab separated
>>>>> inputs.
>>>>>
>>>>> I assume that you meant that the output of RowSimilarityJob can be
>>>>> loaded
>>>>> by the FileItemSimilarity after doing the appropriate parsing. Is that
>>>>> correct, or is there actually a way to load the raw output of
>>>>> RowSimilarityJob into FileItemSimilarity?
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>> On Mon, Feb 24, 2014 at 7:41 PM, Sebastian Schelter <[email protected]>
>>>>> wrote:
>>>>>
>>>>>   The output of RowSimilarityJob can be loaded by the
>>>>> FileItemSimilarity.
>>>>>
>>>>>>
>>>>>> --sebastian
>>>>>>
>>>>>>
>>>>>> On 02/24/2014 08:31 PM, Juan José Ramos wrote:
>>>>>>
>>>>>>   Is there a way to reproduce this process:
>>>>>>
>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/
>>>>>>> Quick+tour+of+text+analysis+using+the+Mahout+command+line
>>>>>>>
>>>>>>> inside Java code and not using the command line tool? I am not
>>>>>>> interested
>>>>>>> in the clustering part but in 'Calculate several similar docs to each
>>>>>>> doc
>>>>>>> in the data'. In particular, I am interested in loading the output of
>>>>>>> the
>>>>>>> rowsimilarity tool into memory to be used as my custom ItemSimilarity
>>>>>>> implementation for an ItemBasedRecommender.
>>>>>>>
>>>>>>> What I exactly want is to have a matrix in memory where for every
>>>>>>> doc in
>>>>>>> my
>>>>>>> catalogue I have the similarity with the 100 (that is the threshold
>>>>>>> I am
>>>>>>> using) most similar items an undefined similarity for the rest.
>>>>>>>
>>>>>>> Is it possible to do with the Java API? I know it can be done calling
>>>>>>> the
>>>>>>> commands from inside the Java code and I guess that also using
>>>>>>> corresponding SparseVectorsFromSequenceFiles, DistributedRowMatrix
>>>>>>> and
>>>>>>> RowItemSimilarityJob. But I still see cannot see an easy way of
>>>>>>> parsing
>>>>>>> the
>>>>>>> output of RowItemSimilarityJob to the memory representation I intend
>>>>>>> to
>>>>>>> use.
>>>>>>>
>>>>>>> Thanks a lot.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Load output of rowsimilarity to memory

Reply via email to