Hi Juan,

It would definitely be nice to have that in the API! It would be great if you could submit a patch after you implemented this.

Best,
Sebastian

On 02/25/2014 10:52 AM, Juan José Ramos wrote:
Thanks for the answer.

That was the approach I had in mind in the first place the only difference
would be that I will write the output to a file that can be later used to
create a FileItemSimilarity.

I think that would be a very nice feature to have in the API.

Thanks again.


On Mon, Feb 24, 2014 at 9:27 PM, Sebastian Schelter <[email protected]> wrote:

I overlooked that you're interested in document similarities. Sry again :)

Another way would be to read the output of RowSimilarityJob with a
o.a.m.common.iterator.sequencefile.SequenceFileDirIterable

You create a list of instances of o.a.m.cf.taste.impl.similarity.
GenericItemSimilarity.ItemItemSimilarity

e.g. for the output


Key: 0: Value: {61112:0.21139380179557016,52144:0.23797846026935565,...}

you would do

list.add(new ItemItemSimilarity(0, 61112, 0.21139380179557016));
list.add(new ItemItemSimilarity(0, 52144, 0.23797846026935565));
...

After that you create a GenericItemSimilarity from the list of
ItemItemSimilarities, which is the in-memory item similarity you asked for.

Hope that helps,
Sebastian



On 02/24/2014 10:04 PM, Juan José Ramos wrote:

Correct me if I'm wrong, but is it not the ItemSimilarityJob mean to be
for
item-based CF? In particular, in the documentation I can read that:
Preferences in the input file should look like
userID,itemID[,preferencevalue]

And in my case the input I have is just text documents and I want to
pre-compute similarities between them beforehand, even before any user has
expressed any preference value for any item.

In order to use ItemSimilarityJob for this purpose, what should be the
input I need to provide? Would it be the output of seq2sparse?

Thanks again.


On Mon, Feb 24, 2014 at 8:54 PM, Sebastian Schelter <[email protected]>
wrote:

  You're right, my bad. If you don't use RowSimilarityJob directly, but
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
(which calls RowSimilarityJob under the covers), your output will be a
textfile that is directly usable with FileItemSimilarity.

--sebastian


On 02/24/2014 09:30 PM, Juan José Ramos wrote:

  Thanks for the prompt reply.

RowSimilarityJob produces an output in the form of:
Key: 0: Value: {61112:0.21139380179557016,
52144:0.23797846026935565,...}

whereas FileItemSimilarity is expecting a comma or tab separated inputs.

I assume that you meant that the output of RowSimilarityJob can be
loaded
by the FileItemSimilarity after doing the appropriate parsing. Is that
correct, or is there actually a way to load the raw output of
RowSimilarityJob into FileItemSimilarity?

Thanks.


On Mon, Feb 24, 2014 at 7:41 PM, Sebastian Schelter <[email protected]>
wrote:

   The output of RowSimilarityJob can be loaded by the
FileItemSimilarity.


--sebastian


On 02/24/2014 08:31 PM, Juan José Ramos wrote:

   Is there a way to reproduce this process:

https://cwiki.apache.org/confluence/display/MAHOUT/
Quick+tour+of+text+analysis+using+the+Mahout+command+line

inside Java code and not using the command line tool? I am not
interested
in the clustering part but in 'Calculate several similar docs to each
doc
in the data'. In particular, I am interested in loading the output of
the
rowsimilarity tool into memory to be used as my custom ItemSimilarity
implementation for an ItemBasedRecommender.

What I exactly want is to have a matrix in memory where for every doc
in
my
catalogue I have the similarity with the 100 (that is the threshold I
am
using) most similar items an undefined similarity for the rest.

Is it possible to do with the Java API? I know it can be done calling
the
commands from inside the Java code and I guess that also using
corresponding SparseVectorsFromSequenceFiles, DistributedRowMatrix
and
RowItemSimilarityJob. But I still see cannot see an easy way of
parsing
the
output of RowItemSimilarityJob to the memory representation I intend
to
use.

Thanks a lot.











Reply via email to