Calculating cosine similarity for vectors extracted from Lucene

Andrew Clegg Sun, 12 Jun 2011 03:35:19 -0700

Hi,

I extracted the contents of a Lucene index like so:


bin/mahout lucene.vector --dir /path/to/index/ --output
/path/to/vectors --dictOut /path/to/dict --field text --idField id
--weight TFIDF --maxDFPercent 90 --minDF 10

And then I tried to get the cosine similarity between the docs like so:

bin/mahout rowsimilarity -i /path/to/vectors -o /path/to/25nn-matrix
-s SIMILARITY_UNCENTERED_COSINE -m 25 -r 10000000

But I got this:

java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
be cast to org.apache.hadoop.io.IntWritable
        at 
org.apache.mahout.math.hadoop.similarity.RowSimilarityJob$RowWeightMapper.map(RowSimilarityJob.java:198)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

I assume this refers to the document ID or something -- since the
actual tf.idf scores will be doubles, right? Is there an easy way to
convert these on the fly, or do I need to write something to do it?

Also, another (somewhat unrelated) question... The -r param to
rowsimilarity specifies "Number of columns in the input matrix".
What's the recommended approach when you don't know this in advance?
Just set it much higher than you'll need (as I did above)?

Many thanks from a Mahout noob!

Andrew.

-- 

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Calculating cosine similarity for vectors extracted from Lucene

Reply via email to