Hi, I extracted the contents of a Lucene index like so:
bin/mahout lucene.vector --dir /path/to/index/ --output /path/to/vectors --dictOut /path/to/dict --field text --idField id --weight TFIDF --maxDFPercent 90 --minDF 10 And then I tried to get the cosine similarity between the docs like so: bin/mahout rowsimilarity -i /path/to/vectors -o /path/to/25nn-matrix -s SIMILARITY_UNCENTERED_COSINE -m 25 -r 10000000 But I got this: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.IntWritable at org.apache.mahout.math.hadoop.similarity.RowSimilarityJob$RowWeightMapper.map(RowSimilarityJob.java:198) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) I assume this refers to the document ID or something -- since the actual tf.idf scores will be doubles, right? Is there an easy way to convert these on the fly, or do I need to write something to do it? Also, another (somewhat unrelated) question... The -r param to rowsimilarity specifies "Number of columns in the input matrix". What's the recommended approach when you don't know this in advance? Just set it much higher than you'll need (as I did above)? Many thanks from a Mahout noob! Andrew. -- http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg