Hi,
I'm trying to use RowSimilarityJob (current trunk) to calculate pairwise
similarities between feature vectors but I'm struggling a bit with the
correct input format.
I used SparseVectorsFromSequenceFiles to create a bunch of vectors from
documents. But using the tfidf vectors directly as input doesn't work as
it produces vectors with Strings as keys, while RowSimilarityJob seems
to expect IntWritable.
I've also seen something about DistributedRowMatrix as input in some
older docs.
Any hints? Is RowSimilarityJob a good choice for that task at all?
Thanks for your help,
Sören