Hi all. I've got some documents described by binary features with
integer ids, and i want to read them into sparse mahout vectors to do
tfidf weighting and clustering. I do not want to paste them back
together and run a Lucene tokenizer. What's the clean way to do this?
I'm thinking that I need to write out SequenceFile objects, with a
document id key and a value that's either an IntTuple. Is that right?
Should I use an IntegerTuple instead? It feels wrong to use either,
actually, because these tuples claim to be ordered, but my features are
not ordered.
I would then use DictionaryVectorizer.createTermFrequencyVectors and
TFIDFConverter.processTfIdf, just like in SparseVectorsFromSequenceFiles.
Am I on the right track?
- vectors from pre-tokenized terms Jack Tanner
-