vectors from pre-tokenized terms

Jack Tanner Fri, 09 Sep 2011 09:17:27 -0700

Hi all. I've got some documents described by binary features withinteger ids, and i want to read them into sparse mahout vectors to dotfidf weighting and clustering. I do not want to paste them backtogether and run a Lucene tokenizer. What's the clean way to do this?

I'm thinking that I need to write out SequenceFile objects, with adocument id key and a value that's either an IntTuple. Is that right?Should I use an IntegerTuple instead? It feels wrong to use either,actually, because these tuples claim to be ordered, but my features arenot ordered.

I would then use DictionaryVectorizer.createTermFrequencyVectors andTFIDFConverter.processTfIdf, just like in SparseVectorsFromSequenceFiles.


Am I on the right track?

vectors from pre-tokenized terms

Reply via email to