I think createDictionaryChunks is the first thing that runs inside of createTermFrequencyVectors. It takes the input from DocumentProcessor.tokenizeDocuments, which outputs Text, StringTuple. So, I would suspect you would need Text, StringTuple as inputs. See SequenceFileTokenizerMapper.java.
On Sep 13, 2011, at 10:52 AM, Jack Tanner wrote: > Ping? Please help if you can. Maybe I was unclear the first time; let me try > again. > > I have input like this: > > term_id,doc_id > 55,1 > 61,1 > 29,2 > 98,3 > > I want to do clustering, so (I think) I need to transform that into a bunch > of SequenceFile objects. > > key:1,value:<55,61> > key:2,value<29> > key:3,value<98> > > What's the format of the SequenceFile value? IntTuple? IntegerTuple? > something else? > > The next step would be to use DictionaryVectorizer.createTermFrequencyVectors > and TFIDFConverter.processTfIdf, just like in SparseVectorsFromSequenceFiles. > > On 9/9/2011 12:17 PM, Jack Tanner wrote: >> Hi all. I've got some documents described by binary features with >> integer ids, and i want to read them into sparse mahout vectors to do >> tfidf weighting and clustering. I do not want to paste them back >> together and run a Lucene tokenizer. What's the clean way to do this? >> >> I'm thinking that I need to write out SequenceFile objects, with a >> document id key and a value that's either an IntTuple. Is that right? >> Should I use an IntegerTuple instead? It feels wrong to use either, >> actually, because these tuples claim to be ordered, but my features are >> not ordered. >> >> I would then use DictionaryVectorizer.createTermFrequencyVectors and >> TFIDFConverter.processTfIdf, just like in SparseVectorsFromSequenceFiles. >> >> Am I on the right track? >> >> > -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com Lucene Eurocon 2011: http://www.lucene-eurocon.com
