Yes. You make a sequencial file using, for example, the SequenceFile.Writer class writing the name of the file as key, and all the content as the value. You can write as files as you want into the sequence file.
Then, you use this *.seq as a input for DocumentProcessor.tokenizeDocuments to tokenize this file (you can use here a stemmer). The result of this is a folder with the files containing the tokens. This folder must be the input of the DictionaryVectorizer.createTermFrequencyVectors class to create the TFvectors of the corpus. Finally, this folder is the input of the LDA algotithm that you can use with the "bin/mahout lda" tool, or calling it from a java program. It's not necesary clustering for running the lda algorithm, because it makes a clustering process itself. [https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html] On Sun, Apr 29, 2012 at 1:11 AM, Aneesha <[email protected]> wrote: > I create sequential file and create vector for k-means. Is it the same > input we > need to use for Latent Dirichlet Allocation???? > >
